scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-05-02 22:25:48 +00:00

Author	SHA1	Message	Date
MaciekCisowski	439001b8c2	service_level_controller: fix small typo in exception message Closes #10136	2022-02-26 22:23:26 +02:00
Avi Kivity	fcb8d040e8	treewide: use Software Package Data Exchange (SPDX) license identifiers Instead of lengthy blurbs, switch to single-line, machine-readable standardized (https://spdx.dev) license identifiers. The Linux kernel switched long ago, so there is strong precedent. Three cases are handled: AGPL-only, Apache-only, and dual licensed. For the latter case, I chose (AGPL-3.0-or-later and Apache-2.0), reasoning that our changes are extensive enough to apply our license. The changes we applied mechanically with a script, except to licenses/README.md. Closes #9937	2022-01-18 12:15:18 +01:00
Pavel Solodovnikov	b958e85c54	utils: atomic_vector: rename `for_each` to `thread_for_each` To emphasize that the function requires `seastar::thread` context to function properly. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2022-01-11 09:29:12 +03:00
Piotr Sarna	4bfaa7d9fc	Merge 'Service levels: fix undefined behaviours' from Eliran Sinvani This mini series contains two fixes that are bundled together since the second one assumes that the first one exists (or it will not fix anything really...), the two problems were: 1. When certain operations are called on a service level controller which doesn't have it's data accessor set, it can lead to a crash since some operations will still try to dereference the accessor pointer. 2. The cql environment test initialized the accessor with a sharded<system_distributed_data>& however this sharded class as itself is not initialized (sharded::start wasn't called), so for the same that were unsafe for null dereference the accessor will now crash for trying to access uninitialized sharded instance. Closes #9468 * github.com:scylladb/scylla: CQL test environment: Fix bad initialization order Service Level Controller: Fix possible dereference of a null pointer	2021-10-18 08:53:53 +02:00
Eliran Sinvani	6d3e8055f9	Service Level Controller: Fix possible dereference of a null pointer If the service level controller don't have his data accessor set, calls for getting of distributed information might dereference this unset pointer for the accessor. Here we add code that will return a result as if there is no data available to the accessor (a behaviour which is roughly equivalent to a null data accessor). Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>	2021-10-12 13:27:50 +03:00
Avi Kivity	f6d59c33ff	service: service_level_controller: drop unused variable sl_compare Reported by gcc 11.	2021-10-10 18:16:50 +03:00
Eliran Sinvani	c38ceafdcf	Service Level Controller: Add an extention point to the API (#9374 ) In order to ease future extensions to the information being sent by the service level configuration change API, we pack the additional parameters (other the the service level options) to the interface in a structure. This will allow an easy expansion in the future if more parameters needs to be sent to the observer.i Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>	2021-10-01 10:20:28 +03:00
Eliran Sinvani	7f44736939	Service Levels: do not notify stale service level removals Before this commit, the service_level_controller will notify the subscribers on stale deletes, meaning, deletes of localy non exixtent service_levels. The code flow shouldn't ever get to such a state, but as long as this condition is checked instead of being asserted it is worthwhile to change the code to be safe. Closes #9253	2021-08-26 18:27:52 +03:00
Eliran Sinvani	47d3862b63	Service Level Controller: Add a listener API for service level config changes This change adds an api for registering a listener for service_level configuration chanhes. It notifies about removal addition and change of service level. The hidden assumption is that some listeners are going to create and/or manage service level specific resources and this it what guided the time of the call to the subscriber. Addition and change of a service level are called before the actual change takes place, this guaranties that resource creation can take place before the service level or new config starts to be used. The deletion notification is called only after the deletion took place and this guranties that the service level can't be active and the resources created can be safely destroyed.	2021-08-16 11:38:59 +03:00
Pavel Emelyanov	c39f04fa6f	code: Remove storage-service header from irrelevant places Some .cc files over the code include the storage service for no real need. Drop the header and include (in some) what's really needed. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-07-22 18:50:19 +03:00
Eliran Sinvani	ccdef39d21	Service Level Controller: Stop configuration polling loop upon leaving the cluster This change subscribes service_level_controller for nodes life cycle notifications and uses the notification of leaving the cluster for the current node to stop the configuration polling loop. If the loop continues to run it's queries will fail consistently since the nodes will not answers to queries. It is worth mentioning that the queries failing in the current state of code is harmles but noisy since after 90 seconsd, if the scylla process is not shut down the failures will start to generate failure logs every 90 seconds which is confusing for users. Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>	2021-07-14 09:31:40 +03:00
Avi Kivity	a55b434a2b	treewide: extent copyright statements to present day	2021-06-06 19:18:49 +03:00
Eliran Sinvani	f2091bb227	workload prioritization: Reduce the logging sensitivity to "glitches" in availability Before this patch every failure to pull the configuration have been reported as a warning. However this is confusing for users for two reasons: 1. It pollutes the logs if the configuration is polled which is Scylla's mode of operation. Such a line is logged every failed iteration. 2. It confuses users because even though this level is warning, it logs out an exception and the log message contains the word failed. We see it a lot during QA runs and customer questions from the field. Point 2 is only solvable by reducing the verbosity of the logged information, which will make debugging harder. Point 1 is addressed here in the following manner, first the one shot configuration pull function is not handling the exception itself, this is OK because it is harmless to fail once or twice in a row in configuration pulling like in every other query, the caller is the one that will be responsible to handle the exception and log the information. Second, the polling loop capture the exceptions being thrown from the configuration pulling function and only report an error with the latest exception if the polling has failed in consecutive iterations over the last 90 seconds. This value was chosen because this is about the empirical worst case time that it takes to a node to notice one of the other nodes in the cluster is down (hence not querying it). It is not important for the user or to us to be notified on temporary glitches in availability (through this error at least) and since we are eventually consistent is ok that some nodes will catch up with the configuration later than others. We also set a threshold in which if the configuration still couldn't be retrieved then the logging level is bumped to ERROR. Closes #8574	2021-05-24 10:51:47 +02:00
Piotr Sarna	368a6976ff	qos: allow returning combined service level options Originally, the API for finding a service level controller returned its name, which also implied that only a single service level may be active for a user and provide its options. After adding timeout parameters it makes more sense to return a result which combines multiple service level parameters - e.g. a user can be attached to one level for read timeouts and a separate one for write timeouts.	2021-05-10 12:39:41 +02:00
Eliran Sinvani	fc93133cbe	Service level controller: fix wrong default service level removal log An out of block log print resulted in repeated prints about removal of the default service level. The period of this print is every time the configuration is scanned for changes. It happens when the default service level is one of the last on the map (sorted as in the map). Fixes #8567 Closes #8576	2021-05-03 09:08:41 +03:00
Eliran Sinvani	02d37cb133	workload prioritization: Fix configuration change detection The configuration detection is based on a loop that advances two iterators and compares the two collection for deducing the configuration change. In order to correctly deduce the changes the iteration have to be according to the key (service level name) order for both of the collections. If it doesn't happen the results are undefined and in some cases can lead to a crash of the system. The bug is that the _service_level_db field was implemented using an unordered_map which obviously don't guarantie the configuration change detection assumption. The fix was simply to change the field type to a map instead of unordered_map. Another problem is that when a static service level (i.e default) is at the end of the keys list, it is repeatedly being deleted - which doesn't really do anything since deleting a static service level is just retaining it's defult values but it is stil wrong.	2021-04-27 12:29:31 +02:00
Eliran Sinvani	946fc6af08	workload prioritization: add exception protection in configuration polling Exceptions around the loop polling were not handled properly. This is an issue due to the fact that if an unhandled exception slips out to the configuration polling loop itself it will break it. When the configuration polling loop is broken, any further change to the configuration will not be acted uppon in the nodes where the loop is broken until the node is restarted. The chances for exceptions are now greater than before since in one of the previous commits we started quering the workload prioritization configuration table with a sensible, shorter timeout. This change also adds a logger for the workload prioritization module and some logging mainly arround the configuration polling loop. Most logs are added in the info level since they are not expected to happen frequently but when they do we would like to have some information by default regarding what broke the loop.	2021-04-27 12:29:31 +02:00
Piotr Sarna	55ae110774	qos: make sure to wait for sl updates on shutdown The service level controller spawns an updating thread, which wasn't properly waited for during shutdown. This behavior is now fixed. In order to make the shutdown order more standardized, the operation is split into two phases - draining and stopping. Tests: manual Fixes #8468	2021-04-22 09:58:27 +02:00
Piotr Sarna	41951d34ad	qos: add waiting for the updater future The distributed data updated used to spawn a future without waiting for it. It was quite safe, since the future had its own abort source, but it's better to remember it and wait for it during stop() anyway.	2021-04-12 16:01:04 +02:00
Eliran Sinvani	a54ea4667b	service/qos: adding service level controller adding the service level controller implementation. The implementation follows the design in: https://docs.google.com/document/d/1RrSTZ3ZX86-YDt2POwAVwFeKN9uX8frEvATJda5n1FU/edit?usp=sharing Some interfaces were added for registration with system componnents. The method of registration is chosen over a constructor parameter, due to the componnets being initialized prior to the service level controller being created. Message-Id: <e9c4e7d5b411062b6a553f5c6861e7875cd71d2c.1609171761.git.sarna@scylladb.com>	2021-04-12 16:01:04 +02:00

20 Commits