scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-05-28 10:41:12 +00:00

Author	SHA1	Message	Date
MaciekCisowski	439001b8c2	service_level_controller: fix small typo in exception message Closes #10136	2022-02-26 22:23:26 +02:00
Avi Kivity	fcb8d040e8	treewide: use Software Package Data Exchange (SPDX) license identifiers Instead of lengthy blurbs, switch to single-line, machine-readable standardized (https://spdx.dev) license identifiers. The Linux kernel switched long ago, so there is strong precedent. Three cases are handled: AGPL-only, Apache-only, and dual licensed. For the latter case, I chose (AGPL-3.0-or-later and Apache-2.0), reasoning that our changes are extensive enough to apply our license. The changes we applied mechanically with a script, except to licenses/README.md. Closes #9937	2022-01-18 12:15:18 +01:00
Pavel Solodovnikov	b958e85c54	utils: atomic_vector: rename `for_each` to `thread_for_each` To emphasize that the function requires `seastar::thread` context to function properly. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2022-01-11 09:29:12 +03:00
Piotr Sarna	4bfaa7d9fc	Merge 'Service levels: fix undefined behaviours' from Eliran Sinvani This mini series contains two fixes that are bundled together since the second one assumes that the first one exists (or it will not fix anything really...), the two problems were: 1. When certain operations are called on a service level controller which doesn't have it's data accessor set, it can lead to a crash since some operations will still try to dereference the accessor pointer. 2. The cql environment test initialized the accessor with a sharded<system_distributed_data>& however this sharded class as itself is not initialized (sharded::start wasn't called), so for the same that were unsafe for null dereference the accessor will now crash for trying to access uninitialized sharded instance. Closes #9468 * github.com:scylladb/scylla: CQL test environment: Fix bad initialization order Service Level Controller: Fix possible dereference of a null pointer	2021-10-18 08:53:53 +02:00
Eliran Sinvani	6d3e8055f9	Service Level Controller: Fix possible dereference of a null pointer If the service level controller don't have his data accessor set, calls for getting of distributed information might dereference this unset pointer for the accessor. Here we add code that will return a result as if there is no data available to the accessor (a behaviour which is roughly equivalent to a null data accessor). Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>	2021-10-12 13:27:50 +03:00
Avi Kivity	f6d59c33ff	service: service_level_controller: drop unused variable sl_compare Reported by gcc 11.	2021-10-10 18:16:50 +03:00
Eliran Sinvani	c38ceafdcf	Service Level Controller: Add an extention point to the API (#9374 ) In order to ease future extensions to the information being sent by the service level configuration change API, we pack the additional parameters (other the the service level options) to the interface in a structure. This will allow an easy expansion in the future if more parameters needs to be sent to the observer.i Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>	2021-10-01 10:20:28 +03:00
Eliran Sinvani	7f44736939	Service Levels: do not notify stale service level removals Before this commit, the service_level_controller will notify the subscribers on stale deletes, meaning, deletes of localy non exixtent service_levels. The code flow shouldn't ever get to such a state, but as long as this condition is checked instead of being asserted it is worthwhile to change the code to be safe. Closes #9253	2021-08-26 18:27:52 +03:00
Eliran Sinvani	47d3862b63	Service Level Controller: Add a listener API for service level config changes This change adds an api for registering a listener for service_level configuration chanhes. It notifies about removal addition and change of service level. The hidden assumption is that some listeners are going to create and/or manage service level specific resources and this it what guided the time of the call to the subscriber. Addition and change of a service level are called before the actual change takes place, this guaranties that resource creation can take place before the service level or new config starts to be used. The deletion notification is called only after the deletion took place and this guranties that the service level can't be active and the resources created can be safely destroyed.	2021-08-16 11:38:59 +03:00
Pavel Emelyanov	c39f04fa6f	code: Remove storage-service header from irrelevant places Some .cc files over the code include the storage service for no real need. Drop the header and include (in some) what's really needed. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-07-22 18:50:19 +03:00
Eliran Sinvani	ccdef39d21	Service Level Controller: Stop configuration polling loop upon leaving the cluster This change subscribes service_level_controller for nodes life cycle notifications and uses the notification of leaving the cluster for the current node to stop the configuration polling loop. If the loop continues to run it's queries will fail consistently since the nodes will not answers to queries. It is worth mentioning that the queries failing in the current state of code is harmles but noisy since after 90 seconsd, if the scylla process is not shut down the failures will start to generate failure logs every 90 seconds which is confusing for users. Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>	2021-07-14 09:31:40 +03:00
Avi Kivity	a55b434a2b	treewide: extent copyright statements to present day	2021-06-06 19:18:49 +03:00
Piotr Sarna	389a0a52c9	treewide: revamp workload type for service levels This patch is not backward compatible with its original, but it's considered fine, since the original workload types were not yet part of any release. The changes include: - instead of using 'unspecified' for declaring that there's no workload type for a particular service level, NULL is used for that purpose; NULL is the standard way of representing lack of data - introducing a delete marker, which accompanies NULL and makes it possible to distinguish between wanting to forcibly reset a workload type to unspecified and not wanting to change the previous value - updating the tests accordingly These changes come in as a single patch, because they're intertwined with each other and the tests for workload types are already in place; an attempt to split them proved to be more complicated than it's worth. Tests: unit(release) Closes #8763	2021-05-31 18:18:33 +03:00
Piotr Sarna	578543603d	qos: add workload_type service level parameter The workload type is currently one of three values: - unspecified - interactive - batch By defining the workload type, the service level makes it easier for other components to decide what to do in overload scenarios. E.g. if the workload is interactive, requests can be shed earlier, while if it's batched (or unspecified), shedding does not take place. Conversely, batch workloads could accept long full scan operations.	2021-05-27 13:02:22 +02:00
Avi Kivity	50f3bbc359	Merge "treewide: various header cleanups" from Pavel S " The patch set is an assorted collection of header cleanups, e.g: * Reduce number of boost includes in header files * Switch to forward declarations in some places A quick measurement was performed to see if these changes provide any improvement in build times (ccache cleaned and existing build products wiped out). The results are posted below (`/usr/bin/time -v ninja dev-build`) for 24 cores/48 threads CPU setup (AMD Threadripper 2970WX). Before: Command being timed: "ninja dev-build" User time (seconds): 28262.47 System time (seconds): 824.85 Percent of CPU this job got: 3979% Elapsed (wall clock) time (h:mm:ss or m:ss): 12:10.97 Average shared text size (kbytes): 0 Average unshared data size (kbytes): 0 Average stack size (kbytes): 0 Average total size (kbytes): 0 Maximum resident set size (kbytes): 2129888 Average resident set size (kbytes): 0 Major (requiring I/O) page faults: 1402838 Minor (reclaiming a frame) page faults: 124265412 Voluntary context switches: 1879279 Involuntary context switches: 1159999 Swaps: 0 File system inputs: 0 File system outputs: 11806272 Socket messages sent: 0 Socket messages received: 0 Signals delivered: 0 Page size (bytes): 4096 Exit status: 0 After: Command being timed: "ninja dev-build" User time (seconds): 26270.81 System time (seconds): 767.01 Percent of CPU this job got: 3905% Elapsed (wall clock) time (h:mm:ss or m:ss): 11:32.36 Average shared text size (kbytes): 0 Average unshared data size (kbytes): 0 Average stack size (kbytes): 0 Average total size (kbytes): 0 Maximum resident set size (kbytes): 2117608 Average resident set size (kbytes): 0 Major (requiring I/O) page faults: 1400189 Minor (reclaiming a frame) page faults: 117570335 Voluntary context switches: 1870631 Involuntary context switches: 1154535 Swaps: 0 File system inputs: 0 File system outputs: 11777280 Socket messages sent: 0 Socket messages received: 0 Signals delivered: 0 Page size (bytes): 4096 Exit status: 0 The observed improvement is about 5% of total wall clock time for `dev-build` target. Also, all commits make sure that headers stay self-sufficient, which would help to further improve the situation in the future. " * 'feature/header_cleanups_v1' of https://github.com/ManManson/scylla: transport: remove extraneous `qos/service_level_controller` includes from headers treewide: remove evidently unneded storage_proxy includes from some places service_level_controller: remove extraneous `service/storage_service.hh` include sstables/writer: remove extraneous `service/storage_service.hh` include treewide: remove extraneous database.hh includes from headers treewide: reduce boost headers usage in scylla header files cql3: remove extraneous includes from some headers cql3: various forward declaration cleanups utils: add missing <limits> header in `extremum_tracking.hh`	2021-05-24 14:24:20 +03:00
Eliran Sinvani	f2091bb227	workload prioritization: Reduce the logging sensitivity to "glitches" in availability Before this patch every failure to pull the configuration have been reported as a warning. However this is confusing for users for two reasons: 1. It pollutes the logs if the configuration is polled which is Scylla's mode of operation. Such a line is logged every failed iteration. 2. It confuses users because even though this level is warning, it logs out an exception and the log message contains the word failed. We see it a lot during QA runs and customer questions from the field. Point 2 is only solvable by reducing the verbosity of the logged information, which will make debugging harder. Point 1 is addressed here in the following manner, first the one shot configuration pull function is not handling the exception itself, this is OK because it is harmless to fail once or twice in a row in configuration pulling like in every other query, the caller is the one that will be responsible to handle the exception and log the information. Second, the polling loop capture the exceptions being thrown from the configuration pulling function and only report an error with the latest exception if the polling has failed in consecutive iterations over the last 90 seconds. This value was chosen because this is about the empirical worst case time that it takes to a node to notice one of the other nodes in the cluster is down (hence not querying it). It is not important for the user or to us to be notified on temporary glitches in availability (through this error at least) and since we are eventually consistent is ok that some nodes will catch up with the configuration later than others. We also set a threshold in which if the configuration still couldn't be retrieved then the logging level is bumped to ERROR. Closes #8574	2021-05-24 10:51:47 +02:00
Piotr Sarna	17f4a55664	qos: remove unused with_user_service_level helper This helper function is an artifact of forward-porting service levels, and it wouldn't even compile when used because of mismatched function declarations. It's not used anywhere in the open-source code, so it's removed to avoid future merge conflicts. Message-Id: <c9f421d0c4c1a807626775d324fd35b4c72505fe.1621845335.git.sarna@scylladb.com>	2021-05-24 11:42:51 +03:00
Pavel Solodovnikov	0663aa6ca1	service_level_controller: remove extraneous `service/storage_service.hh` include Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-05-20 02:18:41 +03:00
Piotr Sarna	368a6976ff	qos: allow returning combined service level options Originally, the API for finding a service level controller returned its name, which also implied that only a single service level may be active for a user and provide its options. After adding timeout parameters it makes more sense to return a result which combines multiple service level parameters - e.g. a user can be attached to one level for read timeouts and a separate one for write timeouts.	2021-05-10 12:39:41 +02:00
Piotr Sarna	cbedefb0f9	qos: add a way of merging service level options In order to combine multiple service level options coming from multiple roles, a helper function is provided to merge two of them. The semantics depend on each parameter, but for timeouts, which are the only parameters at the time of writing this message, the minimum value of the two is taken. That in particular means that when service level A has timeout = 50ms and service level B has timeout = 1s, the resulting service level options would set the timeout to 50ms.	2021-05-10 12:39:41 +02:00
Piotr Sarna	4ba1ac57a1	cql3: add preserving default values for per-sl timeouts In order for per-service-level timeouts to work as expected, a special value is reserved for internally marking the timeouts as deleted.	2021-05-10 11:48:14 +02:00
Piotr Sarna	fb4e8951f5	qos: make getting service level public	2021-05-10 11:48:14 +02:00
Piotr Sarna	06d0e1853d	qos: make finding service level public	2021-05-10 11:48:14 +02:00
Piotr Sarna	aa37974192	cql3: add timeout to service level params Timeout value can now be properly parsed from CQL.	2021-05-10 10:43:21 +02:00
Piotr Sarna	3339ea1d0d	qos: add timeout to service level info Service level information now consists of the timeout config, which stores the timeout value for all operations.	2021-05-10 10:22:11 +02:00
Eliran Sinvani	fc93133cbe	Service level controller: fix wrong default service level removal log An out of block log print resulted in repeated prints about removal of the default service level. The period of this print is every time the configuration is scanned for changes. It happens when the default service level is one of the last on the map (sorted as in the map). Fixes #8567 Closes #8576	2021-05-03 09:08:41 +03:00
Eliran Sinvani	02d37cb133	workload prioritization: Fix configuration change detection The configuration detection is based on a loop that advances two iterators and compares the two collection for deducing the configuration change. In order to correctly deduce the changes the iteration have to be according to the key (service level name) order for both of the collections. If it doesn't happen the results are undefined and in some cases can lead to a crash of the system. The bug is that the _service_level_db field was implemented using an unordered_map which obviously don't guarantie the configuration change detection assumption. The fix was simply to change the field type to a map instead of unordered_map. Another problem is that when a static service level (i.e default) is at the end of the keys list, it is repeatedly being deleted - which doesn't really do anything since deleting a static service level is just retaining it's defult values but it is stil wrong.	2021-04-27 12:29:31 +02:00
Eliran Sinvani	946fc6af08	workload prioritization: add exception protection in configuration polling Exceptions around the loop polling were not handled properly. This is an issue due to the fact that if an unhandled exception slips out to the configuration polling loop itself it will break it. When the configuration polling loop is broken, any further change to the configuration will not be acted uppon in the nodes where the loop is broken until the node is restarted. The chances for exceptions are now greater than before since in one of the previous commits we started quering the workload prioritization configuration table with a sensible, shorter timeout. This change also adds a logger for the workload prioritization module and some logging mainly arround the configuration polling loop. Most logs are added in the info level since they are not expected to happen frequently but when they do we would like to have some information by default regarding what broke the loop.	2021-04-27 12:29:31 +02:00
Piotr Sarna	55ae110774	qos: make sure to wait for sl updates on shutdown The service level controller spawns an updating thread, which wasn't properly waited for during shutdown. This behavior is now fixed. In order to make the shutdown order more standardized, the operation is split into two phases - draining and stopping. Tests: manual Fixes #8468	2021-04-22 09:58:27 +02:00
Piotr Sarna	3626bc253d	service: make enable_shared_from_this inheritance public Without being public, making shared pointer from the service level accessor is not accessible outside of the class.	2021-04-12 16:31:27 +02:00
Eliran Sinvani	8493e19840	qos: Add a standard implementation for service level data accessor service_level_controller defines an interface for accessing the service level distributed data, this patch implements a standard implementation of the interface that delegates to the system distributed keyspace. Message-Id: <25e68302f6f4d4fe5fcb66ea19159ad68506ba64.1609175314.git.sarna@scylladb.com>	2021-04-12 16:01:04 +02:00
Piotr Sarna	41951d34ad	qos: add waiting for the updater future The distributed data updated used to spawn a future without waiting for it. It was quite safe, since the future had its own abort source, but it's better to remember it and wait for it during stop() anyway.	2021-04-12 16:01:04 +02:00
Eliran Sinvani	a54ea4667b	service/qos: adding service level controller adding the service level controller implementation. The implementation follows the design in: https://docs.google.com/document/d/1RrSTZ3ZX86-YDt2POwAVwFeKN9uX8frEvATJda5n1FU/edit?usp=sharing Some interfaces were added for registration with system componnents. The method of registration is chosen over a constructor parameter, due to the componnets being initialized prior to the service level controller being created. Message-Id: <e9c4e7d5b411062b6a553f5c6861e7875cd71d2c.1609171761.git.sarna@scylladb.com>	2021-04-12 16:01:04 +02:00
Eliran Sinvani	4fea0762c2	service/qos: add common definitions Adding common definitions that will be used by the performance isolation classes. Mainly defines the common ground for configuring a service level through the service level options structure. Message-Id: <12476f4a8e21af3a4c7a892683940698f3beacce.1609160860.git.sarna@scylladb.com>	2021-04-12 15:58:09 +02:00

34 Commits