Commit Graph

34 Commits

Author SHA1 Message Date
MaciekCisowski
439001b8c2 service_level_controller: fix small typo in exception message
Closes #10136
2022-02-26 22:23:26 +02:00
Avi Kivity
fcb8d040e8 treewide: use Software Package Data Exchange (SPDX) license identifiers
Instead of lengthy blurbs, switch to single-line, machine-readable
standardized (https://spdx.dev) license identifiers. The Linux kernel
switched long ago, so there is strong precedent.

Three cases are handled: AGPL-only, Apache-only, and dual licensed.
For the latter case, I chose (AGPL-3.0-or-later and Apache-2.0),
reasoning that our changes are extensive enough to apply our license.

The changes we applied mechanically with a script, except to
licenses/README.md.

Closes #9937
2022-01-18 12:15:18 +01:00
Pavel Solodovnikov
b958e85c54 utils: atomic_vector: rename for_each to thread_for_each
To emphasize that the function requires `seastar::thread`
context to function properly.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-01-11 09:29:12 +03:00
Piotr Sarna
4bfaa7d9fc Merge 'Service levels: fix undefined behaviours' from Eliran Sinvani
This mini series contains two fixes that are bundled together since the
second one assumes that the first one exists (or it will not fix
anything really...), the two problems were:
1. When certain operations are called on a service level controller
   which doesn't have it's data accessor set, it can lead to a crash
since some operations will still try to dereference the accessor
pointer.
2. The cql environment test initialized the accessor with a
   sharded<system_distributed_data>& however this sharded class as
itself is not initialized (sharded::start wasn't called), so for the
same that were unsafe for null dereference the accessor will now crash
for trying to access uninitialized sharded instance.

Closes #9468

* github.com:scylladb/scylla:
  CQL test environment: Fix bad initialization order
  Service Level Controller: Fix possible dereference of a null pointer
2021-10-18 08:53:53 +02:00
Eliran Sinvani
6d3e8055f9 Service Level Controller: Fix possible dereference of a null pointer
If the service level controller don't have his data accessor set,
calls for getting of distributed information might dereference this
unset pointer for the accessor. Here we add code that will return
a result as if there is no data available to the accessor (a behaviour
which is roughly equivalent to a null data accessor).

Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
2021-10-12 13:27:50 +03:00
Avi Kivity
f6d59c33ff service: service_level_controller: drop unused variable sl_compare
Reported by gcc 11.
2021-10-10 18:16:50 +03:00
Eliran Sinvani
c38ceafdcf Service Level Controller: Add an extention point to the API (#9374)
In order to ease future extensions to the information being sent
by the service level configuration change API, we pack the additional
parameters (other the the service level options) to the interface in a
structure. This will allow an easy expansion in the future if more
parameters needs to be sent to the observer.i

Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
2021-10-01 10:20:28 +03:00
Eliran Sinvani
7f44736939 Service Levels: do not notify stale service level removals
Before this commit, the service_level_controller will notify
the subscribers on stale deletes, meaning, deletes of localy
non exixtent service_levels.
The code flow shouldn't ever get to such a state, but as long
as this condition is checked instead of being asserted it is
worthwhile to change the code to be safe.

Closes #9253
2021-08-26 18:27:52 +03:00
Eliran Sinvani
47d3862b63 Service Level Controller: Add a listener API for service level config
changes

This change adds an api for registering a listener for service_level
configuration chanhes. It notifies about removal addition and change of
service level.
The hidden assumption is that some listeners are going to create and/or
manage service level specific resources and this it what guided the
time of the call to the subscriber.
Addition and change of a service level are called before the actual
change takes place, this guaranties that resource creation can take
place before the service level or new config starts to be used.
The deletion notification is called only after the deletion took place
and this guranties that the service level can't be active and the
resources created can be safely destroyed.
2021-08-16 11:38:59 +03:00
Pavel Emelyanov
c39f04fa6f code: Remove storage-service header from irrelevant places
Some .cc files over the code include the storage service
for no real need. Drop the header and include (in some)
what's really needed.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-07-22 18:50:19 +03:00
Eliran Sinvani
ccdef39d21 Service Level Controller: Stop configuration polling loop upon leaving
the cluster

This change subscribes service_level_controller for nodes life cycle
notifications and uses the notification of leaving the cluster for
the current node to stop the configuration polling loop. If the loop
continues to run it's queries will fail consistently since the nodes
will not answers to queries. It is worth mentioning that the queries
failing in the current state of code is harmles but noisy since after
90 seconsd, if the scylla process is not shut down the failures will
start to generate failure logs every 90 seconds which is confusing for
users.

Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
2021-07-14 09:31:40 +03:00
Avi Kivity
a55b434a2b treewide: extent copyright statements to present day 2021-06-06 19:18:49 +03:00
Piotr Sarna
389a0a52c9 treewide: revamp workload type for service levels
This patch is not backward compatible with its original,
but it's considered fine, since the original workload types were not
yet part of any release.
The changes include:
 - instead of using 'unspecified' for declaring that there's no workload
   type for a particular service level, NULL is used for that purpose;
   NULL is the standard way of representing lack of data
 - introducing a delete marker, which accompanies NULL and makes it
   possible to distinguish between wanting to forcibly reset a workload
   type to unspecified and not wanting to change the previous value
 - updating the tests accordingly

These changes come in as a single patch, because they're intertwined
with each other and the tests for workload types are already in place;
an attempt to split them proved to be more complicated than it's worth.

Tests: unit(release)

Closes #8763
2021-05-31 18:18:33 +03:00
Piotr Sarna
578543603d qos: add workload_type service level parameter
The workload type is currently one of three values:
 - unspecified
 - interactive
 - batch

By defining the workload type, the service level makes it easier
for other components to decide what to do in overload scenarios.
E.g. if the workload is interactive, requests can be shed earlier,
while if it's batched (or unspecified), shedding does not take place.
Conversely, batch workloads could accept long full scan operations.
2021-05-27 13:02:22 +02:00
Avi Kivity
50f3bbc359 Merge "treewide: various header cleanups" from Pavel S
"
The patch set is an assorted collection of header cleanups, e.g:
* Reduce number of boost includes in header files
* Switch to forward declarations in some places

A quick measurement was performed to see if these changes
provide any improvement in build times (ccache cleaned and
existing build products wiped out).

The results are posted below (`/usr/bin/time -v ninja dev-build`)
for 24 cores/48 threads CPU setup (AMD Threadripper 2970WX).

Before:

	Command being timed: "ninja dev-build"
	User time (seconds): 28262.47
	System time (seconds): 824.85
	Percent of CPU this job got: 3979%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 12:10.97
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 2129888
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 1402838
	Minor (reclaiming a frame) page faults: 124265412
	Voluntary context switches: 1879279
	Involuntary context switches: 1159999
	Swaps: 0
	File system inputs: 0
	File system outputs: 11806272
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

After:

	Command being timed: "ninja dev-build"
	User time (seconds): 26270.81
	System time (seconds): 767.01
	Percent of CPU this job got: 3905%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 11:32.36
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 2117608
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 1400189
	Minor (reclaiming a frame) page faults: 117570335
	Voluntary context switches: 1870631
	Involuntary context switches: 1154535
	Swaps: 0
	File system inputs: 0
	File system outputs: 11777280
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

The observed improvement is about 5% of total wall clock time
for `dev-build` target.

Also, all commits make sure that headers stay self-sufficient,
which would help to further improve the situation in the future.
"

* 'feature/header_cleanups_v1' of https://github.com/ManManson/scylla:
  transport: remove extraneous `qos/service_level_controller` includes from headers
  treewide: remove evidently unneded storage_proxy includes from some places
  service_level_controller: remove extraneous `service/storage_service.hh` include
  sstables/writer: remove extraneous `service/storage_service.hh` include
  treewide: remove extraneous database.hh includes from headers
  treewide: reduce boost headers usage in scylla header files
  cql3: remove extraneous includes from some headers
  cql3: various forward declaration cleanups
  utils: add missing <limits> header in `extremum_tracking.hh`
2021-05-24 14:24:20 +03:00
Eliran Sinvani
f2091bb227 workload prioritization: Reduce the logging sensitivity to "glitches" in availability
Before this patch every failure to pull the configuration have been
reported as a warning. However this is confusing for users for two
reasons:
1. It pollutes the logs if the configuration is polled which is Scylla's
   mode of operation. Such a line is logged every failed iteration.
2. It confuses users because even though this level is warning, it logs
   out an exception and the log message contains the word failed.

We see it a lot during QA runs and customer questions from the field.

Point 2 is only solvable by reducing the verbosity of the logged
information, which will make debugging harder.
Point 1 is addressed here in the following manner, first the
one shot configuration pull function is not handling the exception
itself, this is OK because it is harmless to fail once or twice in a
row in configuration pulling like in every other query, the caller is
the one that will be responsible to handle the exception and log the
information. Second, the polling loop capture the exceptions being
thrown from the configuration pulling function and only report an error
with the latest exception if the polling has failed in consecutive
iterations over the last 90 seconds. This value was chosen because this
is about the empirical worst case time that it takes to a node to notice
one of the other nodes in the cluster is down (hence not querying it).

It is not important for the user or to us to be notified on temporary
glitches in availability (through this error at least) and since we are
eventually consistent is ok that some nodes will catch up with the
configuration later than others.

We also set a threshold in which if the configuration still couldn't be
retrieved then the logging level is bumped to ERROR.

Closes #8574
2021-05-24 10:51:47 +02:00
Piotr Sarna
17f4a55664 qos: remove unused with_user_service_level helper
This helper function is an artifact of forward-porting service levels,
and it wouldn't even compile when used because of mismatched
function declarations. It's not used anywhere in the open-source code,
so it's removed to avoid future merge conflicts.

Message-Id: <c9f421d0c4c1a807626775d324fd35b4c72505fe.1621845335.git.sarna@scylladb.com>
2021-05-24 11:42:51 +03:00
Pavel Solodovnikov
0663aa6ca1 service_level_controller: remove extraneous service/storage_service.hh include
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-05-20 02:18:41 +03:00
Piotr Sarna
368a6976ff qos: allow returning combined service level options
Originally, the API for finding a service level controller returned
its name, which also implied that only a single service level
may be active for a user and provide its options.
After adding timeout parameters it makes more sense to return a result
which combines multiple service level parameters - e.g. a user
can be attached to one level for read timeouts and a separate one
for write timeouts.
2021-05-10 12:39:41 +02:00
Piotr Sarna
cbedefb0f9 qos: add a way of merging service level options
In order to combine multiple service level options coming from
multiple roles, a helper function is provided to merge two
of them. The semantics depend on each parameter, but for timeouts,
which are the only parameters at the time of writing this message,
the minimum value of the two is taken. That in particular means
that when service level A has timeout = 50ms and service level B
has timeout = 1s, the resulting service level options
would set the timeout to 50ms.
2021-05-10 12:39:41 +02:00
Piotr Sarna
4ba1ac57a1 cql3: add preserving default values for per-sl timeouts
In order for per-service-level timeouts to work as expected,
a special value is reserved for internally marking the timeouts
as deleted.
2021-05-10 11:48:14 +02:00
Piotr Sarna
fb4e8951f5 qos: make getting service level public 2021-05-10 11:48:14 +02:00
Piotr Sarna
06d0e1853d qos: make finding service level public 2021-05-10 11:48:14 +02:00
Piotr Sarna
aa37974192 cql3: add timeout to service level params
Timeout value can now be properly parsed from CQL.
2021-05-10 10:43:21 +02:00
Piotr Sarna
3339ea1d0d qos: add timeout to service level info
Service level information now consists of the timeout config,
which stores the timeout value for all operations.
2021-05-10 10:22:11 +02:00
Eliran Sinvani
fc93133cbe Service level controller: fix wrong default service level removal log
An out of block log print resulted in repeated prints about removal of
the default service level. The period of this print is every time the
configuration is scanned for changes. It happens when the default
service level is one of the last on the map (sorted as in the map).

Fixes #8567

Closes #8576
2021-05-03 09:08:41 +03:00
Eliran Sinvani
02d37cb133 workload prioritization: Fix configuration change detection
The configuration detection is based on a loop that
advances two iterators and compares the two collection
for deducing the configuration change. In order to
correctly deduce the changes the iteration have to be
according to the key (service level name) order for both
of the collections. If it doesn't happen the results are
undefined and in some cases can lead to a crash of the
system. The bug is that the _service_level_db field was
implemented using an unordered_map which obviously don't
guarantie the configuration change detection assumption.
The fix was simply to change the field type to a map
instead of unordered_map.

Another problem is that when a static service level (i.e
default) is at the end of the keys list, it is repeatedly
being deleted - which doesn't really do anything since deleting
a static service level is just retaining it's defult values
but it is stil wrong.
2021-04-27 12:29:31 +02:00
Eliran Sinvani
946fc6af08 workload prioritization: add exception protection in configuration polling
Exceptions around the loop polling were not handled properly.
This is an issue due to the fact that if an unhandled exception
slips out to the configuration polling loop itself it will break
it. When the configuration polling loop is broken, any further
change to the configuration will not be acted uppon in the nodes
where the loop is broken until the node is restarted. The chances
for exceptions are now greater than before since in one of the
previous commits we started quering the workload prioritization
configuration table with a sensible, shorter timeout.
This change also adds a logger for the workload prioritization
module and some logging mainly arround the configuration polling loop.

Most logs are added in the info level since they are not expected to
happen frequently but when they do we would like to have some
information by default regarding what broke the loop.
2021-04-27 12:29:31 +02:00
Piotr Sarna
55ae110774 qos: make sure to wait for sl updates on shutdown
The service level controller spawns an updating thread,
which wasn't properly waited for during shutdown.
This behavior is now fixed.
In order to make the shutdown order more standardized,
the operation is split into two phases - draining and stopping.

Tests: manual

Fixes #8468
2021-04-22 09:58:27 +02:00
Piotr Sarna
3626bc253d service: make enable_shared_from_this inheritance public
Without being public, making shared pointer from the service level
accessor is not accessible outside of the class.
2021-04-12 16:31:27 +02:00
Eliran Sinvani
8493e19840 qos: Add a standard implementation for service level data accessor
service_level_controller defines an interface for accessing the service
level distributed data, this patch implements a standard implementation
of the interface that delegates to the system distributed keyspace.
Message-Id: <25e68302f6f4d4fe5fcb66ea19159ad68506ba64.1609175314.git.sarna@scylladb.com>
2021-04-12 16:01:04 +02:00
Piotr Sarna
41951d34ad qos: add waiting for the updater future
The distributed data updated used to spawn a future without waiting
for it. It was quite safe, since the future had its own abort source,
but it's better to remember it and wait for it during stop() anyway.
2021-04-12 16:01:04 +02:00
Eliran Sinvani
a54ea4667b service/qos: adding service level controller
adding the service level controller implementation. The implementation
follows the design in:
https://docs.google.com/document/d/1RrSTZ3ZX86-YDt2POwAVwFeKN9uX8frEvATJda5n1FU/edit?usp=sharing
Some interfaces were added for registration with system componnents.
The method of registration is chosen over a constructor parameter, due to
the componnets being initialized prior to the service level controller being created.
Message-Id: <e9c4e7d5b411062b6a553f5c6861e7875cd71d2c.1609171761.git.sarna@scylladb.com>
2021-04-12 16:01:04 +02:00
Eliran Sinvani
4fea0762c2 service/qos: add common definitions
Adding common definitions that will be used by the
performance isolation classes. Mainly defines the
common ground for configuring a service level
through the service level options structure.

Message-Id: <12476f4a8e21af3a4c7a892683940698f3beacce.1609160860.git.sarna@scylladb.com>
2021-04-12 15:58:09 +02:00