Instead of lengthy blurbs, switch to single-line, machine-readable
standardized (https://spdx.dev) license identifiers. The Linux kernel
switched long ago, so there is strong precedent.
Three cases are handled: AGPL-only, Apache-only, and dual licensed.
For the latter case, I chose (AGPL-3.0-or-later and Apache-2.0),
reasoning that our changes are extensive enough to apply our license.
The changes we applied mechanically with a script, except to
licenses/README.md.
Closes#9937
To emphasize that the function requires `seastar::thread`
context to function properly.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
This mini series contains two fixes that are bundled together since the
second one assumes that the first one exists (or it will not fix
anything really...), the two problems were:
1. When certain operations are called on a service level controller
which doesn't have it's data accessor set, it can lead to a crash
since some operations will still try to dereference the accessor
pointer.
2. The cql environment test initialized the accessor with a
sharded<system_distributed_data>& however this sharded class as
itself is not initialized (sharded::start wasn't called), so for the
same that were unsafe for null dereference the accessor will now crash
for trying to access uninitialized sharded instance.
Closes#9468
* github.com:scylladb/scylla:
CQL test environment: Fix bad initialization order
Service Level Controller: Fix possible dereference of a null pointer
If the service level controller don't have his data accessor set,
calls for getting of distributed information might dereference this
unset pointer for the accessor. Here we add code that will return
a result as if there is no data available to the accessor (a behaviour
which is roughly equivalent to a null data accessor).
Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
In order to ease future extensions to the information being sent
by the service level configuration change API, we pack the additional
parameters (other the the service level options) to the interface in a
structure. This will allow an easy expansion in the future if more
parameters needs to be sent to the observer.i
Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
Before this commit, the service_level_controller will notify
the subscribers on stale deletes, meaning, deletes of localy
non exixtent service_levels.
The code flow shouldn't ever get to such a state, but as long
as this condition is checked instead of being asserted it is
worthwhile to change the code to be safe.
Closes#9253
changes
This change adds an api for registering a listener for service_level
configuration chanhes. It notifies about removal addition and change of
service level.
The hidden assumption is that some listeners are going to create and/or
manage service level specific resources and this it what guided the
time of the call to the subscriber.
Addition and change of a service level are called before the actual
change takes place, this guaranties that resource creation can take
place before the service level or new config starts to be used.
The deletion notification is called only after the deletion took place
and this guranties that the service level can't be active and the
resources created can be safely destroyed.
Some .cc files over the code include the storage service
for no real need. Drop the header and include (in some)
what's really needed.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
the cluster
This change subscribes service_level_controller for nodes life cycle
notifications and uses the notification of leaving the cluster for
the current node to stop the configuration polling loop. If the loop
continues to run it's queries will fail consistently since the nodes
will not answers to queries. It is worth mentioning that the queries
failing in the current state of code is harmles but noisy since after
90 seconsd, if the scylla process is not shut down the failures will
start to generate failure logs every 90 seconds which is confusing for
users.
Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
Before this patch every failure to pull the configuration have been
reported as a warning. However this is confusing for users for two
reasons:
1. It pollutes the logs if the configuration is polled which is Scylla's
mode of operation. Such a line is logged every failed iteration.
2. It confuses users because even though this level is warning, it logs
out an exception and the log message contains the word failed.
We see it a lot during QA runs and customer questions from the field.
Point 2 is only solvable by reducing the verbosity of the logged
information, which will make debugging harder.
Point 1 is addressed here in the following manner, first the
one shot configuration pull function is not handling the exception
itself, this is OK because it is harmless to fail once or twice in a
row in configuration pulling like in every other query, the caller is
the one that will be responsible to handle the exception and log the
information. Second, the polling loop capture the exceptions being
thrown from the configuration pulling function and only report an error
with the latest exception if the polling has failed in consecutive
iterations over the last 90 seconds. This value was chosen because this
is about the empirical worst case time that it takes to a node to notice
one of the other nodes in the cluster is down (hence not querying it).
It is not important for the user or to us to be notified on temporary
glitches in availability (through this error at least) and since we are
eventually consistent is ok that some nodes will catch up with the
configuration later than others.
We also set a threshold in which if the configuration still couldn't be
retrieved then the logging level is bumped to ERROR.
Closes#8574
Originally, the API for finding a service level controller returned
its name, which also implied that only a single service level
may be active for a user and provide its options.
After adding timeout parameters it makes more sense to return a result
which combines multiple service level parameters - e.g. a user
can be attached to one level for read timeouts and a separate one
for write timeouts.
An out of block log print resulted in repeated prints about removal of
the default service level. The period of this print is every time the
configuration is scanned for changes. It happens when the default
service level is one of the last on the map (sorted as in the map).
Fixes#8567Closes#8576
The configuration detection is based on a loop that
advances two iterators and compares the two collection
for deducing the configuration change. In order to
correctly deduce the changes the iteration have to be
according to the key (service level name) order for both
of the collections. If it doesn't happen the results are
undefined and in some cases can lead to a crash of the
system. The bug is that the _service_level_db field was
implemented using an unordered_map which obviously don't
guarantie the configuration change detection assumption.
The fix was simply to change the field type to a map
instead of unordered_map.
Another problem is that when a static service level (i.e
default) is at the end of the keys list, it is repeatedly
being deleted - which doesn't really do anything since deleting
a static service level is just retaining it's defult values
but it is stil wrong.
Exceptions around the loop polling were not handled properly.
This is an issue due to the fact that if an unhandled exception
slips out to the configuration polling loop itself it will break
it. When the configuration polling loop is broken, any further
change to the configuration will not be acted uppon in the nodes
where the loop is broken until the node is restarted. The chances
for exceptions are now greater than before since in one of the
previous commits we started quering the workload prioritization
configuration table with a sensible, shorter timeout.
This change also adds a logger for the workload prioritization
module and some logging mainly arround the configuration polling loop.
Most logs are added in the info level since they are not expected to
happen frequently but when they do we would like to have some
information by default regarding what broke the loop.
The service level controller spawns an updating thread,
which wasn't properly waited for during shutdown.
This behavior is now fixed.
In order to make the shutdown order more standardized,
the operation is split into two phases - draining and stopping.
Tests: manual
Fixes#8468
The distributed data updated used to spawn a future without waiting
for it. It was quite safe, since the future had its own abort source,
but it's better to remember it and wait for it during stop() anyway.