Commit Graph

51 Commits

Author SHA1 Message Date
Michał Jadwiszczak
4e236f3392 cql3/statements/create_service_level: forbid creating SL starting with $
Tenant names starting with `$` are reserved for internal ones.
Forbid creating new service level which name starts with `$`
and log a warning for existing service levels with `$` prefix.

(cherry picked from commit d729d1b272)

Closes scylladb/scylladb#20198
2024-08-19 16:20:48 +03:00
Emil Maskovsky
0770069dda raft: use the abort source reference in raft group0 client interface
Most callers of the raft group0 client interface are passing a real
source instance, so we can use the abort source reference in the client
interface. This change makes the code simpler and more consistent.

(cherry picked from commit 2dbe9ef2f2)
2024-08-01 19:36:00 +02:00
Pavel Emelyanov
a30337e719 service_level_controller_test: Use topology::is_me() helper
The on_leave_cluster() callback needs to check if the leaving node is
the local one. It currently compares endpoint with the my_address()
obtained via pretty long dependency chain of

  auth_service->query_processor->storage_proxy->database->token_metadata

This patch makes the whole thing _much_ shorter.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-05-14 15:47:12 +03:00
Pavel Emelyanov
634c066c43 service_level_controller: Add dependency on shared_token_metadata
The controller needs to access topology, so it needs the token metadata
at hand.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-05-14 15:43:01 +03:00
Wojciech Mitros
8472c46c8a service_level_controller: coroutinize notify_service_level_removed
To avoid conflicts arising from the discrepancy between different
versions of the repository, use coroutines instead of continuations
in service_level_controller::notify_service_level_removed().

Closes scylladb/scylladb#18525
2024-05-06 14:20:49 +03:00
Piotr Dulikowski
64ba620dc2 Merge 'hinted handoff: Use host IDs instead of IPs in the module' from Dawid Mędrek
This pull request introduces host ID in the Hinted Handoff module. Nodes are now identified by their host IDs instead of their IPs. The conversion occurs on the boundary between the module and `storage_proxy.hh`, but aside from that, IPs have been erased.

The changes take into considerations that there might still be old hints, still identified by IPs, on disk – at start-up, we map them to host IDs if it's possible so that they're not lost.

Refs scylladb/scylladb#6403
Fixes scylladb/scylladb#12278

Closes scylladb/scylladb#15567

* github.com:scylladb/scylladb:
  docs: Update Hinted Handoff documentation
  db/hints: Add endpoint_downtime_not_bigger_than()
  db/hints: Migrate hinted handoff when cluster feature is enabled
  db/hints: Handle arbitrary directories in resource manager
  db/hints: Start using hint_directory_manager
  db/hints: Enforce providing IP in get_ep_manager()
  db/hints: Introduce hint_directory_manager
  db/hints/resource_manager: Update function description
  db/hints: Coroutinize space_watchdog::scan_one_ep_dir()
  db/hints: Expose update lock of space watchdog
  db/hints: Add function for migrating hint directories to host ID
  db/hints: Take both IP and host ID when storing hints
  db/hints: Prepare initializing endpoint managers for migrating from IP to host ID
  db/hints: Migrate to locator::host_id
  db/hints: Remove noexcept in do_send_one_mutation()
  service: Add locator::host_id to on_leave_cluster
  service: Fix indentation
  db/hints: Fix indentation
2024-05-06 09:58:18 +02:00
Benny Halevy
ebff5f5d70 everywhere: include seastar headers using angle brackets
seastar is an external library therefore it should
use the system-include syntax.

Closes scylladb/scylladb#18513
2024-05-06 10:00:31 +03:00
Dawid Medrek
54ae9797b9 service: Add locator::host_id to on_leave_cluster
We extend the function
endpoint_lifecycle_subscriber::on_leave_cluster
by another argument -- locator::host_id.
It's more convenient to have a consistent
pair of IP and host ID.
2024-04-26 22:44:03 +02:00
Kefu Chai
e2f3fed373 service: qos: fix a typo
s/accesor/accessor/

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#18124
2024-04-03 10:33:54 +02:00
Marcin Maliszkiewicz
ff17a29b54 service: qos: create separate function for reloading data accessor
Scylla's main is already too long, it's better to contain this logic inside qos service.
2024-03-26 17:26:19 +01:00
Michał Jadwiszczak
a08918a320 main: create raft dda if sl data was migrated
Create `raft_service_levels_distributed_data_accessor` if service levels
were migrated to v2 table.
This supports raft recovery mode, as service levels will be read from v2
table in the mode.
2024-03-21 23:14:57 +01:00
Michał Jadwiszczak
dab909b1d1 service:qos: store information about sl data migration
Save information whether service levels data was migrated to v2 table.
The information is stored in `system.scylla_local` table. It's
written with raft command and included in raft snapshot.
2024-03-21 23:14:57 +01:00
Michał Jadwiszczak
2917ec5d51 service:qos: service levels migration
Migrate data from `system_distributes.service_levels` to
`system.service_levels_v2` during raft topology upgrade.

Migration process reads data from old table with CL ALL
and inserts the data to the new table via raft.
2024-03-21 23:14:57 +01:00
Michał Jadwiszczak
159a6a2169 service:qos: fix is_v2() method 2024-03-21 23:14:57 +01:00
Michał Jadwiszczak
fd32f5162a service:qos: add a method to upgrade data accessor 2024-03-21 23:14:57 +01:00
Michał Jadwiszczak
d5fa0747d7 service:qos: add abort_source for group0 operations
Add mechanism to abort ongoing group0 operations while draining
service_level_controller or leaving the cluster.
2024-03-21 23:14:57 +01:00
Michał Jadwiszczak
71c07addb5 service:qos: use group0_guard in data accessor
Adjust service_level_controller and
service_level_controller::service_level_distributed_data_accessor
interfaces to take `group0_guard` while adding/altering/dropping a
service level.
2024-03-21 23:14:57 +01:00
Michał Jadwiszczak
da82c5f0b0 cql3:statements: run service level statements on shard0 with raft guard
To migrate service levels to be raft managed, obtain `group0_guard` to
be able to pass it to service_level_controller's methods.

Using this mechanism also automatically provides retries in case of
concurrent group0 operation.
2024-03-21 23:14:57 +01:00
Michał Jadwiszczak
c0e22fcb9c service:qos: fix indentation 2024-03-21 23:14:57 +01:00
Michał Jadwiszczak
1f3c6b2813 service:qos: coroutinize some of the methods
Functions:
- `service_level_controller::set_distributed_service_level()`
- `service_level_controller::drop_distributed_service_level()`
- `service_level_controller::drain()`

Coroutines increase readability of those functions.
2024-03-21 23:14:57 +01:00
Marcin Maliszkiewicz
9f172f1843 auth: coroutinize functions in standard_role_manager
Affected functions are: find_record, create_default_role_if_missing,
create_or_replace, drop, modify_membership, query_all, get_attribute,
set_attribute, remove_attribute
2024-03-01 16:25:14 +01:00
Avi Kivity
7cb1c10fed treewide: replace seastar::future::get0() with seastar::future::get()
get0() dates back from the days where Seastar futures carried tuples, and
get0() was a way to get the first (and usually only) element. Now
it's a distraction, and Seastar is likely to deprecate and remove it.

Replace with seastar::future::get(), which does the same thing.
2024-02-02 22:12:57 +08:00
Botond Dénes
f22fc88a64 Merge 'Configure service levels interval' from Michał Jadwiszczak
Service level controller updates itself in interval. However the interval time is hardcoded in main to 10 seconds and it leads to long sleeps in some of the tests.

This patch moves this value to `service_levels_interval_ms` command line option and sets this value to 0.5s in cql-pytest.

Closes scylladb/scylladb#16394

* github.com:scylladb/scylladb:
  test:cql-pytest: change service levels intervals in tests
  configure service levels interval
2024-01-17 12:24:49 +02:00
Kefu Chai
ece2bd2f6e service: do not include unused headers
these unused includes were identified by clangd. see
https://clangd.llvm.org/guides/include-cleaner#unused-include-warning
for more details on the "Unused include" warning.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#16764
2024-01-15 13:29:33 +02:00
Michał Jadwiszczak
f6a464ad81 configure service levels interval
So far the service levels interval, responsible for updating SL configuration,
was hardcoded in main.
Now it's extracted to `service_levels_interval_ms` option.
2024-01-12 10:28:24 +01:00
Benny Halevy
0b310c471c service_level_controller: use locator::topology rather than fb_utilities
Expose cql3::query_processor in auth::service
to get to the topology via storage_proxy.replica::database

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-12-05 10:17:47 +02:00
Yaniv Kaul
c658bdb150 Typos: fix typos in comments
Fixes some typos as found by codespell run on the code.
In this commit, I was hoping to fix only comments, not user-visible alerts, output, etc.
Follow-up commits will take care of them.

Refs: https://github.com/scylladb/scylladb/issues/16255
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
2023-12-02 22:37:22 +02:00
Michał Jadwiszczak
1b08338fe7 service:qos: add option to include effective names to SLO
Allow to include `slo_effective_names` in `service_level_options`
to be able to determine from which service level the specific option value comes from.
2023-11-30 13:07:20 +01:00
Pavel Emelyanov
66e43912d6 code: Switch to seastar API level 7
In that level no io_priority_class-es exist. Instead, all the IO happens
in the context of current sched-group. File API no longer accepts prio
class argument (and makes io_intent arg mandatory to impls).

So the change consists of
- removing all usage of io_priority_class
- patching file_impl's inheritants to updated API
- priority manager goes away altogether
- IO bandwidth update is performed on respective sched group
- tune-up scylla-gdb.py io_queues command

The first change is huge and was made semi-autimatically by:
- grep io_priority_class | default_priority_class
- remove all calls, found methods' args and class' fields

Patching file_impl-s is smaller, but also mechanical:
- replace io_priority_class& argument with io_intent* one
- pass intent to lower file (if applicatble)

Dropping the priority manager is:
- git-rm .cc and .hh
- sed out all the #include-s
- fix configure.py and cmakefile

The scylla-gdb.py update is a bit hairry -- it needs to use task queues
list for IO classes names and shares, but to detect it should it checks
for the "commitlog" group is present.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #13963
2023-06-06 13:29:16 +03:00
Kefu Chai
0cb842797a treewide: do not define/capture unused variables
these warnings are found by Clang-17 after removing
`-Wno-unused-lambda-capture` and '-Wno-unused-variable' from
the list of disabled warnings in `configure.py`.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-02-15 22:57:18 +02:00
Avi Kivity
afc06f0017 messaging: forward-declare types in messaging_service.hh
messaging_service.hh is a switchboard - it includes many things,
and many things include it. Therefore, changes in the things it
includes affect many translation units.

Reduce the dependencies by forward-declaring as much as possible.
This isn't pretty, but it reduces compile time and recompilations.

Other headers adjusted as needed so everything (including
`ninja dev-headers`) still compile.

Closes #10755
2022-06-09 15:52:12 +03:00
MaciekCisowski
439001b8c2 service_level_controller: fix small typo in exception message
Closes #10136
2022-02-26 22:23:26 +02:00
Avi Kivity
fcb8d040e8 treewide: use Software Package Data Exchange (SPDX) license identifiers
Instead of lengthy blurbs, switch to single-line, machine-readable
standardized (https://spdx.dev) license identifiers. The Linux kernel
switched long ago, so there is strong precedent.

Three cases are handled: AGPL-only, Apache-only, and dual licensed.
For the latter case, I chose (AGPL-3.0-or-later and Apache-2.0),
reasoning that our changes are extensive enough to apply our license.

The changes we applied mechanically with a script, except to
licenses/README.md.

Closes #9937
2022-01-18 12:15:18 +01:00
Pavel Solodovnikov
b958e85c54 utils: atomic_vector: rename for_each to thread_for_each
To emphasize that the function requires `seastar::thread`
context to function properly.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-01-11 09:29:12 +03:00
Piotr Sarna
4bfaa7d9fc Merge 'Service levels: fix undefined behaviours' from Eliran Sinvani
This mini series contains two fixes that are bundled together since the
second one assumes that the first one exists (or it will not fix
anything really...), the two problems were:
1. When certain operations are called on a service level controller
   which doesn't have it's data accessor set, it can lead to a crash
since some operations will still try to dereference the accessor
pointer.
2. The cql environment test initialized the accessor with a
   sharded<system_distributed_data>& however this sharded class as
itself is not initialized (sharded::start wasn't called), so for the
same that were unsafe for null dereference the accessor will now crash
for trying to access uninitialized sharded instance.

Closes #9468

* github.com:scylladb/scylla:
  CQL test environment: Fix bad initialization order
  Service Level Controller: Fix possible dereference of a null pointer
2021-10-18 08:53:53 +02:00
Eliran Sinvani
6d3e8055f9 Service Level Controller: Fix possible dereference of a null pointer
If the service level controller don't have his data accessor set,
calls for getting of distributed information might dereference this
unset pointer for the accessor. Here we add code that will return
a result as if there is no data available to the accessor (a behaviour
which is roughly equivalent to a null data accessor).

Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
2021-10-12 13:27:50 +03:00
Avi Kivity
f6d59c33ff service: service_level_controller: drop unused variable sl_compare
Reported by gcc 11.
2021-10-10 18:16:50 +03:00
Eliran Sinvani
c38ceafdcf Service Level Controller: Add an extention point to the API (#9374)
In order to ease future extensions to the information being sent
by the service level configuration change API, we pack the additional
parameters (other the the service level options) to the interface in a
structure. This will allow an easy expansion in the future if more
parameters needs to be sent to the observer.i

Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
2021-10-01 10:20:28 +03:00
Eliran Sinvani
7f44736939 Service Levels: do not notify stale service level removals
Before this commit, the service_level_controller will notify
the subscribers on stale deletes, meaning, deletes of localy
non exixtent service_levels.
The code flow shouldn't ever get to such a state, but as long
as this condition is checked instead of being asserted it is
worthwhile to change the code to be safe.

Closes #9253
2021-08-26 18:27:52 +03:00
Eliran Sinvani
47d3862b63 Service Level Controller: Add a listener API for service level config
changes

This change adds an api for registering a listener for service_level
configuration chanhes. It notifies about removal addition and change of
service level.
The hidden assumption is that some listeners are going to create and/or
manage service level specific resources and this it what guided the
time of the call to the subscriber.
Addition and change of a service level are called before the actual
change takes place, this guaranties that resource creation can take
place before the service level or new config starts to be used.
The deletion notification is called only after the deletion took place
and this guranties that the service level can't be active and the
resources created can be safely destroyed.
2021-08-16 11:38:59 +03:00
Pavel Emelyanov
c39f04fa6f code: Remove storage-service header from irrelevant places
Some .cc files over the code include the storage service
for no real need. Drop the header and include (in some)
what's really needed.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-07-22 18:50:19 +03:00
Eliran Sinvani
ccdef39d21 Service Level Controller: Stop configuration polling loop upon leaving
the cluster

This change subscribes service_level_controller for nodes life cycle
notifications and uses the notification of leaving the cluster for
the current node to stop the configuration polling loop. If the loop
continues to run it's queries will fail consistently since the nodes
will not answers to queries. It is worth mentioning that the queries
failing in the current state of code is harmles but noisy since after
90 seconsd, if the scylla process is not shut down the failures will
start to generate failure logs every 90 seconds which is confusing for
users.

Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
2021-07-14 09:31:40 +03:00
Avi Kivity
a55b434a2b treewide: extent copyright statements to present day 2021-06-06 19:18:49 +03:00
Eliran Sinvani
f2091bb227 workload prioritization: Reduce the logging sensitivity to "glitches" in availability
Before this patch every failure to pull the configuration have been
reported as a warning. However this is confusing for users for two
reasons:
1. It pollutes the logs if the configuration is polled which is Scylla's
   mode of operation. Such a line is logged every failed iteration.
2. It confuses users because even though this level is warning, it logs
   out an exception and the log message contains the word failed.

We see it a lot during QA runs and customer questions from the field.

Point 2 is only solvable by reducing the verbosity of the logged
information, which will make debugging harder.
Point 1 is addressed here in the following manner, first the
one shot configuration pull function is not handling the exception
itself, this is OK because it is harmless to fail once or twice in a
row in configuration pulling like in every other query, the caller is
the one that will be responsible to handle the exception and log the
information. Second, the polling loop capture the exceptions being
thrown from the configuration pulling function and only report an error
with the latest exception if the polling has failed in consecutive
iterations over the last 90 seconds. This value was chosen because this
is about the empirical worst case time that it takes to a node to notice
one of the other nodes in the cluster is down (hence not querying it).

It is not important for the user or to us to be notified on temporary
glitches in availability (through this error at least) and since we are
eventually consistent is ok that some nodes will catch up with the
configuration later than others.

We also set a threshold in which if the configuration still couldn't be
retrieved then the logging level is bumped to ERROR.

Closes #8574
2021-05-24 10:51:47 +02:00
Piotr Sarna
368a6976ff qos: allow returning combined service level options
Originally, the API for finding a service level controller returned
its name, which also implied that only a single service level
may be active for a user and provide its options.
After adding timeout parameters it makes more sense to return a result
which combines multiple service level parameters - e.g. a user
can be attached to one level for read timeouts and a separate one
for write timeouts.
2021-05-10 12:39:41 +02:00
Eliran Sinvani
fc93133cbe Service level controller: fix wrong default service level removal log
An out of block log print resulted in repeated prints about removal of
the default service level. The period of this print is every time the
configuration is scanned for changes. It happens when the default
service level is one of the last on the map (sorted as in the map).

Fixes #8567

Closes #8576
2021-05-03 09:08:41 +03:00
Eliran Sinvani
02d37cb133 workload prioritization: Fix configuration change detection
The configuration detection is based on a loop that
advances two iterators and compares the two collection
for deducing the configuration change. In order to
correctly deduce the changes the iteration have to be
according to the key (service level name) order for both
of the collections. If it doesn't happen the results are
undefined and in some cases can lead to a crash of the
system. The bug is that the _service_level_db field was
implemented using an unordered_map which obviously don't
guarantie the configuration change detection assumption.
The fix was simply to change the field type to a map
instead of unordered_map.

Another problem is that when a static service level (i.e
default) is at the end of the keys list, it is repeatedly
being deleted - which doesn't really do anything since deleting
a static service level is just retaining it's defult values
but it is stil wrong.
2021-04-27 12:29:31 +02:00
Eliran Sinvani
946fc6af08 workload prioritization: add exception protection in configuration polling
Exceptions around the loop polling were not handled properly.
This is an issue due to the fact that if an unhandled exception
slips out to the configuration polling loop itself it will break
it. When the configuration polling loop is broken, any further
change to the configuration will not be acted uppon in the nodes
where the loop is broken until the node is restarted. The chances
for exceptions are now greater than before since in one of the
previous commits we started quering the workload prioritization
configuration table with a sensible, shorter timeout.
This change also adds a logger for the workload prioritization
module and some logging mainly arround the configuration polling loop.

Most logs are added in the info level since they are not expected to
happen frequently but when they do we would like to have some
information by default regarding what broke the loop.
2021-04-27 12:29:31 +02:00
Piotr Sarna
55ae110774 qos: make sure to wait for sl updates on shutdown
The service level controller spawns an updating thread,
which wasn't properly waited for during shutdown.
This behavior is now fixed.
In order to make the shutdown order more standardized,
the operation is split into two phases - draining and stopping.

Tests: manual

Fixes #8468
2021-04-22 09:58:27 +02:00
Piotr Sarna
41951d34ad qos: add waiting for the updater future
The distributed data updated used to spawn a future without waiting
for it. It was quite safe, since the future had its own abort source,
but it's better to remember it and wait for it during stop() anyway.
2021-04-12 16:01:04 +02:00