The supergroup replaces streaming (a.k.a. maintenance as well) group, inherits 200 shares from it and consists of four sub-groups (all have equal shares of 200 withing the new supergroup)
* maintenance_compaction. This group configures `compaction_manager::maintenance_sg()` group. User-triggered compaction runs in it
* backup. This group configures `snapshot_ctl::config::backup_sched_group`. Native backup activity runs there
* maintenance. It's a new "visible" name, everything that was called "maintenance" in the code ran in "streaming" group. Now it will run in "maintenance". The activities include those that don't communicate over RPC (see below why)
* `tablet_allocator::balance_tablets()`
* `sstables_manager::components_reclaim_reload_fiber()`
* `tablet_storage_group_manager::merge_completion_fiber()`
* metrics exporting http server altogether
* streaming. This is purely existing streaming group that just moves under the new supergroup. Everything else that was run there, continues doing so, including
* hints sender
* all view building related components (update generator, builder, workers)
* repair
* stream_manager
* messaging service (except for verb handlers that switch groups)
* join_cluster() activity
* REST API
* ... something else I forgot
The `--maintenance_io_throughput_mb_per_sec` option is introduced. It controls the IO throughput limit applied to the maintenance supergroup. If not set, the `--stream_io_throughput_mb_per_sec` option is used to preserve backward compatibility.
All new sched groups inherit `request_class::maintenance` (however, "backup" seem not to make any requests yet).
Moving more activities from "streaming" into "maintenance" (or its own group) is possible, but one will need to take care of RPC group switching. The thing is that when a client makes an RPC call, the server may switch to one of pre-negotiated scheduling groups. Verbs for existing activities that run in "streaming" group are routed through RPC index that negotiates "streaming" group on the server side. If any of that client code moves to some other group, server will still run the handlers in "streaming" which is not quite expected. That's one of the main reasons why only the selected fibers were moved to their own "maintenance" group. Similar for backup -- this code doesn't use RPC, so it can be moved. Restoring code uses load-and-stream and corresponding RPCs, so it cannot be just moved into its own new group.
Fixes SCYLLADB-351
New feature, not backporting
Closesscylladb/scylladb#28542
* github.com:scylladb/scylladb:
code: Add maintenance/maintenance group
backup: Add maintenance/backup group
compaction: Add maintenance/maintenance_compaction group
main: Introduce maintenance supergroup
main: Move all maintenance sched group into streaming one
database: Use local variable for current_scheduling_group
code: Live-update IO throughputs from main
For counter updates, use a counter ID that is constructed from the
node's rack instead of the node's host ID.
A rack can have at most two active tablet replicas at a time: a single
normal tablet replica, and during tablet migration there are two active
replicas, the normal and pending replica. Therefore we can have two
unique counter IDs per rack that are reused by all replicas in the rack.
We construct the counter ID from the rack UUID, which is constructed
from the name "dc:rack". The pending replica uses a deterministic
variation of the rack's counter ID by negating it.
This improves the performance and size of counter cells by having less
unique counter IDs and less counter shards in a counter cell.
Previously the number of counter shards was the number of different
host_id's that updated the counter, which can be typically the number of
nodes in the cluster and continue growing indefinitely when nodes are
replaced. with the rack-based counter id the number of counter shards
will be at most twice the number of different racks (including removed
racks, which should not be significant).
Fixes SCYLLADB-356
backport not needed - an enhancement
Closesscylladb/scylladb#28901
* github.com:scylladb/scylladb:
docs/dev: add counters doc
counters: reuse counter IDs by rack
For counter updates, use a counter ID that is constructed from the
node's rack instead of the node's host ID.
A rack can have at most two active tablet replicas at a time: a single
normal tablet replica, and during tablet migration there are two active
replicas, the normal and pending replica. Therefore we can have two
unique counter IDs per rack that are reused by all replicas in the rack.
We construct the counter ID from the rack UUID, which is constructed
from the name "dc:rack". The pending replica uses a deterministic
variation of the rack's counter ID by negating it.
This improves the performance and size of counter cells by having less
unique counter IDs and less counter shards in a counter cell.
Previously the number of counter shards was the number of different
host_id's that updated the counter, which can be typically the number of
nodes in the cluster and continue growing indefinitely when nodes are
replaced. with the rack-based counter id the number of counter shards
will be at most twice the number of different racks (including removed
racks, which should not be significant).
Fixes SCYLLADB-356
Add .set_skip_when_empty() to four metrics in replica/database.cc that
are only incremented on very rare error paths and are almost always zero:
- database::dropped_view_updates: view updates dropped due to overload.
NOTE: this metric appears to never be incremented in the current
codebase and may be a candidate for removal.
- database::multishard_query_failed_reader_stops: documented as a 'hard
badness counter' that should always be zero. NOTE: no increment site
was found in the current codebase; may be a candidate for removal.
- database::multishard_query_failed_reader_saves: documented as a 'hard
badness counter' that should always be zero.
- database::total_writes_rejected_due_to_out_of_space_prevention: only
fires when disk utilization is critical and user table writes are
disabled, a very rare operational state.
These metrics create unnecessary reporting overhead when they are
perpetually zero. set_skip_when_empty() suppresses them from metrics
output until they become non-zero.
AI-Assisted: yes
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
Closesscylladb/scylladb#29345
Extend `database::add_column_family()` with a `storage_mode` argument.
If the table is under vnodes-to-tablets migration and the storage mode
is "tablets", create a tablet ERM.
Make the distributed loader determine the storage mode from topology
(`intended_storage_mode` column in system.topology).
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
And move some activities from streaming group into it, namely
- tablet_allocator background group
- sstables_manager-s components reclaimer
- tablet storage group manager merge completion fiber
- prometheus
All other activity that was in streaming group remains there, but can be
moved to this group (or to new maintenance subgroup) later.
All but prometheus are patched here, prometheus still uses the
maintenance_sched_group variable in main.cc, so it transparently
moves into new group
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The snapshot_ctl::backup_task_impl runs in configured scheduling group.
Now it's streaming one. This patch introduces the maintenance/backup
group and re-configures backup task with it.
The group gets its --backup_io_throughput_mb_per_sec option that
controls bandwidth limit for this sub-group only.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Compaction manager tells compaction_sched_group from
maintenance_compaction_sched_group. The latter, however, is set to be
"streaming" group. This patch adds real maintenance_compaction group
under the maintenance supergroup and makes compaction manager use it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The classify_request() helper captures current scheduling group into
local variable and compares it with groups from db_config to decide
which "class" it belongs to.
One if uses current_scheduling_group(), while it could use the local
variable.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
A compaction group has a separator buffer that holds the mixed segments
alive until the separator buffer is flushed. A mixed segment can be
freed only after all separator buffers that hold writes from the segment
are flushed.
Typically a separator buffer is flushed when it becomes full. However
it's possible for example that one compaction groups is filled slower
than others and holds many segments.
To fix this we trigger a separator flush periodically for separator
buffers that hold old segments. We track the active segment sequence
number and for each separator buffer the oldest sequence number it
holds.
Fix the logstor recovery to work with compaction groups. When recovering
a segment find its token range and add it to the appropriate compaction
groups. if it doesn't fit in a single compaction group then write each
record to its compaction group's separator buffer.
Change the primary index to be a btree that is ordered by token,
similarly to a memtable, and create a index per-table instead of a
single global index.
Add a segment_set member to replica::compaction_group that manages the
logstor segments that belong to the compaction group, similarly to how
it manages sstables. Add also a separator buffer in each compaction
group.
When writing a mutation to a compaction group, the mutation is written
to the active segment and to the separator buffer of the compaction
group, and when the separator buffer is flushed the segment is added to
the compaction_group's segment set.
when logstor is enabled, update the db dirty memory limits dynamically.
previously the threshold is set to 0.5 of the available memory, so 0.5
goes to memtables and 0.5 to others (cache).
when logstor is enabled, we calculate the available memory excluding
logstor, and divide it evenly between memtables and cache.
implement freeing all segments of a table for table truncate.
first do barrier to flush all active and mixed segments and put all the
table's data in compaction groups, then stop compaction for the table,
then free the table's segments and remove the live entries from the
index.
add barrier operation that forces switch of the active segment and
separator, and waits for all existing segments to close and all
separators to flush.
add tracking of the total separator debt - writes that were written to a
separator and waiting to be flushed, and add flow control to keep the
debt in control by delaying normal writes.
initial implementation of the separator. it replaces "mixed" segments -
segments that have records from different groups, to segments by group.
every write is written to the active segment and to a buffer in the
active separator. the active separator has in-memory buffers by group.
at some threshold number of segments we switch the active segment and
separator atomically, and start flushing the separator.
the separator is flushed by writing the buffers into new non-mixed
segments, adding them to a compaction group, and frees the mixed
segments.
The limiter scans ranges to decide whether or not to rate-limit the
query. However, when considering each range only the front one's token
is accounted. This looks like a misprint.
The limiter was introduced in cc9a2ad41f
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#29050
This test was observed to fail in CI recently but there is not enough information in the logs to figure out what went wrong. This PR makes a few improvements to make the next investigation easier, should it be needed:
* storage-service: add table name to mutation write failure error messages.
* database: the `database_apply` error injection used to cause trouble, catching writes to bystander tables, making tests flaky. To eliminate this, it gained a filter to apply only to non-system keyspaces. Unfortunately, this still allows it to catch writes to the trace tables. While this should not fail the test, it reduces observability, as some traces disappear. Improve this error injection to only apply to selected table. Also merge it with the `database_apply_wait` error injection, to streamline the code a bit.
* test/test_data_resurrection_in_memtable.py: dump data from the datable, before the checks for expected data, so if checks fail, the data in the table is known.
Refs: SCYLLADB-812
Refs: SCYLLADB-870
Fixes: SCYLLADB-1050 (by restricting `database_apply` error injection, so it doesn't affect writes to system traces)
Backport: test related improvement, no backport
Closesscylladb/scylladb#28899
* github.com:scylladb/scylladb:
test/cluster/test_data_resurrection_in_memtable.py: dump rows before check
replica/database: consolidate the two database_apply error injections
service/storage_proxy: add name of table to error message for write errors
Into a single database_apply one. Add three parameters:
* ks_name and cf_name to filter the tables to be affected
* what - what to do: throw or wait
This leads to smaller footprint in the code and improved filtering for
table names at the cost of some extra error injection params in the
tests.
The method is called from storage_proxy::mutate_hint() which is in turn called from hint_mutation::apply_locally(). The latter is either called from directly by hint sender, which already runs in streaming group, or via RPC HINT_MUTATION handler which uses index 1 that negotiates streaming group as well.
To be sure, add a debugging check for current group being the expected one.
Code cleanup, not backporting
Closesscylladb/scylladb#28545
* github.com:scylladb/scylladb:
hint: Don't switch group in database::apply_hint()
hint_sender: Switch to sender group on stop either
Compaction and statement groups are carried over on those configs, but
are in fact unused. Drop both.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#28540
There is a handful of places in the code related to dictionary
compression which calls get_units to acquire semaphore units but the
returned future is not awaited, seemingly by mistake. The result of
get_units is assigned to a variable - which is reasonable at a glance
because the semaphore units need to be assigned to a variable in order
to control their scope - but at the same time if co_await is mistakenly
omitted, like here, doing so will silence the nodiscard check of
seastar::future and, effectively, the get_units call will be nearly
useless. Unfortunately, this is an easy mistake to make.
Fix the places in the code that acquire semaphore units via get_units
but never await the future returned by it. I found them by manual code
inspection, so I hope that I didn't miss any.
Closesscylladb/scylladb#28581
The method is called from storage_proxy::mutate_hint() which is in turn
called from hint_mutation::apply_locally(). The latter is either called
from directly by hint sender, which already runs in streaming group, or
via RPC HINT_MUTATION handler which uses index 1 that negotiates streaming
group as well.
To be sure, add a debugging check for current group being the expected one.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This field is only used to initialize the following _memtable_controller
one. It's simpler just to do the initialization with whatever value the
field itself is initialized and drop the field itself.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#28539
The new parameter parametrizes the factor used to reject a read
during admission. Its value shall be between 0.0 and 1.0 where
+ 0.0 means a read will never get rejected during admission
+ 1.0 means a read will immediatelly get rejected during admission
Although passing values outside the interaval is possible, they
will have the exact same effects as they were clamped to [0.0, 1.0].
The commit 59faa6d, introduces a new parameter called cpu_concurrency
and sets its default value to 1 which violates the commit fbb83dd that
removes all default values from constructors but one used by the unit
tests.
The patch removes the default value of the cpu_concurrency parameter
and alters tests to use the test dedicated reader_concurrency_semaphore
constructor wherever possible.
In 8df61f6d99 we changed the requirements for creating materialized
views and MV-based indexes - instead of requiring the
rf_rack_valid_keyspaces flag to be set, we now require the keyspace to
be RF-rack-valid at the time of creation, and it is enforced to remain
RF-rack-valid while the MV exists. This validation is done in the cql
create view/index statements.
The same should be done also for alternator - when creating a table with
GSI or LSI, or when adding a GSI to an existing table, previously we
required the flag rf_rack_valid_keyspaces to be set. Now we change it to
instead check if the keyspace is RF-rack-valid, and if not the operation
fails with an appropriate error.
Fixes https://github.com/scylladb/scylladb/issues/28214
backport to 2025.4 to add RF-rack-valid enforcements in alternator
Closesscylladb/scylladb#28154
* github.com:scylladb/scylladb:
locator: document the exception type of assert_rf_rack_valid_keyspace
alternator: don't require rf_rack flag for indexes, validate instead
The function assert_rf_rack_valid_keyspace uses the exception type
std::invalid_argument when the RF-rack validation fails. Document it and
change all callers to catch this specific exception type when checking
for RF-rack validation failures, so that other exception types can be
propagated properly.
Add enforce_rack_list option. When the option is set to true,
all tablet keyspaces have rack list replication factor.
When the option is on:
- CREATE STATEMENT always auto-extends rf to rack lists;
- ALTER STATEMENT fails when there is numeric rf in any DC.
The flag is set to false by default and a node needs to be restarted
in order to change its value. Starting a node with enforce_rack_list
option will fail, if there are any tablet keyspaces with numeric rf
in any DC.
enforce_rack_list is a per-node option and a user needs to ensure
that no tablet keyspace is altered or created while nodes in
the cluster don't have the consistent value.
Mark rf_rack_valid_keyspaces as deprecated.
Fixes: https://github.com/scylladb/scylladb/issues/26399.
New feature; no backport needed
Closesscylladb/scylladb#28084
* github.com:scylladb/scylladb:
test: add test for enforce_rack_list option
db: mark rf_rack_valid_keyspaces as deprecated
config: add enforce_rack_list option
Revert "alternator: require rf_rack_valid_keyspaces when creating index"
And keep the options for now in the local_snapshot_writer.
The options will be used by following patches to pass
extra metadata like the snapshot creation time, expiration time, etc.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Add enforce_rack_list option. When the option is set to true,
all tablet keyspaces have rack list replication factor.
When the option is on:
- CREATE STATEMENT always auto-extends rf to rack lists;
- ALTER STATEMENT fails when there is numeric rf in any DC.
The flag is set to false by default and a node needs to be restarted
in order to change its value. Starting a node with enforce_rack_list
option will fail, if there are any tablet keyspaces with numeric rf
in any DC.
enforce_rack_list is a per-node option and a user needs to ensure
that no tablet keyspace is altered or created while nodes in
the cluster don't have the consistent value.
Allow creating materialized views and secondary indexes in a tablets keyspace only if it's RF-rack-valid, and enforce RF-rack-validity while the keyspace has views by restricting some operations:
* Altering a keyspace's RF if it would make the keyspace RF-rack-invalid
* Adding a node in a new rack
* Removing / Decommissioning the last node in a rack
Previously the config option `rf_rack_valid_keyspaces` was required for creating views. We now remove this restriction - it's not needed because we always maintain RF-rack-validity for keyspaces with views.
The restrictions are relevant only for keyspaces with numerical RF. Keyspace with rack-list-based RF are always RF-rack-valid.
Fixesscylladb/scylladb#23345
Fixes https://github.com/scylladb/scylladb/issues/26820
backport to relevant versions for materialized views with tablets since it depends on rf-rack validity
Closesscylladb/scylladb#26354
* github.com:scylladb/scylladb:
docs: update RF-rack restrictions
cql3: don't apply RF-rack restrictions on vector indexes
cql3: add warning when creating mv/index with tablets about rf-rack
service/tablet_allocator: always allow tablet merge of tables with views
locator: extend rf-rack validation for rack lists
test: test rf-rack validity when creating keyspace during node ops
locator: fix rf-rack validation during node join/remove
test: test topology restrictions for views with tablets
test: add test_topology_ops_with_rf_rack_valid
topology coordinator: restrict node join/remove to preserve RF-rack validity
topology coordinator: add validation to node remove
locator: extend rf-rack validation functions
view: change validate_view_keyspace to allow MVs if RF=Racks
db: enforce rf-rack-validity for keyspaces with views
replica/db: add enforce_rf_rack_validity_for_keyspace helper
db: remove enforce parameter from check_rf_rack_validity
test: adjust test to not break rf-rack validity
The comment was added in 83323e155e
Since then, table::seal_active_memtable was improved to guarantee
waiting on oustanding flushes on success (See d55a2ac762), so
we can remove this TODO comment (it also not covered by any issue
so nobody is planned to ever work on it).
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
can_flush might return a different value for each shard
so check it right before deciding whether to flush or clear a memtable
shard.
Note that under normal condition can_flush would always return true
now that it checks only the presence of the seal memtable function
rather than check memtable_list::empty().
Fixes#27639
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Now that we have a unit test proving that it's safe to flush an
empty memtable list there is no need to distinguish between
may_flush and can_flush.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Now it's database::snapshot_table_on_all_shards(). This is symmetric to
database::truncate_table_on_all_shards().
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Extend the locator function assert_rf_rack_valid_keyspace to accept
arbitrary topology dc-rack maps and nodes instead of using the current
token metadata.
This allows us to add a new variant of the function that checks rf-rack
validity given a topology change that we want to apply. we will use it
to check that rf-rack validity will be maintained before applying the
topology change.
The possible topology changes for the check are node add and node remove
/ decommission. These operations can change the number of normal racks -
if a new node is added to a new rack, or the last node is removed from a
rack.
The function validate_view_keyspace checks if a keyspace is eligible for
having materialized views, and it is used for validation when creating a
MV or a MV-based index.
Previously, it was required that the rf_rack_valid_keyspaces option is
set in order for tablets-based keyspaces to be considered eligible, and
the RF-rack condition was enforced when the option is set.
Instead of this, we change the validation to allow MVs in a keyspace if
the RF-rack condition is satisfied for the keyspace - regardless of the
config option.
We remove the config validation for views on startup that validates the
option `rf_rack_valid_keyspaces` is set if there are any views with
tablets, since this is not required anymore.
We can do this without worrying about upgrades because this change will
be effective from 2025.4 where MVs with tablets are first out of
experimental phase.
We update the test for MV and index restrictions in tablets keyspaces
according to the new requirements.
* Create MV/index: previously the test checked that it's allowed only if
the config option `rf_rack_valid_keyspaces` is set. This is changed
now so it's always allowed to create MV/index if the keyspace is
RF-rack-valid. Update the test to verify that we can create MV/index
when the keyspace is RF-rack-valid, even if the rf_rack option is not
set, and verify that it fails when the keyspace is RF-rack-invalid.
* Alter: Add a new test to verify that while a keyspace has views, it
can't be altered to become RF-rack-invalid.
Extend the RF-rack-validity enforcement to keyspaces that have views,
regardless of the option `rf_rack_valid_keyspaces`.
Previously, RF-rack-validity was enforced when `rf_rack_valid_keyspaces`
was set for all keyspaces. Now we want to allow creating MVs in tablet
keyspaces that are RF-rack-valid and enforce the RF-rack-validity even
if the config option is not set.