To improve the behavior of the removenode operation, we want to issue
a global topology barrier after the removenode has been applied.
However, this requires changing the topology state machine to add a new
state (left_token_ring) to the removenode flow, which is not supported
by older nodes.
To allow rolling upgrades, we add a feature flag
REMOVENODE_WITH_LEFT_TOKEN_RING that controls whether the new removenode
flow is used.
The feature will be used later in this patch series:
- To avoid unnecessary operations when the feature is not enabled
- To guard new API endpoints from being used before the cluster is
ready to use them.
- To implement update tests (by disabling/enabling the feature)
Ref: scylladb/scylla-enterprise#5699
This reverts commit faad0167d7. It causes
a regression in
test_two_tablets_concurrent_repair_and_migration_repair_writer_level
in debug mode (with ~5%-10% probability).
Fixes#27510.
Closesscylladb/scylladb#27560
This patch adds tablet repair progress report support so that the user
could use the /task_manager/task_status API to query the progress.
In order to support this, a new system table is introduced to record the
user request related info, i.e, start of the request and end of the
request.
The progress is accurate when tablet split or merge happens in the
middle of the request, since the tokens of the tablet are recorded when
the request is started and when repair of each tablet is finished. The
original tablet repair is considered as finished when the finished
ranges cover the original tablet token ranges.
After this patch, the /task_manager/task_status API will report correct
progress_total and progress_completed.
Fixes#22564Fixes#26896Closesscylladb/scylladb#26924
And switch to std::source_location.
Upcoming seastar update will deprecate its compatibility layer.
The patch is
for f in $(git grep -l 'seastar::compat::source_location'); do
sed -e 's/seastar::compat::source_location/std::source_location/g' -i $f;
done
and removal of few header includes.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#27309
The reason for this seastar update is to have the fixed handling
of the `integer` type in `seastar-json2code` because it's needed
for further development of ScyllaDB REST API.
The following changes were introduced to ScyllaDB code to ensure it
compiles with the updated seastar:
- Remove `seastar/util/modules.hh` includes as the file was removed
from seastar
- Modified `metrics::impl::labels_type` construction in
`test/boost/group0_test.cc` because now it requires `escaped_string`
* seastar 340e14a7...8c3fba7a (32):
> Merge 'Remove net::packet usage from dns.cc' from Pavel Emelyanov
dns: Optimize packet sending for newer c-ares versions
dns: Replace net::packet with vector<temporary_buffer>
dns: Remove unused local variable
dns: Remove pointless for () loop wrapping
dns: Introduce do_sendv_tcp() method
dns: Introduce do_send_udp() method
> test: Add http rules test of matching order
> Merge 'Generalize packet_data_source into memory_data_source' from Pavel Emelyanov
memcached: Patch test to use memory_data_source
memcached: Use memory_data_source in server
rpc: Use memory_data_sink without constructing net::packet
util: Generalize packet_data_source into memory_data_source
> tests: coroutines: restore "explicit this" tests
> reactor: remove blocking of SIGILL
> Merge 'Update compilers in GH actions scripts' from Pavel Emelyanov
github: Use gcc-14
github: Use clang-20
> Merge 'Reinforce DNS reverse resolution test ' from Pavel Emelyanov
test: Make test_resolve() try several addresses
test: Coroutinize test_resolve() helper
> modules: make module support standards-compliant
> Merge 'Fix incorrect union access in dns resolver' from Pavel Emelyanov
dns: Squash two if blocks together
dns: Do not check tcp entry for udp type
> coroutine: Fix compilation of execute_involving_handle_destruction_in_await_suspend
> promise: Document that promise is resolved at most once
> coroutine: exception: workaround broken destroy coroutine handle in await_suspend
> socket: Return unspecified socket_address for unconnected socket
> smp: Fix exception safety of invoke_on_... internal copying
> Merge 'Improve loads evaluation by reactor' from Pavel Emelyanov
reactor: Keep loads timer on reactor
reactor: Update loads evaluation loop
> Merge 'scripts: add 'integer' type to seastar-json2code' from Andrzej Jackowski
test: extend tests/unit/api.json to use 'integer' type
scripts: add 'integer' type to seastar-json2code
> Merge 'Sanitize tls::session::do_put(_one)? overloads' from Pavel Emelyanov
tls: Rename do_put_one(temporary_buffer) into do_put()
tls: Fix indentation after previous patch
tls: Move semaphore grab into iterating do_put()
> net: tcp: change unsent queue from packets to temporary_buffer:s
> timer: Enable highres timer based on next timeout value
> rpc: Add a new constructor in closed_error to accept string argument
> memcache: Implement own data sink for responses
> Merge 'file: recursive_remove_directory: general cleanup' from Avi Kivity
file: do_recursive_remove_directory(): move object when popping from queue
file: do_recursive_remove_directory(): adjust indentation
file: do_recursive_remove_directory(): coroutinize
file: do_recursive_remove_directory(): simplify conditional
file: do_recursive_remove_directory(): remove wrong const
file: do_recursive_remove_directory(): clean up work_entry
> tests: Move thread_context_switch_test into perf/
> test: Add unit test for append_challenged_posix_file
> Merge 'Prometheus metrics handler optimization' from Travis Downs
prometheus: optimize metrics aggregation
prometheus: move and test aggregate_by helper
prometheus: various optimizations
metrics: introduce escaped_string for label values
metric:value: implement + in terms of +=
tests: add prometheus text format acceptance tests
extract memory_data_sink.hh
metrics_perf: enhance metrics bench
> demos: Simplify udp_zero_copy_demo's way of preparing the packet
> metrics: Remove deprecated make_...-ers
> Merge 'Make slab_test be BOOST kind' from Pavel Emelyanov
test: Use BOOST_REQUIRE checkers
test: Replace some SEASTAR_ASSERT-s with static_assert-s
test: Convert slab test into boost kind
> Merge 'Coroutinize lister_test' from Pavel Emelyanov
test: Fix indentation after previuous patch
test: Coroutinize lister_test lister::report() method
test: Coroutinize lister_test main code
> file: recursive_remove_directory(): use a list instead of a deque
> Merge 'Stop using packets in tls data_sink and session' from Pavel Emelyanov
tls: Stop using net::packet in session::put()
tls: Fix indentation after previous patch
tls: Split session::do_put()
tls: Mark some session methods private
Closesscylladb/scylladb#27240
Add precompiled header support to CMakeLists.txt and configure.py -
it improves compilation time by approximately 10%.
New header `stdafx.hh` is added, don't include it manually -
the compiler will include it for you. The header contains includes from
external libraries used by Scylla - seastar, standard library,
linux headers and zlib.
The feature is enabled by default, use CMake option `Scylla_USE_PRECOMPILED_HEADER`
or configure.py --disable-precompiled-header to disable.
The feature should be disabled, when trying to check headers - otherwise
you might get false negatives on missing includes from seastar / abseil and so on.
Note: following configuration needs to be added to ccache.conf:
sloppiness = pch_defines,time_macros,include_file_mtime,include_file_ctime
Closesscylladb/scylladb#26617
Now that counters work with tablets, allow to create a table with
counters in a tablets-enabled keyspace, and remove the warning about
counters not being supported when creating a keyspace with tablets.
We allow to use counters with tablets only when all nodes are upgraded
and support counters with tablets. We add a new feature flag to
determine if this is the case.
Fixesscylladb/scylladb#18180
This commit:
- Increases the number of allowed scheduling groups to allow the
creation of `sl:driver`.
- Adds the `DRIVER_SERVICE_LEVEL` feature, which prevents creating
`sl:driver` until all nodes have increased the number of
scheduling groups.
- Starts using `get_create_driver_service_level_mutations`
to unconditionally create `sl:driver` on
`raft_initialize_discovery_leader`. The purpose of this code
path is ensuring existence of `sl:driver` in new system and tests.
- Starts using `migrate_to_driver_service_level` to create `sl:driver`
if it is not already present. The creation of `sl:driver` is
managed by `topology_coordinator`, similar to other system keyspace
updates, such as the `view_builder` migration. The purpose of this
code path is handling upgrades.
- Modifies related tests to pass after `sl:driver` is added.
Later in this patch series, `sl:driver` will be used by
`transport/server` to handle selected traffic, such as the driver's
schema and topology fetches.
Refs: scylladb/scylladb#24411
Allows per-DC replication factor to be either a string, holding a
numerical value, or a list of strings, holding a list of rack names.
The rack list is not respected yet by the tablet allocator, this is
achieved in subsequent commit.
This changes the format of options stored in the flattened map
in system_schema.keyspaces#replication. Values which are rack lists,
are converted into multiple entries, with the list index appended to
the key with ':' as the separator:
For example, this extended map:
{
'dc1': '3',
'dc2': ['rack1', 'rack2']
}
is stored as a flattened map:
{
'dc1': '3',
'dc2:0': 'rack1',
'dc2:1': 'rack2'
}
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Signed-off-by: Tomasz Grabiec <tgrabiec@scylladb.com>
Extend the `sstable_format` config enum with a "ms" value,
and, if it's enabled (in the config and in cluster features),
use it for new sstables on the node.
(Before this commit, writing `ms` sstables should only be possible
in unit tests, via internal APIs. After this commit, the format
can be enabled in the config and the database will write it during
normal operation).
As of this commit, the new format is not the default yet.
(But it will become the default in a later commit in the same series).
This commit:
- Increases the number of allowed scheduling groups to allow the
creation of `sl:driver`.
- Adds the `DRIVER_SERVICE_LEVEL` feature, which prevents creating
`sl:driver` until all nodes have increased the number of
scheduling groups.
- Starts using `get_create_driver_service_level_mutations`
to unconditionally create `sl:driver` on
`raft_initialize_discovery_leader`. The purpose of this code
path is ensuring existence of `sl:driver` in new system and tests.
- Starts using `migrate_to_driver_service_level` to create `sl:driver`
if it is not already present. The creation of `sl:driver` is
managed by `topology_coordinator`, similar to other system keyspace
updates, such as the `view_builder` migration. The purpose of this
code path is handling upgrades.
- Modifies related tests to pass after `sl:driver` is added.
Later in this patch series, `sl:driver` will be used by
`transport/server` to handle selected traffic, such as the driver's
schema and topology fetches.
Refs: scylladb/scylladb#24411
add a new feature flag cdc_with_tablets to protect the schema changes
that are required for the CDC with tablets feature. we will also use it
to allow start using CDC in tablets-based keyspaces only once all nodes
are upgraded and support this feature.
Sometimes gossiper operations invoked from storage_service and other components run under a non-gossiper scheduling group. If these operations acquire gossiper locks, priority inversion can occur: higher-priority gossiper tasks may wait behind lower-priority tasks (e.g. streaming), which can cause gossiper slowness or even failures.
This patch ensures that gossiper operations requiring locks on gossiper structures are explicitly executed in the gossiper scheduling group. To help detect similar issues in the future, a warning is logged whenever a gossiper lock is acquired under a non-gossiper scheduling group.
Fixesscylladb/scylladb#25907
Refs: scylladb/scylladb#25702
Backport: this patch fixes an issue with gossiper operations scheduling group, that might affect topology operations, therefore backport is needed to 2025.1, 2025.2, 2025.3
Closesscylladb/scylladb#25981
* https://github.com/scylladb/scylladb:
gossiper: ensure gossiper operations are executed in gossiper scheduling group
gossiper: fix wrong gossiper instance used in `force_remove_endpoint`
Sometimes gossiper operations invoked from storage_service and other components
run under a non-gossiper scheduling group. If these operations acquire gossiper
locks, priority inversion can occur: higher-priority gossiper tasks may wait
behind lower-priority tasks (e.g. streaming), which can cause gossiper slowness
or even failures.
This patch ensures that gossiper operations requiring locks on gossiper
structures are explicitly executed in the gossiper scheduling group.
To help detect similar issues in the future, a warning is logged whenever
a gossiper lock is acquired under a non-gossiper scheduling group.
Fixesscylladb/scylladb#25907
`gossiper::force_remove_endpoint` is always executed on shard 0 using
`invoke_on`. Since each shard has its own `gossiper` instance, if
`force_remove_endpoint` is called from a shard other than shard 0,
`my_host_id()` may be invoked on the wrong `gossiper` object. This
results in undefined behavior due to unsynchronized access to resources
on another shard.
Remove a redundant code block inadvertently introduced in commit 4b3d160f34.
While the duplicate did not affect functionality, its presence could
cause confusion and maintenance issues.
This change does not alter behavior and is purely a cleanup.
Fixes: scylladb/scylladb#25999
Backport: The issue exists in all 2025 branches, so it should be
backported accordingly.
Closesscylladb/scylladb#26001
This change removes the addition of an empty state to `_endpoint_state_map`.
Instead, a new state is created locally and then published via replicate,
avoiding the issue of an empty state existing in `_endpoint_state_map`
before the preemption point. Since this resolves the issue tested in
`test_gossiper_empty_self_id_on_shadow_round`, the `xfail` mark has been removed.
Fixes: scylladb/scylladb#25831
This change adds a test for a race condition in `start_gossiping` that
can lead to an empty self state sent in `gossip_get_endpoint_states_response`.
Test for scylladb/scylladb#25831
In do_apply_state_locally, a race condition can occur if a task is
suspended at a preemption point while the node entry is not locked.
During this time, the host may be removed from _endpoint_state_map.
When the task resumes, this can lead to inserting an entry with an
empty host ID into the map, causing various errors, including a node
crash.
This change
1. adds a check after locking the map entry: if a gossip ACK update
does not contain a host ID, we verify that an entry with that host ID
still exists in the gossiper’s _endpoint_state_map.
2. Removes xfail from the test_gossiper_race test since the issue is now
fixed.
3. Adds exception handling in `do_shadow_round` to skip responses from
nodes that sent an empty host ID.
This re-applies the commit 13392a40d4 that
was reverted in 46aa59fe49, after fixing
the issues that caused the CI to fail.
Fixes: scylladb/scylladb#25702Fixes: scylladb/scylladb#25621
Ref: scylladb/scylla-enterprise#5613
This change introduces a targeted test that simulates the gossiper race
condition observed during node decommissioning. The test delays gossip
state application and host ID lookup to reliably reproduce the scenario
where `gossiper::get_host_id()` is called on a removed endpoint,
potentially triggering an abort in `apply_new_states`.
There is a specific error injection added to widen the race window, in
order to increase the likelihood of hitting the race condition. The
error injection is designed to delay the application of gossip state
updates, for the specific node that is being decommissioned. This should
then result in the server abort in the gossiper.
This re-applies the commit 5dac4b38fb that
was reverted in dc44fca67c, but modified
to relax the check from "on_internal_error" to a just warning log. The
more strict can be re-introduced later once we are sure that all
remaining problems are resolved and it will not break the CI.
Refs: scylladb/scylladb#25621Fixes: scylladb/scylladb#25721
In do_apply_state_locally, a race condition can occur if a task is
suspended at a preemption point while the node entry is not locked.
During this time, the host may be removed from _endpoint_state_map.
When the task resumes, this can lead to inserting an entry with an
empty host ID into the map, causing various errors, including a node
crash.
This change adds a check after locking the map entry: if a gossip ACK update
does not contain a host ID, we verify that an entry with that host ID
still exists in the gossiper’s _endpoint_state_map.
Fixesscylladb/scylladb#25702Fixesscylladb/scylladb#25621
Ref scylladb/scylla-enterprise#5613Closesscylladb/scylladb#25727
This change introduces a targeted test that simulates the gossiper race
condition observed during node decommissioning. The test delays gossip
state application and host ID lookup to reliably reproduce the scenario
where `gossiper::get_host_id()` is called on a removed endpoint,
potentially triggering an abort in `apply_new_states`.
There is a specific error injection added to widen the race window, in
order to increase the likelihood of hitting the race condition. The
error injection is designed to delay the application of gossip state
updates, for the specific node that is being decommissioned. This should
then result in the server abort in the gossiper.
Refs: scylladb/scylladb#25621Fixes: scylladb/scylladb#25721
Backport: The test is primarily for an issue found in 2025.1, so it
needs to be backported to all the 2025.x branches.
Closesscylladb/scylladb#25685
This patch introduces `view_building_coordinator`, a single entity within whole cluster responsible for building tablet-based views.
The view building coordinator takes slightly different approach than the existing node-local view builder. The whole process is split into smaller view building tasks, one per each tablet replica of the base table.
The coordinator builds one base table at a time and it can choose another when all views of currently processing base table are built.
The tasks are started by setting `STARTED` state and they are executed by node-local view building worker. The tasks are scheduled in a way, that each shard processes only one tablet at a time (multiple tasks can be started for a shard on a node because a table can have multiple views but then all tasks have the same base table and tablet (last_token)). Once the coordinator starts the tasks, it sends `work_on_view_building_tasks` RPC to start the tasks and receive their results.
This RPC is resilient to RPC failure or raft leader change, meaning if one RPC call started a batch of tasks but then failed (for instance the raft leader was changed and caller aborted waiting for the response), next RPC call will attach itself to the already started batch.
The coordinator plugs into handling tablet operations (migration/resize/RF change) and adjusts its tasks accordingly. At the start of each tablet operation, the coordinator aborts necessary view building tasks to prevent https://github.com/scylladb/scylladb/issues/21564. Then, new adjusted tasks are created at the end of the operation.
If the operation fails at any moment, aborted tasks are rollback.
The view building coordinator can also handle staging sstables using process_staging view building tasks. We do this because we don't want to start generating view updates from a staging sstable prematurely, before the writes are directed to the new replica (https://github.com/scylladb/scylladb/issues/19149).
For detailed description check: `docs/dev/view-building-coordinator.md`
Fixes https://github.com/scylladb/scylladb/issues/22288
Fixes https://github.com/scylladb/scylladb/issues/19149
Fixes https://github.com/scylladb/scylladb/issues/21564
Fixes https://github.com/scylladb/scylladb/issues/17603
Fixes https://github.com/scylladb/scylladb/issues/22586
Fixes https://github.com/scylladb/scylladb/issues/18826
Fixes https://github.com/scylladb/scylladb/issues/23930
---
This PR is reimplementation of https://github.com/scylladb/scylladb/pull/21942Closesscylladb/scylladb#23760
* github.com:scylladb/scylladb:
test/cluster: add view build status tests
test/cluster: add view building coordinator tests
utils/error_injection: allow to abort `injection_handler::wait_for_message()`
test: adjust existing tests
utils/error_injection: add injection with `sleep_abortable()`
db/view/view_builder: ignore `no_such_keyspace` exception
docs/dev: add view building coordinator documentation
db/view/view_building_worker: work on `process_staging` tasks
db/view/view_building_worker: register staging sstable to view building coordinator when needed
db/view/view_building_worker: discover staging sstables
db/view/view_building_worker: add method to register staging sstable
db/view/view_update_generator: add method to process staging sstables instantly
db/view/view_update_generator: extract generating updates from staging sstables to a method
db/view/view_update_generator: ignore tablet-based sstables
db/view/view_building_coordinator: update view build status on node join/left
db/view/view_building_coordinator: handle tablet operations
db/view: add view building task mutation builder
service/topology_coordinator: run view building coordinator
db/view: introduce `view_building_coordinator`
db/view/view_building_worker: update built views locally
db/view: introduce `view_building_worker`
db/view: extract common view building functionalities
db/view: prepare to create abstract `view_consumer`
message/messaging_service: add `work_on_view_building_tasks` RPC
service/topology_coordinator: make `term_changed_error` public
db/schema_tables: create/cleanup tasks when an index is created/dropped
service/migration_manager: cleanup view building state on drop keyspace
service/migration_manager: cleanup view building state on drop view
service/migration_manager: create view building tasks on create view
test/boost: enable proxy remote in some tests
service/migration_manager: pass `storage_proxy` to `prepare_keyspace_drop_announcement()`
service/migration_manager: coroutinize `prepare_new_view_announcement()`
service/storage_proxy: expose references to `system_keyspace` and `view_building_state_machine`
service: reload `view_building_state_machine` on group0 apply()
service/vb_coordinator: add currently processing base
db/system_keyspace: move `get_scylla_local_mutation()` up
db/system_keyspace: add `view_building_tasks` table
db/view: add view_building_state and views_state
db/system_keyspace: add method to get view build status map
db/view: extract `system.view_build_status_v2` cql statements to system_keyspace
db/system_keyspace: move `internal_system_query_state()` function earlier
db/view: ignore tablet-based views in `view_builder`
gms/feature_service: add VIEW_BUILDING_COORDINATOR feature
Add precompiled header support to CMakeLists.txt and configure.py -
it improves compilation time by approximately 10%.
New header `stdafx.hh` is added, don't include it manually -
the compiler will include it for you. The header contains includes from
external libraries used by Scylla - seastar, standard library,
linux headers and zlib.
The feature is enabled by default, use CMake option `Scylla_USE_PRECOMPILED_HEADER`
or configure.py --disable-precompiled-header to disable.
The feature should be disabled, when trying to check headers - otherwise
you might get false negatives on missing includes from seastar / abseil and so on.
Note: following configuration needs to be added to ccache.conf:
sloppiness = pch_defines,time_macros
Closes#25182
The central idea of incremental repair is to allow repair participants
to select and repair only a portion of the dataset to speed up the
repair process. All repair participants must utilize an identical
selection method to repair and synchronize the same selected dataset.
There are two primary selection methods: time-based and file-based. The
time-based method selects data within a specified time frame. It is
versatile but it is less efficient because it requires reading all of
the dataset and omitting data beyond the time frame. The file-based
method selects data from unrepaired SSTables and is more efficient
because it allows the entire SSTable to be omitted. This document patch
implements the file-based selection method.
Incremental repair will only be supported for tablet tables; it will not
be supported for vnode tables. On one hand, the legacy vnode is less
important to support. On the other hand, the incremental repair for
vnode is much harder to implement. With vnodes, a SSTalbe could contain
data for multiple vnode ranges. When a given vnode range is repaired,
only a portion of the SSTable is repaired. This complicates the
manipulation of SSTables significantly during both repair and
compaction. With tablets, an entire tablet is repaired so that a
sstable is either fully repaired or not repaired which is a huge
simplification.
This patch uses the repaired_at from sstables::statistics component to
mark a sstable as repaired. It uses a virtual clock as the repair
timestamp, i.e., using a monotonically increasing number for the
repaired_at field of a SSTable and sstables_repaired_at column in
system.tablets table. Notice that when a sstable is not repaired, the
repaired_at field will be set to the default value 0 by default. The
being_repaired in memory field of a SSTable is used to explicitly mark
that a SSTable is being selected. The following variables are used for
incremental repair:
The repaired_at on disk field of a SSTable is used.
- A 64-bit number increases sequentially
The sstables_repaired_at is added to the system.tablets table.
- repaired_at <= sstables_repaired_at means the sstable is repaired
The being_repaired in memory field of a SSTable is added.
- A repair UUID tells which sstable has participated in the repair
Initial test results:
1) Medium dataset results
Node amount: 3
Instance type: i4i.2xlarge
Disk usage per node: ~500GB
Cluster pre-populated with ~500GB of data before starting repairs job.
Results for Repair Timings:
The regular repair run took 210 mins.
Incremental repair 1st run took 183 mins, 2nd and 3rd runs took around 48s
The speedup is: 183 mins / 48s = 228X
2) Small dataset results
Node amount: 3
Instance type: i4i.2xlarge
Disk usage per node: ~167GB
Cluster pre-populated with ~167GB of data before starting the repairs job.
Regular repair 1st run took 110s, 2nd and 3rd runs took 110s.
Incremental repair 1st run took 110 seconds, 2nd and 3rd run took 1.5 seconds.
The speedup is: 110s / 1.5s = 73X
3) Large dataset results
Node amount: 6
Instance type: i4i.2xlarge, 3 racks
50% of base load, 50% read/write
Dataset == Sum of data on each node
Dataset Non-incremental repair (minutes)
1.3 TiB 31:07
3.5 TiB 25:10
5.0 TiB 19:03
6.3 TiB 31:42
Dataset Incremental repair (minutes)
1.3 TiB 24:32
3.0 TiB 13:06
4.0 TiB 5:23
4.8 TiB 7:14
5.6 TiB 3:58
6.3 TiB 7:33
7.0 TiB 6:55
Fixes#22472Closesscylladb/scylladb#24291
* github.com:scylladb/scylladb:
replica: Introduce get_compaction_reenablers_and_lock_holders_for_repair
compaction: Move compaction_reenabler to compaction_reenabler.hh
topology_coordinator: Make rpc::remote_verb_error to warning level
repair: Add metrics for sstable bytes read and skipped from sstables
test.py: Disable incremental for test_tombstone_gc_for_streaming_and_repair
test.py: Add tests for tablet incremental repair
repair: Add tablet incremental repair support
compaction: Add tablet incremental repair support
feature_service: Add TABLET_INCREMENTAL_REPAIR feature
tablet_allocator: Add tablet_force_tablet_count_increase and decrease
repair: Add incremental helpers
sstable: Add being_repaired to sstable
sstables: Add set_repaired_at to metadata_collector
mutation_compactor: Introduce add operator to compaction_stats
tablet: Add sstables_repaired_at to system.tablets table
test: Fix drain api in task_manager_client.py
The feature is supported by all live versions since
version 5.4 / 2024.1.
(Although up to 6da758d74c
it could be disabled using the config option)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The following steps are performed in sequence as part of the
Raft-based recovery procedure:
- set `recovery_leader` to the host ID of the recovery leader in
`scylla.yaml` on all live nodes,
- send the `SIGHUP` signal to all Scylla processes to reload the config,
- perform a rolling restart (with the recovery leader being restarted
first).
These steps are not intuitive and more complicated than they could be.
In this PR, we simplify these steps. From now on, we will be able to
simply set `recovery_leader` on each node just before restarting it.
Apart from making necessary changes in the code, we also update all
tests of the Raft-based recovery procedure and the user-facing
documentation.
Fixesscylladb/scylladb#25015
The Raft-based procedure was added in 2025.2. This PR makes the
procedure simpler and less error-prone, so it should be backported
to 2025.2 and 2025.3.
Closesscylladb/scylladb#25032
* github.com:scylladb/scylladb:
docs: document the option to set recovery_leader later
test: delay setting recovery_leader in the recovery procedure tests
gossip: add recovery_leader to gossip_digest_syn
db: system_keyspace: peers_table_read_fixup: remove rows with null host_id
db/config, gms/gossiper: change recovery_leader to UUID
db/config, utils: allow using UUID as a config option
When repairing a partition with many rows, we can store many fragments in a repair_row_on_wire object which is sent as a rpc stream message.
This could cause reactor stalls when the rpc stream compression is turned on, because the compression compresses the whole message without any split and compression.
This patch solves the problem at the higher level by reducing the message size that is sent to the rpc stream.
Tests are added to make sure the message split works.
Fixes#24808Closesscylladb/scylladb#25002
* github.com:scylladb/scylladb:
repair: Avoid too many fragments in a single repair_row_on_wire
repair: Change partition_key_and_mutation_fragments to use chunked_vector
utils: Allow chunked_vector::erase to work with non-default-constructible type
When repairing a partition with many rows, we can store many fragments
in a repair_row_on_wire object which is sent as a rpc stream message.
This could cause reactor stalls when the rpc stream compression is
turned on, because the compression compresses the whole message without
any split and compression.
This patch solves the problem at the higher level by reducing the
message size that is sent to the rpc stream.
Tests are added to make sure the message split works.
Fixes#24808
Nowadays the way to configure an internal service is
1. service declares its config struct
2. caller (main/test/tool) fills the respective config with values it wants
3. the service is started with the config passed by value
The feature service code behaves likewise, but provides a helper method to create its config out of db::config. This PR moves this helper out of gms code, so that it doesn't mess with system-wide db::config and only needs its own small struct feature_config.
For the reference: similar changes with other services: #23705 , #20174 , #19166Closesscylladb/scylladb#25118
* github.com:scylladb/scylladb:
gms,init: Move get_disabled_features_from_db_config() from gms
code: Update callers generating feature service config
gms: Make feature_config a simple struct
gms: Split feature_config_from_db_config() into two
In the new Raft-based recovery procedure, live nodes join the new
group 0 one by one during a rolling restart. There is a time window when
some of them are in the old group 0, while others are in the new group
0. This causes a group 0 mismatch in `gossiper::handle_syn_msg`. The
current solution for this problem is to ignore group 0 mismatches if
`recovery_leader` is set on the local node and to ask the administrator
to perform the rolling restart in the following way:
- set `recovery_leader` in `scylla.yaml` on all live nodes,
- send the `SIGHUP` signal to all Scylla processes to reload the config,
- proceed with the rolling restart.
This commit makes `gossiper::handle_syn_msg` ignore group 0 mismatches
when exactly one of the two gossiping nodes has `recovery_leader` set.
We achieve this by adding `recovery_leader` to `gossip_digest_syn`.
This change makes setting `recovery_leader` earlier on all nodes and
reloading the config unnecessary. From now on, the administrator can
simply restart each node with `recovery_leader` set.
However, note that nodes that join group 0 must have `recovery_leader`
set until all nodes join the new group 0. For example, assume that we
are in the middle of the rolling restart and one of the nodes in the new
group 0 crashes. It must be restarted with `recovery_leader` set, or
else it would reject `gossip_digest_syn` messages from nodes in the old
group 0. To avoid problems in such cases, we will continue to recommend
setting `recovery_leader` in `scylla.yaml` instead of passing it as
a command line argument.
We change the type of the `recovery_leader` config parameter and
`gossip_config::recovery_leader` from sstring to UUID. `recovery_leader`
is supposed to store host ID, so UUID is a natural choice.
After changing the type to UUID, if the user provides an incorrect UUID,
parsing `recovery_leader` will fail early, but the start-up will
continue. Outside the recovery procedure, `recovery_leader` will then be
ignored. In the recovery procedure, the start-up will fail on:
```
throw std::runtime_error(
"Cannot start - Raft-based topology has been enabled but persistent group 0 ID is not present. "
"If you are trying to run the Raft-based recovery procedure, you must set recovery_leader.");
```
Now when all callers are decoupled from gms config generating code, the
latter can be decoupled from the db::config.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
All config-s out there are plan structures without private members and
methods used to simply carry the set of config values around. Make the
feature service config alike.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The helper in question generates the disabled features set and assigns
one on the config. This patch detaches the features set generation into
an other function. The former will go away eventually and the latter
will be kept around main/test code.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Add the option to co-locate tablets of different tables. For example, a base table and its CDC table, or a local index.
main changes and ideas:
* "table group" - a set of one or more tables that should be co-located. (Example: base table and CDC table). A group consists of one base table and zero or more children tables.
* new column `base_table` in `system.tablets`: when creating a new table, it can be set to point to a base table, which the new table's tablets will be co-located with. when it's set, the tablet map information should be retrieved from the base table map. the child map doesn't contain per-tablet information.
* co-located tables always have the same tablet count and the same tablet replicas. each tablet operation - migration, resize, repair - is applied on all tablets in a synchronized manner by the topology coordinator.
* resize decision for a group is made by combining the per-table hints and comparing the average tablet size (over all tablets in the group) with the target tablet size.
* the tablets load balancer works with the base table as a representative of the group. it represents a single migration unit with some `group_size` that is taken into account.
* view tablets are co-located with base tablets when the partition keys match.
Fixes https://github.com/scylladb/scylladb/issues/17043
backport is not needed. this is preliminary work for support of MVs and CDC with tablets.
Closesscylladb/scylladb#22906
* github.com:scylladb/scylladb:
tablets: validate no clustering row mutations on co-located tables
raft_group0_client: extend validate_change to mixed_change type
docs: topology-over-raft: document co-located tables
tablet-mon.py: visual indication for co-located tablets
tablet-mon.py: handle co-located tablets
test/boost/view_schema_test.cc: fix race in wait_until_built
boost/tablets_test: test load balancing and resize of co-located tablets
test/tablets: test tablets colocation
tablets: co-locate view tablets with base when the partition keys match
test/pylib/tablets: common get_tablet_count api
test_mv_tablets: use get_tablet_replicas from common tablets api
test/pylib/tablets: fix test api to read tablet replicas from base table
tablets: allocator: create co-located tables in a single operation
alternator: prepare all new tables in a single announcement
migration_manager: add notification for creating multiple tables
tablets: read_tablet_transition_stage: read from base table
storage service: allow repair request only on base tables
tablets: keyspace_rf_change: apply on base table
storage service: generate tablet migration updates on base tables
tablets: replace all_tables method
tablets: split when all co-located tablets are ready
tablets: load balancer: sizing plan for table groups
tablets: load balancer: handle co-located tablets
tablets: allocate co-located tablets
tablets: handle migration of co-located tablets
storage service: add repair colocated tablets rpc
tablets: save and read tablet metadata of co-located tables
tablets: represent co-located tables in tablet metadata
tablets: add base_table column to system.tablets
docs: update system.tablets schema
When allocating tablets for a new table, add the option to create a
co-located tablet map with an existing base table.
The co-located tablet map is created with the base_table value set.
failure_detector_loop_for_node may be started on a shard before id->ip
mapping is available there. Currently the code treats missing mapping
as an internal error, but it uses its result for debug output only, so
lets relax the code to not assume the mapping is available.
Fixes#23407Closesscylladb/scylladb#24614
Before we can eradicate the numerical sstable generations,
This series completes https://github.com/scylladb/scylladb/issues/20337
by disabling the use of numerical sstable generations where we can
and making sure the feature is never disabled.
Note that until the cluster feature is enabled in the startup process on first boot, numerical generation might be used for local system tables.
Refs #24248
* Enhancement. No backport required
Closesscylladb/scylladb#24554
* github.com:scylladb/scylladb:
feature_service: never disable UUID_SSTABLE_IDENTIFIERS
test: sstable_move_test: always use uuid sstable generation
test: sstable_directory_test: always use uuid sstable generation
sstables: sstable_generation_generator: set last_generation=0 by default
test: database_test: test_distributed_loader_with_pending_delete: use uuid sstable generation
test: lib: test_env: always use uuid sstable generation
test: sstable_test: always use uuid sstable generation
test: sstable_resharding_test::sstable_resharding_over_s3_test: use default use_uuid in config
test: sstable_datafile_test: compound_sstable_set_basic_test: use uuid sstable generation
test: sstable_compaction_test: always use uuid sstable generation