Previously, the prev_ip check caused problems for bootstrapping nodes.
Suppose a bootstrapping node A appears in the system.peers table of
some other node B. Its record has only ID and IP of the node A, due to
the special handling of bootstrapping nodes in raft_topology_update_ip.
Suppose node B gets temporarily isolated from the topology coordinator.
The topology coordinator fences out node B and succesfully finishes
bootstrapping of the node A. Later, when the connectivity is restored,
topology_state_load runs on the node B, node A is already in
normal state, but the gossiper on B might not yet have any state for
it yet. In this case, raft_topology_update_ip would not update
system.peers because the gossiper state is missing. Subsequently,
on_join/on_restart/on_alive events would skip updates because the IP
in gossiper matches the IP for that node in system.peers.
Removing the check avoids this issue, with negligible overhead:
* on_join/on_restart/on_alive happen only once in a
node’s lifetime
* topology_state_load already updates all nodes each time it runs.
This problem was found by a fencing test, which crashed a
node while another node was going through the bootstrapping
process. After restart the node saw that other node already
is in normal state, since the topology coordinator fenced out
this node and managed to finish the bootstrapping process
successfully. This test will be provided in a separate
fencing-for-paxos PR.
Closesscylladb/scylladb#25596
The previous implementation did not handle topology changes well:
* In `node_local_only` mode with CL=1, if the current node is pending, the CL is increased to 2, causing
`unavailable_exception`.
* If the current tablet is in `write_both_read_old` and we try to read with `node_local_only` on the new node, the replica list will be empty.
This patch changes `node_local_only` mode to always use `my_host_id` as the replica list. An explicit check ensures the current node is a replica for the operation; otherwise `on_internal_error` is called.
backport: not needed, since `node_local_only` is only used in LWT for tablets and it hasn't been released yet.
Closesscylladb/scylladb#25508
* github.com:scylladb/scylladb:
test_tablets_lwt: add test_lwt_during_migration
storage_proxy: node_local_only: always use my_host_id
To avoid dependency proliferation, switch to forward declarations.
In one case, we introduce indirection via std::unique_ptr and
deinline the constructor and destructor.
Ref #1Closesscylladb/scylladb#25584
The previous implementation did not handle topology changes well:
* In node_local_only mode with CL=1, if the current node is pending,
the CL is raised to 2, causing unavailable_exception.
* If the current tablet is in write_both_read_old and we read with
node_local_only on the new node, the replica list is empty.
This patch changes node_local_only mode to always use my_host_id as
the replica list. An explicit check ensures the current node is a
replica for the operation; otherwise on_internal_error is called.
The central idea of incremental repair is to allow repair participants
to select and repair only a portion of the dataset to speed up the
repair process. All repair participants must utilize an identical
selection method to repair and synchronize the same selected dataset.
There are two primary selection methods: time-based and file-based. The
time-based method selects data within a specified time frame. It is
versatile but it is less efficient because it requires reading all of
the dataset and omitting data beyond the time frame. The file-based
method selects data from unrepaired SSTables and is more efficient
because it allows the entire SSTable to be omitted. This document patch
implements the file-based selection method.
Incremental repair will only be supported for tablet tables; it will not
be supported for vnode tables. On one hand, the legacy vnode is less
important to support. On the other hand, the incremental repair for
vnode is much harder to implement. With vnodes, a SSTalbe could contain
data for multiple vnode ranges. When a given vnode range is repaired,
only a portion of the SSTable is repaired. This complicates the
manipulation of SSTables significantly during both repair and
compaction. With tablets, an entire tablet is repaired so that a
sstable is either fully repaired or not repaired which is a huge
simplification.
This patch uses the repaired_at from sstables::statistics component to
mark a sstable as repaired. It uses a virtual clock as the repair
timestamp, i.e., using a monotonically increasing number for the
repaired_at field of a SSTable and sstables_repaired_at column in
system.tablets table. Notice that when a sstable is not repaired, the
repaired_at field will be set to the default value 0 by default. The
being_repaired in memory field of a SSTable is used to explicitly mark
that a SSTable is being selected. The following variables are used for
incremental repair:
The repaired_at on disk field of a SSTable is used.
- A 64-bit number increases sequentially
The sstables_repaired_at is added to the system.tablets table.
- repaired_at <= sstables_repaired_at means the sstable is repaired
The being_repaired in memory field of a SSTable is added.
- A repair UUID tells which sstable has participated in the repair
Initial test results:
1) Medium dataset results
Node amount: 3
Instance type: i4i.2xlarge
Disk usage per node: ~500GB
Cluster pre-populated with ~500GB of data before starting repairs job.
Results for Repair Timings:
The regular repair run took 210 mins.
Incremental repair 1st run took 183 mins, 2nd and 3rd runs took around 48s
The speedup is: 183 mins / 48s = 228X
2) Small dataset results
Node amount: 3
Instance type: i4i.2xlarge
Disk usage per node: ~167GB
Cluster pre-populated with ~167GB of data before starting the repairs job.
Regular repair 1st run took 110s, 2nd and 3rd runs took 110s.
Incremental repair 1st run took 110 seconds, 2nd and 3rd run took 1.5 seconds.
The speedup is: 110s / 1.5s = 73X
3) Large dataset results
Node amount: 6
Instance type: i4i.2xlarge, 3 racks
50% of base load, 50% read/write
Dataset == Sum of data on each node
Dataset Non-incremental repair (minutes)
1.3 TiB 31:07
3.5 TiB 25:10
5.0 TiB 19:03
6.3 TiB 31:42
Dataset Incremental repair (minutes)
1.3 TiB 24:32
3.0 TiB 13:06
4.0 TiB 5:23
4.8 TiB 7:14
5.6 TiB 3:58
6.3 TiB 7:33
7.0 TiB 6:55
Fixes#22472Closesscylladb/scylladb#24291
* github.com:scylladb/scylladb:
replica: Introduce get_compaction_reenablers_and_lock_holders_for_repair
compaction: Move compaction_reenabler to compaction_reenabler.hh
topology_coordinator: Make rpc::remote_verb_error to warning level
repair: Add metrics for sstable bytes read and skipped from sstables
test.py: Disable incremental for test_tombstone_gc_for_streaming_and_repair
test.py: Add tests for tablet incremental repair
repair: Add tablet incremental repair support
compaction: Add tablet incremental repair support
feature_service: Add TABLET_INCREMENTAL_REPAIR feature
tablet_allocator: Add tablet_force_tablet_count_increase and decrease
repair: Add incremental helpers
sstable: Add being_repaired to sstable
sstables: Add set_repaired_at to metadata_collector
mutation_compactor: Introduce add operator to compaction_stats
tablet: Add sstables_repaired_at to system.tablets table
test: Fix drain api in task_manager_client.py
endpoint_filter() is used by batchlog to select nodes to replicate
to.
It contains an unordered_multimap data structure that maps rack names
to nodes.
It misuses std::unordered_map::bucket_count() to count the number of
racks. While values that share a key in a multimap will definitly
be in the same bucket, it's possible for values that don't share a
key to share a bucket. Therefore bucket_count() undercounts the
number of racks.
Fix this by using a more accurate data structure: a map of a set.
The patch changes validated.bucket_count() to validated.size()
and validated.size() to a new variable nr_validated.
The patch does cause an extra two allocations per rack (one for the
unordered_map node, one for the unordered_set bucket vector), but
this is only used for logged batches, so it is amortized over all
the mutations in the logged batch.
Closesscylladb/scylladb#25493
This could happen in case the peer node is in shutdown. This is not
something we can not recovery. The log level should be warning instead
of error which our dtest catches for failure of a test.
This was observed in test_repair_one_node_alter_rf dtest.
The central idea of incremental repair is to allow repair participants
to select and repair only a portion of the dataset to speed up the
repair process. All repair participants must utilize an identical
selection method to repair and synchronize the same selected dataset.
There are two primary selection methods: time-based and file-based. The
time-based method selects data within a specified time frame. It is
versatile but it is less efficient because it requires reading all of
the dataset and omitting data beyond the time frame. The file-based
method selects data from unrepaired SSTables and is more efficient
because it allows the entire SSTable to be omitted. This document patch
implements the file-based selection method.
Incremental repair will only be supported for tablet tables; it will not
be supported for vnode tables. On one hand, the legacy vnode is less
important to support. On the other hand, the incremental repair for
vnode is much harder to implement. With vnodes, a SSTalbe could contain
data for multiple vnode ranges. When a given vnode range is repaired,
only a portion of the SSTable is repaired. This complicates the
manipulation of SSTables significantly during both repair and
compaction. With tablets, an entire tablet is repaired so that a
sstable is either fully repaired or not repaired which is a huge
simplification.
This patch uses the repaired_at from sstables::statistics component to
mark a sstable as repaired. It uses a virtual clock as the repair
timestamp, i.e., using a monotonically increasing number for the
repaired_at field of a SSTable and sstables_repaired_at column in
system.tablets table. Notice that when a sstable is not repaired, the
repaired_at field will be set to the default value 0 by default. The
being_repaired in memory field of a SSTable is used to explicitly mark
that a SSTable is being selected. The following variables are used for
incremental repair:
The repaired_at on disk field of a SSTable is used.
- A 64-bit number increases sequentially
The sstables_repaired_at is added to the system.tablets table.
- repaired_at <= sstables_repaired_at means the sstable is repaired
The being_repaired in memory field of a SSTable is added.
- A repair UUID tells which sstable has participated in the repair
Initial test results:
1) Medium dataset results
Node amount: 3
Instance type: i4i.2xlarge
Disk usage per node: ~500GB
Cluster pre-populated with ~500GB of data before starting repairs job.
Results for Repair Timings:
The regular repair run took 210 mins.
Incremental repair 1st run took 183 mins, 2nd and 3rd runs took around 48s
The speedup is: 183 mins / 48s = 228X
2) Small dataset results
Node amount: 3
Instance type: i4i.2xlarge
Disk usage per node: ~167GB
Cluster pre-populated with ~167GB of data before starting the repairs job.
Regular repair 1st run took 110s, 2nd and 3rd runs took 110s.
Incremental repair 1st run took 110 seconds, 2nd and 3rd run took 1.5 seconds.
The speedup is: 110s / 1.5s = 73X
3) Large dataset results
Node amount: 6
Instance type: i4i.2xlarge, 3 racks
50% of base load, 50% read/write
Dataset == Sum of data on each node
Dataset Non-incremental repair (minutes)
1.3 TiB 31:07
3.5 TiB 25:10
5.0 TiB 19:03
6.3 TiB 31:42
Dataset Incremental repair (minutes)
1.3 TiB 24:32
3.0 TiB 13:06
4.0 TiB 5:23
4.8 TiB 7:14
5.6 TiB 3:58
6.3 TiB 7:33
7.0 TiB 6:55
Fixes#22472
This patch addes incremental_repair support in compaction.
- The sstables are split into repaired and unrepaired set.
- Repaired and unrepaired set compact sperately.
- The repaired_at from sstable and sstables_repaired_at from
system.tablets table are used to decide if a sstable is repaired or
not.
- Different compactions tasks, e.g., minor, major, scrub, split, are
serialized with tablet repair.
Enable runtime updates of vector_store_uri configuration without
requiring server restart.
This allows to dynamically enable, disable, or switch the vector search service endpoint on the fly.
To improve the clarity the seastar::experimental::http::client is now wrapped in a private http_client class that also holds the host, address, and port information.
Tests have been added to verify that the client correctly handles transitions between enabled/disabled states and successfully switches traffic to a new endpoint after a configuration update.
Closes: VECTOR-102
No backport is needed as this is a new feature.
Closesscylladb/scylladb#25208
* github.com:scylladb/scylladb:
service/vector_store_client: Add live configuration update support
test/boost/vector_store_client_test.cc: Refactor vector store client test
service/vector_store_client: Refactor host_port struct created
service/vector_store_client: Refactor HTTP request creation
This change includes basic optimizations to
locator::describe_ring, mainly caching the per-endpoint information in an unordered_map instead of looking them up in every inner-loop.
This yields an improvement of 20% in cpu time.
With 45 nodes organized as 3 dcs, 3 racks per dc, 5 nodes per rack, 256 tokens per node, yielding 11520 ranges and 9 replicas per range, describe_ring took Before: 30 milliseconds (2.6 microseconds per range) After: 24 milliseconds (2.1 microseconds per range)
Add respective unit test for vnode keyspace
and for tablets.
Fixes#24887
* backport up to 2025.1 as describe_ring slowness was hit in the field with large clusters
Closesscylladb/scylladb#24889
* github.com:scylladb/scylladb:
locator: util: optimize describe_ring
locator: util: construct_range_to_endpoint_map: pass is_vnode=true to get_natural_replicas
vnode_effective_replication_map: do_get_replicas: throw internal error if token not found in map
locator: effective_replication_map: get_natural_replicas: get is_vnode param
test: cluster: test_repair: add test_vnode_keyspace_describe_ring
This change includes basic optimizations to
locator::describe_ring, mainly caching the per-endpoint
information in an unordered_map instead of looking
them up in every inner-loop.
This yields an improvement of 20% in cpu time.
With 45 nodes organized as 3 dcs, 3 racks per dc, 5 nodes per rack, 256 tokens per
node, yielding 11520 ranges and 9 replicas per range, describe_ring took
Before: 30 milliseconds (2.6 microseconds per range)
After: 24 milliseconds (2.1 microseconds per range)
Add respective unit test of describe_ring for tablets.
A unit test for vnodes already exists in
test/nodetool/test_describering.py
Fixes#24887
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
To improve debuggability, we need to propagate original error messages
from Paxos verbs to the user. This change adds constructors that take
an error message directly, enabling better error reporting.
Additionally, functions such as write_timeout_to_read,
write_failure_to_read etc are updated to use these message-based
constructors. These functions are used in storage_proxy::cas to
convert between different error types, and without this change,
they could lose the original error message during conversion.
Enable runtime updates of vector_store_uri configuration without
requiring server restart.
This allows to dynamically enable, disable, or switch the vector search node endpoint on the fly.
Introduce lightweight wrapper for seastar::http::experimental::client
This wrapper simplifies request creation by automatically injecting the host name.
https://github.com/scylladb/scylladb/issues/24962 introduced memtable overlap checks to cache tombstone GC. This was observed to be very strict and greatly reduce the effectiveness of tombstone GC in the cache, especially for MV workloads, which regularly recycle old timestamp into new writes, so the memtable often has smaller min live timestamp than the timestamp of the tombstones in the cache.
When creating a new memtable, save a snapshot of the tombstone gc state. This snapshot is used later to exclude this memtable from overlap checks for tombstones, whose token have an expiry time larger than that of the tombstone, meaning: all writes in this memtable were produced at a point in time when the current tombstone has already expired. This has the following implications:
* The partition the tombstone is part of was already repaired at the time the memtable was created.
* All writes in the memtable were produced *after* this tombstone's expiry time, these writes cannot be possibly relevant for this tombstone.
Based on this, such memtables are excluded from the overlap checks. With adequately frequent memtable flushes -- so that the tombstone gc state snapshot is refreshed -- most memtables should be excluded from overlap checks, greatly helping the cache's tombstone GC efficiency.
Fixes: https://github.com/scylladb/scylladb/issues/24962
Fixes a regression introduced by https://github.com/scylladb/scylladb/pull/23255 which was backported to all releases, needs backport to all releases as well
Closesscylladb/scylladb#25033
* github.com:scylladb/scylladb:
docs/dev/tombstone.md: document the memtable overlap check elision optimization
test/boost/row_cache_test: add test for memtable overlap check elision
db/cache_mutation_reader: obtain gc-before and min-live-ts lazily
mutation/mutation_compactor: use max_purgeable::can_purge and max_purgeable::purge_result
db/cache_mutation_reader: use max_purgeable::can_purge()
replica/table: get_max_purgeable_fn_for_cache_underlying_reader(): use max_purgable::combine()
replica/database: memtable_list::get_max_purgeable(): set expiry-treshold
compaction/compaction_garbage_collector: max_purgeable: add expiry_treshold
replica/table: propagate gc_state to memtable_list
replica/memtable_list: add tombstone_gc_state* member
replica/memtable: add tombstone_gc_state_snapshot
tombstone_gc: introduce tombstone_gc_state_snapshot
tombstone_gc: extract shared state into shared_tombstone_gc_state
tombstone_gc: per_table_history_maps::_group0_gc_time: make it a value
tombstone_gc: fold get_group0_gc_time() into its caller
tombstone_gc: fold get_or_create_group0_gc_time() into update_group0_refresh_time()
tombstone_gc: fold get_or_create_repair_history_for_table() into update_repair_time()
tombstone_gc: refactor get_or_greate_repair_history_for_table()
replica/memtable_list: s/min_live_timestamp()/get_max_purgeable()/
db/read_context: return max_purgeable from get_max_purgeable()
compaction/compaction_garbage_collector: add formatter for max_purgeable
mutation: move definition of gc symbols to compaction.cc
compaction/compaction_garbage_collector: refactor max_purgeable into a class
test/boost/row_cache_test: refactor test_populating_reader_tombstone_gc_with_data_in_memtable
test: rewrite test_compacting_reader_tombstone_gc_with_data_in_memtable in C++
test/boost/row_cache_test: refactor cache tombstone GC with memtable overlap tests
Instead of storing it partially in tombstone_gc and partially in an
external map. Move all external parts into the new
shared_tombstone_gc_state. This new class is responsible for
keeping and updating the repair history. tombstone_gc_state just keeps
const pointers to the shared state as before and is only responsible for
querying the tombstone gc before times.
This separation makes the code easier to follow and also enables further
patching of tombstone_gc_state.
Implement odd number voter enforcement in the group0 voter calculator to
ensure proper Raft consensus behavior. Raft consensus requires a majority
of voters to make decisions, and odd numbers of voters is preferred
because an even number doesn't add additional reliability but introduces
the risk of scenarios where no group can make progress. If an even number
of voters is divided into two groups of equal size during a network
partition, neither group will have majority and both will be unable to
commit new entries. With an odd number of voters, such equal partition
scenarios are impossible (unless the network is partitioned into at least
three groups).
Fixes: scylladb/scylladb#23266
Derive both vnode_effective_replication_map
and local_effective_replication_map from
static_effective_replication_map as both are static and per-keyspace.
However, local_effective_replication_map does not need vnodes
for the mapping of all tokens to the local node.
Refs #22733
* No backport required
Closesscylladb/scylladb#25222
* github.com:scylladb/scylladb:
locator: abstract_replication_strategy: implement local_replication_strategy
locator: vnode_effective_replication_map: convert clone_data_gently to clone_gently
locator: abstract_replication_map: rename make_effective_replication_map
locator: abstract_replication_map: rename calculate_effective_replication_map
replica: database: keyspace: rename {create,update}_effective_replication_map
locator: effective_replication_map_factory: rename create_effective_replication_map
locator: abstract_replication_strategy: rename vnode_effective_replication_map_ptr et. al
locator: abstract_replication_strategy: rename global_vnode_effective_replication_map
keyspace: rename get_vnode_effective_replication_map
dht: range_streamer: use naked e_r_m pointers
storage_service: use naked e_r_m pointers
alternator: ttl: use naked e_r_m pointers
locator: abstract_replication_strategy: define is_local
Derive both vnode_effective_replication_map
and local_effective_replication_map from
static_effective_replication_map as both are static and per-keyspace.
However, local_effective_replication_map does not need vnodes
for the mapping of all tokens to the local node.
Note that everywhere_replication_strategy is not abstracted in a similar
way, although it could, since the plan is to get rid of it
once all system keyspaces areconverted to local or tablets replication
(and propagated everywhere if needed using raft group0)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
to *_static_effective_replication_map, in preparation
for separating local_effective_replication_map from
vnode_effective_replication_map.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
to create_static_effective_replication_map, in preparation
for separating local_effective_replication_map from
vnode_effective_replication_map.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
to static_effective_replication_map_ptr, in preparation
for separating local_effective_replication_map from
vnode_effective_replication_map.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
to get_static_effective_replication_map, in preparation
for separating local_effective_replication_map from
vnode_effective_replication_map (both are per-keyspace).
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Prepare for following patch that will separate
the local effective replication map from
vnode_effective_replication_map.
The caller is responsible to keep the
effective_replication_map_ptr alive while
in use by low-level async functions.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Prepare for following patch that will separate
the local effective replication map from
vnode_effective_replication_map.
The caller is responsible to keep the
effective_replication_map_ptr alive while
in use by low-level async functions.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Prefer for specializing the local replication strategy,
local effective replication map, et. al byt defining
an is_local() predicate, similar to uses_tablets().
Note that is_vnode_based() still applies to local replication
strategy.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
send_to_live_endpoints() computes sets of endpoints to
which we send mutations - remote endpoints (where we send
to each set as a whole, using forwarding), and local endpoints,
where we send directly. To make handling regular, each local
endpoint is treated as its own set. Thus, each local endpoint
and each datacenter receive one RPC call (or local call if the
coordinator is also a replica).
These sets are maintained a std::unordered_map (for remote endpoints)
and a vector with the same value_type as the map (for local endpoints).
The key part of the vector payload is initialized to the empty string.
We simplify this by noting that the datacenter name is never used
after this computation, so the vector can hold just the replica sets,
without the fake datacenter name. The downstream variable `all` is
adjusted to point just to the replica set as well.
As a reward for our efforts, the vector's contents becomes nothrow
move constructible (no string), and we can convert it to a small_vector,
which reduces allocations in the common case of RF<=3.
The reduction in allocations is visible in perf-simple-query --write
results:
```
before 165080.62 tps ( 60.3 allocs/op, 16.0 logallocs/op, 14.2 tasks/op, 53438 insns/op, 26705 cycles/op, 0 errors)
after 164513.83 tps ( 59.3 allocs/op, 16.0 logallocs/op, 14.2 tasks/op, 53347 insns/op, 26761 cycles/op, 0 errors)
```
The instruction count reduction is a not very impressive 70/op:
before
```
instructions_per_op:
mean= 53412.22 standard-deviation=32.12
median= 53420.53 median-absolute-deviation=20.32
maximum=53462.23 minimum=53290.06
```
after
```
instructions_per_op:
mean= 53350.32 standard-deviation=32.38
median= 53353.71 median-absolute-deviation=13.60
maximum=53415.20 minimum=53222.24
```
Perhaps the extra code from small_vector defeated some inlining,
which negated some of the gain from the reduced allocations. Perhaps
a build with full profiling will gain it back (my builds were without
pgo).
Closesscylladb/scylladb#25270
This pull request is an addition of ANN OF queries.
The patch contains:
- CQL syntax for ORDER BY `vector_column_name` ANN OF `vector_literal` clause of SELECT statements.
- implementation of external ANN queries (using vector-store service)
- tests
Example syntax:
```
SELECT comment
FROM cycling.comments_vs
ORDER BY comment_vector ANN OF [0.1, 0.15, 0.3, 0.12, 0.05]
LIMIT 3;
```
Limit can be between 1 and 1000 - same as for Cassandra.
Co-authored-by: @janpiotrlakomy @smoczy123
Fixes: VECTOR-48
Fixes: VECTOR-46
Closesscylladb/scylladb#24444
* github.com:scylladb/scylladb:
cql3/statements: implement external `ANN OF` queries
vector_store_client: implement ann_error_visitor
test/cqlpy: check ANN queries disallow filtering properly
cassandra_tests: translate vector_invalid_query_test
cassandra_tests: copy vector_invalid_query_test from Cassandra
vector_index: make parameter names case insensitive
cql3/statements: add `ANN OF` queries support to select statements
cql/Cql.g: extend the grammar to allow for `ANN OF` queries
cql3/raw: add ANN ordering to the raw statement layer
The following steps are performed in sequence as part of the
Raft-based recovery procedure:
- set `recovery_leader` to the host ID of the recovery leader in
`scylla.yaml` on all live nodes,
- send the `SIGHUP` signal to all Scylla processes to reload the config,
- perform a rolling restart (with the recovery leader being restarted
first).
These steps are not intuitive and more complicated than they could be.
In this PR, we simplify these steps. From now on, we will be able to
simply set `recovery_leader` on each node just before restarting it.
Apart from making necessary changes in the code, we also update all
tests of the Raft-based recovery procedure and the user-facing
documentation.
Fixesscylladb/scylladb#25015
The Raft-based procedure was added in 2025.2. This PR makes the
procedure simpler and less error-prone, so it should be backported
to 2025.2 and 2025.3.
Closesscylladb/scylladb#25032
* github.com:scylladb/scylladb:
docs: document the option to set recovery_leader later
test: delay setting recovery_leader in the recovery procedure tests
gossip: add recovery_leader to gossip_digest_syn
db: system_keyspace: peers_table_read_fixup: remove rows with null host_id
db/config, gms/gossiper: change recovery_leader to UUID
db/config, utils: allow using UUID as a config option
Currently, `get_cas_shard` uses `sharder.shard_for_reads` to decide which shard to use for LWT execution—both on replicas and the coordinator.
If the coordinator is not a replica, `shard_for_reads` returns a default shard (shard 0). There are at least two problems with this:
* shard 0 can become overloaded, because all LWT coordinators-but-not-replacas are served on it.
* mismatch with replicas: the default shard doesn't match what `shard_for_reads` returns on replicas. This hinders the "same shard for client and server" RPC level optimization.
In this PR we change `get_cas_shard` to use a primary replica shard if the current node is not a replica. This guarantees that all LWT coordinators for the same tablet will be served on the same shard. This is important for LWT coordinator locks (`paxos::paxos_state::get_cas_lock`). Also, if all tablet replicas on different nodes live on the same shard, RPC optimization will make sure that no additional `smp::submit_to` will be needed on server side.
backport: not needed, since this fix applies only to LWT over tablets, and this feature is not released yet
Closesscylladb/scylladb#25224
* github.com:scylladb/scylladb:
test_tablets_lwt.py: make tests rf_rack_valid
test_tablets_lwt: add test_lwt_coordinator_shard
storage_proxy.cc: get_cas_shard: fallback to the primary replica shard
sharder: add try_get_shard_for_reads method
Currently the service levels cache is unnecessarily updated in every
call of `topology_state_load()`.
But it is enough to reload it only when a snapshot is loaded.
(The cache is also already updated when there is a change to one of
`service_levels_v2`, `role_members`, `role_attributes` tables.)
Fixesscylladb/scylladb#25114Fixesscylladb/scylladb#23065Closesscylladb/scylladb#25116
This commits introduces an config option 'tablet_load_stats_refresh_interval_in_seconds'
that allows overriding the default value without using error injection.
Fixesscylladb/scylladb#24641Closesscylladb/scylladb#24746
Right now, service levels are migrated in one group0 command and auth is migrated in the next one. This has a bad effect on the group0 state reload logic - modifying service levels in group0 causes the effective service levels cache to be recalculated, and to do so we need to fetch information about all roles. If the reload happens after SL upgrade and before auth upgrade, the query for roles will be directed to the legacy auth tables in system_auth - and the query, being a potentially remote query, has a timeout. If the query times out, it will throw an exception which will break the group0 apply fiber and the node will need to be restarted to bring it back to work.
In order to solve this issue, make sure that the service level module does not start populating and using the service level cache until both service levels and auth are migrated to raft. This is achieved by adding the check both to the cache population logic and the effective service level getter - they now look at service level's accessor new method, `can_use_effective_service_level_cache` which takes a look at the auth version.
Fixes: scylladb/scylladb#24963
Should be backported to all versions which support upgrade to topology over raft - the issue described here may put the cluster into a state which is difficult to get out of (group0 apply fiber can break on multiple nodes, which necessitates their restart).
Closesscylladb/scylladb#25188
* github.com:scylladb/scylladb:
test: sl: verify that legacy auth is not queried in sl to raft upgrade
qos: don't populate effective service level cache until auth is migrated to raft
Currently, get_cas_shard uses shard_for_reads to decide which
shard to use for LWT execution—both on replicas and the coordinator.
If the coordinator is not a replica, shard_for_reads returns a default
shard (shard 0). There are at least two problems with this:
* shard 0 can become overloaded, because all LWT
coordinators-but-not-replacas are served on it.
* mismatch with replicas: the default shard doesn't match what
shard_for_reads returns on replicas. This hinders the "same shard for
client and server" RPC level optimization.
In this commit we change get_cas_shard to use a primary replica
shard if the current node is not a replica. This guarantees that all
LWT coordinators for the same tablet will be served on the same shard.
This is important for LWT coordinator locks
(paxos::paxos_state::get_cas_lock). Also, if all tablet replicas on
different nodes live on the same shard, RPC
optimization will make sure that no additional smp::submit_to will
be needed on the server side.
Fixesscylladb/scylladb#20497
Right now, service levels are migrated in one group0 command and auth
is migrated in the next one. This has a bad effect on the group0 state
reload logic - modifying service levels in group0 causes the effective
service levels cache to be recalculated, and to do so we need to fetch
information about all roles. If the reload happens after SL upgrade and
before auth upgrade, the query for roles will be directed to the legacy
auth tables in system_auth - and the query, being a potentially remote
query, has a timeout. If the query times out, it will throw
an exception which will break the group0 apply fiber and the node will
need to be restarted to bring it back to work.
In order to solve this issue, make sure that the service level module
does not start populating and using the service level cache until both
service levels and auth are migrated to raft. This is achieved by adding
the check both to the cache population logic and the effective service
level getter - they now look at service level's accessor new method,
`can_use_effective_service_level_cache` which takes a look at the auth
version.
Fixes: scylladb/scylladb#24963
Previously, `raft_group0::abort()` was called in `storage_service::do_drain` (introduced in #24418) to stop the group0 Raft server before destroying local storage. This was necessary because `raft::server` depends on storage (via `raft_sys_table_storage` and `group0_state_machine`).
However, this caused issues: services like `sstable_dict_autotrainer` and `auth::service`, which use `group0_client` but are not stopped by `storage_service`, could trigger use-after-free if `raft_group0` was destroyed too early. This can happen both during normal shutdown and when 'nodetool drain' is used.
This PR reworks the shutdown logic:
* Introduces `abort_and_drain()`, which aborts the server and waits for background tasks to finish, but keeps the server object alive. Clients will see `raft::stopped_error` if they try to access group0 after this method is called.
* Final destruction now happens in `abort_and_destroy()`, called later from `main.cc`, ensuring safe cleanup.
The `raft_server_for_group::aborted` is changed to a `shared_future`, as it is now awaited in both abort methods.
Node startup can fail before reaching `storage_service`, in which case `drain_on_shutdown()` and `abort_and_drain()` are never called. To ensure proper cleanup, `raft_group0` deinitialization logic must be included in both `abort_and_drain()` and `abort_and_destroy()`.
Refs #25115Fixes#24625
Backport: the changes are complicated and not safe to backport, we'll backport a revert of the original patch (#24418) in a separate PR.
Closesscylladb/scylladb#25151
* https://github.com/scylladb/scylladb:
raft_group0: split shutdown into abort_and_drain and destroy
Revert "main.cc: fix group0 shutdown order"
Commit ddc3b6dcf5 added a check of group0 state in
get_schema_for_write(), but group0 client can only be used on shard 0,
and get_schema_for_write() can be called on any shard, so we cannot use
_group0_client there directly. Move assert where we use another group0
function already where it is guarantied to run on shard 0.
Closesscylladb/scylladb#25204
This PR enables **LWT (Lightweight Transactions)** support for tablet-based tables by leveraging **colocated tables**.
Currently, storing Paxos state in system tables causes two major issues:
* **Loss of Paxos state during tablet migration or base table rebuilds**
* When a tablet is migrated or the base table is rebuilt, system tables don't retain Paxos state.
* This breaks LWT correctness in certain scenarios.
* Failing test cases demonstrating this:
* test_lwt_state_is_preserved_on_tablet_migration
* test_lwt_state_is_preserved_on_rebuild
* **Shard misalignment and performance overhead**
* Tablets may be placed on arbitrary shards by the tablet balancer.
* Accessing Paxos state in system tables could require a shard jump, degrading performance.
We move Paxos state into a dedicated Paxos table, colocated with the base table:
* Each base table gets its own Paxos state table.
* This table is lazily created on the first LWT operation.
* Its tablets are colocated with those of the base table, ensuring:
* Co-migration during tablet movement
* Co-rebuilding with the base table
* Shard alignment for local access to Paxos state
Some reasoning for why this is sufficient to preserve LWT correctness is discussed in [2].
This PR addresses two issues from the "Why doesn't it work for tablets" section in [1]:
* Tablet migration vs LWT correctness
* Paxos table sharding
Other issues ("bounce to shard" and "locking for intranode_migration") have already been resolved in previous PRs.
References
[1] - [LWT over tablets design](https://docs.google.com/document/d/1CPm0N9XFUcZ8zILpTkfP5O4EtlwGsXg_TU4-1m7dTuM/edit?tab=t.0#heading=h.goufx7gx24yu)
[2] - [LWT: Paxos state and tablet balancer](https://docs.google.com/document/d/1-xubDo612GGgguc0khCj5ukmMGgLGCLWLIeG6GtHTY4/edit?tab=t.0)
[3] - [Colocated tables PR](https://github.com/scylladb/scylladb/pull/22906#issuecomment-3027123886)
[4] - [Possible LWT consistency violations after a topology change](https://github.com/scylladb/scylladb/issues/5251)
Backport: not needed because this is a new feature.
Closesscylladb/scylladb#24819
* github.com:scylladb/scylladb:
create_keyspace: fix warning for tablets
docs: fix lwt.rst
docs: fix tablets.rst
alternator: enable LWT
random_failures: enable execute_lwt_transaction
test_tablets_lwt: add test_paxos_state_table_permissions
test_tablets_lwt: add test_lwt_for_tablets_is_not_supported_without_raft
test_tablets_lwt: test timeout creating paxos state table
test_tablets_lwt: add test_lwt_concurrent_base_table_recreation
test_tablets_lwt: add test_lwt_state_is_preserved_on_rebuild
test_tablets_lwt: migrate test_lwt_support_with_tablets
test_tablets_lwt: add test_lwt_state_is_preserved_on_tablet_migration
test_tablets_lwt: add simple test for LWT
check_internal_table_permissions: handle Paxos state tables
client_state: extract check_internal_table_permissions
paxos_store: handle base table removal
database: get_base_table_for_tablet_colocation: handle paxos state table
paxos_state: use node_local_only mode to access paxos state
query_options: add node_local_only mode
storage_proxy: handle node_local_only in query
storage_proxy: handle node_local_only in mutate
storage_proxy: introduce node_local_only flag
abstract_replication_strategy: remove unused using
storage_proxy: add coordinator_mutate_options
storage_proxy: rename create_write_response_handler -> make_write_response_handler
storage_proxy: simplify mutate_prepare
paxos_state: lazily create paxos state table
migration_manager: add timeout to start_group0_operation and announce
paxos_store: use non-internal queries
qp: make make_internal_options public
paxos_store: conditional cf_id filter
paxos_store: coroutinize
feature_service: add LWT_WITH_TABLETS feature
paxos_state: inline system_keyspace functions into paxos_store
paxos_state: extract state access functions into paxos_store
Otherwise, tablet rebuilt will be delayed for up to 60s, as the tablet
scheduler needs load stats for the new node (replacing) to make
decisisons.
Fixes#25163Closesscylladb/scylladb#25181
As they are wasteful in many cases, it is better
to move the tablet_map if possible, or clone
it gently in an async fiber.
Add clone() and clone_gently() methods to
allow explicit copies.
* minor optimization, no backport needed
Closesscylladb/scylladb#24978
* github.com:scylladb/scylladb:
tablets: prevent accidental copy of tablets_map
locator: tablets: get rid of synchronous mutate_tablet_map
The `token_metadata_impl` stores the sorted tokens in an `std::vector`.
With a large number of nodes, the size of this vector can grow quickly,
and updating it might lead to oversized allocations.
This commit changes `_sorted_tokens` to a `chunked_vector` to avoid such
issues. It also updates all related code to use `chunked_vector` instead
of `std::vector`.
Fixes#24876
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Closesscylladb/scylladb#25027