6184 Commits

Author SHA1 Message Date
Pavel Emelyanov
f112e42ddd raft: Fix split mutations freeze
Commit faa0ee9844 accidentally broke the way split snapshot mutation was
frozen -- instead of appending the sub-mutation `m` the commit kept the
old variable name of `mut` which in the new code corresponds to "old"
non-split mutation

Fixes #29051

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#29052
2026-03-24 08:53:50 +02:00
Botond Dénes
5573c3b18e Merge 'tablets: Fix deadlock in background storage group merge fiber' from Tomasz Grabiec
When it deadlocks, groups stop merging and compaction group merge
backlog will run-away.

Also, graceful shutdown will be blocked on it.

Found by flaky unit test
test_merge_chooses_best_replica_with_odd_count, which timed-out in 1
in 100 runs.

Reason for deadlock:

When storage groups are merged, the main compaction group of the new
storage group takes a compaction lock, which is appended to
_compaction_reenablers_for_merging, and released when the merge
completion fiber is done with the whole batch.

If we accumulate more than 1 merge cycle for the fiber, deadlock
occurs. Lock order will be this

Initial state:

 cg0: main
 cg1: main
 cg2: main
 cg3: main

After 1st merge:

 cg0': main [locked], merging_groups=[cg0.main, cg1.main]
 cg1': main [locked], merging_groups=[cg2.main, cg3.main]

After 2nd merge:

 cg0'': main [locked], merging_groups=[cg0'.main [locked], cg0.main, cg1.main, cg1'.main [locked], cg2.main, cg3.main]

merge completion fiber will try to stop cg0'.main, which will be
blocked on compaction lock. which is held by the reenabler in
_compaction_reenablers_for_merging, hence deadlock.

The fix is to wait for background merge to finish before we start the
next merge. It's achieved by holding old erm in the background merge,
and doing a topology barrier from the merge finalizing transition.

Background merge is supposed to be a relatively quick operation, it's
stopping compaction groups. So may wait for active requests. It
shouldn't prolong the barrier indefinitely.

Tablet tests which trigger merge need to be adjusted to call the
barrier, otherwise they will be vulnerable to the deadlock.

Fixes SCYLLADB-928

Backport to >= 2025.4 because it's the earliest vulnerable due to f9021777d8.

Closes scylladb/scylladb#29007

* github.com:scylladb/scylladb:
  tablets: Fix deadlock in background storage group merge fiber
  replica: table: Propagate old erm to storage group merge
  test: boost: tablets_test: Save tablet metadata when ACKing split resize decision
  storage_service: Extract local_topology_barrier()
2026-03-20 09:05:52 +02:00
Piotr Dulikowski
171504c84f Merge 'auth: migrate some standard role manager APIs to use cache' from Marcin Maliszkiewicz
This patchset migrates: query_all_directly_granted, query_all,
get_attribute, query_attribute_for_all functions to use cache
instead of doing CQL queries. It also includes some preparatory
work which fixes cache update order and triggering.

Main motivation behind this is to make sure that all calls
from service_level_controller::auth_integration are cached,
which we achieve here.

Alternative implementation could move the whole auth_integration
data into auth cache but since auth_integration manages also lifetime
and contains service levels specific logic such solution would be
too complex for little (if any) gain.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-159
Backport: no, not a bug

Closes scylladb/scylladb#28791

* github.com:scylladb/scylladb:
  auth: switch query_attribute_for_all to use cache
  auth: switch get_attribute to use cache
  auth: cache: add heterogeneous map lookups
  auth: switch query_all to use cache
  auth: switch query_all_directly_granted to use cache
  auth: cache: add ability to go over all roles
  raft: service: reload auth cache before service levels
  service: raft: move update_service_levels_effective_cache check
2026-03-19 14:37:22 +01:00
Botond Dénes
2e47fd9f56 Merge 'tasks: do not fail the wait request if rpc fails' from Aleksandra Martyniuk
During decommission, we first mark a topology request as done, then shut
down a node and in the following steps we remove node from the topology.
Thus,  finished request does not imply that a node is removed from
the topology.

Due to that, in node_ops_virtual_task::wait, while gathering children
from the whole cluster, we may hit the connection exception - because
a node is still in topology, even though it is down.

Modify the get_children method to ignore the exception and warn
about the failure instead.

Keep token_metadata_ptr in get_children to prevent topology from changing.

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-867

Needs backports to all versions

Closes scylladb/scylladb#29035

* github.com:scylladb/scylladb:
  tasks: fix indentation
  tasks: do not fail the wait request if rpc fails
  tasks: pass token_metadata_ptr to task_manager::virtual_task::impl::get_children
2026-03-19 10:03:18 +02:00
Aleksandra Martyniuk
d4fdeb4839 tasks: pass token_metadata_ptr to task_manager::virtual_task::impl::get_children
In get_children we get the vector of alive nodes with get_nodes.
Yet, between this and sending rpc to those nodes there might be
a preemption. Currently, the liveness of a node is checked once
again before the rpcs (only with gossiper not in topology - unlike
get_nodes).

Modify get_children, so that it keeps a token_metadata_ptr,
preventing topology from changing between get_nodes and rpcs.

Remove test_get_children as it checked if the get_children method
won't fail if a node is down after get_nodes - which cannot happen
currently.
2026-03-18 15:37:24 +01:00
Botond Dénes
ae17596c2a Merge 'Demote log level on split failure during shutdown' from Raphael Raph Carvalho
Since commit 509f2af8db, gate_closed_exception can be triggered for ongoing split during shutdown. The commit is correct, but it causes split failure on shutdown to log an error, which causes CI instability. Previously, aborted_exception would be triggered instead which is logged as warning. Let's do the same.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-951.
Fixes https://github.com/scylladb/scylladb/issues/24850.

Only 2026.1 is affected.

Closes scylladb/scylladb#29032

* github.com:scylladb/scylladb:
  replica: Demote log level on split failure during shutdown
  service: Demote log level on split failure during shutdown
2026-03-18 16:21:05 +02:00
Marcin Maliszkiewicz
61952cd985 raft: service: reload auth cache before service levels
Since service levels depend on auth data, and not other
way around, we need to ensure a proper loading order.
2026-03-18 09:06:20 +01:00
Marcin Maliszkiewicz
c4cfb278bc service: raft: move update_service_levels_effective_cache check
The auth::cache::includes_table function also covers role_members and
role_attributes. The existing check was removed because it blocked these
tables from triggering necessary cache updates.

While previously non-critical (due to unused attributes and table coupling),
maintaining a correct cache is essential for upcoming changes.
2026-03-18 09:06:20 +01:00
Dawid Mędrek
a8dd13731f Merge 'Improve debuggability of test/cluster/test_data_resurrection_in_memtable.py' from Botond Dénes
This test was observed to fail in CI recently but there is not enough information in the logs to figure out what went wrong. This PR makes a few improvements to make the next investigation easier, should it be needed:
* storage-service: add table name to mutation write failure error messages.
* database: the `database_apply` error injection used to cause trouble, catching writes to bystander tables, making tests flaky. To eliminate this, it gained a filter to apply only to non-system keyspaces. Unfortunately, this still allows it to catch writes to the trace tables. While this should not fail the test, it reduces observability, as some traces disappear. Improve this error injection to only apply to selected table. Also merge it with the `database_apply_wait` error injection, to streamline the code a bit.
* test/test_data_resurrection_in_memtable.py: dump data from the datable, before the checks for expected data, so if checks fail, the data in the table is known.

Refs: SCYLLADB-812
Refs: SCYLLADB-870
Fixes: SCYLLADB-1050 (by restricting `database_apply` error injection, so it doesn't affect writes to system traces)

Backport: test related improvement, no backport

Closes scylladb/scylladb#28899

* github.com:scylladb/scylladb:
  test/cluster/test_data_resurrection_in_memtable.py: dump rows before check
  replica/database: consolidate the two database_apply error injections
  service/storage_proxy: add name of table to error message for write errors
2026-03-17 13:35:19 +01:00
Calle Wilund
a5df2e79a7 storage_service: Wait for snapshot/backup before decommission
Fixes: SCYLLADB-244

Disables snapshot control such that any active ops finish/fail
before proceeding with decommission.
Note: snapshot control provided as argument, not member ref
due to storage_service being used from both main and cql_test_env.
(The latter has no snapshot_ctl to provide).

Could do the snapshot lockout on API level, but want to do
pre-checks before this.

Note: this just disables backup/snapshot fully. Could re-enable
after decommission, but this seems somewhat pointless.

v2:
* Add log message to snapshot shutdown
* Make test use log waiting instead of timeouts

Closes scylladb/scylladb#28980
2026-03-16 17:12:57 +02:00
Raphael S. Carvalho
b508f3dd38 service: Demote log level on split failure during shutdown
Since commit 509f2af8db, gate_closed_exception can be triggered
for ongoing split during shutdown. The commit is correct, but it
causes split failure on shutdown to log an error, which causes
CI instability. Previously, aborted_exception would be triggered
instead which is logged as warning. Let's do the same.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-951.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2026-03-16 11:52:00 -03:00
Piotr Dulikowski
d8b283e1fb Merge 'Add CQL forwarding for strongly consistent tables' from Wojciech Mitros
In this series we add support for forwarding strongly consistent CQL requests to suitable replicas, so that clients can issue reads/writes to any node and have the request executed on an appropriate tablet replica (and, for writes, on the Raft leader). We return the same CQL response as what the user would get while sending the request to the correct replica and we perform the same logging/stats updates on the request coordinator as if the coordinator was the appropriate replica.

The core mechanism of forwarding a strongly consistent request is sending an RPC containing the user's cql request frame to the appropriate replica and returning back a ready, serialized `cql_transport::response`. We do this in the CQL server - it is most prepared for handling these types and forwarding a request containing a CQL frame allows us to reuse near-top-level methods for CQL request handling in the new RPC handler (such as the general `process`)

For sending the RPC, the CQL server needs to obtain the information about who should it forward the request to. This requires knowledge about the tablet raft group members and leader. We obtain this information during the execution of a `cql3/strong_consistency` statement, and we return this information back to the CQL server using the generalized `bounce_to_shard` `response_message`, where we now store the information about either a shard, or a specific replica to which we should forward to. Similarly to `bounce_to_shard`, we need to handle this `result_message` in a loop - a replica may move during statement execution, or the Raft leader can change. We also use it for forwarding strongly consistent writes when we're not a member of the affected tablet raft group - in that case we need to forward the statement twice - once to any replica of the affected tablet, then that replica can find the leader and return this information to the coordinator, which allows the second request to be directed to the leader.

This feature also allows passing through exception messages which happened on the target replica while executing the statement. For that, many methods of the `cql_transport::cql_server::connection` for creating error responses needed to be moved to `cql_transport::cql_server`. And for final exception handling on the coordinator, we added additional error info to the RPC response, so that the handling can be performed without having the `result_message::exception` or `exception_ptr` itself.

Fixes [SCYLLADB-71](https://scylladb.atlassian.net/browse/SCYLLADB-71)

[SCYLLADB-71]: https://scylladb.atlassian.net/browse/SCYLLADB-71?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

Closes scylladb/scylladb#27517

* github.com:scylladb/scylladb:
  test: add tests for CQL forwarding
  transport: enable CQL forwarding for strong consistency statements
  transport: add remote statement preparation for CQL forwarding
  transport: handle redirect responses in CQL forwarding
  transport: add exception handling for forwarded CQL requests
  transport: add basic CQL request forwarding
  idl: add a representation of client_state for forwarding
  cql_server: handle query, execute, batch in one case
  transport: inline process_on_shard in cql_server::process
  transport: extract process() to cql_server
  transport: add messaging_service to cql_server
  transport: add response reconstruction helpers for forwarding
  transport: generalize the bounce result message for bouncing to other nodes
  strong consistency: redirect requests to live replicas from the same rack
  transport: pass foreign_ptr into sleep_until_timeout_passes and move it to cql_server
  transport: extract the error handling from process_request_one
  transport: move error response helpers from connection to cql_server
2026-03-13 15:03:10 +01:00
Tomasz Grabiec
518470e89e Merge 'load_stats: improve tablet filtering for load stats' from Ferenc Szili
When computing table sizes via load_stats to determine if a split/merge is needed, we are filtering tablets which are being migrated, in order to avoid counting them twice (both on leaving and pending replica) in the total table size. The tablets are filtered so that they are counted on the leaving replica until the streaming stage, and on the pending replica after the streaming stage.

Currently, the procedure for collecting tablet sizes for load balancing also uses this same filter. This should be changed, because the load balancer needs to have as much information about tablet sizes as possible, and could ignore a node due to missing tablet sizes for tablets in the `write_both_read_new` and `use_new` stages.

For tablet size collection, we should include all the tablets which are currently taking up disk space. This means:
- on leaving replica, include all tablets until the `cleanup` stage
- on pending replica, include all tablets starting with the `write_both_read_new` and later stages

While this is an improvement, it causes problems with some of the tests, and therefore needs to be backported to 2026.1

Fixes: SCYLLADB-829

Closes scylladb/scylladb#28587

* github.com:scylladb/scylladb:
  load_stats: add filtering for tablet sizes
  load_stats: move tablet filtering for table size computation
  load_stats: bring the comment and code in sync
2026-03-13 13:08:11 +01:00
Gleb Natapov
fae5282c82 service level: fix crash during migration to driver server level
Before b59b3d4 the migration code checked that service level controller
is on v2 version before migration and the check also implicitly checked
that _sl_data_accessor field is already initialized, but now that the
check is gone the migration can start before service level controller is
fully initialized. Re add the check, but to a different place.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1049

Closes scylladb/scylladb#29021
2026-03-13 11:24:26 +01:00
Botond Dénes
fc8cebd671 Merge 'Verify components digests during component load and scrub in validate mode' from Taras Veretilnyk
This PR adds integrity verification for SSTable component files during loading. When component digests are present in Scylla metadata, the loader now validates each component's CRC32 digest against the stored expected value, catching silent corruption of component files. Index, Rows and Partitions components digests are also validated duriung scrub in validate mode

Added corruption tests that write an SSTable, flip a bit in a specific component file, then verify that reloading the SSTable detects the corruption and throws the expected exception.

Depends on https://github.com/scylladb/scylladb/pull/28338

Backport is not required, this is new feature

Fixes https://github.com/scylladb/scylladb/issues/20103

Closes scylladb/scylladb#28761

* github.com:scylladb/scylladb:
  test/cqlpy: test --ignore-component-digest-mismatch flag in scylla sstable upgrade
  docs: document --ignore-component-digest-mismatch flag for scylla sstable upgrade
  sstables: propagate ignore_component_digest_mismatch config to all load sites
  sstables: add option to ignore component digest mismatches
  sstable_compaction_test: Add scrub validate test for corrupted index
  sstables: add tests for component digest validation on corrupted SSTables
  sstables: validate index components digests during SSTable scrub in validate mode
  sstables: verify component digests on SSTable load
  sstables: add digest_file_random_access_reader for CRC32 digest computation
2026-03-13 09:55:55 +02:00
Tomasz Grabiec
1256a9faa7 tablets: Fix deadlock in background storage group merge fiber
When it deadlocks, groups stop merging and compaction group merge
backlog will run-away.

Also, graceful shutdown will be blocked on it.

Found by flaky unit test
test_merge_chooses_best_replica_with_odd_count, which timed-out in 1
in 100 runs.

Reason for deadlock:

When storage groups are merged, the main compaction group of the new
storage group takes a compaction lock, which is appended to
_compaction_reenablers_for_merging, and released when the merge
completion fiber is done with the whole batch.

If we accumulate more than 1 merge cycle for the fiber, deadlock
occurs. Lock order will be this

Initial state:

 cg0: main
 cg1: main
 cg2: main
 cg3: main

After 1st merge:

 cg0': main [locked], merging_groups=[cg0.main, cg1.main]
 cg1': main [locked], merging_groups=[cg2.main, cg3.main]

After 2nd merge:

 cg0'': main [locked], merging_groups=[cg0'.main [locked], cg0.main, cg1.main, cg1'.main [locked], cg2.main, cg3.main]

merge completion fiber will try to stop cg0'.main, which will be
blocked on compaction lock. which is held by the reenabler in
_compaction_reenablers_for_merging, hence deadlock.

The fix is to wait for background merge to finish before we start the
next merge. It's achieved by holding old erm in the background merge,
and doing a topology barrier from the merge finalizing transition.

Background merge is supposed to be a relatively quick operation, it's
stopping compaction groups. So may wait for active requests. It
shouldn't prolong the barrier indefinitely.

Tablet boost unit tests which trigger merge need to be adjusted to
call the barrier, otherwise they will be vulnerable to the deadlock.

Two cluster tests were removed because they assumed that merge happens
in the backgournd. Now that it happens as part of merge finalization,
and blocks topology state machine, those tests deadlock because they
are unable to make topology changes (node bootstrap) while background
merge is blocked.

The test "test_tablets_merge_waits_for_lwt" needed to be adjusted. It
assumed that merge finalization doesn't wait for the erm held by the
LWT operation, and triggered tablet movement afterwards, and assumed
that this migration will issue a barrier which will block on the LWT
operation. After this commit, it's the barrier in merge finalization
which is blocked. The test was adjusted to use an earlier log mark
when waiting for "Got raft_topology_cmd::barrier_and_drain", which
will catch the barrier in merge finalization.

Fixes SCYLLADB-928
2026-03-12 22:45:01 +01:00
Tomasz Grabiec
279fcdd5ff storage_service: Extract local_topology_barrier()
Will be called in tests. It does the local part of the global topology
barrier.

The comment:

        // We capture the topology version right after the checks
        // above, before any yields. This is crucial since _topology_state_machine._topology
        // might be altered concurrently while this method is running,
        // which can cause the fence command to apply an invalid fence version.

was dropped, because it's no longer true after
fad6c41cee, and it doesn't make sense in
the context of local_topology_barrier(). We'd have to propagate the
version to local_topology_barrier(), but it's pointless. The fence
version is decided before calling the local barrier, and it will be
valid even if local version moves ahead.
2026-03-12 22:44:56 +01:00
Avi Kivity
e2eeef3e01 Merge 'service level: remove remnants of version 1 service level' from Gleb Natapov
can_use_effective_service_level_cache() always returns true now, so the function can be dropped entirely and all the code that assumes it may return false can be dropped as well. Also drop async versions of find_effective_service_level and get_user_scheduling_group since they are unused.

No need to backport, code removal,

Closes scylladb/scylladb#29002

* github.com:scylladb/scylladb:
  service level: make maybe_update_per_service_level_params synchronous
  service level: remove unused get_user_scheduling_group function
  service level: drop async find_effective_service_level
  service level: remove remnants of version 1 service level
2026-03-12 23:39:41 +02:00
Wojciech Mitros
170b82ddca idl: add a representation of client_state for forwarding
In the following patches, when we start allowing to forward CQL
requests to other nodes, we'll need to use the same client state
for executing the request on the destination node as we had on the
source. client_state contains many fields and we need to create
a new instance of it when we start handling the forwarded request,
so to prepare for the forwarding RPC, we add a serializable format
of the client_state as an IDL struct. The new class is missing some
fields that are not used while executing requests, and some whose
value is determined by the fact that the client state is used for
a forwarded request.
These include:
- driver name, driver version, client options - not used for executing
requests. Instead, we use these as data sources for the virtual
"clients" system table.
- auth_state - must be READY - we reached a bounce message, so we were
able to try executing the request locally
- _control_connection - used for altering a cql_server::connection, which
we don't have on the target node
- _default_timeout_config - used when updating service levels, also only
per-connection
- workload_type - used for deciding whether to allow shedding at the
start of processing the request, and for getting per-connection service
level params (for an API)
2026-03-12 17:48:58 +01:00
Wojciech Mitros
b4d66fda2e strong consistency: redirect requests to live replicas from the same rack
Forwarding CQL requests is not implemented yet, but we're already
prepared to return the target to forward to when trying to execute
strongly consistent requests. Currently, if we're not a replica
of the affected tablet, we redirect the request to the first replica
in the list.
This is not optimal, because this replica may be down or it may be
in another rack, making us perform cross-rack requests during forwarding.
Instead, we should forward the request to the replica from the same
rack and handle the case where the replica is down.

In this patch we change the replica selection for forwarding strongly
consistent requests, so that when the coordinator isn't a replica, it
redirects the request to the replica from the same rack.

If the replica from the same rack is down, or there is no replica in
our rack, we choose the next closest replica (preferring same-DC replicas
over other DCs). If no replica is alive, the query fails - the driver
should retry when some replica comes back up.
2026-03-12 17:48:54 +01:00
Patryk Jędrzejczak
f85628a9a0 group0: discovery: shorten the pause duration
Nodes currently pause group0 discovery for 1s. This case is always hit while
adding multiple nodes in parallel to an empty cluster by all nodes except the
one that becomes the group0 leader.

This is fine in production, but in tests, the slowdown is quite significant.
Every `manager.servers_add(n)` call for n > 1 becomes 1s slower when the
cluster is empty. Many cluster tests are affected.

In this commit, we decrease the sleep duration from 1s to 100ms to speed up
tests. The consequence of this change is that nodes might perform more steps
in group0 discovery, but the increase in CPU usage and network traffic should
be negligible.
2026-03-12 15:40:18 +01:00
Gleb Natapov
c67f876893 service level: make maybe_update_per_service_level_params synchronous
It does not call async functions any more.
2026-03-12 15:53:08 +02:00
Gleb Natapov
c30907b8f2 service level: remove unused get_user_scheduling_group function 2026-03-12 14:28:26 +02:00
Gleb Natapov
a934d8391d service level: drop async find_effective_service_level
find_cached_effective_service_level does exactly same thing now and it
is synchronous.
2026-03-12 14:28:26 +02:00
Gleb Natapov
f888f2dced service level: remove remnants of version 1 service level
can_use_effective_service_level_cache() always returns true now, so the
function can be dropped entirely and all the code that assumes it may
return false can be dropped as well.
2026-03-12 12:27:52 +02:00
Piotr Dulikowski
d9a277453e Merge 'cql3: pin prepared cache entry in prepare() to avoid invalid weak handle race' from Alex Dathskovsky
query_processor::prepare() could race with prepared statement invalidation: after loading from the prepared cache, we converted the cached object to a checked weak pointer and then continued asynchronous work (including error-injection waitpoints). If invalidation happened in that window, the weak handle could no longer be promoted and the prepare path could fail nondeterministically.

This change keeps a strong cache entry reference alive across the whole critical section in prepare() by using a pinned cache accessor (get_pinned()), and only deriving the weak handle while the entry is pinned. This removes the lifetime gap without adding retry loops.

  Test coverage was extended in test/cluster/test_prepare_race.py:

  - reproduces the invalidation-during-prepare window with injection,
  - verifies prepare completes successfully,
  - then invalidates again and executes the same stale client prepared object,
  - confirms the driver transparently re-requests/re-prepares and execution succeeds.

  This change introduces:

  - no behavior change for normal prepare flow besides stronger lifetime guarantees,
  - no new protocol semantics,
  - preserves existing cache invalidation logic,
  - adds explicit cluster-level regression coverage for both the race and driver reprepare path.
  - pushes the re prepare operation twards the driver, the server will return unprepared error for the first time and the driver will have to re prepare during execution stage

Fixes: https://github.com/scylladb/scylladb/issues/27657

Backport to active branches recommended: No node crash, but user-visible PREPARE failures under rare schema-invalidation race; low-risk timeout-bounded retry improves robustness.

Closes scylladb/scylladb#28952

* github.com:scylladb/scylladb:
  transport/messages: hold pinned prepared entry in PREPARE result
  cql3: pin prepared cache entry in prepare() to avoid invalid weak handle race
2026-03-11 12:09:23 +01:00
Patryk Jędrzejczak
37aeba9c8c Merge 'raft: add global read barrier to group0_batch::commit and switch auth and service levels' from Marcin Maliszkiewicz
This series adds a global read barrier to raft_group0_client, ensuring that Raft group0 mutations are applied on all live nodes before returning to the caller.

Currently, after a group0_batch::commit, the mutations are only guaranteed to be applied on the leader. Other nodes may still be catching up, leading to stale reads. This patch introduces a broadcast read barrier mechanism. Calling  send_group0_read_barrier_to_live_members after committing will cause the coordinator to send a read barrier RPC to all live nodes (discovered via gossiper) and waits for them to complete. This is best effort attempt to get cluster-wide visibility of the committed state before the response is returned to the user.

Auth and service levels write paths are switched to use this new mechanism.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-650

Backport: no, new feature

Closes scylladb/scylladb#28731

* https://github.com/scylladb/scylladb:
  test: add tests for global group0_batch barrier feature
  qos: switch service levels write paths to use global group0_batch barrier
  auth: switch write paths to use global group0_batch barrier
  raft: add function to broadcast read barrier request
  raft: add gossiper dependency to raft_group0_client
  raft: add read barrier RPC
2026-03-11 10:37:19 +01:00
Botond Dénes
475220b9c9 Merge 'Remove the rest of pre raft topology code' from Gleb Natapov
Remove the rest of the code that assumes that either group0 does not exist yet or a cluster is till not upgraded to raft topology. Both of those are not supported any more.

No need to backport since we remove functionality here.

Closes scylladb/scylladb#28841

* github.com:scylladb/scylladb:
  service level: remove version 1 service level code
  features: move GROUP0_SCHEMA_VERSIONING to deprecated features list
  migration_manager: remove unused forward definitions
  test: remove unused code
  auth: drop auth_migration_listener since it does nothing now
  schema: drop schema_registry_entry::maybe_sync() function
  schema: drop make_table_deleting_mutations since it should not be needed with raft
  schema: remove calculate_schema_digest function
  schema: drop recalculate_schema_version function and its uses
  migration_manager: drop check for group0_schema_versioning feature
  cdc: drop usage of cdc_local table and v1 generation definition
  storage_service: no need to add yourself to the topology during reboot since raft state loading already did it
  storage_service: remove unused functions
  group0: drop with_raft() function from group0_guard since it always returns true now
  gossiper: do not gossip TOKENS and CDC_GENERATION_ID any more
  gossiper: drop tokens from loaded_endpoint_state
  gossiper: remove unused functions
  storage_service: do not pass loaded_peer_features to join_topology()
  storage_service: remove unused fields from replacement_info
  gossiper: drop is_safe_for_restart() function and its use
  storage_service: remove unused variables from join_topology
  gossiper: remove the code that was only used in gossiper topology
  storage_service: drop the check for raft mode from recovery code
  cdc: remove legacy code
  test: remove unused injection points
  auth: remove legacy auth mode and upgrade code
  treewide: remove schema pull code since we never pull schema any more
  raft topology: drop upgrade_state and its type from the topology state machine since it is not used any longer
  group0: hoist the checks for an illegal upgrade into main.cc
  api: drop get_topology_upgrade_state and always report upgrade status as done
  service_level_controller: drop service level upgrade code
  test: drop run_with_raft_recovery parameter to cql_test_env
  group0: get rid of group0_upgrade_state
  storage_service: drop topology_change_kind as it is no longer needed
  storage_service: drop check_ability_to_perform_topology_operation since no upgrades can happen any more
  service_storage: remove unused functions
  storage_service: remove non raft rebuild code
  storage_service: set topology change kind only once
  group0: drop in_recovery function and its uses
  group0: rename use_raft to maintenance_mode and make it sync
2026-03-11 10:24:20 +02:00
Taras Veretilnyk
7214f5a0b6 sstables: propagate ignore_component_digest_mismatch config to all load sites
Add ignore_component_digest_mismatch option to db::config (default false).
When set, sstable loading logs a warning instead of throwing on component
digest mismatches, allowing a node to start up despite corrupted non-vital
components or bugs in digest calculation.

Propagate the config to all production sstable load paths:
- distributed_loader (node startup, upload dir processing)
- storage_service (tablet storage cloning)
- sstables_loader (load-and-stream, download tasks, attach)
- stream_blob (tablet streaming)
2026-03-10 19:24:05 +01:00
Aleksandra Martyniuk
e02b0d763c service: fix indentation 2026-03-10 14:44:52 +01:00
Aleksandra Martyniuk
0e0070f118 service: remove status_helper::tablets
Currently, status_helper::tablets, which keeps a vector of processed
tablet ids, is used only in tablet_virtual_task::get_status_helper,
so there is no point in returning it. Also, in get_status_helper,
it is used only to determine if any tablets are processed.

Remove status_helper::tablets. Use a flag instead of the vector
in get_status_helper.
2026-03-10 14:42:27 +01:00
Aleksandra Martyniuk
e5928497ce service: tasks: scan all tablets in tablet_virtual_task::wait
Currently, for repair tasks tablet_virtual_task::wait gathers the
ids of tablets that are to be repaired. The gathered set is later
used to check if the repair is still ongoing.

However, if the tablets are resized (split or merged), the gathered
set becomes irrelevant. Those, we may end up with invalid tablet id
error being thrown.

Wait until repair is done for all tablets in the table.
2026-03-10 14:42:21 +01:00
Alex
3ac4e258e8 transport/messages: hold pinned prepared entry in PREPARE result
result_message::prepared now owns a strong pinned prepared-cache entry instead of relying only on a weak pointer view. This closes the remaining lifetime gap after query_processor::prepare() returns, so users of the returned PREPARE message cannot observe an invalidated weak handle during subsequent
  processing.

  - update result_message::prepared::cql constructor to accept pinned entry
  - construct weak view from owned pinned entry inside the message
  - pass pinned cache entry from query_processor::prepare() into the message constructor
2026-03-10 14:17:57 +02:00
Gleb Natapov
b59b3d4f8a service level: remove version 1 service level code 2026-03-10 10:46:48 +02:00
Gleb Natapov
40ec0d4942 migration_manager: remove unused forward definitions 2026-03-10 10:46:48 +02:00
Gleb Natapov
74b5a8d43d schema: drop schema_registry_entry::maybe_sync() function
Schema is synced through group0 now. Drop all the test of the function
as well.
2026-03-10 10:46:47 +02:00
Gleb Natapov
08e33ad7f7 schema: drop recalculate_schema_version function and its uses
There is no need to recalculate schema version any more since it is set
by group0.
2026-03-10 10:46:39 +02:00
Gleb Natapov
7bb334a5dd migration_manager: drop check for group0_schema_versioning feature
We do not allow upgrading from a version that does not have it any
longer.
2026-03-10 10:39:59 +02:00
Gleb Natapov
4402b030ae cdc: drop usage of cdc_local table and v1 generation definition 2026-03-10 10:39:59 +02:00
Gleb Natapov
6769615ff1 storage_service: no need to add yourself to the topology during reboot since raft state loading already did it 2026-03-10 10:39:59 +02:00
Gleb Natapov
33fbda9f3b storage_service: remove unused functions 2026-03-10 10:39:58 +02:00
Gleb Natapov
0e3e7be335 group0: drop with_raft() function from group0_guard since it always returns true now
Also drop the code that assumed that the function can return false.
2026-03-10 10:39:58 +02:00
Gleb Natapov
4e56ca3c76 gossiper: do not gossip TOKENS and CDC_GENERATION_ID any more
They were used by legacy topology and cdc code only.
2026-03-10 10:39:58 +02:00
Gleb Natapov
706754dc24 gossiper: remove unused functions 2026-03-10 10:39:58 +02:00
Gleb Natapov
8ee4cdd4b7 storage_service: do not pass loaded_peer_features to join_topology()
They are not used there any longer.
2026-03-10 10:39:58 +02:00
Gleb Natapov
24c01f2289 storage_service: remove unused fields from replacement_info 2026-03-10 10:39:58 +02:00
Gleb Natapov
2d8722d204 gossiper: drop is_safe_for_restart() function and its use
The function checks that the node's state is not left or removed in
gossiper during restart, but with raft topology a removed node will
not be able to contact the cluster to get this information since it will
be banned.
2026-03-10 10:39:58 +02:00
Gleb Natapov
6f739a8ee4 storage_service: remove unused variables from join_topology 2026-03-10 10:39:58 +02:00
Gleb Natapov
d35b83bec8 gossiper: remove the code that was only used in gossiper topology
The topology state machine is always present now and can be passed to
the gossiper during creation.
2026-03-10 10:39:58 +02:00
Gleb Natapov
390eb46c1a storage_service: drop the check for raft mode from recovery code
In non raft mode the node will node boot at all, so the check is
redundant now.
2026-03-10 10:39:58 +02:00