Commit Graph

858 Commits

Author SHA1 Message Date
Michael Litvak
27fd0c119f db: disable tablet balancing with logstor
initially logstor tables will not support tablet migrations, so
disable tablet balancing if the experimental feature flag is set.
2026-03-18 19:24:26 +01:00
Gleb Natapov
4402b030ae cdc: drop usage of cdc_local table and v1 generation definition 2026-03-10 10:39:59 +02:00
Gleb Natapov
77f8f952b2 gossiper: drop tokens from loaded_endpoint_state 2026-03-10 10:39:58 +02:00
Gleb Natapov
6a7e850161 cdc: remove legacy code
The patch removes test/boost/cdc_generation_test.cc since it unit tests
cdc::limit_number_of_streams_if_needed function which is remove here.
2026-03-10 10:38:57 +02:00
Gleb Natapov
0cf726c81f raft topology: drop upgrade_state and its type from the topology state machine since it is not used any longer 2026-03-10 10:09:39 +02:00
Gleb Natapov
60a861c518 group0: hoist the checks for an illegal upgrade into main.cc
The checks are spread around now, but having then in one place and done
as early as possible simplifies the logic.
2026-03-10 10:09:39 +02:00
Tomasz Grabiec
b90fe19a42 Merge 'service: assert that tables updated via group0 use schema commitlog' from Aleksandra Martyniuk
Set enable_schema_commitlog for each group0 tables.

Assert that group0 tables use schema commitlog in ensure_group0_schema
(per each command).

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-914.

Needs backport to all live releases as all are vulnerable

Closes scylladb/scylladb#28876

* github.com:scylladb/scylladb:
  test: add test_group0_tables_use_schema_commitlog
  db: service: remove group0 tables from schema commitlog schema initializer
  service: ensure that tables updated via group0 use schema commitlog
  db: schema: remove set_is_group0_table param
2026-03-05 13:28:13 +01:00
Aleksandra Martyniuk
5306e26b83 db: service: remove group0 tables from schema commitlog schema initializer
Remove group0 tables from schema commitlog schema initializer.
The schema commitlog of group0 tables is ensured by set_is_group0_table.
2026-03-04 17:25:06 +01:00
Aleksandra Martyniuk
6b3b174704 db: schema: remove set_is_group0_table param
set_is_group0_table takes an enabled flag, based on which it decides
whether it's a group0 table. The method is called only with enabled = true.

Drop the param. For not group0 tables nothing should be set.
2026-03-04 17:24:34 +01:00
Avi Kivity
85bd6d0114 Merge 'Add multiple-shard persistent metadata storage for strongly consistent tables' from Wojciech Mitros
In this series we introduce new system tables and use them for storing the raft metadata
for strongly consistent tables. In contrast to the previously used raft group0 tables, the
new tables can store data on any shard. The tables also allow specifying the shard where
each partition should reside, which enables the tablets of strongly consistent tables to have
their raft group metadata co-located on the same shard as the tablet replica.

The new tables have almost the same schemas as the raft group0 tables. However, they
have an additional column in their partition keys. The additional column is the shard
that specifies where the data should be located. While a tablet and its corresponding
raft group server resides on some shard, it now writes and reads all requests to the
metadata tables using its shard in addition to the group_id.

The extra partition key column is used by the new partitioner and sharder which allow
this special shard routing. The partitioner encodes the shard in the token and the
sharder decodes the shard from the token. This approach for routing avoids any
additional lookups (for the tablet mapping) during operations on the new tables
and it also doesn't require keeping any state. It also doesn't interact negatively
with resharding - as long as tablets (and their corresponding raft metadata) occupy
some shard, we do not allow starting the node with a shard count lower than the
id of this shard. When increasing the shard count, the routing does not change,
similarly to how tablet allocation doesn't change.

To use the new tables, a new implementation of `raft::persistence` is added. Currently,
it's almost an exact copy of the `raft_sys_table_storage` which just uses the new tables,
but in the future we can modify it with changes specific to metadata (or mutation)
storage for strongly consistent tables. The new storage is used in the `groups_manager`,
which combined with the removal of some `this_shard_id() == 0` checks, allows strongly
consistent tables to be used on all shards.

This approach for making sure that the reads/writes to the new tables end up on the correct shards
won in the balance of complexity/usability/performance against a few other approaches we've considered.
They include:
1. Making the Raft server read/write directly to the database, skipping the sharder, on its shard, while using
the default partitioner/sharder. This approach could let us avoid changing the schema and there should be
no problems for reads and writes performed by the Raft server. However, in this approach we would input
data in tables conflicting with the placement determined by the sharder. As a result, any read going through
the sharder could miss the rows it was supposed to read. Even when reading all shards to find a specific value,
there is a risk of polluting the cache - the rows loaded on incorrect shards may persist in the cache for an unknown
amount of time. The cache may also mistakenly remember that a row is missing, even though it's actually present,
just on an incorrect shard.
Some of the issues with this approach could be worked around using another sharder which always returns
this_shard_id() when asked about a shard. It's not clear how such a sharder would implement a method like
`token_for_next_shard`, and how much simpler it would be compared to the current "identity" sharder.
2. Using a sharder depending on the current allocation of tablets on the node. This approach relies on the
knowledge of group_id -> shard mapping at any point in time in the cluster. For this approach we'd also
need to either add a custom partitioner which encodes the group_id in the token, or we'd need to track the
token(group_id) -> shard mapping. This approach has the benefit over the one used in the series of keeping
the partition key as just group_id. However, it requires more logic, and the access to the live state of the node
in the sharder, and it's not static - the same token may be sharded differently depending on the state of the
node - it shouldn't occur in practice, but if we changed the state of the node before adjusting the table data,
we would be unable to access/fix the stale data without artificially also changing the state of the node.
3. Using metadata tables co-located to the strongly consistent tables. This approach could simplify the
metadata migrations in the future, however it would require additional schema management of all co-located
metadata tables, and it's not even obvious what could be used as the partition key in these tables - some
metadata is per-raft-group, so we couldn't reuse the partition key of the strongly consistent table for it. And
finding and remembering a partition key that is routed to a specific shard is not a simple task. Finally, splits
and merges will most likely need special handling for metadata anyway, so we wouldn't even make use of
co-located table's splits and merges.

Fixes [SCYLLADB-361](https://scylladb.atlassian.net/browse/SCYLLADB-361)

[SCYLLADB-361]: https://scylladb.atlassian.net/browse/SCYLLADB-361?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

Closes scylladb/scylladb#28509

* github.com:scylladb/scylladb:
  docs: add strong consistency doc
  test/cluster: add tests for strongly-consistent tables' metadata persistence
  raft: enable multi-shard raft groups for strongly consistent tablets
  test/raft: add unit tests for raft_groups_storage
  raft: add raft_groups_storage persistence class
  db: add system tables for strongly consistent tables' raft groups
  dht: add fixed_shard_partitioner and fixed_shard_sharder
  raft: add group_id -> shard mapping to raft_group_registry
  schema: add with_sharder overload accepting static_sharder reference
2026-03-04 08:55:43 +02:00
Wojciech Mitros
654fe4b1ca db: add system tables for strongly consistent tables' raft groups
Add three new system tables for storing raft state for strongly
consistent tablets, corresponding to the tables for group0:

- system.raft_groups: Stores the raft log, term/vote, snapshot_id,
  and commit_idx for each tablet's raft group.

- system.raft_groups_snapshots: Stores snapshot descriptors
  (index, term) for each group.

- system.raft_groups_snapshot_config: Stores the raft configuration
  (current and previous voters) for each snapshot.

These tables use a (shard, group_id) composite partition key with
the newly added raft_groups_partitioner and raft_groups_sharder, ensuring
data is co-located with the tablet replica that owns the raft group.

The tables are only created when the STRONGLY_CONSISTENT_TABLES experimental
feature is enabled.
2026-02-25 12:34:58 +01:00
Calle Wilund
fec7df7cbb topology::snapshot: Add expiry (ttl) to RPC/topo op
Not set yet, but includes it in messages so it can be properly
set in calling code. Will add entry to manifest.
2026-02-23 11:37:17 +01:00
Calle Wilund
8bb81f00f8 system_keyspace::topology_requests_entry: Add snapshot info to table
Adds required info to communicate snapshot requests via topology
coordinator.
2026-02-23 10:44:38 +01:00
Patryk Jędrzejczak
e21ecf69de raft topology: make some assertions non-crashing
Some assertions in the Raft-based topology are likely to cause crashes of
multiple nodes due to the consistent nature of the Raft-based code. If the
failing assertion is executed in the code run by each follower (e.g., the code
reloading the in-memory topology state machine), then all nodes can crash. If
the failing assertion is executed only by the leader (e.g., the topology
coordinator fiber), then multiple consecutive group0 leaders will chain-crash
until there is no group0 majority.

Crashing multiple nodes is much more severe than necessary. It's enough to
prevent the topology state machine from making more progress. This will
naturally happen after throwing a runtime error. The problematic fiber will be
killed or will keep failing in a loop. Note that it should be safe to block
the topology state machine, but not the whole group0, as the topology state
machine is mostly isolated from the rest of group0.

We replace some occurrences of `on_fatal_internal_error` and `SCYLLA_ASSERT`
with `on_internal_error`. These are not all occurrences, as some fatal
assertions make sense, for example, in the bootstrap procedure.
2026-02-12 13:10:03 +01:00
Patryk Jędrzejczak
4e984139b2 Merge 'strongly consistent tables: basic implementation' from Petr Gusev
In this PR we add a basic implementation of the strongly-consistent tables:
* generate raft group id when a strongly-consistent table is created
* persist it into system.tables table
* start raft groups on replicas when a strongly-consistent tablet_map reaches them
* add strongly-consistent version of the storage_proxy, with the `query` and `mutate` methods
* the `mutate` method submits a command to the tablets raft group, the query method reads the data with `raft.read_barrier()`
* strongly-consistent versions of the `select_statement` and `modification_statement` are added
* a basic `test_strong_consistency.py/test_basic_write_read` is added which to check that we can write and read data in a strongly consistent fashion.

Limitations:
* for now the strongly consistent tables can have tablets only on shard zero. This is because we (ab/re) use the existing raft system tables which live only on shard0. In the next PRs we'll create separate tables for the new tablets raft groups.
* No Scylla-side proxying - the test has to figure out who is the leader and submit the command to the right node. This will be fixed separately.
* No tablet balancing -- migration/split/merges require separate complicated code.

The new behavior is hidden behind `STRONGLY_CONSISTENT_TABLES` feature, which is enabled when the `STRONGLY_CONSISTENT_TABLES` experimental feature flag is set.

Requirements, specs and general overview of the feature can be found [here](https://scylladb.atlassian.net/wiki/spaces/RND/pages/91422722/Strong+Consistency). Short term implementation plan is [here](https://docs.google.com/document/d/1afKeeHaCkKxER7IThHkaAQlh2JWpbqhFLIQ3CzmiXhI/edit?tab=t.0#heading=h.thkorgfek290)

One can check the strongly consistent writes and reads locally via cqlsh:
scylla.yaml:
```
experimental_features:
  - strongly-consistent-tables
```

cqlsh:
```
CREATE KEYSPACE IF NOT EXISTS my_ks WITH replication = {'class': 'NetworkTopologyStrategy', 'replication_factor': 1} AND tablets = {'initial': 1} AND consistency = 'local';
CREATE TABLE my_ks.test (pk int PRIMARY KEY, c int);
INSERT INTO my_ks.test (pk, c) VALUES (10, 20);
SELECT * FROM my_ks.test WHERE pk = 10;
```

Fixes SCYLLADB-34
Fixes SCYLLADB-32
Fixes SCYLLADB-31
Fixes SCYLLADB-33
Fixes SCYLLADB-56

backport: no need

Closes scylladb/scylladb#27614

* https://github.com/scylladb/scylladb:
  test_encryption: capture stderr
  test/cluster: add test_strong_consistency.py
  raft_group_registry: disable metrics for non-0 groups
  strong consistency: implement select_statement::do_execute()
  cql: add select_statement.cc
  strong consistency: implement coordinator::query()
  cql: add modification_statement
  cql: add statement_helpers
  strong consistency: implement coordinator::mutate()
  raft.hh: make server::wait_for_leader() public
  strong_consistency: add coordinator
  modification_statement: make get_timeout public
  strong_consistency: add groups_manager
  strong_consistency: add state_machine and raft_command
  table: add get_max_timestamp_for_tablet
  tablets: generate raft group_id-s for new table
  tablet_replication_strategy: add consistency field
  tablets: add raft_group_id
  modification_statement: remove virtual where it's not needed
  modification_statement: inline prepare_statement()
  system_keyspace: disable tablet_balancing for strongly_consistent_tables
  cql: rename strongly_consistent statements to broadcast statements
2026-01-23 09:52:33 +01:00
Botond Dénes
4281d18c2e Merge 'schema: Apply sstable_compression_user_table_options to CQL aux and Alternator tables' from Nikos Dragazis
In PR 5b6570be52 we introduced the config option `sstable_compression_user_table_options` to allow adjusting the default compression settings for user tables. However, the new option was hooked into the CQL layer and applied only to CQL base tables, not to the whole spectrum of user tables: CQL auxiliary tables (materialized views, secondary indexes, CDC log tables), Alternator base tables, Alternator auxiliary tables (GSIs, LSIs, Streams).

This gap also led to inconsistent default compression algorithms after we changed the option’s default algorithm from LZ4 to LZ4WithDicts (adf9c426c2).

This series introduces a general “schema initializer” mechanism in `schema_builder` and uses it to apply the default compression settings uniformly across all user tables. This ensures that all base and aux tables take their default compression settings from config.

Fixes #26914.

Backport justification: LZ4WithDicts is the new default since 2025.4, but the config option exists since 2025.2. Based on severity, I suggest we backport only to 2025.4 to maintain consistency of the defaults.

Closes scylladb/scylladb#27204

* github.com:scylladb/scylladb:
  db/config: Update sstable_compression_user_table_options description
  schema: Add initializer for compression defaults
  schema: Generalize static configurators into schema initializers
  schema: Initialize static properties eagerly
  db: config: Add accessor for sstable_compression_user_table_options
  test: Check that CQL and Alternator tables respect compression config
2026-01-22 06:50:48 +02:00
Petr Gusev
998ee5b7fb system_keyspace: disable tablet_balancing for strongly_consistent_tables
We don't yet support tablet balancing for strongly consistent tables.
2026-01-21 14:56:00 +01:00
Botond Dénes
e01041d3ee db/system_keyspace: move remining tables out of v3 keyspace
The last remining tables in the v3 keyspace are those that are genuinely
distinct -- added by Cassandra 3.0 or >= ScyllaDB 2.0.
Move these out of the v3 keyspace too, with this the v3 keyspace is
defunct and removed.
2026-01-19 12:32:21 +02:00
Botond Dénes
ce57ef94bd db/system_keyspace: relocate truncated() and commitlog_cleanups()
The name variables of these tables is outside the v3 namespace but the
method defining their schema is in the v3 namespace. Relocate the
methods out from the v3 namespace, to the scope where the name variables
live.
The methods are moved to the private: part of system_keyspace, as they
don't have external users currently.
2026-01-19 12:32:21 +02:00
Botond Dénes
2ccb8ff666 db/system_keyspace: drop v3::local()
It is unused, the non-v3 variant is used instead.
2026-01-19 12:32:21 +02:00
Botond Dénes
b52a3f3a43 db/system_keyspace: remove duplicate table names from v3
Those table names that are effectively just an alias of the their
counterpart outside of the v3 namespace (struct).

scylla_local() is made public. Currently it is private, but it has
external users, working around the private designation by using the
public v3::scylla_local() alias. This change just makes the existing
status clear.
2026-01-19 12:32:21 +02:00
Gleb Natapov
bee5f63cb6 topology coordinator: complete pending operation for a replaced node
A replaced node may have pending operation on it. The replace operation
will move the node into the 'left' state and the request will never be
completed. More over the code does not expect left node to have a
request. It will try to process the request and will crash because the
node for the request will not be found.

The patch checks is the replaced node has peening request and completes
it with failure. It also changes topology loading code to skip requests
for nodes that are in a left state. This is not strictly needed, but
makes the code more robust.

Fixes #27990

Closes scylladb/scylladb#28009
2026-01-14 13:11:27 +01:00
Nikos Dragazis
d5ec66bc0c schema: Generalize static configurators into schema initializers
Extend the `static_configurator` mechanism to support initialization of
arbitrary schema properties, not only static ones, by passing a
`schema_builder` reference to the configurator interface.

As part of this change, rename `static_configurator` to
`schema_initializer` to better reflect its broader responsibility.

Add a checkpoint/restore mechanism to allow de-registering an
initializer (useful for testing; will be used in the next patch).

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2026-01-13 20:45:59 +02:00
Asias He
4f77dd058d repair: Add tablet repair progress report support
This patch adds tablet repair progress report support so that the user
could use the /task_manager/task_status API to query the progress.

In order to support this, a new system table is introduced to record the
user request related info, i.e, start of the request and end of the
request.

The progress is accurate when tablet split or merge happens in the
middle of the request, since the tokens of the tablet are recorded when
the request is started and when repair of each tablet is finished. The
original tablet repair is considered as finished when the finished
ranges cover the original tablet token ranges.

After this patch, the /task_manager/task_status API will report correct
progress_total and progress_completed.

Fixes #22564
Fixes #26896

Closes scylladb/scylladb#27679
2026-01-08 21:55:18 +02:00
Benny Halevy
f60033db63 db: system_keyspace: get_group0_history: unfreeze_gently
Prevent stall when the group0 history is too long using unfreeze_gently
rather than the synchronous unfreeze() function

Fixes #27872

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#27873
2025-12-29 12:00:54 +02:00
Patryk Jędrzejczak
d5c205194b Merge 'topology: Make removenode use left_token_ring state for global barrier' from Emil Maskovsky
Make the removenode operation go through the `left_token_ring` state, similar to decommission. This ensures that when removenode completes, all nodes in the cluster are aware of the topology change through a global token metadata barrier.

Previously, removenode would skip the `left_token_ring` state and go directly from `write_both_read_new` to `left` state. This meant that when the operation completed, some nodes might not yet know about the topology change, potentially causing issues with subsequent data plane requests.

Key changes:
- Both decommission and removenode now transition to `left_token_ring` state in the `write_both_read_new` handler
- In `left_token_ring` state, only decommissioning nodes receive the shutdown RPC (removed nodes are already dead)
- Updated documentation to reflect that both operations use this state

This change improves consistency guarantees for removenode operations by ensuring cluster-wide awareness before completion.

The change is protected by "REMOVENODE_WITH_LEFT_TOKEN_RING" feature flag to also support mixed clusters during e.g. upgrade.

Fixes:  scylladb/scylladb#25530

No backport: This fixes and issue found in tests. It can theoretically happen in production too, but wasn't reported in any customer issue, so a backport is not needed.

Closes scylladb/scylladb#26931

* https://github.com/scylladb/scylladb:
  topology: make removenode use left_token_ring state for global barrier
  topology: allow removing nodes not having tokens
  features: add feature flag for removenode via left token ring
2025-12-18 09:34:38 +01:00
Emil Maskovsky
9431826c52 topology: allow removing nodes not having tokens
For the changes to go through the left_token_ring state when
REMOVENODE_WITH_LEFT_TOKEN_RING feature is enabled, we need to allow
removing nodes to not have any tokens (similarly to decommissioning
nodes, which use the same sequence of states).

This means the tests also need to change to allow for this new behavior
- it can temporarily happen that a removing node has no tokens but is
still part of Raft group 0 (so there may be a temporary mismatch between
the token ring and group 0 membership).

Therefore, the `check_token_ring_and_group0_consistency` function is
replaced by `wait_for_token_ring_and_group0_consistency`, which waits
up to 30 seconds for consistency to be reached.
2025-12-17 13:31:11 +01:00
Tomasz Grabiec
c077283352 Merge 'service: support conversion of tablet keyspaces to rack-list using ALTER KEYSPACE' from Aleksandra Martyniuk
If a keyspace has a numeric replication factor in a DC and rf < #racks,
then the replicas of tablets in this keyspace can be distributed among
all racks in the DC (different for each tablet). With rack list, we need all
tablet replicas to be placed on the same racks. Hence, the conversion
requires tablet co-location.

After this series, the conversion can be done using ALTER KEYSPACE
statement. The statement that does this conversion in any DC is not
allowed to change a rf in any DC. So, if we have dc1 and dc2 with 3 racks
each and a keyspace ks then with a single ALTER KEYSPACE we can do:
- {dc1 : 2} -> {dc1 : [r1, r2]};
- {dc1 : 2, dc2: 2} ->  {dc1 : [r1, r2], dc2: [r2,r3]};
- {dc1 : 2, dc2: 2} -> {dc1 : [r1, r2], dc2: 2}
- {dc1 : 2} -> {dc1 : 2, dc2 : [r1]}
But we cannot do:
- {dc1 : 2} -> {dc1 : [r1, r2, r3]};
- {dc1 : 1, dc2 : [r1, r2] → dc1: [r1], dc2: [r1].

In order to do the co-locations rf change request is paused. Tablet
load balancer examines the paused rf change requests and schedules
necessary tablet migrations. During the process of co-location, no other
cross-rack migration is allowed.

Load balancer checks whether any paused rf change request is
ready to be resumed. If so, it puts the request back to global topology
request queue.

While an rf change request for a keyspace is running, any other rf change
of this keyspace will fail.

Fixes: #26398.

New feature, no backport

Closes scylladb/scylladb#27279

* github.com:scylladb/scylladb:
  test: add est_rack_list_conversion_with_two_replicas_in_rack
  test: test creating tablet_rack_list_colocation_plan
  test: add test_numeric_rf_to_rack_list_conversion test
  tasks: service: add global_topology_request_virtual_task
  cql3: statements: allow altering from numeric rf to rack list
  service: topology_coordinator: pause keyspace_rf_change request
  service: implement make_rack_list_colocation_plan
  service: add tablet_rack_list_colocation_plan
  cql3: reject concurrent alter of the same keyspace
  test: check paused rf change requests persistence
  db: service: add paused_rf_change_requests to system.topology
  service: pass topology and system_keyspace to load_balancer ctor
  service: tablet_allocator: extract load updates
  service: tablet_allocator: extract ensure_node
  tasks, system_keyspace: Introduce get_topology_request_entry_opt()
  node_ops: Drop get_pending_ids()
  node_ops: Drop redundant get_status_helper()
2025-12-17 10:05:06 +01:00
Aleksandra Martyniuk
9039dfa4a5 tasks: service: add global_topology_request_virtual_task
Add a service::topo::global_topology_request_virtual_task, which
covers the replication factor changes.

Currently, the global_topology_request_virtual_task can be aborted
only if it is paused.

The progress of the rf change isn't counted.
2025-12-16 13:31:22 +01:00
Aleksandra Martyniuk
08e5f35527 db: service: add paused_rf_change_requests to system.topology
In the following changes, we allow to alter from numeric rf to rack list.
Before the alter, two tablets of the same keyspace can have replicas
on different racks. To switch to rack list, we need to co-locate
the replicas. It will be achieved by pausing the keyspace_rf_change
and scheduling migrations.

We need to persist the ids of requests that are paused. A new column -
paused_rf_change_requests is added to system.topology table.

In this commit no data is kept in the new column.
2025-12-16 13:25:38 +01:00
Tomasz Grabiec
71e6ef90f4 tasks, system_keyspace: Introduce get_topology_request_entry_opt()
It's a cleanup. Better to return std::nullopt than faking an entry
with an id when require_entry == false.
2025-12-16 13:25:34 +01:00
Patryk Jędrzejczak
73db5c94de Merge 'db: api: service: introduce system.client_routes table and related API endpoints' from Andrzej Jackowski
`system.client_routes` is a system table that sets the target address and ports for each `host_id`, for one or more connection (e.g., Private Link) represented by `connection_id`. Cloud will write the table via REST, and drivers will read it via CQL to override values obtained from `system.local` and `system.peers`.

This patch series contains:
 - Introduction of `CLIENT_ROUTES` feature flag.
 - Implementation of raft-based `system.client_routes` table
 - Implementation of `v2/client-routes` POST/DELETE/GET endpoints
 - Implementation of new `CLIENT_ROUTES_CHANGE` event that is sent to drivers when `system.client_routes` is changed
 - New tests that verifies the aforementioned features

Ref: scylladb/scylla-enterprise#5699

For now, no automatic backport. However, the changes are planned to be release on `2025.4` either as a backport or a private build.

Closes scylladb/scylladb#27323

* https://github.com/scylladb/scylladb:
  docs: describe CLIENT_ROUTES_CHANGE extension
  test: add test for CLIENT_ROUTES event
  service: transport: add CLIENT_ROUTES_CHANGE event
  test: add cluster tests for client routes
  test: add API tests for client_routes endpoints
  test: add `timeout` parameter to `delete` in RESTClient
  test: allow json_body in send
  api: implement client_routes endpoints
  api: add client_routes.json
  service: main: add client_routes_service
  db: add system.client_routes table
  gms: add CLIENT_ROUTES feature
2025-12-16 10:38:27 +01:00
Andrzej Jackowski
8dde70d04c db: add system.client_routes table
Introduce `system.client_routes`, a system table that sets the target
address and ports for each `host_id`, for one or more connections
(e.g., Private Link) represented by `connection_id`. Cloud will write
the table via REST, and drivers will read it via CQL to override
values obtained from `system.local` and `system.peers`.

The table is Raft-managed to provide consistent replication across
nodes.

Schema overview: each row is identified by `(connection_id, host_id)`
and describes where clients should connect: `address` and one or more of
`port`, `tls_port`, `alternator_port`, `alternator_https_port`.
`host_id` is a UUID (just as in ScyllaDB) but `connection_id` can be
any string to accept formats of all cloud providers. `address` is
also a regular string because it can represent either an IP address or
a domain. Ports are optional in the sense that at least one of
the four must be provided.

Ref: scylladb/scylla-enterprise#5699
2025-12-15 13:08:05 +01:00
Pavel Emelyanov
3f7ee3ce5d Merge 'batchlog: make replay (flush) faster' from Botond Dénes
The batchlog table contains an entry for each logged batch that is processed by the local node as coordinator. These entries are typically very short lived, they are inserted when the batch is processed and deleted immediately after the batch is successfully applied.
When a table has `tombstone_gc = {'mode': 'repair'}` enabled, every repair has to flush all hints and batchlogs, so that we can be certain that there is no live data in any of these, older than the last repair. Since batches can contain member queries from any number of tables, the whole batchlog has to be flushed, even if repair-mode tombstone-gc is enabled for a single table.

Flushing the batchlog table happens by doing a batchlog replay. This involves reading the entire content of this table, and attempting to replay+delete any live entries (that are old enough to be replayed).  Under normal operating circumstances, 99%+ of the content of the batchlog table is partition tombstones.  Because of this, scanning the content of this table has to process thousands to millions of tombstones. This was observed to require up to 20 minutes to finish, causing repairs to slow down to a crawl, as the batchlog-flush has to be repeated at the end of the repair of each token-range.

When trying to address this problem, the first idea was that we should expedite the garbage-collection of these accumulated tombstones. This experiment failed, see https://github.com/scylladb/scylladb/pull/23752. The commitlog proved to be an impossible to bypass barrier, preventing quick garbage-collection of tombstones. So long as a single commit-log segment is alive, holding content from the batchlog table, all tombstones written after are blocked from GC.
The second approach, represented by this PR, is to not rely in tombstone GC to reduce the tombstone amount. Instead restructure the table such that a single higher-order tombstone can be used to shadow and allow for the eviction of the myriads of individual batchlog entry tombstones. This is realized by reorganizing the batchlog table such that individual batches are rows, not partitions.
This new schema is introduced by the new `system.batchlog_v2` table, introduced by this PR:

    CREATE TABLE system.batchlog_v2 (
        version int,
        stage int,
        shard int,
        written_at timestamp,
        id uuid,
        data blob,
        PRIMARY KEY ((version, stage, shard), written_at, id));

The new schema organization has the following goals:
1) Make post-replay batchlog cleanup possible with a simple range-tombstone. This allows dropping the individual dead batchlog entries, as they are shadowed by a higher level tombstone. This enables dropping tombstones without tombstone GC.
2) To make the above possible, introduce the stage key component: batchlog entries that fail the first replay attempt, are moved to the failed_replay stage, so the initial stage can be cleaned up safely.
3) Spread out the data among Scylla shards, via the batchlog shard column.
4) Make batchlog entries ordered by the batchlog create time (id). This allows for selecting batchlogs to replay, without post-filtering of batchlogs that are too young to be replayed.

Fixes: https://github.com/scylladb/scylladb/issues/23358

This is an improvement, normally not a backport-candidate. We might override this and backport to allow wider use of `tombstone_gc: {'mode': 'repair'}`.

Closes scylladb/scylladb#26671

* github.com:scylladb/scylladb:
  db/config: change batchlog_replay_cleanup_after_replays default to 1
  test/boost/batchlog_manager_test: add test for batchlog cleanup
  replica/mutation_dump: always set position weight for clustering positions
  service/storage_proxy: s/batch_replay_throw/storage_proxy_fail_replay_batch/
  test/lib: introduce error_injection.hh
  utils/error_injection: add debug log to disable() and disable_all()
  test/lib/cql_test_env: forward config to batchlog
  test/lib/cql_test_env: add batch type to execute_batch()
  test/lib/cql_assertions: add with_size(predicate) overload
  test/lib/cql_assertions: add source location to fail messages
  test/lib/cql_assertions: columns_assertions: add assert_for_columns_of_each_row()
  test/lib/cql_assertions: rows_assertions::assert_for_columns_of_row(): add index bound check
  test/lib/cql_assertions: columns_assertions: add T* with_typed_column() overload
  db/batchlog_manager: config: s/write_timeout/reply_timeot/
  db,service: switch to system.batchlog_v2
  db/system_keyspace: introduce system.batchlog_v2
  service,db: extract generation of batchlog delete mutation
  service,db: extract get_batchlog_mutation_for() from storage-proxy
  db/batchlog_manager: only consider propagation delay with tombstone-gc=repair
  db/batchlog_manager: don't drop entire batch if one mutations' table was dropped
  data_dictionary: table: add get_truncation_time()
  db/batchlog_manager: batch(): replace map_reduce() with simple loop
  db/batchlog_manager: finish coroutinizing replay_all_failed_batches
  db/batchlog_manager: improve replayAllFailedBatches logs
2025-12-15 15:05:19 +03:00
Avi Kivity
24264e24bb Revert "repair: Add tablet repair progress report support"
This reverts commit faad0167d7. It causes
a regression in

test_two_tablets_concurrent_repair_and_migration_repair_writer_level

in debug mode (with ~5%-10% probability).

Fixes #27510.

Closes scylladb/scylladb#27560
2025-12-11 12:18:11 +02:00
Asias He
faad0167d7 repair: Add tablet repair progress report support
This patch adds tablet repair progress report support so that the user
could use the /task_manager/task_status API to query the progress.

In order to support this, a new system table is introduced to record the
user request related info, i.e, start of the request and end of the
request.

The progress is accurate when tablet split or merge happens in the
middle of the request, since the tokens of the tablet are recorded when
the request is started and when repair of each tablet is finished. The
original tablet repair is considered as finished when the finished
ranges cover the original tablet token ranges.

After this patch, the /task_manager/task_status API will report correct
progress_total and progress_completed.

Fixes #22564
Fixes #26896

Closes scylladb/scylladb#26924
2025-12-08 13:35:19 +02:00
Tomasz Grabiec
d4014b7970 Drop legacy schema support
We switched to using v3 schema tables (in system_schema keyspace) in
2017, in 9eb91bc30b.

So no system should have the old schema any more.

No need to run legacy_schema_migrator on boot.

Closes scylladb/scylladb#27420
2025-12-07 00:09:13 +02:00
Botond Dénes
ee851266be db/system_keyspace: introduce system.batchlog_v2
Rearranges the system.batchlog schema as follows:

    CREATE TABLE system.batchlog_v2 (
        version int,
        stage int,
        shard int,
        written_at timestamp,
        id uuid,
        data blob,
        PRIMARY KEY ((version, stage, shard), written_at, id));

With the following goals:
1) Make post-replay batchlog cleanup possible with a simple
   range-tombstone. This allows dropping the individual dead batchlog
   entries, as they are shadowed by a higher level tombstone. This
   enables dropping tombstones without tombstone GC.
2) To make the above possible, introduce the stage key component:
   batchlog entries that fail the first replay attempt, are moved to the
   failed_replay stage, so the initial stage can be cleaned up safely.
3) Spread out the data among Scylla shards, via the batchlog shard
   column.
4) Make batchlog entries ordered by the batchlog create time (id). This
   allows for selecting batchlogs to replay, without post-filtering of
   batchlogs that are too young to be replayed.
2025-12-02 14:21:25 +02:00
Michał Jadwiszczak
24d69b4005 db/view/view_building_state: replace task's state with aborted flag
After previous commits, we can drop entire task's state and replace it
with single boolean flag, which determines if a task was aborted.

Once a task was aborted, it cannot get resurrected to a normal state.
2025-11-25 12:14:04 +01:00
Gautam Menghani
939fcc0603 db/system_keyspace: Remove the FIXME related to caching of large tables
Remove the FIXME comment for re-enabling caching of the large tables
since the tables are used infrequently [1].

[1] : github.com/scylladb/scylladb/pull/26789#issuecomment-3477540364

Fixes #26032

Signed-off-by: Gautam Menghani <gautam.opensource@gmail.com>

Closes scylladb/scylladb#26789
2025-11-21 12:34:34 +02:00
Petr Gusev
82da83d0e5 topology_state_machine: add excluded_tablet_nodes field
The topology_coordinator::is_excluded() creates a temporary hash
map for each call. This is probably not a performance problem since
left_nodes_rs contains only those left nodes that are referenced
from tablet replicas, this happens temporarily while e.g. a replaced
node is being rebuilt. On the other hand, why not just have a
dedicated field in the topology_state_machine, then this code wouldn't
look suspicious.
2025-11-12 12:27:43 +01:00
Michael Litvak
e7dbccd59e cdc: use chunked_vector instead of vector for stream ids
use utils::chunked_vector instead of std::vector to store cdc stream
sets for tablets.

a cdc stream set usually represents all streams for a specific table and
timestamp, and has a stream id per each tablet of the table. each stream
id is represented by 16 bytes. thus the vector could require quite large
contiguous allocations for a table that has many tablets. change it to
chunked_vector to avoid large contiguous allocations.

Fixes scylladb/scylladb#26791

Closes scylladb/scylladb#26792
2025-10-31 13:02:34 +01:00
Michael Litvak
8743422241 cdc: improve cdc metadata loading
when loading CDC streams metadata for tablets from the tables, read only
new entries from the history table instead of reading all entries. This
improves the CDC metadata reloading, making it more efficient and
predictable.

the CDC metadata is loaded as part of group0 reload whenever the
internal CDC tables are modified. on tablet split / merge, we create a
new CDC timestamp and streams by writing them to the cdc_streams_history
table by group0 operation, and when it's applied we reload the in-memory
CDC streams map by reading from the tables and constructing the updated map.

Previously, on every update, we would read the entire
cdc_streams_history entries for the changed table, constructing all its
streams and creating a new map from scratch.

We improve this now by reading only new entries from cdc_streams_history
and append them to the existing map. we can do this because we only
append new entries to cdc_streams_history with higher timestamp than all
previous entries.

This makes this reloading more efficient and predictable, because
previously we would read a number of entries that depends on the number
of tablets splits and merges, which increases over time and is
unbounded, whereas now we read only a single stream set on each update.

Fixes scylladb/scylladb#26732
2025-10-28 08:54:09 +01:00
Andrzej Jackowski
8953f96609 system_keyspace: add service_level_driver_created
This commit extends sytem.scylla_local table with an additional
key/value pair that can be used later in this patch series to
keep an information that `sl:driver` was already created. The purpose
of storing this information is to ensure that `sl:driver` is
not recreated after being intentionally removed.

A new mutation is included in `register_raft_pull_snapshot` to keep
`service_level_driver_created` in state machine shapshot, which is
required for proper propagation of the value when a new node is added
to the cluster.

Refs: scylladb/scylladb#24411
2025-10-08 08:24:23 +02:00
Nadav Har'El
926089746b message: move RPC compression from utils/ to message/
The directory utils/ is supposed to contain general-purpose utility
classes and functions, which are either already used across the project,
or are designed to be used across the project.

This patch moves 8 files out of utils/:

    utils/advanced_rpc_compressor.hh
    utils/advanced_rpc_compressor.cc
    utils/advanced_rpc_compressor_protocol.hh
    utils/stream_compressor.hh
    utils/stream_compressor.cc
    utils/dict_trainer.cc
    utils/dict_trainer.hh
    utils/shared_dict.hh

These 8 files together implement the compression feature of RPC.
None of them are used by any other Scylla component (e.g., sstables have
a different compression), or are ready to be used by another component,
so this patch moves all of them into message/, where RPC is implemented.

Theoretically, we may want in the future to use this cluster of classes
for some other component, but even then, we shouldn't just have these
files individually in utils/ - these are not useful stand-alone
utilities. One cannot use "shared_dict.hh" assuming it is some sort of
general-purpose shared hash table or something - it is completely
specific to compression and zstd, and specifically to its use in those
other classes.

Beyond moving these 8 files, this patch also contains changes to:
1. Fix includes to the 5 moved header files (.hh).
2. Fix configure.py, utils/CMakeLists.txt and message/CMakeLists.txt
   for the three moved source files (.cc).
3. In the moved files, change from the "utils::" namespace, to the
   "netw::" namespace used by RPC. Also needed to change a bunch
   of callers for the new namespace. Also, had to add "utils::"
   explicitly in several places which previously assumed the
   current namespace is "utils::".

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#25149
2025-09-30 17:03:09 +03:00
Botond Dénes
86ed627fc4 compaction: move code to namespace compaction
The namespace usage in this directory is very inconsistent, with files
and classes scattered in:
* global namespace
* namespace compaction
* namespace sstables

With cases, where all three used in the same file. This code used to
live in sstables/ and some of it still retains namespace sstables as a
heritage of that time. The mismatch between the dir (future module) and
the namespace used is confusing, so finish the migration and move all
code in compaction/ to namespace compaction too.

This patch, although large, is mechanic and only the following kind of
changes are made:
* replace namespace sstable {} with namespace compaction {}
* add namespace compaction {}
* drop/add sstables::
* drop/add compaction::
* move around forward-declarations so they are in the correct namespace
  context

This refactoring revealed some awkward leftover coupling between
sstables and compaction, in sstables/sstable_set.cc, where the
make_sstable_set() methods of compaction strategies are implemented.
2025-09-25 15:03:56 +03:00
Ernest Zaslavsky
5ba5aec1f8 treewide: Move mutation related files to a mutation directory
As requested in #22104, moved the files and fixed other includes and build system.

Moved files:
 - combine.hh
 - collection_mutation.hh
 - collection_mutation.cc
 - converting_mutation_partition_applier.hh
 - converting_mutation_partition_applier.cc
 - counters.hh
 - counters.cc
 - timestamp.hh

Fixes: #22104

This is a cleanup, no need to backport

Closes scylladb/scylladb#25085
2025-09-24 13:23:38 +03:00
Avi Kivity
1258e7c165 Revert "Merge 'transport: service_level_controller: create and use driver service level' from Andrzej Jackowski"
This reverts commit fe7e63f109, reversing
changes made to b5f3f2f4c5. It is causing
test.py failures around cqlpy.

Fixes #26163

Closes scylladb/scylladb#26174
2025-09-22 09:32:46 +03:00
Michael Litvak
6043409c31 view_builder: register view on all shards atomically
When the view builder starts to build a new view, each shard registers
itself by writing the shard id and current token to the
scylla_views_builds_in_progress table.

Previously, this happened independently by each shard. We change it now
to register all shards "atomically" - when a shard registers itself, it
also registers all other shards with an empty status, if they aren't
registered yet. This ensures that we don't have a partial state in the
table where only some of the shards are registered, but we always have a
status for all shards.

The reason we want to register all shards atomically is that if it
happens that only some of the shards were registered, then we restart
and load the status from table, this doesn't work well for multiple
reasons.

One example is that to know how many shards we had previously, we take
the maximum shard id we see in the table. If it's different than the
current shard count, we will execute the reshard code. But of course, if
the last shard is missing from the table because it didn't register
itself, this calculation will be wrong, and we can't know the previous
number of shards.

This is a problem because suppose we have two shards, and shard 0
finished building the view but shard 1 didn't start. When we come up, we
will think that previously we had only a single shard and it completed
building everything, when in fact we built only half the view
approximately. The problem is that we don't have enough information in
the tables to know that.

There are additional problems related to reshard. In the reshard
function, whether it is executed because we actually do node reshard or
because we calculated the wrong number of previous shards, if the status
of some shard is missing then the calculation of new ranges will be
wrong. When some shard didn't make progress we should start building the
view from scratch. However, this doesn't happen if we don't have a
status for the shard, because the code looks only for shards that have a
status. In effect, this shard is considered complete even though it
didn't start. This could cause the view building to get stuck or
complete without building all tokens ranges.

By registering all shards atomically, this should solve the above
problems because we will always have statuses for all shards.

Fixes scylladb/scylladb#22989
2025-09-21 10:39:05 +02:00
Pavel Emelyanov
a1ea553fe1 code: Replace distributed<> with sharded<>
The latter is recommended in seastar, and the former was left as
compatibility alias. Latest seastar explicitly marks it as deprecated so
once the submodule is updated, compilation logs will explode.

Most of the patch is generated with

    for f in $(git grep -l '\<distributed<[A-Za-z0-9:_]*>') ; do sed -e 's/\<distributed<\([A-Za-z0-9:_]*\)>/sharded<\1>/g' -i $f; done
    for f in $(git grep -l distributed.hh); do sed -e 's/distributed.hh/sharded.hh/' -i $f ; done

and a small manual change in test/perf/perf.hh

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#26136
2025-09-19 12:22:51 +02:00