scylladb

Author	SHA1	Message	Date
Marcin Maliszkiewicz	a83ee6cf66	Merge 'db/batchlog_manager: re-add v1 support for mixed clusters' from Botond Dénes `3f7ee3ce5d` introduced system.batchlog_v2, with a schema designed to speed up batchlog replays and make post-replay cleanups much more effective. It did not introduce a cluster feature for the new table, because it is node local table, so the cluster can switch to the new table gradually, one node at a time. However, https://github.com/scylladb/scylladb/issues/27886 showed that the switching causes timeouts during upgrades, in mixed clusters. Furthermore, switching to the new table unconditionally on upgrades nodes, means that on rollback, the batches saved into the v2 table are lost. This PR introduces re-introduces v1 (`system.batchlog`) support and guards the use of the v2 table with a cluster feature, so mixed clusters keep using v1 and thus be rollback-compatible. The re-introduced v1 support doesn't support post-replay cleanups for simplicity. The cleanup in v1 was never particularly effective anyway and we ended up disabling it for heavy batchlog users, so I don't think the lack of support for cleanup is a problem. Fixes: https://github.com/scylladb/scylladb/issues/27886 Needs backport to 2026.1, to fix upgrades for clusters using batches Closes scylladb/scylladb#28736 * github.com:scylladb/scylladb: test/boost/batchlog_manager_test: add tests for v1 batchlog test/boost/batchlog_manager_test: make prepare_batches() work with both v1 and v2 test/boost/batchlog_manager_test: fix indentation test/boost/batchlog_manager_test: extract prepare_batches() method test/lib/cql_assertions: is_rows(): add dump parameter tools/scylla-sstable: extract query result printers tools/scylla-sstable: add std::ostream& arg to query result printers repair/row_level: repair_flush_hints_batchlog_handler(): add all_replayed to finish log db/batchlog_manager: re-add v1 support db/batchlog_manager: return all_replayed from process_batch() db/batchlog_manager: process_bath() fix indentation db/batchlog_manager: make batch() a standalone function db/batchlog_manager: make structs stats public db/batchlog_manager: allocate limiter on the stack db/batchlog_manager: add feature_service dependency gms/feature_service: add batchlog_v2 feature	2026-03-02 12:09:10 +01:00
Patryk Jędrzejczak	9a9202c909	Merge 'Remove gossiper topology code' from Gleb Natapov The PR removes most of the code that assumes that group0 and raft topology is not enabled. It also makes sure that joining a cluster in no raft mode or upgrading a node in a cluster that not yet uses raft topology to this version will fail. Refs #15422 No backport needed since this removes functionality. Closes scylladb/scylladb#28514 * https://github.com/scylladb/scylladb: group0: fix indentation after previous patch raft_group0: simplify get_group0_upgrade_state function since no upgrade can happen any more raft_group0: move service::group0_upgrade_state to use fmt::formatter instead of iostream raft_group0: remove unused code from raft_group0 node_ops: remove topology over node ops code topology: fix indentation after the previous patch topology: drop topology_change_enabled parameter from raft_group0 code storage_service: remove unused handle_state_* functions gossiper: drop wait_for_gossip_to_settle and deprecate correspondent option storage_service: fix indentation after the last patch storage_service: remove gossiper bootstrapping code storage_service: drop get_group_server_if_raft_topolgy_enabled storage_service: drop is_topology_coordinator_enabled and its uses storage_service: drop run_with_api_lock_in_gossiper_mode_only topology: remove code that assumes raft_topology_change_enabled() may return false test: schema_change_test: make test_schema_digest_does_not_change_with_disabled_features tests run in raft mode test: schema_change_test: drop schema tests relevant for no raft mode only topology: remove upgrade to raft topology code group0: remove upgrade to group0 code group0: refuse to boot if a cluster is still is not in a raft topology mode storage_service: refuse to join a cluster in legacy mode	2026-02-27 14:43:41 +01:00
Marcin Maliszkiewicz	a03ebe1a29	Merge 'cql: implement a new per-row TTL feature' from Nadav Har'El This series implements a new per-row TTL feature for CQL. The per-row TTL feature was requested in issue #13000. It is a feature that does not exist in Cassandra, and was inspired by DynamoDB's TTL feature - and under the hood uses the same implementation that we used in Alternator to implement this DynamoDB feature. The new per-row TTL feature is completely separate from CQL's existing per-write (and per-cell) TTL, and both will be available to users. In the per-row TTL feature, one column in the table is designated as the "TTL" column, and its value for a row is the expiration time for that row. The TTL column can be designated at table creation time, e.g.: ```cql CREATE TABLE tab ( id int PRIMARY KEY, t text, expiration timestamp TTL ); ``` Or after the table already exists with: ```cql ALTER TABLE tab TTL expiration ``` Expiration can also be disabled, with: ```cql ALTER TABLE tab TTL NULL ``` The new per-row TTL feature has two features that users have been asking for: 1. A user can change the value of just the TTL column - without rewriting the entire row - to change the expiration time of the entire row. 2. When an expired row is finally deleted, a CDC event about this deletion appears in the CDC log (if CDC is enabled), including - if a preimage is enabled - the content of the deleted row. To achieve the second goal (CDC events), a row is not guaranteed to disappear at exactly its expiration time (as CQL's original TTL feature guarantees). Rather, the row is deleted some time later, depending on `alternator_ttl_period_in_seconds`; Until the actual deletion, the row is still readable (and even writable). But we are guaranteed that when the row is finally deleted, the CDC event will come too. The implementation uses the same background thread used by Alternator to periodically scan for expired items and delete them. The expiration thread keeps the same metrics as it did for Alternator: * `scylla_expiration_scan_passes` * `scylla_expiration_scan_table` * `scylla_expiration_items_deleted` * `scylla_expiration_secondary_ranges_scanned` The series begins with a few small preparation patches, followed by the main part of the feature (which isn't big, since we are just enabling the pre-existing Alternator expiration machinary for CQL) and finally 30 tests (single-node and multi-node tests) and documentation. This series is a new feature, so traditionally would not be backported. However, I wouldn't be surprised if we will be requested to backport it so that customers will not need to wait for a new major release. Fixes #13000 Closes scylladb/scylladb#28320 * github.com:scylladb/scylladb: test/cqlpy: verify that a column can't be both STATIC and PRIMARY KEY docs/cql: document the new CQL per-row TTL feature test/cluster: tests for the new CQL per-row TTL feature test/cqlpy: tests for the new CQL per-row TTL feature test: set low alternator_ttl_period_in_seconds in CQL tests cql ttl: fix ALTER TABLE to disable TTL if column is dropped cql ttl: add setting/unsetting of TTL column to ALTER TABLE cql ttl: add TTL column support to CREATE TABLE and DESC TABLE ttl: add CQL support to Alternator's TTL expiration service alternator ttl: move TTL_TAG_KEY to a header file alternator ttl: remove unnecessary check of feature flag cql: add "cql_row_ttl" cluster feature alternator: fix error message if UpdateTimeToLive is not supported	2026-02-26 15:29:12 +01:00
Nadav Har'El	e636bc39ad	ttl: add CQL support to Alternator's TTL expiration service The Alternator TTL feature uses an "expiration service", a background thread on each shard which periodically scans for expired items and deletes them. When writing the expiration service, we already anticipated that the day will come that we'll want to use it for CQL too. Well, now that we want to use it for CQL, we only need to make two changes: 1. Before this patch, the expiration service was only started if Alternator was enabled. Now we need to start it unconditionally, as both Alternator and CQL will need to use it. The performance impact of the new background threads, when not needed, should be minimal: These threads will wake up every alternator_ttl_period_in_seconds (by default - once a day) and just check if any table has per-row TTL enabled, and if not, do nothing. 2. Before this patch, the expiration-time column had to be of type "decimal" - a variable-precision floating-point type. This made sense in Alternator - where all numbers are of this type, but CQL offers better and more efficient types for this purpose. In this patch we add support for two additional types for the expiration time column: The "timestamp" type (which uses millisecond precision, which our implementation truncates to whole seconds) and for the "bigint" type storing a number of seconds since the UNIX epoch. We also support the smaller "int" type for compatibility with existing data, but it is not recommended because a signed 32-bit integer counting time from 1970 will break in 2038. After this patch, the expiration service supports CQL tables, but there is nothing yet that can enable it on CQL tables - i.e., nothing that sets the appropriate tag on the table to tell the expiration service which column is the expiration-time column. We'll add new syntax to do this in the next patch. At the moment, we leave the expiration service implementation in its existing location - alternator/ttl.cc. This is despite the fact that we now start it and use it also for CQL. For better modularity, we should probably later move the expiration service implementation to a separate module (directory). Similarly, the expiration service's period is still configured via alternator_ttl_period_in_seconds, which is now a misnomer because it also affects CQL. Later we can rename this configuration parameter, or alternatively, consider different scan periods for different tables and table types, and have separate configuration for Alternator TTL and CQL per-row TTL. The metrics kept by the expiration service are the same metrics existing for Alternator TTL, and fortunately do not have the name "alternator" in their name: * scylla_expiration_scan_passes * scylla_expiration_scan_table * scylla_expiration_items_deleted * scylla_expiration_secondary_ranges_scanned Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2026-02-25 14:59:42 +02:00
Gleb Natapov	cd76604c79	raft_group0: remove unused code from raft_group0 Also do not pass raft_replace_info into setup_group0 since it is not used there for a long time now.	2026-02-25 10:08:32 +02:00
Gleb Natapov	1a57f2b22d	gossiper: drop wait_for_gossip_to_settle and deprecate correspondent option The function is unused now and the option that allows to skip the wait is no longer needed as well.	2026-02-25 10:08:31 +02:00
Gleb Natapov	a8a167623a	topology: remove code that assumes raft_topology_change_enabled() may return false The path removes the code protected by !raft_topology_change_enabled() since it is no longer reachable. Drop test_lwt_for_tablets_is_not_supported_without_raft since not raft mode is no longer supported.	2026-02-25 10:08:30 +02:00
Calle Wilund	9680541144	db::snapshot-ctl: Add method to do snapshot using topo coordinator Separated from "local" snapshot.	2026-02-23 11:27:15 +01:00
Botond Dénes	ac059dadc6	db/batchlog_manager: add feature_service dependency Will be needed to check for batchlog_v2 feature.	2026-02-20 07:03:46 +02:00
Marcin Maliszkiewicz	0c76c73e34	Reapply "main: test: add future and abort_source to after_init_func" This reverts commit `ceec703bb7`. The commit was fixed with abort source handling for alternator standalone path so it's safe to reapply.	2026-02-19 09:33:10 +01:00
Marcin Maliszkiewicz	a23e503e7b	auth: remove old permissions cache	2026-02-17 17:56:27 +01:00
Pavel Emelyanov	8c42704c72	storage_service: Check raft rpc scheduling group from debug namespace Some storage_service rpc verbs may checks that a handler is executed inside gossiper scheduling group. For that, the expected group is grabbed from database. This patch puts the gossiper sched group into debug namespace and makes this check use it from there. It removes one more place that uses database as config provider. Refs #28410 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#28427	2026-02-03 06:34:03 +02:00
Gleb Natapov	08268eee3f	topology: disable force-gossip-topology-changes option The patch marks force-gossip-topology-changes as deprecated and removes tests that use it. There is one test (test_different_group0_ids) which is marked as xfail instead since it looks like gossiper mode was used there as a way to easily achieve a certain state, so more investigation is needed if the tests can be fixed to use raft mode instead. Closes scylladb/scylladb#28383	2026-02-02 09:56:32 +01:00
Avi Kivity	ceec703bb7	Revert "main: test: add future and abort_source to after_init_func" This reverts commit `7bf7ff785a`. The commit tried to add clean shutdown to `scylla perf` paths, but forgot at least `scylla perf-alternator --workload wr` which now crashes on uninitialized `c.as`. Fixes #28473 Closes scylladb/scylladb#28478	2026-02-02 09:22:24 +01:00
Marcin Maliszkiewicz	b8c75673c8	main: remove confusing duplicated auth start message Before we could observe two exactly the same "starting auth service" messages in the log. One from checkpoint() the other from notify(). We remove the second one to stay consistent with other services. Closes scylladb/scylladb#28349	2026-02-01 13:57:53 +02:00
Piotr Dulikowski	ec6a2661de	Merge 'Keep view_builder background fiber in maintenance scheduling group' from Pavel Emelyanov In fact, it's partially there already. When view_builder::start() is called is first calls initialization code (the start_in_background() method), then kicks do_build_step() that runs a background fiber to perform build steps. The starting code inherits scheduling group from main(). And the step fiber code needs to run itself in a maintenance scheduling group, so it explicitly grabs one via database->db_config. This PR mainly gets rid of the call to database::get_streaming_scheduling_group() from do_build_step() as preparation to splitting the streaming scheduling group into parts (see SCYLLADB-351). To make it happen the do_build_step() is patched to inherit its scheduling group from view_builder::start() and the start() itself is called by main from maintenance scheduling group (like for other view building services). New feature (nested scheduling group), not backporting Closes scylladb/scylladb#28386 * github.com:scylladb/scylladb: view_builder: Start background in maintenance group view_builder: Wake-up step fiber with condition variable	2026-01-28 20:49:19 +01:00
Pavel Emelyanov	cb1d05d65a	streaming: Get streaming sched group from debug:: namespace In a lambda returned from make_streaming_consumer() there's a check for current scheudling group being streaming one. It came from #17090 where streaming code was launched in wrong sched group thus affecting user groups in a bad way. The check is nice and useful, but it abuses replica::database by getting unrelated information from it. To preserve the check and to stop using database as provider of configs, keep the streaming scheduling group handle in the debug namespace. This emphasises that this global variable is purely for debugging purposes. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#28410	2026-01-28 19:14:59 +02:00
Pavel Emelyanov	3ebd02513a	view_builder: Start background in maintenance group Currently view_builder::start() is called in default scheduling group. Once it initializes itself, it wakes up the step fiber that explicitly switches to maintenance scheduling group. This explicit switch made sence before previous patch, when the fiber was implemented as a serialized action. Now the fiber starts directly from .start() method and can inherit scheduling group from it. Said that, main code calls view_builder::start() in maintenance scheduling group killing two birds with one stone. First, the step fiber no longer needs borrow its scheduling group indirectly via database. Second, the start_in_background() code itself runs in a more suitable scheduling group. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2026-01-28 18:34:59 +03:00
Pavel Emelyanov	2ffe5b7d80	tablet_allocator: Have its own explicit background scheduling group Currently, tablet_allocator switches to streaming scheduling group that it gets from database. It's not nice to use database as provider of configs/scheduling_groups. This patch adds a background scheduling group for tablet allocator configured via its config and sets it to streaming group in main.cc code. This will help splitting the streaming scheduling group into more elaborated groups under the maintenance supergroup: SCYLLADB-351 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#28356	2026-01-28 10:34:28 +02:00
Pavel Emelyanov	c61d855250	hints: Provide explicit scheduling group for hint_sender Currently it grabs one from database, but it's not nice to use database as config/sched-groups provider. This PR passes the scheduling group to use for sending hints via manager which, in turn, gets one from proxy via its config (proxy config already carries configuration for hints manager). The group is initialized in main.cc code and is set to the maintenance one (nowadays it's the same as streaming group). This will help splitting the streaming scheduling group into more elaborated groups under the maintenance supergroup: SCYLLADB-351 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#28358	2026-01-27 12:50:11 +02:00
Avi Kivity	fa5ed619e8	Merge 'test: perf: add perf-cql-raw benchmarking tool' from Marcin Maliszkiewicz The tool supports: - auth or no auth modes - simple read and write workloads - connection pool or connection per request modes - in-process or remote modes, remote may be usefull to assess tool's overhead or use it as bigger scale benchmark - multi table mode - non superuser mode It could support in the future: - TLS mode - different workloads - shard awareness Example usage: > build/release/scylla perf-cql-raw --workdir /tmp/scylla-data --smp 2 --cpus 0,1 \ --developer-mode 1 --workload read --duration 5 2> /dev/null > Running test with config: {workload=read, partitions=10000, concurrency=100, duration=5, ops_per_shard=0} Pre-populated 10000 partitions 97438.42 tps (269.2 allocs/op, 1.1 logallocs/op, 35.2 tasks/op, 113325 insns/op, 80572 cycles/op, 0 errors) 102460.77 tps (261.1 allocs/op, 0.0 logallocs/op, 31.7 tasks/op, 108222 insns/op, 75447 cycles/op, 0 errors) 95707.93 tps (261.0 allocs/op, 0.0 logallocs/op, 31.7 tasks/op, 108443 insns/op, 75320 cycles/op, 0 errors) 102487.87 tps (261.0 allocs/op, 0.0 logallocs/op, 31.7 tasks/op, 107956 insns/op, 74320 cycles/op, 0 errors) 100409.60 tps (261.0 allocs/op, 0.0 logallocs/op, 31.7 tasks/op, 108337 insns/op, 75262 cycles/op, 0 errors) throughput: mean= 99700.92 standard-deviation=3039.28 median= 100409.60 median-absolute-deviation=2759.85 maximum=102487.87 minimum=95707.93 instructions_per_op: mean= 109256.53 standard-deviation=2281.39 median= 108337.37 median-absolute-deviation=1034.83 maximum=113324.69 minimum=107955.97 cpu_cycles_per_op: mean= 76184.36 standard-deviation=2493.46 median= 75320.20 median-absolute-deviation=922.09 maximum=80572.19 minimum=74320.00 Backports: no, new tool Closes scylladb/scylladb#25990 * github.com:scylladb/scylladb: test: perf: reuse stream id main: test: add future and abort_source to after_init_func test: perf: add option to stress multiple tables in perf-cql-raw test: perf: add perf-cql-raw benchmarking tool test: perf: move cut_arg helper func to common code	2026-01-27 12:23:25 +02:00
Patryk Jędrzejczak	4e984139b2	Merge 'strongly consistent tables: basic implementation' from Petr Gusev In this PR we add a basic implementation of the strongly-consistent tables: * generate raft group id when a strongly-consistent table is created * persist it into system.tables table * start raft groups on replicas when a strongly-consistent tablet_map reaches them * add strongly-consistent version of the storage_proxy, with the `query` and `mutate` methods * the `mutate` method submits a command to the tablets raft group, the query method reads the data with `raft.read_barrier()` * strongly-consistent versions of the `select_statement` and `modification_statement` are added * a basic `test_strong_consistency.py/test_basic_write_read` is added which to check that we can write and read data in a strongly consistent fashion. Limitations: * for now the strongly consistent tables can have tablets only on shard zero. This is because we (ab/re) use the existing raft system tables which live only on shard0. In the next PRs we'll create separate tables for the new tablets raft groups. * No Scylla-side proxying - the test has to figure out who is the leader and submit the command to the right node. This will be fixed separately. * No tablet balancing -- migration/split/merges require separate complicated code. The new behavior is hidden behind `STRONGLY_CONSISTENT_TABLES` feature, which is enabled when the `STRONGLY_CONSISTENT_TABLES` experimental feature flag is set. Requirements, specs and general overview of the feature can be found [here](https://scylladb.atlassian.net/wiki/spaces/RND/pages/91422722/Strong+Consistency). Short term implementation plan is [here](https://docs.google.com/document/d/1afKeeHaCkKxER7IThHkaAQlh2JWpbqhFLIQ3CzmiXhI/edit?tab=t.0#heading=h.thkorgfek290) One can check the strongly consistent writes and reads locally via cqlsh: scylla.yaml: ``` experimental_features: - strongly-consistent-tables ``` cqlsh: ``` CREATE KEYSPACE IF NOT EXISTS my_ks WITH replication = {'class': 'NetworkTopologyStrategy', 'replication_factor': 1} AND tablets = {'initial': 1} AND consistency = 'local'; CREATE TABLE my_ks.test (pk int PRIMARY KEY, c int); INSERT INTO my_ks.test (pk, c) VALUES (10, 20); SELECT * FROM my_ks.test WHERE pk = 10; ``` Fixes SCYLLADB-34 Fixes SCYLLADB-32 Fixes SCYLLADB-31 Fixes SCYLLADB-33 Fixes SCYLLADB-56 backport: no need Closes scylladb/scylladb#27614 * https://github.com/scylladb/scylladb: test_encryption: capture stderr test/cluster: add test_strong_consistency.py raft_group_registry: disable metrics for non-0 groups strong consistency: implement select_statement::do_execute() cql: add select_statement.cc strong consistency: implement coordinator::query() cql: add modification_statement cql: add statement_helpers strong consistency: implement coordinator::mutate() raft.hh: make server::wait_for_leader() public strong_consistency: add coordinator modification_statement: make get_timeout public strong_consistency: add groups_manager strong_consistency: add state_machine and raft_command table: add get_max_timestamp_for_tablet tablets: generate raft group_id-s for new table tablet_replication_strategy: add consistency field tablets: add raft_group_id modification_statement: remove virtual where it's not needed modification_statement: inline prepare_statement() system_keyspace: disable tablet_balancing for strongly_consistent_tables cql: rename strongly_consistent statements to broadcast statements	2026-01-23 09:52:33 +01:00
Marcin Maliszkiewicz	7bf7ff785a	main: test: add future and abort_source to after_init_func This commit avoids leaking seastar::async future from two benchmark tools: perf-alternator and perf-cql-raw. Additionally it adds abort_source for fast and clean shutdown.	2026-01-22 12:26:50 +01:00
Marcin Maliszkiewicz	a033b70704	test: perf: add perf-cql-raw benchmarking tool The tool supports: - auth or no auth modes - simple read and write workloads - connection pool or connection per request modes - in-process or remote modes, remote may be usefull to assess tool's overhead or use it as bigger scale benchmark - uses prepared statements by default - connection only mode, for testing storms It could support in the future: - TLS mode - different workloads - shard awareness Example usage: > build/release/scylla perf-cql-raw --workdir /tmp/scylla-data --smp 2 --cpus 0,1 \ --developer-mode 1 --workload read --duration 5 2> /dev/null Running test with config: {workload=read, partitions=10000, concurrency=100, duration=5, ops_per_shard=0} Pre-populated 10000 partitions 97438.42 tps (269.2 allocs/op, 1.1 logallocs/op, 35.2 tasks/op, 113325 insns/op, 80572 cycles/op, 0 errors) 102460.77 tps (261.1 allocs/op, 0.0 logallocs/op, 31.7 tasks/op, 108222 insns/op, 75447 cycles/op, 0 errors) 95707.93 tps (261.0 allocs/op, 0.0 logallocs/op, 31.7 tasks/op, 108443 insns/op, 75320 cycles/op, 0 errors) 102487.87 tps (261.0 allocs/op, 0.0 logallocs/op, 31.7 tasks/op, 107956 insns/op, 74320 cycles/op, 0 errors) 100409.60 tps (261.0 allocs/op, 0.0 logallocs/op, 31.7 tasks/op, 108337 insns/op, 75262 cycles/op, 0 errors) throughput: mean= 99700.92 standard-deviation=3039.28 median= 100409.60 median-absolute-deviation=2759.85 maximum=102487.87 minimum=95707.93 instructions_per_op: mean= 109256.53 standard-deviation=2281.39 median= 108337.37 median-absolute-deviation=1034.83 maximum=113324.69 minimum=107955.97 cpu_cycles_per_op: mean= 76184.36 standard-deviation=2493.46 median= 75320.20 median-absolute-deviation=922.09 maximum=80572.19 minimum=74320.00	2026-01-22 12:26:50 +01:00
Botond Dénes	7d2e6c0170	Merge 'config: add enforce_rack_list option' from Aleksandra Martyniuk Add enforce_rack_list option. When the option is set to true, all tablet keyspaces have rack list replication factor. When the option is on: - CREATE STATEMENT always auto-extends rf to rack lists; - ALTER STATEMENT fails when there is numeric rf in any DC. The flag is set to false by default and a node needs to be restarted in order to change its value. Starting a node with enforce_rack_list option will fail, if there are any tablet keyspaces with numeric rf in any DC. enforce_rack_list is a per-node option and a user needs to ensure that no tablet keyspace is altered or created while nodes in the cluster don't have the consistent value. Mark rf_rack_valid_keyspaces as deprecated. Fixes: https://github.com/scylladb/scylladb/issues/26399. New feature; no backport needed Closes scylladb/scylladb#28084 * github.com:scylladb/scylladb: test: add test for enforce_rack_list option db: mark rf_rack_valid_keyspaces as deprecated config: add enforce_rack_list option Revert "alternator: require rf_rack_valid_keyspaces when creating index"	2026-01-22 10:27:35 +02:00
Botond Dénes	4281d18c2e	Merge 'schema: Apply `sstable_compression_user_table_options` to CQL aux and Alternator tables' from Nikos Dragazis In PR `5b6570be52` we introduced the config option `sstable_compression_user_table_options` to allow adjusting the default compression settings for user tables. However, the new option was hooked into the CQL layer and applied only to CQL base tables, not to the whole spectrum of user tables: CQL auxiliary tables (materialized views, secondary indexes, CDC log tables), Alternator base tables, Alternator auxiliary tables (GSIs, LSIs, Streams). This gap also led to inconsistent default compression algorithms after we changed the option’s default algorithm from LZ4 to LZ4WithDicts (`adf9c426c2`). This series introduces a general “schema initializer” mechanism in `schema_builder` and uses it to apply the default compression settings uniformly across all user tables. This ensures that all base and aux tables take their default compression settings from config. Fixes #26914. Backport justification: LZ4WithDicts is the new default since 2025.4, but the config option exists since 2025.2. Based on severity, I suggest we backport only to 2025.4 to maintain consistency of the defaults. Closes scylladb/scylladb#27204 * github.com:scylladb/scylladb: db/config: Update sstable_compression_user_table_options description schema: Add initializer for compression defaults schema: Generalize static configurators into schema initializers schema: Initialize static properties eagerly db: config: Add accessor for sstable_compression_user_table_options test: Check that CQL and Alternator tables respect compression config	2026-01-22 06:50:48 +02:00
Petr Gusev	7d111f2396	strong_consistency: add coordinator Add the `coordinator` class, which will be responsible for coordinating reads and writes to strongly consistent tables. This commit includes only the boilerplate; the methods will be implemented in separate commits.	2026-01-21 14:56:01 +01:00
Petr Gusev	4902186ede	strong_consistency: add groups_manager This class is reponsible for managing raft groups for strongly-consistent tablets.	2026-01-21 14:56:00 +01:00
Petr Gusev	53f93eb830	tablets: add raft_group_id Add a `raft_group_id` column to `system.tablets` and to the `tablet_map` class. The column is populated only when the `strongly_consistent_tables` feature is enabled. This feature is currently disabled by default and is enabled only when the user sets the `STRONGLY_CONSISTENT_TABLES` experimental flag. The `raft_group_id` column is added to `system.tablets` only when this flag is set. This allows the schema to evolve freely while the feature is experimental, without requiring complex migrations.	2026-01-21 14:56:00 +01:00
Pavel Emelyanov	18b5a49b0c	Populate all sl:* groups into dedicated top-level supergroup This patch changes the layout of user-facing scheduling groups from / `- statement `- sl:default `- sl:* `- other groups (compaction, streaming, etc.) into / `- user (supergroup) `- statement `- sl:default `- sl:* `- other groups (compaction, streaming, etc.) The new supergroup has 1000 static shares and is name-less, in a sense that it only have a variable in the code to refer to and is not exported via metrics (should be fixed in seastar if we want to). The moved groups don't change their names or shares, only move inside the scheduling hierarchy. The goal of the change is to improve resource consumption of sl:* groups. Right now activities in low-shares service levels are scheduled on-par with e.g. streaming activity, which is considered to be low-prio one. By moving all sl:* groups into their own supergroup with 1000 shares changes the meaning of sl:* shares. From now on these shares values describe preirities of service level between each-other, and the user activities compete with the rest of the system with 1000 shares, regardless of how many service levels are there. Unit tests keep their user groups under root supergroup (for simplicity) Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#28235	2026-01-21 14:14:48 +02:00
Aleksandra Martyniuk	761ace4f05	config: add enforce_rack_list option Add enforce_rack_list option. When the option is set to true, all tablet keyspaces have rack list replication factor. When the option is on: - CREATE STATEMENT always auto-extends rf to rack lists; - ALTER STATEMENT fails when there is numeric rf in any DC. The flag is set to false by default and a node needs to be restarted in order to change its value. Starting a node with enforce_rack_list option will fail, if there are any tablet keyspaces with numeric rf in any DC. enforce_rack_list is a per-node option and a user needs to ensure that no tablet keyspace is altered or created while nodes in the cluster don't have the consistent value.	2026-01-20 09:58:51 +01:00
Dario Mirovic	823d1b9c03	audit: fix start_audit init sequence placement Commit `d54d409` (audit: write out to both table and syslog) unified create_audit and start_audit, which moved the audit service creation later in the startup sequence. This broke startup when audit is enabled because view_builder prepares CQL queries before start_audit runs, and query preparation calls audit_instance().local_is_initialized() which crashes on the non-existent sharded service. Move start_audit to run before view_builder::start() and other components that may prepare CQL queries during their initialization. Fixes SCYLLADB-252 Closes scylladb/scylladb#28139	2026-01-19 11:57:39 +03:00
Nikos Dragazis	1e37781d86	schema: Add initializer for compression defaults In PR `5b6570be52` we introduced the config option `sstable_compression_user_table_options` to allow adjusting the default compression settings for user tables. However, the new option was hooked into the CQL layer and applied only to CQL base tables, not to the whole spectrum of user tables: CQL auxiliary tables (materialized views, secondary indexes, CDC log tables), Alternator base tables, Alternator auxiliary tables (GSIs, LSIs, Streams). Fix this by moving the logic into the `schema_builder` via a schema initializer. This ensures that the default compression settings are applied uniformly regardless of how the table is created, while also keeping the logic in a central place. Register the initializer at startup in all executables where schemas are being used (`scylla_main()`, `scylla_sstable_main()`, `cql_test_env`). Finally, remove the ad-hoc logic from `create_table_statement` (redundant as of this patch), remove the xfail markers from the relevant tests and adjust `test_describe_cdc_log_table_create_statement` to expect LZ4WithDicts as the default compressor. Fixes #26914. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2026-01-13 20:45:59 +02:00
Nikos Dragazis	76b2d0f961	db: config: Add accessor for sstable_compression_user_table_options The `sstable_compression_user_table_options` config option determines the default compression settings for user tables. In patch `2fc812a1b9`, the default value of this option was changed from LZ4 to LZ4WithDicts and a fallback logic was introduced during startup to temporarily revert the option to LZ4 until the dictionary compression feature is enabled. Replace this fallback logic with an accessor that returns the correct settings depending on the feature flag. This is cleaner and more consistent with the way we handle the `sstable_format` option, where the same problem appears (see `get_preferred_sstable_version()`). As a consequence, the configuration option must always be accessed through this accessor. Add a comment to point this out. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2026-01-13 18:30:38 +02:00
Calle Wilund	da17e8b18b	gossiper/main: Extend special treatment of node ID resolve for rpc_address Refs #27429 If running with broadcast_address != listen/cql/rpc address, topology gets confused about the varying addresses. Need to special case resolve both addresses as "self". I.e. extend broadcast_address treatment to cql_address as well. Added export of this via gossiper for symmetry.	2026-01-13 14:12:19 +01:00
Botond Dénes	60570d7114	Merge 'topology coordinator: restrict node join/remove to preserve RF-rack validity' from Michael Litvak Allow creating materialized views and secondary indexes in a tablets keyspace only if it's RF-rack-valid, and enforce RF-rack-validity while the keyspace has views by restricting some operations: * Altering a keyspace's RF if it would make the keyspace RF-rack-invalid * Adding a node in a new rack * Removing / Decommissioning the last node in a rack Previously the config option `rf_rack_valid_keyspaces` was required for creating views. We now remove this restriction - it's not needed because we always maintain RF-rack-validity for keyspaces with views. The restrictions are relevant only for keyspaces with numerical RF. Keyspace with rack-list-based RF are always RF-rack-valid. Fixes scylladb/scylladb#23345 Fixes https://github.com/scylladb/scylladb/issues/26820 backport to relevant versions for materialized views with tablets since it depends on rf-rack validity Closes scylladb/scylladb#26354 * github.com:scylladb/scylladb: docs: update RF-rack restrictions cql3: don't apply RF-rack restrictions on vector indexes cql3: add warning when creating mv/index with tablets about rf-rack service/tablet_allocator: always allow tablet merge of tables with views locator: extend rf-rack validation for rack lists test: test rf-rack validity when creating keyspace during node ops locator: fix rf-rack validation during node join/remove test: test topology restrictions for views with tablets test: add test_topology_ops_with_rf_rack_valid topology coordinator: restrict node join/remove to preserve RF-rack validity topology coordinator: add validation to node remove locator: extend rf-rack validation functions view: change validate_view_keyspace to allow MVs if RF=Racks db: enforce rf-rack-validity for keyspaces with views replica/db: add enforce_rf_rack_validity_for_keyspace helper db: remove enforce parameter from check_rf_rack_validity test: adjust test to not break rf-rack validity	2026-01-09 10:01:23 +02:00
Yaniv Michael Kaul	597d300527	main.cc: remove warning: 'metric_help' is deprecated Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com> Backport: no, benign issue. Closes scylladb/scylladb#27680	2025-12-31 18:36:55 +02:00
Radosław Cybulski	e246abec4d	Add ref to service::storage_service to executor Add a reference to `service::storage_service` to executor object.	2025-12-29 08:33:03 +01:00
Marcin Maliszkiewicz	3c1e1f867d	raft: auth: add semaphore to auth_cache::load_all Auth cache loading at startup is racing between auth service and raft code and it doesn't support concurrency causing it to crash. We can't easily remove any of the places as during raft recovery snapshot is not loaded and it relies on loading cache via auth service. Therefore we add semaphore. Fixes https://github.com/scylladb/scylladb/issues/27540 Closes scylladb/scylladb#27573	2025-12-24 10:56:24 +02:00
Michael Litvak	8df61f6d99	view: change validate_view_keyspace to allow MVs if RF=Racks The function validate_view_keyspace checks if a keyspace is eligible for having materialized views, and it is used for validation when creating a MV or a MV-based index. Previously, it was required that the rf_rack_valid_keyspaces option is set in order for tablets-based keyspaces to be considered eligible, and the RF-rack condition was enforced when the option is set. Instead of this, we change the validation to allow MVs in a keyspace if the RF-rack condition is satisfied for the keyspace - regardless of the config option. We remove the config validation for views on startup that validates the option `rf_rack_valid_keyspaces` is set if there are any views with tablets, since this is not required anymore. We can do this without worrying about upgrades because this change will be effective from 2025.4 where MVs with tablets are first out of experimental phase. We update the test for MV and index restrictions in tablets keyspaces according to the new requirements. * Create MV/index: previously the test checked that it's allowed only if the config option `rf_rack_valid_keyspaces` is set. This is changed now so it's always allowed to create MV/index if the keyspace is RF-rack-valid. Update the test to verify that we can create MV/index when the keyspace is RF-rack-valid, even if the rf_rack option is not set, and verify that it fails when the keyspace is RF-rack-invalid. * Alter: Add a new test to verify that while a keyspace has views, it can't be altered to become RF-rack-invalid.	2025-12-22 09:14:29 +01:00
Michael Litvak	8b0b0c4d80	db: remove enforce parameter from check_rf_rack_validity simple refactoring: the enforce parameter is always given the value of the `rf_rack_valid_keyspaces` option. remove the parameter and use the option value directly from the db config. this will be useful for a later change to the enforcement conditions.	2025-12-22 09:13:49 +01:00
Patryk Jędrzejczak	73db5c94de	Merge 'db: api: service: introduce system.client_routes table and related API endpoints' from Andrzej Jackowski `system.client_routes` is a system table that sets the target address and ports for each `host_id`, for one or more connection (e.g., Private Link) represented by `connection_id`. Cloud will write the table via REST, and drivers will read it via CQL to override values obtained from `system.local` and `system.peers`. This patch series contains: - Introduction of `CLIENT_ROUTES` feature flag. - Implementation of raft-based `system.client_routes` table - Implementation of `v2/client-routes` POST/DELETE/GET endpoints - Implementation of new `CLIENT_ROUTES_CHANGE` event that is sent to drivers when `system.client_routes` is changed - New tests that verifies the aforementioned features Ref: scylladb/scylla-enterprise#5699 For now, no automatic backport. However, the changes are planned to be release on `2025.4` either as a backport or a private build. Closes scylladb/scylladb#27323 * https://github.com/scylladb/scylladb: docs: describe CLIENT_ROUTES_CHANGE extension test: add test for CLIENT_ROUTES event service: transport: add CLIENT_ROUTES_CHANGE event test: add cluster tests for client routes test: add API tests for client_routes endpoints test: add `timeout` parameter to `delete` in RESTClient test: allow json_body in send api: implement client_routes endpoints api: add client_routes.json service: main: add client_routes_service db: add system.client_routes table gms: add CLIENT_ROUTES feature	2025-12-16 10:38:27 +01:00
Andrzej Jackowski	c2b1b10ca0	service: transport: add CLIENT_ROUTES_CHANGE event Introduce the CLIENT_ROUTES_CHANGE event to let drivers refresh connections when `system.client_routes` is modified. Some deployments (e.g., Private Link) require specific address/port mappings that can change without topology changes and drivers need to adapt promptly to avoid connectivity issues. This new EVENT type carries a change indicator plus the affected `connection_ids` and `host_ids`. The only change value is `UPDATE_NODES`, meaning one or more client routes were inserted, updated, or deleted. Drivers subscribe using the existing events mechanism, so no additional `cql_protocol_extension` key is required. Ref: scylladb/scylla-enterprise#5699	2025-12-15 18:19:37 +01:00
Andrzej Jackowski	e153cc434f	api: implement client_routes endpoints Ref: scylladb/scylla-enterprise#5699	2025-12-15 17:36:47 +01:00
Andrzej Jackowski	6fcc1ecf94	service: main: add client_routes_service Introduce `client_routes_service` for managing `system.client_routes` table. Ref: scylladb/scylla-enterprise#5699	2025-12-15 13:13:40 +01:00
Pavel Emelyanov	3f7ee3ce5d	Merge 'batchlog: make replay (flush) faster' from Botond Dénes The batchlog table contains an entry for each logged batch that is processed by the local node as coordinator. These entries are typically very short lived, they are inserted when the batch is processed and deleted immediately after the batch is successfully applied. When a table has `tombstone_gc = {'mode': 'repair'}` enabled, every repair has to flush all hints and batchlogs, so that we can be certain that there is no live data in any of these, older than the last repair. Since batches can contain member queries from any number of tables, the whole batchlog has to be flushed, even if repair-mode tombstone-gc is enabled for a single table. Flushing the batchlog table happens by doing a batchlog replay. This involves reading the entire content of this table, and attempting to replay+delete any live entries (that are old enough to be replayed). Under normal operating circumstances, 99%+ of the content of the batchlog table is partition tombstones. Because of this, scanning the content of this table has to process thousands to millions of tombstones. This was observed to require up to 20 minutes to finish, causing repairs to slow down to a crawl, as the batchlog-flush has to be repeated at the end of the repair of each token-range. When trying to address this problem, the first idea was that we should expedite the garbage-collection of these accumulated tombstones. This experiment failed, see https://github.com/scylladb/scylladb/pull/23752. The commitlog proved to be an impossible to bypass barrier, preventing quick garbage-collection of tombstones. So long as a single commit-log segment is alive, holding content from the batchlog table, all tombstones written after are blocked from GC. The second approach, represented by this PR, is to not rely in tombstone GC to reduce the tombstone amount. Instead restructure the table such that a single higher-order tombstone can be used to shadow and allow for the eviction of the myriads of individual batchlog entry tombstones. This is realized by reorganizing the batchlog table such that individual batches are rows, not partitions. This new schema is introduced by the new `system.batchlog_v2` table, introduced by this PR: CREATE TABLE system.batchlog_v2 ( version int, stage int, shard int, written_at timestamp, id uuid, data blob, PRIMARY KEY ((version, stage, shard), written_at, id)); The new schema organization has the following goals: 1) Make post-replay batchlog cleanup possible with a simple range-tombstone. This allows dropping the individual dead batchlog entries, as they are shadowed by a higher level tombstone. This enables dropping tombstones without tombstone GC. 2) To make the above possible, introduce the stage key component: batchlog entries that fail the first replay attempt, are moved to the failed_replay stage, so the initial stage can be cleaned up safely. 3) Spread out the data among Scylla shards, via the batchlog shard column. 4) Make batchlog entries ordered by the batchlog create time (id). This allows for selecting batchlogs to replay, without post-filtering of batchlogs that are too young to be replayed. Fixes: https://github.com/scylladb/scylladb/issues/23358 This is an improvement, normally not a backport-candidate. We might override this and backport to allow wider use of `tombstone_gc: {'mode': 'repair'}`. Closes scylladb/scylladb#26671 * github.com:scylladb/scylladb: db/config: change batchlog_replay_cleanup_after_replays default to 1 test/boost/batchlog_manager_test: add test for batchlog cleanup replica/mutation_dump: always set position weight for clustering positions service/storage_proxy: s/batch_replay_throw/storage_proxy_fail_replay_batch/ test/lib: introduce error_injection.hh utils/error_injection: add debug log to disable() and disable_all() test/lib/cql_test_env: forward config to batchlog test/lib/cql_test_env: add batch type to execute_batch() test/lib/cql_assertions: add with_size(predicate) overload test/lib/cql_assertions: add source location to fail messages test/lib/cql_assertions: columns_assertions: add assert_for_columns_of_each_row() test/lib/cql_assertions: rows_assertions::assert_for_columns_of_row(): add index bound check test/lib/cql_assertions: columns_assertions: add T* with_typed_column() overload db/batchlog_manager: config: s/write_timeout/reply_timeot/ db,service: switch to system.batchlog_v2 db/system_keyspace: introduce system.batchlog_v2 service,db: extract generation of batchlog delete mutation service,db: extract get_batchlog_mutation_for() from storage-proxy db/batchlog_manager: only consider propagation delay with tombstone-gc=repair db/batchlog_manager: don't drop entire batch if one mutations' table was dropped data_dictionary: table: add get_truncation_time() db/batchlog_manager: batch(): replace map_reduce() with simple loop db/batchlog_manager: finish coroutinizing replay_all_failed_batches db/batchlog_manager: improve replayAllFailedBatches logs	2025-12-15 15:05:19 +03:00
Andrzej Jackowski	5afcec4a3d	Revert "auth: move passwords::check call to alien thread" The alien thread was a solution for reactor stalls caused by indivisible password‑hashing tasks (scylladb/scylladb#24524). However, because there is only one alien thread, overall hashing throughput was reduced (see, e.g., scylladb/scylla-enterprise#5711). To address this, the alien‑thread solution is reverted, and a hashing implementation with yielding will be introduced later in this patch series. This reverts commit `9574513ec1`.	2025-12-10 15:36:09 +01:00
Avi Kivity	d811eeb4ca	Merge 'Make direct failure detector verb handler more efficient' from Gleb Natapov We saw that in large clusters direct failure detector may cause large task queues to be accumulated. The series address this issue and also moves the code into the correct scheduling group. Fixes https://github.com/scylladb/scylladb/issues/27142 Backport to all version where `60f1053087` was backported to since it should improve performance in large clusters. Closes scylladb/scylladb#27387 * github.com:scylladb/scylladb: direct_failure_detector: run direct failure detector in the gossiper scheduling group raft: drop invoke_on from the pinger verb handler direct_failure_detector: pass timeout to direct_fd_ping verb	2025-12-07 11:40:26 +02:00
Tomasz Grabiec	d4014b7970	Drop legacy schema support We switched to using v3 schema tables (in system_schema keyspace) in 2017, in `9eb91bc30b`. So no system should have the old schema any more. No need to run legacy_schema_migrator on boot. Closes scylladb/scylladb#27420	2025-12-07 00:09:13 +02:00
Tomasz Grabiec	e54abde3e8	Merge 'main: delay setup of storage_service REST API' from Andrzej Jackowski The storage_service REST API uses `group0` internally. Before this patch, it was possible to send an HTTP request before `group0` was initialized, which resulted in a segmentation fault. Therefore, this patch delays the setup of the storage_service REST API. Additionally, `test_rest_api_on_startup` is added to reproduce the problem. Fixes: https://github.com/scylladb/scylladb/issues/27130 No backport. It's a crash fix but possible only if a request is sent in a very specific phase of a node start. Closes scylladb/scylladb#27410 * github.com:scylladb/scylladb: test: add test_rest_api_on_startup main: delay setup of storage_service REST API	2025-12-04 14:56:49 +01:00

1 2 3 4 5 ...

1598 Commits