scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-05-23 00:02:37 +00:00

Author	SHA1	Message	Date
Piotr Dulikowski	7c2b1ea0b5	Merge 'view_building: fix tombstone_warn_threshold warnings' from Michał Jadwiszczak `system.view_building_tasks` is a single-partition Raft group0 table (pk = `"view_building"`, CK = timeuuid). When `clean_finished_tasks()` deletes hundreds of finished tasks, the physical rows remain in SSTables until compaction. Any subsequent read of the partition counts every column of every tombstoned row as a dead cell, triggering `tombstone_warn_threshold` warnings in large clusters. Two-part fix: 1. Range tombstones instead of row tombstones (commits 2–3) Instead of one row tombstone per finished task, find the minimum alive task UUID (`min_alive_uuid`) and emit a single range tombstone `[before_all, min_alive_uuid)` covering all tasks below that boundary. This reduces the tombstone count significantly and also benefits future compaction. 2. Bounded scan with `min_task_id` (commits 4–6) Even with range tombstones, physical rows remain until compaction and still count as dead cells during reads. The only way to avoid them is to not read them at all. - Add a `min_task_id timeuuid` static column to `system.view_building_tasks`. - On every GC, write `min_task_id = min_alive_uuid` atomically with the range tombstone (same Raft batch). - On reload, read `min_task_id` first using a static-only partition slice (empty `_row_ranges` + `always_return_static_content`): the SSTable reader stops immediately after the static row before processing any clustering tombstones — zero dead cells counted. - Use `AND id >= min_task_id` as a lower bound for the main task scan, skipping all tombstoned rows. The static-only read and the bounded scan are gated on the `VIEW_BUILDING_TASKS_MIN_TASK_ID` cluster feature so mixed-version clusters fall back to the full scan. The issue is not critical, so the fix shouldn't be backported. Fixes SCYLLADB-657 Closes scylladb/scylladb#28929 * github.com:scylladb/scylladb: test/cluster/test_view_building_coordinator: add reproducer for tombstone threshold warning docs: document tombstone avoidance in view_building_tasks view_building: add `task_uuid_generator` to `view_building_task_mutation_builder` view_building: introduce `task_uuid_generator` view_building: store `min_alive_uuid` in view building state view_building: set min_task_id when GC-ing finished tasks view_building: add min_task_id support to view_building_task_mutation_builder view_building: add min_task_id static column and bounded scan to system_keyspace view_building: use range tombstone when GC-ing finished tasks view_building: add range tombstone support to view_building_task_mutation_builder view_building: introduce VIEW_BUILDING_TASKS_MIN_TASK_ID cluster feature	2026-05-12 12:38:25 +03:00
Avi Kivity	5a887362e3	Merge 'Remove legacy tables creation code' from Gleb Natapov Drop creation of `service_levels` and `cdc_generation_descriptions_v2` table creation code since they are no longer needed. Old clusters will still have it because they were created earlier. Also the series contains a small improvement around group0 creation. No backport needed since this removes functionality. Closes scylladb/scylladb#29482 * github.com:scylladb/scylladb: db/system_distributed_keyspace: remove system_distributed_everywhere since it is unused db/system_distributed_keyspace: drop CDC_TOPOLOGY_DESCRIPTION and CDC_GENERATIONS_V2 db/system_distributed_keyspace: remove unused code db/system_distributed_keyspace: drop old cdc_generation_descriptions_v2 table db/system_distributed_keyspace: drop old service_levels table fix indent after the previous patch group0: call setup_group0 only when needed	2026-05-10 14:46:21 +03:00
Michał Jadwiszczak	e0942bb45a	view_building: introduce VIEW_BUILDING_TASKS_MIN_TASK_ID cluster feature This feature will be used to gate the use of min_task_id static column in system.view_building_tasks, which will be added in a subsequent commit. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-22 09:10:12 +02:00
Aleksandra Martyniuk	7cdf7d62a2	gms: add keyspace_multi_rf_change feature	2026-04-17 09:58:05 +02:00
Benny Halevy	ce00d61917	db: implement large_data virtual tables with feature flag gating Replace the physical system.large_partitions, system.large_rows, and system.large_cells CQL tables with virtual tables that read from LargeDataRecords stored in SSTable scylla metadata (tag 13). The transition is gated by a new LARGE_DATA_VIRTUAL_TABLES cluster feature flag: - Before the feature is enabled: the old physical tables remain in all_tables(), CQL writes are active, no virtual tables are registered. This ensures safe rollback during rolling upgrades. - After the feature is enabled: old physical tables are dropped from disk via legacy_drop_table_on_all_shards(), virtual tables are registered on all shards, and CQL writes are skipped via skip_cql_writes() in cql_table_large_data_handler. Key implementation details: - Three virtual table classes (large_partitions_virtual_table, large_rows_virtual_table, large_cells_virtual_table) extend streaming_virtual_table with cross-shard record collection. - generate_legacy_id() gains a version parameter; virtual tables use version 1 to get different UUIDs than the old physical tables. - compaction_time is derived from SSTable generation UUID at display time via UUID_gen::unix_timestamp(). - Legacy SSTables without LargeDataRecords emit synthetic summary rows based on above_threshold > 0 in LargeDataStats. - The activation logic uses two paths: when the feature is already enabled (test env, restart), it runs as a coroutine; when not yet enabled, it registers a when_enabled callback that runs inside seastar::async from feature_service::enable(). - sstable_3_x_test updated to use a simplified large_data_test_handler and validate LargeDataRecords in SSTable metadata directly.	2026-04-16 08:49:02 +03:00
Gleb Natapov	8713eda271	db/system_distributed_keyspace: drop old cdc_generation_descriptions_v2 table The generation management moved to raft and old table is no longer used.	2026-04-15 15:48:48 +03:00
Tomasz Grabiec	01fb97ee78	locator: tablets: Support arbitrary tablet boundaries There are several reasons we want to do that. One is that it will give us more flexibility in distributing the load. We can subdivide tablets at any points, and achieve more evenly-sized tablets. In particular, we can isolate large partitions into separate tablets. Another reason is vnode-to-tablet migration. We could construct a tablet map which matches exactly the vnode boundaries, so migration can happen transparently from the CQL-coordinator's point of view. Implementation details: We store a vector of tokens which represent tablet boundaries in the tablet_id_map. tablet_id keeps its meaning, it's an index into vector of tablets. To avoid logarithmic lookup of tablet_id from the token, we introduce a lookup structure with power-of-two aligned buckets, and store the tablet_id of the tablet which owns the first token in the bucket. This way, lookup needs to consider tablet id range which overlaps with one bucket. If boundaries are more or less aligned, there are around 1-2 tablets overlapping with a bucket, and the lookup is still O(1). Amount of memory used increased, but not significantly relative to old size (because tablet_info is currently fat): For 131'072 tablets: Before: Size of tablet_metadata in memory: 57456 KiB After: Size of tablet_metadata in memory: 59504 KiB	2026-04-15 01:25:14 +02:00
Piotr Dulikowski	9fc2c65d18	Merge 'cql3: implement WRITETIME() and TTL() of individual elements of map, set, and UDT' from Nadav Har'El In commit `727f68e0f5` we added the ability to SELECT: * Individual elements of a map: `SELECT map_col[key]`. * Individual elements of a set: `SELECT set_col[key]` returns key if the key exists in the set, or null if it doesn't, allowing to check if the element exists in the set. * Individual pieces of a UDT: `SELECT udt_col.field`. But at the time, we didn't provide any way to retrieve the meta-data for this value, namely its timestamp and TTL. We did not support `SELECT TIMESTAMP(collection[key])`, or `SELECT TIMESTAMP(udt.field)`. Users requested to support such SELECTs in the past (see issue #15427), and Cassandra 5.0 added support for this feature - for both maps and sets and udts - so we also need this feature for compatibility. This feature was also requested recently by vector-search developers, who wanted to read Alternator columns - stored as map elements, not individual columns - with their WRITETIME information. The first four patches in this series adds the feature (in four smaller patches instead one big one), the fifth and sixth patches add tests (cqlpy and boost tests, respectively). The seventh patch adds documentation. All the new tests pass on Cassandra 5, failed on Scylla before the present fix, and pass with it. The fix was surprisingly difficult. Our existing implementation (from `727f68e0f5` building on earlier machinery) doesn't just "read" `map_col[key]` and allow us to return just its timestamp. Rather, the implementation reads the entire map, serializes it in some temporary format that does not include the timestamps and ttls, and then takes the subscript key, at which point we no longer have the timestamp or ttl of the element. So the fix had to cross all these layers of the implementation. While adding support for UDT fields in a pre-existing grammar nonterminal "subscriptExpr", we unintentionally added support for UDT fields also in LWT expressions (which used this nonterminal). LWT missing support for UDT fields was a long-time known compatibility issue (#13624) so we unintentionally fixed it :-) Actually, to completely fix it we needed another small change in the expression implementation, so the eighth patch in this series does this. Fixes #15427 Fixes #13624 Closes scylladb/scylladb#29134 * github.com:scylladb/scylladb: cql3: support UDT fields in LWT expressions cql3: document WRITETIME() and TTL() for elements of map, set or UDT test/boost: test WRITETIME() and TTL() on map collection elements test/cqlpy: test WRITETIME() and TTL() on element of map, set or UDT cql3: prepare and evaluate WRITETIME/TTL on collection elements and UDT fields cql3: parse per-element timestamps/TTLs in the selection layer cql3: add extended wire format for per-element timestamps and TTLs cql3: extend WRITETIME/TTL grammar to accept collection and UDT elements	2026-04-14 12:35:46 +02:00
Avi Kivity	0ae22a09d4	LICENSE: Update to version 1.1 Updated terms of non-commercial use (must be a never-customer).	2026-04-12 19:46:33 +03:00
Nadav Har'El	bb63db34e5	cql3: add extended wire format for per-element timestamps and TTLs Introduce the infrastructure needed to transport per-element timestamps and TTL expiry times from replicas to coordinators, required for WRITETIME(col[key]) / TTL(col[key]) and WRITETIME(col.field) / TTL(col.field). * Add a 'writetime_ttl_individual_element' cluster feature flag that guards usage of the new wire format during rolling upgrades: the extended format is only emitted and consumed when every node in the cluster supports it. * Implement serialize_for_cql_with_timestamps() (types/types.cc), a variant of serialize_for_cql() that appends a per-element section to the regular CQL bytes, listing each live element's serialized key, timestamp, and expiry. The format is: [uint32 cql_len][cql bytes] [int32 entry_count] [per entry: (int32 key_len)(key bytes)(int64 timestamp)(int64 expiry)] expiry is -1 when the element has no TTL. * Add partition_slice::option::send_collection_timestamps and modify write_cell() (mutation_partition.cc) to use the new function serialize_for_cql_with_timestamps() when this option is available. This commit stands alone with no user-visible effect: nothing yet sets the new partition-slice option. The next patch adds the selection-layer code that sets the option and parses the extended response.	2026-04-12 11:49:06 +03:00
Avi Kivity	8ccee6803e	Merge 'Remove upgrade view builder' from Gleb Natapov Since we do no longer support upgrade from versions that do not support v2 of "view building status" code (building status is managed by raft) we can remove v1 code and upgrade code and make sure we do not boot with old "builder status" version. v2 version was introduced by `8d25a4d678` which is included in scylla-2025.1.0. No backport needed since this is code removal. Closes scylladb/scylladb#29105 * github.com:scylladb/scylladb: view: drop unused v1 builder code view: remove upgrade to raft code	2026-04-12 00:39:26 +03:00
Nikos Dragazis	3e2dc078c9	feature_service: Add vnodes_to_tablets_migrations feature Vnodes-to-tablets migrations require cluster-level support: the REST API and the group0 state need to be supported by all nodes. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2026-03-24 11:06:38 +02:00
Michael Litvak	ed852a2af2	db: add logstor experimental feature flag add a new experimental feature flag for key-value tables with the new logstor storage engine.	2026-03-18 19:24:26 +01:00
Gleb Natapov	77d3245e02	view: remove upgrade to raft code Since we do no longer support upgrade from versions that do not support v2 of view building code we can remove upgrade code and make sure we do not boot with old builder version.	2026-03-18 17:45:40 +02:00
Gleb Natapov	b633ec1779	features: move GROUP0_SCHEMA_VERSIONING to deprecated features list	2026-03-10 10:46:48 +02:00
Marcin Maliszkiewicz	a83ee6cf66	Merge 'db/batchlog_manager: re-add v1 support for mixed clusters' from Botond Dénes `3f7ee3ce5d` introduced system.batchlog_v2, with a schema designed to speed up batchlog replays and make post-replay cleanups much more effective. It did not introduce a cluster feature for the new table, because it is node local table, so the cluster can switch to the new table gradually, one node at a time. However, https://github.com/scylladb/scylladb/issues/27886 showed that the switching causes timeouts during upgrades, in mixed clusters. Furthermore, switching to the new table unconditionally on upgrades nodes, means that on rollback, the batches saved into the v2 table are lost. This PR introduces re-introduces v1 (`system.batchlog`) support and guards the use of the v2 table with a cluster feature, so mixed clusters keep using v1 and thus be rollback-compatible. The re-introduced v1 support doesn't support post-replay cleanups for simplicity. The cleanup in v1 was never particularly effective anyway and we ended up disabling it for heavy batchlog users, so I don't think the lack of support for cleanup is a problem. Fixes: https://github.com/scylladb/scylladb/issues/27886 Needs backport to 2026.1, to fix upgrades for clusters using batches Closes scylladb/scylladb#28736 * github.com:scylladb/scylladb: test/boost/batchlog_manager_test: add tests for v1 batchlog test/boost/batchlog_manager_test: make prepare_batches() work with both v1 and v2 test/boost/batchlog_manager_test: fix indentation test/boost/batchlog_manager_test: extract prepare_batches() method test/lib/cql_assertions: is_rows(): add dump parameter tools/scylla-sstable: extract query result printers tools/scylla-sstable: add std::ostream& arg to query result printers repair/row_level: repair_flush_hints_batchlog_handler(): add all_replayed to finish log db/batchlog_manager: re-add v1 support db/batchlog_manager: return all_replayed from process_batch() db/batchlog_manager: process_bath() fix indentation db/batchlog_manager: make batch() a standalone function db/batchlog_manager: make structs stats public db/batchlog_manager: allocate limiter on the stack db/batchlog_manager: add feature_service dependency gms/feature_service: add batchlog_v2 feature	2026-03-02 12:09:10 +01:00
Patryk Jędrzejczak	9a9202c909	Merge 'Remove gossiper topology code' from Gleb Natapov The PR removes most of the code that assumes that group0 and raft topology is not enabled. It also makes sure that joining a cluster in no raft mode or upgrading a node in a cluster that not yet uses raft topology to this version will fail. Refs #15422 No backport needed since this removes functionality. Closes scylladb/scylladb#28514 * https://github.com/scylladb/scylladb: group0: fix indentation after previous patch raft_group0: simplify get_group0_upgrade_state function since no upgrade can happen any more raft_group0: move service::group0_upgrade_state to use fmt::formatter instead of iostream raft_group0: remove unused code from raft_group0 node_ops: remove topology over node ops code topology: fix indentation after the previous patch topology: drop topology_change_enabled parameter from raft_group0 code storage_service: remove unused handle_state_* functions gossiper: drop wait_for_gossip_to_settle and deprecate correspondent option storage_service: fix indentation after the last patch storage_service: remove gossiper bootstrapping code storage_service: drop get_group_server_if_raft_topolgy_enabled storage_service: drop is_topology_coordinator_enabled and its uses storage_service: drop run_with_api_lock_in_gossiper_mode_only topology: remove code that assumes raft_topology_change_enabled() may return false test: schema_change_test: make test_schema_digest_does_not_change_with_disabled_features tests run in raft mode test: schema_change_test: drop schema tests relevant for no raft mode only topology: remove upgrade to raft topology code group0: remove upgrade to group0 code group0: refuse to boot if a cluster is still is not in a raft topology mode storage_service: refuse to join a cluster in legacy mode	2026-02-27 14:43:41 +01:00
Marcin Maliszkiewicz	a03ebe1a29	Merge 'cql: implement a new per-row TTL feature' from Nadav Har'El This series implements a new per-row TTL feature for CQL. The per-row TTL feature was requested in issue #13000. It is a feature that does not exist in Cassandra, and was inspired by DynamoDB's TTL feature - and under the hood uses the same implementation that we used in Alternator to implement this DynamoDB feature. The new per-row TTL feature is completely separate from CQL's existing per-write (and per-cell) TTL, and both will be available to users. In the per-row TTL feature, one column in the table is designated as the "TTL" column, and its value for a row is the expiration time for that row. The TTL column can be designated at table creation time, e.g.: ```cql CREATE TABLE tab ( id int PRIMARY KEY, t text, expiration timestamp TTL ); ``` Or after the table already exists with: ```cql ALTER TABLE tab TTL expiration ``` Expiration can also be disabled, with: ```cql ALTER TABLE tab TTL NULL ``` The new per-row TTL feature has two features that users have been asking for: 1. A user can change the value of just the TTL column - without rewriting the entire row - to change the expiration time of the entire row. 2. When an expired row is finally deleted, a CDC event about this deletion appears in the CDC log (if CDC is enabled), including - if a preimage is enabled - the content of the deleted row. To achieve the second goal (CDC events), a row is not guaranteed to disappear at exactly its expiration time (as CQL's original TTL feature guarantees). Rather, the row is deleted some time later, depending on `alternator_ttl_period_in_seconds`; Until the actual deletion, the row is still readable (and even writable). But we are guaranteed that when the row is finally deleted, the CDC event will come too. The implementation uses the same background thread used by Alternator to periodically scan for expired items and delete them. The expiration thread keeps the same metrics as it did for Alternator: * `scylla_expiration_scan_passes` * `scylla_expiration_scan_table` * `scylla_expiration_items_deleted` * `scylla_expiration_secondary_ranges_scanned` The series begins with a few small preparation patches, followed by the main part of the feature (which isn't big, since we are just enabling the pre-existing Alternator expiration machinary for CQL) and finally 30 tests (single-node and multi-node tests) and documentation. This series is a new feature, so traditionally would not be backported. However, I wouldn't be surprised if we will be requested to backport it so that customers will not need to wait for a new major release. Fixes #13000 Closes scylladb/scylladb#28320 * github.com:scylladb/scylladb: test/cqlpy: verify that a column can't be both STATIC and PRIMARY KEY docs/cql: document the new CQL per-row TTL feature test/cluster: tests for the new CQL per-row TTL feature test/cqlpy: tests for the new CQL per-row TTL feature test: set low alternator_ttl_period_in_seconds in CQL tests cql ttl: fix ALTER TABLE to disable TTL if column is dropped cql ttl: add setting/unsetting of TTL column to ALTER TABLE cql ttl: add TTL column support to CREATE TABLE and DESC TABLE ttl: add CQL support to Alternator's TTL expiration service alternator ttl: move TTL_TAG_KEY to a header file alternator ttl: remove unnecessary check of feature flag cql: add "cql_row_ttl" cluster feature alternator: fix error message if UpdateTimeToLive is not supported	2026-02-26 15:29:12 +01:00
Nadav Har'El	4f4e93b695	cql: add "cql_row_ttl" cluster feature This patch adds a new cluster feature "CQL_ROW_TTL", for the new CQL per-row TTL feature. With this patch, this node reports supporting this feature, but the CQL per-row TTL feature can only be used once all the nodes in the cluster supports the feature. In other words, user requests to enable per-row TTL on a table should check this feature flag (on the whole cluster) before proceeding. This is needed because the implementation of the per-row-TTL expiration requires the cooperation of all nodes to participate in scanning for expired items, so the feature can't be trusted until all nodes participate in it. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2026-02-25 14:59:41 +02:00
Gleb Natapov	a8a167623a	topology: remove code that assumes raft_topology_change_enabled() may return false The path removes the code protected by !raft_topology_change_enabled() since it is no longer reachable. Drop test_lwt_for_tablets_is_not_supported_without_raft since not raft mode is no longer supported.	2026-02-25 10:08:30 +02:00
Calle Wilund	3075311f21	feature_service: Add SNAPSHOT_AS_TOPOLOGY_OPERATION feature To detect if cluster can do coordinated snapshot	2026-02-23 10:44:41 +01:00
Botond Dénes	c901ab53d2	gms/feature_service: add batchlog_v2 feature	2026-02-20 07:03:45 +02:00
Łukasz Paszkowski	f06094aa95	topology_coordinator: add write_both_read_old_fallback_cleanup state Yet another barrier-failure scenario exists in the `write_both_read_new` state. When the barrier fails, the tablet is expected to transition to `cleanup_target`, but because barrier execution is asynchronous, the cleanup transition can be skipped entirely and the tablet may continue forward instead. Both `write_both_read_new` and `cleanup_target` modify read and write selectors. In this situation, a barrier is required, and transitioning directly between these states without one is unsafe. Introduce an intermediate `write_both_read_old_fallback_cleanup` state that modifies only a read selector and can be entered without a barrier (there is no need to wait for all nodes to start using the "new" read selector). From there, the tablet can proceed to `cleanup_target`, where the required barriers are enforced. This also avoids changing both selectors in a single step. A direct transition from `write_both_read_new` to `cleanup_target` updates both selectors at once, which can leave coordinators using the old selector for writes and the new selector for reads, causing reads to miss preceding writes. By routing through the fallback state, selectors are updated in order—read first, then write—preserving read-after-write correctness.	2026-01-26 13:14:37 +01:00
Tomasz Grabiec	a009644c7d	raft_topology, tablets: Drain tablets in parallel with other topology operations Allows other topology operations to execute while tablets are being drained on decommission. In particular, bootstrap on scale-out. This is important for elasticity. Allows multiple decommission/removenode to happen in parallel, which is important for efficiency. Flow of decommission/removenode request: 1) pending and paused, has tablet replicas on target node. Tablet scheduler will start draining tablets. 2) No tablets on target node, request is pending but not paused 3) Request is scheduled, node is in transition 4) Request is done Nodes are considered draining as soon as there is a leave or remove request on them. If there are tablet replicas present on the target node, the request is in a paused state and will not be picked by topology coordinator. The paused state is computed from topology state automatically on reload. When request is not paused, its execution starts in write_both_read_old state. The old tablet_draining state is not entered (it's deprecated now). Tablet load balancing will yield the state machine as soon as some request is no longer paused and ready to be scheduled, based on standard preemption mechanics. The test case test_explicit_tablet_movement_during_decommission is removed. It verifies that tablet move API works during tablet draining transition. After this PR, we no longer enter this transition, so the test doesn't work. It loses its purpose, because movement during normal tablet balancing is not special and tested elsewhere.	2026-01-18 15:36:05 +01:00
Patryk Jędrzejczak	eee2b6c7af	Merge 'tablets: Make balancing disabling RPC preempt tablet transitions' from Tomasz Grabiec Disabling of balancing waits for topology state machine to become idle, to guarantee that no migrations are happening or will happen after the call returns. But it doesn't interrupt the scheduler, which means the call can take arbitrary amount of time. It may wait for tablet repair to be finished, which can take many hours. We should do it via topology request, which will interrupt the tablet scheduler. Enabling of balancing can be immediate. Fixes https://github.com/scylladb/scylladb/issues/27647 Fixes #27210 Closes scylladb/scylladb#27736 * https://github.com/scylladb/scylladb: test: Verify that repair doesn't block disabling of tablet load balancing tablets: Make balancing disabling call preempt tablet transitions	2026-01-08 21:55:19 +02:00
Asias He	4f77dd058d	repair: Add tablet repair progress report support This patch adds tablet repair progress report support so that the user could use the /task_manager/task_status API to query the progress. In order to support this, a new system table is introduced to record the user request related info, i.e, start of the request and end of the request. The progress is accurate when tablet split or merge happens in the middle of the request, since the tokens of the tablet are recorded when the request is started and when repair of each tablet is finished. The original tablet repair is considered as finished when the finished ranges cover the original tablet token ranges. After this patch, the /task_manager/task_status API will report correct progress_total and progress_completed. Fixes #22564 Fixes #26896 Closes scylladb/scylladb#27679	2026-01-08 21:55:18 +02:00
Tomasz Grabiec	ccdb301731	tablets: Make balancing disabling call preempt tablet transitions This patch modifies RESTful API handler which disables tablet balancing to use topology request to wait for already running tablet transitions. Before, it was just waiting for topology to be idle, so it could wait much longer than necessary, also for operations which are not affected by the flag, like repair. And repair can take hours. New request type is introduced for this synchronization: noop_request. It will preempt the tablet scheduler, and when the request executes, we know all later tablet transitions will respect the "balancing disabled" flag, and only things which are unuaffected by the flag, like repair, will be scheduled. Fixes #27647	2026-01-05 13:22:08 +01:00
Ferenc Szili	b7ebd73e53	load_balancer: add cluster feature for size based balancing This patch adds a cluster feature size_based_load_balancing which, until enabled, will force capacity based balancing. This is needed because during rolling upgrades some of the nodes will have incomplete data in load_stats (missing tablet sizes and effective_capacity) which are needed for size based balancing to make good decisions and issue correct migrations.	2025-12-27 11:39:08 +01:00
Emil Maskovsky	ba6fabfc88	features: add feature flag for removenode via left token ring To improve the behavior of the removenode operation, we want to issue a global topology barrier after the removenode has been applied. However, this requires changing the topology state machine to add a new state (left_token_ring) to the removenode flow, which is not supported by older nodes. To allow rolling upgrades, we add a feature flag REMOVENODE_WITH_LEFT_TOKEN_RING that controls whether the new removenode flow is used.	2025-12-17 13:31:11 +01:00
Andrzej Jackowski	2e7070d3b7	gms: add CLIENT_ROUTES feature The feature will be used later in this patch series: - To avoid unnecessary operations when the feature is not enabled - To guard new API endpoints from being used before the cluster is ready to use them. - To implement update tests (by disabling/enabling the feature) Ref: scylladb/scylla-enterprise#5699	2025-12-15 13:08:04 +01:00
Avi Kivity	24264e24bb	Revert "repair: Add tablet repair progress report support" This reverts commit `faad0167d7`. It causes a regression in test_two_tablets_concurrent_repair_and_migration_repair_writer_level in debug mode (with ~5%-10% probability). Fixes #27510. Closes scylladb/scylladb#27560	2025-12-11 12:18:11 +02:00
Asias He	faad0167d7	repair: Add tablet repair progress report support This patch adds tablet repair progress report support so that the user could use the /task_manager/task_status API to query the progress. In order to support this, a new system table is introduced to record the user request related info, i.e, start of the request and end of the request. The progress is accurate when tablet split or merge happens in the middle of the request, since the tokens of the tablet are recorded when the request is started and when repair of each tablet is finished. The original tablet repair is considered as finished when the finished ranges cover the original tablet token ranges. After this patch, the /task_manager/task_status API will report correct progress_total and progress_completed. Fixes #22564 Fixes #26896 Closes scylladb/scylladb#26924	2025-12-08 13:35:19 +02:00
Michael Litvak	9208b2f317	cql3: allow counters with tablets Now that counters work with tablets, allow to create a table with counters in a tablets-enabled keyspace, and remove the warning about counters not being supported when creating a keyspace with tablets. We allow to use counters with tablets only when all nodes are upgraded and support counters with tablets. We add a new feature flag to determine if this is the case. Fixes scylladb/scylladb#18180	2025-11-03 16:04:37 +01:00
Gleb Natapov	eb9112a4a2	db: experimental consistent-tablets option The option will be used to hid consistent tablets feature until it is ready.	2025-10-15 11:27:10 +03:00
Andrzej Jackowski	c59a7db1c9	service_level_controller: automatically create `sl:driver` This commit: - Increases the number of allowed scheduling groups to allow the creation of `sl:driver`. - Adds the `DRIVER_SERVICE_LEVEL` feature, which prevents creating `sl:driver` until all nodes have increased the number of scheduling groups. - Starts using `get_create_driver_service_level_mutations` to unconditionally create `sl:driver` on `raft_initialize_discovery_leader`. The purpose of this code path is ensuring existence of `sl:driver` in new system and tests. - Starts using `migrate_to_driver_service_level` to create `sl:driver` if it is not already present. The creation of `sl:driver` is managed by `topology_coordinator`, similar to other system keyspace updates, such as the `view_builder` migration. The purpose of this code path is handling upgrades. - Modifies related tests to pass after `sl:driver` is added. Later in this patch series, `sl:driver` will be used by `transport/server` to handle selected traffic, such as the driver's schema and topology fetches. Refs: scylladb/scylladb#24411	2025-10-08 08:24:43 +02:00
Tomasz Grabiec	66755db062	locator, cql3: Support rack lists in replication options Allows per-DC replication factor to be either a string, holding a numerical value, or a list of strings, holding a list of rack names. The rack list is not respected yet by the tablet allocator, this is achieved in subsequent commit. This changes the format of options stored in the flattened map in system_schema.keyspaces#replication. Values which are rack lists, are converted into multiple entries, with the list index appended to the key with ':' as the separator: For example, this extended map: { 'dc1': '3', 'dc2': ['rack1', 'rack2'] } is stored as a flattened map: { 'dc1': '3', 'dc2:0': 'rack1', 'dc2:1': 'rack2' } Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Signed-off-by: Tomasz Grabiec <tgrabiec@scylladb.com>	2025-10-02 19:42:39 +02:00
Michał Chojnowski	ef11dc57c1	db/config: expose "ms" format to the users via database config Extend the `sstable_format` config enum with a "ms" value, and, if it's enabled (in the config and in cluster features), use it for new sstables on the node. (Before this commit, writing `ms` sstables should only be possible in unit tests, via internal APIs. After this commit, the format can be enabled in the config and the database will write it during normal operation). As of this commit, the new format is not the default yet. (But it will become the default in a later commit in the same series).	2025-09-29 22:15:25 +02:00
Avi Kivity	1258e7c165	Revert "Merge 'transport: service_level_controller: create and use `driver` service level' from Andrzej Jackowski" This reverts commit `fe7e63f109`, reversing changes made to `b5f3f2f4c5`. It is causing test.py failures around cqlpy. Fixes #26163 Closes scylladb/scylladb#26174	2025-09-22 09:32:46 +03:00
Avi Kivity	fe7e63f109	Merge 'transport: service_level_controller: create and use `driver` service level' from Andrzej Jackowski This patch series: - Increases the number of allowed scheduling groups to allow creation of `sl:driver` - Implements `create_driver_service_level` that creates `sl:driver` with shares=200 if it wasn't already created - Implements creation of `sl:driver` for new systems and tests in `raft_initialize_discovery_leader` - Modifies `topology_coordinator` to use create `sl:driver` after upgrades. - Implements using `sl:driver` for new connections in `transport/server` - Adds to `transport/server` recognition of driver's control connections and forcing them to keep using `sl:driver`. - Adds tests to verify the new functionality - Modifies existing tests to let them pass after `sl:driver` is added - Modifies the documentation to contain new `sl:driver` The changes were evaluated by a test with the following scenario ([test_connections-sl-driver.py](https://github.com/user-attachments/files/22021273/test_connections-sl-driver.py)): - Start ScyllaDB with one node - Create 1000 keyspaces, 1 table in each keyspace - Start `cassandra-stress` (`-rate threads=50 -mode native cql3`) - Run connection storm with 1000 session (100 python processes, 10 sessions each) The maximum latency during connection storm dropped from 224.94ms to 41.43ms (those numbers are average from 20 test executions, were max latency was in [140ms, 361ms] before change and [31.4ms, 61.5ms] after). The snippet of cassandra-stress output from the moment of connection storm: Before: ``` type total ops, op/s, pk/s, row/s, mean, med, .95, .99, .999, max, time, stderr, errors, gc: #, max ms, sum ms, sdv ms, mb ... total, 789206, 85887, 85887, 85887, 0.6, 0.3, 2.0, 2.0, 2.5, 5.0, 9.0, 0.09679, 0, 0, 0, 0, 0, 0 total, 909322, 120116, 120116, 120116, 0.4, 0.2, 1.9, 2.0, 2.1, 3.1, 10.0, 0.09053, 0, 0, 0, 0, 0, 0 total, 964392, 55070, 55070, 55070, 0.9, 0.4, 2.0, 4.5, 7.7, 18.9, 11.0, 0.09203, 0, 0, 0, 0, 0, 0 total, 975705, 11313, 11313, 11313, 4.4, 3.5, 6.5, 24.5, 82.7, 83.0, 12.0, 0.11713, 0, 0, 0, 0, 0, 0 total, 987548, 11843, 11843, 11843, 4.2, 3.5, 6.5, 33.7, 48.6, 51.5, 13.0, 0.13366, 0, 0, 0, 0, 0, 0 total, 995422, 7874, 7874, 7874, 6.3, 4.0, 7.7, 85.6, 112.9, 113.5, 14.0, 0.14753, 0, 0, 0, 0, 0, 0 total, 1007228, 11806, 11806, 11806, 4.3, 3.5, 6.5, 29.1, 43.8, 87.1, 15.0, 0.15598, 0, 0, 0, 0, 0, 0 total, 1012840, 5612, 5612, 5612, 8.2, 5.0, 11.5, 121.8, 166.6, 170.1, 16.0, 0.16535, 0, 0, 0, 0, 0, 0 total, 1016186, 3346, 3346, 3346, 13.4, 7.4, 20.1, 204.9, 207.6, 210.4, 17.0, 0.17405, 0, 0, 0, 0, 0, 0 total, 1025462, 9276, 9276, 9276, 6.3, 3.9, 9.6, 74.6, 206.8, 210.0, 18.0, 0.17800, 0, 0, 0, 0, 0, 0 total, 1035979, 10517, 10517, 10517, 4.8, 3.5, 6.7, 38.5, 82.6, 83.0, 19.0, 0.18120, 0, 0, 0, 0, 0, 0 total, 1047488, 11509, 11509, 11509, 4.3, 3.5, 6.0, 32.6, 72.3, 74.0, 20.0, 0.18334, 0, 0, 0, 0, 0, 0 total, 1077456, 29968, 29968, 29968, 1.7, 1.6, 2.9, 3.6, 7.0, 8.2, 21.0, 0.17943, 0, 0, 0, 0, 0, 0 total, 1105490, 28034, 28034, 28034, 1.8, 1.8, 3.5, 4.6, 5.3, 13.8, 22.0, 0.17609, 0, 0, 0, 0, 0, 0 total, 1132221, 26731, 26731, 26731, 1.9, 1.8, 3.8, 5.2, 8.4, 11.1, 23.0, 0.17314, 0, 0, 0, 0, 0, 0 total, 1162149, 29928, 29928, 29928, 1.7, 1.7, 3.0, 4.5, 8.0, 9.1, 24.0, 0.16950, 0, 0, 0, 0, 0, 0 ... ``` After: ``` type total ops, op/s, pk/s, row/s, mean, med, .95, .99, .999, max, time, stderr, errors, gc: #, max ms, sum ms, sdv ms, mb ... total, 822863, 94379, 94379, 94379, 0.5, 0.3, 2.0, 2.0, 2.1, 3.7, 9.0, 0.06669, 0, 0, 0, 0, 0, 0 total, 937337, 114474, 114474, 114474, 0.4, 0.2, 2.0, 2.0, 2.1, 3.4, 10.0, 0.06301, 0, 0, 0, 0, 0, 0 total, 986630, 49293, 49293, 49293, 1.0, 1.0, 2.0, 2.1, 17.9, 19.0, 11.0, 0.07318, 0, 0, 0, 0, 0, 0 total, 1026734, 40104, 40104, 40104, 1.2, 1.0, 2.0, 2.2, 6.3, 7.1, 12.0, 0.08410, 0, 0, 0, 0, 0, 0 total, 1066124, 39390, 39390, 39390, 1.3, 1.0, 2.0, 2.2, 2.6, 3.4, 13.0, 0.09108, 0, 0, 0, 0, 0, 0 total, 1103082, 36958, 36958, 36958, 1.3, 1.1, 2.1, 2.5, 3.1, 4.2, 14.0, 0.09643, 0, 0, 0, 0, 0, 0 total, 1141987, 38905, 38905, 38905, 1.3, 1.0, 2.0, 2.4, 11.4, 12.7, 15.0, 0.09894, 0, 0, 0, 0, 0, 0 total, 1180023, 38036, 38036, 38036, 1.3, 1.0, 2.0, 3.7, 5.6, 7.1, 16.0, 0.10070, 0, 0, 0, 0, 0, 0 total, 1216481, 36458, 36458, 36458, 1.4, 1.0, 2.1, 3.6, 4.7, 5.0, 17.0, 0.10210, 0, 0, 0, 0, 0, 0 total, 1256819, 40338, 40338, 40338, 1.2, 1.0, 2.0, 2.2, 3.5, 5.4, 18.0, 0.10173, 0, 0, 0, 0, 0, 0 total, 1295122, 38303, 38303, 38303, 1.3, 1.0, 2.0, 2.4, 21.0, 21.1, 19.0, 0.10136, 0, 0, 0, 0, 0, 0 total, 1334743, 39621, 39621, 39621, 1.3, 1.0, 2.0, 2.3, 3.3, 4.0, 20.0, 0.10055, 0, 0, 0, 0, 0, 0 total, 1375579, 40836, 40836, 40836, 1.2, 1.0, 2.0, 2.1, 3.4, 5.7, 21.0, 0.09927, 0, 0, 0, 0, 0, 0 total, 1415576, 39997, 39997, 39997, 1.2, 1.0, 2.0, 2.3, 3.2, 4.1, 22.0, 0.09807, 0, 0, 0, 0, 0, 0 total, 1449268, 33692, 33692, 33692, 1.5, 1.4, 2.5, 3.2, 4.2, 5.6, 23.0, 0.09800, 0, 0, 0, 0, 0, 0 total, 1471873, 22605, 22605, 22605, 2.2, 2.0, 4.8, 5.9, 7.0, 7.9, 24.0, 0.10015, 0, 0, 0, 0, 0, 0 ... ``` Fixes: https://github.com/scylladb/scylladb/issues/24411 This is a new feature, so no backport needed. Closes scylladb/scylladb#25412 * github.com:scylladb/scylladb: docs: workload-prioritization: add driver service level test: add test to verify use of `sl:driver` transport: use `sl:driver` to handle driver's control connections transport: whitespace only change in update_scheduling_group transport: call update_scheduling_group for non-auth connections generic_server: transport: start using `sl:driver` for new connections test: add test_desc_* for driver service level test: service_levels: add tests for sl:driver creation and removal test: add reload_raft_topology_state() to ScyllaRESTAPIClient service_level_controller: automatically create `sl:driver` service_level_controller: methods to create driver service level service_level_controller: handle special sl:driver in DESC output topology_coordinator: add service_level_controller reference system_keyspace: add service_level_driver_created test: add MAX_USER_SERVICE_LEVELS	2025-09-18 19:45:17 +03:00
Andrzej Jackowski	6f678a2d1f	service_level_controller: automatically create `sl:driver` This commit: - Increases the number of allowed scheduling groups to allow the creation of `sl:driver`. - Adds the `DRIVER_SERVICE_LEVEL` feature, which prevents creating `sl:driver` until all nodes have increased the number of scheduling groups. - Starts using `get_create_driver_service_level_mutations` to unconditionally create `sl:driver` on `raft_initialize_discovery_leader`. The purpose of this code path is ensuring existence of `sl:driver` in new system and tests. - Starts using `migrate_to_driver_service_level` to create `sl:driver` if it is not already present. The creation of `sl:driver` is managed by `topology_coordinator`, similar to other system keyspace updates, such as the `view_builder` migration. The purpose of this code path is handling upgrades. - Modifies related tests to pass after `sl:driver` is added. Later in this patch series, `sl:driver` will be used by `transport/server` to handle selected traffic, such as the driver's schema and topology fetches. Refs: scylladb/scylladb#24411	2025-09-18 09:28:32 +02:00
Michael Litvak	5f1caebcc7	cdc: add cdc_with_tablets feature flag add a new feature flag cdc_with_tablets to protect the schema changes that are required for the CDC with tablets feature. we will also use it to allow start using CDC in tablets-based keyspaces only once all nodes are upgraded and support this feature.	2025-09-17 14:47:11 +02:00
Michał Jadwiszczak	7dba3667c9	gms/feature_service: add VIEW_BUILDING_COORDINATOR feature	2025-08-27 08:55:46 +02:00
Avi Kivity	611918056a	Merge 'repair: Add tablet incremental repair support' from Asias He The central idea of incremental repair is to allow repair participants to select and repair only a portion of the dataset to speed up the repair process. All repair participants must utilize an identical selection method to repair and synchronize the same selected dataset. There are two primary selection methods: time-based and file-based. The time-based method selects data within a specified time frame. It is versatile but it is less efficient because it requires reading all of the dataset and omitting data beyond the time frame. The file-based method selects data from unrepaired SSTables and is more efficient because it allows the entire SSTable to be omitted. This document patch implements the file-based selection method. Incremental repair will only be supported for tablet tables; it will not be supported for vnode tables. On one hand, the legacy vnode is less important to support. On the other hand, the incremental repair for vnode is much harder to implement. With vnodes, a SSTalbe could contain data for multiple vnode ranges. When a given vnode range is repaired, only a portion of the SSTable is repaired. This complicates the manipulation of SSTables significantly during both repair and compaction. With tablets, an entire tablet is repaired so that a sstable is either fully repaired or not repaired which is a huge simplification. This patch uses the repaired_at from sstables::statistics component to mark a sstable as repaired. It uses a virtual clock as the repair timestamp, i.e., using a monotonically increasing number for the repaired_at field of a SSTable and sstables_repaired_at column in system.tablets table. Notice that when a sstable is not repaired, the repaired_at field will be set to the default value 0 by default. The being_repaired in memory field of a SSTable is used to explicitly mark that a SSTable is being selected. The following variables are used for incremental repair: The repaired_at on disk field of a SSTable is used. - A 64-bit number increases sequentially The sstables_repaired_at is added to the system.tablets table. - repaired_at <= sstables_repaired_at means the sstable is repaired The being_repaired in memory field of a SSTable is added. - A repair UUID tells which sstable has participated in the repair Initial test results: 1) Medium dataset results Node amount: 3 Instance type: i4i.2xlarge Disk usage per node: ~500GB Cluster pre-populated with ~500GB of data before starting repairs job. Results for Repair Timings: The regular repair run took 210 mins. Incremental repair 1st run took 183 mins, 2nd and 3rd runs took around 48s The speedup is: 183 mins / 48s = 228X 2) Small dataset results Node amount: 3 Instance type: i4i.2xlarge Disk usage per node: ~167GB Cluster pre-populated with ~167GB of data before starting the repairs job. Regular repair 1st run took 110s, 2nd and 3rd runs took 110s. Incremental repair 1st run took 110 seconds, 2nd and 3rd run took 1.5 seconds. The speedup is: 110s / 1.5s = 73X 3) Large dataset results Node amount: 6 Instance type: i4i.2xlarge, 3 racks 50% of base load, 50% read/write Dataset == Sum of data on each node Dataset Non-incremental repair (minutes) 1.3 TiB 31:07 3.5 TiB 25:10 5.0 TiB 19:03 6.3 TiB 31:42 Dataset Incremental repair (minutes) 1.3 TiB 24:32 3.0 TiB 13:06 4.0 TiB 5:23 4.8 TiB 7:14 5.6 TiB 3:58 6.3 TiB 7:33 7.0 TiB 6:55 Fixes #22472 Closes scylladb/scylladb#24291 * github.com:scylladb/scylladb: replica: Introduce get_compaction_reenablers_and_lock_holders_for_repair compaction: Move compaction_reenabler to compaction_reenabler.hh topology_coordinator: Make rpc::remote_verb_error to warning level repair: Add metrics for sstable bytes read and skipped from sstables test.py: Disable incremental for test_tombstone_gc_for_streaming_and_repair test.py: Add tests for tablet incremental repair repair: Add tablet incremental repair support compaction: Add tablet incremental repair support feature_service: Add TABLET_INCREMENTAL_REPAIR feature tablet_allocator: Add tablet_force_tablet_count_increase and decrease repair: Add incremental helpers sstable: Add being_repaired to sstable sstables: Add set_repaired_at to metadata_collector mutation_compactor: Introduce add operator to compaction_stats tablet: Add sstables_repaired_at to system.tablets table test: Fix drain api in task_manager_client.py	2025-08-19 13:13:22 +03:00
Asias He	2ecd42f369	feature_service: Add TABLET_INCREMENTAL_REPAIR feature	2025-08-11 10:10:08 +08:00
Benny Halevy	0ad1898f0a	feature_service: move UUID_SSTABLE_IDENTIFIERS to supported_feature_set The feature is supported by all live versions since version 5.4 / 2024.1. (Although up to `6da758d74c` it could be disabled using the config option) Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-08-08 11:46:15 +03:00
Botond Dénes	2985c343ed	Merge 'repair: Avoid too many fragments in a single repair_row_on_wire' from Asias He When repairing a partition with many rows, we can store many fragments in a repair_row_on_wire object which is sent as a rpc stream message. This could cause reactor stalls when the rpc stream compression is turned on, because the compression compresses the whole message without any split and compression. This patch solves the problem at the higher level by reducing the message size that is sent to the rpc stream. Tests are added to make sure the message split works. Fixes #24808 Closes scylladb/scylladb#25002 * github.com:scylladb/scylladb: repair: Avoid too many fragments in a single repair_row_on_wire repair: Change partition_key_and_mutation_fragments to use chunked_vector utils: Allow chunked_vector::erase to work with non-default-constructible type	2025-07-29 17:45:57 +03:00
Asias He	e28c75aa79	repair: Avoid too many fragments in a single repair_row_on_wire When repairing a partition with many rows, we can store many fragments in a repair_row_on_wire object which is sent as a rpc stream message. This could cause reactor stalls when the rpc stream compression is turned on, because the compression compresses the whole message without any split and compression. This patch solves the problem at the higher level by reducing the message size that is sent to the rpc stream. Tests are added to make sure the message split works. Fixes #24808	2025-07-29 13:43:53 +08:00
Botond Dénes	f3ed27bd9e	Merge 'Move feature-service config creation code out of feature-service itself' from Pavel Emelyanov Nowadays the way to configure an internal service is 1. service declares its config struct 2. caller (main/test/tool) fills the respective config with values it wants 3. the service is started with the config passed by value The feature service code behaves likewise, but provides a helper method to create its config out of db::config. This PR moves this helper out of gms code, so that it doesn't mess with system-wide db::config and only needs its own small struct feature_config. For the reference: similar changes with other services: #23705 , #20174 , #19166 Closes scylladb/scylladb#25118 * github.com:scylladb/scylladb: gms,init: Move get_disabled_features_from_db_config() from gms code: Update callers generating feature service config gms: Make feature_config a simple struct gms: Split feature_config_from_db_config() into two	2025-07-29 08:17:49 +03:00
Petr Gusev	ab03badc15	feature_service: add LWT_WITH_TABLETS feature We will need this feature to determine if it's safe to enable LWTs for a tablet-based table.	2025-07-24 16:39:50 +02:00
Pavel Emelyanov	52455f93b6	gms,init: Move get_disabled_features_from_db_config() from gms Now when all callers are decoupled from gms config generating code, the latter can be decoupled from the db::config. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-07-21 19:20:17 +03:00

1 2 3 4 5

201 Commits