scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-05-22 07:42:16 +00:00

Author	SHA1	Message	Date
Piotr Dulikowski	7c2b1ea0b5	Merge 'view_building: fix tombstone_warn_threshold warnings' from Michał Jadwiszczak `system.view_building_tasks` is a single-partition Raft group0 table (pk = `"view_building"`, CK = timeuuid). When `clean_finished_tasks()` deletes hundreds of finished tasks, the physical rows remain in SSTables until compaction. Any subsequent read of the partition counts every column of every tombstoned row as a dead cell, triggering `tombstone_warn_threshold` warnings in large clusters. Two-part fix: 1. Range tombstones instead of row tombstones (commits 2–3) Instead of one row tombstone per finished task, find the minimum alive task UUID (`min_alive_uuid`) and emit a single range tombstone `[before_all, min_alive_uuid)` covering all tasks below that boundary. This reduces the tombstone count significantly and also benefits future compaction. 2. Bounded scan with `min_task_id` (commits 4–6) Even with range tombstones, physical rows remain until compaction and still count as dead cells during reads. The only way to avoid them is to not read them at all. - Add a `min_task_id timeuuid` static column to `system.view_building_tasks`. - On every GC, write `min_task_id = min_alive_uuid` atomically with the range tombstone (same Raft batch). - On reload, read `min_task_id` first using a static-only partition slice (empty `_row_ranges` + `always_return_static_content`): the SSTable reader stops immediately after the static row before processing any clustering tombstones — zero dead cells counted. - Use `AND id >= min_task_id` as a lower bound for the main task scan, skipping all tombstoned rows. The static-only read and the bounded scan are gated on the `VIEW_BUILDING_TASKS_MIN_TASK_ID` cluster feature so mixed-version clusters fall back to the full scan. The issue is not critical, so the fix shouldn't be backported. Fixes SCYLLADB-657 Closes scylladb/scylladb#28929 * github.com:scylladb/scylladb: test/cluster/test_view_building_coordinator: add reproducer for tombstone threshold warning docs: document tombstone avoidance in view_building_tasks view_building: add `task_uuid_generator` to `view_building_task_mutation_builder` view_building: introduce `task_uuid_generator` view_building: store `min_alive_uuid` in view building state view_building: set min_task_id when GC-ing finished tasks view_building: add min_task_id support to view_building_task_mutation_builder view_building: add min_task_id static column and bounded scan to system_keyspace view_building: use range tombstone when GC-ing finished tasks view_building: add range tombstone support to view_building_task_mutation_builder view_building: introduce VIEW_BUILDING_TASKS_MIN_TASK_ID cluster feature	2026-05-12 12:38:25 +03:00
Avi Kivity	5a887362e3	Merge 'Remove legacy tables creation code' from Gleb Natapov Drop creation of `service_levels` and `cdc_generation_descriptions_v2` table creation code since they are no longer needed. Old clusters will still have it because they were created earlier. Also the series contains a small improvement around group0 creation. No backport needed since this removes functionality. Closes scylladb/scylladb#29482 * github.com:scylladb/scylladb: db/system_distributed_keyspace: remove system_distributed_everywhere since it is unused db/system_distributed_keyspace: drop CDC_TOPOLOGY_DESCRIPTION and CDC_GENERATIONS_V2 db/system_distributed_keyspace: remove unused code db/system_distributed_keyspace: drop old cdc_generation_descriptions_v2 table db/system_distributed_keyspace: drop old service_levels table fix indent after the previous patch group0: call setup_group0 only when needed	2026-05-10 14:46:21 +03:00
Tomasz Grabiec	d6346e68c1	Merge 'prevent gossiper from marking nodes as down in tests unexpectedly' from Patryk Jędrzejczak This PR includes two changes that make gossiper much less likely to mark nodes as down in tests unexpectedly, and cause test flakiness in issues like SCYLLADB-864: - fixing false node conviction when echo succeeds, - increasing the failure_detector_timeout fixture. Fixes: SCYLLADB-864 No need for backport: related CI failures are rare, and merging #29522 made them even more unlikely (I haven't seen one since then, but it's still possible to reproduce locally on dev machines). Closes scylladb/scylladb#29755 * github.com:scylladb/scylladb: test/cluster: increase failure_detector_timeout gossiper: fix false node conviction when echo succeeds	2026-05-06 14:01:15 +02:00
Patryk Jędrzejczak	efe0e39d85	gossiper: fix false node conviction when echo succeeds failure_detector_loop_for_node() could falsely convict a healthy node even when the echo succeeded. The code computed diff = now - last (time since last successful echo) and checked diff > max_duration unconditionally, regardless of whether the current echo failed or succeeded. This caused flakiness in tests that decrease the failure detector timeout. We currently run #CPUs tests concurrently, and since cluster tests start multiple nodes with 2 shards, multiple shards contend for one CPU. As a result, some tasks can become abnormally slow and block the failure detector loop execution for a few seconds. Fix by only checking diff > max_duration when the echo actually failed. Note that we send echo with the timeout equal to `max_duration` anyway, so the receiver will be marked as down if it really doesn't respond.	2026-05-05 15:12:32 +02:00
Yaniv Michael Kaul	93722f2c89	gms/gossiper: fix use-after-move in do_send_ack2_msg The second logger.debug() call accesses ack2_msg after it was moved via std::move() in the co_await send_gossip_digest_ack2 call. This is undefined behavior. Fix by formatting ack2_msg to a string before the move, then using that cached string in both debug log calls. FIXES: https://scylladb.atlassian.net/browse/SCYLLADB-1778 Closes scylladb/scylladb#29227	2026-04-30 07:07:39 +03:00
Michał Jadwiszczak	e0942bb45a	view_building: introduce VIEW_BUILDING_TASKS_MIN_TASK_ID cluster feature This feature will be used to gate the use of min_task_id static column in system.view_building_tasks, which will be added in a subsequent commit. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-22 09:10:12 +02:00
Aleksandra Martyniuk	7cdf7d62a2	gms: add keyspace_multi_rf_change feature	2026-04-17 09:58:05 +02:00
Benny Halevy	ce00d61917	db: implement large_data virtual tables with feature flag gating Replace the physical system.large_partitions, system.large_rows, and system.large_cells CQL tables with virtual tables that read from LargeDataRecords stored in SSTable scylla metadata (tag 13). The transition is gated by a new LARGE_DATA_VIRTUAL_TABLES cluster feature flag: - Before the feature is enabled: the old physical tables remain in all_tables(), CQL writes are active, no virtual tables are registered. This ensures safe rollback during rolling upgrades. - After the feature is enabled: old physical tables are dropped from disk via legacy_drop_table_on_all_shards(), virtual tables are registered on all shards, and CQL writes are skipped via skip_cql_writes() in cql_table_large_data_handler. Key implementation details: - Three virtual table classes (large_partitions_virtual_table, large_rows_virtual_table, large_cells_virtual_table) extend streaming_virtual_table with cross-shard record collection. - generate_legacy_id() gains a version parameter; virtual tables use version 1 to get different UUIDs than the old physical tables. - compaction_time is derived from SSTable generation UUID at display time via UUID_gen::unix_timestamp(). - Legacy SSTables without LargeDataRecords emit synthetic summary rows based on above_threshold > 0 in LargeDataStats. - The activation logic uses two paths: when the feature is already enabled (test env, restart), it runs as a coroutine; when not yet enabled, it registers a when_enabled callback that runs inside seastar::async from feature_service::enable(). - sstable_3_x_test updated to use a simplified large_data_test_handler and validate LargeDataRecords in SSTable metadata directly.	2026-04-16 08:49:02 +03:00
Avi Kivity	59ec93b86b	Merge 'Allow arbitrary tablet boundaries and count' from Tomasz Grabiec There are several reasons we want to do that. One is that it will give us more flexibility in distributing the load. We can subdivide tablets at any token, and achieve more evenly-sized tablets. In particular, we can isolate large partitions into separate tablets. We can also split and merge incrementally individual tablets. Currently, we do it for the whole table or nothing, which makes splits and merges take longer and cause wide swings of the count. This is not implemented in this PR yet, we still split/merge the whole table. Another reason is vnode to tablets migration. We now could construct a tablet map which matches exactly the vnode boundaries, so migration can happen transparently from CQL-coordinator point of view. Tablet count is still a power-of-two by default for newly created tables. It may be different if tablet map is created by non-standard means, or if per-table tablet option "pow2_count" is set to "false". build/release/scylla perf-tablets: Memory footprint for 131k tablets increased from 56 MiB to 58.1 MiB (+3.5%) Before: ``` Generating tablet metadata Total tablet count: 131072 Size of tablet_metadata in memory: 57456 KiB Copied in 0.014346 [ms] Cleared in 0.002698 [ms] Saved in 1234.685303 [ms] Read in 445.577881 [ms] Read mutations in 299.596313 [ms] 128 mutations Read required hosts in 247.482742 [ms] Size of canonical mutations: 33.945053 [MiB] Disk space used by system.tablets: 1.456761 [MiB] Tablet metadata reload: full 407.69ms partial 2.65ms ``` After: ``` Generating tablet metadata Total tablet count: 131072 Size of tablet_metadata in memory: 59504 KiB Copied in 0.032475 [ms] Cleared in 0.002965 [ms] Saved in 1093.877441 [ms] Read in 387.027100 [ms] Read mutations in 255.752121 [ms] 128 mutations Read required hosts in 211.202805 [ms] Size of canonical mutations: 33.954453 [MiB] Disk space used by system.tablets: 1.450162 [MiB] Tablet metadata reload: full 354.50ms partial 2.19ms ``` Closes scylladb/scylladb#28459 * github.com:scylladb/scylladb: test: boost: tablets: Add test for merge with arbitrary tablet count tablets, database: Advertise 'arbitrary' layout in snapshot manifest tablets: Introduce pow2_count per-table tablet option tablets: Prepare for non-power-of-two tablet count tablets: Implement merged tablet_map constructor on top of for_each_sibling_tablets() tablets: Prepare resize_decision to hold data in decisions tablets: table: Make storage_group handle arbitrary merge boundaries tablets: Make stats update post-merge work with arbitrary merge boundaries locator: tablets: Support arbitrary tablet boundaries locator: tablets: Introduce tablet_map::get_split_token() dht: Introduce get_uniform_tokens()	2026-04-15 18:57:22 +03:00
Andrzej Jackowski	78926d9c96	test/random_failures: remove gossip shadow round injection Commit `c17c4806a1` removed check_for_endpoint_collision() from the fresh bootstrap path, which was the only code path that called do_shadow_round() for new nodes. Since the gossip shadow round is no longer executed during bootstrap, remove the stop_during_gossip_shadow_round error injection from the test. The entry is marked as REMOVED_ rather than deleted to preserve the shuffle order for seed-based test reproducibility. The injection point in gms/gossiper.cc is also removed since it is no longer used by any test. Fixes: SCYLLADB-1466 Closes scylladb/scylladb#29460	2026-04-15 16:30:55 +02:00
Gleb Natapov	8713eda271	db/system_distributed_keyspace: drop old cdc_generation_descriptions_v2 table The generation management moved to raft and old table is no longer used.	2026-04-15 15:48:48 +03:00
Tomasz Grabiec	01fb97ee78	locator: tablets: Support arbitrary tablet boundaries There are several reasons we want to do that. One is that it will give us more flexibility in distributing the load. We can subdivide tablets at any points, and achieve more evenly-sized tablets. In particular, we can isolate large partitions into separate tablets. Another reason is vnode-to-tablet migration. We could construct a tablet map which matches exactly the vnode boundaries, so migration can happen transparently from the CQL-coordinator's point of view. Implementation details: We store a vector of tokens which represent tablet boundaries in the tablet_id_map. tablet_id keeps its meaning, it's an index into vector of tablets. To avoid logarithmic lookup of tablet_id from the token, we introduce a lookup structure with power-of-two aligned buckets, and store the tablet_id of the tablet which owns the first token in the bucket. This way, lookup needs to consider tablet id range which overlaps with one bucket. If boundaries are more or less aligned, there are around 1-2 tablets overlapping with a bucket, and the lookup is still O(1). Amount of memory used increased, but not significantly relative to old size (because tablet_info is currently fat): For 131'072 tablets: Before: Size of tablet_metadata in memory: 57456 KiB After: Size of tablet_metadata in memory: 59504 KiB	2026-04-15 01:25:14 +02:00
Piotr Dulikowski	9fc2c65d18	Merge 'cql3: implement WRITETIME() and TTL() of individual elements of map, set, and UDT' from Nadav Har'El In commit `727f68e0f5` we added the ability to SELECT: * Individual elements of a map: `SELECT map_col[key]`. * Individual elements of a set: `SELECT set_col[key]` returns key if the key exists in the set, or null if it doesn't, allowing to check if the element exists in the set. * Individual pieces of a UDT: `SELECT udt_col.field`. But at the time, we didn't provide any way to retrieve the meta-data for this value, namely its timestamp and TTL. We did not support `SELECT TIMESTAMP(collection[key])`, or `SELECT TIMESTAMP(udt.field)`. Users requested to support such SELECTs in the past (see issue #15427), and Cassandra 5.0 added support for this feature - for both maps and sets and udts - so we also need this feature for compatibility. This feature was also requested recently by vector-search developers, who wanted to read Alternator columns - stored as map elements, not individual columns - with their WRITETIME information. The first four patches in this series adds the feature (in four smaller patches instead one big one), the fifth and sixth patches add tests (cqlpy and boost tests, respectively). The seventh patch adds documentation. All the new tests pass on Cassandra 5, failed on Scylla before the present fix, and pass with it. The fix was surprisingly difficult. Our existing implementation (from `727f68e0f5` building on earlier machinery) doesn't just "read" `map_col[key]` and allow us to return just its timestamp. Rather, the implementation reads the entire map, serializes it in some temporary format that does not include the timestamps and ttls, and then takes the subscript key, at which point we no longer have the timestamp or ttl of the element. So the fix had to cross all these layers of the implementation. While adding support for UDT fields in a pre-existing grammar nonterminal "subscriptExpr", we unintentionally added support for UDT fields also in LWT expressions (which used this nonterminal). LWT missing support for UDT fields was a long-time known compatibility issue (#13624) so we unintentionally fixed it :-) Actually, to completely fix it we needed another small change in the expression implementation, so the eighth patch in this series does this. Fixes #15427 Fixes #13624 Closes scylladb/scylladb#29134 * github.com:scylladb/scylladb: cql3: support UDT fields in LWT expressions cql3: document WRITETIME() and TTL() for elements of map, set or UDT test/boost: test WRITETIME() and TTL() on map collection elements test/cqlpy: test WRITETIME() and TTL() on element of map, set or UDT cql3: prepare and evaluate WRITETIME/TTL on collection elements and UDT fields cql3: parse per-element timestamps/TTLs in the selection layer cql3: add extended wire format for per-element timestamps and TTLs cql3: extend WRITETIME/TTL grammar to accept collection and UDT elements	2026-04-14 12:35:46 +02:00
Avi Kivity	0ae22a09d4	LICENSE: Update to version 1.1 Updated terms of non-commercial use (must be a never-customer).	2026-04-12 19:46:33 +03:00
Nadav Har'El	bb63db34e5	cql3: add extended wire format for per-element timestamps and TTLs Introduce the infrastructure needed to transport per-element timestamps and TTL expiry times from replicas to coordinators, required for WRITETIME(col[key]) / TTL(col[key]) and WRITETIME(col.field) / TTL(col.field). * Add a 'writetime_ttl_individual_element' cluster feature flag that guards usage of the new wire format during rolling upgrades: the extended format is only emitted and consumed when every node in the cluster supports it. * Implement serialize_for_cql_with_timestamps() (types/types.cc), a variant of serialize_for_cql() that appends a per-element section to the regular CQL bytes, listing each live element's serialized key, timestamp, and expiry. The format is: [uint32 cql_len][cql bytes] [int32 entry_count] [per entry: (int32 key_len)(key bytes)(int64 timestamp)(int64 expiry)] expiry is -1 when the element has no TTL. * Add partition_slice::option::send_collection_timestamps and modify write_cell() (mutation_partition.cc) to use the new function serialize_for_cql_with_timestamps() when this option is available. This commit stands alone with no user-visible effect: nothing yet sets the new partition-slice option. The next patch adds the selection-layer code that sets the option and parses the extended response.	2026-04-12 11:49:06 +03:00
Avi Kivity	8ccee6803e	Merge 'Remove upgrade view builder' from Gleb Natapov Since we do no longer support upgrade from versions that do not support v2 of "view building status" code (building status is managed by raft) we can remove v1 code and upgrade code and make sure we do not boot with old "builder status" version. v2 version was introduced by `8d25a4d678` which is included in scylla-2025.1.0. No backport needed since this is code removal. Closes scylladb/scylladb#29105 * github.com:scylladb/scylladb: view: drop unused v1 builder code view: remove upgrade to raft code	2026-04-12 00:39:26 +03:00
Gleb Natapov	d0576c109f	gossiper: send shutdown notifications in parallel	2026-04-09 13:31:40 +03:00
Gleb Natapov	1586fa65af	gms: remove unused code Also moved version_string(...) and make_token_string(...) to private: — they are internal helpers used only by normal(), not part of the public API	2026-04-09 13:31:40 +03:00
Gleb Natapov	e17fc180a0	gossiper: print node state from raft topology in the logs Raft topology has real node's state now. gossiper sate are now set to NORMAL and SHUTDOWN only.	2026-04-09 13:31:40 +03:00
Gleb Natapov	8439154851	gossiper: use is_shutdown instead of code it manually	2026-04-09 13:31:39 +03:00
Gleb Natapov	7d700d0377	gossiper: mark endpoint_state(inet_address ip) constructor as explicit get_live_members function called is_shutdown which inet_address argument, which caused temporary endpoint_state to be created. Fix it by prohibiting implicit conversion and calling the correct is_shutdown function instead.	2026-04-09 13:31:39 +03:00
Gleb Natapov	6df4f572d5	gossiper: remove unused code	2026-04-09 13:31:39 +03:00
Gleb Natapov	67102496c8	gossiper: drop last use of LEFT state and drop the state The decommission sets left gossiper state only to prevent shutdown notification be issued by the node during shutdown. Since the notification code now checks the state in raft topology this is no longer needed.	2026-04-09 13:31:39 +03:00
Gleb Natapov	54d2c95094	gossiper: drop unused STATUS_BOOTSTRAPPING state	2026-04-09 13:31:38 +03:00
Gleb Natapov	7c895ced19	gossiper: rename is_dead_state to is_left since this is all that the function checks now.	2026-04-09 13:31:38 +03:00
Gleb Natapov	7dfb0577b8	gossiper: use raft topology state instead of gossiper one when checking node's state Raft topology state is a truth source for the nodes state, so use it instead of a gossiper one.	2026-04-09 13:31:38 +03:00
Gleb Natapov	c17c4806a1	storage_service: drop check_for_endpoint_collision function All the checks that it does are also done by join coordinator and the join coordinator uses more reliable raft state instead of gossiper one.	2026-04-09 13:31:37 +03:00
Gleb Natapov	681aa9ebe1	gossiper: remove unused REMOVED_TOKEN state	2026-04-09 13:31:37 +03:00
Gleb Natapov	5af17aa578	gossiper: remove unused advertise_token_removed function	2026-04-09 13:31:36 +03:00
Nikos Dragazis	3e2dc078c9	feature_service: Add vnodes_to_tablets_migrations feature Vnodes-to-tablets migrations require cluster-level support: the REST API and the group0 state need to be supported by all nodes. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2026-03-24 11:06:38 +02:00
Michael Litvak	ed852a2af2	db: add logstor experimental feature flag add a new experimental feature flag for key-value tables with the new logstor storage engine.	2026-03-18 19:24:26 +01:00
Gleb Natapov	77d3245e02	view: remove upgrade to raft code Since we do no longer support upgrade from versions that do not support v2 of view building code we can remove upgrade code and make sure we do not boot with old builder version.	2026-03-18 17:45:40 +02:00
Gleb Natapov	b633ec1779	features: move GROUP0_SCHEMA_VERSIONING to deprecated features list	2026-03-10 10:46:48 +02:00
Gleb Natapov	4402b030ae	cdc: drop usage of cdc_local table and v1 generation definition	2026-03-10 10:39:59 +02:00
Gleb Natapov	4e56ca3c76	gossiper: do not gossip TOKENS and CDC_GENERATION_ID any more They were used by legacy topology and cdc code only.	2026-03-10 10:39:58 +02:00
Gleb Natapov	77f8f952b2	gossiper: drop tokens from loaded_endpoint_state	2026-03-10 10:39:58 +02:00
Gleb Natapov	706754dc24	gossiper: remove unused functions	2026-03-10 10:39:58 +02:00
Gleb Natapov	2d8722d204	gossiper: drop is_safe_for_restart() function and its use The function checks that the node's state is not left or removed in gossiper during restart, but with raft topology a removed node will not be able to contact the cluster to get this information since it will be banned.	2026-03-10 10:39:58 +02:00
Gleb Natapov	d35b83bec8	gossiper: remove the code that was only used in gossiper topology The topology state machine is always present now and can be passed to the gossiper during creation.	2026-03-10 10:39:58 +02:00
Gleb Natapov	6a7e850161	cdc: remove legacy code The patch removes test/boost/cdc_generation_test.cc since it unit tests cdc::limit_number_of_streams_if_needed function which is remove here.	2026-03-10 10:38:57 +02:00
Gleb Natapov	0b508c5f96	test: remove unused injection points Also remove test_auth_raft_command_split test which is irrelevant since `5ba7d1b116` because it does not use the function that injects max sized command after the commit.	2026-03-10 10:09:39 +02:00
Marcin Maliszkiewicz	a83ee6cf66	Merge 'db/batchlog_manager: re-add v1 support for mixed clusters' from Botond Dénes `3f7ee3ce5d` introduced system.batchlog_v2, with a schema designed to speed up batchlog replays and make post-replay cleanups much more effective. It did not introduce a cluster feature for the new table, because it is node local table, so the cluster can switch to the new table gradually, one node at a time. However, https://github.com/scylladb/scylladb/issues/27886 showed that the switching causes timeouts during upgrades, in mixed clusters. Furthermore, switching to the new table unconditionally on upgrades nodes, means that on rollback, the batches saved into the v2 table are lost. This PR introduces re-introduces v1 (`system.batchlog`) support and guards the use of the v2 table with a cluster feature, so mixed clusters keep using v1 and thus be rollback-compatible. The re-introduced v1 support doesn't support post-replay cleanups for simplicity. The cleanup in v1 was never particularly effective anyway and we ended up disabling it for heavy batchlog users, so I don't think the lack of support for cleanup is a problem. Fixes: https://github.com/scylladb/scylladb/issues/27886 Needs backport to 2026.1, to fix upgrades for clusters using batches Closes scylladb/scylladb#28736 * github.com:scylladb/scylladb: test/boost/batchlog_manager_test: add tests for v1 batchlog test/boost/batchlog_manager_test: make prepare_batches() work with both v1 and v2 test/boost/batchlog_manager_test: fix indentation test/boost/batchlog_manager_test: extract prepare_batches() method test/lib/cql_assertions: is_rows(): add dump parameter tools/scylla-sstable: extract query result printers tools/scylla-sstable: add std::ostream& arg to query result printers repair/row_level: repair_flush_hints_batchlog_handler(): add all_replayed to finish log db/batchlog_manager: re-add v1 support db/batchlog_manager: return all_replayed from process_batch() db/batchlog_manager: process_bath() fix indentation db/batchlog_manager: make batch() a standalone function db/batchlog_manager: make structs stats public db/batchlog_manager: allocate limiter on the stack db/batchlog_manager: add feature_service dependency gms/feature_service: add batchlog_v2 feature	2026-03-02 12:09:10 +01:00
Patryk Jędrzejczak	9a9202c909	Merge 'Remove gossiper topology code' from Gleb Natapov The PR removes most of the code that assumes that group0 and raft topology is not enabled. It also makes sure that joining a cluster in no raft mode or upgrading a node in a cluster that not yet uses raft topology to this version will fail. Refs #15422 No backport needed since this removes functionality. Closes scylladb/scylladb#28514 * https://github.com/scylladb/scylladb: group0: fix indentation after previous patch raft_group0: simplify get_group0_upgrade_state function since no upgrade can happen any more raft_group0: move service::group0_upgrade_state to use fmt::formatter instead of iostream raft_group0: remove unused code from raft_group0 node_ops: remove topology over node ops code topology: fix indentation after the previous patch topology: drop topology_change_enabled parameter from raft_group0 code storage_service: remove unused handle_state_* functions gossiper: drop wait_for_gossip_to_settle and deprecate correspondent option storage_service: fix indentation after the last patch storage_service: remove gossiper bootstrapping code storage_service: drop get_group_server_if_raft_topolgy_enabled storage_service: drop is_topology_coordinator_enabled and its uses storage_service: drop run_with_api_lock_in_gossiper_mode_only topology: remove code that assumes raft_topology_change_enabled() may return false test: schema_change_test: make test_schema_digest_does_not_change_with_disabled_features tests run in raft mode test: schema_change_test: drop schema tests relevant for no raft mode only topology: remove upgrade to raft topology code group0: remove upgrade to group0 code group0: refuse to boot if a cluster is still is not in a raft topology mode storage_service: refuse to join a cluster in legacy mode	2026-02-27 14:43:41 +01:00
Marcin Maliszkiewicz	a03ebe1a29	Merge 'cql: implement a new per-row TTL feature' from Nadav Har'El This series implements a new per-row TTL feature for CQL. The per-row TTL feature was requested in issue #13000. It is a feature that does not exist in Cassandra, and was inspired by DynamoDB's TTL feature - and under the hood uses the same implementation that we used in Alternator to implement this DynamoDB feature. The new per-row TTL feature is completely separate from CQL's existing per-write (and per-cell) TTL, and both will be available to users. In the per-row TTL feature, one column in the table is designated as the "TTL" column, and its value for a row is the expiration time for that row. The TTL column can be designated at table creation time, e.g.: ```cql CREATE TABLE tab ( id int PRIMARY KEY, t text, expiration timestamp TTL ); ``` Or after the table already exists with: ```cql ALTER TABLE tab TTL expiration ``` Expiration can also be disabled, with: ```cql ALTER TABLE tab TTL NULL ``` The new per-row TTL feature has two features that users have been asking for: 1. A user can change the value of just the TTL column - without rewriting the entire row - to change the expiration time of the entire row. 2. When an expired row is finally deleted, a CDC event about this deletion appears in the CDC log (if CDC is enabled), including - if a preimage is enabled - the content of the deleted row. To achieve the second goal (CDC events), a row is not guaranteed to disappear at exactly its expiration time (as CQL's original TTL feature guarantees). Rather, the row is deleted some time later, depending on `alternator_ttl_period_in_seconds`; Until the actual deletion, the row is still readable (and even writable). But we are guaranteed that when the row is finally deleted, the CDC event will come too. The implementation uses the same background thread used by Alternator to periodically scan for expired items and delete them. The expiration thread keeps the same metrics as it did for Alternator: * `scylla_expiration_scan_passes` * `scylla_expiration_scan_table` * `scylla_expiration_items_deleted` * `scylla_expiration_secondary_ranges_scanned` The series begins with a few small preparation patches, followed by the main part of the feature (which isn't big, since we are just enabling the pre-existing Alternator expiration machinary for CQL) and finally 30 tests (single-node and multi-node tests) and documentation. This series is a new feature, so traditionally would not be backported. However, I wouldn't be surprised if we will be requested to backport it so that customers will not need to wait for a new major release. Fixes #13000 Closes scylladb/scylladb#28320 * github.com:scylladb/scylladb: test/cqlpy: verify that a column can't be both STATIC and PRIMARY KEY docs/cql: document the new CQL per-row TTL feature test/cluster: tests for the new CQL per-row TTL feature test/cqlpy: tests for the new CQL per-row TTL feature test: set low alternator_ttl_period_in_seconds in CQL tests cql ttl: fix ALTER TABLE to disable TTL if column is dropped cql ttl: add setting/unsetting of TTL column to ALTER TABLE cql ttl: add TTL column support to CREATE TABLE and DESC TABLE ttl: add CQL support to Alternator's TTL expiration service alternator ttl: move TTL_TAG_KEY to a header file alternator ttl: remove unnecessary check of feature flag cql: add "cql_row_ttl" cluster feature alternator: fix error message if UpdateTimeToLive is not supported	2026-02-26 15:29:12 +01:00
Nadav Har'El	4f4e93b695	cql: add "cql_row_ttl" cluster feature This patch adds a new cluster feature "CQL_ROW_TTL", for the new CQL per-row TTL feature. With this patch, this node reports supporting this feature, but the CQL per-row TTL feature can only be used once all the nodes in the cluster supports the feature. In other words, user requests to enable per-row TTL on a table should check this feature flag (on the whole cluster) before proceeding. This is needed because the implementation of the per-row-TTL expiration requires the cooperation of all nodes to participate in scanning for expired items, so the feature can't be trusted until all nodes participate in it. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2026-02-25 14:59:41 +02:00
Avi Kivity	511fab1f28	gossiper: exit failure detector sleep faster When running unit tests, there's a visible ~1-second sleep when gossip exits the failure detector loop. Improve this by adding a condition variable for exiting the loop and signaling it when any of the exit conditions are satisfied: the abort_source is pulled, the gossiper is shut down, or the sleep is complete. We can't just use the abort_source because gossip can be shut down independently of the rest of the system. To see the improvement, I ran cql_query_test in dev mode: Before: $ time ./build/dev/test/boost/combined_tests -t cql_query_test -- --smp 2 > /dev/null 2>&1 real 2m26.904s user 0m24.307s sys 0m13.402s After: $ time ./build/dev/test/boost/combined_tests -t cql_query_test -- --smp 2 > /dev/null 2>&1 real 0m26.579s user 0m24.671s sys 0m13.636s Two minutes of real-time saved. Real-life improvement in test.py will be lower, because of the overhead of launching pytest for each test case. Closes scylladb/scylladb#28649	2026-02-25 11:41:02 +02:00
Gleb Natapov	1a57f2b22d	gossiper: drop wait_for_gossip_to_settle and deprecate correspondent option The function is unused now and the option that allows to skip the wait is no longer needed as well.	2026-02-25 10:08:31 +02:00
Gleb Natapov	a8a167623a	topology: remove code that assumes raft_topology_change_enabled() may return false The path removes the code protected by !raft_topology_change_enabled() since it is no longer reachable. Drop test_lwt_for_tablets_is_not_supported_without_raft since not raft mode is no longer supported.	2026-02-25 10:08:30 +02:00
Calle Wilund	3075311f21	feature_service: Add SNAPSHOT_AS_TOPOLOGY_OPERATION feature To detect if cluster can do coordinated snapshot	2026-02-23 10:44:41 +01:00
Botond Dénes	c901ab53d2	gms/feature_service: add batchlog_v2 feature	2026-02-20 07:03:45 +02:00

1 2 3 4 5 ...

1437 Commits