scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-23 18:10:39 +00:00

Author	SHA1	Message	Date
Botond Dénes	6ce0968960	compaction: release GC'ed sstables incrementally during compaction Garbage collected sstables created during incremental compaction are deleted only at the end of the compaction, which increases the memory footprint. This is inefficient, especially considering that the related input sstables are released regularly during compaction. This commit implements incremental release of GC sstables after each output sstable is sealed. Unlike regular input sstables, GC sstables use a different exhaustion predicate: a GC sstable is only released when its token range no longer overlaps with any remaining input sstable. This is because GC sstables hold tombstones that may shadow data in still-alive overlapping input sstables; releasing them prematurely would cause data resurrection. Fixes #5563 Closes scylladb/scylladb#28984	2026-04-17 18:20:47 +03:00
Avi Kivity	999e108139	Merge 'test: lib: fix broken retry in start_docker_service' from Dario Mirovic The retry loop in `start_docker_service` passes the parse callbacks via `std::move` into `create_handler` on each iteration. After the first iteration, the moved-from `std::function` objects are empty. All subsequent retries skip output parsing entirely and immediately treat the service as successfully started. This defeats the entire purpose of the retry mechanism. Fix by passing the callbacks by copy instead of move, so the original callbacks remain valid across retries. Fixes SCYLLADB-1542 This is a CI stability issue and should be backported. Closes scylladb/scylladb#29504 * github.com:scylladb/scylladb: test/lib: fix typos in proc_utils, gcs_fixture, and dockerized_service test: gcs_fixture: rename container from "local-kms" to "fake-gcs-server" test: fix proc_utils.cc formatting from previous commit test: lib: use unique container name per retry attempt test: lib: fix broken retry in start_docker_service	2026-04-16 21:48:25 +03:00
Botond Dénes	d006c4c476	Merge 'Untie (partially) cql3/statements from db::config' from Pavel Emelyanov There's a bunch of db::config options that are used by cql3/statements/ code. For that they use data_dictionary/database as a proxy to get db::config reference. This PR moves most of these accessed options onto cql_config Options migrated to cql_config: 1. select_internal_page_size 2. strict_allow_filtering 3. enable_parallelized_aggregation 4. batch_size_warn_threshold_in_kb 5. batch_size_fail_threshold_in_kb 6. 7 keyspace replication restriction options 7. 2 TWCS restriction options 8. restrict_future_timestamp 9. strict_is_not_null_in_views (with view_restrictions struct) 10. enable_create_table_with_compact_storage Some options need special treatment and are still abused via database, namely: 1. enable_logstor 2. cluster_name 3. partitioner 4. endpoint_snitch Fixing components inter-dependencies, not backporting Closes scylladb/scylladb#29424 * github.com:scylladb/scylladb: cql3: Move enable_create_table_with_compact_storage to cql_config cql3: Move strict_is_not_null_in_views to cql_config cql3: Move restrict_future_timestamp to cql_config cql3: Move TWCS restriction options to cql_config cql3: Move keyspace restriction options to cql_config cql3: Move batch_size_fail_threshold_in_kb to cql_config cql3: Move batch_size_warn_threshold_in_kb to cql_config cql3: Move enable_parallelized_aggregation to cql_config cql3: Move strict_allow_filtering to cql_config cql3: Move select_internal_page_size to cql_config test: Fix cql_test_env to use updateable cql_config from db::config cql3: Add cql_config parameter to parsed_statement::prepare()	2026-04-16 14:04:43 +03:00
Dario Mirovic	50e498ac0d	test/lib: fix typos in proc_utils, gcs_fixture, and dockerized_service Fix assorted typos in comments, strings, and identifiers: - path_preprend -> path_prepend (proc_utils.hh, proc_utils.cc) - laúnch -> launch (proc_utils.cc) - hand/fail -> hang/fail (dockerized_service.py) - inconvinient -> inconvenient (dockerized_service.py) - priviledges -> privileges (gcs_fixture.hh) - remove double semicolon (gcs_fixture.cc) Refs SCYLLADB-1542	2026-04-16 10:58:55 +02:00
Dario Mirovic	11b5997eaf	test: gcs_fixture: rename container from "local-kms" to "fake-gcs-server" The GCS fixture's fake-gcs-server container was named "local-kms", copy-pasted from the AWS KMS fixture. It happened when both were refactored to use the shared start_docker_service helper (`bc544eb08e`). Rename to "fake-gcs-server" to match the Python-side naming and avoid confusion in logs. Refs SCYLLADB-1542	2026-04-16 10:58:52 +02:00
Dario Mirovic	dc7f848bf8	test: fix proc_utils.cc formatting from previous commit Fix indentation of lines moved inside the for-loop in start_docker_service (lines 208-225). Refs SCYLLADB-1542	2026-04-16 10:55:48 +02:00
Dario Mirovic	be4d32c474	test: lib: use unique container name per retry attempt The container name is generated once before the retry loop, so all retry attempts reuse the same name. Move the name generation inside the loop so each attempt gets a fresh name via the incrementing counter, consistent with the comment "publish port ephemeral, allows parallel instances". Formatting changes (indentation) of lines 208-225 in test/lib/proc_utils.cc will be fixed in the next commit. Refs SCYLLADB-1542	2026-04-16 10:55:04 +02:00
Benny Halevy	ce00d61917	db: implement large_data virtual tables with feature flag gating Replace the physical system.large_partitions, system.large_rows, and system.large_cells CQL tables with virtual tables that read from LargeDataRecords stored in SSTable scylla metadata (tag 13). The transition is gated by a new LARGE_DATA_VIRTUAL_TABLES cluster feature flag: - Before the feature is enabled: the old physical tables remain in all_tables(), CQL writes are active, no virtual tables are registered. This ensures safe rollback during rolling upgrades. - After the feature is enabled: old physical tables are dropped from disk via legacy_drop_table_on_all_shards(), virtual tables are registered on all shards, and CQL writes are skipped via skip_cql_writes() in cql_table_large_data_handler. Key implementation details: - Three virtual table classes (large_partitions_virtual_table, large_rows_virtual_table, large_cells_virtual_table) extend streaming_virtual_table with cross-shard record collection. - generate_legacy_id() gains a version parameter; virtual tables use version 1 to get different UUIDs than the old physical tables. - compaction_time is derived from SSTable generation UUID at display time via UUID_gen::unix_timestamp(). - Legacy SSTables without LargeDataRecords emit synthetic summary rows based on above_threshold > 0 in LargeDataStats. - The activation logic uses two paths: when the feature is already enabled (test env, restart), it runs as a coroutine; when not yet enabled, it registers a when_enabled callback that runs inside seastar::async from feature_service::enable(). - sstable_3_x_test updated to use a simplified large_data_test_handler and validate LargeDataRecords in SSTable metadata directly.	2026-04-16 08:49:02 +03:00
Benny Halevy	cb6004b625	db: call initialize_virtual_tables from shard 0 only Move the smp::invoke_on_all dispatch from the callers into initialize_virtual_tables() itself, so the function is called once from shard 0 and internally distributes the per-shard virtual table setup to all shards. This simplifies the callers and allows a single place to add cross-shard coordination logic (e.g. feature-gated table registration) in future commits.	2026-04-16 08:49:02 +03:00
Benny Halevy	90d4ff34fb	test: add LargeDataRecords round-trip unit tests Add three new test cases to sstable_3_x_test.cc that verify the LargeDataRecords metadata written by the SSTable writer can be read back after open_data(): - test_large_data_records_round_trip: verifies partition_size, row_size, and cell_size records are written with correct field semantics when thresholds are exceeded - test_large_data_records_top_n_bounded: verifies the bounded min-heap keeps only the top-N largest entries per type - test_large_data_records_none_when_below_threshold: verifies no records are written when data is below all thresholds Also wire large_data_records_per_sstable from db_config into the test env's sstables_manager::config so that config changes propagate through the updateable_value chain to configure_writer().	2026-04-16 08:49:02 +03:00
Pavel Emelyanov	728eb20b42	test: Fix cql_test_env to use updateable cql_config from db::config The test environment was creating cql_config with hardcoded default values that were never updated when system.config was modified via CQL. This broke tests that dynamically change configuration values (e.g., TWCS tests). Fix by creating cql_config from db::config using sharded_parameter, which ensures updateable_value fields track the actual db::config sources and reflect changes made during test execution. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2026-04-16 07:57:26 +03:00
Dario Mirovic	336dab1eec	test: lib: fix broken retry in start_docker_service The retry loop in start_docker_service passes the parse callbacks via std::move into create_handler on each iteration. After the first iteration, the moved-from std::function objects are empty. All subsequent retries skip output parsing entirely and immediately treat the service as successfully started. This defeats the entire purpose of the retry mechanism. Fix by passing the callbacks by copy instead of move, so the original callbacks remain valid across retries. Fixes SCYLLADB-1542	2026-04-15 15:25:52 +02:00
Tomasz Grabiec	b6a7023f68	tablets: Prepare for non-power-of-two tablet count This is a step towards more flexibility in managing tablets. A prerequisite before we can split individual tablets, isolating hot partitions, and evening-out tablet sizes by shifting boundaries. After this patch, the system can handle tables with arbitrary tablet count. Tablet allocator is still rounding up desired tablet count to the nearest power of two when allocating tablets for a new table, so unless the tablet map is allocated in some other way, the counts will be still a power of two. We plan to utilize arbitrary count when migrating from vnodes to tablets, by creating a tablet map which matches vnode boundaries. One of the reasons we don't give up on power-of-two by default yet is that it creates an issue with merges. If tablet count is odd, one of the tablets doesn't have a sibling and will not be merged. That can obviously cause imbalance of token space and tablet sizes between tablets. To limit the impact, this patch dynamically chooses which tablet to isolate when initiating a merge. The largest tablet is chosen, as that will minimize imbalance. Otherwise, if we always chose the last tablet to isolate, its size would remain the same while other tablets double in size with each odd-count merge, leading to imbalance. The imbalance will still be there, but the difference in tablet sizes is limited to 2x. Example (3 tablets): [0] owns 1/3 of tokens [1] owns 1/3 of tokens [2] owns 1/3 of tokens After merge: [0] owns 2/3 of tokens [1] owns 1/3 of tokens What we would like instead: Step 1 (split [1]): [0] owns 1/3 of tokens [1] old 1.left, owns 1/6 of tokens [2] old 1.right, owns 1/6 of tokens [3] owns 1/3 of tokens Step 2 (merge): [0] owns 1/2 of tokens [1] owns 1/2 of tokens To do that, we need to be able to split individual tablets, but we're not there yet.	2026-04-15 10:40:55 +02:00
Avi Kivity	0ae22a09d4	LICENSE: Update to version 1.1 Updated terms of non-commercial use (must be a never-customer).	2026-04-12 19:46:33 +03:00
Avi Kivity	8ccee6803e	Merge 'Remove upgrade view builder' from Gleb Natapov Since we do no longer support upgrade from versions that do not support v2 of "view building status" code (building status is managed by raft) we can remove v1 code and upgrade code and make sure we do not boot with old "builder status" version. v2 version was introduced by `8d25a4d678` which is included in scylla-2025.1.0. No backport needed since this is code removal. Closes scylladb/scylladb#29105 * github.com:scylladb/scylladb: view: drop unused v1 builder code view: remove upgrade to raft code	2026-04-12 00:39:26 +03:00
Avi Kivity	ca80ee8586	Merge 'Introduce maintenance scheduling supergroup and do initial population' from Pavel Emelyanov The supergroup replaces streaming (a.k.a. maintenance as well) group, inherits 200 shares from it and consists of four sub-groups (all have equal shares of 200 withing the new supergroup) * maintenance_compaction. This group configures `compaction_manager::maintenance_sg()` group. User-triggered compaction runs in it * backup. This group configures `snapshot_ctl::config::backup_sched_group`. Native backup activity runs there * maintenance. It's a new "visible" name, everything that was called "maintenance" in the code ran in "streaming" group. Now it will run in "maintenance". The activities include those that don't communicate over RPC (see below why) * `tablet_allocator::balance_tablets()` * `sstables_manager::components_reclaim_reload_fiber()` * `tablet_storage_group_manager::merge_completion_fiber()` * metrics exporting http server altogether * streaming. This is purely existing streaming group that just moves under the new supergroup. Everything else that was run there, continues doing so, including * hints sender * all view building related components (update generator, builder, workers) * repair * stream_manager * messaging service (except for verb handlers that switch groups) * join_cluster() activity * REST API * ... something else I forgot The `--maintenance_io_throughput_mb_per_sec` option is introduced. It controls the IO throughput limit applied to the maintenance supergroup. If not set, the `--stream_io_throughput_mb_per_sec` option is used to preserve backward compatibility. All new sched groups inherit `request_class::maintenance` (however, "backup" seem not to make any requests yet). Moving more activities from "streaming" into "maintenance" (or its own group) is possible, but one will need to take care of RPC group switching. The thing is that when a client makes an RPC call, the server may switch to one of pre-negotiated scheduling groups. Verbs for existing activities that run in "streaming" group are routed through RPC index that negotiates "streaming" group on the server side. If any of that client code moves to some other group, server will still run the handlers in "streaming" which is not quite expected. That's one of the main reasons why only the selected fibers were moved to their own "maintenance" group. Similar for backup -- this code doesn't use RPC, so it can be moved. Restoring code uses load-and-stream and corresponding RPCs, so it cannot be just moved into its own new group. Fixes SCYLLADB-351 New feature, not backporting Closes scylladb/scylladb#28542 * github.com:scylladb/scylladb: code: Add maintenance/maintenance group backup: Add maintenance/backup group compaction: Add maintenance/maintenance_compaction group main: Introduce maintenance supergroup main: Move all maintenance sched group into streaming one database: Use local variable for current_scheduling_group code: Live-update IO throughputs from main	2026-04-12 00:34:48 +03:00
Pavel Emelyanov	58e59e8c0d	Merge 'test: add test_sstable_clone_preserves_staging_state' from Benny Halevy Add a test that verifies filesystem_storage::clone preserves the sstable state: an sstable in staging is cloned to a new generation, the clone is re-loaded from the staging directory, and its state is asserted to still be staging. The change proves that https://scylladb.atlassian.net/browse/SCYLLADB-1205 is invalid, and can be closed. * No functional change and no backport needed Closes scylladb/scylladb#29209 * github.com:scylladb/scylladb: test: add test_sstable_clone_preserves_staging_state test: derive sstable state from directory in test_env::make_sstable sstables: log debug message in filesystem_storage::clone	2026-04-07 17:02:04 +03:00
Ernest Zaslavsky	437a581b04	sstable_utils: add `get_storage` and `open_file` helpers Add a non-const `get_storage` accessor to expose underlying storage, and an `open_file` helper to access sstable component files directly. These are needed so compaction tests can read and write sstable components.	2026-04-05 11:07:17 +03:00
Ernest Zaslavsky	2ad2dbae03	test_env: delay unplugging sstable registry Unplugging the mock sstable_registry happened too early in the test environment. During sstable destruction, components may still need access to the registry, so the unplugging is moved to a later stage.	2026-04-05 11:07:17 +03:00
Marcin Maliszkiewicz	f988ec18cb	test/lib: fix port in-use detection in start_docker_service Previously, the result of when_all was discarded. when_all stores exceptions in the returned futures rather than throwing, so the outer catch(in_use&) could never trigger. Now we capture the when_all result and inspect each future individually to properly detect in_use from either stream. Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1216 Closes scylladb/scylladb#29219	2026-03-25 11:45:53 +02:00
Benny Halevy	22f2010477	test: derive sstable state from directory in test_env::make_sstable Instead of always passing sstable_state::normal, infer the state from the last component of the directory path by comparing against the known state subdirectory constants (staging_dir, upload_dir, quarantine_dir). Any unrecognized path component (the common case for normal-state sstables) maps to sstable_state::normal. When a non-normal state is detected, strip the state subdirectory from dir so that the base table directory is passed to storage.	2026-03-24 16:48:01 +02:00
Pavel Emelyanov	3b9398dfc8	Merge 'encryption: fix deadlock in encrypted_data_source::get()' from Ernest Zaslavsky When encrypted_data_source::get() caches a trailing block in _next, the next call takes it directly — bypassing input_stream::read(), which checks _eof. It then calls input_stream::read_exactly() on the already-drained stream. Unlike read(), read_up_to(), and consume(), read_exactly() does not check _eof when the buffer is empty, so it calls _fd.get() on a source that already returned EOS. In production this manifested as stuck encrypted SSTable component downloads during tablet restore: the underlying chunked_download_source hung forever on the post-EOS get(), causing 4 tablets to never complete. The stuck files were always block-aligned sizes (8k, 12k) where _next gets populated and the source is fully consumed in the same call. Fix by checking _input.eof() before calling read_exactly(). When the stream already reached EOF, buf2 is known to be empty, so the call is skipped entirely. A comprehensive test is added that uses a strict_memory_source which fails on post-EOS get(), reproducing the exact code path that caused the production deadlock. Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1128 Backport to 2025.3/4 and 2026.1 is needed since it fixes a bug that may bite us in production, to be on the safe side Closes scylladb/scylladb#29110 * github.com:scylladb/scylladb: encryption: fix deadlock in encrypted_data_source::get() test_lib: mark `limiting_data_source_impl` as not `final` Fix formatting after previous patch Fix indentation after previous patch test_lib: make limiting_data_source_impl available to tests	2026-03-23 17:12:44 +03:00
Pavel Emelyanov	6f43e8562e	compaction: Add maintenance/maintenance_compaction group Compaction manager tells compaction_sched_group from maintenance_compaction_sched_group. The latter, however, is set to be "streaming" group. This patch adds real maintenance_compaction group under the maintenance supergroup and makes compaction manager use it. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2026-03-23 16:00:02 +03:00
Avi Kivity	6b259babeb	Merge 'logstor: initial log-structured storage for key-value tables' from Michael Litvak Introduce an initial and experimental implementation of an alternative log-structured storage engine for key-value tables. Main flows and components: * The storage is composed of 32MB files, each file divided to segments of size 128k. We write to them sequentially records that contain a mutation and additional metadata. Records are written to a buffer first and then written to the active segment sequentially in 4k sized blocks. * The primary index in memory maps keys to their location on disk. It is a B-tree per-table that is ordered by tokens, similar to a memtable. * On reads we calculate the key and look it up in the primary index, then read the mutation from disk with a single disk IO. * On writes we write the record to a buffer, wait for it to be written to disk, then update the index with the new location, and free the previous record. * We track the used space in each segment. When overwriting a record, we increase the free space counter for the segment of the previous record that becomes dead. We store the segments in a histogram by usage. * The compaction process takes segments with low utilization, reads them and writes the live records to new segments, and frees the old segments. * Segments are initially "mixed" - we write to the active segment records from all tables and all tablets. The "separator" process rewrites records from mixed segments into new segments that are organized by compaction groups (tablets), and frees the mixed segments. Each write is written to the active segment and to a separator buffer of the compaction group, which is eventually flushed to a new segment in the compaction group. Currently this mode is experimental and requires an experimental flag to be enabled. Some things that are not supported yet are strong consistency, tablet migration, tablet split/merge, big mutations, tombstone gc, ttl. to use, add to config: ``` enable_logstor: true experimental_features: - logstor ``` create a table: ``` CREATE TABLE ks.t(pk int PRIMARY KEY, a int, v text) WITH storage_engine = 'logstor'; ``` INSERT, SELECT, DELETE work as expected UPDATE not supported yet no backport - new feature Closes scylladb/scylladb#28706 * github.com:scylladb/scylladb: logstor: trigger separator flush for buffers that hold old segments docs/dev: add logstor documentation logstor: recover segments into compaction groups logstor: range read logstor: change index to btree by token per table logstor: move segments to replica::compaction_group db: update dirty mem limits dynamically logstor: track memory usage logstor: logstor stats api logstor: compaction buffer pool logstor: separator: flush buffer when full logstor: hold segment until index updates logstor: truncate table logstor: enable/disable compaction per table logstor: separator buffer pool test: logstor: add separator and compaction tests logstor: segment and separator barrier logstor: separator debt controller logstor: compaction controller logstor: recovery: recover mixed segments using separator logstor: wait for pending reads in compaction logstor: separator logstor: compaction groups logstor: cache files for read logstor: recovery: initial logstor: add segment generation logstor: reserve segments for compaction logstor: index: buckets logstor: add buffer header logstor: add group_id logstor: record generation logstor: generation utility logstor: use RIPEMD-160 for index key test: add test_logstor.py api: add logstor compaction trigger endpoint replica: add logstor to db schema: add logstor cf property logstor: initial commit db: disable tablet balancing with logstor db: add logstor experimental feature flag	2026-03-20 00:18:09 +02:00
Ernest Zaslavsky	f74a54f005	test_lib: mark `limiting_data_source_impl` as not `final`	2026-03-19 13:54:54 +02:00
Ernest Zaslavsky	151e945d9f	Fix formatting after previous patch	2026-03-19 13:54:44 +02:00
Ernest Zaslavsky	537747cf5d	Fix indentation after previous patch	2026-03-19 13:48:53 +02:00
Ernest Zaslavsky	2535164542	test_lib: make limiting_data_source_impl available to tests Relocate the `limiting_data_source_impl` declaration to the header file so that test code can access it directly.	2026-03-19 13:48:53 +02:00
Michael Litvak	2128b1b15c	replica: add logstor to db Add a single logstor instance in the database that is used for writing and reading to tables with kv storage	2026-03-18 19:24:26 +01:00
Gleb Natapov	77d3245e02	view: remove upgrade to raft code Since we do no longer support upgrade from versions that do not support v2 of view building code we can remove upgrade code and make sure we do not boot with old builder version.	2026-03-18 17:45:40 +02:00
Tomasz Grabiec	5ee61f067d	test: cql_test_env: Respect enable-index-cache config Mirrors the code in main.cc	2026-03-18 16:25:20 +01:00
Piotr Dulikowski	d8b283e1fb	Merge 'Add CQL forwarding for strongly consistent tables' from Wojciech Mitros In this series we add support for forwarding strongly consistent CQL requests to suitable replicas, so that clients can issue reads/writes to any node and have the request executed on an appropriate tablet replica (and, for writes, on the Raft leader). We return the same CQL response as what the user would get while sending the request to the correct replica and we perform the same logging/stats updates on the request coordinator as if the coordinator was the appropriate replica. The core mechanism of forwarding a strongly consistent request is sending an RPC containing the user's cql request frame to the appropriate replica and returning back a ready, serialized `cql_transport::response`. We do this in the CQL server - it is most prepared for handling these types and forwarding a request containing a CQL frame allows us to reuse near-top-level methods for CQL request handling in the new RPC handler (such as the general `process`) For sending the RPC, the CQL server needs to obtain the information about who should it forward the request to. This requires knowledge about the tablet raft group members and leader. We obtain this information during the execution of a `cql3/strong_consistency` statement, and we return this information back to the CQL server using the generalized `bounce_to_shard` `response_message`, where we now store the information about either a shard, or a specific replica to which we should forward to. Similarly to `bounce_to_shard`, we need to handle this `result_message` in a loop - a replica may move during statement execution, or the Raft leader can change. We also use it for forwarding strongly consistent writes when we're not a member of the affected tablet raft group - in that case we need to forward the statement twice - once to any replica of the affected tablet, then that replica can find the leader and return this information to the coordinator, which allows the second request to be directed to the leader. This feature also allows passing through exception messages which happened on the target replica while executing the statement. For that, many methods of the `cql_transport::cql_server::connection` for creating error responses needed to be moved to `cql_transport::cql_server`. And for final exception handling on the coordinator, we added additional error info to the RPC response, so that the handling can be performed without having the `result_message::exception` or `exception_ptr` itself. Fixes [SCYLLADB-71](https://scylladb.atlassian.net/browse/SCYLLADB-71) [SCYLLADB-71]: https://scylladb.atlassian.net/browse/SCYLLADB-71?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ Closes scylladb/scylladb#27517 * github.com:scylladb/scylladb: test: add tests for CQL forwarding transport: enable CQL forwarding for strong consistency statements transport: add remote statement preparation for CQL forwarding transport: handle redirect responses in CQL forwarding transport: add exception handling for forwarded CQL requests transport: add basic CQL request forwarding idl: add a representation of client_state for forwarding cql_server: handle query, execute, batch in one case transport: inline process_on_shard in cql_server::process transport: extract process() to cql_server transport: add messaging_service to cql_server transport: add response reconstruction helpers for forwarding transport: generalize the bounce result message for bouncing to other nodes strong consistency: redirect requests to live replicas from the same rack transport: pass foreign_ptr into sleep_until_timeout_passes and move it to cql_server transport: extract the error handling from process_request_one transport: move error response helpers from connection to cql_server	2026-03-13 15:03:10 +01:00
Avi Kivity	b228eb26e6	Merge 'dbuild: Use slirp4netns network in dbuild nested containers' from Calle Wilund Fixes #25084 Add slirp4netns and use for nested containers. This will allow nested container port aliasing, helping CI stability. Note: this contains and updated Dockerfile for dbuild image, but since chicken and eggs, right now will force install slirp4netns before anything in dbuild script. Updates the mock server handling to use ephemeral ports and query from container, ensuring we don't get port collisions. (boost as well as pytest). Includes a timeout up, and a tweak to our scylla_cluster handling, ensuring we don't deadlock when pipe size is less than requires for our sys notify messages. Closes scylladb/scylladb#28727 * github.com:scylladb/scylladb: gcs_fixture: Change to use docker helper aws_kms_fixture: Modify to use docker helper test/lib/proc_util: Add docker helper pytest: use ephemeral port publish for docker mock servers dbuild: Use container network in dbuild nested containers scylla_cluster: Read notify sock in background to prevent deadlock	2026-03-12 23:49:25 +02:00
Wojciech Mitros	b4d66fda2e	strong consistency: redirect requests to live replicas from the same rack Forwarding CQL requests is not implemented yet, but we're already prepared to return the target to forward to when trying to execute strongly consistent requests. Currently, if we're not a replica of the affected tablet, we redirect the request to the first replica in the list. This is not optimal, because this replica may be down or it may be in another rack, making us perform cross-rack requests during forwarding. Instead, we should forward the request to the replica from the same rack and handle the case where the replica is down. In this patch we change the replica selection for forwarding strongly consistent requests, so that when the coordinator isn't a replica, it redirects the request to the replica from the same rack. If the replica from the same rack is down, or there is no replica in our rack, we choose the next closest replica (preferring same-DC replicas over other DCs). If no replica is alive, the query fails - the driver should retry when some replica comes back up.	2026-03-12 17:48:54 +01:00
Gleb Natapov	c67f876893	service level: make maybe_update_per_service_level_params synchronous It does not call async functions any more.	2026-03-12 15:53:08 +02:00
Calle Wilund	bc544eb08e	gcs_fixture: Change to use docker helper	2026-03-11 12:32:02 +01:00
Calle Wilund	eb2dfe04e1	aws_kms_fixture: Modify to use docker helper	2026-03-11 12:32:02 +01:00
Calle Wilund	4a8afd9649	test/lib/proc_util: Add docker helper Adds boost test equivalent of dockerized_service to handle launching dockerized mock service using ephermal port, query port and return the process.	2026-03-11 12:32:02 +01:00
Patryk Jędrzejczak	37aeba9c8c	Merge 'raft: add global read barrier to group0_batch::commit and switch auth and service levels' from Marcin Maliszkiewicz This series adds a global read barrier to raft_group0_client, ensuring that Raft group0 mutations are applied on all live nodes before returning to the caller. Currently, after a group0_batch::commit, the mutations are only guaranteed to be applied on the leader. Other nodes may still be catching up, leading to stale reads. This patch introduces a broadcast read barrier mechanism. Calling send_group0_read_barrier_to_live_members after committing will cause the coordinator to send a read barrier RPC to all live nodes (discovered via gossiper) and waits for them to complete. This is best effort attempt to get cluster-wide visibility of the committed state before the response is returned to the user. Auth and service levels write paths are switched to use this new mechanism. Fixes https://scylladb.atlassian.net/browse/SCYLLADB-650 Backport: no, new feature Closes scylladb/scylladb#28731 * https://github.com/scylladb/scylladb: test: add tests for global group0_batch barrier feature qos: switch service levels write paths to use global group0_batch barrier auth: switch write paths to use global group0_batch barrier raft: add function to broadcast read barrier request raft: add gossiper dependency to raft_group0_client raft: add read barrier RPC	2026-03-11 10:37:19 +01:00
Botond Dénes	475220b9c9	Merge 'Remove the rest of pre raft topology code' from Gleb Natapov Remove the rest of the code that assumes that either group0 does not exist yet or a cluster is till not upgraded to raft topology. Both of those are not supported any more. No need to backport since we remove functionality here. Closes scylladb/scylladb#28841 * github.com:scylladb/scylladb: service level: remove version 1 service level code features: move GROUP0_SCHEMA_VERSIONING to deprecated features list migration_manager: remove unused forward definitions test: remove unused code auth: drop auth_migration_listener since it does nothing now schema: drop schema_registry_entry::maybe_sync() function schema: drop make_table_deleting_mutations since it should not be needed with raft schema: remove calculate_schema_digest function schema: drop recalculate_schema_version function and its uses migration_manager: drop check for group0_schema_versioning feature cdc: drop usage of cdc_local table and v1 generation definition storage_service: no need to add yourself to the topology during reboot since raft state loading already did it storage_service: remove unused functions group0: drop with_raft() function from group0_guard since it always returns true now gossiper: do not gossip TOKENS and CDC_GENERATION_ID any more gossiper: drop tokens from loaded_endpoint_state gossiper: remove unused functions storage_service: do not pass loaded_peer_features to join_topology() storage_service: remove unused fields from replacement_info gossiper: drop is_safe_for_restart() function and its use storage_service: remove unused variables from join_topology gossiper: remove the code that was only used in gossiper topology storage_service: drop the check for raft mode from recovery code cdc: remove legacy code test: remove unused injection points auth: remove legacy auth mode and upgrade code treewide: remove schema pull code since we never pull schema any more raft topology: drop upgrade_state and its type from the topology state machine since it is not used any longer group0: hoist the checks for an illegal upgrade into main.cc api: drop get_topology_upgrade_state and always report upgrade status as done service_level_controller: drop service level upgrade code test: drop run_with_raft_recovery parameter to cql_test_env group0: get rid of group0_upgrade_state storage_service: drop topology_change_kind as it is no longer needed storage_service: drop check_ability_to_perform_topology_operation since no upgrades can happen any more service_storage: remove unused functions storage_service: remove non raft rebuild code storage_service: set topology change kind only once group0: drop in_recovery function and its uses group0: rename use_raft to maintenance_mode and make it sync	2026-03-11 10:24:20 +02:00
Botond Dénes	81e214237f	Merge 'Add digests for all sstable components in scylla metadata' from Taras Veretilnyk This pull request adds support for calculation and storing CRC32 digests for all SSTable components. This change replaces plain file_writer with crc32_digest_file_writer for all SSTable components that should be checksummed. The resulting component digests are stored in the sstable structure and later persisted to disk as part of the Scylla metadata component during writer::consume_end_of_stream. Several test cases where introduced to verify expected behaviour. Additionally, this PR adds new rewrite component mechanism for safe sstable component rewriting. Previously, rewriting an sstable component (e.g., via rewrite_statistics) created a temporary file that was renamed to the final name after sealing. This allowed crash recovery by simply removing the temporary file on startup. However, with component digests stored in scylla_metadata (#20100), replacing a component like Statistics requires atomically updating both the component and scylla_metadata with the new digest - impossible with POSIX rename. The new mechanism creates a clone sstable with a fresh generation: - Hard-links all components from the source except the component being rewritten and scylla_metadata - Copies original sstable components pointer and recognized components from the source - Invokes a modifier callback to adjust the new sstable before rewriting - Writes the modified component along with updated scylla_metadata containing the new digest - Seals the new sstable with a temporary TOC - Replaces the old sstable atomically, the same way as it is done in compaction This is built on the rewrite_sstables compaction framework to support batch operations (e.g., following incremental repair). In case of any failure durning the whole process, sstable will be automatically deleted on the node startup due to temporary toc persistence. Backport is not required, it is a new feature Fixes https://github.com/scylladb/scylladb/issues/20100, https://github.com/scylladb/scylladb/issues/27453 Closes scylladb/scylladb#28338 * github.com:scylladb/scylladb: docs: document components_digests subcomponent and trailing digest in Scylla.db sstable_compaction_test: Add tests for perform_component_rewrite sstable_test: add verification testcases of SSTable components digests persistance sstables: store digest of all sstable components in scylla metadata sstables: replace rewrite_statistics with new rewrite component mechanism sstables: add new rewrite component mechanism for safe sstable component rewriting compaction: add compaction_group_view method to specify sstable version sstables: add null_data_sink and serialized_checksum for checksum-only calculation sstables: extract default write open flags into a constant sstables: Add write_simple_with_digest for component checksumming sstables: Extract file writer closing logic into separate methods sstables: Implement CRC32 digest-only writer	2026-03-10 16:02:53 +02:00
Gleb Natapov	4660f908f9	auth: drop auth_migration_listener since it does nothing now	2026-03-10 10:46:48 +02:00
Gleb Natapov	d35b83bec8	gossiper: remove the code that was only used in gossiper topology The topology state machine is always present now and can be passed to the gossiper during creation.	2026-03-10 10:39:58 +02:00
Gleb Natapov	6a7e850161	cdc: remove legacy code The patch removes test/boost/cdc_generation_test.cc since it unit tests cdc::limit_number_of_streams_if_needed function which is remove here.	2026-03-10 10:38:57 +02:00
Gleb Natapov	1d188f0394	auth: remove legacy auth mode and upgrade code A system needs to be upgraded to use v2 auth before moving to this ScyllaDB version otherwise the boot will fail.	2026-03-10 10:09:39 +02:00
Gleb Natapov	61cc091364	test: drop run_with_raft_recovery parameter to cql_test_env It is unused.	2026-03-10 10:09:38 +02:00
Gleb Natapov	00083b42a7	group0: get rid of group0_upgrade_state Simplify code by getting rid of group0_upgrade_state since upgrade is no longer supported, so no need to track its state. The none upgraded node will simply not boot and to detect that the patch checks the state directly from the system table.	2026-03-10 10:09:38 +02:00
Marcin Maliszkiewicz	cbae84a926	raft: add gossiper dependency to raft_group0_client In following commit raft_group0_client will send read barrier RPC to all alive nodes, it takes list of the nodes from gossiper.	2026-03-09 15:15:59 +01:00
Patryk Jędrzejczak	4c8dba15f1	Merge 'strong_consistency/state_machine: ensure and upgrade mutations schema' from Michał Jadwiszczak This patch fixes 2 issues within strong consistency state machine: - it might happen that apply is called before the schema is delivered to the node - on the other hand, the apply may be called after the schema was changed and purged from the schema registry The first problem is fixed by doing `group0.read_barrier()` before applying the mutations. The second one is solved by upgrading the mutations using column mappings in case the version of the mutations' schema is older. Fixes SCYLLADB-428 Strong consistency is in experimental phase, no need to backport. Closes scylladb/scylladb#28546 * https://github.com/scylladb/scylladb: test/cluster/test_strong_consistency: add reproducer for old schema during apply test/cluster/test_strong_consistency: add reproducer for missing schema during apply test/cluster/test_strong_consistency: extract common function raft_group_registry: allow to drop append entries requests for specific raft group strong_consistency/state_machine: find and hold schemas of applying mutations strong_consistency/state_machine: pull necessary dependencies db/schema_tables: add `get_column_mapping_if_exists()`	2026-03-09 09:49:22 +01:00
Michał Jadwiszczak	33a16940be	strong_consistency/state_machine: pull necessary dependencies Both migration manager and system keyspace will be used in next commit. The first one is needed to execute group0 read barrier and we need system keyspace to get column mappings.	2026-03-05 12:33:17 +01:00

1 2 3 4 5 ...

1742 Commits