scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-06-09 08:23:29 +00:00

Author	SHA1	Message	Date
Avi Kivity	04b54f363b	Merge 'Enable vnodes-to-tablets migrations with arbitrary tokens' from Nikos Dragazis This PR removes the power-of-two token constraint from vnodes-to-tablets migrations, allowing clusters with randomly generated tokens to migrate without manual token reassignment. Previously, migrations required vnode tokens to be a power of two and aligned. In practice, these conditions are not met with Scylla's default random token assignment, so the constraint is a blocker for real-world use. With the introduction of arbitrary tablet boundaries in PR #28459, the tablet layer can now support arbitrary tablet boundaries. This PR builds on that capability to allow arbitrary vnode tokens during migration. When the highest vnode token does not coincide with the end of the token ring, the vnode wraps around, but tablets do not support that. This is handled by splitting it into two tablets: one covering the tail end of the ring and one covering the beginning. Testing has been updated accordingly: existing cluster tests now use randomly generated tokens instead of precomputed power-of-two values, and a new Boost test validates the wrap-around tablet boundary logic. Fixes SCYLLADB-724. New feature, no backport is needed. Closes scylladb/scylladb#29319 * github.com:scylladb/scylladb: test: Use arbitrary tokens in vnodes->tablets migration tests test: boost: Add test for wrap-around vnodes storage_service: Support vnodes->tablets migrations w/ arbitrary tokens storage_service: Hoist migration precondition	2026-04-17 00:46:35 +03:00
Avi Kivity	999e108139	Merge 'test: lib: fix broken retry in start_docker_service' from Dario Mirovic The retry loop in `start_docker_service` passes the parse callbacks via `std::move` into `create_handler` on each iteration. After the first iteration, the moved-from `std::function` objects are empty. All subsequent retries skip output parsing entirely and immediately treat the service as successfully started. This defeats the entire purpose of the retry mechanism. Fix by passing the callbacks by copy instead of move, so the original callbacks remain valid across retries. Fixes SCYLLADB-1542 This is a CI stability issue and should be backported. Closes scylladb/scylladb#29504 * github.com:scylladb/scylladb: test/lib: fix typos in proc_utils, gcs_fixture, and dockerized_service test: gcs_fixture: rename container from "local-kms" to "fake-gcs-server" test: fix proc_utils.cc formatting from previous commit test: lib: use unique container name per retry attempt test: lib: fix broken retry in start_docker_service	2026-04-16 21:48:25 +03:00
Radosław Cybulski	c5ed6b22ae	alternator: add CHILD_SHARDS filtering Add a `CHILD_SHARDS` filter to `DescribeStream` command. When used, user need to pass a parent stream shard id as json's ShardFilter.ShardId field. DescribeStream will then return only list of stream shards, that are direct descendants of passed parent stream shard. Each stream shard cover a consecutive part of token space. A stream shard Q is considered to be a child of stream shard W, when at least one token belongs to token spaces from both streams. The filtering algorithm itself is somewhat complicated - more details in comments in streams.cc. CHILD_SHARDS is a Amazon's functionality and is required by KCL. Add unit tests. Fixes: #25160 Closes scylladb/scylladb#28189	2026-04-16 18:27:55 +03:00
Piotr Szymaniak	d0c3f78d76	test/alternator: extend local TTL streams timeout Increase the non-AWS wait in the TTL streams test to reduce vnode CI flakes caused by delayed expiration visibility. Fixes SCYLLADB-1556 Closes scylladb/scylladb#29516	2026-04-16 15:53:35 +03:00
Emil Maskovsky	91df3795fc	encryption: cover system.raft table in system_info_encryption Extend system_info_encryption to encrypt system.raft SSTables. system.raft contains the Raft log, which may hold sensitive user data (e.g. batched mutations), so it warrants the same treatment as system.batchlog and system.paxos. During upgrade, existing unencrypted system.raft SSTables remain readable. Existing data is rewritten encrypted via compaction, or immediately via nodetool upgradesstables -a. Update the operator-facing system_info_encryption description to mention system.raft and add a focused test that verifies the schema extension is present on system.raft. Fixes: CUSTOMER-268 Backport: 2026.1 - closes an encryption-at-rest coverage gap: system.raft may persist sensitive user-originated data unencrypted; backport to the current LTS. Closes scylladb/scylladb#29242	2026-04-16 13:22:10 +02:00
Botond Dénes	d006c4c476	Merge 'Untie (partially) cql3/statements from db::config' from Pavel Emelyanov There's a bunch of db::config options that are used by cql3/statements/ code. For that they use data_dictionary/database as a proxy to get db::config reference. This PR moves most of these accessed options onto cql_config Options migrated to cql_config: 1. select_internal_page_size 2. strict_allow_filtering 3. enable_parallelized_aggregation 4. batch_size_warn_threshold_in_kb 5. batch_size_fail_threshold_in_kb 6. 7 keyspace replication restriction options 7. 2 TWCS restriction options 8. restrict_future_timestamp 9. strict_is_not_null_in_views (with view_restrictions struct) 10. enable_create_table_with_compact_storage Some options need special treatment and are still abused via database, namely: 1. enable_logstor 2. cluster_name 3. partitioner 4. endpoint_snitch Fixing components inter-dependencies, not backporting Closes scylladb/scylladb#29424 * github.com:scylladb/scylladb: cql3: Move enable_create_table_with_compact_storage to cql_config cql3: Move strict_is_not_null_in_views to cql_config cql3: Move restrict_future_timestamp to cql_config cql3: Move TWCS restriction options to cql_config cql3: Move keyspace restriction options to cql_config cql3: Move batch_size_fail_threshold_in_kb to cql_config cql3: Move batch_size_warn_threshold_in_kb to cql_config cql3: Move enable_parallelized_aggregation to cql_config cql3: Move strict_allow_filtering to cql_config cql3: Move select_internal_page_size to cql_config test: Fix cql_test_env to use updateable cql_config from db::config cql3: Add cql_config parameter to parsed_statement::prepare()	2026-04-16 14:04:43 +03:00
Botond Dénes	88a8324e68	erge 'db: store large data records in SSTable metadata and serve via virtual tables' from Benny Halevy `system.large_partitions`, `system.large_rows`, and `system.large_cells` store records keyed by SSTable name. When SSTables are migrated between shards or nodes (resharding, streaming, decommission), the records are lost because the destination never writes entries for the migrated SSTables. This patch series moves the source of truth for large data records into the SSTable's scylla metadata component (new `LargeDataRecords` tag 13) and reimplements the three `system.large_` tables as virtual tables that query live SSTables on demand. A cluster feature flag (`LARGE_DATA_VIRTUAL_TABLES`) gates the transition for safe rolling upgrades. When the cluster feature is enabled, each node drops the old system large_ tables and starts serving the corresponding tables using virtual tables that represent the large data records now stored on the sstables. Note that the virtual tables will be empty after upgrade until the sstables that contained large data are rewritten, therefore it is recommended to run upgrade sstables compaction or major compaction to repopulate the sstables scylla-metadata with large data records. 1. keys: move key_to_str() to keys/keys.hh — make the helper reusable across large_data_handler, virtual tables, and scylla-sstable 2. sstables: add LargeDataRecords metadata type (tag 13) — new struct with binary-serialized key fields, scylla-sstable JSON support, format documentation 3. large_data_handler: rename partition_above_threshold to above_threshold_result — generalize the struct for reuse 4. large_data_handler: return above_threshold_result from maybe_record_large_cells — separate booleans for cell size vs collection elements thresholds 5. sstables: populate LargeDataRecords from writer — bounded min-heaps (one per large_data_type), configurable top-N via `compaction_large_data_records_per_sstable` 6. test: add LargeDataRecords round-trip unit tests — verify write/read, top-N bounding, below-threshold behavior 7. db: call initialize_virtual_tables from shard 0 only — preparatory refactoring to enable cross-shard coordination 8. db: implement large_data virtual tables with feature flag gating — three virtual table classes, feature flag activation, legacy SSTable fallback, dual-threshold dedup, cross-shard collection Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1276 * Although this fixes a bug where large data entries are effectively lost when sstables are renamed or migrated, the changes are intrusive and do not warrant a backport Closes scylladb/scylladb#29257 * github.com:scylladb/scylladb: db: implement large_data virtual tables with feature flag gating db: call initialize_virtual_tables from shard 0 only test: add LargeDataRecords round-trip unit tests sstables: populate LargeDataRecords from writer large_data_handler: return above_threshold_result from maybe_record_large_cells large_data_handler: rename partition_above_threshold to above_threshold_result sstables: add LargeDataRecords metadata type (tag 13) sstables: add fmt::formatter for large_data_type keys: move key_to_str() to keys/keys.hh	2026-04-16 14:03:31 +03:00
Pavel Emelyanov	207d3b4a68	test_backup: Remove create_schema() helper Test Remove the create_schema() helper function and inline its logic directly into the four call sites. This simplifies the code by eliminating a trivial wrapper. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Closes scylladb/scylladb#29406	2026-04-16 12:57:26 +03:00
Botond Dénes	830d28a889	Merge 'Use standard helpers to create ks:cf and populate it in test_backup.py' from Pavel Emelyanov The PR removed the create_and_ks() helper from backup test and patches all callers to create keyspace, table and populate them with standard explicit facilities. While patching it turned out that one test doesn't need to populate the table, so it even becomes tiny bit shorter and faster Enhancing test, not backporting Closes scylladb/scylladb#29417 * github.com:scylladb/scylladb: test_backup: Remove create_ks_and_cf helper Test test_backup: Replace create_ks_and_cf with async patterns Test test_backup: Add if-True blocks for indentation Test	2026-04-16 12:54:21 +03:00
Nikos Dragazis	7abcf94823	test: Use arbitrary tokens in vnodes->tablets migration tests The migration tests used to start nodes with pre-computed power-of-two tokens. This was required because the migration itself only supported power-of-two aligned tokens. Now that arbitrary tokens are supported, switch the tests to use Scylla's default random token assignment. Switching to arbitrary tokens makes the tests non-deterministic, but the migration aspects that are affected by the token distribution (resharding, wrap-around vnode split) are out of scope for these tests and covered by dedicated tests. Add a `get_all_vnode_tokens()` helper that queries system.topology at runtime to discover the actual token layout, and derive expected tablet counts from that. Also account for the possible extra wrap-around tablet when the last vnode token does not coincide with MAX_TOKEN. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2026-04-16 12:47:27 +03:00
Nikos Dragazis	26f0c038af	test: boost: Add test for wrap-around vnodes Add a Boost test to verify that `prepare_for_tablets_migration()` produces the correct tablet boundaries when a wrap-around vnode exists. Tablets cannot wrap around the token ring as vnodes do; the last token of the last tablet must always be MAX_TOKEN. When the last vnode token does not coincide with MAX_TOKEN, the wrap-around vnode must be split into two tablets. The test is parameterized over both cases: unaligned (split expected) and aligned (no split expected). Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2026-04-16 12:47:16 +03:00
Botond Dénes	c355df4461	Merge 'test: Lower default log level from DEBUG to INFO' from Artsiom Mishuta 1. test.py — Removed --log-level=DEBUG flag from pytest args 2. test/pytest.ini — Changed log_level to INFO (that was set DEBUG in test.py), changed log_file_level from DEBUG to INFO, added clarifying comments +minor fix [test/pylib: save logs on success only during teardown phase](`0ede308a04`) Previously, when --save-log-on-success was enabled, logs were saved for every test phase (setup, call, teardown)in 3 files. Restrict it to only the teardown phase, that contains all 3 in case of test success, to avoid redundant log entries. Closes scylladb/scylladb#29086 * github.com:scylladb/scylladb: test/pylib: save logs on success only during teardown phase test: Lower default log level from DEBUG to INFO	2026-04-16 12:46:11 +03:00
Botond Dénes	9bfcc25cf7	Merge 'streaming: stream_blob: hold table for streaming' from Michael Litvak When initializing streaming sources in tablet_stream_files_handler we use a reference to the table. We should hold the table while doing so, because otherwise the table may be dropped and destroyed when we yield. Use the table.stream_in_progress() phaser to hold the table while we access it. For sstable file streaming we can release the table after the snapshot is initialized, and the table may be dropped safely because the files are held by the snapshot and we don't access the table anymore. There was a single access to the table for logging but it is replaced by a pre-calculated variable. For logstor segment streaming, currently it doesn't support discarding the segments while they are streamed - when the table is dropped it discard the segments by overwriting and freeing them, so they shouldn't be accessed after that. Therefore, in that case continue to hold the table until streaming is completed. Fixes [SCYLLADB-1533](https://scylladb.atlassian.net/browse/SCYLLADB-1533) It's a pre-existing use-after-free issue in sstable file streaming so should be backported to all releases. It's also made worse with the recent changes of logstor, and affects also non-logstor tables, so the logstor fixes should be in the same release (2026.2). [SCYLLADB-1533]: https://scylladb.atlassian.net/browse/SCYLLADB-1533?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ Closes scylladb/scylladb#29488 * github.com:scylladb/scylladb: test: test drop table during streaming streaming: stream_blob: hold table for streaming	2026-04-16 12:12:42 +03:00
Dario Mirovic	50e498ac0d	test/lib: fix typos in proc_utils, gcs_fixture, and dockerized_service Fix assorted typos in comments, strings, and identifiers: - path_preprend -> path_prepend (proc_utils.hh, proc_utils.cc) - laúnch -> launch (proc_utils.cc) - hand/fail -> hang/fail (dockerized_service.py) - inconvinient -> inconvenient (dockerized_service.py) - priviledges -> privileges (gcs_fixture.hh) - remove double semicolon (gcs_fixture.cc) Refs SCYLLADB-1542	2026-04-16 10:58:55 +02:00
Dario Mirovic	11b5997eaf	test: gcs_fixture: rename container from "local-kms" to "fake-gcs-server" The GCS fixture's fake-gcs-server container was named "local-kms", copy-pasted from the AWS KMS fixture. It happened when both were refactored to use the shared start_docker_service helper (`bc544eb08e`). Rename to "fake-gcs-server" to match the Python-side naming and avoid confusion in logs. Refs SCYLLADB-1542	2026-04-16 10:58:52 +02:00
Dario Mirovic	dc7f848bf8	test: fix proc_utils.cc formatting from previous commit Fix indentation of lines moved inside the for-loop in start_docker_service (lines 208-225). Refs SCYLLADB-1542	2026-04-16 10:55:48 +02:00
Dario Mirovic	be4d32c474	test: lib: use unique container name per retry attempt The container name is generated once before the retry loop, so all retry attempts reuse the same name. Move the name generation inside the loop so each attempt gets a fresh name via the incrementing counter, consistent with the comment "publish port ephemeral, allows parallel instances". Formatting changes (indentation) of lines 208-225 in test/lib/proc_utils.cc will be fixed in the next commit. Refs SCYLLADB-1542	2026-04-16 10:55:04 +02:00
Botond Dénes	33682fd14e	Merge 'sstables/storage_manager: fix race between object storage config update and keyspace creation' from Dimitrios Symonidis Previously, config_updater used a serialized_action to trigger update_config() when object_storage_endpoints changed. Because serialized_action::trigger() always schedules the action as a new reactor task (via semaphore::wait().then()), there was a window between the config value becoming visible to the REST API and update_config() actually running. This allowed a concurrent CREATE KEYSPACE to see the new endpoint via is_known_endpoint() before storage_manager had registered it in _object_storage_endpoints. Now config observers run synchronously in a reactor turn and must not suspend. Split the previous monolithic async update_config() coroutine into two phases: - Sync (in the observer, never suspends): storage_manager::_object_storage_endpoints is updated in place; for already-instantiated clients, update_config_sync swaps the new config atomically - Async (per-client gate): background fibers finish the work that can't run in the observer — S3 refreshes credentials under _creds_sem; GCS drains and closes the replaced client. Config reloads triggered by SIGHUP are applied on shard 0 and then broadcast to all other shards. An rwlock has been also introduced to make sure that the configuration has been propagated to all cores. This guarantees that a client requesting a config via the REST API will see a consistent snapshot Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-757 Fixes: [28141](https://github.com/scylladb/scylladb/issues/28141) Closes scylladb/scylladb#28950 * github.com:scylladb/scylladb: test/object_store: verify object storage client creation and live reconfiguration sstables/utils/s3: split config update into sync and async parts test_config: improve logging for wait_for_config API db: introduce read-write lock to synchronize config updates with REST API	2026-04-16 10:20:43 +03:00
Botond Dénes	8e7ba7efe2	Merge 'commitlog: fix segment replay order by using ordered map per shard' from Sergey Zolotukhin The commitlog replayer groups segments by shard using a std::unordered_multimap, then iterates per-shard segments via equal_range(). However, equal_range() does not guarantee iteration order for elements with the same key, so segments could be replayed out of order within a shard. Correct segment ordering is required for: - Fragmented entry reconstruction, which accumulates fragments across segments and depends on ascending order for efficient processing. - Commitlog-based storage used by the strongly consistent tables feature, which relies on replayed raft items being stored in order. Fix by changing the data structure from std::unordered_multimap<unsigned, commitlog::descriptor> to std::unordered_map<unsigned, utils::chunked_vector<commitlog::descriptor>> Since the descriptors are inserted from a std::set ordered by ID, the vector preserves insertion (and thus ID) order. The per-shard iteration now simply iterates the vector, guaranteeing correct replay order. Fixes: SCYLLADB-1411 Backport: It looks like this issue doesn't cause any trouble, and is required only by the strong consistent tables, so no backporting required. Closes scylladb/scylladb#29372 * github.com:scylladb/scylladb: commitlog: add test to verify segment replay order commitlog: fix replay order by using ordered map per shard	2026-04-16 09:55:27 +03:00
Benny Halevy	ce00d61917	db: implement large_data virtual tables with feature flag gating Replace the physical system.large_partitions, system.large_rows, and system.large_cells CQL tables with virtual tables that read from LargeDataRecords stored in SSTable scylla metadata (tag 13). The transition is gated by a new LARGE_DATA_VIRTUAL_TABLES cluster feature flag: - Before the feature is enabled: the old physical tables remain in all_tables(), CQL writes are active, no virtual tables are registered. This ensures safe rollback during rolling upgrades. - After the feature is enabled: old physical tables are dropped from disk via legacy_drop_table_on_all_shards(), virtual tables are registered on all shards, and CQL writes are skipped via skip_cql_writes() in cql_table_large_data_handler. Key implementation details: - Three virtual table classes (large_partitions_virtual_table, large_rows_virtual_table, large_cells_virtual_table) extend streaming_virtual_table with cross-shard record collection. - generate_legacy_id() gains a version parameter; virtual tables use version 1 to get different UUIDs than the old physical tables. - compaction_time is derived from SSTable generation UUID at display time via UUID_gen::unix_timestamp(). - Legacy SSTables without LargeDataRecords emit synthetic summary rows based on above_threshold > 0 in LargeDataStats. - The activation logic uses two paths: when the feature is already enabled (test env, restart), it runs as a coroutine; when not yet enabled, it registers a when_enabled callback that runs inside seastar::async from feature_service::enable(). - sstable_3_x_test updated to use a simplified large_data_test_handler and validate LargeDataRecords in SSTable metadata directly.	2026-04-16 08:49:02 +03:00
Benny Halevy	cb6004b625	db: call initialize_virtual_tables from shard 0 only Move the smp::invoke_on_all dispatch from the callers into initialize_virtual_tables() itself, so the function is called once from shard 0 and internally distributes the per-shard virtual table setup to all shards. This simplifies the callers and allows a single place to add cross-shard coordination logic (e.g. feature-gated table registration) in future commits.	2026-04-16 08:49:02 +03:00
Benny Halevy	90d4ff34fb	test: add LargeDataRecords round-trip unit tests Add three new test cases to sstable_3_x_test.cc that verify the LargeDataRecords metadata written by the SSTable writer can be read back after open_data(): - test_large_data_records_round_trip: verifies partition_size, row_size, and cell_size records are written with correct field semantics when thresholds are exceeded - test_large_data_records_top_n_bounded: verifies the bounded min-heap keeps only the top-N largest entries per type - test_large_data_records_none_when_below_threshold: verifies no records are written when data is below all thresholds Also wire large_data_records_per_sstable from db_config into the test env's sstables_manager::config so that config changes propagate through the updateable_value chain to configure_writer().	2026-04-16 08:49:02 +03:00
Pavel Emelyanov	728eb20b42	test: Fix cql_test_env to use updateable cql_config from db::config The test environment was creating cql_config with hardcoded default values that were never updated when system.config was modified via CQL. This broke tests that dynamically change configuration values (e.g., TWCS tests). Fix by creating cql_config from db::config using sharded_parameter, which ensures updateable_value fields track the actual db::config sources and reflect changes made during test execution. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2026-04-16 07:57:26 +03:00
Nadav Har'El	f0e9177130	Merge 'audit/alternator: Make Alternator requests audited' from Piotr Szymaniak Each Alternator API call results in the request being audited, provided the auditing is enabled. Both successful as well as the failed requests are audited, with few exceptions. The chosen audit types for the operations: - CreateTable - DDL - DescribeTable - QUERY - DeleteTable - DDL - UpdateTable - DDL - PutItem - DML - UpdateItem - DML - GetItem - QUERY - DeleteItem - DML - ListTables - QUERY - Scan - QUERY - DescribeEndpoints - QUERY - BatchWriteItem - DML - BatchGetItem - QUERY - Query - QUERY - TagResource - DDL - UntagResource - DDL - ListTagsOfResource - QUERY - UpdateTimeToLive - DDL - DescribeTimeToLive - QUERY - ListStreams - QUERY - DescribeStream - QUERY - GetShardIterator - QUERY - GetRecords - QUERY - DescribeContinuousBackups - QUERY FIXME: The tests are now covering the new functionality only partially. Fixes: scylladb/scylla-enterprise#3796 Fixes: SCYLLADB-467 No need to backport, new functionality. Closes scylladb/scylladb#27953 * github.com:scylladb/scylladb: audit/alternator: support audit_tables=alternator.<table> shorthand audit/alternator: Add negative audit tests audit/alternator: Add testing of auditing audit/alternator: Audit requests audit/alternator: Refactor in preparation for auditing Alternator	2026-04-15 22:17:57 +03:00
Nikos Dragazis	d38f44208a	test/cqlpy: Harden mutation_fragments tests against background flushes Several tests in test_select_from_mutation_fragments.py assume that all mutations end up in a single SSTable. This assumption can be violated by background memtable flushes triggered by commitlog disk pressure. Since the Scylla node is taken from a pool, it may carry unflushed data from prior tests that prevents closed segments from being recycled, thereby increasing the commitlog disk usage. A main source of such pressure is keyspace-level flushes from earlier tests in this module, which rotate commitlog segments without flushing system tables (e.g., `system.compaction_history`), leaving closed segments dirty. Additionally, prior tests in the same module may have left unflushed data on the shared test table (`test_table` fixture), keeping commitlog segments dirty on its behalf as well. When commitlog disk usage exceeds its threshold, the system flushes the test table to reclaim those segments, potentially splitting a running test's mutations across multiple SSTables. This was observed in CI, where test_paging failed because its data was split across two SSTables, resulting in more mutation fragments than the hardcoded expected count. This patch fixes the affected tests in two ways: 1. Where possible, tests are reworked to not assume a single SSTable: - test_paging - test_slicing_rows - test_many_partition_scan 2. Where rework is impractical, major compaction is added after writes and before validation to ensure that only one SSTable will exist: - test_smoke - test_count - test_metadata_and_value - test_slicing_range_tombstone_changes Fixes SCYLLADB-1375. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com> Closes scylladb/scylladb#29389	2026-04-15 21:46:00 +03:00
Michael Litvak	cc94467097	test: test drop table during streaming Add a test that drops a table while tablet streaming is running for the table. The table is dropped after taking the storage snapshot and initializating streaming sources - after that streaming should be able to complete or abort correctly if the table is dropped. We want to verify there is no incorrect access to the destroyed table. The test tests both types of streaming in stream_blob - sstables and logstor segments.	2026-04-15 19:23:00 +02:00
Avi Kivity	59ec93b86b	Merge 'Allow arbitrary tablet boundaries and count' from Tomasz Grabiec There are several reasons we want to do that. One is that it will give us more flexibility in distributing the load. We can subdivide tablets at any token, and achieve more evenly-sized tablets. In particular, we can isolate large partitions into separate tablets. We can also split and merge incrementally individual tablets. Currently, we do it for the whole table or nothing, which makes splits and merges take longer and cause wide swings of the count. This is not implemented in this PR yet, we still split/merge the whole table. Another reason is vnode to tablets migration. We now could construct a tablet map which matches exactly the vnode boundaries, so migration can happen transparently from CQL-coordinator point of view. Tablet count is still a power-of-two by default for newly created tables. It may be different if tablet map is created by non-standard means, or if per-table tablet option "pow2_count" is set to "false". build/release/scylla perf-tablets: Memory footprint for 131k tablets increased from 56 MiB to 58.1 MiB (+3.5%) Before: ``` Generating tablet metadata Total tablet count: 131072 Size of tablet_metadata in memory: 57456 KiB Copied in 0.014346 [ms] Cleared in 0.002698 [ms] Saved in 1234.685303 [ms] Read in 445.577881 [ms] Read mutations in 299.596313 [ms] 128 mutations Read required hosts in 247.482742 [ms] Size of canonical mutations: 33.945053 [MiB] Disk space used by system.tablets: 1.456761 [MiB] Tablet metadata reload: full 407.69ms partial 2.65ms ``` After: ``` Generating tablet metadata Total tablet count: 131072 Size of tablet_metadata in memory: 59504 KiB Copied in 0.032475 [ms] Cleared in 0.002965 [ms] Saved in 1093.877441 [ms] Read in 387.027100 [ms] Read mutations in 255.752121 [ms] 128 mutations Read required hosts in 211.202805 [ms] Size of canonical mutations: 33.954453 [MiB] Disk space used by system.tablets: 1.450162 [MiB] Tablet metadata reload: full 354.50ms partial 2.19ms ``` Closes scylladb/scylladb#28459 * github.com:scylladb/scylladb: test: boost: tablets: Add test for merge with arbitrary tablet count tablets, database: Advertise 'arbitrary' layout in snapshot manifest tablets: Introduce pow2_count per-table tablet option tablets: Prepare for non-power-of-two tablet count tablets: Implement merged tablet_map constructor on top of for_each_sibling_tablets() tablets: Prepare resize_decision to hold data in decisions tablets: table: Make storage_group handle arbitrary merge boundaries tablets: Make stats update post-merge work with arbitrary merge boundaries locator: tablets: Support arbitrary tablet boundaries locator: tablets: Introduce tablet_map::get_split_token() dht: Introduce get_uniform_tokens()	2026-04-15 18:57:22 +03:00
Andrzej Jackowski	78926d9c96	test/random_failures: remove gossip shadow round injection Commit `c17c4806a1` removed check_for_endpoint_collision() from the fresh bootstrap path, which was the only code path that called do_shadow_round() for new nodes. Since the gossip shadow round is no longer executed during bootstrap, remove the stop_during_gossip_shadow_round error injection from the test. The entry is marked as REMOVED_ rather than deleted to preserve the shuffle order for seed-based test reproducibility. The injection point in gms/gossiper.cc is also removed since it is no longer used by any test. Fixes: SCYLLADB-1466 Closes scylladb/scylladb#29460	2026-04-15 16:30:55 +02:00
Dario Mirovic	336dab1eec	test: lib: fix broken retry in start_docker_service The retry loop in start_docker_service passes the parse callbacks via std::move into create_handler on each iteration. After the first iteration, the moved-from std::function objects are empty. All subsequent retries skip output parsing entirely and immediately treat the service as successfully started. This defeats the entire purpose of the retry mechanism. Fix by passing the callbacks by copy instead of move, so the original callbacks remain valid across retries. Fixes SCYLLADB-1542	2026-04-15 15:25:52 +02:00
Asias He	4137a4229c	test: Stabilize tablet incremental repair error test Use async tablet repair task flow to avoid a race where client timeout returns while server-side repair continues after injections are disabled. Start repair with await_completion=false, assert it does not complete within timeout under injection, abort/wait the task, then verify sstables_repaired_at is unchanged. Fixes SCYLLADB-1184 Closes scylladb/scylladb#29452	2026-04-15 16:24:43 +03:00
Dimitrios Symonidis	ca003680a7	test/object_store: verify object storage client creation and live reconfiguration	2026-04-15 14:28:39 +02:00
Dimitrios Symonidis	a958da0ab9	test_config: improve logging for wait_for_config API	2026-04-15 14:28:31 +02:00
Botond Dénes	00d8470554	Merge 'test: filter benign shutdown errors in tests that grep logs directly' from Marcin Maliszkiewicz Tests that call grep_for_errors() directly and assert no errors can fail spuriously due to benign RPC errors during graceful shutdown (e.g. "connection dropped: Semaphore broken"), which are already filtered by the after_test hook via filter_errors(). Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1464 Backport: no, tests fix (we may decide to backport later if it occurs on release branches) Closes scylladb/scylladb#29463 * github.com:scylladb/scylladb: test: filter benign errors in tests that grep logs during shutdown test: filter_errors: support list[list[str]] error groups	2026-04-15 14:40:15 +03:00
Marcin Maliszkiewicz	53b6e9fda5	Merge 'Make DESCRIBE CLUSTER get cluster information from storage_service' from Pavel Emelyanov Currently the statement returns cluster, partitioner and snitch names by accessing global db::config via database. As the part of an effort to detach components from global db::config, this PR tweaks the statement handler to get the cluster information from some other source. Currently the needed cluster information is stored in different components, but they are all under storage_service umbrella which seems to be a good central source of this truth. Unit test included. Cleaning components inter-dependencies, not backporting Closes scylladb/scylladb#29429 * github.com:scylladb/scylladb: test: Add test_describe_cluster_sanity for DESCRIBE CLUSTER validation describe_statement: Get cluster info from storage_service storage_service: Add describe_cluster() method query_processor: Expose storage_service accessor	2026-04-15 14:40:15 +03:00
Botond Dénes	4a2d032c6f	Merge 'query: result_set: change row member to a chunked vector' from Benny Halevy To prevent large memory allocations. This series shows over 3% improvement in perf-simple-query throughput. ``` $ build/release/scylla perf-simple-query --default-log-level=error --smp=1 --random-seed=1855519715 random-seed=1855519715 enable-cache=1 Running test with config: {partitions=10000, concurrency=100, mode=read, query_single_key=no, counters=no} Disabling auto compaction Creating 10000 partitions... Before: random-seed=1775976514 enable-cache=1 enable-index-cache=1 sstable-summary-ratio=0.0005 sstable-format=me Running test with config: {partitions=10000, concurrency=100, mode=read, query_single_key=no, counters=no} Disabling auto compaction Creating 10000 partitions... 336345.11 tps ( 58.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 32788 insns/op, 12430 cycles/op, 0 errors) 348748.14 tps ( 58.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 32794 insns/op, 12335 cycles/op, 0 errors) 349012.63 tps ( 58.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 32800 insns/op, 12326 cycles/op, 0 errors) 350629.97 tps ( 58.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 32770 insns/op, 12270 cycles/op, 0 errors) 348585.00 tps ( 58.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 32804 insns/op, 12338 cycles/op, 0 errors) throughput: mean= 346664.17 standard-deviation=5825.77 median= 348748.14 median-absolute-deviation=2348.46 maximum=350629.97 minimum=336345.11 instructions_per_op: mean= 32791.35 standard-deviation=13.60 median= 32794.47 median-absolute-deviation=8.65 maximum=32804.45 minimum=32769.57 cpu_cycles_per_op: mean= 12340.05 standard-deviation=57.57 median= 12335.05 median-absolute-deviation=13.94 maximum=12430.42 minimum=12270.28 After: random-seed=1775976514 enable-cache=1 enable-index-cache=1 sstable-summary-ratio=0.0005 sstable-format=me Running test with config: {partitions=10000, concurrency=100, mode=read, query_single_key=no, counters=no} Disabling auto compaction Creating 10000 partitions... 353770.85 tps ( 58.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 32762 insns/op, 11893 cycles/op, 0 errors) 364447.98 tps ( 58.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 32738 insns/op, 11818 cycles/op, 0 errors) 365268.97 tps ( 58.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 32734 insns/op, 11788 cycles/op, 0 errors) 344304.87 tps ( 58.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 32746 insns/op, 12506 cycles/op, 0 errors) 362263.57 tps ( 58.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 32756 insns/op, 11888 cycles/op, 0 errors) throughput: mean= 358011.25 standard-deviation=8916.76 median= 362263.57 median-absolute-deviation=6436.74 maximum=365268.97 minimum=344304.87 instructions_per_op: mean= 32747.06 standard-deviation=11.85 median= 32745.80 median-absolute-deviation=9.36 maximum=32762.18 minimum=32734.01 cpu_cycles_per_op: mean= 11978.65 standard-deviation=298.06 median= 11887.96 median-absolute-deviation=160.96 maximum=12505.72 minimum=11788.49 ``` Refs #28511 (Refs rather than Fixes for the lack of a reproducer unit test) * No backport needed as the issue is rare and not severe Closes scylladb/scylladb#28631 * github.com:scylladb/scylladb: query: result_set: change row member to a chunked vector query: result_set_row: make noexcept query: non_null_data_value: assert is_nothrow_move_constructible and assignable types: data_value: assert is_nothrow_move_constructible and assignable	2026-04-15 14:40:15 +03:00
Nadav Har'El	1eb8d170dd	Merge 'vector_index: allow recreating vector indexes on the same column' from Dawid Pawlik This series allows creating multiple vector indexes on the same column so users can rebuild an index without losing query availability. The intended flow is: 1. Create a new vector index on a column that already has one. 2. Keep serving ANN queries from the old index while the new one is being built. 3. Verify the new index is ready. 4. Automatically switch to the remaining index. 5. Drop the old index. To make that deterministic, `index_version` is changed from the base table schema version to a real creation timeuuid. When multiple vector indexes exist on the same column, ANN query planning now picks the index according to the routing implemented in Vector Store (newest serving index). This keeps queries on the old index until it the new one is up and ready. This patch also removes the create-time restriction that rejected a second vector index on the same column. Name collisions are still rejected as before. Test coverage is updated accordingly: - Scylla now verifies that two vector indexes can coexist on the same column. - Cassandra/SAI behavior is still covered and is still expected to reject duplicate indexes on the same column. Fixes: VECTOR-610 Closes scylladb/scylladb#29407 * github.com:scylladb/scylladb: docs: document vector index metadata and duplicate handling test/cqlpy: cover vector index duplicate creation rules vector_index: allow multiple named indexes on one column vector_index: store `index_version` as creation timeuuid	2026-04-15 14:40:15 +03:00
Botond Dénes	5891efc2ca	Merge 'service: add missing replicas if tablet rebuild was rolled back' from Aleksandra Martyniuk RF change of tablet keyspace starts tablet rebuilds. Even if any of the rebuilds is rolled back (because pending replica was excluded), rf change request finishes successfully. In this case we end up with the state of the replicas that isn't compatible with the expected keyspace replication. Modify topology coordinator so that if it were to be idle, it starts checking if there are any missing replicas. It moves to transition_state::tablet_migration and run required rebuilds. If a new RF change request encounters invalid state of replicas it fails. The state will be fixed later and the analogical ALTER KEYSPACE statement will be allowed. Fixes: SCYLLADB-109. Requires backport to all versions with tablet keyspace rf change. Closes scylladb/scylladb#28709 * github.com:scylladb/scylladb: test: add test_failed_tablet_rebuild_is_retried_on_alter test: add a test to ensure that failed rebuilds are retried service: fail ALTER KEYSPACE if replicas do not satisfy the replication service: retry failed tablet rebuilds service: maybe_start_tablet_migration returns std::optional<group0_guard>	2026-04-15 14:40:15 +03:00
Pavel Emelyanov	a428472e50	db: Remove redundant enable_logstor config option The enable_logstor configuration option is redundant with the 'logstor' experimental feature flag. Consolidate to a single gate: use the experimental feature to control both whether logstor is available for table creation and whether it is initialized at database startup. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#29427	2026-04-15 14:40:15 +03:00
Botond Dénes	87eb20ba33	Merge 'cql: Include parallelized queries in the scylla_cql_select_partition_range_scan_no_bypass_cache metric' from Tomasz Grabiec This metric is used to catch execution of scans which go via row cache, which can have bad effect on performance. Since `f344bd0aaa`, aggregate queries go via new statement class: parallelized_select_statement. This class inherits from select_statement directly rather than from primary_key_select_statement. The range scan detection logic (_range_scan, _range_scan_no_bypass_cache) was only in primary_key_select_statement's constructor, so parallelized queries were not counted in select_partition_range_scan and select_partition_range_scan_no_bypass_cache metrics. Fix by moving the range scan detection into select_statement's constructor, so that all subclasses get it. No backport: enhancement Closes scylladb/scylladb#29422 * github.com:scylladb/scylladb: cql: Include parallelized queries in the scylla_cql_select_partition_range_scan_no_bypass_cache metric test: cluster: dtest: Fix double-counting of metrics	2026-04-15 14:40:15 +03:00
Botond Dénes	aecb6b1d76	Merge 'auth: sanitize {USER} substitution in LDAP URL template' from Piotr Smaron `LDAPRoleManager` interpolated usernames directly into `ldap_url_template`, allowing LDAP filter injection and URL structure manipulation via crafted usernames. This PR adds two layers of encoding when substituting `{USER}`: 1. RFC 4515 filter escaping — neutralises ``, `(`, `)`, `\`, NUL 2. URL percent-encoding* — prevents `%`, `?`, `#` from breaking `ldap_url_parse`'s component splitting or undoing the filter escaping It also adds `validate_query_template()` at startup to reject templates that place `{USER}` outside the filter component (e.g. in the host or base DN), where filter escaping would be the wrong defense. Fixes: SCYLLADB-1309 Compatibility note: Templates with `{USER}` in the host, base DN, attributes, or extensions were previously silently accepted. They are now rejected at startup with a descriptive error. Only templates with `{USER}` in the filter component (after the third `?`) are valid. Fixes: SCYLLADB-1309 Due to severeness, should be backported to all maintained versions. Closes scylladb/scylladb#29388 * github.com:scylladb/scylladb: auth: sanitize {USER} substitution in LDAP URL templates test/ldap: add LDAP filter-injection reproducers	2026-04-15 14:40:15 +03:00
Artsiom Mishuta	146a67cf6f	test: explicitly wait for schema agreement in create_new_test_keyspace Add an explicit wait_for_schema_agreement() call after CREATE KEYSPACE in create_new_test_keyspace to ensure all nodes have applied the schema before proceeding. Closes scylladb/scylladb#29371	2026-04-15 14:40:15 +03:00
Pavel Emelyanov	54e3c648a5	test/cluster/dtest: improve diagnostics in test_update_schema_while_node_is_killed The alter_table case has a known failure where point lookups at QUORUM return 0 rows after node2 restarts, even though: - the schema was correctly synced (ALTER TABLE received from cluster) - the data commitlog was replayed (21 mutations, 0 skipped) - all 3 nodes were alive, so QUORUM (2/3) should be satisfiable by node1+node3 regardless of node2's state The LIMIT 1 table scan succeeds (data is present somewhere), but specific key lookups return empty. This points to a bug in how node2, acting as coordinator after restart, routes single-partition reads — most likely stale tablet routing metadata. Add diagnostics to help distinguish data loss from a coordinator/routing bug on the next failure: - log which key is missing - dump all rows visible at QUORUM - query each node individually at ONE consistency for the missing key Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Closes scylladb/scylladb#29350	2026-04-15 14:40:15 +03:00
Piotr Szymaniak	4c93c2af62	audit/alternator: support audit_tables=alternator.<table> shorthand The real keyspace name of an Alternator table T is "alternator_T". Expand the "alternator.T" format used in the audit_tables config flag to the real keyspace name at parse time, so users don't need to spell out the internal "alternator_T.T" form.	2026-04-15 12:29:15 +02:00
Piotr Szymaniak	0714d8aded	audit/alternator: Add negative audit tests Add tests for the unhappy path of Alternator audit logging: - Category filtering: operations are not logged when their category (DML, QUERY, DDL) is excluded from audit_categories. - Keyspace filtering: operations on a keyspace not listed in audit_keyspaces are not logged. - Error entries: a failed operation (thrown exception after audit_info is set) produces an audit entry with error=true. - Empty-keyspace bypass: global operations like ListTables and DescribeEndpoints are logged regardless of audit_keyspaces because should_log() short-circuits on an empty keyspace.	2026-04-15 12:29:15 +02:00
Piotr Szymaniak	ad05b44931	audit/alternator: Add testing of auditing There is a new test file created, `test/alternator/test_audit.py`. The file contains a suite of tests of all auditing operations.	2026-04-15 12:29:15 +02:00
Tomasz Grabiec	84361194c2	test: boost: tablets: Add test for merge with arbitrary tablet count	2026-04-15 10:40:56 +02:00
Tomasz Grabiec	7af9f5366d	tablets, database: Advertise 'arbitrary' layout in snapshot manifest Currently, the manifest advertises "powof2", which is wrong for arbitrary count and boundaries. Introduce a new kind of layout called "arbitrary", and produce it if the tablet map doesn't conform to "powof2" layout. We should also produce tablet boundaries in this case, but that's worked on in a different PR: https://github.com/scylladb/scylladb/pull/28525	2026-04-15 10:40:56 +02:00
Tomasz Grabiec	b6a7023f68	tablets: Prepare for non-power-of-two tablet count This is a step towards more flexibility in managing tablets. A prerequisite before we can split individual tablets, isolating hot partitions, and evening-out tablet sizes by shifting boundaries. After this patch, the system can handle tables with arbitrary tablet count. Tablet allocator is still rounding up desired tablet count to the nearest power of two when allocating tablets for a new table, so unless the tablet map is allocated in some other way, the counts will be still a power of two. We plan to utilize arbitrary count when migrating from vnodes to tablets, by creating a tablet map which matches vnode boundaries. One of the reasons we don't give up on power-of-two by default yet is that it creates an issue with merges. If tablet count is odd, one of the tablets doesn't have a sibling and will not be merged. That can obviously cause imbalance of token space and tablet sizes between tablets. To limit the impact, this patch dynamically chooses which tablet to isolate when initiating a merge. The largest tablet is chosen, as that will minimize imbalance. Otherwise, if we always chose the last tablet to isolate, its size would remain the same while other tablets double in size with each odd-count merge, leading to imbalance. The imbalance will still be there, but the difference in tablet sizes is limited to 2x. Example (3 tablets): [0] owns 1/3 of tokens [1] owns 1/3 of tokens [2] owns 1/3 of tokens After merge: [0] owns 2/3 of tokens [1] owns 1/3 of tokens What we would like instead: Step 1 (split [1]): [0] owns 1/3 of tokens [1] old 1.left, owns 1/6 of tokens [2] old 1.right, owns 1/6 of tokens [3] owns 1/3 of tokens Step 2 (merge): [0] owns 1/2 of tokens [1] owns 1/2 of tokens To do that, we need to be able to split individual tablets, but we're not there yet.	2026-04-15 10:40:55 +02:00
Tomasz Grabiec	66fc7967b8	tablets: Prepare resize_decision to hold data in decisions merge decision will carry a plan - which replica to isolate. So construction from a string will no longer do.	2026-04-15 10:40:55 +02:00
Nadav Har'El	022add117e	test/cluster: fix flaky test test_row_ttl_scheduling_group The test test/cluster/test_ttl_row.py::test_row_ttl_scheduling_group wants to verify that the new CQL per-row TTL feature does all its work (expiration scanning, deletion of expired items) on all nodes in the "streaming" scheduling group, not in the statement scheduling group. As originally written, the test couldn't require that it uses exactly zero time in the statement scheduling group - because some things do happen there - specifically the ALTER TABLE request we use to enable TTL. So the test checked that the time in the "wrong" group is less than 0.2 of the total time, not zero. But in one CI run, we got to exactly 0.2 and the test failed. Running this test locally, I see the margin is pretty narrow: The test almost always fails if I set the threshold ratio to 0.1. The solution in this patch is to move the ALTER TABLE work to a different scheduling group (by using an additional service level). After doing that the CPU usage in sl:default goes down to exactly zero - not close to zero but exactly zero. However, it seems that there is always some rare background work in sl:default and debug builds it can come out more than 0ms (e.g., in one test we saw 1ms), so we keep checking that sl:default is much lower than sl:stream - not exactly zero. Incidentally, I converted the serial loop adding the 200 rows in the test's setup to a parallel loop, to make the test setup slightly faster. I also added to the test a sanity check that the scheduling group sl:default that we are measuring that TTL does zero work in, is actually the scheduling group that normal writes work in (to avoid the risk of having a test that verifies that some irrelevant scheduling group is unsurprisingly getting zero usage...). Fixes SCYLLADB-1495. Closes scylladb/scylladb#29447	2026-04-15 08:42:29 +03:00

1 2 3 4 5 ...

11487 Commits