scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-06-04 14:03:06 +00:00

Author	SHA1	Message	Date
Nadav Har'El	0afc730b7b	alternator: reject empty attribute names Alternator has a function validate_attr_name_length() used to validate an attribute name passed in different operations like PutItem, UpdateItem, GetItem, etc. It fails the request if the attribute name is longer than 65535 characters. It turns out that we forgot to check if the attribute name length isn’t 0 - which should be forbidden as well! This patch fixes the validation code, and also adds a test that confirms that after this patch empty attribute names are rejected - just like DynamoDB does - whereas before this patch they were silently accepted. We want to fix this issue now, because in a later patch we intend to use the same validation function also for vector indexes - and want it to be accurate. Fixes SCYLLADB-1069. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2026-04-16 13:28:15 +03:00
Nadav Har'El	f0e9177130	Merge 'audit/alternator: Make Alternator requests audited' from Piotr Szymaniak Each Alternator API call results in the request being audited, provided the auditing is enabled. Both successful as well as the failed requests are audited, with few exceptions. The chosen audit types for the operations: - CreateTable - DDL - DescribeTable - QUERY - DeleteTable - DDL - UpdateTable - DDL - PutItem - DML - UpdateItem - DML - GetItem - QUERY - DeleteItem - DML - ListTables - QUERY - Scan - QUERY - DescribeEndpoints - QUERY - BatchWriteItem - DML - BatchGetItem - QUERY - Query - QUERY - TagResource - DDL - UntagResource - DDL - ListTagsOfResource - QUERY - UpdateTimeToLive - DDL - DescribeTimeToLive - QUERY - ListStreams - QUERY - DescribeStream - QUERY - GetShardIterator - QUERY - GetRecords - QUERY - DescribeContinuousBackups - QUERY FIXME: The tests are now covering the new functionality only partially. Fixes: scylladb/scylla-enterprise#3796 Fixes: SCYLLADB-467 No need to backport, new functionality. Closes scylladb/scylladb#27953 * github.com:scylladb/scylladb: audit/alternator: support audit_tables=alternator.<table> shorthand audit/alternator: Add negative audit tests audit/alternator: Add testing of auditing audit/alternator: Audit requests audit/alternator: Refactor in preparation for auditing Alternator	2026-04-15 22:17:57 +03:00
Nikos Dragazis	d38f44208a	test/cqlpy: Harden mutation_fragments tests against background flushes Several tests in test_select_from_mutation_fragments.py assume that all mutations end up in a single SSTable. This assumption can be violated by background memtable flushes triggered by commitlog disk pressure. Since the Scylla node is taken from a pool, it may carry unflushed data from prior tests that prevents closed segments from being recycled, thereby increasing the commitlog disk usage. A main source of such pressure is keyspace-level flushes from earlier tests in this module, which rotate commitlog segments without flushing system tables (e.g., `system.compaction_history`), leaving closed segments dirty. Additionally, prior tests in the same module may have left unflushed data on the shared test table (`test_table` fixture), keeping commitlog segments dirty on its behalf as well. When commitlog disk usage exceeds its threshold, the system flushes the test table to reclaim those segments, potentially splitting a running test's mutations across multiple SSTables. This was observed in CI, where test_paging failed because its data was split across two SSTables, resulting in more mutation fragments than the hardcoded expected count. This patch fixes the affected tests in two ways: 1. Where possible, tests are reworked to not assume a single SSTable: - test_paging - test_slicing_rows - test_many_partition_scan 2. Where rework is impractical, major compaction is added after writes and before validation to ensure that only one SSTable will exist: - test_smoke - test_count - test_metadata_and_value - test_slicing_range_tombstone_changes Fixes SCYLLADB-1375. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com> Closes scylladb/scylladb#29389	2026-04-15 21:46:00 +03:00
Avi Kivity	59ec93b86b	Merge 'Allow arbitrary tablet boundaries and count' from Tomasz Grabiec There are several reasons we want to do that. One is that it will give us more flexibility in distributing the load. We can subdivide tablets at any token, and achieve more evenly-sized tablets. In particular, we can isolate large partitions into separate tablets. We can also split and merge incrementally individual tablets. Currently, we do it for the whole table or nothing, which makes splits and merges take longer and cause wide swings of the count. This is not implemented in this PR yet, we still split/merge the whole table. Another reason is vnode to tablets migration. We now could construct a tablet map which matches exactly the vnode boundaries, so migration can happen transparently from CQL-coordinator point of view. Tablet count is still a power-of-two by default for newly created tables. It may be different if tablet map is created by non-standard means, or if per-table tablet option "pow2_count" is set to "false". build/release/scylla perf-tablets: Memory footprint for 131k tablets increased from 56 MiB to 58.1 MiB (+3.5%) Before: ``` Generating tablet metadata Total tablet count: 131072 Size of tablet_metadata in memory: 57456 KiB Copied in 0.014346 [ms] Cleared in 0.002698 [ms] Saved in 1234.685303 [ms] Read in 445.577881 [ms] Read mutations in 299.596313 [ms] 128 mutations Read required hosts in 247.482742 [ms] Size of canonical mutations: 33.945053 [MiB] Disk space used by system.tablets: 1.456761 [MiB] Tablet metadata reload: full 407.69ms partial 2.65ms ``` After: ``` Generating tablet metadata Total tablet count: 131072 Size of tablet_metadata in memory: 59504 KiB Copied in 0.032475 [ms] Cleared in 0.002965 [ms] Saved in 1093.877441 [ms] Read in 387.027100 [ms] Read mutations in 255.752121 [ms] 128 mutations Read required hosts in 211.202805 [ms] Size of canonical mutations: 33.954453 [MiB] Disk space used by system.tablets: 1.450162 [MiB] Tablet metadata reload: full 354.50ms partial 2.19ms ``` Closes scylladb/scylladb#28459 * github.com:scylladb/scylladb: test: boost: tablets: Add test for merge with arbitrary tablet count tablets, database: Advertise 'arbitrary' layout in snapshot manifest tablets: Introduce pow2_count per-table tablet option tablets: Prepare for non-power-of-two tablet count tablets: Implement merged tablet_map constructor on top of for_each_sibling_tablets() tablets: Prepare resize_decision to hold data in decisions tablets: table: Make storage_group handle arbitrary merge boundaries tablets: Make stats update post-merge work with arbitrary merge boundaries locator: tablets: Support arbitrary tablet boundaries locator: tablets: Introduce tablet_map::get_split_token() dht: Introduce get_uniform_tokens()	2026-04-15 18:57:22 +03:00
Andrzej Jackowski	78926d9c96	test/random_failures: remove gossip shadow round injection Commit `c17c4806a1` removed check_for_endpoint_collision() from the fresh bootstrap path, which was the only code path that called do_shadow_round() for new nodes. Since the gossip shadow round is no longer executed during bootstrap, remove the stop_during_gossip_shadow_round error injection from the test. The entry is marked as REMOVED_ rather than deleted to preserve the shuffle order for seed-based test reproducibility. The injection point in gms/gossiper.cc is also removed since it is no longer used by any test. Fixes: SCYLLADB-1466 Closes scylladb/scylladb#29460	2026-04-15 16:30:55 +02:00
Asias He	4137a4229c	test: Stabilize tablet incremental repair error test Use async tablet repair task flow to avoid a race where client timeout returns while server-side repair continues after injections are disabled. Start repair with await_completion=false, assert it does not complete within timeout under injection, abort/wait the task, then verify sstables_repaired_at is unchanged. Fixes SCYLLADB-1184 Closes scylladb/scylladb#29452	2026-04-15 16:24:43 +03:00
Botond Dénes	00d8470554	Merge 'test: filter benign shutdown errors in tests that grep logs directly' from Marcin Maliszkiewicz Tests that call grep_for_errors() directly and assert no errors can fail spuriously due to benign RPC errors during graceful shutdown (e.g. "connection dropped: Semaphore broken"), which are already filtered by the after_test hook via filter_errors(). Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1464 Backport: no, tests fix (we may decide to backport later if it occurs on release branches) Closes scylladb/scylladb#29463 * github.com:scylladb/scylladb: test: filter benign errors in tests that grep logs during shutdown test: filter_errors: support list[list[str]] error groups	2026-04-15 14:40:15 +03:00
Marcin Maliszkiewicz	53b6e9fda5	Merge 'Make DESCRIBE CLUSTER get cluster information from storage_service' from Pavel Emelyanov Currently the statement returns cluster, partitioner and snitch names by accessing global db::config via database. As the part of an effort to detach components from global db::config, this PR tweaks the statement handler to get the cluster information from some other source. Currently the needed cluster information is stored in different components, but they are all under storage_service umbrella which seems to be a good central source of this truth. Unit test included. Cleaning components inter-dependencies, not backporting Closes scylladb/scylladb#29429 * github.com:scylladb/scylladb: test: Add test_describe_cluster_sanity for DESCRIBE CLUSTER validation describe_statement: Get cluster info from storage_service storage_service: Add describe_cluster() method query_processor: Expose storage_service accessor	2026-04-15 14:40:15 +03:00
Botond Dénes	4a2d032c6f	Merge 'query: result_set: change row member to a chunked vector' from Benny Halevy To prevent large memory allocations. This series shows over 3% improvement in perf-simple-query throughput. ``` $ build/release/scylla perf-simple-query --default-log-level=error --smp=1 --random-seed=1855519715 random-seed=1855519715 enable-cache=1 Running test with config: {partitions=10000, concurrency=100, mode=read, query_single_key=no, counters=no} Disabling auto compaction Creating 10000 partitions... Before: random-seed=1775976514 enable-cache=1 enable-index-cache=1 sstable-summary-ratio=0.0005 sstable-format=me Running test with config: {partitions=10000, concurrency=100, mode=read, query_single_key=no, counters=no} Disabling auto compaction Creating 10000 partitions... 336345.11 tps ( 58.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 32788 insns/op, 12430 cycles/op, 0 errors) 348748.14 tps ( 58.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 32794 insns/op, 12335 cycles/op, 0 errors) 349012.63 tps ( 58.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 32800 insns/op, 12326 cycles/op, 0 errors) 350629.97 tps ( 58.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 32770 insns/op, 12270 cycles/op, 0 errors) 348585.00 tps ( 58.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 32804 insns/op, 12338 cycles/op, 0 errors) throughput: mean= 346664.17 standard-deviation=5825.77 median= 348748.14 median-absolute-deviation=2348.46 maximum=350629.97 minimum=336345.11 instructions_per_op: mean= 32791.35 standard-deviation=13.60 median= 32794.47 median-absolute-deviation=8.65 maximum=32804.45 minimum=32769.57 cpu_cycles_per_op: mean= 12340.05 standard-deviation=57.57 median= 12335.05 median-absolute-deviation=13.94 maximum=12430.42 minimum=12270.28 After: random-seed=1775976514 enable-cache=1 enable-index-cache=1 sstable-summary-ratio=0.0005 sstable-format=me Running test with config: {partitions=10000, concurrency=100, mode=read, query_single_key=no, counters=no} Disabling auto compaction Creating 10000 partitions... 353770.85 tps ( 58.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 32762 insns/op, 11893 cycles/op, 0 errors) 364447.98 tps ( 58.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 32738 insns/op, 11818 cycles/op, 0 errors) 365268.97 tps ( 58.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 32734 insns/op, 11788 cycles/op, 0 errors) 344304.87 tps ( 58.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 32746 insns/op, 12506 cycles/op, 0 errors) 362263.57 tps ( 58.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 32756 insns/op, 11888 cycles/op, 0 errors) throughput: mean= 358011.25 standard-deviation=8916.76 median= 362263.57 median-absolute-deviation=6436.74 maximum=365268.97 minimum=344304.87 instructions_per_op: mean= 32747.06 standard-deviation=11.85 median= 32745.80 median-absolute-deviation=9.36 maximum=32762.18 minimum=32734.01 cpu_cycles_per_op: mean= 11978.65 standard-deviation=298.06 median= 11887.96 median-absolute-deviation=160.96 maximum=12505.72 minimum=11788.49 ``` Refs #28511 (Refs rather than Fixes for the lack of a reproducer unit test) * No backport needed as the issue is rare and not severe Closes scylladb/scylladb#28631 * github.com:scylladb/scylladb: query: result_set: change row member to a chunked vector query: result_set_row: make noexcept query: non_null_data_value: assert is_nothrow_move_constructible and assignable types: data_value: assert is_nothrow_move_constructible and assignable	2026-04-15 14:40:15 +03:00
Nadav Har'El	1eb8d170dd	Merge 'vector_index: allow recreating vector indexes on the same column' from Dawid Pawlik This series allows creating multiple vector indexes on the same column so users can rebuild an index without losing query availability. The intended flow is: 1. Create a new vector index on a column that already has one. 2. Keep serving ANN queries from the old index while the new one is being built. 3. Verify the new index is ready. 4. Automatically switch to the remaining index. 5. Drop the old index. To make that deterministic, `index_version` is changed from the base table schema version to a real creation timeuuid. When multiple vector indexes exist on the same column, ANN query planning now picks the index according to the routing implemented in Vector Store (newest serving index). This keeps queries on the old index until it the new one is up and ready. This patch also removes the create-time restriction that rejected a second vector index on the same column. Name collisions are still rejected as before. Test coverage is updated accordingly: - Scylla now verifies that two vector indexes can coexist on the same column. - Cassandra/SAI behavior is still covered and is still expected to reject duplicate indexes on the same column. Fixes: VECTOR-610 Closes scylladb/scylladb#29407 * github.com:scylladb/scylladb: docs: document vector index metadata and duplicate handling test/cqlpy: cover vector index duplicate creation rules vector_index: allow multiple named indexes on one column vector_index: store `index_version` as creation timeuuid	2026-04-15 14:40:15 +03:00
Botond Dénes	5891efc2ca	Merge 'service: add missing replicas if tablet rebuild was rolled back' from Aleksandra Martyniuk RF change of tablet keyspace starts tablet rebuilds. Even if any of the rebuilds is rolled back (because pending replica was excluded), rf change request finishes successfully. In this case we end up with the state of the replicas that isn't compatible with the expected keyspace replication. Modify topology coordinator so that if it were to be idle, it starts checking if there are any missing replicas. It moves to transition_state::tablet_migration and run required rebuilds. If a new RF change request encounters invalid state of replicas it fails. The state will be fixed later and the analogical ALTER KEYSPACE statement will be allowed. Fixes: SCYLLADB-109. Requires backport to all versions with tablet keyspace rf change. Closes scylladb/scylladb#28709 * github.com:scylladb/scylladb: test: add test_failed_tablet_rebuild_is_retried_on_alter test: add a test to ensure that failed rebuilds are retried service: fail ALTER KEYSPACE if replicas do not satisfy the replication service: retry failed tablet rebuilds service: maybe_start_tablet_migration returns std::optional<group0_guard>	2026-04-15 14:40:15 +03:00
Pavel Emelyanov	a428472e50	db: Remove redundant enable_logstor config option The enable_logstor configuration option is redundant with the 'logstor' experimental feature flag. Consolidate to a single gate: use the experimental feature to control both whether logstor is available for table creation and whether it is initialized at database startup. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#29427	2026-04-15 14:40:15 +03:00
Botond Dénes	87eb20ba33	Merge 'cql: Include parallelized queries in the scylla_cql_select_partition_range_scan_no_bypass_cache metric' from Tomasz Grabiec This metric is used to catch execution of scans which go via row cache, which can have bad effect on performance. Since `f344bd0aaa`, aggregate queries go via new statement class: parallelized_select_statement. This class inherits from select_statement directly rather than from primary_key_select_statement. The range scan detection logic (_range_scan, _range_scan_no_bypass_cache) was only in primary_key_select_statement's constructor, so parallelized queries were not counted in select_partition_range_scan and select_partition_range_scan_no_bypass_cache metrics. Fix by moving the range scan detection into select_statement's constructor, so that all subclasses get it. No backport: enhancement Closes scylladb/scylladb#29422 * github.com:scylladb/scylladb: cql: Include parallelized queries in the scylla_cql_select_partition_range_scan_no_bypass_cache metric test: cluster: dtest: Fix double-counting of metrics	2026-04-15 14:40:15 +03:00
Botond Dénes	aecb6b1d76	Merge 'auth: sanitize {USER} substitution in LDAP URL template' from Piotr Smaron `LDAPRoleManager` interpolated usernames directly into `ldap_url_template`, allowing LDAP filter injection and URL structure manipulation via crafted usernames. This PR adds two layers of encoding when substituting `{USER}`: 1. RFC 4515 filter escaping — neutralises ``, `(`, `)`, `\`, NUL 2. URL percent-encoding* — prevents `%`, `?`, `#` from breaking `ldap_url_parse`'s component splitting or undoing the filter escaping It also adds `validate_query_template()` at startup to reject templates that place `{USER}` outside the filter component (e.g. in the host or base DN), where filter escaping would be the wrong defense. Fixes: SCYLLADB-1309 Compatibility note: Templates with `{USER}` in the host, base DN, attributes, or extensions were previously silently accepted. They are now rejected at startup with a descriptive error. Only templates with `{USER}` in the filter component (after the third `?`) are valid. Fixes: SCYLLADB-1309 Due to severeness, should be backported to all maintained versions. Closes scylladb/scylladb#29388 * github.com:scylladb/scylladb: auth: sanitize {USER} substitution in LDAP URL templates test/ldap: add LDAP filter-injection reproducers	2026-04-15 14:40:15 +03:00
Artsiom Mishuta	146a67cf6f	test: explicitly wait for schema agreement in create_new_test_keyspace Add an explicit wait_for_schema_agreement() call after CREATE KEYSPACE in create_new_test_keyspace to ensure all nodes have applied the schema before proceeding. Closes scylladb/scylladb#29371	2026-04-15 14:40:15 +03:00
Pavel Emelyanov	54e3c648a5	test/cluster/dtest: improve diagnostics in test_update_schema_while_node_is_killed The alter_table case has a known failure where point lookups at QUORUM return 0 rows after node2 restarts, even though: - the schema was correctly synced (ALTER TABLE received from cluster) - the data commitlog was replayed (21 mutations, 0 skipped) - all 3 nodes were alive, so QUORUM (2/3) should be satisfiable by node1+node3 regardless of node2's state The LIMIT 1 table scan succeeds (data is present somewhere), but specific key lookups return empty. This points to a bug in how node2, acting as coordinator after restart, routes single-partition reads — most likely stale tablet routing metadata. Add diagnostics to help distinguish data loss from a coordinator/routing bug on the next failure: - log which key is missing - dump all rows visible at QUORUM - query each node individually at ONE consistency for the missing key Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Closes scylladb/scylladb#29350	2026-04-15 14:40:15 +03:00
Piotr Szymaniak	4c93c2af62	audit/alternator: support audit_tables=alternator.<table> shorthand The real keyspace name of an Alternator table T is "alternator_T". Expand the "alternator.T" format used in the audit_tables config flag to the real keyspace name at parse time, so users don't need to spell out the internal "alternator_T.T" form.	2026-04-15 12:29:15 +02:00
Piotr Szymaniak	0714d8aded	audit/alternator: Add negative audit tests Add tests for the unhappy path of Alternator audit logging: - Category filtering: operations are not logged when their category (DML, QUERY, DDL) is excluded from audit_categories. - Keyspace filtering: operations on a keyspace not listed in audit_keyspaces are not logged. - Error entries: a failed operation (thrown exception after audit_info is set) produces an audit entry with error=true. - Empty-keyspace bypass: global operations like ListTables and DescribeEndpoints are logged regardless of audit_keyspaces because should_log() short-circuits on an empty keyspace.	2026-04-15 12:29:15 +02:00
Piotr Szymaniak	ad05b44931	audit/alternator: Add testing of auditing There is a new test file created, `test/alternator/test_audit.py`. The file contains a suite of tests of all auditing operations.	2026-04-15 12:29:15 +02:00
Tomasz Grabiec	84361194c2	test: boost: tablets: Add test for merge with arbitrary tablet count	2026-04-15 10:40:56 +02:00
Tomasz Grabiec	7af9f5366d	tablets, database: Advertise 'arbitrary' layout in snapshot manifest Currently, the manifest advertises "powof2", which is wrong for arbitrary count and boundaries. Introduce a new kind of layout called "arbitrary", and produce it if the tablet map doesn't conform to "powof2" layout. We should also produce tablet boundaries in this case, but that's worked on in a different PR: https://github.com/scylladb/scylladb/pull/28525	2026-04-15 10:40:56 +02:00
Tomasz Grabiec	b6a7023f68	tablets: Prepare for non-power-of-two tablet count This is a step towards more flexibility in managing tablets. A prerequisite before we can split individual tablets, isolating hot partitions, and evening-out tablet sizes by shifting boundaries. After this patch, the system can handle tables with arbitrary tablet count. Tablet allocator is still rounding up desired tablet count to the nearest power of two when allocating tablets for a new table, so unless the tablet map is allocated in some other way, the counts will be still a power of two. We plan to utilize arbitrary count when migrating from vnodes to tablets, by creating a tablet map which matches vnode boundaries. One of the reasons we don't give up on power-of-two by default yet is that it creates an issue with merges. If tablet count is odd, one of the tablets doesn't have a sibling and will not be merged. That can obviously cause imbalance of token space and tablet sizes between tablets. To limit the impact, this patch dynamically chooses which tablet to isolate when initiating a merge. The largest tablet is chosen, as that will minimize imbalance. Otherwise, if we always chose the last tablet to isolate, its size would remain the same while other tablets double in size with each odd-count merge, leading to imbalance. The imbalance will still be there, but the difference in tablet sizes is limited to 2x. Example (3 tablets): [0] owns 1/3 of tokens [1] owns 1/3 of tokens [2] owns 1/3 of tokens After merge: [0] owns 2/3 of tokens [1] owns 1/3 of tokens What we would like instead: Step 1 (split [1]): [0] owns 1/3 of tokens [1] old 1.left, owns 1/6 of tokens [2] old 1.right, owns 1/6 of tokens [3] owns 1/3 of tokens Step 2 (merge): [0] owns 1/2 of tokens [1] owns 1/2 of tokens To do that, we need to be able to split individual tablets, but we're not there yet.	2026-04-15 10:40:55 +02:00
Tomasz Grabiec	66fc7967b8	tablets: Prepare resize_decision to hold data in decisions merge decision will carry a plan - which replica to isolate. So construction from a string will no longer do.	2026-04-15 10:40:55 +02:00
Nadav Har'El	022add117e	test/cluster: fix flaky test test_row_ttl_scheduling_group The test test/cluster/test_ttl_row.py::test_row_ttl_scheduling_group wants to verify that the new CQL per-row TTL feature does all its work (expiration scanning, deletion of expired items) on all nodes in the "streaming" scheduling group, not in the statement scheduling group. As originally written, the test couldn't require that it uses exactly zero time in the statement scheduling group - because some things do happen there - specifically the ALTER TABLE request we use to enable TTL. So the test checked that the time in the "wrong" group is less than 0.2 of the total time, not zero. But in one CI run, we got to exactly 0.2 and the test failed. Running this test locally, I see the margin is pretty narrow: The test almost always fails if I set the threshold ratio to 0.1. The solution in this patch is to move the ALTER TABLE work to a different scheduling group (by using an additional service level). After doing that the CPU usage in sl:default goes down to exactly zero - not close to zero but exactly zero. However, it seems that there is always some rare background work in sl:default and debug builds it can come out more than 0ms (e.g., in one test we saw 1ms), so we keep checking that sl:default is much lower than sl:stream - not exactly zero. Incidentally, I converted the serial loop adding the 200 rows in the test's setup to a parallel loop, to make the test setup slightly faster. I also added to the test a sanity check that the scheduling group sl:default that we are measuring that TTL does zero work in, is actually the scheduling group that normal writes work in (to avoid the risk of having a test that verifies that some irrelevant scheduling group is unsurprisingly getting zero usage...). Fixes SCYLLADB-1495. Closes scylladb/scylladb#29447	2026-04-15 08:42:29 +03:00
Tomasz Grabiec	01fb97ee78	locator: tablets: Support arbitrary tablet boundaries There are several reasons we want to do that. One is that it will give us more flexibility in distributing the load. We can subdivide tablets at any points, and achieve more evenly-sized tablets. In particular, we can isolate large partitions into separate tablets. Another reason is vnode-to-tablet migration. We could construct a tablet map which matches exactly the vnode boundaries, so migration can happen transparently from the CQL-coordinator's point of view. Implementation details: We store a vector of tokens which represent tablet boundaries in the tablet_id_map. tablet_id keeps its meaning, it's an index into vector of tablets. To avoid logarithmic lookup of tablet_id from the token, we introduce a lookup structure with power-of-two aligned buckets, and store the tablet_id of the tablet which owns the first token in the bucket. This way, lookup needs to consider tablet id range which overlaps with one bucket. If boundaries are more or less aligned, there are around 1-2 tablets overlapping with a bucket, and the lookup is still O(1). Amount of memory used increased, but not significantly relative to old size (because tablet_info is currently fat): For 131'072 tablets: Before: Size of tablet_metadata in memory: 57456 KiB After: Size of tablet_metadata in memory: 59504 KiB	2026-04-15 01:25:14 +02:00
Tomasz Grabiec	82acdae74b	locator: tablets: Introduce tablet_map::get_split_token() And reimplement existing split-related methods around it. This way we avoid calling dht::compaction_group_of(), and assuming anything about tablet boundaries or tablet count being a power of two. This will make later refactoring easier.	2026-04-15 01:24:48 +02:00
Tomasz Grabiec	2e1d41c206	dht: Introduce get_uniform_tokens()	2026-04-15 01:24:48 +02:00
Tomasz Grabiec	a58243bc1e	Merge 'hint_sender: send hints to all tablet replicas if the tablet leaving due to RF--' from Ferenc Szili Currently, hints that are sent to tablet replicas which are leaving due to RF-- can be lost, because `hint_sender` only checks if the destination host is leaving. To avoid this, we add a new method `effective_replication_map::is_leaving(host, token)` which checks if the tablet identified by the given token is leaving the host. This method is called by the `hint_sender` to check if the hint should be sent only to the destination host, or to all the replicas. This way, we increase consistency. For v-node based ERPs, `is_leaving()` calls `token_metadata::is_leaving(host)`. Fixes: SCYLLADB-287 This is an improvement, and backport is not needed. Closes scylladb/scylladb#28770 * github.com:scylladb/scylladb: test: verify hints are delivered during tablet RF reduction hint_sender: use per-tablet is_leaving() to avoid losing hints on RF reduction erm: add is_leaving() to effective_replication_map	2026-04-14 22:51:34 +02:00
Tomasz Grabiec	7fe4ae16f0	Merge 'table: don't create new split compaction groups if main compaction group is disabled' from Ferenc Szili Fixes a race condition where tablet split can crash the server during truncation. `truncate_table_on_all_shards()` disables compaction on all existing compaction groups, then later calls `discard_sstables()` which asserts that compaction is disabled. Between these two points, tablet split can call `set_split_mode()`, which creates new compaction groups via `make_empty_group()` — these start with `compaction_disabled_counter == 0`. When `discard_sstables()` checks its assertion, it finds these new groups and fires `on_internal_error`, aborting the server. In `storage_group::set_split_mode()`, before creating new compaction groups, check whether the main compaction group has compaction disabled. If it does, bail out early and return `false` (not ready). This is safe because the split will be retried once truncation completes and re-enables compaction. A new regression test `test_split_emitted_during_truncate` reproduces the exact interleaving using two error injection points: - `database_truncate_wait` — pauses truncation after compaction is disabled but before `discard_sstables()` runs. - `tablet_split_monitor_wait` (new, in `service/storage_service.cc`) — pauses the split monitor at the start of `process_tablet_split_candidate()`. The test creates a single-tablet table, triggers both operations, uses the injection points to force the problematic ordering, then verifies that truncation completes successfully and the split finishes afterward. Fixes: SCYLLADB-1035 This needs to be backported to all currently supported version. Closes scylladb/scylladb#29250 * github.com:scylladb/scylladb: test: add test_split_emitted_during_truncate table: fix race between tablet split and truncate	2026-04-14 22:00:40 +02:00
Avi Kivity	21d9f54a9a	partition_snapshot_row_cursor: fix reversed maybe_refresh() losing latest version entry In partition_snapshot_row_cursor::maybe_refresh(), the !is_in_latest_version() path calls lower_bound(_position) on the latest version's rows to find the cursor's position in that version. When lower_bound returns null (the cursor is positioned above all entries in the latest version in table order), the code unconditionally sets _background_continuity = true and allows the subsequent if(!it) block to erase the latest version's entry from the heap. This is correct for forward traversal: null means there are no more entries ahead, so removing the version from the heap is safe. However, in reversed mode, null from lower_bound means the cursor is above all entries in table order -- those entries are BELOW the cursor in query order and will be visited LATER during reversed traversal. Erasing the heap entry permanently loses them, causing live rows to be skipped. The fix mirrors what prepare_heap() already does correctly: when lower_bound returns null in reversed mode, use std::prev(rows.end()) to keep the last entry in the heap instead of erasing it. Add test_reversed_maybe_refresh_keeps_latest_version_entry to mvcc_test, alongside the existing reversed cursor tests. The test creates a two-version partition snapshot (v0 with range tombstones, v1 with a live row positioned below all v0 entries in table order), and traverses in reverse calling maybe_refresh() at each step -- directly exercising the buggy code path. The test fails without the fix. The bug was introduced by `6b7473be53` ("Handle non-evictable snapshots", 2022-11-21), which added null-iterator handling for non-evictable snapshots (memtable snapshots lack the trailing dummy entry that evictable snapshots have). prepare_heap() got correct reversed-mode handling at that time, but maybe_refresh() received only forward-mode logic. The bug is intermittent because multiple mechanisms cause iterators_valid() to return false, forcing maybe_refresh() to take the full rebuild path via prepare_heap() (which handles reversed mode correctly): - Mutation cleaner merging versions in the background (changes change_mark) - LSA segment compaction during reserve() (invalidates references) - B-tree rebalancing on partition insertion (invalidates references) - Debug mode's always-true need_preempt() creating many multi-version partitions via preempted apply_monotonically() A dtest reproducer confirmed the same root cause: with 100K overlapping range tombstones creating a massively multi-version memtable partition (287K preemption events), the reversed scan's latest_iterator was observed jumping discontinuously during a version transition -- the latest version's heap entry was erased -- causing the query to walk the entire partition without finding the live row. Fixes: SCYLLADB-1253 Closes scylladb/scylladb#29368	2026-04-14 21:50:25 +02:00
Nadav Har'El	986167a416	Merge 'cql3: fix authorization bypass via BATCH prepared cache poisoning' from Marcin Maliszkiewicz execute_batch_without_checking_exception_message() inserted entries into the authorized prepared cache before verifying that check_access() succeeded. A failed BATCH therefore left behind cached 'authorized' entries that later let a direct EXECUTE of the same prepared statement skip the authorization check entirely. Move the cache insertion after the access check so that entries are only cached on success. This matches the pattern already used by do_execute_prepared() for individual EXECUTE requests. Introduced in `98f5e49ea8` Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1221 Backport: all supported versions Closes scylladb/scylladb#29432 * github.com:scylladb/scylladb: test/cqlpy: add reproducer for BATCH prepared auth cache bypass cql3: fix authorization bypass via BATCH prepared cache poisoning	2026-04-14 22:31:54 +03:00
Pavel Emelyanov	cec44dc68d	test: Add test_describe_cluster_sanity for DESCRIBE CLUSTER validation Add parametrized integration test that verifies DESCRIBE CLUSTER returns correct values in both normal and maintenance modes: The parametrization keeps the validation logic (CQL queries and assertions) identical for both modes, while the setup phase is mode-specific. This ensures the same assertions apply to both cluster states: - partitioner is org.apache.cassandra.dht.Murmur3Partitioner - snitch is org.apache.cassandra.locator.SimpleSnitch - cluster name matches system.local cluster_name Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-14 19:33:21 +03:00
Radosław Cybulski	4b984212ba	alternator: improve parsing / generating of StreamArn parameter Previously Alternator, when emit Amazon's ARN would not stick to the standard. After our attempt to run KCL with scylla we discovered few issues. Amazon's ARN looks like this: arn:partition:service:region:account-id:resource-type/resource-id for example: arn:aws:dynamodb:us-west-2:111122223333:table/TestTable/stream/2015-05-11T21:21:33.291 KCL checks for: - ARN provided from Alternator calls must fit with basic Amazon's ARN pattern shown above, - region constisting only of lower letter alphabets and `-`, no underscore character - account-id being only digits (exactly 12) - service being `dynamodb` - partition starting with `aws` The patch updates our code handling ARNs to match those findings. 1. Split `stream_arn` object into `stream_arn` - ARN for streams only and `stream_shard_id` - id value for stream shards. The latter receives original implementation. The former emits and parses ARN in a Amazon style. for example: 2. Update new `stream_arn` class to encode keyspace and table together separating them by `@`. New ARN looks like this: arn:aws:dynamodb:us-east-1:000000000000:table/TestKeyspace@TestTable/stream/2015-05-11T21:21:33.291 3. hardcode `dynamodb` as service, `aws` as partition, `us-east-1` as region and `000000000000` as account-id (must have 12 digits) 4. Update code handling ARNs for tags manipulation to be able to parse Amazon's style ARNs. Emiting code is left intact - the parser is now capable of parsing both styles. 5. Added unit tests. Fixes #28350 Fixes: SCYLLADB-539 Fixes: #28142 Closes scylladb/scylladb#28187	2026-04-14 18:07:05 +03:00
Marcin Maliszkiewicz	de19714763	Merge 'cql3: prepare list statments metadta_id during prepare statement , send the correct metadata_id directly to the client ' from Alex Dathskovsky This series makes result metadata handling for auth LIST statements consistent and adds coverage for the driver-visible behavior. The first patch makes the result-column metadata construction shared across the affected statements, so the metadata shape used for PREPARE and EXECUTE stays uniform and easier to reason about. The second patch adds regression coverage for both sides of the metadata-id flow: - a Python auth-cluster test verifies that prepared LIST ROLES OF returns a non-empty result metadata id and that a later EXECUTE reuses it without METADATA_CHANGED - a Boost transport test covers the recovery path where the client sends an empty request metadata id and the server responds with METADATA_CHANGED and the full metadata Together these patches tighten the implementation and protect the prepared-metadata-id behavior exposed to drivers. Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1218 backport: this change should be backported to all active branches to help the driver operation Closes scylladb/scylladb#29347	2026-04-14 16:09:49 +02:00
Avi Kivity	ebdfa10c8f	test: fix flaky test_incremental_repair_race_window_promotes_unrepaired_data The test waited for two "Finished tablet repair" log messages on the coordinator, expecting one per tablet. But there are two log sources that emit messages matching this pattern: repair module (repair/repair.cc:2329): "Finished tablet repair for table=..." topology coordinator (topology_coordinator.cc:2083): "Finished tablet repair host=..." When the coordinator is also a repair replica (always the case with RF=3 and 3 nodes), both messages appear in the coordinator log for the same tablet within 1ms of each other. The test consumed both, thinking both tablets were done, while the second tablet repair was still running. From the CI failure logs: 04:08:09.658 Found: repair[...]: Finished tablet repair for table=... global_tablet_id=e42fd650-3542-11f1-9756-85403784a622:0 04:08:09.660 Found: raft_topology - Finished tablet repair host=... tablet=e42fd650-3542-11f1-9756-85403784a622:0 Both messages are for tablet :0. Tablet :1 repair had not finished yet. The test then wrote keys 20-29 while the second tablet repair was still in progress. That repair flushed the memtable (via prepare_sstables_for_incremental_repair), including keys 20-29 in the repair scan, and mark_sstable_as_repaired set repaired_at=2 on the resulting sstable. This caused the assertion failure on servers[0]: "should not have post-repair keys in repaired sstables, got: {20, 21, 22, 23, 24, 25, 26, 27, 28, 29}" Fix by matching "Finished tablet repair host=" which is unique to the topology coordinator message and avoids the ambiguity. Also fix an incorrect comment that said being_repaired=null when at that point in the test being_repaired is still set to the session_id (the delay_end_repair_update injection prevents end_repair from running). Fixes: SCYLLADB-1478 Closes scylladb/scylladb#29444	2026-04-14 13:32:51 +02:00
Piotr Dulikowski	9fc2c65d18	Merge 'cql3: implement WRITETIME() and TTL() of individual elements of map, set, and UDT' from Nadav Har'El In commit `727f68e0f5` we added the ability to SELECT: * Individual elements of a map: `SELECT map_col[key]`. * Individual elements of a set: `SELECT set_col[key]` returns key if the key exists in the set, or null if it doesn't, allowing to check if the element exists in the set. * Individual pieces of a UDT: `SELECT udt_col.field`. But at the time, we didn't provide any way to retrieve the meta-data for this value, namely its timestamp and TTL. We did not support `SELECT TIMESTAMP(collection[key])`, or `SELECT TIMESTAMP(udt.field)`. Users requested to support such SELECTs in the past (see issue #15427), and Cassandra 5.0 added support for this feature - for both maps and sets and udts - so we also need this feature for compatibility. This feature was also requested recently by vector-search developers, who wanted to read Alternator columns - stored as map elements, not individual columns - with their WRITETIME information. The first four patches in this series adds the feature (in four smaller patches instead one big one), the fifth and sixth patches add tests (cqlpy and boost tests, respectively). The seventh patch adds documentation. All the new tests pass on Cassandra 5, failed on Scylla before the present fix, and pass with it. The fix was surprisingly difficult. Our existing implementation (from `727f68e0f5` building on earlier machinery) doesn't just "read" `map_col[key]` and allow us to return just its timestamp. Rather, the implementation reads the entire map, serializes it in some temporary format that does not include the timestamps and ttls, and then takes the subscript key, at which point we no longer have the timestamp or ttl of the element. So the fix had to cross all these layers of the implementation. While adding support for UDT fields in a pre-existing grammar nonterminal "subscriptExpr", we unintentionally added support for UDT fields also in LWT expressions (which used this nonterminal). LWT missing support for UDT fields was a long-time known compatibility issue (#13624) so we unintentionally fixed it :-) Actually, to completely fix it we needed another small change in the expression implementation, so the eighth patch in this series does this. Fixes #15427 Fixes #13624 Closes scylladb/scylladb#29134 * github.com:scylladb/scylladb: cql3: support UDT fields in LWT expressions cql3: document WRITETIME() and TTL() for elements of map, set or UDT test/boost: test WRITETIME() and TTL() on map collection elements test/cqlpy: test WRITETIME() and TTL() on element of map, set or UDT cql3: prepare and evaluate WRITETIME/TTL on collection elements and UDT fields cql3: parse per-element timestamps/TTLs in the selection layer cql3: add extended wire format for per-element timestamps and TTLs cql3: extend WRITETIME/TTL grammar to accept collection and UDT elements	2026-04-14 12:35:46 +02:00
Dawid Pawlik	800dec2180	test/cqlpy: cover vector index duplicate creation rules Add cqlpy tests for the current CREATE INDEX behavior of vector indexes. Cover named and unnamed duplicates, IF NOT EXISTS, coexistence of multiple named vector indexes on the same column, interactions between named and unnamed indexes, and the same-name-on-different-table case.	2026-04-14 12:21:38 +02:00
Marcin Maliszkiewicz	db5e4f2cb8	test/cqlpy: add reproducer for BATCH prepared auth cache bypass An unprivileged user could bypass authorization checks by exploiting the BATCH prepared statement cache: 1. Prepare an INSERT on a table the user has no access to 2. Execute it inside a BATCH — gets Unauthorized 3. Execute the same prepared INSERT directly — succeeds	2026-04-14 10:37:42 +02:00
Marcin Maliszkiewicz	8401e9cbbd	test: filter benign errors in tests that grep logs during shutdown Apply filter_errors() to grep_for_errors() results in test_split_stopped_on_shutdown and test_group0_apply_while_node_is_being_shutdown. Without filtering, benign RPC errors like 'connection dropped: Semaphore broken' that occur during graceful shutdown cause spurious test failures.	2026-04-13 18:33:41 +02:00
Marcin Maliszkiewicz	e78e6cd584	test: filter_errors: support list[list[str]] error groups Accept both list[str] (from distinct_errors=True) and list[list[str]] (from distinct_errors=False) in filter_errors(), matching against the first line of each error group. This allows tests that call grep_for_errors() with default arguments to pipe results directly through filter_errors().	2026-04-13 18:33:29 +02:00
Alex	fdce8824a5	test/cluster: cover prepared LIST metadata ids in one setup Precompute the expected metadata-id hashes for the prepared LIST auth and service-level statements and verify that PREPARE returns them while EXECUTE reuses the prepared metadata without METADATA_CHANGED. Run all cases in a single auth-cluster test after preparing the cluster, role, and service level once through the regular manager fixture.	2026-04-13 19:13:12 +03:00
Alex	0f6d9ffd22	cql: expose stable result metadata for prepared LIST statements Prepared LIST statements were not calculating metadata in PREPARE path, and sent empty string hash to client causing problematic behaviour where metadat_id was not recalculated correctly. This patch moves metadata construction into get_result_metadata() for the affected LIST statements and reuse that metadata when building the result set. This gives PREPARE a stable metadata id for LIST ROLES, LIST USERS, LIST PERMISSIONS and the service-level variants. This patch also adds a new boost test that verifies that when an EXECUTE request carries an empty result metadata id while the server has a real metadata id for the result set, the response is marked METADATA_CHANGED and includes the full result metadata plus the server metadata id. This covers the recovery path for clients that send an empty or otherwise unusable metadata id instead of a matching cached one.	2026-04-13 17:49:27 +03:00
Ferenc Szili	e904e7a715	test: add test_split_emitted_during_truncate Add a regression test that reproduces the race between tablet split and truncation. The test: 1. Creates a single-tablet table and inserts data. 2. Triggers truncation and pauses it (via database_truncate_wait) after compaction is disabled but before discard_sstables() runs. 3. Triggers tablet split and pauses it (via tablet_split_monitor_wait) at the start of process_tablet_split_candidate(). 4. Releases split so set_split_mode() creates new compaction groups. 5. Waits for the set_split_mode log confirming the groups exist. 6. Releases truncation so discard_sstables() encounters the new groups. 7. Verifies truncation completes and split finishes. Adds a tablet_split_monitor_wait error injection point in process_tablet_split_candidate() to allow pausing the split monitor before it enters the split loop.	2026-04-13 11:05:03 +02:00
Avi Kivity	0ae22a09d4	LICENSE: Update to version 1.1 Updated terms of non-commercial use (must be a never-customer).	2026-04-12 19:46:33 +03:00
Avi Kivity	22949bae52	Merge 'logstor: implement tablet split/merge and migration' from Michael Litvak implement tablet split, tablet merge and tablet migration for tables that use the experimental logstor storage engine. * tablet merge simply merges the histograms of segments of one compaction group with another. * for tablet split we take the segments from the source compaction group, read them and write all live records to separate segments according to the split classifier, and move separated segments to the target compaction groups. * for tablet migration we use stream_blob, similarly to file streaming of sstables. we add a new op type for streaming a logstor segment. on the source we take a snapshot of the segments with an input stream that reads the segment, and on the target we create a sink that allocates a new segment on the target shard and writes to it. * we also do some improvements for recovery and loading of segments. we add a segment header that contains useful information for non-mixed segments, such as the table and token range. Refs SCYLLADB-770 no backport - still a new and experimental feature Closes scylladb/scylladb#29207 * github.com:scylladb/scylladb: test: logstor: additional logstor tests docs/dev: add logstor on-disk format section logstor: add version and crc to buffer header test: logstor: tablet split/merge and migration logstor: enable tablet balancing logstor: streaming of logstor segments using stream_blob logstor: add take_logstor_snapshot logstor: segment input/output stream logstor: implement compaction_group::cleanup logstor: tablet split logstor: tablet merge logstor: add compaction reenabler logstor: add segment header logstor: serialize writes to active segment replica: extend compaction_group functions for logstor replica: add compaction_group_for_logstor_segment logstor: code cleanup	2026-04-12 16:11:12 +03:00
Nadav Har'El	33dbb63aef	cql3: support UDT fields in LWT expressions In an earlier patch, we used the CQL grammar's "subscriptExpr" in the rule for WRITETIME() and TTL(). But since we also wanted these to support UDT fields (x.a), not just collection subscripts (x[3]), we expanded subscriptExpr to also support the field syntax. But LWT expressions already used this subscriptExpr, which meant that LWT expressions unintentionally gained support for UDT fields. Missing support for UDT fields in LWT is a long-standing known Cassandra-compatibility bug (#13624), and now our grammar finally supports the missing syntax. But supporting the syntax is not enough for correct implementation of this feature - we also need to fix the expression handling: Two bugs prevented expressions like `v.a = 0` from working in LWT IF clauses, where `v` is a column of user-defined type. The first bug was in get_lhs_receiver() in prepare_expr.cc: it lacked a handler for field_selection nodes, causing an "unexpected expression" internal error when preparing a condition like `IF v.a = 0`. The fix adds a handler that returns a column_specification whose type is taken from the prepared field_selection's type field. The second bug was in search_and_replace() in expression.cc: when recursing into a field_selection node it reconstructed it with only `structure` and `field`, silently dropping the `field_idx` and `type` fields that are set during preparation. As a result, any transformation that uses search_and_replace() on a prepared expression containing a field_selection — such as adjust_for_collection_as_maps() called from column_condition_prepare() — would zero out those fields. At evaluation time, type_of() on the field_selection returned a null data_type pointer, causing a segmentation fault when the comparison operator tried to call ->equal() through it. The fix preserves field_idx and type when reconstructing the node. Fixes #13624.	2026-04-12 14:28:01 +03:00
Nadav Har'El	a544dae047	test/boost: test WRITETIME() and TTL() on map collection elements Add tests in test/boost/expr_test.cc for the low-level implementation of writetime() and ttl() on a map element. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2026-04-12 14:28:01 +03:00
Nadav Har'El	ccb94618cc	test/cqlpy: test WRITETIME() and TTL() on element of map, set or UDT This patch adds many tests verifying the behavior of WRITETIME() and TTL() on individual elements of maps, sets and UDTs, serving as a regression test for issue #15427. We also add tests verifying our understanding of related issues like WRITETIME() and TTL() of entire collections and of individual elements of frozen collections. All new tests pass on Cassandra 5.0, helping to verify that our implementation is compatible with Cassandra. They also pass on ScyllaDB after the previous patch (most didn't before that patch). Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2026-04-12 14:27:40 +03:00
Benny Halevy	e4f0539acf	query: result_set: change row member to a chunked vector To prevent large memory allocations. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2026-04-12 10:00:49 +03:00
Avi Kivity	8ccee6803e	Merge 'Remove upgrade view builder' from Gleb Natapov Since we do no longer support upgrade from versions that do not support v2 of "view building status" code (building status is managed by raft) we can remove v1 code and upgrade code and make sure we do not boot with old "builder status" version. v2 version was introduced by `8d25a4d678` which is included in scylla-2025.1.0. No backport needed since this is code removal. Closes scylladb/scylladb#29105 * github.com:scylladb/scylladb: view: drop unused v1 builder code view: remove upgrade to raft code	2026-04-12 00:39:26 +03:00

1 2 3 4 5 ...

11455 Commits