scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-06-04 14:03:06 +00:00

Author	SHA1	Message	Date
Nadav Har'El	2e274bbdba	alternator: split executor.cc even more This patch continues the effort to split the huge executor.cc (5000 lines before this patch) even more. In this patch we introduce a new source file, executor_util.cc, for various utility functions that are used for many different operations and therefore are useful to have in a header file. These utility functions will now be in executor_util.cc and executor_util.hh - instead of executor.cc and executor.hh. Various source files, including executor.cc, the executor_read.cc introduced in the previous patch, as well as older source files like as streams.cc, ttl.cc and serialization.cc, use the new header file. This patch removes over 700 lines of code from executor.cc, and also removes a large amount of utility functions declerations from executor.hh. Originally, executor.hh was meant to be about the interface that the Alternator server needs to execute the different DynamoDB API operations - and after this patch it returns closer to this original goal. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2026-04-16 14:30:16 +03:00
Nadav Har'El	751da00692	alternator: split alternator/executor.cc Already six years ago, in #5783, we noticed that alternator/executor.cc has grown too large. The previous patches added hundreds of more lines to it to implement vector search, and it reached a whopping 7,000 lines of code. This is too much. This patch splits from executor.cc two major chunks: 1. The implementation of read requests - GetItem, BatchGetItem, Query (base table, GSI/LSI, and vector-search), and Scan - was moved to a new source file alternator/executor_read.cc. The new file has 2,000 lines. 2. Moved 250 lines of template functions dealing with attribute paths and maps of them to a new header file, attribute_path.hh. These utilities are used for many different operations - various read operations use them for ProjectionExpression, and UpdateItem uses them for modifications to nested attributes, so we need the new header file from both executor.cc and executor_read.cc The remaining executor.cc is still pretty big, 5,000 lines, and contains write operations (PutItem, UpdateItem, DeleteItem, BatchWriteItem) as well as various table and other operations, and also many utility functions used by many types of operations, so we can later continue this refactoring effort. Refs #5783 Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2026-04-16 14:30:10 +03:00
Nadav Har'El	83670d2493	alternator: validate vector index attribute values on write When a table has a vector index, writes to the indexed attribute (via PutItem, UpdateItem, or BatchWriteItem) must supply a value that is a vector of the appropriate length: It must be a list of exactly the declared number of elements, where each element is a numeric type ("N") representable as a 32-bit float. Before this patch, invalid values were silently accepted and the item was simply not indexed (it was skipped by the vector store when it read this item). Now these writes are rejected with a ValidationException. This is analogous to the existing validation of GSI/LSI key attribute values - in DynamoDB after a certain attribute becomes the key of a GSI or LSI, the user is no longer allowed to write the same type. The implementation we add here is also analogous to the implementation of the GSI/LSI key validation. The GSI/LSI key validation is done by validate_value_if_index_key / si_key_attributes, and in this patch we add the vector-index parallels: vector_index_attributes() collects the attribute name and declared dimensions for every vector index in the schema, and validate_value_if_vector_index_attribute() enforces the type limitations. For efficiency in the common case where a table has no vector indexes and no GSIs/LSIs, both validation functions are out-of-line and each call site guards the call with an explicit empty() check, so no function-call overhead is incurred when there is nothing to validate. For UpdateItem, the map of vector index attributes is cached in update_item_operation (alongside the existing _key_attributes cache) to avoid recomputing it on every call to update_attribute().	2026-04-16 13:31:49 +03:00
Nadav Har'El	aea7b6a66b	alternator: DescribeTable for vector index: add IndexStatus and Backfilling Add to DescribeTable's output for VectorIndexes two fields - IndexStatus and Backfilling - which are intended to exactly mirror these two fields that exist for GlobalSecondaryIndexes: When a vector index is added, IndexStatus is "CREATING" before the index is usable, and "ACTIVE" when it is finally usable for a Query. During "CREATING" phase, "Backfilling" may be set to true when the index is currently being backfilled (the table is scaned and an index is built). A user is expected to call DescribeTable in a loop after creating a vector index (via either CreateTable and UpdateTable) and only call Query on the index after the IndexStatus is finally ACTIVE. Calling Query earlier, while IndexStatus is still CREATING, will result in an error. In the current implementation, Alternator does not track the state of the vector index, so it needs to contact the vector store to inquire about the state of the index - using a new function introduced in this patch that uses an existing vector-store API. This makes DescribeTable slower on tables that have vector indexes, because the vector store is contacted on every DescribeTable call. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2026-04-16 13:31:49 +03:00
Nadav Har'El	e43a2e5086	alternator: implement Query with a vector index We introduce to the Query request a new "VectorSearch" parameter, which take a mandatory "QueryVector" (a value which must be a numeric vector of the right length) and "Limit". The "Limit" of a vector search (Query with VectorSearch) determines the number of nearest neighbors to return, and does not allow pagination (ExclusiveKeyStart is not allowed). ConsistentRead=True is also not allowed on a vector search query. The "Select"/"ProjectionExpression"/"AttributesToGet" parameters are also supported, requesting which attributes to fetch. Using Select= ALL_PROJECTED_ATTRIBUTES means read only the attributes found in the vector index - currently only the key columns - so it is significantly faster than ALL_ATTRIBUTES because it doesn't require reading the items from the base table. The "FilterExpression" parameter is also supported. Like in DynamoDB's traditional Query, this does post-filtering, i.e., removing some of the results returned by the vector index that don't match the filter, and as a result fewer than Limit results may be returned. Pre-filtering (done on the vector store, and always returns Limit results) is not yet implemented.	2026-04-16 13:31:47 +03:00
Nadav Har'El	68e34c57e1	alternator: fix bug in describe_multi_item() In commit `a55c5e9ec7`, the function describe_multi_item() got a new item_callback parameter, that can be used to calculate the size of the item. This new parameter has a default, an empty noncopyable_function. But an empty noncopyable_function shouldn't be called - exactly like std::function, it throws std::bad_function_call if called when empty. So describe_multi_item() should only call this item_callback if it's not empty. This became a problem in the next patch, implementing vector search query, which called describe_multi_item with the default item_callback. But in general, the function should be usable with the default parameter (or we shouldn't have defined a default value for this parameter!). Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2026-04-16 13:30:02 +03:00
Nadav Har'El	ffe1029b7c	alternator: prevent adding GSI conflicting with a vector index All the "indexes" we implement in Alternator - GSI, LSI and the new vector index - share the same IndexName namespace, which we'll use in Query to refer to the index. In the previous patch we already prevented adding a vector index with the same name as an existing GSI or LSI. In this patch we also prevent the reverse - adding a GSI with the name of an existing vector index. Additionally, one cannot add a GSI on a key that is already the key of a vector index: The types conflict: The key of a vector index must be a vector column, while the key of a GSI must have a standard key type (string, binary or number). We have tests for this later, this the big test patch.	2026-04-16 13:30:02 +03:00
Nadav Har'El	82de16f92c	alternator: implement UpdateTable with a vector index After an earlier patch allowed CreateTable to create vector indexes together with a table, in this patch we add to UpdateTable the ability to add a new vector index to an existing table, as well as the ability to delete a vector index from an existing table. The implementation is inspired by DynamoDB's syntax for GSI - just like GSI has GlobalSecondaryIndexUpdates with "Create" and "Delete" operations, for vector indexes we have VectorIndexUpdates supporting Create and Delete. "Update" is not yet supported - we didn't implement yet any parameter that can be updated - but we can easily implement it in the future.	2026-04-16 13:30:02 +03:00
Nadav Har'El	217090a996	alternator: implement DescribeTable with a vector index In this patch we add to DescribeTable the ability to list the vector indexes enabled on an Alternator table.	2026-04-16 13:30:02 +03:00
Nadav Har'El	e156d67177	alternator: implement CreateTable with a vector index ScyllaDB supports the "vector search" feature in CQL. In this patch we start the path to adding vector search support also to Alternator. In this patch, we implement CreateTable support - allowing the user to enable vector search in a new table. The following patches will enable additional operations like UpdateTable (adding a vector index to an existing table or deleting a vector index to an existing table) and DescribeTable. Extensive tests for all these features will come at the end of the series. Those tests were written in parallel with writing this implementation so cover (hopefully) every nook and cranny of the imlementation.	2026-04-16 13:29:58 +03:00
Nadav Har'El	0afc730b7b	alternator: reject empty attribute names Alternator has a function validate_attr_name_length() used to validate an attribute name passed in different operations like PutItem, UpdateItem, GetItem, etc. It fails the request if the attribute name is longer than 65535 characters. It turns out that we forgot to check if the attribute name length isn’t 0 - which should be forbidden as well! This patch fixes the validation code, and also adds a test that confirms that after this patch empty attribute names are rejected - just like DynamoDB does - whereas before this patch they were silently accepted. We want to fix this issue now, because in a later patch we intend to use the same validation function also for vector indexes - and want it to be accurate. Fixes SCYLLADB-1069. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2026-04-16 13:28:15 +03:00
Nadav Har'El	8948a50f3b	cdc: fix on_pre_create_column_families to create CDC log for vector search The vector-search feature, which is already supported in CQL, introduced the somewhat confusing feature of enabling CDC without explicitly enabling CDC: When a vector index is enabled on a table, CDC is "enabled" for it even if the user didn't ask to enable CDC. For this, some code in cdc/log.cc began to use cdc_enabled() instead of checking schema.cdc_options.enabled() directly. This cdc_enabled() function checks if either this enabled() is true, or has_vector_index() is true. But there's another twist to this story: To write with CDC, we also need to create the CDC log table: 1. In CQL, a vector index can only be added on an existing table (with CREATE INDEX), so the hook on_before_update_column_family() is the one that noticed that a vector index was added, and created the CDC log table. 2. But in Alternator, a vector index can be created up-front with a brand-new table (in CreateTable), so the hook for a new table - on_pre_create_column_families(), also needs to create the CDC log table. It already did, but incorrectly checked just the explicit CDC-enabled flag instead of the new cdc_enabled() function that also allows vector index. So this patch just fixes on_pre_create_column_families to use cdc_enabled(). Before this patch, when a vector index will be created in Alternator with CreateTable, an attempt to write to the table (PutItem) will fail because it will try to write to the CDC log, which wasn't created. After this patch, it works. The reproducing test is test_putitem_vectorindex_createtable (introduced in a later patch). Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2026-04-16 13:28:15 +03:00
Nadav Har'El	f0e9177130	Merge 'audit/alternator: Make Alternator requests audited' from Piotr Szymaniak Each Alternator API call results in the request being audited, provided the auditing is enabled. Both successful as well as the failed requests are audited, with few exceptions. The chosen audit types for the operations: - CreateTable - DDL - DescribeTable - QUERY - DeleteTable - DDL - UpdateTable - DDL - PutItem - DML - UpdateItem - DML - GetItem - QUERY - DeleteItem - DML - ListTables - QUERY - Scan - QUERY - DescribeEndpoints - QUERY - BatchWriteItem - DML - BatchGetItem - QUERY - Query - QUERY - TagResource - DDL - UntagResource - DDL - ListTagsOfResource - QUERY - UpdateTimeToLive - DDL - DescribeTimeToLive - QUERY - ListStreams - QUERY - DescribeStream - QUERY - GetShardIterator - QUERY - GetRecords - QUERY - DescribeContinuousBackups - QUERY FIXME: The tests are now covering the new functionality only partially. Fixes: scylladb/scylla-enterprise#3796 Fixes: SCYLLADB-467 No need to backport, new functionality. Closes scylladb/scylladb#27953 * github.com:scylladb/scylladb: audit/alternator: support audit_tables=alternator.<table> shorthand audit/alternator: Add negative audit tests audit/alternator: Add testing of auditing audit/alternator: Audit requests audit/alternator: Refactor in preparation for auditing Alternator	2026-04-15 22:17:57 +03:00
Nikos Dragazis	d38f44208a	test/cqlpy: Harden mutation_fragments tests against background flushes Several tests in test_select_from_mutation_fragments.py assume that all mutations end up in a single SSTable. This assumption can be violated by background memtable flushes triggered by commitlog disk pressure. Since the Scylla node is taken from a pool, it may carry unflushed data from prior tests that prevents closed segments from being recycled, thereby increasing the commitlog disk usage. A main source of such pressure is keyspace-level flushes from earlier tests in this module, which rotate commitlog segments without flushing system tables (e.g., `system.compaction_history`), leaving closed segments dirty. Additionally, prior tests in the same module may have left unflushed data on the shared test table (`test_table` fixture), keeping commitlog segments dirty on its behalf as well. When commitlog disk usage exceeds its threshold, the system flushes the test table to reclaim those segments, potentially splitting a running test's mutations across multiple SSTables. This was observed in CI, where test_paging failed because its data was split across two SSTables, resulting in more mutation fragments than the hardcoded expected count. This patch fixes the affected tests in two ways: 1. Where possible, tests are reworked to not assume a single SSTable: - test_paging - test_slicing_rows - test_many_partition_scan 2. Where rework is impractical, major compaction is added after writes and before validation to ensure that only one SSTable will exist: - test_smoke - test_count - test_metadata_and_value - test_slicing_range_tombstone_changes Fixes SCYLLADB-1375. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com> Closes scylladb/scylladb#29389	2026-04-15 21:46:00 +03:00
Avi Kivity	59ec93b86b	Merge 'Allow arbitrary tablet boundaries and count' from Tomasz Grabiec There are several reasons we want to do that. One is that it will give us more flexibility in distributing the load. We can subdivide tablets at any token, and achieve more evenly-sized tablets. In particular, we can isolate large partitions into separate tablets. We can also split and merge incrementally individual tablets. Currently, we do it for the whole table or nothing, which makes splits and merges take longer and cause wide swings of the count. This is not implemented in this PR yet, we still split/merge the whole table. Another reason is vnode to tablets migration. We now could construct a tablet map which matches exactly the vnode boundaries, so migration can happen transparently from CQL-coordinator point of view. Tablet count is still a power-of-two by default for newly created tables. It may be different if tablet map is created by non-standard means, or if per-table tablet option "pow2_count" is set to "false". build/release/scylla perf-tablets: Memory footprint for 131k tablets increased from 56 MiB to 58.1 MiB (+3.5%) Before: ``` Generating tablet metadata Total tablet count: 131072 Size of tablet_metadata in memory: 57456 KiB Copied in 0.014346 [ms] Cleared in 0.002698 [ms] Saved in 1234.685303 [ms] Read in 445.577881 [ms] Read mutations in 299.596313 [ms] 128 mutations Read required hosts in 247.482742 [ms] Size of canonical mutations: 33.945053 [MiB] Disk space used by system.tablets: 1.456761 [MiB] Tablet metadata reload: full 407.69ms partial 2.65ms ``` After: ``` Generating tablet metadata Total tablet count: 131072 Size of tablet_metadata in memory: 59504 KiB Copied in 0.032475 [ms] Cleared in 0.002965 [ms] Saved in 1093.877441 [ms] Read in 387.027100 [ms] Read mutations in 255.752121 [ms] 128 mutations Read required hosts in 211.202805 [ms] Size of canonical mutations: 33.954453 [MiB] Disk space used by system.tablets: 1.450162 [MiB] Tablet metadata reload: full 354.50ms partial 2.19ms ``` Closes scylladb/scylladb#28459 * github.com:scylladb/scylladb: test: boost: tablets: Add test for merge with arbitrary tablet count tablets, database: Advertise 'arbitrary' layout in snapshot manifest tablets: Introduce pow2_count per-table tablet option tablets: Prepare for non-power-of-two tablet count tablets: Implement merged tablet_map constructor on top of for_each_sibling_tablets() tablets: Prepare resize_decision to hold data in decisions tablets: table: Make storage_group handle arbitrary merge boundaries tablets: Make stats update post-merge work with arbitrary merge boundaries locator: tablets: Support arbitrary tablet boundaries locator: tablets: Introduce tablet_map::get_split_token() dht: Introduce get_uniform_tokens()	2026-04-15 18:57:22 +03:00
Andrzej Jackowski	78926d9c96	test/random_failures: remove gossip shadow round injection Commit `c17c4806a1` removed check_for_endpoint_collision() from the fresh bootstrap path, which was the only code path that called do_shadow_round() for new nodes. Since the gossip shadow round is no longer executed during bootstrap, remove the stop_during_gossip_shadow_round error injection from the test. The entry is marked as REMOVED_ rather than deleted to preserve the shuffle order for seed-based test reproducibility. The injection point in gms/gossiper.cc is also removed since it is no longer used by any test. Fixes: SCYLLADB-1466 Closes scylladb/scylladb#29460	2026-04-15 16:30:55 +02:00
Asias He	4137a4229c	test: Stabilize tablet incremental repair error test Use async tablet repair task flow to avoid a race where client timeout returns while server-side repair continues after injections are disabled. Start repair with await_completion=false, assert it does not complete within timeout under injection, abort/wait the task, then verify sstables_repaired_at is unchanged. Fixes SCYLLADB-1184 Closes scylladb/scylladb#29452	2026-04-15 16:24:43 +03:00
dependabot[bot]	d584e8e321	build(deps): bump sphinx-scylladb-theme from 1.9.1 to 1.9.2 in /docs Bumps [sphinx-scylladb-theme](https://github.com/scylladb/sphinx-scylladb-theme) from 1.9.1 to 1.9.2. - [Release notes](https://github.com/scylladb/sphinx-scylladb-theme/releases) - [Commits](https://github.com/scylladb/sphinx-scylladb-theme/compare/1.9.1...1.9.2) --- updated-dependencies: - dependency-name: sphinx-scylladb-theme dependency-version: 1.9.2 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Closes scylladb/scylladb#29476	2026-04-15 14:57:37 +03:00
Gleb Natapov	ca24dd4a5f	topology coordinator: log request cancellation only when request are really canceled Currently cancellation is logged in get_next_task, but the function is called by tablets code as well where we do not act upon its result, only yield to the topology coordinator. But the topology coordinator will not necessary do the cancellation as well since it can be busy with tablets migration. As a result cancellation is logged, but not done which is confusing. Fix it by logging cancellation when it is actually happens. Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1409 Closes scylladb/scylladb#29471	2026-04-15 14:46:59 +03:00
Botond Dénes	280fe7cfb7	Merge 'Make inclusion of config.hh cheaper' from Nadav Har'El This is an attempt (mostly suggested and implemented by AI, but with a few hours of human babysitting...), to somewhat reduce compilation time by picking one template, named_value<T>, which is used in more than a hundred source files through the config.hh header, and making it use external instantiation: The different methods of named_value<T> for various T are instantiated only once (in config.cc), and the individual translation units don't need to compile them a hundred times. The resulting saving is a little underwhelming: The total object-file size goes down about 1% (from 346,200 before the patch to 343,488 after the patch), and previous experience shows that this object-file size is proportional to the compilation time, most of which involves code generation. But I haven't been able to measure speedup of the build itself. 1% is not nothing, but not a huge saving either. Though arguably, with 50 more of these patches, we can make the build twice faster :-) Refs #1. Closes scylladb/scylladb#28992 * github.com:scylladb/scylladb: config: move named_value<T> method bodies out-of-line config: suppress named_value<T> instantiation in every source file	2026-04-15 14:40:15 +03:00
Botond Dénes	00d8470554	Merge 'test: filter benign shutdown errors in tests that grep logs directly' from Marcin Maliszkiewicz Tests that call grep_for_errors() directly and assert no errors can fail spuriously due to benign RPC errors during graceful shutdown (e.g. "connection dropped: Semaphore broken"), which are already filtered by the after_test hook via filter_errors(). Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1464 Backport: no, tests fix (we may decide to backport later if it occurs on release branches) Closes scylladb/scylladb#29463 * github.com:scylladb/scylladb: test: filter benign errors in tests that grep logs during shutdown test: filter_errors: support list[list[str]] error groups	2026-04-15 14:40:15 +03:00
Piotr Szymaniak	5b00675bf0	storage_proxy: expedite speculative retry on replica disconnect When a replica disconnects during a digest read (e.g., during decommission), the speculating_read_executor now immediately fires the pending speculative retry instead of waiting for the timer. On DISCONNECT, the digest_read_resolver invokes an _on_disconnect callback set by the executor. The callback cancels the speculate timer and rearms it to clock_type::now() (lowres_clock::now() = thread-local memory read, no syscall). The existing timer callback fires on the next reactor poll with all its logic intact — checking is_completed(), calling add_wait_targets(1), sending the request, and incrementing speculative_digest_reads/speculative_data_reads. The notification is fire-and-forget: on_error() does NOT absorb the DISCONNECT. The existing error arithmetic in digest_read_resolver already handles this correctly because _target_count_for_cl accounts for the speculative target. For never_speculating_read_executor (no spare target) and always_speculating_read_executor (all requests sent upfront), _on_disconnect is never set — no behavior change. Fixes scylladb/scylladb#26307 Closes scylladb/scylladb#29428	2026-04-15 14:40:15 +03:00
Raphael S. Carvalho	a2eed4bb45	service: Use optimistic replicas in all_sibling_tablet_replicas_colocated all_sibling_tablet_replicas_colocated was using committed ti.replicas to decide whether sibling tablets are co-located and merge can be finalized. This caused a false non-co-located window when a co-located pair was moved by the load balancer: as both tablets migrate together, their del_transition commits may land in different Raft rounds. After the first commit, ti.replicas diverge temporarily (one tablet shows the new position, the other the old), causing all_sibling_tablet_replicas_colocated to return false. This clears finalize_resize, allowing the load balancer to start new cascading migrations that delay merge finalization by tens of seconds. Fix this by using the optimistic replica view (trinfo->next when transitioning, ti.replicas otherwise) — the same view the load balancer uses for load accounting — so finalize_resize stays populated throughout an in-flight migration and no spurious cascades are triggered. Steps that lead to the problem: 1. Merge is triggered. The load balancer generates co-location migrations for all sibling pairs that are not yet on the same shard. Some pairs finish co-location before others. 2. Once all pairs are co-located in committed state, all_sibling_tablet_replicas_colocated returns true and finalize_resize is set. Meanwhile the load balancer may have already started a regular LB migration on one co-located pair (both tablets are stable and the load balancer is free to move them). 3. The LB migration moves both tablets together (colocated_tablets). Their two del_transition commits land in separate Raft rounds. After the first commit, ti.replicas[t1] = new position but ti.replicas[t2] = old position. 4. In this window, all_sibling_tablet_replicas_colocated sees the pair as NOT co-located, clears finalize_resize, and the load balancer generates new migrations for other tablets to rebalance the load that the pair move created. 5. Those new migrations can take tens of seconds to stream, keeping the coordinator in handle_tablet_migration mode and preventing maybe_start_tablet_resize_finalization from being called. The merge finalization is delayed until all those cascaded migrations complete. Fixes https://scylladb.atlassian.net/browse/SCYLLADB-821. Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1459. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Closes scylladb/scylladb#29465	2026-04-15 14:40:15 +03:00
Marcin Maliszkiewicz	53b6e9fda5	Merge 'Make DESCRIBE CLUSTER get cluster information from storage_service' from Pavel Emelyanov Currently the statement returns cluster, partitioner and snitch names by accessing global db::config via database. As the part of an effort to detach components from global db::config, this PR tweaks the statement handler to get the cluster information from some other source. Currently the needed cluster information is stored in different components, but they are all under storage_service umbrella which seems to be a good central source of this truth. Unit test included. Cleaning components inter-dependencies, not backporting Closes scylladb/scylladb#29429 * github.com:scylladb/scylladb: test: Add test_describe_cluster_sanity for DESCRIBE CLUSTER validation describe_statement: Get cluster info from storage_service storage_service: Add describe_cluster() method query_processor: Expose storage_service accessor	2026-04-15 14:40:15 +03:00
Botond Dénes	d0e99e018b	reader_concurrency_semaphore: drop unused stop_ext_{pre,post}() Left over from primordial times, when reader_concurrency_semaphore was baseclass for extensions in the separate enterprise repository. Also remove the now unneded virtual marker from the destructor. Closes scylladb/scylladb#29399	2026-04-15 14:40:15 +03:00
Botond Dénes	4a2d032c6f	Merge 'query: result_set: change row member to a chunked vector' from Benny Halevy To prevent large memory allocations. This series shows over 3% improvement in perf-simple-query throughput. ``` $ build/release/scylla perf-simple-query --default-log-level=error --smp=1 --random-seed=1855519715 random-seed=1855519715 enable-cache=1 Running test with config: {partitions=10000, concurrency=100, mode=read, query_single_key=no, counters=no} Disabling auto compaction Creating 10000 partitions... Before: random-seed=1775976514 enable-cache=1 enable-index-cache=1 sstable-summary-ratio=0.0005 sstable-format=me Running test with config: {partitions=10000, concurrency=100, mode=read, query_single_key=no, counters=no} Disabling auto compaction Creating 10000 partitions... 336345.11 tps ( 58.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 32788 insns/op, 12430 cycles/op, 0 errors) 348748.14 tps ( 58.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 32794 insns/op, 12335 cycles/op, 0 errors) 349012.63 tps ( 58.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 32800 insns/op, 12326 cycles/op, 0 errors) 350629.97 tps ( 58.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 32770 insns/op, 12270 cycles/op, 0 errors) 348585.00 tps ( 58.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 32804 insns/op, 12338 cycles/op, 0 errors) throughput: mean= 346664.17 standard-deviation=5825.77 median= 348748.14 median-absolute-deviation=2348.46 maximum=350629.97 minimum=336345.11 instructions_per_op: mean= 32791.35 standard-deviation=13.60 median= 32794.47 median-absolute-deviation=8.65 maximum=32804.45 minimum=32769.57 cpu_cycles_per_op: mean= 12340.05 standard-deviation=57.57 median= 12335.05 median-absolute-deviation=13.94 maximum=12430.42 minimum=12270.28 After: random-seed=1775976514 enable-cache=1 enable-index-cache=1 sstable-summary-ratio=0.0005 sstable-format=me Running test with config: {partitions=10000, concurrency=100, mode=read, query_single_key=no, counters=no} Disabling auto compaction Creating 10000 partitions... 353770.85 tps ( 58.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 32762 insns/op, 11893 cycles/op, 0 errors) 364447.98 tps ( 58.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 32738 insns/op, 11818 cycles/op, 0 errors) 365268.97 tps ( 58.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 32734 insns/op, 11788 cycles/op, 0 errors) 344304.87 tps ( 58.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 32746 insns/op, 12506 cycles/op, 0 errors) 362263.57 tps ( 58.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 32756 insns/op, 11888 cycles/op, 0 errors) throughput: mean= 358011.25 standard-deviation=8916.76 median= 362263.57 median-absolute-deviation=6436.74 maximum=365268.97 minimum=344304.87 instructions_per_op: mean= 32747.06 standard-deviation=11.85 median= 32745.80 median-absolute-deviation=9.36 maximum=32762.18 minimum=32734.01 cpu_cycles_per_op: mean= 11978.65 standard-deviation=298.06 median= 11887.96 median-absolute-deviation=160.96 maximum=12505.72 minimum=11788.49 ``` Refs #28511 (Refs rather than Fixes for the lack of a reproducer unit test) * No backport needed as the issue is rare and not severe Closes scylladb/scylladb#28631 * github.com:scylladb/scylladb: query: result_set: change row member to a chunked vector query: result_set_row: make noexcept query: non_null_data_value: assert is_nothrow_move_constructible and assignable types: data_value: assert is_nothrow_move_constructible and assignable	2026-04-15 14:40:15 +03:00
Nadav Har'El	1eb8d170dd	Merge 'vector_index: allow recreating vector indexes on the same column' from Dawid Pawlik This series allows creating multiple vector indexes on the same column so users can rebuild an index without losing query availability. The intended flow is: 1. Create a new vector index on a column that already has one. 2. Keep serving ANN queries from the old index while the new one is being built. 3. Verify the new index is ready. 4. Automatically switch to the remaining index. 5. Drop the old index. To make that deterministic, `index_version` is changed from the base table schema version to a real creation timeuuid. When multiple vector indexes exist on the same column, ANN query planning now picks the index according to the routing implemented in Vector Store (newest serving index). This keeps queries on the old index until it the new one is up and ready. This patch also removes the create-time restriction that rejected a second vector index on the same column. Name collisions are still rejected as before. Test coverage is updated accordingly: - Scylla now verifies that two vector indexes can coexist on the same column. - Cassandra/SAI behavior is still covered and is still expected to reject duplicate indexes on the same column. Fixes: VECTOR-610 Closes scylladb/scylladb#29407 * github.com:scylladb/scylladb: docs: document vector index metadata and duplicate handling test/cqlpy: cover vector index duplicate creation rules vector_index: allow multiple named indexes on one column vector_index: store `index_version` as creation timeuuid	2026-04-15 14:40:15 +03:00
Botond Dénes	a9c86fc2e4	docs: document schema subcomponent in sstable-scylla-format.md Commit `234f905` (sstables: scylla_metadata: add schema member) added a new Schema subcomponent (tag 11) to scylla_metadata. Document it in the sstable Scylla format reference: - Add schema to the subcomponent grammar enumeration - Add a summary entry describing the subcomponent (tag 11) and its purpose - Add a detailed ## schema subcomponent section with the binary grammar, covering table_id, table_schema_version, keyspace_name, table_name and the column_description array (column_kind, column_name, column_type) Fixes https://github.com/scylladb/scylladb/issues/27960 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Closes scylladb/scylladb#28983	2026-04-15 14:40:15 +03:00
Botond Dénes	5891efc2ca	Merge 'service: add missing replicas if tablet rebuild was rolled back' from Aleksandra Martyniuk RF change of tablet keyspace starts tablet rebuilds. Even if any of the rebuilds is rolled back (because pending replica was excluded), rf change request finishes successfully. In this case we end up with the state of the replicas that isn't compatible with the expected keyspace replication. Modify topology coordinator so that if it were to be idle, it starts checking if there are any missing replicas. It moves to transition_state::tablet_migration and run required rebuilds. If a new RF change request encounters invalid state of replicas it fails. The state will be fixed later and the analogical ALTER KEYSPACE statement will be allowed. Fixes: SCYLLADB-109. Requires backport to all versions with tablet keyspace rf change. Closes scylladb/scylladb#28709 * github.com:scylladb/scylladb: test: add test_failed_tablet_rebuild_is_retried_on_alter test: add a test to ensure that failed rebuilds are retried service: fail ALTER KEYSPACE if replicas do not satisfy the replication service: retry failed tablet rebuilds service: maybe_start_tablet_migration returns std::optional<group0_guard>	2026-04-15 14:40:15 +03:00
David Garcia	0eaa42c846	docs: Makefile: drop redundant -t $(FLAG) from sphinx options Related scylladb/scylladb-docs-homepage#153. make multiversion failed under Sphinx 8+ with: ``` sphinx-build: error: argument --tag/-t: expected one argument subprocess.CalledProcessError: Command '(..., '-m', 'sphinx', '-t', '-D', 'smv_metadata_path=...', ..., 'manual')' returned non-zero exit status 2. make: *** [multiversion] Error 1 ``` sphinx-multiversion's arg forwarding splits `-t manual`, sending `-t` into the options slot and `manual` to the trailing FILENAMES positional. Sphinx 7 silently tolerated the dangling `-t`; Sphinx 8+'s stricter argparse CLI rejects it. Instead, it now reads FLAGS from an env variable. How to test: ```` make multiversion make FLAG=opensource multiversion ```` Both complete and switch variants correctly. chore: rm empty lines Closes scylladb/scylladb#29472	2026-04-15 14:40:15 +03:00
dependabot[bot]	280ffe107f	build(deps): bump sphinx-multiversion-scylla in /docs Bumps [sphinx-multiversion-scylla](https://holzhaus.github.io/sphinx-multiversion/) from 0.3.7 to 0.3.8. --- updated-dependencies: - dependency-name: sphinx-multiversion-scylla dependency-version: 0.3.8 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Closes scylladb/scylladb#29466	2026-04-15 14:40:15 +03:00
Raphael S. Carvalho	1529605b32	logstor: Fix dangling reference captures and shadowed loc variable Three bugs fixed in segment_manager.cc: 1. write_to_separator(): captured [&index] where index was a local coroutine-frame reference. The future is stored in buf.pending_updates and resolved later in flush_separator_buffer(), by which time the enclosing coroutine frame is destroyed, making &index a dangling pointer. This is a use-after-free that manifests as a segfault. Fix: capture index_ptr (raw pointer by value) instead. 2. add_segment_to_compaction_group(): same dangling [&index] pattern inside the for_each_live_record lambda during recovery. Same fix applied. 3. write(): local 'auto loc = seg->allocate(...)' shadowed the outer 'log_location loc', causing the function to always return a zero-initialized log_location{}. Fix: remove 'auto' so the assignment targets the outer variable. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Closes scylladb/scylladb#29451	2026-04-15 14:40:15 +03:00
Tomasz Grabiec	266a225416	utils: avoid exceptions in disk_space_monitor polling loop The poll loop used condition_variable::wait(timeout) to sleep between iterations. On every normal timeout expiry, this threw a condition_variable_timed_out exception, which incremented the C++ exception metric and triggered false alerts for support. Replace the timed wait with a seastar::timer that broadcasts the condition variable on expiry, combined with an untimed wait(). The timer is cancelled automatically on scope exit when the wait is woken early by trigger_poll() or abort. Fixes SCYLLADB-1477 Closes scylladb/scylladb#29438	2026-04-15 14:40:15 +03:00
Pavel Emelyanov	a428472e50	db: Remove redundant enable_logstor config option The enable_logstor configuration option is redundant with the 'logstor' experimental feature flag. Consolidate to a single gate: use the experimental feature to control both whether logstor is available for table creation and whether it is initialized at database startup. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#29427	2026-04-15 14:40:15 +03:00
Botond Dénes	87eb20ba33	Merge 'cql: Include parallelized queries in the scylla_cql_select_partition_range_scan_no_bypass_cache metric' from Tomasz Grabiec This metric is used to catch execution of scans which go via row cache, which can have bad effect on performance. Since `f344bd0aaa`, aggregate queries go via new statement class: parallelized_select_statement. This class inherits from select_statement directly rather than from primary_key_select_statement. The range scan detection logic (_range_scan, _range_scan_no_bypass_cache) was only in primary_key_select_statement's constructor, so parallelized queries were not counted in select_partition_range_scan and select_partition_range_scan_no_bypass_cache metrics. Fix by moving the range scan detection into select_statement's constructor, so that all subclasses get it. No backport: enhancement Closes scylladb/scylladb#29422 * github.com:scylladb/scylladb: cql: Include parallelized queries in the scylla_cql_select_partition_range_scan_no_bypass_cache metric test: cluster: dtest: Fix double-counting of metrics	2026-04-15 14:40:15 +03:00
Botond Dénes	aecb6b1d76	Merge 'auth: sanitize {USER} substitution in LDAP URL template' from Piotr Smaron `LDAPRoleManager` interpolated usernames directly into `ldap_url_template`, allowing LDAP filter injection and URL structure manipulation via crafted usernames. This PR adds two layers of encoding when substituting `{USER}`: 1. RFC 4515 filter escaping — neutralises ``, `(`, `)`, `\`, NUL 2. URL percent-encoding* — prevents `%`, `?`, `#` from breaking `ldap_url_parse`'s component splitting or undoing the filter escaping It also adds `validate_query_template()` at startup to reject templates that place `{USER}` outside the filter component (e.g. in the host or base DN), where filter escaping would be the wrong defense. Fixes: SCYLLADB-1309 Compatibility note: Templates with `{USER}` in the host, base DN, attributes, or extensions were previously silently accepted. They are now rejected at startup with a descriptive error. Only templates with `{USER}` in the filter component (after the third `?`) are valid. Fixes: SCYLLADB-1309 Due to severeness, should be backported to all maintained versions. Closes scylladb/scylladb#29388 * github.com:scylladb/scylladb: auth: sanitize {USER} substitution in LDAP URL templates test/ldap: add LDAP filter-injection reproducers	2026-04-15 14:40:15 +03:00
Artsiom Mishuta	146a67cf6f	test: explicitly wait for schema agreement in create_new_test_keyspace Add an explicit wait_for_schema_agreement() call after CREATE KEYSPACE in create_new_test_keyspace to ensure all nodes have applied the schema before proceeding. Closes scylladb/scylladb#29371	2026-04-15 14:40:15 +03:00
Pavel Emelyanov	54e3c648a5	test/cluster/dtest: improve diagnostics in test_update_schema_while_node_is_killed The alter_table case has a known failure where point lookups at QUORUM return 0 rows after node2 restarts, even though: - the schema was correctly synced (ALTER TABLE received from cluster) - the data commitlog was replayed (21 mutations, 0 skipped) - all 3 nodes were alive, so QUORUM (2/3) should be satisfiable by node1+node3 regardless of node2's state The LIMIT 1 table scan succeeds (data is present somewhere), but specific key lookups return empty. This points to a bug in how node2, acting as coordinator after restart, routes single-partition reads — most likely stale tablet routing metadata. Add diagnostics to help distinguish data loss from a coordinator/routing bug on the next failure: - log which key is missing - dump all rows visible at QUORUM - query each node individually at ONE consistency for the missing key Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Closes scylladb/scylladb#29350	2026-04-15 14:40:15 +03:00
Piotr Szymaniak	4c93c2af62	audit/alternator: support audit_tables=alternator.<table> shorthand The real keyspace name of an Alternator table T is "alternator_T". Expand the "alternator.T" format used in the audit_tables config flag to the real keyspace name at parse time, so users don't need to spell out the internal "alternator_T.T" form.	2026-04-15 12:29:15 +02:00
Piotr Szymaniak	0714d8aded	audit/alternator: Add negative audit tests Add tests for the unhappy path of Alternator audit logging: - Category filtering: operations are not logged when their category (DML, QUERY, DDL) is excluded from audit_categories. - Keyspace filtering: operations on a keyspace not listed in audit_keyspaces are not logged. - Error entries: a failed operation (thrown exception after audit_info is set) produces an audit entry with error=true. - Empty-keyspace bypass: global operations like ListTables and DescribeEndpoints are logged regardless of audit_keyspaces because should_log() short-circuits on an empty keyspace.	2026-04-15 12:29:15 +02:00
Piotr Szymaniak	ad05b44931	audit/alternator: Add testing of auditing There is a new test file created, `test/alternator/test_audit.py`. The file contains a suite of tests of all auditing operations.	2026-04-15 12:29:15 +02:00
Piotr Szymaniak	6913efab5c	audit/alternator: Audit requests Both the successful ones as well as the failed ones are audited. Each Alternator operation sets up audit metadata via an executor::maybe_audit() helper, which checks will_log() and only heap-allocates audit_info_alternator when auditing is enabled. DDL and metadata operations pass no consistency level; data read/write operations pass the actual CL used. BatchWriteItem and BatchGetItem guard table name collection with will_log() to avoid unnecessary work when auditing is disabled. ListStreams audits the input table name rather than collecting output table names during iteration. UntagResource sets up auditing after parameter validation. Exception re-throw in server.cc uses co_return coroutine::exception(). The chosen audit types for the operations: - CreateTable - DDL - DescribeTable - QUERY - DeleteTable - DDL - UpdateTable - DDL - PutItem - DML - UpdateItem - DML - GetItem - QUERY - DeleteItem - DML - ListTables - QUERY - Scan - QUERY - DescribeEndpoints - QUERY - BatchWriteItem - DML - BatchGetItem - QUERY - Query - QUERY - TagResource - DDL - UntagResource - DDL - ListTagsOfResource - QUERY - UpdateTimeToLive - DDL - DescribeTimeToLive - QUERY - ListStreams - QUERY - DescribeStream - QUERY - GetShardIterator - QUERY - GetRecords - QUERY - DescribeContinuousBackups - QUERY	2026-04-15 11:55:42 +02:00
Piotr Szymaniak	9646ee05bd	audit/alternator: Refactor in preparation for auditing Alternator Prepare API in audit for auditing Alternator. The API provides an externally-callable functions `inspect()`, for both CQL and Alternator. Both variants of the function would unpack parameters and merge into calling a common `maybe_log()`, which can then call `log()` when conditions are met. Also, while I was at it, (const) references were favoured over raw pointers. The Alternator audit_info subclass (audit_info_alternator) carries an optional consistency level — only data read/write operations have a meaningful CL, while DDL and metadata queries store an empty string in the audit table and syslog (matching the existing write_login behavior). The storage helpers are updated accordingly. Add a will_log(category, keyspace, table) method that checks whether an operation should be audited (category check AND keyspace/table filtering) without requiring a constructed audit_info object. should_log() delegates to will_log().	2026-04-15 11:46:44 +02:00
Tomasz Grabiec	84361194c2	test: boost: tablets: Add test for merge with arbitrary tablet count	2026-04-15 10:40:56 +02:00
Tomasz Grabiec	7af9f5366d	tablets, database: Advertise 'arbitrary' layout in snapshot manifest Currently, the manifest advertises "powof2", which is wrong for arbitrary count and boundaries. Introduce a new kind of layout called "arbitrary", and produce it if the tablet map doesn't conform to "powof2" layout. We should also produce tablet boundaries in this case, but that's worked on in a different PR: https://github.com/scylladb/scylladb/pull/28525	2026-04-15 10:40:56 +02:00
Tomasz Grabiec	50fbac6ea6	tablets: Introduce pow2_count per-table tablet option By default it's true, in which case tablet count of the table is rounded up to a power of two. This option allows lifting this, in which case the count can be arbitrary. This will allow testing the logic of arbitrary tablet count.	2026-04-15 10:40:56 +02:00
Tomasz Grabiec	b6a7023f68	tablets: Prepare for non-power-of-two tablet count This is a step towards more flexibility in managing tablets. A prerequisite before we can split individual tablets, isolating hot partitions, and evening-out tablet sizes by shifting boundaries. After this patch, the system can handle tables with arbitrary tablet count. Tablet allocator is still rounding up desired tablet count to the nearest power of two when allocating tablets for a new table, so unless the tablet map is allocated in some other way, the counts will be still a power of two. We plan to utilize arbitrary count when migrating from vnodes to tablets, by creating a tablet map which matches vnode boundaries. One of the reasons we don't give up on power-of-two by default yet is that it creates an issue with merges. If tablet count is odd, one of the tablets doesn't have a sibling and will not be merged. That can obviously cause imbalance of token space and tablet sizes between tablets. To limit the impact, this patch dynamically chooses which tablet to isolate when initiating a merge. The largest tablet is chosen, as that will minimize imbalance. Otherwise, if we always chose the last tablet to isolate, its size would remain the same while other tablets double in size with each odd-count merge, leading to imbalance. The imbalance will still be there, but the difference in tablet sizes is limited to 2x. Example (3 tablets): [0] owns 1/3 of tokens [1] owns 1/3 of tokens [2] owns 1/3 of tokens After merge: [0] owns 2/3 of tokens [1] owns 1/3 of tokens What we would like instead: Step 1 (split [1]): [0] owns 1/3 of tokens [1] old 1.left, owns 1/6 of tokens [2] old 1.right, owns 1/6 of tokens [3] owns 1/3 of tokens Step 2 (merge): [0] owns 1/2 of tokens [1] owns 1/2 of tokens To do that, we need to be able to split individual tablets, but we're not there yet.	2026-04-15 10:40:55 +02:00
Tomasz Grabiec	f54daef4ec	tablets: Implement merged tablet_map constructor on top of for_each_sibling_tablets() This way it doesn't need to know how the scheduler chose to merge tablets. We'll have less duplication of logic.	2026-04-15 10:40:55 +02:00
Tomasz Grabiec	66fc7967b8	tablets: Prepare resize_decision to hold data in decisions merge decision will carry a plan - which replica to isolate. So construction from a string will no longer do.	2026-04-15 10:40:55 +02:00
Tomasz Grabiec	d543f260bd	tablets: table: Make storage_group handle arbitrary merge boundaries We only assume that new tablets have boundaries which are equal to some boundaries of old tablets. In preparation for supporting arbitrary merge plan, where any replica can be isolated (not merged with siblings) by the merge plan.	2026-04-15 10:40:55 +02:00

1 2 3 4 5 ...

53237 Commits