scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-28 20:27:03 +00:00

Author	SHA1	Message	Date
Botond Dénes	3573535167	Merge '[Backport 2025.4] schema: Apply `sstable_compression_user_table_options` to CQL aux and Alternator tables' from Scylladb[bot] In PR `5b6570be52` we introduced the config option `sstable_compression_user_table_options` to allow adjusting the default compression settings for user tables. However, the new option was hooked into the CQL layer and applied only to CQL base tables, not to the whole spectrum of user tables: CQL auxiliary tables (materialized views, secondary indexes, CDC log tables), Alternator base tables, Alternator auxiliary tables (GSIs, LSIs, Streams). This gap also led to inconsistent default compression algorithms after we changed the option’s default algorithm from LZ4 to LZ4WithDicts (`adf9c426c2`). This series introduces a general “schema initializer” mechanism in `schema_builder` and uses it to apply the default compression settings uniformly across all user tables. This ensures that all base and aux tables take their default compression settings from config. Fixes #26914. Backport justification: LZ4WithDicts is the new default since 2025.4, but the config option exists since 2025.2. Based on severity, I suggest we backport only to 2025.4 to maintain consistency of the defaults. - (cherry picked from commit `4ec7a064a9`) - (cherry picked from commit `76b2d0f961`) - (cherry picked from commit `5b4aa4b6a6`) - (cherry picked from commit `d5ec66bc0c`) - (cherry picked from commit `1e37781d86`) - (cherry picked from commit `7fa1f87355`) Parent PR: #27204 Closes scylladb/scylladb#28305 * github.com:scylladb/scylladb: db/config: Update sstable_compression_user_table_options description schema: Add initializer for compression defaults schema: Generalize static configurators into schema initializers schema: Initialize static properties eagerly db: config: Add accessor for sstable_compression_user_table_options test: Check that CQL and Alternator tables respect compression config test/cqlpy: test compression setting for auxiliary table test/alternator: tests for schema of Alternator table	2026-01-30 16:10:35 +02:00
Ernest Zaslavsky	c9fe14b79d	aws_error: handle all restartable nested exception types Previously we only inspected std::system_error inside std::nested_exception to support a specific TLS-related failure mode. However, nested exceptions may contain any type, including other restartable (retryable) errors. This change unwraps one nested exception per iteration and re-applies all known handlers until a match is found or the chain is exhausted. Closes scylladb/scylladb#28240 (cherry picked from commit `cb2aa85cf5`) Closes scylladb/scylladb#28344	2026-01-30 16:08:59 +02:00
Nikos Dragazis	b6deff2547	schema: Generalize static configurators into schema initializers Extend the `static_configurator` mechanism to support initialization of arbitrary schema properties, not only static ones, by passing a `schema_builder` reference to the configurator interface. As part of this change, rename `static_configurator` to `schema_initializer` to better reflect its broader responsibility. Add a checkpoint/restore mechanism to allow de-registering an initializer (useful for testing; will be used in the next patch). Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com> (cherry picked from commit `d5ec66bc0c`)	2026-01-28 12:42:10 +02:00
Nikos Dragazis	001581c69c	db: config: Add accessor for sstable_compression_user_table_options The `sstable_compression_user_table_options` config option determines the default compression settings for user tables. In patch `2fc812a1b9`, the default value of this option was changed from LZ4 to LZ4WithDicts and a fallback logic was introduced during startup to temporarily revert the option to LZ4 until the dictionary compression feature is enabled. Replace this fallback logic with an accessor that returns the correct settings depending on the feature flag. This is cleaner and more consistent with the way we handle the `sstable_format` option, where the same problem appears (see `get_preferred_sstable_version()`). As a consequence, the configuration option must always be accessed through this accessor. Add a comment to point this out. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com> (cherry picked from commit `76b2d0f961`)	2026-01-28 12:42:07 +02:00
Botond Dénes	62f399d8db	db/row_cache: make_nonpopulating_reader(): pass cache tracker to snapshot The API contract in partition_version.hh states that when dealing with evictable entries, a real cache tracker pointer has to be passed to all methods that ask for it. The nonpopulating reader violates this, passing a nullptr to the snapshot. This was observed to cause a crash when a concurrent cache read accessed the snapshot with the null tracker. A reproducer is included which fails before and passes after the fix. Fixes: #26847 Closes scylladb/scylladb#28163 (cherry picked from commit `a53f989d2f`) Closes scylladb/scylladb#28279	2026-01-22 12:38:00 +02:00
Botond Dénes	e8f5ac5fb6	reader_concurrency_semaphore: improve handling of base resources reader_permit::release_base_resources() is a soft evict for the permit: it releases the resources aquired during admission. This is used in cases where a single process owns multiple permits, creating a risk for deadlock, like it is the case for repair. In this case, release_base_resources() acts as a manual eviction mechanism to prevent permits blockings each other from admission. Recently we found a bad interaction between release_base_resources() and permit eviction. Repair uses both mechanism: it marks its permits as inactive and later it also uses release_base_resources(). This partice might be worth reconsidering, but the fact remains that there is a bug in the reader permit which causes the base resources to be released twice when release_base_resources() is called on an already evicted permit. This is incorrect and is fixed in this patch. Improve release_base_resources(): * make _base_resources const * move signal call into the if (_base_resources_consumed()) { } * use reader_permit::impl::signal() instead of reader_concurrency_semaphore::signal() * all places where base resources are released now call release_base_resources() A reproducer unit test is added, which fails before and passes after the fix. Fixes: #28083 Closes scylladb/scylladb#28155 (cherry picked from commit `b7bc48e7b7`) Closes scylladb/scylladb#28245	2026-01-21 06:37:38 +02:00
Michał Hudobski	9a0849ef36	vector search, paging: add test for paging warnings We add a test that validates that indexed queries do not throw a warning related to vector search paging Fixes: SCYLLADB-248 Closes scylladb/scylladb#28077 (cherry picked from commit `c8aa49b196`) Closes scylladb/scylladb#28138	2026-01-20 12:57:54 +02:00
Ernest Zaslavsky	177985a69d	aws_error: fix nested exception handling The loop that unwraps nested exception, rethrows nested exception and saves pointer to the temporary std::exception& inner on stack, then continues. This pointer is, thus, pointing to a released temporary Closes scylladb/scylladb#28143 (cherry picked from commit `829bd9b598`) Closes scylladb/scylladb#28243	2026-01-20 11:22:32 +01:00
Asias He	c5aa29404d	repair: Add tablet repair progress report support This patch adds tablet repair progress report support so that the user could use the /task_manager/task_status API to query the progress. In order to support this, a new system table is introduced to record the user request related info, i.e, start of the request and end of the request. The progress is accurate when tablet split or merge happens in the middle of the request, since the tokens of the tablet are recorded when the request is started and when repair of each tablet is finished. The original tablet repair is considered as finished when the finished ranges cover the original tablet token ranges. After this patch, the /task_manager/task_status API will report correct progress_total and progress_completed. Fixes #22564 Fixes #26896 Closes scylladb/scylladb#27679 (cherry picked from commit `4f77dd058d`) Closes scylladb/scylladb#28065	2026-01-19 09:39:13 +02:00
Nikos Dragazis	29c534c6e7	test: database_test: Fix serialization of partition key The `make_key` lambda erroneously allocates a fixed 8-byte buffer (`sizeof(s.size())`) for variable-length strings, potentially causing uninitialized bytes to be included. If such bytes exist and they are not valid UTF-8 characters, deserialization fails: ``` ERROR 2026-01-16 08:18:26,062 [shard 0:main] testlog - snapshot_list_contains_dropped_tables: cql env callback failed, error: exceptions::invalid_request_exception (Exception while binding column p1: marshaling error: Validation failed - non-UTF8 character in a UTF8 string, at byte offset 7) ``` Fixes #28195. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com> Closes scylladb/scylladb#28197 (cherry picked from commit `8aca7b0eb9`) Closes scylladb/scylladb#28209	2026-01-19 09:38:45 +02:00
Calle Wilund	999dfb0e5e	db::commitlog: Fix sanity check error on race between segment flushing and oversized alloc Fixes #27992 When doing a commit log oversized allocation, we lock out all other writers by grabbing the _request_controller semaphore fully (max capacity). We thereafter assert that the semaphore is in fact zero. However, due to how things work with the bookkeep here, the semaphore can in fact become negative (some paths will not actually wait for the semaphore, because this could deadlock). Thus, if, after we grab the semaphore and execution actually returns to us (task schedule), new_buffer via segment::allocate is called (due to a non-fully-full segment), we might in fact grab the segment overhead from zero, resulting in a negative semaphore. The same problem applies later when we try to sanity check the return of our permits. Fix is trivial, just accept less-than-zero values, and take same possible ltz-value into account in exit check (returning units) Added whitebox (special callback interface for sync) unit test that provokes/creates the race condition explicitly (and reliably). Closes scylladb/scylladb#27998 (cherry picked from commit `a7cdb602e1`) Closes scylladb/scylladb#28099	2026-01-16 16:19:18 +02:00
Michał Hudobski	2dc74d66cd	auth: fix cdc vector search indexing permission bug VECTOR_SEARCH_INDEXING permission didn't work on cdc tables as we mistakenly checked for vector indexes on the cdc table insted of the base. This patch fixes that and adds a test that validates this behavior. Fixes: VECTOR-476 Closes scylladb/scylladb#28050 (cherry picked from commit `e2e479f20d`) Closes scylladb/scylladb#28068	2026-01-13 17:58:46 +01:00
Patryk Jędrzejczak	7feafe9a62	Merge '[Backport 2025.4] database: truncate_table_on_all_shards: consider can_flush on all shards' from Scylladb[bot] Currently, database::truncate_table_on_all_shards calls the table::can_flush only on the coordinator shard and therefore it may miss shards with dirty data if the coordinator shard happens to have empty memtables, leading to clearing the memtables with dirty data rather than flushing them. This change fixes that by making flush safe to be called, even if the memtable list is empty, and calling it on every shard that can flush (i.e. seal_immediate_fn is engaged). Also, change database_test::do_with_some_data is use random keys instead of hard-coded key names, to reproduce this issue with `snapshot_list_contains_dropped_tables`. Fixes #27639 * The issue exists since forever and might cause data loss due to wrongly clearing the memtable, so it needs backport to all live versions - (cherry picked from commit `ec4069246d`) - (cherry picked from commit `5be6b80936`) - (cherry picked from commit `0342a24ee0`) - (cherry picked from commit `02ee341a03`) - (cherry picked from commit `2a803d2261`) - (cherry picked from commit `93b827c185`) - (cherry picked from commit `ebd667a8e0`) Parent PR: #27643 Closes scylladb/scylladb#28074 * https://github.com/scylladb/scylladb: test: database_test: do_with_some_data: randomize keys database: truncate_table_on_all_shards: drop outdated TODO comment database: truncate_table_on_all_shards: consider can_flush on all shards memtable_list: unify can_flush and may_flush test: database_test: add test_flush_empty_table_waits_on_outstanding_flush replica: table, storage_group, compaction_group: add needs_flush test: database_test: do_with_some_data_in_thread: accept void callback function	2026-01-12 11:19:41 +01:00
Łukasz Paszkowski	6c8663b1ec	load_sketch: Allow populating load_sketch with normalized current load Currently, tablet allocation intentionally ignores current load ( introduced by the commit #1e407ab) which could cause identical shard selection when allocating a small number of tablets in the same topology. When a tablet allocator is asked to allocate N tablets (where N is smaller than the number of shards on a node), it selects the first N lowest shards. If multiple such tables are created, each allocator run picks the same shards, leading to tablet imbalance across shards. This change initializes the load sketch with the current shard load, scaled into the [0,1] range, ensuring allocation still remains even while starting from globally least-loaded shards. Fixes https://github.com/scylladb/scylladb/issues/27620 Closes https://github.com/scylladb/scylladb/pull/27802 Closes scylladb/scylladb#28060	2026-01-09 18:42:03 +01:00
Michał Hudobski	133a92e86c	auth: add system table permissions to VECTOR_SEARCH_INDEXING Due to the recent changes in the vector store service, the service needs to read two of the system tables to function correctly. This was not accounted for when the new permission was added. This patch fixes that by allowing these tables (group0_history and versions) to be read with the VECTOR_SEARCH_INDEXING permission. We also add a test that validates this behavior. Fixes: SCYLLADB-73 Closes scylladb/scylladb#27546 (cherry picked from commit `ce3320a3ff`) Closes scylladb/scylladb#28042 Parent PR: #27546	2026-01-09 14:55:34 +01:00
Botond Dénes	1632853ebd	reader_concurrency_semaphore: add protection against negative count resource leaks The semaphore has detection and protection against regular resource leaks, where some resources go unaccounted for and are not released by the time the semaphore is destroyed. There is no detection or protection against negative leaks: where resources are "made up" of thin air. This kind of leaks looks benign at first sight, a few extra resources won't hurt anyone so long as this is a small amount. But turns out that even a single extra count resource can defeat a very important anti-deadlock protection in can_admit_read(): the special case which admits a new permit regardless of memory resources, when all original count resources all available. This check uses ==, so if resource > original, the protection is defeated indefinitely. Instead of just changing == to >=, we add detection of such negative leaks to signal(), via on_internal_error_noexcept(). At this time I still don't now how this negative leak happens (the code doesn't confess), with this detection, hopefully we'll get a clue from tests or the field. Note that on_internal_error_noexcept() will not generate a coredump, unless ScyllaDB is explicitely configured to do so. In production, it will just generate an error log with a backtrace. The detection also clams the _resources to _initial_resources, to prevent any damage from the negativae leak. I just noticed that there is no unit test for the deadlock protection described above, so one is added in this PR, even if only loosely related to the rest of the patch. Fixes: SCYLLADB-163 Closes scylladb/scylladb#27764 (cherry picked from commit `e4da0afb8d`) Closes scylladb/scylladb#28004	2026-01-09 13:27:43 +02:00
Benny Halevy	07197ff820	test: database_test: do_with_some_data: randomize keys With randomized keys, and since we're inserting only 2 keys, it is possible that they would end up owned only by a single shard, reproducing #27639 in snapshot_list_contains_dropped_tables. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> (cherry picked from commit `ebd667a8e0`)	2026-01-09 08:12:51 +02:00
Benny Halevy	e9f76e46d1	test: database_test: add test_flush_empty_table_waits_on_outstanding_flush Test that table::flush waits on outstanding flushes, even if the active memtable is empty Signed-off-by: Benny Halevy <bhalevy@scylladb.com> (cherry picked from commit `0342a24ee0`)	2026-01-09 08:09:04 +02:00
Benny Halevy	a95f7c7aaa	test: database_test: do_with_some_data_in_thread: accept void callback function Many test cases already assume `func` is being called a seastar thread and although the function they pass returns a (ready) future, it serves no purpose other than to conform to the interface. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> (cherry picked from commit `ec4069246d`)	2026-01-09 07:58:12 +02:00
Calle Wilund	5b8d6e21f1	commitlog::read_log_file: Check for eof position on all data reads Fixes #24346 When reading, we check for each entry and each chunk, if advancing there will hit EOF of the segment. However, IFF the last chunk being read has the last entry _exactly_ matching the chunk size, and the chunk ending at _exactly_ segment size (preset size, typically 32Mb), we did not check the position, and instead complained about not being able to read. This has literally _never_ happened in actual commitlog (that was replayed at least), but has apparently happened more and more in hints replay. Fix is simple, just check the file position against size when advancing said position, i.e. when reading (skipping already does). v2: * Added unit test Closes scylladb/scylladb#27236 (cherry picked from commit `59c87025d1`) Closes scylladb/scylladb#27346	2026-01-08 16:37:22 +02:00
Avi Kivity	d507568eca	Merge '[Backport 2025.4] db: repair: do not update repair_time if batchlog replay failed' from Scylladb[bot] Currently, batchlog replay is considered successful even if all batches fail to be sent (they are replayed later). However, repair requires all batches to be sent successfully. Currently, if batchlog isn't cleared, the repair never learns and updates the repair_time. If GC mode is set to "repair", this means that the tombstones written before the repair_time (minus propagation_delay) can be GC'd while not all batches were replied. Consider a scenario: - Table t has a row with (pk=1, v=0); - There is an entry in the batchlog that sets (pk=1, v=1) in table t; - The row with pk=1 is deleted from table t; - Table t is repaired: - batchlog reply fails; - repair_time is updated; - propagation_delay seconds passes and the tombstone of pk=1 is GC'd; - batchlog is replayed and (pk=1, v=1) inserted - data resurrection! Do not update repair_time if sending any batch fails. The data is still repaired. For tablet repair the repair runs, but at the end the exception is passed to topology coordinator. Thanks to that the repair_time isn't updated. The repair request isn't removed as well, due to which the repair will need to rerun. Apart from that, a batch is removed from the batchlog if its version is invalid or unknown. The condition on which we consider a batch too fresh to replay is updated to consider propagation_delay. Fixes: https://github.com/scylladb/scylladb/issues/24415 Data resurrection fix; needs backport to all versions - (cherry picked from commit `502b03dbc6`) - (cherry picked from commit `904183734f`) - (cherry picked from commit `7f20b66eff`) - (cherry picked from commit `e1b2180092`) - (cherry picked from commit `d436233209`) - (cherry picked from commit `1935268a87`) - (cherry picked from commit `6fc43f27d0`) Parent PR: #26319 Closes scylladb/scylladb#26766 * github.com:scylladb/scylladb: repair: throw if flush failed in get_flush_time db: fix indentation test: add reproducer for data resurrection repair: fail tablet repair if any batch wasn't sent successfully db/batchlog_manager: fix making decision to skip batch replay db: repair: throw if replay fails db/batchlog_manager: delete batch with incorrect or unknown version db/batchlog_manager: coroutinize replay_all_failed_batches	2025-12-21 14:14:13 +02:00
Michael Litvak	34ede10db9	tablet: scheduler: Do not emit conflicting migration in merge colocation The tablet scheduler should not emit conflicting migrations for the same tablet. This was addressed initially in scylladb/scylladb#26038 but the check is missing in the merge colocation plan, so add it there as well. Without this check, the merge colocation plan could generate a conflicting migration for a tablet that is already scheduled for migration, as the test demonstrates. This can cause correctness problems, because if the load balancer generates two migrations for a single tablet, both will be written as mutations, and the resulting mutation could contain mixed cells from both migrations. Fixes scylladb/scylladb#27304 Closes scylladb/scylladb#27312 (cherry picked from commit `97b7c03709`) Closes scylladb/scylladb#27331	2025-12-19 09:13:29 +02:00
Aleksandra Martyniuk	767d8793b6	db: repair: throw if replay fails Return a flag determining whether all the batches were sent successfully in batchlog_manager::replay_all_failed_batches (batches skipped due to being too fresh are not counted). Throw in repair_flush_hints_batchlog_handler if not all batches were replayed, to ensure that repair_time isn't updated. (cherry picked from commit `7f20b66eff`)	2025-12-16 15:11:05 +01:00
Gleb Natapov	26606c8801	test: test that expired erm that held for too long triggers notification (cherry picked from commit `5dcdaa6f66`)	2025-11-26 15:09:15 +00:00
Karol Nowacki	3edd12eba0	vector_search: Restrict vector index tests to tablets only Vector indexes are going to be supported only for tablets (see VECTOR-322). As a result, tests using vector indexes will be failing when run with vnodes. This change ensures tests using vector indexes run exclusively with tablets. Fixes: VECTOR-49 Closes scylladb/scylladb#27233	2025-11-26 11:00:30 +02:00
Tomasz Grabiec	9e5e8bec69	address_map: Use barrier() to wait for replication More efficient than 100 pings. There was one ping in test which was done "so this shard notices the clock advance". It's not necessary, since obsering completed SMP call implies that local shard sees the clock advancement done within in. (cherry picked from commit `f83c4ffc68`)	2025-11-23 21:05:34 +00:00
Tomasz Grabiec	f6e90d5a5f	address_map: Use more efficient and reliable replication method Primary issue with the old method is that each update is a separate cross-shard call, and all later updated queue behind it. If one of the shards has high latency for such calls, the queue may accumulate and system will appear unresponsive for mapping changes on non-zero shards. This happened in the field when one of the shards was overloaded with sstables and compaction work, which caused frequent stalls which delayed polling for ~100ms. A queue of 3k address updates accumulated. This made bootstrap impossible, since nodes couldn't learn about the IP mapping for the bootstrapping node and streaming failed. To protect against that, use a more efficient method of replication which requires a single cross-shard call to replicate all prior updates. It is also more reliable, if replication fails transiently for some reason, we don't give up and fail all later updates. Fixes #26865 Fixes #26835 (cherry picked from commit `4a85ea8eb2`)	2025-11-23 21:05:33 +00:00
Tomasz Grabiec	8069948f0a	utils: Introduce helper for replicated data structures Key goals: - efficient (batching updates) - reliable (no lost updates) Will be used in data structures maintained on one designed owning shard and replicated to other shards. (cherry picked from commit `ed8d127457`)	2025-11-23 21:05:33 +00:00
Michał Chojnowski	30a6a2c7a7	sstables/trie/trie_writer: free nodes after they are flushed Somehow, the line of code responsible for freeing flushed nodes in `trie_writer` is missing from the implementation. This effectively means that `trie_writer` keeps the whole index in memory until the index writer is closed, which for many dataset is a guaranteed OOM. Fix that, and add some test that catches this. Fixes scylladb/scylladb#27082 Closes scylladb/scylladb#27083 (cherry picked from commit `d8e299dbb2`) Closes scylladb/scylladb#27122	2025-11-20 10:37:40 +02:00
Botond Dénes	034845e6eb	Merge '[Backport 2025.4] sstables/sstable_directory: don't forget to delete other components when deleting TemporaryHashes.db' from Scylladb[bot] TemporaryHashes.db is a temporary sstable component used during ms sstable writes. It's different from other sstable components in that it's not included in the TOC. Because of this, it has a special case in the logic that deletes unfinished sstables on boot. (After Scylla dies in the middle of a sstable write). But there's a bug in that special case, which causes Scylla to forget to delete other components from the same unfinished sstable. The code intends only to delete the TemporaryHashes.db file from the `_state->generations_found` multimap, but it accidentally also deletes the file's sibling components from the multimap. Fix that. Also, extend a related test so that it would catch the problem before the fix. Fixes scylladb/scylladb#26393 Bugfix, needs backport to 2025.4. - (cherry picked from commit `16cb223d7f`) - (cherry picked from commit `6efb807c1a`) Parent PR: #26394 Closes scylladb/scylladb#26409 * github.com:scylladb/scylladb: sstables/sstable_directory: don't forget to delete other components when deleting TemporaryHashes.db test/boost/database_test: fix two no-op distributed loader tests	2025-11-20 10:29:51 +02:00
Michał Chojnowski	db0e209a9b	sstables/trie: fix an assertion violation in bti_partition_index_writer_impl::write_last_key _last_key is a multi-fragment buffer. Some prefix of _last_key (up to _last_key_mismatch) is unneeded because it's already a part of the trie. Some suffix of _last_key (after needed_prefix) is unneeded because _last_key can be differentiated from its neighbors even without it. The job of write_last_key() is to find the middle fragments, (containing the range `[_last_key_mismatch, needed_prefix)`) trim the first and last of the middle fragments appropriately, and feed them to the trie writer. But there's an error in the current logic, in the case where `_last_key_mismatch` falls on a fragment boundary. To describe it with an example, if the key is fragmented like `aaa\|bbb\|ccc`, `_last_key_mismatch == 3`, and `needed_prefix == 7`, then the intended output to the trie writer is `bbb\|c`, but the actual output is `\|bbb\|c`. (I.e. the first fragment is empty). Technically the trie writer could handle empty fragments, but it has an assertion against them, because they are a questionable thing. Fix that. We also extend bti_index_test so that it's able to hit the assert violation (before the patch). The reason why it wasn't able to do that before the patch is that the violation requires decorated keys to differ on the _first_ byte of a partition key column, but the keys generated by the test only differed on the last byte of the column. (Because the test was using sequential integers to make the values more human-readable during debugging). So we modify the key generation to use random values that can differ on any position. Fixes scylladb/scylladb#26819 Closes scylladb/scylladb#26839 (cherry picked from commit `b82c2aec96`) Closes scylladb/scylladb#26903	2025-11-11 10:40:01 +03:00
Botond Dénes	c70656c0db	Merge '[Backport 2025.4] db/config: Change default SSTable compressor to LZ4WithDictsCompressor' from Scylladb[bot] `sstable_compression_user_table_options` allows configuring a node-global SSTable compression algorithm for user tables via scylla.yaml. The current default is LZ4Compressor (inherited from Cassandra). Make LZ4WithDictsCompressor the new default. Metrics from real datasets in the field have shown significant improvements in compression ratios. If the dictionary compression feature is not enabled in the cluster (e.g., during an upgrade), fall back to the `LZ4Compressor`. Once the feature is enabled, flip the default back to the dictionary compressor using with a listener callback. Fixes #26610. - (cherry picked from commit `d95ebe7058`) - (cherry picked from commit `96e727d7b9`) - (cherry picked from commit `2fc812a1b9`) - (cherry picked from commit `a0bf932caa`) Parent PR: #26697 Closes scylladb/scylladb#26830 * github.com:scylladb/scylladb: test/cluster: Add test for default SSTable compressor db/config: Change default SSTable compressor to LZ4WithDictsCompressor db/config: Deprecate sstable_compression_dictionaries_allow_in_ddl boost/cql_query_test: Get expected compressor from config	2025-11-07 16:43:11 +02:00
Botond Dénes	bb142bdc10	Merge '[Backport 2025.4] auth: implement vector store authorization' from Scylladb[bot] This patch implements the changes required by the Vector Store authorization, as described in https://scylladb.atlassian.net/wiki/spaces/RND/pages/107085899/Vector+Store+Authentication+And+Authorization+To+ScyllaDB, that is: - adding a new permission VECTOR_SEARCH_INDEXING, grantable only on ALL KEYSPACES - allowing users with that permission to perform SELECT queries, but only on tables with a vector index - increasing the number of scheduling groups by one to allow users to create a service level for a vector store user - adjusting the tests and documentation These changes are needed, as the vector indexes are managed by the external service, Vector Store, which needs to read the tables to create the indexes in its memory. We would like to limit the privileges of that service to a minimum to maintain the principle of least privilege, therefore a new permission, one that allows the SELECTs conditional on the existence of a vector_index on the table. Fixes: VECTOR-201 Fixes: https://github.com/scylladb/scylladb/issues/26804 Backport reasoning: Backport to 2025.4 required as this can make upgrading clusters more difficult if we add it in 2026.1. As for now Scylla Cloud requires version 2025.4 to enable vector search and permission is set by orchestrator so there is no chance that someone will try to add this permission during upgrade. In 2026.1 it will be more difficult. - (cherry picked from commit `ae86bfadac`) - (cherry picked from commit `3025a35aa6`) - (cherry picked from commit `6a69bd770a`) - (cherry picked from commit `e8fb745965`) - (cherry picked from commit `3db2e67478`) Parent PR: #25976 Closes scylladb/scylladb#26805 * github.com:scylladb/scylladb: docs: adjust docs for VS auth changes test: add tests for VECTOR_SEARCH_INDEXING permission cql: allow VECTOR_SEARCH_INDEXING users to select auth: add possibilty to check for any permission in set auth: add a new permission VECTOR_SEARCH_INDEXING	2025-11-07 16:42:33 +02:00
Michael Litvak	8f7a6fd5eb	cdc: use chunked_vector instead of vector for stream ids use utils::chunked_vector instead of std::vector to store cdc stream sets for tablets. a cdc stream set usually represents all streams for a specific table and timestamp, and has a stream id per each tablet of the table. each stream id is represented by 16 bytes. thus the vector could require quite large contiguous allocations for a table that has many tablets. change it to chunked_vector to avoid large contiguous allocations. Fixes scylladb/scylladb#26791 Closes scylladb/scylladb#26792 (cherry picked from commit `e7dbccd59e`) Closes scylladb/scylladb#26828	2025-11-04 12:41:30 +01:00
Nikos Dragazis	260c9972b0	boost/cql_query_test: Get expected compressor from config Since `5b6570be52`, the default SSTable compression algorithm for user tables is no longer hardcoded; it can be configured via the `sstable_compression_user_table_options.sstable_compression` option in scylla.yaml. Modify the `test_table_compression` test to get the expected value from the configuration. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com> (cherry picked from commit `d95ebe7058`)	2025-10-31 23:50:20 +00:00
Michał Hudobski	ce04e2cb7d	test: add tests for VECTOR_SEARCH_INDEXING permission This commit adds tests to verify the expected behavior of the VECTOR_SEARCH_INDEXING permission, that is, allowing GRANTing this permission only on ALL KEYSPACES and allowing SELECT queries only on tables with vector indexes when the user has this permission (cherry picked from commit `e8fb745965`)	2025-10-30 10:13:16 +00:00
Pavel Emelyanov	080c55a115	lister: Fix race between readdir and stat Sometimes file::list_directory() returns entries without type set. In thase case lister calls file_type() on the entry name to get it. In case the call returns disengated type, the code assumes that some error occurred and resolves into exception. That's not correct. The file_type() method returns disengated type only if the file being inspected is missing (i.e. on ENOENT errno). But this can validly happen if a file is removed bettween readdir and stat. In that case it's not "some error happened", but a enry should be just skipped. In "some error happened", then file_type() would resolve into exceptional future on its own. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#26595 (cherry picked from commit `d9bfbeda9a`) Closes scylladb/scylladb#26767	2025-10-29 11:29:57 +02:00
Michael Litvak	55d9d5e7c2	cdc: helpers for garbage collecting old streams for tablets introduce helper functions that can be used for garbage collecting old cdc streams for tablets-based keyspaces. - get_new_base_for_gc: finds a new base timestamp given a TTL, such that all older timestamps and streams can be removed. - get_cdc_stream_gc_mutations: given new base timestamp and streams, builds mutations that update the internal cdc tables and remove the older streams. - garbage_collect_cdc_streams_for_table: combines the two functions above to find a new base and build mutations to update it for a specific table - garbage_collect_cdc_streams: builds gc mutations for all cdc tables (cherry picked from commit `440caeabcb`)	2025-10-27 19:53:04 +00:00
Pavel Emelyanov	45341ca246	Merge '[Backport 2025.4] s3_client: handle failures which require http::request updating' from Scylladb[bot] Apply two main changes to the s3_client error handling 1. Add a loop to s3_client's `make_request` for the case whe the retry strategy will not help since the request itself have to be updated. For example, authentication token expiration or timestamp on the request header 2. Refine the way we handle exceptions in the `chunked_download_source` background fiber, now we carry the original `exception_ptr` and also we wrap EVERY exception in `filler_exception` to prevent retry strategy trying to retry the request altogether Fixes: https://github.com/scylladb/scylladb/issues/26483 Should be ported back to 2025.3 and 2025.4 to prevent deadlocks and failures in these versions - (cherry picked from commit `55fb2223b6`) - (cherry picked from commit `db1ca8d011`) - (cherry picked from commit `185d5cd0c6`) - (cherry picked from commit `116823a6bc`) - (cherry picked from commit `43acc0d9b9`) - (cherry picked from commit `58a1cff3db`) - (cherry picked from commit `1d34657b14`) - (cherry picked from commit `4497325cd6`) - (cherry picked from commit `fdd0d66f6e`) Parent PR: #26527 Closes scylladb/scylladb#26650 * github.com:scylladb/scylladb: s3_client: tune logging level s3_client: add logging s3_client: improve exception handling for chunked downloads s3_client: fix indentation s3_client: add max for client level retries s3_client: remove `s3_retry_strategy` s3_client: support high-level request retries s3_client: just reformat `make_request` s3_client: unify `make_request` implementation	2025-10-22 11:33:53 +03:00
Ernest Zaslavsky	94d49da8ec	s3_client: improve exception handling for chunked downloads Refactor the wrapping exception used in `chunked_download_source` to prevent the retry strategy from reattempting failed requests. The new implementation preserves the original `exception_ptr`, making the root cause clearer and easier to diagnose. (cherry picked from commit `1d34657b14`)	2025-10-21 12:26:50 +00:00
Lakshmi Narayanan Sreethar	45b9675d28	compaction: fix use after free when strategy is altered during compaction The `compaction_strategy_state` class holds strategy specific state via a `std::variant` containing different state types. When a compaction strategy performs compaction, it retrieves a reference to its state from the `compaction_strategy_state` object. If the table's compaction strategy is ALTERed while a compaction is in progress, the `compaction_strategy_state` object gets replaced, destroying the old state. This leaves the ongoing compaction holding a dangling reference, resulting in a use after free. Fix this by using `seastar::shared_ptr` for the state variant alternatives(`leveled_compaction_strategy_state_ptr` and `time_window_compaction_strategy_state_ptr`). The compaction strategies now hold a copy of the shared_ptr, ensuring the state remains valid for the duration of the compaction even if the strategy is altered. The `compaction_strategy_state` itself is still passed by reference and only the variant alternatives use shared_ptrs. This allows ongoing compactions to retain ownership of the state independently of the wrapper's lifetime. Fixes #25913 Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> (cherry picked from commit `18c071c94b`)	2025-10-21 00:59:33 +00:00
Michał Chojnowski	de8c2a8196	test/boost/sstable_compressor_factory_test: fix thread-unsafe usage of Boost.Test It turns out that Boost assertions are thread-unsafe, (and can't be used from multiple threads concurrently). This causes the test to fail with cryptic log corruptions sometimes. Fix that by switching to thread-safe checks. Fixes scylladb/scylladb#24982 Closes scylladb/scylladb#26472 (cherry picked from commit `7c6e84e2ec`) Closes scylladb/scylladb#26554	2025-10-15 12:08:54 +03:00
Dawid Mędrek	5e0f5f4b44	test/boost/schema_loader_test.cc: Explicitly enable rf_rack_valid_keyspaces The test cases in the file aren't run via an existing interface like `do_with_cql_env`, but they rely on a more direct approach -- calling one of the schema loader tools. Because of that, they manage the `db::config` object on their own and don't enable the configuration option `rf_rack_valid_keyspaces`. That hasn't been a problem so far since the test doesn't attempt to create RF-rack-invalid keyspaces anyway. However, in an upcoming commit, we're going to further restrict views with tablets and require that the option is enabled. To prepare for that, we enable the option in all test cases. It's only necessary in a small subset of them, but it won't hurt the enforce it everywhere, so let's do that. Refs scylladb/scylladb#23958 (cherry picked from commit `d6fcd18540`)	2025-10-06 13:19:54 +00:00
Michał Chojnowski	29b5319bc6	sstables/sstable_directory: don't forget to delete other components when deleting TemporaryHashes.db TemporaryHashes.db is a temporary sstable component used during ms sstable writes. It's different from other sstable components in that it's not included in the TOC. Because of this, it has a special case in the logic that deletes unfinished sstables on boot. (After Scylla dies in the middle of a sstable write). But there's a bug in that special case, which causes Scylla to forget to delete other components from the same unfinished sstable. The code intends only to delete the TemporaryHashes.db file from the `_state->generations_found` multimap, but it accidentally also deletes the file's sibling components from the multimap. Fix that. Fixes scylladb/scylladb#26393 (cherry picked from commit `6efb807c1a`)	2025-10-06 10:22:50 +00:00
Michał Chojnowski	a2e620712c	test/boost/database_test: fix two no-op distributed loader tests There are two tests which effectively check nothing. They intend to check that distributed loader removes "leftover" sstable files. So they create some incomplete sstables, run the test env on the directory, and the files disappeared. But the test env completely clears the test directory before the distributed loader looks at the files, so the tests succeed trivially. Fix that by adding a config knob to the test env which instructs it not to clear the directory before the test. (cherry picked from commit `16cb223d7f`)	2025-10-06 10:22:49 +00:00
Avi Kivity	4d9271df98	Merge 'sstables: introduce sstable version `ms`' from Michał Chojnowski This is yet another part in the BTI index project. Overarching issue: https://github.com/scylladb/scylladb/issues/19191 Previous part: https://github.com/scylladb/scylladb/pull/25626 Next parts: make `ms` the default. Then, general tweaks and improvements. Later, potentially a full `da` format implementation. This patch series introduces a new, Scylla-only sstable format version `ms`, which is like `me`, but with the index components (Summary.db and Index.db) replaced with BTI index components (Partitions.db and Rows.db), as they are in Cassandra 5.0's `da` format version. (Eventually we want to just implement `da`, but there are several other changes (unrelated to the index files) between `me` and `da`. By adding this `ms` as an intermediate step we can adapt the new index formats without dragging all the other changes into the mix (and raising the risk of regressions, which is already high)). The high-level structure of the PR is: 1. Introduce new component types — `Partitions` and `Rows`. 2. Teach `class sstable` to open them when they exist. 3. Teach the sstable writer how to write index data to them. 4. Teach `class sstable` and unit tests how to deal with sstables that have no `Index` or `Summary` (but have `Partitions` and `Rows` instead). 5. Introduce the new sstable version `ms`, specify that it has `Partitions` and `Rows` instead of `Index` and `Summary`. 6. Prepare unit tests for the appearance of `ms`. 7. Enable `ms` in unit tests. 8. Make `ms` enablable via db::config (with a silent fall back to `me` until the new `MS_SSTABLE_FORMAT` cluster feature is enabled). 9. Prepare integration tests for the appearance of `ms`. 10. Enable both `ms` and `me` in tests where we want both versions to be tested. This series doesn't make `ms` the default yet, because that requires teaching Scylla Manager and a few dtests about the new format first. It can be enabled by setting `sstable_format: ms` in the config. Per a review request, here is an example from `perf_fast_forward`, demonstrating some motivation for a new format. (Although not the main one. The main motivations are getting rid of restrictions on the RAM:disk ratio, and index read throughput for datasets with tiny partitions). The dataset was populated with `build/release/scylla perf-fast-forward --smp=1 --sstable-format=$VERSION --data-directory=data.$VERSION --column-index-size-in-kb=1 --populate --random-seed=0`. This test involves a partition with 1000000 clustering rows (with 32-bit keys and 100-byte values) and ~500 index blocks, and queries a few particular rows from the partition. Since the branching factor for the BIG promoted index is 2 (it's a binary search), the lookup involves ~11.2 sequential page reads per row. The BTI format has a more reasonable branching factor, so it involves ~2.3 page reads per row. `build/release/scylla perf-fast-forward --smp=1 --data-directory=perf_fast_forward_data/me --run-tests=large-partition-select-few-rows`: ``` offset stride rows iterations avg aio aio (KiB) 500000 1 1 70 18.0 18 128 500001 1 1 647 19.0 19 132 0 1000000 1 748 15.0 15 116 0 500000 2 372 29.0 29 284 0 250000 4 227 56.0 56 504 0 125000 8 116 106.0 106 928 0 62500 16 67 195.0 195 1732 ``` `build/release/scylla perf-fast-forward --smp=1 --data-directory=perf_fast_forward_data/ms --run-tests=large-partition-select-few-rows`: ``` offset stride rows iterations avg aio aio (KiB) 500000 1 1 51 5.1 5 20 500001 1 1 64 5.3 5 20 0 1000000 1 679 4.0 4 16 0 500000 2 492 8.0 8 88 0 250000 4 804 16.0 16 232 0 125000 8 409 31.0 31 516 0 62500 16 97 54.0 54 1056 ``` Index file size comparison for the default `perf_fast_forward` tables with `--random-seed=0`: Large partition table (dominated by intra-partition index): 2.4 MB with `me`, 732 kB with `ms`. For the small partitions table (dominated by inter-partition index): 11 MB with `me`, 8.4 MB with `ms`. External tests: I ran SCT test `longevity-mv-si-4days-streaming-test` test on 6 nodes with 30 shards each for 8 hours. No anomalies were observed. New functionality, no backport needed. Closes scylladb/scylladb#26215 * github.com:scylladb/scylladb: test/boost/bloom_filter_test: add test_rebuild_from_temporary_hashes test/cluster: add test_bti_index.py test: prepare bypass_cache_test.py for `ms` sstables sstables/trie/bti_index_reader: add a failure injection in advance_lower_and_check_if_present test/cqlpy/test_sstable_validation.py: prepare the test for `ms` sstables tools/scylla-sstable: add `--sstable-version=?` to `scylla sstable write` db/config: expose "ms" format to the users via database config test: in Python tests, prepare some sstable filename regexes for `ms` sstables: add `ms` to `all_sstable_versions` test/boost/sstable_3_x_test: add `ms` sstables to multi-version tests test/lib/index_reader_assertions: skip some row index checks for BTI indexes test/boost/sstable_inexact_index_test: explicitly use a `me` sstable test/boost/sstable_datafile_test: skip test_broken_promoted_index_is_skipped for `ms` sstables test/resource: add `ms` sample sstable files for relevant tests test/boost/sstable_compaction_test: prepare for `ms` sstables. test/boost/index_reader_test: prepare for `ms` sstables test/boost/bloom_filter_tests: prepare for `ms` sstables test/boost/sstable_datafile_test: prepare for `ms` sstables test/boost/sstable_test: prepare for `ms` sstables. sstables: introduce `ms` sstable format version tools/scylla-sstable: default to "preferred" sstable version, not "highest" sstables/mx/reader: use the same hashed_key for the bloom filter and the index reader sstables/trie/bti_index_reader: allow the caller to passing a precalculated murmur hash sstables/trie/bti_partition_index_writer: in add(), get the key hash from the caller sstables/mx: make Index and Summary components optional sstables: open Partitions.db early when it's needed to populate key range for sharding metadata sstables: adapt sstable::set_first_and_last_keys to sstables without Summary sstables: implement an alternative way to rebuild bloom filters for sstables without Index utils/bloom_filter: add `add(const hashed_key&)` sstables: adapt estimated_keys_for_range to sstables without Summary sstables: make `sstable::estimated_keys_for_range` asynchronous sstables/sstable: compute get_estimated_key_count() from Statistics instead of Summary replica/database: add table::estimated_partitions_in_range() sstables/mx: implement sstable::has_partition_key using a regular read sstables: use BTI index for queries, when present and enabled sstables/mx/writer: populate BTI index files sstables: create and open BTI index files, when enabled sstables: introduce Partition and Rows component types sstables/mx/writer: make `_pi_write_m.partition_tombstone` a `sstables::deletion_time`	2025-09-30 09:40:02 +03:00
Piotr Dulikowski	4581c72430	Merge 'lwt: prohibit for tablet-based views and cdc logs' from Petr Gusev `SELECT` commands with SERIAL consistency level are historically allowed for vnode-based views, even though they don't provide linearizability guarantees and in general don't make much sense. In this PR we prohibit LWTs for tablet-based views, but preserve old behavior for vnode-based views for compatibility. Similar logic is applied to CDC log tables. We also add a general check that disallows colocating a table with another colocated table, since this is not needed for now. Fixes https://github.com/scylladb/scylladb/issues/26258 backports: not needed (a new feature) Closes scylladb/scylladb#26284 * github.com:scylladb/scylladb: cql_test_env.cc: log exception when callback throws lwt: prohibit for tablet-based views and cdc logs tablets: disallow chains of colocated tables database: get_base_table_for_tablet_colocation: extract table_id_by_name lambda	2025-09-30 07:15:16 +02:00
Michał Chojnowski	771a82969e	test/boost/bloom_filter_test: add test_rebuild_from_temporary_hashes Adds a test for the bloom filter rebuild mechanism in `ms` sstables.	2025-09-29 22:15:26 +02:00
Michał Chojnowski	fe9f5f4da2	sstables: add `ms` to `all_sstable_versions` Add `ms` to the lists of sstable formats. This will cause it to be included in various unit tests.	2025-09-29 22:15:25 +02:00
Michał Chojnowski	9155eeed10	test/boost/sstable_3_x_test: add `ms` sstables to multi-version tests Add `ms` to tests which already test many format versions. The tests check that sstable files in newer verisons are the same as in `mc`. Arbitrarily, for `ms`, we only check the files common between `mc` and `ms`. If we want to extend this test more, so that it checks that `Partitions.db` and `Rows.db` don't change over time, we have to add `ms` versions of all the sstables under `test/resources` which are used in this test. We won't do that in this patch series. And I'm not sure if we want to do that at all.	2025-09-29 22:15:25 +02:00

1 2 3 4 5 ...

4281 Commits