scylladb

Author	SHA1	Message	Date
Botond Dénes	66db95c048	Merge 'Preserve PyKMIP logs from failed KMIP tests' from Nikos Dragazis This PR extends the `tmpdir` class with an option to preserve the directory if the destructor is called during stack unwinding. It also uses this feature in KMIP tests, where the tmpdir contains PyKMIP server logs, which may be useful when diagnosing test failures. Fixes #25339. Not so important to be backported. Closes scylladb/scylladb#25367 * github.com:scylladb/scylladb: encryption_at_rest_test: Preserve tmpdir from failing KMIP tests test/lib: Add option to preserve tmpdir on exception	2025-08-19 13:17:29 +03:00
Avi Kivity	611918056a	Merge 'repair: Add tablet incremental repair support' from Asias He The central idea of incremental repair is to allow repair participants to select and repair only a portion of the dataset to speed up the repair process. All repair participants must utilize an identical selection method to repair and synchronize the same selected dataset. There are two primary selection methods: time-based and file-based. The time-based method selects data within a specified time frame. It is versatile but it is less efficient because it requires reading all of the dataset and omitting data beyond the time frame. The file-based method selects data from unrepaired SSTables and is more efficient because it allows the entire SSTable to be omitted. This document patch implements the file-based selection method. Incremental repair will only be supported for tablet tables; it will not be supported for vnode tables. On one hand, the legacy vnode is less important to support. On the other hand, the incremental repair for vnode is much harder to implement. With vnodes, a SSTalbe could contain data for multiple vnode ranges. When a given vnode range is repaired, only a portion of the SSTable is repaired. This complicates the manipulation of SSTables significantly during both repair and compaction. With tablets, an entire tablet is repaired so that a sstable is either fully repaired or not repaired which is a huge simplification. This patch uses the repaired_at from sstables::statistics component to mark a sstable as repaired. It uses a virtual clock as the repair timestamp, i.e., using a monotonically increasing number for the repaired_at field of a SSTable and sstables_repaired_at column in system.tablets table. Notice that when a sstable is not repaired, the repaired_at field will be set to the default value 0 by default. The being_repaired in memory field of a SSTable is used to explicitly mark that a SSTable is being selected. The following variables are used for incremental repair: The repaired_at on disk field of a SSTable is used. - A 64-bit number increases sequentially The sstables_repaired_at is added to the system.tablets table. - repaired_at <= sstables_repaired_at means the sstable is repaired The being_repaired in memory field of a SSTable is added. - A repair UUID tells which sstable has participated in the repair Initial test results: 1) Medium dataset results Node amount: 3 Instance type: i4i.2xlarge Disk usage per node: ~500GB Cluster pre-populated with ~500GB of data before starting repairs job. Results for Repair Timings: The regular repair run took 210 mins. Incremental repair 1st run took 183 mins, 2nd and 3rd runs took around 48s The speedup is: 183 mins / 48s = 228X 2) Small dataset results Node amount: 3 Instance type: i4i.2xlarge Disk usage per node: ~167GB Cluster pre-populated with ~167GB of data before starting the repairs job. Regular repair 1st run took 110s, 2nd and 3rd runs took 110s. Incremental repair 1st run took 110 seconds, 2nd and 3rd run took 1.5 seconds. The speedup is: 110s / 1.5s = 73X 3) Large dataset results Node amount: 6 Instance type: i4i.2xlarge, 3 racks 50% of base load, 50% read/write Dataset == Sum of data on each node Dataset Non-incremental repair (minutes) 1.3 TiB 31:07 3.5 TiB 25:10 5.0 TiB 19:03 6.3 TiB 31:42 Dataset Incremental repair (minutes) 1.3 TiB 24:32 3.0 TiB 13:06 4.0 TiB 5:23 4.8 TiB 7:14 5.6 TiB 3:58 6.3 TiB 7:33 7.0 TiB 6:55 Fixes #22472 Closes scylladb/scylladb#24291 * github.com:scylladb/scylladb: replica: Introduce get_compaction_reenablers_and_lock_holders_for_repair compaction: Move compaction_reenabler to compaction_reenabler.hh topology_coordinator: Make rpc::remote_verb_error to warning level repair: Add metrics for sstable bytes read and skipped from sstables test.py: Disable incremental for test_tombstone_gc_for_streaming_and_repair test.py: Add tests for tablet incremental repair repair: Add tablet incremental repair support compaction: Add tablet incremental repair support feature_service: Add TABLET_INCREMENTAL_REPAIR feature tablet_allocator: Add tablet_force_tablet_count_increase and decrease repair: Add incremental helpers sstable: Add being_repaired to sstable sstables: Add set_repaired_at to metadata_collector mutation_compactor: Introduce add operator to compaction_stats tablet: Add sstables_repaired_at to system.tablets table test: Fix drain api in task_manager_client.py	2025-08-19 13:13:22 +03:00
Botond Dénes	f8b79d563a	Merge 's3: Minor refactoring and beautification of S3 client and tests' from Ernest Zaslavsky This pull request introduces minor code refactoring and aesthetic improvements to the S3 client and its associated test suite. The changes focus on enhancing readability, consistency, and maintainability without altering any functional behavior. No backport is required, as the modifications are purely cosmetic and do not impact functionality or compatibility. Closes scylladb/scylladb#25490 * github.com:scylladb/scylladb: s3_client: relocate `req` creation closer to usage s3_client: reformat long logging lines for readability s3_test: extract file writing code to a function	2025-08-18 18:48:42 +03:00
Avi Kivity	96956e48c4	Merge 'utils: stall_free: detect clear_gently method of const payload types' from Benny Halevy Currently, when a container or smart pointer holds a const payload type, utils::clear_gently does not detect the object's clear_gently method as the method is non-const and requires a mutable object, as in the following example in class tablet_metadata: ``` using tablet_map_ptr = foreign_ptr<lw_shared_ptr<const tablet_map>>; using table_to_tablet_map = std::unordered_map<table_id, tablet_map_ptr>; ``` That said, when a container is cleared gently the elements it holds are destroyed anyhow, so we'd like to allow to clear them gently before destruction. This change still doesn't allow directly calling utils::clear_gently an const objects. And respective unit tests. Fixes #24605 Fixed #25026 * This is an optimization that is not strictly required to backport (as https://github.com/scylladb/scylladb/pull/24618 dealt with clear_gently of `tablet_map_ptr = foreign_ptr<lw_shared_ptr<const tablet_map>>` well enough) Closes scylladb/scylladb#24606 * github.com:scylladb/scylladb: utils: stall_free: detect clear_gently method of const payload types utils: stall_free: clear gently a foreign shared ptr only when use_count==1	2025-08-18 12:52:02 +03:00
Pavel Emelyanov	4f55af9578	Merge 'test.py: pytest: support --mode/--repeat in a common way for all tests' from Evgeniy Naydanov Implement repetition of files using `pytest_collect_file` hook: run file collection as many times as needed to cover all `--mode`/`--repeat` combinations. Store build mode and run ID to the stash of repeated item. Some additional changes done: - Add `TestSuiteConfig` class to handle all operations with `test_config.yaml` - Add support for `run_first` option in `test_config.yaml` - Move disabled test logic to `pytest_collect_file` hook. These changes allow to to remove custom logic for `--mode`, `--repeat`, and disabled tests in the code for C++ tests and prepare for switching of Python/CQLApproval/Topology tests to pytest runner. Also, this PR includes required refactoring changes and fixes: - Simplify support of C++ tests: remove redundant facade abstraction and put all code into 3 files: `base.py`, `boost.py`, and `unit.py` - Remove unused imports in `test.py` - Use the constant for `"suite.yaml"` string - Some test suites have own test runners based on pytest, and they don't need all stuff we use for `test.py`. Move all code related to `test.py` framework to `test/pylib/runner.py` and use it as a plugin conditionally (by using `SCYLLA_TEST_RUNNER` env variable.) - Add `cwd` parameter to `run_process()` methods in `resource_gather` module to avoid using of `os.chdir()` (and sort parameters in the same order as in `subprocess.Popen`.) - `extra_scylla_cmdline_options` is a list of commandline arguments and, actually, each argument should be a separate item. Few configuration files have `--reactor-backend` option added in the format which doesn't follow this rule. This PR is a refactoring step for https://github.com/scylladb/scylladb/pull/25443 Closes scylladb/scylladb#25465 * github.com:scylladb/scylladb: test.py: pytest: support --mode/--repeat in a common way for all tests test.py: pytest: streamline suite configuration handling test.py: refactor: remove unused imports in test.py test.py: fix run with bare pytest after merge of scylladb/scylladb#24573 test.py: refactor: move framework-related code to test.pylib.runner test.py: resource_gather: add cwd parameter to run_process() test.py: refactor: use proper format for extra_scylla_cmdline_options	2025-08-18 12:24:04 +03:00
Avi Kivity	e9928b31b8	Merge 'sstables/trie: add BTI key translation routines' from Michał Chojnowski This is yet another part in the BTI index project. Overarching issue: https://github.com/scylladb/scylladb/issues/19191 Previous part: https://github.com/scylladb/scylladb/pull/25396 Next part: implementing sstable index writers and readers on top of the abstract trie writers/readers. The new code added in this PR isn't used outside of tests yet, but it's posted as a separate PR for reviewability. This series provides translation routines for ring positions and clustering positions from Scylla's native in-memory structures to BTI's byte-comparable encoding. This translation is performed whenever a new decorated key or clustering block are added to a BTI index, and whenever a BTI index is queried for a range of positions. For a description of the encoding, see `fad1f74570/src/java/org/apache/cassandra/utils/bytecomparable/ByteComparable.md (multi-component-sequences-partition-or-clustering-keys-tuples-bounds-and-nulls)` The translation logic, with all the fragment awareness, lazy evaluation and avoidable copies, is fairly bloated for the common cases of simple and small keys. This is a potential optimization target for later. No backports needed, new functionality. Closes scylladb/scylladb#25506 * github.com:scylladb/scylladb: sstables/trie: add BTI key translation routines tests/lib: extract generate_all_strings to test/lib tests/lib: extract nondeterministic_choice_stack to test/lib sstables/trie/trie_traversal: extract comparable_bytes_iterator to its own file sstables/mx: move clustering_info from writer.cc to types.hh sstables/trie: allow `comparable_bytes_iterator` to return a mutable span dht/ring_position: add ring_position_view::weight()	2025-08-18 11:55:26 +03:00
Asias He	f9021777d8	compaction: Add tablet incremental repair support This patch addes incremental_repair support in compaction. - The sstables are split into repaired and unrepaired set. - Repaired and unrepaired set compact sperately. - The repaired_at from sstable and sstables_repaired_at from system.tablets table are used to decide if a sstable is repaired or not. - Different compactions tasks, e.g., minor, major, scrub, split, are serialized with tablet repair.	2025-08-18 11:01:21 +08:00
Evgeniy Naydanov	e44b26b809	test.py: pytest: support --mode/--repeat in a common way for all tests Implement repetition of files using pytest_collect_file hook: run file collection as many times as needed to cover all --mode/--repeat combinations. Also move disabled test logic to this hook. Store build mode and run_id in pytest item stashes. Simplify support of C++ tests: remove redundant facade abstraction and put all code into 3 files: base.py, boost.py, and unit.py Add support for `run_first` option in test_config.yaml	2025-08-17 15:26:23 +00:00
Evgeniy Naydanov	cb4d9b8a09	test.py: refactor: use proper format for extra_scylla_cmdline_options `extra_scylla_cmdline_options` is a list of commandline arguments and, actually, each argument should be a separate item. Few configuration files have `--reactor-backend` option added in the format which doesn't follow this rule.	2025-08-17 12:32:35 +00:00
Michał Chojnowski	413dcf8891	sstables/trie: add BTI key translation routines This file provides translation routines for ring positions and clustering positions from Scylla's native in-memory structures to BTI's byte-comparable encoding. This translation is performed whenever a new decorated key or clustering block are added to a BTI index, and whenever a BTI index is queried for a range of positions. For a description of the encoding, see `fad1f74570/src/java/org/apache/cassandra/utils/bytecomparable/ByteComparable.md (multi-component-sequences-partition-or-clustering-keys-tuples-bounds-and-nulls)` The translation logic, with all the fragment awareness, lazy evaluation and avoidable copies, is fairly bloated for the common cases of simple and small keys. This is a potential optimization target for later.	2025-08-15 11:13:00 +02:00
Michał Chojnowski	5e76708335	tests/lib: extract generate_all_strings to test/lib This util will be used in another test file in a later commit, so hoist it to `test/lib`.	2025-08-14 22:38:38 +02:00
Taras Veretilnyk	30ff5942c6	database_test: fix race in test_drop_quarantined_sstables The test_drop_quarantined_sstables test could fail due to a race between compaction and quarantining of SSTables. If compaction selects an SSTable before it is moved to quarantine, and change_state is called during compaction, the SSTable may already be removed, resulting in a std::filesystem_error due to missing files. This patch resolves the issue by wrapping the quarantine operation inside run_with_compaction_disabled(). This ensures compaction is paused on the compaction group view while SSTables are being quarantined, preventing the race. Additionally, updates the test to quarantine up to 1/5 SSTables instead of one randomly and increases the number of sstables genereted to improve test scenario. Fixes scylladb/scylladb#25487 Closes scylladb/scylladb#25494	2025-08-14 20:23:42 +03:00
Taras Veretilnyk	367eaf46c5	keys: from_nodetool_style_string don't split single partition keys Users with single-column partition keys that contain colon characters were unable to use certain REST APIs and 'nodetool' commands, because the API split key by colon regardless of the partition key schema. Affected commands: - 'nodetool getendpoints' - 'nodetool getsstables' Affected endpoints: - '/column_family/sstables/by_key' - '/storage_service/natural_endpoints' Refs: #16596 - This does not fully fix the issue, as users with compound keys will face the issue if any column of the partition key contains a colon character. Closes scylladb/scylladb#24829	2025-08-14 19:52:04 +03:00
Avi Kivity	1ef6697949	Merge 'service/vector_store_client: Add live configuration update support' from Karol Nowacki Enable runtime updates of vector_store_uri configuration without requiring server restart. This allows to dynamically enable, disable, or switch the vector search service endpoint on the fly. To improve the clarity the seastar::experimental::http::client is now wrapped in a private http_client class that also holds the host, address, and port information. Tests have been added to verify that the client correctly handles transitions between enabled/disabled states and successfully switches traffic to a new endpoint after a configuration update. Closes: VECTOR-102 No backport is needed as this is a new feature. Closes scylladb/scylladb#25208 * github.com:scylladb/scylladb: service/vector_store_client: Add live configuration update support test/boost/vector_store_client_test.cc: Refactor vector store client test service/vector_store_client: Refactor host_port struct created service/vector_store_client: Refactor HTTP request creation	2025-08-14 19:45:06 +03:00
Avi Kivity	fe6e1071d3	Merge 'locator: util: optimize describe_ring' from Benny Halevy This change includes basic optimizations to locator::describe_ring, mainly caching the per-endpoint information in an unordered_map instead of looking them up in every inner-loop. This yields an improvement of 20% in cpu time. With 45 nodes organized as 3 dcs, 3 racks per dc, 5 nodes per rack, 256 tokens per node, yielding 11520 ranges and 9 replicas per range, describe_ring took Before: 30 milliseconds (2.6 microseconds per range) After: 24 milliseconds (2.1 microseconds per range) Add respective unit test for vnode keyspace and for tablets. Fixes #24887 * backport up to 2025.1 as describe_ring slowness was hit in the field with large clusters Closes scylladb/scylladb#24889 * github.com:scylladb/scylladb: locator: util: optimize describe_ring locator: util: construct_range_to_endpoint_map: pass is_vnode=true to get_natural_replicas vnode_effective_replication_map: do_get_replicas: throw internal error if token not found in map locator: effective_replication_map: get_natural_replicas: get is_vnode param test: cluster: test_repair: add test_vnode_keyspace_describe_ring	2025-08-14 19:39:17 +03:00
Ernest Zaslavsky	29960b83b5	s3_test: extract file writing code to a function Reduce code doing the same over and over again by extracting file writing code to a function	2025-08-14 16:18:43 +03:00
Avi Kivity	66173c06a3	Merge 'Eradicate the ability to create new sstables with numerical sstable generation' from Benny Halevy Remove support for generating numerical sstable generation for new sstables. Loading such sstables is still supported but new sstables are always created with a uuid generation. This is possible since: * All live versions (since 5.4 / `f014ccf369`) now support uuid sstable generations. * The `uuid_sstable_identifiers_enabled` config option (that is unused from version 2025.2 / `6da758d74c`) controls only the use of uuid generations when creating new sstables. SSTables with uuid generations should still be properly loaded by older versions, even if `uuid_sstable_identifiers_enabled` is set to `false`. Fixes #24248 * Enhancement, no backport needed Closes scylladb/scylladb#24512 * github.com:scylladb/scylladb: streaming: stream_blob: use the table sstable_generation_generator replica: distributed_loader: process_upload_dir: use the table sstable_generation_generator sstables: sstable_generation_generator: stop tracking highest generation replica: table: get rid of update_sstables_known_generation sstables: sstable_directory: stop tracking highest_generation replica: distributed_loader: stop tracking highest_generation sstables: sstable_generation: get rid of uuid_identifiers bool class sstables_manager: drop uuid_sstable_identifiers feature_service: move UUID_SSTABLE_IDENTIFIERS to supported_feature_set test: cql_query_test: add test_sstable_load_mixed_generation_type test: sstable_datafile_test: move copy_directory helper to test/lib/test_utils test: database_test: move table_dir helper to test/lib/test_utils	2025-08-14 11:54:33 +03:00
Pavel Emelyanov	eaec7c9b2e	Merge 'cql3: add default replication strategy to `create_keyspace_statement`' from Dario Mirovic When creating a new keyspace, both replication strategy and replication factor must be stated. For example: `CREATE KEYSPACE ks WITH REPLICATION = { 'class' : 'NetworkTopologyStrategy', 'replication_factor' : 3 };` This syntax is verbose, and in all but some testing scenarios `NetworkTopologyStrategy` is used. This patch allows skipping replication strategy name, filling it with `NetworkTopologyStrategy` when that happens. The following syntax is now valid: `CREATE KEYSPACE ks WITH REPLICATION = { 'replication_factor' : 3 };` and will give the same result as the previous, more explicit one. Fixes https://github.com/scylladb/scylladb/issues/16029 Backport is not needed. This is an enhancement for future releases. Closes scylladb/scylladb#25236 * github.com:scylladb/scylladb: docs/cql: update documentation for default replication strategy test/cqlpy: add keyspace creation default strategy test cql3: add default replication strategy to `create_keyspace_statement`	2025-08-14 11:18:36 +03:00
Ernest Zaslavsky	dd51e50f60	s3_client: add memory fallback in `chunked_download_source` Introduce fallback logic in `chunked_download_source` to handle memory exhaustion. When memory is low, feed the `deque` with only one uncounted buffer at a time. This allows slow but steady progress without getting stuck on the memory semaphore. Fixes: https://github.com/scylladb/scylladb/issues/25453 Fixes: https://github.com/scylladb/scylladb/issues/25262 Closes scylladb/scylladb#25452	2025-08-14 09:52:10 +03:00
Michał Chojnowski	72818a98e0	tests/lib: extract nondeterministic_choice_stack to test/lib This util will be used in another test file in later commit, so hoist it to `test/lib`.	2025-08-14 02:06:34 +02:00
Benny Halevy	50abeb1270	locator: util: optimize describe_ring This change includes basic optimizations to locator::describe_ring, mainly caching the per-endpoint information in an unordered_map instead of looking them up in every inner-loop. This yields an improvement of 20% in cpu time. With 45 nodes organized as 3 dcs, 3 racks per dc, 5 nodes per rack, 256 tokens per node, yielding 11520 ranges and 9 replicas per range, describe_ring took Before: 30 milliseconds (2.6 microseconds per range) After: 24 milliseconds (2.1 microseconds per range) Add respective unit test of describe_ring for tablets. A unit test for vnodes already exists in test/nodetool/test_describering.py Fixes #24887 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-08-13 12:42:25 +03:00
Ernest Zaslavsky	380c73ca03	s3_client: make memory semaphore acquisition abortable Add `abort_source` to the `get_units` call for the memory semaphore in the S3 client, allowing the acquisition process to be aborted. Fixes: https://github.com/scylladb/scylladb/issues/25454 Closes scylladb/scylladb#25469	2025-08-13 08:48:55 +03:00
Dario Mirovic	bc8bb0873d	cql3: add default replication strategy to `create_keyspace_statement` When creating a new keyspace, both replication strategy and replication factor must be stated. For example: `CREATE KEYSPACE ks WITH REPLICATION = { 'class' : 'NetworkTopologyStrategy', 'replication_factor' : 3 };` This syntax is verbose, and in all but some testing scenarios `NetworkTopologyStrategy` is used. This patch allows skipping replication strategy name, filling it with `NetworkTopologyStrategy` when that happens. The following syntax is now valid: `CREATE KEYSPACE ks WITH REPLICATION = { 'replication_factor' : 3 };` and will give the same result as the previous, more explicit one. Fixes #16029	2025-08-13 01:51:53 +02:00
Taras Veretilnyk	b7097b2993	database_test: fix abandoned futures in test_drop_quarantined_sstables The lambda passed to do_with_cql_env_thread() in test_drop_quarantined_sstables was mistakenly written as a coroutine. This change replaces co_await with .get() calls on futures and changes lambda return type to void. Fixes scylladb/scylladb#25427 Closes scylladb/scylladb#25431	2025-08-12 13:31:06 +03:00
Karol Nowacki	22a133df9b	service/vector_store_client: Add live configuration update support Enable runtime updates of vector_store_uri configuration without requiring server restart. This allows to dynamically enable, disable, or switch the vector search node endpoint on the fly.	2025-08-12 08:12:53 +02:00
Karol Nowacki	152274735e	test/boost/vector_store_client_test.cc: Refactor vector store client test Consolidate consecutive setup functions into a dedicated helper. Extract test table creation into a separate function. Remove redundant assertions to improve clarity.	2025-08-12 08:12:53 +02:00
Tomasz Grabiec	9fd312d157	Merge 'row_cache: add memtable overlap checks elision optimization for tombstone gc' from Botond Dénes https://github.com/scylladb/scylladb/issues/24962 introduced memtable overlap checks to cache tombstone GC. This was observed to be very strict and greatly reduce the effectiveness of tombstone GC in the cache, especially for MV workloads, which regularly recycle old timestamp into new writes, so the memtable often has smaller min live timestamp than the timestamp of the tombstones in the cache. When creating a new memtable, save a snapshot of the tombstone gc state. This snapshot is used later to exclude this memtable from overlap checks for tombstones, whose token have an expiry time larger than that of the tombstone, meaning: all writes in this memtable were produced at a point in time when the current tombstone has already expired. This has the following implications: * The partition the tombstone is part of was already repaired at the time the memtable was created. * All writes in the memtable were produced after this tombstone's expiry time, these writes cannot be possibly relevant for this tombstone. Based on this, such memtables are excluded from the overlap checks. With adequately frequent memtable flushes -- so that the tombstone gc state snapshot is refreshed -- most memtables should be excluded from overlap checks, greatly helping the cache's tombstone GC efficiency. Fixes: https://github.com/scylladb/scylladb/issues/24962 Fixes a regression introduced by https://github.com/scylladb/scylladb/pull/23255 which was backported to all releases, needs backport to all releases as well Closes scylladb/scylladb#25033 * github.com:scylladb/scylladb: docs/dev/tombstone.md: document the memtable overlap check elision optimization test/boost/row_cache_test: add test for memtable overlap check elision db/cache_mutation_reader: obtain gc-before and min-live-ts lazily mutation/mutation_compactor: use max_purgeable::can_purge and max_purgeable::purge_result db/cache_mutation_reader: use max_purgeable::can_purge() replica/table: get_max_purgeable_fn_for_cache_underlying_reader(): use max_purgable::combine() replica/database: memtable_list::get_max_purgeable(): set expiry-treshold compaction/compaction_garbage_collector: max_purgeable: add expiry_treshold replica/table: propagate gc_state to memtable_list replica/memtable_list: add tombstone_gc_state* member replica/memtable: add tombstone_gc_state_snapshot tombstone_gc: introduce tombstone_gc_state_snapshot tombstone_gc: extract shared state into shared_tombstone_gc_state tombstone_gc: per_table_history_maps::_group0_gc_time: make it a value tombstone_gc: fold get_group0_gc_time() into its caller tombstone_gc: fold get_or_create_group0_gc_time() into update_group0_refresh_time() tombstone_gc: fold get_or_create_repair_history_for_table() into update_repair_time() tombstone_gc: refactor get_or_greate_repair_history_for_table() replica/memtable_list: s/min_live_timestamp()/get_max_purgeable()/ db/read_context: return max_purgeable from get_max_purgeable() compaction/compaction_garbage_collector: add formatter for max_purgeable mutation: move definition of gc symbols to compaction.cc compaction/compaction_garbage_collector: refactor max_purgeable into a class test/boost/row_cache_test: refactor test_populating_reader_tombstone_gc_with_data_in_memtable test: rewrite test_compacting_reader_tombstone_gc_with_data_in_memtable in C++ test/boost/row_cache_test: refactor cache tombstone GC with memtable overlap tests	2025-08-11 23:54:59 +02:00
Michał Chojnowski	3017dbb204	sstables/trie: add trie traversal routines `trie::node_reader`, added in a previous series, contains encoding-aware logic for traversing a single node (or a batch of nodes) during a trie search. This commits adds encoding-agnostic functions which drive the the `trie::node_reader` in a loop to traverse the whole branch. Together, the added functions (`traverse`, `step`, `step_back`) and the data structure they modify (`ancestor_trail`) constitute a trie cursor. We might later wrap them into some `trie_cursor` class, but regardless of whether we are going to do that, keeping them (also) as free functions makes them easier to test. Closes scylladb/scylladb#25396	2025-08-11 19:15:09 +03:00
Botond Dénes	65c770f21a	test/boost/row_cache_test: add test for memtable overlap check elision	2025-08-11 17:20:12 +03:00
Botond Dénes	cfac9691ff	compaction/compaction_garbage_collector: max_purgeable: add expiry_treshold Allow possibly avoiding overlap checks in the case where the source of the min-live timestamp is known to only contain data which was written after expiry treshold. Expiry treshold is the upper bound of tombstone.deletion_time that was already expired at the time of obtaining this expiry treshold value. Meaning that any write originating from after this point in time, was generated at a time when such tombstone was already expired. Hence these writes are not relevant for the purposes of overlap checks with the tombstone and so their min-live timestamp can be ignored. This is important for MV workloads, where writes generated now can have timestamps going far back in time, possibly blocking tombstone GC of much older [shadowable] tombstones.	2025-08-11 17:20:11 +03:00
Patryk Jędrzejczak	e14c5e3890	Merge 'raft: enforce odd number of voters in group0' from Emil Maskovsky raft: enforce odd number of voters in group0 Implement odd number voter enforcement in the group0 voter calculator to ensure proper Raft consensus behavior. Raft consensus requires a majority of voters to make decisions, and odd numbers of voters is preferred because an even number doesn't add additional reliability but introduces the risk of scenarios where no group can make progress. If an even number of voters is divided into two groups of equal size during a network partition, neither group will have majority and both will be unable to commit new entries. With an odd number of voters, such equal partition scenarios are impossible (unless the network is partitioned into at least three groups). Fixes: scylladb/scylladb#23266 No backport: This is a new change that is to be only deployed in the new version, so it will not be backported. Closes scylladb/scylladb#25332 * https://github.com/scylladb/scylladb: raft: enforce odd number of voters in group0 test/raft: adapt test_tablets_lwt.py for odd voter number enforcement test/raft: adapt test_raft_no_quorum.py for odd voter enforcement	2025-08-11 15:44:21 +02:00
Benny Halevy	23ac80fc6b	utils: stall_free: detect clear_gently method of const payload types Currently, when a container or smart pointer holds a const payload type, utils::clear_gently does not detect the object's clear_gently method as the method is non-const and requires a mutable object, as in the following example in class tablet_metadata: ``` using tablet_map_ptr = foreign_ptr<lw_shared_ptr<const tablet_map>>; using table_to_tablet_map = std::unordered_map<table_id, tablet_map_ptr>; ``` That said, when a container is cleared gently the elements it holds are destroyed anyhow, so we'd like to allow to clear them gently before destruction. This change still doesn't allow directly calling utils::clear_gently an const objects. And respective unit tests. Fixes #24605 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-08-11 14:22:01 +03:00
Benny Halevy	cb9db2f396	utils: stall_free: clear gently a foreign shared ptr only when use_count==1 Unlike clear_gently of SharedPtr, clear_gently of a `foreign_ptr<shared_ptr<T>>` calls clear_gently on the contained object even if it's still shared and may still be in use. This change examines the foreign shared pointer's use_count and calls clear_gently on the shard object only when its use_count reaches 1. Fixes #25026 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-08-11 14:21:32 +03:00
Tomasz Grabiec	f7c001deff	Merge 'key: clustering_bounds_comparator: avoid thread_local initialization guard overhead' from Avi Kivity I noticed clustering_bounds_comparator was running an unnecessary thread_local initialization guard. This series switches the variable to constinit initialization, removing the guard. Performance measurements (perf-simple-query) show an unimpressive 20 instruction per op reduction. However, each instruction counts! Before: ``` throughput: mean= 203642.54 standard-deviation=1102.99 median= 204328.69 median-absolute-deviation=955.56 maximum=204624.13 minimum=202222.19 instructions_per_op: mean= 42097.59 standard-deviation=40.07 median= 42111.83 median-absolute-deviation=30.65 maximum=42139.88 minimum=42044.91 cpu_cycles_per_op: mean= 22664.81 standard-deviation=131.28 median= 22581.10 median-absolute-deviation=111.57 maximum=22832.30 minimum=22553.24 ``` After: ``` throughput: mean= 204397.73 standard-deviation=2277.71 median= 204942.95 median-absolute-deviation=2191.54 maximum=207588.30 minimum=202162.80 instructions_per_op: mean= 42087.21 standard-deviation=27.30 median= 42092.75 median-absolute-deviation=20.33 maximum=42108.33 minimum=42041.51 cpu_cycles_per_op: mean= 22589.79 standard-deviation=219.24 median= 22544.82 median-absolute-deviation=191.98 maximum=22835.11 minimum=22303.52 ``` (Very) minor performance improvement, no backport suggestd. Closes scylladb/scylladb#25259 * github.com:scylladb/scylladb: keys: clustering_bounds_comparator: make thread_local _empty_prefix constinit keys: make empty creation clustering_key_prefix constexpr managed_bytes: make empty managed_bytes constexpr friendly keys: clustering_bounds_comparator: make _empty_prefix a prefix	2025-08-11 13:20:38 +02:00
Botond Dénes	ab633590f1	tombstone_gc: introduce tombstone_gc_state_snapshot Returns gc-before times, identical to what tombstone_gc_state would have returned at the point of taking the snapshot.	2025-08-11 07:09:14 +03:00
Botond Dénes	614d17347a	tombstone_gc: extract shared state into shared_tombstone_gc_state Instead of storing it partially in tombstone_gc and partially in an external map. Move all external parts into the new shared_tombstone_gc_state. This new class is responsible for keeping and updating the repair history. tombstone_gc_state just keeps const pointers to the shared state as before and is only responsible for querying the tombstone gc before times. This separation makes the code easier to follow and also enables further patching of tombstone_gc_state.	2025-08-11 07:09:14 +03:00
Botond Dénes	ef7d49cd21	compaction/compaction_garbage_collector: refactor max_purgeable into a class Make members private, add getters and constructors. This struct will get more functionality soon, so class is a better fit.	2025-08-11 07:09:13 +03:00
Botond Dénes	c150bdd59c	test/boost/row_cache_test: refactor test_populating_reader_tombstone_gc_with_data_in_memtable This test currently uses gc_grace_seconds=0. The introduction of memtable overlap elision will break these tests because the optimization is always active with this tombstone-gc. Switch the tests to use tombstone-gc=repair, which allows for greater control over when the memtable overlap elision is triggered. This requires a move to vnodes, as tombstone-gc=repair doesn't work with RF=1 currently, and using RF=3 won't work with tablets.	2025-08-11 07:09:13 +03:00
Botond Dénes	c052f2ad1d	test: rewrite test_compacting_reader_tombstone_gc_with_data_in_memtable in C++ This test will soon need to be changed to use tombstone-gc=repair. This cannot work as of now, as the test uses a single-node cluster. The options are the following: * Make it use more than one nodes * Make repair work with single node clusters * Rewrite in C++ where repair can be done synthetically We chose the last option, it is the simplest one both in terms of code and runtime footprint. The new test is in test/boost/row_cache_test.cc Two changes were done during the migration * Change the name to test_populating_reader_tombstone_gc_with_data_in_memtable to better express which cache component this test is targetting; * Use NullCompactionStrategy on the table instead of disabling auto-compaction.	2025-08-11 07:09:13 +03:00
Botond Dénes	e4c048ada1	test/boost/row_cache_test: refactor cache tombstone GC with memtable overlap tests These tests currently use tombstone-gc=immediate. The introduction of memtable overlap elision will break these tests because the optimization is always active with this tombstone-gc. Switch the tests to use tombstone-gc=repair, which allows for greater control over when the memtable overlap elision is triggered. This requires a move to vnodes, as tombstone-gc=repair doesn't work with RF=1 currently, and using RF=3 won't work with tablets.	2025-08-11 07:09:13 +03:00
Emil Maskovsky	7c54401d3d	raft: enforce odd number of voters in group0 Implement odd number voter enforcement in the group0 voter calculator to ensure proper Raft consensus behavior. Raft consensus requires a majority of voters to make decisions, and odd numbers of voters is preferred because an even number doesn't add additional reliability but introduces the risk of scenarios where no group can make progress. If an even number of voters is divided into two groups of equal size during a network partition, neither group will have majority and both will be unable to commit new entries. With an odd number of voters, such equal partition scenarios are impossible (unless the network is partitioned into at least three groups). Fixes: scylladb/scylladb#23266	2025-08-08 19:49:20 +02:00
Benny Halevy	0a20834d2a	replica: table: get rid of update_sstables_known_generation It is not needed anymore. With that database::_sstable_generation_generator can be a regular member rather than optional and initialized later. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-08-08 11:46:21 +03:00
Benny Halevy	42cb25c470	sstables: sstable_directory: stop tracking highest_generation It is not needed anymore as we always generate uuid generations. Convert sstable_directory_test_table_simple_empty_directory_scan to use the newly added empty() method instead of checking the highest generation seen. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-08-08 11:46:21 +03:00
Benny Halevy	b01524c5a3	replica: distributed_loader: stop tracking highest_generation It is not needed anymore as we always generate uuid generations. Move highest_generation_seen(sharded<sstables::sstable_directory>& directory) to sstables/sstable_directory module. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-08-08 11:46:21 +03:00
Benny Halevy	6cc964ef16	sstables: sstable_generation: get rid of uuid_identifiers bool class Now that all call sites enable uuid_identifiers. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-08-08 11:46:21 +03:00
Raphael S. Carvalho	beaaf00fac	test: Add test that compaction doesn't cross logical group boundary Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-08-08 06:58:01 +03:00
Raphael S. Carvalho	d351b0726b	replica: Introduce views in compaction_group for incremental repair Wired the unrepaired, repairing and repaired views into compaction_group. Also the repaired filter was wired, so tablet_storage_group_manager can implement the procedure to classify the sstable. Based on this classifier, we can decide which view a sstable belongs to, at any given point in time. Additionally, we made changes changes to compaction_group_view to return only sstables that belong to the underlying view. From this point on, repaired, repairing and unrepaired sets are connected to compaction manager through their views. And that guarantees sstables on different groups cannot be compacted together. Repairing view specifically has compaction disabled on it altogether, we can revert this later if we want, to allow repairing sstables to be compacted with one another. The benefit of this logical approach is having the classifier as the single source of truth. Otherwise, we'd need to keep the sstable location consistest with global metadata, creating complexity Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-08-08 06:58:00 +03:00
Raphael S. Carvalho	9d3755f276	replica: Futurize retrieval of sstable sets in compaction_group_view This will allow upcoming work to gently produce a sstable set for each compaction group view. Example: repaired and unrepaired. Locking strategy for compaction's sstable selection: Since sstable retrieval path became futurized, tasks in compaction manager will now hold the write lock (compaction_state::lock) when retrieving the sstable list, feeding them into compaction strategy, and finally registering selected sstables as compacting. The last step prevents another concurrent task from picking the same sstable. Previously, all those steps were atomic, but we have seen stall in that area in large installations, so futurization of that area would come sooner or later. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-08-08 06:58:00 +03:00
Raphael S. Carvalho	2c4a9ba70c	treewide: Rename table_state to compaction_group_view Since table_state is a view to a compaction group, it makes sense to rename it as so. With upcoming incremental repair, each replica::compaction_group will be actually two compaction groups, so there will be two views for each replica::compaction_group. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-08-08 06:51:28 +03:00
Asias He	acc367c522	tests: adjust for incremental repair The separatation of sstables into the logical repaired and unrepaired virtual sets, requires some adjustments for certain tests, in particular for those that look at number of compaction tasks or number of sstables. The following tests need adjustment: * test/cluster/tasks/test_tablet_tasks.py * test/boost/memtable_test.cc The adjustments are done in such a way that they accomodate both the case where there is separate repaired/unrepaired states and when there isn't.	2025-08-08 06:49:17 +03:00

1 2 3 4 5 ...

4105 Commits