scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-06-03 05:26:58 +00:00

Author	SHA1	Message	Date
Michał Chojnowski	3017dbb204	sstables/trie: add trie traversal routines `trie::node_reader`, added in a previous series, contains encoding-aware logic for traversing a single node (or a batch of nodes) during a trie search. This commits adds encoding-agnostic functions which drive the the `trie::node_reader` in a loop to traverse the whole branch. Together, the added functions (`traverse`, `step`, `step_back`) and the data structure they modify (`ancestor_trail`) constitute a trie cursor. We might later wrap them into some `trie_cursor` class, but regardless of whether we are going to do that, keeping them (also) as free functions makes them easier to test. Closes scylladb/scylladb#25396	2025-08-11 19:15:09 +03:00
Botond Dénes	65c770f21a	test/boost/row_cache_test: add test for memtable overlap check elision	2025-08-11 17:20:12 +03:00
Botond Dénes	cfac9691ff	compaction/compaction_garbage_collector: max_purgeable: add expiry_treshold Allow possibly avoiding overlap checks in the case where the source of the min-live timestamp is known to only contain data which was written after expiry treshold. Expiry treshold is the upper bound of tombstone.deletion_time that was already expired at the time of obtaining this expiry treshold value. Meaning that any write originating from after this point in time, was generated at a time when such tombstone was already expired. Hence these writes are not relevant for the purposes of overlap checks with the tombstone and so their min-live timestamp can be ignored. This is important for MV workloads, where writes generated now can have timestamps going far back in time, possibly blocking tombstone GC of much older [shadowable] tombstones.	2025-08-11 17:20:11 +03:00
Patryk Jędrzejczak	e14c5e3890	Merge 'raft: enforce odd number of voters in group0' from Emil Maskovsky raft: enforce odd number of voters in group0 Implement odd number voter enforcement in the group0 voter calculator to ensure proper Raft consensus behavior. Raft consensus requires a majority of voters to make decisions, and odd numbers of voters is preferred because an even number doesn't add additional reliability but introduces the risk of scenarios where no group can make progress. If an even number of voters is divided into two groups of equal size during a network partition, neither group will have majority and both will be unable to commit new entries. With an odd number of voters, such equal partition scenarios are impossible (unless the network is partitioned into at least three groups). Fixes: scylladb/scylladb#23266 No backport: This is a new change that is to be only deployed in the new version, so it will not be backported. Closes scylladb/scylladb#25332 * https://github.com/scylladb/scylladb: raft: enforce odd number of voters in group0 test/raft: adapt test_tablets_lwt.py for odd voter number enforcement test/raft: adapt test_raft_no_quorum.py for odd voter enforcement	2025-08-11 15:44:21 +02:00
Benny Halevy	23ac80fc6b	utils: stall_free: detect clear_gently method of const payload types Currently, when a container or smart pointer holds a const payload type, utils::clear_gently does not detect the object's clear_gently method as the method is non-const and requires a mutable object, as in the following example in class tablet_metadata: ``` using tablet_map_ptr = foreign_ptr<lw_shared_ptr<const tablet_map>>; using table_to_tablet_map = std::unordered_map<table_id, tablet_map_ptr>; ``` That said, when a container is cleared gently the elements it holds are destroyed anyhow, so we'd like to allow to clear them gently before destruction. This change still doesn't allow directly calling utils::clear_gently an const objects. And respective unit tests. Fixes #24605 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-08-11 14:22:01 +03:00
Benny Halevy	cb9db2f396	utils: stall_free: clear gently a foreign shared ptr only when use_count==1 Unlike clear_gently of SharedPtr, clear_gently of a `foreign_ptr<shared_ptr<T>>` calls clear_gently on the contained object even if it's still shared and may still be in use. This change examines the foreign shared pointer's use_count and calls clear_gently on the shard object only when its use_count reaches 1. Fixes #25026 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-08-11 14:21:32 +03:00
Tomasz Grabiec	f7c001deff	Merge 'key: clustering_bounds_comparator: avoid thread_local initialization guard overhead' from Avi Kivity I noticed clustering_bounds_comparator was running an unnecessary thread_local initialization guard. This series switches the variable to constinit initialization, removing the guard. Performance measurements (perf-simple-query) show an unimpressive 20 instruction per op reduction. However, each instruction counts! Before: ``` throughput: mean= 203642.54 standard-deviation=1102.99 median= 204328.69 median-absolute-deviation=955.56 maximum=204624.13 minimum=202222.19 instructions_per_op: mean= 42097.59 standard-deviation=40.07 median= 42111.83 median-absolute-deviation=30.65 maximum=42139.88 minimum=42044.91 cpu_cycles_per_op: mean= 22664.81 standard-deviation=131.28 median= 22581.10 median-absolute-deviation=111.57 maximum=22832.30 minimum=22553.24 ``` After: ``` throughput: mean= 204397.73 standard-deviation=2277.71 median= 204942.95 median-absolute-deviation=2191.54 maximum=207588.30 minimum=202162.80 instructions_per_op: mean= 42087.21 standard-deviation=27.30 median= 42092.75 median-absolute-deviation=20.33 maximum=42108.33 minimum=42041.51 cpu_cycles_per_op: mean= 22589.79 standard-deviation=219.24 median= 22544.82 median-absolute-deviation=191.98 maximum=22835.11 minimum=22303.52 ``` (Very) minor performance improvement, no backport suggestd. Closes scylladb/scylladb#25259 * github.com:scylladb/scylladb: keys: clustering_bounds_comparator: make thread_local _empty_prefix constinit keys: make empty creation clustering_key_prefix constexpr managed_bytes: make empty managed_bytes constexpr friendly keys: clustering_bounds_comparator: make _empty_prefix a prefix	2025-08-11 13:20:38 +02:00
Botond Dénes	ab633590f1	tombstone_gc: introduce tombstone_gc_state_snapshot Returns gc-before times, identical to what tombstone_gc_state would have returned at the point of taking the snapshot.	2025-08-11 07:09:14 +03:00
Botond Dénes	614d17347a	tombstone_gc: extract shared state into shared_tombstone_gc_state Instead of storing it partially in tombstone_gc and partially in an external map. Move all external parts into the new shared_tombstone_gc_state. This new class is responsible for keeping and updating the repair history. tombstone_gc_state just keeps const pointers to the shared state as before and is only responsible for querying the tombstone gc before times. This separation makes the code easier to follow and also enables further patching of tombstone_gc_state.	2025-08-11 07:09:14 +03:00
Botond Dénes	ef7d49cd21	compaction/compaction_garbage_collector: refactor max_purgeable into a class Make members private, add getters and constructors. This struct will get more functionality soon, so class is a better fit.	2025-08-11 07:09:13 +03:00
Botond Dénes	c150bdd59c	test/boost/row_cache_test: refactor test_populating_reader_tombstone_gc_with_data_in_memtable This test currently uses gc_grace_seconds=0. The introduction of memtable overlap elision will break these tests because the optimization is always active with this tombstone-gc. Switch the tests to use tombstone-gc=repair, which allows for greater control over when the memtable overlap elision is triggered. This requires a move to vnodes, as tombstone-gc=repair doesn't work with RF=1 currently, and using RF=3 won't work with tablets.	2025-08-11 07:09:13 +03:00
Botond Dénes	c052f2ad1d	test: rewrite test_compacting_reader_tombstone_gc_with_data_in_memtable in C++ This test will soon need to be changed to use tombstone-gc=repair. This cannot work as of now, as the test uses a single-node cluster. The options are the following: * Make it use more than one nodes * Make repair work with single node clusters * Rewrite in C++ where repair can be done synthetically We chose the last option, it is the simplest one both in terms of code and runtime footprint. The new test is in test/boost/row_cache_test.cc Two changes were done during the migration * Change the name to test_populating_reader_tombstone_gc_with_data_in_memtable to better express which cache component this test is targetting; * Use NullCompactionStrategy on the table instead of disabling auto-compaction.	2025-08-11 07:09:13 +03:00
Botond Dénes	e4c048ada1	test/boost/row_cache_test: refactor cache tombstone GC with memtable overlap tests These tests currently use tombstone-gc=immediate. The introduction of memtable overlap elision will break these tests because the optimization is always active with this tombstone-gc. Switch the tests to use tombstone-gc=repair, which allows for greater control over when the memtable overlap elision is triggered. This requires a move to vnodes, as tombstone-gc=repair doesn't work with RF=1 currently, and using RF=3 won't work with tablets.	2025-08-11 07:09:13 +03:00
Emil Maskovsky	7c54401d3d	raft: enforce odd number of voters in group0 Implement odd number voter enforcement in the group0 voter calculator to ensure proper Raft consensus behavior. Raft consensus requires a majority of voters to make decisions, and odd numbers of voters is preferred because an even number doesn't add additional reliability but introduces the risk of scenarios where no group can make progress. If an even number of voters is divided into two groups of equal size during a network partition, neither group will have majority and both will be unable to commit new entries. With an odd number of voters, such equal partition scenarios are impossible (unless the network is partitioned into at least three groups). Fixes: scylladb/scylladb#23266	2025-08-08 19:49:20 +02:00
Benny Halevy	0a20834d2a	replica: table: get rid of update_sstables_known_generation It is not needed anymore. With that database::_sstable_generation_generator can be a regular member rather than optional and initialized later. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-08-08 11:46:21 +03:00
Benny Halevy	42cb25c470	sstables: sstable_directory: stop tracking highest_generation It is not needed anymore as we always generate uuid generations. Convert sstable_directory_test_table_simple_empty_directory_scan to use the newly added empty() method instead of checking the highest generation seen. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-08-08 11:46:21 +03:00
Benny Halevy	b01524c5a3	replica: distributed_loader: stop tracking highest_generation It is not needed anymore as we always generate uuid generations. Move highest_generation_seen(sharded<sstables::sstable_directory>& directory) to sstables/sstable_directory module. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-08-08 11:46:21 +03:00
Benny Halevy	6cc964ef16	sstables: sstable_generation: get rid of uuid_identifiers bool class Now that all call sites enable uuid_identifiers. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-08-08 11:46:21 +03:00
Raphael S. Carvalho	beaaf00fac	test: Add test that compaction doesn't cross logical group boundary Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-08-08 06:58:01 +03:00
Raphael S. Carvalho	d351b0726b	replica: Introduce views in compaction_group for incremental repair Wired the unrepaired, repairing and repaired views into compaction_group. Also the repaired filter was wired, so tablet_storage_group_manager can implement the procedure to classify the sstable. Based on this classifier, we can decide which view a sstable belongs to, at any given point in time. Additionally, we made changes changes to compaction_group_view to return only sstables that belong to the underlying view. From this point on, repaired, repairing and unrepaired sets are connected to compaction manager through their views. And that guarantees sstables on different groups cannot be compacted together. Repairing view specifically has compaction disabled on it altogether, we can revert this later if we want, to allow repairing sstables to be compacted with one another. The benefit of this logical approach is having the classifier as the single source of truth. Otherwise, we'd need to keep the sstable location consistest with global metadata, creating complexity Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-08-08 06:58:00 +03:00
Raphael S. Carvalho	9d3755f276	replica: Futurize retrieval of sstable sets in compaction_group_view This will allow upcoming work to gently produce a sstable set for each compaction group view. Example: repaired and unrepaired. Locking strategy for compaction's sstable selection: Since sstable retrieval path became futurized, tasks in compaction manager will now hold the write lock (compaction_state::lock) when retrieving the sstable list, feeding them into compaction strategy, and finally registering selected sstables as compacting. The last step prevents another concurrent task from picking the same sstable. Previously, all those steps were atomic, but we have seen stall in that area in large installations, so futurization of that area would come sooner or later. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-08-08 06:58:00 +03:00
Raphael S. Carvalho	2c4a9ba70c	treewide: Rename table_state to compaction_group_view Since table_state is a view to a compaction group, it makes sense to rename it as so. With upcoming incremental repair, each replica::compaction_group will be actually two compaction groups, so there will be two views for each replica::compaction_group. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-08-08 06:51:28 +03:00
Asias He	acc367c522	tests: adjust for incremental repair The separatation of sstables into the logical repaired and unrepaired virtual sets, requires some adjustments for certain tests, in particular for those that look at number of compaction tasks or number of sstables. The following tests need adjustment: * test/cluster/tasks/test_tablet_tasks.py * test/boost/memtable_test.cc The adjustments are done in such a way that they accomodate both the case where there is separate repaired/unrepaired states and when there isn't.	2025-08-08 06:49:17 +03:00
Avi Kivity	8164f72f6e	Merge 'Separate local_effective_replication_map from vnode_effective_replication_map' from Benny Halevy Derive both vnode_effective_replication_map and local_effective_replication_map from static_effective_replication_map as both are static and per-keyspace. However, local_effective_replication_map does not need vnodes for the mapping of all tokens to the local node. Refs #22733 * No backport required Closes scylladb/scylladb#25222 * github.com:scylladb/scylladb: locator: abstract_replication_strategy: implement local_replication_strategy locator: vnode_effective_replication_map: convert clone_data_gently to clone_gently locator: abstract_replication_map: rename make_effective_replication_map locator: abstract_replication_map: rename calculate_effective_replication_map replica: database: keyspace: rename {create,update}_effective_replication_map locator: effective_replication_map_factory: rename create_effective_replication_map locator: abstract_replication_strategy: rename vnode_effective_replication_map_ptr et. al locator: abstract_replication_strategy: rename global_vnode_effective_replication_map keyspace: rename get_vnode_effective_replication_map dht: range_streamer: use naked e_r_m pointers storage_service: use naked e_r_m pointers alternator: ttl: use naked e_r_m pointers locator: abstract_replication_strategy: define is_local	2025-08-07 12:51:43 +03:00
Avi Kivity	90eb6e6241	Merge 'sstables/trie: implement BTI node format serialization and traversal' from Michał Chojnowski This is the next part in the BTI index project. Overarching issue: https://github.com/scylladb/scylladb/issues/19191 Previous part: https://github.com/scylladb/scylladb/pull/25154 Next part: implementing a trie cursor (the "set to key, step forwards, step backwards" thing) on top of the `node_reader` added here. The new code added here is not used for anything yet, but it's posted as a separate PR to keep things reviewably small. This part implements the BTI trie node encoding, as described in https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/io/sstable/format/bti/BtiFormat.md#trie-nodes. It contains the logic for encoding the abstract in-memory `writer_node`s (added in the previous PR) into the on-disk format, and the logic for traversing the on-disk nodes during a read. New functionality, no backporting needed. Closes scylladb/scylladb#25317 * github.com:scylladb/scylladb: sstables/trie: add tests for BTI node serialization and traversal sstables/trie: implement BTI node traversal sstables/trie: implement BTI serialization utils/cached_file: add get_shared_page() utils/cached_file: replace a std::pair with a named struct	2025-08-07 12:15:42 +03:00
Benny Halevy	02b922ac40	test: cql_query_test: add test_sstable_load_mixed_generation_type Test that we can load sstables with mixed, numerical and uuid generation types, and verify the expected data. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-08-07 12:04:23 +03:00
Benny Halevy	9b65856a26	test: sstable_datafile_test: move copy_directory helper to test/lib/test_utils It's a generic helper that can be used by all tests. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-08-07 12:04:23 +03:00
Benny Halevy	7c9ce235d7	test: database_test: move table_dir helper to test/lib/test_utils It's a generic helper that can be used by all tests. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-08-07 12:04:23 +03:00
Nikos Dragazis	ee92fcc078	encryption_at_rest_test: Preserve tmpdir from failing KMIP tests The KMIP tests start a local PyKMIP server and configure it to write logs in the test's temporary directory (`tmpdir`). However, the tmpdir is a RAII object that deletes the directory once it goes out of scope, causing PyKMIP server logs to be lost on test failures. To assist with debugging, preserve the whole directory if the test failed with an exception. Allow the user to disable this by setting the SCYLLA_TEST_PRESERVE_TMP_ON_EXCEPTION environment variable. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2025-08-06 16:29:19 +03:00
Benny Halevy	6dbbb80aae	locator: abstract_replication_strategy: implement local_replication_strategy Derive both vnode_effective_replication_map and local_effective_replication_map from static_effective_replication_map as both are static and per-keyspace. However, local_effective_replication_map does not need vnodes for the mapping of all tokens to the local node. Note that everywhere_replication_strategy is not abstracted in a similar way, although it could, since the plan is to get rid of it once all system keyspaces areconverted to local or tablets replication (and propagated everywhere if needed using raft group0) Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-08-06 16:05:11 +03:00
Benny Halevy	babb4a41a8	locator: abstract_replication_map: rename calculate_effective_replication_map to calculate_vnode_effective_replication_map since it is specific to vnode-based range calculations. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-08-06 16:03:53 +03:00
Benny Halevy	cbad497859	locator: abstract_replication_strategy: rename vnode_effective_replication_map_ptr et. al to static_effective_replication_map_ptr, in preparation for separating local_effective_replication_map from vnode_effective_replication_map. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-08-06 16:03:53 +03:00
Benny Halevy	bd62421c05	keyspace: rename get_vnode_effective_replication_map to get_static_effective_replication_map, in preparation for separating local_effective_replication_map from vnode_effective_replication_map (both are per-keyspace). Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-08-06 13:40:43 +03:00
Pavel Emelyanov	0616407be5	Merge 'rest_api: add endpoint which drops all quarantined sstables' from Taras Veretilnyk Added a new POST endpoint `/storage_service/drop_quarantined_sstables` to the REST API. This endpoint allows dropping all quarantined SSTables either globally or for a specific keyspace and tables. Optional query parameters `keyspace` and `tables` (comma-separated table names) can be provided to limit the scope of the operation. Fixes scylladb/scylladb#19061 Backport is not required, it is new functionality Closes scylladb/scylladb#25063 * github.com:scylladb/scylladb: docs: Add documentation for the nodetool dropquarantinedsstables command nodetool: add command for dropping quarantine sstables rest_api: add endpoint which drops all quarantined sstables	2025-08-06 11:55:15 +03:00
Karol Nowacki	032e8f9030	test/boost/vector_store_client_test.cc: Fix flaky tests The vector_store_client_test was observed to be flaky, sometimes hanging while waiting for a response from HTTP server. Problem: The default load balancing algorithm (in Seastar's posix_server_socket_impl::accept) could route an incoming connection to a different shard than the one executing the test. Because the HTTP server is a non-sharded service running only on the test's originating shard, any connection submitted to another shard would never be handled, causing the test client to hang waiting for response. Solution: The patch resolves the issue by explicitly setting fixed cpu load balancing algorithm. This ensures that incoming connections are always handled on the same shard where the HTTP server is running. Closes scylladb/scylladb#25314	2025-08-06 11:24:51 +03:00
Michał Chojnowski	9930cd59eb	sstables/trie: add tests for BTI node serialization and traversal Adds tests which check that nodes serialized by `bti_node_sink` are readable by `bti_node_reader` with the right result. (Note: there are no tests which check compatibility of the encoded nodes with Cassandra or with handwritten hexdumps. There are only tests for mutual compatibility between Scylla's writers and readers. This can be considered a gap in testing.)	2025-08-05 21:48:24 +02:00
Pavel Emelyanov	10056a8c6d	Merge 'Simplify credential reload: remove internal expiration checks' from Ernest Zaslavsky This PR introduces a refinement in how credential renewal is triggered. Previously, the system attempted to renew credentials one hour before their expiration, but the credentials provider did not recognize them as expired—resulting in a no-op renewal that returned existing credentials. This led the timer fiber to immediately retry renewal, causing a renewal storm. To resolve this, we remove expiration (or any other checks) in `reload` method, assuming that whoever calls this method knows what he does. Fixes: https://github.com/scylladb/scylladb/issues/25044 Should be backported to 2025.3 since we need this fix for the restore Closes scylladb/scylladb#24961 * github.com:scylladb/scylladb: s3_creds: code cleanup s3_creds: Make `reload` unconditional s3_creds: Add test exposing credentials renewal issue	2025-08-05 17:49:13 +03:00
Avi Kivity	4c785b31c7	Merge 'List Alternator clients in system.clients virtual table' from Nadav Har'El Before this series, the "system.clients" virtual table lists active connections (and their various properties, like client address, logged in username and client version) only for CQL requests. This series adds also Alternator clients to system.clients. One of the interesting use cases of this new feature is understanding exactly which SDK a user is using -without inspecting their application code. Different SDKs pass different "User-Agent" headers in requests, and that User-Agent will be visible in the system.clients entries for Alternator requests as the "driver_name" field. Unlike CQL where logged in username, driver name, etc. applies to a complete connection, in the Alternator API, different requests can theoretically be signed by different users and carry different headers but still arrive over the same HTTP connection. So instead of listing the currently open Alternator connections, we will list the currently active requests. The first three patches introduce utilities that will be useful in the implementation. The fourth patch is the implementation itself (which is quite simple with the utility introduced in the second patch), and the fifth patch a regression test for the new feature. The sixth patch adds documentation, the seventh patch refactors generic_server to use the newly introduced utility class and reduce code duplication, and the eighth patch adds a small check to an existing check of CQL's system.clients. Fixes #24993 This patch adds a new feature, so doesn't require a backport. Nevertheless, if we want it to get to existing customers more quickly to allow us to better understand their use case by reading the system.clients table, we may want to consider backporting this patch to existing branches. There is some risk involved in this patch, because it adds code that gets run on every Alternator request, so a bug on it can cause problems for every Alternator request. Closes scylladb/scylladb#25178 * github.com:scylladb/scylladb: test/cqlpy: slightly strengthen test for system.clients generic_server: use utils::scoped_item_list docs/alternator: document the system.clients system table in Alternator alternator: add test for Alternator clients in system.clients alternator: list active Alternator requests in system.clients utils: unit test for utils::scoped_item_list utils: add a scoped_item_list utility class utils: add "fatal" version of utils::on_internal_error()	2025-08-05 15:55:41 +03:00
Ernest Zaslavsky	e4ebe6a309	s3_creds: Make `reload` unconditional Assume that any caller invoking `reload` intends to refresh credentials. Remove conditional logic that checks for expiration before reloading.	2025-08-03 17:41:35 +03:00
Ernest Zaslavsky	68855c90ca	s3_creds: Add test exposing credentials renewal issue Add a test demonstrating that renewing credentials does not update their expiration. After requesting credentials again, the expiration remains unchanged, indicating no actual update occurred.	2025-08-03 17:41:25 +03:00
Avi Kivity	8b1bf46086	Merge 'sstables: introduce trie_writer' from Michał Chojnowski This is the first part of a larger project meant to implement a trie-based index format. (The same or almost the same as Cassandra's BTI). As of this patch, the new code isn't used for anything yet, but we introduced separately from its users to keep PRs small enough for reviewability. This commit introduces trie_writer, a class responsible for turning a stream of (key, value) pairs (already sorted by key) into a stream of serializable nodes, such that: 1. Each node lies entirely within one page (guaranteed). 2. Parents are located in the same page as their children (best-effort). 3. Padding (unused space) is minimized (best-effort). It does mostly what you would expect a "sorted keys -> trie" builder to do. The hard part is calculating the sizes of nodes (which, in a well-packed on-disk format, depend on the exact offsets of the node from its children) and grouping them into pages. This implementation mostly follows Cassandra's design of the same thing. There are some differences, though. Notable ones: 1. The writer operates on chains of characters, rather than single characters. In Cassandra's implementation, the writer creates one node per character. A single long key can be translated to thousands of nodes. We create only one node per key. (Actually we split very long keys into a few nodes, but that's arbitrary and beside the point). For BTI's partition key index this doesn't matter. Since it only stores a minimal unique prefix of each key, and the trie is very balanced (due to token randomness), the average number of new characters added per key is very close to 1 anyway. (And the string-based logic might actually be a small pessimization, since manipulating a 1-byte string might be costlier than manipulating a single byte). But the row index might store arbitrarily long entries, and in that case the character-based logic might result in catastrophically bad performance. For reference: when writing a partition index, the total processing cost of a single node in the trie_writer is on the order of 800 instructions. Total processing cost of a single tiny partition during a `upgradesstables` operation is on the order of 10000 instructions. A small INSERT is on the order of 40000 instructions. So processing a single 1000-character clustering key in the trie_writer could cost as much as 20 INSERTs, which is scary. Even 100-character keys can be very expensive. With extremely long keys like that, the string-based logic is more than ~100x cheaper than character-based logic. (Note that only new characters matter here. If two index entries share a prefix, that prefix is only processed once. And the index is only populated with the minimal prefix needed to distinguish neighbours. So in practice, long chains might not happen often. But still, they are possible). I don't know if it makes sense to care about this case, but I figured the potential for problems is too big to ignore, so I switched to chain-based logic. 2. In the (assumed to be rare) case when a grouped subtree turns out to be bigger than a full page after revising the estimate, Cassandra splits it in a different way than us. For testability, there is some separation between the logic responsible for turning a stream of keys into a stream of nodes, and the logic responsible for turning a stream of nodes into a stream of bytes. This commit only includes the first part. It doesn't implement the target on-disk format yet. The serialization logic is passed to trie_writer via a template parameter. There is only one test added in this commit, which attempts to be exhaustive, by testing all possible datasets up to some size. The run time of the test grows exponentially with the parameter size. I picked a set of parameters which runs fast enough while still being expressive enough to cover all the logic. (I checked the code coverage). But I also tested it with greater parameters on my own machine (and with DEVELOPER_BUILD enabled, which adds extra sanitization). Refs scylladb/scylladb#19191 New functionality, no backporting needed. Closes scylladb/scylladb#25154 * github.com:scylladb/scylladb: sstables: introduce trie_writer utils/bit_cast: add object_representation()	2025-08-01 20:23:24 +03:00
Nikos Dragazis	2656fca504	test: Use in-memory SQLite for PyKMIP server The PyKMIP server uses an SQLite database to store artifacts such as encryption keys. By default, SQLite performs a full journal and data flush to disk on every CREATE TABLE operation. Each operation triggers three fdatasync(2) calls. If we multiply this by 16, that is the number of tables created by the server, we get a significant number of file syncs, which can last for several seconds on slow machines. This behavior has led to CI stability issues from KMIP unit tests where the server failed to complete its schema creation within the 20-second timeout (observed on spider9 and spider11). Fix this by configuring the server to use an in-memory SQLite. Fixes #24842. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com> Closes scylladb/scylladb#24995	2025-08-01 12:11:27 +03:00
Nadav Har'El	20b31987e1	utils: unit test for utils::scoped_item_list The previous test introduced a new utility class, utils::scoped_item_list. This patch adds a comprehensive unit test for the new class. We test basic usage of scoped_item_list, its size() and empty() methods, how items are removed from the list when their handle goes out of scope, how a handle's move constructor works, how items can be read and written through their handles, and finally that removing an item during a for_each_gently() iteration doesn't break the iteration. One thing I still didn't figure out how to properly test is how removing an item during multiple iterations that run concurrently fixes multiple iterators. I believe the code is correct there (we just have a list of ongoing iterations - instead of just one), but haven't found yet a way to reproduce this situation in a test. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-08-01 02:15:04 +03:00
Michał Chojnowski	c8682af418	sstables: introduce trie_writer This is the first part of a larger project meant to implement a trie-based index format. (The same or almost the same as Cassandra's BTI). As of this patch, the new code isn't used for anything yet, but we introduced separately from its users to keep PRs small enough for reviewability. This commit introduces trie_writer, a class responsible for turning a stream of (key, value) pairs (already sorted by key) into a stream of serializable nodes, such that: 1. Each node lies entirely within one page (guaranteed). 2. Parents are located in the same page as their children (best-effort). 3. Padding (unused space) is minimized (best-effort). It does mostly what you would expect a "sorted keys -> trie" builder to do. The hard part is calculating the sizes of nodes (which, in a well-packed on-disk format, depend on the exact offsets of the node from its children) and grouping them into pages. This implementation mostly follows Cassandra's design of the same thing. There are some differences, though. Notable ones: 1. The writer operates on chains of characters, rather than single characters. In Cassandra's implementation, the writer creates one node per character. A single long key can be translated to thousands of nodes. We create only one node per key. (Actually we split very long keys into a few nodes, but that's arbitrary and beside the point). For BTI's partition key index this doesn't matter. Since it only stores a minimal unique prefix of each key, and the trie is very balanced (due to token randomness), the average number of new characters added per key is very close to 1 anyway. (And the string-based logic might actually be a small pessimization, since manipulating a 1-byte string might be costlier than manipulating a single byte). But the row index might store arbitrarily long entries, and in that case the character-based logic might result in catastrophically bad performance. For reference: when writing a partition index, the total processing cost of a single node in the trie_writer is on the order of 800 instructions. Total processing cost of a single tiny partition during a `upgradesstables` operation is on the order of 10000 instructions. A small INSERT is on the order of 40000 instructions. So processing a single 1000-character clustering key in the trie_writer could cost as much as 20 INSERTs, which is scary. Even 100-character keys can be very expensive. With extremely long keys like that, the string-based logic is more than ~100x cheaper than character-based logic. (Note that only new characters matter here. If two index entries share a prefix, that prefix is only processed once. And the index is only populated with the minimal prefix needed to distinguish neighbours. So in practice, long chains might not happen often. But still, they are possible). I don't know if it makes sense to care about this case, but I figured the potential for problems is too big to ignore, so I switched to chain-based logic. 2. In the (assumed to be rare) case when a grouped subtree turns out to be bigger than a full page after revising the estimate, Cassandra splits it in a different way than us. For testability, there is some separation between the logic responsible for turning a stream of keys into a stream of nodes, and the logic responsible for turning a stream of nodes into a stream of bytes. This commit only includes the first part. It doesn't implement the target on-disk format yet. The serialization logic is passed to trie_writer via a template parameter. There is only one test added in this commit, which attempts to be exhaustive, by testing all possible datasets up to some size. The run time of the test grows exponentially with the parameter size. I picked a set of parameters which runs fast enough while still being expressive enough to cover all the logic. (I checked the code coverage). But I also tested it with greater parameters on my own machine (and with DEVELOPER_BUILD enabled, which adds extra sanitization).	2025-07-31 12:51:37 +02:00
Calle Wilund	43f7eecf9e	compress: move compress.cc/hh to sstables/compressor Fixes #22106 Moves the shared compress components to sstables, and rename to match class type. Adjust includes, removing redundant/unneeded ones where possible. Closes scylladb/scylladb#25103	2025-07-31 13:10:41 +03:00
Pavel Emelyanov	34608450c5	Merge 'qos: don't populate effective service level cache until auth is migrated to raft' from Piotr Dulikowski Right now, service levels are migrated in one group0 command and auth is migrated in the next one. This has a bad effect on the group0 state reload logic - modifying service levels in group0 causes the effective service levels cache to be recalculated, and to do so we need to fetch information about all roles. If the reload happens after SL upgrade and before auth upgrade, the query for roles will be directed to the legacy auth tables in system_auth - and the query, being a potentially remote query, has a timeout. If the query times out, it will throw an exception which will break the group0 apply fiber and the node will need to be restarted to bring it back to work. In order to solve this issue, make sure that the service level module does not start populating and using the service level cache until both service levels and auth are migrated to raft. This is achieved by adding the check both to the cache population logic and the effective service level getter - they now look at service level's accessor new method, `can_use_effective_service_level_cache` which takes a look at the auth version. Fixes: scylladb/scylladb#24963 Should be backported to all versions which support upgrade to topology over raft - the issue described here may put the cluster into a state which is difficult to get out of (group0 apply fiber can break on multiple nodes, which necessitates their restart). Closes scylladb/scylladb#25188 * github.com:scylladb/scylladb: test: sl: verify that legacy auth is not queried in sl to raft upgrade qos: don't populate effective service level cache until auth is migrated to raft	2025-07-31 13:05:27 +03:00
Avi Kivity	5c6c944797	managed_bytes: make empty managed_bytes constexpr friendly Sprinkle constexpr where needed to make the default constructor, move constructor, and destructor constexpr. Add a test to verify. This is needed to make a thread_local variable containing an empty managed_bytes constinit, reducing thread-local guards.	2025-07-29 23:51:43 +03:00
Botond Dénes	2985c343ed	Merge 'repair: Avoid too many fragments in a single repair_row_on_wire' from Asias He When repairing a partition with many rows, we can store many fragments in a repair_row_on_wire object which is sent as a rpc stream message. This could cause reactor stalls when the rpc stream compression is turned on, because the compression compresses the whole message without any split and compression. This patch solves the problem at the higher level by reducing the message size that is sent to the rpc stream. Tests are added to make sure the message split works. Fixes #24808 Closes scylladb/scylladb#25002 * github.com:scylladb/scylladb: repair: Avoid too many fragments in a single repair_row_on_wire repair: Change partition_key_and_mutation_fragments to use chunked_vector utils: Allow chunked_vector::erase to work with non-default-constructible type	2025-07-29 17:45:57 +03:00
Piotr Dulikowski	2bb800c004	qos: don't populate effective service level cache until auth is migrated to raft Right now, service levels are migrated in one group0 command and auth is migrated in the next one. This has a bad effect on the group0 state reload logic - modifying service levels in group0 causes the effective service levels cache to be recalculated, and to do so we need to fetch information about all roles. If the reload happens after SL upgrade and before auth upgrade, the query for roles will be directed to the legacy auth tables in system_auth - and the query, being a potentially remote query, has a timeout. If the query times out, it will throw an exception which will break the group0 apply fiber and the node will need to be restarted to bring it back to work. In order to solve this issue, make sure that the service level module does not start populating and using the service level cache until both service levels and auth are migrated to raft. This is achieved by adding the check both to the cache population logic and the effective service level getter - they now look at service level's accessor new method, `can_use_effective_service_level_cache` which takes a look at the auth version. Fixes: scylladb/scylladb#24963	2025-07-29 11:37:37 +02:00
Asias He	266a518e4c	repair: Change partition_key_and_mutation_fragments to use chunked_vector With the change in "repair: Avoid too many fragments in a single repair_row_on_wire", the std::list<frozen_mutation_fragment> _mfs; in partition_key_and_mutation_fragments will not contain large number of fragments any more. Switch to use chunked_vector.	2025-07-29 13:43:17 +08:00

... 12 13 14 15 16 ...

4728 Commits