scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-26 11:30:36 +00:00

Author	SHA1	Message	Date
Lakshmi Narayanan Sreethar	7cdda510ee	compaction/scrub: register sstables for compaction before validation When `scrub --validate` runs, it collects all candidate sstables at the start and validates them one by one in separate compaction tasks. However, scrub in validate mode does not register these sstables for compaction, which allows regular compaction to pick them up and potentially compact them away before validation begins. This leads to scrub failures because the sstables can no longer be found. This patch fixes the issue by first disabling compaction, collecting the sstables, and then registering them for compaction before starting validation. This ensures that the enqueued sstables remain available for the entire duration of the scrub validation task. Fixes #23363 Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2025-09-16 15:29:57 +05:30
Aleksandra Martyniuk	75b772adfb	db: optimize cache invalidation following repair/streaming Currently, if a new sstable is created during repair/streaming, we invalidate its whole token range in cache. If the sstable is sparse, we unnecessarily clear too much data. Modify cache invalidation, so that only the partitions present in the sstable are cleared. To check whether a partition is present in the sstable, we use bloom filters. Bloom filters may return false positives and show that an sstable contains a partition, even though it does not. Due to that we may invalidate a bit more than we need to, but the cache will be in valid state. An issue arises when we do not invalidate two consecutive partitions that are continuous. The sstable may contain a token that falls between these partitions, breaking the continuity. To check that, we would need to scan sstable index. However, such a change would noticeably complicate the invalidation, both performance and code. In this change, sstable index reader isn't used. Instead, the continuity flag is unset for all scanned partitions. This comes at a cost of heavier reads, as we will need to verify continuity when reading more than one partition from cache. Fixes: https://github.com/scylladb/scylladb/issues/9136. Closes scylladb/scylladb#25996	2025-09-14 19:48:14 +03:00
Lakshmi Narayanan Sreethar	1d1e572962	sstables: skip bloom filter rebuilds with minimal savings If a bloom filter was built with a bad partition estimate, it is rebuilt right before the sstable is sealed. The rebuild is already skipped if the current bitset size results in a false-positive rate within 75%–125% of the configured value. This patch adds additional conditions to prevent rebuilds when the savings are minimal. It also skips rebuilding for garbage collected sstables, since they will be dropped soon anyway. Also updated and added more test cases to cover these new criteria for bloom filter rebuilds. Fixes #25464 Fixes #25468 Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> Closes scylladb/scylladb#25968	2025-09-14 18:19:50 +03:00
Avi Kivity	c91b326d5a	Merge 'transport: replace throwing protocol_exception with returns' from Dario Mirovic Replace throwing `protocol_exception` with returning it as a result or an exceptional future in the transport server module. The goal is to improve performance. Most of the `protocol_exception` throws were made from `fragmented_temporary_buffer` module, by passing `exception_thrower()` to its `read` methods. `fragmented_temporary_buffer` is changed so that it now accepts an exception creator, not exception thrower. `fragmented_temporary_buffer_concepts::ExceptionCreator` concept replaced `fragmented_temporary_buffer_concepts::ExceptionThrower` and all methods that have been throwing now return failed result of type `utils::result_with_eptr`. This change is then propagated to the callers. The scope of this patch is `protocol_exception`, so commitlog just calls `.value()` method on the result. If the result failed, that will throw the exception from the result, as defined by `utils::result_with_eptr_throw_policy`. This means that the behavior of commitlog module stays the same. transport server module handles results gracefully. All the caller functions that return non-future value `T` now return `utils::result_with_eptr<T>`. When the caller is a function that returns a future, and it receives failed result, `make_exception_future(std::move(failed_result).value())` is returned. The rest of the callstack up to the transport server `handle_error` function is already working without throwing, and that's how zero throws is achieved. cql3 module changes do the same as transport server module. Benchmark that is not yet merged has commit `67fbe35833e2d23a8e9c2dcb5e04580231d8ec96`, [GitHub diff view](https://github.com/scylladb/scylladb/compare/master...nuivall:scylladb:perf_cql_raw). It uses either read or write query. Command line used: ``` ./build/release/scylla perf-cql-raw --workdir ~/tmp/scylladir --smp 1 --developer-mode 1 --workload write --duration 300 --concurrency 1000 --username cassandra --password cassandra 2>/dev/null ``` The only thing changed across runs is `--workload write`/`--workload read`. Built and run on `release` target. <details> ``` throughput: mean= 36946.04 standard-deviation=1831.28 median= 37515.49 median-absolute-deviation=1544.52 maximum=39748.41 minimum=28443.36 instructions_per_op: mean= 108105.70 standard-deviation=965.19 median= 108052.56 median-absolute-deviation=53.47 maximum=124735.92 minimum=107899.00 cpu_cycles_per_op: mean= 70065.73 standard-deviation=2328.50 median= 69755.89 median-absolute-deviation=1250.85 maximum=92631.48 minimum=66479.36 ⏱ real=5:11.08 user=2:00.20 sys=2:25.55 cpu=85% ``` ``` throughput: mean= 40718.30 standard-deviation=2237.16 median= 41194.39 median-absolute-deviation=1723.72 maximum=43974.56 minimum=34738.16 instructions_per_op: mean= 117083.62 standard-deviation=40.74 median= 117087.54 median-absolute-deviation=31.95 maximum=117215.34 minimum=116874.30 cpu_cycles_per_op: mean= 58777.43 standard-deviation=1225.70 median= 58724.65 median-absolute-deviation=776.03 maximum=64740.54 minimum=55922.58 ⏱ real=5:12.37 user=27.461 sys=3:54.53 cpu=83% ``` ``` throughput: mean= 37107.91 standard-deviation=1698.58 median= 37185.53 median-absolute-deviation=1300.99 maximum=40459.85 minimum=29224.83 instructions_per_op: mean= 108345.12 standard-deviation=931.33 median= 108289.82 median-absolute-deviation=55.97 maximum=124394.65 minimum=108188.37 cpu_cycles_per_op: mean= 70333.79 standard-deviation=2247.71 median= 69985.47 median-absolute-deviation=1212.65 maximum=92219.10 minimum=65881.72 ⏱ real=5:10.98 user=2:40.01 sys=1:45.84 cpu=85% ``` ``` throughput: mean= 38353.12 standard-deviation=1806.46 median= 38971.17 median-absolute-deviation=1365.79 maximum=41143.64 minimum=32967.57 instructions_per_op: mean= 117270.60 standard-deviation=35.50 median= 117268.07 median-absolute-deviation=16.81 maximum=117475.89 minimum=117073.74 cpu_cycles_per_op: mean= 57256.00 standard-deviation=1039.17 median= 57341.93 median-absolute-deviation=634.50 maximum=61993.62 minimum=54670.77 ⏱ real=5:12.82 user=4:10.79 sys=11.530 cpu=83% ``` This shows ~240 instructions per op increase for reads and ~180 instructions per op increase for writes. Tests have been run multiple times, with almost identical results. Each run lasted 300 seconds. Number of operations executed is roughly 38k per second 300 seconds = 11.4m ops. Update: I have repeated the benchmark with clean state - reboot computer, put in performance mode, rebuild, closed other apps that might affect CPU and disk usage. run count: 5 times before and 5 times after the patch duration: 300 seconds Average write throughput median before patch: 41155.99 Average write throughput median after patch: 42193.22 Median absolute deviation is also lower now, with values in range 350-550, while the previous runs' values were in range 750-1350. </details> Built and run on `release` target. <details> ./build/release/scylla perf-simple-query --smp 1 --duration 300 --concurrency 1000 --enable-cache false --bypass-cache 2>/dev/null ``` throughput: mean= 14910.90 standard-deviation=477.72 median= 14956.73 median-absolute-deviation=294.16 maximum=16061.18 minimum=13198.68 instructions_per_op: mean= 659591.63 standard-deviation=495.85 median= 659595.46 median-absolute-deviation=324.91 maximum=661184.94 minimum=658001.49 cpu_cycles_per_op: mean= 213301.49 standard-deviation=2724.27 median= 212768.64 median-absolute-deviation=1403.85 maximum=225837.15 minimum=208110.12 ⏱ real=5:19.26 user=5:00.22 sys=15.827 cpu=98% ``` ./build/release/scylla perf-simple-query --smp 1 --duration 300 --concurrency 1000 --enable-cache false 2>/dev/null ``` throughput: mean= 93345.45 standard-deviation=4499.00 median= 93915.52 median-absolute-deviation=2764.41 maximum=104343.64 minimum=79816.66 instructions_per_op: mean= 65556.11 standard-deviation=97.42 median= 65545.11 median-absolute-deviation=71.51 maximum=65806.75 minimum=65346.25 cpu_cycles_per_op: mean= 34160.75 standard-deviation=803.02 median= 33927.16 median-absolute-deviation=453.08 maximum=39285.19 minimum=32547.13 ⏱ real=5:03.23 user=4:29.46 sys=29.255 cpu=98% ``` ./build/release/scylla perf-simple-query --smp 1 --duration 300 --concurrency 1000 --enable-cache true 2>/dev/null ``` throughput: mean= 206982.18 standard-deviation=15894.64 median= 208893.79 median-absolute-deviation=9923.41 maximum=232630.14 minimum=127393.34 instructions_per_op: mean= 35983.27 standard-deviation=6.12 median= 35982.75 median-absolute-deviation=3.75 maximum=36008.24 minimum=35952.14 cpu_cycles_per_op: mean= 17374.87 standard-deviation=985.06 median= 17140.81 median-absolute-deviation=368.86 maximum=26125.38 minimum=16421.99 ⏱ real=5:01.23 user=4:57.88 sys=0.124 cpu=98% ``` ./build/release/scylla perf-simple-query --smp 1 --duration 300 --concurrency 1000 --enable-cache false --bypass-cache 2>/dev/null ``` throughput: mean= 16198.26 standard-deviation=902.41 median= 16094.02 median-absolute-deviation=588.58 maximum=17890.10 minimum=13458.74 instructions_per_op: mean= 659752.73 standard-deviation=488.08 median= 659789.16 median-absolute-deviation=334.35 maximum=660881.69 minimum=658460.82 cpu_cycles_per_op: mean= 216070.70 standard-deviation=3491.26 median= 215320.37 median-absolute-deviation=1678.06 maximum=232396.48 minimum=209839.86 ⏱ real=5:17.33 user=4:55.87 sys=18.425 cpu=99% ``` ./build/release/scylla perf-simple-query --smp 1 --duration 300 --concurrency 1000 --enable-cache false 2>/dev/null ``` throughput: mean= 97067.79 standard-deviation=2637.79 median= 97058.93 median-absolute-deviation=1477.30 maximum=106338.97 minimum=87457.60 instructions_per_op: mean= 65695.66 standard-deviation=58.43 median= 65695.93 median-absolute-deviation=37.67 maximum=65947.76 minimum=65547.05 cpu_cycles_per_op: mean= 34300.20 standard-deviation=704.66 median= 34143.92 median-absolute-deviation=321.72 maximum=38203.68 minimum=33427.46 ⏱ real=5:03.22 user=4:31.56 sys=29.164 cpu=99% ``` ./build/release/scylla perf-simple-query --smp 1 --duration 300 --concurrency 1000 --enable-cache true 2>/dev/null ``` throughput: mean= 223495.91 standard-deviation=6134.95 median= 224825.90 median-absolute-deviation=3302.09 maximum=234859.90 minimum=193209.69 instructions_per_op: mean= 35981.41 standard-deviation=3.16 median= 35981.13 median-absolute-deviation=2.12 maximum=35991.46 minimum=35972.55 cpu_cycles_per_op: mean= 17482.26 standard-deviation=281.82 median= 17424.08 median-absolute-deviation=143.91 maximum=19120.68 minimum=16937.43 ⏱ real=5:01.23 user=4:58.54 sys=0.136 cpu=99% ``` </details> Fixes: #24567 This PR is a continuation of #24738 [transport: remove throwing protocol_exception on connection start](https://github.com/scylladb/scylladb/pull/24738). This PR does not solve a burning issue, but is rather an improvement in the same direction. As it is just an enhancement, it should not be backported. Closes scylladb/scylladb#25408 * github.com:scylladb/scylladb: test/cqlpy: add protocol exception tests test/cqlpy: `test_protocol_exceptions.py` refactor message frame building test/cqlpy: `test_protocol_exceptions.py` refactor duplicate code transport: replace `make_frame` throw with return result cql3: remove throwing `protocol_exception` transport: replace throw in validate_utf8 with result_with_exception_ptr return transport: replace throwing protocol_exception with returns utils: add result_with_exception_ptr test/cqlpy: add unknown compression algorithm test case	2025-09-10 21:54:15 +03:00
Avi Kivity	fc64333040	Merge 'sstables/trie: add BTI index readers and writers' from Michał Chojnowski This is yet another part in the BTI index project. Overarching issue: https://github.com/scylladb/scylladb/issues/19191 Previous part: https://github.com/scylladb/scylladb/pull/25506/ Next part: plugging the BTI index readers and writers into sstable readers and writers. The new code added in this PR isn't used outside of tests yet, but it's posted as a separate PR for reviewability. This series implements, on top of the key translation logic, and abstract trie writing and traversal logic, a writer and a reader of sstable index files (which map primary keys to positions in Data.db), as described in `f16fb6765b/src/java/org/apache/cassandra/io/sstable/format/bti/BtiFormat.md`. Caveats: 1. I think the added test has reasonable coverage, but that depends on running it multiple times. (Though it shouldn't need more than a few runs to catch any bug it covers). It's somewhat awkward as a test meant for running in CI, it's better as something you run many times after a relevant change. 2. These readers and writers are intended to be compatible with Cassandra, but I did NOT do any compatibility testing. The writers and readers added here have only been tested against each other, not against Cassandra's readers and writers. 3. This didn't undergo any proper benchmarking and optimization work. I was doing some measurements in the past, but everything was rewritten so much since then that the my old measurements are effectively invalidated. Frankly I have no idea what the performance of all this branchy-branchy logic is now. No backports needed, new functionality. Closes scylladb/scylladb#25626 * github.com:scylladb/scylladb: test/manual: add bti_cassandra_compatibility_test test/lib/random_schema: add some constraints for generated uuid and time/date values test/lib/random_utils: add a variant of get_bytes which takes an `engine&` test/boost: add bti_index_test sstables/writer: add an accessor for the current write position in Data.db sstables/trie: introduce bti_index_reader sstables/trie: add bti_partition_index_writer.cc sstables/trie: add bti_row_index_writer.cc utils/bit_cast: add a new overload of write_unaligned() sstables/trie: add trie_writer::add_partial() sstables/consumer: add read_56() sstables/trie: make bti_node_reader::page_ptr copy-constructible sstables: extract abstract_index_reader from index_reader.hh to its own header sstables/trie: add an accessor to the file_writer under bti_node_sink sstables/types: make `deletion_time::operator tombstone()` const sstables/types: add sstables::deletion_time::make_live() sstables/trie: fix a special case in max_offset_from_child sstables/trie: handle `partition_region`s other than `clustered` in BTI position encoding sstables/trie: rewrite lcb_mismatch to handle fragment invalidation test/boost/bti_key_translation_test: fix a compilation error hidden behind `if constexpr`	2025-09-10 21:48:52 +03:00
Wojciech Mitros	1f9be235b8	mv: delete previously undetected ghost rows in PRUNE MATERIALIZED VIEW statement The PRUNE MATERIALIZED VIEW statement is supposed to remove ghost rows from the view. Ghost rows are rows in the view with no corresponding row in the base table. Before this patch, only rows whose primary key columns of the base table had different values than any of the base rows were treated as ghost rows by the PRUNE statement. However, view rows which have a column in their primary key that's not in the base primary can also be ghost rows if this column has a different value than the base row with the same values of remaining primary key columns. That's because these rows won't be deleted unless we change value of this column in the base table to this specific value. In this patch we add a check for this column in the PRUNE MATERIALIZED VIEW logic. If this column isn't the same in the base table and the view, these rows are also deleted. Fixes https://github.com/scylladb/scylladb/issues/25655 Closes scylladb/scylladb#25720	2025-09-10 07:35:00 +02:00
Pawel Pery	61ee630f42	vector_store_client: add timeouts to tests Sometimes `vector_store_client_test_ann_request` test hangs up. It is hard to reproduce. It seems that the problem is that tests are unreliable in case of stalled requests. This patch attaches a timer to the abort_source to ensure that the test will finish with a timeout at least. Fixes: VECTOR-150 Fixes: #25234 Closes scylladb/scylladb#25301	2025-09-08 10:20:48 +03:00
Michał Chojnowski	927c7ecbb0	test/boost: add bti_index_test Adds a fat unit test (integration test?) for BTI index writers and readers. The test generates a small random dataset for the index writer, writes it to a BTI file, and then attempts to run all possible (and legal) sequences (up to a certain length) of index reader operations and check that the results match the expectation (dictated by a "simple" reference index reader). It is currently the only test for this new feature, but I think it's reasonably good. (I validated the coverage by looking at LLVM coverage reports and by manually adding bugs in various places and checking that the test catches it after a reasonably small number of runs). (Note that in a later series, when we hook up BTI indexes to Data.db readers/writers, the feature will also be indirectly tested by the corresponding integration tests). This is a randomized test. As such, its power grows with the number of runs. In particular, a single run has a decently high probability of not exercising parts of the code at all. (E.g. the generated dataset might have no clustering keys). Also this is a slow test. (With the default parameters, ~1s in release mode on my PC, several seconds in debug mode. (And note that this is after a custom parameter downsizing for debug mode, because this test is slowed extremely badly by debug mode, due to the forced preemption after every future)). For those two reasons, I'm not glad to include it in the test suite that runs in CI. Instead of running it once in every CI run, I'd very much rather have it run 10000 times after the tested feature is touched, and before releases. Unfortunately we don't have a precedent for something like that.	2025-09-07 00:32:02 +02:00
Michał Chojnowski	0684dbb5bd	sstables/trie: fix a special case in max_offset_from_child A `writer_node` contains a chain of multiple BTI nodes. `writer_node::_node_size` is the size (in bytes) of the entire chain. But the parent of the `writer_node` wants to know the offset to the rootmost node in the chain. This can be deduced from the `writer_node::_transition_length` and the `writer_node::_node_size`. But there's an error in the logic for that, for the special case when there are two nodes in the chain. The rootmost node will be SINGLE_NOPAYLOAD_12 if and only if the leafmost node is smaller than 16 bytes, which is true if and only if `_node_size` is smaller than 19 bytes. But the current logic compares `_node_size` against 16. That's incorrect. This patch fixes that. There was a test for this branch, but it was not good enough. It only tested payloads of size 1 and 20, but the problem is only revealed by payloads of size 13-14. The test has been extended to cover all sizes between 1 and 20.	2025-09-07 00:30:15 +02:00
Michał Chojnowski	b2d793552f	sstables/trie: handle `partition_region`s other than `clustered` in BTI position encoding As far as I know, positions outside the clustering region are never passed to sstable indexes. But since they are representable within the argument type of lazy_comparable_bytes_from_clustering_position, let's make it handle them.	2025-09-07 00:30:08 +02:00
Avi Kivity	49b0751980	test: nonwrapping_interval_test: verify an interval of tokens is trivial Since dht::token is trivial, an interval<dht::token> ought to be trivial too. Verify that.	2025-09-06 18:41:00 +03:00
Michał Chojnowski	3c3ed867e6	test/boost/bti_key_translation_test: fix a compilation error hidden behind `if constexpr` The branch inside the `if constexpr (debug)` contains a piece of template code that doesn't typecheck properly. (I used this code before committing it, but apparently I let it become outdated when some changes around it happened). Fix that.	2025-09-04 15:02:29 +02:00
Calle Wilund	21adfd8a60	test::boost::gcp_object_storage_test: Initial unit tests for GCP obj storage Allows testing using either local mock server (installed or using docker), or real GCP project (not tested as of writing this). v2: Try podman if docker unavail v3: Ensure we check log output on fake-gcs, because when using podman, the published port will be connectible even though the actual server is not up yet. v4: Use ephermal port forward in docker/podman to allow us running parallel instances. Also adjust credentials and port finding in test. v5: Re-ensure no parallel tests for this: We seem to time out in podman trying to fetch image for X parallel tests v6: Remove the ephermal port stuff. Because of course this does not work with our podman-in-podman. Do brute-force port speculation instead. v7: Up timeout for server start to allow docker pull. v8: Fix string check error v9: Add explicit docker image version	2025-09-01 18:14:20 +00:00
Avi Kivity	bae66cc0d8	Merge 'types: add byte-comparable format support for collections' from Lakshmi Narayanan Sreethar This PR builds on the byte comparable support introduced in #23541 to add byte comparable support for all the collection types. This implementation adheres to the byte-comparable format specification in https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/utils/bytecomparable/ByteComparable.md Refs https://github.com/scylladb/scylladb/issues/19407 New feature - backport not required. Closes scylladb/scylladb#25603 * github.com:scylladb/scylladb: types/comparable_bytes: add compatibility testcases for collection types types/comparable_bytes: update compatibility testcase to support collection types types/comparable_bytes: support empty type types/comparable_bytes: support reversed types types/comparable_bytes: support vector cql3 type types/comparable_bytes: support tuple and UDT cql3 type types/comparable_bytes: support map cql3 type types/comparable_bytes: support set and list cql3 types types/comparable_bytes: introduce encode/decode_component types/comparable_bytes: introduce to_comparable_bytes/from_comparable_bytes	2025-08-31 15:53:27 +03:00
Avi Kivity	bc5773f777	Merge 'Add out of space prevention mechanisms' from Łukasz Paszkowski When a scaling out is delayed or fails, it is crucial to ensure that clusters remain operational and recoverable even under extreme conditions. To achieve this, the following proactive measures are implemented: - reject writes - includes: inserts, updates, deletes, counter updates, hints, read+repair and lwt writes - applicable to: user tables, views, CDC log, audit, cql tracing - stop running compactions/repairs and prevent from starting new ones - reject incoming tablet migrations The aforementioned mechanisms are automatically enabled when node's disk utilization reaches the critical level (default: 98%) and disabled when the utilization drop below the threshold. Apart from that, the series add tests that require mounted volumes to simulate out of space. The paths to the volumes can be provided using the a pytest argument, i.e. `--space-limited-dirs`. When not provided, tests are skipped. Test scenarios: 1. Start a cluster and write data until one of the nodes reaches 90% of the disk utilization 2. Perform an operation that would take the nodes over 100% 3. The nodes should not exceed the critical disk utilization (98% by default) 4. Scale out the cluster by adding one node per rack 5. Retry or wait for the operation from step 2 The operation is: writing data, running compactions, building materialized views, running repair, migrating tablets (caused by RF change, decommission). The test is successful, if no nodes run out of space, the operation from step 2 is aborted/paused/timed out and the operation from step 5 is successful. `perf-simple-query --smp 1 -m 1G` results obtained for fixed 400MHz frequency: Read path (before) ``` instructions_per_op: mean= 39661.51 standard-deviation=34.53 median= 39655.39 median-absolute-deviation=23.33 maximum=39708.71 minimum=39622.61 ``` Read path (after) ``` instructions_per_op: mean= 39691.68 standard-deviation=34.54 median= 39683.14 median-absolute-deviation=11.94 maximum=39749.32 minimum=39656.63 ``` Write path (before): ``` instructions_per_op: mean= 50942.86 standard-deviation=97.69 median= 50974.11 median-absolute-deviation=34.25 maximum=51019.23 minimum=50771.60 ``` Write path (after): ``` instructions_per_op: mean= 51000.15 standard-deviation=115.04 median= 51043.93 median-absolute-deviation=52.19 maximum=51065.81 minimum=50795.00 ``` Fixes: https://github.com/scylladb/scylladb/issues/14067 Refs: https://github.com/scylladb/scylladb/issues/2871 No backport, as it is a new feature. Closes scylladb/scylladb#23917 * github.com:scylladb/scylladb: tests/cluster: Add new storage tests test/scylla_cluster: Override workdir when passed via cmdline streaming: Reject incoming migrations storage_service: extend locator::load_stats to collect per-node critical disk utilization flag repair_service: Add a facility to disable the service compaction_manager: Subscribe to out of space controller compaction_manager: Replace enabled/disabled states with running state database: Add critical_disk_utilization mode database can be moved to disk_space_monitor: add subscription API for threshold-based disk space monitoring docs: Add feature documentation config: Add critical_disk_utilization_level option replica/exceptions: Add a new custom replica exception	2025-08-30 18:47:57 +03:00
Calle Wilund	cc9eb321a1	commitlog: Ensure segment deletion is re-entrant Fixes #25709 If we have large allocations, spanning more than one segment, and the internal segment references from lead to secondary are the only thing keeping a segment alive, the implicit drop in discard_unused_segments and orphan_all can cause a recursive call to discard_unused_segments, which in turn can lead to vector corruption/crash, or even double free of segment (iterator confusion). Need to separate the modification of the vector (_segments) from actual releasing of objects. Using temporaries is the easiest solution. To further reduce recursion, we can also do an early clear of segment dependencies in callbacks from segment release (cf release). Closes scylladb/scylladb#25719	2025-08-30 08:24:57 +02:00
Piotr Dulikowski	7ccb50514d	Merge 'Introduce view building coordinator' from Michał Jadwiszczak This patch introduces `view_building_coordinator`, a single entity within whole cluster responsible for building tablet-based views. The view building coordinator takes slightly different approach than the existing node-local view builder. The whole process is split into smaller view building tasks, one per each tablet replica of the base table. The coordinator builds one base table at a time and it can choose another when all views of currently processing base table are built. The tasks are started by setting `STARTED` state and they are executed by node-local view building worker. The tasks are scheduled in a way, that each shard processes only one tablet at a time (multiple tasks can be started for a shard on a node because a table can have multiple views but then all tasks have the same base table and tablet (last_token)). Once the coordinator starts the tasks, it sends `work_on_view_building_tasks` RPC to start the tasks and receive their results. This RPC is resilient to RPC failure or raft leader change, meaning if one RPC call started a batch of tasks but then failed (for instance the raft leader was changed and caller aborted waiting for the response), next RPC call will attach itself to the already started batch. The coordinator plugs into handling tablet operations (migration/resize/RF change) and adjusts its tasks accordingly. At the start of each tablet operation, the coordinator aborts necessary view building tasks to prevent https://github.com/scylladb/scylladb/issues/21564. Then, new adjusted tasks are created at the end of the operation. If the operation fails at any moment, aborted tasks are rollback. The view building coordinator can also handle staging sstables using process_staging view building tasks. We do this because we don't want to start generating view updates from a staging sstable prematurely, before the writes are directed to the new replica (https://github.com/scylladb/scylladb/issues/19149). For detailed description check: `docs/dev/view-building-coordinator.md` Fixes https://github.com/scylladb/scylladb/issues/22288 Fixes https://github.com/scylladb/scylladb/issues/19149 Fixes https://github.com/scylladb/scylladb/issues/21564 Fixes https://github.com/scylladb/scylladb/issues/17603 Fixes https://github.com/scylladb/scylladb/issues/22586 Fixes https://github.com/scylladb/scylladb/issues/18826 Fixes https://github.com/scylladb/scylladb/issues/23930 --- This PR is reimplementation of https://github.com/scylladb/scylladb/pull/21942 Closes scylladb/scylladb#23760 * github.com:scylladb/scylladb: test/cluster: add view build status tests test/cluster: add view building coordinator tests utils/error_injection: allow to abort `injection_handler::wait_for_message()` test: adjust existing tests utils/error_injection: add injection with `sleep_abortable()` db/view/view_builder: ignore `no_such_keyspace` exception docs/dev: add view building coordinator documentation db/view/view_building_worker: work on `process_staging` tasks db/view/view_building_worker: register staging sstable to view building coordinator when needed db/view/view_building_worker: discover staging sstables db/view/view_building_worker: add method to register staging sstable db/view/view_update_generator: add method to process staging sstables instantly db/view/view_update_generator: extract generating updates from staging sstables to a method db/view/view_update_generator: ignore tablet-based sstables db/view/view_building_coordinator: update view build status on node join/left db/view/view_building_coordinator: handle tablet operations db/view: add view building task mutation builder service/topology_coordinator: run view building coordinator db/view: introduce `view_building_coordinator` db/view/view_building_worker: update built views locally db/view: introduce `view_building_worker` db/view: extract common view building functionalities db/view: prepare to create abstract `view_consumer` message/messaging_service: add `work_on_view_building_tasks` RPC service/topology_coordinator: make `term_changed_error` public db/schema_tables: create/cleanup tasks when an index is created/dropped service/migration_manager: cleanup view building state on drop keyspace service/migration_manager: cleanup view building state on drop view service/migration_manager: create view building tasks on create view test/boost: enable proxy remote in some tests service/migration_manager: pass `storage_proxy` to `prepare_keyspace_drop_announcement()` service/migration_manager: coroutinize `prepare_new_view_announcement()` service/storage_proxy: expose references to `system_keyspace` and `view_building_state_machine` service: reload `view_building_state_machine` on group0 apply() service/vb_coordinator: add currently processing base db/system_keyspace: move `get_scylla_local_mutation()` up db/system_keyspace: add `view_building_tasks` table db/view: add view_building_state and views_state db/system_keyspace: add method to get view build status map db/view: extract `system.view_build_status_v2` cql statements to system_keyspace db/system_keyspace: move `internal_system_query_state()` function earlier db/view: ignore tablet-based views in `view_builder` gms/feature_service: add VIEW_BUILDING_COORDINATOR feature	2025-08-29 17:28:44 +02:00
Łukasz Paszkowski	9539e80e54	compaction_manager: Subscribe to out of space controller	2025-08-29 14:56:07 +02:00
Lakshmi Narayanan Sreethar	4547f6f188	types/comparable_bytes: update compatibility testcase to support collection types The `abstract_type::from_string()` method used to parse the input data doesn't support collections yet. So the collection testdata will be passed as JSON strings to the testcase. This patch updates the testcase to adapt to this workaround. Also, extended the testcase to verify that Scylla's implementation can successfully decode the byte comparable output encoded by Cassandra. Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2025-08-29 12:26:22 +05:30
Lakshmi Narayanan Sreethar	0997b3533c	types/comparable_bytes: support empty type Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2025-08-29 12:26:22 +05:30
Lakshmi Narayanan Sreethar	b799101a09	types/comparable_bytes: support reversed types A reversed type is first encoded using the underlying type and then all the bits are flipped to ensure that the lexicographical sort order is reversed. During decode, the bytes are flipped first and then decoded using the underlying type. Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2025-08-29 12:26:22 +05:30
Lakshmi Narayanan Sreethar	6c2a3e2c51	types/comparable_bytes: support vector cql3 type The CQL vector type encoding is similar to the lists, where each element is transformed into a byte-comparable format and prefixed with a component marker. The sequence is terminated with a terminator marker to indicate the end of the collection. Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2025-08-29 12:26:22 +05:30
Lakshmi Narayanan Sreethar	1ccfe522f1	types/comparable_bytes: support tuple and UDT cql3 type The CQL tuple and UDT types share the same internal implementation and therefore use the same byte comparable encoding. The encoding is similar to lists, where each element is transformed into a byte-comparable format and prefixed with a component marker. The sequence is terminated with a terminator marker to indicate the end of the collection. TODO: Add duplicate test items to maps, lists and sets For maps, add more entries that share keys ex map1 : key1 : value1, key2 : value2 map2 : key1 : value4 map3 : key2 : value5 etc Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2025-08-29 12:26:22 +05:30
Lakshmi Narayanan Sreethar	ca38c15a97	types/comparable_bytes: support map cql3 type The CQL map type is encoded as a sequence of key-value pairs. Each key and each value is individually prefixed with a component marker, and the sequence is terminated with a terminator marker to indicate the end of the collection. Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2025-08-29 12:26:22 +05:30
Lakshmi Narayanan Sreethar	4d5e5f0c84	types/comparable_bytes: support set and list cql3 types The CQL set and list types are encoded as a sequence of elements, where each element is transformed into a byte-comparable format and prefixed with a component marker. The sequence is terminated with a terminator marker to indicate the end of the collection. Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2025-08-29 12:26:22 +05:30
Lakshmi Narayanan Sreethar	8e46e8be01	types/comparable_bytes: introduce encode/decode_component The components of a collection, such as an element from a list, set, or vector; a key or value from a map; or a field from a tuple, share the same encode and decode logic. During encode, the component is transformed into the byte comparable format and is prefixed with the `NEXT_COMPONENT` marker. During decode, the component is transformed back into its serialized form and is prefixed with the serialized size. A null component is encoded as a single `NEXT_COMPONENT_NULL` marker and during decode, a `-1` is written to the serialized output. This commit introduces few helper methods that implement the above mentioned encode and decode logics. Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2025-08-29 12:26:21 +05:30
Dario Mirovic	8120709231	transport: replace `make_frame` throw with return result `cql_transport::response::make_frame` used to throw `protocol_exception`. With this change it will return `result_with_exception_ptr<sstring>` instead. Code changes are propagated to `cql_transport::cql_server::response::make_message` and from there to `cql_transport::cql_server::connection::write_response`. `write_response` continuation calling `make_message` used to transform the exception from `make_message` to an exception future, and now the logic stays the same, just explicitly stated at this code layer, so the behavior is not changed. Refs: #24567	2025-08-28 23:33:33 +02:00
Dario Mirovic	51995af258	transport: replace throwing protocol_exception with returns Replace throwing `protocol_exception` with returning it as a result or an exceptional future in the transport server module. The goal is to improve performance. Most of the `protocol_exception` throws were made from `fragmented_temporary_buffer` module, by passing `exception_thrower()` to its `read*` methods. `fragmented_temporary_buffer` is changed so that it now accepts an exception creator, not exception thrower. `fragmented_temporary_buffer_concepts::ExceptionCreator` concept replaced `fragmented_temporary_buffer_concepts::ExceptionThrower` and all methods that have been throwing now return failed result of type `utils::result_with_exception_ptr`. This change is then propagated to the callers. The scope of this patch is `protocol_exception`, so commitlog just calls `.value()` method on the result. If the result failed, that will throw the exception from the result, as defined by `utils::result_with_exception_ptr_throw_policy`. This means that the behavior of commitlog module stays the same. transport server module handles results gracefully. All the caller functions that return non-future value `T` now return `utils::result_with_exception_ptr<T>`. When the caller is a function that returns a future, and it receives failed result, `make_exception_future(std::move(failed_result).value())` is returned. The rest of the callstack up to the transport server `handle_error` function is already working without throwing, and that's how zero throws is achieved. Fixes: #24567	2025-08-28 23:31:36 +02:00
Łukasz Paszkowski	3e740d25b5	disk_space_monitor: add subscription API for threshold-based disk space monitoring Introduce the `subscribe` method to disk_space_monitor, allowing clients to register callbacks triggered when disk utilization crosses a configurable threshold. The API supports flexible trigger options, including notifications on threshold crossing and direction (above/below). This enables more granular and efficient disk space monitoring for consumers.	2025-08-28 18:06:37 +02:00
Michał Jadwiszczak	233f4dcee3	db/view/view_building_worker: register staging sstable to view building coordinator when needed Change return type of `check_needs_view_update_path()`. Instead of retrning bool which tells whether to use staging directory (and register to `view_update_generator`) or use normal directory. Now the function returns enum with possible values: - `normal_directory` - use normal directory for the sstable - `staging_directly_to_generator` - use staging directory and register to `view_update_generator` - `staging_managed_by_vbc` - use staging directory but don't register it to `view_update_generator` but create view building tasks for later The third option is new, it's used when the table has any view which is in building process currrently. In this case, registering it to `view_update_generator` prematurely may lead to base-view inconsistency (for example when a replica is in a pending state).	2025-08-27 10:23:03 +02:00
Michał Jadwiszczak	6e3e287a39	db/schema_tables: create/cleanup tasks when an index is created/dropped Similarly as in previous commits, create view building tasks when an index is created and cleanup view building status when it's dropped.	2025-08-27 08:55:47 +02:00
Michał Jadwiszczak	19651b4978	test/boost: enable proxy remote in some tests After a few next patches, creating/dropping a view in tablet keyspace will require a remote proxy to obtain references to system keyspace and view building state. Because of this, remote proxy needs to be explicitly enabled in boost tests which create views.	2025-08-27 08:55:47 +02:00
Michał Jadwiszczak	204f61ffe1	service/migration_manager: pass `storage_proxy` to `prepare_keyspace_drop_announcement()` The reference is needed to get `view_building_state_machine`.	2025-08-27 08:55:47 +02:00
Nadav Har'El	e2c99436cf	Merge 'cdc, vector_search: enable CDC when the index is created' from Dawid Pawlik When a vector index is created in Scylla, it is initially built using a full scan of the database. After that, it stays up to date by tracking changes through CDC, which should be automatically enabled when the vector index is created. When a user attempts to enable Vector Search (VS), the system checks whether Change Data Capture (CDC) is enabled and properly configured: 1. CDC is not enabled - CDC is automatically enabled with the minimum required TTL (Time-to-Live) for VS (24 hours) and the delta mode set to 'full' or post-image is enabled. - If the user later tries to reduce the CDC TTL below 24 hours or set delta mode to 'keys' with post-image disabled, the action fails. - Error message: Clearly states that CDC TTL must be at least 24 hours and delta mode must be set to 'full' or post-image must be enabled for VS to function. 2. CDC is already enabled - If CDC TTL is ≥ 24 hours and delta mode is set to 'full' or post-image is enabled: VS is enabled successfully. - If CDC TTL is < 24 hours or delta mode is set to 'keys' with post-image disabled: The VS enabling process fails. - Error message: Informs the user that CDC TTL must be at least 24 hours, delta mode must be set to 'full' or post-image must be enabled, and provides a link to documentation on how to update the TTL, delta mode, and post-image. When a user attempts to disable CDC when VS is enabled, the action will fail and the user will be informed by error message that clearly states that VS needs to be disabled (vector indexes have to be dropped) first. Full setup requirements and steps will be detailed in the documentation of Vector Search. Co-authored-by: @smoczy123 Fixes: VECTOR-27 Fixes: VECTOR-25 Closes scylladb/scylladb#25179 * github.com:scylladb/scylladb: test/cqlpy: ensure Vector Search CDC options test/boost: adjust CDC boost tests for Vector Search test/cql: add Vector Search CDC enable/disable test cdc, vector_index: provide minimal option setup for Vector Search test/cqlpy: adjust describe table tests with CDC for Vector Search describe, cdc: adjust describe for cdc log tables cdc: enable CDC log when vector index is created test/cqlpy: run vector_index tests only on vnodes vector_index: check if vector index exists in schema	2025-08-26 23:01:32 +03:00
Avi Kivity	8815491085	treewide: include boost headers as "system" headers Boost is external to the project so treat its headers as "system" headers and include them with angle brackets. Closes scylladb/scylladb#25619	2025-08-22 17:21:24 +03:00
Robert Bindar	3291a5cc75	Fix dbuild boost::gregorian usage error On my dbuild runs, compiler complained about no member "gregorian" in namespace boost in the user_function_test.cc file. Was also noticed in CI. Signed-off-by: Robert Bindar <robert.bindar@scylladb.com> Closes scylladb/scylladb#25593	2025-08-21 11:32:47 +03:00
Dawid Pawlik	61c7b935e1	test/boost: adjust CDC boost tests for Vector Search Adjust name conflict and permissions tests when enabling CDC for Vector Search. Add test that checks if CDC with vector column is setup properly.	2025-08-20 17:20:37 +02:00
Botond Dénes	66db95c048	Merge 'Preserve PyKMIP logs from failed KMIP tests' from Nikos Dragazis This PR extends the `tmpdir` class with an option to preserve the directory if the destructor is called during stack unwinding. It also uses this feature in KMIP tests, where the tmpdir contains PyKMIP server logs, which may be useful when diagnosing test failures. Fixes #25339. Not so important to be backported. Closes scylladb/scylladb#25367 * github.com:scylladb/scylladb: encryption_at_rest_test: Preserve tmpdir from failing KMIP tests test/lib: Add option to preserve tmpdir on exception	2025-08-19 13:17:29 +03:00
Avi Kivity	611918056a	Merge 'repair: Add tablet incremental repair support' from Asias He The central idea of incremental repair is to allow repair participants to select and repair only a portion of the dataset to speed up the repair process. All repair participants must utilize an identical selection method to repair and synchronize the same selected dataset. There are two primary selection methods: time-based and file-based. The time-based method selects data within a specified time frame. It is versatile but it is less efficient because it requires reading all of the dataset and omitting data beyond the time frame. The file-based method selects data from unrepaired SSTables and is more efficient because it allows the entire SSTable to be omitted. This document patch implements the file-based selection method. Incremental repair will only be supported for tablet tables; it will not be supported for vnode tables. On one hand, the legacy vnode is less important to support. On the other hand, the incremental repair for vnode is much harder to implement. With vnodes, a SSTalbe could contain data for multiple vnode ranges. When a given vnode range is repaired, only a portion of the SSTable is repaired. This complicates the manipulation of SSTables significantly during both repair and compaction. With tablets, an entire tablet is repaired so that a sstable is either fully repaired or not repaired which is a huge simplification. This patch uses the repaired_at from sstables::statistics component to mark a sstable as repaired. It uses a virtual clock as the repair timestamp, i.e., using a monotonically increasing number for the repaired_at field of a SSTable and sstables_repaired_at column in system.tablets table. Notice that when a sstable is not repaired, the repaired_at field will be set to the default value 0 by default. The being_repaired in memory field of a SSTable is used to explicitly mark that a SSTable is being selected. The following variables are used for incremental repair: The repaired_at on disk field of a SSTable is used. - A 64-bit number increases sequentially The sstables_repaired_at is added to the system.tablets table. - repaired_at <= sstables_repaired_at means the sstable is repaired The being_repaired in memory field of a SSTable is added. - A repair UUID tells which sstable has participated in the repair Initial test results: 1) Medium dataset results Node amount: 3 Instance type: i4i.2xlarge Disk usage per node: ~500GB Cluster pre-populated with ~500GB of data before starting repairs job. Results for Repair Timings: The regular repair run took 210 mins. Incremental repair 1st run took 183 mins, 2nd and 3rd runs took around 48s The speedup is: 183 mins / 48s = 228X 2) Small dataset results Node amount: 3 Instance type: i4i.2xlarge Disk usage per node: ~167GB Cluster pre-populated with ~167GB of data before starting the repairs job. Regular repair 1st run took 110s, 2nd and 3rd runs took 110s. Incremental repair 1st run took 110 seconds, 2nd and 3rd run took 1.5 seconds. The speedup is: 110s / 1.5s = 73X 3) Large dataset results Node amount: 6 Instance type: i4i.2xlarge, 3 racks 50% of base load, 50% read/write Dataset == Sum of data on each node Dataset Non-incremental repair (minutes) 1.3 TiB 31:07 3.5 TiB 25:10 5.0 TiB 19:03 6.3 TiB 31:42 Dataset Incremental repair (minutes) 1.3 TiB 24:32 3.0 TiB 13:06 4.0 TiB 5:23 4.8 TiB 7:14 5.6 TiB 3:58 6.3 TiB 7:33 7.0 TiB 6:55 Fixes #22472 Closes scylladb/scylladb#24291 * github.com:scylladb/scylladb: replica: Introduce get_compaction_reenablers_and_lock_holders_for_repair compaction: Move compaction_reenabler to compaction_reenabler.hh topology_coordinator: Make rpc::remote_verb_error to warning level repair: Add metrics for sstable bytes read and skipped from sstables test.py: Disable incremental for test_tombstone_gc_for_streaming_and_repair test.py: Add tests for tablet incremental repair repair: Add tablet incremental repair support compaction: Add tablet incremental repair support feature_service: Add TABLET_INCREMENTAL_REPAIR feature tablet_allocator: Add tablet_force_tablet_count_increase and decrease repair: Add incremental helpers sstable: Add being_repaired to sstable sstables: Add set_repaired_at to metadata_collector mutation_compactor: Introduce add operator to compaction_stats tablet: Add sstables_repaired_at to system.tablets table test: Fix drain api in task_manager_client.py	2025-08-19 13:13:22 +03:00
Botond Dénes	f8b79d563a	Merge 's3: Minor refactoring and beautification of S3 client and tests' from Ernest Zaslavsky This pull request introduces minor code refactoring and aesthetic improvements to the S3 client and its associated test suite. The changes focus on enhancing readability, consistency, and maintainability without altering any functional behavior. No backport is required, as the modifications are purely cosmetic and do not impact functionality or compatibility. Closes scylladb/scylladb#25490 * github.com:scylladb/scylladb: s3_client: relocate `req` creation closer to usage s3_client: reformat long logging lines for readability s3_test: extract file writing code to a function	2025-08-18 18:48:42 +03:00
Avi Kivity	96956e48c4	Merge 'utils: stall_free: detect clear_gently method of const payload types' from Benny Halevy Currently, when a container or smart pointer holds a const payload type, utils::clear_gently does not detect the object's clear_gently method as the method is non-const and requires a mutable object, as in the following example in class tablet_metadata: ``` using tablet_map_ptr = foreign_ptr<lw_shared_ptr<const tablet_map>>; using table_to_tablet_map = std::unordered_map<table_id, tablet_map_ptr>; ``` That said, when a container is cleared gently the elements it holds are destroyed anyhow, so we'd like to allow to clear them gently before destruction. This change still doesn't allow directly calling utils::clear_gently an const objects. And respective unit tests. Fixes #24605 Fixed #25026 * This is an optimization that is not strictly required to backport (as https://github.com/scylladb/scylladb/pull/24618 dealt with clear_gently of `tablet_map_ptr = foreign_ptr<lw_shared_ptr<const tablet_map>>` well enough) Closes scylladb/scylladb#24606 * github.com:scylladb/scylladb: utils: stall_free: detect clear_gently method of const payload types utils: stall_free: clear gently a foreign shared ptr only when use_count==1	2025-08-18 12:52:02 +03:00
Pavel Emelyanov	4f55af9578	Merge 'test.py: pytest: support --mode/--repeat in a common way for all tests' from Evgeniy Naydanov Implement repetition of files using `pytest_collect_file` hook: run file collection as many times as needed to cover all `--mode`/`--repeat` combinations. Store build mode and run ID to the stash of repeated item. Some additional changes done: - Add `TestSuiteConfig` class to handle all operations with `test_config.yaml` - Add support for `run_first` option in `test_config.yaml` - Move disabled test logic to `pytest_collect_file` hook. These changes allow to to remove custom logic for `--mode`, `--repeat`, and disabled tests in the code for C++ tests and prepare for switching of Python/CQLApproval/Topology tests to pytest runner. Also, this PR includes required refactoring changes and fixes: - Simplify support of C++ tests: remove redundant facade abstraction and put all code into 3 files: `base.py`, `boost.py`, and `unit.py` - Remove unused imports in `test.py` - Use the constant for `"suite.yaml"` string - Some test suites have own test runners based on pytest, and they don't need all stuff we use for `test.py`. Move all code related to `test.py` framework to `test/pylib/runner.py` and use it as a plugin conditionally (by using `SCYLLA_TEST_RUNNER` env variable.) - Add `cwd` parameter to `run_process()` methods in `resource_gather` module to avoid using of `os.chdir()` (and sort parameters in the same order as in `subprocess.Popen`.) - `extra_scylla_cmdline_options` is a list of commandline arguments and, actually, each argument should be a separate item. Few configuration files have `--reactor-backend` option added in the format which doesn't follow this rule. This PR is a refactoring step for https://github.com/scylladb/scylladb/pull/25443 Closes scylladb/scylladb#25465 * github.com:scylladb/scylladb: test.py: pytest: support --mode/--repeat in a common way for all tests test.py: pytest: streamline suite configuration handling test.py: refactor: remove unused imports in test.py test.py: fix run with bare pytest after merge of scylladb/scylladb#24573 test.py: refactor: move framework-related code to test.pylib.runner test.py: resource_gather: add cwd parameter to run_process() test.py: refactor: use proper format for extra_scylla_cmdline_options	2025-08-18 12:24:04 +03:00
Avi Kivity	e9928b31b8	Merge 'sstables/trie: add BTI key translation routines' from Michał Chojnowski This is yet another part in the BTI index project. Overarching issue: https://github.com/scylladb/scylladb/issues/19191 Previous part: https://github.com/scylladb/scylladb/pull/25396 Next part: implementing sstable index writers and readers on top of the abstract trie writers/readers. The new code added in this PR isn't used outside of tests yet, but it's posted as a separate PR for reviewability. This series provides translation routines for ring positions and clustering positions from Scylla's native in-memory structures to BTI's byte-comparable encoding. This translation is performed whenever a new decorated key or clustering block are added to a BTI index, and whenever a BTI index is queried for a range of positions. For a description of the encoding, see `fad1f74570/src/java/org/apache/cassandra/utils/bytecomparable/ByteComparable.md (multi-component-sequences-partition-or-clustering-keys-tuples-bounds-and-nulls)` The translation logic, with all the fragment awareness, lazy evaluation and avoidable copies, is fairly bloated for the common cases of simple and small keys. This is a potential optimization target for later. No backports needed, new functionality. Closes scylladb/scylladb#25506 * github.com:scylladb/scylladb: sstables/trie: add BTI key translation routines tests/lib: extract generate_all_strings to test/lib tests/lib: extract nondeterministic_choice_stack to test/lib sstables/trie/trie_traversal: extract comparable_bytes_iterator to its own file sstables/mx: move clustering_info from writer.cc to types.hh sstables/trie: allow `comparable_bytes_iterator` to return a mutable span dht/ring_position: add ring_position_view::weight()	2025-08-18 11:55:26 +03:00
Asias He	f9021777d8	compaction: Add tablet incremental repair support This patch addes incremental_repair support in compaction. - The sstables are split into repaired and unrepaired set. - Repaired and unrepaired set compact sperately. - The repaired_at from sstable and sstables_repaired_at from system.tablets table are used to decide if a sstable is repaired or not. - Different compactions tasks, e.g., minor, major, scrub, split, are serialized with tablet repair.	2025-08-18 11:01:21 +08:00
Evgeniy Naydanov	e44b26b809	test.py: pytest: support --mode/--repeat in a common way for all tests Implement repetition of files using pytest_collect_file hook: run file collection as many times as needed to cover all --mode/--repeat combinations. Also move disabled test logic to this hook. Store build mode and run_id in pytest item stashes. Simplify support of C++ tests: remove redundant facade abstraction and put all code into 3 files: base.py, boost.py, and unit.py Add support for `run_first` option in test_config.yaml	2025-08-17 15:26:23 +00:00
Evgeniy Naydanov	cb4d9b8a09	test.py: refactor: use proper format for extra_scylla_cmdline_options `extra_scylla_cmdline_options` is a list of commandline arguments and, actually, each argument should be a separate item. Few configuration files have `--reactor-backend` option added in the format which doesn't follow this rule.	2025-08-17 12:32:35 +00:00
Michał Chojnowski	413dcf8891	sstables/trie: add BTI key translation routines This file provides translation routines for ring positions and clustering positions from Scylla's native in-memory structures to BTI's byte-comparable encoding. This translation is performed whenever a new decorated key or clustering block are added to a BTI index, and whenever a BTI index is queried for a range of positions. For a description of the encoding, see `fad1f74570/src/java/org/apache/cassandra/utils/bytecomparable/ByteComparable.md (multi-component-sequences-partition-or-clustering-keys-tuples-bounds-and-nulls)` The translation logic, with all the fragment awareness, lazy evaluation and avoidable copies, is fairly bloated for the common cases of simple and small keys. This is a potential optimization target for later.	2025-08-15 11:13:00 +02:00
Michał Chojnowski	5e76708335	tests/lib: extract generate_all_strings to test/lib This util will be used in another test file in a later commit, so hoist it to `test/lib`.	2025-08-14 22:38:38 +02:00
Taras Veretilnyk	30ff5942c6	database_test: fix race in test_drop_quarantined_sstables The test_drop_quarantined_sstables test could fail due to a race between compaction and quarantining of SSTables. If compaction selects an SSTable before it is moved to quarantine, and change_state is called during compaction, the SSTable may already be removed, resulting in a std::filesystem_error due to missing files. This patch resolves the issue by wrapping the quarantine operation inside run_with_compaction_disabled(). This ensures compaction is paused on the compaction group view while SSTables are being quarantined, preventing the race. Additionally, updates the test to quarantine up to 1/5 SSTables instead of one randomly and increases the number of sstables genereted to improve test scenario. Fixes scylladb/scylladb#25487 Closes scylladb/scylladb#25494	2025-08-14 20:23:42 +03:00
Taras Veretilnyk	367eaf46c5	keys: from_nodetool_style_string don't split single partition keys Users with single-column partition keys that contain colon characters were unable to use certain REST APIs and 'nodetool' commands, because the API split key by colon regardless of the partition key schema. Affected commands: - 'nodetool getendpoints' - 'nodetool getsstables' Affected endpoints: - '/column_family/sstables/by_key' - '/storage_service/natural_endpoints' Refs: #16596 - This does not fully fix the issue, as users with compound keys will face the issue if any column of the partition key contains a colon character. Closes scylladb/scylladb#24829	2025-08-14 19:52:04 +03:00

1 2 3 4 5 ...

4142 Commits