scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-05-25 09:11:10 +00:00

Author	SHA1	Message	Date
Pavel Emelyanov	06006a6328	test/refresh: Prepare indentation for new_test_keyspace in test_refresh_deletes_uploaded_sstables Wrap the test body under if True: to pre-indent it, making the subsequent patch that introduces new_test_keyspace a pure content change with no whitespace noise. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-03-19 18:42:40 +03:00
Pavel Emelyanov	67d8cde42d	test/refresh: Decouple test_refresh_deletes_uploaded_sstables from backup tests Replace create_cluster() from object_store/test_backup.py with a plain manager.servers_add(2) call. The test does not use object storage, so there is no need to pull in the backup helper along with its config and logging knobs. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-03-19 18:42:36 +03:00
Pavel Emelyanov	04f046d2d8	test/refresh: Remove unused wait_for_cql_and_get_hosts import Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-03-19 18:42:32 +03:00
Artsiom Mishuta	0ede308a04	test/pylib: save logs on success only during teardown phase Previously, when --save-log-on-success was enabled, logs were saved for every test phase (setup, call, teardown)in 3 files. Restrict it to only the teardown phase, that contains all 3 in case of test success, to avoid redundant log entries.	2026-03-19 16:35:22 +01:00
Artsiom Mishuta	cbc07569c0	test: Lower default log level from DEBUG to INFO 1. test.py — Removed --log-level=DEBUG flag from pytest args 2. test/pytest.ini — Changed log_level to INFO (that was set DEBUG in test.py), changed log_file_level from DEBUG to INFO, added clarifying comments	2026-03-19 16:32:30 +01:00
Dario Mirovic	d2c44722e1	test: cluster: fix log clear race condition in test_audit.py assert_entries_were_added: - takes a "before" snapshot of the audit log - yields to execute a statement - takes an "after" snapshot of the audit log - computes new rows by diffing "after" minus "before" If an audit entry generated by prepare() arrives between the snapshot and the diff, it inflates the new row count and the test fails with assert 2 <= 1. Fix by: - Adding clear_audit_logs() at the end of prepare(), after all setup - Waiting for the "completed re-reading configuration file" log message after server_update_config - Draining pending syslog lines before clearing the buffer Refs SCYLLADB-573	2026-03-19 16:12:13 +01:00
Dario Mirovic	821f8696a7	test: pylib: shut down exclusive cql connections in ManagerClient get_cql_exclusive() creates a Cluster object per call, but never records it. driver_close() cannot shut it down. The cluster's internal scheduler thread then tries to submit work to an already shut down executor. This causes RuntimeError: RuntimeError: cannot schedule new futures after shutdown Fix this by tracking every exclusive Cluster in a list and shutting them all down in driver_close(). Refs SCYLLADB-573	2026-03-19 16:12:13 +01:00
Dario Mirovic	d94999f87b	test: cluster: fix multinode audit entry comparison in test_audit.py assert_entries_were_added computes new audit rows by slicing the "after" list at the length of the "before" list: rows_after[len(rows_before):]. This assumes new rows always appear at the tail of the combined sorted list. In a multinode setup, each node generates its own event_time timestamps. A new row from node A can sort before an old row from node B, breaking the tail assumption. The assertion "new rows are not the last rows in the audit table" then fires. Fix this by splitting the before/after lists per node and computing the new rows tail independently for each node. This guarantees that per node ordering, which is monotonic, is respected, and the combined new rows are sorted afterwards. Refs SCYLLADB-573	2026-03-19 16:12:13 +01:00
Dario Mirovic	249a6cec1b	test: cluster: dtest: remove old audit tests Since audit tests have been migrated to test/cluster/test_audit.py, old tests in test/cluster/dtest/audit_test.py have to be removed. Refs SCYLLADB-573	2026-03-19 16:12:13 +01:00
Dario Mirovic	adc790a8bf	test: cluster: group migrated audit tests for cluster reuse This patch reorganizes the execution flow of the test functions. They are grouped to enable cluster reuse between specific test functions. One of the main contributors to the test execution time is the cluster preparation. This patch significantly reduces the total test execution time by having way less new cluster preparation calls and more cluster reuse. Performance increase on the developer machine is around 38%: - before: 4m 29s - after: 2m 47s Fixes SCYLLADB-573	2026-03-19 16:11:47 +01:00
Dario Mirovic	967b7ff6bf	test: cluster: enable migrated audit tests and make them work Make audit tests from test/cluster/dtest to test/cluster. test/cluster environment has less overhead, and audit tests are heavy, their execution taking lots of time. This patch is part of an effort to improve audit test suite performance. This patch refactors the tests so that they execute correctly, as well as enables them. A follow up patch will remove the audit tests in test/cluster/dtest. All the tests are confirmed to be running after the change. No dead code present. Test test_audit_categories_invalid is not parametrized anymore. It never used the parametrized helper class, so it just ran the same logic three times. This is why there are now 74, and not 76, test executions. Refs SCYLLADB-573	2026-03-19 16:07:28 +01:00
Dario Mirovic	8367509b3b	test: pylib: manager_client: specify AuthProvider in get_cql_exclusive This patch allows ManagerClient.get_cql_exclusive to accept AuthProvider as parameter. This will be used in a follow up patch which migrates audit test suite to test/cluster and requires this functionality for some tests. Refs SCYLLADB-573	2026-03-19 15:35:24 +01:00
Dario Mirovic	0a7a69345c	test: pylib: scylla cluster after_test log fix Before any test, a pool of ScyllaCluster objects is created. At the beginning of a test suite, a ScyllaClusterManager is created, and given a reference to the pool. At the end of a test suite, the ScyllaClusterManager is destroyed. Before each test case: - ManagerClient is constructed and connected to the ScyllaClusterManager of that test suite - A ScyllaCluster object is fetched from the pool - If the pool is empty, a new ScyllaCluster object is created - If the pool is not empty, a cached ScyllaCluster object is returned After each test case: - Return ScyllaCluster object from ManagerClient to the pool - If the cluster is dirty, the pool destroys it - If the cluster is clean, the pool caches it - ManagerClient is destroyed Many actions mark a cluster as dirty. Normal test execution will always make the cluster be destroyed upon returning to the pool. ManagerClient.mark_clean is not used in the tests. When it is used, the flow with cluster reuse happens. The bug is that the log file is closed even if cluster is not dirty. This causes an error when trying to log to a reused cluster server. The solution in this patch is to not close the log file if the cluster is not dirty. Upon cluster reuse the log file will be open and functional. Another approach would be to reopen the log file if closed, but this approach seems more clean. Refs SCYLLADB-573	2026-03-19 15:35:24 +01:00
Dario Mirovic	899ae71349	test: audit: copy audit test from dtest This patch just copies the audit test suite from dtest and disables it in the test config file. Later patches will update the code and enable the test suite. Refs SCYLLADB-573	2026-03-19 15:35:24 +01:00
Andrzej Jackowski	4deeb7ebfc	test: add new guardrail tests matching documentation scenarios Add tests for RF guardrails (min/max warn/fail, RF=0 bypass, threshold=-1 disable, ALTER KEYSPACE) and write consistency level guardrails to cover all scenarios described in guardrails.rst. Test runtime (dev): test_guardrail_replication_strategy - 6s test_guardrail_write_consistency_level - 5s Refs: SCYLLADB-257	2026-03-19 15:07:03 +01:00
Andrzej Jackowski	2a03c634c0	test: add metric assertions to guardrail replication strategy tests Verify that guardrail violations increment the corresponding metrics. Refs: SCYLLADB-257	2026-03-19 15:07:03 +01:00
Andrzej Jackowski	81c4e717e2	test: use regex matching in guardrail replication strategy tests Replace loose substring assertions with regex-based matching against the exact server message formats. Add regex constants for all guardrail messages and rewrite create_ks_and_assert_warnings_and_errors() to verify count and content of warnings and failures. Refs: SCYLLADB-257	2026-03-19 15:07:03 +01:00
Avi Kivity	5e7fb08bf3	Merge 'Fix bad performance for densely populated partition index pages' from Tomasz Grabiec This applies to small partition workload where index pages have high partition count, and the index doesn't fit in cache. It was observed that the count can be in the order of hundreds. In such a workload pages undergo constant population, LSA compaction, and LSA eviction, which has severe impact on CPU utilization. Refs https://scylladb.atlassian.net/browse/SCYLLADB-620 This PR reduces the impact by several changes: - reducing memory footprint in the partition index. Assuming partition key size is 16 bytes, the cost dropped from 96 bytes to 36 bytes per partition. - flattening the object graph and amortizing storage. Storing entries directly in the vector. Storing all key values in a single managed_bytes. Making index_entry a trivial struct. - index entries and key storage are now trivially moveable, and batched inside vector storage so LSA migration can use memcpy(), which amortizes the cost per key. This reduces the cost of LSA segment compaction. - LSA eviction is now pretty much constant time for the whole page regardless of the number of entries, because elements are trivial and batched inside vectors. Page eviction cost dropped from 50 us to 1 us. Performance evaluated with: scylla perf-simple-query -c1 -m200M --partitions=1000000 Before: ``` 7774.96 tps (166.0 allocs/op, 521.7 logallocs/op, 54.0 tasks/op, 802428 insns/op, 430457 cycles/op, 0 errors) 7511.08 tps (166.1 allocs/op, 527.2 logallocs/op, 54.0 tasks/op, 804185 insns/op, 430752 cycles/op, 0 errors) 7740.44 tps (166.3 allocs/op, 526.2 logallocs/op, 54.2 tasks/op, 805347 insns/op, 432117 cycles/op, 0 errors) 7818.72 tps (165.2 allocs/op, 517.6 logallocs/op, 53.7 tasks/op, 794965 insns/op, 427751 cycles/op, 0 errors) 7865.49 tps (165.1 allocs/op, 513.3 logallocs/op, 53.6 tasks/op, 788898 insns/op, 425171 cycles/op, 0 errors) ``` After (+318%): ``` 32492.40 tps (130.7 allocs/op, 12.8 logallocs/op, 36.1 tasks/op, 109236 insns/op, 103203 cycles/op, 0 errors) 32591.99 tps (130.4 allocs/op, 12.8 logallocs/op, 36.0 tasks/op, 108947 insns/op, 102889 cycles/op, 0 errors) 32514.52 tps (130.6 allocs/op, 12.8 logallocs/op, 36.0 tasks/op, 109118 insns/op, 103219 cycles/op, 0 errors) 32491.14 tps (130.6 allocs/op, 12.8 logallocs/op, 36.0 tasks/op, 109349 insns/op, 103272 cycles/op, 0 errors) 32582.90 tps (130.5 allocs/op, 12.8 logallocs/op, 36.0 tasks/op, 109269 insns/op, 102872 cycles/op, 0 errors) 32479.43 tps (130.6 allocs/op, 12.8 logallocs/op, 36.0 tasks/op, 109313 insns/op, 103242 cycles/op, 0 errors) 32418.48 tps (130.7 allocs/op, 12.8 logallocs/op, 36.1 tasks/op, 109201 insns/op, 103301 cycles/op, 0 errors) 31394.14 tps (130.7 allocs/op, 12.8 logallocs/op, 36.1 tasks/op, 109267 insns/op, 103301 cycles/op, 0 errors) 32298.55 tps (130.7 allocs/op, 12.8 logallocs/op, 36.1 tasks/op, 109323 insns/op, 103551 cycles/op, 0 errors) ``` When the workload is miss-only, with both row cache and index cache disabled (no cache maintenance cost): perf-simple-query -c1 -m200M --duration 6000 --partitions=100000 --enable-index-cache=0 --enable-cache=0 Before: ``` 9124.57 tps (146.2 allocs/op, 789.0 logallocs/op, 45.3 tasks/op, 889320 insns/op, 357937 cycles/op, 0 errors) 9437.23 tps (146.1 allocs/op, 789.3 logallocs/op, 45.3 tasks/op, 889613 insns/op, 357782 cycles/op, 0 errors) 9455.65 tps (146.0 allocs/op, 787.4 logallocs/op, 45.2 tasks/op, 887606 insns/op, 357167 cycles/op, 0 errors) 9451.22 tps (146.0 allocs/op, 787.4 logallocs/op, 45.3 tasks/op, 887627 insns/op, 357357 cycles/op, 0 errors) 9429.50 tps (146.0 allocs/op, 787.4 logallocs/op, 45.3 tasks/op, 887761 insns/op, 358148 cycles/op, 0 errors) 9430.29 tps (146.1 allocs/op, 788.2 logallocs/op, 45.3 tasks/op, 888501 insns/op, 357679 cycles/op, 0 errors) 9454.08 tps (146.0 allocs/op, 787.3 logallocs/op, 45.3 tasks/op, 887545 insns/op, 357132 cycles/op, 0 errors) ``` After (+55%): ``` 14484.84 tps (150.7 allocs/op, 6.5 logallocs/op, 44.7 tasks/op, 396164 insns/op, 229490 cycles/op, 0 errors) 14526.21 tps (150.8 allocs/op, 6.5 logallocs/op, 44.8 tasks/op, 396401 insns/op, 228824 cycles/op, 0 errors) 14567.53 tps (150.7 allocs/op, 6.5 logallocs/op, 44.7 tasks/op, 396319 insns/op, 228701 cycles/op, 0 errors) 14545.63 tps (150.6 allocs/op, 6.5 logallocs/op, 44.7 tasks/op, 395889 insns/op, 228493 cycles/op, 0 errors) 14626.06 tps (150.5 allocs/op, 6.5 logallocs/op, 44.7 tasks/op, 395254 insns/op, 227891 cycles/op, 0 errors) 14593.74 tps (150.5 allocs/op, 6.5 logallocs/op, 44.7 tasks/op, 395480 insns/op, 227993 cycles/op, 0 errors) 14538.10 tps (150.8 allocs/op, 6.5 logallocs/op, 44.8 tasks/op, 397035 insns/op, 228831 cycles/op, 0 errors) 14527.18 tps (150.8 allocs/op, 6.5 logallocs/op, 44.8 tasks/op, 396992 insns/op, 228839 cycles/op, 0 errors) ``` Same as above, but with summary ratio increased from 0.0005 to 0.005 (smaller pages): Before: ``` 33906.70 tps (146.1 allocs/op, 83.6 logallocs/op, 45.1 tasks/op, 170553 insns/op, 98104 cycles/op, 0 errors) 32696.16 tps (146.0 allocs/op, 83.5 logallocs/op, 45.1 tasks/op, 170369 insns/op, 98405 cycles/op, 0 errors) 33889.05 tps (146.1 allocs/op, 83.6 logallocs/op, 45.1 tasks/op, 170551 insns/op, 98135 cycles/op, 0 errors) 33893.24 tps (146.1 allocs/op, 83.5 logallocs/op, 45.1 tasks/op, 170488 insns/op, 98168 cycles/op, 0 errors) 33836.73 tps (146.1 allocs/op, 83.6 logallocs/op, 45.1 tasks/op, 170528 insns/op, 98226 cycles/op, 0 errors) 33897.61 tps (146.0 allocs/op, 83.5 logallocs/op, 45.1 tasks/op, 170428 insns/op, 98081 cycles/op, 0 errors) 33834.73 tps (146.1 allocs/op, 83.5 logallocs/op, 45.1 tasks/op, 170438 insns/op, 98178 cycles/op, 0 errors) 33776.31 tps (146.3 allocs/op, 83.9 logallocs/op, 45.2 tasks/op, 170958 insns/op, 98418 cycles/op, 0 errors) 33808.08 tps (146.3 allocs/op, 83.9 logallocs/op, 45.2 tasks/op, 170940 insns/op, 98388 cycles/op, 0 errors) ``` After (+18%): ``` 40081.51 tps (148.2 allocs/op, 4.4 logallocs/op, 45.0 tasks/op, 121047 insns/op, 82231 cycles/op, 0 errors) 40005.85 tps (148.6 allocs/op, 4.4 logallocs/op, 45.2 tasks/op, 121327 insns/op, 82545 cycles/op, 0 errors) 39816.75 tps (148.3 allocs/op, 4.4 logallocs/op, 45.1 tasks/op, 121067 insns/op, 82419 cycles/op, 0 errors) 39953.11 tps (148.1 allocs/op, 4.4 logallocs/op, 45.0 tasks/op, 121027 insns/op, 82258 cycles/op, 0 errors) 40073.96 tps (148.2 allocs/op, 4.4 logallocs/op, 45.0 tasks/op, 121006 insns/op, 82313 cycles/op, 0 errors) 39882.25 tps (148.2 allocs/op, 4.4 logallocs/op, 45.0 tasks/op, 120925 insns/op, 82320 cycles/op, 0 errors) 39916.08 tps (148.3 allocs/op, 4.4 logallocs/op, 45.1 tasks/op, 121054 insns/op, 82393 cycles/op, 0 errors) 39786.30 tps (148.2 allocs/op, 4.4 logallocs/op, 45.0 tasks/op, 121027 insns/op, 82465 cycles/op, 0 errors) 38662.45 tps (148.3 allocs/op, 4.4 logallocs/op, 45.0 tasks/op, 121108 insns/op, 82312 cycles/op, 0 errors) 39849.42 tps (148.3 allocs/op, 4.4 logallocs/op, 45.1 tasks/op, 121098 insns/op, 82447 cycles/op, 0 errors) ``` Closes scylladb/scylladb#28603 * github.com:scylladb/scylladb: sstables: mx: index_reader: Optimize parsing for no promoted index case vint: Use std::countl_zero() test: sstable_partition_index_cache_test: Validate scenario of pages with sparse promoted index placement sstables: mx: index_reader: Amoritze partition key storage managed_bytes: Hoist write_fragmented() to common header utils: managed_vector: Use std::uninitialized_move() to move objects sstables: mx: index_reader: Keep promoted_index info next to index_entry sstables: mx: index_reader: Extract partition_index_page::clear_gently() sstables: mx: index_reader: Shave-off 16 bytes from index_entry by using raw_token sstables: mx: index_reader: Reduce allocation_section overhead during index page parsing by batching allocation sstables: mx: index_reader: Keep index_entry directly in the vector dht: Introduce raw_token test: perf_simple_query: Add 'sstable-format' command-line option test: perf_simple_query: Add 'sstable-summary-ratio' command-line option test: perf-simple-query: Add option to disable index cache test: cql_test_env: Respect enable-index-cache config	2026-03-19 14:42:50 +02:00
Ernest Zaslavsky	aa9da87e97	encryption: fix deadlock in encrypted_data_source::get() When encrypted_data_source::get() caches a trailing block in _next, the next call takes it directly — bypassing input_stream::read(), which checks _eof. It then calls input_stream::read_exactly() on the already-drained stream. Unlike read(), read_up_to(), and consume(), read_exactly() does not check _eof when the buffer is empty, so it calls _fd.get() on a source that already returned EOS. In production this manifested as stuck encrypted SSTable component downloads during tablet restore: the underlying chunked_download_source hung forever on the post-EOS get(), causing 4 tablets to never complete. The stuck files were always block-aligned sizes (8k, 12k) where _next gets populated and the source is fully consumed in the same call. Fix by checking _input.eof() before calling read_exactly(). When the stream already reached EOF, buf2 is known to be empty, so the call is skipped entirely. A comprehensive test is added that uses a strict_memory_source which fails on post-EOS get(), reproducing the exact code path that caused the production deadlock.	2026-03-19 13:54:54 +02:00
Ernest Zaslavsky	f74a54f005	test_lib: mark `limiting_data_source_impl` as not `final`	2026-03-19 13:54:54 +02:00
Ernest Zaslavsky	151e945d9f	Fix formatting after previous patch	2026-03-19 13:54:44 +02:00
Andrzej Jackowski	517bb8655d	test: extract ks_opts helper in test_guardrail_replication_strategy Factor out ks_opts() to build keyspace options with tablets handling and use it across all existing replication strategy guardrail tests. No behavioral changes. This facilitates further modification of the tests later in this patch series. Refs: SCYLLADB-257	2026-03-19 12:49:41 +01:00
Ernest Zaslavsky	537747cf5d	Fix indentation after previous patch	2026-03-19 13:48:53 +02:00
Ernest Zaslavsky	2535164542	test_lib: make limiting_data_source_impl available to tests Relocate the `limiting_data_source_impl` declaration to the header file so that test code can access it directly.	2026-03-19 13:48:53 +02:00
Botond Dénes	86d7c82993	test/cluster/test_repair.py: use tablets in test_repair_timestamp_difference After repair, the test does a major to compact all sstables into a single one, so the results can be simply checked by a select from mutation_fragments() query. Sometimes off-strategy happens parallel to this major, so after the major there are still 2 sstables, resulting in the test failing when checking that the query returns just a single row. To fix, just use tablets for the test table, tablets don't use off-strategy anymore. Fixes: SCYLLADB-940 Closes scylladb/scylladb#29071	2026-03-19 12:42:18 +03:00
Michael Litvak	399260a6c0	test: mv: fix flaky wait for commitlog sync Previously the test test_interrupt_view_build_shard_registration stopped the node ungracefully and used commitlog periodic mode to persist the view build progress in a not very reliable way. It can happen that due to timing issues, the view build progress is not persisted, or some of it is persisted in a different ordering than expected. To make the test more reliable we change it to stop the node gracefully, so the commitlog is persisted in a graceful and consistent way, without using the periodic mode delay. We need to also change the injection for the shutdown to not get stuck. Fixes SCYLLADB-1005 Closes scylladb/scylladb#29008	2026-03-19 10:41:21 +01:00
Pavel Emelyanov	f27dc12b7c	Merge 'Fix directory lister leak in table::get_snapshot_details: ' from Benny Halevy As reported in SCYLLADB-1013, the directory lister must be closed also when an exception is thrown. For example, see backtrace below: ``` seastar::on_internal_error(seastar::logger&, std::basic_string_view<char, std::char_traits<char>>) at ./build/release/seastar/./seastar/src/core/on_internal_error.cc:57 directory_lister::~directory_lister() at ./utils/lister.cc:77 replica::table::get_snapshot_details(std::filesystem::__cxx11::path, std::filesystem::__cxx11::path) (.resume) at ./replica/table.cc:4081 std::__n4861::coroutine_handle<seastar::internal::coroutine_traits_base<db::snapshot_ctl::table_snapshot_details>::promise_type>::resume() const at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/coroutine:247 (inlined by) seastar::internal::coroutine_traits_base<db::snapshot_ctl::table_snapshot_details>::promise_type::run_and_dispose() at ././seastar/include/seastar/core/coroutine.hh:129 seastar::reactor::task_queue::run_tasks() at ./build/release/seastar/./seastar/src/core/reactor.cc:2695 (inlined by) seastar::reactor::task_queue_group::run_tasks() at ./build/release/seastar/./seastar/src/core/reactor.cc:3201 seastar::reactor::task_queue_group::run_some_tasks() at ./build/release/seastar/./seastar/src/core/reactor.cc:3185 (inlined by) seastar::reactor::do_run() at ./build/release/seastar/./seastar/src/core/reactor.cc:3353 seastar::reactor::run() at ./build/release/seastar/./seastar/src/core/reactor.cc:3245 seastar::app_template::run_deprecated(int, char, std::function<void ()>&&) at ./build/release/seastar/./seastar/src/core/app-template.cc:266 seastar::app_template::run(int, char, std::function<seastar::future<int> ()>&&) at ./build/release/seastar/./seastar/src/core/app-template.cc:160 scylla_main(int, char*) at ./main.cc:756 ``` Fixes: [SCYLLADB-1013](https://scylladb.atlassian.net/browse/SCYLLADB-1013) Requires backport to 2026.1 since the leak exists since `004c08f525` [SCYLLADB-1013]: https://scylladb.atlassian.net/browse/SCYLLADB-1013?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ Closes scylladb/scylladb#29084 * github.com:scylladb/scylladb: test/boost/database_test: add test_snapshot_ctl_details_exception_handling table: get_snapshot_details: fix indentation inside try block table: per-snapshot get_snapshot_details: fix typo in comment table: per-snapshot get_snapshot_details: always close lister using try/catch table: get_snapshot_details: always close lister using deferred_close	2026-03-19 12:40:23 +03:00
Raphael S. Carvalho	3143134968	test: avoid split/major compaction deadlock in tablet split test Run keyspace compaction asynchronously in `test_tombstone_gc_correctness_during_tablet_split` and only await it after `split_sstable_rewrite` is disabled. The problem is that `keyspace_compaction()` starts with a flush, and that flush can take around five seconds. During that window the split compaction is stopped before major compaction is retried. The stop aborts the in-flight major compaction attempt, then the split proceeds far enough to enter the `split_sstable_rewrite` injection point. At that point the test used to wait synchronously for major compaction to finish, but major compaction cannot finish yet: when it retries, it needs the same semaphore that is still effectively tied up behind the blocked split rewrite. So the test waits for major compaction, while the split waits for the injection to be released, and the code that would release that injection never runs. Starting major compaction as a task breaks that cycle. The test can first disable `split_sstable_rewrite`, let the split get out of the way, and only then wait for major compaction to complete. Fixes https://scylladb.atlassian.net/browse/SCYLLADB-827. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#29066	2026-03-19 11:12:21 +02:00
Botond Dénes	2e47fd9f56	Merge 'tasks: do not fail the wait request if rpc fails' from Aleksandra Martyniuk During decommission, we first mark a topology request as done, then shut down a node and in the following steps we remove node from the topology. Thus, finished request does not imply that a node is removed from the topology. Due to that, in node_ops_virtual_task::wait, while gathering children from the whole cluster, we may hit the connection exception - because a node is still in topology, even though it is down. Modify the get_children method to ignore the exception and warn about the failure instead. Keep token_metadata_ptr in get_children to prevent topology from changing. Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-867 Needs backports to all versions Closes scylladb/scylladb#29035 * github.com:scylladb/scylladb: tasks: fix indentation tasks: do not fail the wait request if rpc fails tasks: pass token_metadata_ptr to task_manager::virtual_task::impl::get_children	2026-03-19 10:03:18 +02:00
Michael Litvak	31d339e54a	logstor: trigger separator flush for buffers that hold old segments A compaction group has a separator buffer that holds the mixed segments alive until the separator buffer is flushed. A mixed segment can be freed only after all separator buffers that hold writes from the segment are flushed. Typically a separator buffer is flushed when it becomes full. However it's possible for example that one compaction groups is filled slower than others and holds many segments. To fix this we trigger a separator flush periodically for separator buffers that hold old segments. We track the active segment sequence number and for each separator buffer the oldest sequence number it holds.	2026-03-18 19:24:28 +01:00
Michael Litvak	a0da07e5b7	logstor: recover segments into compaction groups Fix the logstor recovery to work with compaction groups. When recovering a segment find its token range and add it to the appropriate compaction groups. if it doesn't fit in a single compaction group then write each record to its compaction group's separator buffer.	2026-03-18 19:24:28 +01:00
Michael Litvak	24379acc76	logstor: range read extend the logstor mutation reader to support range read	2026-03-18 19:24:28 +01:00
Michael Litvak	e7c3942d43	logstor: move segments to replica::compaction_group Add a segment_set member to replica::compaction_group that manages the logstor segments that belong to the compaction group, similarly to how it manages sstables. Add also a separator buffer in each compaction group. When writing a mutation to a compaction group, the mutation is written to the active segment and to the separator buffer of the compaction group, and when the separator buffer is flushed the segment is added to the compaction_group's segment set.	2026-03-18 19:24:28 +01:00
Michael Litvak	bd66edee5c	logstor: truncate table implement freeing all segments of a table for table truncate. first do barrier to flush all active and mixed segments and put all the table's data in compaction groups, then stop compaction for the table, then free the table's segments and remove the live entries from the index.	2026-03-18 19:24:27 +01:00
Michael Litvak	37c485e3d1	test: logstor: add separator and compaction tests	2026-03-18 19:24:27 +01:00
Michael Litvak	31aefdc07d	logstor: segment and separator barrier add barrier operation that forces switch of the active segment and separator, and waits for all existing segments to close and all separators to flush.	2026-03-18 19:24:27 +01:00
Michael Litvak	600ec82bec	logstor: separator initial implementation of the separator. it replaces "mixed" segments - segments that have records from different groups, to segments by group. every write is written to the active segment and to a buffer in the active separator. the active separator has in-memory buffers by group. at some threshold number of segments we switch the active segment and separator atomically, and start flushing the separator. the separator is flushed by writing the buffers into new non-mixed segments, adding them to a compaction group, and frees the mixed segments.	2026-03-18 19:24:27 +01:00
Michael Litvak	5a16980845	logstor: recovery: initial initial and basic recovery implementation. * find all files, read their segments and populate the index with the newest record for each key. * find which segments are used and build the usage histogram	2026-03-18 19:24:26 +01:00
Michael Litvak	521fca5c92	logstor: index: buckets divide the primary index to buckets, each bucket containing a btree. the bucket is determined by using bits from the key hash.	2026-03-18 19:24:26 +01:00
Michael Litvak	ddd72a16b0	logstor: add group_id add group_id value to each log record that is passed with the mutation when writing it. the group_id will be used to group log records in segments, such that a segment will contain records only from a single group. this will be useful for tablet migration. we want for each tablet to have their own segments with all their records, so we can migrate them efficiently by copying these segments. the group_id value is set to a value equivalent to the tablet id.	2026-03-18 19:24:26 +01:00
Michael Litvak	5f649dd39f	logstor: use RIPEMD-160 for index key use a 20-byte hash function for the index key to make hash collisions very unlikely. we assume there are no hash collisions.	2026-03-18 19:24:26 +01:00
Michael Litvak	a521bcbcee	test: add test_logstor.py add basic tests for key-value tables with logstor storage	2026-03-18 19:24:26 +01:00
Michael Litvak	1ae1f37ec1	api: add logstor compaction trigger endpoint add a new api endpoint that triggers logstor compaction.	2026-03-18 19:24:26 +01:00
Michael Litvak	2128b1b15c	replica: add logstor to db Add a single logstor instance in the database that is used for writing and reading to tables with kv storage	2026-03-18 19:24:26 +01:00
Michael Litvak	9172cc172e	schema: add logstor cf property add a schema property for tables with logstor storage	2026-03-18 19:24:26 +01:00
Michael Litvak	0b1343747f	logstor: initial commit initial implementation of the logstor storage engine for key-value tables that supports writes, reads and basic compaction. main components: * logstor: this is the main interface to users that supports writing and reading back mutations, and manages the internal components. * index: the primary index in-memory that maps a key to a location on disk. * write buffer: writes go initially to a write buffer. it accumulates multiple records in a buffer and writes them to the segment manager in 4k sized blocks. * segment manager: manages the storage - files, segments, compaction. it manages file and segment allocation, and writes 4k aligned buffers to the active segment sequentially. it tracks the used space in each segment. the compaction finds segment with low space usage and writes them to new segments, and frees the old segments.	2026-03-18 19:24:26 +01:00
Avi Kivity	46a6f8e1d3	Merge 'auth: add maintenance_socket_authorizer' from Dario Mirovic GRANT/REVOKE fails on the maintenance socket connections, because maintenance_auth_service uses allow_all_authorizer. allow_all_authorizer allows all operations, but not GRANT/REVOKE, because they make no sense in its context. This has been observed during PGO run failure in operations from ./pgo/conf/auth.cql file. This patch introduces maintenance_socket_authorizer that supports the capabilities of default_authorizer ('CassandraAuthorizer') without needing authorization. Refs SCYLLADB-1070 This is an improvement, no need for backport. Closes scylladb/scylladb#29080 * github.com:scylladb/scylladb: test: use NetworkTopologyStrategy in maintenance socket tests test: use cleanup fixture in maintenance socket auth tests auth: add maintenance_socket_authorizer	2026-03-18 19:29:57 +02:00
Gleb Natapov	77d3245e02	view: remove upgrade to raft code Since we do no longer support upgrade from versions that do not support v2 of view building code we can remove upgrade code and make sure we do not boot with old builder version.	2026-03-18 17:45:40 +02:00
Tomasz Grabiec	6017688445	test: sstable_partition_index_cache_test: Validate scenario of pages with sparse promoted index placement	2026-03-18 16:25:21 +01:00
Tomasz Grabiec	f55bb154ec	sstables: mx: index_reader: Amoritze partition key storage This change reduces the cost of partition index page construction and LSA migration. This is achieved by several things working together: - index entries don't store keys as separate small objects (managed_bytes) They are written into one managed_bytes fragmented storage, entries hold offset into it. Before, we paid 16 bytes for managed_bytes plus LSA descriptor for the storage (1 byte) plus back-reference in the storage (8 bytes), so 25 bytes. Now we only pay 4 bytes for the size offset. If keys are 16 bytes, that's a reduction from 31 bytes to 20 bytes per key. - index entries and key storage are now trivially moveable, so LSA migration can use memcpy() which amortizes the cost per key. memcpy(). LSA eviction is now trivial and constant time for the whole page regardless of the number of entries. Page eviction dropped from 14 us to 1 us. This improves throughput in a CPU-bound miss-heavy read workload where the partition index doesn't fit in memory. scylla perf-simple-query -c1 -m200M --partitions=1000000 Before: 15328.25 tps (150.0 allocs/op, 14.1 logallocs/op, 45.4 tasks/op, 286769 insns/op, 218134 cycles/op, 0 errors) 15279.01 tps (149.9 allocs/op, 14.1 logallocs/op, 45.3 tasks/op, 287696 insns/op, 218637 cycles/op, 0 errors) 15347.78 tps (149.7 allocs/op, 14.1 logallocs/op, 45.3 tasks/op, 285851 insns/op, 217795 cycles/op, 0 errors) 15403.68 tps (149.6 allocs/op, 14.1 logallocs/op, 45.2 tasks/op, 285111 insns/op, 216984 cycles/op, 0 errors) 15189.47 tps (150.0 allocs/op, 14.1 logallocs/op, 45.5 tasks/op, 289509 insns/op, 219602 cycles/op, 0 errors) 15295.04 tps (149.8 allocs/op, 14.1 logallocs/op, 45.3 tasks/op, 288021 insns/op, 218545 cycles/op, 0 errors) 15162.01 tps (149.8 allocs/op, 14.1 logallocs/op, 45.4 tasks/op, 291265 insns/op, 220451 cycles/op, 0 errors) After: 21620.18 tps (148.4 allocs/op, 13.4 logallocs/op, 43.7 tasks/op, 176817 insns/op, 153183 cycles/op, 0 errors) 20644.03 tps (149.8 allocs/op, 13.5 logallocs/op, 44.3 tasks/op, 187941 insns/op, 160409 cycles/op, 0 errors) 20588.06 tps (150.1 allocs/op, 13.5 logallocs/op, 44.5 tasks/op, 188090 insns/op, 160818 cycles/op, 0 errors) 20789.29 tps (149.5 allocs/op, 13.5 logallocs/op, 44.2 tasks/op, 186495 insns/op, 159382 cycles/op, 0 errors) 20977.89 tps (149.5 allocs/op, 13.4 logallocs/op, 44.2 tasks/op, 183969 insns/op, 158140 cycles/op, 0 errors) 21125.34 tps (149.1 allocs/op, 13.4 logallocs/op, 44.1 tasks/op, 183204 insns/op, 156925 cycles/op, 0 errors) 21244.42 tps (148.6 allocs/op, 13.4 logallocs/op, 43.8 tasks/op, 181276 insns/op, 155973 cycles/op, 0 errors) Mostly because the index now fits in memory. When it doesn't, the benefits are still visible due to lower LSA overhead.	2026-03-18 16:25:21 +01:00

... 11 12 13 14 15 ...

11801 Commits