scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-24 18:40:38 +00:00

Author	SHA1	Message	Date
Benny Halevy	ca9ff134b8	sstables: log debug message in filesystem_storage::clone	2026-03-24 12:26:03 +02:00
Botond Dénes	772b32d9f7	test/scylla_gdb: fix flakiness by preparing objects at test time Fixtures previously ran GDB once (module scope) to find live objects (sstables, tasks, schemas) and stored their addresses. Tests then reused those addresses in separate GDB invocations. Sometimes these addresses would become stale and the test would step on use-after-free (e.g. sstables compacted away between invocations). Fix by dropping the fixtures. The helper functions used by the fixtures to obtain the required objects are converted to gdb convenience functions, which can be used in the same expression as the test command invocation. Thus, the object is aquired on-demand at the moment it is used, so it is guaranteed to be fresh and relevant. Fixes: SCYLLADB-1020 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Closes scylladb/scylladb#28999	2026-03-23 16:54:03 +02:00
Piotr Dulikowski	60fb5270a9	logstor: fix fmt::format use with std::filesystem::path The version of fmt installed on my machine refuses to work with `std::filesystem::path` directly. Add `.string()` calls in places that attempt to print paths directly in order to make them work. Closes scylladb/scylladb#29148	2026-03-23 15:15:52 +01:00
Pavel Emelyanov	3b9398dfc8	Merge 'encryption: fix deadlock in encrypted_data_source::get()' from Ernest Zaslavsky When encrypted_data_source::get() caches a trailing block in _next, the next call takes it directly — bypassing input_stream::read(), which checks _eof. It then calls input_stream::read_exactly() on the already-drained stream. Unlike read(), read_up_to(), and consume(), read_exactly() does not check _eof when the buffer is empty, so it calls _fd.get() on a source that already returned EOS. In production this manifested as stuck encrypted SSTable component downloads during tablet restore: the underlying chunked_download_source hung forever on the post-EOS get(), causing 4 tablets to never complete. The stuck files were always block-aligned sizes (8k, 12k) where _next gets populated and the source is fully consumed in the same call. Fix by checking _input.eof() before calling read_exactly(). When the stream already reached EOF, buf2 is known to be empty, so the call is skipped entirely. A comprehensive test is added that uses a strict_memory_source which fails on post-EOS get(), reproducing the exact code path that caused the production deadlock. Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1128 Backport to 2025.3/4 and 2026.1 is needed since it fixes a bug that may bite us in production, to be on the safe side Closes scylladb/scylladb#29110 * github.com:scylladb/scylladb: encryption: fix deadlock in encrypted_data_source::get() test_lib: mark `limiting_data_source_impl` as not `final` Fix formatting after previous patch Fix indentation after previous patch test_lib: make limiting_data_source_impl available to tests	2026-03-23 17:12:44 +03:00
Piotr Dulikowski	df68d0c0f7	directories: add missing seastar/util/closeable.hh include Without this include the file would not compile on its own. The issue was most likely masked by the use of precompiled headers in our CI. Closes scylladb/scylladb#29170	2026-03-23 15:46:56 +03:00
Yaniv Michael Kaul	051107f5bc	scylla-gdb: fix sstable-summary crash on ms-format sstables The 'scylla sstable-summary' GDB command crashes with 'ValueError: Argument "count" should be greater than zero' when inspecting ms-format (trie-based) sstables. This happens because ms-format sstables don't populate the traditional summary structure, leaving all fields zeroed out, which causes gdb.read_memory() to be called with a zero count. Fix by: - Adding zero-length guards to sstring.to_hex() and sstring.as_bytes() to return early when the data length is zero, consistent with the existing guard in managed_bytes.get(). - Adding the same guard to scylla_sstable_summary.to_hex(). - Detecting ms-format sstables (version == 5) early in scylla_sstable_summary.invoke() and printing an informative message instead of attempting to read the unpopulated summary. Fixes: SCYLLADB-1180 Closes scylladb/scylladb#29162	2026-03-23 12:44:47 +02:00
Piotr Szymaniak	c8e7e20c5c	test/cluster: retry create_table on transient schema agreement timeout In test_index_requires_rf_rack_valid_keyspace, the create_table call for a plain tablet-based table can fail with 'Unable to reach schema agreement' after the server's 10s timeout is exceeded. This happens when schema gossip propagation across the 4-node cluster takes longer than expected after a sequence of rapid schema changes earlier in the test. Add a retry (up to 2 attempts) on schema agreement errors for this specific create_table call rather than increasing the server-side timeout. Fixes: SCYLLADB-1135 Closes scylladb/scylladb#29132	2026-03-23 10:45:30 +02:00
Yaniv Kaul	fb1f995d6b	.github/workflows/backport-pr-fixes-validation.yaml: workflow does not contain permissions (Potential fix for code scanning alert no. 139) Potential fix for https://github.com/scylladb/scylladb/security/code-scanning/139, To fix the problem, explicitly restrict the `GITHUB_TOKEN` permissions for this workflow/job so it has only what is needed. The script reads PR data and repository info (which is covered by `contents: read`/default read scopes) and posts a comment via `github.rest.issues.createComment`, which requires `issues: write`. No other write scopes (e.g., `contents: write`, `pull-requests: write`) are necessary. The best fix without changing functionality is to add a `permissions` block scoped to this job (or at the workflow root). Since we only see a single job here, we’ll add it under `check-fixes-prefix`. Concretely, in `.github/workflows/backport-pr-fixes-validation.yaml`, between the `runs-on: ubuntu-latest` line (line 10) and `steps:` (line 11), add: ```yaml permissions: contents: read issues: write ``` This keeps the token minimally privileged while still allowing the script to create issue/PR comments. Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com> Closes scylladb/scylladb#27810	2026-03-23 10:30:01 +02:00
Piotr Smaron	32225797cd	dtest: fix flaky test_writes_schema_recreated_while_node_down `read_barrier(session2)` was supposed to ensure `node2` has caught up on schema before a CL=ALL write. But `patient_cql_connection(node2)` creates a cluster-aware driver session `(TokenAwarePolicy(DCAwareRoundRobinPolicy()))` that can route the barrier CQL statement to any node — not necessarily `node2`. If the barrier runs on `node1` or `node3` (which already have the new schema), it's a no-op, and `node2` remains stale, thus the observed `WriteFailure`. The fix is to switch to `patient_exclusive_cql_connection(node2)`, which uses `WhiteListRoundRobinPolicy([node2_ip])` to pin all CQL to `node2`. This is already the established pattern used by other tests in the same file. Fixes: SCYLLADB-1139 No need to backport yet, appeared only on master. Closes scylladb/scylladb#29151	2026-03-23 10:25:54 +02:00
Michał Chojnowski	f29525f3a6	test/boost/cache_algorithm_test: disable sstable compression to avoid giant index pages The test intentionally creates huge index pages. But since `5e7fb08bf3`, the index reader allocates a block of memory for a whole index page, instead of incrementally allocating small pieces during index parsing. This giant allocation causes the test to fail spuriously in CI sometimes. Fix this by disabling sstable compression on the test table, which puts a hard cap of 2000 keys per index page. Fixes: SCYLLADB-1152 Closes scylladb/scylladb#29152	2026-03-23 09:57:11 +02:00
Raphael S. Carvalho	05b11a3b82	sstables_loader: use new sstable add path Use add_new_sstable_and_update_cache() when attaching SSTables downloaded by the node-scoped local loader. This is the correct variant for new SSTables: it can unlink the SSTable on failure to add it, and it can split the SSTable if a tablet split is in progress. The older add_sstable_and_update_cache() helper is intended for preexisting SSTables that are already stable on disk. Additionally, downloaded SSTables are now left unsealed (TemporaryTOC) until they are successfully added to the table's SSTable set. The download path (download_fully_contained_sstables) passes leave_unsealed=true to create_stream_sink, and attach_sstable opens the SSTable with unsealed_sstable=true and seals it only inside the on_add callback — matching the pattern used by stream_blob.cc and storage_service.cc for tablet streaming. This prevents a data-resurrection hazard: previously, if the process crashed between download and attach_sstable, or if attach_sstable failed mid-loop, sealed (TOC) SSTables would remain in the table directory and be reloaded by distributed_loader on restart. With TemporaryTOC, sstable_directory automatically cleans them up on restart instead. Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1085. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#29072	2026-03-23 10:33:04 +03:00
Piotr Szymaniak	f511264831	alternator/test: fix test_ttl_with_load_and_decommission flaky Connection refused error The native Scylla nodetool reports ECONNREFUSED as 'Connection refused', not as 'ConnectException' (which is the Java nodetool format). Add 'Connection refused' to the valid_errors list so that transient connection failures during concurrent decommission/bootstrap topology changes are properly tolerated. Fixes SCYLLADB-1167 Closes scylladb/scylladb#29156	2026-03-22 11:01:45 +02:00
Piotr Dulikowski	cc695bc3f7	Merge 'vector_search: fix race condition on connection timeout' from Karol Nowacki When a `with_connect` operation timed out, the underlying connection attempt continued to run in the reactor. This could lead to a crash if the connection was established/rejected after the client object had already been destroyed. This issue was observed during the teardown phase of a upcoming high-availability test case. This commit fixes the race condition by ensuring the connection attempt is properly canceled on timeout. Additionally, the explicit TLS handshake previously forced during the connection is now deferred to the first I/O operation, which is the default and preferred behavior. Fixes: SCYLLADB-832 Backports to 2026.1 and 2025.4 are required, as this issue also exists on those branches and is causing CI flakiness. Closes scylladb/scylladb#29031 * github.com:scylladb/scylladb: vector_search: test: fix flaky test vector_search: fix race condition on connection timeout	2026-03-20 11:12:04 +01:00
Petr Gusev	4bfcd035ae	test_fencing: add missing await-s Fixes SCYLLADB-1099 Closes scylladb/scylladb#29133	2026-03-20 10:55:35 +01:00
Pavel Emelyanov	c4a0f6f2e6	object_store: Don't leave dangling objects by iterating moved-from names vector The code in upload_file std::move()-s vector of names into merge_objects() method, then iterates over this vector to delete objects. The iteration is apparently a no-op on moved-from vector. The fix is to make merge_objects() helper get vector of names by const reference -- the method doesn't modify the names collection, the caller keeps one in stable storage. Fixes #29060 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#29061	2026-03-20 10:09:30 +02:00
Pavel Emelyanov	712ba5a31f	utils: Use yielding directory_lister in owner verification Switch directories::do_verify_owner_and_mode() from lister::scan_dir() to utils::directory_lister while preserving the previous hidden-entry behavior. Make do_verify_subpath use lister::filter_type directly so the verification helper can pass it straight into directory_lister, and keep a single yielding iteration loop for directory traversal. Minus one scan_dir user twards scan_dir removal from code. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#29064	2026-03-20 10:08:38 +02:00
Pavel Emelyanov	961fc9e041	s3: Don't rearm credential timers when credentials are not refreshed The update_credentials_and_rearm() may get "empty" credentials from _creds_provider_chain.get_aws_credentials() -- it doesn't throw, but returns default-initialized value. In that case the expires_at will be set to time_point::min, and it's probably not a good idea to arm the refresh timer and, even worse idea, to subtract 1h from it. Fixes #29056 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#29057	2026-03-20 10:07:01 +02:00
Pavel Emelyanov	0a8dc4532b	s3: Fix missing upload ID in copy_part trace log The format string had two {} placeholders but three arguments, the _upload_id one is skipped from formatting Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#29053	2026-03-20 10:05:44 +02:00
Botond Dénes	bb5c328a16	Merge 'Squash two primary-replica restoration tests together' from Pavel Emelyanov The test_restore_primary_replica_same_domain and test_restore_primary_replica_different_domain tests have very much in common. Previously both tests were also split each into two, so we have four tests, and now we have two that can also be squashed, the lines-of-code savings still worth it. This is the continuation of #28569 Tests improvement, not backporting Closes scylladb/scylladb#28994 * github.com:scylladb/scylladb: test: Replace a bunch of ternary operators with an if-else block test: Squash test_restore_primary_replica_same\|different_domain tests test: Use the same regexp in test_restore_primary_replica_different\|same_domain-s	2026-03-20 10:05:16 +02:00
Pavel Emelyanov	ea2a214959	test/backup: Use unique_name() for backup prefix instead of cf_dir The do_test_backup_abort() fetched the node's workdir and resolved cf_dir solely to construct a unique-ish backup prefix: prefix = f'{cf_dir}/backup' The comment already acknowledged this was only "unique(ish)" — relying on the UUID-derived cf_dir name as a uniqueness source is roundabout. unique_name() is already imported and used for exactly this purpose elsewhere in the file. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#29030	2026-03-20 10:04:22 +02:00
Pavel Emelyanov	65032877d4	api: Move /storage_service/toppartitions from storage_service.cc to column_family.cc The endpoint URL remains intact. Having it next to another toppartitions endpoint (the /column_family/toppartitions one) is natural. This endpoint only needs sharded<replica::database>&, grabs it from http_context and doesn't use any other service. In column_family.cc the database reference is already available as a parameter. Once more user of http_context.db is gone. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Closes scylladb/scylladb#28996	2026-03-20 09:52:33 +02:00
Botond Dénes	de0bdf1a65	Merge 'Decouple test_refresh_deletes_uploaded_sstables from backup test-suite' from Pavel Emelyanov The test in question uses several helpers from the backup sute, but it doesn't really need them -- the operations it want to perform can be performed with standard pylib methods. "While at it" also collect some dangling effectively unused local variables from this test (these were apparently left from backup tests this one was copied-and-reworked from) Enhancing tests, not backporting Closes scylladb/scylladb#29130 * github.com:scylladb/scylladb: test/refresh: Simplify refresh invocation test/refresh: Remove r_servers alias for servers test/refresh: Replace check_mutation_replicas with a plain CQL SELECT test/refresh: Inline keyspace/table/data setup in test_refresh_deletes_uploaded_sstables test/refresh: Prepare indentation for new_test_keyspace in test_refresh_deletes_uploaded_sstables test/refresh: Decouple test_refresh_deletes_uploaded_sstables from backup tests test/refresh: Remove unused wait_for_cql_and_get_hosts import	2026-03-20 09:29:15 +02:00
Botond Dénes	97430e2df5	Merge 'Fix object storage lister entries walking loop' from Pavel Emelyanov Two issues found in the lister returned by gs_client_wrapper::make_object_lister() Lister can report EOF too early in case filter is active, another one is potential vector out-of-bounds access Fixes #29058 The code appeared in 2026.1, worth fixing it there as well Closes scylladb/scylladb#29059 * github.com:scylladb/scylladb: sstables: Fix object storage lister not resetting position in batch vector sstables: Fix object storage lister skipping entries when filter is active	2026-03-20 09:12:42 +02:00
Botond Dénes	5573c3b18e	Merge 'tablets: Fix deadlock in background storage group merge fiber' from Tomasz Grabiec When it deadlocks, groups stop merging and compaction group merge backlog will run-away. Also, graceful shutdown will be blocked on it. Found by flaky unit test test_merge_chooses_best_replica_with_odd_count, which timed-out in 1 in 100 runs. Reason for deadlock: When storage groups are merged, the main compaction group of the new storage group takes a compaction lock, which is appended to _compaction_reenablers_for_merging, and released when the merge completion fiber is done with the whole batch. If we accumulate more than 1 merge cycle for the fiber, deadlock occurs. Lock order will be this Initial state: cg0: main cg1: main cg2: main cg3: main After 1st merge: cg0': main [locked], merging_groups=[cg0.main, cg1.main] cg1': main [locked], merging_groups=[cg2.main, cg3.main] After 2nd merge: cg0'': main [locked], merging_groups=[cg0'.main [locked], cg0.main, cg1.main, cg1'.main [locked], cg2.main, cg3.main] merge completion fiber will try to stop cg0'.main, which will be blocked on compaction lock. which is held by the reenabler in _compaction_reenablers_for_merging, hence deadlock. The fix is to wait for background merge to finish before we start the next merge. It's achieved by holding old erm in the background merge, and doing a topology barrier from the merge finalizing transition. Background merge is supposed to be a relatively quick operation, it's stopping compaction groups. So may wait for active requests. It shouldn't prolong the barrier indefinitely. Tablet tests which trigger merge need to be adjusted to call the barrier, otherwise they will be vulnerable to the deadlock. Fixes SCYLLADB-928 Backport to >= 2025.4 because it's the earliest vulnerable due to `f9021777d8`. Closes scylladb/scylladb#29007 * github.com:scylladb/scylladb: tablets: Fix deadlock in background storage group merge fiber replica: table: Propagate old erm to storage group merge test: boost: tablets_test: Save tablet metadata when ACKing split resize decision storage_service: Extract local_topology_barrier()	2026-03-20 09:05:52 +02:00
Botond Dénes	34473302b0	Merge 'docs: document existing guardrails' from Andrzej Jackowski This patch series introduces a new documentation for exiting guardrails. Moreover: - Warning / failure messages of recently added write CL guardrails (SCYLLADB-259) are rephrased, so all guardrails have similar messages. - Some new tests are added, to help verify the correctness of the documentation and avoid situations where the documentation and implementation diverge. Fixes: [SCYLLADB-257](https://scylladb.atlassian.net/browse/SCYLLADB-257) No backport, just new docs and tests. [SCYLLADB-257]: https://scylladb.atlassian.net/browse/SCYLLADB-257?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ Closes scylladb/scylladb#29011 * github.com:scylladb/scylladb: test: add new guardrail tests matching documentation scenarios test: add metric assertions to guardrail replication strategy tests test: use regex matching in guardrail replication strategy tests test: extract ks_opts helper in test_guardrail_replication_strategy docs: document CQL guardrails cql: improve write consistency level guardrail messages	2026-03-20 08:56:00 +02:00
artem.penner	9898e5700b	scylla-node-exporter: Add systemd collector to node exporter This PR enables the node_exporter systemd collector and configures the unit whitelist to include scylla-server.service and systemd-coredump services. Motivation: We currently lack visibility into system-level service states, which is critical for diagnosing stability issues. This configuration enables two specific use cases: - Detecting Coredump Loops: We encounter scenarios where ScyllaDB enters a restart loop. To pinpoint SIGSEGV (coredumps) as the root cause, we need to track when the systemd-coredump service becomes active, indicating a dump is being processed. - Identifying Startup Failures: We need to detect when the scylla-server unit enters a failed state. This is essential for catching unrecoverable errors (e.g., corrupted commitlogs or configuration bugs) that prevent the server from starting. example of promql queries: - `node_systemd_unit_state{name=~"systemd-coredump@.*", state="active"} == 1` - `node_systemd_unit_state{name="scylla-server.service", state="failed"} == 1` Closes #28402	2026-03-20 08:39:56 +02:00
Andrzej Jackowski	10c4b9b5b0	test: verify signal() detects resource negative leak in rcs reader_concurrency_semaphore::signal() guards against available resources exceeding the initial limit after a signal, which would indicate a bug such as double-returning resources. It reports the issue via on_internal_error_noexcept and clamps resources back to the initial values. However, before this commit there were no tests that verified this behavior, so bugs like SCYLLADB-1014 went undetected. Add a test that artificially signals resources that were never consumed and verifies that signal() detects the negative leak and clamps available resources back to the initial limit. Refs: SCYLLADB-1014 Fixes: SCYLLADB-1031 Closes scylladb/scylladb#28993	2026-03-20 09:21:20 +03:00
Botond Dénes	f9adbc7548	test/cqlpy/test_tombstone_limit.py: disable tombstone-gc for test table Since `7564a56dc8`, all tables default to repair-mode tombstone-gc, which is identical to immediate-mode for RF=1 tables. Consequently the tombstones written by the tests in this test file are immediately collectible and with some unlucky timing, some of them can be collected before the end of the test, failing the empty-page prefix check because the empty pages prefix will be smaller than expected based on the number of tombstones written. Disable tombstone-gc to remove this source of flakyness. Fixes: SCYLLADB-1062 Closes scylladb/scylladb#29077	2026-03-20 09:14:29 +03:00
Michał Chojnowski	6b18d95dec	test: add a missing reconnect_driver in test_sstable_compression_dictionaries_upgrade.py Need to work around https://github.com/scylladb/python-driver/issues/295, lest a CQL query fail spuriously after the cluster restart. Fixes: SCYLLADB-1114 Closes scylladb/scylladb#29118	2026-03-20 09:05:14 +03:00
Botond Dénes	89388510a0	test/cluster/test_data_resurrection_in_memtable.py: use explicit CL The test has expectation w.r.t which write makes it to which nodes: * inserts make it to all nodes * delete makes it to all-1 (QUORUM) node However, this was not expressed with CL, and the default CL=ONE allowed for some nodes missing the writes and this violating the tests expectations on what data is persent on which nodes. This resulted on the test being flaky and failing on the data checks. Use explicit CL for the ingestion to prevent this. The improvements to the test introduced in `a8dd13731f` was of great help in investigating this: traces are now available and the check happens after the data was dumped to logs. Fixes: SCYLLADB-870 Fixes: SCYLLADB-812 Fixes: SCYLLADB-1102 Closes scylladb/scylladb#29128	2026-03-20 09:02:57 +03:00
Avi Kivity	6b259babeb	Merge 'logstor: initial log-structured storage for key-value tables' from Michael Litvak Introduce an initial and experimental implementation of an alternative log-structured storage engine for key-value tables. Main flows and components: * The storage is composed of 32MB files, each file divided to segments of size 128k. We write to them sequentially records that contain a mutation and additional metadata. Records are written to a buffer first and then written to the active segment sequentially in 4k sized blocks. * The primary index in memory maps keys to their location on disk. It is a B-tree per-table that is ordered by tokens, similar to a memtable. * On reads we calculate the key and look it up in the primary index, then read the mutation from disk with a single disk IO. * On writes we write the record to a buffer, wait for it to be written to disk, then update the index with the new location, and free the previous record. * We track the used space in each segment. When overwriting a record, we increase the free space counter for the segment of the previous record that becomes dead. We store the segments in a histogram by usage. * The compaction process takes segments with low utilization, reads them and writes the live records to new segments, and frees the old segments. * Segments are initially "mixed" - we write to the active segment records from all tables and all tablets. The "separator" process rewrites records from mixed segments into new segments that are organized by compaction groups (tablets), and frees the mixed segments. Each write is written to the active segment and to a separator buffer of the compaction group, which is eventually flushed to a new segment in the compaction group. Currently this mode is experimental and requires an experimental flag to be enabled. Some things that are not supported yet are strong consistency, tablet migration, tablet split/merge, big mutations, tombstone gc, ttl. to use, add to config: ``` enable_logstor: true experimental_features: - logstor ``` create a table: ``` CREATE TABLE ks.t(pk int PRIMARY KEY, a int, v text) WITH storage_engine = 'logstor'; ``` INSERT, SELECT, DELETE work as expected UPDATE not supported yet no backport - new feature Closes scylladb/scylladb#28706 * github.com:scylladb/scylladb: logstor: trigger separator flush for buffers that hold old segments docs/dev: add logstor documentation logstor: recover segments into compaction groups logstor: range read logstor: change index to btree by token per table logstor: move segments to replica::compaction_group db: update dirty mem limits dynamically logstor: track memory usage logstor: logstor stats api logstor: compaction buffer pool logstor: separator: flush buffer when full logstor: hold segment until index updates logstor: truncate table logstor: enable/disable compaction per table logstor: separator buffer pool test: logstor: add separator and compaction tests logstor: segment and separator barrier logstor: separator debt controller logstor: compaction controller logstor: recovery: recover mixed segments using separator logstor: wait for pending reads in compaction logstor: separator logstor: compaction groups logstor: cache files for read logstor: recovery: initial logstor: add segment generation logstor: reserve segments for compaction logstor: index: buckets logstor: add buffer header logstor: add group_id logstor: record generation logstor: generation utility logstor: use RIPEMD-160 for index key test: add test_logstor.py api: add logstor compaction trigger endpoint replica: add logstor to db schema: add logstor cf property logstor: initial commit db: disable tablet balancing with logstor db: add logstor experimental feature flag	2026-03-20 00:18:09 +02:00
Avi Kivity	062751fcec	Merge 'db/config: enable ms sstable format by default' from Łukasz Paszkowski Trie-based sstable indexes are supposed to be (hopefully) a better default than the old BIG indexes. Make the new format a new default for new clusters by naming ms in the default scylla.yaml. New functionality. No backport needed. This PR is basically Michał's one https://github.com/scylladb/scylladb/pull/26377, Jakub's https://github.com/scylladb/scylladb/pull/27332 fixing `sstables_manager::get_highest_supported_format()` and one test fix. Closes scylladb/scylladb#28960 * github.com:scylladb/scylladb: db/config: announce ms format as highest supported db/config: enable `ms` sstable format by default cluster/dtest/bypass_cache_test: switch from highest_supported_sstable_format to chosen_sstable_format api/system: add /system/chosen_sstable_version test/cluster/dtest: reduce num_tokens to 16	2026-03-19 18:19:01 +02:00
Pavel Emelyanov	969dddb630	test/refresh: Simplify refresh invocation take_snapshot return values were unused so drop them. do_refresh was a thin wrapper around load_new_sstables that added no logic; inline it directly into the gather expression. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-03-19 18:42:57 +03:00
Pavel Emelyanov	de21572b31	test/refresh: Remove r_servers alias for servers r_servers = servers was a no-op assignment; use servers directly. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-03-19 18:42:52 +03:00
Pavel Emelyanov	20b1531e6d	test/refresh: Replace check_mutation_replicas with a plain CQL SELECT The goal of test_refresh_deletes_uploaded_sstables is to verify that sstables are removed from the upload directory after refresh. The replica check was just a sanity guard; a simple SELECT of all keys is sufficient and much lighter. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2026-03-19 18:42:48 +03:00
Pavel Emelyanov	c591b9ebe2	test/refresh: Inline keyspace/table/data setup in test_refresh_deletes_uploaded_sstables Replace create_dataset() with explicit keyspace creation via new_test_keyspace, inline CREATE TABLE, and direct cql.run_async inserts — matching the pattern used in do_test_streaming_scopes. This removes the last dependency on backup helpers for dataset setup and makes the test self-contained. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-03-19 18:42:44 +03:00
Pavel Emelyanov	06006a6328	test/refresh: Prepare indentation for new_test_keyspace in test_refresh_deletes_uploaded_sstables Wrap the test body under if True: to pre-indent it, making the subsequent patch that introduces new_test_keyspace a pure content change with no whitespace noise. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-03-19 18:42:40 +03:00
Pavel Emelyanov	67d8cde42d	test/refresh: Decouple test_refresh_deletes_uploaded_sstables from backup tests Replace create_cluster() from object_store/test_backup.py with a plain manager.servers_add(2) call. The test does not use object storage, so there is no need to pull in the backup helper along with its config and logging knobs. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-03-19 18:42:36 +03:00
Pavel Emelyanov	04f046d2d8	test/refresh: Remove unused wait_for_cql_and_get_hosts import Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-03-19 18:42:32 +03:00
Botond Dénes	e8b37d1a89	Merge 'doc: fix the installation section' from Anna Stuchlik This PR fixes the Installation page: - Replaces `http `with `https `in the download command. - Replaces the Open Source example from the Installation section for CentOS (we overlooked this example before). Fixes https://github.com/scylladb/scylladb/issues/29087 Fixes https://github.com/scylladb/scylladb/issues/29087 This update affects all supported versions and should be backported as a bug fix. Closes scylladb/scylladb#29088 * github.com:scylladb/scylladb: doc: remove the Open Source Example from Installation doc: replace http with https in the installation instructions	2026-03-19 17:13:53 +02:00
Dario Mirovic	5d51501a0b	pgo: use maintenance socket for CQL setup in PGO training The default 'cassandra' superuser was removed from ScyllaDB, which broke PGO training. exec_cql.py relied on username/password auth ('cassandra'/'cassandra') to execute setup CQL scripts like auth.cql and counters.cql. Switch exec_cql.py to connect via the Unix domain maintenance socket instead. The maintenance socket bypasses authentication, no credentials are needed. Additionally, create the 'cassandra' superuser via the maintenance socket during the populate phase, so that cassandra-stress keeps working. cassandra-stress hardcodes user=cassandra password=cassandra. Changes: - exec_cql.py: replace host/port/username/password arguments with a single --socket argument; add connect_maintenance_socket() with wait ready logic - pgo.py: add maintenance_socket_path() helper; update populate_auth_conns() and populate_counters() to pass the socket path to exec_cql.py Fixes SCYLLADB-1070 Closes scylladb/scylladb#29081	2026-03-19 16:52:36 +02:00
Andrzej Jackowski	4deeb7ebfc	test: add new guardrail tests matching documentation scenarios Add tests for RF guardrails (min/max warn/fail, RF=0 bypass, threshold=-1 disable, ALTER KEYSPACE) and write consistency level guardrails to cover all scenarios described in guardrails.rst. Test runtime (dev): test_guardrail_replication_strategy - 6s test_guardrail_write_consistency_level - 5s Refs: SCYLLADB-257	2026-03-19 15:07:03 +01:00
Andrzej Jackowski	2a03c634c0	test: add metric assertions to guardrail replication strategy tests Verify that guardrail violations increment the corresponding metrics. Refs: SCYLLADB-257	2026-03-19 15:07:03 +01:00
Andrzej Jackowski	81c4e717e2	test: use regex matching in guardrail replication strategy tests Replace loose substring assertions with regex-based matching against the exact server message formats. Add regex constants for all guardrail messages and rewrite create_ks_and_assert_warnings_and_errors() to verify count and content of warnings and failures. Refs: SCYLLADB-257	2026-03-19 15:07:03 +01:00
Anna Stuchlik	6b1df5202c	doc: remove the instructions to install old versions from Web Installer The Web Installer page includes instructions to install the old pre-2025.1 Enterprise versions, which are no longer supported (since we released 2026.1). This commit removes those redundant and misleading instructions. Fixes https://github.com/scylladb/scylladb/issues/29099 Closes scylladb/scylladb#29103	2026-03-19 15:47:00 +02:00
Piotr Dulikowski	171504c84f	Merge 'auth: migrate some standard role manager APIs to use cache' from Marcin Maliszkiewicz This patchset migrates: query_all_directly_granted, query_all, get_attribute, query_attribute_for_all functions to use cache instead of doing CQL queries. It also includes some preparatory work which fixes cache update order and triggering. Main motivation behind this is to make sure that all calls from service_level_controller::auth_integration are cached, which we achieve here. Alternative implementation could move the whole auth_integration data into auth cache but since auth_integration manages also lifetime and contains service levels specific logic such solution would be too complex for little (if any) gain. Fixes https://scylladb.atlassian.net/browse/SCYLLADB-159 Backport: no, not a bug Closes scylladb/scylladb#28791 * github.com:scylladb/scylladb: auth: switch query_attribute_for_all to use cache auth: switch get_attribute to use cache auth: cache: add heterogeneous map lookups auth: switch query_all to use cache auth: switch query_all_directly_granted to use cache auth: cache: add ability to go over all roles raft: service: reload auth cache before service levels service: raft: move update_service_levels_effective_cache check	2026-03-19 14:37:22 +01:00
Avi Kivity	5e7fb08bf3	Merge 'Fix bad performance for densely populated partition index pages' from Tomasz Grabiec This applies to small partition workload where index pages have high partition count, and the index doesn't fit in cache. It was observed that the count can be in the order of hundreds. In such a workload pages undergo constant population, LSA compaction, and LSA eviction, which has severe impact on CPU utilization. Refs https://scylladb.atlassian.net/browse/SCYLLADB-620 This PR reduces the impact by several changes: - reducing memory footprint in the partition index. Assuming partition key size is 16 bytes, the cost dropped from 96 bytes to 36 bytes per partition. - flattening the object graph and amortizing storage. Storing entries directly in the vector. Storing all key values in a single managed_bytes. Making index_entry a trivial struct. - index entries and key storage are now trivially moveable, and batched inside vector storage so LSA migration can use memcpy(), which amortizes the cost per key. This reduces the cost of LSA segment compaction. - LSA eviction is now pretty much constant time for the whole page regardless of the number of entries, because elements are trivial and batched inside vectors. Page eviction cost dropped from 50 us to 1 us. Performance evaluated with: scylla perf-simple-query -c1 -m200M --partitions=1000000 Before: ``` 7774.96 tps (166.0 allocs/op, 521.7 logallocs/op, 54.0 tasks/op, 802428 insns/op, 430457 cycles/op, 0 errors) 7511.08 tps (166.1 allocs/op, 527.2 logallocs/op, 54.0 tasks/op, 804185 insns/op, 430752 cycles/op, 0 errors) 7740.44 tps (166.3 allocs/op, 526.2 logallocs/op, 54.2 tasks/op, 805347 insns/op, 432117 cycles/op, 0 errors) 7818.72 tps (165.2 allocs/op, 517.6 logallocs/op, 53.7 tasks/op, 794965 insns/op, 427751 cycles/op, 0 errors) 7865.49 tps (165.1 allocs/op, 513.3 logallocs/op, 53.6 tasks/op, 788898 insns/op, 425171 cycles/op, 0 errors) ``` After (+318%): ``` 32492.40 tps (130.7 allocs/op, 12.8 logallocs/op, 36.1 tasks/op, 109236 insns/op, 103203 cycles/op, 0 errors) 32591.99 tps (130.4 allocs/op, 12.8 logallocs/op, 36.0 tasks/op, 108947 insns/op, 102889 cycles/op, 0 errors) 32514.52 tps (130.6 allocs/op, 12.8 logallocs/op, 36.0 tasks/op, 109118 insns/op, 103219 cycles/op, 0 errors) 32491.14 tps (130.6 allocs/op, 12.8 logallocs/op, 36.0 tasks/op, 109349 insns/op, 103272 cycles/op, 0 errors) 32582.90 tps (130.5 allocs/op, 12.8 logallocs/op, 36.0 tasks/op, 109269 insns/op, 102872 cycles/op, 0 errors) 32479.43 tps (130.6 allocs/op, 12.8 logallocs/op, 36.0 tasks/op, 109313 insns/op, 103242 cycles/op, 0 errors) 32418.48 tps (130.7 allocs/op, 12.8 logallocs/op, 36.1 tasks/op, 109201 insns/op, 103301 cycles/op, 0 errors) 31394.14 tps (130.7 allocs/op, 12.8 logallocs/op, 36.1 tasks/op, 109267 insns/op, 103301 cycles/op, 0 errors) 32298.55 tps (130.7 allocs/op, 12.8 logallocs/op, 36.1 tasks/op, 109323 insns/op, 103551 cycles/op, 0 errors) ``` When the workload is miss-only, with both row cache and index cache disabled (no cache maintenance cost): perf-simple-query -c1 -m200M --duration 6000 --partitions=100000 --enable-index-cache=0 --enable-cache=0 Before: ``` 9124.57 tps (146.2 allocs/op, 789.0 logallocs/op, 45.3 tasks/op, 889320 insns/op, 357937 cycles/op, 0 errors) 9437.23 tps (146.1 allocs/op, 789.3 logallocs/op, 45.3 tasks/op, 889613 insns/op, 357782 cycles/op, 0 errors) 9455.65 tps (146.0 allocs/op, 787.4 logallocs/op, 45.2 tasks/op, 887606 insns/op, 357167 cycles/op, 0 errors) 9451.22 tps (146.0 allocs/op, 787.4 logallocs/op, 45.3 tasks/op, 887627 insns/op, 357357 cycles/op, 0 errors) 9429.50 tps (146.0 allocs/op, 787.4 logallocs/op, 45.3 tasks/op, 887761 insns/op, 358148 cycles/op, 0 errors) 9430.29 tps (146.1 allocs/op, 788.2 logallocs/op, 45.3 tasks/op, 888501 insns/op, 357679 cycles/op, 0 errors) 9454.08 tps (146.0 allocs/op, 787.3 logallocs/op, 45.3 tasks/op, 887545 insns/op, 357132 cycles/op, 0 errors) ``` After (+55%): ``` 14484.84 tps (150.7 allocs/op, 6.5 logallocs/op, 44.7 tasks/op, 396164 insns/op, 229490 cycles/op, 0 errors) 14526.21 tps (150.8 allocs/op, 6.5 logallocs/op, 44.8 tasks/op, 396401 insns/op, 228824 cycles/op, 0 errors) 14567.53 tps (150.7 allocs/op, 6.5 logallocs/op, 44.7 tasks/op, 396319 insns/op, 228701 cycles/op, 0 errors) 14545.63 tps (150.6 allocs/op, 6.5 logallocs/op, 44.7 tasks/op, 395889 insns/op, 228493 cycles/op, 0 errors) 14626.06 tps (150.5 allocs/op, 6.5 logallocs/op, 44.7 tasks/op, 395254 insns/op, 227891 cycles/op, 0 errors) 14593.74 tps (150.5 allocs/op, 6.5 logallocs/op, 44.7 tasks/op, 395480 insns/op, 227993 cycles/op, 0 errors) 14538.10 tps (150.8 allocs/op, 6.5 logallocs/op, 44.8 tasks/op, 397035 insns/op, 228831 cycles/op, 0 errors) 14527.18 tps (150.8 allocs/op, 6.5 logallocs/op, 44.8 tasks/op, 396992 insns/op, 228839 cycles/op, 0 errors) ``` Same as above, but with summary ratio increased from 0.0005 to 0.005 (smaller pages): Before: ``` 33906.70 tps (146.1 allocs/op, 83.6 logallocs/op, 45.1 tasks/op, 170553 insns/op, 98104 cycles/op, 0 errors) 32696.16 tps (146.0 allocs/op, 83.5 logallocs/op, 45.1 tasks/op, 170369 insns/op, 98405 cycles/op, 0 errors) 33889.05 tps (146.1 allocs/op, 83.6 logallocs/op, 45.1 tasks/op, 170551 insns/op, 98135 cycles/op, 0 errors) 33893.24 tps (146.1 allocs/op, 83.5 logallocs/op, 45.1 tasks/op, 170488 insns/op, 98168 cycles/op, 0 errors) 33836.73 tps (146.1 allocs/op, 83.6 logallocs/op, 45.1 tasks/op, 170528 insns/op, 98226 cycles/op, 0 errors) 33897.61 tps (146.0 allocs/op, 83.5 logallocs/op, 45.1 tasks/op, 170428 insns/op, 98081 cycles/op, 0 errors) 33834.73 tps (146.1 allocs/op, 83.5 logallocs/op, 45.1 tasks/op, 170438 insns/op, 98178 cycles/op, 0 errors) 33776.31 tps (146.3 allocs/op, 83.9 logallocs/op, 45.2 tasks/op, 170958 insns/op, 98418 cycles/op, 0 errors) 33808.08 tps (146.3 allocs/op, 83.9 logallocs/op, 45.2 tasks/op, 170940 insns/op, 98388 cycles/op, 0 errors) ``` After (+18%): ``` 40081.51 tps (148.2 allocs/op, 4.4 logallocs/op, 45.0 tasks/op, 121047 insns/op, 82231 cycles/op, 0 errors) 40005.85 tps (148.6 allocs/op, 4.4 logallocs/op, 45.2 tasks/op, 121327 insns/op, 82545 cycles/op, 0 errors) 39816.75 tps (148.3 allocs/op, 4.4 logallocs/op, 45.1 tasks/op, 121067 insns/op, 82419 cycles/op, 0 errors) 39953.11 tps (148.1 allocs/op, 4.4 logallocs/op, 45.0 tasks/op, 121027 insns/op, 82258 cycles/op, 0 errors) 40073.96 tps (148.2 allocs/op, 4.4 logallocs/op, 45.0 tasks/op, 121006 insns/op, 82313 cycles/op, 0 errors) 39882.25 tps (148.2 allocs/op, 4.4 logallocs/op, 45.0 tasks/op, 120925 insns/op, 82320 cycles/op, 0 errors) 39916.08 tps (148.3 allocs/op, 4.4 logallocs/op, 45.1 tasks/op, 121054 insns/op, 82393 cycles/op, 0 errors) 39786.30 tps (148.2 allocs/op, 4.4 logallocs/op, 45.0 tasks/op, 121027 insns/op, 82465 cycles/op, 0 errors) 38662.45 tps (148.3 allocs/op, 4.4 logallocs/op, 45.0 tasks/op, 121108 insns/op, 82312 cycles/op, 0 errors) 39849.42 tps (148.3 allocs/op, 4.4 logallocs/op, 45.1 tasks/op, 121098 insns/op, 82447 cycles/op, 0 errors) ``` Closes scylladb/scylladb#28603 * github.com:scylladb/scylladb: sstables: mx: index_reader: Optimize parsing for no promoted index case vint: Use std::countl_zero() test: sstable_partition_index_cache_test: Validate scenario of pages with sparse promoted index placement sstables: mx: index_reader: Amoritze partition key storage managed_bytes: Hoist write_fragmented() to common header utils: managed_vector: Use std::uninitialized_move() to move objects sstables: mx: index_reader: Keep promoted_index info next to index_entry sstables: mx: index_reader: Extract partition_index_page::clear_gently() sstables: mx: index_reader: Shave-off 16 bytes from index_entry by using raw_token sstables: mx: index_reader: Reduce allocation_section overhead during index page parsing by batching allocation sstables: mx: index_reader: Keep index_entry directly in the vector dht: Introduce raw_token test: perf_simple_query: Add 'sstable-format' command-line option test: perf_simple_query: Add 'sstable-summary-ratio' command-line option test: perf-simple-query: Add option to disable index cache test: cql_test_env: Respect enable-index-cache config	2026-03-19 14:42:50 +02:00
Botond Dénes	4981e72607	Merge 'replica: avoid unnecessary computation on token lookup hot path' from Łukasz Paszkowski `storage_group_of()` sits on the replica-side token lookup hot path, yet it called `tablet_map::get_tablet_id_and_range_side()`, which always computes both the tablet id and the post-split range side — even though most callers only need the storage group id. The range-side computation is only relevant when a storage group is in tablet splitting mode, but we were paying for it unconditionally on every lookup. This series fixes that by: 1. Adding `tablet_map::get_tablet_range_side()` so the range side can be computed independently when needed. 2. Adding lazy `select_compaction_group()` overloads that defer the range-side computation until splitting mode is actually active. 3. Switching `storage_group_of()` to use the cheaper `get_tablet_id()` path, only computing the range side on demand. Improvements. No backport is required. Closes scylladb/scylladb#28963 * github.com:scylladb/scylladb: replica/table: avoid computing token range side in storage_group_of() on hot path replica/compaction_group: add lazy select_compaction_group() overloads locator/tablets: add tablet_map::get_tablet_range_side()	2026-03-19 14:27:12 +02:00
Ernest Zaslavsky	aa9da87e97	encryption: fix deadlock in encrypted_data_source::get() When encrypted_data_source::get() caches a trailing block in _next, the next call takes it directly — bypassing input_stream::read(), which checks _eof. It then calls input_stream::read_exactly() on the already-drained stream. Unlike read(), read_up_to(), and consume(), read_exactly() does not check _eof when the buffer is empty, so it calls _fd.get() on a source that already returned EOS. In production this manifested as stuck encrypted SSTable component downloads during tablet restore: the underlying chunked_download_source hung forever on the post-EOS get(), causing 4 tablets to never complete. The stuck files were always block-aligned sizes (8k, 12k) where _next gets populated and the source is fully consumed in the same call. Fix by checking _input.eof() before calling read_exactly(). When the stream already reached EOF, buf2 is known to be empty, so the call is skipped entirely. A comprehensive test is added that uses a strict_memory_source which fails on post-EOS get(), reproducing the exact code path that caused the production deadlock.	2026-03-19 13:54:54 +02:00
Ernest Zaslavsky	f74a54f005	test_lib: mark `limiting_data_source_impl` as not `final`	2026-03-19 13:54:54 +02:00

1 2 3 4 5 ...

52784 Commits