scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-06-09 08:23:29 +00:00

Author	SHA1	Message	Date
Calle Wilund	df56f6bdc2	memtable_test::memtable_flush_period: Change sleep to use injection signal instead Fixes: SCYLLADB-942 Adds an injection signal _from_ table::seal_active_memtable to allow us to reliably wait for flushing. And does so. Closes scylladb/scylladb#29070 (cherry picked from commit `0013f22374`) Closes scylladb/scylladb#29116	2026-04-16 16:29:24 +02:00
Asias He	92f8f2c2db	test: Stabilize tablet incremental repair error test Use async tablet repair task flow to avoid a race where client timeout returns while server-side repair continues after injections are disabled. Start repair with await_completion=false, assert it does not complete within timeout under injection, abort/wait the task, then verify sstables_repaired_at is unchanged. Fixes SCYLLADB-1184 Closes scylladb/scylladb#29452 (cherry picked from commit `4137a4229c`) Closes scylladb/scylladb#29500	2026-04-16 10:40:39 +03:00
Tomasz Grabiec	e992d76489	Merge 'table: don't create new split compaction groups if main compaction group is disabled' from Ferenc Szili Fixes a race condition where tablet split can crash the server during truncation. `truncate_table_on_all_shards()` disables compaction on all existing compaction groups, then later calls `discard_sstables()` which asserts that compaction is disabled. Between these two points, tablet split can call `set_split_mode()`, which creates new compaction groups via `make_empty_group()` — these start with `compaction_disabled_counter == 0`. When `discard_sstables()` checks its assertion, it finds these new groups and fires `on_internal_error`, aborting the server. In `storage_group::set_split_mode()`, before creating new compaction groups, check whether the main compaction group has compaction disabled. If it does, bail out early and return `false` (not ready). This is safe because the split will be retried once truncation completes and re-enables compaction. A new regression test `test_split_emitted_during_truncate` reproduces the exact interleaving using two error injection points: - `database_truncate_wait` — pauses truncation after compaction is disabled but before `discard_sstables()` runs. - `tablet_split_monitor_wait` (new, in `service/storage_service.cc`) — pauses the split monitor at the start of `process_tablet_split_candidate()`. The test creates a single-tablet table, triggers both operations, uses the injection points to force the problematic ordering, then verifies that truncation completes successfully and the split finishes afterward. Fixes: SCYLLADB-1035 This needs to be backported to all currently supported version. Closes scylladb/scylladb#29250 * github.com:scylladb/scylladb: test: add test_split_emitted_during_truncate table: fix race between tablet split and truncate (cherry picked from commit `7fe4ae16f0`) Closes scylladb/scylladb#29478	2026-04-16 10:39:29 +03:00
Avi Kivity	196db8931e	partition_snapshot_row_cursor: fix reversed maybe_refresh() losing latest version entry In partition_snapshot_row_cursor::maybe_refresh(), the !is_in_latest_version() path calls lower_bound(_position) on the latest version's rows to find the cursor's position in that version. When lower_bound returns null (the cursor is positioned above all entries in the latest version in table order), the code unconditionally sets _background_continuity = true and allows the subsequent if(!it) block to erase the latest version's entry from the heap. This is correct for forward traversal: null means there are no more entries ahead, so removing the version from the heap is safe. However, in reversed mode, null from lower_bound means the cursor is above all entries in table order -- those entries are BELOW the cursor in query order and will be visited LATER during reversed traversal. Erasing the heap entry permanently loses them, causing live rows to be skipped. The fix mirrors what prepare_heap() already does correctly: when lower_bound returns null in reversed mode, use std::prev(rows.end()) to keep the last entry in the heap instead of erasing it. Add test_reversed_maybe_refresh_keeps_latest_version_entry to mvcc_test, alongside the existing reversed cursor tests. The test creates a two-version partition snapshot (v0 with range tombstones, v1 with a live row positioned below all v0 entries in table order), and traverses in reverse calling maybe_refresh() at each step -- directly exercising the buggy code path. The test fails without the fix. The bug was introduced by `6b7473be53` ("Handle non-evictable snapshots", 2022-11-21), which added null-iterator handling for non-evictable snapshots (memtable snapshots lack the trailing dummy entry that evictable snapshots have). prepare_heap() got correct reversed-mode handling at that time, but maybe_refresh() received only forward-mode logic. The bug is intermittent because multiple mechanisms cause iterators_valid() to return false, forcing maybe_refresh() to take the full rebuild path via prepare_heap() (which handles reversed mode correctly): - Mutation cleaner merging versions in the background (changes change_mark) - LSA segment compaction during reserve() (invalidates references) - B-tree rebalancing on partition insertion (invalidates references) - Debug mode's always-true need_preempt() creating many multi-version partitions via preempted apply_monotonically() A dtest reproducer confirmed the same root cause: with 100K overlapping range tombstones creating a massively multi-version memtable partition (287K preemption events), the reversed scan's latest_iterator was observed jumping discontinuously during a version transition -- the latest version's heap entry was erased -- causing the query to walk the entire partition without finding the live row. Fixes: SCYLLADB-1253 Closes scylladb/scylladb#29368 (cherry picked from commit `21d9f54a9a`) Closes scylladb/scylladb#29480	2026-04-15 18:54:57 +03:00
Nadav Har'El	e436db01e3	Merge 'cql3: fix authorization bypass via BATCH prepared cache poisoning' from Marcin Maliszkiewicz execute_batch_without_checking_exception_message() inserted entries into the authorized prepared cache before verifying that check_access() succeeded. A failed BATCH therefore left behind cached 'authorized' entries that later let a direct EXECUTE of the same prepared statement skip the authorization check entirely. Move the cache insertion after the access check so that entries are only cached on success. This matches the pattern already used by do_execute_prepared() for individual EXECUTE requests. Introduced in `98f5e49ea8` Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1221 Backport: all supported versions Closes scylladb/scylladb#29432 * github.com:scylladb/scylladb: test/cqlpy: add reproducer for BATCH prepared auth cache bypass cql3: fix authorization bypass via BATCH prepared cache poisoning (cherry picked from commit `986167a416`) Closes scylladb/scylladb#29479	2026-04-15 17:14:20 +02:00
Michał Hudobski	d46ff9b405	vector_search: forward non-primary key restrictions to Vector Store service Include non-primary key restrictions (e.g. regular column filters) in the filter JSON sent to the Vector Store service. Previously only partition key and clustering column restrictions were forwarded, so filtering on regular columns was silently ignored. Add get_nonprimary_key_restrictions() getter to statement_restrictions. Add unit tests for non-primary key equality, range, and bind marker restrictions in filter_test. Fixes: SCYLLADB-970 Closes scylladb/scylladb#29019 (cherry picked from commit `7d648961ed`) Closes scylladb/scylladb#29437	2026-04-12 14:39:48 +03:00
Michał Jadwiszczak	e5d82bf857	test: fix flaky test_create_index_synchronous_updates trace event race The test_create_index_synchronous_updates test in test_secondary_index_properties.py was intermittently failing with 'assert found_wanted_trace' because the expected trace event 'Forcing ... view update to be synchronous' was missing from the trace events returned by get_query_trace(). Root cause: trace events are written asynchronously to system_traces.events. The Python driver's populate() method considers a trace complete once the session row in system_traces.sessions has duration IS NOT NULL, then reads events exactly once. Since the session row and event rows are written as separate mutations with no transactional guarantee, the driver can read an incomplete set of events. Evidence from the failed CI run logs: - The entire test (CREATE TABLE through DROP TABLE) completed in ~300ms (01:38:54,859 - 01:38:55,157) - The INSERT with tracing happened in a ~50ms window between the second CREATE INDEX completing (01:38:55,108) and DROP TABLE starting (01:38:55,157) - The 'Forcing ... synchronous' trace message is generated during the INSERT write path (db/view/view.cc:2061), so it was produced, but not yet flushed to system_traces.events when the driver read them - This matches the known limitation documented in test/alternator/ test_tracing.py: 'we have no way to know whether the tracing events returned is the entire trace' Fix: replace the single-shot trace.events read with a retry loop that directly queries system_traces.events until the expected event appears (with a 30s timeout). Use ConsistencyLevel.ONE since system_traces has RF=2 and cqlpy tests run on a single-node cluster. The same race condition pattern exists in test_mv_synchronous_updates in test_materialized_view.py (which this test was modeled after), so the same fix is proactively applied there as well. Fixes SCYLLADB-1314 Closes scylladb/scylladb#29374 (cherry picked from commit `568f20396a`) Closes scylladb/scylladb#29395	2026-04-12 14:32:31 +03:00
Marcin Maliszkiewicz	fac9795325	Merge 'ldap: fix double-free of LDAPMessage in poll_results()' from Andrzej Jackowski In the unregistered-ID branch, ldap_msgfree() was called on a result already owned by an RAII ldap_msg_ptr, causing a double-free on scope exit. Remove the redundant manual free. Fixes: SCYLLADB-1344 Backport: 2026.1, 2025.4, 2025.1 - it's a memory corruption, with a one-line fix, so better backport it everywhere. Closes scylladb/scylladb#29302 * github.com:scylladb/scylladb: test: ldap: add regression test for double-free on unregistered message ID ldap: fix double-free of LDAPMessage in poll_results() (cherry picked from commit `895fdb6d29`) Closes scylladb/scylladb#29393	2026-04-12 14:31:00 +03:00
Botond Dénes	fb81acb7aa	Merge 'cql3: fix null handling in data_value formatting' from Dario Mirovic `data_value::to_parsable_string()` crashes with a null pointer dereference when called on a `null` data_value. Return `"null"` instead. Added tests after the fix. Manually checked that tests fail without the fix. Fixes SCYLLADB-1350 This is a fix that prevents format crash. No known occurrence in production, but backport is desirable. Closes scylladb/scylladb#29262 * github.com:scylladb/scylladb: test: boost: test null data value to_parsable_string cql3: fix null handling in data_value formatting (cherry picked from commit `816f2bf163`) Closes scylladb/scylladb#29384	2026-04-10 13:09:02 +02:00
Michał Chojnowski	da53b8798f	test: add a missing reconnect_driver in test_sstable_compression_dictionaries_upgrade.py Need to work around https://github.com/scylladb/python-driver/issues/295, lest a CQL query fail spuriously after the cluster restart. Fixes: SCYLLADB-1114 Closes scylladb/scylladb#29118 (cherry picked from commit `6b18d95dec`) Closes scylladb/scylladb#29146	2026-04-06 22:07:57 +03:00
Botond Dénes	3d167dd36e	Merge 'Alternator: add per-table batch latency metrics and test coverage' from Amnon Heiman This series fixes a metrics visibility gap in Alternator and adds regression coverage. Until now, BatchGetItem and BatchWriteItem updated global latency histograms but did not consistently update per-table latency histograms. As a result, table-level latency dashboards could miss batch traffic. It updates the batch read/write paths to compute request duration once and record it in both global and per-table latency metrics. Add the missing tests, including a metric-agnostic helper and a dedicated per-table latency test that verifies latency counters increase for item and batch operations. This change is metrics-only (no API/behavior change for requests) and improves observability consistency between global and per-table views. Fixes #28721 We assume the alternator per-table metrics exist, but the batch ones are not updated Closes scylladb/scylladb#28732 * github.com:scylladb/scylladb: test(alternator): add per-table latency coverage for item and batch ops alternator: track per-table latency for batch get/write operations (cherry picked from commit `035aa90d4b`) Closes scylladb/scylladb#29067	2026-04-06 22:03:00 +03:00
Botond Dénes	c93c037d39	Merge 'service: tasks: return successful status if a table was dropped' from Aleksandra Martyniuk tablet_virtual_task::wait throws if a table on which a tablet operation was working is dropped. Treat the tablet operation as successful if a table is dropped. Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-494 Needs backport to all live releases Closes scylladb/scylladb#28933 * github.com:scylladb/scylladb: test: add test_tablet_repair_wait_with_table_drop service: tasks: return successful status if a table was dropped (cherry picked from commit `1e41db5948`) Closes scylladb/scylladb#28965	2026-04-06 14:25:13 +03:00
Piotr Dulikowski	3107d9083e	Merge 'vector_search: fix race condition on connection timeout' from Karol Nowacki When a `with_connect` operation timed out, the underlying connection attempt continued to run in the reactor. This could lead to a crash if the connection was established/rejected after the client object had already been destroyed. This issue was observed during the teardown phase of a upcoming high-availability test case. This commit fixes the race condition by ensuring the connection attempt is properly canceled on timeout. Additionally, the explicit TLS handshake previously forced during the connection is now deferred to the first I/O operation, which is the default and preferred behavior. Fixes: SCYLLADB-832 Backports to 2026.1 and 2025.4 are required, as this issue also exists on those branches and is causing CI flakiness. Closes scylladb/scylladb#29031 * github.com:scylladb/scylladb: vector_search: test: fix flaky test vector_search: fix race condition on connection timeout (cherry picked from commit `cc695bc3f7`) Closes scylladb/scylladb#29157	2026-04-06 14:24:39 +03:00
Pavel Emelyanov	70b9ae04ff	Merge 'encryption: fix deadlock in encrypted_data_source::get()' from Ernest Zaslavsky When encrypted_data_source::get() caches a trailing block in _next, the next call takes it directly — bypassing input_stream::read(), which checks _eof. It then calls input_stream::read_exactly() on the already-drained stream. Unlike read(), read_up_to(), and consume(), read_exactly() does not check _eof when the buffer is empty, so it calls _fd.get() on a source that already returned EOS. In production this manifested as stuck encrypted SSTable component downloads during tablet restore: the underlying chunked_download_source hung forever on the post-EOS get(), causing 4 tablets to never complete. The stuck files were always block-aligned sizes (8k, 12k) where _next gets populated and the source is fully consumed in the same call. Fix by checking _input.eof() before calling read_exactly(). When the stream already reached EOF, buf2 is known to be empty, so the call is skipped entirely. A comprehensive test is added that uses a strict_memory_source which fails on post-EOS get(), reproducing the exact code path that caused the production deadlock. Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1128 Backport to 2025.3/4 and 2026.1 is needed since it fixes a bug that may bite us in production, to be on the safe side Closes scylladb/scylladb#29110 * github.com:scylladb/scylladb: encryption: fix deadlock in encrypted_data_source::get() test_lib: mark `limiting_data_source_impl` as not `final` Fix formatting after previous patch Fix indentation after previous patch test_lib: make limiting_data_source_impl available to tests (cherry picked from commit `3b9398dfc8`) Closes scylladb/scylladb#29198	2026-04-06 14:23:05 +03:00
Botond Dénes	abfa4d0272	Merge 'test: cluster: Deflake test_startup_with_keyspaces_violating_rf_rack_valid_keyspaces' from Dawid Mędrek The test was flaky. The scenario looked like this: 1. Stop server 1. 2. Set its rf_rack_valid_keyspaces configuration option to true. 3. Create an RF-rack-invalid keyspace. 4. Start server 1 and expect a failure during start-up. It was wrong. We cannot predict when the Raft mutation corresponding to the newly created keyspace will arrive at the node or when it will be processed. If the check of the RF-rack-valid keyspaces we perform at start-up was done before that, it won't include the keyspace. This will lead to a test failure. Unfortunately, it's not feasible to perform a read barrier during start-up. What's more, although it would help the test, it wouldn't be useful otherwise. Because of that, we simply fix the test, at least for now. The new scenario looks like this: 1. Disable the rf_rack_valid_keyspaces configuration option on server 1. 2. Start the server. 3. Create an RF-rack-invalid keyspace. 4. Perform a read barrier on server 1. This will ensure that it has observed all Raft mutations, and we won't run into the same problem. 5. Stop the node. 6. Set its rf_rack_valid_keyspaces configuration option to true. 7. Try to start the node and observe a failure. This will make the test perform consistently. --- I ran the test (in dev mode, on my local machine) three times before these changes, and three times with them. I include the time results below. Before: ``` real 0m47.570s user 0m41.631s sys 0m8.634s real 0m50.495s user 0m42.499s sys 0m8.607s real 0m50.375s user 0m41.832s sys 0m8.789s ``` After: ``` real 0m50.509s user 0m43.535s sys 0m9.715s real 0m50.857s user 0m44.185s sys 0m9.811s real 0m50.873s user 0m44.289s sys 0m9.737s ``` Fixes SCYLLADB-1137 Backport: The test is present on all supported branches, and so we should backport these changes to them. Closes scylladb/scylladb#29218 * github.com:scylladb/scylladb: test: cluster: Deflake test_startup_with_keyspaces_violating_rf_rack_valid_keyspaces test: cluster: Mark test with @pytest.mark.asyncio in test_multidc.py (cherry picked from commit `d52fbf7ada`) Closes scylladb/scylladb#29247	2026-04-06 14:22:08 +03:00
Avi Kivity	8bdc97924e	Merge 'test: fix race condition in test_crashed_node_substitution' from Sergey Zolotukhin `test_crashed_node_substitution` intermittently failed: ```python assert len(gossiper_eps) == (len(server_eps) + 1) ``` The test crashed the node right after a single ACK2 handshake (`finished do_send_ack2_msg`), assuming the node state was visible to all peers. However, since gossip is eventually consistent, the update may not have propagated yet, so some nodes did not see the failed node. This change: Wait until the gossiper state is visible on peers before continuing the test and asserting. Fixes: [SCYLLADB-921](https://scylladb.atlassian.net/browse/SCYLLADB-921). backport: this issue may affect CI for all branches, so should be backported to all versions. Closes scylladb/scylladb#29254 * github.com:scylladb/scylladb: test: test_crashed_node_substitution: add docstring and fix whitespace test: fix race condition in test_crashed_node_substitution (cherry picked from commit `b708e5d7c9`) Closes scylladb/scylladb#29258	2026-04-06 14:21:46 +03:00
Botond Dénes	253fa9519f	test/encryption: wait for topology convergence after abrupt restart test_reboot uses a custom restart function that SIGKILLs and restarts nodes sequentially. After all nodes are back up, the test proceeded directly to reads after wait_for_cql_and_get_hosts(), which only confirms CQL reachability. While a node is restarted, other nodes might execute global token metadata barriers, which advance the topology fence version. The restarted node has to learn about the new version before it can send reads/writes to the other nodes. The test issues reads as soon as the CQL port is opened, which might happen before the last restarted node learns of the latest topology version. If this node acts as a coordinator for reads/write before this happens, these will fail as the other nodes will reject the ops with the outdated topology fence version. Fix this by replacing wait_for_cql_and_get_hosts() on the abrupt-restart path with the more robus get_ready_cql(), which makes sure servers see each other before refreshing the cql connection. This should ensure that nodes have exchanged gossip and converged on topology state before any reads are executed. The rolling_restart() path is unaffected as it handles this internally. Fixes: SCYLLADB-557 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Closes scylladb/scylladb#29211 (cherry picked from commit `854c374ebf`) Closes scylladb/scylladb#29260	2026-04-06 14:21:26 +03:00
Botond Dénes	70b7652e64	test/cluster: fix flaky test_cleanup_stop by using asyncio.sleep The test was using time.sleep(1) (a blocking call) to wait after scheduling the stop_compaction task, intending to let it register on the server before releasing the sstable_cleanup_wait injection point. However, time.sleep() blocks the asyncio event loop entirely, so the asyncio.create_task(stop_compaction) task never gets to run during the sleep. After the sleep, the directly-awaited message_injection() runs first, releasing the injection point before stop_compaction is even sent. By the time stop_compaction reaches Scylla, the cleanup has already completed successfully -- no exception is raised and the test fails. Fix by replacing time.sleep(1) with await asyncio.sleep(1), which yields control to the event loop and allows the stop_compaction task to actually send its HTTP request before message_injection is called. Fixes: SCYLLADB-834 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Closes scylladb/scylladb#29202 (cherry picked from commit `068a7894aa`) Closes scylladb/scylladb#29277	2026-04-06 14:20:46 +03:00
Andrzej Jackowski	27604deebb	test: use exclusive driver connection in test_limited_concurrency_of_writes Use get_cql_exclusive(node1) so the driver only connects to node1 and never attempts to contact the stopped node2. The test was flaky because the driver received `Host has been marked down or removed` from node2. Fixes: SCYLLADB-1227 Closes scylladb/scylladb#29268 (cherry picked from commit `ab43420d30`) Closes scylladb/scylladb#29278	2026-04-06 14:20:25 +03:00
Tomasz Grabiec	cd7baebc8b	tests: address_map_test: Fix flakiness in debug mode due to task reordering Debug mode shuffles task position in the queue. So the following is possible: 1) shard 1 calls manual_clock::advance(). This expires timers on shard 1 and queues a background smp call to shard 0 which will expire timers there 2) the smp::submit_to(0, ...) from shard 1 called by the test sumbits the call 3) shard 0 creates tasks for both calls, but (2) is run first, and preempts the reactor 4) shard 1 sees the completion, completes m_svc.invoke_on(1, ..) 5) shard 0 inserts the completion from (4) before task from (1) 6) the check on shard 0: m.find(id1) fails because the timer is not expired yet To fix that, wait for timer expiration on shard 0, so that the test doesn't depend on task execution order. Note: I was not able to reproduce the problem locally using test.py --mode debug --repeat 1000. It happens in jenkins very rarely. Which is expected as the scenario which leads to this is quite unlikely. Fixes SCYLLADB-1265 Closes scylladb/scylladb#29290 (cherry picked from commit `2ec47a8a21`) Closes scylladb/scylladb#29309	2026-04-06 14:18:39 +03:00
Andrzej Jackowski	c5f57815a5	test: protect populate_range in row_cache_test from bad_alloc When test_exception_safety_of_update_from_memtable was converted from manual fail_after()/catch to with_allocation_failures() in `74db08165d`, the populate_range() call ended up inside the failure injection scope without a scoped_critical_alloc_section guard. The other two tests converted in the same commit (test_exception_safety_of_transitioning... and test_exception_safety_of_partition_scan) were correctly guarded. Without the guard, the allocation failure injector can sometimes target an allocation point inside the cleanup path of populate_range(). In a rare corner case, this triggers a bad_alloc in a noexcept context (reader_concurrency_semaphore::stop()), causing std::terminate. Fixes SCYLLADB-1346 Closes scylladb/scylladb#29321 (cherry picked from commit `8c0920202b`) Closes scylladb/scylladb#29331	2026-04-06 14:17:12 +03:00
Avi Kivity	95e422db48	Merge 'service_levels: mark v2 migration complete on empty legacy table' from Alex Dathskovsky During raft-topology upgrade in 2026.1, service_level_controller::migrate_to_v2() returns early when system_distributed.service_levels is empty. This skips the service_level_version = 2 write, so the cluster is never marked as upgraded to service levels v2 even though there is no data to migrate. Subsequent upgrades may then fail the startup check which requires service_level_version == 2. Remove the early return and let the migration commit the version marker even when there are no legacy service levels rows to copy. Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1198 backport: should be backported to all versions that can be upgraded to 2026.2 Closes scylladb/scylladb#29333 * github.com:scylladb/scylladb: test/auth_cluster: cover empty legacy table in service level upgrade service_levels: mark v2 migration complete on empty legacy table	2026-04-06 14:07:48 +03:00
Alex	faba13d2b7	test/auth_cluster: cover empty legacy table in service level upgrade Add a cluster test that upgrades to raft topology with an empty legacy `system_distributed.service_levels` table and verifies that the migration still marks `service_level_version` as `2`.	2026-04-05 19:46:15 +03:00
Botond Dénes	f2111c011f	Merge 'Demote log level on split failure during shutdown' from Raphael Raph Carvalho Since commit `509f2af8db`, gate_closed_exception can be triggered for ongoing split during shutdown. The commit is correct, but it causes split failure on shutdown to log an error, which causes CI instability. Previously, aborted_exception would be triggered instead which is logged as warning. Let's do the same. Fixes https://scylladb.atlassian.net/browse/SCYLLADB-951. Fixes https://github.com/scylladb/scylladb/issues/24850. Only 2026.1 is affected. Closes scylladb/scylladb#29032 * github.com:scylladb/scylladb: replica: Demote log level on split failure during shutdown service: Demote log level on split failure during shutdown (cherry picked from commit `ae17596c2a`) Closes scylladb/scylladb#29115	2026-04-05 14:30:55 +03:00
Piotr Dulikowski	d2b12329ab	Merge 'Return HTTP error description in Vector Store client' from Szymon Wasik The `service_error` struct: `6dc2c42f8b/service/vector_store_client.hh (L64)` currently stores just the error status code. For this reason whenever the HTTP error occurs, only the error code can be forwarded to the client. For example see here: `6dc2c42f8b/service/vector_store_client.cc (L580)` For this reason in the output of the drivers full description of the error is missing which forces user to take a look into Scylla server logs. The objective of this PR is to extend the support for HTTP errors in Vector Store client to handle messages as well. Moreover, it removes the quadratic reallocation in response_content_to_sstring() helper function that is used for getting the response in case of error. Fixes: VECTOR-189 Closes scylladb/scylladb#26139 * github.com:scylladb/scylladb: vector_search: Avoid quadratic reallocation in response_content_to_sstring vector_store_client: Return HTTP error description, not just code (cherry picked from commit `38a2829f69`) Closes scylladb/scylladb#29312	2026-04-03 17:53:40 +02:00
Patryk Jędrzejczak	d5c7f29734	raft_group0: join_group0: fix join hang when node joins group 0 before post_server_start A joining node hung forever if the topology coordinator added it to the group 0 configuration before the node reached `post_server_start`. In that case, `server->get_configuration().contains(my_id)` returned true and the node broke out of the join loop early, skipping `post_server_start`. `_join_node_group0_started` was therefore never set, so the node's `join_node_response` RPC handler blocked indefinitely. Meanwhile the topology coordinator's `respond_to_joining_node` call (which has no timeout) hung forever waiting for the reply that never came. Fix by only taking the early-break path when not starting as a follower (i.e. when the node is the discovery leader or is restarting). A joining node must always reach `post_server_start`. We also provide a regression test. It takes 6s in dev mode. Fixes SCYLLADB-959 Closes scylladb/scylladb#29266 (cherry picked from commit `b9f82f6f23`) Closes scylladb/scylladb#29291	2026-04-01 09:58:20 +02:00
Pavel Emelyanov	233da83dd9	Merge 'Fix directory lister leak in table::get_snapshot_details: ' from Benny Halevy As reported in SCYLLADB-1013, the directory lister must be closed also when an exception is thrown. For example, see backtrace below: ``` seastar::on_internal_error(seastar::logger&, std::basic_string_view<char, std::char_traits<char>>) at ./build/release/seastar/./seastar/src/core/on_internal_error.cc:57 directory_lister::~directory_lister() at ./utils/lister.cc:77 replica::table::get_snapshot_details(std::filesystem::__cxx11::path, std::filesystem::__cxx11::path) (.resume) at ./replica/table.cc:4081 std::__n4861::coroutine_handle<seastar::internal::coroutine_traits_base<db::snapshot_ctl::table_snapshot_details>::promise_type>::resume() const at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/coroutine:247 (inlined by) seastar::internal::coroutine_traits_base<db::snapshot_ctl::table_snapshot_details>::promise_type::run_and_dispose() at ././seastar/include/seastar/core/coroutine.hh:129 seastar::reactor::task_queue::run_tasks() at ./build/release/seastar/./seastar/src/core/reactor.cc:2695 (inlined by) seastar::reactor::task_queue_group::run_tasks() at ./build/release/seastar/./seastar/src/core/reactor.cc:3201 seastar::reactor::task_queue_group::run_some_tasks() at ./build/release/seastar/./seastar/src/core/reactor.cc:3185 (inlined by) seastar::reactor::do_run() at ./build/release/seastar/./seastar/src/core/reactor.cc:3353 seastar::reactor::run() at ./build/release/seastar/./seastar/src/core/reactor.cc:3245 seastar::app_template::run_deprecated(int, char, std::function<void ()>&&) at ./build/release/seastar/./seastar/src/core/app-template.cc:266 seastar::app_template::run(int, char, std::function<seastar::future<int> ()>&&) at ./build/release/seastar/./seastar/src/core/app-template.cc:160 scylla_main(int, char*) at ./main.cc:756 ``` Fixes: [SCYLLADB-1013](https://scylladb.atlassian.net/browse/SCYLLADB-1013) Requires backport to 2026.1 since the leak exists since `004c08f525` [SCYLLADB-1013]: https://scylladb.atlassian.net/browse/SCYLLADB-1013?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ Closes scylladb/scylladb#29084 * github.com:scylladb/scylladb: test/boost/database_test: add test_snapshot_ctl_details_exception_handling table: get_snapshot_details: fix indentation inside try block table: per-snapshot get_snapshot_details: fix typo in comment table: per-snapshot get_snapshot_details: always close lister using try/catch table: get_snapshot_details: always close lister using deferred_close (cherry picked from commit `f27dc12b7c`) Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#29125	2026-03-25 10:04:39 +01:00
Michał Chojnowski	804842e95c	test/boost/cache_algorithm_test: disable sstable compression to avoid giant index pages The test intentionally creates huge index pages. But since `5e7fb08bf3`, the index reader allocates a block of memory for a whole index page, instead of incrementally allocating small pieces during index parsing. This giant allocation causes the test to fail spuriously in CI sometimes. Fix this by disabling sstable compression on the test table, which puts a hard cap of 2000 keys per index page. Fixes: SCYLLADB-1152 Closes scylladb/scylladb#29152 (cherry picked from commit `f29525f3a6`) Closes scylladb/scylladb#29172	2026-03-24 16:02:48 +02:00
Botond Dénes	4f77cb621f	Merge 'tablets: Fix deadlock in background storage group merge fiber' from Tomasz Grabiec When it deadlocks, groups stop merging and compaction group merge backlog will run-away. Also, graceful shutdown will be blocked on it. Found by flaky unit test test_merge_chooses_best_replica_with_odd_count, which timed-out in 1 in 100 runs. Reason for deadlock: When storage groups are merged, the main compaction group of the new storage group takes a compaction lock, which is appended to _compaction_reenablers_for_merging, and released when the merge completion fiber is done with the whole batch. If we accumulate more than 1 merge cycle for the fiber, deadlock occurs. Lock order will be this Initial state: cg0: main cg1: main cg2: main cg3: main After 1st merge: cg0': main [locked], merging_groups=[cg0.main, cg1.main] cg1': main [locked], merging_groups=[cg2.main, cg3.main] After 2nd merge: cg0'': main [locked], merging_groups=[cg0'.main [locked], cg0.main, cg1.main, cg1'.main [locked], cg2.main, cg3.main] merge completion fiber will try to stop cg0'.main, which will be blocked on compaction lock. which is held by the reenabler in _compaction_reenablers_for_merging, hence deadlock. The fix is to wait for background merge to finish before we start the next merge. It's achieved by holding old erm in the background merge, and doing a topology barrier from the merge finalizing transition. Background merge is supposed to be a relatively quick operation, it's stopping compaction groups. So may wait for active requests. It shouldn't prolong the barrier indefinitely. Tablet tests which trigger merge need to be adjusted to call the barrier, otherwise they will be vulnerable to the deadlock. Fixes SCYLLADB-928 Backport to >= 2025.4 because it's the earliest vulnerable due to `f9021777d8`. Closes scylladb/scylladb#29007 * github.com:scylladb/scylladb: tablets: Fix deadlock in background storage group merge fiber replica: table: Propagate old erm to storage group merge test: boost: tablets_test: Save tablet metadata when ACKing split resize decision storage_service: Extract local_topology_barrier() (cherry picked from commit `5573c3b18e`) Closes scylladb/scylladb#29144	2026-03-21 01:37:30 +01:00
Raphael S. Carvalho	eb6c333e1b	streaming: Release space incrementally during file streaming File streaming only releases the file descriptors of a tablet being streamed in the very streaming end. Which means that if the streaming tablet has compaction on largest tier finished after streaming started, there will be always ~2x space amplification for that single tablet. Since there can be up to 4 tablets being migrated away, it can add up to a significant amount, since nodes are pushed to a substantial usage of available space (~90%). We want to optimize this by dropping reference to a sstable after it was fully streamed. This way, we reduce the chances of hitting 2x space amplification for a given tablet. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Fixes https://scylladb.atlassian.net/browse/SCYLLADB-790. Closes scylladb/scylladb#28505 (cherry picked from commit `5b550e94a6`) Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#28769	2026-03-20 15:50:46 +02:00
Calle Wilund	8d21636a81	test_internode_compression: Add await for "run" coro:s Fixes: SCYLLADB-907 Closes scylladb/scylladb#28885 (cherry picked from commit `35aab75256`) Closes scylladb/scylladb#28905	2026-03-20 10:32:31 +02:00
Calle Wilund	4da8641d83	test_encryption: Fix test_system_auth_encryption Fixes: SCYLLADB-915 Test was quite broken; Not waiting for coro:s, as well as a bunch of checks no longer even close to valid (this is a ported dtest, and not a very good one). Closes scylladb/scylladb#28887 (cherry picked from commit `ef795eda5b`) Closes scylladb/scylladb#28966	2026-03-20 10:30:48 +02:00
Łukasz Paszkowski	3ab789e1ca	test/storage: harden out-of-space prevention tests around restart and disk-utilization transitions The tests in test_out_of_space_prevention.py are flaky. Three issues contribute: 1. After creating/removing the blob file that simulates disk pressure, the tests immediately checked derived state (e.g., "compaction_manager - Drained") without first confirming the disk space monitor had detected the utilization change. Fix: explicitly wait for "Reached/Dropped below critical disk utilization level" right after creating/removing the blob file, before checking downstream effects. 2. Several tests called `manager.driver_connect()` or omitted reconnection entirely after `server_restart()` / `server_start()`. The pre-existing driver session can silently reconnect multiple times, causing subsequent CQL queries to fail. Fix: call `reconnect_driver()` after every node restart. Additionally, call `wait_for_cql_and_get_hosts()` where CQL is used afterward, to ensure all connection pools are established. 3. Some log assertions used marks captured before a restart, so they could match pre-restart messages or miss messages emitted in the correct post-restart window. Fix: refresh marks at the right points. Apart from that, the patch fixes a typo: autotoogle -> autotoggle. Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-655 Closes scylladb/scylladb#28626 (cherry picked from commit `826fd5d6c3`) Closes scylladb/scylladb#28967	2026-03-20 10:30:23 +02:00
Asias He	25a17282bd	test: Fix coordinator assumption in do_test_tablet_incremental_repair_merge_error The first node in the cluster is not guaranteed to be the coordinator node. Hardcoding node 0 as the coordinator causes test flakiness. This patch dynamically finds the actual coordinator node and targets it for error injection, log checking, and restarts. Additionally, inject `tablet_force_tablet_count_decrease_once` across all servers to force the tablet merge process to trigger once. Fixes SCYLLADB-865 Closes scylladb/scylladb#28945 (cherry picked from commit `e0483f6001`) Closes scylladb/scylladb#28969	2026-03-20 10:30:00 +02:00
Botond Dénes	7afcc56128	db,compaction: use utils::chunked_vector for cache invalidation ranges Instead of dht::partition_ranges_vector, which is an std::vector<> and have been seen to cause large allocations when calculating ranges to be invalidated after compaction: seastar_memory - oversized allocation: 147456 bytes. This is non-fatal, but could lead to latency and/or fragmentation issues. Please report: at [Backtrace #0] void seastar::backtrace<seastar::current_backtrace_tasklocal()::$_0>(seastar::current_backtrace_tasklocal()::$_0&&, bool) at ./build/release/seastar/./seastar/include/seastar/util/backtrace.hh:89 (inlined by) seastar::current_backtrace_tasklocal() at ./build/release/seastar/./seastar/src/util/backtrace.cc:99 seastar::current_tasktrace() at ./build/release/seastar/./seastar/src/util/backtrace.cc:136 seastar::current_backtrace() at ./build/release/seastar/./seastar/src/util/backtrace.cc:169 seastar::memory::cpu_pages::warn_large_allocation(unsigned long) at ./build/release/seastar/./seastar/src/core/memory.cc:840 seastar::memory::cpu_pages::check_large_allocation(unsigned long) at ./build/release/seastar/./seastar/src/core/memory.cc:903 (inlined by) seastar::memory::cpu_pages::allocate_large(unsigned int, bool) at ./build/release/seastar/./seastar/src/core/memory.cc:910 (inlined by) seastar::memory::allocate_large(unsigned long, bool) at ./build/release/seastar/./seastar/src/core/memory.cc:1533 (inlined by) seastar::memory::allocate_slowpath(unsigned long) at ./build/release/seastar/./seastar/src/core/memory.cc:1679 seastar::memory::allocate(unsigned long) at ././seastar/src/core/memory.cc:1698 (inlined by) operator new(unsigned long) at ././seastar/src/core/memory.cc:2440 (inlined by) std::__new_allocator<interval<dht::ring_position>>::allocate(unsigned long, void const) at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/bits/new_allocator.h:151 (inlined by) std::allocator<interval<dht::ring_position>>::allocate(unsigned long) at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/bits/allocator.h:203 (inlined by) std::allocator_traits<std::allocator<interval<dht::ring_position>>>::allocate(std::allocator<interval<dht::ring_position>>&, unsigned long) at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/bits/alloc_traits.h:614 (inlined by) std::_Vector_base<interval<dht::ring_position>, std::allocator<interval<dht::ring_position>>>::_M_allocate(unsigned long) at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/bits/stl_vector.h:387 (inlined by) std::vector<interval<dht::ring_position>, std::allocator<interval<dht::ring_position>>>::reserve(unsigned long) at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/bits/vector.tcc:79 dht::to_partition_ranges(utils::chunked_vector<interval<dht::token>, 131072ul> const&, seastar::bool_class<utils::can_yield_tag>) at ./dht/i_partitioner.cc:347 compaction::compaction::get_ranges_for_invalidation(std::vector<seastar::lw_shared_ptr<sstables::sstable>, std::allocator<seastar::lw_shared_ptr<sstables::sstable>>> const&) at ./compaction/compaction.cc:619 (inlined by) compaction::compaction::get_compaction_completion_desc(std::vector<seastar::lw_shared_ptr<sstables::sstable>, std::allocator<seastar::lw_shared_ptr<sstables::sstable>>>, std::vector<seastar::lw_shared_ptr<sstables::sstable>, std::allocator<seastar::lw_shared_ptr<sstables::sstable>>>) at ./compaction/compaction.cc:719 (inlined by) compaction::regular_compaction::replace_remaining_exhausted_sstables() at ./compaction/compaction.cc:1362 compaction::compaction::finish(std::chrono::time_point<db_clock, std::chrono::duration<long, std::ratio<1l, 1000l>>>, std::chrono::time_point<db_clock, std::chrono::duration<long, std::ratio<1l, 1000l>>>) at ./compaction/compaction.cc:1021 compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0::operator()() at ./compaction/compaction.cc:1960 (inlined by) compaction::compaction_result std::__invoke_impl<compaction::compaction_result, compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0>(std::__invoke_other, compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0&&) at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/bits/invoke.h:63 (inlined by) std::__invoke_result<compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0>::type std::__invoke<compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0>(compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0&&) at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/bits/invoke.h:98 (inlined by) decltype(auto) std::__apply_impl<compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0, std::tuple<>>(compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0&&, std::tuple<>&&, std::integer_sequence<unsigned long, ...>) at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/tuple:2920 (inlined by) decltype(auto) std::apply<compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0, std::tuple<>>(compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0&&, std::tuple<>&&) at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/tuple:2935 (inlined by) seastar::future<compaction::compaction_result> seastar::futurize<compaction::compaction_result>::apply<compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0>(compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0&&, std::tuple<>&&) at ././seastar/include/seastar/core/future.hh:1930 (inlined by) seastar::futurize<std::invoke_result<compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0>::type>::type seastar::async<compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0>(seastar::thread_attributes, compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0&&)::'lambda'()::operator()() const at ././seastar/include/seastar/core/thread.hh:267 (inlined by) seastar::noncopyable_function<void ()>::direct_vtable_for<seastar::futurize<std::invoke_result<compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0>::type>::type seastar::async<compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0>(seastar::thread_attributes, compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0&&)::'lambda'()>::call(seastar::noncopyable_function<void ()> const) at ././seastar/include/seastar/util/noncopyable_function.hh:138 seastar::noncopyable_function<void ()>::operator()() const at ./build/release/seastar/./seastar/include/seastar/util/noncopyable_function.hh:224 (inlined by) seastar::thread_context::main() at ./build/release/seastar/./seastar/src/core/thread.cc:318 dht::partition_ranges_vector is used on the hot path, so just convert the problematic user -- cache invalidation -- to use utils::chunked_vector<dht::partition_range> instead. Fixes: SCYLLADB-121 Closes scylladb/scylladb#28855 (cherry picked from commit `13ff9c4394`) Closes scylladb/scylladb#28975	2026-03-20 10:29:23 +02:00
Botond Dénes	3e9b984020	Merge 'service: tasks: scan all tablets in tablet_virtual_task::wait' from Aleksandra Martyniuk Currently, for repair tasks tablet_virtual_task::wait gathers the ids of tablets that are to be repaired. The gathered set is later used to check if the repair is still ongoing. However, if the tablets are resized (split or merged), the gathered set becomes irrelevant. Those, we may end up with invalid tablet id error being thrown. Wait until repair is done for all tablets in the table. Fixes: https://github.com/scylladb/scylladb/issues/28202 Backport to 2026.1 needed as it contains the change introducing the issue `d51b1fea94` Closes scylladb/scylladb#28323 * github.com:scylladb/scylladb: service: fix indentation test: add test_tablet_repair_wait service: remove status_helper::tablets service: tasks: scan all tablets in tablet_virtual_task::wait (cherry picked from commit `3fed6f9eff`) Closes scylladb/scylladb#28991	2026-03-20 10:28:03 +02:00
Piotr Dulikowski	35cd7f9239	Merge 'cql3: pin prepared cache entry in prepare() to avoid invalid weak handle race' from Alex Dathskovsky query_processor::prepare() could race with prepared statement invalidation: after loading from the prepared cache, we converted the cached object to a checked weak pointer and then continued asynchronous work (including error-injection waitpoints). If invalidation happened in that window, the weak handle could no longer be promoted and the prepare path could fail nondeterministically. This change keeps a strong cache entry reference alive across the whole critical section in prepare() by using a pinned cache accessor (get_pinned()), and only deriving the weak handle while the entry is pinned. This removes the lifetime gap without adding retry loops. Test coverage was extended in test/cluster/test_prepare_race.py: - reproduces the invalidation-during-prepare window with injection, - verifies prepare completes successfully, - then invalidates again and executes the same stale client prepared object, - confirms the driver transparently re-requests/re-prepares and execution succeeds. This change introduces: - no behavior change for normal prepare flow besides stronger lifetime guarantees, - no new protocol semantics, - preserves existing cache invalidation logic, - adds explicit cluster-level regression coverage for both the race and driver reprepare path. - pushes the re prepare operation twards the driver, the server will return unprepared error for the first time and the driver will have to re prepare during execution stage Fixes: https://github.com/scylladb/scylladb/issues/27657 Backport to active branches recommended: No node crash, but user-visible PREPARE failures under rare schema-invalidation race; low-risk timeout-bounded retry improves robustness. Closes scylladb/scylladb#28952 * github.com:scylladb/scylladb: transport/messages: hold pinned prepared entry in PREPARE result cql3: pin prepared cache entry in prepare() to avoid invalid weak handle race (cherry picked from commit `d9a277453e`) Closes scylladb/scylladb#29001	2026-03-20 10:27:04 +02:00
Botond Dénes	fef7750eb6	Merge 'tasks: do not fail the wait request if rpc fails' from Aleksandra Martyniuk During decommission, we first mark a topology request as done, then shut down a node and in the following steps we remove node from the topology. Thus, finished request does not imply that a node is removed from the topology. Due to that, in node_ops_virtual_task::wait, while gathering children from the whole cluster, we may hit the connection exception - because a node is still in topology, even though it is down. Modify the get_children method to ignore the exception and warn about the failure instead. Keep token_metadata_ptr in get_children to prevent topology from changing. Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-867 Needs backports to all versions Closes scylladb/scylladb#29035 * github.com:scylladb/scylladb: tasks: fix indentation tasks: do not fail the wait request if rpc fails tasks: pass token_metadata_ptr to task_manager::virtual_task::impl::get_children (cherry picked from commit `2e47fd9f56`) Closes scylladb/scylladb#29126	2026-03-20 10:24:30 +02:00
Patryk Jędrzejczak	1398a55d16	test: test_remove_garbage_group0_members: wait for token ring and group0 consistency before removenode The removenove initiator could have an outdated token ring (still considering the node removed by the previous removenode a token owner) and unexpectedly reject the operation. Fix that by waiting for token ring and group0 consistency before removenode. Note that the test already checks that consistency, but only for one node, which is different from the removenode initiator. This test has been removed in master together with the code being tested (the gossip-based topology). Hence, the fix is submitted directly to 2026.1. Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1103 Backport to all supported branches (other than 2026.1), as the test can fail there. Closes scylladb/scylladb#29108	2026-03-20 10:22:40 +02:00
Avi Kivity	d4e454b5bc	Merge 'Fix bad performance for densely populated partition index pages' from Tomasz Grabiec This applies to small partition workload where index pages have high partition count, and the index doesn't fit in cache. It was observed that the count can be in the order of hundreds. In such a workload pages undergo constant population, LSA compaction, and LSA eviction, which has severe impact on CPU utilization. Refs https://scylladb.atlassian.net/browse/SCYLLADB-620 This PR reduces the impact by several changes: - reducing memory footprint in the partition index. Assuming partition key size is 16 bytes, the cost dropped from 96 bytes to 36 bytes per partition. - flattening the object graph and amortizing storage. Storing entries directly in the vector. Storing all key values in a single managed_bytes. Making index_entry a trivial struct. - index entries and key storage are now trivially moveable, and batched inside vector storage so LSA migration can use memcpy(), which amortizes the cost per key. This reduces the cost of LSA segment compaction. - LSA eviction is now pretty much constant time for the whole page regardless of the number of entries, because elements are trivial and batched inside vectors. Page eviction cost dropped from 50 us to 1 us. Performance evaluated with: scylla perf-simple-query -c1 -m200M --partitions=1000000 Before: ``` 7774.96 tps (166.0 allocs/op, 521.7 logallocs/op, 54.0 tasks/op, 802428 insns/op, 430457 cycles/op, 0 errors) 7511.08 tps (166.1 allocs/op, 527.2 logallocs/op, 54.0 tasks/op, 804185 insns/op, 430752 cycles/op, 0 errors) 7740.44 tps (166.3 allocs/op, 526.2 logallocs/op, 54.2 tasks/op, 805347 insns/op, 432117 cycles/op, 0 errors) 7818.72 tps (165.2 allocs/op, 517.6 logallocs/op, 53.7 tasks/op, 794965 insns/op, 427751 cycles/op, 0 errors) 7865.49 tps (165.1 allocs/op, 513.3 logallocs/op, 53.6 tasks/op, 788898 insns/op, 425171 cycles/op, 0 errors) ``` After (+318%): ``` 32492.40 tps (130.7 allocs/op, 12.8 logallocs/op, 36.1 tasks/op, 109236 insns/op, 103203 cycles/op, 0 errors) 32591.99 tps (130.4 allocs/op, 12.8 logallocs/op, 36.0 tasks/op, 108947 insns/op, 102889 cycles/op, 0 errors) 32514.52 tps (130.6 allocs/op, 12.8 logallocs/op, 36.0 tasks/op, 109118 insns/op, 103219 cycles/op, 0 errors) 32491.14 tps (130.6 allocs/op, 12.8 logallocs/op, 36.0 tasks/op, 109349 insns/op, 103272 cycles/op, 0 errors) 32582.90 tps (130.5 allocs/op, 12.8 logallocs/op, 36.0 tasks/op, 109269 insns/op, 102872 cycles/op, 0 errors) 32479.43 tps (130.6 allocs/op, 12.8 logallocs/op, 36.0 tasks/op, 109313 insns/op, 103242 cycles/op, 0 errors) 32418.48 tps (130.7 allocs/op, 12.8 logallocs/op, 36.1 tasks/op, 109201 insns/op, 103301 cycles/op, 0 errors) 31394.14 tps (130.7 allocs/op, 12.8 logallocs/op, 36.1 tasks/op, 109267 insns/op, 103301 cycles/op, 0 errors) 32298.55 tps (130.7 allocs/op, 12.8 logallocs/op, 36.1 tasks/op, 109323 insns/op, 103551 cycles/op, 0 errors) ``` When the workload is miss-only, with both row cache and index cache disabled (no cache maintenance cost): perf-simple-query -c1 -m200M --duration 6000 --partitions=100000 --enable-index-cache=0 --enable-cache=0 Before: ``` 9124.57 tps (146.2 allocs/op, 789.0 logallocs/op, 45.3 tasks/op, 889320 insns/op, 357937 cycles/op, 0 errors) 9437.23 tps (146.1 allocs/op, 789.3 logallocs/op, 45.3 tasks/op, 889613 insns/op, 357782 cycles/op, 0 errors) 9455.65 tps (146.0 allocs/op, 787.4 logallocs/op, 45.2 tasks/op, 887606 insns/op, 357167 cycles/op, 0 errors) 9451.22 tps (146.0 allocs/op, 787.4 logallocs/op, 45.3 tasks/op, 887627 insns/op, 357357 cycles/op, 0 errors) 9429.50 tps (146.0 allocs/op, 787.4 logallocs/op, 45.3 tasks/op, 887761 insns/op, 358148 cycles/op, 0 errors) 9430.29 tps (146.1 allocs/op, 788.2 logallocs/op, 45.3 tasks/op, 888501 insns/op, 357679 cycles/op, 0 errors) 9454.08 tps (146.0 allocs/op, 787.3 logallocs/op, 45.3 tasks/op, 887545 insns/op, 357132 cycles/op, 0 errors) ``` After (+55%): ``` 14484.84 tps (150.7 allocs/op, 6.5 logallocs/op, 44.7 tasks/op, 396164 insns/op, 229490 cycles/op, 0 errors) 14526.21 tps (150.8 allocs/op, 6.5 logallocs/op, 44.8 tasks/op, 396401 insns/op, 228824 cycles/op, 0 errors) 14567.53 tps (150.7 allocs/op, 6.5 logallocs/op, 44.7 tasks/op, 396319 insns/op, 228701 cycles/op, 0 errors) 14545.63 tps (150.6 allocs/op, 6.5 logallocs/op, 44.7 tasks/op, 395889 insns/op, 228493 cycles/op, 0 errors) 14626.06 tps (150.5 allocs/op, 6.5 logallocs/op, 44.7 tasks/op, 395254 insns/op, 227891 cycles/op, 0 errors) 14593.74 tps (150.5 allocs/op, 6.5 logallocs/op, 44.7 tasks/op, 395480 insns/op, 227993 cycles/op, 0 errors) 14538.10 tps (150.8 allocs/op, 6.5 logallocs/op, 44.8 tasks/op, 397035 insns/op, 228831 cycles/op, 0 errors) 14527.18 tps (150.8 allocs/op, 6.5 logallocs/op, 44.8 tasks/op, 396992 insns/op, 228839 cycles/op, 0 errors) ``` Same as above, but with summary ratio increased from 0.0005 to 0.005 (smaller pages): Before: ``` 33906.70 tps (146.1 allocs/op, 83.6 logallocs/op, 45.1 tasks/op, 170553 insns/op, 98104 cycles/op, 0 errors) 32696.16 tps (146.0 allocs/op, 83.5 logallocs/op, 45.1 tasks/op, 170369 insns/op, 98405 cycles/op, 0 errors) 33889.05 tps (146.1 allocs/op, 83.6 logallocs/op, 45.1 tasks/op, 170551 insns/op, 98135 cycles/op, 0 errors) 33893.24 tps (146.1 allocs/op, 83.5 logallocs/op, 45.1 tasks/op, 170488 insns/op, 98168 cycles/op, 0 errors) 33836.73 tps (146.1 allocs/op, 83.6 logallocs/op, 45.1 tasks/op, 170528 insns/op, 98226 cycles/op, 0 errors) 33897.61 tps (146.0 allocs/op, 83.5 logallocs/op, 45.1 tasks/op, 170428 insns/op, 98081 cycles/op, 0 errors) 33834.73 tps (146.1 allocs/op, 83.5 logallocs/op, 45.1 tasks/op, 170438 insns/op, 98178 cycles/op, 0 errors) 33776.31 tps (146.3 allocs/op, 83.9 logallocs/op, 45.2 tasks/op, 170958 insns/op, 98418 cycles/op, 0 errors) 33808.08 tps (146.3 allocs/op, 83.9 logallocs/op, 45.2 tasks/op, 170940 insns/op, 98388 cycles/op, 0 errors) ``` After (+18%): ``` 40081.51 tps (148.2 allocs/op, 4.4 logallocs/op, 45.0 tasks/op, 121047 insns/op, 82231 cycles/op, 0 errors) 40005.85 tps (148.6 allocs/op, 4.4 logallocs/op, 45.2 tasks/op, 121327 insns/op, 82545 cycles/op, 0 errors) 39816.75 tps (148.3 allocs/op, 4.4 logallocs/op, 45.1 tasks/op, 121067 insns/op, 82419 cycles/op, 0 errors) 39953.11 tps (148.1 allocs/op, 4.4 logallocs/op, 45.0 tasks/op, 121027 insns/op, 82258 cycles/op, 0 errors) 40073.96 tps (148.2 allocs/op, 4.4 logallocs/op, 45.0 tasks/op, 121006 insns/op, 82313 cycles/op, 0 errors) 39882.25 tps (148.2 allocs/op, 4.4 logallocs/op, 45.0 tasks/op, 120925 insns/op, 82320 cycles/op, 0 errors) 39916.08 tps (148.3 allocs/op, 4.4 logallocs/op, 45.1 tasks/op, 121054 insns/op, 82393 cycles/op, 0 errors) 39786.30 tps (148.2 allocs/op, 4.4 logallocs/op, 45.0 tasks/op, 121027 insns/op, 82465 cycles/op, 0 errors) 38662.45 tps (148.3 allocs/op, 4.4 logallocs/op, 45.0 tasks/op, 121108 insns/op, 82312 cycles/op, 0 errors) 39849.42 tps (148.3 allocs/op, 4.4 logallocs/op, 45.1 tasks/op, 121098 insns/op, 82447 cycles/op, 0 errors) ``` Closes scylladb/scylladb#28603 * github.com:scylladb/scylladb: sstables: mx: index_reader: Optimize parsing for no promoted index case vint: Use std::countl_zero() test: sstable_partition_index_cache_test: Validate scenario of pages with sparse promoted index placement sstables: mx: index_reader: Amoritze partition key storage managed_bytes: Hoist write_fragmented() to common header utils: managed_vector: Use std::uninitialized_move() to move objects sstables: mx: index_reader: Keep promoted_index info next to index_entry sstables: mx: index_reader: Extract partition_index_page::clear_gently() sstables: mx: index_reader: Shave-off 16 bytes from index_entry by using raw_token sstables: mx: index_reader: Reduce allocation_section overhead during index page parsing by batching allocation sstables: mx: index_reader: Keep index_entry directly in the vector dht: Introduce raw_token test: perf_simple_query: Add 'sstable-format' command-line option test: perf_simple_query: Add 'sstable-summary-ratio' command-line option test: perf-simple-query: Add option to disable index cache test: cql_test_env: Respect enable-index-cache config (cherry picked from commit `5e7fb08bf3`) Closes scylladb/scylladb#29136	2026-03-20 01:21:49 +01:00
Botond Dénes	69f78ce74a	Merge 'perf-alternator: wait for alternator port before running workload' from Marcin Maliszkiewicz This patch is mostly for the purpose of running pgo CI job. We may receive connection error if asyncio.sleep(5) in pgo.py is not sufficient waiting time. In pgo.py we do wait for port but only for cql, anyway it's better to have high level check than trying to wait for alternator port there. Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1071 Backport: 2026.1 - it failed on CI for that build Closes scylladb/scylladb#29063 * github.com:scylladb/scylladb: perf: add abort_source support to wait-for-port loops perf-alternator: wait for alternator port before running workload (cherry picked from commit `172c786079`) Closes scylladb/scylladb#29098	2026-03-18 13:13:01 +01:00
Patryk Jędrzejczak	3513ce6069	test: test_raft_no_quorum: decrease group0_raft_op_timeout_in_ms after quorum loss `test_raft_no_quorum.py::test_cannot_add_new_node` is currently flaky in dev mode. The bootstrap of the first node can fail due to `add_entry()` timing out (with the 1s timeout set by the test case). Other test cases in this test file could fail in the same way as well, so we need a general fix. We don't want to increase the timeout in dev mode, as it would slow down the test. The solution is to keep the timeout unchanged, but set it only after quorum is lost. This prevents unexpected timeouts of group0 operations with almost no impact on the test running time. A note about the new `update_group0_raft_op_timeout` function: waiting for the log seems to be necessary only for `test_quorum_lost_during_node_join_response_handler`, but let's do it for all test cases just in case (including `test_can_restart` that shouldn't be flaky currently). Fixes https://scylladb.atlassian.net/browse/SCYLLADB-913 Closes scylladb/scylladb#28998 (cherry picked from commit `526e5986fe`) Closes scylladb/scylladb#29068	2026-03-17 18:04:27 +01:00
Piotr Dulikowski	0ca7253315	Merge 'vector_search: fix TLS server name with IP' from Karol Nowacki SNI works only with DNS hostnames. Adding an IP address causes warnings on the server side. This change adds SNI only if it is not an IP address. This change has no unit tests, as this behavior is not critical, since it causes a warning on the server side. The critical part, that the server name is verified, is already covered. This PR also adds warning logs to improve future troubleshooting of connections to the vector-store nodes. Fixes: VECTOR-528 Backports to 2025.04 and 2026.01 are required, as these branches are also affected. Closes scylladb/scylladb#28637 * github.com:scylladb/scylladb: vector_search: fix TLS server name with IP vector_search: add warn log for failed ann requests (cherry picked from commit `23ed0d4df8`) Closes scylladb/scylladb#28964	2026-03-17 15:30:07 +01:00
Piotr Dulikowski	c7ac3b5394	db: view: mutate_MV: don't hold keyspace ref across preemption Currently, the view_update_generator::mutate_MV function acquires a reference to the keyspace relevant to the operation, then it calls max_concurrent_for_each and uses that reference inside the lambda passed to that function. max_concurrent_for_each can preempt and there is no mechanism that makes sure that the keyspace is alive until the view updates are generated, so it is possible that the keyspace is freed by the time the reference is used. Fix the issue by precomputing the necessary information based on the keyspace reference right away, and then passing that information by value to the other parts of the code. It turns out that we only need to know whether the keyspace uses tablets and whether it uses a network topology strategy. Fixes: scylladb/scylladb#28925 Closes scylladb/scylladb#28928 (cherry picked from commit `42d70baad3`) Closes scylladb/scylladb#28968	2026-03-17 13:35:22 +01:00
Piotr Dulikowski	d6ed05efc1	Merge '[Backport 2026.1] mv: allow skipping view updates when a collection is unmodified' from Scylladb[bot] mv: allow skipping view updates when a collection is unmodified When we generate view updates, we check whether we can skip the entire view update if all columns selected by the view are unmodified. However, for collection columns, we only check if they were unset before and after the update. In this patch we add a check for the actual collection contents. We perform this check for both virtual and non-virtual selections. When the column is only a virtual column in the view, it would be enough to check the liveness of each collection cell, however for that we'd need to deserialize the entire collection anyway, which should be effectively as expensive as comparing all of its bytes. Fixes: SCYLLADB-996 - (cherry picked from commit `01ddc17ab9`) Parent PR: #28839 Closes scylladb/scylladb#28977 * github.com:scylladb/scylladb: Merge 'mv: allow skipping view updates when a collection is unmodified' from Wojciech Mitros mv: remove dead code in view_updates::can_skip_view_updates	2026-03-17 13:34:28 +01:00
Aleksandra Martyniuk	b307c9301d	nodetool: cluster repair: do not fail if a table was dropped nodetool cluster repair without additional params repairs all tablet keyspaces in a cluster. Currently, if a table is dropped while the command is running, all tables are repaired but the command finishes with a failure. Modify nodetool cluster repair. If a table wasn't specified (i.e. all tables are repaired), the command finishes successfully even if a table was dropped. If a table was specified and it does not exist (e.g. because it was dropped before the repair was requested), then the behavior remains unchanged. Fixes: SCYLLADB-568. Closes scylladb/scylladb#28739 (cherry picked from commit `2e68f48068`) Closes scylladb/scylladb#29006	2026-03-14 22:28:00 +02:00
Marcin Maliszkiewicz	2bd10bff5e	Merge 'test_proxy_protocol: fix flaky system.clients visibility checks' from Piotr Smaron `test_proxy_protocol_port_preserved_in_system_clients` failed because it didn't see the just created connection in system.clients immediately. The last lines of the stacktrace are: ``` # Complete CQL handshake await do_cql_handshake(reader, writer) # Now query system.clients using the driver to see our connection cql = manager.get_cql() rows = list(cql.execute( f"SELECT address, port FROM system.clients WHERE address = '{fake_src_addr}' ALLOW FILTERING" )) # We should find our connection with the fake source address and port > assert len(rows) > 0, f"Expected to find connection from {fake_src_addr} in system.clients" E AssertionError: Expected to find connection from 203.0.113.200 in system.clients E assert 0 > 0 E + where 0 = len([]) ``` Explanation: we first await for the hand-made connection to be completed, then, via another connection, we're querying system.clients, and we don't get this hand-made connection in the resultset. The solution is to replace the bare cql.execute() calls with await wait_for_results(), a helper that polls via cql.run_async() until the expected row count is reached (30 s timeout, 100 ms period). Fixes: SCYLLADB-819 The flaky test is present on master and in previous release, so backporting only there. Closes scylladb/scylladb#28849 * github.com:scylladb/scylladb: test_proxy_protocol: introduce extra logging to aid debugging test_proxy_protocol: fix flaky system.clients visibility checks (cherry picked from commit `4150c62f29`) Closes scylladb/scylladb#28951	2026-03-10 22:48:14 +02:00
Avi Kivity	1105d83893	Merge 'mv: allow skipping view updates when a collection is unmodified' from Wojciech Mitros When we generate view updates, we check whether we can skip the entire view update if all columns selected by the view are unmodified. However, for collection columns, we only check if they were unset before and after the update. In this patch we add a check for the actual collection contents. We perform this check for both virtual and non-virtual selections. When the column is only a virtual column in the view, it would be enough to check the liveness of each collection cell, however for that we'd need to deserialize the entire collection anyway, which should be effectively as expensive as comparing all of its bytes. Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-808 Closes scylladb/scylladb#28839 * github.com:scylladb/scylladb: mv: allow skipping view updates when a collection is unmodified mv: allow skipping view updates if an empty collection remains unset (cherry picked from commit `01ddc17ab9`)	2026-03-10 21:27:23 +01:00
Tomasz Grabiec	a8fd9936a3	Merge 'service: assert that tables updated via group0 use schema commitlog' from Aleksandra Martyniuk Set enable_schema_commitlog for each group0 tables. Assert that group0 tables use schema commitlog in ensure_group0_schema (per each command). Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-914. Needs backport to all live releases as all are vulnerable Closes scylladb/scylladb#28876 * github.com:scylladb/scylladb: test: add test_group0_tables_use_schema_commitlog db: service: remove group0 tables from schema commitlog schema initializer service: ensure that tables updated via group0 use schema commitlog db: schema: remove set_is_group0_table param (cherry picked from commit `b90fe19a42`) Closes scylladb/scylladb#28916	2026-03-10 11:58:03 +01:00
Botond Dénes	9190d42863	Merge 'repair: Fix rwlock in compaction_state and lock holder lifecycle' from Raphael Raph Carvalho Consider this: - repair takes the lock holder - tablet merge filber destories the compaction group and the compaction state - repair fails - repair destroy the lock holder This is observed in the test: ``` repair - repair[5d73d094-72ee-4570-a3cc-1cd479b2a036] Repair 1 out of 1 tablets: table=sec_index.users range=(432345564227567615,504403158265495551] replicas=[0e9d51a5-9c99-4d6e-b9db-ad36a148b0ea:15, 498e354c-1254-4d8d-a565-2f5c6523845a:9, 5208598c-84f0-4526-bb7f-573728592172:28] ... repair - repair[5d73d094-72ee-4570-a3cc-1cd479b2a036]: Started to repair 1 out of 1 tables in keyspace=sec_index, table=users, table_id=ea2072d0-ccd9-11f0-8dba-c5ab01bffb77, repair_reason=repair repair - Enable incremental repair for table=sec_index.users range=(432345564227567615,504403158265495551] table - Disabled compaction for range=(432345564227567615,504403158265495551] session_id=a13a72cc-cd2d-11f0-8e9b-76d54580ab09 for incremental repair table - Got unrepaired compaction and repair lock for range=(432345564227567615,504403158265495551] session_id=a13a72cc-cd2d-11f0-8e9b-76d54580ab09 for incremental repair table - Disabled compaction for range=(432345564227567615,504403158265495551] session_id=a13a72cc-cd2d-11f0-8e9b-76d54580ab09 for incremental repair table - Got unrepaired compaction and repair lock for range=(432345564227567615,504403158265495551] session_id=a13a72cc-cd2d-11f0-8e9b-76d54580ab09 for incremental repair repair - repair[5d73d094-72ee-4570-a3cc-1cd479b2a036]: get_sync_boundary: got error from node=0e9d51a5-9c99-4d6e-b9db-ad36a148b0ea, keyspace=sec_index, table=users, range=(432345564227567615,504403158265495551], error=seastar::rpc::remote_verb_error (Compaction state for table [0x60f008fa34c0] not found) compaction_manager - Stopping 1 tasks for 1 ongoing compactions for table sec_index.users compaction_group=238 due to tablet merge compaction_manager - Stopping 1 tasks for 1 ongoing compactions for table sec_index.users compaction_group=238 due to tablet merge .... scylla[10793] Segmentation fault on shard 28, in scheduling group streaming ``` The rwlock in compaction_state could be destroyed before the lock holder of the rwlock is destroyed. This causes user after free when the lock the holder is destroyed. To fix it, users of repair lock will now be waited when a compaction group is being stopped. That way, compaction group - which controls the lifetime of rwlock - cannot be destroyed while the lock is held. Additionally, the merge completion fiber - that might remove groups - is properly serialized with incremental repair. The issue can be reproduced using sanitize build consistently and can not be reproduced after the fix. Fixes #27365 Closes scylladb/scylladb#28823 * github.com:scylladb/scylladb: repair: Fix rwlock in compaction_state and lock holder lifecycle repair: Prevent repair lock holder leakage after table drop (cherry picked from commit `509f2af8db`) Closes scylladb/scylladb#28934	2026-03-09 10:25:47 +02:00

1 2 3 4 5 ...

10744 Commits