scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-05-21 23:32:15 +00:00

Author	SHA1	Message	Date
Ernest Zaslavsky	b992ead4bb	sstables_loader: hold token_metadata_ptr to prevent use-after-free in tablet_restore_task_impl::run() `tablet_restore_task_impl::run()` iterates `topo.get_datacenter_racks().at(dc) \| std::views::keys` in a range-based for loop that contains a `co_await` in its body. The original code obtained `topo` as a raw `const auto&` by dereferencing the temporary `lw_shared_ptr` returned from `get_token_metadata_ptr()`. Because no copy of the `lw_shared_ptr` was kept, the ref count did not stay elevated: const auto& topo = db.get_token_metadata().get_topology(); // ↑ temporary lw_shared_ptr — destroyed immediately When the coroutine suspends at the inner `co_await`, a Raft topology update can run on the reactor, calling `shared_token_metadata::set()` with a new token_metadata. The old token_metadata`s refcount then drops to 0, its destructor fires `clear_and_dispose_impl()`, and `topology::clear_gently()` is scheduled as a background fire-and-forget future. That future erases nodes from `_dc_racks` one-by-one with yields between batches. When the coroutine resumes and the range-based for loop advances its iterator, the iterator`s hash-node pointer references freed memory: AddressSanitizer: heap-use-after-free sstables_loader.cc:1204:31 in sstables_loader::tablet_restore_task_impl::run() (.resume) Fix: store the `token_metadata_ptr` (`lw_shared_ptr<token_metadata>`) in a local variable on the coroutine frame. Because coroutine locals survive across suspension points, the elevated ref count keeps the token_metadata (and its topology) alive for the duration of the loop, making the existing range iteration safe without any data copying. Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-2149 Closes scylladb/scylladb#29995	2026-05-22 01:10:25 +02:00
Avi Kivity	305346a3ec	Merge 'Don't materialize collections into intermediate representations' from Botond Dénes Collections have an age-old problem in ScyllaDB: they had to be unserialized into an intermediate representation for any access or manipulation. The intermediate representation needs effort to produce and also requires additional memory to store. Both can be significant for large collections. This intermediate representation is then either discarded immediately after use, or re-serialized again. This problem was significant enough for us to consider the use of collections as somewhat of an anti-pattern. But our customers keep using it. Alternator is also a heavy user of collections. This PR aims to solve this problem once and for all. The plan is as follows: * Promote direct use of the serialized collection format: - Add accessor methods to `collection_mutation_view` which read from the serialized format directly: `tomb()`, `size()` and `begin()`/`end()`. - Add a `collection_mutation_writer` which provides container semantics for generating a serialized `collection_mutation` directly on the go (`push_back()`). * Replace all usage of `collection_mutation_description`, `collection_mutation_view_description` and friends with use of the new infrastructure. * Drop the old infrastructure, to avoid accidental regressions. Continues the work started by https://github.com/scylladb/scylladb/pull/29033 and takes it to its conclusion. To help focus review, here is a summary of the patches: * [1, 2] preparatory refactoring: drop some unused abstract_type params * [3, 6] introduce new infrastructure to write and read serialized collections directly; this is the meat of the PR * [6, -1) replace all usage of old materializing infrastructure with usage of the new one * [-1] drop old infrastructure Command: ``` dbuild -it -- build/release/scylla perf-simple-query --collection=16 -c1 -m2G --default-log-level=error ``` \| Metric \| Before \| After \| Change \| \|--------------------------\|--------:\|--------:\|------------\| \| Throughput (median tps) \| 315,760 \| 332,021 \| +5.1% \| \| Instructions/op (median) \| 53,776 \| 48,681 \| -9.5% \| \| CPU cycles/op (median) \| 17,365 \| 16,471 \| -5.1% \| \| Allocations/op \| 85.1 \| 82.1 \| -3.5% \| Significant improvement. Throughput is up ~5%, and both instruction count and cycle count are meaningfully reduced. --- Command: ``` dbuild -it -- build/release/scylla perf-simple-query --collection=16 -c1 -m2G --default-log-level=error --write ``` \| Metric \| Before \| After \| Change \| \|--------------------------\|----------:\|---------:\|-----------\| \| Throughput (median tps) \| 150,823 \| 149,678 \| -0.8% \| \| Instructions/op (median) \| 108,388 \| 103,858 \| -4.2% \| \| CPU cycles/op (median) \| 34,860 \| 35,371 \| +1.5% \| \| Allocations/op \| ~105–108 \| ~102–103 \| -3.0% \| Mixed, mostly neutral. Throughput is essentially flat (within noise). Instructions/op improved by ~4%, allocations dropped slightly, but cycles/op edged up marginally. --- Command: ``` dbuild -it -- build/release/scylla perf-alternator --workload write --developer-mode=1 --alternator-port=8000 --alternator-write-isolation=unsafe -c1 -m2G --default-log-level=error ``` \| Metric \| Before \| After \| Change \| \|--------------------------\|--------:\|-------:\|-----------\| \| Throughput (median tps) \| 55,777 \| 56,051 \| +0.5% \| \| Instructions/op (median) \| 246,215 \|246,610 \| +0.2% \| \| CPU cycles/op (median) \| 77,641 \| 77,020 \| -0.8% \| \| Allocations/op \| 340.4 \| 335.4 \| -1.5% \| Essentially neutral. All metrics are within noise margins. Slight reduction in allocations and cycles, negligible otherwise. --- The change has a clear, substantial positive effect on reads (~5% throughput gain, ~9.5% fewer instructions per op). The write and alternator paths are unaffected in practice — changes there are within measurement noise. No regressions are apparent. This is expected: https://github.com/scylladb/scylladb/pull/29033 did the heavy lifting when it comes to the write path, this PR finishes the job, mostly improving reads. Fixes: #3602 Improvement, no backport. Closes scylladb/scylladb#29127 * github.com:scylladb/scylladb: mutation/collection_mutation: make collection_mutation::_data private mutation_collection: drop collection_mutation_description and friends test: move away from collection_mutation_description tree: move away from collection_mutation_description test: move away from collection_mutation_view::with_deserialized() tree: move away from collection_mutation_view::with_deserialized() types: fix indendation, left broken by previous commit types: move away from collection_mutation_view::with_deserialized() types: serialize_for_cql(): use throwing_assert() instead of SCYLLA_ASSERT() schema: column_computation: move away from collection_mutation_view::with_deserialized() mutation: move away from collection_mutation_view::with_deserialized() alternator: move away from collection_mutation_view::with_deserialized() cdc: move away from collection_mutation_view::with_deserialized() mutation/collection_mutation: printer: don't deserialize collections mutation/collection_mutation: difference(): don't deserialize collections mutation/collection_mutation: merge(): don't deserialize collections mutation/collection_mutation: extract compact_and_expire() to free function mutation/collection_mutation: refactor empty(), is_any_live() and last_update() compaction_garbage_collector: pass collection_mutation to collect() test/boost/mutation_test: add tests for collection_mutation_{view,writer} mutation/collaction_mutation: collection_mutation_view: add methods to inspect content mutation/collection_mutation: add collection_mutation_writer mutation/collection_mutation: collection_mutation(): generate valid collection mutation/collection_mutation: collection_mutation(): remove unused abstract_type param mutation/atomic_cell: drop unused type param from from_bytes()	2026-05-21 17:10:40 +03:00
Patryk Jędrzejczak	1ed3f5c4af	Merge 'storage_service: cancel write handlers during drain to prevent shutdown deadlock' from Petr Gusev Fixes a shutdown deadlock where a node hangs because `stale_versions_in_use()` blocks on stale `token_metadata` versions held by write handlers whose `MUTATION_DONE` responses can never arrive (transport is already stopped). Two manifestations depending on whether the shutting-down node is the topology coordinator: - Coordinator: do_drain → wait_for_group0_stop deadlocks because the topology coordinator fiber is stuck in barrier_and_drain → stale_versions_in_use(). - Non-coordinator: ss::stop → uninit_messaging_service deadlocks because the barrier_and_drain RPC handler holds the gate open. The non-coordinator case was fixed in PR #24714 (cancel all write requests on storage_proxy shutdown), but its test never actually failed — the write handler always captured the current token_metadata version because `pause_before_barrier_and_drain` used `one_shot=True,` so only the first `barrier_and_drain` was paused. The topology state hadn't advanced by that point, meaning the write handler's ERM version matched the current version and `stale_versions_in_use()` returned immediately. The coordinator case was not covered at all. Cancel all write response handlers on all shards right after `stop_transport()` in `do_drain()`. This releases their ERMs and the associated stale token_metadata versions, unblocking `stale_versions_in_use()`. Fixed the test to ensure the write handler holds a stale version: use one_shot=False, let the first barrier_and_drain through (version still current), then wait for the second one (version now stale). Extended to cover both coordinator and non-coordinator shutdown on the same 2-node cluster. Also includes supporting changes: - error_injection: release wait_for_message waiters on disable() so the test can atomically unblock paused handlers - error_injection: add non-shared mode to wait_for_message for per-invocation message semantics - scylla_cluster.py: allow stop() to bypass start_stop_lock so SIGKILL works while stop_gracefully is blocked Fixes: SCYLLADB-1842 Refs: scylladb/scylladb#23665 backports: SCYLLADB-1842 reported a failure in 2025.1, so we need to backport to all versions starting from 2025.1 Closes scylladb/scylladb#29882 * https://github.com/scylladb/scylladb: storage_service: cancel write handlers during drain to prevent shutdown deadlock test_unfinished_writes_during_shutdown: extend to cover coordinator shutdown test_unfinished_writes_during_shutdown: fix to reproduce the shutdown deadlock test_unfinished_writes_during_shutdown: await add_last_node_task instead of cancelling it test_unfinished_writes_during_shutdown: add timeout and deadlock detection for shutdown_task test: scylla_cluster: allow stop() to bypass start_stop_lock error_injection: add non-shared mode to wait_for_message error_injection: release waiters when injection is disabled	2026-05-21 15:43:36 +02:00
Piotr Dulikowski	6148316f66	Merge 'db/view/view_building_coordinator: add flag to mark if any remote work was finished' from Michał Jadwiszczak There is small windows just after view building coordinator releases group0 guard and before it waits on view_building_state_machine's CV, when the coordinator may miss CV broadcast triggered by finished remote work. To fix it, this patch adds a boolean flag, which is set to true before broadcasting the CV and is checked before awaiting on the CV. Fixes SCYLLADB-2029 The problem is not critical but it should be backported to 2025.4 and newer version, all of them contains view building coordinator. Closes scylladb/scylladb#27313 * github.com:scylladb/scylladb: test/cluster/test_view_building_coordinator: add reproducer db/view/view_building_coordinator: add flag to mark if any remote work was finished	2026-05-21 15:11:58 +02:00
Pavel Emelyanov	4b13b24695	Merge 's3: make S3 connection pool size configurable per scheduling group' from Ernest Zaslavsky The S3 client creates a separate HTTP connection pool per scheduling group. Previously, the pool size was hardcoded as shares/100, yielding 1-10 connections. This was not tunable and could under-provision connections for groups with low share counts. Changes - A missing include (short_streams.hh) in sstables_loader.cc is added first to fix CMake builds where the header is not transitively included. - The hardcoded per-share divisor is replaced with a per-shard connection budget. The new `object_storage_connections_per_shard` config option (default 128) specifies the total number of connections available on each shard. Connections are distributed proportionally across scheduling groups based on their shares: `max_connections = budget * group_shares / total_shares`. Remainder connections are assigned to the group with the most shares. When a new scheduling group client is created, all existing groups are rebalanced via `set_maximum_connections`. Creation and rebalance are serialized with a semaphore to prevent concurrent rebalances from racing. - The config option is made live-updateable: a `storage_manager` observer propagates changes to all existing S3 clients, triggering rebalance under the same semaphore. Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1704 No backport needed since this change affects KS on object storage which is not operational yet. Closes scylladb/scylladb#29719 * github.com:scylladb/scylladb: s3: make connections_per_shard live-updateable s3: distribute connection pool proportionally across scheduling groups	2026-05-21 12:12:36 +03:00
Andrzej Jackowski	f8156702de	tree: add missing -present to copyright headers ~2076 files used "Copyright (C) YYYY-present ScyllaDB" while ~88 files used "Copyright (C) YYYY ScyllaDB". This inconsistency leads to unnecessary code review discussions and gradual spread of the less common format. Standardize all ScyllaDB copyright headers to use -present. Fixes SCYLLADB-1984 Closes scylladb/scylladb#29876	2026-05-21 10:57:42 +02:00
Wojciech Mitros	13c043903d	strong_consistency: cache leader location for non-replica nodes When a non-replica node handles a strongly consistent write, it must forward the request to a replica. If the closest replica is not the leader, the request gets redirected again, causing an extra roundtrip. Add a leader location cache in groups_manager, keyed by raft group_id. After a write request is forwarded, the CQL transport layer records the final node as the leader in the cache. Subsequent write requests from the same node for the same group are forwarded directly to the cached leader, eliminating the extra roundtrip. The cache is only used for writes. Reads can be served by any replica, so they skip the cache and use proximity-based routing instead. Cache entries are validated at use time: if the cached leader is no longer a replica (e.g. after tablet migration), the entry is evicted and the normal closest-replica path is taken. This prevents a scenario where two nodes keep redirecting to each other because both think that the other is the leader but actually both are non-replicas - such loop is broken as soon as the tablet maps are updated. On token_metadata updates, entries for groups that no longer exist (e.g. table dropped, tablet merged) are evicted. Entries for groups that still exist are kept — use-time validation handles staleness. An on_node_resolved callback is propagated through the redirect/bounce path so the transport layer can update the cache generically without coupling to the strong-consistency coordinator. The coordinator creates the callback only for writes (capturing the groups_manager and group_id) and attaches it to the bounce message; the transport layer invokes it once the final node is known, keeping the forwarding infrastructure subsystem-agnostic. We also add a test which verifies that after the initial redirect, following requests to the same node avoid the extra redirect and forward directly to the leader. Fixes: SCYLLADB-1064 Closes scylladb/scylladb#29392	2026-05-21 10:32:56 +02:00
Gleb Natapov	cc034f84c5	schema: ensure committed_by_group0 is set for all non-system tables on boot Tables created before the GROUP0_SCHEMA_VERSIONING feature was enabled have committed_by_group0 = null in system_schema.scylla_tables. This causes maybe_delete_schema_version() to delete their version cell, forcing the legacy hash-based schema version computation path. Add ensure_committed_by_group0() which runs on boot and fixes up any non-system tables where committed_by_group0 is not true (null or false): 1. Queries system_schema.scylla_tables for rows where committed_by_group0 is null or false, skipping system keyspaces (system, system_schema). 2. Takes a group0 guard 3. Re-checks after the raft barrier in case another node already fixed it. 4. For each table needing fixup, creates a mutation writing the version cell (from the in-memory schema). The committed_by_group0 = true flag is stamped by add_committed_by_group0_flag() inside announce(). 5. Announces via raft group0. 6. Retries with a small random delay on group0_concurrent_modification. On other nodes, schema_applier will detect these as "altered" tables (scylla_tables mutation changed), but since the actual table definition is unchanged, update_column_family is effectively a no-op. This is a prerequisite for eventually removing the legacy hash-based schema versioning code path. Closes scylladb/scylladb#29911	2026-05-21 10:22:07 +02:00
Patryk Jędrzejczak	cbadc3d675	test: fix flaky test_raft_snapshot_truncation by waiting for async log truncation Snapshot creation and raft log truncation happen asynchronously in the IO fiber after a schema change completes. The test was querying system.raft immediately after the schema change returned, racing with the IO fiber's store_snapshot_descriptor call. Replace immediate assertions with wait_for polling loops: - log_size == 0: wait for log truncation after drop keyspace - new_snap_id != original_snap_id: wait for new snapshot to be persisted Fixes: SCYLLADB-2120 Closes scylladb/scylladb#29967	2026-05-21 10:50:00 +03:00
Botond Dénes	d1cb102bd2	mutation/collection_mutation: make collection_mutation::_data private Nobody should be looking at the raw data storage directly. The collection can be inspected via collection_mutation_view. Added a data() && accessor, to be able to extract the raw data for storage in atomic_cell_or_collection.	2026-05-21 10:36:59 +03:00
Artsiom Mishuta	2259307c2e	test.py: remove redundant pytest.mark.asyncio decorators Fixes: SCYLLADB-1935	2026-05-21 10:36:47 +03:00
Botond Dénes	076b55fdc0	mutation_collection: drop collection_mutation_description and friends Drop: * collection_mutation_description * collection_mutation_view_description * collection_mutation_input_stream * collection_mutation_view::with_deserialized() * All serialized/deserialize methods for/from collection_mutation_description All these are now unused.	2026-05-21 10:23:29 +03:00
Botond Dénes	da7903de79	test: move away from collection_mutation_description Use collection_mutation_writer instead.	2026-05-21 10:23:29 +03:00
Botond Dénes	636e2877e2	tree: move away from collection_mutation_description Use collection_mutation_writer instead. Add to_managed_bytes() to cql3::raw_value to help avoid some copies. A special note for sstables/kl/reader.cc: this conversion is not straighforward, so we accumulate a list of cells and feed to the writer at the end. This is sub-optimal but this code is rarely used, best to be conservative.	2026-05-21 10:23:29 +03:00
Botond Dénes	c76ab90fb2	test: move away from collection_mutation_view::with_deserialized() Use the collection_mutation_view directly.	2026-05-21 10:23:29 +03:00
Botond Dénes	35a776d043	tree: move away from collection_mutation_view::with_deserialized() Use the collection_mutation_view directly. This is the remainder after the previous patches collecting larger changes by module.	2026-05-21 10:23:29 +03:00
Botond Dénes	7815ec6f83	types: fix indendation, left broken by previous commit	2026-05-21 10:23:29 +03:00
Botond Dénes	76c2e1c5f3	types: move away from collection_mutation_view::with_deserialized() Use the collection_mutation_view directly.	2026-05-21 10:23:29 +03:00
Botond Dénes	4f442d13bd	types: serialize_for_cql(): use throwing_assert() instead of SCYLLA_ASSERT() Good practice in general. Also prepares the ground for calling serialize_for_cql() from serialize_for_cql_with_timestamps(). The latter already switched to throwing_assert(), avoid regressing to a crash.	2026-05-21 10:23:29 +03:00
Botond Dénes	73564acfa6	schema: column_computation: move away from collection_mutation_view::with_deserialized() Use the collection_mutation_view directly. Also use managed_bytes_view for collection keys to match collection_mutation_view::iterator::value_type and to avoid unnecessary key linearization.	2026-05-21 10:23:29 +03:00
Botond Dénes	9501c2a57e	mutation: move away from collection_mutation_view::with_deserialized() Use the collection_mutation_view directly.	2026-05-21 10:23:29 +03:00
Botond Dénes	16da8103ce	alternator: move away from collection_mutation_view::with_deserialized() Use the collection_mutation_view directly.	2026-05-21 10:23:28 +03:00
Botond Dénes	727cdeb951	cdc: move away from collection_mutation_view::with_deserialized() Use the collection_mutation_view directly. Requires changing key type from bytes_view to managed_bytes_view, to match collection_mutation_view::iterator::value_type and more importantly to avoid unnecessary key linearization.	2026-05-21 10:23:28 +03:00
Botond Dénes	e8a23362c7	mutation/collection_mutation: printer: don't deserialize collections Work with collection_mutation_view directly instead	2026-05-21 10:23:28 +03:00
Botond Dénes	5ded1ebfc2	mutation/collection_mutation: difference(): don't deserialize collections Work with collection_mutation_view directly instead Use collection_mutation_writer to generate the serialized output directly, instead of accumulating cells in temporary storage.	2026-05-21 10:23:28 +03:00
Botond Dénes	019fd7af2b	mutation/collection_mutation: merge(): don't deserialize collections Work with collection_mutation_view directly instead. Use collection_mutation_writer to generate the serialized output directly, instead of accumulating cells in temporary storage.	2026-05-21 10:23:28 +03:00
Botond Dénes	7c8b5681f4	mutation/collection_mutation: extract compact_and_expire() to free function The new free-function variant operates on a collection_mutation_view directly, instead of on collection_mutation_description.	2026-05-21 10:23:15 +03:00
Botond Dénes	f8ac8540bd	Merge 'logstor: compare records by timestamp and segment sequence number' from Michael Litvak Add the record timestamp. The timestamp is extracted from the row marker of the mutation when we write it. When inserting a record to index, we compare it with the existing record, and insert it only if it has newer timestamp. Add a segment sequence number that is a global (per-shard) increasing number that is allocated when getting a new segment for write, and is written in buffer headers in the segment. It is used to distinguish between buffers written to different generations of a segment, and for recovery to break ties by keeping the record from the newest segment. Refs https://scylladb.atlassian.net/browse/SCYLLADB-770 no backport - logstor is a new feature Closes scylladb/scylladb#29933 * github.com:scylladb/scylladb: test: logstor: add basic delete test logstor: rewrite segment seq num from streaming logstor: add segment sequence number logstor: get_segment helper logstor: compare records by timestamp	2026-05-21 08:44:18 +03:00
Botond Dénes	f783869440	mutation/collection_mutation: refactor empty(), is_any_live() and last_update() Use the new infrastructure introduced in the previous commit: size(), tomb() and iterator to streamline the code of these pre-existing methods.	2026-05-21 08:34:21 +03:00
Botond Dénes	7be054dda9	compaction_garbage_collector: pass collection_mutation to collect() Change the collection overload of compaction_garbage_collector::collect() to accept a collection_mutation (a serialized blob) instead of collection_mutation_description (an owned, deserialized representation). Change collection_mutation_description::compact_and_expire() (the only user) to build a serialized collection_mutation, with the new collection_mutation_writer directly. Allows dropping collection_mutation_description from the compaction_garbage_collector interface, getting us one step closer to removing it. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-05-21 08:34:21 +03:00
Botond Dénes	c5d12d44c6	test/boost/mutation_test: add tests for collection_mutation_{view,writer} Test the new facilities for producing and inspecting serialized collection mutations directly, without intermediate formats.	2026-05-21 08:34:21 +03:00
Botond Dénes	6a729b4e9e	mutation/collaction_mutation: collection_mutation_view: add methods to inspect content Add size, tombstone getters and begin()/end() to iterate over the cells. Allows inspecting the content of the collection in-place, without deserializing into an intermediate representation (collection_mutation_description[_view]). This will be used to gradually replace all usage of collection_mutation_description[_view]. Deliberately avoding the use of collection_mutation_input_stream as that one is also a target for elimination. The names for the accessors are tomb() and size(), following existing conventions in mutation/. Also rename is_empty() -> empty() to align with this convention. There is a single caller to update only. All accessors (new or pre-existing) are made to work with default-constructed `collection_mutation` (i.e. one containing empty buffer). No users yet.	2026-05-21 08:34:21 +03:00
Botond Dénes	11d7128bf6	mutation/collection_mutation: add collection_mutation_writer Create a serialized collection element-by-element, with size not known up-front. Appended elements are written into the buffer directly. Allows replacing the current practice of first creating a vector of cells, then serializing at the end.	2026-05-21 08:34:21 +03:00
Botond Dénes	b5a301e78a	mutation/collection_mutation: collection_mutation(): generate valid collection Currently the collection_mutation default constructor just default-initializes the underlying managed_bytes, which is not a valid collection mutation value. This is an accident waiting to happen, because attempting to deserialize or interpret this value would result in buffer overflow. Generate a valid collection value in the default constructor. This is just 5 bytes, small enough to fit into managed_byte's small buffer, so it even avoids allocation.	2026-05-21 08:34:21 +03:00
Botond Dénes	24fdfa34dd	mutation/collection_mutation: collection_mutation(): remove unused abstract_type param	2026-05-21 08:34:21 +03:00
Botond Dénes	6926375c4d	mutation/atomic_cell: drop unused type param from from_bytes() The abstract_type parameter was needed by the old IMR-based atomic-cell serialization format. When we reverted to the simpler managed_bytes representation, the parameter became a no-op but was never removed. Drop it from all overloads of atomic_cell_view::from_bytes() and atomic_cell_mutable_view::from_bytes() and update all call sites. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-05-21 08:34:21 +03:00
Petr Gusev	2927f0dd21	storage_service: cancel write handlers during drain to prevent shutdown deadlock When a node shuts down, do_drain() calls stop_transport() which tears down the messaging service. After this point, MUTATION_DONE responses from replicas can no longer reach the coordinator, so any in-flight write_response_handlers will never complete naturally. These handlers hold ERMs referencing stale token_metadata versions. If the topology coordinator calls barrier_and_drain (either on itself or via RPC), it blocks in stale_versions_in_use() waiting for these stale versions to be released. This causes: - On the coordinator node: do_drain -> wait_for_group0_stop deadlock (the topology coordinator fiber is stuck in barrier_and_drain). - On non-coordinator nodes: ss::stop -> uninit_messaging_service deadlock (the barrier_and_drain RPC handler holds the gate open). Fix: cancel all write response handlers on all shards right after stop_transport() in do_drain(). This releases their ERMs and the associated stale token_metadata versions, unblocking stale_versions_in_use(). Heap-allocate _write_handlers_gate and add an allow_new parameter to cancel_all_write_response_handlers(). When allow_new=true (used by do_drain), the gate is closed and swapped with a fresh one — existing handlers are waited on while new handlers can still be created. This avoids blocking internal writes (paxos learn, compaction history updates) that still need to create handlers during the remainder of the drain sequence. When allow_new=false (used by drain_on_shutdown), the gate is closed permanently — no new handlers can be created after final shutdown. Update test_lwt_shutdown to wait for 'Stop transport: done' instead of 'Shutting down storage proxy RPC verbs'. The latter message is now only logged after do_drain() completes, but do_drain() blocks in cancel_all_write_response_handlers() waiting for the background paxos learn handler — which is exactly what the test needs to release before shutdown can proceed. Fixes: SCYLLADB-1842 Refs: scylladb/scylladb#23665	2026-05-20 22:21:45 +02:00
Ernest Zaslavsky	86f678a592	s3: make connections_per_shard live-updateable Wire the object_storage_connections_per_shard config option as LiveUpdate so it can be changed at runtime without restart. When the value changes, the storage_manager observer propagates it to all existing S3 clients, which rebalance their connection pools under the rebalance semaphore.	2026-05-20 20:45:14 +03:00
Ernest Zaslavsky	b9e1dcc0fe	s3: distribute connection pool proportionally across scheduling groups The S3 client creates a separate HTTP connection pool per scheduling group. Previously, the pool size was computed per-group using a per-share multiplier (connections = shares * multiplier), which did not account for the total number of groups sharing the shard's connection budget. Replace the per-share multiplier with a per-shard connection budget: the new object_storage_connections_per_shard config option (default 100) specifies the total number of connections available on each shard. When a new scheduling group's client is created, connections are distributed proportionally across all groups based on their shares (connections = budget * group_shares / total_shares), and existing groups are rebalanced via set_maximum_connections. When the endpoint_config has an explicit max_connections override, it is used directly without proportional distribution.	2026-05-20 20:45:04 +03:00
Petr Gusev	5bc3e84d1e	test_unfinished_writes_during_shutdown: extend to cover coordinator shutdown The existing test only covers the case where the shutting-down node is NOT the topology coordinator (deadlocks in uninit_messaging_service). When the node IS the coordinator, the deadlock manifests differently: the topology coordinator fiber calls barrier_and_drain on itself (without messaging), and do_drain -> wait_for_group0_stop blocks because the coordinator can't stop while stale_versions_in_use is waiting on the uncancelled write handler. Run the test twice on the same 2-node cluster (RF=2): - Run 1: target is a non-coordinator - Restore cluster state (restart target, decommission added node) - Run 2: target is the topology coordinator Use CL=ONE so the write completes from the local replica even with the other server's response paused. Mark as xfail since this reproduces bugs not yet fixed on this branch. Refs: SCYLLADB-1842	2026-05-20 17:22:23 +02:00
Petr Gusev	a093be9ca9	test_unfinished_writes_during_shutdown: fix to reproduce the shutdown deadlock The test was written for another case, and was not supposed to reproduce the issue that was fixed in this PR. Fix the test to reproduce the real scenario: 1. Use one_shot=False for pause_before_barrier_and_drain so the injection fires on every barrier_and_drain RPC, not just the first. 2. Let the first barrier_and_drain through (at this point the write handler's ERM version matches the current token_metadata version). 3. Wait for the second barrier_and_drain. Between the two calls, topology_state_load installs a new token_metadata version. The write handler still holds the old version's ERM — now stale. 4. After stop_transport completes, disable the injection (rather than sending a single message) to release the paused handler and any subsequent ones that arrived during stop_transport. The 'disabled' flag in injection_shared_data ensures all waiters wake up. With these changes the test reliably fails (shutdown deadlock within 15s) on the unfixed code and passes on the fixed version from `e0dc73f52a` ('Cancel all write requests on storage_proxy shutdown'). Refs: scylladb/scylladb#23665	2026-05-20 17:22:23 +02:00
Petr Gusev	32002f6443	test_unfinished_writes_during_shutdown: await add_last_node_task instead of cancelling it asyncio cancel() only affects the client-side coroutine. The server-side addserver handler in the cluster manager continues running. If it can't complete (e.g. no raft quorum because the target node is shut down), the orphaned handler blocks _after_test cleanup for 120s. Await the task instead so it completes cleanly (we restart the target node first to restore quorum).	2026-05-20 17:22:16 +02:00
Petr Gusev	fa01f74ae6	test_unfinished_writes_during_shutdown: add timeout and deadlock detection for shutdown_task Add a 15s timeout around the shutdown_task await. If the timeout fires, the deadlock is reproduced (shutdown hung because stale_versions_in_use blocks on a write handler holding a stale token_metadata version). When the timeout fires, explicitly kill the node via server_stop() so that the manager's _after_test handler does not wait 120s for the stuck stop_gracefully request. Then fail the test with a clear message.	2026-05-20 17:21:56 +02:00
Petr Gusev	c88120abca	test: scylla_cluster: allow stop() to bypass start_stop_lock Remove the @stop_event and @start_stop_lock decorators from ScyllaServer.stop() so it can SIGKILL a server even while stop_gracefully() holds the lock (e.g. the node is deadlocked during shutdown and stop_gracefully is blocked on cmd.wait()). A local copy of self.cmd is used because there are await points after which another coroutine (stop_gracefully) may set self.cmd to None. The concurrent stop_gracefully() unblocks once the process dies from SIGKILL since its cmd.wait() returns. Also make shutdown_control_connection a plain (non-async) function since it contains no await points — this makes it obvious that no coroutine interleaving is possible inside it.	2026-05-20 17:05:54 +02:00
Petr Gusev	324a08295d	error_injection: add non-shared mode to wait_for_message Add a 'share' parameter to wait_for_message (default true, preserving existing behavior). When share=false, each handler invocation requires its own dedicated message to proceed — a message consumed by one handler is not visible to others. Use share=false for the pause_before_barrier_and_drain injection in raft_topology_cmd_handler. The topology coordinator sends multiple barrier_and_drain RPCs during a single topology transition (one per state change). With share=true a single message_injection call releases all handlers. With share=false the test can release them one at a time, controlling exactly which topology state the write handler's ERM captures.	2026-05-20 17:05:54 +02:00
Petr Gusev	bc4dc13e94	error_injection: release waiters when injection is disabled When an error injection is disabled (via disable() or disable_all()), any handlers currently suspended in wait_for_message() must be woken up so they can proceed instead of hanging until timeout. Add a 'disabled' flag to injection_shared_data. When disable() or disable_all() is called, set the flag and broadcast the condition variable. The wait_for_message() predicate checks the flag and returns true immediately, letting the handler continue. This makes disable() atomic with respect to releasing waiters: it both wakes up blocked handlers and removes the injection from the enabled map in one call. This avoids races that would occur with separate message_injection() + disable() calls — message_injection() after disable() fails because the injection is already gone, and disable() after message_injection() risks a new handler hitting the injection between the two calls. Concrete example: test_unfinished_writes_during_shutdown pauses barrier_and_drain RPC handlers via wait_for_message. During shutdown, the test calls disable_injection() to simultaneously release the paused handler and prevent any new barrier_and_drain RPCs from getting stuck.	2026-05-20 17:05:54 +02:00
Michał Jadwiszczak	eac9449967	test/test_mv_building: ensure nodes see each other after restart In SCYLLADB-2058 we observed a timeout exception while querying the base table after restarting nodes 2 and 3. Unfortunately, logs don't give us much useful information about the root cause. This patch adds basic checks that nodes see each other after the restart and that the cql connection sees restarted node. It doesn't guarantee that the error won't occur again - in logs from SCYLLADB-2058 we see that each node sees other via gossip after part of the cluster is restarted. In case the error will occur again, this commit also increases logging level of `cql_server` and `storage_proxy`. Refs SCYLLADB-2058 Closes scylladb/scylladb#29951	2026-05-20 14:11:41 +02:00
Marcin Maliszkiewicz	83823149e9	Merge 'audit: implement audit_rules config' from Andrzej Jackowski This patch series adds `audit_rules`, a new audit configuration option for fine-grained, role-aware audit filtering with per-rule sink routing. Rules can be configured in `scylla.yaml` or updated live through `system.config` without restarting the node. Each rule specifies target sinks (`table`, `syslog`), statement categories, qualified table name patterns, and role patterns. Table and role patterns use POSIX `fnmatch` with extended glob syntax. For table-scoped categories (`DML`, `DDL`, `QUERY`), a rule matches only when the category, role, and qualified table name all match. For table-independent categories (`AUTH`, `ADMIN`, `DCL`), the table filter is ignored. Empty category or role lists match nothing; an empty table list matches nothing only for table-scoped categories. The new rules are additive with the existing `audit_categories`, `audit_keyspaces`, and `audit_tables` settings: both mechanisms are evaluated for each audit event, and the final sink set is the union of all matches. To avoid evaluating glob patterns on every audit event, audit rules use a preprocessed cache of known roles and tables. The cache is kept in sync through group0 role/table snapshots, role-change notifications, and schema migration notifications. For known entities, rule matching uses precomputed role/table rule sets; unknown entities fall back to direct rule evaluation. When `audit_rules` is empty, per-event rule matching returns immediately and does not evaluate glob patterns. Audit still keeps known role/table metadata in sync while audit is enabled, so rules can be enabled later through live configuration updates without restarting the node. Performance Measured with `perf-simple-query --smp 1 --duration 100` against a null syslog socket. Results show no regression when audit is disabled, and audit-rules performance has at most 1% more instructions than legacy config for equivalent workloads: ``` =============================================================================================================================================================================== Configuration \| Binary \| throughput (tps) \| insns/op \| cpu_cycles/op \| alloc/op \| logal/op \| task/op =============================================================================================================================================================================== audit=none [1] \| baseline \| 206922.4 \| 36591.6 \| 15348.3 \| 58.1 \| 0.0 \| 14.1 audit=none [1] \| this PR \| 207856.4 (+0.5%) \| 36544.9 (-0.1%) \| 15274.0 (-0.5%) \| 58.1 \| 0.0 \| 14.1 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- audit=syslog keyspaces=ks [2] \| baseline \| 94871.8 \| 54163.0 \| 27172.4 \| 72.0 \| 0.0 \| 24.0 audit=syslog keyspaces=ks [2] \| this PR \| 96138.4 (+1.3%) \| 54072.3 (-0.2%) \| 26699.3 (-1.7%) \| 72.0 \| 0.0 \| 24.0 audit=syslog audit-rules=ks [3] \| this PR \| 95142.1 (+0.3%) \| 54457.8 (+0.5%) \| 26953.8 (-0.8%) \| 72.0 \| 0.0 \| 24.0 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- audit=syslog keyspaces=ks-non-existent [4] \| baseline \| 213997.8 \| 36735.6 \| 14848.1 \| 58.1 \| 0.0 \| 14.1 audit=syslog keyspaces=ks-non-existent [4] \| this PR \| 219297.2 (+2.5%) \| 36667.3 (-0.2%) \| 14500.1 (-2.3%) \| 58.1 \| 0.0 \| 14.1 audit=syslog audit-rules=ks-non-existent [5] \| this PR \| 211038.7 (-1.4%) \| 36999.7 (+0.7%) \| 15048.6 (+1.4%) \| 58.1 \| 0.0 \| 14.1 =============================================================================================================================================================================== [1] ./scylla perf-simple-query --smp 1 --duration 100 --audit "none" [2] ./scylla perf-simple-query --smp 1 --duration 100 --audit "syslog" --audit-keyspaces "ks" --audit-categories "DCL,DDL,AUTH,DML,QUERY" --audit-unix-socket-path "/tmp/audit-null.sock" [3] ./scylla perf-simple-query --smp 1 --duration 100 --audit "syslog" --audit-rules '[{"sinks":["syslog"],"categories":["DCL","DDL","AUTH","DML","QUERY"],"qualified_table_names":["ks."],"roles":[""]}]' --audit-unix-socket-path "/tmp/audit-null.sock" [4] ./scylla perf-simple-query --smp 1 --duration 100 --audit "syslog" --audit-keyspaces "ks-non-existent" --audit-categories "DCL,DDL,AUTH,DML,QUERY" --audit-unix-socket-path "/tmp/audit-null.sock" [5] ./scylla perf-simple-query --smp 1 --duration 100 --audit "syslog" --audit-rules '[{"sinks":["syslog"],"categories":["DCL","DDL","AUTH","DML","QUERY"],"qualified_table_names":["ks-non-existent."],"roles":[""]}]' --audit-unix-socket-path "/tmp/audit-null.sock" audit-null.sock was created with `socat -u UNIX-RECV:/tmp/audit-null.sock,type=2 OPEN:/dev/null` ``` Fixes: SCYLLADB-1430 No backport: new feature Closes scylladb/scylladb#29267 * github.com:scylladb/scylladb: test: alternator: audit: rules filtering and batch bypass test: perf: add --audit-rules option to perf-simple-query docs: add audit rules section to the auditing guide test: audit: cover role and schema cache notifications test: audit: cover audit rules cluster behavior audit: rebuild rule caches on group0 snapshot and role changes audit: refresh rule caches on schema, role, and config changes audit: route matching rules to configured sinks test: cover preprocessed audit rule cache audit: add preprocessed rule matching cache audit: pass sink targets to storage helpers test: audit: cover rule matching semantics audit: add rule matching and sink helpers test: audit: cover audit_rules configuration config: add live audit_rules option test: cover audit rule parsing and validation audit: define audit_rule type with parsing and validation	2026-05-20 14:10:45 +02:00
Gleb Natapov	c2cc7ebf39	test: fix test_cas_semaphore flakiness due to paxos state table creation timeout The test was starting Scylla with --write-request-timeout-in-ms=500 on the command line. This tight timeout also applied to paxos state table creation, which goes through raft and can take longer than 500ms on slow platforms (e.g. aarch64/dev). When the first batch of CAS requests triggered paxos state table creation under error injection, the raft schema change could still be in-flight when the second batch fired, causing spurious WriteTimeout failures unrelated to the semaphore bug being tested. Fix by changing the write timeout at runtime via the REST API: lower it to 500ms only for the error-injection CAS phase (after table creation is done), then restore it to 10000ms before the second batch that must succeed. Fixes https://scylladb.atlassian.net/browse/SCYLLADB-2104 Closes scylladb/scylladb#29969	2026-05-20 13:06:17 +02:00
Avi Kivity	6df04c9e5b	Update seastar submodule Changed seastar::http::experimental to seastar::http to reflect graduation of the seastar http API. Changed call to seastar::rename_file() (in sstables/storage.cc, sstables/sstable_directory.cc, sstable/sstables.cc and db/hints/internal/hint_storage.cc) to reflect new default parameter. Updated scylla_gdb test helper get_task() to work with updated accept loop in Seatar. This is just test code (attempts to find a task to operate on), not used in real scylla-gdb.py work, but nevertheless the adjustment keeps backward compatibility. Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1798 Fixes https://scylladb.atlassian.net/browse/SCYLLADB-2043 * seastar 485a62b2...510f3148 (43): > reactor_backend: fix iocb double-free and shutdown hang during AIO teardown > file: fix default DMA alignment > http: add to_reply() to redirect_exception with extra-header support > core: propagate syscall errors via `coroutine::exception` > file: assert dma alignments are powers of two > doc: Document undocumented io_tester features and fix output example > backtrace: print the build_id along with the backtrace > reactor: default to oneline backtraces > Merge 'json: formatter: support types with user-defined conversion to sstring' from Benny Halevy tests: json_formatter: test formatter::write with string types json: formatter: support types with user-defined conversion to sstring > httpd_test: fix build failure with Seastar_SSTRING=OFF > net/tls: introduce ssl_call wrapper for SSL I/O > build: disable unused command line argument error for C++ module > coroutine/generator: fix setup of generator's waiting task > tests/tls: set 1000-day validity for self-signed CA cert > net: tls: openssl: disable certificate compression > reactor: reduce steady_clock::now() calls per scheduling quantum > fair_queue: remove notify_request_finished() > loop: use small_vector for parallel_for_each_state incomplete futures > dodge false sharing in spinlock > Merge 'Handle nowait support for reads and writes independently' from Pavel Emelyanov file: Change nowait_works mode detection file: Introduce read-only nowait_mode filesystem: Make nowait_works bit a enum class too file: Make nowait_works bit a enum class > Merge 'net/tls: improve OpenSSL error queue hygiene' from Gellért Peresztegi-Nagy net/tls: assert clean error queue before SSL operations net/tls: clear error queue after successful SSL operations net/tls: clear error queue after successful SSL_CTX_new net/tls: drain error queue on unexpected error codes net/tls: use make_openssl_error for BIO creation failure > vla.hh: add missing includes > Merge 'smp: make smp::count non-static' from Avi Kivity smp: convert all smp::count usages to instance-aware alternatives smp: add per-instance shard_count and this_smp() infrastructure disk_params: document pre-init smp::count access with explicit 0 reactor_backend: document pre-init smp::count access with explicit 0 tests: alien_test: pass shard count to alien thread explicitly > build: fix cmake missing ninja on Ubuntu 26.04 > rpc: Fix uint64 wraparound of expired timeout in send_entry() > Merge 'Generalize some RPC tests' from Pavel Emelyanov tests: Generalize async connection-based scheduling RPC tests tests: Generalize sync connection-based scheduling RPC tests tests: Remove redundant variadic/nonvariadic RPC tuple tests tests: Generalize max timeout RPC tests > net: tls: openssl: Share BIO ptrs across shards > http: fix compilation on clang 22 with c++26 > build: openssl tools needed for test cert generation > reactor: support rename2 > future: fix forwarding of reference types > Merge 'Zero-copy http chunked data sink' from Pavel Emelyanov http: Make chunked data sink zero-copy tests/prometheus_http: Rewrite on top of http::client tests/httpd: Rewrite content_length_limit on top of http::client > tests: Replace ad-hoc http_consumer with production HTTP parser > Merge 'co_return to accept same expressions and types as return' from Alexey Bashtanov tests/unit/{coroutines,futures}: strict types on co_return and set_value api: introduce version 10: core/{coroutine,future}: make `co_return` more strict with types core/{coroutine,future}: preparations to fix `co_return` type semantics > Merge 'Perftune.py: add special handling for mlx5 rss queues number calculation' from Vladislav Zolotarov perftune.py: NetPerfTuner: enhance RSS (a.k.a. "Rx") queues accounting for mlx5 devices perftune.py: update docstring of NetPerfTuner.__get_rps_cpus() method perftune.py: add a method that parses and models the output of the 'ethtool -l' command for a given interface > httpd: rewrite do_accepts/do_accept_one as coroutines > file: add mmap support to file > http: Move client code out of experimental namespace > file: add hugetlbfs support to file system detection > tests: Replace test_source_impl with util::as_input_stream > tests: Replace buf_source_impl with util::as_input_stream > Merge 'rpc_tester: expose throuput for rpc tester' from Marcin Szopa rpc_tester: remove unused payload size variable from job_rpc_streaming class rpc_tester: add start time tracking for throughput calculation, print throughput and msg/s for job_rpc rpc_tester: refactor result emission to use dedicated functions for messages and throughput > iostream: cast first argument of `std::min` to `size_t` Closes scylladb/scylladb#29952	2026-05-20 13:47:12 +03:00

1 2 3 4 5 ...

54038 Commits