`tablet_restore_task_impl::run()` iterates `topo.get_datacenter_racks().at(dc) |
std::views::keys` in a range-based for loop that contains a `co_await` in its body.
The original code obtained `topo` as a raw `const auto&` by dereferencing the
temporary `lw_shared_ptr` returned from `get_token_metadata_ptr()`. Because no
copy of the `lw_shared_ptr` was kept, the ref count did not stay elevated:
const auto& topo = db.get_token_metadata().get_topology();
// ↑ temporary lw_shared_ptr — destroyed immediately
When the coroutine suspends at the inner `co_await`, a Raft topology update can
run on the reactor, calling `shared_token_metadata::set()` with a new
token_metadata. The old token_metadata`s refcount then drops to 0, its destructor
fires `clear_and_dispose_impl()`, and `topology::clear_gently()` is scheduled as
a background fire-and-forget future. That future erases nodes from `_dc_racks`
one-by-one with yields between batches.
When the coroutine resumes and the range-based for loop advances its iterator,
the iterator`s hash-node pointer references freed memory:
AddressSanitizer: heap-use-after-free sstables_loader.cc:1204:31
in sstables_loader::tablet_restore_task_impl::run() (.resume)
Fix: store the `token_metadata_ptr` (`lw_shared_ptr<token_metadata>`) in a local
variable on the coroutine frame. Because coroutine locals survive across
suspension points, the elevated ref count keeps the token_metadata (and its
topology) alive for the duration of the loop, making the existing range iteration
safe without any data copying.
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-2149Closesscylladb/scylladb#29995
Collections have an age-old problem in ScyllaDB: they had to be unserialized into an intermediate representation for any access or manipulation. The intermediate representation needs effort to produce and also requires additional memory to store. Both can be significant for large collections. This intermediate representation is then either discarded immediately after use, or re-serialized again.
This problem was significant enough for us to consider the use of collections as somewhat of an anti-pattern. But our customers keep using it. Alternator is also a heavy user of collections.
This PR aims to solve this problem once and for all. The plan is as follows:
* Promote direct use of the serialized collection format:
- Add accessor methods to `collection_mutation_view` which read from the serialized format directly: `tomb()`, `size()` and `begin()`/`end()`.
- Add a `collection_mutation_writer` which provides container semantics for generating a serialized `collection_mutation` directly on the go (`push_back()`).
* Replace all usage of `collection_mutation_description`, `collection_mutation_view_description` and friends with use of the new infrastructure.
* Drop the old infrastructure, to avoid accidental regressions.
Continues the work started by https://github.com/scylladb/scylladb/pull/29033 and takes it to its conclusion.
To help focus review, here is a summary of the patches:
* [1, 2] preparatory refactoring: drop some unused abstract_type params
* [3, 6] introduce new infrastructure to write and read serialized collections directly; this is the meat of the PR
* [6, -1) replace all usage of old materializing infrastructure with usage of the new one
* [-1] drop old infrastructure
**Command:**
```
dbuild -it -- build/release/scylla perf-simple-query --collection=16 -c1 -m2G --default-log-level=error
```
| Metric | Before | After | Change |
|--------------------------|--------:|--------:|------------|
| Throughput (median tps) | 315,760 | 332,021 | **+5.1%** |
| Instructions/op (median) | 53,776 | 48,681 | **-9.5%** |
| CPU cycles/op (median) | 17,365 | 16,471 | **-5.1%** |
| Allocations/op | 85.1 | 82.1 | **-3.5%** |
**Significant improvement.** Throughput is up ~5%, and both instruction count and cycle count are meaningfully reduced.
---
**Command:**
```
dbuild -it -- build/release/scylla perf-simple-query --collection=16 -c1 -m2G --default-log-level=error --write
```
| Metric | Before | After | Change |
|--------------------------|----------:|---------:|-----------|
| Throughput (median tps) | 150,823 | 149,678 | **-0.8%** |
| Instructions/op (median) | 108,388 | 103,858 | **-4.2%** |
| CPU cycles/op (median) | 34,860 | 35,371 | **+1.5%** |
| Allocations/op | ~105–108 | ~102–103 | **-3.0%** |
**Mixed, mostly neutral.** Throughput is essentially flat (within noise). Instructions/op improved by ~4%, allocations dropped slightly, but cycles/op edged up marginally.
---
**Command:**
```
dbuild -it -- build/release/scylla perf-alternator --workload write --developer-mode=1 --alternator-port=8000 --alternator-write-isolation=unsafe -c1 -m2G --default-log-level=error
```
| Metric | Before | After | Change |
|--------------------------|--------:|-------:|-----------|
| Throughput (median tps) | 55,777 | 56,051 | **+0.5%** |
| Instructions/op (median) | 246,215 |246,610 | **+0.2%** |
| CPU cycles/op (median) | 77,641 | 77,020 | **-0.8%** |
| Allocations/op | 340.4 | 335.4 | **-1.5%** |
**Essentially neutral.** All metrics are within noise margins. Slight reduction in allocations and cycles, negligible otherwise.
---
The change has a **clear, substantial positive effect on reads** (~5% throughput gain, ~9.5% fewer instructions per op).
The write and alternator paths are **unaffected in practice** — changes there are within measurement noise. No regressions are apparent.
This is expected: https://github.com/scylladb/scylladb/pull/29033 did the heavy lifting when it comes to the write path, this PR finishes the job, mostly improving reads.
Fixes: #3602
Improvement, no backport.
Closesscylladb/scylladb#29127
* github.com:scylladb/scylladb:
mutation/collection_mutation: make collection_mutation::_data private
mutation_collection: drop collection_mutation_description and friends
test: move away from collection_mutation_description
tree: move away from collection_mutation_description
test: move away from collection_mutation_view::with_deserialized()
tree: move away from collection_mutation_view::with_deserialized()
types: fix indendation, left broken by previous commit
types: move away from collection_mutation_view::with_deserialized()
types: serialize_for_cql(): use throwing_assert() instead of SCYLLA_ASSERT()
schema: column_computation: move away from collection_mutation_view::with_deserialized()
mutation: move away from collection_mutation_view::with_deserialized()
alternator: move away from collection_mutation_view::with_deserialized()
cdc: move away from collection_mutation_view::with_deserialized()
mutation/collection_mutation: printer: don't deserialize collections
mutation/collection_mutation: difference(): don't deserialize collections
mutation/collection_mutation: merge(): don't deserialize collections
mutation/collection_mutation: extract compact_and_expire() to free function
mutation/collection_mutation: refactor empty(), is_any_live() and last_update()
compaction_garbage_collector: pass collection_mutation to collect()
test/boost/mutation_test: add tests for collection_mutation_{view,writer}
mutation/collaction_mutation: collection_mutation_view: add methods to inspect content
mutation/collection_mutation: add collection_mutation_writer
mutation/collection_mutation: collection_mutation(): generate valid collection
mutation/collection_mutation: collection_mutation(): remove unused abstract_type param
mutation/atomic_cell: drop unused type param from from_bytes()
Fixes a shutdown deadlock where a node hangs because `stale_versions_in_use()` blocks on stale `token_metadata` versions held by write handlers whose `MUTATION_DONE` responses can never arrive (transport is already stopped).
Two manifestations depending on whether the shutting-down node is the topology coordinator:
- Coordinator: do_drain → wait_for_group0_stop deadlocks because the topology coordinator fiber is stuck in barrier_and_drain → stale_versions_in_use().
- Non-coordinator: ss::stop → uninit_messaging_service deadlocks because the barrier_and_drain RPC handler holds the gate open.
The non-coordinator case was fixed in PR #24714 (cancel all write requests on storage_proxy shutdown), but its test never actually failed — the write handler always captured the current token_metadata version because `pause_before_barrier_and_drain` used `one_shot=True,` so only the first `barrier_and_drain` was paused. The topology state hadn't advanced by that point, meaning the write handler's ERM version matched the current version and `stale_versions_in_use()` returned immediately. The coordinator case was not covered at all.
Cancel all write response handlers on all shards right after `stop_transport()` in `do_drain()`. This releases their ERMs and the associated stale token_metadata versions, unblocking `stale_versions_in_use()`.
Fixed the test to ensure the write handler holds a stale version: use one_shot=False, let the first barrier_and_drain through (version still current), then wait for the second one (version now stale). Extended to cover both coordinator and non-coordinator shutdown on the same 2-node cluster.
Also includes supporting changes:
- error_injection: release wait_for_message waiters on disable() so the test can atomically unblock paused handlers
- error_injection: add non-shared mode to wait_for_message for per-invocation message semantics
- scylla_cluster.py: allow stop() to bypass start_stop_lock so SIGKILL works while stop_gracefully is blocked
Fixes: SCYLLADB-1842
Refs: scylladb/scylladb#23665
backports: SCYLLADB-1842 reported a failure in 2025.1, so we need to backport to all versions starting from 2025.1
Closesscylladb/scylladb#29882
* https://github.com/scylladb/scylladb:
storage_service: cancel write handlers during drain to prevent shutdown deadlock
test_unfinished_writes_during_shutdown: extend to cover coordinator shutdown
test_unfinished_writes_during_shutdown: fix to reproduce the shutdown deadlock
test_unfinished_writes_during_shutdown: await add_last_node_task instead of cancelling it
test_unfinished_writes_during_shutdown: add timeout and deadlock detection for shutdown_task
test: scylla_cluster: allow stop() to bypass start_stop_lock
error_injection: add non-shared mode to wait_for_message
error_injection: release waiters when injection is disabled
There is small windows just after view building coordinator releases
group0 guard and before it waits on view_building_state_machine's CV,
when the coordinator may miss CV broadcast triggered by finished remote
work.
To fix it, this patch adds a boolean flag, which is set to true before
broadcasting the CV and is checked before awaiting on the CV.
Fixes SCYLLADB-2029
The problem is not critical but it should be backported to 2025.4 and newer version, all of them contains view building coordinator.
Closesscylladb/scylladb#27313
* github.com:scylladb/scylladb:
test/cluster/test_view_building_coordinator: add reproducer
db/view/view_building_coordinator: add flag to mark if any remote work was finished
The S3 client creates a separate HTTP connection pool per scheduling group. Previously, the pool size was hardcoded as shares/100, yielding 1-10 connections. This was not tunable and could under-provision connections for groups with low share counts.
**Changes**
- A missing include (short_streams.hh) in sstables_loader.cc is added first to fix CMake builds where the header is not transitively included.
- The hardcoded per-share divisor is replaced with a per-shard connection budget. The new `object_storage_connections_per_shard` config option (default 128) specifies the total number of connections available on each shard. Connections are distributed proportionally across scheduling groups based on their shares: `max_connections = budget * group_shares / total_shares`. Remainder connections are assigned to the group with the most shares. When a new scheduling group client is created, all existing groups are rebalanced via `set_maximum_connections`. Creation and rebalance are serialized with a semaphore to prevent concurrent rebalances from racing.
- The config option is made live-updateable: a `storage_manager` observer propagates changes to all existing S3 clients, triggering rebalance under the same semaphore.
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1704
No backport needed since this change affects KS on object storage which is not operational yet.
Closesscylladb/scylladb#29719
* github.com:scylladb/scylladb:
s3: make connections_per_shard live-updateable
s3: distribute connection pool proportionally across scheduling groups
~2076 files used "Copyright (C) YYYY-present ScyllaDB" while
~88 files used "Copyright (C) YYYY ScyllaDB". This
inconsistency leads to unnecessary code review discussions
and gradual spread of the less common format.
Standardize all ScyllaDB copyright headers to use -present.
Fixes SCYLLADB-1984
Closesscylladb/scylladb#29876
When a non-replica node handles a strongly consistent write, it must
forward the request to a replica. If the closest replica is not the
leader, the request gets redirected again, causing an extra roundtrip.
Add a leader location cache in groups_manager, keyed by raft group_id.
After a write request is forwarded, the CQL transport layer records the
final node as the leader in the cache. Subsequent write requests from
the same node for the same group are forwarded directly to the cached
leader, eliminating the extra roundtrip.
The cache is only used for writes. Reads can be served by any replica,
so they skip the cache and use proximity-based routing instead.
Cache entries are validated at use time: if the cached leader is no
longer a replica (e.g. after tablet migration), the entry is evicted
and the normal closest-replica path is taken. This prevents a scenario
where two nodes keep redirecting to each other because both think that
the other is the leader but actually both are non-replicas - such loop
is broken as soon as the tablet maps are updated.
On token_metadata updates, entries for groups that no longer exist
(e.g. table dropped, tablet merged) are evicted. Entries for groups
that still exist are kept — use-time validation handles staleness.
An on_node_resolved callback is propagated through the redirect/bounce
path so the transport layer can update the cache generically without
coupling to the strong-consistency coordinator. The coordinator creates
the callback only for writes (capturing the groups_manager and
group_id) and attaches it to the bounce message; the transport layer
invokes it once the final node is known, keeping the forwarding
infrastructure subsystem-agnostic.
We also add a test which verifies that after the initial redirect,
following requests to the same node avoid the extra redirect and
forward directly to the leader.
Fixes: SCYLLADB-1064
Closesscylladb/scylladb#29392
Tables created before the GROUP0_SCHEMA_VERSIONING feature was enabled
have committed_by_group0 = null in system_schema.scylla_tables. This
causes maybe_delete_schema_version() to delete their version cell,
forcing the legacy hash-based schema version computation path.
Add ensure_committed_by_group0() which runs on boot and fixes up any
non-system tables where committed_by_group0 is not true (null or false):
1. Queries system_schema.scylla_tables for rows where committed_by_group0
is null or false, skipping system keyspaces (system, system_schema).
2. Takes a group0 guard
3. Re-checks after the raft barrier in case another node already fixed it.
4. For each table needing fixup, creates a mutation writing the version
cell (from the in-memory schema). The committed_by_group0 = true flag
is stamped by add_committed_by_group0_flag() inside announce().
5. Announces via raft group0.
6. Retries with a small random delay on group0_concurrent_modification.
On other nodes, schema_applier will detect these as "altered" tables
(scylla_tables mutation changed), but since the actual table definition
is unchanged, update_column_family is effectively a no-op.
This is a prerequisite for eventually removing the legacy hash-based
schema versioning code path.
Closesscylladb/scylladb#29911
Snapshot creation and raft log truncation happen asynchronously in the
IO fiber after a schema change completes. The test was querying
system.raft immediately after the schema change returned, racing with
the IO fiber's store_snapshot_descriptor call.
Replace immediate assertions with wait_for polling loops:
- log_size == 0: wait for log truncation after drop keyspace
- new_snap_id != original_snap_id: wait for new snapshot to be persisted
Fixes: SCYLLADB-2120
Closesscylladb/scylladb#29967
Nobody should be looking at the raw data storage directly. The
collection can be inspected via collection_mutation_view.
Added a data() && accessor, to be able to extract the raw data for
storage in atomic_cell_or_collection.
Drop:
* collection_mutation_description
* collection_mutation_view_description
* collection_mutation_input_stream
* collection_mutation_view::with_deserialized()
* All serialized/deserialize methods for/from
collection_mutation_description
All these are now unused.
Use collection_mutation_writer instead.
Add to_managed_bytes() to cql3::raw_value to help avoid some copies.
A special note for sstables/kl/reader.cc: this conversion is not
straighforward, so we accumulate a list of cells and feed to the writer
at the end. This is sub-optimal but this code is rarely used, best to be
conservative.
Good practice in general. Also prepares the ground for calling
serialize_for_cql() from serialize_for_cql_with_timestamps(). The latter
already switched to throwing_assert(), avoid regressing to a crash.
Use the collection_mutation_view directly.
Also use managed_bytes_view for collection keys to match
collection_mutation_view::iterator::value_type and to avoid
unnecessary key linearization.
Use the collection_mutation_view directly.
Requires changing key type from bytes_view to managed_bytes_view, to
match collection_mutation_view::iterator::value_type and more
importantly to avoid unnecessary key linearization.
Work with collection_mutation_view directly instead
Use collection_mutation_writer to generate the serialized output
directly, instead of accumulating cells in temporary storage.
Work with collection_mutation_view directly instead.
Use collection_mutation_writer to generate the serialized output
directly, instead of accumulating cells in temporary storage.
Add the record timestamp. The timestamp is extracted from the row marker
of the mutation when we write it.
When inserting a record to index, we compare it with the existing
record, and insert it only if it has newer timestamp.
Add a segment sequence number that is a global (per-shard) increasing
number that is allocated when getting a new segment for write, and is
written in buffer headers in the segment.
It is used to distinguish between buffers written to different generations
of a segment, and for recovery to break ties by keeping the record
from the newest segment.
Refs https://scylladb.atlassian.net/browse/SCYLLADB-770
no backport - logstor is a new feature
Closesscylladb/scylladb#29933
* github.com:scylladb/scylladb:
test: logstor: add basic delete test
logstor: rewrite segment seq num from streaming
logstor: add segment sequence number
logstor: get_segment helper
logstor: compare records by timestamp
Change the collection overload of compaction_garbage_collector::collect()
to accept a collection_mutation (a serialized blob) instead of
collection_mutation_description (an owned, deserialized
representation).
Change collection_mutation_description::compact_and_expire() (the only
user) to build a serialized collection_mutation, with the new
collection_mutation_writer directly.
Allows dropping collection_mutation_description from the
compaction_garbage_collector interface, getting us one step closer to
removing it.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add size, tombstone getters and begin()/end() to iterate over the cells.
Allows inspecting the content of the collection in-place, without
deserializing into an intermediate representation
(collection_mutation_description[_view]). This will be used to gradually
replace all usage of collection_mutation_description[_view].
Deliberately avoding the use of collection_mutation_input_stream as that
one is also a target for elimination.
The names for the accessors are tomb() and size(), following existing
conventions in mutation/. Also rename is_empty() -> empty() to align
with this convention. There is a single caller to update only.
All accessors (new or pre-existing) are made to work with
default-constructed `collection_mutation` (i.e. one containing empty
buffer).
No users yet.
Create a serialized collection element-by-element, with size not known
up-front. Appended elements are written into the buffer directly.
Allows replacing the current practice of first creating a
vector of cells, then serializing at the end.
Currently the collection_mutation default constructor just
default-initializes the underlying managed_bytes, which is not a valid
collection mutation value. This is an accident waiting to happen,
because attempting to deserialize or interpret this value would result
in buffer overflow. Generate a valid collection value in the default
constructor. This is just 5 bytes, small enough to fit into
managed_byte's small buffer, so it even avoids allocation.
The abstract_type parameter was needed by the old IMR-based atomic-cell
serialization format. When we reverted to the simpler managed_bytes
representation, the parameter became a no-op but was never removed.
Drop it from all overloads of atomic_cell_view::from_bytes() and
atomic_cell_mutable_view::from_bytes() and update all call sites.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When a node shuts down, do_drain() calls stop_transport() which tears
down the messaging service. After this point, MUTATION_DONE responses
from replicas can no longer reach the coordinator, so any in-flight
write_response_handlers will never complete naturally. These handlers
hold ERMs referencing stale token_metadata versions.
If the topology coordinator calls barrier_and_drain (either on itself
or via RPC), it blocks in stale_versions_in_use() waiting for these
stale versions to be released. This causes:
- On the coordinator node: do_drain -> wait_for_group0_stop deadlock
(the topology coordinator fiber is stuck in barrier_and_drain).
- On non-coordinator nodes: ss::stop -> uninit_messaging_service
deadlock (the barrier_and_drain RPC handler holds the gate open).
Fix: cancel all write response handlers on all shards right after
stop_transport() in do_drain(). This releases their ERMs and the
associated stale token_metadata versions, unblocking
stale_versions_in_use().
Heap-allocate _write_handlers_gate and add an allow_new parameter to
cancel_all_write_response_handlers(). When allow_new=true (used by
do_drain), the gate is closed and swapped with a fresh one — existing
handlers are waited on while new handlers can still be created. This
avoids blocking internal writes (paxos learn, compaction history
updates) that still need to create handlers during the remainder of
the drain sequence. When allow_new=false (used by drain_on_shutdown),
the gate is closed permanently — no new handlers can be created after
final shutdown.
Update test_lwt_shutdown to wait for 'Stop transport: done' instead
of 'Shutting down storage proxy RPC verbs'. The latter message is
now only logged after do_drain() completes, but do_drain() blocks
in cancel_all_write_response_handlers() waiting for the background
paxos learn handler — which is exactly what the test needs to release
before shutdown can proceed.
Fixes: SCYLLADB-1842
Refs: scylladb/scylladb#23665
Wire the object_storage_connections_per_shard config option as
LiveUpdate so it can be changed at runtime without restart. When
the value changes, the storage_manager observer propagates it to
all existing S3 clients, which rebalance their connection pools
under the rebalance semaphore.
The S3 client creates a separate HTTP connection pool per scheduling
group. Previously, the pool size was computed per-group using a
per-share multiplier (connections = shares * multiplier), which did
not account for the total number of groups sharing the shard's
connection budget.
Replace the per-share multiplier with a per-shard connection budget:
the new object_storage_connections_per_shard config option (default
100) specifies the total number of connections available on each
shard. When a new scheduling group's client is created, connections
are distributed proportionally across all groups based on their
shares (connections = budget * group_shares / total_shares), and
existing groups are rebalanced via set_maximum_connections.
When the endpoint_config has an explicit max_connections override,
it is used directly without proportional distribution.
The existing test only covers the case where the shutting-down node is
NOT the topology coordinator (deadlocks in uninit_messaging_service).
When the node IS the coordinator, the deadlock manifests differently:
the topology coordinator fiber calls barrier_and_drain on itself
(without messaging), and do_drain -> wait_for_group0_stop blocks
because the coordinator can't stop while stale_versions_in_use is
waiting on the uncancelled write handler.
Run the test twice on the same 2-node cluster (RF=2):
- Run 1: target is a non-coordinator
- Restore cluster state (restart target, decommission added node)
- Run 2: target is the topology coordinator
Use CL=ONE so the write completes from the local replica even with
the other server's response paused.
Mark as xfail since this reproduces bugs not yet fixed on this branch.
Refs: SCYLLADB-1842
The test was written for another case, and was not supposed to
reproduce the issue that was fixed in this PR.
Fix the test to reproduce the real scenario:
1. Use one_shot=False for pause_before_barrier_and_drain so the
injection fires on every barrier_and_drain RPC, not just the first.
2. Let the first barrier_and_drain through (at this point the write
handler's ERM version matches the current token_metadata version).
3. Wait for the second barrier_and_drain. Between the two calls,
topology_state_load installs a new token_metadata version. The
write handler still holds the old version's ERM — now stale.
4. After stop_transport completes, disable the injection (rather than
sending a single message) to release the paused handler and any
subsequent ones that arrived during stop_transport. The 'disabled'
flag in injection_shared_data ensures all waiters wake up.
With these changes the test reliably fails (shutdown deadlock within
15s) on the unfixed code and passes on the fixed version from
e0dc73f52a ('Cancel all write requests on storage_proxy shutdown').
Refs: scylladb/scylladb#23665
asyncio cancel() only affects the client-side coroutine. The
server-side addserver handler in the cluster manager continues
running. If it can't complete (e.g. no raft quorum because the
target node is shut down), the orphaned handler blocks _after_test
cleanup for 120s.
Await the task instead so it completes cleanly (we restart the
target node first to restore quorum).
Add a 15s timeout around the shutdown_task await. If the timeout
fires, the deadlock is reproduced (shutdown hung because
stale_versions_in_use blocks on a write handler holding a stale
token_metadata version).
When the timeout fires, explicitly kill the node via
server_stop() so that the manager's _after_test handler does not
wait 120s for the stuck stop_gracefully request. Then fail the
test with a clear message.
Remove the @stop_event and @start_stop_lock decorators from
ScyllaServer.stop() so it can SIGKILL a server even while
stop_gracefully() holds the lock (e.g. the node is deadlocked
during shutdown and stop_gracefully is blocked on cmd.wait()).
A local copy of self.cmd is used because there are await points
after which another coroutine (stop_gracefully) may set self.cmd
to None. The concurrent stop_gracefully() unblocks once the
process dies from SIGKILL since its cmd.wait() returns.
Also make shutdown_control_connection a plain (non-async) function
since it contains no await points — this makes it obvious that no
coroutine interleaving is possible inside it.
Add a 'share' parameter to wait_for_message (default true, preserving
existing behavior). When share=false, each handler invocation requires
its own dedicated message to proceed — a message consumed by one
handler is not visible to others.
Use share=false for the pause_before_barrier_and_drain injection in
raft_topology_cmd_handler. The topology coordinator sends multiple
barrier_and_drain RPCs during a single topology transition (one per
state change). With share=true a single message_injection call
releases all handlers. With share=false the test can release them
one at a time, controlling exactly which topology state the write
handler's ERM captures.
When an error injection is disabled (via disable() or disable_all()),
any handlers currently suspended in wait_for_message() must be woken up
so they can proceed instead of hanging until timeout.
Add a 'disabled' flag to injection_shared_data. When disable() or
disable_all() is called, set the flag and broadcast the condition
variable. The wait_for_message() predicate checks the flag and returns
true immediately, letting the handler continue.
This makes disable() atomic with respect to releasing waiters: it both
wakes up blocked handlers and removes the injection from the enabled
map in one call. This avoids races that would occur with separate
message_injection() + disable() calls — message_injection() after
disable() fails because the injection is already gone, and
disable() after message_injection() risks a new handler hitting the
injection between the two calls.
Concrete example: test_unfinished_writes_during_shutdown pauses
barrier_and_drain RPC handlers via wait_for_message. During shutdown,
the test calls disable_injection() to simultaneously release the
paused handler and prevent any new barrier_and_drain RPCs from
getting stuck.
In SCYLLADB-2058 we observed a timeout exception while querying the base
table after restarting nodes 2 and 3.
Unfortunately, logs don't give us much useful information about the
root cause.
This patch adds basic checks that nodes see each other after the restart
and that the cql connection sees restarted node.
It doesn't guarantee that the error won't occur again - in logs from
SCYLLADB-2058 we see that each node sees other via gossip after part of
the cluster is restarted.
In case the error will occur again, this commit also increases logging
level of `cql_server` and `storage_proxy`.
Refs SCYLLADB-2058
Closesscylladb/scylladb#29951
This patch series adds `audit_rules`, a new audit configuration option for fine-grained, role-aware audit filtering with per-rule sink routing. Rules can be configured in `scylla.yaml` or updated live through `system.config` without restarting the node. Each rule specifies target sinks (`table`, `syslog`), statement categories, qualified table name patterns, and role patterns. Table and role patterns use POSIX `fnmatch` with extended glob syntax. For table-scoped categories (`DML`, `DDL`, `QUERY`), a rule matches only when the category, role, and qualified table name all match. For table-independent categories (`AUTH`, `ADMIN`, `DCL`), the table filter is ignored. Empty category or role lists match nothing; an empty table list matches nothing only for table-scoped categories. The new rules are additive with the existing `audit_categories`, `audit_keyspaces`, and `audit_tables` settings: both mechanisms are evaluated for each audit event, and the final sink set is the union of all matches.
To avoid evaluating glob patterns on every audit event, audit rules use a preprocessed cache of known roles and tables. The cache is kept in sync through group0 role/table snapshots, role-change notifications, and schema migration notifications. For known entities, rule matching uses precomputed role/table rule sets; unknown entities fall back to direct rule evaluation. When `audit_rules` is empty, per-event rule matching returns immediately and does not evaluate glob patterns. Audit still keeps known role/table metadata in sync while audit is enabled, so rules can be enabled later through live configuration updates without restarting the node.
**Performance**
Measured with `perf-simple-query --smp 1 --duration 100` against a null syslog socket. Results show no regression when audit is disabled, and audit-rules performance has at most 1% more instructions than legacy config for equivalent workloads:
```
===============================================================================================================================================================================
Configuration | Binary | throughput (tps) | insns/op | cpu_cycles/op | alloc/op | logal/op | task/op
===============================================================================================================================================================================
audit=none [1] | baseline | 206922.4 | 36591.6 | 15348.3 | 58.1 | 0.0 | 14.1
audit=none [1] | this PR | 207856.4 (+0.5%) | 36544.9 (-0.1%) | 15274.0 (-0.5%) | 58.1 | 0.0 | 14.1
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
audit=syslog keyspaces=ks [2] | baseline | 94871.8 | 54163.0 | 27172.4 | 72.0 | 0.0 | 24.0
audit=syslog keyspaces=ks [2] | this PR | 96138.4 (+1.3%) | 54072.3 (-0.2%) | 26699.3 (-1.7%) | 72.0 | 0.0 | 24.0
audit=syslog audit-rules=ks [3] | this PR | 95142.1 (+0.3%) | 54457.8 (+0.5%) | 26953.8 (-0.8%) | 72.0 | 0.0 | 24.0
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
audit=syslog keyspaces=ks-non-existent [4] | baseline | 213997.8 | 36735.6 | 14848.1 | 58.1 | 0.0 | 14.1
audit=syslog keyspaces=ks-non-existent [4] | this PR | 219297.2 (+2.5%) | 36667.3 (-0.2%) | 14500.1 (-2.3%) | 58.1 | 0.0 | 14.1
audit=syslog audit-rules=ks-non-existent [5] | this PR | 211038.7 (-1.4%) | 36999.7 (+0.7%) | 15048.6 (+1.4%) | 58.1 | 0.0 | 14.1
===============================================================================================================================================================================
[1] ./scylla perf-simple-query --smp 1 --duration 100 --audit "none"
[2] ./scylla perf-simple-query --smp 1 --duration 100 --audit "syslog" --audit-keyspaces "ks" --audit-categories "DCL,DDL,AUTH,DML,QUERY" --audit-unix-socket-path "/tmp/audit-null.sock"
[3] ./scylla perf-simple-query --smp 1 --duration 100 --audit "syslog" --audit-rules '[{"sinks":["syslog"],"categories":["DCL","DDL","AUTH","DML","QUERY"],"qualified_table_names":["ks.*"],"roles":["*"]}]' --audit-unix-socket-path "/tmp/audit-null.sock"
[4] ./scylla perf-simple-query --smp 1 --duration 100 --audit "syslog" --audit-keyspaces "ks-non-existent" --audit-categories "DCL,DDL,AUTH,DML,QUERY" --audit-unix-socket-path "/tmp/audit-null.sock"
[5] ./scylla perf-simple-query --smp 1 --duration 100 --audit "syslog" --audit-rules '[{"sinks":["syslog"],"categories":["DCL","DDL","AUTH","DML","QUERY"],"qualified_table_names":["ks-non-existent.*"],"roles":["*"]}]' --audit-unix-socket-path "/tmp/audit-null.sock"
audit-null.sock was created with `socat -u UNIX-RECV:/tmp/audit-null.sock,type=2 OPEN:/dev/null`
```
Fixes: SCYLLADB-1430
No backport: new feature
Closesscylladb/scylladb#29267
* github.com:scylladb/scylladb:
test: alternator: audit: rules filtering and batch bypass
test: perf: add --audit-rules option to perf-simple-query
docs: add audit rules section to the auditing guide
test: audit: cover role and schema cache notifications
test: audit: cover audit rules cluster behavior
audit: rebuild rule caches on group0 snapshot and role changes
audit: refresh rule caches on schema, role, and config changes
audit: route matching rules to configured sinks
test: cover preprocessed audit rule cache
audit: add preprocessed rule matching cache
audit: pass sink targets to storage helpers
test: audit: cover rule matching semantics
audit: add rule matching and sink helpers
test: audit: cover audit_rules configuration
config: add live audit_rules option
test: cover audit rule parsing and validation
audit: define audit_rule type with parsing and validation
The test was starting Scylla with --write-request-timeout-in-ms=500 on the
command line. This tight timeout also applied to paxos state table creation,
which goes through raft and can take longer than 500ms on slow platforms
(e.g. aarch64/dev). When the first batch of CAS requests triggered paxos
state table creation under error injection, the raft schema change could
still be in-flight when the second batch fired, causing spurious WriteTimeout
failures unrelated to the semaphore bug being tested.
Fix by changing the write timeout at runtime via the REST API: lower it to
500ms only for the error-injection CAS phase (after table creation is done),
then restore it to 10000ms before the second batch that must succeed.
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-2104Closesscylladb/scylladb#29969
Changed seastar::http::experimental to seastar::http to reflect
graduation of the seastar http API.
Changed call to seastar::rename_file() (in sstables/storage.cc,
sstables/sstable_directory.cc, sstable/sstables.cc and
db/hints/internal/hint_storage.cc) to reflect new default parameter.
Updated scylla_gdb test helper get_task() to work with updated
accept loop in Seatar. This is just test code (attempts to find
a task to operate on), not used in real scylla-gdb.py work, but
nevertheless the adjustment keeps backward compatibility.
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1798
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-2043
* seastar 485a62b2...510f3148 (43):
> reactor_backend: fix iocb double-free and shutdown hang during AIO teardown
> file: fix default DMA alignment
> http: add to_reply() to redirect_exception with extra-header support
> core: propagate syscall errors via `coroutine::exception`
> file: assert dma alignments are powers of two
> doc: Document undocumented io_tester features and fix output example
> backtrace: print the build_id along with the backtrace
> reactor: default to oneline backtraces
> Merge 'json: formatter: support types with user-defined conversion to sstring' from Benny Halevy
tests: json_formatter: test formatter::write with string types
json: formatter: support types with user-defined conversion to sstring
> httpd_test: fix build failure with Seastar_SSTRING=OFF
> net/tls: introduce ssl_call wrapper for SSL I/O
> build: disable unused command line argument error for C++ module
> coroutine/generator: fix setup of generator's waiting task
> tests/tls: set 1000-day validity for self-signed CA cert
> net: tls: openssl: disable certificate compression
> reactor: reduce steady_clock::now() calls per scheduling quantum
> fair_queue: remove notify_request_finished()
> loop: use small_vector for parallel_for_each_state incomplete futures
> dodge false sharing in spinlock
> Merge 'Handle nowait support for reads and writes independently' from Pavel Emelyanov
file: Change nowait_works mode detection
file: Introduce read-only nowait_mode
filesystem: Make nowait_works bit a enum class too
file: Make nowait_works bit a enum class
> Merge 'net/tls: improve OpenSSL error queue hygiene' from Gellért Peresztegi-Nagy
net/tls: assert clean error queue before SSL operations
net/tls: clear error queue after successful SSL operations
net/tls: clear error queue after successful SSL_CTX_new
net/tls: drain error queue on unexpected error codes
net/tls: use make_openssl_error for BIO creation failure
> vla.hh: add missing includes
> Merge 'smp: make smp::count non-static' from Avi Kivity
smp: convert all smp::count usages to instance-aware alternatives
smp: add per-instance shard_count and this_smp() infrastructure
disk_params: document pre-init smp::count access with explicit 0
reactor_backend: document pre-init smp::count access with explicit 0
tests: alien_test: pass shard count to alien thread explicitly
> build: fix cmake missing ninja on Ubuntu 26.04
> rpc: Fix uint64 wraparound of expired timeout in send_entry()
> Merge 'Generalize some RPC tests' from Pavel Emelyanov
tests: Generalize async connection-based scheduling RPC tests
tests: Generalize sync connection-based scheduling RPC tests
tests: Remove redundant variadic/nonvariadic RPC tuple tests
tests: Generalize max timeout RPC tests
> net: tls: openssl: Share BIO ptrs across shards
> http: fix compilation on clang 22 with c++26
> build: openssl tools needed for test cert generation
> reactor: support rename2
> future: fix forwarding of reference types
> Merge 'Zero-copy http chunked data sink' from Pavel Emelyanov
http: Make chunked data sink zero-copy
tests/prometheus_http: Rewrite on top of http::client
tests/httpd: Rewrite content_length_limit on top of http::client
> tests: Replace ad-hoc http_consumer with production HTTP parser
> Merge 'co_return to accept same expressions and types as return' from Alexey Bashtanov
tests/unit/{coroutines,futures}: strict types on co_return and set_value
api: introduce version 10:
core/{coroutine,future}: make `co_return` more strict with types
core/{coroutine,future}: preparations to fix `co_return` type semantics
> Merge 'Perftune.py: add special handling for mlx5 rss queues number calculation' from Vladislav Zolotarov
perftune.py: NetPerfTuner: enhance RSS (a.k.a. "Rx") queues accounting for mlx5 devices
perftune.py: update docstring of NetPerfTuner.__get_rps_cpus() method
perftune.py: add a method that parses and models the output of the 'ethtool -l' command for a given interface
> httpd: rewrite do_accepts/do_accept_one as coroutines
> file: add mmap support to file
> http: Move client code out of experimental namespace
> file: add hugetlbfs support to file system detection
> tests: Replace test_source_impl with util::as_input_stream
> tests: Replace buf_source_impl with util::as_input_stream
> Merge 'rpc_tester: expose throuput for rpc tester' from Marcin Szopa
rpc_tester: remove unused payload size variable from job_rpc_streaming class
rpc_tester: add start time tracking for throughput calculation, print throughput and msg/s for job_rpc
rpc_tester: refactor result emission to use dedicated functions for messages and throughput
> iostream: cast first argument of `std::min` to `size_t`
Closesscylladb/scylladb#29952