When the cache is constructed with expiry == 0 the underlying storage is
never instantiated and get_ptr() asserts via caching_enabled(). This is
fine for callers that need a handle into the cache, but it makes get_ptr()
unusable for write-only insertions on caches whose expiry is configurable
at runtime (e.g. caches driven by a LiveUpdate config option that the
operator may set to 0).
Add a new insert(k, load) method on loading_cache that returns a future<>
and is a no-op when caching is disabled, otherwise forwards to
get_ptr(k, load) and discards the resulting handle. This completes the
disabled-mode safety contract of the cache for the write side, mirroring
the fallback that get() already provides for the read side.
Switch authorized_prepared_statements_cache::insert() from
get_ptr().discard_result() to the new insert(), which fixes the crash
'Assertion caching_enabled() failed' in
authorized_prepared_statements_cache::insert() that occurs when
permissions_validity_in_ms is set to 0 and a prepared statement is
executed under authentication.
Fixes SCYLLADB-1699
The only test requires SCYLLA_ENABLE_ERROR_INJECTION. In modes without it
(e.g. release) the suite was empty, so pytest exited with code 5
("no tests collected") and CI failed. Add a no-op case in that branch
so collection always yields at least one test.
Deterministic reproducer using an error injection point placed in
table_helper::insert() between cache_table_info() and execute(). The
test parks fiber A at the injection, drops the target table (evicting
the prepared_statements_cache entry), runs fiber B which nulls
_insert_stmt, then releases fiber A. Without the fix this crashes in
execute(); with the fix fiber A holds a local strong ref and proceeds.
Uses the new waiters() API to synchronize with fiber A's entry into
the injection.
Returns the number of fibers currently suspended in wait_for_message()
for a named injection. Lets tests synchronize precisely with code parked
on an injection point.
insert() held no local strong ref to the prepared modification_statement
across the suspension in execute(). On a single shard:
1. Fiber A suspends inside _insert_stmt->execute().
2. DROP TABLE / DROP KEYSPACE on the target, or LRU eviction, removes
the prepared_statements_cache entry, releasing its strong ref.
3. Fiber B re-enters cache_table_info(), sees _prepared_stmt
(checked_weak_ptr) invalidated, and runs _insert_stmt = nullptr,
releasing the last strong ref. The modification_statement is freed.
4. Fiber A resumes inside execute() and touches freed *this.
Pin strong ref to _insert_stmt locally before the suspension.
- Revert the previous "test.py: fix test collection bug" commit (92c09d10) which worked around broken deduplication by filtering items without `BUILD_MODE` in `pytest_collection_modifyitems`. This approach masked the root cause and is superseded by the proper fixes below.
- Backport pytest 9.0.3's argument normalization algorithm into `test.py` to work around broken deduplication in pytest 8.3.5 ([pytest-dev/pytest#12083](https://github.com/pytest-dev/pytest/issues/12083)). Duplicate or subsumed test paths (e.g. `test/cql` and `test/cql/lua_test.cql`) are collapsed before invoking pytest. Revert when upgrading to pytest 9.x.
- Return a `DisabledFile` collector instead of an empty list in `pytest_collect_file` when all modes are disabled for a file, fixing a bug where subsequent files would not get their stash items set (`REPEATING_FILES`). Restructure `pytest_collect_file` to use a walrus operator (`if repeats := ...`) with a single `remove(file_path)` and `return collectors` at the end, eliminating the early return.
- Add `--keep-duplicates` CLI argument to bypass deduplication and forward to pytest.
- Move `RUN_ID` assignment from `pytest_collect_file` to `modify_pytest_item`. A shared `run_ids` cache (`defaultdict[tuple[str, str], count]`) is created in `pytest_collection_modifyitems` and passed to `modify_pytest_item`, keyed by `(build_mode, nodeid)` so each mode gets independent counters. This ensures unique run IDs even when `--keep-duplicates` causes the same file to be collected multiple times.
- Fix `--repeat` option default from string `"1"` to int `1` — argparse only applies `type=` to CLI-parsed values, not defaults.
pytest normally deduplicates overlapping test arguments — e.g. `test/cql test/cql/lua_test.cql` collects `lua_test.cql` only once. The original `test.py` never performed this deduplication, and the pytest version in the toolchain image (8.3.5) has a bug that breaks it ([pytest-dev/pytest#12083](https://github.com/pytest-dev/pytest/issues/12083).)
Since we are moving to bare pytest, `test.py` should match pytest's default behavior: deduplicate. Because we cannot easily upgrade pytest, commit 2 backports the deduplication logic from pytest 9.0.3.
To match pytest's interface, `--keep-duplicates` is added as an opt-out. This lets a user intentionally run overlapping paths — e.g. `./test.py test/blah test/blah/test_foo.py --keep-duplicates` runs `test_foo.py` twice. The flag is forwarded to pytest and also skips the backported deduplication in `test.py`.
- Revert 92c09d10 which filtered items without `BUILD_MODE` in `pytest_collection_modifyitems` and added an early return in `CppFile.collect()`. This workaround is superseded by the proper deduplication and `DisabledFile` fixes.
- Add `_CollectionArgument` dataclass (`order=True`, `__contains__` for subsumption) and `_deduplicate_test_args()` function, adapted from pytest 9.0.3. Marked with a TODO to remove once we update to pytest 9.x.
- Call `_deduplicate_test_args()` on `options.name` before passing to pytest.
- Add `DisabledFile(pytest.File)` that skips collection with an informative message instead of returning an empty list.
- Restructure `pytest_collect_file` to use walrus operator: `if repeats := ...:` / `else:` — single `remove(file_path)` at end, no early return.
- Add `--keep-duplicates` argument that skips deduplication and is forwarded to pytest.
- Create a shared `run_ids` cache in `pytest_collection_modifyitems` and pass it to `modify_pytest_item`, which assigns unique sequential RUN_IDs via `itertools.count`. The cache is keyed by `(build_mode, nodeid)` so each mode gets independent counters.
- Remove `RUN_ID` from `_STASH_KEYS_TO_COPY` — it is no longer set on collectors.
- Remove `CppFile.run_id` cached_property. `CppTestCase` now reads `RUN_ID` from its own item stash.
- Fix `--repeat` option default from `"1"` to `1` and drop redundant `int()` cast.
Closes SCYLLADB-1730
Closesscylladb/scylladb#29665
* github.com:scylladb/scylladb:
test: add --keep-duplicates and assign RUN_ID via shared cache
test/pylib/runner: fix disabled file collection
test.py: deduplicate CLI test arguments before passing to pytest
Revert "test.py: fix test collection bug"
The second logger.debug() call accesses ack2_msg after it was moved
via std::move() in the co_await send_gossip_digest_ack2 call.
This is undefined behavior.
Fix by formatting ack2_msg to a string before the move, then using
that cached string in both debug log calls.
FIXES: https://scylladb.atlassian.net/browse/SCYLLADB-1778Closesscylladb/scylladb#29227
lock_tables_metadata() acquires a write lock on tables_metadata._cf_lock
on every shard. It used invoke_on_all(), which dispatches lock
acquisitions to all shards in parallel via parallel_for_each +
smp::submit_to.
When two fibers call lock_tables_metadata() concurrently, this can
deadlock. parallel_for_each starts all iterations unconditionally:
even when the local shard's lock attempt blocks (because the other
fiber already holds it), SMP messages are still sent to remote shards.
Both fibers' lock-acquisition messages land in the per-shard SMP
queues. The SMP queue itself is FIFO, but process_incoming() drains
it and schedules each item as a reactor task via add_task(), which —
in debug and sanitize builds with SEASTAR_SHUFFLE_TASK_QUEUE — shuffles
each newly added task against all pending tasks in the same scheduling
group's reactor task queue. This means fiber A's lock acquisition can
be reordered past fiber B's (and past unrelated tasks) on a given shard.
If fiber A wins the lock on shard X while fiber B wins on shard Y, this
creates a classic cross-shard lock-ordering deadlock (circular wait).
In production builds without SEASTAR_SHUFFLE_TASK_QUEUE, the reactor
task queue is FIFO. Still, even in release builds, the SMP queues can
reorder messages even, so the deadlock is still possible, even if it's
much less likely. In debug and sanitize builds, the task-queue shuffle
makes the deadlock very likely whenever both fibers' lock-acquisition
tasks are pending simultaneously in the reactor task queue on any shard.
This deadlock was exposed by ce00d61917 ("db: implement large_data
virtual tables with feature flag gating", merged as 88a8324e68),
which introduced legacy_drop_table_on_all_shards as a second caller
of lock_tables_metadata(). When LARGE_DATA_VIRTUAL_TABLES is enabled
during topology_state_load (via feature_service::enable), two fibers
can race:
1. activate_large_data_virtual_tables() — calls
legacy_drop_table_on_all_shards() which calls
lock_tables_metadata() synchronously via .get()
2. reload_schema_in_bg() — fires as a background fiber from
TABLE_DIGEST_INSENSITIVE_TO_EXPIRY, eventually reaches
schema_applier::commit() which also calls lock_tables_metadata()
If both reach lock_tables_metadata() while the lock is free on all
shards, the parallel acquisition creates the deadlock opportunity.
The deadlock blocks topology_state_load() from completing, which
prevents the bootstrapping node from finishing its topology state
transitions. The coordinator's topology coordinator then waits for
the node to reach the expected state, but the node is stuck, so
eventually the read_barrier times out after 300 seconds.
Fix by acquiring the shard 0 lock first before attempting to
acquire any other lock. Whichever fiber wins shard 0 is
guaranteed to acquire all remaining shards before the other fiber
can proceed past shard 0, eliminating the circular-wait condition.
Tested manually with 2 approaches:
1. causing different shard locks to be acquired by different
lock_tables_metadata() calls by adding different sleeps depending
on the lock_tables_metadata() call and target shard - this reproduced
the issue consistently
2. matching the time point at which both fibers reach lock_tables_metadata()
adding a single sleep to one of the fibers - this heavily depends on
the machine so we can't create a universal reproducer this way, but
it did result in the observed failure on my machine after finding the
right sleep time
Also added a unit test for concurrent lock_tables_metadata() calls.
Fixes: SCYLLADB-1694
Fixes: SCYLLADB-1644
Fixes: SCYLLADB-1684
Closesscylladb/scylladb#29678
In do_execute_cql_with_timeout(), when the prepared statement was not found in the cache, we called qp.prepare() and stored the returned result_message::prepared in a local variable scoped to the 'if' block. We then extracted ps_ptr (a checked_weak_ptr to the prepared statement) from the message, let the message go out of scope at the end of the 'if', and used ps_ptr after a co_await on st->execute().
Since 3ac4e258e8 ("transport/messages: hold pinned prepared entry in PREPARE result"), result_message::prepared owns a strong pinned reference to the prepared cache entry. While qp.prepare() runs it also holds its own pin on the entry, so on return the entry has at least the pin owned by the returned message. As long as that message is alive, the cache entry cannot be purged and the weak handle inside ps_ptr remains promotable.
The lifetime gap manifested only in debug builds. qp.prepare() returns a ready future on the cache-miss path, so in release builds the co_await resumes synchronously: control flows from the assignment of ps_ptr straight into st->execute() with no opportunity for any other task (in particular, prepared cache invalidation triggered by a concurrent schema change) to run in between. Debug builds, however, force a reactor preemption point on every co_await even when the awaited future is ready. With prepared_msg already destroyed at the end of the 'if' block, the only remaining handle on the cache entry was the weak ps_ptr, and the preemption gave a concurrent cache purge
- triggered, for example, by Raft schema changes received during a node restart - the chance to drop the entry. The subsequent execute() then failed when promoting the weak pointer with
checked_ptr_is_null_exception.
The exception propagated out of the Paxos prepare path as a generic std::exception with no type information in the log, surfacing on the coordinator as:
WriteFailure: Failed to prepare ballot ... Replica errors:
host_id ... -> seastar::rpc::remote_verb_error (std::exception)
Hoist the result_message::prepared into the outer scope so the pinned cache entry stays alive across co_await st->execute(...), closing the window in which a concurrent cache purge could invalidate the weak handle.
Fixes SCYLLADB-1173
backport: the patch is simple, we can backport it to all versions with "LWT over tablets" feature. Note that the problem is only in test runs in debug configuration, production is not affected.
Closesscylladb/scylladb#29675
* https://github.com/scylladb/scylladb:
table_helper: retry insert prepare on concurrent cache invalidation
paxos_state: keep prepared message alive across statement execution
Since commit e942c074f2 changed _tasks from std::list<shared_ptr<...>>
to a boost::intrusive_list, iterating yields raw compaction_task_executor
objects rather than shared_ptr wrappers. The GDB script was updated to
use intrusive_list() but still wrapped elements in seastar_shared_ptr(),
causing 'gdb.error: There is no member or method named _p' when
compaction tasks are active.
Move the seastar_shared_ptr unwrapping to the 6.2 compatibility
fallback path only, since the intrusive list path yields objects
directly.
Fixes: SCYLLADB-1762
Closesscylladb/scylladb#29690
This series addresses three problems in the audit startup/shutdown
sequence:
1. [BUG] Shutdown SIGABRT. During graceful shutdown, deferred stops run in reverse order of construction. With the audit service constructed after the maintenance socket, audit was destroyed first, and in-flight queries on the maintenance socket could hit the destroyed audit service (assertion failure in sharded::local()).
2. [BUG] Startup audit bypass. The maintenance socket opened before audit storage was initialized, allowing queries (e.g. creating a superuser) to bypass auditing in that window.
3. [PROBLEM] Blocks SCYLLADB-1430. The existing order prevents audit configuration from being driven by group0 state, because audit started before group0.
The series is organized as: a test-helper refactor, a test for the audited maintenance-socket flow, a startup-phase split, the construction-order fix and its shutdown-race test, and finally the storage-before-socket fix and its startup-window test.
Fixes SCYLLADB-1615
No backport, bugs don't seem severe enough to justify backporting.
Closesscylladb/scylladb#29539
* github.com:scylladb/scylladb:
audit: assert storage ordering invariants at runtime
audit: start maintenance socket after audit storage
audit: move audit construction before maintenance socket
audit: split startup into construction and storage phases
test: audit: verify maintenance socket operations are audited
test: audit: parameterize source address in audit assertions
Commit 8d34127684 ("sstables: clean up TemporaryHashes file in wipe()")
unconditionally calls filename(..., component_type::TemporaryHashes)
inside filesystem_storage::wipe(). However, the TemporaryHashes
component is only registered in the component map of the 'ms' sstable
format. For older formats (ka, la, mc, md, me) the lookup goes through
sstable_version_constants::get_component_map(version).at(...) and throws
std::out_of_range.
The exception is then swallowed by the outer catch(...) in wipe(), which
just logs and ignores. As a side effect, the subsequent
remove_file(new_toc_name) is never reached and the TemporaryTOC
('*-TOC.txt.tmp') file is left as an orphan on disk after every unlink()
of a non-'ms' sstable.
Guard the lookup with get_component_map(version).contains() so the
cleanup is only attempted for formats that actually define the
component.
Add a regression test in test/boost/sstable_directory_test.cc that
creates an 'me'-format sstable, unlinks it and asserts that the sstable
directory is left empty. Without the fix the test fails with a leftover
'me-...-TOC.txt.tmp' file.
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1697Closesscylladb/scylladb#29620
`ScyllaCluster.nodelist()` creates new `ScyllaNode` objects on every call,
so per-node state set via `set_smp()`, `set_log_level()`, and
`_adjust_smp_and_memory()` was lost. This meant `set_smp()` had no effect
when `cluster.start()` was called after it, since `start_nodes()` calls
`nodelist()` internally which creates fresh nodes with default values.
- Add debug logging for smp/memory in ScyllaNode
- Store per-node settings (smp, memory, log levels) in a
`ScyllaCluster._node_resources` dict keyed by server_id, so they survive
`nodelist()` reconstruction. `ScyllaNode` restores its state from this dict
on construction and saves it back whenever `set_smp()`, `set_log_level()`,
or `_adjust_smp_and_memory()` modifies it.
- Add a reproducer test verifying `set_smp()` takes effect on restart
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1629
--
No backport needed: this only fixes dtest infrastructure, no production code
is affected.
Closesscylladb/scylladb#29549
* github.com:scylladb/scylladb:
test/cluster/dtest: add test for node.set_smp() persistence
test/cluster/dtest: cache ScyllaNode instances in ScyllaCluster
test/cluster/dtest/ccmlib/scylla_node: add debug logging
Add --keep-duplicates CLI argument to bypass deduplication and forward
to pytest, allowing duplicate test file arguments to be collected
multiple times.
Move RUN_ID assignment from pytest_collect_file to modify_pytest_item.
All File collectors for the same source file share a single run_ids
dict (via RUN_ID_CACHE stash key), so items from duplicate collection
arguments (e.g. with --keep-duplicates) automatically get unique IDs.
Remove CppFile.run_id cached_property — CppTestCase now reads RUN_ID
from its own item stash, which is set during modify_pytest_item.
Fix --repeat option default from string "1" to int 1 — argparse only
applies type= to CLI-parsed values, not defaults.
Co-Authored-By: Claude Opus 4.6 (200K context) <noreply@anthropic.com>
Return a DisabledFile collector instead of an empty list when all modes
are disabled for a file. Returning an empty list caused subsequent
files to not get their stash items set because file_path was never
removed from REPEATING_FILES.
Co-Authored-By: Claude Opus 4.6 (200K context) <noreply@anthropic.com>
Backport the argument normalization algorithm from pytest 9.0.3 to
work around broken deduplication in pytest 8.3.5
(https://github.com/pytest-dev/pytest/issues/12083).
Duplicate or subsumed test paths (e.g. 'test/cql' and
'test/cql/lua_test.cql') are now collapsed before invoking pytest.
Revert this commit when upgrading to pytest 9.x.
Co-Authored-By: Claude Opus 4.6 (200K context) <noreply@anthropic.com>
Abort if audit storage fails to start rather than silently
running with an unaudited maintenance socket. Also assert
that storage is already stopped when the audit service is
destroyed, documenting the defer-stack ordering requirement.
Refs SCYLLADB-1615
Refs SCYLLADB-1695
Without this, there is a window after startup where queries on
the maintenance socket bypass auditing because audit storage
is not yet initialized.
Fixes SCYLLADB-1615
During graceful shutdown, deferred stops run in reverse order of
construction. When the audit service was constructed after the
maintenance socket, audit was destroyed first. A DML query
still in-flight on the maintenance socket could then bypass
auditing entirely.
Move construction as early as possible so the audit service
outlives the maintenance socket on the defer stack, and to
maximise the window in which attempts to use audit before
storage is ready are caught with on_internal_error_noexcept.
Refs SCYLLADB-1615
The table-based audit backend needs Raft to create its keyspace,
but the audit service must exist earlier so that CQL paths don't
silently skip auditing.
Split startup into two phases: construction and storage
initialization. Queries arriving between the two phases are
logged as errors.
This is a refactoring commit and the split sections will be
moved later in this patch series.
Refs SCYLLADB-1615
User creation via the maintenance socket should produce audit
entries, as this is the recommended flow for creating the
initial superuser when default credentials are disabled.
The test is parametrized by audit backend (table and syslog).
The maintenance socket source address is "::" because Seastar
returns a zero-initialised in6_addr for AF_UNIX sockets.
Test time in dev: 0.6s
Refs SCYLLADB-1615
Some tests in test_tablets.py read system_schema.keyspaces from an arbitrary node that may not have applied the latest schema change yet. Pin the read to a specific node and issue a read barrier before querying, ensuring the node has up-to-date data.
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1700
Test fix; no backport
Closesscylladb/scylladb#29655
* github.com:scylladb/scylladb:
test: fix flaky rack list conversion tests by using read barrier
test: fix flaky test_enforce_rack_list_option by using read barrier
table_helper::insert() retrieves the prepared statement via
cache_table_info() and then dereferences _prepared_stmt to read
bound_names. _prepared_stmt is a checked_weak_ptr into the prepared
statements cache and can be invalidated at any time by a concurrent
purge (for example, on a schema change).
cache_table_info() (re-)prepares the statement and assigns
_prepared_stmt before returning, and the strong pin held by the
result_message::prepared returned from qp.prepare() keeps the cache
entry alive only for the duration of try_prepare(). After try_prepare()
returns, the pin is gone and _prepared_stmt is the only remaining
handle on the entry.
In release builds this is fine: the chain of ready-future co_awaits
between try_prepare() finishing and _prepared_stmt->bound_names being
read resumes synchronously, so no other task -- in particular, no
cache purge -- can run in that window. In debug builds, however,
Seastar inserts a reactor preemption point on every co_await even when
the awaited future is ready. That preemption window is wide enough for
a concurrent invalidation to drop the freshly installed cache entry,
turning _prepared_stmt into a null weak handle and crashing the
subsequent dereference with checked_ptr_is_null_exception.
Wrap the cache_table_info() call in a loop that re-attempts the
preparation until a synchronous post-resume check finds _prepared_stmt
still valid. The check runs in the same task immediately after the
co_await resumes, with no co_await between the check and the
dereference, so a purge cannot slip in. _insert_stmt is a strong
shared_ptr to the statement object and is not affected by cache
invalidation, so it remains safe to use across the final co_await on
execute().
The other caller of cache_table_info(),
trace_keyspace_helper::apply_events_mutation(), accesses only the
strong _insert_stmt via insert_stmt() and never dereferences the weak
_prepared_stmt, so it is unaffected.
Refs SCYLLADB-1173
Parallelize SSTable creation using parallel_for_each. The file
count is made a parameter with a default of 64, allowing future
S3/GCS variants to use a smaller count if needed.
Parallelize SSTable creation using parallel_for_each and reduce
the SSTable count from 256 to 64 for S3/GCS variants. The local
test variant retains the original 256 count.
Parallelize SSTable creation across all sub-tests using
parallel_for_each and reduce the SSTable count from 256 to 64 for
S3/GCS variants.
Re-enable the S3 test variant that was previously disabled due to
taking 4+ minutes. With parallel creation and reduced count, the
test now completes in a reasonable time.
Pre-extract mutation pairs and use parallel_for_each with
make_sstable_containing_async to create SSTables concurrently
instead of sequentially. The post-creation loop still runs serially
to collect token ranges and generations.
Pre-extract mutation pairs and use parallel_for_each with
make_sstable_containing_async to create SSTables concurrently
instead of sequentially. The post-creation loop still runs serially
to collect token ranges and generations that depend on SSTable order.
Use parallel_for_each with make_sstable_containing_async to create
SSTables concurrently instead of sequentially, reducing wall-clock
time on remote storage backends (S3/GCS).
Use parallel_for_each with make_sstable_containing_async to create
SSTables concurrently instead of sequentially, reducing wall-clock
time on remote storage backends (S3/GCS).
Raise log levels for s3 and gcp_storage from debug to trace, and add
trace-level logging for http and default_http_retry_strategy modules.
This provides better visibility into storage backend interactions
when debugging slow or failing compaction tests on remote storage.
The original make_memtable used seastar::thread::yield() for
preemption, which required all callers to run inside a
seastar::thread context. This prevented the utilities from being
used directly in coroutines or parallel_for_each lambdas.
Make the primary functions — make_memtable, make_sstable_containing,
and verify_mutation — return future<> directly. Callers now .get()
explicitly when in seastar::thread context, or co_await when in
a coroutine.
make_memtable now uses coroutine::maybe_yield() instead of
seastar::thread::yield(). verify_mutation is converted to
coroutines as well.
Requested in:
https://github.com/scylladb/scylladb/pull/29416#pullrequestreview-4112296282
test_exception_safety_of_update_from_memtable called make_memtable
inside the row_cache::external_updater callback. external_updater
runs as a synchronous execute() call that must not yield, but
make_memtable calls seastar::thread::yield() every 10th mutation.
The bug was latent because the test only inserted 5 mutations, so
the yield was never reached. Move the call before the callback.
Prerequisite for the next patch, which changes make_memtable to
call make_memtable_async().get() -- that would yield on every
mutation via coroutine::maybe_yield(), making this bug visible.
Increase max_connections from the default to 32 for the S3 endpoint
used in tests. This allows more concurrent HTTP connections to the S3
backend, which is needed to benefit from parallel SSTable creation
that will be introduced in subsequent commits.
mark_for_deletion() only set an in-memory flag; the actual file
deletion ran lazily when the last shared_sstable reference dropped,
leaving a window in which a follow-up scan of the upload directory
(e.g. a second 'nodetool refresh --load-and-stream') could observe a
partially-deleted sstable and fail with malformed_sstable_exception.
Force the unlink to complete before stream() returns. For tablet
streaming, partially-contained sstables span multiple per-tablet
batches, so a defer_unlinking flag postpones the unlink until after
all sstables are streamed; for vnodes and fully-contained sstables are streamed
only once and could be removed just after being streamed.
Added a FIXME on object_storage_base::wipe and strengthened the doc on storage::wipe to
make the never-fails contract explicit
Commit 2b7aa32 (topology_coordinator: Refresh load stats after
table is created or altered) registered topology_coordinator as a
schema change listener and added on_create_column_family which
fire-and-forgets _tablet_load_stats_refresh.trigger(). The
triggered task runs on the gossip scheduling group via
with_scheduling_group and accesses the topology_coordinator via
'this'.
stop() unregisters the listener but does not wait for any
in-flight refresh task. If a notification fires between
_tablet_load_stats_refresh.join() in run() and unregister_listener
in stop(), the scheduled task can outlive the topology_coordinator
and access freed memory after run_topology_coordinator's coroutine
frame is destroyed.
Wait for the refresh to complete in stop() after unregistering the
listener, ensuring no task can fire after destruction.
Fixes SCYLLADB-1728
Backport to 2026.1 and 2026.2, because the issue was introduced in 2b7aa32Closesscylladb/scylladb#29653
* https://github.com/scylladb/scylladb:
test: tablet_stats: reproduce shutdown refresh race
topology_coordinator: join tablet load stats refresh in stop()
Add a test that reproduces SCYLLADB-1629: set_smp() had no effect
because nodelist() created new ScyllaNode objects on every call,
losing the _smp_set_during_test value. The test fails without the
fix in the previous patch.
ScyllaCluster.nodelist() was creating new ScyllaNode objects on every
call, so per-node state set via set_smp(), set_log_level(), and
_adjust_smp_and_memory() was lost between calls.
Fix by caching ScyllaNode instances in a list populated by
_add_nodes() using the list returned by servers_add() in populate().
Nodes are assigned monotonically increasing names (node1, node2, ...).
nodelist() simply returns the cached list.
The LDAP role manager's `_cache_pruner` background fiber periodically calls cache::reload_all_permissions(). Two races cause it to hit SCYLLA_ASSERT(_permission_loader):
- Cross-shard race: The pruner `used _cache.container().invoke_on_all()` to reload permissions on every shard. Since both `service::start()` and `sharded<service>::stop()` execute per-shard in parallel, the pruner on one shard could call reload_all_permissions() on another shard before that shard set its loader (startup) or after it cleared its loader (shutdown). Each shard runs its own pruner instance, so reloading locally is sufficient — this also removes redundant N² reload calls.
- Intra-shard race: `service::stop()` cleared the permission loader and stopped the role manager concurrently (via when_all_succeed). A mid-reload pruner could yield and then call the now-null loader. Fixed by stopping the role manager first so the pruner is fully drained before the loader is cleared.
Fixes SCYLLADB-1679
Backport to 2026.2, introduced in 7eedf50c12Closesscylladb/scylladb#29605
* github.com:scylladb/scylladb:
auth: make shutdown the exact reverse of startup
test: ldap: add test for pruner crash during shutdown
auth: start authorizer and set permission loader before role manager
auth: stop role manager before clearing permission loader
auth: reload LDAP permission cache on local shard only
In certain circumstances current way of collecting can be error-prone. Collection can stop when the first file is skipped in the mode leaving the rest of the files in CLI not collected.
Another issue that if the file specified twice, with directory and file explicitly, it will produce incorrect CppFile in the stash causing KeyError.
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1714
No backport, test framework bug fix only.
Closesscylladb/scylladb#29634
* github.com:scylladb/scylladb:
test.py: fix framework test
test.py: fix test collection bug
In do_execute_cql_with_timeout(), when the prepared statement was not
found in the cache, we called qp.prepare() and stored the returned
result_message::prepared in a local variable scoped to the 'if' block.
We then extracted ps_ptr (a checked_weak_ptr to the prepared statement)
from the message, let the message go out of scope at the end of the
'if', and used ps_ptr after a co_await on st->execute().
Since 3ac4e258e8 ("transport/messages: hold pinned prepared entry in
PREPARE result"), result_message::prepared owns a strong pinned
reference to the prepared cache entry. While qp.prepare() runs it also
holds its own pin on the entry, so on return the entry has at least
the pin owned by the returned message. As long as that message is
alive, the cache entry cannot be purged and the weak handle inside
ps_ptr remains promotable.
The lifetime gap manifested only in debug builds. qp.prepare() returns
a ready future on the cache-miss path, so in release builds the
co_await resumes synchronously: control flows from the assignment of
ps_ptr straight into st->execute() with no opportunity for any other
task (in particular, prepared cache invalidation triggered by a
concurrent schema change) to run in between. Debug builds, however,
force a reactor preemption point on every co_await even when the
awaited future is ready. With prepared_msg already destroyed at the
end of the 'if' block, the only remaining handle on the cache entry
was the weak ps_ptr, and the preemption gave a concurrent cache purge
- triggered, for example, by Raft schema changes received during a
node restart - the chance to drop the entry. The subsequent execute()
then failed when promoting the weak pointer with
checked_ptr_is_null_exception.
The exception propagated out of the Paxos prepare path as a generic
std::exception with no type information in the log, surfacing on the
coordinator as:
WriteFailure: Failed to prepare ballot ... Replica errors:
host_id ... -> seastar::rpc::remote_verb_error (std::exception)
Hoist the result_message::prepared into the outer scope so the pinned
cache entry stays alive across co_await st->execute(...), closing the
window in which a concurrent cache purge could invalidate the weak
handle.
Fixes SCYLLADB-1173
- Ensure servers[1] is not the topology coordinator before restarting it, preventing the leader death + re-election + re-repair sequence that masked the compaction-merge bug
- Add a retry loop that detects post-restart leadership transfer to servers[1] via direct coordinator query, retrying up to 5 times
Fixes: SCYLLADB-1478
Backporting to 2026.2, which sees the failure regularly.
Closesscylladb/scylladb#29671
* github.com:scylladb/scylladb:
test/cluster/test_incremental_repair: add retry for residual leadership race
test/cluster/test_incremental_repair: fix flaky coordinator-change scenario
The coordinator can receive a schema-change notification after run()
finishes but before stop() unregisters listeners. The test pins that
window with error injections and verifies stop() waits for the refresh
instead of letting it outlive the coordinator.
Test time in dev: 9.51s
Refs SCYLLADB-1728
Commit 2b7aa3211d made schema changes trigger tablet load stats
refreshes in the background. A notification can still arrive after
run() stops the periodic refresher and before the coordinator object
is destroyed.
Move lifecycle subscription cleanup to stop() and join the serialized
refresh there after unregistering refresh trigger sources. This keeps
the coordinator alive until notification-triggered refresh work has
completed.
Fixes SCYLLADB-1728
There is a small race window where Raft leadership could transfer back
to servers[1] between the ensure_group0_leader_on() check and the
actual restart. If this happens, the new coordinator re-initiates
repair and masks the compaction-merge bug.
Extract the core test logic into _do_race_window_promotes_unrepaired_data()
which directly checks get_topology_coordinator() after restart and raises
_LeadershipTransferred if servers[1] became coordinator. The test
function calls this helper in a retry loop (up to 5 attempts).
Refs: SCYLLADB-1478
The test_incremental_repair_race_window_promotes_unrepaired_data test
was flaky because it hardcodes servers[1] as the restart target but did
not ensure servers[1] was NOT the topology coordinator.
When servers[1] happened to be the Raft group0 leader (topology
coordinator), restarting it killed the leader, forced a new election,
and the new coordinator re-initiated tablet repair. This re-repair
flushes memtables on all replicas via take_storage_snapshot() and marks
the resulting sstables as repaired -- causing post-repair keys to appear
in repaired sstables on servers[0] and servers[2]. The test then hit
the wrong assertion (servers[0]/[2] contaminated).
Fix: before starting the repair, check whether servers[1] is the
topology coordinator. If so, move leadership to another server via
ensure_group0_leader_on() so that restarting servers[1] only kills a
follower -- which does not trigger an election or coordinator change.
Reproducibility was confirmed by forcing leadership to servers[1] via
ensure_group0_leader_on() and observing deterministic failure with all
three servers showing post-repair keys in repaired sstables (confirming
the re-repair scenario), then verifying the fix passes reliably.
Fixes: SCYLLADB-1478
test_numeric_rf_to_rack_list_conversion and
test_numeric_rf_to_rack_list_conversion_abort were reading
system_schema.keyspaces from an arbitrary node that may not have
applied the latest schema change yet. Pin the read to a specific node
and issue a read barrier before querying, ensuring the node has
up-to-date data.
The test was reading system_schema.keyspaces from an arbitrary node
that may not have applied the latest schema change yet. Pin the read
to a specific node and issue a read barrier before querying, ensuring
the node has up-to-date data.