Verify that the load balancer does not issue intranode migrations when
the load difference between shards is within the size_based_balance_threshold,
and that it does issue migrations when the difference exceeds the threshold.
(cherry picked from commit 6856f51097)
The intranode shard balancing loop only stopped when the most-loaded
and least-loaded shard were the same (src == dst), meaning it would
keep issuing migrations until the load difference reached exactly 0.
This caused unnecessary migrations for negligible imbalances.
Apply the same is_balanced() threshold check that is already used for
inter-node balancing, so that intranode migrations stop when the
relative load difference between shards is within the configured
size_based_balance_threshold (default 1%).
(cherry picked from commit aaead10e5d)
test_create_role_mixed_cluster calls servers_add(2) to bootstrap two old
nodes concurrently, then adds a new node before issuing CREATE ROLE. The
concurrent bootstraps trigger the well-known Python driver bug
(scylladb/python-driver#317): two on_add notifications race in
update_created_pools, causing a second pool to be created for a host whose
pool was already established. If CREATE ROLE is in-flight on the old pool
when it is closed, the driver retries on the new pool, executing the
statement twice. The second execution fails with "Role ... already exists",
making the test flaky.
Fix by using CREATE ROLE IF NOT EXISTS. This is safe because unique_name()
generates a timestamp+random suffix that is guaranteed to be unique; the
role can "already exist" only due to the driver double-execution bug, never
due to a real conflict.
This is the same workaround that has been applied many times elsewhere in
our test suite for exactly the same root cause:
- CREATE KEYSPACE was changed to CREATE KEYSPACE IF NOT EXISTS (scylladb#18368,
later generalised in scylladb#22399 via new_test_keyspace helpers)
- DROP KEYSPACE was changed to DROP KEYSPACE IF EXISTS (scylladb#29487)
Fixes: SCYLLADB-1811
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#29732
(cherry picked from commit 983eb5ab43)
Closesscylladb/scylladb#29743
The test uses a 10ms read timeout to exercise code paths that handle
timed-out reads without throwing C++ exceptions. As part of setup, it
inserts rows and flushes them to two SSTables, then runs a warm-up
SELECT to populate internal caches (e.g. the auth cache) before the
real test begins.
The reason for this warm-up read was the possibility that the first
read does additional operations (such as reading and caching
authentication) that might throw exceptions internally. I couldn't
verify that such exceptions actually happen in today's code, but
they might (re)appear in the future, so we should keep the warm-up
SELECT.
On slow CI machines (aarch64, debug build), that warm-up SELECT can
take longer than 10ms to read from the two SSTables. When it does, the
read times out: the coordinator receives 0 responses from the local
replica within the deadline and propagates a read_timeout_exception.
Since the exception is not caught, it escapes the test lambda, is
logged as "cql env callback failed", and causes Boost.Test to report a
C++ failure at the do_with_cql_env_thread call site. This matches the
CI failure seen in SCYLLADB-1774:
ERROR ... replica_read_timeout_no_exception: cql env callback failed,
error: exceptions::read_timeout_exception (Operation timed out for
replica_read_timeout_no_exception.tbl - received only 0 responses
from 1 CL=ONE.)
The CI log also shows that only 12 reads were admitted (the warm-up
read plus the 11 reads from the two prepare() calls and CREATE/INSERT
statements made earlier), and the current permit was stuck in
need_cpu state -- the reactor hadn't had a chance to schedule the read
before the 10ms window elapsed.
The fix catches read_timeout_exception from the warm-up SELECT and
retries until the read succeeds. The warm-up is required for
correctness: some lazy-init code paths (e.g. auth cache population)
use C++ exceptions for control flow internally. Those exceptions must
be absorbed before the cxx_exceptions baseline is sampled inside
execute_test(); otherwise they would appear in the delta and cause a
false test failure. Simply ignoring a timed-out warm-up is not safe,
because the lazy-init exceptions would then fire during the 1000 test
reads, inflating cxx_exceptions_after relative to
cxx_exceptions_before.
No other calls in setup are susceptible to the 10ms read timeout:
- CREATE KEYSPACE, CREATE TABLE, INSERT, and flush use the write
timeout (10s) and are not reads.
- e.prepare() goes through the query processor without reading table
data, so it is not subject to the read timeout.
- The semaphore manipulation in Test 2 is internal and has no timeout.
- All 1000 reads in execute_test() are expected to fail, so a timeout
there is the happy path, not a failure.
The 10ms timeout itself is fine for the test's purpose: it is
deliberately aggressive so that reads reliably time out on the hot path
being tested. The problem was only that the pre-test warm-up was not
guarded against the same timeout.
Fixes: SCYLLADB-1830
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#29731
(cherry picked from commit 1f15e05946)
Closesscylladb/scylladb#29760
In debug mode, this test can timeout during tablets merge. While the
test already decreases the number of tables in debug mode (20 tables,
instead of 200 for dev mode), this is not enough, and the test can still
timeout during merge. This change reduces the number of tables from 20
to 5 in debug mode.
It also drops the log level for lead_balancer to debug. This should make
any potential future problems with this test easier to investigate.
Fixes: SCYLLADB-1863
Closesscylladb/scylladb#29682
(cherry picked from commit ec4b483e88)
Closesscylladb/scylladb#29786
- Handle dropped tables gracefully in the tablet load balancer's `get_schema_and_rs()` instead of aborting with `on_internal_error`
- The load balancer operates on a token metadata snapshot but accesses the live schema for table lookups. A DROP TABLE applied by another fiber between coroutine yield points can remove a table from the live schema while it still exists in the snapshot, causing an abort.
`get_schema_and_rs()` now returns `std::optional` and logs a warning in debug log level instead of aborting when a table is missing. All callers skip dropped tables:
- `make_sizing_plan`: skips to next table
- `make_resize_plan`: skips to next table (merge suppression is moot)
- `check_constraints`: returns `skip_info{}` with empty viable targets
- `get_rs`: returns `nullptr`, checked by `check_constraints`
The call chain is: `make_plan` → `make_internode_plan` → `check_constraints` → `get_rs` → `get_schema_and_rs`. The `make_internode_plan` coroutine has multiple `co_await` yield points (`maybe_yield`, `pick_candidate`) between building the candidate tablet list and checking replication constraints. A DROP TABLE schema mutation applied during any of these yields removes the table from `_db.get_tables_metadata()` while the candidate list still references it.
Added `test_load_balancing_with_dropped_table` which simulates the race by capturing a token metadata snapshot, dropping the table, then calling `balance_tablets` with the stale snapshot.
Fixes: SCYLLADB-1905
This fix needs to be backported to versions: 2025.4, 2026.1
- (cherry picked from commit 4987204f71)
- (cherry picked from commit 6b3e18c4a9)
Parent PR: #29585Closesscylladb/scylladb#29818
* github.com:scylladb/scylladb:
test: verify load balancer handles dropped tables gracefully
tablet_allocator: handle dropped tables gracefully in get_schema_and_rs
Alternator Streams now supports tablets, so stop skipping the TTL Streams test in tablet mode and stop forcing vnodes in the Streams audit test.
Refs SCYLLADB-463
Closesscylladb/scylladb#29697
(cherry picked from commit 459c1dc32f)
Closesscylladb/scylladb#29819
After `load_and_stream` (e.g. via `nodetool refresh --load-and-stream`)
returns success, source sstable files in the `upload/` directory may
still be on disk. `mark_for_deletion()` only sets an in-memory flag; the
actual file deletion runs lazily when the last `shared_sstable`
reference drops.
This leaves a window between API success and physical deletion where a
follow-up scan of the upload directory can detected sstables that will be deleted soon.
This might cause failure because SSTable will be already wiped during processing.
For fix:
Force unlink to complete before `stream()` returns, so the upload
directory is in a consistent state by the time the API reports success.
For tablet streaming, partially-contained sstables participate in
multiple per-tablet batches; eagerly unlinking after each batch would
break the next batch that still needs to read the file. A
`defer_unlinking` flag on the streamer postpones the explicit unlink
until after all batches complete (called once at the end of
`tablet_sstable_streamer::stream()`). Vnode streaming unlink eagerly at the end of
`stream_sstable_mutations`.
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1647
Backport is required, as it is a bug fix that was introduced in 517a4dc4df.
- (cherry picked from commit 7cdf215999)
- (cherry picked from commit 784127c40b)
Parent PR: #29599Closesscylladb/scylladb#29845
* github.com:scylladb/scylladb:
sstables_loader: synchronously unlink streamed sstables before returning
sstables: make sstable::unlink() idempotent
The procedure to migrate a vnodes-based keyspace to tablets-based keyspace
has been labeled as experimental.
Fixes SCYLLADB-1932
Closesscylladb/scylladb#29834
(cherry picked from commit 1f7d20f701)
Closesscylladb/scylladb#29849
When `force_capacity_based_balancing` is enabled and a node is being drained/excluded, the tablet allocator incorrectly aborts balancing due to incomplete tablet stats - even though capacity-based balancing doesn't depend on tablet sizes.
The tablet allocator normally waits for complete load stats before balancing. An exception exists for drained+excluded nodes (they're unreachable and won't return stats). However, when forced capacity-based balancing is active, this exception was not being applied, causing the balancer to reject the drain plan.
Adjust the condition in `tablet_allocator.cc` so that the "ignore missing data for drained nodes" logic applies regardless of whether capacity-based balancing is forced.
Added a Boost unit test that forces capacity-based balancing and verifies a drained/excluded node gets its tablets migrated even when tablet size stats are missing.
This bug was introduced in 2026.1, so this needs to be backported to 2026.1 and 2026.2
Fixes: SCYLLADB-1953
- (cherry picked from commit 906d2b817e)
- (cherry picked from commit f7bc8f5fa7)
Parent PR: #29791Closesscylladb/scylladb#29866
* github.com:scylladb/scylladb:
test: boost: add drain test for forced capacity-based balancing
service: allow draining with forced capacity-based balancing
When a table's compaction is disabled via 'enabled': 'false', the DESCRIBE
output incorrectly showed NullCompactionStrategy instead of the actual strategy.
This happened because schema_properties() called compaction_strategy(), which
returns compaction_strategy_type::null when compaction is disabled. Fix it by
using configured_compaction_strategy(), which always returns the real strategy
type - consistent with how schema_tables.cc serializes it to disk.
Fixes SCYLLADB-1353
Closesscylladb/scylladb#29804
(cherry picked from commit 8d6f031a4a)
Closesscylladb/scylladb#29867
The test samples sl:default runtime before and after setup writes to
prove that it measures the scheduling group used by regular CQL writes.
The metric is exported in milliseconds, so a single 200-row batch may
not be visible immediately, or may be too small in some environments.
Keep the original 200-row table size, but wait up to 30 seconds for the
metric to advance. If it does not, retry the same writes before TTL is
enabled. The retries update the same keys, so the expiration part of the
test still waits for exactly the original number of rows.
In a local 100-run with N=200 rows, the observed delta of
`ms_statement_before - ms_statement_before_write` was: min=4.0,
max=16.0, mean=8.13, and median=8.0. Therefore, it looks possible that
in a rare corner case the delta drops even to 0.
Fixes SCYLLADB-1869
Closesscylladb/scylladb#29797
(cherry picked from commit 89261bf759)
Closesscylladb/scylladb#29868
In the test we perform 2 consecutive writes where the first write
is supposed to increase the view update backlog above the mv
admission control threshold and the second one is expected to be
rejected because of that.
On each node/shard we have 2 types of view update backlogs:
1. for deciding whether we should admit writes
2. for propagating the backlog information to other nodes/shards.
For the second write to be rejected, it must be performed on a node
and shard which updated its backlog of type 1.
The view update backlog of type 2. is immediately increased on the
base table replica. For this backlog to be registered as a backlog
of type 1., it needs to be either carried by gossip (happening once
every second) or by attaching it to a replica write response. We
don't want to increase the runtime of tests unnecessarily, so we don't
wait and we rely on the second mechanism. The response to the first
base table write (the one causing increase in the backlog) carries
the increased backlog to the coordinator of this write. So for the
second write to observe the increased backlog, it needs to be coordinated
on the same node+shard as the first write.
We make sure that both writes are coordinated on the same node+shard by
using prepared statements combined with setting the host in `run_async`.
Both writes target the same partition and with prepared statements we
route them directly to the correct shard.
That was the idea, at least. In practice, for the driver to learn the
correct shard, it first needs to learn the token->shard mapping from
the server. For vnodes it can expect a shard by calculating the token
of the affected partition, but for tablets, it had no opportunity to
learn the tablet->shard mapping so the first write may route to any shard.
Additionally, we aren't guaranteed that the driver established connections
to all shards on all nodes at the point of any write. So if a connection
finishes establishing between the two writes, this may also cause us to
coordinate these 2 writes on different shards, leading to a missed view
backlog growth and not-rejected second write.
We fix this in this patch by running the test using one shard on each node.
This way, as long as we perform both writes on the same node, they'll also
be coordinated on the same shard. This also makes the prepared statement and
BoundStatement unnecessary — we can use SimpleStatement with
FallthroughRetryPolicy directly.
Fixes: SCYLLADB-1957
Closesscylladb/scylladb#29862
(cherry picked from commit f3cf20803b)
Closesscylladb/scylladb#29873
Add a Boost unit test that forces capacity-based balancing through
configuration and verifies that a drained and excluded node will be
drained of its tablets when tablet size stats are missing.
The test covers the regression where the allocator rejected the plan due
to incomplete tablet stats, even though forced capacity-based balancing
does not depend on tablet sizes.
(cherry picked from commit f7bc8f5fa7)
When force_capacity_based_balancing is enabled, the tablet allocator
balances by node and shard capacity rather than by tablet sizes.
When the data needed for load balancing is incomplete, the balancer
fails and waits until load_stats is available and correct for all the
nodes. An exception to this is when a node is being drained and
excluded: it is unreachable, and will not return. In this case
the balancer has to do its best and ignore the missing data.
This patch fixes a bug where forcing capacity based balancing made the
balancer not ignore missing data in these cases, and instead abort the
balancing.
(cherry picked from commit 906d2b817e)
The MV Select Statement description was missing the word "columns" and
used incorrect verb agreement, making the sentence grammatically broken
and ambiguous.
docs/cql/mv.rst: "which of the base table is included" →
"which of the base table columns are included"
Fixes#29662Closes#29663
Co-authored-by: annastuchlik <37244380+annastuchlik@users.noreply.github.com>
(cherry picked from commit 9e7d67612c)
Closesscylladb/scylladb#29835
When a user calls the repair API with identical startToken and endToken
values, the code creates a wrapping interval (T, T]. This causes
unwrap() to split it into (-inf, T] and (T, +inf), covering the entire
token ring and triggering a full repair.
Reject such requests early with an error message matching
Cassandra's behavior: "Start and end tokens must be different."
Fixes: CUSTOMER-368
Closesscylladb/scylladb#29821
(cherry picked from commit 0204372156)
Closesscylladb/scylladb#29836
Fixes: SCYLLADB-1936
In case we abort a decommission operation, the snapshot/backup
mechanism need to remain open.
This change moves it to after raft_decommission.
In the case of a cluster snapshot, our nodes ownership
or not of tables will be serialized by raft anyway, so
should remain consistent. In that case we at worst coordinate
from a node in "leave" status
In the case of a local snapshot, ownership matters less,
only sstables on disk, which should not change.
In the case of backup, this operates on a snapshot, state of which
is not affected.
Adds an injection point for testing.
v2:
- Added injection point to ensure test can abort decommission
Closesscylladb/scylladb#29667
(cherry picked from commit 2cc1a2c406)
Closesscylladb/scylladb#29848
mark_for_deletion() only set an in-memory flag; the actual file
deletion ran lazily when the last shared_sstable reference dropped,
leaving a window in which a follow-up scan of the upload directory
(e.g. a second 'nodetool refresh --load-and-stream') could observe a
partially-deleted sstable and fail with malformed_sstable_exception.
Force the unlink to complete before stream() returns. For tablet
streaming, partially-contained sstables span multiple per-tablet
batches, so a defer_unlinking flag postpones the unlink until after
all sstables are streamed; for vnodes and fully-contained sstables are streamed
only once and could be removed just after being streamed.
Added a FIXME on object_storage_base::wipe and strengthened the doc on storage::wipe to
make the never-fails contract explicit
(cherry picked from commit 784127c40b)
Avoid duplicate work when unlink() is called more than once on the
same sstable. This happens when a caller invokes unlink() explicitly
on an sstable that is also marked for deletion: the destructor's
close_files() path would otherwise call unlink() again, re-firing
_on_delete, double-counting _stats.on_delete() and double-invoking
_manager.on_unlink().
(cherry picked from commit 7cdf215999)
The function database::create_local_system_table calls
get_tables_metadata().hold_write_lock(), but does not co_await the
returned future. Effectively, this code does not guarantee mutual
exclusion because it does not wait for the lock to be acquired and does
not guarantee that the lock is held long enough.
Fix this by adding the co_await that was missing.
Found by manual inspection. This code is not known to have caused any
problems so far, but it's clearly wrong - hence the fix.
Fixes: SCYLLADB-1916
Closesscylladb/scylladb#29806
(cherry picked from commit bc482bfdea)
Closesscylladb/scylladb#29815
Add test_load_balancing_with_dropped_table that simulates the race between
DROP TABLE and the load balancer by capturing a token metadata snapshot
before dropping the table, then passing the stale snapshot to
balance_tablets(). Verifies it completes without aborting and produces no
migrations for the dropped table.
(cherry picked from commit 6b3e18c4a9)
The load balancer's get_schema_and_rs() would trigger on_internal_error when
a table present in the token metadata snapshot had been concurrently dropped
from the live schema. This race is possible because the balancer coroutine
yields between building the candidate list and checking replication
constraints, allowing a DROP TABLE schema mutation to be applied by another
fiber in the meantime.
Change get_schema_and_rs() to return {nullptr, nullptr} for dropped tables
instead of aborting. Update all callers to skip dropped tables:
- make_sizing_plan: continue to next table
- make_resize_plan: continue to next table (merge suppression is moot)
- check_constraints: return skip_info with empty viable targets
- get_rs: return nullptr, checked by check_constraints
(cherry picked from commit 4987204f71)
The BTI partition index trie writer flushes all buffered nodes at the
end of each SSTable via complete_until_depth(0), called from
bti_partition_index_writer_impl::finish(). This is a tight synchronous
loop that writes trie nodes through file_writer::write(), which uses a
buffered output_stream: individual writes that fit in the buffer are
plain memcpy operations returning a ready future, so .get() never
yields. As a result the reactor can stall for several milliseconds on
large SSTables.
The entire call chain runs inside seastar::async() (via
sstable::write_components()), so seastar::thread::maybe_yield() is
safe to call here. Add it at the top of both tight loops:
- complete_until_depth(), which iterates over trie depth
- lay_out_children(), which iterates over child branches per node
Fixes SCYLLADB-1885
Closesscylladb/scylladb#29798
(cherry picked from commit d0813769ec)
Closesscylladb/scylladb#29810
Commit 2b7aa32 (topology_coordinator: Refresh load stats after
table is created or altered) registered topology_coordinator as a
schema change listener and added on_create_column_family which
fire-and-forgets _tablet_load_stats_refresh.trigger(). The
triggered task runs on the gossip scheduling group via
with_scheduling_group and accesses the topology_coordinator via
'this'.
stop() unregisters the listener but does not wait for any
in-flight refresh task. If a notification fires between
_tablet_load_stats_refresh.join() in run() and unregister_listener
in stop(), the scheduled task can outlive the topology_coordinator
and access freed memory after run_topology_coordinator's coroutine
frame is destroyed.
Wait for the refresh to complete in stop() after unregistering the
listener, ensuring no task can fire after destruction.
Fixes SCYLLADB-1728
Backport to 2026.1 and 2026.2, because the issue was introduced in 2b7aa32Closesscylladb/scylladb#29653
* https://github.com/scylladb/scylladb:
test: tablet_stats: reproduce shutdown refresh race
topology_coordinator: join tablet load stats refresh in stop()
(cherry picked from commit d9dd3bfe53)
Closesscylladb/scylladb#29686
Fix the persistent flakiness in `test_incremental_repair_race_window_promotes_unrepaired_data` (SCYLLADB-1478, reopened).
After restarting servers[1], the topology coordinator can initiate a **residual re-repair** when it sees tablets stuck in the `repair` stage. This re-repair flushes memtables on all replicas and marks post-repair data as repaired, contaminating the test state and masking the compaction-merge bug the test is designed to detect. The assertion then fails on the *next* retry because the previous attempt's re-repair left behind repaired sstables containing post-repair keys.
1. **Propagating `current_key` through the exception** — correctly advanced the key counter on retry, but the contaminated tablet metadata from the prior re-repair (repaired sstables with post-repair keys) was still present, causing assertion failures on the next attempt.
2. **DROP TABLE + CREATE TABLE between retries** — the tablet metadata (sstables_repaired_at, repair stage) is tied to the tablet identity, and recreating the table in the same keyspace still showed residual state issues.
Instead of trying to clean up contaminated state, each retry creates a **completely fresh keyspace** (unique name via `create_new_test_keyspace`). This gives entirely new tablets with no residual repair metadata from prior attempts. Combined with broader detection of coordinator changes and residual re-repairs, the test reliably retries before any contamination can cause false failures.
The detection is now comprehensive:
- **Broadened coordinator check**: any coordinator change (`new_coord != coord`), not just migration to servers[1]
- **Re-repair detection** at three points: post-restart, during the compaction poll, and after injection release — grep for `"Initiating tablet repair host="` in the coordinator log
1. **`test: extract _setup_table_for_race_window helper`** — pure code-movement refactor that extracts keyspace+table+data+repair1+data+flush into a reusable helper. Easily verifiable as a no-op behavioral change.
2. **`test: fix race window test flakiness from residual re-repair`** — the actual fix: broadened detection logic + re-repair grep at 3 points + fresh-keyspace retry on exception.
Passed 1000 consecutive runs with the fix applied. Without the fix, about 2% flakiness was observed in debug mode.
Fixes: SCYLLADB-1743
So far, we haven't observed flakiness of this test on branches, so not backporting yet. Will backport if seen.
Closesscylladb/scylladb#29721
* github.com:scylladb/scylladb:
test: fix race window test flakiness from residual re-repair
test: extract _setup_table_for_race_window helper for race window test
(cherry picked from commit d33bb6ea00)
Closesscylladb/scylladb#29761
insert() held no local strong ref to the prepared modification_statement
across the suspension in execute(). On a single shard:
1. Fiber A suspends inside _insert_stmt->execute().
2. DROP TABLE / DROP KEYSPACE on the target, or LRU eviction, removes
the prepared_statements_cache entry, releasing its strong ref.
3. Fiber B re-enters cache_table_info(), sees _prepared_stmt
(checked_weak_ptr) invalidated, and runs _insert_stmt = nullptr,
releasing the last strong ref. The modification_statement is freed.
4. Fiber A resumes inside execute() and touches freed *this.
Pin strong ref to _insert_stmt locally before the suspension.
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1667
Backport: all supported branches, it's memory corruption bug, long present
Closesscylladb/scylladb#29588
* github.com:scylladb/scylladb:
test/boost: add dummy case to table_helper_test for non-injection modes
test/boost: add regression test for table_helper insert() UAF
utils/error_injection: add waiters() API
table_helper: fix use-after-free on prepared-statement invalidation
(cherry picked from commit efcc0b6376)
Closesscylladb/scylladb#29747
In the test_delete_partition_rows_from_table_with_mv case we perform
a deletion of a large partition to verify that the deletion will
self-throttle when generating many view updates.
Before the deletion, we first build the materialized view, which causes
the view update backlog to grow. The backlog should be back to empty
when the view building finishes, and we do wait for that to happen, but
the information about the backlog drop may not be propagated to the
delete coordinator in time - the gossip interval is 1s and we perform
no other writes between the nodes in the meantime, so we don't make use
of the "piggyback" mechanism of propagating view backlog either. If the
coordinator thinks that the backlog is high on the replica, it may reject
the delete, failing this test.
We change this in this patch - after the view is built, we perform an
extra write from the coordinator. When the write finishes, the coordinator
will have the up-to-date view backlog and can proceed with the DELETE.
Additionally, we enable the "update_backlog_immediately" injection, which
makes the node backlog (the highest backlog across shards) update immediately
after each change.
Fixes: SCYLLADB-1877
Closesscylladb/scylladb#29775
(cherry picked from commit ab12083525)
Closesscylladb/scylladb#29793
The `vector_store_client_test_dns_resolving_repeated` test was intermittently
timing out on CI. The exact root cause is not fully understood, but the
hypothesis is that a single trigger signal can be lost somewhere (not exactly
known where). This is not an issue for the production code because refresh
trigger will be called multiple times whenever all configured nodes will be
unreachable.
Fixes SCYLLADB-1794
Backport to 2026.1 and 2026.2, as the same CI flakiness can occur on these branches.
- (cherry picked from commit 4722be1289)
- (cherry picked from commit 207de967fb)
Parent PR: #29752Closesscylladb/scylladb#29784
* github.com:scylladb/scylladb:
vector_search: test: default timeout in test_dns_resolving_repeated
vector_search: test: fix flaky test_dns_resolving_repeated
The auth cache crashes when it encounters rows in role_permissions that have a live row marker but no permissions column. These “ghost rows” were created by the now-removed auth v2 migration, which used INSERT (creating row markers) instead of UPDATE.
When permissions were later revoked, the row marker remained while the permissions column became null. An empty collection appears as null, since its lifetime is based only on its element's cells.
As a result, when the cache reloads and expects the permissions column to exist, it hits a missing_column exception.
The series removes dead code that was the primary crash site, adds has() guards to the remaining access paths, and includes a test reproducer.
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1816
Backport: all supported versions 2026.1, 2025.4, 2025.1
- (cherry picked from commit 797bc28aae)
- (cherry picked from commit c44625ebdf)
- (cherry picked from commit df69a5c79b)
- (cherry picked from commit 5c5306c692)
Parent PR: #29757Closesscylladb/scylladb#29783
* github.com:scylladb/scylladb:
test: add reproducer for auth cache crash on missing permissions column
auth: tolerate missing permissions column in authorize()
auth: add defensive has() guard for role_attributes value column
auth: remove unused permissions field from cache role_record
Commit cf237e060a introduced 'from test.pylib.driver_utils import
safe_driver_shutdown' in pgo/exec_cql.py. This module runs during PGO
profile training (a build step) where the test package is not on the
Python path, causing an immediate ModuleNotFoundError on both x86 and
ARM. Revert to plain cluster.shutdown() which is sufficient for the
single-use PGO training scenario.
Fixes: SCYLLADB-1862
Closesscylladb/scylladb#29746
(cherry picked from commit 65eabda833)
Closesscylladb/scylladb#29785
Replace explicit 1-second timeouts in repeat_until() with the default
STANDARD_WAIT (10s). The 1-second timeout could be too aggressive for
loaded CI environments where lowres_clock granularity (~10ms) combined
with OS scheduling delays and resource contention (-c2 -m2G) could cause
the loop to expire before the DNS refresh task completes its cycle.
This also unifies test timeouts across test cases.
(cherry picked from commit 207de967fb)
Move trigger_dns_resolver() inside the repeat_until loop instead of
calling it once before the loop.
The test was intermittently timing out on CI. The exact root cause is not
fully understood, but the hypothesis is that a single trigger signal can
be lost somewhere (not exactly known where). This is not an issue for the
production code because refresh trigger will be called multiple times -
in every query where all configured nodes will be unreachable.
By triggering inside the loop, we ensure the signal is re-sent on
each iteration until the resolver actually performs the refresh and
picks up the new (failing) DNS resolution. This makes the test
resilient to timing-dependent signal loss without changing production
code.
Fixes: SCYLLADB-1794
(cherry picked from commit 4722be1289)
Ghost rows in role_permissions with a live row marker but no permissions
column can occur when permissions created via INSERT (e.g. by the removed
auth v2 migration) are later revoked. The row marker survives the revoke,
leaving a row visible to queries but with permissions=null.
Add a has() guard before accessing the permissions column, matching the
pattern already used in list_all(). Return NONE permissions for such
ghost rows instead of crashing.
(cherry picked from commit df69a5c79b)
Add a has() check before accessing the value column in role_attributes
to tolerate ghost rows with missing regular columns. In practice this
is unlikely to be a problem since attributes are not typically revoked,
but the guard is added for consistency and defensive programming.
(cherry picked from commit c44625ebdf)
The permissions field in role_record was populated by fetch_role() but
never read. Authorization uses cached_permissions instead, which is
loaded via the permission_loader callback. Remove the dead field and
its fetch code.
The removed code also did not check for missing columns before accessing
the permissions set, which could crash on ghost rows left by the removed
auth v2 migration. The migration used INSERT (creating row markers),
and when permissions were later revoked, the row marker survived while
the permissions column became null.
(cherry picked from commit 797bc28aae)
The LDAP role manager's `_cache_pruner` background fiber periodically calls cache::reload_all_permissions(). Two races cause it to hit SCYLLA_ASSERT(_permission_loader):
- Cross-shard race: The pruner `used _cache.container().invoke_on_all()` to reload permissions on every shard. Since both `service::start()` and `sharded<service>::stop()` execute per-shard in parallel, the pruner on one shard could call reload_all_permissions() on another shard before that shard set its loader (startup) or after it cleared its loader (shutdown). Each shard runs its own pruner instance, so reloading locally is sufficient — this also removes redundant N² reload calls.
- Intra-shard race: `service::stop()` cleared the permission loader and stopped the role manager concurrently (via when_all_succeed). A mid-reload pruner could yield and then call the now-null loader. Fixed by stopping the role manager first so the pruner is fully drained before the loader is cleared.
Fixes SCYLLADB-1679
Backport to 2026.2, introduced in 7eedf50c12Closesscylladb/scylladb#29605
* github.com:scylladb/scylladb:
auth: make shutdown the exact reverse of startup
test: ldap: add test for pruner crash during shutdown
auth: start authorizer and set permission loader before role manager
auth: stop role manager before clearing permission loader
auth: reload LDAP permission cache on local shard only
(cherry picked from commit b0f988afc4)
Closesscylladb/scylladb#29681
When `permissions_validity_in_ms` is set to 0, executing a prepared statement under authentication crashes with:
```
Assertion `caching_enabled()' failed.
at utils/loading_cache.hh:319
in authorized_prepared_statements_cache::insert
```
`loading_cache::get_ptr()` asserts when caching is disabled (expiry == 0), but `authorized_prepared_statements_cache::insert()` was using it purely for its side effect of populating the cache, which is meaningless when caching is off.
Add a new `loading_cache::insert(k, load)` method that is a no-op when caching is disabled and otherwise forwards to `get_ptr()`. Switch `authorized_prepared_statements_cache::insert()` to use it. This
completes the disabled-mode safety contract of the cache for the write side, mirroring the fallback that `get()` already provides for the read side.
Includes a regression test in `test/boost/loading_cache_test.cc` plus a positive test for the new `insert()` overload.
Fixes SCYLLADB-1699
The crash is introduced a long time ago. It is present on all the live versions, from 2025.1 onward. No client tickets, but it should be backported.
Closesscylladb/scylladb#29638
* github.com:scylladb/scylladb:
test: boost: regression test for loading_cache::insert with caching disabled
utils: loading_cache: add insert() that is a no-op when caching is disabled
(cherry picked from commit c00fee0316)
Closesscylladb/scylladb#29762
Add more logging to barrier and drain rpc to try and pinpoint https://github.com/scylladb/scylladb/issues/26281
Bakport since we want to have it if it happens in the field.
Fixes: SCYLLADB-1836
Refs: #26281Closesscylladb/scylladb#29735
* https://github.com/scylladb/scylladb:
session, raft_topology: add periodic warnings for hung drain and stale version waits
session: add info-level logging to drain_closing_sessions
raft_topology: log sub-step progress in local_topology_barrier
raft_topology: log read_barrier progress in topology cmd handler
(cherry picked from commit b69d00b0a7)
Closesscylladb/scylladb#29763
If start_server_for_group0() successfully registers a server in
_raft_gr._servers but a subsequent step (e.g. enable_in_memory_state_machine())
throws, the server is never destroyed because abort_and_drain()/destroy()
check std::get_if<raft::group_id>(&_group0) which was only set after the
entire with_scheduling_group block completed.
Move _group0.emplace<raft::group_id>() inside the lambda, immediately after
start_server_for_group() succeeds, so that cleanup paths can always find
and destroy the registered server.
This fixes the assertion:
"raft_group_registry - stop(): server for group ... is not destroyed"
which manifests during shutdown after an upgrade where topology_state_load()
fails due to netw::unknown_address.
Backport: Yes, to 2026.1, 2026.2, as it causes a crash on upgrades
Refs: SCYLLADB-1217
Refs: CUSTOMER-340
Refs: CUSTOMER-335
Fixes: SCYLLADB-1809
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
AI-assisted: Yes, Opencode/Opus 4.6
Closesscylladb/scylladb#29702
(cherry picked from commit 6179406467)
Closesscylladb/scylladb#29742
In do_execute_cql_with_timeout(), when the prepared statement was not found in the cache, we called qp.prepare() and stored the returned result_message::prepared in a local variable scoped to the 'if' block. We then extracted ps_ptr (a checked_weak_ptr to the prepared statement) from the message, let the message go out of scope at the end of the 'if', and used ps_ptr after a co_await on st->execute().
Since 3ac4e258e8 ("transport/messages: hold pinned prepared entry in PREPARE result"), result_message::prepared owns a strong pinned reference to the prepared cache entry. While qp.prepare() runs it also holds its own pin on the entry, so on return the entry has at least the pin owned by the returned message. As long as that message is alive, the cache entry cannot be purged and the weak handle inside ps_ptr remains promotable.
The lifetime gap manifested only in debug builds. qp.prepare() returns a ready future on the cache-miss path, so in release builds the co_await resumes synchronously: control flows from the assignment of ps_ptr straight into st->execute() with no opportunity for any other task (in particular, prepared cache invalidation triggered by a concurrent schema change) to run in between. Debug builds, however, force a reactor preemption point on every co_await even when the awaited future is ready. With prepared_msg already destroyed at the end of the 'if' block, the only remaining handle on the cache entry was the weak ps_ptr, and the preemption gave a concurrent cache purge
- triggered, for example, by Raft schema changes received during a node restart - the chance to drop the entry. The subsequent execute() then failed when promoting the weak pointer with
checked_ptr_is_null_exception.
The exception propagated out of the Paxos prepare path as a generic std::exception with no type information in the log, surfacing on the coordinator as:
WriteFailure: Failed to prepare ballot ... Replica errors:
host_id ... -> seastar::rpc::remote_verb_error (std::exception)
Hoist the result_message::prepared into the outer scope so the pinned cache entry stays alive across co_await st->execute(...), closing the window in which a concurrent cache purge could invalidate the weak handle.
Fixes SCYLLADB-1173
backport: the patch is simple, we can backport it to all versions with "LWT over tablets" feature. Note that the problem is only in test runs in debug configuration, production is not affected.
Closesscylladb/scylladb#29675
* https://github.com/scylladb/scylladb:
table_helper: retry insert prepare on concurrent cache invalidation
paxos_state: keep prepared message alive across statement execution
(cherry picked from commit 15f35577ed)
Closesscylladb/scylladb#29701
The test hardcoded the expected number of coordinator elections
(2, 3, 4, 5) for each phase. If a prior phase triggered an extra
election, subsequent phases would wait for a count that was already
reached or would never match.
Fix by reading the current election count before each operation and
expecting exactly one more, making each phase independent of prior
history.
Also add wait_for_no_pending_topology_transition() calls after each
coordinator election to ensure the topology state machine has fully
settled before proceeding with restarts and further operations.
Decrease the failure detector timeout (failure_detector_timeout_in_ms)
to 2000 ms on all test nodes so that coordinator crashes are detected
faster, reducing test wallclock time and timeout-related flakiness.
Enable raft_topology=trace logging on all test nodes to aid
post-failure diagnosis. Add diagnostic logging in
wait_new_coordinator_elected().
Fixes: SCYLLADB-1790
Closesscylladb/scylladb#29284
(cherry picked from commit 8afdae24d2)
Closesscylladb/scylladb#29723
Commit 8d34127684 ("sstables: clean up TemporaryHashes file in wipe()")
unconditionally calls filename(..., component_type::TemporaryHashes)
inside filesystem_storage::wipe(). However, the TemporaryHashes
component is only registered in the component map of the 'ms' sstable
format. For older formats (ka, la, mc, md, me) the lookup goes through
sstable_version_constants::get_component_map(version).at(...) and throws
std::out_of_range.
The exception is then swallowed by the outer catch(...) in wipe(), which
just logs and ignores. As a side effect, the subsequent
remove_file(new_toc_name) is never reached and the TemporaryTOC
('*-TOC.txt.tmp') file is left as an orphan on disk after every unlink()
of a non-'ms' sstable.
Guard the lookup with get_component_map(version).contains() so the
cleanup is only attempted for formats that actually define the
component.
Add a regression test in test/boost/sstable_directory_test.cc that
creates an 'me'-format sstable, unlinks it and asserts that the sstable
directory is left empty. Without the fix the test fails with a leftover
'me-...-TOC.txt.tmp' file.
Fixes: SCYLLADB-1767
Closesscylladb/scylladb#29620
(cherry picked from commit 7e14ea5ac8)
Closesscylladb/scylladb#29692