The MV Select Statement description was missing the word "columns" and
used incorrect verb agreement, making the sentence grammatically broken
and ambiguous.
docs/cql/mv.rst: "which of the base table is included" →
"which of the base table columns are included"
Fixes#29662Closes#29663
Co-authored-by: annastuchlik <37244380+annastuchlik@users.noreply.github.com>
(cherry picked from commit 9e7d67612c)
Closesscylladb/scylladb#29835Closesscylladb/scylladb#29865
The function database::create_local_system_table calls
get_tables_metadata().hold_write_lock(), but does not co_await the
returned future. Effectively, this code does not guarantee mutual
exclusion because it does not wait for the lock to be acquired and does
not guarantee that the lock is held long enough.
Fix this by adding the co_await that was missing.
Found by manual inspection. This code is not known to have caused any
problems so far, but it's clearly wrong - hence the fix.
Fixes: SCYLLADB-1916
Closesscylladb/scylladb#29806
(cherry picked from commit bc482bfdea)
Closesscylladb/scylladb#29815Closesscylladb/scylladb#29833
The BTI partition index trie writer flushes all buffered nodes at the
end of each SSTable via complete_until_depth(0), called from
bti_partition_index_writer_impl::finish(). This is a tight synchronous
loop that writes trie nodes through file_writer::write(), which uses a
buffered output_stream: individual writes that fit in the buffer are
plain memcpy operations returning a ready future, so .get() never
yields. As a result the reactor can stall for several milliseconds on
large SSTables.
The entire call chain runs inside seastar::async() (via
sstable::write_components()), so seastar::thread::maybe_yield() is
safe to call here. Add it at the top of both tight loops:
- complete_until_depth(), which iterates over trie depth
- lay_out_children(), which iterates over child branches per node
Fixes SCYLLADB-1885
Closesscylladb/scylladb#29798
(cherry picked from commit d0813769ec)
Closesscylladb/scylladb#29810Closesscylladb/scylladb#29816
insert() held no local strong ref to the prepared modification_statement
across the suspension in execute(). On a single shard:
1. Fiber A suspends inside _insert_stmt->execute().
2. DROP TABLE / DROP KEYSPACE on the target, or LRU eviction, removes
the prepared_statements_cache entry, releasing its strong ref.
3. Fiber B re-enters cache_table_info(), sees _prepared_stmt
(checked_weak_ptr) invalidated, and runs _insert_stmt = nullptr,
releasing the last strong ref. The modification_statement is freed.
4. Fiber A resumes inside execute() and touches freed *this.
Pin strong ref to _insert_stmt locally before the suspension.
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1667
Backport: all supported branches, it's memory corruption bug, long present
Closesscylladb/scylladb#29588
* github.com:scylladb/scylladb:
test/boost: add dummy case to table_helper_test for non-injection modes
test/boost: add regression test for table_helper insert() UAF
utils/error_injection: add waiters() API
table_helper: fix use-after-free on prepared-statement invalidation
(cherry picked from commit efcc0b6376)
Closesscylladb/scylladb#29747Closesscylladb/scylladb#29802
If start_server_for_group0() successfully registers a server in
_raft_gr._servers but a subsequent step (e.g. enable_in_memory_state_machine())
throws, the server is never destroyed because abort_and_drain()/destroy()
check std::get_if<raft::group_id>(&_group0) which was only set after the
entire with_scheduling_group block completed.
Move _group0.emplace<raft::group_id>() inside the lambda, immediately after
start_server_for_group() succeeds, so that cleanup paths can always find
and destroy the registered server.
This fixes the assertion:
"raft_group_registry - stop(): server for group ... is not destroyed"
which manifests during shutdown after an upgrade where topology_state_load()
fails due to netw::unknown_address.
Backport: Yes, to 2026.1, 2026.2, as it causes a crash on upgrades
Refs: SCYLLADB-1217
Refs: CUSTOMER-340
Refs: CUSTOMER-335
Fixes: SCYLLADB-1809
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
AI-assisted: Yes, Opencode/Opus 4.6
Closesscylladb/scylladb#29702
(cherry picked from commit 6179406467)
Closesscylladb/scylladb#29742Closesscylladb/scylladb#29754
During incremental repair, each tablet replica holds three SSTable views:
UNREPAIRED, REPAIRING, and REPAIRED. The repair lifecycle is:
1. Replicas snapshot unrepaired SSTables and mark them REPAIRING.
2. Row-level repair streams missing rows between replicas.
3. mark_sstable_as_repaired() runs on all replicas, rewriting the
SSTables with repaired_at = sstables_repaired_at + 1 (e.g. N+1).
4. The coordinator atomically commits sstables_repaired_at=N+1 and
the end_repair stage to Raft, then broadcasts
repair_update_compaction_ctrl which calls clear_being_repaired().
The bug lives in the window between steps 3 and 4. After step 3, each
replica has on-disk SSTables with repaired_at=N+1, but sstables_repaired_at
in Raft is still N. The classifier therefore sees:
is_repaired(N, sst{repaired_at=N+1}) == false
sst->being_repaired == null (lost on restart, or not yet set)
and puts them in the UNREPAIRED view. If a new write arrives and is
flushed (repaired_at=0), STCS minor compaction can fire immediately and
merge the two SSTables. The output gets repaired_at = max(N+1, 0) = N+1
because compaction preserves the maximum repaired_at of its inputs.
Once step 4 commits sstables_repaired_at=N+1, the compacted output is
classified REPAIRED on the affected replica even though it contains data
that was never part of the repair scan. Other replicas, which did not
experience this compaction, classify the same rows as UNREPAIRED. This
divergence is never healed by future repairs because the repaired set is
considered authoritative. The result is data resurrection: deleted rows
can reappear after the next compaction that merges unrepaired data with the
wrongly-promoted repaired SSTable.
The fix has two layers:
Layer 1 (in-memory, fast path): mark_sstable_as_repaired() now also calls
mark_as_being_repaired(session) on the new SSTables it writes. This keeps
them in the REPAIRING view from the moment they are created until
repair_update_compaction_ctrl clears the flag after step 4, covering the
race window in the normal (no-restart) case.
Layer 2 (durable, restart-safe): a new is_being_repaired() helper on
tablet_storage_group_manager detects the race window even after a node
restart, when being_repaired has been lost from memory. It checks:
sst.repaired_at == sstables_repaired_at + 1
AND tablet transition kind == tablet_transition_kind::repair
Both conditions survive restarts: repaired_at is on-disk in SSTable
metadata, and the tablet transition is persisted in Raft. Once the
coordinator commits sstables_repaired_at=N+1 (step 4), is_repaired()
returns true and the SSTable naturally moves to the REPAIRED view.
The classifier in make_repair_sstable_classifier_func() is updated to call
is_being_repaired(sst, sstables_repaired_at) in place of the previous
sst->being_repaired.uuid().is_null() check.
A new test, test_incremental_repair_race_window_promotes_unrepaired_data,
reproduces the bug by:
- Running repair round 1 to establish sstables_repaired_at=1.
- Injecting delay_end_repair_update to hold the race window open.
- Running repair round 2 so all replicas complete mark_sstable_as_repaired
(repaired_at=2) but the coordinator has not yet committed step 4.
- Writing post-repair keys to all replicas and flushing servers[1] to
create an SSTable with repaired_at=0 on disk.
- Restarting servers[1] so being_repaired is lost from memory.
- Waiting for autocompaction to merge the two SSTables on servers[1].
- Asserting that the merged SSTable contains post-repair keys (the bug)
and that servers[0] and servers[2] do not see those keys as repaired.
NOTE FOR MAINTAINER: Copilot initially only implemented Layer 1 (the
in-memory being_repaired guard), missing the restart scenario entirely.
I pointed out that being_repaired is lost on restart and guided Copilot
to add the durable Layer 2 check. I also polished the implementation:
moving is_being_repaired into tablet_storage_group_manager so it can
reuse the already-held _tablet_map (avoiding an ERM lookup and try/catch),
passing sstables_repaired_at in from the classifier to avoid re-reading it,
and using compaction_group_for_sstable inside the function rather than
threading a tablet_id parameter through the classifier.
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1239.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Closesscylladb/scylladb#29244
(cherry picked from commit 16e387d5f9)
(cherry picked from commit cc3dcc4ba8)
Closesscylladb/scylladb#29411
Add more logging to barrier and drain rpc to try and pinpoint https://github.com/scylladb/scylladb/issues/26281
Bakport since we want to have it if it happens in the field.
Fixes: SCYLLADB-1837
Refs: #26281
- (cherry picked from commit 11b838e71e)
- (cherry picked from commit e88ce09372)
- (cherry picked from commit 385915c101)
- (cherry picked from commit d2b695aa64)
Parent PR: #29735Closesscylladb/scylladb#29770
* https://github.com/scylladb/scylladb:
session, raft_topology: add periodic warnings for hung drain and stale version waits
session: add info-level logging to drain_closing_sessions
raft_topology: log sub-step progress in local_topology_barrier
raft_topology: log read_barrier progress in topology cmd handler
token_metadata: improve stale versions diagnostics
The estimate() function in the size_estimates virtual reader only
considered sstables local to the shard that happened to own the
keyspace's partition key token. Since sstables are distributed across
shards, this caused partition count estimates to be approximately
1/smp_count of the actual value.
This bug has been present since the virtual reader was introduced in
225648780d.
Use db.container().map_reduce0() to aggregate sstable estimates
across all shards. Each shard contributes its local count and
estimated_histogram, which are then merged to produce the correct
total.
Also fix the `test_partitions_estimate_full_overlap` test which becomes
flaky (xpassing ~1% of runs) because autocompaction could merge the
two overlapping sstables before the size estimate was read. Wrap the
test body in nodetool.no_autocompaction_context to prevent this race.
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1179
Refs https://github.com/scylladb/scylladb/issues/9083Closesscylladb/scylladb#29286
(cherry picked from commit 6f364fd3b7)
Closesscylladb/scylladb#29381
The `vector_store_client_test_dns_resolving_repeated` test was intermittently
timing out on CI. The exact root cause is not fully understood, but the
hypothesis is that a single trigger signal can be lost somewhere (not exactly
known where). This is not an issue for the production code because refresh
trigger will be called multiple times whenever all configured nodes will be
unreachable.
Fixes SCYLLADB-1794
Backport to 2026.1 and 2026.2, as the same CI flakiness can occur on these branches.
- (cherry picked from commit e9240587f4)
- (cherry picked from commit 44249c0a75)
Parent PR: #29752Closesscylladb/scylladb#29796
* github.com:scylladb/scylladb:
vector_search: test: default timeout in test_dns_resolving_repeated
vector_search: test: fix flaky test_dns_resolving_repeated
Replace explicit 1-second timeouts in repeat_until() with the default
STANDARD_WAIT (10s). The 1-second timeout could be too aggressive for
loaded CI environments where lowres_clock granularity (~10ms) combined
with OS scheduling delays and resource contention (-c2 -m2G) could cause
the loop to expire before the DNS refresh task completes its cycle.
This also unifies test timeouts across test cases.
(cherry picked from commit 207de967fb)
Move trigger_dns_resolver() inside the repeat_until loop instead of
calling it once before the loop.
The test was intermittently timing out on CI. The exact root cause is not
fully understood, but the hypothesis is that a single trigger signal can
be lost somewhere (not exactly known where). This is not an issue for the
production code because refresh trigger will be called multiple times -
in every query where all configured nodes will be unreachable.
By triggering inside the loop, we ensure the signal is re-sent on
each iteration until the resolver actually performs the refresh and
picks up the new (failing) DNS resolution. This makes the test
resilient to timing-dependent signal loss without changing production
code.
Fixes: SCYLLADB-1794
(cherry picked from commit 4722be1289)
When `permissions_validity_in_ms` is set to 0, executing a prepared statement under authentication crashes with:
```
Assertion `caching_enabled()' failed.
at utils/loading_cache.hh:319
in authorized_prepared_statements_cache::insert
```
`loading_cache::get_ptr()` asserts when caching is disabled (expiry == 0), but `authorized_prepared_statements_cache::insert()` was using it purely for its side effect of populating the cache, which is meaningless when caching is off.
Add a new `loading_cache::insert(k, load)` method that is a no-op when caching is disabled and otherwise forwards to `get_ptr()`. Switch `authorized_prepared_statements_cache::insert()` to use it. This
completes the disabled-mode safety contract of the cache for the write side, mirroring the fallback that `get()` already provides for the read side.
Includes a regression test in `test/boost/loading_cache_test.cc` plus a positive test for the new `insert()` overload.
Fixes SCYLLADB-1699
The crash is introduced a long time ago. It is present on all the live versions, from 2025.1 onward. No client tickets, but it should be backported.
Closesscylladb/scylladb#29638
* github.com:scylladb/scylladb:
test: boost: regression test for loading_cache::insert with caching disabled
utils: loading_cache: add insert() that is a no-op when caching is disabled
(cherry picked from commit c00fee0316)
Closesscylladb/scylladb#29762Closesscylladb/scylladb#29782
This is manual backport of https://github.com/scylladb/scylladb/pull/29757, because the changes are required on 2026.1 ASAP.
===
The auth cache crashes when it encounters rows in role_permissions that have a live row marker but no permissions column. These “ghost rows” were created by the now-removed auth v2 migration, which used INSERT (creating row markers) instead of UPDATE.
When permissions were later revoked, the row marker remained while the permissions column became null. An empty collection appears as null, since its lifetime is based only on its element's cells.
As a result, when the cache reloads and expects the permissions column to exist, it hits a missing_column exception.
The series removes dead code that was the primary crash site, adds has() guards to the remaining access paths, and includes a test reproducer.
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1816
Backport: all supported versions 2026.1, 2025.4, 2025.1
Parent PR: https://github.com/scylladb/scylladb/pull/29757Closesscylladb/scylladb#29771
* github.com:scylladb/scylladb:
test: add reproducer for auth cache crash on missing permissions column
auth: tolerate missing permissions column in authorize()
auth: add defensive has() guard for role_attributes value column
auth: remove unused permissions field from cache role_record
Add periodic warning timers (every 5 minutes) to help diagnose hangs in
barrier_and_drain:
- drain_closing_sessions(): warn if semaphore acquisition or session gate
close is taking too long, reporting the gate count to show how many
guards are still alive.
- local_topology_barrier(): warn if stale_versions_in_use() is taking
too long, reporting the current stale version trackers.
- session::gate_count(): new public accessor for diagnostic purposes.
These warnings help distinguish between the two possible hang points
in barrier_and_drain (stale versions vs session drain) and provide
ongoing visibility into what's blocking progress.
(cherry picked from commit d2b695aa64)
drain_closing_sessions() is called as part of the barrier_and_drain
topology command and can block on two things: acquiring the drain
semaphore (if another drain is in progress) and waiting for individual
sessions to close (which blocks until all session guards are released).
Previously, all logging in this function was at debug level, making it
invisible in production logs. When barrier_and_drain hangs, there is no
way to tell whether the function is waiting for the semaphore, waiting
for a specific session to close, or was never called.
Promote logging to info level and add messages at each blocking point:
before/after semaphore acquisition (with count of sessions to drain),
before/after each individual session close (with session id), and at
function completion. This makes it possible to identify the exact
session blocking a topology operation from the node log alone.
(cherry picked from commit 385915c101)
When a node processes a barrier_and_drain topology command, it performs
two potentially long-running operations inside local_topology_barrier():
waiting for stale token metadata versions to be released
(stale_versions_in_use) and draining closing sessions
(drain_closing_sessions). Either of these can hang indefinitely -- for
example, stale_versions_in_use blocks until all references to previous
token metadata versions are released, which depends on in-flight
requests completing.
Previously, the only logging was a single 'done' message at the end,
making it impossible to determine which sub-step was blocking when a
barrier_and_drain RPC appeared stuck on a node. In a recent CI failure,
a node never responded to barrier_and_drain during a removenode
operation, and the logs showed the RPC was received but nothing about
what it was waiting on internally.
Add info-level logging before each blocking sub-step, including the
topology version for correlation. This allows diagnosing hangs by
showing whether the node is stuck waiting for stale metadata versions,
stuck draining sessions, or never reached these steps at all.
(cherry picked from commit e88ce09372)
When a raft topology command (e.g. barrier_and_drain) is received by a
node, the handler first performs a raft read_barrier to ensure it sees
the latest topology state. This read_barrier can hang indefinitely if
raft cannot achieve quorum, but there was no logging around it, making
it impossible to tell whether the handler was stuck at this step or
somewhere else.
Add info-level logging before and after the read_barrier call in
raft_topology_cmd_handler, including the command type, index, and term.
This allows diagnosing hangs by showing whether the node entered the
read_barrier and whether it completed, narrowing down the root cause
when a topology command RPC appears stuck on the receiver side.
(cherry picked from commit 11b838e71e)
Before waiting on stale_versions_in_use(), we log the stale versions
the barrier_and_drain handler will wait for, along with the number of
token_metadata references representing each version.
To achieve this, we store a pointer to token_metadata in
version_tracker, traverse the _trackers list, and output all items
with a version smaller than the latest. Since token_metadata
contains the version_tracker instance, it is guaranteed to remain
alive during traversal. To count references, token_metadata now
inherits from enable_lw_shared_from_this.
This helps diagnose tablet migration stalls and allows more
deterministic tests: when a barrier is expected to block, we can
verify that the log contains the expected stale versions rather
than checking that the barrier_and_drain is blocked on
stale_versions_in_use() for a fixed amount of time.
(cherry picked from commit e39f4b399c)
Ghost rows in role_permissions with a live row marker but no permissions
column can occur when permissions created via INSERT (e.g. by the removed
auth v2 migration) are later revoked. The row marker survives the revoke,
leaving a row visible to queries but with permissions=null.
Add a has() guard before accessing the permissions column, matching the
pattern already used in list_all(). Return NONE permissions for such
ghost rows instead of crashing.
(cherry picked from commit df69a5c79b)
Add a has() check before accessing the value column in role_attributes
to tolerate ghost rows with missing regular columns. In practice this
is unlikely to be a problem since attributes are not typically revoked,
but the guard is added for consistency and defensive programming.
(cherry picked from commit c44625ebdf)
The permissions field in role_record was populated by fetch_role() but
never read. Authorization uses cached_permissions instead, which is
loaded via the permission_loader callback. Remove the dead field and
its fetch code.
The removed code also did not check for missing columns before accessing
the permissions set, which could crash on ghost rows left by the removed
auth v2 migration. The migration used INSERT (creating row markers),
and when permissions were later revoked, the row marker survived while
the permissions column became null.
(cherry picked from commit 797bc28aae)
parse_assert() accepts an optional `message` parameter that defaults
to nullptr. When the assertion fails and message is nullptr, it is
implicitly converted to sstring via the sstring(const char*) constructor,
which calls strlen(nullptr) -- undefined behavior that manifests as a
segfault in __strlen_evex.
This turns what should be a graceful malformed_sstable_exception into a
fatal crash. In the case of CUSTOMER-279, a corrupt SSTable triggered
parse_assert() during streaming (in continuous_data_consumer::
fast_forward_to()), causing a crash loop on the affected node.
Fix by guarding the nullptr case with a ternary, passing an empty
sstring() when message is null. on_parse_error() already handles
the empty-message case by substituting "parse_assert() failed".
Fixes: SCYLLADB-1672
Closesscylladb/scylladb#29285
(cherry picked from commit cfebe17592)
Closesscylladb/scylladb#29597
This PR solves a series of similar problems related to executing
methods on an already aborted `raft::server`. They materialize
in various ways:
* For `add_entry` and `modify_config`, a `raft::not_a_leader` with
a null ID will be returned IF forwarding is disabled. This wasn't
a big problem because forwarding has always been enabled for group0,
but it's something that's nice to fix. It's might be relevant for
strong consistency that will heavily rely on this code.
* For `wait_for_leader` and `wait_for_state_change`, the calls may
hang and never resolve. A more detailed scenario is provided in a
commit message.
For the last two methods, we also extend their descriptions to indicate
the new possible exception type, `raft::stopped_error`. This change is
correct since either we enter the functions and throw the exception
immediately (if the server has already been aborted), or it will be
thrown upon the call to `raft::server::abort`.
We fix both issues. A few reproducer tests have been included to verify
that the calls finish and throw the appropriate errors.
Fixes SCYLLADB-841
Backport: Although the hanging problems haven't been spotted so far
(at least to the best of my knowledge), it's best to avoid
running into a problem like that, so let's backport the
changes to all supported versions. They're small enough.
Closesscylladb/scylladb#28822
* https://github.com/scylladb/scylladb:
raft: Make methods throw stopped_error if server aborted
raft: Throw stopped_error if server aborted
test: raft: Introduce get_default_cluster
(cherry picked from commit bb1a798c2c)
Closesscylladb/scylladb#28903
When view_update_builder::on_results() hits the path where the update
fragment reader is already exhausted, it still needs to keep tracking
existing range tombstones and apply them to encountered rows.
Otherwise a row covered by an existing range tombstone can appear
alive while generating the view update and create a spurious view row.
Update the existing tombstone state even on the exhausted-reader path
and apply the effective tombstone to clustering rows before generating
the row tombstone update. Add a cqlpy regression test covering the
partition-delete-after-range-tombstone case.
Fixes: SCYLLADB-1649
Closesscylladb/scylladb#29481
(cherry picked from commit 073710a661)
Closesscylladb/scylladb#29649
Pin all external GitHub Actions to full commit SHAs and upgrade to
their latest major versions to reduce supply chain attack surface:
- actions/checkout: v3/v4/v5 -> v6.0.2
- actions/github-script: v7 -> v8.0.0
- actions/setup-python: v5 -> v6.2.0
- actions/upload-artifact: v4 -> v7.0.0
- astral-sh/setup-uv: v6 -> v8.0.0
- mheap/github-action-required-labels: v5.5.2 (pinned)
- redhat-plumbers-in-action/differential-shellcheck: v5.5.6 (pinned)
- codespell-project/actions-codespell: v2.2 (pinned, was @master)
Set FORCE_JAVASCRIPT_ACTIONS_TO_NODE24=true in all 21 workflows that
use JavaScript-based actions to opt into the Node.js 24 runtime now.
This resolves the deprecation warning:
"Node.js 20 actions are deprecated. Please check if updated versions
of these actions are available that support Node.js 24. Actions will
be forced to run with Node.js 24 by default starting June 2nd,
2026. Node.js 20 will be removed from the runner on September 16th,
2026."
See: https://github.blog/changelog/2025-09-19-deprecation-of-node-20-on-github-actions-runners/
scylladb/github-automation references are intentionally left at @main
as they are org-internal reusable workflows.
Fixes: SCYLLADB-1410
Backport: Backport is required for live branches that run GH actions:
2026.1, 2025.4, 2025.1 and 2024.1
Closesscylladb/scylladb#29525
(cherry picked from commit d2d7604188)
Closesscylladb/scylladb#29525
REST route removal unregisters handlers but does not wait for requests
that already entered storage_service. A request can therefore suspend
inside an async operation, restart proceeds to tear the service down,
and the coroutine later resumes against destroyed members such as
_topology_state_machine, _group0, or _sys_ks — a use-after-destruction
bug that surfaces as UBSAN dynamic-type failures (e.g. the crash seen
from topology_state_load()).
Fix this by holding storage_service::_async_gate from the entry
boundary of every externally-triggered async operation so that stop()
drains them before teardown begins. The gate is acquired in
run_with_api_lock, run_with_no_api_lock, and in individual REST
handlers that bypass those wrappers (reload_raft_topology_state,
mark_excluded, removenode, schema reload, topology-request
waits/abort, cleanup, ring/schema queries, SSTable dictionary
training/publish, and sampling).
Additionally, fix get_ownership() and abort_topology_request() which
forward work to shard 0 but were still referencing the caller-shard's
`this` pointer instead of the destination-shard instance, causing
silent cross-shard access to shard-local state.
Add a cluster regression test that repeatedly exercises the multi-shard
ownership REST path to cover the forwarding fix.
Fixes: SCYLLADB-1415
Should be backported to all branches, the code has been introduced around 2024.1 release.
Closesscylladb/scylladb#29373
* github.com:scylladb/scylladb:
storage_service: fix shard-0 forwarding in REST helpers
storage_service: gate REST-facing async operations during shutdown
storage_service: prepare for async gate in REST handlers
(cherry picked from commit 4043d95810)
Closesscylladb/scylladb#29611
Tracing events are written to system_traces.events with CL=ANY, so they are only guaranteed to be present on the local node of the query coordinator. Reading them back with the driver default (CL=LOCAL_ONE) may route the query to a replica that has not yet received all events, causing the assertion on 'digest mismatch, starting read repair' to fail intermittently.
Fix execute_with_tracing() to read tracing via the ResponseFuture API with query_cl=ConsistencyLevel.ALL, so events from all replicas are merged before the caller inspects them.
Fixes: SCYLLADB-1707
Backport: fixing flaky test, test failure only seen on master so far so no backport
- (cherry picked from commit b49cf6247f)
Parent PR: #29566Closesscylladb/scylladb#29631
* github.com:scylladb/scylladb:
test: fix flaky test_read_repair_with_trace_logging by reading tracing with CL=ALL
replica/database: consolidate the two database_apply error injections
Users ought to have possibility to create the local index for Vector Search
based only on a part of the partition key. This commits provides this by
removing requirements of 'full partition key only' for custom local index.
The commit updates docs to explain that local vector index can use only a part
of the partition key.
The commit implements cqlpy test to check fixed functionality.
Fixes: SCYLLADB-953
Needs to be backported to 2026.1 as it is a fix for local vector indexes.
Closesscylladb/scylladb#28931
(cherry picked from commit 7883f161bb)
Closesscylladb/scylladb#29543
Fixes: SCYLLADB-1706
The returned file object does not increment file pos as is. One line fix.
Added test to make sure this read path works as expected.
Closesscylladb/scylladb#29456
(cherry picked from commit c97ce32f47)
Closesscylladb/scylladb#29630
Tracing events are written to system_traces.events with CL=ANY, so they
are only guaranteed to be present on the local node of the query
coordinator. Reading them back with the driver default (CL=LOCAL_ONE)
may route the query to a replica that has not yet received all events,
causing the assertion on 'digest mismatch, starting read repair' to fail
intermittently.
Fix execute_with_tracing() to read tracing via the ResponseFuture API
with query_cl=ConsistencyLevel.ALL, so events from all replicas are
merged before the caller inspects them.
Fixes: SCYLLADB-1707
Closesscylladb/scylladb#29566
(cherry picked from commit b49cf6247f)
Into a single database_apply one. Add three parameters:
* ks_name and cf_name to filter the tables to be affected
* what - what to do: throw or wait
This leads to smaller footprint in the code and improved filtering for
table names at the cost of some extra error injection params in the
tests.
(cherry picked from commit f375aae257)
The new_test_keyspace context manager in test/cluster/util.py uses
DROP KEYSPACE without IF EXISTS during cleanup. The Python driver
has a known bug (scylladb/python-driver#317) where connection pool
renewal after concurrent node bootstraps causes double statement
execution. The DROP succeeds server-side, but the response is lost
when the old pool is closed. The driver retries on the new pool, and
gets ConfigurationException message "Cannot drop non existing keyspace".
The CREATE KEYSPACE in create_new_test_keyspace already uses IF NOT
EXISTS as a workaround for the same driver bug. This patch applies
the same approach to fix DROP KEYSPACE.
Fixes SCYLLADB-1538
Closesscylladb/scylladb#29487
(cherry picked from commit 40740104ab)
Closesscylladb/scylladb#29568
The view update builder ignored range tombstone changes from the update
stream when there all existing mutation fragments were already consumed.
The old code assumed range tombstones 'remove nothing pre-existing, so
we can ignore it', but this failed to update _update_current_tombstone.
Consequently, when a range delete and an insert within that range appeared
in the same batch, the range tombstone was not applied to the inserted row,
or was applied to a row outside the range that it covered causing it to
incorrectly survive/be deleted in the materialized view.
Fix by handling is_range_tombstone_change() fragments in the update-only
branch, updating _update_current_tombstone so subsequent clustering rows
correctly have the range tombstone applied to them.
Fixes SCYLLADB-1555
Closesscylladb/scylladb#29483
(cherry picked from commit 6011cb8a4c)
Closesscylladb/scylladb#29569
Three test cases in multishard_query_test.cc set the querier_cache entry
TTL to 2s and then assert, between pages of a stateful paged query, that
cached queriers are still present (population >= 1) and that
time_based_evictions stays 0.
The 2s TTL is not load-bearing for what these tests exercise — they are
checking the paging-cache handoff, not TTL semantics. But on busy CI
runners (SCYLLADB-1642 was observed on aarch64 release), scheduling
jitter between saving a reader and sampling the population can exceed
2s. When that happens, the TTL fires, both saved queriers are
time-evicted, population drops to 0, and the assertion
`require_greater_equal(saved_readers, 1u)` fails. The trailing
`require_equal(time_based_evictions, 0)` check never runs because the
earlier assertion has already aborted the iteration — which is why the
Jenkins failure surfaces only as a bare "C++ failure at seastar_test.cc:93".
Reproduced deterministically in test_read_with_partition_row_limits by
injecting a `seastar::sleep(2500ms)` between the save and the sample:
the hook then reports
population=0 inserts=2 drops=0 time_based_evictions=2 resource_based_evictions=0
and the assertion fires — matching the Jenkins symptoms exactly.
Bump the TTL to 60s in all three affected tests:
- test_read_with_partition_row_limits (confirmed repro for SCYLLADB-1642)
- test_read_all (same pattern, same invariants — suspect)
- test_read_all_multi_range (same pattern, same invariants — suspect)
Leave test_abandoned_read (1s TTL, actually tests TTL-driven eviction)
and test_evict_a_shard_reader_on_each_page (tests manual eviction via
evict_one(); its TTL is not load-bearing but the fix is deferred for a
separate review) unchanged.
Fixes: SCYLLADB-1683
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closesscylladb/scylladb#29564
(cherry picked from commit f5eb99f149)
Closesscylladb/scylladb#29607
Wait for cql on all hosts after restarting a server in the test.
The problem that was observed is that the test restarts servers[1] and
doesn't wait for the cql to be ready on it. On test teardown it drops
the keyspace, trying to execute it on the host that is not ready, and
fails.
Fixes SCYLLADB-1632
Closesscylladb/scylladb#29562
(cherry picked from commit 3468e8de8b)
Closesscylladb/scylladb#29624
The test creates a sync point immediately after writing 100 rows
with CL=ANY, without waiting for pending hint writes to complete.
store_hint() is fire-and-forget: it submits do_store_hint() to a gate
and returns immediately. do_store_hint() updates _last_written_rp only
after writing to the commitlog. If create_sync_point() is called before
all do_store_hint() coroutines complete, the captured replay position
is stale, and await_sync_point() returns DONE before all hints are
replayed, leaving some rows missing.
Fix by waiting for the size_of_hints_in_progress metric to reach zero
before creating the sync point, ensuring all in-flight hint writes have
completed and _last_written_rp is up to date. This follows the same
pattern already used in test_sync_point.
Fixes: SCYLLADB-1709
Closesscylladb/scylladb#29623
(cherry picked from commit 7634d3f7d4)
Closesscylladb/scylladb#29632
The log grep in get_sst_status searched from the beginning of the log
(no from_mark), so the second-repair assertions were checking cumulative
counts across both repairs rather than counts for the second repair alone.
The expected values (sst_add==2, sst_mark==2) relied on this cumulative
behaviour: 1 from the first repair + 1 from the second = 2. This works
when the second repair encounters exactly one unrepaired sstable, but
fails whenever the second repair sees two.
The second repair can see two unrepaired sstables when the 100 keys
inserted before it (via asyncio.gather) trigger a background auto-flush
before take_storage_snapshot runs. take_storage_snapshot always flushes
the memtable itself, so if an auto-flush already split the batch into two
sstables on disk, the second repair's snapshot contains both and logs
"Added sst" twice, making the cumulative count 3 instead of 2.
Fix: take a log mark per-server before each repair call and pass it to
get_sst_status so each check counts only the entries produced by that
repair. The expected values become 1/0/1 and 1/1/1 respectively,
independent of how many sstables happened to exist beforehand.
get_sst_status gains an optional from_mark parameter (default None)
which preserves existing call sites that intentionally grep from the
start of the log.
Fixes: SCYLLADB-1711
Closesscylladb/scylladb#29484
(cherry picked from commit d280517e27)
Closesscylladb/scylladb#29633
drain() signals the postponed_reevaluation condition variable to terminate
the postponed_compactions_reevaluation() coroutine but does not await its
completion. When enable() is called afterwards, it overwrites
_waiting_reevalution with a new coroutine, orphaning the old one. During
shutdown, really_do_stop() only awaits the latest coroutine via
_waiting_reevalution, leaving the orphaned coroutine still alive. After
sharded::stop() destroys the compaction_manager, the orphaned coroutine
resumes and reads freed memory (is_disabled() accesses _state).
Fix by introducing stop_postponed_compactions(), awaiting the reevaluation coroutine in
both drain() and stop() after signaling it, if postponed_compactions_reevaluation() is running.
It uses an std::optional<future<>> for _waiting_reevalution and std::exchange to leave
_waiting_reevalution disengaged when postponed_compactions_reevaluation() is not running.
This prevents a race between drain() and stop().
While at it, fix typo in _waiting_reevalution -> _waiting_reevaluation.
Fixes: SCYLLADB-1463
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closesscylladb/scylladb#29443
(cherry picked from commit 05a00fe140)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closesscylladb/scylladb#29527
This series makes result metadata handling for auth LIST statements consistent and adds coverage for the driver-visible behavior.
The first patch makes the result-column metadata construction shared across the affected statements, so the metadata shape used for PREPARE and EXECUTE stays uniform and easier to reason about.
The second patch adds regression coverage for both sides of the metadata-id flow:
- a Python auth-cluster test verifies that prepared LIST ROLES OF returns a non-empty result metadata id and that a later EXECUTE reuses it without METADATA_CHANGED
- a Boost transport test covers the recovery path where the client sends an empty request metadata id and the server responds with METADATA_CHANGED and the full metadata
Together these patches tighten the implementation and protect the prepared-metadata-id behavior exposed to drivers.
Fixes: SCYLLADB-1543
backport: this change should be backported to all active branches to help the driver operation
- (cherry picked from commit de19714763)
Parent PR: #29347Closesscylladb/scylladb#29477
* github.com:scylladb/scylladb:
test/cluster: cover prepared LIST metadata ids in one setup Precompute the expected metadata-id hashes for the prepared LIST auth and service-level statements and verify that PREPARE returns them while EXECUTE reuses the prepared metadata without METADATA_CHANGED. Run all cases in a single auth-cluster test after preparing the cluster, role, and service level once through the regular manager fixture.
cql: expose stable result metadata for prepared LIST statements Prepared LIST statements were not calculating metadata in PREPARE path, and sent empty string hash to client causing problematic behaviour where metadat_id was not recalculated correctly. This patch moves metadata construction into get_result_metadata() for the affected LIST statements and reuse that metadata when building the result set. This gives PREPARE a stable metadata id for LIST ROLES, LIST USERS, LIST PERMISSIONS and the service-level variants. This patch also adds a new boost test that verifies that when an EXECUTE request carries an empty result metadata id while the server has a real metadata id for the result set, the response is marked METADATA_CHANGED and includes the full result metadata plus the server metadata id. This covers the recovery path for clients that send an empty or otherwise unusable metadata id instead of a matching cached one.
`test_limit_concurrent_requests` could create far more tables than intended
because worker threads looped indefinitely and only the probe path terminated
the test. In practice, workers often hit `RequestLimitExceeded` first, but the
test kept running and creating tables, increasing memory pressure and causing
flakiness due to bad_alloc errors in logs.
Fix by replacing the old probe-driven termination with worker-driven
termination. Workers now run until any worker sees
`RequestLimitExceeded`.
Fixes SCYLLADB-1181
Closesscylladb/scylladb#29270
(cherry picked from commit b355bb70c2)
Closesscylladb/scylladb#29292
sstables_loader::load_and_stream holds a replica::table& reference via
the sstable_streamer for the entire streaming operation. If the table
is dropped concurrently (e.g. DROP TABLE or DROP KEYSPACE), the
reference becomes dangling and the next access crashes with SEGV.
This was observed in a longevity-50gb-12h-master test run where a
keyspace was dropped while load_and_stream was still streaming SSTables
from a previous batch.
Fix by acquiring a stream_in_progress() phaser guard in load_and_stream
before creating the streamer. table::stop() calls
_pending_streams_phaser.close() which blocks until all outstanding
guards are released, keeping the table alive for the duration of the
streaming operation.
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1352Closesscylladb/scylladb#29403
(cherry picked from commit e5e6608f20)
Closesscylladb/scylladb#29558
Pass jvm_args=["--smp", "1"] on both cluster.start() calls to
ensure consistent shard count across restarts, avoiding resharding
on restart. Also pass wait_for_binary_proto=True to cluster.start()
to ensure the CQL port is ready before connecting.
Fixes: SCYLLADB-824
Closesscylladb/scylladb#29548
(cherry picked from commit 34adb0e069)
Closesscylladb/scylladb#29557
The `try-catch` expression is pretty much useless in its current form. If we return the future, the awaiting will only be performed by the caller, completely circumventing the exception handling.
As a result, instead of handling `raft::request_aborted` with a proper error message, the user will face `seastar::abort_requested_exception` whose message is cryptic at best. It doesn't even point to the root of the problem.
Fixes SCYLLADB-665
Backport: This is a small improvement and may help when debugging, so let's backport it to all supported versions.
- (cherry picked from commit c36623baad)
- (cherry picked from commit e4f2b62019)
- (cherry picked from commit fae71f79c2)
Parent PR: #28624Closesscylladb/scylladb#28753
* https://github.com/scylladb/scylladb:
test: raft: Add test_aborting_wait_for_state_change
raft: Describe exception types for wait_for_state_change and wait_for_leader
raft: Await instead of returning future in wait_for_state_change