Commit Graph

53623 Commits

Author SHA1 Message Date
Petr Gusev
a2b2a42936 storage_service: cancel write handlers during drain to prevent shutdown deadlock
When a node shuts down, do_drain() calls stop_transport() which tears
down the messaging service. After this point, MUTATION_DONE responses
from replicas can no longer reach the coordinator, so any in-flight
write_response_handlers will never complete naturally. These handlers
hold ERMs referencing stale token_metadata versions.

If the topology coordinator calls barrier_and_drain (either on itself
or via RPC), it blocks in stale_versions_in_use() waiting for these
stale versions to be released. This causes:
- On the coordinator node: do_drain -> wait_for_group0_stop deadlock
  (the topology coordinator fiber is stuck in barrier_and_drain).
- On non-coordinator nodes: ss::stop -> uninit_messaging_service
  deadlock (the barrier_and_drain RPC handler holds the gate open).

Fix: cancel all write response handlers on all shards right after
stop_transport() in do_drain(). This releases their ERMs and the
associated stale token_metadata versions, unblocking
stale_versions_in_use().

Heap-allocate _write_handlers_gate and add an allow_new parameter to
cancel_all_write_response_handlers(). When allow_new=true (used by
do_drain), the gate is closed and swapped with a fresh one — existing
handlers are waited on while new handlers can still be created. This
avoids blocking internal writes (paxos learn, compaction history
updates) that still need to create handlers during the remainder of
the drain sequence. When allow_new=false (used by drain_on_shutdown),
the gate is closed permanently — no new handlers can be created after
final shutdown.

Update test_lwt_shutdown to wait for 'Stop transport: done' instead
of 'Shutting down storage proxy RPC verbs'. The latter message is
now only logged after do_drain() completes, but do_drain() blocks
in cancel_all_write_response_handlers() waiting for the background
paxos learn handler — which is exactly what the test needs to release
before shutdown can proceed.

Fixes: SCYLLADB-2163
Refs: scylladb/scylladb#23665
(cherry picked from commit 2927f0dd21)
2026-05-21 18:58:06 +00:00
Petr Gusev
1268ab6f92 test_unfinished_writes_during_shutdown: extend to cover coordinator shutdown
The existing test only covers the case where the shutting-down node is
NOT the topology coordinator (deadlocks in uninit_messaging_service).
When the node IS the coordinator, the deadlock manifests differently:
the topology coordinator fiber calls barrier_and_drain on itself
(without messaging), and do_drain -> wait_for_group0_stop blocks
because the coordinator can't stop while stale_versions_in_use is
waiting on the uncancelled write handler.

Run the test twice on the same 2-node cluster (RF=2):
- Run 1: target is a non-coordinator
- Restore cluster state (restart target, decommission added node)
- Run 2: target is the topology coordinator

Use CL=ONE so the write completes from the local replica even with
the other server's response paused.

Mark as xfail since this reproduces bugs not yet fixed on this branch.

Refs: SCYLLADB-1842
(cherry picked from commit 5bc3e84d1e)
2026-05-21 18:58:06 +00:00
Petr Gusev
b147ab4418 test_unfinished_writes_during_shutdown: fix to reproduce the shutdown deadlock
The test was written for another case, and was not supposed to
reproduce the issue that was fixed in this PR.

Fix the test to reproduce the real scenario:

1. Use one_shot=False for pause_before_barrier_and_drain so the
   injection fires on every barrier_and_drain RPC, not just the first.

2. Let the first barrier_and_drain through (at this point the write
   handler's ERM version matches the current token_metadata version).

3. Wait for the second barrier_and_drain. Between the two calls,
   topology_state_load installs a new token_metadata version. The
   write handler still holds the old version's ERM — now stale.

4. After stop_transport completes, disable the injection (rather than
   sending a single message) to release the paused handler and any
   subsequent ones that arrived during stop_transport. The 'disabled'
   flag in injection_shared_data ensures all waiters wake up.

With these changes the test reliably fails (shutdown deadlock within
15s) on the unfixed code and passes on the fixed version from
e0dc73f52a ('Cancel all write requests on storage_proxy shutdown').

Refs: scylladb/scylladb#23665
(cherry picked from commit a093be9ca9)
2026-05-21 18:58:05 +00:00
Petr Gusev
8489493323 test_unfinished_writes_during_shutdown: await add_last_node_task instead of cancelling it
asyncio cancel() only affects the client-side coroutine. The
server-side addserver handler in the cluster manager continues
running. If it can't complete (e.g. no raft quorum because the
target node is shut down), the orphaned handler blocks _after_test
cleanup for 120s.

Await the task instead so it completes cleanly (we restart the
target node first to restore quorum).

(cherry picked from commit 32002f6443)
2026-05-21 18:58:05 +00:00
Petr Gusev
8addbed0dc test_unfinished_writes_during_shutdown: add timeout and deadlock detection for shutdown_task
Add a 15s timeout around the shutdown_task await. If the timeout
fires, the deadlock is reproduced (shutdown hung because
stale_versions_in_use blocks on a write handler holding a stale
token_metadata version).

When the timeout fires, explicitly kill the node via
server_stop() so that the manager's _after_test handler does not
wait 120s for the stuck stop_gracefully request. Then fail the
test with a clear message.

(cherry picked from commit fa01f74ae6)
2026-05-21 18:58:04 +00:00
Petr Gusev
de55e0472a test: scylla_cluster: allow stop() to bypass start_stop_lock
Remove the @stop_event and @start_stop_lock decorators from
ScyllaServer.stop() so it can SIGKILL a server even while
stop_gracefully() holds the lock (e.g. the node is deadlocked
during shutdown and stop_gracefully is blocked on cmd.wait()).

A local copy of self.cmd is used because there are await points
after which another coroutine (stop_gracefully) may set self.cmd
to None. The concurrent stop_gracefully() unblocks once the
process dies from SIGKILL since its cmd.wait() returns.

Also make shutdown_control_connection a plain (non-async) function
since it contains no await points — this makes it obvious that no
coroutine interleaving is possible inside it.

(cherry picked from commit c88120abca)
2026-05-21 18:58:04 +00:00
Petr Gusev
09aca68f71 error_injection: add non-shared mode to wait_for_message
Add a 'share' parameter to wait_for_message (default true, preserving
existing behavior). When share=false, each handler invocation requires
its own dedicated message to proceed — a message consumed by one
handler is not visible to others.

Use share=false for the pause_before_barrier_and_drain injection in
raft_topology_cmd_handler. The topology coordinator sends multiple
barrier_and_drain RPCs during a single topology transition (one per
state change). With share=true a single message_injection call
releases all handlers. With share=false the test can release them
one at a time, controlling exactly which topology state the write
handler's ERM captures.

(cherry picked from commit 324a08295d)
2026-05-21 18:58:04 +00:00
Petr Gusev
3655879f48 error_injection: release waiters when injection is disabled
When an error injection is disabled (via disable() or disable_all()),
any handlers currently suspended in wait_for_message() must be woken up
so they can proceed instead of hanging until timeout.

Add a 'disabled' flag to injection_shared_data. When disable() or
disable_all() is called, set the flag and broadcast the condition
variable. The wait_for_message() predicate checks the flag and returns
true immediately, letting the handler continue.

This makes disable() atomic with respect to releasing waiters: it both
wakes up blocked handlers and removes the injection from the enabled
map in one call. This avoids races that would occur with separate
message_injection() + disable() calls — message_injection() after
disable() fails because the injection is already gone, and
disable() after message_injection() risks a new handler hitting the
injection between the two calls.

Concrete example: test_unfinished_writes_during_shutdown pauses
barrier_and_drain RPC handlers via wait_for_message. During shutdown,
the test calls disable_injection() to simultaneously release the
paused handler and prevent any new barrier_and_drain RPCs from
getting stuck.

(cherry picked from commit bc4dc13e94)
2026-05-21 18:58:03 +00:00
Patryk Jędrzejczak
d5a27e1cf1 test: fix flaky test_raft_snapshot_truncation by waiting for async log truncation
Snapshot creation and raft log truncation happen asynchronously in the
IO fiber after a schema change completes. The test was querying
system.raft immediately after the schema change returned, racing with
the IO fiber's store_snapshot_descriptor call.

Replace immediate assertions with wait_for polling loops:
- log_size == 0: wait for log truncation after drop keyspace
- new_snap_id != original_snap_id: wait for new snapshot to be persisted

Fixes: SCYLLADB-2157

Closes scylladb/scylladb#29967

(cherry picked from commit cbadc3d675)

Closes scylladb/scylladb#29999
2026-05-21 16:06:24 +02:00
Avi Kivity
0afe3dcfd5 Update seastar submodule (default DMA alignment)
* seastar 4d268e0ef5...74f19b81ca (1):
  > file: fix default DMA alignment

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-2043

Ref 6df04c9e5b
2026-05-20 18:57:19 +03:00
Jenkins Promoter
c4c38aeeda Update pgo profiles - aarch64 2026-05-20 15:41:19 +03:00
Jenkins Promoter
11c7df5510 Update pgo profiles - x86_64 2026-05-20 14:35:23 +03:00
Jenkins Promoter
bd3335221e Update ScyllaDB version to: 2026.2.0-rc3 2026-05-19 23:20:28 +03:00
Patryk Jędrzejczak
6466cded43 Merge '[Backport 2026.2] cql: fix missing TABLETS_ROUTING_V1 payload after CAS shard bounce' from Scylladb[bot]
After an internal CAS shard bounce, check_locality() was evaluating
against this_shard_id() of the post-bounce shard — which is the correct
tablet shard — so it returned nullopt, and LWT/SERIAL responses omitted
the tablets-routing-v1 custom payload. The client never learned the
correct tablet map.

Fix by recording the original entry shard in client_state (initialized
to this_shard_id() at construction, preserved across shard bounces via
client_state_for_another_shard) and passing it to check_locality() so
it compares against the client's actual routing decision.

No host_id tracking or forwarded_client_state IDL changes are needed
because CAS shard bounces are always intra-node.

Fixes SCYLLADB-2041

backport: need to backport to all versions with LWT over tablets

- (cherry picked from commit 167a3c9c50)
- (cherry picked from commit 8a76ec7e65)
- (cherry picked from commit 738b7b4a86)
- (cherry picked from commit 9e3209e4a3)

Parent PR: #29910

Closes scylladb/scylladb#29948

* https://github.com/scylladb/scylladb:
  cql: refactor add_tablet_info to take tablet_routing_info directly
  cql: fix UB dereference of nullopt tablet_info in execute_with_condition
  test/boost: add regression test for missing tablet routing after CAS bounce
  cql: fix missing TABLETS_ROUTING_V1 payload after CAS shard bounce
2026-05-19 10:24:49 +02:00
Ernest Zaslavsky
8173885012 test: fix use-after-free in start_docker_service retry path
start_docker_service is a coroutine that took docker_args and
image_args by const reference. Its caller start_fake_gcs_server
is a regular function that passes temporaries (initializer lists)
and immediately returns a future. The temporaries are destroyed
when the caller returns, leaving the coroutine holding dangling
references.

On the first loop iteration this works by luck (memory not yet
reused), but on retry (after "address already in use") the
params.append_range(image_args) reads freed memory, causing
use-after-free that manifests as std::bad_alloc or broken_promise
in non-sanitizer builds.

Fix by taking docker_args and image_args by value so the coroutine
frame owns the vectors for its entire lifetime.

Fixes: SCYLLADB-2081

Closes scylladb/scylladb#29932

(cherry picked from commit 834eed10d9)

Closes scylladb/scylladb#29943
2026-05-18 19:24:55 +03:00
Petr Gusev
1c3fde8abb cql: refactor add_tablet_info to take tablet_routing_info directly
Change add_tablet_info() to accept locator::tablet_routing_info instead
of destructured (tablet_replica_set, token_range) pair. This simplifies
all three call sites.

Remove the empty-replicas guard inside add_tablet_info(): the only
producer of tablet_routing_info is tablet ERM's check_locality(), which
returns either nullopt (correctly routed) or info with replicas copied
from tablet_info — a tablet always has replicas. All callers already
check for nullopt before calling add_tablet_info(), so by the time we
enter the function replicas are guaranteed non-empty.

(cherry picked from commit 9e3209e4a3)
2026-05-18 13:39:09 +00:00
Petr Gusev
b8b08ea89f cql: fix UB dereference of nullopt tablet_info in execute_with_condition
When check_locality() returns nullopt (correctly routed LWT), the
optional tablet_info was unconditionally dereferenced in the lambda
capture list: tablet_info->tablet_replicas, tablet_info->token_range.

The code previously masked this by initializing tablet_info with an
empty-but-present value, so the dereference happened to work but
only because the empty tablet_replicas made add_tablet_info() a no-op.
After check_locality() overwrites it with nullopt, the dereference
is UB.

Fix by initializing tablet_info as empty (nullopt) and guarding the
dereference.

(cherry picked from commit 738b7b4a86)
2026-05-18 13:39:09 +00:00
Petr Gusev
377cb5bc53 test/boost: add regression test for missing tablet routing after CAS bounce
Add test_tablet_routing_info_after_cas_shard_bounce that verifies
TABLETS_ROUTING_V1 payload is returned after an internal CAS shard
bounce.

The test simulates the transport-layer bounce: it creates a table whose
single tablet replica lands on a shard different from the test thread,
executes an LWT (which bounces), then transfers client_state via
client_state_for_another_shard (preserving _original_shard) and
re-executes on the tablet shard. The test asserts that check_locality()
correctly detects the misrouting and returns tablet routing info.

Refs SCYLLADB-2041

(cherry picked from commit 8a76ec7e65)
2026-05-18 13:39:08 +00:00
Petr Gusev
c66f8fa6fd cql: fix missing TABLETS_ROUTING_V1 payload after CAS shard bounce
After an internal CAS shard bounce, check_locality() was evaluating
against this_shard_id() of the post-bounce shard — which is the correct
tablet shard — so it returned nullopt, and LWT/SERIAL responses omitted
the tablets-routing-v1 custom payload. The client never learned the
correct tablet map.

Fix by recording the original entry shard in client_state (initialized
to this_shard_id() at construction, preserved across shard bounces via
client_state_for_another_shard) and passing it to check_locality() so
it compares against the client's actual routing decision.

No host_id tracking or forwarded_client_state IDL changes are needed
because CAS shard bounces are always intra-node.

Fixes SCYLLADB-2041

(cherry picked from commit 167a3c9c50)
2026-05-18 13:39:08 +00:00
Jenkins Promoter
172cd07bb7 Update ScyllaDB version to: 2026.2.0-rc2 scylla-2026.2.0-rc2-candidate-20260518093053 scylla-2026.2.0-rc2 2026-05-18 14:22:29 +03:00
Andrzej Jackowski
fb33abaee6 test: storage: retry fusermount3 unmount on teardown
After stopping scylla server processes, the FUSE daemon
(fuse2fs) may still be processing file handle closures.
An immediate fusermount3 -u can fail with 'device busy',
causing spurious test failures on teardown.

Retry the unmount up to 10 times with 0.5s delay between
attempts, and capture stderr for diagnostics.

Fixes: SCYLLADB-2066

Closes scylladb/scylladb#29920

(cherry picked from commit 61e5ec9888)

Closes scylladb/scylladb#29930
2026-05-18 12:25:22 +03:00
Avi Kivity
c1c0c96643 Merge '[Backport 2026.2] QOS: self-heal stale V1-to-V2 migration state on upgrade' from Scylladb[bot]
service_levels: self-heal stale v1 marker after raft topology upgrade

This PR handles an upgrade corner case where a node may already be using
raft topology, while `system.scylla_local` still marks service levels as v1.

The problem was introduced by commit 2917ec5d51
("service:qos: service levels migration"), which added the service-levels
migration from `system_distributed.service_levels` to
`system.service_levels_v2` as part of the raft topology upgrade.

However, if the cluster had no service levels configured, there was no data
to migrate. In that case, the migration path could leave the local version
marker unchanged, so the node would later observe an inconsistent state:

  * raft topology is already enabled;
  * service levels are still marked as v1 in `system.scylla_local`.

Such clusters can be left in a stale state and fail startup during upgrade to
2026.2

This PR makes the upgrade path self-healing.

The first commit restores `service_level_controller::migrate_to_v2()`, giving
us a group0-based path for writing the service-levels v2 state even after raft
topology is already in use.

The second commit wires this path into startup. When the node detects the
stale raft-topology + service-levels-v1 state, it retries the migration a
bounded number of times and updates the version marker to v2 instead of
failing startup.

With this change, clusters that were left in this stale state can recover
automatically during upgrade to 2026.
Fixes: SCYLLADB-2038

backport: 2026.2 2026.1 we need this functionality when we are upgrading older servers

- (cherry picked from commit ac0a19aab8)
- (cherry picked from commit c2014f7e50)
- (cherry picked from commit 6188bf3e01)

Parent PR: #29749

Closes scylladb/scylladb#29905

* github.com:scylladb/scylladb:
  test/auth_cluster: simulate v1 state in self-heal test When skip_service_levels_v2_initialization is used, write an explicit v1 service level version marker while skipping v2 initialization. This lets the restart test exercise self-healing from v1 to v2.
  qos: self-heal stale service levels version on startup
  qos: reintroduce service levels v2 migration self-heal
2026-05-17 19:33:58 +03:00
Avi Kivity
b924f66425 Merge '[Backport 2026.2] test: limits: optimize test_max_cells to avoid large allocations and fragmentation' from Scylladb[bot]
The `test_max_cells` test was flaky due to `std::bad_alloc` caused by Seastar buddy allocator fragmentation. The root causes are:
1. The doubling loop with 24 iterations of CREATE/INSERT/DROP fragmented the allocator
2. The test built the whole batch as a single string that takes contiguous memory

Also, some iterations inserted zero rows, but still did CREATE/DROP table which also contributed to the fragmentation.

This patch series:
- Skips iterations that insert zero rows
- Creates the table once, truncates it after each test iteration
- Switches to prepared statements

Investigation results are presented in detail in https://scylladb.atlassian.net/browse/SCYLLADB-1645

Fixes SCYLLADB-1645

CI stability improvement. Backport to versions that have this test.

- (cherry picked from commit 3debae9a37)
- (cherry picked from commit 0fd6f6f292)
- (cherry picked from commit 0574055b73)

Parent PR: #29759

Closes scylladb/scylladb#29926

* github.com:scylladb/scylladb:
  test: prepare max cells inserts
  test: reuse max cells schema
  test: limits: skip empty max cells iterations
2026-05-17 19:31:32 +03:00
Botond Dénes
c1a5fea937 docs: expand OCI Object Storage configuration section
The existing OCI section in admin.rst was a minimal stub that only showed
a config snippet without explaining how to actually set up connectivity.

Add documentation for:
- The OCI S3-compatible endpoint URL format (namespace + region)
- That credentials must be set explicitly via AWS_ACCESS_KEY_ID /
  AWS_SECRET_ACCESS_KEY using OCI Customer Secret Keys (unlike AWS,
  OCI has no instance metadata fallback compatible with STS/EC2)
- A note that iam_role_arn is AWS-specific and should be omitted for OCI

Fixes: SCYLLADB-2047

Closes scylladb/scylladb#29689

(cherry picked from commit 8a305dd6c7)

Closes scylladb/scylladb#29916
2026-05-16 19:34:25 +03:00
Piotr Szymaniak
e87c4d80aa alternator/doc: update Streams compatibility docs
Alternator Streams graduated from experimental in #29604.  Update the
compatibility and FAQ docs accordingly:

- Replace the "Experimental API features" section with a new
  "Alternator Streams" section that lists known differences without
  the experimental framing.
- Expand the alternator_streams_increased_compatibility paragraph to
  explain both consequences of leaving it off (spurious no-op events
  and inaccurate INSERT/MODIFY distinction) and the performance cost
  of enabling it (LWT path for every write).
- Drop the stale ShardFilter limitation (now implemented).
- Replace the alternator-streams FAQ example with
  strongly-consistent-tables so the multi-feature syntax example
  remains useful.

Fixes SCYLLADB-462

Closes scylladb/scylladb#29695

(cherry picked from commit ac3fff897a)

Closes scylladb/scylladb#29921
2026-05-16 19:34:01 +03:00
Aleksandra Martyniuk
4b8672256d service: skip load_sketch unload for excluded nodes on RF shrink
When an RF change shrinks replicas on a DC and the node being shrunk is
excluded, refresh_tablet_load_stats() only provides load_stats for that
node if it has a cached snapshot from when the node was still up. If the
snapshot is missing or predates the tables being shrunk (e.g. they were
created after the node went down), stats stay incomplete. In that case
load_sketch::unload() called from make_rf_change_plan() throws:

    Can't provide accurate load computation with incomplete load_stats
    for host: <uuid>

Since an excluded node is not expected to come back, load_stats will
never become complete, and the topology coordinator retries the plan
infinitely, hanging ALTER KEYSPACE.

Add a check for excluded nodes and skip unload() for them: we are
removing the replica, so accurate load data for that node is not
needed. For all other node states the throw-and-retry behavior is
preserved.

Modify test_excludenode_shrink_rf to always trigger the bug: a new
error injection 'force_down_node_load_stats_invalid' forces the
invalid-stats path in refresh_tablet_load_stats() for a down node, so
the test does not depend on whether the load-stats refresher happened
to cache the excluded node's stats while it was still up.

Fixes: SCYLLADB-2056.

Closes scylladb/scylladb#29622

(cherry picked from commit d874d355c2)

Closes scylladb/scylladb#29927
2026-05-16 19:18:18 +03:00
Marcin Maliszkiewicz
44784a5169 test: prepare max cells inserts
Switch from raw CQL batch string to using a prepared statement.
The old approach constructed the entire 50-row batch as a single
CQL text string (~19.8 MiB with 32768 column names spelled out
per row). This caused large contiguous allocations in the server.

Fixes SCYLLADB-1645

(cherry picked from commit 0574055b73)
2026-05-15 19:04:12 +00:00
Marcin Maliszkiewicz
69fbe36d1a test: reuse max cells schema
Extract table creation into _create_max_cell_count_table(). Call
it once before the loop instead of creating and dropping the table
on every iteration. Use TRUNCATE instead of DROP TABLE between
iterations to clear data while keeping the schema.

This avoids repeated schema operations that fragment the Seastar
buddy allocator's address space with scattered small allocations.

Refs SCYLLADB-1645

(cherry picked from commit 0fd6f6f292)
2026-05-15 19:04:11 +00:00
Marcin Maliszkiewicz
c6fca009d1 test: limits: skip empty max cells iterations
The doubling loop in test_max_cells started from cells=1. Since
each row has MAX_CELLS_COLUMNS (32768) cells, iterations where
cells < MAX_CELLS_COLUMNS produced zero rows (cells // columns = 0).
Those iterations only did CREATE TABLE / DROP TABLE with no data
inserted.

Start the loop from MAX_CELLS_COLUMNS and use a while loop.

Co-authored-by: Dario Mirovic <dario.mirovic@scylladb.com>

Refs SCYLLADB-1645

(cherry picked from commit 3debae9a37)
2026-05-15 19:04:11 +00:00
Jenkins Promoter
51e187b6e2 Update pgo profiles - aarch64 2026-05-15 04:44:30 +03:00
Marcin Maliszkiewicz
38fb4c0f2d Merge '[Backport 2026.2] docs/cql: fix syntax errors in CQL examples' from Yaniv Kaul
Fix 4 genuine CQL syntax errors in documentation examples, found by automated extraction and execution of doc code blocks against a live ScyllaDB instance.

- **insert.rst**: `USING TTL 86400 IF NOT EXISTS` → `IF NOT EXISTS USING TTL 86400` (wrong clause order produces syntax error)
- **ddl.rst**: Missing opening quote in ALTER KEYSPACE example (`dc2'` → `'dc2'`)
- **ddl.rst**: Hyphenated column names need double-quoting; also fix PRIMARY KEY referencing non-existent `customer_id` instead of `cust_id`
- **types.rst**: UDT `address` contains nested collections, so it must be `frozen<address>` when used as a column type

Parent PR: #29765

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-2027

Closes scylladb/scylladb#29899

* github.com:scylladb/scylladb:
  test/cqlpy: add tests for hyphenated column names
  docs/cql: fix UDT example to use frozen<address>
  docs/cql: fix CREATE TABLE example with hyphenated column names
  docs/cql: fix missing opening quote in ALTER KEYSPACE example
  docs/cql: fix INSERT example clause order (IF NOT EXISTS before USING)
2026-05-14 16:59:07 +02:00
Alex
aa20b65d0b test/auth_cluster: simulate v1 state in self-heal test
When skip_service_levels_v2_initialization is used, write an explicit
v1 service level version marker while skipping v2 initialization. This
lets the restart test exercise self-healing from v1 to v2.

(cherry picked from commit 6188bf3e01)
2026-05-14 15:33:39 +03:00
Alex
e5aab3b260 qos: self-heal stale service levels version on startup
Add self_heal_service_levels_version() and use it during startup when
  the node is already on raft topology but service levels are still marked
  as v1.

  In that stale state, migrate service levels to v2 through group0 instead
  of failing startup.

(cherry picked from commit c2014f7e50)
2026-05-14 15:33:39 +03:00
Alex
5d9927f4b0 qos: reintroduce service levels v2 migration self-heal
migrate_to_v2() was removed after gossip-based service level migration
  support was dropped, since upgraded nodes were expected to already use
  service levels v2.

  However, clusters affected by the old migration bug may reach raft topology
  while system.scylla_local still has a stale service level version. Restore
  the migration helper so startup can self-heal those nodes by writing the v2
  state through group0.

(cherry picked from commit ac0a19aab8)
2026-05-14 10:47:56 +00:00
Yaniv Michael Kaul
368bd2af1d test/cqlpy: add tests for hyphenated column names
Verify that double-quoted column names with hyphens (e.g. "my-col")
work correctly for CREATE TABLE, INSERT, and SELECT. Also verify that
unquoted hyphenated names are rejected with a syntax error.

(cherry picked from commit 7557c64f20)
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-2027
2026-05-14 09:26:42 +03:00
Yaniv Michael Kaul
209bf75c21 docs/cql: fix UDT example to use frozen<address>
The 'address' UDT contains a nested collection (map<text, frozen<phone>>),
so it must be frozen when used as a column type. Non-frozen UDTs with
nested non-frozen collections are not supported.

(cherry picked from commit d13a56be2e)
2026-05-13 07:02:53 +00:00
Yaniv Michael Kaul
f6dacaa18b docs/cql: fix CREATE TABLE example with hyphenated column names
Column names containing hyphens must be double-quoted. Also fix
the PRIMARY KEY reference from 'customer_id' (non-existent) to
'cust_id' (the actual column).

(cherry picked from commit 5c528e4e02)
2026-05-13 07:02:53 +00:00
Yaniv Michael Kaul
ebc3256e6a docs/cql: fix missing opening quote in ALTER KEYSPACE example
The dc2 key was missing its opening single quote: dc2' should be 'dc2'.

(cherry picked from commit 3e2b0f844c)
2026-05-13 07:02:52 +00:00
Yaniv Michael Kaul
ad2ba2b888 docs/cql: fix INSERT example clause order (IF NOT EXISTS before USING)
The grammar requires IF NOT EXISTS to appear before USING TTL,
not after. The example had 'USING TTL 86400 IF NOT EXISTS' which
produces a syntax error.

(cherry picked from commit 815aad50af)
2026-05-13 07:02:52 +00:00
Nadav Har'El
5de73f5480 test/cluster/auth_cluster: use CREATE ROLE IF NOT EXISTS to fix flaky test
test_create_role_mixed_cluster calls servers_add(2) to bootstrap two old
nodes concurrently, then adds a new node before issuing CREATE ROLE.  The
concurrent bootstraps trigger the well-known Python driver bug
(scylladb/python-driver#317): two on_add notifications race in
update_created_pools, causing a second pool to be created for a host whose
pool was already established.  If CREATE ROLE is in-flight on the old pool
when it is closed, the driver retries on the new pool, executing the
statement twice.  The second execution fails with "Role ... already exists",
making the test flaky.

Fix by using CREATE ROLE IF NOT EXISTS.  This is safe because unique_name()
generates a timestamp+random suffix that is guaranteed to be unique; the
role can "already exist" only due to the driver double-execution bug, never
due to a real conflict.

This is the same workaround that has been applied many times elsewhere in
our test suite for exactly the same root cause:
- CREATE KEYSPACE was changed to CREATE KEYSPACE IF NOT EXISTS (scylladb#18368,
  later generalised in scylladb#22399 via new_test_keyspace helpers)
- DROP KEYSPACE was changed to DROP KEYSPACE IF EXISTS (scylladb#29487)

Fixes: SCYLLADB-1811

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#29732

(cherry picked from commit 983eb5ab43)

Closes scylladb/scylladb#29743
2026-05-13 09:30:52 +03:00
Nadav Har'El
594e8f35b4 test: fix replica_read_timeout_no_exception flakiness on slow systems
The test uses a 10ms read timeout to exercise code paths that handle
timed-out reads without throwing C++ exceptions.  As part of setup, it
inserts rows and flushes them to two SSTables, then runs a warm-up
SELECT to populate internal caches (e.g. the auth cache) before the
real test begins.

The reason for this warm-up read was the possibility that the first
read does additional operations (such as reading and caching
authentication) that might throw exceptions internally. I couldn't
verify that such exceptions actually happen in today's code, but
they might (re)appear in the future, so we should keep the warm-up
SELECT.

On slow CI machines (aarch64, debug build), that warm-up SELECT can
take longer than 10ms to read from the two SSTables.  When it does, the
read times out: the coordinator receives 0 responses from the local
replica within the deadline and propagates a read_timeout_exception.
Since the exception is not caught, it escapes the test lambda, is
logged as "cql env callback failed", and causes Boost.Test to report a
C++ failure at the do_with_cql_env_thread call site.  This matches the
CI failure seen in SCYLLADB-1774:

  ERROR ... replica_read_timeout_no_exception: cql env callback failed,
  error: exceptions::read_timeout_exception (Operation timed out for
  replica_read_timeout_no_exception.tbl - received only 0 responses
  from 1 CL=ONE.)

The CI log also shows that only 12 reads were admitted (the warm-up
read plus the 11 reads from the two prepare() calls and CREATE/INSERT
statements made earlier), and the current permit was stuck in
need_cpu state -- the reactor hadn't had a chance to schedule the read
before the 10ms window elapsed.

The fix catches read_timeout_exception from the warm-up SELECT and
retries until the read succeeds. The warm-up is required for
correctness: some lazy-init code paths (e.g. auth cache population)
use C++ exceptions for control flow internally. Those exceptions must
be absorbed before the cxx_exceptions baseline is sampled inside
execute_test(); otherwise they would appear in the delta and cause a
false test failure. Simply ignoring a timed-out warm-up is not safe,
because the lazy-init exceptions would then fire during the 1000 test
reads, inflating cxx_exceptions_after relative to
cxx_exceptions_before.

No other calls in setup are susceptible to the 10ms read timeout:
- CREATE KEYSPACE, CREATE TABLE, INSERT, and flush use the write
  timeout (10s) and are not reads.
- e.prepare() goes through the query processor without reading table
  data, so it is not subject to the read timeout.
- The semaphore manipulation in Test 2 is internal and has no timeout.
- All 1000 reads in execute_test() are expected to fail, so a timeout
  there is the happy path, not a failure.

The 10ms timeout itself is fine for the test's purpose: it is
deliberately aggressive so that reads reliably time out on the hot path
being tested.  The problem was only that the pre-test warm-up was not
guarded against the same timeout.

Fixes: SCYLLADB-1830

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#29731

(cherry picked from commit 1f15e05946)

Closes scylladb/scylladb#29760
2026-05-13 09:28:04 +03:00
Ferenc Szili
b1fad45a6d test: fix flaky test_tablets_split_merge_with_many_tables
In debug mode, this test can timeout during tablets merge. While the
test already decreases the number of tables in debug mode (20 tables,
instead of 200 for dev mode), this is not enough, and the test can still
timeout during merge. This change reduces the number of tables from 20
to 5 in debug mode.

It also drops the log level for lead_balancer to debug. This should make
any potential future problems with this test easier to investigate.

Fixes: SCYLLADB-1863

Closes scylladb/scylladb#29682

(cherry picked from commit ec4b483e88)

Closes scylladb/scylladb#29786
2026-05-13 09:18:30 +03:00
Botond Dénes
a0a61fe81f Merge '[Backport 2026.2] load_balancer: fix tablet allocator dropped table' from Scylladb[bot]
- Handle dropped tables gracefully in the tablet load balancer's `get_schema_and_rs()` instead of aborting with `on_internal_error`
- The load balancer operates on a token metadata snapshot but accesses the live schema for table lookups. A DROP TABLE applied by another fiber between coroutine yield points can remove a table from the live schema while it still exists in the snapshot, causing an abort.

`get_schema_and_rs()` now returns `std::optional` and logs a warning in debug log level instead of aborting when a table is missing. All callers skip dropped tables:
- `make_sizing_plan`: skips to next table
- `make_resize_plan`: skips to next table (merge suppression is moot)
- `check_constraints`: returns `skip_info{}` with empty viable targets
- `get_rs`: returns `nullptr`, checked by `check_constraints`

The call chain is: `make_plan` → `make_internode_plan` → `check_constraints` → `get_rs` → `get_schema_and_rs`. The `make_internode_plan` coroutine has multiple `co_await` yield points (`maybe_yield`, `pick_candidate`) between building the candidate tablet list and checking replication constraints. A DROP TABLE schema mutation applied during any of these yields removes the table from `_db.get_tables_metadata()` while the candidate list still references it.

Added `test_load_balancing_with_dropped_table` which simulates the race by capturing a token metadata snapshot, dropping the table, then calling `balance_tablets` with the stale snapshot.

Fixes: SCYLLADB-1905

This fix needs to be backported to versions: 2025.4, 2026.1

- (cherry picked from commit 4987204f71)
- (cherry picked from commit 6b3e18c4a9)

Parent PR: #29585

Closes scylladb/scylladb#29818

* github.com:scylladb/scylladb:
  test: verify load balancer handles dropped tables gracefully
  tablet_allocator: handle dropped tables gracefully in get_schema_and_rs
2026-05-13 09:16:30 +03:00
Piotr Szymaniak
4d00019eff test/alternator: stop avoiding tablets in Streams tests
Alternator Streams now supports tablets, so stop skipping the TTL Streams test in tablet mode and stop forcing vnodes in the Streams audit test.

Refs SCYLLADB-463

Closes scylladb/scylladb#29697

(cherry picked from commit 459c1dc32f)

Closes scylladb/scylladb#29819
2026-05-13 09:15:20 +03:00
Botond Dénes
3f57cdf7d7 Merge '[Backport 2026.2] sstables_loader: ensure upload directory is empty when load_and_stream returns' from Scylladb[bot]
After `load_and_stream` (e.g. via `nodetool refresh --load-and-stream`)
returns success, source sstable files in the `upload/` directory may
still be on disk. `mark_for_deletion()` only sets an in-memory flag; the
actual file deletion runs lazily when the last `shared_sstable`
reference drops.

This leaves a window between API success and physical deletion where a
follow-up scan of the upload directory can detected sstables that will be deleted soon.
This might cause failure because SSTable will be already wiped during processing.

For fix:
Force unlink to complete before `stream()` returns, so the upload
directory is in a consistent state by the time the API reports success.
For tablet streaming, partially-contained sstables participate in
multiple per-tablet batches; eagerly unlinking after each batch would
break the next batch that still needs to read the file. A
`defer_unlinking` flag on the streamer postpones the explicit unlink
until after all batches complete (called once at the end of
`tablet_sstable_streamer::stream()`). Vnode streaming unlink eagerly at the end of
`stream_sstable_mutations`.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1647

Backport is required, as it is a bug fix that was introduced in 517a4dc4df.

- (cherry picked from commit 7cdf215999)
- (cherry picked from commit 784127c40b)

Parent PR: #29599

Closes scylladb/scylladb#29845

* github.com:scylladb/scylladb:
  sstables_loader: synchronously unlink streamed sstables before returning
  sstables: make sstable::unlink() idempotent
2026-05-13 09:13:33 +03:00
Anna Stuchlik
664fc5bcc6 doc: mark Vector Search in Alternator as Cloud-only
This commit adds the information missing from the Alternator docs
that Vector Search is only available in ScyllaDB Cloud.

Fixes https://github.com/scylladb/scylladb/issues/29661

Closes scylladb/scylladb#29664

(cherry picked from commit 4c01556f79)

Closes scylladb/scylladb#29847
2026-05-13 09:12:26 +03:00
Anna Stuchlik
1b6784df56 doc: label Migration from Vnodes to Tablets as experimental
The procedure to migrate a vnodes-based keyspace to tablets-based keyspace
has been labeled as experimental.

Fixes SCYLLADB-1932

Closes scylladb/scylladb#29834

(cherry picked from commit 1f7d20f701)

Closes scylladb/scylladb#29849
2026-05-13 09:11:09 +03:00
Botond Dénes
473320df18 Merge '[Backport 2026.2] load_balance: fix drain with forced capacity-based balancing' from Scylladb[bot]
When `force_capacity_based_balancing` is enabled and a node is being drained/excluded, the tablet allocator incorrectly aborts balancing due to incomplete tablet stats - even though capacity-based balancing doesn't depend on tablet sizes.

The tablet allocator normally waits for complete load stats before balancing. An exception exists for drained+excluded nodes (they're unreachable and won't return stats). However, when forced capacity-based balancing is active, this exception was not being applied, causing the balancer to reject the drain plan.

Adjust the condition in `tablet_allocator.cc` so that the "ignore missing data for drained nodes" logic applies regardless of whether capacity-based balancing is forced.

Added a Boost unit test that forces capacity-based balancing and verifies a drained/excluded node gets its tablets migrated even when tablet size stats are missing.

This bug was introduced in 2026.1, so this needs to be backported to 2026.1 and 2026.2

Fixes: SCYLLADB-1953

- (cherry picked from commit 906d2b817e)
- (cherry picked from commit f7bc8f5fa7)

Parent PR: #29791

Closes scylladb/scylladb#29866

* github.com:scylladb/scylladb:
  test: boost: add drain test for forced capacity-based balancing
  service: allow draining with forced capacity-based balancing
2026-05-13 09:05:42 +03:00
Botond Dénes
ceae68b487 schema: fix DESCRIBE showing NullCompactionStrategy when compaction is disabled
When a table's compaction is disabled via 'enabled': 'false', the DESCRIBE
output incorrectly showed NullCompactionStrategy instead of the actual strategy.
This happened because schema_properties() called compaction_strategy(), which
returns compaction_strategy_type::null when compaction is disabled. Fix it by
using configured_compaction_strategy(), which always returns the real strategy
type - consistent with how schema_tables.cc serializes it to disk.

Fixes SCYLLADB-1353

Closes scylladb/scylladb#29804

(cherry picked from commit 8d6f031a4a)

Closes scylladb/scylladb#29867
2026-05-13 08:59:59 +03:00
Andrzej Jackowski
3df25f1952 test: wait for TTL scheduling sanity metric
The test samples sl:default runtime before and after setup writes to
prove that it measures the scheduling group used by regular CQL writes.
The metric is exported in milliseconds, so a single 200-row batch may
not be visible immediately, or may be too small in some environments.

Keep the original 200-row table size, but wait up to 30 seconds for the
metric to advance. If it does not, retry the same writes before TTL is
enabled. The retries update the same keys, so the expiration part of the
test still waits for exactly the original number of rows.

In a local 100-run with N=200 rows, the observed delta of
`ms_statement_before - ms_statement_before_write` was: min=4.0,
max=16.0, mean=8.13, and median=8.0. Therefore, it looks possible that
in a rare corner case the delta drops even to 0.

Fixes SCYLLADB-1869

Closes scylladb/scylladb#29797

(cherry picked from commit 89261bf759)

Closes scylladb/scylladb#29868
2026-05-13 08:59:23 +03:00