Commit Graph

1344 Commits

Author SHA1 Message Date
Wojciech Mitros
13c043903d strong_consistency: cache leader location for non-replica nodes
When a non-replica node handles a strongly consistent write, it must
forward the request to a replica. If the closest replica is not the
leader, the request gets redirected again, causing an extra roundtrip.

Add a leader location cache in groups_manager, keyed by raft group_id.
After a write request is forwarded, the CQL transport layer records the
final node as the leader in the cache. Subsequent write requests from
the same node for the same group are forwarded directly to the cached
leader, eliminating the extra roundtrip.

The cache is only used for writes. Reads can be served by any replica,
so they skip the cache and use proximity-based routing instead.

Cache entries are validated at use time: if the cached leader is no
longer a replica (e.g. after tablet migration), the entry is evicted
and the normal closest-replica path is taken. This prevents a scenario
where two nodes keep redirecting to each other because both think that
the other is the leader but actually both are non-replicas - such loop
is broken as soon as the tablet maps are updated.

On token_metadata updates, entries for groups that no longer exist
(e.g. table dropped, tablet merged) are evicted. Entries for groups
that still exist are kept — use-time validation handles staleness.

An on_node_resolved callback is propagated through the redirect/bounce
path so the transport layer can update the cache generically without
coupling to the strong-consistency coordinator. The coordinator creates
the callback only for writes (capturing the groups_manager and
group_id) and attaches it to the bounce message; the transport layer
invokes it once the final node is known, keeping the forwarding
infrastructure subsystem-agnostic.

We also add a test which verifies that after the initial redirect,
following requests to the same node avoid the extra redirect and
forward directly to the leader.

Fixes: SCYLLADB-1064

Closes scylladb/scylladb#29392
2026-05-21 10:32:56 +02:00
Gleb Natapov
cc034f84c5 schema: ensure committed_by_group0 is set for all non-system tables on boot
Tables created before the GROUP0_SCHEMA_VERSIONING feature was enabled
have committed_by_group0 = null in system_schema.scylla_tables. This
causes maybe_delete_schema_version() to delete their version cell,
forcing the legacy hash-based schema version computation path.

Add ensure_committed_by_group0() which runs on boot and fixes up any
non-system tables where committed_by_group0 is not true (null or false):

1. Queries system_schema.scylla_tables for rows where committed_by_group0
   is null or false, skipping system keyspaces (system, system_schema).
2. Takes a group0 guard
3. Re-checks after the raft barrier in case another node already fixed it.
4. For each table needing fixup, creates a mutation writing the version
   cell (from the in-memory schema). The committed_by_group0 = true flag
   is stamped by add_committed_by_group0_flag() inside announce().
5. Announces via raft group0.
6. Retries with a small random delay on group0_concurrent_modification.

On other nodes, schema_applier will detect these as "altered" tables
(scylla_tables mutation changed), but since the actual table definition
is unchanged, update_column_family is effectively a no-op.

This is a prerequisite for eventually removing the legacy hash-based
schema versioning code path.

Closes scylladb/scylladb#29911
2026-05-21 10:22:07 +02:00
Patryk Jędrzejczak
cbadc3d675 test: fix flaky test_raft_snapshot_truncation by waiting for async log truncation
Snapshot creation and raft log truncation happen asynchronously in the
IO fiber after a schema change completes. The test was querying
system.raft immediately after the schema change returned, racing with
the IO fiber's store_snapshot_descriptor call.

Replace immediate assertions with wait_for polling loops:
- log_size == 0: wait for log truncation after drop keyspace
- new_snap_id != original_snap_id: wait for new snapshot to be persisted

Fixes: SCYLLADB-2120

Closes scylladb/scylladb#29967
2026-05-21 10:50:00 +03:00
Artsiom Mishuta
2259307c2e test.py: remove redundant pytest.mark.asyncio decorators
Fixes: SCYLLADB-1935
2026-05-21 10:36:47 +03:00
Botond Dénes
f8ac8540bd Merge 'logstor: compare records by timestamp and segment sequence number' from Michael Litvak
Add the record timestamp. The timestamp is extracted from the row marker
of the mutation when we write it.
When inserting a record to index, we compare it with the existing
record, and insert it only if it has newer timestamp.

Add a segment sequence number that is a global (per-shard) increasing
number that is allocated when getting a new segment for write, and is
written in buffer headers in the segment.
It is used to distinguish between buffers written to different generations
of a segment, and for recovery to break ties by keeping the record
from the newest segment.

Refs https://scylladb.atlassian.net/browse/SCYLLADB-770

no backport - logstor is a new feature

Closes scylladb/scylladb#29933

* github.com:scylladb/scylladb:
  test: logstor: add basic delete test
  logstor: rewrite segment seq num from streaming
  logstor: add segment sequence number
  logstor: get_segment helper
  logstor: compare records by timestamp
2026-05-21 08:44:18 +03:00
Michał Jadwiszczak
eac9449967 test/test_mv_building: ensure nodes see each other after restart
In SCYLLADB-2058 we observed a timeout exception while querying the base
table after restarting nodes 2 and 3.
Unfortunately, logs don't give us much useful information about the
root cause.

This patch adds basic checks that nodes see each other after the restart
and that the cql connection sees restarted node.
It doesn't guarantee that the error won't occur again - in logs from
SCYLLADB-2058 we see that each node sees other via gossip after part of
the cluster is restarted.

In case the error will occur again, this commit also increases logging
level of `cql_server` and `storage_proxy`.

Refs SCYLLADB-2058

Closes scylladb/scylladb#29951
2026-05-20 14:11:41 +02:00
Marcin Maliszkiewicz
83823149e9 Merge 'audit: implement audit_rules config' from Andrzej Jackowski
This patch series adds `audit_rules`, a new audit configuration option for fine-grained, role-aware audit filtering with per-rule sink routing. Rules can be configured in `scylla.yaml` or updated live through `system.config` without restarting the node. Each rule specifies target sinks (`table`, `syslog`), statement categories, qualified table name patterns, and role patterns. Table and role patterns use POSIX `fnmatch` with extended glob syntax. For table-scoped categories (`DML`, `DDL`, `QUERY`), a rule matches only when the category, role, and qualified table name all match. For table-independent categories (`AUTH`, `ADMIN`, `DCL`), the table filter is ignored. Empty category or role lists match nothing; an empty table list matches nothing only for table-scoped categories. The new rules are additive with the existing `audit_categories`, `audit_keyspaces`, and `audit_tables` settings: both mechanisms are evaluated for each audit event, and the final sink set is the union of all matches.

To avoid evaluating glob patterns on every audit event, audit rules use a preprocessed cache of known roles and tables. The cache is kept in sync through group0 role/table snapshots, role-change notifications, and schema migration notifications. For known entities, rule matching uses precomputed role/table rule sets; unknown entities fall back to direct rule evaluation. When `audit_rules` is empty, per-event rule matching returns immediately and does not evaluate glob patterns. Audit still keeps known role/table metadata in sync while audit is enabled, so rules can be enabled later through live configuration updates without restarting the node.

**Performance**
Measured with `perf-simple-query --smp 1 --duration 100` against a null syslog socket. Results show no regression when audit is disabled, and audit-rules performance has at most 1% more instructions than legacy config for equivalent workloads:

```
===============================================================================================================================================================================
Configuration                                     | Binary     |         throughput (tps) | insns/op                 | cpu_cycles/op            | alloc/op | logal/op | task/op
===============================================================================================================================================================================
audit=none [1]                                    | baseline   |                 206922.4 |                  36591.6 |                  15348.3 |     58.1 |      0.0 |    14.1
audit=none [1]                                    | this PR    |        207856.4  (+0.5%) |         36544.9  (-0.1%) |         15274.0  (-0.5%) |     58.1 |      0.0 |    14.1
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
audit=syslog keyspaces=ks [2]                     | baseline   |                  94871.8 |                  54163.0 |                  27172.4 |     72.0 |      0.0 |    24.0
audit=syslog keyspaces=ks [2]                     | this PR    |         96138.4  (+1.3%) |         54072.3  (-0.2%) |         26699.3  (-1.7%) |     72.0 |      0.0 |    24.0
audit=syslog audit-rules=ks [3]                   | this PR    |         95142.1  (+0.3%) |         54457.8  (+0.5%) |         26953.8  (-0.8%) |     72.0 |      0.0 |    24.0
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
audit=syslog keyspaces=ks-non-existent [4]        | baseline   |                 213997.8 |                  36735.6 |                  14848.1 |     58.1 |      0.0 |    14.1
audit=syslog keyspaces=ks-non-existent [4]        | this PR    |        219297.2  (+2.5%) |         36667.3  (-0.2%) |         14500.1  (-2.3%) |     58.1 |      0.0 |    14.1
audit=syslog audit-rules=ks-non-existent [5]      | this PR    |        211038.7  (-1.4%) |         36999.7  (+0.7%) |         15048.6  (+1.4%) |     58.1 |      0.0 |    14.1
===============================================================================================================================================================================

[1] ./scylla perf-simple-query --smp 1 --duration 100 --audit "none"
[2] ./scylla perf-simple-query --smp 1 --duration 100 --audit "syslog" --audit-keyspaces "ks" --audit-categories "DCL,DDL,AUTH,DML,QUERY" --audit-unix-socket-path "/tmp/audit-null.sock"
[3] ./scylla perf-simple-query --smp 1 --duration 100 --audit "syslog" --audit-rules '[{"sinks":["syslog"],"categories":["DCL","DDL","AUTH","DML","QUERY"],"qualified_table_names":["ks.*"],"roles":["*"]}]' --audit-unix-socket-path "/tmp/audit-null.sock"
[4] ./scylla perf-simple-query --smp 1 --duration 100 --audit "syslog" --audit-keyspaces "ks-non-existent" --audit-categories "DCL,DDL,AUTH,DML,QUERY" --audit-unix-socket-path "/tmp/audit-null.sock"
[5] ./scylla perf-simple-query --smp 1 --duration 100 --audit "syslog" --audit-rules '[{"sinks":["syslog"],"categories":["DCL","DDL","AUTH","DML","QUERY"],"qualified_table_names":["ks-non-existent.*"],"roles":["*"]}]' --audit-unix-socket-path "/tmp/audit-null.sock"

audit-null.sock was created with `socat -u UNIX-RECV:/tmp/audit-null.sock,type=2 OPEN:/dev/null`
```

Fixes: SCYLLADB-1430
No backport: new feature

Closes scylladb/scylladb#29267

* github.com:scylladb/scylladb:
  test: alternator: audit: rules filtering and batch bypass
  test: perf: add --audit-rules option to perf-simple-query
  docs: add audit rules section to the auditing guide
  test: audit: cover role and schema cache notifications
  test: audit: cover audit rules cluster behavior
  audit: rebuild rule caches on group0 snapshot and role changes
  audit: refresh rule caches on schema, role, and config changes
  audit: route matching rules to configured sinks
  test: cover preprocessed audit rule cache
  audit: add preprocessed rule matching cache
  audit: pass sink targets to storage helpers
  test: audit: cover rule matching semantics
  audit: add rule matching and sink helpers
  test: audit: cover audit_rules configuration
  config: add live audit_rules option
  test: cover audit rule parsing and validation
  audit: define audit_rule type with parsing and validation
2026-05-20 14:10:45 +02:00
Gleb Natapov
c2cc7ebf39 test: fix test_cas_semaphore flakiness due to paxos state table creation timeout
The test was starting Scylla with --write-request-timeout-in-ms=500 on the
command line. This tight timeout also applied to paxos state table creation,
which goes through raft and can take longer than 500ms on slow platforms
(e.g. aarch64/dev). When the first batch of CAS requests triggered paxos
state table creation under error injection, the raft schema change could
still be in-flight when the second batch fired, causing spurious WriteTimeout
failures unrelated to the semaphore bug being tested.

Fix by changing the write timeout at runtime via the REST API: lower it to
500ms only for the error-injection CAS phase (after table creation is done),
then restore it to 10000ms before the second batch that must succeed.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-2104

Closes scylladb/scylladb#29969
2026-05-20 13:06:17 +02:00
Andrzej Jackowski
f03398fdba test: audit: cover role and schema cache notifications
Verify on a multi-node cluster that role creation/alter/
drop and table/materialized-view create/drop trigger
updates to the preprocessed audit-rules cache on every
node, and that a matching DML on the newly created table
is audited via the cache.

Refs SCYLLADB-1430
2026-05-20 06:55:15 +02:00
Andrzej Jackowski
7f61d7662d test: audit: cover audit rules cluster behavior
Cluster-level tests should validate rule matching, live
updates, sink routing, role filtering, and error handling
without rerunning the broader audit suite.

Add audit_rules to LIVE_AUDIT_KEYS so the test framework
tracks it as a live-updatable config key. Test that rules
with empty categories or roles match nothing, that DML
rules coexist with legacy audit config, AUTH rules fire
on login events, CQL and REST API update paths reject
invalid JSON, per-rule sink routing works for table and
syslog, role-based filtering works across sessions, and
sink mismatch produces a warning in server logs.

Refs SCYLLADB-1430
2026-05-20 06:55:15 +02:00
Tomasz Grabiec
f0dea67a87 Merge 'transport: add per-service-level transport request latency histogram' from Piotr Smaron and Marcin Maliszkiewicz
Add a per-scheduling-group latency histogram on the transport level that measures the full CQL request lifetime: from fetching the request buffer until the response is written to the socket.

Today latencies are accounted only on the storage proxy level, leaving the time spent in the transport layer (response queue wait + actual I/O) unaccounted. Having both transport and storage proxy latencies allows operators to tell where latency accumulates.

The metric is exposed as scylla_transport_cql_request_latency_histogram with the scheduling_group_name label, following the cql_ prefix convention of all other per-SG transport metrics.

Fixes: SCYLLADB-1691

New feature, no backport.

Closes scylladb/scylladb#29878

* github.com:scylladb/scylladb:
  test/cluster: add test for per-service-level transport request latency histogram
  transport: add per-service-level transport request latency histogram
2026-05-20 01:12:14 +02:00
Michael Litvak
eecbead541 test: wait for others_not_see_server before exclude
Between stopping a server and excluding it, wait for other nodes to see
the server as down, otherwise exclude may see the server as alive and
fail.

Fixes SCYLLADB-2110

Closes scylladb/scylladb#29966
2026-05-19 19:36:54 +02:00
Piotr Smaron
810ed6eedc test/cluster: add test for per-service-level transport request latency histogram
Verify that the new scylla_transport_cql_request_latency_histogram metric
correctly records transport-level request latencies per service level.

Uses error injection to pause a request mid-flight and verifies that the
histogram is not updated while the request is paused (since the response
has not been written yet), and is updated after the request completes.

Co-authored-by: Marcin Maliszkiewicz <marcinmal@scylladb.com>
2026-05-19 16:07:33 +02:00
Benny Halevy
97e03762c5 test/cluster/test_keyspace_rf: extend test_create_keyspace_with_default_replication_factor for tablets rack lists
Add more racks to dc2 to verify that the default replication factor
covers all available racks (rather than e.g. limited to 3).

With tablets and rf_rack_valid_keyspaces, verify also the automatically
selected rack list.

Restrict the extension to non-debug build modes to prevent running out
of memory with --repeat=100.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#29931
2026-05-19 10:44:24 +03:00
Andrei Chekun
6414c48fc2 test.py: rewrite resource gather
Python tests requires different handling of metrics gathering from
cgroup than C++ tests. pytest do not execute each python tests in
a separate process, so we can't put it there and get the metrics.
The idea is to put the whole pytest process to the cgroup and get the
metrics. This will work because pytest runs the threads as as completely
separate processes and inside the thread it will run tests consequently.
Additionally, to simplify system resource monitor moved to pytest main
thread.
2026-05-18 12:23:40 +02:00
Evgeniy Naydanov
39a10d6d67 test: remove dead suite subclasses and legacy execution pipeline
After all test suites migrated to test_config.yaml with type: Python,
the specialized suite classes (Topology, CQLApproval, Run, Tool) and
the legacy execution pipeline (find_tests, run_test, TestSuite.run,
Test.run) became unreachable. Remove all this dead code.

Deleted files:
- suite/topology.py, suite/cql_approval.py, suite/run.py, suite/tool.py

Simplified:
- base.py: remove run_test(), read_log(), TestSuite.run(),
  add_test_list(), build_test_list(), all_tests(), test_count(),
  SUITE_CONFIG_FILENAME, disabled/flaky test tracking, and dead
  Test attributes (args, core_args, valid_exit_codes, allure_dir,
  is_flaky, is_cancelled, etc.)
- python.py: remove PythonTestSuite.run(), PythonTest.run(),
  _prepare_pytest_params(), pattern, test_file_ext, xmlout,
  server_log, scylla_env setup, and shlex import.
  Simplify run_ctx() to take no parameters.
- runner.py: remove --scylla-log-filename option,
  print_scylla_log_filename fixture, SUITE_CONFIG_FILENAME import,
  and suite.yaml probe in TestSuiteConfig.from_pytest_node().
- __init__.py: remove re-exports of deleted classes.
- test_config.yaml: Topology -> Python, Approval -> Python.
- conftest files: run_ctx(options=...) -> run_ctx().
- docs/dev/testing.md: update to reflect current pytest-based
  architecture, log paths, and removed features.

Co-Authored-By: Claude Opus 4.6 (200K context) <noreply@anthropic.com>

Closes scylladb/scylladb#29613
2026-05-17 22:16:31 +03:00
Michael Litvak
c3ab341234 test: logstor: add basic delete test
extend the test with basic DELETE queries and verify the results are as
expected.
2026-05-17 17:22:43 +02:00
Michael Litvak
c18d616f64 logstor: compare records by timestamp
Add the record timestamp. The timestamp is extracted from the row marker
of the mutation when we write it.

When inserting a record to index, we compare it with the existing
record, and insert it only if it has newer timestamp.
2026-05-17 17:22:26 +02:00
Andrzej Jackowski
61e5ec9888 test: storage: retry fusermount3 unmount on teardown
After stopping scylla server processes, the FUSE daemon
(fuse2fs) may still be processing file handle closures.
An immediate fusermount3 -u can fail with 'device busy',
causing spurious test failures on teardown.

Retry the unmount up to 10 times with 0.5s delay between
attempts, and capture stderr for diagnostics.

Fixes: SCYLLADB-2049

Closes scylladb/scylladb#29920
2026-05-16 19:36:48 +03:00
Piotr Dulikowski
460cb1656e Merge 'test: limits: optimize test_max_cells to avoid large allocations and fragmentation' from Dario Mirovic
The `test_max_cells` test was flaky due to `std::bad_alloc` caused by Seastar buddy allocator fragmentation. The root causes are:
1. The doubling loop with 24 iterations of CREATE/INSERT/DROP fragmented the allocator
2. The test built the whole batch as a single string that takes contiguous memory

Also, some iterations inserted zero rows, but still did CREATE/DROP table which also contributed to the fragmentation.

This patch series:
- Skips iterations that insert zero rows
- Creates the table once, truncates it after each test iteration
- Switches to prepared statements

Investigation results are presented in detail in https://scylladb.atlassian.net/browse/SCYLLADB-1645

Fixes SCYLLADB-1645

CI stability improvement. Backport to versions that have this test.

Closes scylladb/scylladb#29759

* github.com:scylladb/scylladb:
  test: prepare max cells inserts
  test: reuse max cells schema
  test: limits: skip empty max cells iterations
2026-05-15 18:12:48 +02:00
Aleksandra Martyniuk
d874d355c2 service: skip load_sketch unload for excluded nodes on RF shrink
When an RF change shrinks replicas on a DC and the node being shrunk is
excluded, refresh_tablet_load_stats() only provides load_stats for that
node if it has a cached snapshot from when the node was still up. If the
snapshot is missing or predates the tables being shrunk (e.g. they were
created after the node went down), stats stay incomplete. In that case
load_sketch::unload() called from make_rf_change_plan() throws:

    Can't provide accurate load computation with incomplete load_stats
    for host: <uuid>

Since an excluded node is not expected to come back, load_stats will
never become complete, and the topology coordinator retries the plan
infinitely, hanging ALTER KEYSPACE.

Add a check for excluded nodes and skip unload() for them: we are
removing the replica, so accurate load data for that node is not
needed. For all other node states the throw-and-retry behavior is
preserved.

Modify test_excludenode_shrink_rf to always trigger the bug: a new
error injection 'force_down_node_load_stats_invalid' forces the
invalid-stats path in refresh_tablet_load_stats() for a down node, so
the test does not depend on whether the load-stats refresher happened
to cache the excluded node's stats while it was still up.

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1702.

Closes scylladb/scylladb#29622
2026-05-15 17:46:28 +02:00
Marcin Maliszkiewicz
0574055b73 test: prepare max cells inserts
Switch from raw CQL batch string to using a prepared statement.
The old approach constructed the entire 50-row batch as a single
CQL text string (~19.8 MiB with 32768 column names spelled out
per row). This caused large contiguous allocations in the server.

Fixes SCYLLADB-1645
2026-05-14 17:25:39 +02:00
Marcin Maliszkiewicz
0fd6f6f292 test: reuse max cells schema
Extract table creation into _create_max_cell_count_table(). Call
it once before the loop instead of creating and dropping the table
on every iteration. Use TRUNCATE instead of DROP TABLE between
iterations to clear data while keeping the schema.

This avoids repeated schema operations that fragment the Seastar
buddy allocator's address space with scattered small allocations.

Refs SCYLLADB-1645
2026-05-14 17:24:53 +02:00
Marcin Maliszkiewicz
3debae9a37 test: limits: skip empty max cells iterations
The doubling loop in test_max_cells started from cells=1. Since
each row has MAX_CELLS_COLUMNS (32768) cells, iterations where
cells < MAX_CELLS_COLUMNS produced zero rows (cells // columns = 0).
Those iterations only did CREATE TABLE / DROP TABLE with no data
inserted.

Start the loop from MAX_CELLS_COLUMNS and use a while loop.

Co-authored-by: Dario Mirovic <dario.mirovic@scylladb.com>

Refs SCYLLADB-1645
2026-05-14 17:00:15 +02:00
Piotr Dulikowski
5b269be37b Merge 'test/cluster/test_view_building_coordinator: migrate test from dtest' from Michał Jadwiszczak
Move `materialized_views_test.py::TestMaterializedViews::test_do_not_finish_view_building_with_hints`
test from dtest to test.py.

The dtest was throttling down IO throughput in the hope that the view
building won't be finished too soon. This introduces some unreliability,
which can be solved by using error injection and pausing view building
until we stop necessary nodes.

This patch adds 2 tests: one for tablet-based view and one for vnode-based. Both of the tests use error injection to pause view building.

Fixes [SCYLLADB-1261](https://scylladb.atlassian.net/browse/SCYLLADB-1261)

The issue was seen in 2026.2, so we should backport this patch to this version.

[SCYLLADB-1261]: https://scylladb.atlassian.net/browse/SCYLLADB-1261?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

Closes scylladb/scylladb#29788

* github.com:scylladb/scylladb:
  test/cluster/mv/test_mv_building: add similar test for vnode-based view
  test/cluster/test_view_building_coordinator: migrate test from dtest
  db/view/view_building_worker: add more logs when flushing base table
2026-05-14 15:34:26 +02:00
Michał Jadwiszczak
5c84cff78a test/cluster/mv/test_mv_building: add similar test for vnode-based view
In the dtest repo, the test run for both vnode and tablet based views.

Since in test.py infra we're using error injection to pause the view
building process, we need separate tests for those two cases.
2026-05-14 10:52:44 +02:00
Piotr Dulikowski
0c016cecc3 Merge 'QOS: self-heal stale V1-to-V2 migration state on upgrade' from Alex Dathskovsky
service_levels: self-heal stale v1 marker after raft topology upgrade

This PR handles an upgrade corner case where a node may already be using
raft topology, while `system.scylla_local` still marks service levels as v1.

The problem was introduced by commit 2917ec5d51
("service:qos: service levels migration"), which added the service-levels
migration from `system_distributed.service_levels` to
`system.service_levels_v2` as part of the raft topology upgrade.

However, if the cluster had no service levels configured, there was no data
to migrate. In that case, the migration path could leave the local version
marker unchanged, so the node would later observe an inconsistent state:

  * raft topology is already enabled;
  * service levels are still marked as v1 in `system.scylla_local`.

Such clusters can be left in a stale state and fail startup during upgrade to
2026.2

This PR makes the upgrade path self-healing.

The first commit restores `service_level_controller::migrate_to_v2()`, giving
us a group0-based path for writing the service-levels v2 state even after raft
topology is already in use.

The second commit wires this path into startup. When the node detects the
stale raft-topology + service-levels-v1 state, it retries the migration a
bounded number of times and updates the version marker to v2 instead of
failing startup.

With this change, clusters that were left in this stale state can recover
automatically during upgrade to 2026.
Fixes: SCYLLADB-1807

backport: 2026.2 2026.1 we need this functionality when we are upgrading older servers

Closes scylladb/scylladb#29749

* github.com:scylladb/scylladb:
  test/auth_cluster: simulate v1 state in self-heal test When skip_service_levels_v2_initialization is used, write an explicit v1 service level version marker while skipping v2 initialization. This lets the restart test exercise self-healing from v1 to v2.
  qos: self-heal stale service levels version on startup
  qos: reintroduce service levels v2 migration self-heal
2026-05-14 10:32:43 +02:00
Michał Jadwiszczak
b887f8cb2b test/cluster/test_view_building_coordinator: migrate test from dtest
Move `materialized_views_test.py::TestMaterializedViews::test_do_not_finish_view_building_with_hints`
test from dtest to test.py.

The dtest was throttling down IO throughput in the hope that the view
building won't be finished too soon. This introduces some unreliability,
which can be solved by using error injection and pausing view building
until we stop necessary nodes.

Fixes SCYLLADB-1261
2026-05-14 10:23:42 +02:00
Avi Kivity
f2ab911a46 Merge 'test/cluster: fix server-starting functions to wait for all ports' from Nadav Har'El
This series fixes a recurring source of flaky tests in the cluster test suite.

When a test configures Scylla to listen on non-default ports (e.g. a custom Alternator port, proxy-protocol port or shard-aware port), server_add() and server_start() would declare the server ready by polling the hardcoded standard CQL and Alternator ports. Those ports can become available slightly before the custom ports finish binding, so the test could start using the custom port before it was open — causing intermittent failures.

The fix for each affected test was to pass `expected_server_up_state=ServerUpState.SERVING` explicitly, which waits for Scylla's sd_notify("STATUS=serving") signal instead. That signal is sent only after all configured listeners are fully open, so it is always the right readiness signal regardless of the port configuration. This workaround was applied again in PR #29737 and will keep being needed for every new test that uses a non-default port.

This series makes ServerUpState.SERVING the default at every level of the server start/add call stack so no test needs to remember it:

* Make server_add(), servers_add(), server_start() et al. all  default to ServerUpState.SERVING.
* Document that server_add/server_start wait for all ports to be  ready,  so future test authors understand what the functions guarantee.
* Remove now-redundant expected_server_up_state=SERVING from exiting tests.
* A small optimization: Fix check_serving_notification() returning False on first completion. When the sd_notify future completed, the function correctly updated _received_serving but still returned False, wasting one 100ms polling interval. Return self._received_serving directly.

Closes scylladb/scylladb#29758

* github.com:scylladb/scylladb:
  test/pylib: fix missing protocol_version=4 on control_cluster
  scylla_cluster: guard poll_status() set_result() calls against cancelled future
  test/cluster: avoid repeated CQL checks and leaks while waiting for SERVING
  test/cluster: fix check_serving_notification() inefficiency
  test/cluster: remove now-redundant expected_server_up_state=SERVING
  test/cluster: document that add/start waits for all ports to be ready
  test/cluster: update remaining CQL_ALTERNATOR_QUERIED defaults to SERVING
  test/cluster: fix server_add/server_start hanging when starting in maintenance mode
  main: notify "entering maintenance mode" after the maintenance CQL server is ready
  test/cluster: make server_start() default to ServerUpState.SERVING
  test/cluster: make server_add() default to ServerUpState.SERVING
2026-05-13 21:23:18 +03:00
Alex
6188bf3e01 test/auth_cluster: simulate v1 state in self-heal test
When skip_service_levels_v2_initialization is used, write an explicit
v1 service level version marker while skipping v2 initialization. This
lets the restart test exercise self-healing from v1 to v2.
2026-05-13 17:55:20 +03:00
Piotr Dulikowski
dc05bd35bb Merge 'strong_consistency: limit available consistency levels in strong consistent requests' from Michał Jadwiszczak
Strong consistent requests take different patch then EC requests and consistency levels don’t map well.
We should limit available consistency levels in SC request to avoid ignoring them silently, which may cause confusion to user.

For writes, there is only one option:
- QUORUM/LOCAL_QUORUM (multi DC is not supported yet, so both of those CLs have the same effect) - we need quorum of replicas to successfully commit new mutations to Raft log.

For reads, there are 2 options:
- QUORUM/LOCAL_QUORUM - if user wants to be sure he sees latest data and the query needs to execute `read_barrier()`, which requires quorum of replicas
- ONE/LOCAL_ONE - if user just wants to read data from one replica without synchronization

All tests were updated to use LOCAL_QUORUM for both read and writes.

Fixes SCYLLADB-1766

SC is in experimental phase and this patch is an improvement, no backport needed.

Closes scylladb/scylladb#29691

* github.com:scylladb/scylladb:
  strong_consistency: allow QUORUM/LOCAL_QUORUM and ONE/LOCAL_ONE for reads
  strong_consistency: allow only QUORUM/LOCAL_QUORUM CL for writes
2026-05-13 16:31:05 +02:00
Piotr Smaron
0fcae72530 test: bootstrap tombstone gc repair cluster sequentially
Avoid concurrent topology changes in the tombstone GC repair setup, where debug-mode nodes running hinted handoff and materialized view startup work can time out while applying Raft entries before the test starts.

Keep the sequential path opt-in so unrelated repair tests still exercise concurrent bootstrap behavior.

Closes scylladb/scylladb#29829
2026-05-13 13:58:44 +03:00
Patryk Jędrzejczak
3f2ff5a13f Merge 'Remove raft_group0::finish_setup_after_join' from Gleb Natapov
The function does nothing useful now.

No backport needed. Removes code.

Closes scylladb/scylladb#29828

* https://github.com/scylladb/scylladb:
  raft_group0: remove finish_setup_after_join function
  raft_group0: fix indentation after the last change
  raft_group: drop unneeded checks
2026-05-13 10:53:37 +02:00
Yaniv Michael Kaul
5d6f160129 test: update get_scylla_2025_1_executable() to use 2025.1.12
Update the hardcoded 2025.1.0 binary URL to the latest 2025.1.12
release for upgrade tests.

The 2025.1.12 binary now supports and enforces the
rf_rack_valid_keyspaces option which the test harness enables by
default. Since test_sstable_compression_dictionaries_upgrade creates
a 2-node cluster in a single rack with RF=2, it violates the
constraint. Disable the option explicitly for this test.

Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>

Closes scylladb/scylladb#29714
2026-05-12 23:20:55 +02:00
Michał Jadwiszczak
68f0cf6fac strong_consistency: allow only QUORUM/LOCAL_QUORUM CL for writes
To successfully write data to strong consistent table, a quorum of
replicas need to be used to save the data to Raft log.

So the only reasonable consistency level is QUORUM/LOCAL_QUORUM
(currently SC doesn't support multi DC).
2026-05-12 23:20:03 +02:00
Wojciech Mitros
f3cf20803b test: run test_mv_admission_control_exception on one shard
In the test we perform 2 consecutive writes where the first write
is supposed to increase the view update backlog above the mv
admission control threshold and the second one is expected to be
rejected because of that.

On each node/shard we have 2 types of view update backlogs:
 1. for deciding whether we should admit writes
 2. for propagating the backlog information to other nodes/shards.

For the second write to be rejected, it must be performed on a node
and shard which updated its backlog of type 1.

The view update backlog of type 2. is immediately increased on the
base table replica. For this backlog to be registered as a backlog
of type 1., it needs to be either carried by gossip (happening once
every second) or by attaching it to a replica write response. We
don't want to increase the runtime of tests unnecessarily, so we don't
wait and we rely on the second mechanism. The response to the first
base table write (the one causing increase in the backlog) carries
the increased backlog to the coordinator of this write. So for the
second write to observe the increased backlog, it needs to be coordinated
on the same node+shard as the first write.

We make sure that both writes are coordinated on the same node+shard by
using prepared statements combined with setting the host in `run_async`.
Both writes target the same partition and with prepared statements we
route them directly to the correct shard.

That was the idea, at least. In practice, for the driver to learn the
correct shard, it first needs to learn the token->shard mapping from
the server. For vnodes it can expect a shard by calculating the token
of the affected partition, but for tablets, it had no opportunity to
learn the tablet->shard mapping so the first write may route to any shard.
Additionally, we aren't guaranteed that the driver established connections
to all shards on all nodes at the point of any write. So if a connection
finishes establishing between the two writes, this may also cause us to
coordinate these 2 writes on different shards, leading to a missed view
backlog growth and not-rejected second write.

We fix this in this patch by running the test using one shard on each node.
This way, as long as we perform both writes on the same node, they'll also
be coordinated on the same shard. This also makes the prepared statement and
BoundStatement unnecessary — we can use SimpleStatement with
FallthroughRetryPolicy directly.

Fixes: SCYLLADB-1901

Closes scylladb/scylladb#29862
2026-05-12 17:34:19 +02:00
Piotr Dulikowski
129f193116 Merge 'strong_consistency: implement basic coordinator metrics' from Michał Jadwiszczak
Add per-shard metrics for strong consistency coordinator operations (latency, timeouts, bounces, status unknown) under the `"strong_consistency_coordinator"` category. These are analogous to the eventual consistency metrics in `storage_proxy_stats`, enabling direct performance comparison between the two consistency modes.

The metrics are simplified compared to `storage_proxy_stats` — no breakdown by table, tablet, scheduling group, or DC, only per-shard.

Fixes SCYLLADB-1343

Strong consistency is still in experimental phase, no need to backport.

Closes scylladb/scylladb#29318

* github.com:scylladb/scylladb:
  test/strong_consistency: verify metrics
  strong_consistency: wire up metrics to operations
  strong_consistency: add stats struct and metrics registration
2026-05-12 16:15:51 +02:00
Botond Dénes
e95eb21a16 Merge 'Tablet-aware restore' from Pavel Emelyanov
The mechanics of the restore is like this

- A /storage_service/tablets/restore API is called with (keyspace, table, endpoint, bucket, manifests) parameters
  - First, it populates the system_distributed.snapshot_sstables table with the data read from the manifests
  - Then it emplaces a bunch of tablet transitions (of a new "restore" kind), one for each tablet
- The topology coordinator handles the "restore" transition by calling a new RESTORE_TABLET RPC against all the current tablet replicas
- Each replica handles the RPC verb by
  - Reading the snapshot_sstables table
  - Filtering the read sstable infos against current node and tablet being handled
  - Downloading and attaching the filtered sstables

This PR includes system_distributed.snapshot_sstables table from @robertbindar and preparation work from @kreuzerkrieg that extracts raw sstables downloading and attaching from existing generic sstables loading code.

This is first step towards SCYLLADB-197 and lacks many things. In particular
- the API only works for single-DC cluster
- the caller needs to "lock" tablet boundaries with min/max tablet count
- not abortable
- no progress tracking
- sub-optimal (re-kicking API on restore will re-download everything again)
- not re-attacheable (if API node dies, restoration proceeds, but the caller cannot "wait" for it to complete via other node)
- nodes download sstables in maintenance/streaming sched gorup (should be moved to maintenance/backup)

Other follow-up items:
- have an actual swagger object specification for `backup_location`

Closes #28436
Closes #28657
Closes #28773

Closes scylladb/scylladb#28763

* github.com:scylladb/scylladb:
  docs: Update topology_over_raft.md with `restore` transition kind
  test: Add test for backup vs migration race
  test: Restore resilience test
  sstables_loader: Fail tablet-restore task if not all sstables were downloaded
  sstables_loader: mark sstables as downloaded after attaching
  sstables_loader: return shared_sstable from attach_sstable
  db: add update_sstable_download_status method
  db: add downloaded column to snapshot_sstables
  db: extract snapshot_sstables TTL into class constant
  test: Add a test for tablet-aware restore
  tablets: Implement tablet-aware cluster-wide restore
  messaging: Add RESTORE_TABLET RPC verb
  sstables_loader: Add method to download and attach sstables for a tablet
  tablets: Add restore_config to tablet_transition_info
  sstables_loader: Add restore_tablets task skeleton
  test: Add rest_client helper to kick newly introduced API endpoint
  api: Add /storage_service/tablets/restore endpoint skeleton
  sstables_loader: Add keyspace and table arguments to manfiest loading helper
  sstables_loader_helpers: just reformat the code
  sstables_loader_helpers: generalize argument and variable names
  sstables_loader_helpers: generalize get_sstables_for_tablet
  sstables_loader_helpers: add token getters for tablet filtering
  sstables_loader_helpers: remove underscores from struct members
  sstables_loader: move download_sstable and get_sstables_for_tablet
  sstables_loader: extract single-tablet SST filtering
  sstables_loader: make download_sstable static
  sstables_loader: fix formating of the new `download_sstable` function
  sstables_loader: extract single SST download into a function
  sstables_loader: add shard_id to minimal_sst_info
  sstables_loader: add function for parsing backup manifests
  split utility functions for creating test data from database_test
  export make_storage_options_config from lib/test_services
  rjson: Add helpers for conversions to dht::token and sstable_id
  Add system_distributed_keyspace.snapshot_sstables
  add get_system_distributed_keyspace to cql_test_env
  code: Add system_distributed_keyspace dependency to sstables_loader
  storage_service: Export export handle_raft_rpc() helper
  storage_service: Export do_tablet_operation()
  storage_service: Split transit_tablet() into two
  tablets: Add braces around tablet_transition_kind::repair switch
2026-05-12 16:24:13 +03:00
Andrzej Jackowski
89261bf759 test: wait for TTL scheduling sanity metric
The test samples sl:default runtime before and after setup writes to
prove that it measures the scheduling group used by regular CQL writes.
The metric is exported in milliseconds, so a single 200-row batch may
not be visible immediately, or may be too small in some environments.

Keep the original 200-row table size, but wait up to 30 seconds for the
metric to advance. If it does not, retry the same writes before TTL is
enabled. The retries update the same keys, so the expiration part of the
test still waits for exactly the original number of rows.

In a local 100-run with N=200 rows, the observed delta of
`ms_statement_before - ms_statement_before_write` was: min=4.0,
max=16.0, mean=8.13, and median=8.0. Therefore, it looks possible that
in a rare corner case the delta drops even to 0.

Fixes SCYLLADB-1869

Closes scylladb/scylladb#29797
2026-05-12 12:38:25 +03:00
Piotr Dulikowski
7c2b1ea0b5 Merge 'view_building: fix tombstone_warn_threshold warnings' from Michał Jadwiszczak
`system.view_building_tasks` is a single-partition Raft group0 table (pk = `"view_building"`, CK = timeuuid). When `clean_finished_tasks()` deletes hundreds of finished tasks, the physical rows remain in SSTables until compaction. Any subsequent read of the partition counts every column of every tombstoned row
  as a dead cell, triggering `tombstone_warn_threshold` warnings in large clusters.

Two-part fix:

**1. Range tombstones instead of row tombstones (commits 2–3)**

Instead of one row tombstone per finished task, find the minimum alive task UUID (`min_alive_uuid`) and emit a single range tombstone `[before_all, min_alive_uuid)` covering all tasks below that boundary. This reduces the tombstone count significantly and also benefits future compaction.

**2. Bounded scan with `min_task_id` (commits 4–6)**

Even with range tombstones, physical rows remain until compaction and still count as dead cells during reads. The only way to avoid them is to not read them at all.

   - Add a `min_task_id timeuuid` static column to `system.view_building_tasks`.
   - On every GC, write `min_task_id = min_alive_uuid` atomically with the range tombstone (same Raft batch).
   - On reload, read `min_task_id` first using a **static-only partition slice** (empty `_row_ranges` + `always_return_static_content`): the SSTable reader stops immediately after the static row before processing any clustering tombstones — zero dead cells counted.
   - Use `AND id >= min_task_id` as a lower bound for the main task scan, skipping all tombstoned rows.

The static-only read and the bounded scan are gated on the `VIEW_BUILDING_TASKS_MIN_TASK_ID` cluster feature so mixed-version clusters fall back to the full scan.

The issue is not critical, so the fix shouldn't be backported.

Fixes SCYLLADB-657

Closes scylladb/scylladb#28929

* github.com:scylladb/scylladb:
  test/cluster/test_view_building_coordinator: add reproducer for tombstone threshold warning
  docs: document tombstone avoidance in view_building_tasks
  view_building: add `task_uuid_generator` to `view_building_task_mutation_builder`
  view_building: introduce `task_uuid_generator`
  view_building: store `min_alive_uuid` in view building state
  view_building: set min_task_id when GC-ing finished tasks
  view_building: add min_task_id support to view_building_task_mutation_builder
  view_building: add min_task_id static column and bounded scan to system_keyspace
  view_building: use range tombstone when GC-ing finished tasks
  view_building: add range tombstone support to view_building_task_mutation_builder
  view_building: introduce VIEW_BUILDING_TASKS_MIN_TASK_ID cluster feature
2026-05-12 12:38:25 +03:00
Pavel Emelyanov
150345cc52 Merge 'test: per-bucket isolation for S3/GCS object storage tests' from Ernest Zaslavsky
This series adds per-test bucket isolation to all S3 and GCS object storage tests. Previously, every test shared a single pre-created bucket, which meant tests could interfere with each other through leftover objects and could not run concurrently across multiple `test.py` processes without risking collisions.

New `create_bucket`, `delete_bucket`, and `delete_bucket_with_objects` methods on `s3::client`, following the existing `make_request` pattern. `create_bucket` handles the `BUCKET_ALREADY_OWNED_BY_YOU` error gracefully.

A new `s3_test_fixture` RAII class for C++ Boost tests that creates a uniquely-named bucket on construction (derived from the Boost test name and pid) and tears down everything — objects, bucket, client — on destruction. All S3 tests in `s3_test.cc` are migrated to use it, removing manual `deferred_delete_object` and `deferred_close` boilerplate. The minio server policy is broadened to allow dynamic bucket creation/deletion.

A `client::make` overload that accepts a custom `retry_strategy`, used in tests with a fast 1ms retry delay instead of exponential backoff, significantly reducing test runtime for transient errors during bucket lifecycle operations.

Python-side (`test/cluster/object_store`): each pytest fixture (`object_storage`, `s3_storage`, `s3_server`) now creates a unique bucket per test function via `create_test_bucket()` and destroys it on teardown. Bucket names are sanitized from the pytest node name with a short UUID suffix for uniqueness.

Object storage helpers (`S3Server`, `MinioWrapper`, `GSFront`, `GSServerImpl`, factory functions, CQL helpers, `s3_server` fixture) are extracted from `test/cluster/object_store/conftest.py` into a shared `test/pylib/object_storage.py` module, eliminating duplication across test suites. The conftest becomes a thin re-export wrapper. Old class names are preserved as aliases for backward compatibility.

| Test Name                                                    | new test specific retry strategy execution time (ms) | original execution time (ms) |   Δ (ms) | Speedup |
|--------------------------------------------------------------|----------------:|-------------:|---------:|--------:|
| test_client_upload_file_multi_part_with_remainder_proxy      |          19,261 |       61,395 | −42,134  | **3.2×** |
| test_client_upload_file_multi_part_without_remainder_proxy   |          16,901 |       53,688 | −36,787  | **3.2×** |
| test_client_upload_file_single_part_proxy                    |           3,478 |        6,789 |  −3,311  | **2.0×** |
| test_client_multipart_copy_upload_proxy                      |           1,303 |        1,619 |    −316  | 1.2×    |
| test_client_put_get_object_proxy                             |             150 |          365 |    −215  | **2.4×** |
| test_client_readable_file_stream_proxy                       |             125 |          327 |    −202  | **2.6×** |
| test_small_object_copy_proxy                                 |             205 |          389 |    −184  | 1.9×    |
| test_client_put_get_tagging_proxy                            |             181 |          350 |    −169  | 1.9×    |
| test_client_multipart_upload_proxy                           |           1,252 |        1,416 |    −164  | 1.1×    |
| test_client_list_objects_proxy                               |             729 |          881 |    −152  | 1.2×    |
| test_chunked_download_data_source_with_delays_proxy          |             830 |          960 |    −130  | 1.2×    |
| test_client_readable_file_proxy                              |             148 |          279 |    −131  | 1.9×    |
| test_client_upload_file_multi_part_with_remainder_minio      |           3,358 |        3,170 |    +188  | 0.9×    |
| test_client_upload_file_multi_part_without_remainder_minio   |           3,131 |        2,929 |    +202  | 0.9×    |
| test_client_upload_file_single_part_minio                    |             519 |          421 |     +98  | 0.8×    |
| test_download_data_source_proxy                              |             180 |          237 |     −57  | 1.3×    |
| test_client_list_objects_incomplete_proxy                     |             590 |          641 |     −51  | 1.1×    |
| test_large_object_copy_proxy                                 |             952 |          991 |     −39  | 1.0×    |
| test_client_multipart_upload_fallback_proxy                  |             148 |          185 |     −37  | 1.3×    |
| test_client_multipart_copy_upload_minio                      |             641 |          674 |     −33  | 1.1×    |

No backport needed — this is a test infrastructure improvement with no production code impact beyond the new `s3::client` methods.

Closes scylladb/scylladb#29508

* github.com:scylladb/scylladb:
  test: extract object storage helpers to test/pylib/object_storage.py
  test: add per-test bucket isolation to object_store fixtures
  s3: add client::make overload with custom retry strategy
  test: add s3_test_fixture and migrate tests to per-bucket isolation
  s3: add create_bucket and delete_bucket to client
2026-05-12 12:38:24 +03:00
Pavel Emelyanov
19820910f8 test: Add test for backup vs migration race
The test starts regular backup+restore on a smaller cluster, but prior
to it spawns tablet migration from one node to another and locks it in
the middle with the help of block_tablet_streaming injection (even
though tablets have no data and there's nothing to stream, the injection
is located early enough to work).

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-05-12 10:40:24 +03:00
Pavel Emelyanov
3bcefa42c5 test: Restore resilience test
The test checks that losing one of nodes from the cluster while restore
is handled. In particular:

- losing an API node makes the task waiting API to throw (apparently)
- losing coordinator or replica node makes the API call to fail, because
  some tablets should fail to get restored. If the coordinator is lost,
  it triggers coordinator re-election and new coordinator still notices
  that a tablet that was replicated to "old" coordinator failed to get
  restored and fails the restore anyway

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-05-12 10:40:24 +03:00
Pavel Emelyanov
69b8f76a32 sstables_loader: Fail tablet-restore task if not all sstables were downloaded
When the storage_service::restore_tablets() resolves, it only means that
tablet transitions are done, including restore transitions, but not
necessarily that they succeeded. So before resolving the restoration
task with success need to check if all sstables were downloaded and, if
not, resolve the task with exception.

Test included. It uses fault-injection to abort downloading of a single
sstable early, then checks that the error was properly propagated back
to the task waiting API

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-05-12 10:40:24 +03:00
Pavel Emelyanov
4137211cf4 test: Add a test for tablet-aware restore
The test is derived from test_restore_with_streaming_scopes() one, with
the excaption that it doesn't check for streaming directions, doesn't
check mutations right after creation and doesn't loop over scoped
sub-tests, because there's no scope concept here.

Also it verifies just two topologies, it seems to be enough. The scopes
test has many topologies because of the nature of the scoped restore,
with cluster-wide restore such flexibility is not required.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-05-12 10:40:23 +03:00
Calle Wilund
2cc1a2c406 storage_service: Disable snapshots after raft decommission
Fixes: SCYLLADB-1693

In case we abort a decommission operation, the snapshot/backup
mechanism need to remain open.

This change moves it to after raft_decommission.

In the case of a cluster snapshot, our nodes ownership
or not of tables will be serialized by raft anyway, so
should remain consistent. In that case we at worst coordinate
from a node in "leave" status

In the case of a local snapshot, ownership matters less,
only sstables on disk, which should not change.

In the case of backup, this operates on a snapshot, state of which
is not affected.

Adds an injection point for testing.

v2:
- Added injection point to ensure test can abort decommission

Closes scylladb/scylladb#29667
2026-05-11 17:04:09 +03:00
Piotr Smaron
71542206bc cql: return InvalidRequest for oversized partition/clustering keys
When a partition key or clustering key value exceeds the 64 KiB limit
(65535 bytes serialized), Scylla used to raise a generic
std::runtime_error "Key size too large: N > M" from the low-level
compound-key serializer. That error surfaced to clients as a CQL
server error (code 0x0000, "NoHostAvailable"-looking), which is both
ugly and incompatible with Cassandra - Cassandra returns a clean
InvalidRequest with the message "Key length of N is longer than
maximum of M".

Fix this at the single chokepoint: compound_type::serialize_value in
keys/compound.hh. The serializer is on every path that materializes a
key - INSERT/UPDATE/DELETE/BATCH build mutations through it, and
SELECT builds partition and clustering ranges through it - so a single
throw replacement produces a clean InvalidRequest consistently across
all paths and all key shapes (single, compound PK, composite CK).

The previous approach on this PR branch patched three call sites in
cql3/restrictions/statement_restrictions.cc, which only covered
SELECT, duplicated the check, and placed it mid-restrictions code
(flagged in review). Dropping those changes in favour of the
root-cause fix here.

Un-xfail the tests this fixes:
- test/cqlpy/test_key_length.py: test_insert_65k_pk, test_insert_65k_ck,
  test_where_65k_pk, test_where_65k_ck, test_insert_65k_ck_composite,
  test_insert_total_compound_pk_err, test_insert_total_composite_ck_err.
- test/cqlpy/cassandra_tests/.../insert_test.py: testPKInsertWithValueOver64K,
  testCKInsertWithValueOver64K.
- test/cqlpy/cassandra_tests/.../select_test.py: testPKQueryWithValueOver64K.

test_insert_65k_pk_compound stays xfail: its oversized value gets
rejected by the Python driver's CQL wire-protocol encoder (see
CASSANDRA-19270) before reaching the server, so the fix can't apply.
Updated its reason. testCKQueryWithValueOver64K stays xfail with an
updated reason: Cassandra silently returns empty for an oversized
clustering key in WHERE, while Scylla now throws InvalidRequest - a
deliberate choice mirroring the partition-key case, documented in
the discussion on #10366.

Add three tight-boundary tests (addressing review feedback on the
previous revision) that pin MAX+1 behaviour for SELECT and INSERT of
both partition and clustering keys.

Update test/cluster/dtest/limits_test.py to match the new message
("Key length of \\d+ is longer than maximum of 65535").

fixes #10366
fixes #12247

Co-authored-by: Alexander Turetskiy <someone.tur@gmail.com>

Closes scylladb/scylladb#23433
2026-05-11 16:56:35 +03:00
Gleb Natapov
c3d2f0bde9 raft_group0: remove finish_setup_after_join function
The only thing it does not change a bootstrapping node to become a voter
in case the cluster does not support limited voters feature. But the
feature was introduced in 2025.2 and direct upgrade from 2025.1 to
version newer than 2026.1 is not supported. But even if such upgrade is
done the removed code has affect only during bootstrap, not during
regular boot.

Also remove the upgrade test since after the patch suppressing the
feature on the first boot will no longer behave correctly.
2026-05-11 15:38:36 +03:00
Asias He
0204372156 repair: Reject repair requests where start and end tokens are equal
When a user calls the repair API with identical startToken and endToken
values, the code creates a wrapping interval (T, T]. This causes
unwrap() to split it into (-inf, T] and (T, +inf), covering the entire
token ring and triggering a full repair.

Reject such requests early with an error message matching
Cassandra's behavior: "Start and end tokens must be different."

Fixes: https://scylladb.atlassian.net/browse/CUSTOMER-358

Closes scylladb/scylladb#29821
2026-05-11 14:08:20 +03:00
Botond Dénes
ad7ac62835 Merge ' Add a node_owner column (locator::host_id) to system.sstables and make it part of the partition key' from Dimitrios Symonidis
Add a node_owner column (locator::host_id) to system.sstables and make it part of the partition key, so the primary key becomesv PRIMARY KEY ((table_id, node_owner), generation).

This is the first step toward moving the sstables registry into system_distributed: once distributed, each node's startup scan  must read only the rows it owns, which requires the owning node to be part of the partition key. Partitioning by (table_id, node_owner) turns that scan into a single-partition read of exactly the local node's rows.

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1562
No need to backport this, keyspace over object storage is experimental feature

Closes scylladb/scylladb#29659

* github.com:scylladb/scylladb:
  db, sstables: add node_owner to sstables registry primary key
  db, sstables: rename sstables registry column owner to table_id
2026-05-11 14:08:19 +03:00