SSTable unlinking is async, so in some cases it may happen that
the upload dir is not empty immediately after refresh is done.
This patch adjusts test_refresh_deletes_uploaded_sstables so
it waits with a timeout till the upload dir becomes empty
instead of just assuming the API will sync on sstables being
gone.
Fixes SCYLLADB-1190
Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
Closesscylladb/scylladb#29215
Remove unused `pytest.mark.single_node` marker from `TestCQLAudit`.
Rename `TestCQLAudit` to `CQLAuditTester` to reflect that it is a test helper, not a test class. This avoids accidental pytest collection and subsequent warning about `__init__`.
Logs before the fixes:
```
test/cluster/test_audit.py:514: 14 warnings
/home/dario/dev/scylladb/test/cluster/test_audit.py:514: PytestCollectionWarning: cannot collect test class 'TestCQLAudit' because it has a __init__ constructor (from: cluster/test_audit.py)
@pytest.mark.single_node
```
Fixes SCYLLADB-1237
This is an addition to the latest master code. No backport needed.
Closesscylladb/scylladb#29237
* github.com:scylladb/scylladb:
test: audit: rename TestCQLAudit to CQLAuditTester
test: audit: remove unused pytest.mark.single_node
pytest tries to collect tests for execution in several ways.
One is to pick all classes that start with 'Test'. Those classes
must not have custom '__init__' constructor. TestCQLAudit does.
TestCQLAudit after migration from test/cluster/dtest is not a test
class anymore, but rather a helper class. There are two ways to fix
this:
1. Add __init__ = False to the TestCQLAudit class
2. Rename it to not start with 'Test'
Option 2 feels better because the new name itself does not convey
the wrong message about its role.
Fixes SCYLLADB-1237
Remove unused pytest.mark.single_node in TestCQLAudit class.
This is a leftover from audit tests migration from
test/cluster/dtest to test/cluster.
Refs SCYLLADB-1237
Previously, the result of when_all was discarded. when_all stores
exceptions in the returned futures rather than throwing, so the outer
catch(in_use&) could never trigger. Now we capture the when_all result
and inspect each future individually to properly detect in_use from
either stream.
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1216Closesscylladb/scylladb#29219
Change wait_for() defaults from period=1s/no backoff to period=0.1s
with 1.5x backoff capped at 1.0s. This catches fast conditions in
100ms instead of 1000ms, benefiting ~100 call sites automatically.
Add completion logging with elapsed time and iteration count.
Tested local with test/cluster/test_fencing.py::test_fence_hints (dev mode),
log output:
wait_for(at_least_one_hint_failed) completed in 0.83s (4 iterations)
wait_for(exactly_one_hint_sent) completed in 1.34s (5 iterations)
Fixes SCYLLADB-738
Closesscylladb/scylladb#29173
The test was flaky. The scenario looked like this:
1. Stop server 1.
2. Set its rf_rack_valid_keyspaces configuration option to true.
3. Create an RF-rack-invalid keyspace.
4. Start server 1 and expect a failure during start-up.
It was wrong. We cannot predict when the Raft mutation corresponding to
the newly created keyspace will arrive at the node or when it will be
processed. If the check of the RF-rack-valid keyspaces we perform at
start-up was done before that, it won't include the keyspace. This will
lead to a test failure.
Unfortunately, it's not feasible to perform a read barrier during
start-up. What's more, although it would help the test, it wouldn't be
useful otherwise. Because of that, we simply fix the test, at least for
now.
The new scenario looks like this:
1. Disable the rf_rack_valid_keyspaces configuration option on server 1.
2. Start the server.
3. Create an RF-rack-invalid keyspace.
4. Perform a read barrier on server 1. This will ensure that it has
observed all Raft mutations, and we won't run into the same problem.
5. Stop the node.
6. Set its rf_rack_valid_keyspaces configuration option to true.
7. Try to start the node and observe a failure.
This will make the test perform consistently.
---
I ran the test (in dev mode, on my local machine) three times before
these changes, and three times with them. I include the time results
below.
Before:
```
real 0m47.570s
user 0m41.631s
sys 0m8.634s
real 0m50.495s
user 0m42.499s
sys 0m8.607s
real 0m50.375s
user 0m41.832s
sys 0m8.789s
```
After:
```
real 0m50.509s
user 0m43.535s
sys 0m9.715s
real 0m50.857s
user 0m44.185s
sys 0m9.811s
real 0m50.873s
user 0m44.289s
sys 0m9.737s
```
Fixes SCYLLADB-1137
Backport: The test is present on all supported branches, and so we
should backport these changes to them.
Closesscylladb/scylladb#29218
* github.com:scylladb/scylladb:
test: cluster: Deflake test_startup_with_keyspaces_violating_rf_rack_valid_keyspaces
test: cluster: Mark test with @pytest.mark.asyncio in test_multidc.py
This PR contains two small improvements to `test_incremental_repair.py`
motivated by the sporadic failure of
`test_tablet_incremental_repair_and_scrubsstables_abort`.
The test fails with `assert 3 == 2` on `len(sst_add)` in the second
repair round. The extra SSTable has `repaired_at=0`, meaning scrub
unexpectedly produced more unrepaired SSTables than anticipated. Since
scrub (and compaction in general) logs at DEBUG level and the test did
not enable debug logging, the existing logs do not contain enough
information to determine the root cause.
**Commit 1** fixes a long-standing typo in the helper function name
(`preapre` -> `prepare`).
**Commit 2** enables `compaction=debug` for the Scylla nodes started by
`do_tablet_incremental_repair_and_ops`, which covers all
`test_tablet_incremental_repair_and_*` variants. This will capture full
compaction/scrub activity on the next reproduction, making the failure
diagnosable.
Refs: SCYLLADB-1086
Backport: test improvement, no backport
Closesscylladb/scylladb#29175
* https://github.com/scylladb/scylladb:
test/cluster/test_incremental_repair.py: enable compaction DEBUG logs in do_tablet_incremental_repair_and_ops
test/cluster/test_incremental_repair.py: fix typo preapre -> prepare
Two issues prevented the precompiled header from compiling
successfully when using CMake directly (rather than the
configure.py + ninja build system):
a) Propagate build flags to Rust binding targets reusing the
PCH. The wasmtime_bindings and inc targets reuse the PCH
from scylla-precompiled-header, which is compiled with
Seastar's flags (including sanitizer flags in
Debug/Sanitize modes). Without matching compile options,
the compiler rejects the PCH due to flag mismatch (e.g.,
-fsanitize=address). Link these targets against
Seastar::seastar to inherit the required compile options.
Closesscylladb/scylladb#28941
The test was flaky. The scenario looked like this:
1. Stop server 1.
2. Set its rf_rack_valid_keyspaces configuration option to true.
3. Create an RF-rack-invalid keyspace.
4. Start server 1 and expect a failure during start-up.
It was wrong. We cannot predict when the Raft mutation corresponding to
the newly created keyspace will arrive at the node or when it will be
processed. If the check of the RF-rack-valid keyspaces we perform at
start-up was done before that, it won't include the keyspace. This will
lead to a test failure.
Unfortunately, it's not feasible to perform a read barrier during
start-up. What's more, although it would help the test, it wouldn't be
useful otherwise. Because of that, we simply fix the test, at least for
now.
The new scenario looks like this:
1. Disable the rf_rack_valid_keyspaces configuration option on server 1.
2. Start the server.
3. Create an RF-rack-invalid keyspace.
4. Perform a read barrier on server 1. This will ensure that it has
observed all Raft mutations, and we won't run into the same problem.
5. Stop the node.
6. Set its rf_rack_valid_keyspaces configuration option to true.
7. Try to start the node and observe a failure.
This will make the test perform consistently.
---
I ran the test (in dev mode, on my local machine) three times before
these changes, and three times with them. I include the time results
below.
Before:
```
real 0m47.570s
user 0m41.631s
sys 0m8.634s
real 0m50.495s
user 0m42.499s
sys 0m8.607s
real 0m50.375s
user 0m41.832s
sys 0m8.789s
```
After:
```
real 0m50.509s
user 0m43.535s
sys 0m9.715s
real 0m50.857s
user 0m44.185s
sys 0m9.811s
real 0m50.873s
user 0m44.289s
sys 0m9.737s
```
Fixes SCYLLADB-1137
ERMs created in `calculate_vnode_effective_replication_map` have RF computed based
on the old token metadata during a topology change. The reading replicas, however,
are computed based on the new token metadata (`target_token_metadata`) when
`read_new` is true. That can create a mismatch for EverywhereStrategy during some
topology changes - RF can be equal to the number of reading replicas +-1. During
bootstrap, this can cause the
`everywhere_replication_strategy::sanity_check_read_replicas` check to fail in
debug mode.
We fix the check in this commit by allowing one more reading replica when
`read_new` is true.
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1147Closesscylladb/scylladb#29150
Before these changes, we would send mutations to the node and
immediately query the metrics to see how many hints had been written.
However, that could lead to random failures of the test: even if the
mutations have finished executing, hints are stored asynchronously, so
we don't have a guarantee they have already been processed.
To prevent such failures, we rewrite the check: we will perform multiple
checks against the metrics until we have confirmed that the hints have
indeed been written or we hit the timeout.
We're generous with the timeout: we give the test 60 seconds. That
should be enough time to avoid flakiness even on super slow machines,
and if the test does fail, we will know something is really wrong.
As a bonus, we improve the test in general too. We explicitly express
the preconditions we rely on, as well as bump the log level. If the
test fails in the future, it might be very difficult do debug it
without this additional information.
Fixes SCYLLADB-1133
Backport: The test is present on all supported branches. To avoid
running into more failures, we should backport these changes
to them.
Closesscylladb/scylladb#29191
* github.com:scylladb/scylladb:
test: cluster: Increase log level in test_write_cl_any_to_dead_node_generates_hints
test: cluster: Await all mutations concurrently in test_write_cl_any_to_dead_node_generates_hints
test: cluster: Specify min_tablet_count in test_write_cl_any_to_dead_node_generates_hints
test: cluster: Use new_test_table in test_write_cl_any_to_dead_node_generates_hints
test: cluster: Introduce auxiliary function keyspace_has_tablets
test: cluster: Deflake test_write_cl_any_to_dead_node_generates_hints
Reduce the timeout for one test to 60 minutes. The longest test we had
so far was ~10-15 minutes. So reducing this timeout is pretty safe and
should help with hanging tests.
Closesscylladb/scylladb#29212
Migrate audit tests from test/cluster/dtest to test/cluster. Optimize their execution time through cluster reuse.
The audit test suite is heavy. There are more than 70 test executions. Environment preparation is a significant part of each test case execution time.
This PR:
1. Copies audit tests from test/cluster/dtest to test/cluster, refactoring and enabling them
2. Groups tests functions by non-live cluster configuration variations to enable cluster reuse between them
- Execution time reduced from 4m 29s to 2m 47s, which is ~38% execution time decrease
3. Removes the old audit tests from test/cluster/dtest
Includes two supporting changes:
- Allow specifying `AuthProvider` in `ManagerClient.get_cql_exclusive`
- Fix server log file handling for clean clusters
Refs [SCYLLADB-573](https://scylladb.atlassian.net/browse/SCYLLADB-573)
This PR is an improvement and does not require a backport.
[SCYLLADB-573]: https://scylladb.atlassian.net/browse/SCYLLADB-573?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQClosesscylladb/scylladb#28650
* github.com:scylladb/scylladb:
test: cluster: fix log clear race condition in test_audit.py
test: pylib: shut down exclusive cql connections in ManagerClient
test: cluster: fix multinode audit entry comparison in test_audit.py
test: cluster: dtest: remove old audit tests
test: cluster: group migrated audit tests for cluster reuse
test: cluster: enable migrated audit tests and make them work
test: pylib: manager_client: specify AuthProvider in get_cql_exclusive
test: pylib: scylla cluster after_test log fix
test: audit: copy audit test from dtest
Maintenance socket path used for PGO is in the node workdir.
When the node workdir path is too long, the maintenance socket path
(workdir/cql.m) can exceed the Unix domain socket sun_path limit
and failing the PGO training pipeline.
To prevent this:
- pass an explicit --maintenance-socket override
pointing to a short determinitic path in /tmp derived from the MD5
hash of the workdir maintenance socket path
- update maintenance_socket_path to return the matching short path
so that exec_cql.py connects to the right socket
The short path socket files are cleaned up after the cluster stops.
The path is using MD5 hash of the workdir path, so it is deterministic.
Fixes SCYLLADB-1070
Closesscylladb/scylladb#29149
Commit faa0ee9844 accidentally broke the way split snapshot mutation was
frozen -- instead of appending the sub-mutation `m` the commit kept the
old variable name of `mut` which in the new code corresponds to "old"
non-split mutation
Fixes#29051
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#29052
There's a flaw in table::query() -- calling querier_opt->close() can dereferences a disengaged std::optional. The fix pretty simple. Once fixed, there are two if-s checking for querier_opt being engaged or not that are worth being merged.
The problem doesn't really shows itself becase table::query() is not called with null saved_querier, so the de-facto if is always correct. However, better to be on safe-side.
The problem doesn't show itself for real, not worth backporting
Closesscylladb/scylladb#29142
* github.com:scylladb/scylladb:
table: merge adjacent querier_opt checks in query()
table: don't close a disengaged querier in query()
Potential fix for https://github.com/scylladb/scylladb/security/code-scanning/147.
To fix the problem, add an explicit `permissions:` block to the workflow
(either at the top level or inside the `trigger-jenkins` job) that
constrains the `GITHUB_TOKEN` to the minimal necessary privileges. This
codifies least-privilege in the workflow itself instead of relying on
repository or organization defaults.
The best minimal, non‑breaking change is to define a root‑level
`permissions:` block with read‑only contents access because the job does
not perform any write operations to the repository, nor does it interact
with issues, pull requests, or other GitHub resources. A conservative,
widely accepted baseline is `contents: read`. If later steps require more
permissions, they can be added explicitly, but for this snippet, no such
need is visible.
Concretely, in `.github/workflows/trigger_jenkins.yaml`, insert:
```yaml
permissions:
contents: read
```
between the `name:` block and the `on:` block (e.g., after line 2).
No additional methods, imports, or definitions are needed since this is
a pure YAML configuration change and does not alter runtime behavior of
the existing shell steps.
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
Closesscylladb/scylladb#27815
Potential fix for https://github.com/scylladb/scylladb/security/code-scanning/169.
In general, the fix is to add an explicit `permissions:` block to the
workflow (at the root level or per job) so that the `GITHUB_TOKEN` has
only the minimal scopes needed. Since this job only reads event data and
uses secrets to talk to Jenkins, we can restrict `GITHUB_TOKEN` to
read‑only repository contents.
The single best fix here is to add a top‑level `permissions:` block
right under the `name:` (and before `on:`) in
`.github/workflows/trigger-scylla-ci.yaml`, setting `contents: read`.
This applies to all jobs in the workflow, including `trigger-jenkins`,
and does not alter any existing steps or logic. No additional imports or
methods are needed, as this is purely a YAML configuration change for
GitHub Actions.
Concretely, edit `.github/workflows/trigger-scylla-ci.yaml` to insert:
```yaml
permissions:
contents: read
```
after line 1. No other lines in the file need to change.
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
Closesscylladb/scylladb#27812
We increase the log level of `hints_manager` to TRACE in the test.
If it fails, it may be incredibly difficult to debug it without any
additional information.
The test relies on the assumption that mutations will be distributed
more or less uniformly over the nodes. Although in practice this should
not be possible, theoretically it's possible that there's only one
tablet allocated for the table.
To clearly indicate this precondition, we explicitly set the property
`min_tablet_count` when creating the table. This way, we have a gurantee
that the table has multiple tablets. The load balancer should now take
care of distributing them over the nodes equally. Thanks to that,
`servers[1]` will have some tablets, and so it'll be the target for some
of the mutations we perform.
The context manager is the de-facto standard in the test suite. It will
also allow us for a prettier way to conditionally enable per-table
tablet options in the following commit.
The function is adapted from its counterpart in the cqlpy test suite:
cqlpy/util.py::keyspace_has_tablets. We will use it in a commit in this
series to conditionally set tablet properties when creating a table.
It might also be useful in general.
Before these changes, we would send mutations to the node and
immediately query the metrics to see how many hints had been written.
However, that could lead to random failures of the test: even if the
mutations have finished executing, hints are stored asynchronously, so
we don't have a guarantee they have already been processed.
To prevent such failures, we rewrite the check: we will perform multiple
checks against the metrics until we have confirmed that the hints have
indeed been written or we hit the timeout.
We're generous with the timeout: we give the test 60 seconds. That
should be enough time to avoid flakiness even on super slow machines,
and if the test does fail, we will know something is really wrong.
Fixes SCYLLADB-1133
Fixtures previously ran GDB once (module scope) to find live objects
(sstables, tasks, schemas) and stored their addresses. Tests then
reused those addresses in separate GDB invocations. Sometimes these
addresses would become stale and the test would step on use-after-free
(e.g. sstables compacted away between invocations).
Fix by dropping the fixtures. The helper functions used by the fixtures
to obtain the required objects are converted to gdb convenience
functions, which can be used in the same expression as the test command
invocation. Thus, the object is aquired on-demand at the moment it is
used, so it is guaranteed to be fresh and relevant.
Fixes: SCYLLADB-1020
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Closesscylladb/scylladb#28999
The version of fmt installed on my machine refuses to work with
`std::filesystem::path` directly. Add `.string()` calls in places that
attempt to print paths directly in order to make them work.
Closesscylladb/scylladb#29148
When encrypted_data_source::get() caches a trailing block in _next, the next call takes it directly — bypassing input_stream::read(), which checks _eof. It then calls input_stream::read_exactly() on the already-drained stream. Unlike read(), read_up_to(), and consume(), read_exactly() does not check _eof when the buffer is empty, so it calls _fd.get() on a source that already returned EOS.
In production this manifested as stuck encrypted SSTable component downloads during tablet restore: the underlying chunked_download_source hung forever on the post-EOS get(), causing 4 tablets to never complete. The stuck files were always block-aligned sizes (8k, 12k) where _next gets populated and the source is fully consumed in the same call.
Fix by checking _input.eof() before calling read_exactly(). When the stream already reached EOF, buf2 is known to be empty, so the call is skipped entirely.
A comprehensive test is added that uses a strict_memory_source which fails on post-EOS get(), reproducing the exact code path that caused the production deadlock.
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1128
Backport to 2025.3/4 and 2026.1 is needed since it fixes a bug that may bite us in production, to be on the safe side
Closesscylladb/scylladb#29110
* github.com:scylladb/scylladb:
encryption: fix deadlock in encrypted_data_source::get()
test_lib: mark `limiting_data_source_impl` as not `final`
Fix formatting after previous patch
Fix indentation after previous patch
test_lib: make limiting_data_source_impl available to tests
The test sporadically fails because scrub produces an unexpected number
of SSTables. Compaction logs are needed to diagnose why, but were not
captured since scrub runs at DEBUG level. Enable compaction=debug for
the servers started by do_tablet_incremental_repair_and_ops so the next
reproduction provides enough information to root-cause the issue.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Without this include the file would not compile on its own. The issue
was most likely masked by the use of precompiled headers in our CI.
Closesscylladb/scylladb#29170
The 'scylla sstable-summary' GDB command crashes with
'ValueError: Argument "count" should be greater than zero' when
inspecting ms-format (trie-based) sstables. This happens because
ms-format sstables don't populate the traditional summary structure,
leaving all fields zeroed out, which causes gdb.read_memory() to be
called with a zero count.
Fix by:
- Adding zero-length guards to sstring.to_hex() and sstring.as_bytes()
to return early when the data length is zero, consistent with the
existing guard in managed_bytes.get().
- Adding the same guard to scylla_sstable_summary.to_hex().
- Detecting ms-format sstables (version == 5) early in
scylla_sstable_summary.invoke() and printing an informative message
instead of attempting to read the unpopulated summary.
Fixes: SCYLLADB-1180
Closesscylladb/scylladb#29162
In test_index_requires_rf_rack_valid_keyspace, the create_table call
for a plain tablet-based table can fail with 'Unable to reach schema
agreement' after the server's 10s timeout is exceeded. This happens
when schema gossip propagation across the 4-node cluster takes longer
than expected after a sequence of rapid schema changes earlier in the
test.
Add a retry (up to 2 attempts) on schema agreement errors for this
specific create_table call rather than increasing the server-side
timeout.
Fixes: SCYLLADB-1135
Closesscylladb/scylladb#29132
Potential fix for https://github.com/scylladb/scylladb/security/code-scanning/139,
To fix the problem, explicitly restrict the `GITHUB_TOKEN` permissions
for this workflow/job so it has only what is needed. The script reads PR
data and repository info (which is covered by `contents: read`/default
read scopes) and posts a comment via `github.rest.issues.createComment`,
which requires `issues: write`. No other write scopes (e.g., `contents:
write`, `pull-requests: write`) are necessary.
The best fix without changing functionality is to add a `permissions`
block scoped to this job (or at the workflow root). Since we only see a
single job here, we’ll add it under `check-fixes-prefix`. Concretely, in
`.github/workflows/backport-pr-fixes-validation.yaml`, between the
`runs-on: ubuntu-latest` line (line 10) and `steps:` (line 11), add:
```yaml
permissions:
contents: read
issues: write
```
This keeps the token minimally privileged while still allowing the script
to create issue/PR comments.
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
Closesscylladb/scylladb#27810
`read_barrier(session2)` was supposed to ensure `node2` has caught up on schema
before a CL=ALL write. But `patient_cql_connection(node2)` creates a
cluster-aware driver session `(TokenAwarePolicy(DCAwareRoundRobinPolicy()))`
that can route the barrier CQL statement to any node — not necessarily `node2`.
If the barrier runs on `node1` or `node3` (which already have the new schema),
it's a no-op, and `node2` remains stale, thus the observed `WriteFailure`.
The fix is to switch to `patient_exclusive_cql_connection(node2)`,
which uses `WhiteListRoundRobinPolicy([node2_ip])` to pin all CQL to `node2`.
This is already the established pattern used by other tests in the same file.
Fixes: SCYLLADB-1139
No need to backport yet, appeared only on master.
Closesscylladb/scylladb#29151
The test intentionally creates huge index pages.
But since 5e7fb08bf3,
the index reader allocates a block of memory for a whole index page,
instead of incrementally allocating small pieces during index parsing.
This giant allocation causes the test to fail spuriously in CI sometimes.
Fix this by disabling sstable compression on the test table,
which puts a hard cap of 2000 keys per index page.
Fixes: SCYLLADB-1152
Closesscylladb/scylladb#29152
Use add_new_sstable_and_update_cache() when attaching SSTables
downloaded by the node-scoped local loader.
This is the correct variant for new SSTables: it can unlink the
SSTable on failure to add it, and it can split the SSTable if a
tablet split is in progress. The older
add_sstable_and_update_cache() helper is intended for preexisting
SSTables that are already stable on disk.
Additionally, downloaded SSTables are now left unsealed (TemporaryTOC)
until they are successfully added to the table's SSTable set. The
download path (download_fully_contained_sstables) passes
leave_unsealed=true to create_stream_sink, and attach_sstable opens
the SSTable with unsealed_sstable=true and seals it only inside the
on_add callback — matching the pattern used by stream_blob.cc and
storage_service.cc for tablet streaming.
This prevents a data-resurrection hazard: previously, if the process
crashed between download and attach_sstable, or if attach_sstable
failed mid-loop, sealed (TOC) SSTables would remain in the table
directory and be reloaded by distributed_loader on restart. With
TemporaryTOC, sstable_directory automatically cleans them up on
restart instead.
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1085.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#29072
The native Scylla nodetool reports ECONNREFUSED as 'Connection refused',
not as 'ConnectException' (which is the Java nodetool format). Add
'Connection refused' to the valid_errors list so that transient
connection failures during concurrent decommission/bootstrap topology
changes are properly tolerated.
Fixes SCYLLADB-1167
Closesscylladb/scylladb#29156
After the previous fix both guarding if-s start with 'if (querier_opt &&'.
Merge them into a single outer 'if (querier_opt)' block to avoid the
redundant check and make the structure easier to follow.
No functional change.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When a `with_connect` operation timed out, the underlying connection
attempt continued to run in the reactor. This could lead to a crash
if the connection was established/rejected after the client object had
already been destroyed. This issue was observed during the teardown
phase of a upcoming high-availability test case.
This commit fixes the race condition by ensuring the connection attempt
is properly canceled on timeout.
Additionally, the explicit TLS handshake previously forced during the
connection is now deferred to the first I/O operation, which is the
default and preferred behavior.
Fixes: SCYLLADB-832
Backports to 2026.1 and 2025.4 are required, as this issue also exists on those branches and is causing CI flakiness.
Closesscylladb/scylladb#29031
* github.com:scylladb/scylladb:
vector_search: test: fix flaky test
vector_search: fix race condition on connection timeout
The condition guarding querier_opt->close() was:
When saved_querier is null the short-circuit makes the whole condition true
regardless of whether querier_opt is engaged. If partition_ranges is empty,
query_state::done() is true before the while-loop body ever runs, so querier_opt
is never created. Calling querier_opt->close() then dereferences a disengaged
std::optional — undefined behaviour.
Fix by checking querier_opt first:
This preserves all existing semantics (close when not saving, or when saving
wouldn't be useful) while making the no-querier path safe.
Why this doesn't surface today: the sole production call site, database::query(),
in practice. The API header documents nullptr as valid ("Pass nullptr when
queriers are not saved"), so the bug is real but latent.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The code in upload_file std::move()-s vector of names into
merge_objects() method, then iterates over this vector to delete
objects. The iteration is apparently a no-op on moved-from vector.
The fix is to make merge_objects() helper get vector of names by const
reference -- the method doesn't modify the names collection, the caller
keeps one in stable storage.
Fixes#29060
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#29061
Switch directories::do_verify_owner_and_mode() from lister::scan_dir() to
utils::directory_lister while preserving the previous hidden-entry
behavior.
Make do_verify_subpath use lister::filter_type directly so the
verification helper can pass it straight into directory_lister, and keep
a single yielding iteration loop for directory traversal.
Minus one scan_dir user twards scan_dir removal from code.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#29064
The update_credentials_and_rearm() may get "empty" credentials from
_creds_provider_chain.get_aws_credentials() -- it doesn't throw, but
returns default-initialized value. In that case the expires_at will be
set to time_point::min, and it's probably not a good idea to arm the
refresh timer and, even worse idea, to subtract 1h from it.
Fixes#29056
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#29057
The format string had two {} placeholders but three arguments, the
_upload_id one is skipped from formatting
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#29053
The test_restore_primary_replica_same_domain and test_restore_primary_replica_different_domain tests have very much in common. Previously both tests were also split each into two, so we have four tests, and now we have two that can also be squashed, the lines-of-code savings still worth it.
This is the continuation of #28569
Tests improvement, not backporting
Closesscylladb/scylladb#28994
* github.com:scylladb/scylladb:
test: Replace a bunch of ternary operators with an if-else block
test: Squash test_restore_primary_replica_same|different_domain tests
test: Use the same regexp in test_restore_primary_replica_different|same_domain-s
The do_test_backup_abort() fetched the node's workdir and resolved cf_dir
solely to construct a unique-ish backup prefix:
prefix = f'{cf_dir}/backup'
The comment already acknowledged this was only "unique(ish)" — relying
on the UUID-derived cf_dir name as a uniqueness source is roundabout.
unique_name() is already imported and used for exactly this purpose
elsewhere in the file.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#29030
The endpoint URL remains intact. Having it next to another toppartitions
endpoint (the /column_family/toppartitions one) is natural.
This endpoint only needs sharded<replica::database>&, grabs it from
http_context and doesn't use any other service. In column_family.cc the
database reference is already available as a parameter. Once more user
of http_context.db is gone.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Closesscylladb/scylladb#28996
The test in question uses several helpers from the backup sute, but it doesn't really need them -- the operations it want to perform can be performed with standard pylib methods. "While at it" also collect some dangling effectively unused local variables from this test (these were apparently left from backup tests this one was copied-and-reworked from)
Enhancing tests, not backporting
Closesscylladb/scylladb#29130
* github.com:scylladb/scylladb:
test/refresh: Simplify refresh invocation
test/refresh: Remove r_servers alias for servers
test/refresh: Replace check_mutation_replicas with a plain CQL SELECT
test/refresh: Inline keyspace/table/data setup in test_refresh_deletes_uploaded_sstables
test/refresh: Prepare indentation for new_test_keyspace in test_refresh_deletes_uploaded_sstables
test/refresh: Decouple test_refresh_deletes_uploaded_sstables from backup tests
test/refresh: Remove unused wait_for_cql_and_get_hosts import
Two issues found in the lister returned by gs_client_wrapper::make_object_lister()
Lister can report EOF too early in case filter is active, another one is potential vector out-of-bounds access
Fixes#29058
The code appeared in 2026.1, worth fixing it there as well
Closesscylladb/scylladb#29059
* github.com:scylladb/scylladb:
sstables: Fix object storage lister not resetting position in batch vector
sstables: Fix object storage lister skipping entries when filter is active
When it deadlocks, groups stop merging and compaction group merge
backlog will run-away.
Also, graceful shutdown will be blocked on it.
Found by flaky unit test
test_merge_chooses_best_replica_with_odd_count, which timed-out in 1
in 100 runs.
Reason for deadlock:
When storage groups are merged, the main compaction group of the new
storage group takes a compaction lock, which is appended to
_compaction_reenablers_for_merging, and released when the merge
completion fiber is done with the whole batch.
If we accumulate more than 1 merge cycle for the fiber, deadlock
occurs. Lock order will be this
Initial state:
cg0: main
cg1: main
cg2: main
cg3: main
After 1st merge:
cg0': main [locked], merging_groups=[cg0.main, cg1.main]
cg1': main [locked], merging_groups=[cg2.main, cg3.main]
After 2nd merge:
cg0'': main [locked], merging_groups=[cg0'.main [locked], cg0.main, cg1.main, cg1'.main [locked], cg2.main, cg3.main]
merge completion fiber will try to stop cg0'.main, which will be
blocked on compaction lock. which is held by the reenabler in
_compaction_reenablers_for_merging, hence deadlock.
The fix is to wait for background merge to finish before we start the
next merge. It's achieved by holding old erm in the background merge,
and doing a topology barrier from the merge finalizing transition.
Background merge is supposed to be a relatively quick operation, it's
stopping compaction groups. So may wait for active requests. It
shouldn't prolong the barrier indefinitely.
Tablet tests which trigger merge need to be adjusted to call the
barrier, otherwise they will be vulnerable to the deadlock.
Fixes SCYLLADB-928
Backport to >= 2025.4 because it's the earliest vulnerable due to f9021777d8.
Closesscylladb/scylladb#29007
* github.com:scylladb/scylladb:
tablets: Fix deadlock in background storage group merge fiber
replica: table: Propagate old erm to storage group merge
test: boost: tablets_test: Save tablet metadata when ACKing split resize decision
storage_service: Extract local_topology_barrier()
This patch series introduces a new documentation for exiting guardrails.
Moreover:
- Warning / failure messages of recently added write CL guardrails (SCYLLADB-259) are rephrased, so all guardrails have similar messages.
- Some new tests are added, to help verify the correctness of the documentation and avoid situations where the documentation and implementation diverge.
Fixes: [SCYLLADB-257](https://scylladb.atlassian.net/browse/SCYLLADB-257)
No backport, just new docs and tests.
[SCYLLADB-257]: https://scylladb.atlassian.net/browse/SCYLLADB-257?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQClosesscylladb/scylladb#29011
* github.com:scylladb/scylladb:
test: add new guardrail tests matching documentation scenarios
test: add metric assertions to guardrail replication strategy tests
test: use regex matching in guardrail replication strategy tests
test: extract ks_opts helper in test_guardrail_replication_strategy
docs: document CQL guardrails
cql: improve write consistency level guardrail messages
This PR enables the node_exporter systemd collector and configures the unit whitelist to include scylla-server.service and systemd-coredump services.
**Motivation**: We currently lack visibility into system-level service states, which is critical for diagnosing stability issues.
This configuration enables two specific use cases:
- Detecting Coredump Loops: We encounter scenarios where ScyllaDB enters a restart loop. To pinpoint SIGSEGV (coredumps) as the root cause, we need to track when the systemd-coredump service becomes active, indicating a dump is being processed.
- Identifying Startup Failures: We need to detect when the scylla-server unit enters a failed state. This is essential for catching unrecoverable errors (e.g., corrupted commitlogs or configuration bugs) that prevent the server from starting.
example of promql queries:
- `node_systemd_unit_state{name=~"systemd-coredump@.*", state="active"} == 1`
- `node_systemd_unit_state{name="scylla-server.service", state="failed"} == 1`
Closes#28402
reader_concurrency_semaphore::signal() guards against available
resources exceeding the initial limit after a signal, which would
indicate a bug such as double-returning resources. It reports the
issue via on_internal_error_noexcept and clamps resources back to
the initial values. However, before this commit there were no tests
that verified this behavior, so bugs like SCYLLADB-1014 went
undetected.
Add a test that artificially signals resources that were never
consumed and verifies that signal() detects the negative leak and
clamps available resources back to the initial limit.
Refs: SCYLLADB-1014
Fixes: SCYLLADB-1031
Closesscylladb/scylladb#28993
Since 7564a56dc8, all tables default to
repair-mode tombstone-gc, which is identical to immediate-mode for RF=1
tables. Consequently the tombstones written by the tests in this test
file are immediately collectible and with some unlucky timing, some of
them can be collected before the end of the test, failing the empty-page
prefix check because the empty pages prefix will be smaller than
expected based on the number of tombstones written.
Disable tombstone-gc to remove this source of flakyness.
Fixes: SCYLLADB-1062
Closesscylladb/scylladb#29077
The test has expectation w.r.t which write makes it to which nodes:
* inserts make it to all nodes
* delete makes it to all-1 (QUORUM) node
However, this was not expressed with CL, and the default CL=ONE allowed
for some nodes missing the writes and this violating the tests
expectations on what data is persent on which nodes. This resulted on
the test being flaky and failing on the data checks.
Use explicit CL for the ingestion to prevent this.
The improvements to the test introduced in
a8dd13731f was of great help in
investigating this: traces are now available and the check happens after
the data was dumped to logs.
Fixes: SCYLLADB-870
Fixes: SCYLLADB-812
Fixes: SCYLLADB-1102
Closesscylladb/scylladb#29128
Introduce an initial and experimental implementation of an alternative log-structured storage engine for key-value tables.
Main flows and components:
* The storage is composed of 32MB files, each file divided to segments of size 128k. We write to them sequentially records that contain a mutation and additional metadata. Records are written to a buffer first and then written to the active segment sequentially in 4k sized blocks.
* The primary index in memory maps keys to their location on disk. It is a B-tree per-table that is ordered by tokens, similar to a memtable.
* On reads we calculate the key and look it up in the primary index, then read the mutation from disk with a single disk IO.
* On writes we write the record to a buffer, wait for it to be written to disk, then update the index with the new location, and free the previous record.
* We track the used space in each segment. When overwriting a record, we increase the free space counter for the segment of the previous record that becomes dead. We store the segments in a histogram by usage.
* The compaction process takes segments with low utilization, reads them and writes the live records to new segments, and frees the old segments.
* Segments are initially "mixed" - we write to the active segment records from all tables and all tablets. The "separator" process rewrites records from mixed segments into new segments that are organized by compaction groups (tablets), and frees the mixed segments. Each write is written to the active segment and to a separator buffer of the compaction group, which is eventually flushed to a new segment in the compaction group.
Currently this mode is experimental and requires an experimental flag to be enabled.
Some things that are not supported yet are strong consistency, tablet migration, tablet split/merge, big mutations, tombstone gc, ttl.
to use, add to config:
```
enable_logstor: true
experimental_features:
- logstor
```
create a table:
```
CREATE TABLE ks.t(pk int PRIMARY KEY, a int, v text) WITH storage_engine = 'logstor';
```
INSERT, SELECT, DELETE work as expected
UPDATE not supported yet
no backport - new feature
Closesscylladb/scylladb#28706
* github.com:scylladb/scylladb:
logstor: trigger separator flush for buffers that hold old segments
docs/dev: add logstor documentation
logstor: recover segments into compaction groups
logstor: range read
logstor: change index to btree by token per table
logstor: move segments to replica::compaction_group
db: update dirty mem limits dynamically
logstor: track memory usage
logstor: logstor stats api
logstor: compaction buffer pool
logstor: separator: flush buffer when full
logstor: hold segment until index updates
logstor: truncate table
logstor: enable/disable compaction per table
logstor: separator buffer pool
test: logstor: add separator and compaction tests
logstor: segment and separator barrier
logstor: separator debt controller
logstor: compaction controller
logstor: recovery: recover mixed segments using separator
logstor: wait for pending reads in compaction
logstor: separator
logstor: compaction groups
logstor: cache files for read
logstor: recovery: initial
logstor: add segment generation
logstor: reserve segments for compaction
logstor: index: buckets
logstor: add buffer header
logstor: add group_id
logstor: record generation
logstor: generation utility
logstor: use RIPEMD-160 for index key
test: add test_logstor.py
api: add logstor compaction trigger endpoint
replica: add logstor to db
schema: add logstor cf property
logstor: initial commit
db: disable tablet balancing with logstor
db: add logstor experimental feature flag
Trie-based sstable indexes are supposed to be (hopefully) a better default than the old BIG indexes.
Make the new format a new default for new clusters by naming ms in the default scylla.yaml.
New functionality. No backport needed.
This PR is basically Michał's one https://github.com/scylladb/scylladb/pull/26377, Jakub's https://github.com/scylladb/scylladb/pull/27332 fixing `sstables_manager::get_highest_supported_format()` and one test fix.
Closesscylladb/scylladb#28960
* github.com:scylladb/scylladb:
db/config: announce ms format as highest supported
db/config: enable `ms` sstable format by default
cluster/dtest/bypass_cache_test: switch from highest_supported_sstable_format to chosen_sstable_format
api/system: add /system/chosen_sstable_version
test/cluster/dtest: reduce num_tokens to 16
take_snapshot return values were unused so drop them. do_refresh was a
thin wrapper around load_new_sstables that added no logic; inline it
directly into the gather expression.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The goal of test_refresh_deletes_uploaded_sstables is to verify that
sstables are removed from the upload directory after refresh. The replica
check was just a sanity guard; a simple SELECT of all keys is sufficient
and much lighter.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Replace create_dataset() with explicit keyspace creation via new_test_keyspace,
inline CREATE TABLE, and direct cql.run_async inserts — matching the pattern
used in do_test_streaming_scopes. This removes the last dependency on backup
helpers for dataset setup and makes the test self-contained.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Wrap the test body under if True: to pre-indent it, making the subsequent
patch that introduces new_test_keyspace a pure content change with no
whitespace noise.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replace create_cluster() from object_store/test_backup.py with a plain
manager.servers_add(2) call. The test does not use object storage, so
there is no need to pull in the backup helper along with its config and
logging knobs.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This PR fixes the Installation page:
- Replaces `http `with `https `in the download command.
- Replaces the Open Source example from the Installation section for CentOS (we overlooked this example before).
Fixes https://github.com/scylladb/scylladb/issues/29087
Fixes https://github.com/scylladb/scylladb/issues/29087
This update affects all supported versions and should be backported as a bug fix.
Closesscylladb/scylladb#29088
* github.com:scylladb/scylladb:
doc: remove the Open Source Example from Installation
doc: replace http with https in the installation instructions
assert_entries_were_added:
- takes a "before" snapshot of the audit log
- yields to execute a statement
- takes an "after" snapshot of the audit log
- computes new rows by diffing "after" minus "before"
If an audit entry generated by prepare() arrives between the snapshot
and the diff, it inflates the new row count and the test fails with
assert 2 <= 1.
Fix by:
- Adding clear_audit_logs() at the end of prepare(), after all setup
- Waiting for the "completed re-reading configuration file" log message
after server_update_config
- Draining pending syslog lines before clearing the buffer
Refs SCYLLADB-573
get_cql_exclusive() creates a Cluster object per call, but never
records it. driver_close() cannot shut it down. The cluster's
internal scheduler thread then tries to submit work to an already
shut down executor. This causes RuntimeError:
RuntimeError: cannot schedule new futures after shutdown
Fix this by tracking every exclusive Cluster in a list and shutting
them all down in driver_close().
Refs SCYLLADB-573
assert_entries_were_added computes new audit rows by slicing the "after"
list at the length of the "before" list: rows_after[len(rows_before):].
This assumes new rows always appear at the tail of the combined sorted
list. In a multinode setup, each node generates its own event_time
timestamps. A new row from node A can sort before an old row from node
B, breaking the tail assumption. The assertion "new rows are not the
last rows in the audit table" then fires.
Fix this by splitting the before/after lists per node and computing the
new rows tail independently for each node. This guarantees that per node
ordering, which is monotonic, is respected, and the combined new rows
are sorted afterwards.
Refs SCYLLADB-573
Since audit tests have been migrated to test/cluster/test_audit.py,
old tests in test/cluster/dtest/audit_test.py have to be removed.
Refs SCYLLADB-573
This patch reorganizes the execution flow of the test functions.
They are grouped to enable cluster reuse between specific test
functions. One of the main contributors to the test execution time
is the cluster preparation. This patch significantly reduces the
total test execution time by having way less new cluster preparation
calls and more cluster reuse.
Performance increase on the developer machine is around 38%:
- before: 4m 29s
- after: 2m 47s
Fixes SCYLLADB-573
Make audit tests from test/cluster/dtest to test/cluster.
test/cluster environment has less overhead, and audit tests
are heavy, their execution taking lots of time. This patch
is part of an effort to improve audit test suite performance.
This patch refactors the tests so that they execute correctly,
as well as enables them. A follow up patch will remove the
audit tests in test/cluster/dtest.
All the tests are confirmed to be running after the change.
No dead code present.
Test test_audit_categories_invalid is not parametrized anymore.
It never used the parametrized helper class, so it just ran
the same logic three times. This is why there are now 74,
and not 76, test executions.
Refs SCYLLADB-573
The default 'cassandra' superuser was removed from ScyllaDB, which
broke PGO training. exec_cql.py relied on username/password auth
('cassandra'/'cassandra') to execute setup CQL scripts like auth.cql
and counters.cql.
Switch exec_cql.py to connect via the Unix domain maintenance socket
instead. The maintenance socket bypasses authentication, no credentials
are needed. Additionally, create the 'cassandra' superuser via the
maintenance socket during the populate phase, so that cassandra-stress
keeps working. cassandra-stress hardcodes user=cassandra password=cassandra.
Changes:
- exec_cql.py: replace host/port/username/password arguments with a
single --socket argument; add connect_maintenance_socket() with
wait ready logic
- pgo.py: add maintenance_socket_path() helper; update
populate_auth_conns() and populate_counters() to pass the socket
path to exec_cql.py
Fixes SCYLLADB-1070
Closesscylladb/scylladb#29081
This patch allows ManagerClient.get_cql_exclusive to accept AuthProvider
as parameter. This will be used in a follow up patch which migrates
audit test suite to test/cluster and requires this functionality for
some tests.
Refs SCYLLADB-573
Before any test, a pool of ScyllaCluster objects is created.
At the beginning of a test suite, a ScyllaClusterManager is created,
and given a reference to the pool.
At the end of a test suite, the ScyllaClusterManager is destroyed.
Before each test case:
- ManagerClient is constructed and connected to the ScyllaClusterManager
of that test suite
- A ScyllaCluster object is fetched from the pool
- If the pool is empty, a new ScyllaCluster object is created
- If the pool is not empty, a cached ScyllaCluster object is returned
After each test case:
- Return ScyllaCluster object from ManagerClient to the pool
- If the cluster is dirty, the pool destroys it
- If the cluster is clean, the pool caches it
- ManagerClient is destroyed
Many actions mark a cluster as dirty. Normal test execution will always
make the cluster be destroyed upon returning to the pool.
ManagerClient.mark_clean is not used in the tests. When it is used,
the flow with cluster reuse happens.
The bug is that the log file is closed even if cluster is not dirty.
This causes an error when trying to log to a reused cluster server.
The solution in this patch is to not close the log file if the cluster
is not dirty. Upon cluster reuse the log file will be open and functional.
Another approach would be to reopen the log file if closed, but this
approach seems more clean.
Refs SCYLLADB-573
This patch just copies the audit test suite from dtest and
disables it in the test config file. Later patches will
update the code and enable the test suite.
Refs SCYLLADB-573
Add tests for RF guardrails (min/max warn/fail, RF=0 bypass,
threshold=-1 disable, ALTER KEYSPACE) and write consistency level
guardrails to cover all scenarios described in guardrails.rst.
Test runtime (dev):
test_guardrail_replication_strategy - 6s
test_guardrail_write_consistency_level - 5s
Refs: SCYLLADB-257
Replace loose substring assertions with regex-based matching against
the exact server message formats. Add regex constants for all
guardrail messages and rewrite create_ks_and_assert_warnings_and_errors()
to verify count and content of warnings and failures.
Refs: SCYLLADB-257
The Web Installer page includes instructions to install the old pre-2025.1 Enterprise versions,
which are no longer supported (since we released 2026.1).
This commit removes those redundant and misleading instructions.
Fixes https://github.com/scylladb/scylladb/issues/29099Closesscylladb/scylladb#29103
This patchset migrates: query_all_directly_granted, query_all,
get_attribute, query_attribute_for_all functions to use cache
instead of doing CQL queries. It also includes some preparatory
work which fixes cache update order and triggering.
Main motivation behind this is to make sure that all calls
from service_level_controller::auth_integration are cached,
which we achieve here.
Alternative implementation could move the whole auth_integration
data into auth cache but since auth_integration manages also lifetime
and contains service levels specific logic such solution would be
too complex for little (if any) gain.
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-159
Backport: no, not a bug
Closesscylladb/scylladb#28791
* github.com:scylladb/scylladb:
auth: switch query_attribute_for_all to use cache
auth: switch get_attribute to use cache
auth: cache: add heterogeneous map lookups
auth: switch query_all to use cache
auth: switch query_all_directly_granted to use cache
auth: cache: add ability to go over all roles
raft: service: reload auth cache before service levels
service: raft: move update_service_levels_effective_cache check
`storage_group_of()` sits on the replica-side token lookup hot path, yet it called `tablet_map::get_tablet_id_and_range_side()`, which always computes both the tablet id and the post-split range side — even though most callers only need the storage group id.
The range-side computation is only relevant when a storage group is in tablet splitting mode, but we were paying for it unconditionally on every lookup.
This series fixes that by:
1. Adding `tablet_map::get_tablet_range_side()` so the range side can be computed independently when needed.
2. Adding lazy `select_compaction_group()` overloads that defer the range-side computation until splitting mode is actually active.
3. Switching `storage_group_of()` to use the cheaper `get_tablet_id()` path, only computing the range side on demand.
Improvements. No backport is required.
Closesscylladb/scylladb#28963
* github.com:scylladb/scylladb:
replica/table: avoid computing token range side in storage_group_of() on hot path
replica/compaction_group: add lazy select_compaction_group() overloads
locator/tablets: add tablet_map::get_tablet_range_side()
When encrypted_data_source::get() caches a trailing block in
_next, the next call takes it directly — bypassing
input_stream::read(), which checks _eof. It then calls
input_stream::read_exactly() on the already-drained stream.
Unlike read(), read_up_to(), and consume(), read_exactly()
does not check _eof when the buffer is empty, so it calls
_fd.get() on a source that already returned EOS.
In production this manifested as stuck encrypted SSTable
component downloads during tablet restore: the underlying
chunked_download_source hung forever on the post-EOS get(),
causing 4 tablets to never complete. The stuck files were
always block-aligned sizes (8k, 12k) where _next gets
populated and the source is fully consumed in the same call.
Fix by checking _input.eof() before calling read_exactly().
When the stream already reached EOF, buf2 is known to be
empty, so the call is skipped entirely.
A comprehensive test is added that uses a strict_memory_source
which fails on post-EOS get(), reproducing the exact code
path that caused the production deadlock.
Factor out ks_opts() to build keyspace options with tablets handling
and use it across all existing replication strategy guardrail tests.
No behavioral changes.
This facilitates further modification of the tests later in this
patch series.
Refs: SCYLLADB-257
After repair, the test does a major to compact all sstables into a
single one, so the results can be simply checked by a select from
mutation_fragments() query. Sometimes off-strategy happens parallel to
this major, so after the major there are still 2 sstables, resulting in
the test failing when checking that the query returns just a single row.
To fix, just use tablets for the test table, tablets don't use
off-strategy anymore.
Fixes: SCYLLADB-940
Closesscylladb/scylladb#29071
Previously the test test_interrupt_view_build_shard_registration stopped
the node ungracefully and used commitlog periodic mode to persist the
view build progress in a not very reliable way.
It can happen that due to timing issues, the view build progress is not
persisted, or some of it is persisted in a different ordering than
expected.
To make the test more reliable we change it to stop the node gracefully,
so the commitlog is persisted in a graceful and consistent way, without
using the periodic mode delay. We need to also change the injection for
the shutdown to not get stuck.
Fixes SCYLLADB-1005
Closesscylladb/scylladb#29008
As reported in SCYLLADB-1013, the directory lister must be closed also when an exception is thrown.
For example, see backtrace below:
```
seastar::on_internal_error(seastar::logger&, std::basic_string_view<char, std::char_traits<char>>) at ./build/release/seastar/./seastar/src/core/on_internal_error.cc:57
directory_lister::~directory_lister() at ./utils/lister.cc:77
replica::table::get_snapshot_details(std::filesystem::__cxx11::path, std::filesystem::__cxx11::path) (.resume) at ./replica/table.cc:4081
std::__n4861::coroutine_handle<seastar::internal::coroutine_traits_base<db::snapshot_ctl::table_snapshot_details>::promise_type>::resume() const at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/coroutine:247
(inlined by) seastar::internal::coroutine_traits_base<db::snapshot_ctl::table_snapshot_details>::promise_type::run_and_dispose() at ././seastar/include/seastar/core/coroutine.hh:129
seastar::reactor::task_queue::run_tasks() at ./build/release/seastar/./seastar/src/core/reactor.cc:2695
(inlined by) seastar::reactor::task_queue_group::run_tasks() at ./build/release/seastar/./seastar/src/core/reactor.cc:3201
seastar::reactor::task_queue_group::run_some_tasks() at ./build/release/seastar/./seastar/src/core/reactor.cc:3185
(inlined by) seastar::reactor::do_run() at ./build/release/seastar/./seastar/src/core/reactor.cc:3353
seastar::reactor::run() at ./build/release/seastar/./seastar/src/core/reactor.cc:3245
seastar::app_template::run_deprecated(int, char**, std::function<void ()>&&) at ./build/release/seastar/./seastar/src/core/app-template.cc:266
seastar::app_template::run(int, char**, std::function<seastar::future<int> ()>&&) at ./build/release/seastar/./seastar/src/core/app-template.cc:160
scylla_main(int, char**) at ./main.cc:756
```
Fixes: [SCYLLADB-1013](https://scylladb.atlassian.net/browse/SCYLLADB-1013)
* Requires backport to 2026.1 since the leak exists since 004c08f525
[SCYLLADB-1013]: https://scylladb.atlassian.net/browse/SCYLLADB-1013?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQClosesscylladb/scylladb#29084
* github.com:scylladb/scylladb:
test/boost/database_test: add test_snapshot_ctl_details_exception_handling
table: get_snapshot_details: fix indentation inside try block
table: per-snapshot get_snapshot_details: fix typo in comment
table: per-snapshot get_snapshot_details: always close lister using try/catch
table: get_snapshot_details: always close lister using deferred_close
Run keyspace compaction asynchronously in
`test_tombstone_gc_correctness_during_tablet_split` and only await it
after `split_sstable_rewrite` is disabled.
The problem is that `keyspace_compaction()` starts with a flush, and that
flush can take around five seconds. During that window the split
compaction is stopped before major compaction is retried. The stop aborts
the in-flight major compaction attempt, then the split proceeds far enough
to enter the `split_sstable_rewrite` injection point.
At that point the test used to wait synchronously for major compaction to
finish, but major compaction cannot finish yet: when it retries, it needs
the same semaphore that is still effectively tied up behind the blocked
split rewrite. So the test waits for major compaction, while the split
waits for the injection to be released, and the code that would release
that injection never runs.
Starting major compaction as a task breaks that cycle. The test can first
disable `split_sstable_rewrite`, let the split get out of the way, and
only then wait for major compaction to complete.
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-827.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#29066
During decommission, we first mark a topology request as done, then shut
down a node and in the following steps we remove node from the topology.
Thus, finished request does not imply that a node is removed from
the topology.
Due to that, in node_ops_virtual_task::wait, while gathering children
from the whole cluster, we may hit the connection exception - because
a node is still in topology, even though it is down.
Modify the get_children method to ignore the exception and warn
about the failure instead.
Keep token_metadata_ptr in get_children to prevent topology from changing.
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-867
Needs backports to all versions
Closesscylladb/scylladb#29035
* github.com:scylladb/scylladb:
tasks: fix indentation
tasks: do not fail the wait request if rpc fails
tasks: pass token_metadata_ptr to task_manager::virtual_task::impl::get_children
Document that `SELECT ... WHERE` clause currently accepts only conjunctions
of relations joined by `AND` (`OR` is not supported), and that
parentheses cannot be used to group boolean subexpressions.
Add an unsupported query example and point readers to equivalent `IN`
rewrites when applicable.
This problem has been raised by one of our users in
https://forum.scylladb.com/t/error-parsing-query-or-unsupported-statement/5299,
and while one could infer answer to user's question by looking at the
syntax of the `SELECT ... WHERE`, it's not immediately obvious to
non-advanced users, so clarifying these concepts is justified.
Fixes: SCYLLADB-1116
Closesscylladb/scylladb#29100
A compaction group has a separator buffer that holds the mixed segments
alive until the separator buffer is flushed. A mixed segment can be
freed only after all separator buffers that hold writes from the segment
are flushed.
Typically a separator buffer is flushed when it becomes full. However
it's possible for example that one compaction groups is filled slower
than others and holds many segments.
To fix this we trigger a separator flush periodically for separator
buffers that hold old segments. We track the active segment sequence
number and for each separator buffer the oldest sequence number it
holds.
Fix the logstor recovery to work with compaction groups. When recovering
a segment find its token range and add it to the appropriate compaction
groups. if it doesn't fit in a single compaction group then write each
record to its compaction group's separator buffer.
Change the primary index to be a btree that is ordered by token,
similarly to a memtable, and create a index per-table instead of a
single global index.
Add a segment_set member to replica::compaction_group that manages the
logstor segments that belong to the compaction group, similarly to how
it manages sstables. Add also a separator buffer in each compaction
group.
When writing a mutation to a compaction group, the mutation is written
to the active segment and to the separator buffer of the compaction
group, and when the separator buffer is flushed the segment is added to
the compaction_group's segment set.
when logstor is enabled, update the db dirty memory limits dynamically.
previously the threshold is set to 0.5 of the available memory, so 0.5
goes to memtables and 0.5 to others (cache).
when logstor is enabled, we calculate the available memory excluding
logstor, and divide it evenly between memtables and cache.
add a write gate to write_buffer. when writing a record to the write
buffer, the gate is held and passed back to the caller, and the caller
holds the gate until the write operation is complete, including
follow-up operations such as updating the index after the write.
in particular, when writing a mutation in logstor::write, the write
buffer is held open until the write is completed and updated in the
index.
when writing the write buffer to the active segment, we write the buffer
and then wait for the write buffer gate to close, i.e. we wait for all
index updates to complete before proceeding. the segment is held open
until all the write operations and index updates are complete.
this property is useful for correctness: when a segment is closed we
know that all the writes to it are updated in the index. this is needed
in compaction for example, where we take closed segments and check
which records in them are alive by looking them up in the index. if the
index is not updated yet then it will be wrong.
implement freeing all segments of a table for table truncate.
first do barrier to flush all active and mixed segments and put all the
table's data in compaction groups, then stop compaction for the table,
then free the table's segments and remove the live entries from the
index.
add barrier operation that forces switch of the active segment and
separator, and waits for all existing segments to close and all
separators to flush.
add tracking of the total separator debt - writes that were written to a
separator and waiting to be flushed, and add flow control to keep the
debt in control by delaying normal writes.
on recovery we may find mixed segments. recover them by adding them to a
separator, reading all their records and writing them to the separator,
and flush the separator.
we free a segment from compaction after updating all live records in the
segment to point to new locations in the index. we need to ensure they
are no running operations that use the old locations before we free the
segment.
initial implementation of the separator. it replaces "mixed" segments -
segments that have records from different groups, to segments by group.
every write is written to the active segment and to a buffer in the
active separator. the active separator has in-memory buffers by group.
at some threshold number of segments we switch the active segment and
separator atomically, and start flushing the separator.
the separator is flushed by writing the buffers into new non-mixed
segments, adding them to a compaction group, and frees the mixed
segments.
initial and basic recovery implementation.
* find all files, read their segments and populate the index with the
newest record for each key.
* find which segments are used and build the usage histogram
add segment generation number that is incremented when the segment is
reused, and it's written to every buffer that is written to the segment.
this is useful for recovery.
reserve segments for compaction so it always has enough segments to run
and doesn't get stuck.
do the compaction writes into full new segments instead of the active
segment.
add group_id value to each log record that is passed with the mutation
when writing it.
the group_id will be used to group log records in segments, such that a
segment will contain records only from a single group.
this will be useful for tablet migration. we want for each tablet to
have their own segments with all their records, so we can migrate them
efficiently by copying these segments.
the group_id value is set to a value equivalent to the tablet id.
basic utility for generation numbers that will be useful next. a
generation number is an unsigned integer that can be incremented and
compared even if it wraparounds, assuming the values we compare were
written around the same time.
initial implementation of the logstor storage engine for key-value
tables that supports writes, reads and basic compaction.
main components:
* logstor: this is the main interface to users that supports writing and
reading back mutations, and manages the internal components.
* index: the primary index in-memory that maps a key to a location on
disk.
* write buffer: writes go initially to a write buffer. it accumulates
multiple records in a buffer and writes them to the segment manager in
4k sized blocks.
* segment manager: manages the storage - files, segments, compaction. it
manages file and segment allocation, and writes 4k aligned buffers to
the active segment sequentially. it tracks the used space in each
segment. the compaction finds segment with low space usage and writes
them to new segments, and frees the old segments.
GRANT/REVOKE fails on the maintenance socket connections, because maintenance_auth_service uses allow_all_authorizer. allow_all_authorizer allows all operations, but not GRANT/REVOKE, because they make no sense in its context.
This has been observed during PGO run failure in operations from ./pgo/conf/auth.cql file.
This patch introduces maintenance_socket_authorizer that supports the capabilities of default_authorizer ('CassandraAuthorizer') without needing authorization.
Refs SCYLLADB-1070
This is an improvement, no need for backport.
Closesscylladb/scylladb#29080
* github.com:scylladb/scylladb:
test: use NetworkTopologyStrategy in maintenance socket tests
test: use cleanup fixture in maintenance socket auth tests
auth: add maintenance_socket_authorizer
Make the pattern static const so it is compiled once at first call rather
than on every Content-Range header parse.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#29054
It handles 0, and could generate better code for that. On Broadwell
architecture, it translates to a single instruction (LZCNT). We're
still on Westmere, so it translates to BSR with a conditional move.
Also, drop unnecessary casts and bit arithmetic, which saves a few
instructions.
Move to header so that it's inlined in parsers.
This change reduces the cost of partition index page construction and
LSA migration. This is achieved by several things working together:
- index entries don't store keys as separate small objects (managed_bytes)
They are written into one managed_bytes fragmented storage, entries
hold offset into it.
Before, we paid 16 bytes for managed_bytes plus LSA descriptor for
the storage (1 byte) plus back-reference in the storage (8 bytes),
so 25 bytes. Now we only pay 4 bytes for the size offset. If keys are 16
bytes, that's a reduction from 31 bytes to 20 bytes per key.
- index entries and key storage are now trivially moveable, so LSA
migration can use memcpy() which amortizes the cost per key.
memcpy().
LSA eviction is now trivial and constant time for the whole page
regardless of the number of entries. Page eviction dropped from
14 us to 1 us.
This improves throughput in a CPU-bound miss-heavy read workload where
the partition index doesn't fit in memory.
scylla perf-simple-query -c1 -m200M --partitions=1000000
Before:
15328.25 tps (150.0 allocs/op, 14.1 logallocs/op, 45.4 tasks/op, 286769 insns/op, 218134 cycles/op, 0 errors)
15279.01 tps (149.9 allocs/op, 14.1 logallocs/op, 45.3 tasks/op, 287696 insns/op, 218637 cycles/op, 0 errors)
15347.78 tps (149.7 allocs/op, 14.1 logallocs/op, 45.3 tasks/op, 285851 insns/op, 217795 cycles/op, 0 errors)
15403.68 tps (149.6 allocs/op, 14.1 logallocs/op, 45.2 tasks/op, 285111 insns/op, 216984 cycles/op, 0 errors)
15189.47 tps (150.0 allocs/op, 14.1 logallocs/op, 45.5 tasks/op, 289509 insns/op, 219602 cycles/op, 0 errors)
15295.04 tps (149.8 allocs/op, 14.1 logallocs/op, 45.3 tasks/op, 288021 insns/op, 218545 cycles/op, 0 errors)
15162.01 tps (149.8 allocs/op, 14.1 logallocs/op, 45.4 tasks/op, 291265 insns/op, 220451 cycles/op, 0 errors)
After:
21620.18 tps (148.4 allocs/op, 13.4 logallocs/op, 43.7 tasks/op, 176817 insns/op, 153183 cycles/op, 0 errors)
20644.03 tps (149.8 allocs/op, 13.5 logallocs/op, 44.3 tasks/op, 187941 insns/op, 160409 cycles/op, 0 errors)
20588.06 tps (150.1 allocs/op, 13.5 logallocs/op, 44.5 tasks/op, 188090 insns/op, 160818 cycles/op, 0 errors)
20789.29 tps (149.5 allocs/op, 13.5 logallocs/op, 44.2 tasks/op, 186495 insns/op, 159382 cycles/op, 0 errors)
20977.89 tps (149.5 allocs/op, 13.4 logallocs/op, 44.2 tasks/op, 183969 insns/op, 158140 cycles/op, 0 errors)
21125.34 tps (149.1 allocs/op, 13.4 logallocs/op, 44.1 tasks/op, 183204 insns/op, 156925 cycles/op, 0 errors)
21244.42 tps (148.6 allocs/op, 13.4 logallocs/op, 43.8 tasks/op, 181276 insns/op, 155973 cycles/op, 0 errors)
Mostly because the index now fits in memory.
When it doesn't, the benefits are still visible due to lower LSA overhead.
It's shorter, and is supposed to be optimized for trivially-moveable
types.
Important for managed_vector<index_entry>, which can have lots of
elements.
Densely populated pages have no promoted index (small partitions), so
we can save space in such workloads by keeping promoted index in a
separate vector.
For workloads which do have a promoted index, pages have only one
partition. There aren't many such pages and they are long-lived, so
the extra allocation of the vector is amortized.
promoted_index class is removed, and replaced with equivalent
parsed_promoted_index_entry for simplicity. Because it's removed,
make_cursor() is moved into the index_reader class.
Reducing the size of index_entry is important for performence if pages
are densly populated. It helps to reduce LSA allocator pressure and
compaction/eviction speed.
This change, combined with the earlier change "Shave-off 16 bytes from
index_entry by using raw_token", gives significant improvement in
throughput in perf_simple_query run where the index doesn't fit in
memory:
scylla perf-simple-query -c1 -m200M --partitions=1000000
Before:
9714.78 tps (170.9 allocs/op, 16.9 logallocs/op, 55.3 tasks/op, 494788 insns/op, 343920 cycles/op, 0 errors)
9603.13 tps (171.6 allocs/op, 17.0 logallocs/op, 55.6 tasks/op, 502358 insns/op, 348344 cycles/op, 0 errors)
9621.43 tps (171.9 allocs/op, 17.0 logallocs/op, 55.8 tasks/op, 500612 insns/op, 347508 cycles/op, 0 errors)
9597.75 tps (171.6 allocs/op, 17.0 logallocs/op, 55.6 tasks/op, 501428 insns/op, 348604 cycles/op, 0 errors)
9615.54 tps (171.6 allocs/op, 16.9 logallocs/op, 55.6 tasks/op, 501313 insns/op, 347935 cycles/op, 0 errors)
9577.03 tps (171.8 allocs/op, 17.0 logallocs/op, 55.7 tasks/op, 503283 insns/op, 349251 cycles/op, 0 errors)
After:
15328.25 tps (150.0 allocs/op, 14.1 logallocs/op, 45.4 tasks/op, 286769 insns/op, 218134 cycles/op, 0 errors)
15279.01 tps (149.9 allocs/op, 14.1 logallocs/op, 45.3 tasks/op, 287696 insns/op, 218637 cycles/op, 0 errors)
15347.78 tps (149.7 allocs/op, 14.1 logallocs/op, 45.3 tasks/op, 285851 insns/op, 217795 cycles/op, 0 errors)
15403.68 tps (149.6 allocs/op, 14.1 logallocs/op, 45.2 tasks/op, 285111 insns/op, 216984 cycles/op, 0 errors)
15189.47 tps (150.0 allocs/op, 14.1 logallocs/op, 45.5 tasks/op, 289509 insns/op, 219602 cycles/op, 0 errors)
15295.04 tps (149.8 allocs/op, 14.1 logallocs/op, 45.3 tasks/op, 288021 insns/op, 218545 cycles/op, 0 errors)
15162.01 tps (149.8 allocs/op, 14.1 logallocs/op, 45.4 tasks/op, 291265 insns/op, 220451 cycles/op, 0 errors)
The std::optional<> adds 8 bytes.
And dht::token adds 8 bytes due to _kind, which in this case is always
kind::key.
The size changd from 56 to 48 bytes.
During decommission, we first mark a topology request as done, then shut
down a node and in the following steps we remove node from the topology.
Thus, finished request does not imply that a node is removed from
the topology.
Due to that, in node_ops_virtual_task::wait, while gathering children
from the whole cluster, we may hit the connection exception - because
a node is still in topology, even though it is down.
Modify the get_children method to ignore the exception and warn
about the failure instead.
In get_children we get the vector of alive nodes with get_nodes.
Yet, between this and sending rpc to those nodes there might be
a preemption. Currently, the liveness of a node is checked once
again before the rpcs (only with gossiper not in topology - unlike
get_nodes).
Modify get_children, so that it keeps a token_metadata_ptr,
preventing topology from changing between get_nodes and rpcs.
Remove test_get_children as it checked if the get_children method
won't fail if a node is down after get_nodes - which cannot happen
currently.
Fixes: SCYLLADB-942
Adds an injection signal _from_ table::seal_active_memtable to allow us to
reliably wait for flushing. And does so.
Closesscylladb/scylladb#29070
Since commit 509f2af8db, gate_closed_exception can be triggered for ongoing split during shutdown. The commit is correct, but it causes split failure on shutdown to log an error, which causes CI instability. Previously, aborted_exception would be triggered instead which is logged as warning. Let's do the same.
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-951.
Fixes https://github.com/scylladb/scylladb/issues/24850.
Only 2026.1 is affected.
Closesscylladb/scylladb#29032
* github.com:scylladb/scylladb:
replica: Demote log level on split failure during shutdown
service: Demote log level on split failure during shutdown
The limiter scans ranges to decide whether or not to rate-limit the
query. However, when considering each range only the front one's token
is accounted. This looks like a misprint.
The limiter was introduced in cc9a2ad41f
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#29050
A followup of the merge of two test cases that happened in the previous
patch. Both used `foo = N if domain == bar else M` to evaluate the
parameters for topology. Using if-else block makes it immediately obvious
which topology and scope apply for each domain value without having to
evaluate multiple inline conditionals.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The two tests differ only in the way they set up the topology for the
cluster and the post-restore checks against the resulting streams.
The merge happens with the help of a "scope_is_same" boolean parameter
and corresponding updates in the topology setup and post-checks.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The one in "different domain" test is simpler because the test performs
less checks. Next patch will merge both tests and making regexp-s look
identical makes the merge even smother.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Update test running instructions to reflect unified pytest-based runner.
The test.py now requires full test paths with file extensions for both
C++ and Python tests.
No backport: The change is only relevant for recent test.py changes in
master.
Closesscylladb/scylladb#29062
The auth::cache::includes_table function also covers role_members and
role_attributes. The existing check was removed because it blocked these
tables from triggering necessary cache updates.
While previously non-critical (due to unused attributes and table coupling),
maintaining a correct cache is essential for upcoming changes.
Verify that the directory listers opened by get_snapshot_details
are properly closed when handling an (injected) exception.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The comment says the snapshot directory may contain a `schema.sql` file,
but the code treats `schema.cql` as the special-case schema file.
Reported-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Since this is a coroutine, we cannot just use deferred_close,
but rather we need to catch an error, close the lister, and then
return the error, is applicable.
Fixes: SCYLLADB-1013
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
NetworkTopologyStrategy is the preferred choice. We should not
use SimpleStrategy anymore. This patch changes the topology strategy
for all the maintenance socket tests.
Refs SCYLLADB-1070
Add a cql_clusters pytest fixture that tracks CQL driver Cluster
objects and shuts them down automatically after test completion.
This replaces manual shutdown() calls at the end of each test.
Also consolidate shutdown() calls in retry helpers into finally
blocks for consistent cleanup.
Refs SCYLLADB-1070
GRANT/REVOKE fails on the maintenance socket connections,
because maintenance_auth_service uses allow_all_authorizer.
allow_all_authorizer allows all operations, but not GRANT/REVOKE,
because they make no sense in its context.
This has been observed during PGO run failure in operations from
./pgo/conf/auth.cql file.
This patch introduces maintenance_socket_authorizer that supports
the capabilities of default_authorizer ('CassandraAuthorizer')
without needing authorization.
Refs SCYLLADB-1070
This patch is mostly for the purpose of running pgo CI job.
We may receive connection error if asyncio.sleep(5) in
pgo.py is not sufficient waiting time.
In pgo.py we do wait for port but only for cql,
anyway it's better to have high level check than
trying to wait for alternator port there.
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1071
Backport: 2026.1 - it failed on CI for that build
Closesscylladb/scylladb#29063
* github.com:scylladb/scylladb:
perf: add abort_source support to wait-for-port loops
perf-alternator: wait for alternator port before running workload
- fix s3::range max value for object size which is 50TiB and not 5.
- refactor constants to make it accessible for all interested parties, also reuse these constants in tests
No need to backport, doubt we will encounter an object larger than 5TiB
Closesscylladb/scylladb#28601
* github.com:scylladb/scylladb:
s3_client: reorganize tests in part_size_calculation_test
s3_client: switch using s3 limits constants in tests
s3_client: fix the s3::range max object size
s3_client: remove "aws" prefix from object limits constants
s3_client: make s3 object limits accessible
This commit replaces the Open Soruce example from the Installation section for CentOS.
We updated the example for Ubuntu, but not for CentOS.
We don't want to have any Open Source information in the docs.
Fixes https://github.com/scylladb/scylladb/issues/29087
This test was observed to fail in CI recently but there is not enough information in the logs to figure out what went wrong. This PR makes a few improvements to make the next investigation easier, should it be needed:
* storage-service: add table name to mutation write failure error messages.
* database: the `database_apply` error injection used to cause trouble, catching writes to bystander tables, making tests flaky. To eliminate this, it gained a filter to apply only to non-system keyspaces. Unfortunately, this still allows it to catch writes to the trace tables. While this should not fail the test, it reduces observability, as some traces disappear. Improve this error injection to only apply to selected table. Also merge it with the `database_apply_wait` error injection, to streamline the code a bit.
* test/test_data_resurrection_in_memtable.py: dump data from the datable, before the checks for expected data, so if checks fail, the data in the table is known.
Refs: SCYLLADB-812
Refs: SCYLLADB-870
Fixes: SCYLLADB-1050 (by restricting `database_apply` error injection, so it doesn't affect writes to system traces)
Backport: test related improvement, no backport
Closesscylladb/scylladb#28899
* github.com:scylladb/scylladb:
test/cluster/test_data_resurrection_in_memtable.py: dump rows before check
replica/database: consolidate the two database_apply error injections
service/storage_proxy: add name of table to error message for write errors
Previously, all stream-table fixtures in test_streams.py used scope="function",
forcing a fresh table to be created for every test, slowing down the test a bit
(though not much), and discouraging writing small new tests.
This was a workaround for a DynamoDB quirk (that Alternator doesn't have):
LATEST shard iterators have a time slack and may point slightly before the true
stream head, causing leftover events from a previous test to appear in the next
test's reads.
The first two tests in this series fix small problems that turn up once we start
sharing test tables in test_streams.py. The final patch fixes the "LATEST" problem
and enables sharing the test table by using "module" scope fixtures instead of
"function".
After this series, test_streams.py run time went down a bit, from 20.2 seconds to 17.7 seconds.
Closesscylladb/scylladb#28972
* github.com:scylladb/scylladb:
test/alternator: speed up test_streams.py by using module-scope fixtures
test/alternator: test_streams.py don't use fixtures in 4 tests
test/alternator: fix do_test() in test_streams.py
This test relies on the cache entry being evicted after 200ms past the
TTL. This may not happen on a busy CI machine. Make the test less
reliant on timing by using eventually_true().
Simplify the test by dropping the second entry, it doesn't add anything
to the test.
Fixes: SCYLLADB-811
Closesscylladb/scylladb#28958
This fixtures starts the mock server and immediately connects to it to
setup the expected requests. The connection attempt might be too early,
so there is a retry loop with a timeout. The loop currently checks for
requests.exception.ConnectionError. We've seen a case where the
connection is successful but the request fails with 404. The mock
started the server but didn't setup the routes yet. Add a retry for http
404 to handle this.
Fixes: SCYLLADB-966
Closesscylladb/scylladb#29003
When handling `repair_stream_cmd::end_of_current_rows`, passing the
foreign list directly to `put_row_diff_handler` triggered a massive
synchronous deep copy on the destination shard. Additionally, destroying
the list triggered a synchronous deallocation on the source shard. This
blocked the reactor and triggered the CPU stall detector.
This commit fixes the issue by introducing `clone_gently()` to copy the
list elements one by one, and leveraging the existing
`utils::clear_gently()` to destroy them. Both utilize
`seastar::coroutine::maybe_yield()` to allow the reactor to breathe
during large cross-shard transfers and cleanups.
Fixes SCYLLADB-403
Closesscylladb/scylladb#28979
vector
The lister loop in get() pre-fetches records in batches and keeps them
in a _info vector, iterating over it with the help of _pos cursor. When
the vector is re-read, the cursor must be reset too.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The lister loop in get() method looks weird. It uses do-while(false)
loop and calls continue; inside when filter asks to skip a entry.
Skipping, thus, aborts the whole thing and EOF-s, which is not what's
supposed to happen.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This series fixes a metrics visibility gap in Alternator and adds regression coverage.
Until now, BatchGetItem and BatchWriteItem updated global latency histograms but did not consistently update per-table latency histograms. As a result, table-level latency dashboards could miss batch traffic.
It updates the batch read/write paths to compute request duration once and record it in both global and per-table latency metrics.
Add the missing tests, including a metric-agnostic helper and a dedicated per-table latency test that verifies latency counters increase for item and batch operations.
This change is metrics-only (no API/behavior change for requests) and improves observability consistency between global and per-table views.
Fixes#28721
**We assume the alternator per-table metrics exist, but the batch ones are not updated**
Closesscylladb/scylladb#28732
* github.com:scylladb/scylladb:
test(alternator): add per-table latency coverage for item and batch ops
alternator: track per-table latency for batch get/write operations
Remove outdated references to filtering on columns provided in the
index definition, and remove the note about equal relations (= and IN)
being the only supported operations. Vector search filtering currently
supports WHERE clauses on primary key columns only.
Closesscylladb/scylladb#28949
Permits in the `waiting_for_memory` state represent already-executing reads that are blocked on memory allocation. Preemptively aborting them is wasteful -- these reads have already consumed resources and made progress, so they should be allowed to complete.
Restrict the preemptive abort check in maybe_admit_waiters() to only apply to permits in the `waiting_for_admission` state, and tighten the state validation in `on_preemptive_aborted()` accordingly.
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1016
Backport not needed. The commit introducing replica load shedding is not part of 2026.1
Closesscylladb/scylladb#29025
* github.com:scylladb/scylladb:
reader_concurrency_semaphore: skip preemptive abort for permits waiting for memory
reader_concurrency_semaphore_test: detect memory leak on preemptive abort of waiting_for_memory permit
Check abort_source on each retry iteration in
wait_for_alternator and wait_for_cql so the
wait can be interrupted on shutdown.
Didn't use sleep_abortable as the sleep is very short
anyway.
Fixes: SCYLLADB-244
Disables snapshot control such that any active ops finish/fail
before proceeding with decommission.
Note: snapshot control provided as argument, not member ref
due to storage_service being used from both main and cql_test_env.
(The latter has no snapshot_ctl to provide).
Could do the snapshot lockout on API level, but want to do
pre-checks before this.
Note: this just disables backup/snapshot fully. Could re-enable
after decommission, but this seems somewhat pointless.
v2:
* Add log message to snapshot shutdown
* Make test use log waiting instead of timeouts
Closesscylladb/scylladb#28980
This patch is mostly for the purpose of running pgo CI job.
We may receive connection error if asyncio.sleep(5) in
pgo.py is not sufficient waiting time.
In pgo.py we do wait for port but only for cql,
anyway it's better to have high level check than
trying to wait for alternator port there.
Some tests, when create a cluster, configure nodes with the rf-rack-valid option, because sometimes they want to have it OFF. For that the option is explicitly carried around, but the cluster creating helper can guess this option itself -- out of the provided topology and replication factor.
Removing this option simplifies the code and (which a nicer outcome) the test "signature" that's used e.g. in command-line to run a specific test.
Improving tests, not backporting
Closesscylladb/scylladb#28860
* github.com:scylladb/scylladb:
test: Relax topology_rf_validity parameter for some tests
test: Auto detect rf-rack-valid option in create_cluster()
Dtest failed with:
table - Failed to load SSTable .../me-3gyn_0qwi_313gw2n2y90v2j4fcv-big-Data.db
of origin memtable due to std::runtime_error (Cannot split
.../me-3gyn_0qwi_313gw2n2y90v2j4fcv-big-Data.db because manager has compaction
disabled, reason might be out of space prevention), it will be unlinked...
The reason is that the error above is being triggered when the cause is
shutdown, not out of space prevention. Let's distinguish between the two
cases and log the error with warning level on shutdown.
Fixes https://github.com/scylladb/scylladb/issues/24850.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
`test_raft_no_quorum.py::test_cannot_add_new_node` is currently flaky in dev
mode. The bootstrap of the first node can fail due to `add_entry()` timing
out (with the 1s timeout set by the test case).
Other test cases in this test file could fail in the same way as well, so we
need a general fix. We don't want to increase the timeout in dev mode, as it
would slow down the test. The solution is to keep the timeout unchanged, but
set it only after quorum is lost. This prevents unexpected timeouts of group0
operations with almost no impact on the test running time.
A note about the new `update_group0_raft_op_timeout` function: waiting for
the log seems to be necessary only for
`test_quorum_lost_during_node_join_response_handler`, but let's do it
for all test cases just in case (including `test_can_restart` that shouldn't
be flaky currently).
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-913Closesscylladb/scylladb#28998
Since commit 509f2af8db, gate_closed_exception can be triggered
for ongoing split during shutdown. The commit is correct, but it
causes split failure on shutdown to log an error, which causes
CI instability. Previously, aborted_exception would be triggered
instead which is logged as warning. Let's do the same.
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-951.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Replaced multiple per-action workflow jobs with a single consolidated
call to main_pr_events_jira_sync.yml. Added 'edited' event trigger.
This makes CI actions in PRs more readable and workflow execution faster.
Fixes:PM-253
Closesscylladb/scylladb#29042
changes in this commit:
1)rename class from 'TestContext' to 'Context' so pytest will not consider this class as a test
2)extend pytest filterwarnings list to ignore warnings from external libs
3) use datetime.datetime.now(datetime.UTC) unstead datetime.datetime.utcnow()
4) use ResultSet.one() instead ResultSet[0]
Fixes SCYLLADB-904
Fixes SCYLLADB-908
Related SCYLLADB-902
Closesscylladb/scylladb#28956
The test assumes that the sleep duration will be at least the value of
the sleep parameter. However, the actual sleep time can be slightly less
than requested (e.g., a 100ms sleep request might result in a 99ms
sleep).
This commit adjusts the test's time comparison to be more lenient,
preventing test flakiness.
When a `with_connect` operation timed out, the underlying connection
attempt continued to run in the reactor. This could lead to a crash
if the connection was established/rejected after the client object had
already been destroyed. This issue was observed during the teardown
phase of a upcoming high-availability test case.
This commit fixes the race condition by ensuring the connection attempt
is properly canceled on timeout.
Additionally, the explicit TLS handshake previously forced during the
connection is now deferred to the first I/O operation, which is the
default and preferred behavior.
Fixes: SCYLLADB-832
In this series we add support for forwarding strongly consistent CQL requests to suitable replicas, so that clients can issue reads/writes to any node and have the request executed on an appropriate tablet replica (and, for writes, on the Raft leader). We return the same CQL response as what the user would get while sending the request to the correct replica and we perform the same logging/stats updates on the request coordinator as if the coordinator was the appropriate replica.
The core mechanism of forwarding a strongly consistent request is sending an RPC containing the user's cql request frame to the appropriate replica and returning back a ready, serialized `cql_transport::response`. We do this in the CQL server - it is most prepared for handling these types and forwarding a request containing a CQL frame allows us to reuse near-top-level methods for CQL request handling in the new RPC handler (such as the general `process`)
For sending the RPC, the CQL server needs to obtain the information about who should it forward the request to. This requires knowledge about the tablet raft group members and leader. We obtain this information during the execution of a `cql3/strong_consistency` statement, and we return this information back to the CQL server using the generalized `bounce_to_shard` `response_message`, where we now store the information about either a shard, or a specific replica to which we should forward to. Similarly to `bounce_to_shard`, we need to handle this `result_message` in a loop - a replica may move during statement execution, or the Raft leader can change. We also use it for forwarding strongly consistent writes when we're not a member of the affected tablet raft group - in that case we need to forward the statement twice - once to any replica of the affected tablet, then that replica can find the leader and return this information to the coordinator, which allows the second request to be directed to the leader.
This feature also allows passing through exception messages which happened on the target replica while executing the statement. For that, many methods of the `cql_transport::cql_server::connection` for creating error responses needed to be moved to `cql_transport::cql_server`. And for final exception handling on the coordinator, we added additional error info to the RPC response, so that the handling can be performed without having the `result_message::exception` or `exception_ptr` itself.
Fixes [SCYLLADB-71](https://scylladb.atlassian.net/browse/SCYLLADB-71)
[SCYLLADB-71]: https://scylladb.atlassian.net/browse/SCYLLADB-71?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQClosesscylladb/scylladb#27517
* github.com:scylladb/scylladb:
test: add tests for CQL forwarding
transport: enable CQL forwarding for strong consistency statements
transport: add remote statement preparation for CQL forwarding
transport: handle redirect responses in CQL forwarding
transport: add exception handling for forwarded CQL requests
transport: add basic CQL request forwarding
idl: add a representation of client_state for forwarding
cql_server: handle query, execute, batch in one case
transport: inline process_on_shard in cql_server::process
transport: extract process() to cql_server
transport: add messaging_service to cql_server
transport: add response reconstruction helpers for forwarding
transport: generalize the bounce result message for bouncing to other nodes
strong consistency: redirect requests to live replicas from the same rack
transport: pass foreign_ptr into sleep_until_timeout_passes and move it to cql_server
transport: extract the error handling from process_request_one
transport: move error response helpers from connection to cql_server
Update warn and fail messages for the write_consistency_levels_warned
and write_consistency_levels_disallowed guardrails to include the
configuration option name and actionable guidance. The main motivation
is to make the messages follow the conventions of other guardrails.
Refs: SCYLLADB-257
When computing table sizes via load_stats to determine if a split/merge is needed, we are filtering tablets which are being migrated, in order to avoid counting them twice (both on leaving and pending replica) in the total table size. The tablets are filtered so that they are counted on the leaving replica until the streaming stage, and on the pending replica after the streaming stage.
Currently, the procedure for collecting tablet sizes for load balancing also uses this same filter. This should be changed, because the load balancer needs to have as much information about tablet sizes as possible, and could ignore a node due to missing tablet sizes for tablets in the `write_both_read_new` and `use_new` stages.
For tablet size collection, we should include all the tablets which are currently taking up disk space. This means:
- on leaving replica, include all tablets until the `cleanup` stage
- on pending replica, include all tablets starting with the `write_both_read_new` and later stages
While this is an improvement, it causes problems with some of the tests, and therefore needs to be backported to 2026.1
Fixes: SCYLLADB-829
Closesscylladb/scylladb#28587
* github.com:scylladb/scylladb:
load_stats: add filtering for tablet sizes
load_stats: move tablet filtering for table size computation
load_stats: bring the comment and code in sync
Tests that call create_cluster() helper no longer need to carry the
rf-validity parameter. This simplifies the code and test signature.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The helper accepts its as boolean argument, but it can easily estimate
one from the provided topology.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Before b59b3d4 the migration code checked that service level controller
is on v2 version before migration and the check also implicitly checked
that _sl_data_accessor field is already initialized, but now that the
check is gone the migration can start before service level controller is
fully initialized. Re add the check, but to a different place.
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1049Closesscylladb/scylladb#29021
Permits in the `waiting_for_memory` state represent already-executing
reads that are blocked on memory allocation. Preemptively aborting
them is wasteful -- these reads have already consumed resources and
made progress, so they should be allowed to complete.
Restrict the preemptive abort check in maybe_admit_waiters() to only
apply to permits in the `waiting_for_admission` state, and tighten
the state validation in `on_preemptive_aborted()` accordingly.
Adjust the following tests:
+ test_reader_concurrency_semaphore_abort_preemptively_aborted_permit
no longer relies on requesting memory
+ test_reader_concurrency_semaphore_preemptive_abort_requested_memory_leak
adjusted to the fix
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1016
- Added VECTOR to the comma-separated list of Jira project keys in `call_sync_milestone_to_jira.yml`.
- The `jira_project_keys` value changed from `SCYLLADB,CUSTOMER,SMI,RELENG` to `SCYLLADB,CUSTOMER,SMI,RELENG,VECTOR`.
- The VECTOR project needs to sync with scylladb.git milestones, so that when a GitHub milestone is created or closed in scylladb/scylladb, the corresponding Jira release is also created or released in the VECTOR project.
- Previously only SCYLLADB, CUSTOMER, SMI, and RELENG projects were synced.
Fixes:PM-220
Closesscylladb/scylladb#29014
This PR adds integrity verification for SSTable component files during loading. When component digests are present in Scylla metadata, the loader now validates each component's CRC32 digest against the stored expected value, catching silent corruption of component files. Index, Rows and Partitions components digests are also validated duriung scrub in validate mode
Added corruption tests that write an SSTable, flip a bit in a specific component file, then verify that reloading the SSTable detects the corruption and throws the expected exception.
Depends on https://github.com/scylladb/scylladb/pull/28338
Backport is not required, this is new feature
Fixes https://github.com/scylladb/scylladb/issues/20103Closesscylladb/scylladb#28761
* github.com:scylladb/scylladb:
test/cqlpy: test --ignore-component-digest-mismatch flag in scylla sstable upgrade
docs: document --ignore-component-digest-mismatch flag for scylla sstable upgrade
sstables: propagate ignore_component_digest_mismatch config to all load sites
sstables: add option to ignore component digest mismatches
sstable_compaction_test: Add scrub validate test for corrupted index
sstables: add tests for component digest validation on corrupted SSTables
sstables: validate index components digests during SSTable scrub in validate mode
sstables: verify component digests on SSTable load
sstables: add digest_file_random_access_reader for CRC32 digest computation
Fix several test cases that did not await async tasks:
- test_restart_leaving_replica_during_cleanup
- test_restart_in_cleanup_stage_after_cleanup
- test_tablet_back_and_forth_migration
- test_staging_backlog_is_preserved_with_file_based_streaming
Fixes SCYLLADB-910
* Minor fixes, no backport needed
Closesscylladb/scylladb#28908
* github.com:scylladb/scylladb:
test_tablets_migration: test_staging_backlog_is_preserved_with_file_based_streaming: convert for loop to asyncio.gather
test_tablets_migration: test_tablet_back_and_forth_migration: await move_tablet
test_tablets_migration: test_restart_in_cleanup_stage_after_cleanup: await move_task
test_tablets_migration: test_restart_leaving_replica_during_cleanup: await move_task
test_tablets_migration: drop unused imports from cassandra.query
Fixes#25084
Add slirp4netns and use for nested containers. This will allow nested container port aliasing, helping CI stability.
Note: this contains and updated Dockerfile for dbuild image, but since chicken and eggs, right now will force install slirp4netns before anything in dbuild script.
Updates the mock server handling to use ephemeral ports and query from container, ensuring we don't get port collisions. (boost as well as pytest).
Includes a timeout up, and a tweak to our scylla_cluster handling, ensuring we don't deadlock when pipe size is less than requires for our sys notify messages.
Closesscylladb/scylladb#28727
* github.com:scylladb/scylladb:
gcs_fixture: Change to use docker helper
aws_kms_fixture: Modify to use docker helper
test/lib/proc_util: Add docker helper
pytest: use ephemeral port publish for docker mock servers
dbuild: Use container network in dbuild nested containers
scylla_cluster: Read notify sock in background to prevent deadlock
A few days ago, in commit 7b30a39 we added to pytest.ini the option
xfail_strict. This option causes every time a test XPASSes, i.e., an xfail
test actually passes - to be considered an error and fail the test.
But some tests demonstrate a timing-related bug and do not reproduce the
bug every single time. An example we noticed in one CI run is:
test/cluster/test_alternator.py::test_alternator_concurrent_rmw_same_partition_different_server
This test reproduces a timing-related bug (if you do an LWT write to
one partition on to two different coordinators "at the same time", you
can get a failure), but only most of the time, not 100% of the time.
The solution is to add "strict=False" for the xfail marker on this specific
test. This undoes the xfail_strict for this specific test, accepting that
this specific test can either pass or fail. Note that this does NOT make
this test worthless - we still see this test failing most of the time, and
when a developer finally fixes this issue, the test will begin to pass all
the time.
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-941
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#29016
When it deadlocks, groups stop merging and compaction group merge
backlog will run-away.
Also, graceful shutdown will be blocked on it.
Found by flaky unit test
test_merge_chooses_best_replica_with_odd_count, which timed-out in 1
in 100 runs.
Reason for deadlock:
When storage groups are merged, the main compaction group of the new
storage group takes a compaction lock, which is appended to
_compaction_reenablers_for_merging, and released when the merge
completion fiber is done with the whole batch.
If we accumulate more than 1 merge cycle for the fiber, deadlock
occurs. Lock order will be this
Initial state:
cg0: main
cg1: main
cg2: main
cg3: main
After 1st merge:
cg0': main [locked], merging_groups=[cg0.main, cg1.main]
cg1': main [locked], merging_groups=[cg2.main, cg3.main]
After 2nd merge:
cg0'': main [locked], merging_groups=[cg0'.main [locked], cg0.main, cg1.main, cg1'.main [locked], cg2.main, cg3.main]
merge completion fiber will try to stop cg0'.main, which will be
blocked on compaction lock. which is held by the reenabler in
_compaction_reenablers_for_merging, hence deadlock.
The fix is to wait for background merge to finish before we start the
next merge. It's achieved by holding old erm in the background merge,
and doing a topology barrier from the merge finalizing transition.
Background merge is supposed to be a relatively quick operation, it's
stopping compaction groups. So may wait for active requests. It
shouldn't prolong the barrier indefinitely.
Tablet boost unit tests which trigger merge need to be adjusted to
call the barrier, otherwise they will be vulnerable to the deadlock.
Two cluster tests were removed because they assumed that merge happens
in the backgournd. Now that it happens as part of merge finalization,
and blocks topology state machine, those tests deadlock because they
are unable to make topology changes (node bootstrap) while background
merge is blocked.
The test "test_tablets_merge_waits_for_lwt" needed to be adjusted. It
assumed that merge finalization doesn't wait for the erm held by the
LWT operation, and triggered tablet movement afterwards, and assumed
that this migration will issue a barrier which will block on the LWT
operation. After this commit, it's the barrier in merge finalization
which is blocked. The test was adjusted to use an earlier log mark
when waiting for "Got raft_topology_cmd::barrier_and_drain", which
will catch the barrier in merge finalization.
Fixes SCYLLADB-928
Needs to be ordered before split finalization, because storage_group
must be in split mode already at finalization time. There must be
split-ready compaction groups, otherwise finalization fails with this
error:
Found 0 split ready compaction groups, but expected 2 instead.
Exposed by increased split activity in tests.
Will be called in tests. It does the local part of the global topology
barrier.
The comment:
// We capture the topology version right after the checks
// above, before any yields. This is crucial since _topology_state_machine._topology
// might be altered concurrently while this method is running,
// which can cause the fence command to apply an invalid fence version.
was dropped, because it's no longer true after
fad6c41cee, and it doesn't make sense in
the context of local_topology_barrier(). We'd have to propagate the
version to local_topology_barrier(), but it's pointless. The fence
version is decided before calling the local barrier, and it will be
valid even if local version moves ahead.
This is short cleanup after recent removal of creating default cassandra superuser and auth-v1 code removal.
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1036
Backport: no, just code cleanup
Closesscylladb/scylladb#29004
* github.com:scylladb/scylladb:
auth: remove DEFAULT_SUPERUSER_NAME constant and dead DEFAULT_USER_PASSWORD
auth: use configurable default_superuser in describe_roles
auth: move default_superuser to common, remove _superuser member
auth: use LOCAL_ONE for all auth queries
auth: remove get_auth_ks_name indirection
can_use_effective_service_level_cache() always returns true now, so the function can be dropped entirely and all the code that assumes it may return false can be dropped as well. Also drop async versions of find_effective_service_level and get_user_scheduling_group since they are unused.
No need to backport, code removal,
Closesscylladb/scylladb#29002
* github.com:scylladb/scylladb:
service level: make maybe_update_per_service_level_params synchronous
service level: remove unused get_user_scheduling_group function
service level: drop async find_effective_service_level
service level: remove remnants of version 1 service level
Introduced by 54bddeb3b5, the yield was
added to write_cell(), to also help the general case where there is no
collection. Arguably this was unnecessary and this patch moves the yield
to write_collection(), to the cell write loop instead, so regular cells
don't have to poll the preempt flag.
Closesscylladb/scylladb#29013
This PR shortens two sleeps from 1s to 100ms to speed up bootstrap in tests.
The changed sleeps are:
- the pause duration in group0 discovery,
- the retry period in `wait_for_cql`.
Refs: https://scylladb.atlassian.net/browse/SCYLLADB-918
No backport: performance improvements mostly relevant to tests.
Closesscylladb/scylladb#29020
* github.com:scylladb/scylladb:
test: pylib: util: wait for CQL being ready with a shorter period
group0: discovery: shorten the pause duration
Add basic cluster tests for CQL forwarding.
The test cases include:
- basic reads and writes
- prepared statements with binds
- forwarding from a non-replica
- exception passthrough during forwarding (using an injection)
- re-preparing a statement on the target node, even if the user
query is also an EXECUTE request on a prepared statement
- verification metric updates
The existing test_basic_write_read was modified so that a few extra
cases could be validated on the same cluster.
We enable CQL forwarding by starting to return the bounce_to_node
result message in redirect_statement() instead of throwing. The
forwarding code introduced in the preceding patches reacts to these
messages, allowing the requests to be forwarded.
With the update, some tests assuming that requests can't be forwarded
need to be adjusted, so we do that as well.
During forwarding of CQL EXECUTE requests, the target node may
not have the prepared statement in its cache. If we do have this
statement as a coordinator, instead of returning PREPARED NOT FOUND
to the client, we want to prepare the statement ourselves on target
node.
For that, we add a new FORWARD_CQL_PREPARE RPC. We use the new RPC
after gettting the prepared_not_found status during forwarding. When
we try to forward a request, we always have the query string (we
decide whether to forward based on this query), so we can always use
the new RPC when getting the prepared_not_found status.
After receiving the response, we try forwarding the EXECUTE request
again.
During CQL forwarding, when the target node can't handle the request,
it will find another node which can execute the request or which knows
where the request can be executed. We return this information in
responses to CQL forwarding, and in this patch, we add handling of
this kind of a response.
After getting a redirect response, we retry forwarding to the returned
host/shard until success or timeout. This can happen many times during
a single request, when we first forward to a replica and later to the
coordinator, or when a replica/coordinator migrated while we were
performing the forwarding
When a forwarded request fails on the remote node, we can't use the
exception handling that happens in process_request_one because we
don't go through this code path. Instead, we use the previously
extracted cql_server::handle_exception handler, which performs
all accounting on the forwarded-to node, and which prepares the
response. For the read_failure_exception_with_timeout exception,
we need to perform the sleep on the source node, so we return the
timeout in the forwarding response and use it on the source node
to know how long to sleep without any extra calculations.
The handle_forward_execute() method is extracted from the inline handler
lambda to make the error catching wrapper cleaner.
Add the infrastructure for forwarding CQL requests to other nodes.
When a process() call results in a node bounce (as opposed to a shard
bounce), the coordinator serializes the request and sends it via the
FORWARD_CQL_EXECUTE RPC verb to the target node.
In this patch we omit several features that allow handling more
scenarios that can happen when trying to forward a CQL request,
but the RPC request and response are already prepared for them.
They will be handled in the following commits.
Use rolling_max_tracker to record gross bytes allocated during each
CQL parse. The rolling maximum is then added to the memory estimate
for incoming QUERY and PREPARE requests so that the admission control
in the CQL transport layer accounts for parsing overhead.
The measured memory footprint serves as upper bound rather than
exact number but it's purpose is to prevent OOMs under unprepared
statements heavy load.
In benchmark 1G memory node shows decrease of non-LSA memory usage
from peak 320MB (our coordinator budget is 10% of 1G) to 96MB. While
tps drops from 1.2 kops to 0.8 kops. Drop in tps is expected as
memory admission kicks in trying to prevent OOM.
This is phase 1 of OOM prevention, potential next steps:
- add second admission in query_processor::get_statement trying to prevent potential thundering herd problem
- decrease cql_server memory pool size
- count reads in the memory pool
- add per service level memory pool and a shared one
Related https://scylladb.atlassian.net/browse/SCYLLADB-740
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-938
Backport: no, new feature, but we may reconsider if some customer needs it
Closesscylladb/scylladb#28919
* github.com:scylladb/scylladb:
cql3: track CQL parsing memory cost and use it for admission control
utils: add rolling max tracker
In the following patches, when we start allowing to forward CQL
requests to other nodes, we'll need to use the same client state
for executing the request on the destination node as we had on the
source. client_state contains many fields and we need to create
a new instance of it when we start handling the forwarded request,
so to prepare for the forwarding RPC, we add a serializable format
of the client_state as an IDL struct. The new class is missing some
fields that are not used while executing requests, and some whose
value is determined by the fact that the client state is used for
a forwarded request.
These include:
- driver name, driver version, client options - not used for executing
requests. Instead, we use these as data sources for the virtual
"clients" system table.
- auth_state - must be READY - we reached a bounce message, so we were
able to try executing the request locally
- _control_connection - used for altering a cql_server::connection, which
we don't have on the target node
- _default_timeout_config - used when updating service levels, also only
per-connection
- workload_type - used for deciding whether to allow shedding at the
start of processing the request, and for getting per-connection service
level params (for an API)
Currently we perform the same steps when handling query, execute
and batch CQL requests. So instead of creating multiple functions
performing these steps, we can handle them all in one fallthrough
case in cql_server::connection::process_request_one.
The process_on_shard method is relatively short, it's only used
in the process() method and the Process concept that is uses
is as long as the function itself. This area will be made more
complex by the following patches for cql forwarding, so we simplify
it by inlining process_on_shard in cql_server::process.
Move process() and process_on_shard() from cql_server::connection to
cql_server. The process() method is no longer a template - instead, it
takes an opcode parameter and uses get_process_fn_for_opcode() to select
the appropriate internal processing function.
The process_query, process_execute, and process_batch wrappers on
connection now delegate to _server.process() with the appropriate opcode.
This refactoring is preparation for CQL request forwarding, where
process() will need to be called from a context other than connection
- the forwarding RPC handler).
The messaging service will be used by cql_server to register RPC
handlers for forwarding CQL requests between nodes.
We pass it through the controller to cql_server.
Expose response::flags() and response::extract_body(), and a new constructor.
It will be needed for creating a cql_transport::response from the response body returned
during CQL forwarding.
In the following patches, we'll start allowing forwarding requests to strongly
consistent tables so that they'll get executed on the suitable tablet Raft group
members. For that we'll reuse the approach that we already have for bouncing
requests to other shards - we'll try to execute a request locally, and the
result of that will be a bounce message with another replica as the target.
In this patch we generalize the former bounce_to_shard result message so that
it will be able to specify the target of the bounce as another shard or specific
replica.
We also rename it to result_message::bounce so that it stops implying that only
another shard may be its target.
Aside from the host_id and the shard, the new message also includes the timeout,
because in the service handling the forwarding we won't have the access to it,
and it's needed for specifying how long we should wait for the forwarded
requests. It also includes an information whether this is a write request
to return correct timeout response in case the deadline is exceeded.
We will return other hosts in the new bounce message when executing requests to
strongly consistent tables when we can't handle the request because we aren't
a suitable replica. We can't handle this message yet, so we don't return it
anywhere and we still assume that every bounce message is a bounce to the same
host.
Forwarding CQL requests is not implemented yet, but we're already
prepared to return the target to forward to when trying to execute
strongly consistent requests. Currently, if we're not a replica
of the affected tablet, we redirect the request to the first replica
in the list.
This is not optimal, because this replica may be down or it may be
in another rack, making us perform cross-rack requests during forwarding.
Instead, we should forward the request to the replica from the same
rack and handle the case where the replica is down.
In this patch we change the replica selection for forwarding strongly
consistent requests, so that when the coordinator isn't a replica, it
redirects the request to the replica from the same rack.
If the replica from the same rack is down, or there is no replica in
our rack, we choose the next closest replica (preferring same-DC replicas
over other DCs). If no replica is alive, the query fails - the driver
should retry when some replica comes back up.
A permit in `waiting_for_memory` state can be preemptively aborted by
maybe_admit_waiters(). This is wrong: such permits have already been
admitted and are actively processing a read — they are merely blocked
waiting for memory under serialize-limit pressure.
When `on_preemptive_aborted()` fires on a `waiting_for_memory` permit,
it does not clear `_requested_memory`. A subsequent `request_memory()`
call accumulatesa on top of the stale value, causing `on_granted_memory()`
to consume more than resource_units tracks.
This commit adds a test that confirms that scenario by counting
internal_errors.
The test_raft_voters_multidc_kill_dc scenario had become weaker after group0 voter count was made always odd.
In particular, the old num_nodes == 1 case (dc1=2, dc2=1, dc3=1) could pass even without the intended balancing logic, because with 3 voters total we naturally get one voter per DC.
This change restores coverage of the original intent:
- Replace num_nodes parametrization with explicit DC triples.
- Use (3, 1, 1) to force a meaningful asymmetric topology where voter placement logic is required.
- Keep a larger topology case (6, 3, 3) for broader coverage.
- Mark (6, 3, 3) as skip_mode(debug) with reason:
larger topology case is too slow in debug on minipcs.
Also updated comments/docstring to match the new setup.
Fixes: SCYLLADB-794
backport: None, it is done to deflake minipcs that will start working only on master
Closesscylladb/scylladb#29000
Change sleep_until_timeout_passes() to accept a foreign_ptr<std::unique_ptr<response>>.
We can easily create the foreign_ptr for the responses created in the CQL server,
but we'll need this when we get responses when forwarding CQL statements - the responses
may come from other shards.
We also move it from cql_server::connection to cql_server, because for forwarded CQL
requests, we'll need to handle it at the cql_server level.
The method also loses its const qualifier - the abort_source that we pass into
sleep_abortable needs to be non-const. Apparently, we could still use it in a const
method of cql_server::connection because we passed it as _server._abort_source which
caused the const qualifier to be lost.
After stop() moved _reaper, in-flight with_connection() callbacks could
still call reap(), which accessed the moved-from future causing a
SIGSEGV in future_base::detach_promise(). Add a seastar::gate so
stop() waits for all in-flight operations before moving _reaper.
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1043Closesscylladb/scylladb#29015
`wait_for_cql` is used in hundreds, if not thousands, of places in tests.
We shouldn't waste up to 1s for every call.
Also, the 1s period is clearly too long compared to the bootstrap time,
which is usually 0-3s in dev mode.
The following test speeds up from 50s to 42s with the change:
```
for _ in range(10):
servers = await manager.servers_add(3)
await manager.get_ready_cql(servers)
```
Nodes currently pause group0 discovery for 1s. This case is always hit while
adding multiple nodes in parallel to an empty cluster by all nodes except the
one that becomes the group0 leader.
This is fine in production, but in tests, the slowdown is quite significant.
Every `manager.servers_add(n)` call for n > 1 becomes 1s slower when the
cluster is empty. Many cluster tests are affected.
In this commit, we decrease the sleep duration from 1s to 100ms to speed up
tests. The consequence of this change is that nodes might perform more steps
in group0 discovery, but the increase in CPU usage and network traffic should
be negligible.
Currently the test iterates on all servers and calls manager.api.disable_injection
but it doesn't await those calls.
Use asyncio.gather to await all calls in parallel.
Co-authored-by: Copilot CLI
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
serialize_collection_mutation() copies the serialized collection into
the returned collection_mutation object. Change to move to avoid the
copy.
Fixes: SCYLLADB-1041
Closesscylladb/scylladb#29010
can_use_effective_service_level_cache() always returns true now, so the
function can be dropped entirely and all the code that assumes it may
return false can be dropped as well.
The Alternator test test_compressed_request.py::test_gzip_request_oversized
checks that a very large request that compresses to a small size is still
rejected. This test passed on Alternator, but used to fail on DynamoDB
because DynamoDB didn't reject this case. This was a bug in DynamoDB
(a "decompression bomb" vulnerability), and after I reported it, it
was fixed.
So now this test does pass on DynamoDB (after a small modification to
allow for different error codes). So remove its scylla_only marker,
and make the comment true to the current state.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#28820
Use rolling_max_tracker to record gross bytes allocated during each
CQL parse. The rolling maximum is then added to the memory estimate
for incoming QUERY and PREPARE requests so that the admission control
in the CQL transport layer accounts for parsing overhead.
The measured memory footprint serves as upper bound rather than
exact number but it's purpose is to prevent OOMs under unprepared
statements heavy load.
In benchmark 1G memory node shows decrease of non-LSA memory usage
from peak 320MB (our coordinator budget is 10% of 1G) to 96MB. While
tps drops from 1.2 kops to 0.8 kops. Drop in tps is expected as
memory admission kicks in trying to prevent OOM.
In this patch we replace every single use of SCYLLA_ASSERT(), abort() and assert() in the cql3/ directory by throwing_assert().
The problem with SCYLLA_ASSERT()/abort()/assert() is that when it fails, it crashes Scylla. This is almost always a bad idea (see #7871 discussing why), but it's even riskier in front-end code like cql3/: In front-end code, there is a risk that due to a bug in our code, a specific user request can cause Scylla to crash. A malicious user can send this query to all nodes and crash the entire cluster. When the user is not malicious, it causes a small problem (a failing request) to become a much worse crash - and worse, the user has no idea which request is causing this crash and the crash will repeat if the same request is tried again.
All of this is solved by using the new throwing_assert(), which is the same as SCYLLA_ASSERT() but throws an exception (using on_internal_error()) instead of crashing. The exception will prevent the code path with the invalid assumption from continuing, but will result in only the current user request being aborted, with a clear error message reporting the internal server error due to an assertion failure.
I reviewed all the changes that I did in these patches to check that (to the best of my understanding) none of the assertions in cql3/ involve the sort of serious corruption that might require crashing the Scylla node entirely.
throwing_assert() also improves logging of assertion failures compared to the original SCYLLA_ASSERT()/abort() - SCYLLA_ASSERT() printed a message to stderr which in many installations is lost, and abort() often prints no message at all. But throwing_assert() uses Scylla's standard logger, and also includes a backtrace in the log message.
Fixes#13970 (Exorcise assertions from CQL code paths)
Refs #7871 (Exorcise assertions from Scylla)
Closesscylladb/scylladb#28847
* github.com:scylladb/scylladb:
cql3: remove unnecessary assert()
cql3: replace abort() by throwing_assert()
cql3: Replace SCYLLA_ASSERT by throwing_assert
The `test/cqlpy/cassandra_tests/validation/entities/json_test.py::testJsonOrdering` was failing because of differences between Cassandra and Scylla in printing
JSON floating point values - e.g. Cassandra prints 30.0, where Scylla prints 30.
Both are valid, so in this patch, instead of comparing strings, we compare parsed JSON using `EquivalentJson`.
Fixes#28467Closesscylladb/scylladb#28924
Replace the hardcoded meta::DEFAULT_SUPERUSER_NAME comparison with
default_superuser(_qp) which reads from the auth_superuser_name
config option. This makes the IF NOT EXISTS clause in DESCRIBE
output correct for clusters with a non-default superuser name.
_"A journey of a thousand miles begins with a single step" Lao Tzu_
ScyllaDB uses estimated_histogram in many places.
We already have a more efficient alternative: approx_exponential_histogram. It is both CPU and
memory-efficient and can be exported as Prometheus native histograms.
Its main limitation (which has its benefits) is that the bucket layout is fixed at compile time, so
histograms with different configurations cannot be mixed.
The end goal is to replace all uses of estimated_histogram in the codebase.
That migration needs a few small API adjustments, so I am splitting the work
into steps for easier review.
This series is the first step. It introduces a base template for fixed-size
estimated histograms, and switches the Alternator's estimated_histogram with the template.
This change is self-contained and valuable on its own, while keeping the scope limited.
Minor adjustments were made to the code and tests so that the tests would pass.
Follow-up PRs will apply the same pattern to the rest of the code.
**New feature no need to backport**
Closesscylladb/scylladb#28987
* github.com:scylladb/scylladb:
alternator: migrate to operation_size_kb histograms
test/alternator/test_metrics.py: Update the bucket in the histogram search
alternator: Use batch_histogram for batch size histograms
estimated_histogram.hh: adds estimated_histogram_with_max
When we forward CQL statements, we'll need to handle the errors
on the destination node. Only for read_failure_exception_with_timeout
exception, we'll still need to wait until timeout passes on the
source node.
For that we extract the exception handling to a separate method.
Additionally, we separate the waiting and all other handling,
so that all handling aside from waiting will be reusable after
forwarding, and we'll also be able to sleep on the source node
if necessary.
These methods are used only in the error handler in the cql server,
and outside of 3 cases, they don't need any information from the
cql_server::connection. We move them from cql_server::connection
to cql_server, so that they can be used in the following patches
for methods for CQL request forwarding where we'll have no instance
of cql_server::connection on the node forwarded to.
After the change the methods require no access to the server's
or connection's fields, so we also make them static methods.
Switch Alternator operation-size metrics from the legacy estimated
histogram implementation to estimated_histogram_with_max<512> and export
them through the native approx-exponential histogram path.
Add a dedicated operation-size histogram type alias based on
estimated_histogram_with_max<512>.
Replace all per-operation size histograms (GetItem/PutItem/DeleteItem/
UpdateItem/BatchGetItem/BatchWriteItem) with the new type.
Remove the custom legacy histogram-to-metrics adapter and use
to_metrics_histogram() for operation size metrics, aligning export
behavior with other approx-exponential histograms.
Update Alternator metrics tests to compute expected le bucket boundaries using
approx-exponential bucket math (including deduplication of equal
bounds), so assertions match the new exported histogram schema.
Update bucket helper signatures to use (max, precision) parameters and keep
+Inf handling unchanged.
Replace byte-to-KB ceiling conversion with plain integer division (bytes
/ 1024): histogram export already reports each bucket by its upper bound
(le), so rounding input values up before bucketing is unnecessary and
would over-shift borderline samples into higher buckets.
Move default_superuser() to auth::meta in common.{hh,cc} and remove the
cached _superuser member from both standard_role_manager and
password_authenticator. The superuser name comes from config which is
immutable at runtime, so caching it is unnecessary.
Removes auth-v1 hack for cassandra superuser as auth-v1
code no longer exists.
Also CL is not really used when quering raft replicated
tables (like auth ones), but LOCAL_ONE is the least confusing
one.
Replace get_auth_ks_name(qp) with db::system_keyspace::NAME directly.
The function always returned the constant "system" and its qp
parameter was unused.
nodetool cluster repair without additional params repairs all tablet
keyspaces in a cluster. Currently, if a table is dropped while
the command is running, all tables are repaired but the command finishes
with a failure.
Modify nodetool cluster repair. If a table wasn't specified
(i.e. all tables are repaired), the command finishes successfully
even if a table was dropped.
If a table was specified and it does not exist (e.g. because it was
dropped before the repair was requested), then the behavior remains
unchanged.
Fixes: SCYLLADB-568.
Closesscylladb/scylladb#28739
What changed
* Added closed to milestone event types in
call_sync_milestone_to_jira.yml (types: [created] -> types: [created, closed])
* Added VECTOR to the list of Jira project keys being synced
(jira_project_keys: SCYLLADB,CUSTOMER,SMI,RELENG -> jira_project_keys: SCYLLADB,CUSTOMER,SMI,RELENG,VECTOR)
Why (Requirements Summary)
* The call_sync_milestone_to_jira.yml workflow only triggered on
milestone creation. When a GitHub milestone is closed, the
corresponding Jira versions (in SCYLLADB, CUSTOMER, SMI, RELENG
projects) should be marked as released. Adding the closed trigger
enables the called workflow (main_sync_milestone_to_jira_release.yml
in github-automation) to handle both creating and releasing Jira
versions from GitHub milestone events.
* Added the VECTOR project so its Jira versions are also created/released
when milestones are created or closed in scylladb.git.
* This is consistent with the same change already applied to the staging
and scylla-machine-image repos.
Fixes:PM-216
Update call_sync_milestone_to_jira.yml in scylladb.git - add close trigger and VECTOR project sync
Closesscylladb/scylladb#28981
This patch adds estimated_histogram_with_max template that will be a
based for specific estimated_histograms, eventually replacing the current
struct implementation.
Introduce estimated_histogram_with_max<Max> as a reusable wrapper around
approx_exponential_histogram<1, Max, 4>, providing merge support and the
same add helpers used by existing estimated_histogra type.
Add estimated_histogram_with_max_merge()
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Fix an invalid condition, when searching for a parent shard, when table
is based on vnodes. Shards have associated with them `last token` -
token, than marks the end of the range of tokens they consume (inclusive).
An additional assumptions are whole token space is used and
(for vnodes) token space wraps around.
Previously code looked like this:
auto pid = std::upper_bound(..., [](const dht::token& t, const cdc::stream_id& id) {
return t < id.token();
});
if (pid != pids.begin()) {
pid = std::prev(pid);
}
An `upper_bound` call with `t < id.token()` means it is looking for
an iterator, for which value `t < id.token()` changed to true,
which effectively means a position, where iterator is bigger
then searched value. Then we move iterator backward once if possible.
Assuming token space <-2, 2> and parents [0, 2], when we search for:
- -1 -> we will get 0, it's first, so we can't move backward, so 0 (ok)
- 0 -> we will get 2, it's not first, so we go back and we return 0 (ok)
- 1 -> we will get 2, it's not first, so we go back and we return 0
(not ok - should be 2)
The fix is to replace it with `std::lower_bound` and remove conditional
backward motion. Since we've a guarantees that whole token space is used
if `std::lower_bound` ends with `end()` value, then we have a wrap
around case and we need to pick `begin()` as result.
Fixes#28354
Fixes: SCYLLADB-537
Closesscylladb/scylladb#28382
Changes dockerized_service to use ephermal port publish, and
query the published port from podman/docker.
Modifies client code to use slightly changed usage syntax.
query_processor::prepare() could race with prepared statement invalidation: after loading from the prepared cache, we converted the cached object to a checked weak pointer and then continued asynchronous work (including error-injection waitpoints). If invalidation happened in that window, the weak handle could no longer be promoted and the prepare path could fail nondeterministically.
This change keeps a strong cache entry reference alive across the whole critical section in prepare() by using a pinned cache accessor (get_pinned()), and only deriving the weak handle while the entry is pinned. This removes the lifetime gap without adding retry loops.
Test coverage was extended in test/cluster/test_prepare_race.py:
- reproduces the invalidation-during-prepare window with injection,
- verifies prepare completes successfully,
- then invalidates again and executes the same stale client prepared object,
- confirms the driver transparently re-requests/re-prepares and execution succeeds.
This change introduces:
- no behavior change for normal prepare flow besides stronger lifetime guarantees,
- no new protocol semantics,
- preserves existing cache invalidation logic,
- adds explicit cluster-level regression coverage for both the race and driver reprepare path.
- pushes the re prepare operation twards the driver, the server will return unprepared error for the first time and the driver will have to re prepare during execution stage
Fixes: https://github.com/scylladb/scylladb/issues/27657
Backport to active branches recommended: No node crash, but user-visible PREPARE failures under rare schema-invalidation race; low-risk timeout-bounded retry improves robustness.
Closesscylladb/scylladb#28952
* github.com:scylladb/scylladb:
transport/messages: hold pinned prepared entry in PREPARE result
cql3: pin prepared cache entry in prepare() to avoid invalid weak handle race
Remove the host network setting, ensuring we use private networks
(slirp4netns). This will allow nested container port aliasing,
helping CI stability (can use ephemeral ports and container
introspection).
This also makes the nested podman setup non-conditional,
since we only run podman containers inside dbuild, and need
the setup regardless if host container is docker or not.
Starts a thread to process scylla notify messages (NOTIFY_SOCKET)
instead of just processing inline, non-blocking. This because it
is possible for the pipe created to be to small to hold enough
messages for us to reach the point where we otherwise even read
from said pipe, allowing other end (scylla) to proceed.
This series adds a global read barrier to raft_group0_client, ensuring that Raft group0 mutations are applied on all live nodes before returning to the caller.
Currently, after a group0_batch::commit, the mutations are only guaranteed to be applied on the leader. Other nodes may still be catching up, leading to stale reads. This patch introduces a broadcast read barrier mechanism. Calling send_group0_read_barrier_to_live_members after committing will cause the coordinator to send a read barrier RPC to all live nodes (discovered via gossiper) and waits for them to complete. This is best effort attempt to get cluster-wide visibility of the committed state before the response is returned to the user.
Auth and service levels write paths are switched to use this new mechanism.
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-650
Backport: no, new feature
Closesscylladb/scylladb#28731
* https://github.com/scylladb/scylladb:
test: add tests for global group0_batch barrier feature
qos: switch service levels write paths to use global group0_batch barrier
auth: switch write paths to use global group0_batch barrier
raft: add function to broadcast read barrier request
raft: add gossiper dependency to raft_group0_client
raft: add read barrier RPC
Remove the rest of the code that assumes that either group0 does not exist yet or a cluster is till not upgraded to raft topology. Both of those are not supported any more.
No need to backport since we remove functionality here.
Closesscylladb/scylladb#28841
* github.com:scylladb/scylladb:
service level: remove version 1 service level code
features: move GROUP0_SCHEMA_VERSIONING to deprecated features list
migration_manager: remove unused forward definitions
test: remove unused code
auth: drop auth_migration_listener since it does nothing now
schema: drop schema_registry_entry::maybe_sync() function
schema: drop make_table_deleting_mutations since it should not be needed with raft
schema: remove calculate_schema_digest function
schema: drop recalculate_schema_version function and its uses
migration_manager: drop check for group0_schema_versioning feature
cdc: drop usage of cdc_local table and v1 generation definition
storage_service: no need to add yourself to the topology during reboot since raft state loading already did it
storage_service: remove unused functions
group0: drop with_raft() function from group0_guard since it always returns true now
gossiper: do not gossip TOKENS and CDC_GENERATION_ID any more
gossiper: drop tokens from loaded_endpoint_state
gossiper: remove unused functions
storage_service: do not pass loaded_peer_features to join_topology()
storage_service: remove unused fields from replacement_info
gossiper: drop is_safe_for_restart() function and its use
storage_service: remove unused variables from join_topology
gossiper: remove the code that was only used in gossiper topology
storage_service: drop the check for raft mode from recovery code
cdc: remove legacy code
test: remove unused injection points
auth: remove legacy auth mode and upgrade code
treewide: remove schema pull code since we never pull schema any more
raft topology: drop upgrade_state and its type from the topology state machine since it is not used any longer
group0: hoist the checks for an illegal upgrade into main.cc
api: drop get_topology_upgrade_state and always report upgrade status as done
service_level_controller: drop service level upgrade code
test: drop run_with_raft_recovery parameter to cql_test_env
group0: get rid of group0_upgrade_state
storage_service: drop topology_change_kind as it is no longer needed
storage_service: drop check_ability_to_perform_topology_operation since no upgrades can happen any more
service_storage: remove unused functions
storage_service: remove non raft rebuild code
storage_service: set topology change kind only once
group0: drop in_recovery function and its uses
group0: rename use_raft to maintenance_mode and make it sync
The `service_error` struct: 6dc2c42f8b/service/vector_store_client.hh (L64)
currently stores just the error status code. For this reason whenever the HTTP error occurs, only the error code can be forwarded to the client. For example see here: 6dc2c42f8b/service/vector_store_client.cc (L580)
For this reason in the output of the drivers full description of the error is missing which forces user to take a look into Scylla server logs.
The objective of this PR is to extend the support for HTTP errors in Vector Store client to handle messages as well.
Moreover, it removes the quadratic reallocation in response_content_to_sstring() helper function that is used for getting the response in case of error.
Fixes: VECTOR-189
Closesscylladb/scylladb#26139
* github.com:scylladb/scylladb:
vector_search: Avoid quadratic reallocation in response_content_to_sstring
vector_store_client: Return HTTP error description, not just code
Refs: SCYLLADB-557
We should use full replication in KS/CF creation and population,
for at least two reasons:
1.) Ensure we wait fully for and write to all nodes
2.) Make test more "real", behaving like a proper cluster
Closesscylladb/scylladb#28959
In cql3/, there was one call to assert() (not SCYLLA_ASSERT or
throwing_assert), and it was:
const auto shard_num = smp::count;
assert(shard_num > 0)
Rather than converting this assert() to throwing_assert() as I did in
previous patches, I decided to outright remove it: Seastar guarantees
that smp::count is not zero. Many other places in the code use
smp::count assuming that it is correct, no other place bothers to assert
it isn't zero.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
After the previous patch replaced all SCYLLA_ASSERT() calls by
throwing_assert(), this patch also replaces all calls to abort().
All these abort() calls are supposedly cases that can never happen,
but if they ever do happen because of a bug, in none of these places
we absolutely need to crash - and exception that aborts the current
operation should be enough.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
In this patch we replace every single use of SCYLLA_ASSERT() in the cql3/
directory by throwing_assert().
The problem with SCYLLA_ASSERT() is that when it fails, it crashes Scylla.
This is almost always a bad idea (see #7871 discussing why), but it's even
riskier in front-end code like cql3/: In front-end code, there is a risk
that due to a bug in our code, a specific user request can cause Scylla
to crash. A malicious user can send this query to all nodes and crash
the entire cluster. When the user is not malicious, it causes a small
problem (a failing request) to become a much worse crash - and worse,
the user has no idea which request is causing this crash and the crash
will repeat if the same request is tried again.
All of this is solved by using the new throwing_assert(), which is the
same as SCYLLA_ASSERT() but throws an exception (using on_internal_error())
instead of crashing. The exception will prevent the code path with the
invalid assumption from continuing, but will result in only the current
user request being aborted, with a clear error message reporting the
internal server error due to an assertion failure.
I reviewed all the changes that I did in this patch to check that (to the
best of my understanding) none of the assertions in cql3/ involve the
sort of serious corruption that might require crashing the Scylla node
entirely.
throwing_assert() also improves logging of assertion failures compared
to the original SCYLLA_ASSERT() - SCYLLA_ASSERT() printed a message to
stderr which in many installations is lost, whereas throwing_assert()
uses Scylla's standard logger, and also includes a backtrace in the
log message.
Fixes#13970 (Exorcise assertions from CQL code paths)
Refs #7871 (Exorcise assertions from Scylla)
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
To restore how streaming scopes work there are two tests that greatly duplicate each other -- test_restore_with_streaming_scopes from cluster/object_store suite and test_refresh_with_streaming_scopes from cluster suite.
This patch generalizes both into a do_test_streaming_scopes() non-test function
Closesscylladb/scylladb#28874
* github.com:scylladb/scylladb:
test: Re-sort comments around do_test_streaming_scopes()
test: Split do_load_sstables()
test: Drop load_fn argument from do_load_sstables()
test: Re-use do_test_streaming_scopes() in refresh test
test: Introduce SSTablesOnLocalStorage
test: Introduce SSTablesOnObjectStorage
test: Move test_restore_with_streaming_scopes() into do_test_streaming_scopes()
Document how to migrate a ScyllaDB cluster to different instance
types using the add-and-replace node cycling approach.
Closes: QAINFRA-42
Closesscylladb/scylladb#28458
Recently, in commit 7b30a39, we added to pytest.ini the option xfail_strict.
This option causes every time a test XPASSes, i.e., an XFAIL test actually
passes, to be considered an error and fail the test.
While this has some benefits, it's a big problem when running tests
against a reference implementation like DynamoDB or Cassandra: We
typically mark a test "xfail" if the test shows a known bug - i.e., if
the test fails on Scylla but passes on the reference system (DynamoDB
or Cassandra). This means that when running "test/cqlpy/run-cassandra"
or "test/alternator/run --aws", we expect to see many tests XPASS,
and now this will cause these runs to "fail".
So in this patch we add the xfail_strict=false to cqlpy/run-cassandra
and alternator/run --aws. This option is not added to cqlpy/run or
to alternator/run without --aws, and also doesn't affect test.py or
Jenkins.
P.S. This is another nail in the coffin of doing "cd test/alternator;
pytest --aws". You should get used to running Alternator tests through
test/alternator/run, even if you don't need to run Scylla (the "--aws"
option doesn't run Scylla).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#28973
During tests I noticed that if the number of tablets is very small,
say 2, and the number of nodes is 3 (2 shards per node), using the
number of storage groups on each shard, a shard may end up holding 0 groups,
whilst the other holds 1 group. And in some nodes even both shards have
0 groups.
Taking the minimum among shards here was showing in manifests a tablet
count of 0 for all 3 nodes, which is incorrect.
Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
Closesscylladb/scylladb#28978
Currently, for repair tasks tablet_virtual_task::wait gathers the
ids of tablets that are to be repaired. The gathered set is later
used to check if the repair is still ongoing.
However, if the tablets are resized (split or merged), the gathered
set becomes irrelevant. Those, we may end up with invalid tablet id
error being thrown.
Wait until repair is done for all tablets in the table.
Fixes: https://github.com/scylladb/scylladb/issues/28202
Backport to 2026.1 needed as it contains the change introducing the issue d51b1fea94Closesscylladb/scylladb#28323
* github.com:scylladb/scylladb:
service: fix indentation
test: add test_tablet_repair_wait
service: remove status_helper::tablets
service: tasks: scan all tablets in tablet_virtual_task::wait
A bug or some bad operator intervention can lead to a sstable existing in a
node after the tablet replica was moved to a different node.
This will result sstable loading during boot failing, requiring operator
intervention.
The log today just dumps the name of the "orphaned" sstable, but one
investigating it might want to know which process (repair, memtable, whatever)
generated that sstable, if the sstable was created locally or remotely,
and the current replica set of the underlying tablet.
From the original identifier, we can know the exact time the sstable was
created on its original node. From the current id, we know the time it
was created on the current node.
All this info can help the investigator to correlate with events in other nodes
(includes actions from the coordinator) to get closer to the root cause.
The new log will look like this:
"Unable to load SSTable .../me-3gyg_1fsw_2u0u826b00b71vc46o-big-Data.db
(originated from compaction with id 913f41c0-18c2-11f1-8f08-cb8521b3f330
on host e483238c-2287-4022-8bc4-b4f1c4cb2b0d)
of tablet 6 (replica set: [e483238c-2287-4022-8bc4-b4f1c4cb2b0d:0])"
Refs https://scylladb.atlassian.net/browse/SCYLLADB-788.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#28921
Vector deserialization is an operation which performance is critical for
vector similarity search feature because it is frequently executed during
rescoring operation. Some of the identified performance bottlenecks
for it include:
1. Per-element virtual dispatch in deserialize(): each of the N elements
went through visit() which switches on ~28 type variants. For a
1024-dimension float vector, that's 1024 redundant type switches when
the element type is the same for all of them.
2. Redundant work in split_fragmented(): value_length_if_fixed() was
called inside the loop (N virtual calls), and no reserve() was done
on the output vector causing repeated reallocations.
This series fixes both:
- Introduce deserialize_vector_visitor that dispatches on the element
type once for the entire vector, then loops inside the resolved
handler. Simple numeric types (float, int, etc.) call
deserialize_value() directly with no virtual dispatch per element.
String types (ascii, utf8) get a dedicated handler that skips
make_empty() (sstring has no empty_t constructor). Complex types
(list, map, tuple, etc.) fall back to per-element dispatch.
- In split_fragmented(), reserve the output vector to _dimension and
cache value_length_if_fixed() before the loop.
Benchmark results (1024-dim float vector, release build, -O3 -flto):
deserialize: 15.73 us -> 11.70 us (1.34x, 26% faster)
split_fragmented: 10.34 us -> 7.45 us (1.39x, 28% faster)
References: SCYLLADB-471
Backport: none, unless we observe some critical performance improvement for quantization.
Closesscylladb/scylladb#28618
* github.com:scylladb/scylladb:
types: optimize reading vector fragments
types: optimize vector deserialization for high-dimensional vectors
Similarly to LWTs, we reject queries with user-provided timestamps
when they target strongly consistent tables.
Such statements could force us to rewrite history, and that contradicts
the philosophy of linearizability we aim for.
Fixes SCYLLADB-879
Closesscylladb/scylladb#28867
* seastar d2953d2a...4d268e0e (32):
> Merge 'prometheus: support multiple __name__ filters and prefixed names' from Travis Downs
doc: update prometheus.md with __name__ filter enhancements
prometheus: support prefixed names in __name__ filter
prometheus: add benchmarks for name filter performance
prometheus: support multiple __name__ query parameters
prometheus: move write_body_args to header
> fair_queue: Subtract from _queued_capacity on pop_front()
> memory: expose cumulative allocated bytes statistic
> Merge 'Add ability to configure IO bandwidth limit for supergroup' from Pavel Emelyanov
test: IO bandiwdth throttler unit tests
code: Add ability to configure IO bandwidth limit for supergroup
io_queue: Have more than one throttler par class
io_queue: Introduce bandwidth_throttler helper class
io_queue: Nest io_group::priotiy_class_data-s
io_queue: Update class bandwidth on group's class data
io_queue: Make io_group::priority_class_data::tokens() static
fair_queue: Introduce group (un)plugging
> Fix _shard_to_numa_node_mapping double population
> Use exception parameter in log_timer_callback_exception()
> Fix wakeup_granularity() fallback debug-fs reading
> test_fixture: Fix SEASTAR_FIXTURE_THREAD_TEST_CASE thread not propagated
> build: support tuning -ffile-prefix-map
> test: Remove unused C::dup() method of testing class
> src/core/reactor: introduce reactor::get_backend_name()
> util/process: add pid() accessor
> Merge 'Add source location to task and tasktrace object' from Radosław Cybulski
coroutine.hh: disable source_location for GCC to avoid ICE
reactor: improve do_dump_task_queue reporting
Use source_location in `do_dump_task_queue`
Update backtrace with source locations of resume points
Add calls to update resume_point
Add a std::source_location (resume_point) to task object.
> Merge 'Refine posix file .dup() implementation' from Pavel Emelyanov
file: Templatize posix_file_handle_impl
file: Don't dup() non-read-only files
file: Split ..._impl::dup() implementations
test: Add a simple test for dup()
> Merge 'Deprecate reactor::make_pollable_fd(socket_address, int)' from Pavel Emelyanov
reactor: Deprecate make_pollable_fd()
net/posix: Create file_desc for sockets in-place
reactor,net: Keep sock_need_nonblock boolean on posix_network_stack
net/posix: Re-format constructor initializer lists
> Merge 'test: add fuzz testing infrastructure and sstring fuzzer' from Travis Downs
test: add fuzz tests to CI workflow
test: add sstring differential fuzzer
test: add fuzz testing infrastructure
> Introduce "integrated queue length" metrics and use it for IO classes (#3210)
> reactor: Remove get_sg_data(unsigned) overload
> memcached: Stop using scattered_message
> reactor: Mark uptime() method const
> alien: Remove deprecated run_on and submit_to calls
> file: make open_flags and access_flags constexpr
> scheduling: Unfriend some methods from scheduling_group
> reactor: Move _dying bit to epoll backend
> file: coroutinize the with_file templates
> configure: validate --cook ingredient names
> fix trailing whitespace
> Merge 'Estimate timing overhead, allow failing if it is too high' from Travis Downs
perf_tests: document overhead column and threshold options
perf_tests: add measurement overhead tracking and warnings
perf_tests: remove inline/hot attributes from time_measurement methods
perf_tests: move time_measurement class to implementation file
perf_tests: move perf counters into time_measurement singleton
> rpc: log handler type
> Merge 'Add pre-commit with trailing whitespace hook' from Travis Downs
Add GitHub Actions workflow for pre-commit enforcement
Add pre-commit setup documentation to HACKING.md
Add pre-commit configuration with trailing-whitespace hook
Remove trailing whitespace from source files
> posix-stack: Make internal::posix_connect() resolve exceptions into futures
> sstring: fix npos to be size_t for consistency with std::string
Closesscylladb/scylladb#28954
There was a redundant work in split_fragmented(): value_length_if_fixed() was
called inside the loop (N virtual calls), and no reserve() was done
on the output vector causing repeated reallocations.
This patch reserves the output vector to _dimension and
caches value_length_if_fixed() before the loop.
Additionally, split read_vector_element() into two specialized functions:
read_vector_element_fixed() and read_vector_element_variable(), and hoist
the branch on fixed_len outside the loop in split_fragmented() and
deserialize_loop(). This avoids a conditional branch per element in the
hot path.
Benchmark results (1024-dim float vector, release build, -O3 -flto):
10.34 us -> 7.45 us (1.39x, 28% faster)
Verify that scylla sstable upgrade fails when an sstable has a
corrupted Statistics component digest, and succeeds when the
--ignore-component-digest-mismatch flag is provided.
Add ignore_component_digest_mismatch option to db::config (default false).
When set, sstable loading logs a warning instead of throwing on component
digest mismatches, allowing a node to start up despite corrupted non-vital
components or bugs in digest calculation.
Propagate the config to all production sstable load paths:
- distributed_loader (node startup, upload dir processing)
- storage_service (tablet storage cloning)
- sstables_loader (load-and-stream, download tasks, attach)
- stream_blob (tablet streaming)
Add `ignore_component_digest_mismatch` option to `sstable_open_config`
that logs a warning instead of throwing `malformed_sstable_exception`
on component digest mismatch. This is useful for recovering sstables
with corrupted non-vital components or working around bugs in digest
calculation.
Expose the option in scylla-sstable via the
`--ignore-component-digest-mismatch` flag for the upgrade operation.
Generalize corrupt_sstable() and scrub_validate_corrupted_file() to
accept a component_type parameter, defaulting to Data, so they can be
reused for corrupting other components.
Add tests that verify SSTable component digest validation detects
corruption on load. Each test writes an SSTable, corrupts a specific
component file by flipping a bit, then asserts that reloading the
SSTable throws malformed_sstable_exception with the expected digest
mismatch message.
Add integrity verification for SSTable component files by validating
their CRC32 digests against the expected values stored in Scylla
metadata during SSTable loading.
The following components are validated on load: TOC, Scylla metadata,
CompressionInfo, Statistics, Summary, and Filter.
Add a new reader that wraps file_random_access_reader and computes a
running CRC32 digest over the data as it is read. The digest accumulates
across sequential read_exactly() calls and is reset on seek(), since a
non-sequential read invalidates the running checksum.
One of the performance bottlenecks while deserializing vectors was
per-element virtual dispatch in deserialize(): each of the N elements
went through visit() which switches on ~28 type variants. For a
1024-dimension float vector, that's 1024 redundant type switches when
the element type is the same for all of them.
This patch introduces deserialize_vector_visitor that dispatches on the element
type once for the entire vector, then loops inside the resolved
handler. Simple numeric types (float, int, etc.) call
deserialize_value() directly with no virtual dispatch per element.
String types (ascii, utf8) get a dedicated handler that skips
make_empty() (sstring has no empty_t constructor). Complex types
(list, map, tuple, etc.) fall back to per-element dispatch.
Benchmark results (1024-dim float vector, release build, -O3 -flto):
15.73 us -> 11.70 us (1.34x, 26% faster)
`e4da0afb8d5491bf995cbd1d7a7efb966c79ac34` introduces a protection
against resources that are "made up" of thin air to
`reader_concurrency_semaphore`. If there are more `_resources` than
the `_initial_resources`, it means there is a negative leak, and
`on_internal_error_noexcept` is called. In addition to it,
`_resources` is set to `std::max(_resources, _initial_resources)`.
However, the commit message of `e4da0afb8d5491bf995cbd1d7a7efb966c79ac34`
states the opposite: "The detection also clamps the
_resources to _initial_resources, to prevent any damage".
Before this commit, the protection mechanism doesn't clamp
`_resources` to `_initial_resources` but instead keeps `_resources` high,
possibly even indefinitely growing. This commit changes `std::max` to
`std::min` to make the code behave as intended.
Refs: SCYLLADB-163
Closesscylladb/scylladb#28982
Pre-compute the total size and allocate a single uninitialized sstring
before copying the buffers, following the pattern from Seastar's
read_entire_stream_contiguous(). This avoids iterative reallocation
which is O(n^2) for large responses.
This simple patch adds support for storing the HTTP error description
that Vector Store client receives from vector store. Until now it was
just printed to the log but it was not returned. For this reason it
was not forwarded to the drivers which forced users to access ScyllaDB
server logs to understand what is wrong with Vector Store.
This patch also updates formatter to print the message next to the
error code.
Fixes: VECTOR-189
Previously, all stream-table fixtures in this test file used
scope="function", forcing a fresh table to be created for every test,
slowing down the test a bit (though not much), and discouraging writing
small new tests.
This was a workaround for a DynamoDB quirk (that Alternator doesn't have):
LATEST shard iterators have a time slack and may point slightly before
the true stream head, causing leftover events from a previous test to
appear in the next test's reads.
We fix this by draining the stream inside latest_iterators() and
shards_and_latest_iterators() after obtaining the LATEST iterators:
fetch records in a loop until two consecutive polling rounds both return
empty, guaranteeing the iterators are positioned past all pre-existing
events before the caller writes anything. With this guarantee in place,
all stream-table fixtures can safely use scope="module".
After this patch, test_streams.py continues to pass on DynamoDB.
On Alternator, the test file's run time went down a bit, from
20.2 seconds to 17.7 seconds.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
In the next patch, we plan to make the fixtures in test_streams.py
shared between tests. Most tests work well with shared tables, but two
(test_streams_trim_horizon and test_streams_starting_sequence_number)
were written to expect a new table with an empty history, and two
other (test_streams_closed_read and test_streams_disabled_stream) want
to disable streaming and would break a shared table.
So this patch we modify these four tests to create their own new table
instead of using a fixture.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This pull request adds support for calculation and storing CRC32 digests for all SSTable components.
This change replaces plain file_writer with crc32_digest_file_writer for all SSTable components that should be checksummed. The resulting component digests are stored in the sstable structure
and later persisted to disk as part of the Scylla metadata component during writer::consume_end_of_stream.
Several test cases where introduced to verify expected behaviour.
Additionally, this PR adds new rewrite component mechanism for safe sstable component rewriting.
Previously, rewriting an sstable component (e.g., via rewrite_statistics) created a temporary file that was renamed to the final name after sealing. This allowed crash recovery by simply removing the temporary file on startup.
However, with component digests stored in scylla_metadata (#20100),
replacing a component like Statistics requires atomically updating both the component
and scylla_metadata with the new digest - impossible with POSIX rename.
The new mechanism creates a clone sstable with a fresh generation:
- Hard-links all components from the source except the component being rewritten and scylla_metadata
- Copies original sstable components pointer and recognized components from the source
- Invokes a modifier callback to adjust the new sstable before rewriting
- Writes the modified component along with updated scylla_metadata containing the new digest
- Seals the new sstable with a temporary TOC
- Replaces the old sstable atomically, the same way as it is done in compaction
This is built on the rewrite_sstables compaction framework to support batch operations (e.g., following incremental repair).
In case of any failure durning the whole process, sstable will be automatically deleted on the node startup due to
temporary toc persistence.
Backport is not required, it is a new feature
Fixes https://github.com/scylladb/scylladb/issues/20100, https://github.com/scylladb/scylladb/issues/27453Closesscylladb/scylladb#28338
* github.com:scylladb/scylladb:
docs: document components_digests subcomponent and trailing digest in Scylla.db
sstable_compaction_test: Add tests for perform_component_rewrite
sstable_test: add verification testcases of SSTable components digests persistance
sstables: store digest of all sstable components in scylla metadata
sstables: replace rewrite_statistics with new rewrite component mechanism
sstables: add new rewrite component mechanism for safe sstable component rewriting
compaction: add compaction_group_view method to specify sstable version
sstables: add null_data_sink and serialized_checksum for checksum-only calculation
sstables: extract default write open flags into a constant
sstables: Add write_simple_with_digest for component checksumming
sstables: Extract file writer closing logic into separate methods
sstables: Implement CRC32 digest-only writer
Currently, status_helper::tablets, which keeps a vector of processed
tablet ids, is used only in tablet_virtual_task::get_status_helper,
so there is no point in returning it. Also, in get_status_helper,
it is used only to determine if any tablets are processed.
Remove status_helper::tablets. Use a flag instead of the vector
in get_status_helper.
Currently, for repair tasks tablet_virtual_task::wait gathers the
ids of tablets that are to be repaired. The gathered set is later
used to check if the repair is still ongoing.
However, if the tablets are resized (split or merged), the gathered
set becomes irrelevant. Those, we may end up with invalid tablet id
error being thrown.
Wait until repair is done for all tablets in the table.
There is a race condition in driver that raises the RuntimeException.
This pollutes the output, so this PR is just silencing this exception.
Fixes: SCYLLADB-900
Closesscylladb/scylladb#28957
result_message::prepared now owns a strong pinned prepared-cache entry instead of relying only on a weak pointer view. This closes the remaining lifetime gap after query_processor::prepare() returns, so users of the returned PREPARE message cannot observe an invalidated weak handle during subsequent
processing.
- update result_message::prepared::cql constructor to accept pinned entry
- construct weak view from owned pinned entry inside the message
- pass pinned cache entry from query_processor::prepare() into the message constructor
query_processor::prepare() could race with prepared statement invalidation: after loading from the prepared cache, we converted the cached object to a checked weak pointer and then continued asynchronous work (including error-injection waitpoints). If invalidation happened in that window, the weak
handle could no longer be promoted and the prepare path could fail nondeterministically.
This change keeps a strong cache entry reference alive across the whole critical section in prepare() by using a pinned cache accessor (get_pinned()), and only deriving the weak handle while the entry is pinned. This removes the lifetime gap without adding retry loops.
Test coverage was extended in test/cluster/test_prepare_race.py:
- reproduces the invalidation-during-prepare window with injection,
- verifies prepare completes successfully,
- then invalidates again and executes the same stale client prepared object,
- confirms the driver transparently re-requests/re-prepares and execution succeeds.
This change introduces:
- no behavior change for normal prepare flow besides stronger lifetime guarantees,
- no new protocol semantics,
- preserves existing cache invalidation logic,
- adds explicit cluster-level regression coverage for both the race and driver reprepare path.
- pushes the re prepare operation twards the driver, the server will return unprepared error for the first time and the driver will have to re prepare during execution stage
Fixed several places where ScyllaLogFile.grep() was called without await, resulting in checking coroutine objects for truthiness instead of actual log matches.
Fixes: SCYLLADB-903
No backport, framework fix and one test fix.
Closesscylladb/scylladb#28909
* github.com:scylladb/scylladb:
test.py: fix unawaited ScyllaLogFile.grep() coroutines
tests: fix test_group0_recovers_after_partial_command_application
When dropping a table, make_drop_table_or_view_mutations() creates
a point tombstone in system_schema.columns for every column in the table.
The clustering key of system_schema.columns is (table_name, column_name).
A clustering key with only the table_name component acts as a prefix
tombstone. That tombstone covers all columns belonging to that table.
This approach is already used by make_table_deleting_mutations() during
CREATE TABLE.
Apply the same prefix tombstone approach to DROP TABLE for the columns,
view_virtual_columns, computed_columns, and dropped_columns schema tables.
This reduces tombstone accumulation in schema table sstables.
In test_max_cells test case, which repeatedly creates and drops a table
with 32768 columns, overall test time improved from ~180s to ~157s, which
is ~12.7% improvement.
Refs SCYLLADB-815
Closesscylladb/scylladb#28976
The function checks that the node's state is not left or removed in
gossiper during restart, but with raft topology a removed node will
not be able to contact the cluster to get this information since it will
be banned.
Also remove test_auth_raft_command_split test which is irrelevant since 5ba7d1b116
because it does not use the function that injects max sized command after the
commit.
Schema pull was used by legacy schema code which is not supported for a
long time now and during legacy recovery which is no longer supported as
well. It can be dropped now.
Simplify code by getting rid of group0_upgrade_state since upgrade is no
longer supported, so no need to track its state. The none upgraded node
will simply not boot and to detect that the patch checks the state
directly from the system table.
The only support mode is topology_change_kind::raft, so always set it in
storage_service::join_cluster during join or regular boot. Drop the check
for legacy mode from raft_group0::setup_group0_if_exist since the mode
will not be set at this point any longer. The wrong upgrade will still
be detected in storage_service::join_cluster where topology.upgrade_state
is checked directly.
group0_upgrade_state::recovery is now used only in maintenance mode
so rename the function to indicate it. Also there is no preemption point
in the function any more and it can be a regular function, not a
co-routine.
The test description of refreshing test is very elaborated and it's
worth having it as the description of the streaming scopes test itself.
Callers of the helper can go with smaller descriptions.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This helper does two things -- sorts sstables per server according to
scope in use and calls sstables_storage.restore(). The code looks better
if the sorting of sstables stays in a helper and the call for .restore()
is moved to the caller.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Now it's possible to replace the whole body of the
test_refresh_with_streaming_scopes() test by calling the corresponding
helper function from backup/restore test module. This helper does
exactly the same, and the SSTablesOnLocalStorage class provides the
necessary save/restore implementations.
One more thing to mention -- the refreshing test for some reason only
wants to run with restored min-tablet-count equal to the original one.
The do_test_streaming_scopes() needs to account for that, as it runs the
tests for more options.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This class implements some of the sstables manipulations performed by
test_refresh_with_streaming_scopes(). It's here to facilitate next patch
that will use it to call do_test_streaming_scopes() helper.
This patch moves two blocks of code out of the test into this new class.
The shutil.rmtree(tmpbackup) is seemingly lost, but it really isn't --
the tmpbackup variable holds a name of a _subdir_ inside servers'
workdirs. This path doesn't really exist on disk on its own, so removing
it is a no-op.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The class in question performs two operations for
do_test_streaming_scopes(): saves sstables and restores them. Current
caller of the helper is the test_restore_with_streaming_scopes() test
that need to backup sstables on object storage and restore them from
there with the restoration API. The SSTablesOnObjectStorage class does
exactly that.
The change in do_load_sstables() that checks for sstables_storage to be
non None is needed to keep test_refresh_with_streaming_scopes() work --
that test doesn't provide sstables_storage (yet) and the function in
question will call the load_fn callback. Next patch will eliminate it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The body of this test is duplicated by
test_refresh_with_streaming_scopes() test from other module. Keeping it
in a non-test top-level function will help generalizing these two tests.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
A few days ago, in commit 7b30a3981b we added to pytest.ini the option
xfail_strict. This option causes every time a test XPASSes, i.e., an
xfail test actually passes - to be considered an error and fail the test.
But some tests demonstrate a timing-related bug and do not reproduce the
bug every single time. An example we noticed in one CI run is:
test/cqlpy/test_secondary_index.py::test_unbuilt_index_not_used
This test reproduces a timing-related bug (if you read from a secondary
index "too quickly" you can get wrong results), but only about 90% of the
time, not 100% of the time.
The solution is to add "strict=False" for the xfail marker on this
specific test. This undoes the xfail_strict for this specific test,
accepting that this specific test can either pass or fail. Note that
this does NOT make this test worthless - we still see this test failing
most of the time, and when a developer finally fixes this issue, the
test will begin to pass all the time.
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-956
(we'll probably need to follow up this fix with the same fix for other
xfail tests that can sometime pass).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#28942
When we generate view updates, we check whether we can skip the
entire view update if all columns selected by the view are unmodified.
However, for collection columns, we only check if they were unset
before and after the update.
In this patch we add a check for the actual collection contents.
We perform this check for both virtual and non-virtual selections.
When the column is only a virtual column in the view, it would be
enough to check the liveness of each collection cell, however for
that we'd need to deserialize the entire collection anyway, which
should be effectively as expensive as comparing all of its bytes.
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-808Closesscylladb/scylladb#28839
* github.com:scylladb/scylladb:
mv: allow skipping view updates when a collection is unmodified
mv: allow skipping view updates if an empty collection remains unset
Instead of dht::partition_ranges_vector, which is an std::vector<> and
have been seen to cause large allocations when calculating ranges to be
invalidated after compaction:
seastar_memory - oversized allocation: 147456 bytes. This is non-fatal, but could lead to latency and/or fragmentation issues. Please report: at
[Backtrace #0]
void seastar::backtrace<seastar::current_backtrace_tasklocal()::$_0>(seastar::current_backtrace_tasklocal()::$_0&&, bool) at ./build/release/seastar/./seastar/include/seastar/util/backtrace.hh:89
(inlined by) seastar::current_backtrace_tasklocal() at ./build/release/seastar/./seastar/src/util/backtrace.cc:99
seastar::current_tasktrace() at ./build/release/seastar/./seastar/src/util/backtrace.cc:136
seastar::current_backtrace() at ./build/release/seastar/./seastar/src/util/backtrace.cc:169
seastar::memory::cpu_pages::warn_large_allocation(unsigned long) at ./build/release/seastar/./seastar/src/core/memory.cc:840
seastar::memory::cpu_pages::check_large_allocation(unsigned long) at ./build/release/seastar/./seastar/src/core/memory.cc:903
(inlined by) seastar::memory::cpu_pages::allocate_large(unsigned int, bool) at ./build/release/seastar/./seastar/src/core/memory.cc:910
(inlined by) seastar::memory::allocate_large(unsigned long, bool) at ./build/release/seastar/./seastar/src/core/memory.cc:1533
(inlined by) seastar::memory::allocate_slowpath(unsigned long) at ./build/release/seastar/./seastar/src/core/memory.cc:1679
seastar::memory::allocate(unsigned long) at ././seastar/src/core/memory.cc:1698
(inlined by) operator new(unsigned long) at ././seastar/src/core/memory.cc:2440
(inlined by) std::__new_allocator<interval<dht::ring_position>>::allocate(unsigned long, void const*) at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/bits/new_allocator.h:151
(inlined by) std::allocator<interval<dht::ring_position>>::allocate(unsigned long) at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/bits/allocator.h:203
(inlined by) std::allocator_traits<std::allocator<interval<dht::ring_position>>>::allocate(std::allocator<interval<dht::ring_position>>&, unsigned long) at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/bits/alloc_traits.h:614
(inlined by) std::_Vector_base<interval<dht::ring_position>, std::allocator<interval<dht::ring_position>>>::_M_allocate(unsigned long) at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/bits/stl_vector.h:387
(inlined by) std::vector<interval<dht::ring_position>, std::allocator<interval<dht::ring_position>>>::reserve(unsigned long) at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/bits/vector.tcc:79
dht::to_partition_ranges(utils::chunked_vector<interval<dht::token>, 131072ul> const&, seastar::bool_class<utils::can_yield_tag>) at ./dht/i_partitioner.cc:347
compaction::compaction::get_ranges_for_invalidation(std::vector<seastar::lw_shared_ptr<sstables::sstable>, std::allocator<seastar::lw_shared_ptr<sstables::sstable>>> const&) at ./compaction/compaction.cc:619
(inlined by) compaction::compaction::get_compaction_completion_desc(std::vector<seastar::lw_shared_ptr<sstables::sstable>, std::allocator<seastar::lw_shared_ptr<sstables::sstable>>>, std::vector<seastar::lw_shared_ptr<sstables::sstable>, std::allocator<seastar::lw_shared_ptr<sstables::sstable>>>) at ./compaction/compaction.cc:719
(inlined by) compaction::regular_compaction::replace_remaining_exhausted_sstables() at ./compaction/compaction.cc:1362
compaction::compaction::finish(std::chrono::time_point<db_clock, std::chrono::duration<long, std::ratio<1l, 1000l>>>, std::chrono::time_point<db_clock, std::chrono::duration<long, std::ratio<1l, 1000l>>>) at ./compaction/compaction.cc:1021
compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0::operator()() at ./compaction/compaction.cc:1960
(inlined by) compaction::compaction_result std::__invoke_impl<compaction::compaction_result, compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0>(std::__invoke_other, compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0&&) at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/bits/invoke.h:63
(inlined by) std::__invoke_result<compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0>::type std::__invoke<compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0>(compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0&&) at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/bits/invoke.h:98
(inlined by) decltype(auto) std::__apply_impl<compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0, std::tuple<>>(compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0&&, std::tuple<>&&, std::integer_sequence<unsigned long, ...>) at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/tuple:2920
(inlined by) decltype(auto) std::apply<compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0, std::tuple<>>(compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0&&, std::tuple<>&&) at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/tuple:2935
(inlined by) seastar::future<compaction::compaction_result> seastar::futurize<compaction::compaction_result>::apply<compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0>(compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0&&, std::tuple<>&&) at ././seastar/include/seastar/core/future.hh:1930
(inlined by) seastar::futurize<std::invoke_result<compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0>::type>::type seastar::async<compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0>(seastar::thread_attributes, compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0&&)::'lambda'()::operator()() const at ././seastar/include/seastar/core/thread.hh:267
(inlined by) seastar::noncopyable_function<void ()>::direct_vtable_for<seastar::futurize<std::invoke_result<compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0>::type>::type seastar::async<compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0>(seastar::thread_attributes, compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0&&)::'lambda'()>::call(seastar::noncopyable_function<void ()> const*) at ././seastar/include/seastar/util/noncopyable_function.hh:138
seastar::noncopyable_function<void ()>::operator()() const at ./build/release/seastar/./seastar/include/seastar/util/noncopyable_function.hh:224
(inlined by) seastar::thread_context::main() at ./build/release/seastar/./seastar/src/core/thread.cc:318
dht::partition_ranges_vector is used on the hot path, so just convert
the problematic user -- cache invalidation -- to use
utils::chunked_vector<dht::partition_range> instead.
Fixes: SCYLLADB-121
Closesscylladb/scylladb#28855
Fixed several places where ScyllaLogFile.grep() was called without
await, resulting in checking coroutine objects for truthiness instead
of actual log matches.
Fixes: SCYLLADB-903
Many tests in test/alternator/test_streams.py use a do_test() function
which performs a user-defined function that runs some write requests,
and then verifies that the expected output appears on the stream.
Because DynamoDB drops do-nothing changes from the stream - such as
writing to an item a value that it already has - these tests need to
write to a different item each time, so do_test() invents a random key
and passes it to the user-defined function to use. But... we had a bug,
the random number generation was done only once, instead of every time.
The fix is to do the random number generation on every call.
We never noticed this bug when each test used a brand new table. But the
next patch will make the tests share the test table, and tests start
to fail. It's especially visible if you run the same test twice against
DynamoDB, e.g.,
test/alternator/run --count 2 --aws \
test_streams.py::test_streams_putitem_keys_only
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
`storage_group_of()` is on the replica-side token lookup hot path but
used `tablet_map::get_tablet_id_and_range_side()`, which computes both
tablet id and post-split range side.
Most callers only need the storage group id. Switch `storage_group_of()`
to use `get_tablet_id()` via `tablet_id_for_token()`, and select the
compaction group via new overloads that compute the range side only
when splitting mode is active.
Change `storage_group::select_compaction_group()` to accept a token
(and tablet_map) and compute the tablet range side only when
splitting_mode() is active.
Add an overload for selecting the compaction group for an sstable
spanning a token range.
Add `tablet_map::get_tablet_range_side(token)` to compute the
post-split range side without computing the tablet id.
Pure addition, no behavior change.
Trie-based sstable indexes are supposed to be (hopefully)
a better default than the old BIG indexes.
Make them the new default.
If we change our mind, this change can be reverted later.
Trie-based indexes and older indexes have a difference in metrics,
and the test uses the metrics to check for bypass cache.
To choose the right metrics, it uses highest_supported_sstable_format,
which is inappropriate, because the sstable format chosen for writes
by Scylla might be different than highest_supported_sstable_format.
Use chosen_sstable_format instead.
Returns the sstable version currently chosen for use in for new sstables.
We are adding it because some tests want to know what format they are
writing (tests using upgradesstable, tests which check stats that only
apply to one of the index types, etc).
(Currently they are using `highest_supported_sstable_format` for this
purpose, which is inappropriate, and will become invalid if a non-latest
format is the default).
cluster.dtest_alternator_tests.test_slow_query_logging performs
a bootstrap with 768 token ranges.
It works with `me` sstables, which have 2 open file descriptors
per open sstable, but with `ms` sstables, which have 3 open
file descriptors per open sstable, it fails with EMFILE.
To avoid this problem, let's just decrease the number of vnodes
for in the test suite. It's appropriate anyway, because it avoids some
unneeded work without weakening the tests.
(Note: pylib-based have been setting `num_tokens` to 16 for a long time too).
This breaks `bypass_cache_test`, which is written in a way that expects
a certain number of token ranges. We adjust the relevant parameter
accordingly.
This function ensures that all alive nodes executed
read barrier.
It will be usefull for the following commits which would
eventually delay returning response to the user until
mutations are applied on other nodes so that the user
may perceive better data consistency accross nodes.
When a user tries to use ALTER TABLE on a materialized view,
the resulting error message is `Cannot use ALTER TABLE on Materialized View`.
The intention behind this error is that ALTER MATERIALIZED VIEW should
be used instead.
But we observed that some users interpret this error message as a general
"You cannot do any ALTER on this thing".
This patch enhances the error message (and others similar to it)
to prevent the confusion.
Closesscylladb/scylladb#28831
tablet_virtual_task::wait throws if a table on which a tablet operation
was working is dropped.
Treat the tablet operation as successful if a table is dropped.
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-494
Needs backport to all live releases
Closesscylladb/scylladb#28933
* github.com:scylladb/scylladb:
test: add test_tablet_repair_wait_with_table_drop
service: tasks: return successful status if a table was dropped
SNI works only with DNS hostnames. Adding an IP address causes warnings
on the server side.
This change adds SNI only if it is not an IP address.
This change has no unit tests, as this behavior is not critical,
since it causes a warning on the server side.
The critical part, that the server name is verified, is already covered.
This PR also adds warning logs to improve future troubleshooting of connections to the vector-store nodes.
Fixes: VECTOR-528
Backports to 2025.04 and 2026.01 are required, as these branches are also affected.
Closesscylladb/scylladb#28637
* github.com:scylladb/scylladb:
vector_search: fix TLS server name with IP
vector_search: add warn log for failed ann requests
The first node in the cluster is not guaranteed to be the coordinator
node. Hardcoding node 0 as the coordinator causes test flakiness. This
patch dynamically finds the actual coordinator node and targets it for
error injection, log checking, and restarts.
Additionally, inject `tablet_force_tablet_count_decrease_once` across
all servers to force the tablet merge process to trigger once.
Fixes SCYLLADB-865
Closesscylladb/scylladb#28945
Mention that role and permission changes are durable but may
not be immediately visible on other nodes due to asynchronous
replication.
Fixes: SCYLLADB-651
Closesscylladb/scylladb#28900
Currently, the view_update_generator::mutate_MV function acquires a
reference to the keyspace relevant to the operation, then it calls
max_concurrent_for_each and uses that reference inside the lambda passed
to that function. max_concurrent_for_each can preempt and there is no
mechanism that makes sure that the keyspace is alive until the view
updates are generated, so it is possible that the keyspace is freed by
the time the reference is used.
Fix the issue by precomputing the necessary information based on the
keyspace reference right away, and then passing that information by
value to the other parts of the code. It turns out that we only need to
know whether the keyspace uses tablets and whether it uses a network
topology strategy.
Fixes: scylladb/scylladb#28925Closesscylladb/scylladb#28928
The tests in test_out_of_space_prevention.py are flaky. Three issues contribute:
1. After creating/removing the blob file that simulates disk pressure,
the tests immediately checked derived state (e.g., "compaction_manager
- Drained") without first confirming the disk space monitor had detected
the utilization change. Fix: explicitly wait for "Reached/Dropped below
critical disk utilization level" right after creating/removing the blob
file, before checking downstream effects.
2. Several tests called `manager.driver_connect()` or omitted reconnection
entirely after `server_restart()` / `server_start()`. The pre-existing
driver session can silently reconnect multiple times, causing subsequent
CQL queries to fail. Fix: call `reconnect_driver()` after every node restart.
Additionally, call `wait_for_cql_and_get_hosts()` where CQL is used afterward,
to ensure all connection pools are established.
3. Some log assertions used marks captured before a restart, so they could
match pre-restart messages or miss messages emitted in the correct post-restart
window. Fix: refresh marks at the right points.
Apart from that, the patch fixes a typo: autotoogle -> autotoggle.
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-655Closesscylladb/scylladb#28626
Fixes: SCYLLADB-915
Test was quite broken; Not waiting for coro:s, as well as a bunch
of checks no longer even close to valid (this is a ported dtest, and
not a very good one).
Closesscylladb/scylladb#28887
vector_search: small improvements
This PR addresses several minor code quality issues and style inconsistencies within the vector_search module.
No backport is needed as these improvements are not visible to the end user.
Closesscylladb/scylladb#28718
* github.com:scylladb/scylladb:
vector_search: fix names of private members
vector_search: remove unused global variable
Tests use `start_writes` as a simple write workload to test that writes
succeed when they should (e.g., there is no availability loss), but not to
test performance. There is no reason to overload the CPU, which can lead to
test failures.
I suspect this function to be the cause of SCYLLADB-929, where the failures
of `test_raft_recovery_user_data` (that creates multiple write workloads
with `start_writes`) indicated that the machine was overloaded.
The relevant observations:
- two runs failed at the same time in debug mode,
- there were many reactor stalls and RPC timeouts in the logs (leading to
unexpected events like servers marking each other down and group0
leader changes).
I didn't prove that `start_writes` really caused this, but adding this sleep
should be a good change, even if I'm wrong.
The number of writes performed by the test decreases 30-50 times with the
sleep.
Note that some other util functions like `start_writes_to_cdc_table` have
such a sleep.
This PR also contains some minor updates to `test_raft_recovery_user_data`.
Fixes SCYLLADB-929
No backport:
- the failures were observed only in master CI,
- no proof that the change fixes the issue, so backports could be a waste
of time.
Closesscylladb/scylladb#28917
* github.com:scylladb/scylladb:
test: test_raft_recovery_user_data: replace asyncio.gather with gather_safely
test: test_raft_recovery_user_data: use the exclude_node API
test: test_raft_recovery_user_data: drop tablet_load_stats_cfg
test: cluster: util: sleep for 0.01s between writes in do_writes
This patch adds a test file, test/alternator/test_encoding.py, testing
how Alternator stores its data in the underlying CQL database. We test
how tables are named, and how attributes of different types are encoded
into CQL.
The test, which begins with a long comment, also doubles as developer-
oriented *documention* on how Alternator encodes its data in the CQL
database. This documentation is not intended for end-users - we do
not want to officially support reading or writing Alternator tables
through CQL. But once in a while, this information can come in handy
for developers.
More importantly, this test will also serve as a regression test,
verifying that Alternator's encoding doesn't change unintentionally.
If we make an unintentional change to the way that Alternator stores
its data, this can break upgrades: The new code might not be able to
read or write the old table with its old encoding. So it's important
to make sure we never make such unintentional changes to the encoding
of Alternator's data. If we ever do make *intentional* changes to
Alternator's data encoding, we will need to fix the relevant test;
But also not forget to make sure that the new code is able to read
the old encoding as well.
The new tests use both "dynamodb" (Alternator) and "cql" fixtures,
to test how CQL sees the Alternator tables. So naturally are these
tests are marked "scylla_only" and skipped on DynamoDB.
Fixes#19770.
Closesscylladb/scylladb#28866
The motivations for this patch are as follows:
- Guardrails should follow similar conventions, e.g. for config names,
metrics names, testing. Keeping guardrails together makes it easier
to find and compare existing guardrails when new guardrails are
implemented.
- The configuration is used to auto-generate the documentation
(particularly, the `configuration-parameters` page). Currently,
the order of parameters in the documentation is inconsistent (e.g.
`minimum_replication_factor_fail_threshold` before
`minimum_replication_factor_warn_threshold` but
`maximum_replication_factor_fail_threshold` after
`maximum_replication_factor_warn_threshold`), which can be confusing
to customers.
Fixes: SCYLLADB-256
Closesscylladb/scylladb#28932
This series improves timeout handling consistency across the test framework and makes build-mode effects explicit in LWT tests. (starting with LWT test that got flaky)
1. Centralize timeout scaling
Introduce scale_timeout(timeout) fixture in runner.py to provide a single, consistent mechanism for scaling test timeouts based on build mode.
Previously, timeout adjustments were done in an ad-hoc manner across different helpers and tests. Centralizing the logic:
Ensures consistent behavior across the test suite
Simplifies maintenance and reasoning about timeout behavior
Reduces duplication and per-test scaling logic
This becomes increasingly important as tests run on heterogeneous hardware configurations, where different build modes (especially debug) can significantly impact execution time.
2. Make scale_timeout explicit in LWT helpers
Propagate scale_timeout explicitly through BaseLWTTester and Worker, validating it at construction time instead of relying on implicit pytest fixture injection inside helper classes.
Additionally:
Update wait_for_phase_ops() and wait_for_tablet_count() to use scale_timeout_by_mode() for consistent polling behavior across modes
Update all LWT test call sites to pass build_mode explicitly
Increase default timeout values, as the previous defaults were too short and prone to flakiness, particularly under slower configurations such as debug builds
Overall, this series improves determinism, reduces flakiness, and makes the interaction between build mode and test timing explicit and maintainable.
backport: not required just an enhansment for test.py infra
Closesscylladb/scylladb#28840
* https://github.com/scylladb/scylladb:
test/auth_cluster: align service-level timeout expectations with scaled config
test/lwt: propagate scale_timeout through LWT helpers; scale resize waits Pass scale_timeout explicitly through BaseLWTTester and Worker, validating it at construction time instead of relying on implicit pytest fixture injection inside helper classes. Update wait_for_phase_ops() and wait_for_tablet_count() to use scale_timeout_by_mode() so polling behavior remains consistent across build modes. Adjust LWT test call sites to pass scale_timeout explicitly. Increase default timeout values, as the previous defaults were too short and prone to flakiness under slower configurations (notably debug/dev builds).
test/pylib: introduce scale_timeout fixture helper
This patch fixes 2 issues within strong consistency state machine:
- it might happen that apply is called before the schema is delivered to the node
- on the other hand, the apply may be called after the schema was changed and purged from the schema registry
The first problem is fixed by doing `group0.read_barrier()` before applying the mutations.
The second one is solved by upgrading the mutations using column mappings in case the version of the mutations' schema is older.
Fixes SCYLLADB-428
Strong consistency is in experimental phase, no need to backport.
Closesscylladb/scylladb#28546
* https://github.com/scylladb/scylladb:
test/cluster/test_strong_consistency: add reproducer for old schema during apply
test/cluster/test_strong_consistency: add reproducer for missing schema during apply
test/cluster/test_strong_consistency: extract common function
raft_group_registry: allow to drop append entries requests for specific raft group
strong_consistency/state_machine: find and hold schemas of applying mutations
strong_consistency/state_machine: pull necessary dependencies
db/schema_tables: add `get_column_mapping_if_exists()`
`test_proxy_protocol_port_preserved_in_system_clients` failed because it
didn't see the just created connection in system.clients immediately. The
last lines of the stacktrace are:
```
# Complete CQL handshake
await do_cql_handshake(reader, writer)
# Now query system.clients using the driver to see our connection
cql = manager.get_cql()
rows = list(cql.execute(
f"SELECT address, port FROM system.clients WHERE address = '{fake_src_addr}' ALLOW FILTERING"
))
# We should find our connection with the fake source address and port
> assert len(rows) > 0, f"Expected to find connection from {fake_src_addr} in system.clients"
E AssertionError: Expected to find connection from 203.0.113.200 in system.clients
E assert 0 > 0
E + where 0 = len([])
```
Explanation: we first await for the hand-made connection to be completed,
then, via another connection, we're querying system.clients, and we don't
get this hand-made connection in the resultset.
The solution is to replace the bare cql.execute() calls with await wait_for_results(), a helper
that polls via cql.run_async() until the expected row count is reached
(30 s timeout, 100 ms period).
Fixes: SCYLLADB-819
The flaky test is present on master and in previous release, so backporting only there.
Closesscylladb/scylladb#28849
* github.com:scylladb/scylladb:
test_proxy_protocol: introduce extra logging to aid debugging
test_proxy_protocol: fix flaky system.clients visibility checks
Move all ${{ }} expression interpolations into env: blocks so they are
passed as environment variables instead of being expanded directly into
shell scripts. This prevents an attacker from escaping the heredoc in
the Validate Comment Trigger step and executing arbitrary commands on
the runner.
The Verify Org Membership step is hardened in the same way for
defense-in-depth.
Refs: GHSA-9pmq-v59g-8fxp
Fixes: SCYLLADB-954
Closesscylladb/scylladb#28935
When we generate view updates, we check whether we can skip the
entire view update if all columns selected by the view are unmodified.
However, for collection columns, we only check if they were unset
before and after the update.
In this patch we add a check for the actual collection contents.
We perform this check for both virtual and non-virtual selections.
When the column is only a virtual column in the view, it would be
enough to check the liveness of each collection cell, however for
that we'd need to deserialize the entire collection anyway, which
should be effectively as expensive as comparing all of its bytes.
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-808
Currently, when we generate view updates, we skip the view update if
all columns selected by the view are unchanged in the base table update.
However, this does not apply for collection columns - if the base table
has a collection regular column, we never allow skipping generating
view updates and the reason for that is missing implementation.
We can easily relax this for the case where the collection was missing
before and after the update - in this commit we move the check for
collections after the check for missing cells.
The ini-level strict_config was removed/never existed as a config key in pytest 8 — it's only a command-line flag(and back in pytest 9)
In pytest 8.3.5, the equivalent is the --strict-config CLI flag, not an ini option
Fixes SCYLLADB-955
Closesscylladb/scylladb#28939
Document the new `components_digests` subcomponent (tag 12) added to the
Scylla.db metadata component, which stores CRC32 digests of all checksummed
SSTable component files. Also document the trailing CRC32 digest that
stores digest of the scylla metadata itself.
Add two test cases to verify the correctness of the perform_component_rewrite
functionality:
- test_perform_component_rewrite_single_sstable: Tests rewriting the Statistics
component of a single sstable
- test_perform_component_rewrite_multiple_sstables: Tests rewriting 5 out of 10
sstables
Adds a generic test helper that writes a random SSTable, reloads it, and
verifies that the persisted CRC32 digest for each component matches the
digest computed from disk. Those covers all checksummed components test cases.
This change replaces plain file_writer with crc32_digest_file_writer
for all SSTable components that should be checksummed. The resulting component
digests are stored in scylla metadata component.
This also extends new rewrite component mechanism,
to rewrite metadata with updated digest together with the component.
When the local entry with `read_idx` belongs to the current term, it's
safe to update the local `commit_idx` to `read_idx`.
The motivation for this change is to speed up read barriers. `wait_for_apply`
executed at the end of `read_barrier` is delayed until the follower learns
that the entry with `read_idx` is committed. It usually happens quickly in
the `read_quorum` message. However, non-voters don't receive this message,
so they have to wait for `append_entries`. If no new entries are being
added, `append_entries` can come only from `fsm::tick_leader()`. For group0,
this happens once every 100ms.
The issue above significantly slows down cluster setups in tests. Nodes
join group0 as non-voters, and then they are met with several read barriers
just after a write to group0. One example is `global_token_metadata_barrier`
in `write_both_read_new` performed just after `update_topology_state` in
`write_both_read_old`.
I tested the performance impact of this change with the following test:
```python
for _ in range(10):
await manager.servers_add(3)
```
It consistently takes 44-45s with the change and 50-51s without the change
in dev mode.
No backport:
- non-critical performance improvement mostly relevant in tests,
- the change requires some soak time in master.
Closesscylladb/scylladb#28891
* github.com:scylladb/scylladb:
raft: server: fix the repeating typo
raft: clarify the comment about read_barrier_reply
raft: read_barrier: update local commit_idx to read_idx when it's safe
raft: log: clarify the specification of term_for
In case of an error, we want to see the contents of the system.clients
table to have a better understanding of what happened - whether the
row(s) are really missing or maybe they are there, but 1 digit doesn't
match or the row is half-written.
We'll therefore query for the whole table on the CQL side, and then
filter out the rows we want to later proceed with on the python side.
This way we can dump the contents of the whole system.clients table if
something goes south.
`test_proxy_protocol_port_preserved_in_system_clients` failed because it
didn't see the just created connection in system.clients immediately. The
last lines of the stacktrace are:
```
# Complete CQL handshake
await do_cql_handshake(reader, writer)
# Now query system.clients using the driver to see our connection
cql = manager.get_cql()
rows = list(cql.execute(
f"SELECT address, port FROM system.clients WHERE address = '{fake_src_addr}' ALLOW FILTERING"
))
# We should find our connection with the fake source address and port
> assert len(rows) > 0, f"Expected to find connection from {fake_src_addr} in system.clients"
E AssertionError: Expected to find connection from 203.0.113.200 in system.clients
E assert 0 > 0
E + where 0 = len([])
```
Explanation: we first await for the hand-made connection to be completed,
then, via another connection, we're querying system.clients, and we don't
get this hand-made connection in the resultset.
The solution is to replace the bare cql.execute() calls with await wait_for_results(), a helper
that polls via cql.run_async() until the expected row count is reached
(30 s timeout, 100 ms period).
Fixes: SCYLLADB-819
tablet_virtual_task::wait throws if a table on which a tablet operation
was working is dropped.
Treat the tablet operation as successful if a table is dropped.
The one accepts long list of arguments, some of those is not really needed. Also some callers can be relaxed not to provide default values for arguments with such.
Improving tests, not backporting
Closesscylladb/scylladb#28861
* github.com:scylladb/scylladb:
test: Remove passing default "expected_replicas" to check_mutation_replicas()
test: Remove scope and primary-replica-only arguments from check_mutation_replicas() helper
Previous code performed endian conversion by bulk-copying raw bytes
into a std::vector<float> and then iterating over it via a
reinterpret_cast<uint32_t*> pointer. Accessing float storage through a
uint32_t* violates C++ strict aliasing rules, giving the compiler
freedom to reorder or elide the stores, causing undefined behavior.
Replace the two-pass approach with a single-pass loop using
seastar::consume_be<uint32_t>() and std::bit_cast<float>(), which is
both well-defined and auto-vectorizable.
Follow-up #28754Closesscylladb/scylladb#28912
Add support for non_gating, the opposite of gating in dtest terminology, tests in test.py
codebase
This test will/should not be run by any current gating job (ci/next/nightly)
Closesscylladb/scylladb#28902
HostRegistry initialized in several places in the framework, this can
lead to the overlapping IP, even though the possibility is low it's not
zero. This PR makes host registry initialized once for the master
thread and pytest. To avoid communication between with workers, each
worker will get its own subnet that it can use solely for its own goals.
This simplifies the solution while providing the way to avoid overlapping IP's.
Closesscylladb/scylladb#28520
This PR disables running FXAIL tests on ci run to speed it up.
tests will continue run on "nightly" job and FAIL on unexpected pass
and will continue run on "NEXT" job and NOT FAIL on unexpected pass
Closesscylladb/scylladb#28886
The purpose of `add_column_for_post_processing` is to add columns that are required for processing of a query,
but are not part of SELECT clause and shouldn't be returned. They are added to the final result set, but later are not serialized.
Mainly it is used for filtering and grouping columns, with a special case of `WHERE primary_key IN ... ORDER BY ...` when the whole result set needs additional final sorting,
and ordering columns must be added as well.
There was a bug that manifested in #9435, #8100 and was actually identified in #22061.
In case of selection with processing (e.g functions involved), result set row is formed in two stages.
Initially it is a list of columns fetched from replicas - on which filtering and grouping is performed.
After that the actual selection is resolved and the final number of columns can change.
Ordering is performed on this final shape, but the ordering column index returned by `add_column_for_post_processing` refereed to initial shape.
If selection refereed to the same column twice (e.g. `v, TTL(v)` as in #9435) final row was longer than initial and ordering refereed to incorrect column.
If a function in selection refereed to multiple columns (e.g. as_json(.., ..) which #8100 effectively uses) the final row was shorter
and ordering tried to use a non-existing column.
This patch fixes the problem by making sure that column index of the final result set is used for ordering.
The previously crashing test `cassandra_tests/validation/entities/json_test.py::testJsonOrdering` doesn't have to be skipped, but now it is failing on issue #28467.
Fixes#9435Fixes#8100Fixes#22061Closesscylladb/scylladb#28472
Tests use `start_writes` as a simple write workload to test that writes
succeed when they should (e.g., there is no availability loss), but not to
test performance. There is no reason to overload the CPU, which can lead to
test failures.
I suspect this function to be the cause of SCYLLADB-929, where the failures
of `test_raft_recovery_user_data` (that creates multiple write workloads
with `start_writes`) indicated that the machine was overloaded.
The relevant observations:
- two runs failed at the same time in debug mode,
- there were many reactor stalls and RPC timeouts in the logs (leading to
unexpected events like servers marking each other down and group0
leader changes).
I didn't prove that `start_writes` really caused this, but adding this sleep
should be a good change, even if I'm wrong.
The number of writes performed by the test decreases 30-50 times with the
sleep.
Note that some other util functions like `start_writes_to_cdc_table` have
such a sleep.
Fixes SCYLLADB-929
Similar to `raft_drop_incoming_append_entries`, the new error injection
`raft_drop_incoming_append_entries_for_specified_group` skips handler
for `raft_append_entries` RPC but it allows to specify id of raft group
for which the requests should be dropped.
The id of a raft group should be passed in error injection parameters
under `value` key.
It might happen that a strong consistency command will arrive to a node:
- before it knows about the schema
- after the schema was changes and the old version was removed from the
memory
To fix the first case, it's enough to perform a read barrier on group0.
In case of the second one, we can use column mapping the upgrade the
mutation to newer schema.
Also, we should hold pointers to schemas until we finish `_db.apply()`,
so the schema is valid for the whole time.
And potentially we should hold multiple pointers because commands passed
to `state_machine::apply()` may contain mutations to different schema
versions.
This commit relies on a fact that the tablet raft group and its state
machine is created only after the table is created locally on the node.
Fixes SCYLLADB-428
Set enable_schema_commitlog for each group0 tables.
Assert that group0 tables use schema commitlog in ensure_group0_schema
(per each command).
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-914.
Needs backport to all live releases as all are vulnerable
Closesscylladb/scylladb#28876
* github.com:scylladb/scylladb:
test: add test_group0_tables_use_schema_commitlog
db: service: remove group0 tables from schema commitlog schema initializer
service: ensure that tables updated via group0 use schema commitlog
db: schema: remove set_is_group0_table param
Consider this:
- repair takes the lock holder
- tablet merge filber destories the compaction group and the compaction state
- repair fails
- repair destroy the lock holder
This is observed in the test:
```
repair - repair[5d73d094-72ee-4570-a3cc-1cd479b2a036] Repair 1 out of 1 tablets: table=sec_index.users range=(432345564227567615,504403158265495551] replicas=[0e9d51a5-9c99-4d6e-b9db-ad36a148b0ea:15, 498e354c-1254-4d8d-a565-2f5c6523845a:9, 5208598c-84f0-4526-bb7f-573728592172:28]
...
repair - repair[5d73d094-72ee-4570-a3cc-1cd479b2a036]: Started to repair 1 out of 1 tables in keyspace=sec_index, table=users, table_id=ea2072d0-ccd9-11f0-8dba-c5ab01bffb77, repair_reason=repair
repair - Enable incremental repair for table=sec_index.users range=(432345564227567615,504403158265495551]
table - Disabled compaction for range=(432345564227567615,504403158265495551] session_id=a13a72cc-cd2d-11f0-8e9b-76d54580ab09 for incremental repair
table - Got unrepaired compaction and repair lock for range=(432345564227567615,504403158265495551] session_id=a13a72cc-cd2d-11f0-8e9b-76d54580ab09 for incremental repair
table - Disabled compaction for range=(432345564227567615,504403158265495551] session_id=a13a72cc-cd2d-11f0-8e9b-76d54580ab09 for incremental repair
table - Got unrepaired compaction and repair lock for range=(432345564227567615,504403158265495551] session_id=a13a72cc-cd2d-11f0-8e9b-76d54580ab09 for incremental repair
repair - repair[5d73d094-72ee-4570-a3cc-1cd479b2a036]: get_sync_boundary: got error from node=0e9d51a5-9c99-4d6e-b9db-ad36a148b0ea, keyspace=sec_index, table=users, range=(432345564227567615,504403158265495551], error=seastar::rpc::remote_verb_error (Compaction state for table [0x60f008fa34c0] not found)
compaction_manager - Stopping 1 tasks for 1 ongoing compactions for table sec_index.users compaction_group=238 due to tablet merge
compaction_manager - Stopping 1 tasks for 1 ongoing compactions for table sec_index.users compaction_group=238 due to tablet merge
....
scylla[10793] Segmentation fault on shard 28, in scheduling group streaming
```
The rwlock in compaction_state could be destroyed before the lock holder
of the rwlock is destroyed. This causes user after free when the lock
the holder is destroyed.
To fix it, users of repair lock will now be waited when a compaction
group is being stopped.
That way, compaction group - which controls the lifetime of rwlock -
cannot be destroyed while the lock is held.
Additionally, the merge completion fiber - that might remove groups -
is properly serialized with incremental repair.
The issue can be reproduced using sanitize build consistently and can not
be reproduced after the fix.
Fixes#27365Closesscylladb/scylladb#28823
* github.com:scylladb/scylladb:
repair: Fix rwlock in compaction_state and lock holder lifecycle
repair: Prevent repair lock holder leakage after table drop
The comment could be misleading. It could suggest that the returned index is
already safe to read. That's not necessarily true. The entry with the
returned index could, for example, be dropped by the leader if the leader's
entry with this index had a different term.
When the local entry with `read_idx` belongs to the current term, it's
safe to update the local `commit_idx` to `read_idx`. The argument for
safety is in the new comment above `maybe_update_commit_idx_for_read`.
The motivation for this change is to speed up read barriers. `wait_for_apply`
executed at the end of `read_barrier` is delayed until the follower learns
that the entry with `read_idx` is committed. It usually happens quickly in
the `read_quorum` message. However, non-voters don't receive this message,
so they have to wait for `append_entries`. If no new entries are being
added, `append_entries` can come only from `fsm::tick_leader()`. For group0,
this happens once every 100ms.
The issue above significantly slows down cluster setups in tests. Nodes
join group0 as non-voters, and then they are met with several read barriers
just after a write to group0. One example is `global_token_metadata_barrier`
in `write_both_read_new` performed just after `update_topology_state` in
`write_both_read_old`.
Writing a test for this change would be difficult, so we trust the nemesis
tests to do the job. They have already found consistency issues in read
barriers. See #10578.
Both migration manager and system keyspace will be used in next commit.
The first one is needed to execute group0 read barrier and we need
system keyspace to get column mappings.
Use scale_timeout_by_mode() in make_scylla_conf() to derive
request_timeout_in_ms in test/pylib/scylla_cluster.py.
Update test_connections_parameters_auto_update in
test/cluster/auth_cluster/test_raft_service_levels.py to expect the
mode-specific timeout string returned by the REST endpoint after this
scaling change.
Pass scale_timeout explicitly through BaseLWTTester and Worker, validating it at construction time instead of relying on implicit pytest fixture injection inside helper classes.
Update wait_for_phase_ops() and wait_for_tablet_count() to use scale_timeout_by_mode() so polling behavior remains consistent across build modes.
Adjust LWT test call sites to pass scale_timeout explicitly.
Increase default timeout values, as the previous defaults were too short and prone to flakiness under slower configurations (notably debug/dev builds).
Introduce scale_timeout(mode) to centralize test timeout scaling logic based on build mode, the function will return a callable that will handle the timeout by mode.
This ensures consistent timeout behavior across test helpers and eliminates ad-hoc per-test scaling adjustments.
Centralizing the logic improves maintainability and makes timeout behavior easier to reason about.
This becomes increasingly important as we run tests on heterogeneous hardware configurations.
Different build modes (especially debug) can significantly affect execution time, and having a single scaling mechanism helps keep test stability predictable across environments.
No functional change beyond unifying existing timeout scaling behavior.
This commit updates the documentation for the unified installer.
- The Open Source example is replaced with version 2025.1 (Source Available, currently supported, LTS).
- The info about CentOS 7 is removed (no longer supported).
- Java 8 is removed.
- The example for cassandra-stress is removed (as it was already removed on other installation pages).
Fixes https://github.com/scylladb/scylladb/issues/28150Closesscylladb/scylladb#28152
In scenarios where we want to firsty check if a column mapping exists
and if we don't want do flow control with exception, it is very wasteful
to do
```
if (column_mapping_exists()) {
get_column_mapping();
}
```
especially in a hot path like `state_machine::apply()` becase this will
execute 2 internal queries.
This commit introduces `get_column_mapping_if_exists()` function,
which simply wrapps result of `get_column_mapping()` in optional and
doesn't throw an exception if the mapping doesn't exist.
this commit enables 3 strict pytest options:
strict_config - if any warnings encountered while parsing the pytest section of the configuration file will raise errors.
xfail_strict - if markers not registered in the markers section of the configuration file will raise errors.
strict-markers - if tests marked with @pytest.mark.xfail that actually succeed will by default fail the test suite
and fix errors that occur after enabling these options
Closesscylladb/scylladb#28859
Currently, repair-mode tombstone-gc cannot be used on tables with RF=1. We want to make repair-mode the default for all tablet tables (and more, see https://github.com/scylladb/scylladb/issues/22814), but currently a keyspace created with RF=1 and later altered to RF>1 will end up using timeout-mode tombstone gc. This is because the repair-mode tombstone-gc code relies on repair history to determine the gc-before time for keys/ranges. RF=1 tables cannot run repairs so they will have empty repair history and consequently won't be able to purge tombstones.
This PR solves this by keeping a registry of RF=1 tables and consulting this registry when creating `tombstone_gc_state` objects. If the table is RF=1, tombstone-gc will work as if the table used immediate-mode tombstone-gc. The registry is updated on each replication update. As soon as the table is not RF=1 anymore, the tombstone-gc reverts to the natural repair-mode behaviour.
After this PR, tombstone-gc defaults to repair-mode for all tables, regardless of RF and tablets/vnodes.
Fixes: SCYLLADB-106.
New feature, no backport required.
Closesscylladb/scylladb#22945
* github.com:scylladb/scylladb:
test/{boost,cluster}: add test for tombstone gc mode=repair with RF=1
tombstone_gc: allow use of repair-mode for RF=1 tables
replica/table: update rf=1 table registry in shared tombstone-gc state
tombstone_gc: tombstone_gc_before_getter: consider RF when getting gc before time
tombstone_gc: unpack per_table_history_maps
tombstone_gc: extract _group0_gc_time from per_table_history_map
tombstone_gc: drop tombstone_gc_state(nullptr) ctor and operator bool()
test/lib/random_schema: use timeout-mode tombstone_gc
tombstone_gc_options: add C++ friendly constructor
test: move away from tombstone_gc_state(nullptr) ctor
treewide: move away from tombstone_gc_state(nullptr) ctor
sstable: move away from tombstone_gc_mode::operator bool()
replica/table: add get_tombstone_gc_state()
compaction: use tombstone_gc_state with value semantics
db/row_cache: use tombstone_gc_state with value semantics
tombstone_gc: introduce tombstone_gc_state::for_tests()
The test test_mv_merge_allowed asserts in two places that the tablet
count is 2. It does so by calling an async function but, mistakenly, the
returned coroutine was not awaited. The coroutine is, apparently, truthy
so the assertions always passed.
Fix the test to properly await the coroutines in the assertions.
Fixes: SCYLLADB-905
Closesscylladb/scylladb#28875
This PR solves a series of similar problems related to executing
methods on an already aborted `raft::server`. They materialize
in various ways:
* For `add_entry` and `modify_config`, a `raft::not_a_leader` with
a null ID will be returned IF forwarding is disabled. This wasn't
a big problem because forwarding has always been enabled for group0,
but it's something that's nice to fix. It's might be relevant for
strong consistency that will heavily rely on this code.
* For `wait_for_leader` and `wait_for_state_change`, the calls may
hang and never resolve. A more detailed scenario is provided in a
commit message.
For the last two methods, we also extend their descriptions to indicate
the new possible exception type, `raft::stopped_error`. This change is
correct since either we enter the functions and throw the exception
immediately (if the server has already been aborted), or it will be
thrown upon the call to `raft::server::abort`.
We fix both issues. A few reproducer tests have been included to verify
that the calls finish and throw the appropriate errors.
Fixes SCYLLADB-841
Backport: Although the hanging problems haven't been spotted so far
(at least to the best of my knowledge), it's best to avoid
running into a problem like that, so let's backport the
changes to all supported versions. They're small enough.
Closesscylladb/scylladb#28822
* https://github.com/scylladb/scylladb:
raft: Make methods throw stopped_error if server aborted
raft: Throw stopped_error if server aborted
test: raft: Introduce get_default_cluster
Into a single database_apply one. Add three parameters:
* ks_name and cf_name to filter the tables to be affected
* what - what to do: throw or wait
This leads to smaller footprint in the code and improved filtering for
table names at the cost of some extra error injection params in the
tests.
This patch series implements `write_consistency_levels_warned` and `write_consistency_levels_disallowed` guardrails, allowing the configuration of which consistency levels are unwanted for writes. The motivation for these guardrails is to forbid writing with consistency levels that don't provide high durability guarantees (like CL=ANY, ONE, or LOCAL_ONE).
Neither guardrail is enabled by default, so as not to disrupt clusters that are currently using any of the CLs for writes. The warning guardrail may seem harmless, as it only adds a warning to the CQL response; however, enabling it can significantly increase network traffic (as a warning message is added to each response) and also decrease throughput due to additional allocations required to prepare the warning. Therefore, both guardrails should be enabled with care. The newly added `writes_per_consistency_level` metric, which is incremented unconditionally, can help decide whether a guardrail can be safely enabled in an existing cluster.
This commit adds additional `if` instructions on the critical path. However, based on the `perf_simple_query` benchmark for writes, the difference is marginal (~40 additional instructions, which is a relative difference smaller than 0.001).
BEFORE:
```
291443.35 tps ( 53.3 allocs/op, 16.0 logallocs/op, 14.2 tasks/op, 48067 insns/op, 18885 cycles/op, 0 errors)
throughput:
mean= 289743.07 standard-deviation=6075.60
median= 291424.69 median-absolute-deviation=1702.56
maximum=292498.27 minimum=261920.06
instructions_per_op:
mean= 48072.30 standard-deviation=21.15
median= 48074.49 median-absolute-deviation=12.07
maximum=48119.87 minimum=48019.89
cpu_cycles_per_op:
mean= 18884.09 standard-deviation=56.43
median= 18877.33 median-absolute-deviation=14.71
maximum=19155.48 minimum=18821.57
```
AFTER:
```
290108.83 tps ( 53.3 allocs/op, 16.0 logallocs/op, 14.2 tasks/op, 48121 insns/op, 18988 cycles/op, 0 errors)
throughput:
mean= 289105.08 standard-deviation=3626.58
median= 290018.90 median-absolute-deviation=1072.25
maximum=291110.44 minimum=274669.98
instructions_per_op:
mean= 48117.57 standard-deviation=18.58
median= 48114.51 median-absolute-deviation=12.08
maximum=48162.18 minimum=48087.18
cpu_cycles_per_op:
mean= 18953.43 standard-deviation=28.76
median= 18945.82 median-absolute-deviation=20.84
maximum=19023.93 minimum=18916.46
```
Fixes: SCYLLADB-259
Refs: SCYLLADB-739
No backport, it's a new feature
Closesscylladb/scylladb#28570
* github.com:scylladb/scylladb:
scylla.yaml: add write CL guardrails to scylla.yaml
scylla.yaml: reorganize guardrails config to be in one place
test: add cluster tests for write CL guardrails
test: implement test_guardrail_write_consistency_level
cql3: start using write CL guardrails
cql3/query_processor: implement metrics to track CL of writes
db: cql3/query_processor: add write_consistency_levels enum_sets
config: add write_consistency_levels_* guardrails configuration
The test/alternator/run script currently fails, Scylla fails to boot
complaining that "--alternator-ttl-period-in-seconds" is specified
twice (which is, unfortunately, not allowed). The problem is that
recently we started to set this option in test/cqlpy/run.py, for
CQL's new row-level TTL, so now it is no longer needed in
test/alternator/run - and in fact not allowed and we must remove it.
This patch only affects the script test/alternator/run, and has no
affect on running tests through test.py or Jenkins.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#28868
This test forgot to await its check() calls, which is the pass-condition
of the test. Once the await was added, the test started failing. Turns
out, the test was broken, but this was never discovered, because due to
the missing await, the errors were not propagated.
This patch adds the missing await and fixes the discovered problems:
* Use cql.run_async() instead of cql.execute()
* Fix json path for timestamp
* Missing flush/compact
Fixes: SCYLLADB-911
Closesscylladb/scylladb#28883
Recently we started to rely on the options "--auth-superuser-name"
and "--auth-superuser-salted-password" to ensure that a
cassandra/cassandra user exists for tests - without those options
a default superuser no longer exists.
This broke "test/cqlpy/run --release" for old releases, earlier
than 5.4 (in the enterprise stream, 2024.1 or earlier), because
those old release didn't have this option.
So in this patch we fix the "--release" logic that removes these
options from the command line when running these old versions.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#28894
The helper in question duplicates the functionality of `take_snapshot()` one from the same file. The only difference is that it additionally creates keyspace:table with yet another helper, but that helper is also going to be removed (as continuation of #28600 and #28608)
Enhancing tests, not backporting
Closesscylladb/scylladb#28834
* github.com:scylladb/scylladb:
test_backup: Remove prepare_snapshot_for_backup()
test_backup: Patch test_simple_backup_and_restore to use take_snapshot()
test_backup: Patch backup tests to use take_snapshot()
test_backup: Add helper to take snapshot on a single server
8e9c7397c5 made `rs` a reference, which can
lead to use-after-free. The `normal_nodes` map containing the referenced
value can be destroyed before the last use of `rs` when the topology state
is reloaded after a context switch on some `co_await`. The following move
assignment in `storage_service::topology_state_load` causes this:
```
_topology_state_machine._topology = co_await _sys_ks.local().load_topology_state(tablet_hosts);
```
This issue has been discovered in next-2026.1 CI after queueing the
backport of #28558. `test_truncate_during_topology_change` failed after
ASan reported a heap-use-after-free in
```
co_await _repair.local().bootstrap_with_repair(get_token_metadata_ptr(), rs.ring.value().tokens, session);
```
This test enables `delay_bootstrap_120s`, which makes the bug much more
likely to reproduce, but it could happen elsewhere.
No backport needed, as the only backport of #28558 hasn't been merged yet.
The backport PR will cherry-pick this commit.
Closesscylladb/scylladb#28772
Set enable_schema_commitlog for each group0 tables.
Assert that group0 tables use schema commitlog in ensure_group0_schema
(per each command).
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-914.
set_is_group0_table takes an enabled flag, based on which it decides
whether it's a group0 table. The method is called only with enabled = true.
Drop the param. For not group0 tables nothing should be set.
After the previous changes in `raft::server::{add_entry, modify_config}`
(cf. SCYLLADB-841), we also go through other methods of `raft::server`
and verify that they handle the aborted state properly.
I found two methods that do not:
(A) `wait_for_leader`
(B) `wait_for_state_change`
What happened before these changes?
In case (A), the dangerous scenario occurred when `_leader_promise` was
empty on entering the function. In that case, we would construct the
promise and wait on the corresponding future. However, if the server
had been already aborted before the call, the future would never
resolve and we'd be effectively stuck.
Case (B) is fully analogous: instead of `_leader_promise`, we'd work
with `_stte_change_promise`.
There's probably a more proper solution to this problem, but since I'm
not familiar with the internal code of Raft, I fix it this way. We can
improve it further in the future.
We provide two simple validation tests. They verify that after aborting
a `raft::server`, the calls:
* do not hang (the tests would time out otherwise),
* throw raft::stopped_error.
Fixes SCYLLADB-841
Before the change, calling `add_entry` or `modify_config` on an already
aborted Raft server could result in an error `not_a_leader` containing
a null server ID. It was possible precisely when forwarding was
disabled in the server configuration.
`not_a_leader` is supposed to return the ID of the current leader,
so that was wrong. Furthermore, the description of the function
specified that if a server is aborted, then it should throw
`stopped_error`.
We fix that issue. A few small reproducer tests were provided to verify
that the functions behave correctly with and without forwarding enabled.
Refs SCYLLADB-841
Two calls in test_client_routes_upgrade were missing `await`,
so they were never actually executed. This caused Python
to emit RuntimeWarning about unawaited coroutines, and more
importantly, the test skipped important verification steps, which
could mask real bugs or cause flakiness.
Additionally, increase 10s timeouts to 60s to avoid flakiness in slow
environments. Although these tests haven't failed so far, similar
issues have already been observed in other tests with too-short
timeouts.
Fixes: [SCYLLADB-909](https://scylladb.atlassian.net/browse/SCYLLADB-909)
Backport to 2026.1, as the test is also there.
[SCYLLADB-909]: https://scylladb.atlassian.net/browse/SCYLLADB-909?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQClosesscylladb/scylladb#28877
* github.com:scylladb/scylladb:
test: increase timeouts in test_client_routes.py
test: add missing awaits in test_client_routes_upgrade
The main loops iterating over vector components were not vectorized due to:
- "cannot prove it is safe to reorder floating-point operations"
- "Cannot vectorize early exit loop with more than one early exit"
The first issue is fixed with adding `#pragma clang fp contract(fast) reassociate(on)`, which allows compiler to optimize floating point operations.
The second issue is solved by refactoring the operations in the affected loop.
Additionally using float operations instead of double increases throughput and numerical accuracy is not the main consideration in vector search scenarios.
Performance measured:
- scylla built using dbuild
- using https://github.com/zilliztech/VectorDBBench (modified to call `SELECT id, similarity_cosine({vector<float, 1536>}, {vector<float, 1536>}) ...` without ANN search):
- client concurrency 20
before: ~2250 QPS
`float` operations: ~2350 QPS
`compute_cosine_similarity` vectorization: ~2500QPS
`extract_float_vector` vectorization: ~3000QPS
Follow-up https://github.com/scylladb/scylladb/pull/28615
Ref https://scylladb.atlassian.net/browse/SCYLLADB-764Closesscylladb/scylladb#28754
Increase 10s timeouts to 60s to avoid flakiness in slow
environments. Although these tests haven't failed so far, similar
issues have already been observed in other tests with too-short
timeouts.
Test execution time is unaffected; the entire suite in `dev` takes ~30s
before and after this change.
Two calls in test_client_routes_upgrade were missing `await`,
so they were never actually executed. This caused Python
to emit RuntimeWarning about unawaited coroutines, and more
importantly, the test skipped important verification steps, which
could mask real bugs or cause flakiness.
Fixes: SCYLLADB-909
The comparator used to sort per-IP client rows was not a strict-weak-ordering (it could return true in both directions for some pairs), which makes `std::ranges::sort` behavior undefined. A concrete pair that breaks it (and is realistic in system.clients):
a = (port=9042, client_type="cql")
b = (port=10000, client_type="alternator")
With the current comparator:
cmp(a,b) = (9042 < 10000) || ("cql" < "alternator") = true || false = true
cmp(b,a) = (10000 < 9042) || ("alternator" < "cql") = false || true = true
So both directions are true, meaning there is no valid ordering that sort can achieve.
The fix is to sort lexicographically by (port, client_type) to match the table's clustering key and ensure deterministic ordering.
Closesscylladb/scylladb#28844
This series closes a gap in the approx_exponential_histogram implementation to
cover integer values starting from small Min values.
While the original implementation was focused on durations, where this limitation
was not an issue, over time, there has been a growing need for histograms that
cover smaller values, such as the number of SSTables or the number of items in a
batch.
The reason for the original limitation is inherent to the exponential histogram
math. The previous code required Min to be at least Precision to avoid negative
bit shifts in the exponential calculations.
After this series, approx_exponential_histogram allows Min to be smaller than
Precision by scaling values during indexing. The value is shifted left by
log2 Precision minus log2 Min or zero whichever is larger, and the existing
exponential math is applied. Bucket limits are then scaled back to the original
units. This keeps insertion and retrieval O(1) without runtime branching, at the
cost of repeated bucket limits for some values in the Min to Precision range.
Additional tests cover the new behavior.
Relates to #2785
** New feature, no need to backport. **
Closesscylladb/scylladb#28371
* github.com:scylladb/scylladb:
estimated_histogram_test.cc: add to_metrics_histogram test
histogram_metrics_helper.hh: Support Min < Precision
estimated_histogram_test.cc: Add tests for approx_exponential_histogram with Min<Precision
estimated_histogram.hh: support Min less than Precision histograms
`isclose` function checks if returned similarity floats are close enough to expected value, but it doesn't `assert` by itself.
Several tests missed that `assert`, effectively always passing.
With this patch similarity values checks are wrapped in helper function `assert_similarity` with predefined tolerance.
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-877Closesscylladb/scylladb#28748
This patch series removes creation of default 'cassandra:cassandra' superuser on system start.
Disable creation of a superuser with default 'cassandra:cassandra' credentials to improve security. The current flow requires clients to create another superuser and then drop the default `cassandra:cassandra' role. For those who do, there is a time window where the default credentials exist. For those who do not, that role stays. We want to improve security by forcing the client to either use config to specify default values for default superuser name and password or use cqlsh over maintenance socket connection to explicitly create/alter a superuser role.
The patch series:
- Enable role modification over the maintenance socket
- Stop using default 'cassandra' value for default superuser, skipping creation instead
Design document: https://scylladb.atlassian.net/wiki/spaces/RND/pages/165773327/Drop+default+cassandra+superuserFixesscylladb/scylla-enterprise#5657
This is an improvement. It does not need a backport.
Closesscylladb/scylladb#27215
* github.com:scylladb/scylladb:
config: enable maintenance socket in workdir by default
docs: auth: do not specify password with -p option
docs: update documentation related to default superuser
test: maintenance socket role management
test: cluster: add logs to test_maintenance_socket.py
test: pylib: fix connect_driver handling when adding and starting server
auth: do not create default 'cassandra:cassandra' superuser
auth: remove redundant DEFAULT_USER_NAME from password authenticator
auth: enable role management operations via maintenance socket
client_state: add has_superuser method
client_state: add _bypass_auth_checks flag
auth: let maintenance_socket_role_manager know if node is in maintenance mode
auth: remove class registrator usage
auth: instantiate auth service with factory functors
auth: add service constructor with factory functors
auth: add transitional.hh file
service: qos: handle special scheduling group case for maintenance socket
service: qos: use _auth_integration as condition for using _auth_integration
The method is called from storage_proxy::mutate_hint() which is in turn called from hint_mutation::apply_locally(). The latter is either called from directly by hint sender, which already runs in streaming group, or via RPC HINT_MUTATION handler which uses index 1 that negotiates streaming group as well.
To be sure, add a debugging check for current group being the expected one.
Code cleanup, not backporting
Closesscylladb/scylladb#28545
* github.com:scylladb/scylladb:
hint: Don't switch group in database::apply_hint()
hint_sender: Switch to sender group on stop either
This change is a bit more careful, as the test collects files from
snapshot directory several times. Before patching it to use the helper,
it collected _all_ the files. Now the helper only provides TOC-s, but
that's fine -- the only check that relies on that may also re-collect
TOC-s and compare new set with old set.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Some of those tests need to update the hard-coded 'backup' snapshot name
to use the one provided by take_snapshot() helper. Other than that, the
patching is pretty straightforward.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The take_snapshot() helper returns a dict(server: list[string]). When
there's only one server to work with, it's more handy to just get a
single list of sstables.
Next patches will make use of that helper.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Modify the methods which calculate the default gc mode as well as that
which validates whether repair-mode can be used at all, so both accepts
use of repair-mode on RF=1 tables.
This de-facto changes the default tombstone-gc to repair-mode for all
tables. Documentation is updated accordingly.
Some tests need adjusting:
* cqlpy/test_select_from_mutation_fragments.py: disable GC for some test
cases because this patch makes tombstones they write subject to GC
when using defaults.
* test/cluster/test_mv.py::test_mv_tombstone_gc_not_inherited used
repair-mode as a non-default for the base table and expected the MV to
revert to default. Another mode has to be used as the non-default
(immediate).
* test/cqlpy/test_tools.py::test_scylla_sstable_dump_schema: don't
compare tombstone_gc schema extension when comparing dumped schema vs.
original. The tool's schema loader doesn't have access to the keyspace
definition so it will come up with different defaults for
tombstone-gc.
* test/boost/row_cache_test.cc::test_populating_cache_with_expired_and_nonexpired_tombstones
sets tombstone expiry assuming the tombstone-gc timeout-mode default.
Change the CREATE TABLE statement to set the expected mode.
Most of the functionality is tested in cqlpy tests located in
`test_guardrail_write_consistency_level.py`. Add two tests
that require the cluster framework:
- `test_invalid_write_cl_guardrail_config` checks the node startup
path when incorrect `write_consistency_levels_warned` and
`write_consistency_levels_disallowed` values are used.
- `test_write_cl_default` checks the behavior of the default
configuration using a multi-node cluster.
Tests execution time:
- Dev: 10s
- Debug: 18s
Refs: SCYLLADB-259
Implement basic tests for write consistency level guardrails,
verifying that they work for each type of write request (inserts,
updates, deletes, logged batches, unlogged batches, conditional batches,
and counter operations).
All tests are marked as Scylla-only because they currently don't
pass with Cassandra due to differences in handling superusers (see:
SCYLLADB-882).
Tests execution time:
- Dev: 3s
- Debug: 14s
Refs: SCYLLADB-259
Refs: SCYLLADB-882
In this series we introduce new system tables and use them for storing the raft metadata
for strongly consistent tables. In contrast to the previously used raft group0 tables, the
new tables can store data on any shard. The tables also allow specifying the shard where
each partition should reside, which enables the tablets of strongly consistent tables to have
their raft group metadata co-located on the same shard as the tablet replica.
The new tables have almost the same schemas as the raft group0 tables. However, they
have an additional column in their partition keys. The additional column is the shard
that specifies where the data should be located. While a tablet and its corresponding
raft group server resides on some shard, it now writes and reads all requests to the
metadata tables using its shard in addition to the group_id.
The extra partition key column is used by the new partitioner and sharder which allow
this special shard routing. The partitioner encodes the shard in the token and the
sharder decodes the shard from the token. This approach for routing avoids any
additional lookups (for the tablet mapping) during operations on the new tables
and it also doesn't require keeping any state. It also doesn't interact negatively
with resharding - as long as tablets (and their corresponding raft metadata) occupy
some shard, we do not allow starting the node with a shard count lower than the
id of this shard. When increasing the shard count, the routing does not change,
similarly to how tablet allocation doesn't change.
To use the new tables, a new implementation of `raft::persistence` is added. Currently,
it's almost an exact copy of the `raft_sys_table_storage` which just uses the new tables,
but in the future we can modify it with changes specific to metadata (or mutation)
storage for strongly consistent tables. The new storage is used in the `groups_manager`,
which combined with the removal of some `this_shard_id() == 0` checks, allows strongly
consistent tables to be used on all shards.
This approach for making sure that the reads/writes to the new tables end up on the correct shards
won in the balance of complexity/usability/performance against a few other approaches we've considered.
They include:
1. Making the Raft server read/write directly to the database, skipping the sharder, on its shard, while using
the default partitioner/sharder. This approach could let us avoid changing the schema and there should be
no problems for reads and writes performed by the Raft server. However, in this approach we would input
data in tables conflicting with the placement determined by the sharder. As a result, any read going through
the sharder could miss the rows it was supposed to read. Even when reading all shards to find a specific value,
there is a risk of polluting the cache - the rows loaded on incorrect shards may persist in the cache for an unknown
amount of time. The cache may also mistakenly remember that a row is missing, even though it's actually present,
just on an incorrect shard.
Some of the issues with this approach could be worked around using another sharder which always returns
this_shard_id() when asked about a shard. It's not clear how such a sharder would implement a method like
`token_for_next_shard`, and how much simpler it would be compared to the current "identity" sharder.
2. Using a sharder depending on the current allocation of tablets on the node. This approach relies on the
knowledge of group_id -> shard mapping at any point in time in the cluster. For this approach we'd also
need to either add a custom partitioner which encodes the group_id in the token, or we'd need to track the
token(group_id) -> shard mapping. This approach has the benefit over the one used in the series of keeping
the partition key as just group_id. However, it requires more logic, and the access to the live state of the node
in the sharder, and it's not static - the same token may be sharded differently depending on the state of the
node - it shouldn't occur in practice, but if we changed the state of the node before adjusting the table data,
we would be unable to access/fix the stale data without artificially also changing the state of the node.
3. Using metadata tables co-located to the strongly consistent tables. This approach could simplify the
metadata migrations in the future, however it would require additional schema management of all co-located
metadata tables, and it's not even obvious what could be used as the partition key in these tables - some
metadata is per-raft-group, so we couldn't reuse the partition key of the strongly consistent table for it. And
finding and remembering a partition key that is routed to a specific shard is not a simple task. Finally, splits
and merges will most likely need special handling for metadata anyway, so we wouldn't even make use of
co-located table's splits and merges.
Fixes [SCYLLADB-361](https://scylladb.atlassian.net/browse/SCYLLADB-361)
[SCYLLADB-361]: https://scylladb.atlassian.net/browse/SCYLLADB-361?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQClosesscylladb/scylladb#28509
* github.com:scylladb/scylladb:
docs: add strong consistency doc
test/cluster: add tests for strongly-consistent tables' metadata persistence
raft: enable multi-shard raft groups for strongly consistent tablets
test/raft: add unit tests for raft_groups_storage
raft: add raft_groups_storage persistence class
db: add system tables for strongly consistent tables' raft groups
dht: add fixed_shard_partitioner and fixed_shard_sharder
raft: add group_id -> shard mapping to raft_group_registry
schema: add with_sharder overload accepting static_sharder reference
The default 100ms timeout for client readiness in tests is too
aggressive. In some test environments, this is not enough time for
client creation, which involves address resolution and TLS certificate
reading, leading to flaky tests.
This commit increases the default client creation timeout to 10 seconds.
This makes the tests more robust, especially in slower execution
environments, and prevents similar flakiness in other test cases.
Fixes: VECTOR-547, SCYLLADB-802, SCYLLADB-825, SCYLLADB-826
Backport to 2025.4 and 2026.1, as the same problem occurs on these branches and can potentially make the CI flaky there as well.
Closesscylladb/scylladb#28846
* github.com:scylladb/scylladb:
vector_search: test: include ANN error in assertion
vector_search: test: fix HTTPS client test flakiness
When we create a materialized view, we consider 2 cases:
1. the view's primary key contains a column that is not
in the primary key of the base table
2. the view's primary key doesn't contain such a column
In the 2nd case, we add all columns from the base table
to the schema of the view (as virtual columns). As a result,
all of these columns are effectively "selected" in
view_updates::can_skip_view_updates. Same thing happens when
we add new columns to the base table using ALTER.
Because of this, we can never have !column_is_selected and
!has_base_non_pk_columns_in_view_pk at the same time. And
thus, the check (!column_is_selected
&& _base_info.has_base_non_pk_columns_in_view_pk) is always
the same as (!column_is_selected).
Because we immediately return after this check, the tail of
this function is also never reached - all checks after the
(column_is_selected) are affected by this. Also, the condition
(!column_is_selected && base_has_nonexpiring_marker) is always
false at the point it is called. And this in turn makes the
`base_has_nonexpiring_marker` unused, so we delete it as well.
It's worth considering, why did we even have
`base_has_nonexpiring_marker` if it's effectively unused. We
initially introduced it in bd52e05ae2 and we (incorrectly)
used it to allow skipping view updates even if the liveness of
virtual columns changed. Soon after, in 5f85a7a821, we
started categorizing virtual columns as column_is_selected == true
and we moved the liveness checks for virtual columns to the
`if (column_is_selected)` clause, before the `base_has_nonexpiring_marker`
check. We changed this because even if we have a nonexpiring marker
right now, it may be changed in the future, in which case the liveness
of the view row will depend on liveness of the virtual columns and
we'll need to have the view updates from the time the row marker was
nonexpiring.
Closesscylladb/scylladb#28838
Fixes issue #12818 with the following docs changes:
docs/dev/system_keyspace.md: Added missing system tables, added table of contents (TOC), added categories
Closesscylladb/scylladb#27789
For 2025.3 and 2025.4 this test runs order of magnitude
slower in debug mode. Potentially due to passwords::check
running in alien thread and overwhelming the CPU (this is
fixed in newer versions).
Decreasing the number of connections in test makes it fast
again, without breaking reproducibility.
As additional measure we double the timeout.
The fix is now cherry-picked to master as sometimes
test fails there too.
(cherry picked from commit 1f1fc2c2ac)
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-795
backport: 2026.1, already on other stable branches
Closesscylladb/scylladb#28848
* github.com:scylladb/scylladb:
test: add more logs to test_startup_no_auth_response
test: decrease strain in test_startup_response
Enable verification of write consistency level guardrails in
`modification_statement` and `batch_statement`.
Neither guardrail is enabled by default, so as not to disrupt clusters
that are currently using any of the CLs for writes. The warning
guardrail may seem harmless, as it only adds a warning to the CQL
response; however, enabling it can significantly increase network
traffic (as a warning message is added to each response) and also
decrease throughput due to additional allocations required to prepare
the warning. Therefore, both guardrails should be enabled with care.
The newly added `writes_per_consistency_level` metric, which is
incremented unconditionally, can help decide whether a guardrail can
be safely enabled in an existing cluster.
This commit adds additional `if` instructions on the critical path.
However, based on the `perf_simple_query` benchmark for writes,
the difference is marginal (~40 additional instructions, which is
a relative difference smaller than 0.001).
BEFORE:
```
291443.35 tps ( 53.3 allocs/op, 16.0 logallocs/op, 14.2 tasks/op, 48067 insns/op, 18885 cycles/op, 0 errors)
throughput:
mean= 289743.07 standard-deviation=6075.60
median= 291424.69 median-absolute-deviation=1702.56
maximum=292498.27 minimum=261920.06
instructions_per_op:
mean= 48072.30 standard-deviation=21.15
median= 48074.49 median-absolute-deviation=12.07
maximum=48119.87 minimum=48019.89
cpu_cycles_per_op:
mean= 18884.09 standard-deviation=56.43
median= 18877.33 median-absolute-deviation=14.71
maximum=19155.48 minimum=18821.57
```
AFTER:
```
290108.83 tps ( 53.3 allocs/op, 16.0 logallocs/op, 14.2 tasks/op, 48121 insns/op, 18988 cycles/op, 0 errors)
throughput:
mean= 289105.08 standard-deviation=3626.58
median= 290018.90 median-absolute-deviation=1072.25
maximum=291110.44 minimum=274669.98
instructions_per_op:
mean= 48117.57 standard-deviation=18.58
median= 48114.51 median-absolute-deviation=12.08
maximum=48162.18 minimum=48087.18
cpu_cycles_per_op:
mean= 18953.43 standard-deviation=28.76
median= 18945.82 median-absolute-deviation=20.84
maximum=19023.93 minimum=18916.46
```
Fixes: SCYLLADB-259
Consider this:
- repair takes the lock holder
- tablet merge filber destories the compaction group and the compaction state
- repair fails
- repair destroy the lock holder
This is observed in the test:
```
repair - repair[5d73d094-72ee-4570-a3cc-1cd479b2a036] Repair 1 out of 1 tablets: table=sec_index.users range=(432345564227567615,504403158265495551] replicas=[0e9d51a5-9c99-4d6e-b9db-ad36a148b0ea:15, 498e354c-1254-4d8d-a565-2f5c6523845a:9, 5208598c-84f0-4526-bb7f-573728592172:28]
...
repair - repair[5d73d094-72ee-4570-a3cc-1cd479b2a036]: Started to repair 1 out of 1 tables in keyspace=sec_index, table=users, table_id=ea2072d0-ccd9-11f0-8dba-c5ab01bffb77, repair_reason=repair
repair - Enable incremental repair for table=sec_index.users range=(432345564227567615,504403158265495551]
table - Disabled compaction for range=(432345564227567615,504403158265495551] session_id=a13a72cc-cd2d-11f0-8e9b-76d54580ab09 for incremental repair
table - Got unrepaired compaction and repair lock for range=(432345564227567615,504403158265495551] session_id=a13a72cc-cd2d-11f0-8e9b-76d54580ab09 for incremental repair
table - Disabled compaction for range=(432345564227567615,504403158265495551] session_id=a13a72cc-cd2d-11f0-8e9b-76d54580ab09 for incremental repair
table - Got unrepaired compaction and repair lock for range=(432345564227567615,504403158265495551] session_id=a13a72cc-cd2d-11f0-8e9b-76d54580ab09 for incremental repair
repair - repair[5d73d094-72ee-4570-a3cc-1cd479b2a036]: get_sync_boundary: got error from node=0e9d51a5-9c99-4d6e-b9db-ad36a148b0ea, keyspace=sec_index, table=users, range=(432345564227567615,504403158265495551], error=seastar::rpc::remote_verb_error (Compaction state for table [0x60f008fa34c0] not found)
compaction_manager - Stopping 1 tasks for 1 ongoing compactions for table sec_index.users compaction_group=238 due to tablet merge
compaction_manager - Stopping 1 tasks for 1 ongoing compactions for table sec_index.users compaction_group=238 due to tablet merge
....
scylla[10793] Segmentation fault on shard 28, in scheduling group streaming
```
The rwlock in compaction_state could be destroyed before the lock holder
of the rwlock is destroyed. This causes user after free when the lock
the holder is destroyed.
To fix it, users of repair lock will now be waited when a compaction
group is being stopped.
That way, compaction group - which controls the lifetime of rwlock -
cannot be destroyed while the lock is held.
Additionally, the merge completion fiber - that might remove groups -
is properly serialized with incremental repair.
The issue can be reproduced using sanitize build consistently and can not
be reproduced after the fix.
Fixes#27365
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Prevent repair lock holder from being leaked in repair_service when table
is dropped midway.
The leakage might result in use-after-free later, since the repair lock
itself will be gone after table drop.
The RPC verb that removes the lock on success path will not be called
by coordinator after table was dropped.
Refs #27365.
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-896.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
We want to enable maintenance socket by default.
This will prevent users from having to reboot a server to enable it.
Also, there is little point in having maintenance socket that is turned off,
and we want users to use it. After this patch series, they will have
to use it. Note that while config seeding exists, we do not encourage it
for production deployments.
This patch changes default maintenance_socket value from ignore to workdir.
This enables maintenance socket without specifying an explicit path.
Refs SCYLLADB-409
Specifying password with -p option is considered unsafe.
The password will be saved in bash history.
The preferred approach is to enter the password when prompted.
Any approach that passes the password via command line arguments
makes that password visible in process options (ps command), no matter
if the password is passed directly or as an environment variable.
Refs SCYLLADB-409
Update create superuser procedure:
- Remove notes about default `cassandra` superuser
- Add create superuser using existing superuser section
- Update create superuser by using `scylla.yaml` config
- Add create superuser using maintenance socket
Update password reset procedure:
- Add maintenance socket approach
- Remove the old approach with deleting all the roles
Update enabling authentication with downtime and during runtime:
- Mention creating new superuser over the maintenance socket
- Remove default superuser usage
Update enable authorization:
- Mention creating new superuser over the maintenance socket
- Remove mention of default superuser
Reasoning for deletion of the old approach:
- [old] Needs cluster downtime, removes all roles, needs recreation of roles,
needs maintenance socket anyways, if config values are not used for superuser
- [new] No cluster downtime, possibly one node restart to enable maintenance
socket, faster
Refs SCYLLADB-409
Introduce a test that cover:
- Server startup without credentials config seeding with no roles created
- Await maintenance socket role management to be enabled
- `CREATE ROLE`, `ALTER ROLE`, and `DROP ROLE` statement execution success
All the tests in the test_maintenance_socket.py module take 2-3 seconds
to execute.
Explicitly shut down Cluster objects to prevent 'RuntimeError: cannot
schedule new futures after shutdown'.
Refs SCYLLADB-409
Add logs to test_maintenance_socket.py test test_maintenance_socket.
This approach offers additional visibility in case of test failure.
Such logs will be added to new tests in a follow up patch in this
patch series.
Refs SCYLLADB-409
When connect_driver=False, the expected server up state should be
capped to HOST_ID_QUERIED. This is to avoid waiting for CQL readiness,
which requires a superuser to be present.
This logic was only in ScyllaCluster.server_start. ManagerClient.server_add
with start=True and connect_driver=False would still wait for CQL and hang
if no superuser is present. The workaround was to call
ManagerClient.server_add(start=False, connect_driver=False) followed by
ManagerClient.server_start(connect_driver=False).
This patch moves the capping from ScyllaCluster.server_start to
ManagerClient.server_add and ManagerClient.server_start, where connect_driver
is processed. ScyllaCluster only receives the already resolved
expected_server_up_state value.
Refs SCYLLADB-409
Changes the behavior of default superuser creation.
Previously, without configuration 'cassandra:cassandra' credentials
were used. Now default superuser creation is skipped if not configured.
The two ways to create default superuser are:
- Config file - auth_superuser_name and auth_superuser_salted_password fields
- Maintenance socket - connect over maintenance socket and CREATE/ALTER ROLE ...
Behavior changes:
Old behavior:
- No config - 'cassandra:cassandra' created
- auth_superuser_name only - <name>:cassandra created
- auth_superuser_salted_password only - 'cassandra:<password>' created
- Both specified - '<name>:<password>' created
New behavior:
- No config - no default superuser
- Requires maintenance socket setup
- auth_superuser_name only - '<name>:' created WITHOUT password
- Requires maintenance socket setup
- auth_superuser_salted_password only - no default superuser
- Both specified - '<name>:<password>' created
Fixes SCYLLADB-409
Introduce maintenance_socket_authenticator and rework
maintenance_socket_role_manager to support role management operations.
Maintenance auth service uses allow_all_authenticator. To allow
role modification statements over the maintenance socket connections,
we need to treat the maintenance socket connections as superusers and
give them proper access rights.
Possible approaches are:
1. Modify allow_all_authenticator with conditional logic that
password_authenticator already does
2. Modify password_authenticator with conditional logic specific
for the maintenance socket connections
3. Extend password_authenticator, overriding the methods that differ
Option 3 is chosen: maintenance_socket_authenticator extends
password_authenticator with authentication disabled.
The maintenance_socket_role_manager is reworked to lazily create a
standard_role_manager once the node joins the cluster, delegating role
operations to it. In maintenance mode role operations remain disabled.
Refs SCYLLADB-409
Encapsulate the superuser check in client_state so that it
respects _bypass_auth_checks. Connections that bypass auth
(internal callers and the maintenance socket) are always
considered superusers.
Migrate existing call sites from auth::has_superuser(service, user)
to client_state.has_superuser(). Also add _bypass_auth_checks
handling to ensure_not_anonymous().
Refs SCYLLADB-409
Authorization checks were previously skipped based on the
_is_internal flag. This couples two concerns: marking client
state as internal and bypassing authorization.
Introduce _bypass_auth_checks to handle only the authorization
bypass. Internal client state sets it to true, preserving current
behavior. External client state accepts it as a constructor
parameter, defaulting to false.
This will allow maintenance socket connections to skip
authorization without being marked as internal.
Refs SCYLLADB-409
This patch is part of preparations for dropping 'cassandra::cassandra'
default superuser. When that is implemented, maintenance_socket_role_manager
will have two modes of work:
1. in maintenance mode, where role operations are forbidden
2. in normal mode, where role operations are allowed
To execute the role operations, the node has to join a cluster.
In maintenance mode the node does not join a cluster.
This patch lets maintenance_socket_role_manager know if it works under
maintenance mode and returns appropriate error message when role
operations execution is requested.
Refs SCYLLADB-409
This patch removes class registrator usage in auth module.
It is not used after switching to factory functor initialization
of auth service.
Several role manager, authenticator, and authorizer name variables
are returned as well, and hardcoded inside qualified_java_name method,
since that is the only place they are ever used.
Refs SCYLLADB-409
Auth service is instantiated with the constructor that accepts
service_config, which then uses class registrator to instantiate
authorizer, authenticator, and role manager.
This patch switches to instantiating auth service via the constructor
that accepts factory functors. This is a step towards removing
usage of class registrator.
Refs SCYLLADB-409
Auth service can be initialized:
- [current] by passing instantiated authorizer, authenticator, role manager
- [current] by passing service_config, which then uses class registrator to instantiate authorizer, authenticator, role manager
- This approach is easy to use with sharded services
- [new] by passing factory functors which instantiate authorizer, authenticator, role manager
- This approach is also easy to use with sharded services
Refs SCYLLADB-409
In a follow-up patch in this patch series class registrator will be removed.
Adding transitional.hh file will be necessary to expose the authenticator and authorizer.
Refs SCYLLADB-409
service_level_controller has special handling for maintenance socket connections.
If the current user is not a named user, it should use the default scheduling group.
The reason is that the maintenance socket can communicate with Scylla before
auth_integration is registered.
The guard is already present, but it was omitted in get_cached_user_scheduling_group.
This also fixes flakiness in test_maintenance_socket.py tests.
Refs SCYLLADB-409
Maintenance socket connections can be established before _auth_integration is
initialized. The fix introduced with scylladb/scylladb#26856 PR check for
the value of user variable. For maintenance socket connections it will be an
anonymous user, and will fall back to using default scheduling group.
This patch changes the criteria for using default scheduling group from
the user variable to checking the _auth_integration variable itself:
- If _auth_integration is not initialized, use default scheduling group
- If _auth_integration is initialized, let it choose the scheduling group
Refs SCYLLADB-409
Add `write_consistency_levels_disallowed_violations` and
`write_consistency_levels_warned_violations` metrics to track
violations of write_consistency_levels guardrails.
Add `writes_per_consistency_level` to track what CL is used by
writes, regardless of the guardrails configuration.
Data gathered by this metric can be used to decide whether enabling
a particular write consistency level guardrail in a particular
existing cluster is safe.
Refs: SCYLLADB-259
Add enum_sets to query_processor that track the configuration
values of `write_consistency_levels_warned` and
`write_consistency_levels_disallowed`.
Refs: SCYLLADB-259
We introduce a function creating a Raft cluster with parameters usually
used by the tests. This will avoid code duplication, especially after
introducing new tests in the following commits.
Note that the test `test_aborting_wait_for_state_change` has changed:
the previous complex configuration was unnecessary for it (I wrote it).
These two are only used to print into logs on error. However, their
values can be found from previous logs and test execution context.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When the test fails, the assertion message does not include
the error from the ANN request.
This change enhances the assertion to include the specific ANN error,
making it easier to diagnose test failures.
The default 100ms timeout for client readiness in tests is too
aggressive. In some test environments, this is not enough time for
client creation, which involves address resolution and TLS certificate
reading, leading to flaky tests.
This commit increases the default client creation timeout to 10 seconds.
This makes the tests more robust, especially in slower execution
environments, and prevents similar flakiness in other test cases.
Fixes: VECTOR-547, SCYLLADB-802
On every update of the ERM, update the state of the current table in the
registry of RF=1 tables in shared tombstone gc state.
Ensures that tombstone gc stops collection of tombstones in immediate
mode as soon as the table starts transitioning away from RF=1.
Currently we cannot use repair-mode tombstone gc on RF=1 tables, because
such tables don't need repair and so there won't be repair history to
use to produce gc_before times.
Introduce shared_tombstone_gc_state::_rf_one_tables which will keep a
registry of RF=1 tables. Keeping this up to date is left to outside code
(table.cc). Consult the registry to determine whether a table is RF=1 or
not, so the repair history check can be ellided for rf=1 tables.
Not wired in yet into the table code.
Doesn't belong there. Also, having it as a separate member of
shared_tombstone_gc_state makes updating _group0_gc_time cheaper, as the
update doesn't have to do a copy-mutate-swap of the history maps.
This is the current de-facto default for all tests using random schema
and some are apparently relying on this. Make this explicit to avoid
upsetting tests, by the impending change of this default to repair.
It is ambigous, use the appropriate no-gc or gc-all factories instead,
as appropriate.
A special note for mutation::compacted(): according to the comment above
it, it doesn't drop expired tombstones but as it is currently, it
actually does. Change the tombstone gc param for the underlying call to
compact_for_compaction() to uphold the comment. This is used in tests
mostly, so no fallout expected.
Tests are handled in the next commit, to reduce noise.
Two tests in mutation_test.cc have to be updated:
* test_compactor_range_tombstone_spanning_many_pages
has to be updated in this commit, as it uses
mutation_partition::compact_for_query() as well as
compact_for_query(). The test passes default constructed
tombstone_gc() to the latter while the former now uses no-gc
creating a mismatch in tombstone gc behaviour, resulting in test
failure. Update the test to also pass no-gc to compact_for_query().
* test_query_digest similarly uses mutation_partition::query_mutation()
and another compaction method, having to match the no-gc now used in
query_mutation().
It is ambiguous, use tombstone_gc_mode::is_gc_enabled() instead.
Note that the two has slightly different meanings, operator bool()
returned true when repair-history related functionality was enabled.
This is fine, because the only two users are logs, where the two
meanings are close enough. All other users were eliminated or migrated
already, taking the change in meaning into account.
Instead of keeping a pointer to it. Replace nullptr with
tombstone_gc_state::no_gc().
This object is now designed to be used as a value-type, after recent
refactoring.
To replace the usage of tombstone_gc_state(nullptr) usage in tests
specifically. This more verbose factory method will hopefully convey
that this is not to be used in production code. The nullptr constructor
doesn't convey this and in fact it was used in production code
here-and-there.
This commit introduces pure pytest logging into a file
Previously, test.py used pytest as a script(not a framework) and just captured pytest stdout and logged this data by itself
This commit sets up the log files format that additionaly display Python processName, threadName adn taskName because test.py test cases use them, and now it is so hard to investigate issues that are connected with parallelism inside test case themselve
In addition, commit splits the logging of different pytest workers(xdist) into different files. If pytest workers have ho failed test - log file for these workers will be deleted
There is also additional logging for failures that will contain a separate file per test failure and contain the error itself (stacktrace) and all capture logs from stdout, stderr during the test run. With --save-log-on-success it will be a separate file per test on pass as well
All this new functionality works with the new xdit scheduler (--test-py-init=True)
Fixes SCYLLADB-713
Closesscylladb/scylladb#28705
Introduced a new max_tablet_count tablet option that caps the maximum number of tablets a table can have. This feature is designed primarily for backup and restore workflows.
During backup, when load balancing is disabled for snapshot consistency, the current tablet count is recorded in the backup manifest.
During restore, max_tablet_count is set to this recorded value, ensuring the restored table's tablet count never exceeds the original snapshot's tablet distribution.
This guarantee enables efficient file-based SSTable streaming during restore, as each SSTable remains fully contained within a single tablet boundary.
Closesscylladb/scylladb#28450
The secondary index mechanism is currently used to determine the target column.
This mechanism works incorrectly for vector indexes with filtering because
it returns the last specified column as the target (vectors) column.
However, the syntax for a vector index requires the first column to be the target:
```
CREATE CUSTOM INDEX ON t(vectors, users) USING 'vector_index';
```
This discrepancy eventually leads to the following exception when performing an
ANN search on a vector index with filtering columns:
````
ANN ordering by vector requires the column to be indexed using 'vector_index'
````
This commit fixes the issue by introducing dedicated logic for vector indexes
to correctly identify the target(vectors) column.
Fixes: SCYLLADB-635
Closesscylladb/scylladb#28740
Split input sstable(s) into multiple output sstables based on the provided
token boundaries. The input sstable(s) are divided according to the specified
split tokens, creating one output sstable per token range.
Fixes: SCYLLADB-10
Closesscylladb/scylladb#28741
For 2025.3 and 2025.4 this test runs order of magnitude
slower in debug mode. Potentially due to passwords::check
running in alien thread and overwhelming the CPU (this is
fixed in newer versions).
Decreasing the number of connections in test makes it fast
again, without breaking reproducibility.
As additional measure we double the timeout.
The fix is now cherry-picked to master as sometimes
test fails there too.
(cherry picked from commit 1f1fc2c2ac)
The query (and in certain modes the write) operations uses virtual table facility inside `cql_test_env`. The schema of the sstable is created as a table in `cql_test_env`. This involves registering all UDTs with the keyspace, so they are available for lookups.
This was done with a flat loop over all column types, but this is not enough. UDTs might be nested in other types, like collections. One has to do a traversal of the type tree and register every UDT on the way.
This PR changes the flat loop to a recursive traversal of the type tree. The query operation now works with UDTs, no matter how deeply nested they are.
Backport: Implements missing functionality of a tool, no backport.
Closesscylladb/scylladb#28798
* github.com:scylladb/scylladb:
tools/scylla-sstable: create_table_in_cql_env(): register UDTs recursively
tools/scylla-sstable: generalize dump_if_user_type
tools/scylla-sstable: move dump_if_user_type() definition
3f7ee3ce5d introduced system.batchlog_v2, with a schema designed to speed up batchlog replays and make post-replay cleanups much more effective.
It did not introduce a cluster feature for the new table, because it is node local table, so the cluster can switch to the new table gradually, one node at a time.
However, https://github.com/scylladb/scylladb/issues/27886 showed that the switching causes timeouts during upgrades, in mixed clusters. Furthermore, switching to the new table unconditionally on upgrades nodes, means that on rollback, the batches saved into the v2 table are lost.
This PR introduces re-introduces v1 (`system.batchlog`) support and guards the use of the v2 table with a cluster feature, so mixed clusters keep using v1 and thus be rollback-compatible.
The re-introduced v1 support doesn't support post-replay cleanups for simplicity. The cleanup in v1 was never particularly effective anyway and we ended up disabling it for heavy batchlog users, so I don't think the lack of support for cleanup is a problem.
Fixes: https://github.com/scylladb/scylladb/issues/27886
Needs backport to 2026.1, to fix upgrades for clusters using batches
Closesscylladb/scylladb#28736
* github.com:scylladb/scylladb:
test/boost/batchlog_manager_test: add tests for v1 batchlog
test/boost/batchlog_manager_test: make prepare_batches() work with both v1 and v2
test/boost/batchlog_manager_test: fix indentation
test/boost/batchlog_manager_test: extract prepare_batches() method
test/lib/cql_assertions: is_rows(): add dump parameter
tools/scylla-sstable: extract query result printers
tools/scylla-sstable: add std::ostream& arg to query result printers
repair/row_level: repair_flush_hints_batchlog_handler(): add all_replayed to finish log
db/batchlog_manager: re-add v1 support
db/batchlog_manager: return all_replayed from process_batch()
db/batchlog_manager: process_bath() fix indentation
db/batchlog_manager: make batch() a standalone function
db/batchlog_manager: make structs stats public
db/batchlog_manager: allocate limiter on the stack
db/batchlog_manager: add feature_service dependency
gms/feature_service: add batchlog_v2 feature
The test is currently flaky with `reuse_ip = True`. The issue is that the
test retries replace before the first replace is rolled back and the
first replacing node is removed from gossip. The second replacing node
can see the entry of the first replacing node in gossip. This entry has
a newer generation than the entry of the node being replaced, and both
replacing nodes have the same IP as the node being replaced. Therefore,
the second replacing node incorrectly considers this entry as the entry
of the node being replaced. This entry is missing rack and DC, so the
second replace fails with
```
ERROR 2026-02-24 21:19:03,420 [shard 0:main] init - Startup failed:
std::runtime_error (Cannot replace node
8762a9d2-3b30-4e66-83a1-98d16c5dd007/127.61.127.1 with a node on
a different data center or rack.
Current location=UNKNOWN_DC/UNKNOWN_RACK, new location=dc1/rack2)
```
Fixes SCYLLADB-805
Closesscylladb/scylladb#28829
Following becb48b586 it seems we have a regression with trigger CI logic
The Verify Org Membership step used gh api /orgs/scylladb/members/$AUTHOR
with GITHUB_TOKEN to check if the user is an org member. However,
GITHUB_TOKEN does not have read:org scope, so the API call fails for all
users — even actual scylladb org members — causing CI triggers to be
silently skipped.
Replace the API call with the author_association field from the GitHub
event payload, which is set by GitHub itself and requires no special
token permissions. This allows any scylladb org member (MEMBER or OWNER)
to trigger CI via comment, regardless of whether they authored the PR.
Closesscylladb/scylladb#28837
remove the test since it's not relevant anymore, it's not testing what
it's supposed to test and it's unstable.
the purpose of the test was to reproduce an issue in the legacy view
builder where a view starts to build at token T2 and then all tokens
[T1, end) with T1<T2 migrate to another node while it's still building,
exposing an issue when the view builder wraparounds the token ring.
this is not relevant anymore because now view building with tablets is
done via the view building coordinator for tablets, and all views start
to build from the first token with no wraparound.
besides, the test is unstable due to relying too much on specific
timing, which was useful for investigating and fixing the original issue
but not anymore.
Fixes SCYLLADB-842
Closesscylladb/scylladb#28842
The issue was fixed by commit cc03f5c89d
("cql3: support literals and bind variables in selectors"), so the
xfail marker is no longer needed.
Closesscylladb/scylladb#28776
The test uses create_dataset helper duplicating the existing code that does the same. This PR patches basic tests to use standard facilities.
Also the PR simplifies the 3-level nested loops used to combine several sets of restoration parameters by using itertools.product facility.
Continuation of #28600.
Cleaning tests, not backporting
Closesscylladb/scylladb#28608
* github.com:scylladb/scylladb:
test/object_store: Use itertools.product() for deeply nested loops
test/object_store: Replace dataset creation usage with standard methods
test/object_store: Shift indentation right for test_restore_with_streaming_scopes
The perf-simple-query tests were not restricted on CPU count,
so on a 96-CPU machine, they would run on 96 CPUs, and time out
in debug mode.
All restrict memory usage and add --overprovisioned so that
pinning is disabled. Apply that to all tests.
Closesscylladb/scylladb#28821
The helper is very simple yet generic -- it takes a snapshot of a keyspace on all servers and collects the resulting sstables from workdirs. Re-using it in all test cases saves some lines of code. Also, the method is "sequential", making it "parallel" reduces the waiting time a bit.
Will help generalizing existing backup/restore tests to support clustered snapshot/backup/restore API (see #28525) later.
Cleaning up tests, not backporting.
Closesscylladb/scylladb#28660
* github.com:scylladb/scylladb:
test/backup: Run keyspace flush and snapshot taking API in parallel
test/backup: Re-use take_snapshot() helper in do_abort_restore()
test/backup: Move take_snapshot() helper up
Split the chained inject_parameter().value_or() expression into two
separate named variables for clarity.
Use condition_variable::when() instead of wait(). when() is the
coroutine-native API, designed specifically for co_await contexts —
it avoids the heap allocation of a promise_waiter that wait() uses,
and instead uses a stack-based awaiter.
Closesscylladb/scylladb#28824
The PR removes most of the code that assumes that group0 and raft topology is not enabled. It also makes sure that joining a cluster in no raft mode or upgrading a node in a cluster that not yet uses raft topology to this version will fail.
Refs #15422
No backport needed since this removes functionality.
Closesscylladb/scylladb#28514
* https://github.com/scylladb/scylladb:
group0: fix indentation after previous patch
raft_group0: simplify get_group0_upgrade_state function since no upgrade can happen any more
raft_group0: move service::group0_upgrade_state to use fmt::formatter instead of iostream
raft_group0: remove unused code from raft_group0
node_ops: remove topology over node ops code
topology: fix indentation after the previous patch
topology: drop topology_change_enabled parameter from raft_group0 code
storage_service: remove unused handle_state_* functions
gossiper: drop wait_for_gossip_to_settle and deprecate correspondent option
storage_service: fix indentation after the last patch
storage_service: remove gossiper bootstrapping code
storage_service: drop get_group_server_if_raft_topolgy_enabled
storage_service: drop is_topology_coordinator_enabled and its uses
storage_service: drop run_with_api_lock_in_gossiper_mode_only
topology: remove code that assumes raft_topology_change_enabled() may return false
test: schema_change_test: make test_schema_digest_does_not_change_with_disabled_features tests run in raft mode
test: schema_change_test: drop schema tests relevant for no raft mode only
topology: remove upgrade to raft topology code
group0: remove upgrade to group0 code
group0: refuse to boot if a cluster is still is not in a raft topology mode
storage_service: refuse to join a cluster in legacy mode
This commit adds the upgrade guide for version 2026.1.
According to the new upgrade policy, the user can now upgrade to the major version (2026.1)
from any previous minor version.
So instead of adding a separate guide form 2025.4 to 2026.1, we need a guide from 2025.x to 2026.1.
In addition, this commit:
- Updates the upgrade policy for reflect the above change.
- Removes the upgrade guides for the previous version.
Fixes https://github.com/scylladb/scylladb/issues/28533
Fixes https://github.com/scylladb/scylladb/issues/28532Closesscylladb/scylladb#28789
assert(), and SCYLLA_ASSERT() are evil (Refs #7871) because they can cause the entire Scylla cluster to crash mysteriously instead of cleanly failing the specific request that encountered a serious problem of failed pre-requisite.
In this two-patch series, in the first patch we introduce a new macro throwing_assert(), a convenient drop-in replacement for SCYLLA_ASSERT() but which has all the benefits of on_internal_error() instead of the dangers of SCYLLA_ASSERT().
In the second patch we use the new function to replace every call to SCYLLA_ASSERT() in Alternator by the new throwing_assert().
Here is an example from the second patch to demonstrate the power of this approach: The Alternator code uses the attrs_column() function to retrieve the ":attrs" column of a schema. Since every Alternator table always has an ":attrs" column in its schema, we felt safe to SCYLLA_ASSERT() that this column exists. However, imagine that one day because of a bug, one Alternator table is missing this column. Or maybe not a bug - maybe a malicious user on a shared cluster found a way to deliberately delete this column (e.g, with a CQL command!) and this check fails. Before this patch, the entire Scylla node will crash. If the same request is sent to all nodes - the entire cluster will crash. The user might not even know which request caused this crash. In contrast, after this patch, the specific operation - e.g., PutItem - will get an exception. Only this operation, and nothing else, will be aborted, and the user who sent this request will even get an "Internal Server Error" with the assertion-failure message, alerting them that this specific query is causing problems, while other queries might work normally.
There's no need to backport this patch - unless it becomes annoying that other branches don't have the throwing_assert() function and we want it to ease other backports.
Fixes#28308.
Closesscylladb/scylladb#28445
* github.com:scylladb/scylladb:
alternator: replace SCYLLA_ASSERT with throwing_assert
utils: introduce throwing_assert(), a safe replacement for assert
In nonroot installations, the install.sh script was hardcoding the
api_ui_dir and api_doc_dir paths to /opt/scylladb/ in scylla.yaml,
even though the actual files were installed to a different location
(typically ~/scylladb). This caused REST API endpoints like
/api-doc/failure_detector/ to fail with "transfer closed with
outstanding read data remaining" error because Scylla couldn't find
the API documentation files at the configured paths.
Fix this by using the $prefix variable instead of hardcoded
/opt/scylladb/ paths. This ensures that:
- In regular installations: $prefix = /opt/scylladb (no change)
- In nonroot installations: $prefix = ~/scylladb (paths now correct)
Fixes: SCYLLADB-721
Backport: The hardcoded paths in install.sh have been present since
the nonroot installation feature was introduced, making REST API
endpoints non-functional in all nonroot installations across all
live versions of Scylla.
Closesscylladb/scylladb#28805
The concurrency semaphore gates uninitialized connections across all
do_accepts loops, but was initialized to a fixed value regardless of
how many listeners exist. With multiple listeners competing for the
same units, each effectively gets less than the configured concurrency.
Initialize the semaphore to concurrency - 1 and signal 1 per listen()
call, so total capacity is concurrency - 1 + nr_listeners. This
guarantees each listener's accept loop can have at least one unit
available.
It mainly fixes problem when setting uninitialized_connections_semaphore_cpu_concurrency
config value to 1 would result in not being able to process connections, as only 1 out of 2
listeners got the semaphore.
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-762
Backport: no, it's a minor problem
Closesscylladb/scylladb#28747
* github.com:scylladb/scylladb:
test: add test_uninitialized_conns_semaphore
generic_server: fix waiters count in shed log
generic_server: scale connection concurrency semaphore by listener count
Previously, rewriting an sstable component (e.g., via rewrite_statistics) created a temporary file that was renamed
to the final name after sealing. This allows crash recovery by simply removing the temporary file on startup.
However, this approach won't work once component digests are stored in scylla_metadata,
as replacing a component like Statistics will require atomically updating both the component
and scylla_metadata with the new digest—impossible with POSIX rename.
The new mechanism creates a clone sstable with a fresh generation:
- Hard-links all components from the source except the component being rewritten and scylla metadata if update_sstable_id is true
- Copies original sstable components pointer and recognized components from the source
- Invokes a modifier callback to adjust the new sstable before rewriting
- Writes the modified component. If update_sstable_id is true, reads scylla metadata, generates new sstable_id and rewrites it.
- Seals the new sstable with a temporary TOC
- Replaces the old sstable atomically, the same way as it is done in compaction
This is built on the rewrite_sstables compaction framework to support batch operations (e.g., following incremental repair).
In case of any failure during the whole process, sstable will be automatically deleted on the node startup due to
temporary toc persistence.
This prepares the infrastructure for component digests. Once digests are introduced in scylla_metadata
this mechanism will be extended to also rewrite scylla metadata with the updated digest alongside the modified component, ensuring atomic updates of both.
Add make_sstable() overload that accepts sstable_version_types parameter
to compaction_group_view interface and all implementations.
This will be useful in rewrite component mechanism, as we
need to preserve sstable version when creating the new one for the replacement.
Introduce a null_data_sink and make_digest_calculator implementation that discards
all writes, enabling checksum calculation without file I/O. This allows the
existing checksummed_file_writer to be used for digest computation
without writing data to disk.
This will be used in a future commit to calculate the scylla metadata
component checksum before writing it to disk, allowing the component
to store its own checksum.
The futurization refactoring in 9d3755f276 ("replica: Futurize
retrieval of sstable sets in compaction_group_view") changed
maybe_wait_for_sstable_count_reduction() from a single predicated
wait:
```
co_await cstate.compaction_done.wait([..] {
return num_runs_for_compaction() <= threshold
|| !can_perform_regular_compaction(t);
});
```
to a while loop with a predicated wait:
```
while (can_perform_regular_compaction(t)
&& co_await num_runs_for_compaction() > threshold) {
co_await cstate.compaction_done.wait([this, &t] {
return !can_perform_regular_compaction(t);
});
}
```
This was necessary because num_runs_for_compaction() became a
coroutine (returns future<size_t>) and can no longer be called
inside a condition_variable predicate (which must be synchronous).
However, the inner wait's predicate — !can_perform_regular_compaction(t)
— only returns true when compaction is disabled or the table is being
removed. During normal operation, every signal() from compaction_done
wakes the waiter, the predicate returns false, and the waiter
immediately goes back to sleep without ever re-checking the outer
while loop's num_runs_for_compaction() condition.
This causes memtable flushes to hang forever in
maybe_wait_for_sstable_count_reduction() whenever the sstable run
count exceeds the threshold, because completed compactions signal
compaction_done but the signal is swallowed by the predicate.
Fix by replacing the predicated wait with a bare wait(), so that
any signal (including from completed compactions) causes the outer
while loop to re-evaluate num_runs_for_compaction().
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-610Closesscylladb/scylladb#28801
This series implements a new per-row TTL feature for CQL. The per-row TTL feature was requested in issue #13000. It is a feature that does not exist in Cassandra, and was inspired by DynamoDB's TTL feature - and under the hood uses the same implementation that we used in Alternator to implement this DynamoDB feature.
The new per-row TTL feature is completely separate from CQL's existing per-write (and per-cell) TTL, and both will be available to users.
In the per-row TTL feature, one column in the table is designated as the "TTL" column, and its value for a row is the expiration time for that row. The TTL column can be designated at table creation time, e.g.:
```cql
CREATE TABLE tab (
id int PRIMARY KEY,
t text,
expiration timestamp TTL
);
```
Or after the table already exists with:
```cql
ALTER TABLE tab TTL expiration
```
Expiration can also be disabled, with:
```cql
ALTER TABLE tab TTL NULL
```
The new per-row TTL feature has two features that users have been asking for:
1. A user can change the value of just the TTL column - without rewriting the entire row - to change the expiration time of the entire row.
2. When an expired row is finally deleted, a CDC event about this deletion appears in the CDC log (if CDC is enabled), including - if a preimage is enabled - the content of the deleted row.
To achieve the second goal (CDC events), a row is not guaranteed to disappear at exactly its expiration time (as CQL's original TTL feature guarantees). Rather, the row is deleted some time later, depending on `alternator_ttl_period_in_seconds`; Until the actual deletion, the row is still readable (and even writable). But we are guaranteed that when the row is finally deleted, the CDC event will come too.
The implementation uses the same background thread used by Alternator to periodically scan for expired items and delete them.
The expiration thread keeps the same metrics as it did for Alternator:
* `scylla_expiration_scan_passes`
* `scylla_expiration_scan_table`
* `scylla_expiration_items_deleted`
* `scylla_expiration_secondary_ranges_scanned`
The series begins with a few small preparation patches, followed by the main part of the feature (which isn't big, since we are just enabling the pre-existing Alternator expiration machinary for CQL) and finally 30 tests (single-node and multi-node tests) and documentation.
This series is a new feature, so traditionally would not be backported. However, I wouldn't be surprised if we will be requested to backport it so that customers will not need to wait for a new major release.
Fixes#13000Closesscylladb/scylladb#28320
* github.com:scylladb/scylladb:
test/cqlpy: verify that a column can't be both STATIC and PRIMARY KEY
docs/cql: document the new CQL per-row TTL feature
test/cluster: tests for the new CQL per-row TTL feature
test/cqlpy: tests for the new CQL per-row TTL feature
test: set low alternator_ttl_period_in_seconds in CQL tests
cql ttl: fix ALTER TABLE to disable TTL if column is dropped
cql ttl: add setting/unsetting of TTL column to ALTER TABLE
cql ttl: add TTL column support to CREATE TABLE and DESC TABLE
ttl: add CQL support to Alternator's TTL expiration service
alternator ttl: move TTL_TAG_KEY to a header file
alternator ttl: remove unnecessary check of feature flag
cql: add "cql_row_ttl" cluster feature
alternator: fix error message if UpdateTimeToLive is not supported
This patch adds filtering for tablet sizes collected in load_stats.
This is needed to improve the chances that the balancer will have all
the tablet sizes for the node, and that way avoid having the node
ignored during balancing.
Switch vector dimension handling to fixed-width `uint32_t` type,
update parsing/validation, and add boundary tests.
The dimension is parsed as `unsigned long` at first which is guaranteed
to be **at least** 32-bit long, which is safe to downcast to `uint32_t`.
Move `MAX_VECTOR_DIMENSION` from `cql3_type::raw_vector` to `cql3_type`
to ensure public visibility for checks outside the class.
Add tests to verify the type boundaries.
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-223
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
Co-authored-by: Dawid Pawlik <dawid.pawlik@scylladb.com>
Closesscylladb/scylladb#28762
This patch moves the table size tablet filtering code from a lambda in
storage_service::load_stats_for_tablet_based_tables() to the code
section where it will be used:
tablet_storage_group_manager::table_load_stats()
This is needed to better accomodate the next commit which will add
code for filtering tablets for tablet sizes.
Table sizes are collected in load stats, and are filtered according to
the migration stage, so as to avoid double accounting of tablet sizes.
The comment which explains the cut-off migration stage (cleanup) and
the cut-off stage in the code (streaming) are not the same.
This patch fixes that.
The test test_build_view_with_large_row creates a materialized view and
expects the view to be built with a timeout of 5 seconds. It was
observed to fail because the timeout is too short on slow machines.
Increase the timeout to 60 seconds to make the test less flaky on slow
machines. Similarly for the other tests in the file that have a timeout
for view build, increase the timeout to 60 seconds to be consistent and
safer.
Fixes SCYLLADB-769
Closesscylladb/scylladb#28817
We have observed a bug that caused Scylla to crash due to metrics double
registration. This bug is really difficult to reproduce and was seen
only once in the wild. We think that it may be caused by a request
in-flight keeping a reference to the stats object, making it not
deregister when the index is dropped, which casues a double registration
when we recreate the index, however we are not 100% sure.
This patch makes it so the metrics always get deregistered when we drop the index, which should fix the double registration bug.
Fixes: #27252Closesscylladb/scylladb#28655
Audit tests have been slow. They rely on wait_for function. This function first sleeps for the duration of the time step specified, and then calls the given function. The audit tests need 0.02-0.03 seconds for the given function, but the operation lasts around 1.02-1.03 seconds, since step is 1 second.
This patch modifies wait_for dtest function so it first executes the given function, and afterwards calls time.sleep(step). This reduces time needed for the given function from 1.03 to 0.03 seconds.
Total audit tests suite speedup is 3x. On the developer machine the time is reduced from 13+ minutes to 4 minutes.
This patch also improves performance of some alternator tests that use the same wait_for dtest function.
`wait_for` in dtest framework has default time step reduced to make the environment more responsive and test execution faster.
Refs SCYLLADB-573
This is a performance improvement of testing framework. No need to backport.
Closesscylladb/scylladb#28590
* github.com:scylladb/scylladb:
dtest: shorten default sleep step in wait_for
dtest: wait_for speedup
Add a test that exercises to_metrics_histogram when Min is smaller than
Precision.
The test verifies duplicate integer bounds are collapsed,
counts remain cumulative, and native histogram metadata is still present
with the expected schema and min id.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
to_metrics_histogram now collapses duplicate integer bucket bounds
caused by Min less than Precision scaling while always keeping native
histogram metadata.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Add three test cases to verify the hybrid linear/exponential bucketing:
- test_histogram_min_1_bucket_limits: Validates bucket lower limits
- test_histogram_min_1_basic: Tests value insertion and bucket distribution
- test_histogram_min_1_statistics: Tests min(), max(), quantile(), and mean()
- test_histogram_min_2_precision_4: Test min == 2 and precision 4.
These tests cover the new Min<Precision mode with Precision=4, verifying both the
linear range and exponential range.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
approx_exponential_histogram is a pseudo exponential histogram implementation
that can insert and retrieve values into and from buckets in O 1 time.
The implementation uses power of two ranges and splits them linearly into
buckets. The number of buckets per power of two range is called Precision.
The original implementation aimed at covering large value ranges had a
limitation. The histogram Min value had to be greater than or equal to
Precision. As a result code that needs histograms for small integer values
could not use this implementation efficiently.
This change addresses that gap by handling the case where Min is less than
Precision. For Min smaller than Precision the value is scaled by a power of
two factor during indexing so the existing exponential math can be reused
without runtime branching. Bucket limits are scaled back to the original
units which can lead to repeated bucket limits in the Min to Precision
range for integer values.
Example with Min 2 and Precision 4
Buckets 2 2 3 3 4 5 6 7 8 10 12 14 and so on
Implementation details
Introduce SHIFT based on log2 Precision minus log2 Min when positive
Scale Min and Max by SHIFT for all exponential calculations
Compute NUM_BUCKETS using the standard log2 Max over Min formula
Use scaled value in find_bucket_index to avoid fractional bucket steps
Return bucket limits by scaling back to original units
Constraint relaxed from Min greater or equal to Precision to allow any Min
less than Max still power of two
This change maintains backward compatibility with existing histograms
while enabling efficient tracking of small integer values.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Add missing tests for per-table Alternator latency metrics to ensure recent
per-table latency accounting is actually validated.
Changes in this patch:
Refactor latency assertion helper into check_sets_latency_by_metric(),
parameterized by metric name.
Keep existing behavior by implementing check_sets_latency() as a wrapper
over scylla_alternator_op_latency.
Add test_item_latency_per_table() to verify
scylla_alternator_table_op_latency_count increases for:
PutItem, GetItem, DeleteItem, UpdateItem, BatchWriteItem,
and BatchGetItem.
This closes a test gap where only global latency metrics were checked, while
per-table latency metrics were not covered.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Batch operations were updating only global latency histograms, which left
table-level latency metrics incomplete.
This change computes request duration once at the end of each operation and
reuses it to update both global and per-table latency stats:
Latencies are stored per table used,
This aligns batch read/write metric behavior with other operations and improves
per-table observability.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
This workflow calls the reusable backport-with-jira workflow from
scylladb/github-automation to enable automatic backport PR creation with
Jira sub-issue integration.
The workflow triggers on:
- Push to master/next-*/branch-* branches (for promotion events)
- PR labeled with backport/X.X pattern (for manual backport requests)
- PR closed/merged on version branches (for chain backport processing)
Features enabled by calling the shared workflow:
- Creates Jira sub-issues under the main issue for each backport version
- Sorts versions descending (highest first: 2025.4 -> 2025.3 -> 2025.2)
- Cherry-picks from previous version branch to avoid repeated conflicts
- On Jira API failure: adds comment to main issue, applies 'jira-sub-issue-creation-failed' label, continues with PR
Closesscylladb/scylladb#28804
While adding the new syntax "TTL" to CREATE TABLE, I noticed that the
parser actually allows a column to be defined as "STATIC PRIMARY KEY".
So I add here a small test to verify that this is not really allowed:
The syntax "c int STATIC PRIMARY KEY" is accepted, but then rejected
by a later check. The syntax "c int PRIMARY KEY STATIC" is rejected
as a syntax error.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Add user-facing documentation for the new CQL per-row TTL feature,
in docs/cql/cql-extensions.md.
Also mention (and link) the new alternative TTL feature in a few
relevant documents about the old (per-write) TTL, about CDC,
and about the CREATE TABLE and ALTER TABLE commands.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The previous patch added single-node functional tests (in test/cqlpy)
for everything which was possible to test on a single node. In this
patch we add four tests that we couldn't test on a single node, using
the test/cluster test framework:
1. Test that the TTL expiration work - both the scanning threads and
the actual deletion work on all nodes - happens on the "streaming"
scheduling group.
2. Test that even if one of the cluster's nodes is down, still all
the items get expired - another node "takes over" the dead node's
work.
3. Test that rolling upgrade works as designed for the CQL per-row TTL
feature: Before every single node in the cluster is upgraded to
support this feature, a TTL column cannot be enabled on a table.
And as soon as the last node of the cluster is upgraded, the TTL
feature begins to work completely (you don't need to reboot all
the nodes again).
4. Test that expiration works correctly on a multi-DC setup. The test
doesn't check the efficiency of this process - i.e., that today each
DC scans part of the data, reading with LOCAL_QUORUM, and writing
the deletions across the entire cluster. Rather, the test only
verifies the correctness - that expired rows do get deleted -
for the usual case the data across the DCs is consistent.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch contains 27 functional tests (in the test/cqlpy framework)
for the new CQL per-row TTL feature. The tests cover the TTL column
configuration statements (CREATE TABLE, ALTER TABLE) as well as the
actual item expiration or non-expiration depending on the value of
the expiration-time column - and also CDC events generated on expiration
and the metrics generated by the expiration process.
These tests were written together with the code, as in "test-driven
development", so they aim to cover every corner case considered during
the development, and they reproduce every bug and misstep seen during
the development process. As a result, they hopefully achieve very high
code coverage - but since we don't have a working code-coverage tool,
I can't report any specific code coverage numbers.
These tests check everything which we can check on single-node cluster.
The next patch will add additional multi-node tests for things we can't
check here with a single node - such as the scheduling group used by the
distributed work, the effect of dead nodes on the TTL functionality, and
the process of rolling upgrade.
The tests in this patch do NOT try to stress the background expiration
scanning threads, or to check how they handle topology changes, large
amounts of data or clusters spanning multiple DCs. These tests also don't
test the performance impact of these scanning threads. Because the
expiration scanning thread is identical to the one already used by
Alternator TTL, we assume that many of these aspects were already tested
for Alternator TTL and did not change when the same implementation is
used for the new CQL feature.
All new tests pass on ScyllaDB. Because the per-row TTL feature is
a new ScyllaDB feature that does not exist on Cassandra, all these
tests are skipped on Cassandra.
Because some of these tests involve waiting for expiration, they can't
be very quick. Still, because we set alternator_ttl_period_in_seconds
to 0.5 seconds in the test framework, all 27 tests running sequentially
finish in roughly 6 seconds total, which we consider acceptable.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
In test/alternator/run we set alternator_ttl_period_in_seconds to a very
low number (0.5 seconds) to allow TTL tests to expire items very quickly
and finish quickly.
Until now, we didn't need to do this for CQL tests, because they weren't
using this Alternator-only feature. Now that CQL uses the same expiration
feature with its original configuration parameter, we need to set it in
CQL tests too.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
If "ALTER TABLE tab DROP x" is done to delete column x, and column x was
the designated TTL column, then the per-row TTL feature should be disabled
on this table.
If we don't do this, the expiration scanner will continue to scan the
table trying to read the dropped column - which will be wasteful or
worse.
A test for this case is also included in test/cqlpy/test_ttl_row.py
in a later patch.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The previous patch added the ability in CREATE TABLE to designate one
of the regular columns as a "TTL column", to be used by the per-row TTL
feature (Refs #13000). In this patch we add to ALTER TABLE the ability
to enable per-row TTL on an existing table with a given column as the
TTL column:
ALTER TABLE tab TTL colname
and also the ability to disable per-row TTL with
ALTER TABLE tab TTL NULL
as in CREATE TABLE, the designated TTL column must be a regular column
(it can't be a primary key column or a static column), and must have
the types timestamp, bigint or int.
You can't enable per-row TTL if already enabled, or disable it if
already disabled. To change the TTL column on an existing table,
you must first disable TTL, and then re-enable it with the new column.
A large collection of functional tests (in test/cqlpy), for every detail
of this patch, will come in a later patch in this series.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch enables the per-row TTL feature in CQL (Refs #13000).
This patch allows the user to create a new table with one of its columns
designated as the TTL column with a syntax like:
CREATE TABLE tab (
id int PRIMARY KEY,
t text,
expiration timestamp TTL
);
The column marked "TTL" must have the "timestamp", "bigint" or "int"
types (the choice of these types was explained in the previous patch),
and there can only be one such column. We decided not to allow a column
to be both a primary key column and a TTL column - although it would
have worked (it's supported in Alternator), I considered this non-useful
and confusing, and decided not to allow it in CQL. A TTL column also
can't be a static column.
We save the information of which column is the TTL column in a tag which
is read by the "expiration service" - originally a part of Alternator's
TTL implementation. After the previous patch, the expiration service is
running and knows how to understand CQL tables, so the CQL per-row TTL
feature will start to work.
This patch also implements DESC TABLE, printing the word "TTL" in the
right place of the output.
This patch doesn't yet implement ALTER TABLE that should allow enabling
or disabling the TTL column setting on an existing table - we'll do that
in the next patch.
A large collection of functional tests (in test/cqlpy), for every detail
of this feature will be added in a later patch.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The Alternator TTL feature uses an "expiration service", a background
thread on each shard which periodically scans for expired items and
deletes them. When writing the expiration service, we already
anticipated that the day will come that we'll want to use it for CQL
too. Well, now that we want to use it for CQL, we only need to make
two changes:
1. Before this patch, the expiration service was only started if
Alternator was enabled. Now we need to start it unconditionally,
as both Alternator and CQL will need to use it.
The performance impact of the new background threads, when not
needed, should be minimal: These threads will wake up every
alternator_ttl_period_in_seconds (by default - once a day) and
just check if any table has per-row TTL enabled, and if not, do
nothing.
2. Before this patch, the expiration-time column had to be of type
"decimal" - a variable-precision floating-point type. This made
sense in Alternator - where all numbers are of this type, but CQL
offers better and more efficient types for this purpose. In this
patch we add support for two additional types for the expiration
time column: The "timestamp" type (which uses millisecond precision,
which our implementation truncates to whole seconds) and for the
"bigint" type storing a number of seconds since the UNIX epoch.
We also support the smaller "int" type for compatibility with
existing data, but it is not recommended because a signed
32-bit integer counting time from 1970 will break in 2038.
After this patch, the expiration service supports CQL tables, but there
is nothing yet that can enable it on CQL tables - i.e., nothing that
sets the appropriate tag on the table to tell the expiration service
which column is the expiration-time column. We'll add new syntax to
do this in the next patch.
At the moment, we leave the expiration service implementation in
its existing location - alternator/ttl.cc. This is despite the fact
that we now start it and use it also for CQL. For better modularity,
we should probably later move the expiration service implementation
to a separate module (directory).
Similarly, the expiration service's period is still configured via
alternator_ttl_period_in_seconds, which is now a misnomer because it
also affects CQL. Later we can rename this configuration parameter,
or alternatively, consider different scan periods for different tables
and table types, and have separate configuration for Alternator TTL
and CQL per-row TTL.
The metrics kept by the expiration service are the same metrics existing
for Alternator TTL, and fortunately do not have the name "alternator" in
their name:
* scylla_expiration_scan_passes
* scylla_expiration_scan_table
* scylla_expiration_items_deleted
* scylla_expiration_secondary_ranges_scanned
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
TTL_TAG_KEY stores the name of the tag in which we store the name of the
table's expiration-time column, for Alternator's TTL feature.
We already need this name in two source files, and soon we'll need it
in more files - as we want to use the same implementation also for for
a new per-row TTL feature in CQL. So it's time to move the declaration
of this variable to a new header file - alternator/ttl_tag.hh.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Every node that supports the Alternator TTL feature should start its
background expiration-checking thread, *without* checking if other
nodes support this feature. This patch removes the unnecessary check.
Indeed, until all other nodes enable this feature, the background thread
will have nothing to do. but when finally all nodes have this feature -
we need this thread to already be on - without requiring another reboot
of all nodes to start this thread.
In practice, this change won't change anything on modern installations
because this feature is already three years old and always enabled on
modern clusters. But I don't want to repeat the same mistake for the new
CQL per-row TTL feature, so better fix it in Alternator too.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch adds a new cluster feature "CQL_ROW_TTL", for the new CQL
per-row TTL feature.
With this patch, this node reports supporting this feature, but the CQL
per-row TTL feature can only be used once all the nodes in the cluster
supports the feature. In other words, user requests to enable per-row TTL
on a table should check this feature flag (on the whole cluster) before
proceeding.
This is needed because the implementation of the per-row-TTL expiration
requires the cooperation of all nodes to participate in scanning for
expired items, so the feature can't be trusted until all nodes participate
in it.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Since commit 2dedb5ea75, the Alternator
TTL feature is no longer experimental. It is still a "cluster feature"
meaning it cannot be used on a partially-upgraded cluster until the entire
cluster supports this feature.
The error message we printed when the cluster doesn't support this
feature was outdated, referring to the no-longer-existing experimental
feature. So this patch fixes the error message.
Since this feature is already three years old, nobody is likely to ever
see this error message (it can be seen only by someone upgrading an
even older cluster, during the rolling upgrade), but better not have
wrong error messages in the code, even if it's not seen by users.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Replace all calls to SCYLLA_ASSSERT() in Alternator by the better and
safer throwing_assert() introduced in the previous patch.
As a result of this patch, if one of the call sites for these asserts
is buggy and ever fails, only the involved operation will be killed
by an exception, instead of crashing the whole server - and often the
entire cluster (as the same buggy request reaches all nodes and
crashes them all).
Additionally, this patch replaces a few existing uses in Alternator
of on_internal_error() with a non-interesting message with a
more-or-less equivalent, but shorter, throwing_assert(). The idea is
to convert the verbose idiom:
if (!condition) {
on_internal_error(logger, "some error message")
}
With the shorter
throwing_assert(condition)
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch introduces throwing_assert(cond), a better and safer
replacement for assert(cond) or SCYLLA_ASSERT(cond). It aims to
eventually replace all assertions in Scylla and provide a real solution
to issue #7871 ("exorcise assertions from Scylla").
throwing_assert() is based on the existing on_internal_error() and
inherits all its benefits, but brings with it the *convenience* of
assert() and SCYLLA_ASSERT(): No need for a separate if(), new strings,
etc. For example, you can do write just one line of throwing_assert():
throwing_assert(p != nullptr);
Instead of much more verbose on_internal_error:
if (p == nullptr) {
utils::on_internal_error("assertion failed: p != nullptr")
}
Like assert() and SCYLLA_ASSERT(), in our tests throwing_assert() dumps
core on failure. But its advantage over the other assertion functions
like becomes clear in production:
* assert() is compiled-out in release builds. This means that the
condition is not checked, and the code after the failed condition
continues to run normally, potentially to disasterous consequences.
In contrast, throwing_assert() continues to check the condition even in
release builds, and if the condition is false it throws an exception.
This ensures that the code following the condition doesn't run.
* SCYLLA_ASSERT() in release builds checks the condition and *crashes*
Scylla if the condition is not met.
In contrast, throwing_assert() doesn't crash, but throws an exception.
This means that the specific operation that encountered the error
is aborted, instead of the entire server. It often also means that
the user of this operation will see this error somehow and know
which operation failed - instead of encountering a mysterious
server (or even whole-cluster crash) without any indication which
operation caused it.
Another benefit of throwing_assert() is that it logs the error message
(and also a backtrace!) to Scylla's usual logging mechanisms - not to
stderr like assert and SCYLLA_ASSERT write, where users sometimes can't
see what is written.
Fixes#28308.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Add a new docs/dev document for the strongly consistent tables feature.
For now, it only contains information about the Raft metadata persistence,
but it should be updated as more of the strong-consistency components are
added.
In this patch we add various tests for checking how strongly consistent
tables work while allowing their tablets to reside on non-0 shards and
while using the new persistent storage for their raft metadata.
The tests verify that:
- strongly consistent tables' tablets can be allocated on different shards
and we can write/read from them
- the raft metadata is persistent across restarts even with disruptions
- the sharder correctly routes metadata queries to specified shards
- we can correctly perform multi-shard reads from the metadata tables
- we can read using just the group_id (without shard) using ALLOW FILTERING
For the tests we add logging to the sharder and partitioner and we add
some extra logs for observability.
In this patch we allow strongly consistent tables to have tablets on
shards different than 0.
For that, we remove the checks for shard 0 for the non-group0 raft
groups, and we allow the tablet allocator to place tablets of
strongly consistent tables on shards different than 0.
We also start using the new storage (raft::persistence) for strongly
consistent tables, added in the preceding commits.
Most functions of the new storage for raft groups for strongly
consistent tables are the same as for the system raft table
storage, so we reuse the tests for them to test the new storage.
We add additional tests for checking the new raft groups partitioner
and sharder, and for verifying that writes using storages for different
shards do not affect the data read on different shards.
We also add a test for checking the snapshot_descriptor present after
the storage bootstrap - for both system and strongly consistent storages
we check that the storage contains the initial descriptor.
Add raft_groups_storage, a raft::persistence implementation for
strongly consistent tablet groups.
Currently, it's almost an exact copy of the raft_sys_table_storage that
uses the new raft tables for strongly consistent tables (raft_groups,
raft_groups_snapshots, raft_groups_snapshot_config) which have
a (shard, group_id) partition key.
In the future, the mutation, term and commit_idx data will be stored
differently for for strongly consistent tables than for group0, which
will differentiate this class from the original raft_sys_table_storage.
The storage is created for each raft group server and it takes a shard
parameter at construction time to ensure all queries target the correct
partition (and thus shard).
Add three new system tables for storing raft state for strongly
consistent tablets, corresponding to the tables for group0:
- system.raft_groups: Stores the raft log, term/vote, snapshot_id,
and commit_idx for each tablet's raft group.
- system.raft_groups_snapshots: Stores snapshot descriptors
(index, term) for each group.
- system.raft_groups_snapshot_config: Stores the raft configuration
(current and previous voters) for each snapshot.
These tables use a (shard, group_id) composite partition key with
the newly added raft_groups_partitioner and raft_groups_sharder, ensuring
data is co-located with the tablet replica that owns the raft group.
The tables are only created when the STRONGLY_CONSISTENT_TABLES experimental
feature is enabled.
Add a custom partitioner and sharder that will be used for Raft tables
for strongly consistent tables. These tables will have partition keys
of the form (shard, group_id) and the partitioner creates tokens that
encode the target shard in the high 16 bits.
Token layout:
[shard: 16 bits][partition key hash: 48 bits]
This encoding guarantees that raft group data will be located on the
same shard as the tablet replica corresponding to that raft group as long
we use the tablet replica's shard as the value in the partition key.
Storing the shard directly in the partition key avoids additional lookups
for request routing to the incoming new raft tables.
For even more simplicity, we avoid biasing between uint64_t and int64_t
by limiting the acceptable shard ids up to 32767 (leaving the top bit 0),
which results in the same value of the token when interpreting either as
uint64_t or int64_t.
The sharder decodes the shard by extracting the high bits, which is
shard-count independent. This allows the partition key:shard mapping
to remain the same even during smp changes (only increases are allowed,
the same limitation as for tablets).
This series closes a gap in how CQL request and response sizes are reported.
Previously, request_size and response_size were tracked as simple counters,
providing only cumulative totals per shard. This made it difficult to understand
the distribution of message sizes and identify potential issues with very large
or very small requests.
After this series, the CQL transport reports detailed histogram metrics showing
the distribution of request and response sizes. These histograms are tracked
per-instance, per-type (per ops), and per-scheduling-group, providing
much better visibility into CQL traffic patterns.
The histograms are collected for QUERY, EXECUTE, and BATCH operations, which are
the primary data path operations where message size distribution is most relevant.
This data can help identify:
- Clients sending unexpectedly large requests
- Operations with oversized result sets
- Scheduling group differences in traffic patterns
To support this, the series extends the approx_exponential_histogram template to
handle accurate sum, adds a bytes_histogram type alias optimized for byte-range measurements (1KB to 1GB).
The existing per-shard counter metrics are maintained for backward compatibility.
Metrics example:
```
scylla_transport_cql_request_bytes{kind="BATCH",scheduling_group_name="sl:default",shard="0"} 129808
scylla_transport_cql_request_bytes{kind="EXECUTE",scheduling_group_name="sl:default",shard="0"} 227409
scylla_transport_cql_request_bytes{kind="PREPARE",scheduling_group_name="sl:default",shard="0"} 631
scylla_transport_cql_request_bytes{kind="QUERY",scheduling_group_name="sl:default",shard="0"} 2809
scylla_transport_cql_request_bytes{kind="QUERY",scheduling_group_name="sl:driver",shard="0"} 4079
scylla_transport_cql_request_bytes{kind="REGISTER",scheduling_group_name="sl:default",shard="0"} 98
scylla_transport_cql_request_bytes{kind="STARTUP",scheduling_group_name="sl:driver",shard="0"} 432
scylla_transport_cql_request_histogram_bytes_sum{kind="QUERY",scheduling_group_name="sl:driver"} 4079
scylla_transport_cql_request_histogram_bytes_count{kind="QUERY",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="1024.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="2048.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="4096.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="8192.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="16384.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="32768.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="65536.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="131072.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="262144.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="524288.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="1048576.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="2097152.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="4194304.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="8388608.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="16777216.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="33554432.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="67108864.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="134217728.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="268435456.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="536870912.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="1073741824.000000",scheduling_group_name="sl:driver"} 57
```
**The field sees it as an important issue**
Fixes#14850Closesscylladb/scylladb#28419
* github.com:scylladb/scylladb:
test/boost/estimated_histogram_test.cc: Switch to real Sum
transport/server: to bytes_histogram
approx_exponential_histogram: Add sum() method for accurate value tracking
utils/estimated_histogram.hh: Add bytes_histogram
When running unit tests, there's a visible ~1-second sleep
when gossip exits the failure detector loop.
Improve this by adding a condition variable for exiting the loop
and signaling it when any of the exit conditions are satisfied:
the abort_source is pulled, the gossiper is shut down, or the sleep
is complete. We can't just use the abort_source because gossip can be
shut down independently of the rest of the system.
To see the improvement, I ran cql_query_test in dev mode:
Before:
$ time ./build/dev/test/boost/combined_tests -t cql_query_test -- --smp 2 > /dev/null 2>&1
real 2m26.904s
user 0m24.307s
sys 0m13.402s
After:
$ time ./build/dev/test/boost/combined_tests -t cql_query_test -- --smp 2 > /dev/null 2>&1
real 0m26.579s
user 0m24.671s
sys 0m13.636s
Two minutes of real-time saved.
Real-life improvement in test.py will be lower, because of the overhead
of launching pytest for each test case.
Closesscylladb/scylladb#28649
When a permit is preemptively aborted, store the corresponding
exception in permit's member: `reader_permit::impl::_ex`.
This makes preemptively-aborted permits consistently report aborted()
and prevents them from being treated as eligible for inactive
registration in `register_inactive_read()`, avoiding assertion
failures on unexpected permit state.
Closesscylladb/scylladb#28591
scylladb/scylla container image doesn't include systemctl binary, while it
is used by perftune.py script shipped within the same image.
Scylla Operator runs this script to tune Scylla nodes/containers,
expecting its all dependencies to be available in the container's PATH.
Without systemctl, the script fails on systems that run irqbalance
(e.g., on EKS nodes) as the script tries to reconfigure irqbalance and
restart it via systemctl afterwards.
Fixes: scylladb/scylla-operator#3080Closesscylladb/scylladb#28567
Refs: SCYLLADB-193
Adds a "snapshot_table" topology operation and associated data structure/table columns to support dispatching a snapshot operation as a topo coordinator op.
Logic is similar, and thus broken out and semi-shared with, truncation.
Also adds optional tablet metadata to manifest, listing all tablets present in a given snapshot, as well as
tablet sstable ownership, repair status, and token ranges.
As per description in SCYLLADB-193, the alternative snapshot mechanism is in
a separate namespace under 'tablets', which while dubious is the desired destination.
The API is accessed via `nodetool cluster snapshot`, which more or less mirrors `nodetool snapshot`, but using topo op.
TTL is added to message propagation as a separate patch here, since it is not (yet) used from API (or nodetool).
Requires a syntax for both API and command line.
Closesscylladb/scylladb#28525
* github.com:scylladb/scylladb:
topology::snapshot: Add expiry (ttl) to RPC/topo op
test_snapshot_with_tablets: Extend test to check manifest content
table::manifest: Add tablet info to manifest.json
test::test_snapshot_with_tablets: Add small test for topo coordinated snapshot
scylla-nodetool: Add "cluster snapshot" command
api::storage_service: Add tablets/snapshots command for cluster level snapshot
db::snapshot-ctl: Add method to do snapshot using topo coordinator
storage_proxy: Add snapshot_keyspace method
topology_coordinator: Add handler for snapshot_tables
storage_proxy: Add handler for SNAPSHOT_WITH_TABLETS
messaging_service: Add SNAPSHOT_WITH_TABLETS verb
feature_service: Add SNAPSHOT_AS_TOPOLOGY_OPERATION feature
topology_mutation: Add setter for snapshot part of row
system_keyspace::topology_requests_entry: Add snapshot info to table
topology_state_machine: Add snapshot_tables operation
topology_coordinator: Break out logic from handle_truncate_table
storage_proxy: Break out logic from request_truncate_with_tablets
test/object_store: Remove create_ks_and_cf() helper
test/object_store: Replace create_ks_and_cf() usage with standard methods
test/object_store: Shift indentation right for test cases
Migrate cluster tests directory to be handled by pytest. This is the next step in process of unification of the tests and migration to the pytest.
With this PR cluster test will be executed with the full path to the file instead of `suite/test` paradigm.
Backport is not needed because it framework enhancement.
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-46Closesscylladb/scylladb#27618
* github.com:scylladb/scylladb:
test.py: remove setsid from the framework
test.py: rename suite.yaml to test_config.yaml
test.py: add cluster tests to be executed by pytest
test.py: add random seed for topology tests reproducibility
test.py: add explicit default values to pytest options
test.py: replace SCYLLA env var with build_mode fixture
Recently we suffered a regression on how Alternator TTL behaves when a node goes down when tablets are used.
Usually, expiration of data in a particular tablet are handled by this tablet's "primary replica". However, if that node is down, we want another node to perform these expiration until the primary replica goes back online. We created a function `tablet_map::get_secondary_replica()` to select that "other node". We don't care too much what the "secondary replica" means, but we do care that it's different from the primary replica - if it's the same the expiration of that tablet will never be done.
It turns out that recently, in commits 817fdad and d88036d, the implementation of get_primary_replica() changed without a corresponding change to get_secondary_replica(). After those changes, the two functions are mismatched, and sometimes return the same node for both primary and secondary replica.
Unfortunately, although we had a dtest for the handling of a dead node in Alternator TTL, it failed to reproduce this bug, so this regression was missed - nothing else besides Alternator TTL ever used the get_secondary_replica() function.
So this series, in addition to fixing the bug, we add two tests that reproduce this bug (fail before the fix, pass with the fix):
1. A unit test that checks that get_secondary_replica() always returns a different node from get_primary_replica()
2. A cluster test based on the original dtest, which does reproduce this bug in Alternator TTL where some of the data was never expired (but only failed in release build, for an unknown reason).
Fixes SCYLLADB-777.
Closesscylladb/scylladb#28771
* github.com:scylladb/scylladb:
test: add unit test for tablet_map::get_secondary_replica()
test, alternator: add test for TTL expiration with a node down
locator: fix get_secondary_replica() to match get_primary_replica()
The path removes the code protected by !raft_topology_change_enabled()
since it is no longer reachable. Drop test_lwt_for_tablets_is_not_supported_without_raft
since not raft mode is no longer supported.
It is not enough to go over all column types and register the UDTs. UDTs
might be nested in other types, like collections. One has to do a
traversal of the type tree and register every UDT on the way. That is
what this patch does.
This function is used by the query and write operations, which should
now both work with nested UDTs.
Add a test which fails before and passes after this patch.
Audit tests have been slow. They rely on wait_for function.
This function first sleeps for the duration of the time step
specified, and then calls the given function. The audit tests
need 0.02-0.03 seconds for the given function, but the operation
lasts around 1.02-1.03 seconds, since step is 1 second.
This patch modifies wait_for dtest function so it first executes
the given function, and afterwards calls time.sleep(step). This
reduces time needed for the given function from 1.03 to 0.03 seconds.
Total audit tests suite speedup is 3x. On the developer machine
the time is reduced from 13+ minutes to 4 minutes.
This patch also improves performance of some alternator tests that
use the same wait_for dtest function.
Refs SCYLLADB-573
The current way of checking the boost's stdout can have a race
condition when pytest will try to read the file before it was really
flushed. So this PR should eliminate this possibility.
Closesscylladb/scylladb#28783
PR #28703 was merged into master but not with the latest version of the
changes. This patch is an incremental fix for this.
Currently, the elements of the tablet_sizes_per_shard vector are
incremented in separate shards. This is prone to false sharing of cache
lines, and ping-pong of memory, which leads to reduced performance.
In this patch, in order to avoid cache line collisions while updating
the sum of tablet sizes per shard, we align the counter to 64 bytes.
Fixes: SCYLLADB-678
Closesscylladb/scylladb#28757
This pr adds await for each one of the tasks to wait for the MV schema to be added successfully
and then to start the server shutdown
With this change we dont need will not get the shutdown races.
Closesscylladb/scylladb#28774
This commit removes the information that Alternator doesn't support tablets.
The limitation is no longer valid.
Fixes SCYLLADB-778
Closesscylladb/scylladb#28781
`test_autoretrain_dict` sporadically fails because the default
compression algorithm was changed after the test was written.
`9ffa62a986815709d0a09c705d2d0caf64776249` was an attempt to fix it by
changing the compression configuration during node startup. However,
the configuration change had an incorrect YAML format and was
ignored by ScyllaDB. This commit fixes it.
Fixes: scylladb/scylladb#28204Closesscylladb/scylladb#28746
For a while, we have seen coroutine related tests (those that use the
coroutine_task fixture) fail occasionally, because no coroutine frame is
found. Multiple attempts were made to make this problem self-diagnosing
and dump enough information to be able to debug this post-mortem. To no
avail so far. A lot of time was invested into this this benign issue:
See the long discussion at https://github.com/scylladb/scylladb/issues/22501.
It is not known if the bug is in gdb, or the gdb script trying to find
the coroutine frame. In any case, both are only used for debugging, so
we can tolerate occasional failures -- we are forced to do so when
working with gdb anyway.
Instead of piling on more effor there, just skip these tests when the
problem occurs. This solves the CI flakyness.
Fixes: #22501Closesscylladb/scylladb#28745
Add --continue-after-error true to perf-cql-raw and perf-alternator
tests, and --stop-on-error false to perf-simple-query test, so that
tests don't abort on the first error.
Reason for this is that tests are flaky with example failure:
Perf test failed: std::runtime_error (server returned ERROR to EXECUTE)
When CPU is starved on CI we can return timeouts and/or other errors.
The change should make tests more robust on the expense of smaller test
scope. But those tests were written mostly to test startup sequence
as it differs from Scylla's starup.
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-759Closesscylladb/scylladb#28767
With previous architecture, scylla servers were handled by test.py and
if pytest fails, test.py was responsible for stopping scylla processes.
Now with only pytest handling, there is no such mechanism, that's why
I'm removing the setsid, so when the parent pytest process closes it
will automatically close all child including any started process during
testing. This will allow to not leave any scylla process in case pytest
was killed.
cluster tests are now executed by pytest also
Run pytest in an executor to avoid blocking the event loop, allowing
resource monitoring to run concurrently
Logic for passing the arguments to pytest changed due to the fact that
almost all directories now executed by pytest and directories that are
not handled excluded in pytest.ini
Modify the threads count for debug mode, because with the default
logic CI agents die
Set TOPOLOGY_RANDOM_FAILURES_TEST_SHUFFLE_SEED environment variable
during pytest configuration to enable to ensure that all xdist workers will
discover the same scope of the tests. This is a known limitation of the
xdist plugi where the discovered tests should be consistenta across
master and workers.
Add explicit default values to pytest command line options to prevent
issues when running tests with pytest's parallel execution where
options are not present on upper conftest, so they're just not set at all.
Replace direct usage of SCYLLA environment variable with the build_mode
pytest fixture and path_to helper function. This makes tests more
flexible and consistent with the test framework. Also this allows to use
tests with xdist, where environment variable can be left in the master
process and will not be set in the workers
Add using the fixture to get the scylla binary from the suite, this will
align with getting relocatable Scylla exe.
Lua doesn't have separate integer and floating point numbers,
so we check if a number can fit in an integer and if so convert
it to an integer.
The conversion routine invokes undefined behavior (and even
acknowledges it!). More recent compilers changed their behavior
when casting infinities, breaking test_user_function_double_return
which tests this conversion.
Fix by tightening the conversion to not invoke undefined behavior.
Closesscylladb/scylladb#28503
This patchset:
- ensures the loading semaphore is acquired in cross-shard callbacks
- fixes iterator invalidation problem when reloading all cached permissions
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-780
Backport: no, affected code not released yet
Closesscylladb/scylladb#28766
* github.com:scylladb/scylladb:
auth: cache: fix permissions iterator invalidation in reload_all_permissions
auth/cache: acquire _loading_sem in cross-shard callbacks
The hostent::addr_list is deprecated in favor of address_entry::addr
field that contains the very same addresses.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#28565
- add an overload to the rest http client to accept retry strategy instance as an argument
- remove hand rolled error handling from object storage client and replace with common machinery that supports handling and retrying when appropriate
No backport neede since it is only refactoring
Closesscylladb/scylladb#28161
* github.com:scylladb/scylladb:
object_storage: add retryable machinery to object storage
rest_client: add `simple_send` overload
To handle RPC from other nodes, we need to be able to redirect the
requests for each raft group to the shard that owns it. We need to
be able to do the redirection on all shards, so to achieve that, on
all shards we need to store the information about which shard is
occupied by each Raft group server.
For that we add a group_id -> shard mapping to the raft_group_registry.
The mapping is filled out when starting raft servers, it's emptied
when we abort raft servers. We use it when registering RPC verb handlers,
so that regardless of the shard handling the RPC, the work on the raft
group can be performed on the corresponding shard.
This patch adds a unit test for tablet_map::get_secondary_replica().
It was never officially defined how the "primary" and "secondary"
replicas were chosen, and their implementation changed over time,
but the one invariant that this test verifies is that the secondary
replica and the primary replica must be a different node.
This test reproduces issue SCYLLADB-777, where we discovered that
the get_primary_replica() changed without a corresponding change to
get_primary_replica(). So before the previous patch, this test failed,
and after the previous patch - it passes.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
We have many single-node functional tests for Alternator TTL in
test/alternator/test_ttl.py. This patch adds a multi-node test in
test/cluster/test_alternator.py. The new test verifies that:
1. Even though Alternator TTL splits the work of scanning and expiring
items between nodes, all the items get correctly expired.
2. When one node is down, all the items still expire because the
"secondary" owner of each token range takes over expiring the
items in this range while the "primary" owner is down.
This new test is actually a port of a test we already had in dtest
(alternator_ttl_tests.py::test_multinode_expiration). This port is
faster and smaller then the original (fewer nodes, fewer rows), but it
still found a regression (SCYLLADB-777) that dtest missed - the new test
failed when running with tablets and in release build mode.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The function tablet_map::get_secondary_replica() is used by Alternator
TTL to choose a node different from get_primary_replica(). Unfortunately,
recently (commits 817fdad and d88037d) the implementation of the latter
function changed, without changing the former. So this patch changes
the former to match.
The next two patches will have two tests that fail before this patch,
and pass with it:
1. A unit test that checks that get_secondary_replica() returns a
different node than get_primary_replica().
2. An Alternator TTL test that checks that when a node is down,
expirations still happen because the secondary replica takes over
the primary replica's work.
Fixes SCYLLADB-777
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Move the storage test suite from test/storage/ to test/cluster/storage/
to consolidate related cluster-based tests.This removes the standalone
test/storage/suite.yaml as the tests will use the cluster's test configuration.
Initially these tests were in cluster, but to use unshare at first
iteration they were moved outside. Now they are using another way to
handle volumes without unshare, they should be in cluster
Closesscylladb/scylladb#28634
They were running in recovery to reuse existing system tables without
group0 id, but since we want to remove recovery mode we need to
re-generate the tables.
This patch removes ability of a cluster to upgrade from not having
group0 to having one. This ability is used in gossiper based recovery
procedure that is deprecated and removed in this version. Also remove
tests that uses the procedure.
We are going to drop legacy topology mode (gossiper mode) and no longer
allow ScyllaDB to start in this mode. This patch refuses to boot if a
cluster is not in raft topology mode yet. It may happen if a node of a
cluster that is not yet in a raft topology is upgraded to a newer
version. If this happens the node has to be downgraded. Raft topology
has to be enabled on a cluster and then the node can be upgraded again.
We are going to drop legacy topology mode (gossiper mode) and no longer
allow ScyllaDB to start in this mode. This patch disallows a node to
join a cluster that is still in legacy mode. A cluster needs to enable
raft mode first.
The ANN vector queries with all-zero vectors are allowed even on vector indexes with similarity function set to cosine.
When enabling the rescoring option, those queries would fail as the rescoring calls `similarity_cosine` function underneath, causing an `InvalidRequest` exception as all-zero vectors were not allowed matching Cassandra's behaviour.
To eliminate the discrepancy we want the all-zero vector `similarity_cosine` calls to pass, but return the NaN as the cosine similarity for zero vectors is mathematically incorrect. We decided not to use arbitrary values contrary to USearch, for which the distance (not to be confused with similarity) is defined as cos(0, 0) = 0, cos(0, x) = 1 while supporting the range of values [0, 2].
If we wanted to convert that to similarity, that would mean sim_cos(0, x) = 0.5, which does not support mathematical reasoning why that would be more similar than for example vectors marking obtuse angles.
It's safe to assume that all-zero vectors for cosine similarity shouldn't make any impact, therefore we return NaN and eliminate them from best results.
Adjusted the tests accordingly to check both proper Cassandra and Scylla's behaviour.
Fixes: SCYLLADB-456
Backport to 2026.1 needed, as it fixes the bug for ANN vector queries using rescoring introduced there.
Closesscylladb/scylladb#28609
* github.com:scylladb/scylladb:
test/vector_search: add reproducer for rescoring with zero vectors
vector_search: return NaN for similarity_cosine with all-zero vectors
This patch series moves `test/cluster/dtest/guardrails_test.py`
to `test/cluster/test_guardrails.py`, and migrates it from `cluster/dtest/`
to `cluster/` framework.
There are two motivations for moving the test:
- Execution time reduction (from 12s to 9s in 'dev' in my env)
- Facilitate adding new tests to the `guardrails_test.py` file
No backport, `dtest/guardrails_test.py` is only on master
Closesscylladb/scylladb#28737
* github.com:scylladb/scylladb:
test: move dtest/guardrails_test.py to test_guardrails.py
test: prepare guardrails_test.py to be moved to test/cluster/
The inner loops in reload_all_permissions iterate role's permissions
and _anonymous_permissions maps across yield points. Concurrent
load_permissions calls (which don't hold _loading_sem) can emplace
into those same maps during a yield, potentially triggering a rehash
that invalidates the active iterator.
We want to avoid adding semaphore acquire in load_permissions
because it's on a common path (get_permissions).
Fixing by snapshotting the keys into a vector before iterating with
yields, so no long-lived map iterator is held across suspension
points.
This series hardens MV shutdown behavior by fixing lifecycle tracking for detached view-builder callbacks and aligning update handling with the same async dispatch style used by create/drop.
Patch 1 refactors on_update_view to use a dedicated coroutine dispatcher (dispatch_update_view), keeping update logic serialized under the existing view-builder lock and consistent with the callback architecture already used for create/drop paths.
Patch 2 adds explicit callback lifetime coordination in view_builder:
- introduce a seastar::gate member
- acquire _ops_gate.hold() when launching detached create/update/drop dispatch futures
- keep the hold alive until each detached future resolves
- close the gate during view_builder::drain() so shutdown waits for in-flight callback work before final teardown
Together, these changes reduce shutdown race exposure in MV event handling while preserving existing behavior for normal operation.
Testing:
- pytest --test-py-init test/cluster/mv (47 passed, 7 skipped)
backport: not required started happening in master
fixes: SCYLLADB-687
Closesscylladb/scylladb#28648
* github.com:scylladb/scylladb:
db/view: gate detached view-builder callbacks during shutdown
db:view: refactor on_update_view to use coroutine dispatcher
Takes set of ks->tables tuples and issues snapshot for each.
If feature is enabled, keyspace is non-local, and uses tablets,
will issue topo coordinator call across cluster.
Keyspaces not fitting the above will just go to "normal" (node
local) snapshot.
Makes request_truncate_with_tablets use a parameterized helper,
because eventually we will want to use almost identical logic
for other ops, like snapshot.
To create a keyspace theres new_test_keyspace helper
Table is created with a single cql.run_async with explicit schema
Dataset is populated with a single parallel INSERT as well
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This is preparational patch. Next will need to replace
foo()
bar()
with
with something() as s:
foo()
bar()
Effectively -- only add the `with something()` line. Not to shift the
whole file right together with that future change, do it here.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
distribute_role() modifies _roles on non-zero shards via
invoke_on_others() without holding _loading_sem. Similarly, load_all()'s
invoke_on_others() callback calls prune_all() without the semaphore.
When these run concurrently with reload_all_permissions(), which
iterates _roles across yield points, an insertion can trigger
absl::flat_hash_map::resize(), freeing the backing storage while
an iterator still references it.
Fix by acquiring _loading_sem on the target shard in both
distribute_role()'s and load_all()'s invoke_on_others callbacks,
serializing all _roles mutations with coroutines that iterate
the map.
The test_restore_with_streaming_scopes want to run some loop body for
all (almost) combinations of scope, primary-replica-only and min tablet
count. For that three nested loops are used. Using itertools.product()
makes the code shorter, less indented and more explicit.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Two places are fixed
1. The call to create_dataset() is replaced with three "library"
methods. This makes it explicit which options and schema are used
for that. Eventually, the large and bulky create_dataset will be
removed
2. The part that restores data into a fresh new table calls some CQLs by
hand, and partially re-uses variables obtained from previous call to
create_dataset(). Using the same "library" methods to re-create an
empty table makes this part much simpler
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This is preparational patch. Next will need to replace
foo()
bar()
with
with something() as s:
foo()
bar()
Effectively -- only add the `with something()` line. Not to shift the
whole file right together with that future change, do it here.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The take_snapshot() helper runs these API sequentially for every server.
Running them with asyncio.gather() slightly reduces the wait-time thus
improving the total runtime.
Before:
CPU utilization: 2.1%
real 0m33,871s
user 0m22,500s
sys 0m13,207s
After:
CPU utilization: 2.4%
real 0m29,532s
user 0m22,351s
sys 0m12,890s
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The test in question does _exactly_ what this helper does, but in a
longer way. The only difference is that it uses server_id as key to dict
with sstable components, but it's easy to tune.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
So that it's not in the middle of tests themselves, but near other
"helper" functions in the .py file
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
remove hand rolled error handling from object storage client
and replace with common machinery that supports exception
handling and retrying when appropriate
The concurrency semaphore gates uninitialized connections across all
do_accepts loops, but was initialized to a fixed value regardless of
how many listeners exist. With multiple listeners competing for the
same units, each effectively gets less than the configured concurrency.
Initialize the semaphore to concurrency - 1 and signal 1 per listen()
call, so total capacity is concurrency - 1 + nr_listeners. This
guarantees each listener's accept loop can have at least one unit
available.
The test uses create_ks_and_cf helper duplicating the existing code that does the same. This PR patches basic tests to use standard facilities. Also it prepares the ground for testing keyspace storage options with rf=3
Cleaning tests, not backporting
Closesscylladb/scylladb#28600
* https://github.com/scylladb/scylladb:
test/object_store: Remove create_ks_and_cf() helper
test/object_store: Replace create_ks_and_cf() usage with standard methods
test/object_store: Shift indentation right for test cases
Currently, test_secondary_index.py::test_indexing_paging_and_aggregation
is very slow, and the slowest test in the test/cqlpy framework: It takes
around 13 seconds on dev build, and because it is CPU-bound (doesn't sleep),
it is much slower on debug builds. The reason for this slowness is that it
needs to set up and read over 10,000 rows which is the default
select_internal_page_size.
But after the patches in pull request (#25368), we can configure
select_internal_page_size, so in this patch we change the test to
temporarily reduce this option to just 50, and then the test can reach
the same code paths with just 142 rows instead of 20120 rows before this
patch.
As a result, the test should now be 140 times faster than it was before.
In practice, because of some fixed overheads (the test creates several
tables and indexes), in dev build mode the test run speedup is "only"
26-fold (to around half a second).
I verified that removing the code added in bb08af7 indeed makes the new
shorter test fail - and this is the only test in test_secondary_index.py
that starts to fail besides test_index_paging_group_by which is also
related (so my revert didn't just break secondary indexing completely).
So the shorter test is still a good regression test.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#28268
The future toolchain did not build the sanitizers, so debug
executables did not link. Fix by not disabling the sanitizers.
Closesscylladb/scylladb#28733
The test_restore_with_streaming_scopes among other things checks how data streams flow while restoring. Whether or not to check the streams is decided based on the min tablet count value, which is compared with a hardcoded 512. This value of 512 matched the tablet count used by this test until it was "optimized" by #27839, where this number changed to 5 and streaming checks became off.
Good news is that the very same checks are still performed by test_refresh_with_streaming_scopes. But it's better to have a working restoration test anyway.
Minor test fix, not backporting
Closesscylladb/scylladb#28607
* github.com:scylladb/scylladb:
test: Fix the condition for streaming directions validation
test: Split test_backup.py::check_data_is_back() into two
Currently, the test assumes that when
'topology_coordinator_pause_before_processing_backlog: waiting' is
logged, the task for decommission must be there. This was based on the
assumption that topology coordinator is idle and decommission request
wakes it up. But if the server is slow enough, it may still be running
the load balancer in reaction to table creation, and block on that
injection point before decommission request was added.
Fix by waiting for the task to appear rather than the injection.
Fixes SCYLLADB-715
Only 2026.1 vulnerable.
Closesscylladb/scylladb#28688
* github.com:scylladb/scylladb:
test_tablets_parallel_decommission: Fix flakiness due to delayed task appearance
test: cluster: task_manager_client: Introduce wait_task_appears()
tests: pylib: util: Add exponential backoff to wait_for
There's a bunch of incremental repair tests that want to call scylla
sstable command. For that they try to find where scylla binary by
scanning /proc directory (see local_process_id and get_scylla_path
helpers).
There's shorter way -- just call manager.get_server_exe().
Same for backup-restore test.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#28676
There are three tests and a function with a pair of boolean parameters
called by those. It's less code if the function becomes a test with
parameters.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#28677
The test_backup_simple creates a ks/cf, takes a snapshot, backs it up,
then checks that the files were uploaded. The test_backup_move does the
same, but also plays with 'move_files' parameter to be true/false.
In fact, the "move" test was the copy of "simple" one that dropepd check
for scheduling group being "streaming" (backup with --move-files can
check the same, it's not bad), and check for destination bucket to
contain needed files (same here -- checking that files arrived to bucket
after --move-files is good).
In the end of the day, after the change backup test is run two times,
instead of three, and performs extra checks for --move-files case.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#28606
https://github.com/scylladb/scylladb/pull/25746 added a new column to `system.clients`: `client_options frozen<map<text, text>>`. This column stores all options sent by the client in the `STARTUP` message.
This PR also added `CLIENT_OPTIONS` to the list of values sent in `SUPPORTED` message, and documented that drivers can send their configuration (as JSON) in `STARTUP` under this key.
Documentation for the new column was not added to the description of `system.clients` table, and documentation about the new `STARTUP` key was added in `protocol-extensions.md`, but in the section about shard awareness extension.
This PR adds missing `system.clients` column description, moves the documentation of `CLIENT_OPTIONS` into its own section, and expands it a bit.
Backport: none, because this fixes internal documentation.
Closesscylladb/scylladb#28126
* github.com:scylladb/scylladb:
protocol-extensions.md: Fix client_options docs
system_keyspace.md: Add client_options column
system_keyspace.md: Fix order in system.clients
Doing it with format("{}", foo) is correct, but to_string is
a bit more lightweight.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#28630
The `try-catch` expression is pretty much useless in its current form. If we return the future, the awaiting will only be performed by the caller, completely circumventing the exception handling.
As a result, instead of handling `raft::request_aborted` with a proper error message, the user will face `seastar::abort_requested_exception` whose message is cryptic at best. It doesn't even point to the root of the problem.
Fixes SCYLLADB-665
Backport: This is a small improvement and may help when debugging, so let's backport it to all supported versions.
Closesscylladb/scylladb#28624
* https://github.com/scylladb/scylladb:
test: raft: Add test_aborting_wait_for_state_change
raft: Describe exception types for wait_for_state_change and wait_for_leader
raft: Await instead of returning future in wait_for_state_change
This commit moves `guardrails_test.py`, prepared in the previous
commit of this patch series, to `test/cluster/test_guardrails.py`.
It also cleans up `suite.yaml`.
Disable `test/cluster/dtest/guardrails_test.py` in `suite.yaml` and
make it compatible with the `test/cluster/` framework. This will
allow moving this file from `test/cluster/dtest/` to `test/cluster/`
in the next commit of this patch series.
There are two motivations for moving the test:
- Execution time reduction (from 12s to 9s in 'dev' in my env)
- Facilitate adding new tests to the `guardrails_test.py` file
There are 3 metrics (that goes in every compaction_history entry):
total_tombstone_purge_attempt
total_tombstone_purge_failure_due_to_overlapping_with_memtable
total_tombstone_purge_failure_due_to_overlapping_with_uncompacting_sstable
When a tombstone is not expired (e.g. doesn't satisfy "gc_before" or
grace period), it can be currently accounted as failure due to
overlapping with either memtable or uncompacting sstable.
So those 2 last metrics have noise of *unexpired* tombstones.
What we should do is to only account for expired tombstones in all
those 3 metrics. We lose the info of knowing the amount of tombstones
processed by compaction, now we'll only know about the expired ones.
But those metrics were primarily added for explaining why expired
tombstones cannot be removed.
We could have alternatively added a new field
purge_failure_due_to_being_unexpired or something, but
it requires adding a new field to compaction_history.
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-737.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#28669
Links were pointing to the `debian` subdirectory. However, there docker build was refactored to use `redhat`: 1abf981a73, see https://github.com/scylladb/scylladb/pull/22910
No backport, just a README link fixes.
Closesscylladb/scylladb#28699
* github.com:scylladb/scylladb:
docs: fix path to the build_docker.sh which was moved from debian to redhat subdirectory
docs: fix link to docker build README.MD
Make the actual table name a parameter and add logic to adapt to the
variant used.
Also add dump_to_log::yes to is_rows() invokation to help debuging
tests.
When set to true, the query results will be logged by the testlog logger
with debug level. A huge help when debugging failures around cql
assertions: seeing the actual query result is often enough to
immediately understand why the test failed.
system.batchlog will still have to be used while the cluster is
upgrading from an older version, which doesn't know v2 yet.
Re-add support for replaying v1 batchlogs. The switch to v2 will happen
after the BATCHLOG_V2 cluster feature is enabled.
The only external user -- storage_proxy -- only needs a minor
adjustment: switch between the table names. The rest is handled
transparently by the db/batchlog.hh interface and the batchlog_manager.
process_batch() currently returns stop_iteration::no from all control
paths. This is not useful. Return the all_replayed output param instead.
This requires making the batch() lambda a coroutine, but considering the
amount of work process_batch() does (send multiple writes), this should
be inconsequential.
The patchset fixes abort_source implementation for perf-alternator and perf-cql-raw. It moves
run_standalone function to common code in perf.hh with necessary templating.
We also add extensive testing so that it's more difficult to break the tooling in the future.
Fixes SCYLLADB-560
Backport: no, internal tooling improvement
Closesscylladb/scylladb#28541
* github.com:scylladb/scylladb:
test: cluster: add tests for perf tools
test: perf: fix port race condition on startup in connect workload
test: perf: prepare benchmarks to bind to custom host
test: perf: make perf-alterantor remote port configurable
test: perf: fix ASAN leak warnings in perf-alternator
Reapply "main: test: add future and abort_source to after_init_func"
Some assertions in the Raft-based topology are likely to cause crashes of
multiple nodes due to the consistent nature of the Raft-based code. If the
failing assertion is executed in the code run by each follower (e.g., the code
reloading the in-memory topology state machine), then all nodes can crash. If
the failing assertion is executed only by the leader (e.g., the topology
coordinator fiber), then multiple consecutive group0 leaders will chain-crash
until there is no group0 majority.
Crashing multiple nodes is much more severe than necessary. It's enough to
prevent the topology state machine from making more progress. This will
naturally happen after throwing a runtime error. The problematic fiber will be
killed or will keep failing in a loop. Note that it should be safe to block
the topology state machine, but not the whole group0, as the topology state
machine is mostly isolated from the rest of group0.
We replace some occurrences of `on_fatal_internal_error` and `SCYLLA_ASSERT`
with `on_internal_error`. These are not all occurrences, as some fatal
assertions make sense, for example, in the bootstrap procedure.
We also raise an internal error to prevent a segmentation fault in a few places.
Fixes#27987
Backporting this PR is not required, but we can consider it at least for 2026.1
because:
- it is LTS,
- the changes are low-risk,
- there shouldn't be many conflicts.
Closesscylladb/scylladb#28558
* github.com:scylladb/scylladb:
raft topology: prevent accessing nullptr returned by topology::find
raft topology: make some assertions non-crashing
In https://github.com/scylladb/scylladb/pull/27262 table audit has been
re-enabled by default in `scylla.yaml`, logging certain categories to a table,
which should make new Scylla deployments have audit enabled.
Now, in the next release, we also want to enable audit in `db/config.cc`,
which should enable audit for all deployments, which don't explicitly configure
audit otherwise in `scylla.yaml` (or via cmd line).
BTW. Because this commit aligns audit's default config values in `db/config.cc`
to those of `scylla.yaml`, `docs/reference/configuration-parameters.rst`, which
is based on `db/config.cc` will start showing that table audit is the default.
Refs: https://github.com/scylladb/scylladb/issues/28355
Refs: https://scylladb.atlassian.net/browse/SCYLLADB-222
No backport: table audit has been enabled in 2026.1 in `scylla.yaml`,
and should be always on starting from the next release,
which is the release we're currently merging to (2026.2).
Closesscylladb/scylladb#28376
* github.com:scylladb/scylladb:
docs: decommission: note audit ks may require ALTERing
docs: mention table audit enabled by default
audit: disable DDL by default
db/config: enable table audit by default
test/cluster: fix `test_table_desc_read_barrier` assertion
test/cluster: adjust audit in tests involving decommissioning its ks
audit_test: fix incorrect config in `test_audit_type_none`
Compaction and statement groups are carried over on those configs, but
are in fact unused. Drop both.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#28540
There are four tests that check how restore with primary-replica-only option works in various scopes and topologies. Cases that check same-racks and same-datacenters are very very similar, so are those that check different-racks and different-datacenters. Parametrizing them and merging saves lots of code (+30 lines, -116 lines)
It's probably worth merging the resulting same-domain with different-domain tests, because the similarity is still large in both, but the result becomes too if-y, so not done here. Maybe later.
Improving tests, not backporting
Closesscylladb/scylladb#28569
* https://github.com/scylladb/scylladb:
test: Merge test_restore_primary_replica_different_... tests
test: Merge test_restore_primary_replica_same_... tests
test: Don't specify expected_replicas in test_restore_primary_replica_different_dc_scope_all
test: Remove local r_servers variable from test_restore_primary_replica_different_dc_scope_all
Fix the build of the test and the upload operation flow
No need to backport since it is only a test we barely use
Closesscylladb/scylladb#28595
* github.com:scylladb/scylladb:
s3_perf: fix upload operation flow
s3_perf: fix the CMake build
Tablet migration keeps sstable snapshot during streaming, which may
cause temporary increase in disk utilization if compaction is running
concurrently. SSTables compacted away are kept on disk until streaming
is done with them. The more tablets we allow to migrate concurrently,
the higher disk space can rise. When the target tablet size is
configured correcly, every tablet should own about 1% of disk
space. So concurrency of 4 shouldn't put us at risk. But target tablet
size is not chosen dynamically yet, and it may not be aligned with
disk capacity.
Also, tablet sizes can temporarily grow above the target, up to 2x
before the split starts, and some more because splits take a while to
complete.
To reduce the impact from this, reduce concurrency of
migration. Concurrency of 2 should still be enough to saturate
resources on the leaving shard.
Also, reducing concurrency means that load balancing is more
responsive to preemption. There will be less bandwidth sharing, so
scheduled migrations complete faster. This is important for scale-out,
where we bootstrap a node and want to start migrations to that new
node as soon as possible.
Refs scylladb/siren#15317Closesscylladb/scylladb#28563
* github.com:scylladb/scylladb:
tablets, config: Reduce migration concurrency to 2
tablets: load_balancer: Always accept migration if the load is 0
config, tablets: Make tablet migration concurrency configurable
SNI works only with DNS hostnames. Adding an IP address causes warnings
on the server side.
This change adds SNI only if it is not an IP address.
This change has no unit tests, as this behavior is not critical,
since it causes a warning on the server side.
The critical part, that the server name is verified, is already covered.
Fixes: VECTOR-528
In order to simplify troubleshooting connection problems, this patch
adds an extra warn log that prints the error for the vector search
request whenever it fails.
The methods of `raft::server` are abortable and if the passed
`abort_source` is triggered, they throw `raft::request_aborted`.
We document that.
Although `raft::server` is an interface, this is consistent with
the descriptions of its other methods.
The `try-catch` expression is pretty much useless in its current form.
If we return the future, the awaiting will only be performed by the
caller, completely circumventing the exception handling.
As a result, instead of handling `raft::request_aborted` with a proper
error message, the user will face `seastar::abort_requested_exception`
whose message is cryptic at best. It doesn't even point to the root
of the problem.
Fixes SCYLLADB-665
Due to lack of checks present in process_execute_internal from
transport/server.cc needs_authorization bool was always set to true
doing some extra work (check_access()) for each request.
We mirror the logic in this patch in test env which perf-simple-query
uses. This can also potentially improve runtime of unittests (marginally).
Note that bug is only in perf tool not scylla itself, the fix
decreases insns/op by around 10%:
Before: 41065 insns/op
After: 37452 insns/op
Command: ./build/release/scylla perf-simple-query --duration 5 --smp 1
Fixes https://github.com/scylladb/scylladb/issues/27941Closesscylladb/scylladb#28704
Using an outdated image can cause problems when `microdnf update`
runs, if the distribution doesn't maintain good update hygiene.
Although, I suspect that when update failures happen they're really
caused by propagation delay of packages to mirrors.
Fix by using --pull=always to get a fresh image.
Ref https://scylladb.atlassian.net/browse/SCYLLADB-714Closesscylladb/scylladb#28680
In storage_service::load_stats_for_tablet_based_tables(), we are passing
a reference to sum_tablet_sizes to the lambda which increments this value
on each shard via map_reduce0(). This means we could have a race
condition because this is executed on separate threads/CPUs.
This patch fixed the problem by collecting the sums by shard into a
vector, then summing those up.
Refs: SCYLLADB-678
Closesscylladb/scylladb#28703
interval_data's move constructor is conditionally noexcept. It
contains a throw statemnt for the case that the underlying type's
move constructor can throw; that throw statemnt is never executed
if we're in the noexept branch. Clang 23 however doesn't understand
that, and warns about throwing in a noexcept function.
Fix that by rewriting the logic using seastar::defer(). In the
noexcept case, the optimizer should eliminate it as dead code.
Closesscylladb/scylladb#28710
Correct the upload operation logic. The previous flow incorrectly
checked for the test file on S3 even when performing operations that do
not download the file, such as uploads.
Remove bootstrap and decomission from allowed_repair_based_node_ops.
Using RBNO over streaming for these operations has no benefits, as they
are not exposed to the out-of-date replica problem that replace,
removenode and rebuild are.
On top of that, RBNO is known to have problems with empty user tables.
Using streaming for boostrap and decomission is safe and faster
than RBNO in all condition, especially when the table is small.
One test needs adjustment as it relies on RBNO being used for all node
ops.
Fixes: SCYLLADB-105
Closesscylladb/scylladb#28080
It checks if all workloads can be properly
executed with succesfull startup and teardown.
Especially testing alternator in remote mode is important
because it's invoked like this during pgo training in pgo.py.
Test runtime:
Release - 24s
Debug - 1m 15s
Test time consists mostly of Scylla startup in various modes.
Other workloads at startup call prepopulate() which connects
with retry loop therefore it waits until cql port is open.
This commit adds a single place where we will wait for port
for all workloads.
Timeout is set to 5 minutes so that even slowest machines
are able to start.
There is a handful of places in the code related to dictionary
compression which calls get_units to acquire semaphore units but the
returned future is not awaited, seemingly by mistake. The result of
get_units is assigned to a variable - which is reasonable at a glance
because the semaphore units need to be assigned to a variable in order
to control their scope - but at the same time if co_await is mistakenly
omitted, like here, doing so will silence the nodiscard check of
seastar::future and, effectively, the get_units call will be nearly
useless. Unfortunately, this is an easy mistake to make.
Fix the places in the code that acquire semaphore units via get_units
but never await the future returned by it. I found them by manual code
inspection, so I hope that I didn't miss any.
Closesscylladb/scylladb#28581
With audit feature enabled, it's not immediately obvious that its
pseudo-system keyspace `audit` may require adjusting its RF across DCs
before decommissioning a node, and this should be documented.
DDL audit category doesn't make sense if its enabled by default on its
own, as no DDL statements are going to be audited if audit_keyspaces/audit_tables
setting is empty. This may be counter-intuitive to our users, who may
expect to actually see these statements logged if we're enabling this by
default. Also, it doesn't make sense to enable a setting by default if
it has no effect.
Additionally, listed all possible audit categories for user's
convenience.
In https://github.com/scylladb/scylladb/pull/27262 table audit has been
re-enabled by default in `scylla.yaml`, logging certain categories to a table,
which should make new Scylla deployments have audit enabled.
Now, in the next release, we also want to enable audit in `db/config.cc`,
which should enable audit for all deployments, which don't explicitly configure
audit otherwise in `scylla.yaml` (or via cmd line).
BTW. Because this commit aligns audit's default config values in `db/config.cc`
to those of `scylla.yaml`, `docs/reference/configuration-parameters.rst`, which
is based on `db/config.cc` will start showing that table audit is the default.
Refs: https://github.com/scylladb/scylladb/issues/28355
Refs: https://scylladb.atlassian.net/browse/SCYLLADB-222
The test `assertion desc_schema[0] == desc_schema[1]` does a direct
list comparison, which is order-sensitive. Before enabling audit by default,
both nodes would return only the test keyspace/table, so the order
didn't matter. With audit enabled, there will be multiple keyspaces,
and they can be returned in different order by different nodes.
When table audit is enabled, Scylla creates the "audit" ks with
NetworkTopologyStrategy and RF=3. During node decommission, streaming can fail
for the audit ks with "zero replica after the removal" when all nodes from a DC
are removed, and so we have to ALTER audit ks to either zero the number of its
replicas, to allow for a clear decommission, or have them in the 2nd DC.
BTW. https://github.com/scylladb/scylladb/issues/27395 is the same change, but
in dtests repository.
Passing Python `None` to setup is incorrect, because config updates are sent
as a dict and `None` is treated as "unset" - meaning: use Scylla's default.
Using the explicit string "none" to guarantee that audit is disabled.
There is no point running repair for tables using RF one. Row level
repair will skip it but the auto repair scheduler will keep scheduling
such repairs since repair_time could not be updated.
Skip such repairs at the scheduler level for auto repair.
If the request is issued by user, we will have to schedule such
repair otherwise the user request will never be finished.
Fixes SCYLLADB-561
Closesscylladb/scylladb#28640
This commit introduces four changes:
- In the `table` example, singular forms (node, partition) are changed to
plural forms (nodes, partitions). Currently, the default `table`
audit configuration is RF=3 and writes use CL=ONE. Therefore,
a `table` audit log write failure should not be caused by a single
node unavailability, and plural forms are more adequate.
- In the `table` example, unreachability due to network issues is
mentioned because with RF=3, audit failure due to network problems
is more likely to happen than a simultaneous failure of three
nodes (such network failures happened in SCYLLADB-706).
- In the `syslog` example, a slash `/` is changed to `or`, so `table`
and `syslog` examples have similar structure.
- As the `syslog` line is already being changed, I also change `unix`
to `Unix`, as the capitalized form is the correct one.
Refs SCYLLADB-706
Closesscylladb/scylladb#28702
The connection's `cpu_concurrency_t` struct tracks the state of a connection
to manage the admission of new requests and prevent CPU overload during
connection storms. When a connection holds units (allowed only 0 or 1), it is
considered to be in the "CPU state" and contributes to the concurrency limits
used when accepting new connections.
The bug stems from the fact that `counted_data_source_impl::get` and
`counted_data_sink_impl::put` calls can interleave during execution. This
occurs because of `should_parallelize` and `_ready_to_respond`, the latter being
a future chain that can run in the background while requests are being read.
Consequently, while reading request (N), the system may concurrently be
writing the response for request (N-1) on the same connection.
This interleaving allows `return_all()` to be called twice before the
subsequent `consume_units()` is invoked. While the second `return_all()` call
correctly returns 0 units, the matching `consume_units()` call would
mistakenly take an extra unit from the semaphore. Over time, a connection
blocked on a read operation could end up holding an unreturned semaphore
unit. If this pattern repeats across multiple connections, the semaphore
units are eventually depleted, preventing the server from accepting any
new connections.
The fix ensures that we always consume the exact number of units that were
previously returned. With this change, interleaved operations behave as
follows:
get() return_all — returns 1 unit
put() return_all — returns 0 units
get() consume_units — takes back 1 unit
put() consume_units — takes back 0 units
Logically, the networking phase ends when the first network operation
concludes. But more importantly, when a network operation
starts, we no longer hold any units.
Other solutions are possible but the chosen one seems to be the
simplest and safest to backport.
Fixes SCYLLADB-485
Backport: all supported affected versions, bug introduced with initial feature implementation in: ed3e4f33fdClosesscylladb/scylladb#28530
* github.com:scylladb/scylladb:
test: auth_cluster: add test for hanged AUTHENTICATING connections
transport: fix connection code to consume only initially taken semaphore units
This patchset replaces permissions cache based on loading_cache with a new unified (permissions and roles), full, coherent auth cache.
Reason for the change is that we want to improve scenarios under stress and simplify operation manuals. New cache doesn't require any tweaking. And it behaves particularly better in scenarios with lots of schema entities (e.g. tables) combined with unprepared queries. Old cache can generate few thousands of extra internal tps due to cache refresh.
Benchmark of unprepared statements (just to populate the cache) with 1000 tables shows 3k tps of internal reads reduction and 9.1% reduction of median instructions per op. So many tables were used to show resource impact, cache could be filled with other resource types to show the same improvement.
Backport: no, it's a new feature.
Fixes https://github.com/scylladb/scylladb/issues/7397
Fixes https://github.com/scylladb/scylladb/issues/3693
Fixes https://github.com/scylladb/scylladb/issues/2589
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-147Closesscylladb/scylladb#28078
* github.com:scylladb/scylladb:
test: boost: add auth cache tests
auth: add cache size metrics
docs: conf: update permissions cache documentation
auth: remove old permissions cache
auth: use unified cache for permissions
auth: ldap: add permissions reload to unified cache
auth: add permissions cache to auth/cache
auth: add service::revoke_all as main entry point
auth: explicitly life-extend resource in auth_migration_listener
The hostent::addr_list is deprecated in favor of address_entry::addr
field that contains the very same addresses.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#28566
All users of it had been updated to get the streaming group elsewhere,
so this getter is no longer needed.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#28527
sccache combines the functions of ccache and distcc, and
promises to support C++20 modules in the future. Switch
to sccache in anticipation of modules support.
The documentation is adjusted since cache will be
persistent for sccache without further work.
Closesscylladb/scylladb#28524
There are some places that get `map<foo, bar>` and return it to the caller as `"key": string(foo), "value": string(bar)` json. For that there's `map_to_key_value()` helper in api.hh that re-formats the map into a vector of json elements and returns it, letting seastar json-ize that vector.
Recently in seastar there appeared stream_range_as_array() helper that helps streaming any range without converting it into intermediate collection. Some of the hottest users of `map_to_key_value()` had been converted, this PR converts few remainders and removes the helper in question to encourage further usage of the stream_range_as_array().
Code cleanup, not backporting
Closesscylladb/scylladb#28491
* github.com:scylladb/scylladb:
api: Remove map_to_key_value() helpers
api: Streamify view_build_statuses handler
api: Streamify few more storage_service/ handlers
api: Add map_to_json() helper
api: Coroutinize view_build_statuses handler
The "--primary-replica-only" ("-pro") flag was previously ignored by
the `restore` operation. This patch ensures the argument is parsed and
applied correctly.
Closesscylladb/scylladb#28490
in s3::Range class start using s3 global constant for two reasons:
1) uniformity, no need to introduce semantically same constant in each class
2) the value was wrong
Detached migration callbacks (on_create_view, on_update_view, on_drop_view)
can race with view_builder::drain() teardown.
Add a lifetime gate to view_builder and wire callback launches through
_ops_gate.hold() so each detached dispatch future is tracked until it
completes (finally keeps the hold alive). During shutdown, drain()
now waits for all tracked callback work with _ops_gate.close().
This ensures drain does not proceed past callback lifetime while shutdown is in
progress, and ignores only gate_closed_exception at callback entry as the
expected shutdown path.
Fixes parsing of comma-separated seed lists in "init.cc" and "cql_test_env.cc" to use the standard `split_comma_separated_list` utility, avoiding manual `npos` arithmetic. The previous code relied on `npos` being `uint32_t(-1)`, which would not overflow in `uint64_t` target and exit the loop as expected. With Seastar's upcoming change to make `npos` `size_t(-1)`, this would wrap around to zero and cause an infinite loop.
Switch to `split_comma_separated_list` standardized way of tokenization that is also used in other places in the code. Empty tokens are handled as before. This prevents startup hangs and test failures when Seastar is updated.
The other commit also removes the unnecessary creation of temporary `gms::inet_address()` objects when calling `std::set<gms::inet_address>::emplace()`.
Refs: https://github.com/scylladb/seastar/pull/3236
No backport: The problem will only appear in master after the Seastar will be upgraded. The old code works with the Seastar before https://github.com/scylladb/seastar/pull/3236 (although by accident because of different integer bitsizes).
Closesscylladb/scylladb#28573
* github.com:scylladb/scylladb:
init: fix infinite loop on npos wrap with updated Seastar
init: remove unnecessary object creation in emplace calls
test_node_ops_tasks.py::test_get_children fails due to timeout of
tasks_vt_get_children injection in debug mode. Compared to a successful
run, no clear root cause stands out.
Extend the message timeout of tasks_vt_get_children from 10s to 60s.
Fixes: #28295.
Closesscylladb/scylladb#28599
What changed
Updated .github/workflows/call_sync_milestone_to_jira.yml to include SMI in jira_project_keys
Why (Requirements Summary)
Adding SMI to create releases in the SMI Jira project based on new milestones from scylladb.git.
This will create a new release in the SMI Jira project when a milestone is added to scylladb.git.
Fixes:PM-190
Closesscylladb/scylladb#28585
Right now the slowest test in the test/cqlpy directory is
cassandra_tests/validation/entities/collections_test.py::
testMapWithLargePartition
This test (translated from Cassandra's unit test), just wants to verify
that we can write and flush a partition with a single large map - with
200 items totalling around 2MB in size.
200 items totalling 2MB is large, but not huge, and is not the reason
why this test was so so slow (around 9 seconds). It turns out that most
of the test time was spent in Python code, preparing a 2MB random string
the slowest possible way. But there is no need for this string to be
random at all - we only care about the large size of the value, not the
specific characters in it!
Making the characters written in this text constant instead of random
made it 20 times fast - it now takes less than half a second.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#28271
File streaming only releases the file descriptors of a tablet being
streamed in the very streaming end. Which means that if the streaming
tablet has compaction on largest tier finished after streaming
started, there will be always ~2x space amplification for that
single tablet. Since there can be up to 4 tablets being migrated
away, it can add up to a significant amount, since nodes are pushed
to a substantial usage of available space (~90%).
We want to optimize this by dropping reference to a sstable after
it was fully streamed. This way, we reduce the chances of hitting
2x space amplification for a given tablet.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#28505
Fedora 45 tightened the default installation checks [1]. As a result
the cassandra-stress rpm we provide no longer installs.
Install it with --no-gpgchecks as a workaround. It's our own package
so we trust it. Later we'll sign it properly.
We install its dependencies via the normal methods so they're still
checked.
[1] https://fedoraproject.org/wiki/Changes/Enforcing_signature_checking_by_defaultClosesscylladb/scylladb#28687
Today S3 client has well established and well testes (hopefully) http request retry strategy, in the rest of clients it looks like we are trying to achieve the same writing the same code over and over again and of course missing corner cases that already been addressed in the S3 client.
This PR aims to extract the code that could assist other clients to detect the retryability of an error originating from the http client, reuse the built in seastar http client retryability and to minimize the boilerplate of http client exception handling
No backport needed since it is only refactoring of the existing code
Closesscylladb/scylladb#28250
* github.com:scylladb/scylladb:
exceptions: add helper to build a chain of error handlers
http: extract error classification code
aws_error: extract `retryable` from aws_error
- Correct `calc_part_size` function since it could return more than 10k parts
- Add tests
- Add more checks in `calc_part_size` to comply with S3 limits
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-640
Must be ported back to 2025.3/4 and 2026.1 since we may encounter this bug in production clusters
Closesscylladb/scylladb#28592
* github.com:scylladb/scylladb:
s3_client: add more constrains to the calc_part_size
s3_client: add tests for calc_part_size
s3_client: correct multipart part-size logic to respect 10k limit
Currently, the test assumes that when
'topology_coordinator_pause_before_processing_backlog: waiting' is
logged, the task for decommission must be there. This was based on the
assumption that topology coordinator is idle and decommission request
wakes it up. But if the server is slow enough, it may still be running
the load balancer in reaction to table creation, and block on that
injection point before decommission request was added.
Fix by waiting for the task to appear rather than the injection.
Fixes SCYLLADB-715
Improves performance of deserialization of vector data for calculating similarity functions.
Instead of deserializing vector data into a std::vector<data_value>, we deserialize directly into a std::vector<float>
and then pass it to similarity functions as a std::span<const float>.
This avoids overhead of data_value allocations and conversions.
Example QPS of `SELECT id, similarity_cosine({vector<float, 1536>}, {vector<float, 1536>}) ...`:
client concurrency 1: before: ~135 QPS, after: ~1005 QPS
client concurrency 20: before: ~280 QPS, after: ~2097 QPS
Measured using https://github.com/zilliztech/VectorDBBench (modified to call above query without ANN search)
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-471Closesscylladb/scylladb#28615
Fixes#28678
If replenish loop exits the sleep condition, with an empty queue,
when "_shutdown" is already set, a waiter might get stuck, unsignalled
waiting for segments, even though we are exiting.
Simply move queue abort to always be done on loop exit.
Closesscylladb/scylladb#28679
Fixes parsing of comma-separated seed lists in "init.cc" and
"cql_test_env.cc" to use the standard `split_comma_separated_list`
utility, avoiding manual `npos` arithmetic. The previous code relied on
`npos` being `uint32_t(-1)`, which would not overflow in `uint64_t`
target and exit the loop as expected. With Seastar's upcoming change
to make `npos` `size_t(-1)`, this would wrap around to zero and cause
an infinite loop.
Switch to `split_comma_separated_list` standardized way of tokenization
that is also used in other places in the code. Empty tokens are handled
as before. This prevents startup hangs and test failures when Seastar is
updated.
Refs: scylladb/seastar#3236
The cache is covered already with general auth
dtests but some cases are more tricky and easier
to express directly as calls to cache class.
For such tests boost test file was added.
The LDAP server may change role-chain assignments without notifying
Scylla. As a result, effective permissions can change, so some form of
polling is required.
Currently, this is handled via cache expiration. However, the unified
cache is designed to be consistent and does not support expiration.
To provide an equivalent mechanism for LDAP, we will periodically
reload the permissions portion of the new cache at intervals matching
the previously configured expiration time.
We want to get rid of loading cache because its periodic
refresh logic generates a lot of internal load when there
is many entries. Also our operation procedures involve tweaking
the config while new unified cache is supposed to work out
of the box.
In the following commit we'll need to add some
cache related logic (removing resource permissions).
This logic doesn't depend on authorizer so it should
be managed by the service itself.
The connection's cpu_concurrency_t struct tracks the state of a connection
to manage the admission of new requests and prevent CPU overload during
connection storms. When a connection holds units (allowed only 0 or 1), it is
considered to be in the "CPU state" and contributes to the concurrency limits
used when accepting new connections.
The bug stems from the fact that `counted_data_source_impl::get` and
`counted_data_sink_impl::put` calls can interleave during execution. This
occurs because of `should_parallelize` and `_ready_to_respond`, the latter being
a future chain that can run in the background while requests are being read.
Consequently, while reading request (N), the system may concurrently be
writing the response for request (N-1) on the same connection.
This interleaving allows `return_all()` to be called twice before the
subsequent `consume_units()` is invoked. While the second `return_all()` call
correctly returns 0 units, the matching `consume_units()` call would
mistakenly take an extra unit from the semaphore. Over time, a connection
blocked on a read operation could end up holding an unreturned semaphore
unit. If this pattern repeats across multiple connections, the semaphore
units are eventually depleted, preventing the server from accepting any
new connections.
The fix ensures that we always consume the exact number of units that were
previously returned. With this change, interleaved operations behave as
follows:
get() return_all — returns 1 unit
put() return_all — returns 0 units
get() consume_units — takes back 1 unit
put() consume_units — takes back 0 units
Logically, the networking phase ends when the first network operation
concludes. But more importantly, when a network operation
starts, we no longer hold any units.
Other solutions are possible but the chosen one seems to be the
simplest and safest to backport.
Fixes SCYLLADB-485
Added .github/workflows/close_issue_for_scylla_employee.yml workflow file to automatically close issues opened by ScyllaDB associates
We want to allow external users to open issues in the scylladb repo, but for ScyllaDB associates, we would like them to open issues in Jira instead. If a ScyllaDB associates opens by mistake an issue in scylladb.git repo, the issue will be closed automatically with an appropriate comment explaining that the issue should be opened in Jira.
This is a new github action, and does not require any code backport.
Fixes: PM-64
Closesscylladb/scylladb#28212
What changed
Added new workflow file .github/workflows/call_jira_sync_pr_milestone.yml
Why (Requirements Summary)
Adds a GitHub Action that will be triggered when a milestone is set or removed from a PR
When milestone is added (milestoned event), calls main_jira_sync_pr_milestone_set.yml from github-automation.git, which will add the version to the 'Fix Versions' field in the relevant linked Jira issue
When milestone is removed (demilestoned event), calls main_jira_sync_pr_milestone_removed.yml from github-automation.git, which will remove the version from the 'Fix Versions' field in the relevant linked Jira issue
Testing was performed in staging.git and the STAG Jira project.
Fixes:PM-177
Closesscylladb/scylladb#28575
Fixes#28398Fixes#28399
When used as path elements in google storage paths, the object names need to be URL encoded. Due to
a.) tests not really using prefixes including non-url valid chars (i.e. / etc)
and
b.) the mock server used for most testing not enforcing this particular aspect,
this was missed.
Modified unit tests to use prefixing for all names, so when running real GS, any errors like this will show.
"Real" GCS also behaves a bit different when listing with pager, compared to mock;
The former will not give a pager token for last page, only penultimate.
Adds handling for this.
Needs backport to the releases that have (though might not really use) the feature, as it is technically possible to use google storage for backup and whatnot there, and it should work as expected.
Closesscylladb/scylladb#28400
* github.com:scylladb/scylladb:
utils/gcp/object_storage: URL-encode object names in URL:s
utils::gcp::object_storage: Fix list object pager end condition detection
The current manager flow have a flaw. It will trigger pytest.fail when
it found errors on teardown regardless if the test was already failed.
This will create an additional record in JUnit report with the same name
and Jenkins will not be able to show the logs correctly. So to avoid
this, this PR changes logic slightly.
Now manager will check that test failed or not to avoid two fails for
the same test in the report.
If test passed, manager will check the cluster status and fail if
something wrong with a status of it. There is no need to check the
cluster status in case of test fail.
If test passed, and cluster status if OK, but there are unexpected
errors in the logs, test will fail as well. But this check will gather
all information about the errors and potential stacktraces and will only
fail the test if it's not yet failed to avoid double entry in report.
Closesscylladb/scylladb#28633
The test was marked with xfail in #28383, as it needed to be updated to
work with the Raft-based topology. We are doing that in this patch.
With the Raft-based topology, there is no reason to check that nodes with
different group0 IDs cannot merge their topology/token_metadata. That is
clearly impossible, as doing any topology change requires being in the
same group0. So, the original regression test doesn't make sense.
We can still test that nodes with different group0 IDs cannot gossip with
each other, so we keep the test. It's very fast anyway.
No backport, test update.
Closesscylladb/scylladb#28571
* github.com:scylladb/scylladb:
test: run test_different_group0_ids in all modes
test: make test_different_group0_ids work with the Raft-based topology
Currently, the load balancing simulator computes node, shard and tablet load based on tablet count.
This patch changes the load balancing simulator to be tablet size aware. It generates random tablet sizes with a normal distribution, and a mean value of `default_target_tablet_size`, and reports the computed load for nodes and tables based on tablet size sum, instead of tablet count.
This is the last patch in the size based load balancing series. It is the last PR in the Size Based Load Balancing series:
- First part for tablet size collection via load_stats: scylladb/scylladb#26035
- Second part reconcile load_stats: scylladb/scylladb#26152
- The third part for load_sketch changes: scylladb/scylladb#26153
- The fourth part which performs tablet load balancing based on tablet size: scylladb/scylladb#26254
- The fifth part changes the load balancing simulator: scylladb/scylladb#26438
This is a new feature and backport is not needed.
Closesscylladb/scylladb#26438
* github.com:scylladb/scylladb:
test, simulator: compute load based on tablet size instead of count
test, simulator: generate tablet sizes and update load_stats
test, simulator: postpone creation of load_stats_ptr
In b03d520aff ("cql3: introduce similarity functions syntax") we
added vector similarity functions to the grammar. The grammar had to
be modified because we wanted to support literals as vector similarity
function arguments, and the general function syntax in selectors
did not allow that.
In cc03f5c89d ("cql3: support literals and bind variables in
selectors") we extended the selector function call grammar to allow
literals as function arguments.
Here, we remove the special case for vector similarity functions as
the general case in function calls covers all the possibilities the
special case does.
As a side effect, the vector similarity function names are no longer
reserved.
Note: the grammar change fixes an inconsistency with how the vector
similarity functions were evaluated: typically, when a USE statement
is in effect, an unqualified function is first matched against functions
in the keyspace, and only if there is no match is the system keyspace
checked. But with the previous implementation vector similarity functions
ignored the USE keyspace and always matched only the system keyspace.
This small inconsistency doesn't matter in practice because user defined
functions are still experimental, and no one would name a UDF to conflict
with a system function, but it is still good to fix it.
Closesscylladb/scylladb#28481
Currently, if the test fail, pytest will output only some basic information
about the fail. With this change, it will output the last 300 lines of the
boost/seastar test output.
Also add capturing the output of the failed tests to JUnit report, so it
will be present in the report on Jenkins.
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-449Closesscylladb/scylladb#28535
In ebda2fd4db ("test: cql_test_env: increase file descriptor limit"),
we raised the open file limit for cql_test_env. Here, we raise it for sstables::test_env
as well, to fix a couple of twcs resharding tests failing outside dbuild. These tests
open 256 sstables, and with 2 files/sstable + resharding work it is understandable
that they overflow the 1024 limit.
No backport: this is a quality of life improvement for developers running outside dbuild, but they can use dbuild for branches.
Closesscylladb/scylladb#28646
* github.com:scylladb/scylladb:
test: sstables::test_env: adjust file open limit
test: extract cql_test_env's adjust_rlimit() for reuse
* seastar f55dc7eb...d2953d2a (13):
> io_tester: Revive IO bandwidth configuration
> Merge 'io_tester: add vectorized I/O support' from Travis Downs
doc: add vectorized I/O options to io-tester.md
io_tester: add vectorized I/O support
> Merge 'Remove global scheduling group ID bitmap' from Pavel Emelyanov
reactor: Drop sched group IDs bitmap
reactor: Allocate scheduling group on shard-0 first
reactor: Detach init_scheduling_group_specific_data()
reactor: Coroutinize create_scheduling_group()
> set_iterator: increase compatibility with C++ ranges
> test: fix race condition in test_connection_statistics
> Add Claude Code project instructions
> reactor: Unfriend pollable_fd via pollable_fd_state::make()
> Merge 'rpc_tester: introduce rpc_streaming job based on streaming API' from Jakub Czyszczoń
apps: rpc_tester: Add STREAM_UNIDIRECTIONAL job We introduce an unidirectional streaming to the rpc_streaming job.
apps: rpc_tester: Add STREAM_BIDIRECTIONAL job This commit extends the rpc_tester with rpc_streaming job that uses rpc::sink<> and rpc::source<> to stream data between the client and the server.
> treewide: remove remnants of SEASTAR_MODULE
> test: Tune abort-accept test to use more readable async()
> build: support sccache as a compiler cache (#3205)
> posix-stack: Reuse parent class _reuseport from child
> Merge 'reactor_backend: Fix another busy spin bug in the epoll backend' from Stephan Dollberg
tests: Add unit test for epoll busy spin bug
reactor_backend: Fix another busy spin bug in epoll
Closesscylladb/scylladb#28513
Previous implementation of Scylla lifecycle brought flakiness to the test.
This change leaves lifecycle management up to PythonTest.run_ctx,
which implements more stability logic for setup/teardown.
Replace pexpect-driven GDB interaction with GDB batch mode:
- Avoids DeprecationWarning: "This process is multi-threaded, use of forkpty()
may lead to deadlocks in the child.", which ultimately caused CI deadlocks.
- Removes timeout-driven flakiness on slow systems - no interactive waits/timeouts.
- Produces cleaner, more direct assertions around command execution and output.
- Trade-off: batch mode adds ~10s per command per test,
but with --dist=worksteal this is ~10% overall runtime increase across the suite.
Closesscylladb/scylladb#28484
After PR https://github.com/scylladb/scylladb/pull/28396 reduced
the test volumes to 20MiB to speed up test_out_of_space_prevention.py,
keeping the original 0.8 critical disk utilization threshold can make
the tests flaky: transient disk usage (e.g. commitlog segment churn)
can push the node into ENOSPC during the run.
These tests do not write much data, so reduce the critical disk
utilization threshold to 0.5. With 20MiB volumes this leaves ~10MiB
of headroom for temporary growth during the test.
Fixes: https://github.com/scylladb/scylladb/issues/28463Closesscylladb/scylladb#28593
test_maintenance_socket with new way of running is flaky. Looks like the
driver tries to reconnect with an old maintenance socket from previous
driver and fails. This PR adds white list for connection that stabilize
the test
test_no_removed_node_event_on_ip_change was flaky on CI, while the issue
never reproduced locally. The assumption that under load we have race
condition and trying to check the logs before message is arrived. Small
for loop to retry added to avoid such situation.
Closesscylladb/scylladb#28635
The test can currently fail like this:
```
> await cql.run_async(f"ALTER TABLE {ks}.test WITH tablets = {{'min_tablet_count': 1}}")
E cassandra.cluster.NoHostAvailable: ('Unable to complete the operation against any hosts', {<Host: 127.158.27.9:9042 datacenter1>: <Error from server: code=0000 [Server error] message="Failed to apply group 0 change due to concurrent modification">})
```
The following happens:
- node A is restarted and becomes the group0 leader,
- the driver sends the ALTER TABLE request to node B,
- the request hits group 0 concurrent modification error 10 times and fails
because node A performs tablet migrations at the the same time.
What is unexpected is that even though the driver session uses the default
retry policy, the driver doesn't retry the request on node A. The request
is guaranteed to succeed on node A because it's the only node adding group0
entries.
The driver doesn't retry the request on node A because of a missing
`wait_for_cql_and_get_hosts` call. We add it in this commit. We also reconnect
the driver just in case to prevent hitting scylladb/python-driver#295.
Moreover, we can revert the workaround from
4c9efc08d8, as the fix from this commit also
prevents DROP KEYSPACE failures.
The commit has been tested in byo with `_concurrent_ddl_retries{0}` to
verify that node A really can't hit group 0 concurrent modification error
and always receives the ALTER TABLE request from the driver. All 300 runs in
each build mode passed.
Fixes#25938Closesscylladb/scylladb#28632
When `test_autoretrain_dict` was originally written, the default
`sstable_compression_user_table_options` was `LZ4Compressor`. The
test assumed (correctly) that initially the compression doesn't use
a trained dictionary, and later in the test scenario, it changed
the algorithm to one with a dictionary.
However, the default `sstable_compression_user_table_options` is now
`LZ4WithDictsCompressor`, so the old assumption is no longer correct.
As a result, the assertion that data is initially not compressed well
may or may not fail depending on dictionary training timing.
To fix this, this commit explicitly sets `ZstdCompressor`
as the initial `sstable_compression_user_table_options`, ensuring that
the assumption that initial compression is without a dictionary
is always met.
Note: `ZstdCompressor` differs from the former default `LZ4Compressor`.
However, it's a better choice — the test aims to show the benefit of
using a dictionary, not the benefit of Zstd over LZ4 (and the test uses
ZstdWithDictsCompressor as the algorithm with the dictionary).
Fixes: https://github.com/scylladb/scylladb/issues/28204
Backport: 2025.4, as test already failed there (and also backport to 2026.1 to make everything consistent).
Closesscylladb/scylladb#28625
* github.com:scylladb/scylladb:
test: explicitly set compression algorithm in test_autoretrain_dict
test: remove unneeded semicolons from python test
Harden get_scylla_2025_1_executable() by improving error reporting when subprocesses fail,
increasing curl's retry count for more resilient downloads, and enabling --retry-all-errors to retry on all failures.
Fixes https://github.com/scylladb/scylladb/issues/27745
Backport: no, it's not a bug fix
Closesscylladb/scylladb#28628
* github.com:scylladb/scylladb:
test: pylib: retry on all errors in get_scylla_2025_1_executable curl's call
test: pylib: increase curl's number of retries when downloading scylla
test: pylib: improve error reporting in get_scylla_2025_1_executable
Previously, global_tablet_token_metadata_barrier() could proceed with
fencing even if some nodes did not acknowledge the barrier_and_drain.
This could cause problems:
* In scylladb/scylladb#26864, replica locks did not provide mutual
exclusion, because “fenced out” requests from old topology versions
could run in parallel with requests using newer versions.
* In scylladb/scylladb#26375, the barrier could succeed even though we
did not wait for closed sessions to become unused. This could leave
aborted repair or streaming tasks running concurrently after a tablet
transition was aborted, and thus running concurrently with the next
transition.
In this commit we add a parameter drain_all_nodes: bool to
the global_token_metadata_barrier function. If this parameter is set,
the barrier waits for all nodes to acknowledge the barrier_and_drain
round of RPCs. If any of the nodes are not accessible or throw an error,
such errors are rethrown to the caller. We set this parameter only in
global_tablet_token_metadata_barrier since for topology migrations
the old behavior should be preserved. In case of errors, the tablet
migration is blocked until the problem goes away by itself or the
problematic node is added to the ignore_nodes list.
The test_fenced_out_on_tablet_migration_while_handling_paxos_verb is
removed: with tablets, we now drain all nodes, so after a successful
barrier_and_drain round there can be no coordinators with an old
topology version. The fence_token check after executing a request on
a replica is therefore unnecessary for tablets, but still required for
vnodes, where topology changes do not wait for all nodes.
Topology fencing is covered by test_fence_lwt_during_bootstrap.
Fixesscylladb/scylladb#26864Fixesscylladb/scylladb#26375
Add explicit erm-holding variables in all replica-side RPC handlers.
This is required to ensure that tablet migration waits for in-flight
replica requests even if a non-replica coordinator has been fenced out.
Holding erms on the replica side may increase the global-barrier wait
time, since the barrier must drain these requests. We believe this
is acceptable because:
* We already hold erms during replica-side request execution, but in
an ad-hoc, non-systemic way in lower layers of storage_proxy
(e.g. in sp::mutate_locally and do_query_tablets).
* Replica requests are bounded by replica-side timeouts, so the
global-barrier wait time cannot exceed the maximum of these timeouts.
For Paxos verbs, we use token_metadata_guard, which wraps the ERM and
automatically refreshes it when tablet migration does not affect the
current token; see the token_metadata_guard comments for details.
We use this guard only for Paxos verbs because regular reads and writes
already hold raw erms in storage_proxy and on the coordinators.
The erms must be held in all RPC handlers that support fencing — that
is, those with a fencing_token parameter in storage_proxy.idl.
Counter updates already hold erms in
mutate_counter_on_leader_and_replicate.
Fix test_tablets2::test_timed_out_reader_after_cleanup: the tablets
barrier now waits for all nodes. As a result, the replica read
is expected to finish, rather than fail due to the tablet having
moved as it did previously. The test is renamed to
test_tablets_barrier_waits_for_replica_erms to better reflect its
purpose.
Refs scylladb/scylladb#26864
Before waiting on stale_versions_in_use(), we log the stale versions
the barrier_and_drain handler will wait for, along with the number of
token_metadata references representing each version.
To achieve this, we store a pointer to token_metadata in
version_tracker, traverse the _trackers list, and output all items
with a version smaller than the latest. Since token_metadata
contains the version_tracker instance, it is guaranteed to remain
alive during traversal. To count references, token_metadata now
inherits from enable_lw_shared_from_this.
This helps diagnose tablet migration stalls and allows more
deterministic tests: when a barrier is expected to block, we can
verify that the log contains the expected stale versions rather
than checking that the barrier_and_drain is blocked on
stale_versions_in_use() for a fixed amount of time.
One of the tests check that amount of the PK should be more than 2, but
the method that creates it can return table with less keys. This leads
to flakiness and to avoid it, this PR ensures that table will have at
least 3 PK
Closesscylladb/scylladb#28636
on_update_view() currently runs its serialized logic inline via with_semaphore()
from a detached callback path, while create/drop already use dedicated async
dispatchers.
Refactor update handling to follow the same pattern:
- add dispatch_update_view(sstring ks_name, sstring view_name)
- move update logic into that coroutine
- acquire the existing view-builder lock via get_or_adopt_view_builder_lock()
- keep existing behavior for missing base/view state
- keep background invocation semantics from on_update_view()
This aligns update/create/drop flow and keeps async lifecycle handling and a first step to fix shutdown issue.
The twcs compaction tests open more than 1024 files (not
so good), and will fail in a user session with the default
soft limit (1024).
Attempt to raise the limit so the tests pass. On a modern
systemd installation the hard limit is >500,000, so this
will work.
There's no problem in dbuild since it raises the file limit
globally.
Most likely, the root cause of the flaky test was that the TLS handshake hung for an extended period (60s). This caused
the test case to fail because the ANN request duration exceeded the test case timeout.
The PR introduces two changes:
* Mitigation of the hanging TLS handshake: This issue likely occurred because the test performed certificate rewrites
simultaneously with ANN requests that utilize those certificates.
* Production code fix: This addresses a bug where the TLS handshake itself was not covered by the connection timeout.
Since tls::connect does not perform the handshake immediately, the handshake only occurs during the first write
operation, potentially bypassing connect timeout.
Fixes: #28012
Backport to 2026.01 and 2025.04 is needed, as these branches are also affected and may experience CI flakiness due to this test.
Closesscylladb/scylladb#28617
* github.com:scylladb/scylladb:
vector_search: Fix missing timeout on TLS handshake
vector_search: test: Fix flaky cert rewrite test
Extract the commonly used `open_flags::wo | open_flags::create |
open_flags::exclusive` into a reusable constant
`sstable_write_open_flags` to reduce duplication.
Introduce new methods to write SSTable components while calculating
and returning their CRC32 checksums. This adds:
- make_digests_component_file_writer(): creates a crc32_digest_file_writer
for component writing with checksum tracking
- write_simple_with_digest() and do_write_simple_with_digest(): write
components and return the full checksum value
Introduce template parameter to checksummed file writer to support
digest-only calculation without storing chunk checksums.
This will be needed for future to calculate digest of other components.
CI currently fails in release and debug modes if the PR only changes
a test run only in dev mode. There is no reason to wait for the CI fix,
as there is no reason to run this test only in dev mode in the first
place. The test is very fast.
The test was marked with xfail in #28383, as it needed to be updated to
work with the Raft-based topology. We are doing that in this patch.
With the Raft-based topology, there is no reason to check that nodes with
different group0 IDs cannot merge their topology/token_metadata. That is
clearly impossible, as doing any topology change requires being in the
same group0. So, the original regression test doesn't make sense.
We can still test that nodes with different group0 IDs cannot gossip with
each other, so we keep the test. It's very fast anyway.
It's difficult to say if our download backend would always return
transient error correctly so that the curl could retry. Instead it's
more robust to always retry on error.
By default curl does exponential backoff, and we want to keep that
but there is time cap of 10 minutes, so with 40 retries we'd wait
long time, instead we set the cap to 60 seconds.
Total waiting time (excluding receiving request time):
before - 17m
after - 35m
When `test_autoretrain_dict` was originally written, the default
`sstable_compression_user_table_options` was `LZ4Compressor`. The
test assumed (correctly) that initially the compression doesn't use
a trained dictionary, and later in the test scenario, it changed
the algorithm to one with a dictionary.
However, the default `sstable_compression_user_table_options` is now
`LZ4WithDictsCompressor`, so the old assumption is no longer correct.
As a result, the assertion that data is initially not compressed well
may or may not fail depending on dictionary training timing.
To fix this, this commit explicitly sets `ZstdCompressor`
as the initial `sstable_compression_user_table_options`, ensuring that
the assumption that initial compression is without a dictionary
is always met.
Note: `ZstdCompressor` differs from the former default `LZ4Compressor`.
However, it's a better choice — the test aims to show the benefit of
using a dictionary, not the benefit of Zstd over LZ4 (and the test uses
ZstdWithDictsCompressor as the algorithm with the dictionary).
Fixes: scylladb/scylladb#28204
Some assertions in the Raft-based topology are likely to cause crashes of
multiple nodes due to the consistent nature of the Raft-based code. If the
failing assertion is executed in the code run by each follower (e.g., the code
reloading the in-memory topology state machine), then all nodes can crash. If
the failing assertion is executed only by the leader (e.g., the topology
coordinator fiber), then multiple consecutive group0 leaders will chain-crash
until there is no group0 majority.
Crashing multiple nodes is much more severe than necessary. It's enough to
prevent the topology state machine from making more progress. This will
naturally happen after throwing a runtime error. The problematic fiber will be
killed or will keep failing in a loop. Note that it should be safe to block
the topology state machine, but not the whole group0, as the topology state
machine is mostly isolated from the rest of group0.
We replace some occurrences of `on_fatal_internal_error` and `SCYLLA_ASSERT`
with `on_internal_error`. These are not all occurrences, as some fatal
assertions make sense, for example, in the bootstrap procedure.
This patch changes the load balancing simulator so that it computes
table load based on tablet sizes instead of tablet count.
best_shard_overcommit measured minimal allowed overcommit in cases
where the number of tablets can not be evenly distributed across
all the available shards. This is still the case, but instead of
computing it as an integer div_ceil() of the average shard load,
it is now computed by allocating the tablet sizes using the
largest-tablet-first method. From these, we can get the lowest
overcommit for the given set of nodes, shards and tablet sizes.
This change adds a random tablet size generator. The tablet sizes are
created in load_stats.
Further changes to the load balance simulator:
- apply_plan() updates the load_stats after a migration plan is issued by the
load balancer,
- adds the option to set a command line option which controls the tablet size
deviation factor.
With size based load balancing, we will have to move the tablet size in
load_stats after each internode migration issued by balance_tablets().
This will be done in a subsequent commit in apply_plan() which is
called from rebalance_tablets().
Currently, rebalance_tablets() is passed a load_stats_ptr which is
defined as:
using load_stats_ptr = lw_shared_ptr<const load_stats>;
Because this is a pointer to const, apply_plan() can't modify it.
So, we pass a reference to load_stats to rebalance_tablets() and create
a load_stats_ptr from it for each call to balance_tablets().
test_remove_node_violating_rf_rack_with_rack_list creates a cluster
with four nodes. One of the nodes is excluded, then another one is
stopped, excluded, and removed. If the two stopped nodes were both
voters, the majority is lost and the cluster loses its raft leader.
As a result, the node cannot be removed and the operation times out.
Add the 5th node to the cluster. This way the majority is always up.
Fixes: https://github.com/scylladb/scylladb/issues/28596.
Closesscylladb/scylladb#28610
The test creates a single node cluster, then creates 3 tables which
remain empty. Then it adds another node with half the disk capacity of
the first one, and then it waits for the balancer to migrate tablets to
the newly added node by calling the quiesce topology API. The number of
tablets on the smaller node should be exactly half the number of tablets
on the larger node.
After waiting for quiesce topology, we could have a situation where we
query the number of tablets from the node which still hasn't processed
the last tablet migrations and updated system.tablets.
This patch adds a read barrier so that both nodes see the same tablets
metadata before we query the number of tablets.
Fixes: SCYLLADB-603
Closesscylladb/scylladb#28598
Currently the TLS handshake in the vector search client does not have a timeout.
This is because tls::connect does not perform handshake itself; the handshake
is deferred until the first read/write operation is performed. This can lead to long
hangs on ANN requests.
This commit calls tls::check_session_is_resumed() after tls::connect
to force the handshake to happen immediately and to run under with_timeout.
The test is flaky most likely because when TLS certificate rewrite
happens simultaneously with an ANN request, the handshake can hang for a
long time (~60s). This leads to a timeout in the test case.
This change introduces a checkpoint in the test so that it will
wait for the certificate rewrite to happen before sending an ANN request,
which should prevent the handshake from hanging and make the test more reliable.
Fixes: #28012
The test `test_sync_point` had a few shortcomings that made it flaky
or simply wrong:
1. We were verifying that hints were written by checking the size of
in-flight hints. However, that could potentially lead to problems
in rare situations.
For instance, if all of the hints failed to be written to disk, the
size of in-flight hints would drop to zero, but creating a sync point
would correspond to the empty state.
In such a situation, we should fail immediately and indicate what
the cause was.
2. A sync point corresponds to the hints that have already been written
to disk. The number of those is tracked by the metric `written`.
It's a much more reliable way to make sure that hints have been
written to the commitlog. That ensures that the sync point we'll
create will really correspond to those hints.
3. The auxiliary function `wait_for` used in the test works like this:
it executes the passed callback and looks at the result. If it's
`None`, it retries it. Otherwise, the callback is deemed to have
finished its execution and no further retries will be attempted.
Before this commit, we simply returned a bool, and so the code was
wrong. We improve it.
---
Note that this fixesscylladb/scylladb#28203, which was a manifestation
of scylladb/scylladb#25879. We created a sync point that corresponded
to the empty state, and so it immediately resolved, even when node 3
was still dead.
As a bonus, we rewrite the auxiliary code responsible for fetching
metrics and manipulating sync points. Now it's asynchronous and
uses the existing standard mechanisms available to developers.
Furthermore, we reduce the time needed for executing
`test_sync_point` by 27 seconds.
---
The total difference in time needed to execute the whole test file
(on my local machine, in dev mode):
Before:
CPU utilization: 0.9%
real 2m7.811s
user 0m25.446s
sys 0m16.733s
After:
CPU utilization: 1.1%
real 1m40.288s
user 0m25.218s
sys 0m16.566s
---
Refs scylladb/scylladb#25879Fixesscylladb/scylladb#28203
Backport: This improves the stability of our CI, so let's
backport it to all supported versions.
Closesscylladb/scylladb#28602
* github.com:scylladb/scylladb:
test: cluster: Reduce wait time in test_sync_point
test: cluster: Fix test_sync_point
test: cluster: Await sync points asynchronously
test: cluster: Create sync points asynchronously
test: cluster: Fetch hint metrics asynchronously
The ANN vector queries with all-zero vectors are allowed even on vector
indexes with similarity function set to cosine.
When enabling the rescoring option, those queries would fail as the rescoring
calls `similarity_cosine` function underneath, causing an `InvalidRequest` exception
as all-zero vectors were not allowed matching Cassandra's behaviour.
To eliminate the discrepancy we want the all-zero vector `similarity_cosine` calls to pass,
but return the NaN as the cosine similarity for zero vectors is mathematically incorrect.
We decided not to use arbitrary values contrary to USearch, for which the distance
(not to be confused with similarity) is defined as cos(0, 0) = 0, cos(0, x) = 1 while
supporting the range of values [0, 2].
If we wanted to convert that to similarity, that would mean sim_cos(0, x) = 0.5,
which does not support mathematical reasoning why that would be more similar than
for example vectors marking obtuse angles.
It's safe to assume that all-zero vectors for cosine similarity shouldn't make any impact,
therefore we return NaN and eliminate them from best results.
Adjusted the tests accordingly to check both proper Cassandra and Scylla's behaviour.
Fixes: SCYLLADB-456
Commit ea8a661119 tried to reduce the dataset for restoration tests.
While doing it effectively disabled part of itself -- the checks for
streaming directions were never ran after this change. The thing is that
this check only runs if restored tablet count matches some hardcoded one
of 512. This was the real dataset size of the test before the
aforementioned commit, but after it it had changed to over values, and
the comparison with 512 became always False.
Fix it with a local variable to prevent such mistakes in the future.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This method does two things -- checks that the data is indeed back, and
validates streaming directions. The latter is not quite about "data is
back", so better to have it as explicit dedicated method.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
If everything is OK, the sync point will not resolve with node 3 dead.
As a result, the waiting will use all of the time we allocate for it,
i.e. 30 seconds. That's a lot of time.
There's no easy way to verify that the sync point will NOT resolve, but
let's at least reduce the waiting to 3 seconds. If there's a bug, it
should be enough to trigger it at some point, while reducing the average
time needed for CI.
The test had a few shortcomings that made it flaky or simply wrong:
1. We were verifying that hints were written by checking the size of
in-flight hints. However, that could potentially lead to problems
in rare situations.
For instance, if all of the hints failed to be written to disk, the
size of in-flight hints would drop to zero, but creating a sync point
would correspond to the empty state.
In such a situation, we should fail immediately and indicate what
the cause was.
2. A sync point corresponds to the hints that have already been written
to disk. The number of those is tracked by the metric `written`.
It's a much more reliable way to make sure that hints have been
written to the commitlog. That ensures that the sync point we'll
create will really correspond to those hints.
3. The auxiliary function `wait_for` used in the test works like this:
it executes the passed callback and looks at the result. If it's
`None`, it retries it. Otherwise, the callback is deemed to have
finished its execution and no further retries will be attempted.
Before this commit, we simply returned a bool, and so the code was
wrong. We improve it.
Note that this fixesscylladb/scylladb#28203, which was a manifestation
of scylladb/scylladb#25879. We created a sync point that corresponded
to the empty state, and so it immediately resolved, even when node 3
was still dead.
Refs scylladb/scylladb#25879Fixesscylladb/scylladb#28203
To create a keyspace theres new_test_keyspace helper
Table is created with a single cql.run_async with explicit schema
Dataset is populated with a single parallel INSERT as well
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This is preparational patch. Next will need to replace
foo()
bar()
with
with something() as s:
foo()
bar()
Effectively -- only add the `with something()` line. Not to shift the
whole file right together with that future change, do it here.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The previous calculation could produce more than 10,000 parts for large
uploads because we mixed values in bytes and MiB when determining the
part size. This could result in selecting a part size that still
exceeded the AWS multipart upload limit. The updated logic now ensures
the number of parts never exceeds the allowed maximum.
This change also aligns the implementation with the code comment: we
prefer a 50 MiB part size because it provides the best performance, and
we use it whenever it fits within the 10,000-part limit. If it does not,
we increase the part size (in bytes, aligned to MiB) to stay within the
limit.
Add a schema_builder::with_sharder() overload that accepts a const
reference to dht::static_sharder. This allows schemas to use custom
sharder instances instead of only static sharder configurations.
This is needed to support tables that use custom partitioning and
sharding strategies, such as the incoming raft metadata tables for
strongly consistent tables.
Generalize error handling by creating exception dispatcher which allows to write error handlers by sequentially applying handlers the same way one would write `catch ()` blocks
The method is called from storage_proxy::mutate_hint() which is in turn
called from hint_mutation::apply_locally(). The latter is either called
from directly by hint sender, which already runs in streaming group, or
via RPC HINT_MUTATION handler which uses index 1 that negotiates streaming
group as well.
To be sure, add a debugging check for current group being the expected one.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Currently sender only switches group for hints sending on start. It's
worth doing the same on stop too for consistency. There's nothing to
compete with at this point.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Similarly to previous patch, the handler can stream the map of build
statuses. Unlike previous patch, it doesn't need to fmt::format() key
and value, as these are strings already.
It could be a map_to_json<string, string> partial specialization, but
there's so far only one caller, so probably not worth it yet.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Like get_token_endpoint one streams the map that it got from storage
service, the get_ownership and get_effective_ownership can do the same.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The get_token_endpoint handler converts iterator of std::map into
generated maplist_mapper type. Next patch will do the same for more
handlers, so it's good to have a helper converter for it.
As a nice side effect, it's possible to avoid multiline lambda argument
to stream_range_as_array().
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This reverts commit bcd1758911, reversing
changes made to b2c2a99741.
There is a design decision to not introduce additional test
orchestration tool for scylladb.git (see comments for #27499). One
commit has already been reverted in 55c7bc7. Last CI runs made validator
test flaky, so it is a time to remove all remaining validator tests.
It needs a backport to 2026.1 to remove remaining validator tests from there.
Fixes: VECTOR-497
Closesscylladb/scylladb#28568
When running a gdb command, we check that the string 'Error'
does not appear within the output. However, if the command output
includes the string 'Error' as part of its normal operation, this
generates a false positive. In fact the task_histogram can include
the string 'error::Error' from the Rust core::error module.
Allow for that and only match 'Error' that isn't 'error::Error'.
Fixes#28516.
Closesscylladb/scylladb#28574
The difference is very tiny:
@@ -1,12 +1,12 @@
@pytest.mark.asyncio
async def test_restore_primary_replica_same_...(manager: ManagerClient, object_storage):
''' comment '''
- topology = topo(rf = 4, nodes = 8, racks = 2, dcs = 1)
- scope = "rack"
+ topology = topo(rf = 4, nodes = 8, racks = 2, dcs = 2)
+ scope = "dc"
ks = 'ks'
cf = 'cf'
@@ -42,7 +42,7 @@ async def test_restore_primary_replica_s
for r in res:
nodes_by_operation[r[1].group(1)].append(r[1].group(2))
- scope_nodes = set([ str(host_ids[s.server_id]) for s in servers if s.rack == servers[i].rack ])
+ scope_nodes = set([ str(host_ids[s.server_id]) for s in servers if s.datacenter == servers[i].datacenter ])
for op, nodes in nodes_by_operation.items():
logger.info(f'Operation {op} streamed to nodes {nodes}')
assert len(nodes) == 1, "Each streaming operation should stream to exactly one primary replica"
The (removed in the above example) test description comments differ only
in their usage of "rack" and "dc" words.
Squashing them into one parametrized test makes perfect sense.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Tablet migration keeps sstable snapshot during streaming, which may
cause temporary increase in disk utilization if compaction is running
concurrently. SStables compacted away are kept on disk until streaming
is done with them. The more tablets we allow to migrate concurrently,
the higher disk space can rise. When the target tablet size is
configured correcly, every tablet should own about 1% of disk
space. So concurrency of 4 shouldn't put us at risk. But target tablet
size is not chosen dynamically yet, and it may not be aligned with
disk capacity.
Also, tablet sizes can temporary grow above the target, up to 2x
before the split starts, and some more because splits take a while to
complete.
The reduce impact from this, reduce concurrency of
migation. Concurrency of 2 should still be enough to saturate
resources on the leaving shard.
Also, reducing concurrency means that load balancing is more
responsive to preemption. There will be less bandwidth sharing, so
scheduled migrations complete faster. This is important for scale-out,
where we bootstrap a node and want to start migrations to that new
node as soon as possible.
Refs scylladb/siren#15317
Different transitions have different weights, and limits are
configurable. We don't want a situation where a high-cost migration
is cut off by limits and the system can make no progress.
For example, repair uses weight 2 for read concurrency. Migrating
co-located tablets scales the cost by the number of co-located
tablets.
The test `test_size_based_load_balancing.py::test_balance_empty_tablets`
waits for tablet load stats to be refreshed and uses the
`short_tablet_stats_refresh_interval` injection to speed up the refresh
interval.
This injection has no effect; it was replaced by the
`tablet_load_stats_refresh_interval_in_seconds` config option (patch: 1d6808aec4),
so the test currently waits for 60 seconds (default refresh interval).
Use the config option. This reduces the execution time to ~8 seconds.
Fixes SCYLLADB-556.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Closesscylladb/scylladb#28536
This field is only used to initialize the following _memtable_controller
one. It's simpler just to do the initialization with whatever value the
field itself is initialized and drop the field itself.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#28539
This patch fixes the few remaining cases of XPASS in test/cqlpy and test/alternator.
These are tests which, when written, reproduced a bug and therefore were marked "xfail", but some time later the bug was fixed and we either did not notice it was ever fixed, or just forgot to remove the xfail marker.
Removing the no-longer-needed xfail markers is good for test hygiene, but more importantly is needed to avoid regressions in those already-fixed areas (if a test is already marked xfail, it can start to fail in a new way and we wouldn't notice).
Backport not needed, xpass doesn't bother anyone.
Closesscylladb/scylladb#28441
* github.com:scylladb/scylladb:
test/cqlpy: remove xfail from tests for fixed issue 7972
test/cqlpy: remove xfail from tests for fixed issue 10358
test/cqlpy: remove xfail from passing test testInvalidNonFrozenUDTRelation
test/alternator: remove xfail from passing test_update_item_increases_metrics_for_new_item_size_only
The goal of this small pull request is to reproduce issue #28439, which found a bug in the Alternator Streams output when BatchWriteItem is called to write multiple items in the same partition, and always_use_lwt write isolation mode is used.
* The first patch reproduces this specific bug in Alternator Streams.
* The second patch adds missing (Fixes#28171) tests for BatchWriteItem in different write modes, and shows that BatchWriteItem itself works correctly - the bug is just in Alternator Streams' reporting of this write.
Closesscylladb/scylladb#28528
* github.com:scylladb/scylladb:
test/alternator: add test for BatchWriteItem with different write isolations
test/alternator: reproducer for Alternator Streams bug
It turns out that the cdc driver requires permissions to two additional system tables. This patch adds them to VECTOR_SEARCH_INDEXING and modifies the unit tests. The integration with vector store was tested manually, integration tests will be added in vector-store repository in a follow up PR.
Fixes: SCYLLADB-522
Closesscylladb/scylladb#28519
Alternator's various write operations have different code paths for the
different write isolation modes. Because most of the test suite runs in
only a single write mode (currently - only_rmw_uses_lwt), we already
introduced a test file test/alternator/test_write_isolation.py for
checking the different write operations in *all* four write isolation
modes.
But we missed testing one write operation - BatchWriteItem. This
operation isn't very "interesting" because it doesn't support *any*
read-modify-option option (it doesn't support UpdateExpression,
ConditionExpression or ReturnValues), but even without those, the
pure write code still has different code paths with and without LWT,
and should be tested. So we add the missing test here - and it passes.
In issue #28439 we discovered a bug that can be seen in Alternator
Streams in the case of BatchWriteItem with multiple writes to the
same partition and always_use_lwt mode. The fact that the test added
here passes shows that the bug is NOT in BatchWriteItem itself, which
works correctly in this case - but only in the Alternator Streams layer.
Fixes#28171
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch adds a reproducer for an Alternator Streams bug described in
issue #28439, where the stream returns the wrong events (and fewer of
them) in the following specific combination of the following circumstances:
1. A BatchWriteItem operation writing multiple items to the *same*
partition.
2. The "always_use_lwt" write isolation mode is used. (the bug doesn't
occur in other write isolation modes).
We didn't catch this bug earlier because the Alternator Streams test
we had for BatchWriteItem had multiple items in multiple partitions,
and we missed the multiple-items-in-one-partition case. Moreover,
today we run all the tests in only_rmw_uses_lwt mode (in the past,
we did use always_use_lwt, but changed recently in commit e7257b1393
following commit 76a766c that changed test.py).
As issue #28439 explains, the underlying cause of the bug is that the
always_use_lwt causes the multiple items to be written with the same
timestamp, which confused the Alternator Streams code reading the CDC
log. The bug is not in BatchWriteItem itself, or in ScyllaDB CDC, but
just in the Alternator Streams layer.
The test in this patch is parameterized to run on each of the four
write isolation modes, and currently fails (and so marked xfail) just
for the one mode 'always_use_lwt'. The test is scylla_only, as its
purpose is to checks the different write isolation mode - which don't
exist in AWS DynamoDB.
Refs #28439
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Improve events printing, when test in test_streams.py failed.
New code will print both expected and received events (keys, previous
image, new image and type).
New code will explicitly mark, at which output event comparison failed.
Fixes#28455Closesscylladb/scylladb#28476
The usual Scylla shutdown in a cluster test takes ~2.1s. 2s come from
```
co_await sleep(std::chrono::milliseconds(_gcfg.shutdown_announce_ms));
```
as the default value of `shutdown_announce_in_ms` is 2000. This sleep
makes every `server_stop_gracefully` call 2s slower. There are ~300 such
calls in cluster tests (note that some come from `rolling_restart`). So,
it looks like this sleep makes cluster tests 300 * 2s = 10min slower.
Indeed, `./test.py --mode=dev cluster` takes 61min instead of 71min
on the potwor machine (the one in the Warsaw office) without it.
We set `shutdown_announce_in_ms` to 0 for all cluster tests to make them
faster.
The sleep is completely unnecessary in tests. Removing it could introduce
flakiness, but if that's the case, then the test for which it happens is
incorrect in the first place. Tests shouldn't assume that all nodes
receive and handle the shutdown message in 2s. They should use functions
like `server_not_sees_other_server` instead, which are faster and more
reliable.
Improvement of the tests running time, so no backport. The fix of
`test_tablets_parallel_decommission` may have to be backported to
2026.1, but it can be done manually.
Closesscylladb/scylladb#28464
* github.com:scylladb/scylladb:
test: pylib: scylla_cluster: set shutdown_announce_in_ms to 0
test: test_tablets_parallel_decommission: prevent group0 majority loss
test: delete test_service_levels_work_during_recovery
The handler appeared back in c9e710dca3. In this commit it performed the
"core" part of the task -- the do_build_range() method -- inside the
streaming sched group. The setup code looks seemingly was copied from the
view_builder::do_build_step() method and got the explicit switch of the
scheduling group.
The switch looks both -- justified and not. On one hand, it makes it
explict that the activity runs in the streaming scheduling group. On the
other hand, the verb already uses RPC index on 1, which is negotiated to
be run in streaming group anyway. On the "third hand", even though being
explicit the switch happens too late, as there exists a lot of other
activities performed by the handler that seems to also belong to the
same scheduling group, but which is not switched into explicitly.
By and large, it seems better to avoid the explicit switch and rely on
the RPC-level negotiation-based sched group switching.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#28397
This is the continuation of #28363 , this time about getting gossiper scheduling group via database.
Several places that do it already have gossiper at hand and should better get the group from it.
Eventually, this will allow to get rid of database::get_gossip_scheduling_group().
Refining inter-components API, not backporting
Closesscylladb/scylladb#28412
* github.com:scylladb/scylladb:
gossiper: Export its scheduling group for those who need it
migration_manager: Reorder members
Recently we had a question whether key columns can have any supported
type. I knew that actually - they can't, that key columns can have only
the types S(tring), B(inary) or N(umber), and that is all. But it turns
out we never had a test that confirms this understanding is true.
We did have a test for it for GSI key types already,
test_gsi.py::test_gsi_invalid_key_types, but we didn't have one for the
base table. So in this patch we add this missing test, and confirm that,
indeed, both DynamoDB and Alternator refuse a key attribute with any
type other than S, B or N.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#28479
Current way is checking only fail during the test phase, and it will
miss the cases when fail happens on another phase. This PR eliminate
this, so every phase will have modified node reporter to enrich the
JUnit XML report with custom attribute function_path.
Closesscylladb/scylladb#28462
Current way is always assumed that the error happened in the test file,
but that not always true. This PR will show the error from the boost
logger where actually error is happened.
Closesscylladb/scylladb#28429
1. fmt::localtime is deprecated.
2. We should really print times in UTC, especially on the cloud.
3. The current log message does not print the timezone so it'd unclear
to anyone reading the lof message if the expiration time is in the
local timezone or in GMT/UTC.
Fixes the following warning:
```
gms/gossiper.cc:2428:28: warning: 'localtime' is deprecated [-Wdeprecated-declarations]
2428 | endpoint, fmt::localtime(clk::to_time_t(expire_time)), expire_time.time_since_epoch().count(),
| ^
/usr/include/fmt/chrono.h:538:1: note: 'localtime' has been explicitly marked deprecated here
538 | FMT_DEPRECATED inline auto localtime(std::time_t time) -> std::tm {
| ^
/usr/include/fmt/base.h:207:28: note: expanded from macro 'FMT_DEPRECATED'
207 | # define FMT_DEPRECATED [[deprecated]]
| ^
```
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closesscylladb/scylladb#28434
Some storage_service rpc verbs may checks that a handler is executed
inside gossiper scheduling group. For that, the expected group is
grabbed from database.
This patch puts the gossiper sched group into debug namespace and makes
this check use it from there. It removes one more place that uses
database as config provider.
Refs #28410
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#28427
The test test_to_json_double used to fail due to #7972, but this issue
was already fixed in Scylla 5.1 and we didn't notice.
So remove the xfail marker from this test, and also update another test
which still xfails but no longer due to this issue.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The tests testWithUnsetValues and testFilteringWithoutIndices used to fail
due to #10358, but this issue was already fixed three years ago, when the
UNSET-checking code was cleaned up, and the test is now passing.
So remove the xfail marker from these tests.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The test testInvalidNonFrozenUDTRelation used to fail due to #10632
(an incorrectly-printed column name in an error message) and was marked
"xfail". But this issue has already been fixed two years ago, and
the test is now passing. So remove the xfail marker.
The test test_metrics.py::test_update_item_increases_metrics_for_new_item_size_only
tests whether the Alternator metrics report the exactly-DynamoDB-compatible
WCU number. It is parameterized with two cases - one that uses
alternator_force_read_before_write and one which doesn't.
The case that uses alternator_force_read_before_write is expected to
measure the "accurate" WCU, and currently it doesn't, so the test
rightly xfails.
But the case that doesn't use alternator_force_read_before_write is not
expected to measure the "accurate" WCU and has a different expectation,
so this case actually passes. But because the entire test is marked
xfail, it is reported as "XPASS" - unexpected pass.
Fix this by marking only the "True" case with xfail, while the "False"
case is not marked. After this pass, the True case continues to XFAIL
and the False case passes normally, instead of XPASS.
Also removed a sentence promising that the failing case will be solved
"by the next PR". Clearly this didn't happen. Maybe we even have such
a PR open (?), but it won't the "the next PR" even if merged today.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Paxos state tables are internal tables fully managed by Scylla
and they shouldn't be exposed to the user nor they shouldn't be backed up.
This commit hides those kind of tables from all listings and if such table
is directly described with `DESC ks."tbl$paxos"`, the description is generated
withing a comment and a note for the user is added.
Fixes https://github.com/scylladb/scylladb/issues/28183
LWT on tablets and paxos state tables are present in 2025.4, so the patch should be backported to this version.
Closesscylladb/scylladb#28230
* github.com:scylladb/scylladb:
test/cqlpy: add reproducer for hidden Paxos table being shown by DESC
cql3/statements/describe_statement: hide paxos state tables
Copilot found these typos in comments and variable name in alternator/,
so might as well fix them.
There are no functional changes in this patch.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#28447
This patch series copies `guardrails_test.py` from scylla-dtest, fix it and enables it.
The motivation is to unify the test execution of guardrails test, as some tests (`cqlpy/test_guardrail_...`) were already in scylladb repo, and some were in `scylla-dtest`.
Fixes: SCYLLADB-255
No backport, just test migration
Closesscylladb/scylladb#28454
* github.com:scylladb/scylladb:
test: refactor test_all_rf_limits in guardrails_test.py
test: specify exceptions being caught in guardrails_test.py
test: enable guardrails_test.py
test: add wait_other_notice to test_default_rf in guardrails_test.py
test: copy guardrails_test.py from scylla-dtest
Next Fedora will likely not have toxiproxy packaged [1]. Adapt
by installing it directly. To avoid changing the current toolchain,
add a ./install-dependencies --future option. This will allow us
to easily go back to the packages if the Fedora bug is fixed.
[1] https://bugzilla.redhat.com/show_bug.cgi?id=2426954Closesscylladb/scylladb#28444
Modern toxiproxy interprets `-h` as help and requires the subcommand
subject (e.g. the proxy name) to be after the subcommand switches.
Arrange the command line in the way it likes, and spell out the
subcommands to be more comprehensible.
Closesscylladb/scylladb#28442
related PR: https://github.com/scylladb/scylladb/pull/27527
This PR changes test.py logic of parsing boost test cases to use -- --list_json_content
and pass boost labels as pytests markers
using -- --list_json_content is not ideal and currenly require to implement severall [workarounds](https://github.com/scylladb/scylladb/pull/27527#issuecomment-3765499812), but having the ability to support boost labels in pytest is worth it. because now we can apply the tiering mechanism for the boost tests as well
Fixes SCYLLADB-246
Closesscylladb/scylladb#28232
* github.com:scylladb/scylladb:
test: add nightly label
test.py: support boost labels in test.py
Hints destined for some other node can only be drained after the other node is no longer a replica of any vnode or tablet. In case when tablets are present, a node might still technically be a replica of some tablets after it moved to left state. When it no longer is a replica of any tablet, it becomes "released" and storage service generates a notification about it. Hinted handoff listens to this notification and kicks off draining hints after getting it.
The current implementation of the "released" notification would trigger every time raft topology state is reloaded and a left node without any tokens is present in the raft topology. Although draining hints is idempotent, generating duplicate notifications is wasteful and recently became very noisy after in 44de563 verbosity of the draining-related log messages have been increased. The verbosity increase itself makes sense as draining is supposed to be a rare operation, but the duplicate notification bug now needs to be addressed.
Fix the duplicate notification problem by passing the list of previously released nodes to the `storage_service::raft_topology_update_ip` function and filtering based on it. If this function processes the topology state for the first time, it will not produce any notifications. This is fine as hinted handoff is prepared to detect "released" nodes during the startup sequence in main.cc and start draining the hints there, if needed.
Fixes: scylladb/scylladb#28301
Refs: scylladb/scylladb#25031
The log messages added in 44de563 cause a lot of noise during topology operations and tablet migrations, so the fix should be backported to all affected versions (2025.4 and 2026.1).
Closesscylladb/scylladb#28367
* github.com:scylladb/scylladb:
storage_service: fix indentation after previous patch
raft topology: generate notification about released nodes only once
raft topology: extract "released" nodes calculation to external function
We currently make the local node the only token owner (that owns the
whole ring) in maintenance mode, but we don't update the topology properly.
The node is present in the topology, but in the `none` state. That's how
it's inserted by `tm.get_topology().set_host_id_cfg(host_id);` in
`scylla_main`. As a result, the node started in maintenance mode crashes
in the following way in the presence of a vnodes-based keyspace with the
NetworkTopologyStrategy:
```
scylla: locator/network_topology_strategy.cc:207:
locator::natural_endpoints_tracker::natural_endpoints_tracker(
const token_metadata &, const network_topology_strategy::dc_rep_factor_map &):
Assertion `!_token_owners.empty() && !_racks.empty()' failed.
```
Both `_token_owners` and `_racks` are empty. The reason is that
`_tm.get_datacenter_token_owners()` and
`_tm.get_datacenter_racks_token_owners()` called above filter out nodes
in the `none` state.
This bug basically made maintenance mode unusable in customer clusters.
We fix it by changing the node state to `normal`.
We also extend `test_maintenance_mode` to provide a reproducer for
Fixes#27988
This PR must be backported to all branches, as maintenance mode is
currently unusable everywhere.
Closesscylladb/scylladb#28322
* github.com:scylladb/scylladb:
test: test_maintenance_mode: enable maintenance mode properly
test: test_maintenance_mode: shutdown cluster connections
test: test_maintenance_mode: run with different keyspace options
test: test_maintenance_mode: check that group0 is disabled by creating a keyspace
test: test_maintenance_mode: get rid of the conditional skip
test: test_maintenance_mode: remove the redundant value from the query result
storage_proxy: skip validate_read_replica in maintenance mode
storage_service: set up topology properly in maintenance mode
Before this commit, `test_all_rf_limits` was implemented in a
repetitive manner, making it harder to understand how the guardrails
were tested. This commit refactors the test to reduce code redundancy
and verify the guardrails more explicitly.
Before this commit, the test caught a broad `Exception`. This change
specifies the expected exceptions to avoid a situation where the product
or test is broken and it goes undetected.
The usual Scylla shutdown in a cluster test takes ~2.1s. 2s come from
```
co_await sleep(std::chrono::milliseconds(_gcfg.shutdown_announce_ms));
```
as the default value of `shutdown_announce_in_ms` is 2000. This sleep
makes every `server_stop_gracefully` call 2s slower. There are ~300 such
calls in cluster tests (note that some come from `rolling_restart`). So,
it looks like this sleep makes cluster tests 300 * 2s = 10min slower.
Indeed, `./test.py --mode=dev cluster` takes 61min instead of 71min
on the potwor machine (the one in the Warsaw office) without it.
We set `shutdown_announce_in_ms` to 0 for all cluster tests to make them
faster.
The sleep is completely unnecessary in tests. Removing it could introduce
flakiness, but if that's the case, then the test for which it happens is
incorrect in the first place. Tests shouldn't assume that all nodes
receive and handle the shutdown message in 2s. They should use functions
like `server_not_sees_other_server` instead, which are faster and more
reliable.
Both of the changed test cases stop two out of four nodes when there are
three group0 voters in the cluster. If one of the two live nodes is
a non-voter (node 1, specifically, as node 0 is the leader), a temporary
majority loss occurs, which can cause the following operations to fail.
In the case of `test_tablets_are_rebuilt_in_parallel`, the `exclude_node`
API can fail. In the case of `test_remove_is_canceled_if_there_is_node_down`,
removenode can fail with an unexpected error message:
```
"service::raft_operation_timeout_error (group
[46dd9cf1-fe21-11f0-baa0-03429f562ff5] raft operation [read_barrier] timed out)"
```
Somehow, these test cases are currently not flaky, but they become flaky in
the following commit.
We can consider backporting this commit to 2026.1 to prevent flakiness.
The test becomes flaky in one of the following commits. However, there is
no need to fix it, as we should delete it anyway. We are in the process of
removing the gossip-based topology from the code base, which includes the
recovery mode. We don't have to rewrite the test to use the new Raft-based
recovery procedure, as there is nothing interesting to test (no regression
to legacy service levels).
add nightly label for test
test_foreign_reader_as_mutation_source
as an example of usinf boost labels pytest as markers
command to test :
./tools/toolchain/dbuild pytest --test-py-init --collect-only -q -m=nightly test/boost
output:
boost/mutation_reader_test.cc::test_foreign_reader_as_mutation_source.debug.1
boost/mutation_reader_test.cc::test_foreign_reader_as_mutation_source.release.1
boost/mutation_reader_test.cc::test_foreign_reader_as_mutation_source.dev.1
The patch marks force-gossip-topology-changes as deprecated and removes
tests that use it. There is one test (test_different_group0_ids) which
is marked as xfail instead since it looks like gossiper mode was used
there as a way to easily achieve a certain state, so more investigation
is needed if the tests can be fixed to use raft mode instead.
Closesscylladb/scylladb#28383
This reverts commit 7bf7ff785a. The commit
tried to add clean shutdown to `scylla perf` paths, but forgot at least
`scylla perf-alternator --workload wr` which now crashes on uninitialized
`c.as`.
Fixes#28473Closesscylladb/scylladb#28478
Add support for literals in the SELECT clause. This allows
SELECT fn(column, 4) or SELECT fn(column, ?).
Note, "SELECT 7 FROM tab" becomes valid in the grammar, but is still
not accepted because of failed type inference - we cannot infer the
type of 7, and don't have a favored type for literals (like C favors
int). We might relax this later.
In the WHERE clause, and Cassandra in the SELECT clause, type hints
can also resolve type ambiguity: (bigint)7 or (text)?. But this is
deferred to a later patch.
A few changes to the grammar are needed on top of adding a `value`
alternative to `unaliasedSelector`:
- vectorSimilarityArg gained access to `value` via `unaliasedSelector`,
so it loses that alternate to avoid ambiguity. We may drop
`vectorSimilarityArg` later.
- COUNT(1) became ambiguous via the general function path (since
function arguments can now be literals), so we remove this case
from the COUNT special cases, remaining with count(*).
- SELECT JSON and SELECT DISTINCT became "ambiguous enough" for
ANTLR to complain, though as far as I can tell `value` does not
add real ambiguity. The solution is to commit early (via "=>") to
a parsing path.
Due to the loss of count(1) recognition in the parser, we have to
special-case it in prepare. We may relax it to count any expression
later, like modern Cassandra and SQL.
Testing is awkward because of the type inference problem in top-level.
We test via the set_intersection() function and via lua functions.
Example:
```
cqlsh> CREATE FUNCTION ks.sum(a int, b int) RETURNS NULL ON NULL INPUT RETURNS int LANGUAGE LUA AS 'return a + b';
cqlsh> SELECT ks.sum(1, 2) FROM system.local;
ks.sum(1, 2)
--------------
3
(1 rows)
cqlsh>
```
(There are no suitable system functions!)
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-296Closesscylladb/scylladb#28256
`system.cluster_status` is missing the rack info compared to `nodetool status`
that is supposed to be equivalent. It has probably been an omission.
Closesscylladb/scylladb#28457
The hander of raft_topology_cmd::command::stream_ranges switches to
streaming scheduling group to perform data streaming in it. It grabs the
group from database db_config, which's not great. There's streaming
manager at hand in storage service handlers, since it's using its
functionality, it should use _its_ scheduling group.
This will help splitting the streaming scheduling group into more
elaborated groups under the maintenance supergroup: SCYLLADB-351
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#28363
Before we could observe two exactly the same
"starting auth service" messages in the log.
One from checkpoint() the other from notify().
We remove the second one to stay consistent
with other services.
Closesscylladb/scylladb#28349
Adds --json-result option to perf-cql-raw and perf-alternator, the same as perf-simple-query has.
It is useful for automating test runs.
Related: https://scylladb.atlassian.net/browse/SCYLLADB-434
Bacport: no, original benchmark is not backported
Closesscylladb/scylladb#28451
* github.com:scylladb/scylladb:
test: perf: add example commands to perf-alternator and perf-cql-raw
test: perf: add option to write results to json in perf-cql-raw
test: perf: add option to write results to json in perf-alternator
test: perf: move write_json_result to a common file
When the topology coordinator refreshes load_stats, it caches load_stats for every node. In case the node becomes unresponsive, and fresh load_stats can not be read from the node, the cached version of load_stats will be used. This is to allow the load balancer to have at least some information about the table sizes and disk capacities of the host.
During load_stats refresh, we aggregate the table sizes from all the nodes. This procedure calls db.find_column_family() for each table_id found in load_stats. This function will throw if the table is not found. This will cause load_stats refresh to fail.
It is also possible for a table to have been dropped between the time load_stats has been prepared on the host, and the time it is processed on the topology coordinator. This would also cause an exception in the refresh procedure.
This fixes this problem by checking if the table still exists.
Fixes: #28359Closesscylladb/scylladb#28440
* github.com:scylladb/scylladb:
test: add test and reproducer for load_stats refresh exception
load_stats: handle dropped tables when refreshing load_stats
This patch adds a test and reproducer for the issue where the load_stats
refresh procedure throws exceptions if any of the tables have been
dropped since load_stats was produced.
We extend the test to provide a reproducer for #27988 and to avoid
similar bugs in the future.
The test slows down from ~14s to ~19s on my local machine in dev
mode. It seems reasonable.
In the following commit, we make the rest run with multiple keyspaces,
and the old check becomes inconvenient. We also move it below to the
part of the code that won't be executed for each keyspace.
Additionally, we check if the error message is as expected.
This skip has already caused trouble.
After 0668c642a2, the skip was always hit, and
the test was silently doing nothing. This made us miss #26816 for a long
time. The test was fixed in 222eab45f8, but we
should get rid of the skip anyway.
We increase the number of writes from 256 to 1000 to make the chance of not
finding the key on server A even lower. If that still happens, it must be
due to a bug, so we fail the test. We also make the test insert rows until
server A is a replica of one row. The expected number of inserted rows is
a small constant, so it should, in theory, make the test faster and cleaner
(we need one row on server A, so we insert exactly one such row).
It's possible to make the test fully deterministic, by e.g., hardcoding
the key and tokens of all nodes via `initial_token`, but I'm afraid it would
make the test "too deterministic" and could hide a bug.
In maintenance mode, the local node adds only itself to the topology. However,
the effective replication map of a keyspace with tablets enabled contains all
tablet replicas. It gets them from the tablets map, not the topology. Hence,
`network_topology_strategy::sanity_check_read_replicas` hits
```
throw std::runtime_error(format("Requested location for node {} not in topology. backtrace {}", id, lazy_backtrace()));
```
for tablet replicas other than the local node.
As a result, all requests to a keyspace with tablets enabled and RF > 1 fail
in debug mode (`validate_read_replica` does nothing in other modes). We don't
want to skip maintenance mode tests in debug mode, so we skip the check in
maintenance mode.
We move the `is_debug_build()` check because:
- `validate_read_replicas` is a static function with no access to the config,
- we want the `!_db.local().get_config().maintenance_mode()` check to be
dropped by the compiler in non-debug builds.
We also suppress `-Wunneeded-internal-declaration` with `[[maybe_unused]]`.
We currently make the local node the only token owner (that owns the
whole ring) in maintenance mode, but we don't update the topology properly.
The node is present in the topology, but in the `none` state. That's how
it's inserted by `tm.get_topology().set_host_id_cfg(host_id);` in
`scylla_main`. As a result, the node started in maintenance mode crashes
in the following way in the presence of a vnodes-based keyspace with the
NetworkTopologyStrategy:
```
scylla: locator/network_topology_strategy.cc:207:
locator::natural_endpoints_tracker::natural_endpoints_tracker(
const token_metadata &, const network_topology_strategy::dc_rep_factor_map &):
Assertion `!_token_owners.empty() && !_racks.empty()' failed.
```
Both `_token_owners` and `_racks` are empty. The reason is that
`_tm.get_datacenter_token_owners()` and
`_tm.get_datacenter_racks_token_owners()` called above filter out nodes
in the `none` state.
This bug basically made maintenance mode unusable in customer clusters.
We fix it by changing the node state to `normal`. We also update its
rack, datacenter, and shards count. Rack and datacenter are present in the
topology somehow, but there is nothing wrong with updating them again.
The shard count is also missing, so we better update it to avoid other
issues.
Fixes#27988
After guardrails_test.py has been migrated to test.py and fixed in
previous commits of this patch series, it can finally be enabled.
Fixes: SCYLLADB-255
This commit adds `wait_other_notice=True` to `cluster.populate` in
`guardrails_test.py`. Without this, `test_default_rf` sometimes fails
because `NetworkTopologyStrategy` setting fails before
the node knows about all other DCs.
Refs: SCYLLADB-255
This commit copies guardrails_test.py from dtest repository and
(temporarily) disables it, as it requires improvement in following
commits of this patch series before being enabled.
Refs: SCYLLADB-255
Schema is already a member of select statement, avoiding
the call saves around 400 cpu instructions on a select
request hot path.
Closesscylladb/scylladb#28328
When the topology coordinator refreshes load_stats, it caches load_stats
for every node. In case the node becomes unresponsive, and fresh
load_stats can not be read from the node, the cached version of
load_stats will be used. This is to allow the load balancer to
have at least some information about the table sizes and disk capacities
of the host.
During load_stats refresh, we aggregate the table sizes from all the
nodes. This procedure calls db.find_column_family() for each table_id
found in load_stats. This function will throw if the table is not found.
This will cause load_stats refresh to fail.
It is also possible for a table to have been dropped between the time
load_stats has been prepared on the host, and the time it is processed
on the topology coordinator. This would also cause an exception in the
refresh procedure.
This patch fixes this problem by checking if the table still exists.
Vector Search feature needs to support creating vector indexes with additional
filtering column. There will be two types of indexes: global which indexes
vectors per table, and local which indexes vectors per partition key. The new
syntaxes are based on ScyllaDB's Global Secondary Index and Local Secondary
Index. Vector indexes don't use secondary indexes functionalities in any way -
all indexing, filtering and processing data will be done on Vector Store side.
This patch allows creating vector indexes using this CQL syntax:
```
CREATE TABLE IF NOT EXISTS cycling.comments_vs (
commenter text,
comment text,
comment_vector VECTOR <FLOAT, 5>,
created_at timestamp,
discussion_board_id int,
country text,
lang text,
PRIMARY KEY ((commenter, discussion_board_id), created_at)
);
CREATE CUSTOM INDEX IF NOT EXISTS global_ann_index
ON cycling.comments_vs(comment_vector, country, lang) USING 'vector_index'
WITH OPTIONS = { 'similarity_function': 'DOT_PRODUCT' };
CREATE CUSTOM INDEX IF NOT EXISTS local_ann_index
ON cycling.comments_vs((commenter, discussion_board_id), comment_vector, country, lang)
USING 'vector_index'
WITH OPTIONS = { 'similarity_function': 'DOT_PRODUCT' };
```
Currently, if we run these queries to create indexes we will receive such errors:
```
InvalidRequest: Error from server: code=2200 [Invalid query] message="Vector index can only be created on a single column"
InvalidRequest: Error from server: code=2200 [Invalid query] message="Local index definition must contain full partition key only. Redundant column: XYZ"
```
This commit refactors `vector_index::check_target` to correctly validate
columns building the index. Vector-store currently support filtering by native
types, so the type of columns is checked. The first column from the list must
be a vector (to build index based on these vectors), so it is also checked.
Allowed types for columns are native types without counter (it is not possible
to create a table with counter and vector) and without duration (it is not
possible to correctly compare durations, this type is even not allowed in
secondary indexes).
This commits adds cqlpy test to check errors while creating indexes.
Fixes: SCYLLADB-298
This needs to be backported to version 2026.1 as this is a fix for filtering support.
Closesscylladb/scylladb#28366
These were marked xfail due to #8077 (the column name was wrong),
but it was fixed long ago for 5.4 (exact commit not known).
Remove the xfail markers to prevent regressions.
Closesscylladb/scylladb#28432
This series optimizes role lookup by moving find_record into standard_role_manager and switching it to use the auth cache. This allows reverting can_login to its original simpler form, ensuring hot paths are properly cached while maintaining consistency via group0_guard.
Backport: no, it's not a bug fix.
Closesscylladb/scylladb#28329
* github.com:scylladb/scylladb:
auth: bring back previous version of standard_role_manager::can_login
auth: switch find_record to use cache
auth: make find_record and callers standard_role_manager members
The method was coroutinized by 6df07f7ff7. Back then thecoroutine::switch_to() wasn't available, and the code used with_scheduling_group() to call coroutinized lambdas. Those lambdas were implemented as on-stack variables to solve the capture list lifetime problems. As a result, the code looks like
```
auto flush = [] {
... // do the flushing
auto post_flush = [] {
... // do the post-flushing
}
co_return co_await with_scheduling_group(group_b, post_flush);
};
co_return co_await with_scheduling_group(group_a, flush);
```
which is a bit clumsy. Now we have switch_to() and can make the code flow of this method more readable, like this
```
co_await switch_to(group_a);
... // do the flushing
co_await switch_to(group_b);
... // do the post-flushing
```
Code cleanup, not backporting
Closesscylladb/scylladb#28430
* github.com:scylladb/scylladb:
table: Fix indentation after previous patch
table: Use coroutine::switch_to() in try_flush_memtable_to_sstable()
test_alternator_proxy_protocol starts a node and connects via the alternator ports.
Starting a node, by default, waits until the CQL ports are up. This does not guarantee
that the alternator ports are up (they will be up very soon after this), so there is a short
window where a connection to the alternator ports will fail.
Fix by adding a ServerUpState=SERVING mode, which waits for the node to report
to its supervisor (systemd, which we are pretending to be) that its ports are open.
The test is then adjusted to request this new ServerUpState.
Fixes#28210Fixes#28211
Flaky tests are only in master and branch-2026.1, so backporting there.
Closesscylladb/scylladb#28291
* github.com:scylladb/scylladb:
test: test_alternator_proxy_protocol: wait for the node to report itself as serving
test: cluster_manager: add ability to wait for supervisor STATUS=serving
It allows dropping the local lambdas passed into with_scheduling_group()
calls. Overall the code flow becomes more readable.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This commit moves the "Ungrouped properties" category to the end of the
properties list. The properties are now published in the documentation,
and it doesn't look good if the list starts with ungrouped properties.
This patch was taken over from Anna Stuchlik <anna.stuchlik@scylladb.com>.
Closesscylladb/scylladb#28343
Move to `replica/`, drop `flat` from name and drop unused usages as well as unused includes.
Code cleanup, no backport
Closesscylladb/scylladb#28353
* github.com:scylladb/scylladb:
replica/partition_snapshot_reader: remove unused includes
partition_snapshot_reader: remove "flat" from name
mv partition_snapshot_reader.hh -> replica/
This test case was observed to take over 2 minutes to run on CI
machines, contributing to already bloated CI run times.
Disable this test in debug mode. This test checks for memtable flush
being slowed down when compaction can't keep up. So this test needs to
overwhelm the CPU by definition. On the other hand, this is not a
correctness test, there are such tests for the memtable and compaction
already, so it is not critical to run this in debug mode, it is not
expected to catch any use-after-free and such.
Closesscylladb/scylladb#28407
This commit replaces the previous approach of running pytest inside
GDB’s Python interpreter. Instead, tests are executed by driving a
persistent GDB process externally using pexpect.
- pexpect: Python library for controlling interactive programs
(used here to send commands to GDB and capture its output)
- persistent GDB: keep one GDB session alive across multiple tests
instead of starting a new process for each test
Tests can now be executed via `./test.py gdb` or with
`pytest test/scylla_gdb`. This improves performance and
makes failures easier to debug since pytest no longer runs
hidden inside GDB subprocesses.
Closesscylladb/scylladb#24804
When reads arrive, they have to wait for admission on the reader
concurrency semaphore. If the node is overloaded, the reads will
be queued. They can time out while in the queue, but will not time
out once admitted.
Once the shard is sufficiently loaded, it is possible that most
queued reads will time out, because the average time it takes to
for a queued read to be admitted is around that of the timeout.
If a read times out, any work we already did, or are about to do
on it is wasted effort. Therefore, the patch tries to prevent it
by checking if an admitted read has a chance to complete in time
and abort it if not. It uses the following criteria:
if read's remaining time <= read's timeout when arrived to the semaphore * live updateable preemptive_abort_factor;
the read is rejected and the next one from the wait list is considered.
Fixes https://github.com/scylladb/scylladb/issues/14909
Fixes: SCYLLADB-353
Backport is not needed. Better to first observe its impact.
Closesscylladb/scylladb#21649
* github.com:scylladb/scylladb:
reader_concurrency_semaphore: Check during admission if read may timeout
permit_reader::impl: Replace break with return after evicting inactive permit on timeout
reader_concurrency_semaphore: Add preemptive_abort_factor to constructors
config: Add parameters to control reads' preemptive_abort_factor
permit_reader: Add a new state: preemptive_aborted
reader_concurrency_semaphore: validate waiters counter when dequeueing a waiting permit
reader_concurrency_semaphore: Remove cpu_concurrency's default value
Contains various improvements to tablet load balancer. Batched together to save on the bill for CI.
Most notably:
- Make plan summary more concise, and print info only about present elements.
- Print rack name in addition to DC name when making a per-rack plan
- Print "Not possible to achieve balance" only when this is the final plan with no active migrations
- Print per-node stats when "Not possible to achieve balance" is printed
- amortize metrics lookup cost
- avoid spamming logs with per-node "Node {} does not have complete tablet stats, ignoring"
Backport to 2026.1: since the changes enhance debuggability and are relatively low risk
Fixes#28423Fixes#28422Closesscylladb/scylladb#28337
* github.com:scylladb/scylladb:
tablets: tablet_allocator.cc: Convert tabs to spaces
tablets: load_balancer: Warn about incomplete stats once for all offending nodes
tablets: load_balancer: Improve node stats printout
tablets: load_balancer: Warn about imbalance only when there are no more active migrations
tablets: load_balancer: Extract print_node_stats()
tablet: load_balancer: Use empty() instead of size() where applicable
tablets: Fix redundancy in migration_plan::empty()
tablets: Cache pointer to stats during plan-making
tablets: load_balancer: Print rack in addition to DC when giving context
tablets: load_balancer: Make plan summary concise
tablets: load_balancer: Move "tablet_migration_bypass" injection point to make_plan()
Now that the sum function in the histogram uses true values instead of
an estimate, the test should reflect that.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
In fact, it's partially there already. When view_builder::start() is called is first calls initialization code (the start_in_background() method), then kicks do_build_step() that runs a background fiber to perform build steps. The starting code inherits scheduling group from main(). And the step fiber code needs to run itself in a maintenance scheduling group, so it explicitly grabs one via database->db_config.
This PR mainly gets rid of the call to database::get_streaming_scheduling_group() from do_build_step() as preparation to splitting the streaming scheduling group into parts (see SCYLLADB-351). To make it happen the do_build_step() is patched to inherit its scheduling group from view_builder::start() and the start() itself is called by main from maintenance scheduling group (like for other view building services).
New feature (nested scheduling group), not backporting
Closesscylladb/scylladb#28386
* github.com:scylladb/scylladb:
view_builder: Start background in maintenance group
view_builder: Wake-up step fiber with condition variable
In a lambda returned from make_streaming_consumer() there's a check for
current scheudling group being streaming one. It came from #17090 where
streaming code was launched in wrong sched group thus affecting user
groups in a bad way.
The check is nice and useful, but it abuses replica::database by getting
unrelated information from it.
To preserve the check and to stop using database as provider of configs,
keep the streaming scheduling group handle in the debug namespace. This
emphasises that this global variable is purely for debugging purposes.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#28410
Explain what automatic repair is and how to configure it. While at it, improve the existing repair documentation a bit.
Fixes: SCYLLADB-130
This PR missed the 2026.1 branch date, so it needs backport to 2026.1, where the auto repair feature debuts.
Closesscylladb/scylladb#28199
* github.com:scylladb/scylladb:
docs: add feature page for automatic repair
docs: inter-link incremental-repair and repair documents
docs: incremental-repair: fix curl example
One of the best features of the pytest framework is "assertion
rewriting": If your test does for example "assert a + 1 == b", the
assertion is "rewritten" so that if it fails it tells you not only
that "a+1" and "b" are not equal, what the non-equal values are,
how they are not equal (e.g., find different elements of arrays) and
how each side of the equality was calculated.
But pytest can only "rewrite" assertion that it sees. If you call a
utility function checksomething() from another module and that utility
function calls assert, it will not be able to rewrite it, and you'll
get ugly, hard-to-debug, assertion failures.
This problem is especially noticable in tests we translated from
Cassandra, in test/cqlpy/cassandra_tests. Those tests use a bunch of
assertion-performing utility functions like assertRows() et al.
Those utility functions are defined in a separate source file,
porting.py, so by default do not get their assertions rewritten.
We had a solution for this: test/cqlpy/cassandra_test/__init__.py had:
pytest.register_assert_rewrite("cassandra_tests.porting")
This tells pytest to rewrite assertions in porting.py the first time
that it is imported.
It used to work well, but recently it stopped working. This is because
we change the module paths recently, and it should be written as
test.cqlpy.cassandra_tests.porting.
I verified by editing one of the cassandra_tests to make a bad check
that indeed this statement stopped working, and fixing the module
path in this way solves it, and makes assertion rewriting work
again.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#28411
Currently view_builder::start() is called in default scheduling group.
Once it initializes itself, it wakes up the step fiber that explicitly
switches to maintenance scheduling group.
This explicit switch made sence before previous patch, when the fiber
was implemented as a serialized action. Now the fiber starts directly
from .start() method and can inherit scheduling group from it.
Said that, main code calls view_builder::start() in maintenance
scheduling group killing two birds with one stone. First, the step fiber
no longer needs borrow its scheduling group indirectly via database.
Second, the start_in_background() code itself runs in a more suitable
scheduling group.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
View builder runs a background fiber that perform build steps. To kick
the fiber it uses serizlized action, but it's an overkill -- nobody
waits for the action to finish, but on stop, when it's joined.
This patch uses condition variable to kick the fiber, and starts it
instantly, in the place where serialized action was first kicked.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There are several places in the code that need to explicitly switch into
gossiper scheduling group. For that they currently call database to
provide the group, but it's better to get gossiper sched group from
gossiper itself, all the more so all those places have gossiper at hand.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This is to initialize dependency references, in particular gossiper&,
before _group0_barrier. The latter will need to access this->_gossiper
in the next patch.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Explain what the feature is and how to confiture it.
Inter-link all the repair related pages, so one can discover all about
repair, regardless of which page they land on.
When a shard on a replica is overloaded, it breaks down completely,
throughput collapses, latencies go through the roof and the
node/shard can even become completely unresponsive to new connection
attempts.
When reads arrive, they have to wait for admission on the reader
concurrency semaphore. If the node is overloaded, the reads will
be queued and thus they can time out while being in the queue or during
the execution. In the latter case, the timeout does not always
result in the read being aborted.
Once the shard is sufficiently loaded, it is possible that most
queued reads will time out, because the average time it takes
for a queued read to be admitted is around that of the timeout.
If a read times out, any work we already did, or are about to do
on it is wasted effort. Therefore, the patch tries to prevent it
by checking if an admitted read has a chance to complete in time
and abort it if not. It uses the following cryteria:
if read's remaining time <= read's timeout when arrived to the semaphore * preemptive factor;
the read is rejected and the next one from the wait list is
considered.
Evicting an inactive permit destroyes the permit object when the
reader is closed, making any further member access invalid. Switch
from break to an early return to prevent any possible use-after-free
after evict() in the state::inactive timeout path.
The new parameter parametrizes the factor used to reject a read
during admission. Its value shall be between 0.0 and 1.0 where
+ 0.0 means a read will never get rejected during admission
+ 1.0 means a read will immediatelly get rejected during admission
Although passing values outside the interaval is possible, they
will have the exact same effects as they were clamped to [0.0, 1.0].
A permit gets into the preemptive_aborted state when:
- times out;
- gets rejected from execution due to high chance its execution would
not finalize on time;
Being in this state means a permit was removed from the wait list,
its internal timer was canceled and semaphore's statistic
`total_reads_shed_due_to_overload` increased.
Fix a subtle but damaging failure mode in the tablet migration state machine: when a barrier fails, the follow-up barrier is triggered asynchronously, and cleanup can get skipped for that iteration. On the next loop, the original failure may no longer be visible (because the failing node got excluded), so the tablet can incorrectly move forward instead of entering `cleanup_target`.
To make cleanup reliable this PR:
Adds an additional “fallback cleanup” stage
- `write_both_read_old_fallback_cleanup`
that does not modify read/write selectors. This stage is safe to enter immediately after a barrier failure, and it funnels the tablet into cleanup with the required barriers.
Avoids changing both read and write selectors in a single step transitioning from `write_both_read_new` to `cleanup_target`. The fallback path updates selectors in a safe order: read first, then write.
Allows a direct no-barrier transition from `allow_write_both_read_old` to `cleanup_target` after failure, because in that specific case `cleanup_target` doesn’t change selectors and the hop is safe.
No need for backport. It's an improvement. Currently, tablets transition to `cleanup_target` eventually via failed streaming.
Closesscylladb/scylladb#28169
* github.com:scylladb/scylladb:
topology_coordinator: add write_both_read_old_fallback_cleanup state
topology_coordinator: allow cleanup_target transition from streaming/rebuild_repair without barrier
topology_coordinator: allow cleanup_target transition without barrier after failure in write_both_read_old
topology_coordinator: allow cleanup_target transition without barrier after failure in allow_write_both_read_old
This patch replaces simple counters with bytes_histogram for tracking
CQL request and response sizes, enabling better visibility into message
size distribution.
Changes:
- Replace request_size and response_size metrics with bytes_histogram in
cql_sg_stats::request_kind_stats
- Per-shard metrics continue to be reported as before
- QUERY, EXECUTE, and BATCH operations now report per-node, per-scheduling-group
histograms of bytes sent and received, providing detailed insight into these
operations
Other CQL operations (e.g., PREPARE, OPTIONS) are not included in per-node
histogram reporting as they are less performance-critical, but can be added
in the future if proven useful.
Metrics example:
```
# HELP scylla_transport_cql_request_bytes Counts the total number of received bytes in CQL messages of a specific kind.
# TYPE scylla_transport_cql_request_bytes counter
scylla_transport_cql_request_bytes{kind="BATCH",scheduling_group_name="sl:default",shard="0"} 129808
scylla_transport_cql_request_bytes{kind="EXECUTE",scheduling_group_name="sl:default",shard="0"} 227409
scylla_transport_cql_request_bytes{kind="PREPARE",scheduling_group_name="sl:default",shard="0"} 631
scylla_transport_cql_request_bytes{kind="QUERY",scheduling_group_name="sl:default",shard="0"} 2809
scylla_transport_cql_request_bytes{kind="QUERY",scheduling_group_name="sl:driver",shard="0"} 4079
scylla_transport_cql_request_bytes{kind="REGISTER",scheduling_group_name="sl:default",shard="0"} 98
scylla_transport_cql_request_bytes{kind="STARTUP",scheduling_group_name="sl:driver",shard="0"} 432
# HELP scylla_transport_cql_request_histogram_bytes A histogram of received bytes in CQL messages of a specific kind and specific scheduling group.
# TYPE scylla_transport_cql_request_histogram_bytes histogram
scylla_transport_cql_request_histogram_bytes_sum{kind="QUERY",scheduling_group_name="sl:driver"} 4079
scylla_transport_cql_request_histogram_bytes_count{kind="QUERY",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="1024.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="2048.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="4096.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="8192.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="16384.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="32768.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="65536.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="131072.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="262144.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="524288.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="1048576.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="2097152.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="4194304.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="8388608.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="16777216.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="33554432.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="67108864.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="134217728.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="268435456.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="536870912.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="1073741824.000000",scheduling_group_name="sl:driver"} 57
```
Previously, histogram sums were estimated by multiplying bucket offsets
by their counts, which produces inaccurate results - typically too high
when using upper limits or too low when using lower limits.
This patch adds accurate sum tracking to approx_exponential_histogram:
- Adds a _sum member variable to track the actual sum of all values
- Implements sum() method to return the accumulated total
- Updates add() to increment _sum for each value
- Modifies to_metrics_histogram() helper to use the new sum() method
This change is important as histograms will be used instead of counters for
byte statistics, where accurate totals are essential for metrics reporting.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
For various use cases, we need to report byte histograms, such as for
request and reply message sizes.
This patch introduce bytes_histogram as a type alias for
approx_exponential_histogram configured to track byte values from 1KB to
1GB with power-of-2 buckets (Precision=1).
This provides a convenient, performance-efficient histogram for
measuring message sizes, payload sizes, and other byte-based metrics.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
There are few places that use raft_group0_client as a way to get to system_keyspace. Mostly they can live without it -- either the needed reference is already at hand, or it's (ab)used to get to the database reference. The only place that really needs the system keyspace is the state merger code that needs last state ID. For that, the explicit helper method is added to group0_client.
Refining API between components, not backporting
Closesscylladb/scylladb#28387
* github.com:scylladb/scylladb:
raft_group0_client: Dont export system keyspace
raft_group0_client: Add and use get_last_group0_state_id()
group0_state_machine: Call ensure_group0_sched() with data_dictionary
view_building_worker: Use its own system_keyspace& reference
During test.py run, noticed this warning:
```
10:38:22 test/cqlpy/cassandra_tests/validation/operations/insert_update_if_condition_test.py:14: 32 warnings
10:38:22 /jenkins/workspace/releng-testing/scylla-ci/scylla/test/cqlpy/cassandra_tests/validation/operations/insert_update_if_condition_test.py:14: PytestAssertRewriteWarning: Module already imported so cannot be rewritten: test.cqlpy.cassandra_tests.porting
10:38:22 pytest.register_assert_rewrite('test.cqlpy.cassandra_tests.porting')
```
The insert_update_if_condition_test.py was calling
pytest.register_assert_rewrite() for the porting module, but this
registration is already handled by cassandra_tests/__init__.py which
is automatically loaded before any test runs.
Closesscylladb/scylladb#28409
The include-what-you-use workflow fails with
```
Invalid workflow file: .github/workflows/iwyu.yaml#L25
The workflow is not valid. .github/workflows/iwyu.yaml (Line: 25, Col: 3): Error calling workflow 'scylladb/scylladb/.github/workflows/read-toolchain.yaml@257054deffbef0bde95f0428dc01ad10d7b30093'. The nested job 'read-toolchain' is requesting 'contents: read', but is only allowed 'contents: none'.
```
Fix by adding the correct permissions.
Closesscylladb/scylladb#28390
Enhance the skip_mode marker to accept either a single mode string
or a list of modes, allowing tests to be skipped across multiple
build configurations with a single marker.
Before:
@pytest.mark.skip_mode("dev", reason="...")
@pytest.mark.skip_mode("debug", reason="...")
After:
@pytest.mark.skip_mode(["dev", "debug"], reason="...")
This reduces duplication when the same skip condition applies
to multiple build modes.
Closesscylladb/scylladb#28406
The test is currently flaky. It tries to get the host ID of the bootstrapping
node via the REST API after the node crashes. This can obviously fail. The
test usually doesn't fail, though, as it relies on the host ID being saved
in `ScyllaServer._host_id` at this point by `ScyllaServer.try_get_host_id()`
repeatedly called in `ScyllaServer.start()`. However, with a very fast crash
and unlucky timings, no such call may succeed.
We deflake the test by getting the host ID before the crash. Note that at this
point, the bootstrapping node must be serving the REST API requests because
`await log.wait_for("finished do_send_ack2_msg")` above guarantees that the
node has started the gossip shadow round, which happens after starting the REST
API.
Fixes#28385Closesscylladb/scylladb#28388
Currently, tablet_allocator switches to streaming scheduling group that
it gets from database. It's not nice to use database as provider of
configs/scheduling_groups.
This patch adds a background scheduling group for tablet allocator
configured via its config and sets it to streaming group in main.cc
code.
This will help splitting the streaming scheduling group into more
elaborated groups under the maintenance supergroup: SCYLLADB-351
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#28356
This PR reduces the runtime of `test_out_of_space_prevention.py` by addressing two main sources of overhead: slow “critical utilization” setup and delayed tablet load stats propagation. Combined, these changes cut the module’s total execution time from 324s to 185s.
Improvements. No backup is required.
Closesscylladb/scylladb#28396
* github.com:scylladb/scylladb:
test/storage: speed up out-of-space prevention tests by using smaller volumes
test/storage: reduce tablet load stats refresh interval to speed up OOS prevention tests
As a next step of migration to the pytest runner, this PR moves
responsibility for nodetool tests execution solely to the pytest.
Closesscylladb/scylladb#28348
Otherwise, it may be only a temporary situation due to lack of
candidates, and may be unnecessarily alerting.
Also, print node stats to allow assessing how bad the situation is on
the spot. Those stats can hint to a cause of imbalance, if balancing
is per-DC and racks have different capacity.
Load-balancing can be now per-rack instead of per-DC. So just printing
"in DC" is confusing. If we're balancing a rack, we should print which
rack is that.
Before:
load_balancer - Prepared 1 migration plans, out of which there were 1 tablet migration(s) and 0 resize decision(s) and 0 tablet repair(s) and 0 rack-list colocation(s)
After:
load_balancer - Prepared plan: migrations: 1
We print only stats about elements which are present.
When building with `--disable-precompiled-header`, view.cc failed to
compile due to missing <seastar/coroutine/all.hh> include, which provides
`coroutine::all`.
The problem doesn't manifest when precompiled headers are used, which is
the default. So that's likely why it was missed by the CI.
Adding the explicit include fixes the build.
Fixes: scylladb/scylladb#28378
Ref: scylladb/scylladb#28093
No backport: This problem is only present in master.
Closesscylladb/scylladb#28379
Fixes#28398
When used as path elements in google storage paths, the object names
need to be URL encoded. Due to a.) tests not really using prefixes including
non-url valid chars (i.e. / etc) and the mock server used for most
testing not enforcing this particular aspect, this was missed.
Modified unit tests to use prefixing for all names, so when run
in real GS, any errors like this will show.
Fixes#28399
When iterating with pager, the mock server and real GCS behaves differently.
The latter will not give a pager token for last page, only penultimate.
Need to handle.
In production environments, we observed cases where the S3 client would repeatedly fail to connect due to DNS entries becoming stale. Because the existing logic only attempted the first resolved address and lacked a way to refresh DNS state, the client could get stuck in a failure loop.
Introduce RR TTL and connection failure retry to
- re-resolve the RR in a timely manner
- forcefully reset and re-resolve addresses
- add a special case when the TTL is 0 and the record must be resolved for every request
Fixes: CUSTOMER-96
Fixes: CUSTOMER-139
Should be backported to 2025.3/4 and 2026.1 since we already encountered it in the production clusters for 2025.3
Closesscylladb/scylladb#27891
* github.com:scylladb/scylladb:
connection_factory: includes cleanup
dns_connection_factory: refine the move constructor
connection_factory: retry on failure
connection_factory: introduce TTL timer
connection_factory: get rid of shared_future in dns_connection_factory
connection_factory: extract connection logic into a member
connection_factory: remove unnecessary `else`
connection_factory: use all resolved DNS addresses
s3_test: remove client double-close
Use the new ServerUpState=SERVING mechanism to wait to the alternator
ports to be up, rather than relying on the default waiting for CQL,
which happens earlier and therefore opens a window where a connection to
the alternator ports will fail.
When running under systemd, ScyllaDB sends a STATUS=serving message
to systemd. Co-opt this mechanism by setting up NOTIFY_SOCKET, thus
making the cluster manager pretend it is systemd. Users of the cluster
manager can now wait for the node to report itself up, rather than
having to parse log files or retry connections.
This parameter was not mentioned at all anywhere in the documentation.
Add an explanation of this parameter: why we need it, what is the
default and how it can be changed.
Closesscylladb/scylladb#28132
Hints destined for some other node can only be drained after the other
node is no longer a replica of any vnode or tablet. In case when tablets
are present, a node might still technically be a replica of some tablets
after it moved to left state. When it no longer is a replica of any
tablet, it becomes "released" and storage service generates a
notification about it. Hinted handoff listens to this notification and
kicks off draining hints after getting it.
The current implementation of the "released" notification would trigger
every time raft topology state is reloaded and a left node without any
tokens is present in the raft topology. Although draining hints is
idempotent, generating duplicate notifications is wasteful and recently
became very noisy after in 44de563 verbosity of the draining-related log
messages have been increased. The verbosity increase itself makes sense
as draining is supposed to be a rare operation, but the duplicate
notification bug now needs to be addressed.
Fix the duplicate notification problem by passing the list of previously
released nodes to the `storage_service::raft_topology_update_ip`
function and filtering based on it. If this function processes the
topology state for the first time, it will not produce any
notifications. This is fine as hinted handoff is prepared to detect
"released" nodes during the startup sequence in main.cc and start
draining the hints there, if needed.
Fixes: #28301
Refs: #25031
The commit 59faa6d, introduces a new parameter called cpu_concurrency
and sets its default value to 1 which violates the commit fbb83dd that
removes all default values from constructors but one used by the unit
tests.
The patch removes the default value of the cpu_concurrency parameter
and alters tests to use the test dedicated reader_concurrency_semaphore
constructor wherever possible.
Tests in test_out_of_space_prevention.py spend a large fraction of
time creating a random “blob” file to cross the 0.8 critical disk
utilization threshold. With 100MB volumes this requires writing
~70–80MB of data, which is slow inside Docker/Podman-backed volumes.
Most tests only use ~11MB of data, so large volumes are unnecessary.
Reduce the test volume size to 20MB so the critical threshold is
reached at ~16MB and the blob file is much smaller.
This cuts ~5–6s per test.
Set `--tablet-load-stats-refresh-interval-in-seconds=1` for this module’s
clusters applicable to all tests. This significantly reduces runtime
for the slowest cases:
- test_reject_split_compaction: 75.62s -> 23.04s
- test_split_compaction_not_triggered: 69.36s -> 22.98s
In the following commits we will need to compare the set of released
nodes before and after reload of raft topology state. Moving the logic
that calculates such a set to a separate function will make it easier to
do.
Now system_keyspace reference is used internally by the client code
itself, no need to encourage other services abuse it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There are several places that want to get last state id and for that
they make raft_group0_client() export system_keyspace reference.
This patch adds a helper method to provide the needed ID.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There's a validation for tables being used by group0 commands are marked
with the respective prop. For it the caller code needs to provide
database reference and it gets one from client -> system_keyspace chain.
There's more explicit way -- get the data_dictionary via proxy.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Some code in the worker need to mess with system_keyspace&. While
there's a reference on it from the worker object, it gets one via
group0 -> group0_client, which is a bit an overkill.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This patch set eliminates special audit info guard used before for batch statements
and simplifies audit::inspect function by returning quickly if audit is not needed.
It saves around 300 instructions on a request's hot path.
Related: https://github.com/scylladb/scylladb/issues/27941
Backport: no, not a bug
Closesscylladb/scylladb#28326
* github.com:scylladb/scylladb:
audit: replace batch dynamic_cast with static_cast
audit: eliminate dynamic_cast to batch_statement in inspect
audit: cql: remove create_no_audit_info
audit: add batch bool to audit_info class
Currently it grabs one from database, but it's not nice to use database
as config/sched-groups provider.
This PR passes the scheduling group to use for sending hints via manager
which, in turn, gets one from proxy via its config (proxy config already
carries configuration for hints manager). The group is initialized in
main.cc code and is set to the maintenance one (nowadays it's the same
as streaming group).
This will help splitting the streaming scheduling group into more
elaborated groups under the maintenance supergroup: SCYLLADB-351
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#28358
consistent_cluster_management is deprecated since scylla-5.2 and no
longer used by Scylladb, so it should not be used by test either.
Closesscylladb/scylladb#28340
These two streams mostly play together. The former provides an input_stream from read from in-memory temporary buffers, the latter wraps it to limit the size of provided temporary buffers. Both are used to test contiguous data consumer, also the buffer_input_stream has a caller in sstables reversing reader.
This PR removes the buffer_input_stream in favor of seastar memory_data_source, and moves the limiting_input_stream into test/lib.
Enanching testing code, not backporting
Closesscylladb/scylladb#28352
* github.com:scylladb/scylladb:
code: Move limiting data source to test/lib
util: Simplify limiting_data_source API
util: Remove buffer_input_stream
test: Use seastar::util::temporary_buffer_data_source in data consumer test
sstables: Use seastar::util::as_input_stream() in mx reader
This compaction group testing is useless because the machinery for it
to work was removed. This was useful in the early tablet days, where
we wanted to test compaction groups directly. Today groups are stressed
and tested on every tablet test.
I see a ~40% reduction time after this patch, since database_test is
one of the most (if not the most) time consuming in boost suite.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#28324
db: view: refactor semaphore usage in create/drop view paths
Refactor the construction and usage of semaphore units in the create and drop view flows.
The previous semaphore handling was hard to follow (as noted while working on https://github.com/scylladb/scylladb/pull/27929), so this change restructures unit creation and movement to follow a clearer and symmetric pattern across shards.
The semaphore usage model is now documented with a detailed in-code comment to make the intended behavior and invariants explicit.
As part of the refactor, the control flow is modernized by replacing continuation-based logic with coroutine-style code, improving readability and maintainability.
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-250
backport: not required, this is a refactor
Closesscylladb/scylladb#28093
* github.com:scylladb/scylladb:
db: view: extend try/catch scope in handle_create_view_local The try/catch region is extended to cover step functions and inner helpers, which may throw or abort during view creation. This change is safe because we are just swolowing more parts that may throw due to semaphore abortion or any other abortion request, and doesnt change the logic
db: view: refine create/drop coroutine signatures Refactor the create/drop coroutine interfaces to accept parameters as const references, enabling a clearer workflow and safer data flow.
db: view: switch from continuations to coroutines Refactor the flow and style of create and drop view to use coroutines instead of continuations. This simplifies the logic, improves readability, and makes the code easier to maintain and extend. This commit also utilizes the get_view_builder_units function that was added in the previous commit. this commit also introduces a new alisasing for optional unit type for simpler and more readable functions that use this type
db: view: introduce helper to acquire or reuse semaphore units Introduce a small helper that acquires semaphore units when needed or reuses units provided by the caller. This centralizes semaphore handling, simplifies the current logic, and enables refactoring the view create/drop path to a coroutine-based implementation instead of continuation-style code.
db: view: add detailed comments on semaphore bookkeeping and serialized create/drop on shard 0
Previously, we wanted to make minimal changes with regards to the new
unified auth cache. However, as a result, some calls on the hot path
were missed. Now we have switched the underlying find_record call
to use the cache. Since caching is now at a lower level, we bring
back the original code.
Since every write-type auth statement takes group0_guard at the beginning,
we hold read_apply_mutex and cannot have a running raft apply during our
operation. Therefore, the auth cache and internal CQL reads return the same,
consistent results. This makes it safe to read via cache instead of internal
CQL.
LDAP is an exception, but it is eventually consistent anyway.
The partition snapshot lives in mutation/, however mutation/ is a lower
level concept than a mutation reader. The next best place for this
reader is the replica/ directory, where the memtable, its main user,
also lives.
Also move the code to the replica namespace.
test/boost/mvcc_test.cc includes this header but doesn't use anything
from it. Instead of updating the include path, just drop the unused
include.
Filter the content of sstable(s), including or excluding the specified partitions. Partitions can be provided on the command line via `--partition`, or in a file via `--partitions-file`. Produces one output sstable per input sstable -- if the filter selects at least one partition in the respective input sstable. Output sstables are placed in the path provided via `--oputput-dir`. Use `--merge` to filter all input sstables combined, producing one output sstable.
Fixes: #13076
New functionality, no backport.
Closesscylladb/scylladb#27836
* github.com:scylladb/scylladb:
tools/scylla-sstable: introduce filter command
tools/scylla-sstable: remove --unsafe-accept-nonempty-output-dir
tools/scylla-sstable: make partition_set ordered
tools/scylla-stable: remove unused boost/algorithm/string.hpp include
Generally square brackets are non allowed in URI, while pytest uses it
the test name to show that there were additional parameters for the same
test. When such a test fail it shows the directory correctly in Jenkins,
however attempt to download only this will fail, because of the square
brackets in URI. This change substitute the square brackets with round
brackets.
Closesscylladb/scylladb#28226
Yet another barrier-failure scenario exists in the `write_both_read_new`
state. When the barrier fails, the tablet is expected to transition
to `cleanup_target`, but because barrier execution is asynchronous,
the cleanup transition can be skipped entirely and the tablet may
continue forward instead.
Both `write_both_read_new` and `cleanup_target` modify read and write
selectors. In this situation, a barrier is required, and transitioning
directly between these states without one is unsafe.
Introduce an intermediate `write_both_read_old_fallback_cleanup`
state that modifies only a read selector and can be entered without
a barrier (there is no need to wait for all nodes to start using the
"new" read selector). From there, the tablet can proceed to `cleanup_target`,
where the required barriers are enforced.
This also avoids changing both selectors in a single step. A direct
transition from `write_both_read_new` to `cleanup_target` updates
both selectors at once, which can leave coordinators using the old
selector for writes and the new selector for reads, causing reads to
miss preceding writes.
By routing through the fallback state, selectors are updated in
order—read first, then write—preserving read-after-write correctness.
In both `streaming` and `rebuild_repair` stages, the read/write
selectors are unchanged compared to the preceding stage. Because
entry into these stages is already fenced by a barrier from
`write_both_read_old`, and the `cleanup_target` itself requires
barrier, rolling back directly to `cleanup_target` is safe without
an additional barrier.
A similar barrier-failure scenario exists in the `write_both_read_old`
state. If the barrier fails, the tablet is expected to transition to
`cleanup_target`, but due to the barrier being evaluated asynchronously
the cleanup path can be skipped and the tablet may continue forward
instead.
In `write_both_read_old`, we already switched group0 writes from old
to both, while the barrier may not have executed yet. As a result,
nodes can be at most one step apart (some still use old, others use
both).
Transitioning to `cleanup_target` reverts the write selector back to
old. Nodes still differ by at most one step (old vs both), so the
transition is safe without an additional barrier.
This prevents cleanup from being skipped while keeping selector semantics
and barrier guarantees intact.
When a tablet is in `allow_write_both_read_old`, progressing normally
requires a barrier. If this first barrier fails, the tablet is supposed
to transition to `cleanup_target` on the next iteration:
```
case locator::tablet_transition_stage::allow_write_both_read_old:
if (action_failed(tablet_state.barriers[trinfo.stage])) {
if (check_excluded_replicas()) {
transition_to_with_barrier(locator::tablet_transition_stage::cleanup_target);
break;
}
}
if (do_barrier()) {
...
}
break;
```
That transition itself requires a barrier, which is executed asynchronously.
Because the barrier runs in the background, the cleanup logic is skipped in
that iteration.
On the following iteration, `action_failed(barriers[stage])` no longer
returns true, since the node that caused the original barrier failure
has been excluded. The barrier is therefore observed as successful,
and the tablet incorrectly proceeds to the next stage instead of entering
`cleanup_target`.
Since `cleanup_target` does not modify read/write selectors, the transition
can be done safely without a barrier, simplifying the state machine and
ensuring cleanup is not skipped.
Without it, the tablet would still eventually reach `cleanup_target` via
`write_both_read_old` and `streaming`, but that path is unnecessary.
The try/catch region is extended to cover step functions and inner helpers,
which may throw or abort during view creation.
This change is safe because we are just swolowing more parts that may throw due to semaphore abortion
or any other abortion request, and doesnt change the logic
Refactor the flow and style of create and drop view to use coroutines instead of continuations.
This simplifies the logic, improves readability, and makes the code
easier to maintain and extend. This commit also utilizes the get_view_builder_units function that was added in the previous commit.
this commit also introduces a new alisasing for optional unit type for simpler and more readable functions that use this type
Introduce a small helper that acquires semaphore units when needed or
reuses units provided by the caller.
This centralizes semaphore handling, simplifies the current logic, and
enables refactoring the view create/drop path to a coroutine-based
implementation instead of continuation-style code.
Commit 0156e97560 ("storage_proxy: cas: reject for
tablets-enabled tables") marked a bunch of LWT tests as
XFAIL with tablets enabled, pending resolution of #18066.
But since that event is now in the past, we undo the XFAIL
markings (or in some cases, use an any-keyspace fixture
instead of a vnodes-only fixture).
Ref #18066.
Closesscylladb/scylladb#28336
Only two tests use it now -- the limit-data-source-test iself and a test
that validates continuous_data_consumer template.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The source maintains "limit generator" -- a function that returns the
maximum size of bytes to return from the next buffer.
Currently all callers just return constant numbers from it. Passing a
function that returns non-constant one can, probably, be used for a
fuzzy test, but even the limiting-data-source-test itself doesn't do it,
so what's the point...
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The test creates buffer_data_source_impl and wraps it with limiting data
source. The former data_source duplicates the functionality of the
existing seastar temporary_buffer_data_source.
This patch makes the test code use seastar facility. The
buffer_data_source_impl will be removed soon.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Right now the code uses make_buffer_input_stream() helper that creates
an input stream with buffer_data_source_impl inside which, in turn,
provides the data_source_impl API over a single temporary_buffer.
Seastar has the very same facility, it's better to use it. Eventually
the buffer_data_source_impl will be removed.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
We don't need a special guard value, it's
only being filled for batch statements for
which we can simply ignore the value.
Not having special value allows us to return
fast when audit is not enabled.
The user can now discover the general explanatio of repair when reading
about incremental repair, useful if they don't know what repair is.
The user can now discover incremental repair while reading the generic
repair procedure document.
This PR refactors the streaming subsystem to support direct download of fully contained sstables. Instead of streaming these files, they are downloaded and attached directly to their corresponding tables. This approach reduces overhead, simplifies logic, and improves efficiency. Expected node scope restore performance improvement: ~4 times faster in best case scenario when all sstables are fully contained.
1. Add storage options field to sstable Introduce a data member to store storage options, enabling distinction between local and object storage types.
2. Add method to create component source Extend the storage interface with a public method to create a data_source for any sstable component.
3. Inline streamer instance creation Remove make_sstable_streamer and inline its usage to allow different sets of arguments at call sites.
4. Skip streaming empty sstable sets Avoid unnecessary streaming calls when the sstable set is empty.
5. Enable direct download of contained sstables Replace streaming of fully contained sstables with direct download, attaching them to their corresponding table.
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-200
Refs: https://github.com/scylladb/scylladb/issues/23908
No need to backport as this code targets 2026.2 release (for tablet-aware restore)
Closesscylladb/scylladb#26834
* github.com:scylladb/scylladb:
tests: reuse test_backup_broken_streaming
streaming: enable direct download of contained sstables
storage: add method to create component source
streaming: keep sharded database reference on tablet_sstable_streamer
streaming: skip streaming empty sstable sets
streaming: inline streamer instance creation
tests: fix incorrect backup/restore test flow
Previously we only inspected std::system_error inside
std::nested_exception to support a specific TLS-related failure
mode. However, nested exceptions may contain any type, including
other restartable (retryable) errors. This change unwraps one
nested exception per iteration and re-applies all known handlers
until a match is found or the chain is exhausted.
Closesscylladb/scylladb#28240
In this PR, we fix two bugs present in `boost_test_tree_lister` that
affected the output of `--list_json_content` added in
scylladb/scylladb@afde5f668a:
* The labels test units use were duplicated in the output.
* If a test suite or a test file didn't contain any tests, it wasn't
listed in the output.
Refs scylladb/scylladb#25415
Backport: not needed. The code hasn't been used anywhere yet.
Closesscylladb/scylladb#28255
* github.com:scylladb/scylladb:
test/lib/boost_test_tree_lister.cc: Record empty test suites
test/lib/boost_test_tree_lister.cc: Deduplicate labels
Finishing the deprecation of the skip_mode function in favor of
pytest.mark.skip_mode. This PR is only cleaning and migrating leftover tests
that are still used and old way of skip_mode.
Closesscylladb/scylladb#28299
Move state management from dns_connection_factory into state class
itself to encapsulate its internal state and stop managing it from the
`dns_connection_factory`
Instead of streaming fully contained sstables, download them directly
and attach them to their corresponding table. This simplifies the
process and avoids unnecessary streaming overhead.
Filter the content of sstable(s), including or excluding the specified
partitions. Partitions can be provided on the command line via
`--partition`, or in a file via `--partitions-file`.
Produces one output sstable per input sstable -- if the filter selects
at least one partition in the respective input sstable.
Output sstables are placed in the path provided via `--oputput-dir`.
Use `--merge` to filter all input sstables combined, producing one
output sstable.
This flag was added to operations which have an --output-dir
command-line arguments. These operations write sstables and need a
directory where to write them. Back in the numeric-generation world this
posed a problem: if the directory contained any sstable, generation
clash was almost guaranteed, because each scylla-sstable command
invokation would start output generations from 1. To avoid this, empty
output directory was a requirement, with the
--unsafe-accept-nonempty-output-dir allowing for a force-override.
Now in the timeuuid generation days, all this is not necessary anymore:
generations are unique, so it is not a problem if the output directory
already contains sstables: the probability of generation clash is almost
0. Even if it happens, the tool will just simply fail to write the new
sstable with the clashing generation.
Remove this historic relic of a flag and the related logic, it is just a
pointless nuissance nowadays.
When one request is super slow and req/s high
in theory we have a collision on id, this patch
avoids that by reusing id and aborting when there
is no free one (unlikely).
This commit avoids leaking seastar::async future from two benchmark
tools: perf-alternator and perf-cql-raw. Additionally it adds
abort_source for fast and clean shutdown.
This patch adds a reproducer test showing issue #28183 - that when LWT
is used, hidden tables "...$paxos" are created but they are unexpectedly
shown by DESC TABLES, DESC SCHEMA and DESC KEYSPACE.
The new test was failing (in three places) on Scylla, as those internal
(and illegally-named) tables are listed, and passes on Cassandra
(which doesn't add hidden tables for LWT).
The commit also contains another test, which verifies if direct
description of paxos state table is wrapped in comment.
Refs #28183.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Remove the `make_sstable_streamer` function and inline its usage where
needed. This change allows passing different sets of arguments
directly at the call sites.
When working directly with sstable components, the provided name should
be only the file name without path prefixes. Any prefixing tokens
belong in the 'prefix' argument, as the name suggests.
Paxos state tables are internal tables fully managed by Scylla
and they shouldn't be exposed to the user nor they shouldn't be backed up.
This commit hides those kind of tables from all listings and if such table
is directly described with `DESC ks."tbl$paxos"`, the description is generated
withing a comment and a note for the user is added.
Fixesscylladb/scylladb#28183
Before this commit, if a test file or a test suite didn't include
any actual test cases, it was ignored by `boost_test_tree_lister`.
However, this information is useful; for example, it allows us to tell
if the test file the user wants to run doesn't exist or simply doesn't
contain any tests. The kind of error we would return to them should be
different depending on which situation we're dealing with.
We start including those empty suites and files in the output of
`--list_json_content`.
---
Examples (with additional formatting):
* Consider the following test file, `test/boost/dummy_test.cc` [1]:
```
BOOST_AUTO_TEST_SUITE(dummy_suite1)
BOOST_AUTO_TEST_SUITE(dummy_suite2)
BOOST_AUTO_TEST_SUITE_END()
BOOST_AUTO_TEST_SUITE_END()
BOOST_AUTO_TEST_SUITE(dummy_suite3)
BOOST_AUTO_TEST_SUITE_END()
```
Before this commit:
```
$ ./build/debug/test/boost/dummy_test -- --list_json_content
[{"file": "test/boost/dummy_test.cc", "content": {"suites": [], "tests": []}}]
```
After this commit:
```
$ ./build/debug/test/boost/dummy_test -- --list_json_content
[{"file":"test/boost/dummy_test.cc", "content": {"suites": [
{"name": "dummy_suite1", "suites": [
{"name": "dummy_suite2", "suites": [], "tests": []}
], "tests": []},
{"name": "dummy_suite3", "suites": [], "tests": []}
], "tests": []}}]
```
* Consider the same test file as in Example 1, but also assume it's compiled
into `test/boost/combined_tests`.
Before this commit:
```
$ ./build/debug/test/boost/combined_tests -- --list_json_content | grep dummy
$
```
After this commit:
```
$ ./build/debug/test/boost/combined_tests -- --list_json_content
[..., {"file": "test/boost/dummy_test.cc", "content": {"suites": [
{"name": "dummy_suite1", "suites":
[{"name": "dummy_suite2", "suites": [], "tests": []}],
"tests": []},
{"name": "dummy_suite3", "suites": [], "tests": []}],
"tests":[]}}, ...]
```
[1] Note that the example is simplified. As of now, it's not possible to use
`--list_json_content` with a file without any Boost tests. That will
result in the following error: `Test setup error: test tree is empty`.
Refs scylladb/scylladb#25415
In scylladb/scylladb@afde5f668a, we
implemented custom collection of information about Boost tests
in the repository. The solution boiled down to traversing through
the test tree via callbacks provided by Boost.Test and calling that
code from a global fixture. This way, the code is called automatically
by the framework.
Unfortunately, for an unknown reason, this leads to labels of test units
being duplicated. We haven't found the root cause yet and so we
deduplicate the labels manually.
---
Example (with additional formatting):
Consider the following test in the file `test/boost/dummy_test.cc`:
```
SEASTAR_TEST_CASE(dummy_case, *boost::unit_test::label("mylabel1")) {
return make_ready_future();
}
```
Before this commit:
```
$ ./build/dev/test/boost/dummy_test -- --list_json_content
[{"file": "test/boost/dummy_test.cc", "content": {"suites": [],
"tests": [{"name": "dummy_case", "labels": "mylabel1,mylabel1"}]}
}]
```
After this commit:
```
$ ./build/dev/test/boost/dummy_test -- --list_json_content
[{"file": "test/boost/dummy_test.cc", "content": {"suites": [],
"tests": [{"name": "dummy_case", "labels": "mylabel1"}]}
}]
```
Refs scylladb/scylladb#25415
Refactor streams.cc - turn `.then` calls into coroutines.
Reduces amount of clutter, lambdas and referenced variables.
Note - the code is kept at the same indentation level to ease review,
the next commit will fix this.
When this column and relevant SUPPORTED key were added, the
documentation was mistakenly put in the section about shard awareness
extension. This commit moves the documentation into a dedicated section.
I also expended it to describe both the new column and the new SUPPORTED
key.
- name:Comment and close if author email is scylladb.com
uses:actions/github-script@v7
with:
github-token:${{ secrets.GITHUB_TOKEN }}
script:|
const issue = context.payload.issue;
const actor = context.actor;
// Get user data (only public email is available)
const { data: user } = await github.rest.users.getByUsername({
username: actor,
});
const email = user.email || "";
console.log(`Actor: ${actor}, public email: ${email || "<none>"}`);
// Only continue if email exists and ends with @scylladb.com
if (!email || !email.toLowerCase().endsWith("@scylladb.com")) {
console.log("User is not a scylladb.com email (or email not public); skipping.");
return;
}
const owner = context.repo.owner;
const repo = context.repo.repo;
const issue_number = issue.number;
const body = "Issues in this repository are closed automatically. Scylla associates should use Jira to manage issues.\nPlease move this issue to Jira https://scylladb.atlassian.net/jira/software/c/projects/SCYLLADB/list";
CLEANER_DIRS:test/unit exceptions alternator api auth cdc compaction db dht gms index lang message mutation mutation_writer node_ops raft redis replica service
seastar::metrics::make_histogram("batch_item_count_histogram",seastar::metrics::description("Histogram of the number of items in a batch request"),labels,
seastar::metrics::make_histogram("batch_item_count_histogram",seastar::metrics::description("Histogram of the number of items in a batch request"),labels,
co_returnapi_error::unknown_operation("UpdateTimeToLive not yet supported. Experimental support is available if the 'alternator-ttl' experimental feature is enabled on all nodes.");
co_returnapi_error::unknown_operation("UpdateTimeToLive not yet supported. Upgrade all nodes to a version that supports it.");
// Should never happen - we verified the column's type
// before starting the scan.
[[unlikely]]
on_internal_error(tlogger,format("expiration scanner value of unsupported type {} in column {}",meta[*expiration_column]->type->cql3_type_name(),scan_ctx.column_name));
"summary":"Trigger compaction of the key-value storage",
"type":"void",
"nickname":"logstor_compaction",
"produces":[
"application/json"
],
"parameters":[
{
"name":"major",
"description":"When true, perform a major compaction",
"required":false,
"allowMultiple":false,
"type":"boolean",
"paramType":"query"
}
]
}
]
},
{
"path":"/storage_service/logstor_flush",
"operations":[
{
"method":"POST",
"summary":"Trigger flush of logstor storage",
"type":"void",
"nickname":"logstor_flush",
"produces":[
"application/json"
],
"parameters":[]
}
]
},
{
"path":"/storage_service/active_repair/",
"operations":[
@@ -3085,6 +3124,48 @@
}
]
},
{
"path":"/storage_service/tablets/snapshots",
"operations":[
{
"method":"POST",
"summary":"Takes the snapshot for the given keyspaces/tables. A snapshot name must be specified.",
"type":"void",
"nickname":"take_cluster_snapshot",
"produces":[
"application/json"
],
"parameters":[
{
"name":"tag",
"description":"the tag given to the snapshot",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"keyspace",
"description":"Keyspace(s) to snapshot. Multiple keyspaces can be provided using a comma-separated list. If omitted, snapshot all keyspaces.",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"table",
"description":"Table(s) to snapshot. Multiple tables (in a single keyspace) can be provided using a comma-separated list. If omitted, snapshot all tables in the given keyspace(s).",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
}
]
}
]
},
{
"path":"/storage_service/quiesce_topology",
"operations":[
@@ -3187,6 +3268,38 @@
}
]
},
{
"path":"/storage_service/logstor_info",
"operations":[
{
"method":"GET",
"summary":"Logstor segment information for one table",
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.