Compare commits

...

587 Commits

Author SHA1 Message Date
Yaniv Michael Kaul
8c293dfe36 schema/schema_registry: reformat code style
Reformat constructor initializer lists, brace placement, and
line wrapping for consistency.

The seastar logger already checks is_enabled() before formatting
arguments, so explicit guards around trace calls with simple
variable arguments are unnecessary.
AI-assisted: OpenCode / Claude Opus 4.6
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
2026-03-24 18:30:41 +02:00
Yaniv Michael Kaul
97fb93a5b4 utils/gcp/object_storage: fix dead code and format string bug
Severity: low (dead code) + correctness (format string bug)

- Remove unused error_msg variable (parsed JSON but result discarded)
- Fix format string: "()" should be "({})" so _session_path is
  actually included in the debug log instead of being silently discarded
AI-assisted: OpenCode / Claude Opus 4.6
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
2026-03-24 18:30:41 +02:00
Yaniv Michael Kaul
357b94ac60 utils/logalloc: remove redundant format() wrapper in reclamation rate log
Severity: low

format("{:.3f}", ...) was eagerly allocating a string to pass to
the logger. Move the precision specifier directly into the log format
string, letting the logger handle formatting lazily.
AI-assisted: OpenCode / Claude Opus 4.6
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
2026-03-24 18:30:40 +02:00
Yaniv Michael Kaul
037f291d48 dht/boot_strapper: reformat code style
Reformat function parameters, constructor initializer lists, and
line wrapping for consistency.

The seastar logger already checks is_enabled() before formatting
arguments, so explicit guards around debug calls with simple
variable arguments are unnecessary.
AI-assisted: OpenCode / Claude Opus 4.6
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
2026-03-24 18:30:40 +02:00
Yaniv Michael Kaul
b246a70404 dht/range_streamer: reformat code style
Reformat indentation, brace placement, lambda formatting, and
line wrapping for consistency.

The seastar logger already checks is_enabled() before formatting
arguments, so explicit guards around debug calls with simple
variable arguments are unnecessary.
AI-assisted: OpenCode / Claude Opus 4.6
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
2026-03-24 18:30:40 +02:00
Yaniv Michael Kaul
a071d54160 cdc/split: reformat code style
Reformat indentation, brace placement, lambda formatting, and
line wrapping for consistency.

The seastar logger already checks is_enabled() before formatting
arguments, so explicit guards around trace calls with simple
variable arguments are unnecessary.
AI-assisted: OpenCode / Claude Opus 4.6
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
2026-03-24 18:30:40 +02:00
Yaniv Michael Kaul
62d369c113 auth/ldap_role_manager: reformat code style
Reformat indentation, brace placement, and line wrapping for
consistency.

The seastar logger already checks is_enabled() before formatting
arguments, so explicit guards around debug calls with simple
variable arguments are unnecessary.
AI-assisted: OpenCode / Claude Opus 4.6
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
2026-03-24 18:30:40 +02:00
Yaniv Michael Kaul
bce09e055c locator/topology: defer expensive format() calls in debug log with value_of()
Severity: medium

Five format("{}", ...) ternary expressions create temporary strings
for a debug log on every topology node update. Use seastar::value_of()
to defer the formatting to log-time, avoiding 5 heap allocations when
debug logging is disabled.
AI-assisted: OpenCode / Claude Opus 4.6
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
2026-03-24 18:30:40 +02:00
Yaniv Michael Kaul
37004ac4dc locator/tablets: reformat code style
Reformat indentation, brace placement, lambda formatting, and
line wrapping for consistency.

The seastar logger already checks is_enabled() before formatting
arguments, so explicit guards around debug calls with simple
variable arguments are unnecessary.
AI-assisted: OpenCode / Claude Opus 4.6
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
2026-03-24 18:30:40 +02:00
Yaniv Michael Kaul
6f619908ce locator/network_topology_strategy: reformat code style
Reformat indentation, brace placement, and line wrapping for
consistency.

The seastar logger already checks is_enabled() before formatting
arguments, so explicit guards around debug calls with simple
variable arguments are unnecessary.
AI-assisted: OpenCode / Claude Opus 4.6
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
2026-03-24 18:30:40 +02:00
Yaniv Michael Kaul
0fe552b074 cql3/query_processor: reformat code style
Reformat indentation, brace placement, lambda formatting, and
line wrapping for consistency.

The seastar logger already checks is_enabled() before formatting
arguments, so explicit guards around debug calls with simple
variable arguments are unnecessary.
AI-assisted: OpenCode / Claude Opus 4.6
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
2026-03-24 18:30:39 +02:00
Yaniv Michael Kaul
14482e2e07 gms/gossiper: guard expensive gossip message formatting in debug/trace logs
Severity: high (includes correctness bug fix)

- Guard formatting of gossip_digest_syn + ack at debug level (high: large
  message structures formatted per gossip round)
- Guard formatting of gossip_digest_ack at trace level (high: includes
  full endpoint_state_map)
- Fix use-after-move bug: ack2_msg was read after std::move at line 405.
  Capture formatted string before the move, guard both log calls (high:
  correctness bug + waste)
- Guard nodes vector formatting inside parallel_for_each loop (high:
  O(N^2) formatting across N iterations)
- Guard live_endpoints + _endpoints_to_talk_with formatting (medium)
- Guard live_members set formatting called twice per gossip round (medium)
AI-assisted: OpenCode / Claude Opus 4.6
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
2026-03-24 18:30:39 +02:00
Yaniv Michael Kaul
4bda6dded5 replica/database: defer eager set formatting in debug log with value_of()
Severity: medium

fmt::format("{}", ks_names_set) in a ternary expression eagerly
formats an unordered_set<sstring> even when debug logging is disabled.
Use seastar::value_of() to defer the formatting to log-time.
AI-assisted: OpenCode / Claude Opus 4.6
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
2026-03-24 18:30:39 +02:00
Yaniv Michael Kaul
0be4fac5e5 db/schema_applier: reformat code style
Reformat indentation, brace placement, lambda formatting, and
line wrapping for consistency.

The seastar logger already checks is_enabled() before formatting
arguments, so explicit guards around trace calls with simple
variable arguments are unnecessary.
AI-assisted: OpenCode / Claude Opus 4.6
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
2026-03-24 18:30:39 +02:00
Yaniv Michael Kaul
8818a0347e streaming/stream_transfer_task: reformat code style
Reformat indentation, brace placement, lambda formatting, and
line wrapping for consistency.

The seastar logger already checks is_enabled() before formatting
arguments, so explicit guards around debug calls with simple
variable arguments are unnecessary.
AI-assisted: OpenCode / Claude Opus 4.6
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
2026-03-24 18:30:39 +02:00
Yaniv Michael Kaul
489bdb2f93 repair/repair: defer percentage calculations in debug log with value_of()
Severity: low

Six percentage calculation function calls were evaluated eagerly as
debug log arguments on every repair range completion. Use
seastar::value_of() to defer the computations to log-time, skipping
them when debug logging is disabled.
AI-assisted: OpenCode / Claude Opus 4.6
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
2026-03-24 18:30:39 +02:00
Yaniv Michael Kaul
9efd94199d repair/row_level: avoid redundant coroutine invocations in debug logs
Severity: high

working_row_hashes().get() was being called multiple times purely for
debug log arguments, rebuilding the entire hash set each time. Cache
the result and wrap debug logs in is_enabled() guards to avoid the
expensive coroutine re-invocation when debug logging is disabled.
AI-assisted: OpenCode / Claude Opus 4.6
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
2026-03-24 18:30:39 +02:00
Yaniv Michael Kaul
10085ddb25 compaction/leveled_compaction_strategy: guard async I/O in debug log
Severity: high

co_await table_s.main_sstable_set() was called solely to get the
sstable count for a debug log message. This performs async I/O on
every LCS compaction even when debug logging is disabled. Wrap in
an is_enabled() guard.
AI-assisted: OpenCode / Claude Opus 4.6
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
2026-03-24 18:30:38 +02:00
Yaniv Michael Kaul
8635a2aa6f compaction/time_window_compaction_strategy: defer O(n^2) overlap computation in debug log with value_of()
Severity: high

Two sstable_set_overlapping_count() calls (O(n^2) overlap computation)
were passed as direct arguments to a debug log, causing them to be
evaluated eagerly even when debug logging is disabled. Use
seastar::value_of() to defer the computation to log-time.
AI-assisted: OpenCode / Claude Opus 4.6
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
2026-03-24 18:30:38 +02:00
Yaniv Michael Kaul
c7bb7b34c0 compaction/compaction_manager: avoid eager evaluation in debug log arguments
Severity: low-to-medium

- Remove redundant format() wrapping in debug log at consume_sstable()
  (low: one heap allocation per sstable consumed)
- Move get_compactions(filter).size() inside is_enabled() guard in
  do_stop_ongoing_compactions() (medium: builds entire compaction vector
  just for .size() in log message)
- Guard eager ::format() of descriptor.sstables with is_enabled(debug)
  (medium-high: formats entire sstable vector on every minor compaction)
AI-assisted: OpenCode / Claude Opus 4.6
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
2026-03-24 18:30:38 +02:00
Marcin Maliszkiewicz
66be0f4577 Merge 'test: cluster: audit test suite optimization' from Dario Mirovic
Migrate audit tests from test/cluster/dtest to test/cluster. Optimize their execution time through cluster reuse.

The audit test suite is heavy. There are more than 70 test executions. Environment preparation is a significant part of each test case execution time.

This PR:
1. Copies audit tests from test/cluster/dtest to test/cluster, refactoring and enabling them
2. Groups tests functions by non-live cluster configuration variations to enable cluster reuse between them
    - Execution time reduced from 4m 29s to 2m 47s, which is ~38% execution time decrease
3. Removes the old audit tests from test/cluster/dtest

Includes two supporting changes:
- Allow specifying `AuthProvider` in `ManagerClient.get_cql_exclusive`
- Fix server log file handling for clean clusters

Refs [SCYLLADB-573](https://scylladb.atlassian.net/browse/SCYLLADB-573)

This PR is an improvement and does not require a backport.

[SCYLLADB-573]: https://scylladb.atlassian.net/browse/SCYLLADB-573?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

Closes scylladb/scylladb#28650

* github.com:scylladb/scylladb:
  test: cluster: fix log clear race condition in test_audit.py
  test: pylib: shut down exclusive cql connections in ManagerClient
  test: cluster: fix multinode audit entry comparison in test_audit.py
  test: cluster: dtest: remove old audit tests
  test: cluster: group migrated audit tests for cluster reuse
  test: cluster: enable migrated audit tests and make them work
  test: pylib: manager_client: specify AuthProvider in get_cql_exclusive
  test: pylib: scylla cluster after_test log fix
  test: audit: copy audit test from dtest
2026-03-24 09:29:52 +01:00
Dario Mirovic
120f381a9d pgo: fix maintenance socket path too long
Maintenance socket path used for PGO is in the node workdir.
When the node workdir path is too long, the maintenance socket path
(workdir/cql.m) can exceed the Unix domain socket sun_path limit
and failing the PGO training pipeline.

To prevent this:
- pass an explicit --maintenance-socket override
  pointing to a short determinitic path in /tmp derived from the MD5
  hash of the workdir maintenance socket path
- update maintenance_socket_path to return the matching short path
  so that exec_cql.py connects to the right socket

The short path socket files are cleaned up after the cluster stops.

The path is using MD5 hash of the workdir path, so it is deterministic.

Fixes SCYLLADB-1070

Closes scylladb/scylladb#29149
2026-03-24 09:17:10 +01:00
Pavel Emelyanov
f112e42ddd raft: Fix split mutations freeze
Commit faa0ee9844 accidentally broke the way split snapshot mutation was
frozen -- instead of appending the sub-mutation `m` the commit kept the
old variable name of `mut` which in the new code corresponds to "old"
non-split mutation

Fixes #29051

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#29052
2026-03-24 08:53:50 +02:00
Botond Dénes
56c375b1f3 Merge 'table: don't close a disengaged querier in query()' from Pavel Emelyanov
There's a flaw in table::query() -- calling querier_opt->close() can dereferences a disengaged std::optional. The fix pretty simple. Once fixed, there are two if-s checking for querier_opt being engaged or not that are worth being merged.

The problem doesn't really shows itself becase table::query() is not called with null saved_querier, so the de-facto if is always correct. However, better to be on safe-side.

The problem doesn't show itself for real, not worth backporting

Closes scylladb/scylladb#29142

* github.com:scylladb/scylladb:
  table: merge adjacent querier_opt checks in query()
  table: don't close a disengaged querier in query()
2026-03-24 08:47:35 +02:00
Yaniv Kaul
e59a21752d .github/workflows/trigger_jenkins.yaml: add workflow permissions
Potential fix for https://github.com/scylladb/scylladb/security/code-scanning/147.

To fix the problem, add an explicit `permissions:` block to the workflow
(either at the top level or inside the `trigger-jenkins` job) that
constrains the `GITHUB_TOKEN` to the minimal necessary privileges. This
codifies least-privilege in the workflow itself instead of relying on
repository or organization defaults.

The best minimal, non‑breaking change is to define a root‑level
`permissions:` block with read‑only contents access because the job does
not perform any write operations to the repository, nor does it interact
with issues, pull requests, or other GitHub resources. A conservative,
widely accepted baseline is `contents: read`. If later steps require more
permissions, they can be added explicitly, but for this snippet, no such
need is visible.

Concretely, in `.github/workflows/trigger_jenkins.yaml`, insert:

```yaml
permissions:
  contents: read
```

between the `name:` block and the `on:` block (e.g., after line 2).
No additional methods, imports, or definitions are needed since this is
a pure YAML configuration change and does not alter runtime behavior of
the existing shell steps.

Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>

Closes scylladb/scylladb#27815
2026-03-24 08:40:30 +02:00
Yaniv Kaul
85a531819b .github/workflows/trigger-scylla-ci.yaml: add permissions to workflow
Potential fix for https://github.com/scylladb/scylladb/security/code-scanning/169.

In general, the fix is to add an explicit `permissions:` block to the
workflow (at the root level or per job) so that the `GITHUB_TOKEN` has
only the minimal scopes needed. Since this job only reads event data and
uses secrets to talk to Jenkins, we can restrict `GITHUB_TOKEN` to
read‑only repository contents.

The single best fix here is to add a top‑level `permissions:` block
right under the `name:` (and before `on:`) in
`.github/workflows/trigger-scylla-ci.yaml`, setting `contents: read`.
This applies to all jobs in the workflow, including `trigger-jenkins`,
and does not alter any existing steps or logic. No additional imports or
methods are needed, as this is purely a YAML configuration change for
GitHub Actions.

Concretely, edit `.github/workflows/trigger-scylla-ci.yaml` to insert:

```yaml
permissions:
  contents: read
```

after line 1. No other lines in the file need to change.

Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>

Closes scylladb/scylladb#27812
2026-03-24 08:37:49 +02:00
Botond Dénes
772b32d9f7 test/scylla_gdb: fix flakiness by preparing objects at test time
Fixtures previously ran GDB once (module scope) to find live objects
(sstables, tasks, schemas) and stored their addresses. Tests then
reused those addresses in separate GDB invocations. Sometimes these
addresses would become stale and the test would step on use-after-free
(e.g. sstables compacted away between invocations).

Fix by dropping the fixtures. The helper functions used by the fixtures
to obtain the required objects are converted to gdb convenience
functions, which can be used in the same expression as the test command
invocation. Thus, the object is aquired on-demand at the moment it is
used, so it is guaranteed to be fresh and relevant.

Fixes: SCYLLADB-1020

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Closes scylladb/scylladb#28999
2026-03-23 16:54:03 +02:00
Piotr Dulikowski
60fb5270a9 logstor: fix fmt::format use with std::filesystem::path
The version of fmt installed on my machine refuses to work with
`std::filesystem::path` directly. Add `.string()` calls in places that
attempt to print paths directly in order to make them work.

Closes scylladb/scylladb#29148
2026-03-23 15:15:52 +01:00
Pavel Emelyanov
3b9398dfc8 Merge 'encryption: fix deadlock in encrypted_data_source::get()' from Ernest Zaslavsky
When encrypted_data_source::get() caches a trailing block in _next, the next call takes it directly — bypassing input_stream::read(), which checks _eof. It then calls input_stream::read_exactly() on the already-drained stream. Unlike read(), read_up_to(), and consume(), read_exactly() does not check _eof when the buffer is empty, so it calls _fd.get() on a source that already returned EOS.

In production this manifested as stuck encrypted SSTable component downloads during tablet restore: the underlying chunked_download_source hung forever on the post-EOS get(), causing 4 tablets to never complete. The stuck files were always block-aligned sizes (8k, 12k) where _next gets populated and the source is fully consumed in the same call.

Fix by checking _input.eof() before calling read_exactly(). When the stream already reached EOF, buf2 is known to be empty, so the call is skipped entirely.

A comprehensive test is added that uses a strict_memory_source which fails on post-EOS get(), reproducing the exact code path that caused the production deadlock.

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1128

Backport to 2025.3/4 and 2026.1 is needed since it fixes a bug that may bite us in production, to be on the safe side

Closes scylladb/scylladb#29110

* github.com:scylladb/scylladb:
  encryption: fix deadlock in encrypted_data_source::get()
  test_lib: mark `limiting_data_source_impl` as not `final`
  Fix formatting after previous patch
  Fix indentation after previous patch
  test_lib: make limiting_data_source_impl available to tests
2026-03-23 17:12:44 +03:00
Piotr Dulikowski
df68d0c0f7 directories: add missing seastar/util/closeable.hh include
Without this include the file would not compile on its own. The issue
was most likely masked by the use of precompiled headers in our CI.

Closes scylladb/scylladb#29170
2026-03-23 15:46:56 +03:00
Yaniv Michael Kaul
051107f5bc scylla-gdb: fix sstable-summary crash on ms-format sstables
The 'scylla sstable-summary' GDB command crashes with
'ValueError: Argument "count" should be greater than zero' when
inspecting ms-format (trie-based) sstables. This happens because
ms-format sstables don't populate the traditional summary structure,
leaving all fields zeroed out, which causes gdb.read_memory() to be
called with a zero count.

Fix by:
- Adding zero-length guards to sstring.to_hex() and sstring.as_bytes()
  to return early when the data length is zero, consistent with the
  existing guard in managed_bytes.get().
- Adding the same guard to scylla_sstable_summary.to_hex().
- Detecting ms-format sstables (version == 5) early in
  scylla_sstable_summary.invoke() and printing an informative message
  instead of attempting to read the unpopulated summary.

Fixes: SCYLLADB-1180

Closes scylladb/scylladb#29162
2026-03-23 12:44:47 +02:00
Piotr Szymaniak
c8e7e20c5c test/cluster: retry create_table on transient schema agreement timeout
In test_index_requires_rf_rack_valid_keyspace, the create_table call
for a plain tablet-based table can fail with 'Unable to reach schema
agreement' after the server's 10s timeout is exceeded. This happens
when schema gossip propagation across the 4-node cluster takes longer
than expected after a sequence of rapid schema changes earlier in the
test.

Add a retry (up to 2 attempts) on schema agreement errors for this
specific create_table call rather than increasing the server-side
timeout.

Fixes: SCYLLADB-1135

Closes scylladb/scylladb#29132
2026-03-23 10:45:30 +02:00
Yaniv Kaul
fb1f995d6b .github/workflows/backport-pr-fixes-validation.yaml: workflow does not contain permissions (Potential fix for code scanning alert no. 139)
Potential fix for https://github.com/scylladb/scylladb/security/code-scanning/139,

To fix the problem, explicitly restrict the `GITHUB_TOKEN` permissions
for this workflow/job so it has only what is needed. The script reads PR
data and repository info (which is covered by `contents: read`/default
read scopes) and posts a comment via `github.rest.issues.createComment`,
which requires `issues: write`. No other write scopes (e.g., `contents:
write`, `pull-requests: write`) are necessary.

The best fix without changing functionality is to add a `permissions`
block scoped to this job (or at the workflow root). Since we only see a
single job here, we’ll add it under `check-fixes-prefix`. Concretely, in
`.github/workflows/backport-pr-fixes-validation.yaml`, between the
`runs-on: ubuntu-latest` line (line 10) and `steps:` (line 11), add:

```yaml
    permissions:
      contents: read
      issues: write
```

This keeps the token minimally privileged while still allowing the script
to create issue/PR comments.

Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>

Closes scylladb/scylladb#27810
2026-03-23 10:30:01 +02:00
Piotr Smaron
32225797cd dtest: fix flaky test_writes_schema_recreated_while_node_down
`read_barrier(session2)` was supposed to ensure `node2` has caught up on schema
before a CL=ALL write. But `patient_cql_connection(node2)` creates a
cluster-aware driver session `(TokenAwarePolicy(DCAwareRoundRobinPolicy()))`
that can route the barrier CQL statement to any node — not necessarily `node2`.
If the barrier runs on `node1` or `node3` (which already have the new schema),
it's a no-op, and `node2` remains stale, thus the observed `WriteFailure`.
The fix is to switch to `patient_exclusive_cql_connection(node2)`,
which uses `WhiteListRoundRobinPolicy([node2_ip])` to pin all CQL to `node2`.
This is already the established pattern used by other tests in the same file.

Fixes: SCYLLADB-1139

No need to backport yet, appeared only on master.

Closes scylladb/scylladb#29151
2026-03-23 10:25:54 +02:00
Michał Chojnowski
f29525f3a6 test/boost/cache_algorithm_test: disable sstable compression to avoid giant index pages
The test intentionally creates huge index pages.
But since 5e7fb08bf3,
the index reader allocates a block of memory for a whole index page,
instead of incrementally allocating small pieces during index parsing.
This giant allocation causes the test to fail spuriously in CI sometimes.

Fix this by disabling sstable compression on the test table,
which puts a hard cap of 2000 keys per index page.

Fixes: SCYLLADB-1152

Closes scylladb/scylladb#29152
2026-03-23 09:57:11 +02:00
Raphael S. Carvalho
05b11a3b82 sstables_loader: use new sstable add path
Use add_new_sstable_and_update_cache() when attaching SSTables
downloaded by the node-scoped local loader.

This is the correct variant for new SSTables: it can unlink the
SSTable on failure to add it, and it can split the SSTable if a
tablet split is in progress. The older
add_sstable_and_update_cache() helper is intended for preexisting
SSTables that are already stable on disk.

Additionally, downloaded SSTables are now left unsealed (TemporaryTOC)
until they are successfully added to the table's SSTable set. The
download path (download_fully_contained_sstables) passes
leave_unsealed=true to create_stream_sink, and attach_sstable opens
the SSTable with unsealed_sstable=true and seals it only inside the
on_add callback — matching the pattern used by stream_blob.cc and
storage_service.cc for tablet streaming.

This prevents a data-resurrection hazard: previously, if the process
crashed between download and attach_sstable, or if attach_sstable
failed mid-loop, sealed (TOC) SSTables would remain in the table
directory and be reloaded by distributed_loader on restart. With
TemporaryTOC, sstable_directory automatically cleans them up on
restart instead.

Fixes  https://scylladb.atlassian.net/browse/SCYLLADB-1085.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#29072
2026-03-23 10:33:04 +03:00
Piotr Szymaniak
f511264831 alternator/test: fix test_ttl_with_load_and_decommission flaky Connection refused error
The native Scylla nodetool reports ECONNREFUSED as 'Connection refused',
not as 'ConnectException' (which is the Java nodetool format). Add
'Connection refused' to the valid_errors list so that transient
connection failures during concurrent decommission/bootstrap topology
changes are properly tolerated.

Fixes SCYLLADB-1167

Closes scylladb/scylladb#29156
2026-03-22 11:01:45 +02:00
Pavel Emelyanov
7dce43363e table: merge adjacent querier_opt checks in query()
After the previous fix both guarding if-s start with 'if (querier_opt &&'.
Merge them into a single outer 'if (querier_opt)' block to avoid the
redundant check and make the structure easier to follow.

No functional change.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-03-20 14:48:08 +03:00
Piotr Dulikowski
cc695bc3f7 Merge 'vector_search: fix race condition on connection timeout' from Karol Nowacki
When a `with_connect` operation timed out, the underlying connection
attempt continued to run in the reactor. This could lead to a crash
if the connection was established/rejected after the client object had
already been destroyed. This issue was observed during the teardown
phase of a upcoming high-availability test case.

This commit fixes the race condition by ensuring the connection attempt
is properly canceled on timeout.

Additionally, the explicit TLS handshake previously forced during the
connection is now deferred to the first I/O operation, which is the
default and preferred behavior.

Fixes: SCYLLADB-832

Backports to 2026.1 and 2025.4 are required, as this issue also exists on those branches and is causing CI flakiness.

Closes scylladb/scylladb#29031

* github.com:scylladb/scylladb:
  vector_search: test: fix flaky test
  vector_search: fix race condition on connection timeout
2026-03-20 11:12:04 +01:00
Petr Gusev
4bfcd035ae test_fencing: add missing await-s
Fixes SCYLLADB-1099

Closes scylladb/scylladb#29133
2026-03-20 10:55:35 +01:00
Pavel Emelyanov
9c1c41df03 table: don't close a disengaged querier in query()
The condition guarding querier_opt->close() was:

When saved_querier is null the short-circuit makes the whole condition true
regardless of whether querier_opt is engaged.  If partition_ranges is empty,
query_state::done() is true before the while-loop body ever runs, so querier_opt
is never created.  Calling querier_opt->close() then dereferences a disengaged
std::optional — undefined behaviour.

Fix by checking querier_opt first:

This preserves all existing semantics (close when not saving, or when saving
wouldn't be useful) while making the no-querier path safe.

Why this doesn't surface today: the sole production call site, database::query(),
in practice.  The API header documents nullptr as valid ("Pass nullptr when
queriers are not saved"), so the bug is real but latent.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-03-20 12:25:13 +03:00
Pavel Emelyanov
c4a0f6f2e6 object_store: Don't leave dangling objects by iterating moved-from names vector
The code in upload_file std::move()-s vector of names into
merge_objects() method, then iterates over this vector to delete
objects. The iteration is apparently a no-op on moved-from vector.

The fix is to make merge_objects() helper get vector of names by const
reference -- the method doesn't modify the names collection, the caller
keeps one in stable storage.

Fixes #29060

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#29061
2026-03-20 10:09:30 +02:00
Pavel Emelyanov
712ba5a31f utils: Use yielding directory_lister in owner verification
Switch directories::do_verify_owner_and_mode() from lister::scan_dir() to
utils::directory_lister while preserving the previous hidden-entry
behavior.

Make do_verify_subpath use lister::filter_type directly so the
verification helper can pass it straight into directory_lister, and keep
a single yielding iteration loop for directory traversal.

Minus one scan_dir user twards scan_dir removal from code.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#29064
2026-03-20 10:08:38 +02:00
Pavel Emelyanov
961fc9e041 s3: Don't rearm credential timers when credentials are not refreshed
The update_credentials_and_rearm() may get "empty" credentials from
_creds_provider_chain.get_aws_credentials() -- it doesn't throw, but
returns default-initialized value. In that case the expires_at will be
set to time_point::min, and it's probably not a good idea to arm the
refresh timer and, even worse idea, to subtract 1h from it.

Fixes #29056

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#29057
2026-03-20 10:07:01 +02:00
Pavel Emelyanov
0a8dc4532b s3: Fix missing upload ID in copy_part trace log
The format string had two {} placeholders but three arguments, the
_upload_id one is skipped from formatting

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#29053
2026-03-20 10:05:44 +02:00
Botond Dénes
bb5c328a16 Merge 'Squash two primary-replica restoration tests together' from Pavel Emelyanov
The test_restore_primary_replica_same_domain and test_restore_primary_replica_different_domain tests have very much in common. Previously both tests were also split each into two, so we have four tests, and now we have two that can also be squashed, the lines-of-code savings still worth it.

This is the continuation of #28569

Tests improvement, not backporting

Closes scylladb/scylladb#28994

* github.com:scylladb/scylladb:
  test: Replace a bunch of ternary operators with an if-else block
  test: Squash test_restore_primary_replica_same|different_domain tests
  test: Use the same regexp in test_restore_primary_replica_different|same_domain-s
2026-03-20 10:05:16 +02:00
Pavel Emelyanov
ea2a214959 test/backup: Use unique_name() for backup prefix instead of cf_dir
The do_test_backup_abort() fetched the node's workdir and resolved cf_dir
solely to construct a unique-ish backup prefix:

    prefix = f'{cf_dir}/backup'

The comment already acknowledged this was only "unique(ish)" — relying
on the UUID-derived cf_dir name as a uniqueness source is roundabout.
unique_name() is already imported and used for exactly this purpose
elsewhere in the file.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#29030
2026-03-20 10:04:22 +02:00
Pavel Emelyanov
65032877d4 api: Move /storage_service/toppartitions from storage_service.cc to column_family.cc
The endpoint URL remains intact. Having it next to another toppartitions
endpoint (the /column_family/toppartitions one) is natural.

This endpoint only needs sharded<replica::database>&, grabs it from
http_context and doesn't use any other service. In column_family.cc the
database reference is already available as a parameter. Once more user
of http_context.db is gone.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Closes scylladb/scylladb#28996
2026-03-20 09:52:33 +02:00
Botond Dénes
de0bdf1a65 Merge 'Decouple test_refresh_deletes_uploaded_sstables from backup test-suite' from Pavel Emelyanov
The test in question uses several helpers from the backup sute, but it doesn't really need them -- the operations it want to perform can be performed with standard pylib methods. "While at it" also collect some dangling effectively unused local variables from this test (these were apparently left from backup tests this one was copied-and-reworked from)

Enhancing tests, not backporting

Closes scylladb/scylladb#29130

* github.com:scylladb/scylladb:
  test/refresh: Simplify refresh invocation
  test/refresh: Remove r_servers alias for servers
  test/refresh: Replace check_mutation_replicas with a plain CQL SELECT
  test/refresh: Inline keyspace/table/data setup in test_refresh_deletes_uploaded_sstables
  test/refresh: Prepare indentation for new_test_keyspace in test_refresh_deletes_uploaded_sstables
  test/refresh: Decouple test_refresh_deletes_uploaded_sstables from backup tests
  test/refresh: Remove unused wait_for_cql_and_get_hosts import
2026-03-20 09:29:15 +02:00
Botond Dénes
97430e2df5 Merge 'Fix object storage lister entries walking loop' from Pavel Emelyanov
Two issues found in the lister returned by gs_client_wrapper::make_object_lister()
Lister can report EOF too early in case filter is active, another one is potential vector out-of-bounds access

Fixes #29058

The code appeared in 2026.1, worth fixing it there as well

Closes scylladb/scylladb#29059

* github.com:scylladb/scylladb:
  sstables: Fix object storage lister not resetting position in batch vector
  sstables: Fix object storage lister skipping entries when filter is active
2026-03-20 09:12:42 +02:00
Botond Dénes
5573c3b18e Merge 'tablets: Fix deadlock in background storage group merge fiber' from Tomasz Grabiec
When it deadlocks, groups stop merging and compaction group merge
backlog will run-away.

Also, graceful shutdown will be blocked on it.

Found by flaky unit test
test_merge_chooses_best_replica_with_odd_count, which timed-out in 1
in 100 runs.

Reason for deadlock:

When storage groups are merged, the main compaction group of the new
storage group takes a compaction lock, which is appended to
_compaction_reenablers_for_merging, and released when the merge
completion fiber is done with the whole batch.

If we accumulate more than 1 merge cycle for the fiber, deadlock
occurs. Lock order will be this

Initial state:

 cg0: main
 cg1: main
 cg2: main
 cg3: main

After 1st merge:

 cg0': main [locked], merging_groups=[cg0.main, cg1.main]
 cg1': main [locked], merging_groups=[cg2.main, cg3.main]

After 2nd merge:

 cg0'': main [locked], merging_groups=[cg0'.main [locked], cg0.main, cg1.main, cg1'.main [locked], cg2.main, cg3.main]

merge completion fiber will try to stop cg0'.main, which will be
blocked on compaction lock. which is held by the reenabler in
_compaction_reenablers_for_merging, hence deadlock.

The fix is to wait for background merge to finish before we start the
next merge. It's achieved by holding old erm in the background merge,
and doing a topology barrier from the merge finalizing transition.

Background merge is supposed to be a relatively quick operation, it's
stopping compaction groups. So may wait for active requests. It
shouldn't prolong the barrier indefinitely.

Tablet tests which trigger merge need to be adjusted to call the
barrier, otherwise they will be vulnerable to the deadlock.

Fixes SCYLLADB-928

Backport to >= 2025.4 because it's the earliest vulnerable due to f9021777d8.

Closes scylladb/scylladb#29007

* github.com:scylladb/scylladb:
  tablets: Fix deadlock in background storage group merge fiber
  replica: table: Propagate old erm to storage group merge
  test: boost: tablets_test: Save tablet metadata when ACKing split resize decision
  storage_service: Extract local_topology_barrier()
2026-03-20 09:05:52 +02:00
Botond Dénes
34473302b0 Merge 'docs: document existing guardrails' from Andrzej Jackowski
This patch series introduces a new documentation for exiting guardrails.

Moreover:
 - Warning / failure messages of recently added write CL guardrails (SCYLLADB-259) are rephrased, so all guardrails have similar messages.
 - Some new tests are added, to help verify the correctness of the documentation and avoid situations where the documentation and implementation diverge.

Fixes: [SCYLLADB-257](https://scylladb.atlassian.net/browse/SCYLLADB-257)

No backport, just new docs and tests.

[SCYLLADB-257]: https://scylladb.atlassian.net/browse/SCYLLADB-257?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

Closes scylladb/scylladb#29011

* github.com:scylladb/scylladb:
  test: add new guardrail tests matching documentation scenarios
  test: add metric assertions to guardrail replication strategy tests
  test: use regex matching in guardrail replication strategy tests
  test: extract ks_opts helper in test_guardrail_replication_strategy
  docs: document CQL guardrails
  cql: improve write consistency level guardrail messages
2026-03-20 08:56:00 +02:00
artem.penner
9898e5700b scylla-node-exporter: Add systemd collector to node exporter
This PR enables the node_exporter systemd collector and configures the unit whitelist to include scylla-server.service and systemd-coredump services.

**Motivation**: We currently lack visibility into system-level service states, which is critical for diagnosing stability issues.

This configuration enables two specific use cases:
- Detecting Coredump Loops: We encounter scenarios where ScyllaDB enters a restart loop. To pinpoint SIGSEGV (coredumps) as the root cause, we need to track when the systemd-coredump service becomes active, indicating a dump is being processed.
- Identifying Startup Failures: We need to detect when the scylla-server unit enters a failed state. This is essential for catching unrecoverable errors (e.g., corrupted commitlogs or configuration bugs) that prevent the server from starting.

example of promql queries:
- `node_systemd_unit_state{name=~"systemd-coredump@.*", state="active"} == 1`
- `node_systemd_unit_state{name="scylla-server.service", state="failed"} == 1`

Closes #28402
2026-03-20 08:39:56 +02:00
Andrzej Jackowski
10c4b9b5b0 test: verify signal() detects resource negative leak in rcs
reader_concurrency_semaphore::signal() guards against available
resources exceeding the initial limit after a signal, which would
indicate a bug such as double-returning resources. It reports the
issue via on_internal_error_noexcept and clamps resources back to
the initial values. However, before this commit there were no tests
that verified this behavior, so bugs like SCYLLADB-1014 went
undetected.

Add a test that artificially signals resources that were never
consumed and verifies that signal() detects the negative leak and
clamps available resources back to the initial limit.

Refs: SCYLLADB-1014
Fixes: SCYLLADB-1031

Closes scylladb/scylladb#28993
2026-03-20 09:21:20 +03:00
Botond Dénes
f9adbc7548 test/cqlpy/test_tombstone_limit.py: disable tombstone-gc for test table
Since 7564a56dc8, all tables default to
repair-mode tombstone-gc, which is identical to immediate-mode for RF=1
tables. Consequently the tombstones written by the tests in this test
file are immediately collectible and with some unlucky timing, some of
them can be collected before the end of the test, failing the empty-page
prefix check because the empty pages prefix will be smaller than
expected based on the number of tombstones written.
Disable tombstone-gc to remove this source of flakyness.

Fixes: SCYLLADB-1062

Closes scylladb/scylladb#29077
2026-03-20 09:14:29 +03:00
Michał Chojnowski
6b18d95dec test: add a missing reconnect_driver in test_sstable_compression_dictionaries_upgrade.py
Need to work around https://github.com/scylladb/python-driver/issues/295,
lest a CQL query fail spuriously after the cluster restart.

Fixes: SCYLLADB-1114

Closes scylladb/scylladb#29118
2026-03-20 09:05:14 +03:00
Botond Dénes
89388510a0 test/cluster/test_data_resurrection_in_memtable.py: use explicit CL
The test has expectation w.r.t which write makes it to which nodes:
* inserts make it to all nodes
* delete makes it to all-1 (QUORUM) node

However, this was not expressed with CL, and the default CL=ONE allowed
for some nodes missing the writes and this violating the tests
expectations on what data is persent on which nodes. This resulted on
the test being flaky and failing on the data checks.

Use explicit CL for the ingestion to prevent this.

The improvements to the test introduced in
a8dd13731f was of great help in
investigating this: traces are now available and the check happens after
the data was dumped to logs.

Fixes: SCYLLADB-870
Fixes: SCYLLADB-812
Fixes: SCYLLADB-1102

Closes scylladb/scylladb#29128
2026-03-20 09:02:57 +03:00
Avi Kivity
6b259babeb Merge 'logstor: initial log-structured storage for key-value tables' from Michael Litvak
Introduce an initial and experimental implementation of an alternative log-structured storage engine for key-value tables.

Main flows and components:
* The storage is composed of 32MB files, each file divided to segments of size 128k. We write to them sequentially records that contain a mutation and additional metadata. Records are written to a buffer first and then written to the active segment sequentially in 4k sized blocks.
* The primary index in memory maps keys to their location on disk. It is a B-tree per-table that is ordered by tokens, similar to a memtable.
* On reads we calculate the key and look it up in the primary index, then read the mutation from disk with a single disk IO.
* On writes we write the record to a buffer, wait for it to be written to disk, then update the index with the new location, and free the previous record.
* We track the used space in each segment. When overwriting a record, we increase the free space counter for the segment of the previous record that becomes dead. We store the segments in a histogram by usage.
* The compaction process takes segments with low utilization, reads them and writes the live records to new segments, and frees the old segments.
* Segments are initially "mixed" - we write to the active segment records from all tables and all tablets. The "separator" process rewrites records from mixed segments into new segments that are organized by compaction groups (tablets), and frees the mixed segments. Each write is written to the active segment and to a separator buffer of the compaction group, which is eventually flushed to a new segment in the compaction group.

Currently this mode is experimental and requires an experimental flag to be enabled.
Some things that are not supported yet are strong consistency, tablet migration, tablet split/merge, big mutations, tombstone gc, ttl.

to use, add to config:
```
enable_logstor: true

experimental_features:
  - logstor
```

create a table:
```
CREATE TABLE ks.t(pk int PRIMARY KEY, a int, v text) WITH storage_engine = 'logstor';
```

INSERT, SELECT, DELETE work as expected
UPDATE not supported yet

no backport - new feature

Closes scylladb/scylladb#28706

* github.com:scylladb/scylladb:
  logstor: trigger separator flush for buffers that hold old segments
  docs/dev: add logstor documentation
  logstor: recover segments into compaction groups
  logstor: range read
  logstor: change index to btree by token per table
  logstor: move segments to replica::compaction_group
  db: update dirty mem limits dynamically
  logstor: track memory usage
  logstor: logstor stats api
  logstor: compaction buffer pool
  logstor: separator: flush buffer when full
  logstor: hold segment until index updates
  logstor: truncate table
  logstor: enable/disable compaction per table
  logstor: separator buffer pool
  test: logstor: add separator and compaction tests
  logstor: segment and separator barrier
  logstor: separator debt controller
  logstor: compaction controller
  logstor: recovery: recover mixed segments using separator
  logstor: wait for pending reads in compaction
  logstor: separator
  logstor: compaction groups
  logstor: cache files for read
  logstor: recovery: initial
  logstor: add segment generation
  logstor: reserve segments for compaction
  logstor: index: buckets
  logstor: add buffer header
  logstor: add group_id
  logstor: record generation
  logstor: generation utility
  logstor: use RIPEMD-160 for index key
  test: add test_logstor.py
  api: add logstor compaction trigger endpoint
  replica: add logstor to db
  schema: add logstor cf property
  logstor: initial commit
  db: disable tablet balancing with logstor
  db: add logstor experimental feature flag
2026-03-20 00:18:09 +02:00
Avi Kivity
062751fcec Merge 'db/config: enable ms sstable format by default' from Łukasz Paszkowski
Trie-based sstable indexes are supposed to be (hopefully) a better default than the old BIG indexes.
Make the new format a new default for new clusters by naming ms in the default scylla.yaml.

New functionality. No backport needed.

This PR is basically Michał's one https://github.com/scylladb/scylladb/pull/26377, Jakub's  https://github.com/scylladb/scylladb/pull/27332 fixing `sstables_manager::get_highest_supported_format()` and one test fix.

Closes scylladb/scylladb#28960

* github.com:scylladb/scylladb:
  db/config: announce ms format as highest supported
  db/config: enable `ms` sstable format by default
  cluster/dtest/bypass_cache_test: switch from highest_supported_sstable_format to chosen_sstable_format
  api/system: add /system/chosen_sstable_version
  test/cluster/dtest: reduce num_tokens to 16
2026-03-19 18:19:01 +02:00
Pavel Emelyanov
969dddb630 test/refresh: Simplify refresh invocation
take_snapshot return values were unused so drop them. do_refresh was a
thin wrapper around load_new_sstables that added no logic; inline it
directly into the gather expression.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-03-19 18:42:57 +03:00
Pavel Emelyanov
de21572b31 test/refresh: Remove r_servers alias for servers
r_servers = servers was a no-op assignment; use servers directly.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-03-19 18:42:52 +03:00
Pavel Emelyanov
20b1531e6d test/refresh: Replace check_mutation_replicas with a plain CQL SELECT
The goal of test_refresh_deletes_uploaded_sstables is to verify that
sstables are removed from the upload directory after refresh. The replica
check was just a sanity guard; a simple SELECT of all keys is sufficient
and much lighter.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-03-19 18:42:48 +03:00
Pavel Emelyanov
c591b9ebe2 test/refresh: Inline keyspace/table/data setup in test_refresh_deletes_uploaded_sstables
Replace create_dataset() with explicit keyspace creation via new_test_keyspace,
inline CREATE TABLE, and direct cql.run_async inserts — matching the pattern
used in do_test_streaming_scopes. This removes the last dependency on backup
helpers for dataset setup and makes the test self-contained.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-03-19 18:42:44 +03:00
Pavel Emelyanov
06006a6328 test/refresh: Prepare indentation for new_test_keyspace in test_refresh_deletes_uploaded_sstables
Wrap the test body under if True: to pre-indent it, making the subsequent
patch that introduces new_test_keyspace a pure content change with no
whitespace noise.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-03-19 18:42:40 +03:00
Pavel Emelyanov
67d8cde42d test/refresh: Decouple test_refresh_deletes_uploaded_sstables from backup tests
Replace create_cluster() from object_store/test_backup.py with a plain
manager.servers_add(2) call. The test does not use object storage, so
there is no need to pull in the backup helper along with its config and
logging knobs.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-03-19 18:42:36 +03:00
Pavel Emelyanov
04f046d2d8 test/refresh: Remove unused wait_for_cql_and_get_hosts import
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-03-19 18:42:32 +03:00
Botond Dénes
e8b37d1a89 Merge 'doc: fix the installation section' from Anna Stuchlik
This PR fixes the Installation page:

- Replaces `http `with `https `in the download command.
- Replaces the Open Source example from the Installation section for CentOS (we overlooked this example before).

Fixes https://github.com/scylladb/scylladb/issues/29087
Fixes https://github.com/scylladb/scylladb/issues/29087

This update affects all supported versions and should be backported as a bug fix.

Closes scylladb/scylladb#29088

* github.com:scylladb/scylladb:
  doc: remove the Open Source Example from Installation
  doc: replace http with https in the installation instructions
2026-03-19 17:13:53 +02:00
Dario Mirovic
d2c44722e1 test: cluster: fix log clear race condition in test_audit.py
assert_entries_were_added:
- takes a "before" snapshot of the audit log
- yields to execute a statement
- takes an "after" snapshot of the audit log
- computes new rows by diffing "after" minus "before"

If an audit entry generated by prepare() arrives between the snapshot
and the diff, it inflates the new row count and the test fails with
assert 2 <= 1.

Fix by:
- Adding clear_audit_logs() at the end of prepare(), after all setup
- Waiting for the "completed re-reading configuration file" log message
  after server_update_config
- Draining pending syslog lines before clearing the buffer

Refs SCYLLADB-573
2026-03-19 16:12:13 +01:00
Dario Mirovic
821f8696a7 test: pylib: shut down exclusive cql connections in ManagerClient
get_cql_exclusive() creates a Cluster object per call, but never
records it. driver_close() cannot shut it down. The cluster's
internal scheduler thread then tries to submit work to an already
shut down executor. This causes RuntimeError:

RuntimeError: cannot schedule new futures after shutdown

Fix this by tracking every exclusive Cluster in a list and shutting
them all down in driver_close().

Refs SCYLLADB-573
2026-03-19 16:12:13 +01:00
Dario Mirovic
d94999f87b test: cluster: fix multinode audit entry comparison in test_audit.py
assert_entries_were_added computes new audit rows by slicing the "after"
list at the length of the "before" list: rows_after[len(rows_before):].
This assumes new rows always appear at the tail of the combined sorted
list. In a multinode setup, each node generates its own event_time
timestamps. A new row from node A can sort before an old row from node
B, breaking the tail assumption. The assertion "new rows are not the
last rows in the audit table" then fires.

Fix this by splitting the before/after lists per node and computing the
new rows tail independently for each node. This guarantees that per node
ordering, which is monotonic, is respected, and the combined new rows
are sorted afterwards.

Refs SCYLLADB-573
2026-03-19 16:12:13 +01:00
Dario Mirovic
249a6cec1b test: cluster: dtest: remove old audit tests
Since audit tests have been migrated to test/cluster/test_audit.py,
old tests in test/cluster/dtest/audit_test.py have to be removed.

Refs SCYLLADB-573
2026-03-19 16:12:13 +01:00
Dario Mirovic
adc790a8bf test: cluster: group migrated audit tests for cluster reuse
This patch reorganizes the execution flow of the test functions.
They are grouped to enable cluster reuse between specific test
functions. One of the main contributors to the test execution time
is the cluster preparation. This patch significantly reduces the
total test execution time by having way less new cluster preparation
calls and more cluster reuse.

Performance increase on the developer machine is around 38%:
- before: 4m 29s
- after: 2m 47s

Fixes SCYLLADB-573
2026-03-19 16:11:47 +01:00
Dario Mirovic
967b7ff6bf test: cluster: enable migrated audit tests and make them work
Make audit tests from test/cluster/dtest to test/cluster.
test/cluster environment has less overhead, and audit tests
are heavy, their execution taking lots of time. This patch
is part of an effort to improve audit test suite performance.

This patch refactors the tests so that they execute correctly,
as well as enables them. A follow up patch will remove the
audit tests in test/cluster/dtest.

All the tests are confirmed to be running after the change.
No dead code present.

Test test_audit_categories_invalid is not parametrized anymore.
It never used the parametrized helper class, so it just ran
the same logic three times. This is why there are now 74,
and not 76, test executions.

Refs SCYLLADB-573
2026-03-19 16:07:28 +01:00
Dario Mirovic
5d51501a0b pgo: use maintenance socket for CQL setup in PGO training
The default 'cassandra' superuser was removed from ScyllaDB, which
broke PGO training. exec_cql.py relied on username/password auth
('cassandra'/'cassandra') to execute setup CQL scripts like auth.cql
and counters.cql.

Switch exec_cql.py to connect via the Unix domain maintenance socket
instead. The maintenance socket bypasses authentication, no credentials
are needed. Additionally, create the 'cassandra' superuser via the
maintenance socket during the populate phase, so that cassandra-stress
keeps working. cassandra-stress hardcodes user=cassandra password=cassandra.

Changes:
- exec_cql.py: replace host/port/username/password arguments with a
  single --socket argument; add connect_maintenance_socket() with
  wait ready logic
- pgo.py: add maintenance_socket_path() helper; update
  populate_auth_conns() and populate_counters() to pass the socket
  path to exec_cql.py

Fixes SCYLLADB-1070

Closes scylladb/scylladb#29081
2026-03-19 16:52:36 +02:00
Dario Mirovic
8367509b3b test: pylib: manager_client: specify AuthProvider in get_cql_exclusive
This patch allows ManagerClient.get_cql_exclusive to accept AuthProvider
as parameter. This will be used in a follow up patch which migrates
audit test suite to test/cluster and requires this functionality for
some tests.

Refs SCYLLADB-573
2026-03-19 15:35:24 +01:00
Dario Mirovic
0a7a69345c test: pylib: scylla cluster after_test log fix
Before any test, a pool of ScyllaCluster objects is created.

At the beginning of a test suite, a ScyllaClusterManager is created,
and given a reference to the pool.
At the end of a test suite, the ScyllaClusterManager is destroyed.

Before each test case:
- ManagerClient is constructed and connected to the ScyllaClusterManager
  of that test suite
- A ScyllaCluster object is fetched from the pool
  - If the pool is empty, a new ScyllaCluster object is created
  - If the pool is not empty, a cached ScyllaCluster object is returned

After each test case:
- Return ScyllaCluster object from ManagerClient to the pool
  - If the cluster is dirty, the pool destroys it
  - If the cluster is clean, the pool caches it
- ManagerClient is destroyed

Many actions mark a cluster as dirty. Normal test execution will always
make the cluster be destroyed upon returning to the pool.
ManagerClient.mark_clean is not used in the tests. When it is used,
the flow with cluster reuse happens.

The bug is that the log file is closed even if cluster is not dirty.
This causes an error when trying to log to a reused cluster server.

The solution in this patch is to not close the log file if the cluster
is not dirty. Upon cluster reuse the log file will be open and functional.

Another approach would be to reopen the log file if closed, but this
approach seems more clean.

Refs SCYLLADB-573
2026-03-19 15:35:24 +01:00
Dario Mirovic
899ae71349 test: audit: copy audit test from dtest
This patch just copies the audit test suite from dtest and
disables it in the test config file. Later patches will
update the code and enable the test suite.

Refs SCYLLADB-573
2026-03-19 15:35:24 +01:00
Andrzej Jackowski
4deeb7ebfc test: add new guardrail tests matching documentation scenarios
Add tests for RF guardrails (min/max warn/fail, RF=0 bypass,
threshold=-1 disable, ALTER KEYSPACE) and write consistency level
guardrails to cover all scenarios described in guardrails.rst.

Test runtime (dev):
test_guardrail_replication_strategy - 6s
test_guardrail_write_consistency_level - 5s

Refs: SCYLLADB-257
2026-03-19 15:07:03 +01:00
Andrzej Jackowski
2a03c634c0 test: add metric assertions to guardrail replication strategy tests
Verify that guardrail violations increment the corresponding metrics.

Refs: SCYLLADB-257
2026-03-19 15:07:03 +01:00
Andrzej Jackowski
81c4e717e2 test: use regex matching in guardrail replication strategy tests
Replace loose substring assertions with regex-based matching against
the exact server message formats. Add regex constants for all
guardrail messages and rewrite create_ks_and_assert_warnings_and_errors()
to verify count and content of warnings and failures.

Refs: SCYLLADB-257
2026-03-19 15:07:03 +01:00
Anna Stuchlik
6b1df5202c doc: remove the instructions to install old versions from Web Installer
The Web Installer page includes instructions to install the old pre-2025.1 Enterprise versions,
which are no longer supported (since we released 2026.1).

This commit removes those redundant and misleading instructions.

Fixes https://github.com/scylladb/scylladb/issues/29099

Closes scylladb/scylladb#29103
2026-03-19 15:47:00 +02:00
Piotr Dulikowski
171504c84f Merge 'auth: migrate some standard role manager APIs to use cache' from Marcin Maliszkiewicz
This patchset migrates: query_all_directly_granted, query_all,
get_attribute, query_attribute_for_all functions to use cache
instead of doing CQL queries. It also includes some preparatory
work which fixes cache update order and triggering.

Main motivation behind this is to make sure that all calls
from service_level_controller::auth_integration are cached,
which we achieve here.

Alternative implementation could move the whole auth_integration
data into auth cache but since auth_integration manages also lifetime
and contains service levels specific logic such solution would be
too complex for little (if any) gain.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-159
Backport: no, not a bug

Closes scylladb/scylladb#28791

* github.com:scylladb/scylladb:
  auth: switch query_attribute_for_all to use cache
  auth: switch get_attribute to use cache
  auth: cache: add heterogeneous map lookups
  auth: switch query_all to use cache
  auth: switch query_all_directly_granted to use cache
  auth: cache: add ability to go over all roles
  raft: service: reload auth cache before service levels
  service: raft: move update_service_levels_effective_cache check
2026-03-19 14:37:22 +01:00
Avi Kivity
5e7fb08bf3 Merge 'Fix bad performance for densely populated partition index pages' from Tomasz Grabiec
This applies to small partition workload where index pages have high partition count, and the index doesn't fit in cache. It was observed that the count can be in the order of hundreds. In such a workload pages undergo constant population, LSA compaction, and LSA eviction, which has severe impact on CPU utilization.

Refs https://scylladb.atlassian.net/browse/SCYLLADB-620

This PR reduces the impact by several changes:

  - reducing memory footprint in the partition index. Assuming partition key size is 16 bytes, the cost dropped from 96 bytes to 36 bytes per partition.

  - flattening the object graph and amortizing storage. Storing entries directly in the vector. Storing all key values in a single managed_bytes. Making index_entry a trivial struct.

  - index entries and key storage are now trivially moveable, and batched inside vector storage
    so LSA migration can use memcpy(), which amortizes the cost per key. This reduces the cost of LSA segment compaction.

 - LSA eviction is now pretty much constant time for the whole page
   regardless of the number of entries, because elements are trivial and batched inside vectors.
   Page eviction cost dropped from 50 us to 1 us.

Performance evaluated with:

   scylla perf-simple-query -c1 -m200M --partitions=1000000

Before:

```
7774.96 tps (166.0 allocs/op, 521.7 logallocs/op,  54.0 tasks/op,  802428 insns/op,  430457 cycles/op,        0 errors)
7511.08 tps (166.1 allocs/op, 527.2 logallocs/op,  54.0 tasks/op,  804185 insns/op,  430752 cycles/op,        0 errors)
7740.44 tps (166.3 allocs/op, 526.2 logallocs/op,  54.2 tasks/op,  805347 insns/op,  432117 cycles/op,        0 errors)
7818.72 tps (165.2 allocs/op, 517.6 logallocs/op,  53.7 tasks/op,  794965 insns/op,  427751 cycles/op,        0 errors)
7865.49 tps (165.1 allocs/op, 513.3 logallocs/op,  53.6 tasks/op,  788898 insns/op,  425171 cycles/op,        0 errors)
```

After (+318%):

```
32492.40 tps (130.7 allocs/op,  12.8 logallocs/op,  36.1 tasks/op,  109236 insns/op,  103203 cycles/op,        0 errors)
32591.99 tps (130.4 allocs/op,  12.8 logallocs/op,  36.0 tasks/op,  108947 insns/op,  102889 cycles/op,        0 errors)
32514.52 tps (130.6 allocs/op,  12.8 logallocs/op,  36.0 tasks/op,  109118 insns/op,  103219 cycles/op,        0 errors)
32491.14 tps (130.6 allocs/op,  12.8 logallocs/op,  36.0 tasks/op,  109349 insns/op,  103272 cycles/op,        0 errors)
32582.90 tps (130.5 allocs/op,  12.8 logallocs/op,  36.0 tasks/op,  109269 insns/op,  102872 cycles/op,        0 errors)
32479.43 tps (130.6 allocs/op,  12.8 logallocs/op,  36.0 tasks/op,  109313 insns/op,  103242 cycles/op,        0 errors)
32418.48 tps (130.7 allocs/op,  12.8 logallocs/op,  36.1 tasks/op,  109201 insns/op,  103301 cycles/op,        0 errors)
31394.14 tps (130.7 allocs/op,  12.8 logallocs/op,  36.1 tasks/op,  109267 insns/op,  103301 cycles/op,        0 errors)
32298.55 tps (130.7 allocs/op,  12.8 logallocs/op,  36.1 tasks/op,  109323 insns/op,  103551 cycles/op,        0 errors)
```

When the workload is miss-only, with both row cache and index cache disabled (no cache maintenance cost):

  perf-simple-query -c1 -m200M --duration 6000 --partitions=100000 --enable-index-cache=0 --enable-cache=0

Before:

```
9124.57 tps (146.2 allocs/op, 789.0 logallocs/op,  45.3 tasks/op,  889320 insns/op,  357937 cycles/op,        0 errors)
9437.23 tps (146.1 allocs/op, 789.3 logallocs/op,  45.3 tasks/op,  889613 insns/op,  357782 cycles/op,        0 errors)
9455.65 tps (146.0 allocs/op, 787.4 logallocs/op,  45.2 tasks/op,  887606 insns/op,  357167 cycles/op,        0 errors)
9451.22 tps (146.0 allocs/op, 787.4 logallocs/op,  45.3 tasks/op,  887627 insns/op,  357357 cycles/op,        0 errors)
9429.50 tps (146.0 allocs/op, 787.4 logallocs/op,  45.3 tasks/op,  887761 insns/op,  358148 cycles/op,        0 errors)
9430.29 tps (146.1 allocs/op, 788.2 logallocs/op,  45.3 tasks/op,  888501 insns/op,  357679 cycles/op,        0 errors)
9454.08 tps (146.0 allocs/op, 787.3 logallocs/op,  45.3 tasks/op,  887545 insns/op,  357132 cycles/op,        0 errors)
```

After (+55%):

```
14484.84 tps (150.7 allocs/op,   6.5 logallocs/op,  44.7 tasks/op,  396164 insns/op,  229490 cycles/op,        0 errors)
14526.21 tps (150.8 allocs/op,   6.5 logallocs/op,  44.8 tasks/op,  396401 insns/op,  228824 cycles/op,        0 errors)
14567.53 tps (150.7 allocs/op,   6.5 logallocs/op,  44.7 tasks/op,  396319 insns/op,  228701 cycles/op,        0 errors)
14545.63 tps (150.6 allocs/op,   6.5 logallocs/op,  44.7 tasks/op,  395889 insns/op,  228493 cycles/op,        0 errors)
14626.06 tps (150.5 allocs/op,   6.5 logallocs/op,  44.7 tasks/op,  395254 insns/op,  227891 cycles/op,        0 errors)
14593.74 tps (150.5 allocs/op,   6.5 logallocs/op,  44.7 tasks/op,  395480 insns/op,  227993 cycles/op,        0 errors)
14538.10 tps (150.8 allocs/op,   6.5 logallocs/op,  44.8 tasks/op,  397035 insns/op,  228831 cycles/op,        0 errors)
14527.18 tps (150.8 allocs/op,   6.5 logallocs/op,  44.8 tasks/op,  396992 insns/op,  228839 cycles/op,        0 errors)
```

Same as above, but with summary ratio increased from 0.0005 to 0.005 (smaller pages):

Before:

```
33906.70 tps (146.1 allocs/op,  83.6 logallocs/op,  45.1 tasks/op,  170553 insns/op,   98104 cycles/op,        0 errors)
32696.16 tps (146.0 allocs/op,  83.5 logallocs/op,  45.1 tasks/op,  170369 insns/op,   98405 cycles/op,        0 errors)
33889.05 tps (146.1 allocs/op,  83.6 logallocs/op,  45.1 tasks/op,  170551 insns/op,   98135 cycles/op,        0 errors)
33893.24 tps (146.1 allocs/op,  83.5 logallocs/op,  45.1 tasks/op,  170488 insns/op,   98168 cycles/op,        0 errors)
33836.73 tps (146.1 allocs/op,  83.6 logallocs/op,  45.1 tasks/op,  170528 insns/op,   98226 cycles/op,        0 errors)
33897.61 tps (146.0 allocs/op,  83.5 logallocs/op,  45.1 tasks/op,  170428 insns/op,   98081 cycles/op,        0 errors)
33834.73 tps (146.1 allocs/op,  83.5 logallocs/op,  45.1 tasks/op,  170438 insns/op,   98178 cycles/op,        0 errors)
33776.31 tps (146.3 allocs/op,  83.9 logallocs/op,  45.2 tasks/op,  170958 insns/op,   98418 cycles/op,        0 errors)
33808.08 tps (146.3 allocs/op,  83.9 logallocs/op,  45.2 tasks/op,  170940 insns/op,   98388 cycles/op,        0 errors)
```

After (+18%):

```
40081.51 tps (148.2 allocs/op,   4.4 logallocs/op,  45.0 tasks/op,  121047 insns/op,   82231 cycles/op,        0 errors)
40005.85 tps (148.6 allocs/op,   4.4 logallocs/op,  45.2 tasks/op,  121327 insns/op,   82545 cycles/op,        0 errors)
39816.75 tps (148.3 allocs/op,   4.4 logallocs/op,  45.1 tasks/op,  121067 insns/op,   82419 cycles/op,        0 errors)
39953.11 tps (148.1 allocs/op,   4.4 logallocs/op,  45.0 tasks/op,  121027 insns/op,   82258 cycles/op,        0 errors)
40073.96 tps (148.2 allocs/op,   4.4 logallocs/op,  45.0 tasks/op,  121006 insns/op,   82313 cycles/op,        0 errors)
39882.25 tps (148.2 allocs/op,   4.4 logallocs/op,  45.0 tasks/op,  120925 insns/op,   82320 cycles/op,        0 errors)
39916.08 tps (148.3 allocs/op,   4.4 logallocs/op,  45.1 tasks/op,  121054 insns/op,   82393 cycles/op,        0 errors)
39786.30 tps (148.2 allocs/op,   4.4 logallocs/op,  45.0 tasks/op,  121027 insns/op,   82465 cycles/op,        0 errors)
38662.45 tps (148.3 allocs/op,   4.4 logallocs/op,  45.0 tasks/op,  121108 insns/op,   82312 cycles/op,        0 errors)
39849.42 tps (148.3 allocs/op,   4.4 logallocs/op,  45.1 tasks/op,  121098 insns/op,   82447 cycles/op,        0 errors)
```

Closes scylladb/scylladb#28603

* github.com:scylladb/scylladb:
  sstables: mx: index_reader: Optimize parsing for no promoted index case
  vint: Use std::countl_zero()
  test: sstable_partition_index_cache_test: Validate scenario of pages with sparse promoted index placement
  sstables: mx: index_reader: Amoritze partition key storage
  managed_bytes: Hoist write_fragmented() to common header
  utils: managed_vector: Use std::uninitialized_move() to move objects
  sstables: mx: index_reader: Keep promoted_index info next to index_entry
  sstables: mx: index_reader: Extract partition_index_page::clear_gently()
  sstables: mx: index_reader: Shave-off 16 bytes from index_entry by using raw_token
  sstables: mx: index_reader: Reduce allocation_section overhead during index page parsing by batching allocation
  sstables: mx: index_reader: Keep index_entry directly in the vector
  dht: Introduce raw_token
  test: perf_simple_query: Add 'sstable-format' command-line option
  test: perf_simple_query: Add 'sstable-summary-ratio' command-line option
  test: perf-simple-query: Add option to disable index cache
  test: cql_test_env: Respect enable-index-cache config
2026-03-19 14:42:50 +02:00
Botond Dénes
4981e72607 Merge 'replica: avoid unnecessary computation on token lookup hot path' from Łukasz Paszkowski
`storage_group_of()` sits on the replica-side token lookup hot path, yet it called `tablet_map::get_tablet_id_and_range_side()`, which always computes both the tablet id and the post-split range side — even though most callers only need the storage group id.

The range-side computation is only relevant when a storage group is in tablet splitting mode, but we were paying for it unconditionally on every lookup.

This series fixes that by:

1. Adding `tablet_map::get_tablet_range_side()` so the range side can be computed independently when needed.
2. Adding lazy `select_compaction_group()` overloads that defer the range-side computation until splitting mode is actually active.
3. Switching `storage_group_of()` to use the cheaper `get_tablet_id()` path, only computing the range side on demand.

Improvements. No backport is required.

Closes scylladb/scylladb#28963

* github.com:scylladb/scylladb:
  replica/table: avoid computing token range side in storage_group_of() on hot path
  replica/compaction_group: add lazy select_compaction_group() overloads
  locator/tablets: add tablet_map::get_tablet_range_side()
2026-03-19 14:27:12 +02:00
Ernest Zaslavsky
aa9da87e97 encryption: fix deadlock in encrypted_data_source::get()
When encrypted_data_source::get() caches a trailing block in
_next, the next call takes it directly — bypassing
input_stream::read(), which checks _eof. It then calls
input_stream::read_exactly() on the already-drained stream.
Unlike read(), read_up_to(), and consume(), read_exactly()
does not check _eof when the buffer is empty, so it calls
_fd.get() on a source that already returned EOS.

In production this manifested as stuck encrypted SSTable
component downloads during tablet restore: the underlying
chunked_download_source hung forever on the post-EOS get(),
causing 4 tablets to never complete. The stuck files were
always block-aligned sizes (8k, 12k) where _next gets
populated and the source is fully consumed in the same call.

Fix by checking _input.eof() before calling read_exactly().
When the stream already reached EOF, buf2 is known to be
empty, so the call is skipped entirely.

A comprehensive test is added that uses a strict_memory_source
which fails on post-EOS get(), reproducing the exact code
path that caused the production deadlock.
2026-03-19 13:54:54 +02:00
Ernest Zaslavsky
f74a54f005 test_lib: mark limiting_data_source_impl as not final 2026-03-19 13:54:54 +02:00
Ernest Zaslavsky
151e945d9f Fix formatting after previous patch 2026-03-19 13:54:44 +02:00
Andrzej Jackowski
517bb8655d test: extract ks_opts helper in test_guardrail_replication_strategy
Factor out ks_opts() to build keyspace options with tablets handling
and use it across all existing replication strategy guardrail tests.
No behavioral changes.

This facilitates further modification of the tests later in this
patch series.

Refs: SCYLLADB-257
2026-03-19 12:49:41 +01:00
Andrzej Jackowski
9b24d9ee7d docs: document CQL guardrails
Add docs/cql/guardrails.rst covering replication factor, replication
strategy, write consistency level, and compact storage guardrails.

Fixes: SCYLLADB-257
2026-03-19 12:49:41 +01:00
Ernest Zaslavsky
537747cf5d Fix indentation after previous patch 2026-03-19 13:48:53 +02:00
Ernest Zaslavsky
2535164542 test_lib: make limiting_data_source_impl available to tests
Relocate the `limiting_data_source_impl` declaration to the header file
so that test code can access it directly.
2026-03-19 13:48:53 +02:00
Botond Dénes
86d7c82993 test/cluster/test_repair.py: use tablets in test_repair_timestamp_difference
After repair, the test does a major to compact all sstables into a
single one, so the results can be simply checked by a select from
mutation_fragments() query. Sometimes off-strategy happens parallel to
this major, so after the major there are still 2 sstables, resulting in
the test failing when checking that the query returns just a single row.
To fix, just use tablets for the test table, tablets don't use
off-strategy anymore.

Fixes: SCYLLADB-940

Closes scylladb/scylladb#29071
2026-03-19 12:42:18 +03:00
Michael Litvak
399260a6c0 test: mv: fix flaky wait for commitlog sync
Previously the test test_interrupt_view_build_shard_registration stopped
the node ungracefully and used commitlog periodic mode to persist the
view build progress in a not very reliable way.

It can happen that due to timing issues, the view build progress is not
persisted, or some of it is persisted in a different ordering than
expected.

To make the test more reliable we change it to stop the node gracefully,
so the commitlog is persisted in a graceful and consistent way, without
using the periodic mode delay. We need to also change the injection for
the shutdown to not get stuck.

Fixes SCYLLADB-1005

Closes scylladb/scylladb#29008
2026-03-19 10:41:21 +01:00
Pavel Emelyanov
f27dc12b7c Merge 'Fix directory lister leak in table::get_snapshot_details: ' from Benny Halevy
As reported in SCYLLADB-1013, the directory lister must be closed also when an exception is thrown.

For example, see backtrace below:
```
seastar::on_internal_error(seastar::logger&, std::basic_string_view<char, std::char_traits<char>>) at ./build/release/seastar/./seastar/src/core/on_internal_error.cc:57
directory_lister::~directory_lister() at ./utils/lister.cc:77
replica::table::get_snapshot_details(std::filesystem::__cxx11::path, std::filesystem::__cxx11::path) (.resume) at ./replica/table.cc:4081
std::__n4861::coroutine_handle<seastar::internal::coroutine_traits_base<db::snapshot_ctl::table_snapshot_details>::promise_type>::resume() const at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/coroutine:247
 (inlined by) seastar::internal::coroutine_traits_base<db::snapshot_ctl::table_snapshot_details>::promise_type::run_and_dispose() at ././seastar/include/seastar/core/coroutine.hh:129
seastar::reactor::task_queue::run_tasks() at ./build/release/seastar/./seastar/src/core/reactor.cc:2695
 (inlined by) seastar::reactor::task_queue_group::run_tasks() at ./build/release/seastar/./seastar/src/core/reactor.cc:3201
seastar::reactor::task_queue_group::run_some_tasks() at ./build/release/seastar/./seastar/src/core/reactor.cc:3185
 (inlined by) seastar::reactor::do_run() at ./build/release/seastar/./seastar/src/core/reactor.cc:3353
seastar::reactor::run() at ./build/release/seastar/./seastar/src/core/reactor.cc:3245
seastar::app_template::run_deprecated(int, char**, std::function<void ()>&&) at ./build/release/seastar/./seastar/src/core/app-template.cc:266
seastar::app_template::run(int, char**, std::function<seastar::future<int> ()>&&) at ./build/release/seastar/./seastar/src/core/app-template.cc:160
scylla_main(int, char**) at ./main.cc:756
```

Fixes: [SCYLLADB-1013](https://scylladb.atlassian.net/browse/SCYLLADB-1013)

* Requires backport to 2026.1 since the leak exists since 004c08f525

[SCYLLADB-1013]: https://scylladb.atlassian.net/browse/SCYLLADB-1013?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

Closes scylladb/scylladb#29084

* github.com:scylladb/scylladb:
  test/boost/database_test: add test_snapshot_ctl_details_exception_handling
  table: get_snapshot_details: fix indentation inside try block
  table: per-snapshot get_snapshot_details: fix typo in comment
  table: per-snapshot get_snapshot_details: always close lister using try/catch
  table: get_snapshot_details: always close lister using deferred_close
2026-03-19 12:40:23 +03:00
Raphael S. Carvalho
3143134968 test: avoid split/major compaction deadlock in tablet split test
Run keyspace compaction asynchronously in
`test_tombstone_gc_correctness_during_tablet_split` and only await it
after `split_sstable_rewrite` is disabled.

The problem is that `keyspace_compaction()` starts with a flush, and that
flush can take around five seconds. During that window the split
compaction is stopped before major compaction is retried. The stop aborts
the in-flight major compaction attempt, then the split proceeds far enough
to enter the `split_sstable_rewrite` injection point.

At that point the test used to wait synchronously for major compaction to
finish, but major compaction cannot finish yet: when it retries, it needs
the same semaphore that is still effectively tied up behind the blocked
split rewrite. So the test waits for major compaction, while the split
waits for the injection to be released, and the code that would release
that injection never runs.

Starting major compaction as a task breaks that cycle. The test can first
disable `split_sstable_rewrite`, let the split get out of the way, and
only then wait for major compaction to complete.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-827.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#29066
2026-03-19 11:12:21 +02:00
Botond Dénes
2e47fd9f56 Merge 'tasks: do not fail the wait request if rpc fails' from Aleksandra Martyniuk
During decommission, we first mark a topology request as done, then shut
down a node and in the following steps we remove node from the topology.
Thus,  finished request does not imply that a node is removed from
the topology.

Due to that, in node_ops_virtual_task::wait, while gathering children
from the whole cluster, we may hit the connection exception - because
a node is still in topology, even though it is down.

Modify the get_children method to ignore the exception and warn
about the failure instead.

Keep token_metadata_ptr in get_children to prevent topology from changing.

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-867

Needs backports to all versions

Closes scylladb/scylladb#29035

* github.com:scylladb/scylladb:
  tasks: fix indentation
  tasks: do not fail the wait request if rpc fails
  tasks: pass token_metadata_ptr to task_manager::virtual_task::impl::get_children
2026-03-19 10:03:18 +02:00
Piotr Smaron
a2ad57062f docs/cql: clarify WHERE clause boolean limitations
Document that `SELECT ... WHERE` clause currently accepts only conjunctions
of relations joined by `AND` (`OR` is not supported), and that
parentheses cannot be used to group boolean subexpressions.
Add an unsupported query example and point readers to equivalent `IN`
rewrites when applicable.
This problem has been raised by one of our users in
https://forum.scylladb.com/t/error-parsing-query-or-unsupported-statement/5299,
and while one could infer answer to user's question by looking at the
syntax of the `SELECT ... WHERE`, it's not immediately obvious to
non-advanced users, so clarifying these concepts is justified.

Fixes: SCYLLADB-1116

Closes scylladb/scylladb#29100
2026-03-19 09:47:22 +02:00
Michael Litvak
31d339e54a logstor: trigger separator flush for buffers that hold old segments
A compaction group has a separator buffer that holds the mixed segments
alive until the separator buffer is flushed. A mixed segment can be
freed only after all separator buffers that hold writes from the segment
are flushed.

Typically a separator buffer is flushed when it becomes full. However
it's possible for example that one compaction groups is filled slower
than others and holds many segments.

To fix this we trigger a separator flush periodically for separator
buffers that hold old segments. We track the active segment sequence
number and for each separator buffer the oldest sequence number it
holds.
2026-03-18 19:24:28 +01:00
Michael Litvak
ad87eda835 docs/dev: add logstor documentation 2026-03-18 19:24:28 +01:00
Michael Litvak
a0da07e5b7 logstor: recover segments into compaction groups
Fix the logstor recovery to work with compaction groups. When recovering
a segment find its token range and add it to the appropriate compaction
groups. if it doesn't fit in a single compaction group then write each
record to its compaction group's separator buffer.
2026-03-18 19:24:28 +01:00
Michael Litvak
24379acc76 logstor: range read
extend the logstor mutation reader to support range read
2026-03-18 19:24:28 +01:00
Michael Litvak
a9d0211a64 logstor: change index to btree by token per table
Change the primary index to be a btree that is ordered by token,
similarly to a memtable, and create a index per-table instead of a
single global index.
2026-03-18 19:24:28 +01:00
Michael Litvak
e7c3942d43 logstor: move segments to replica::compaction_group
Add a segment_set member to replica::compaction_group that manages the
logstor segments that belong to the compaction group, similarly to how
it manages sstables. Add also a separator buffer in each compaction
group.

When writing a mutation to a compaction group, the mutation is written
to the active segment and to the separator buffer of the compaction
group, and when the separator buffer is flushed the segment is added to
the compaction_group's segment set.
2026-03-18 19:24:28 +01:00
Michael Litvak
d69f7eb0ee db: update dirty mem limits dynamically
when logstor is enabled, update the db dirty memory limits dynamically.

previously the threshold is set to 0.5 of the available memory, so 0.5
goes to memtables and 0.5 to others (cache).

when logstor is enabled, we calculate the available memory excluding
logstor, and divide it evenly between memtables and cache.
2026-03-18 19:24:27 +01:00
Michael Litvak
65cd0b5639 logstor: track memory usage
add logstor::get_memory_usage() that returns an estimate of the memory
usage by logstor. add tracking to how many unique keys are held in the
index.
2026-03-18 19:24:27 +01:00
Michael Litvak
b7bdb1010a logstor: logstor stats api
add api to get logstor statistics about segments for a table
2026-03-18 19:24:27 +01:00
Michael Litvak
8bd3bd7e2a logstor: compaction buffer pool
pre-allocate write buffers for compaction
2026-03-18 19:24:27 +01:00
Michael Litvak
caf5aa47c2 logstor: separator: flush buffer when full
flush separator buffers when they become full and switched instead of
aggregating all the buffers and flushing them when the separator is
switched.
2026-03-18 19:24:27 +01:00
Michael Litvak
6ddb7a4d13 logstor: hold segment until index updates
add a write gate to write_buffer. when writing a record to the write
buffer, the gate is held and passed back to the caller, and the caller
holds the gate until the write operation is complete, including
follow-up operations such as updating the index after the write.

in particular, when writing a mutation in logstor::write, the write
buffer is held open until the write is completed and updated in the
index.

when writing the write buffer to the active segment, we write the buffer
and then wait for the write buffer gate to close, i.e. we wait for all
index updates to complete before proceeding. the segment is held open
until all the write operations and index updates are complete.

this property is useful for correctness: when a segment is closed we
know that all the writes to it are updated in the index. this is needed
in compaction for example, where we take closed segments and check
which records in them are alive by looking them up in the index. if the
index is not updated yet then it will be wrong.
2026-03-18 19:24:27 +01:00
Michael Litvak
bd66edee5c logstor: truncate table
implement freeing all segments of a table for table truncate.

first do barrier to flush all active and mixed segments and put all the
table's data in compaction groups, then stop compaction for the table,
then free the table's segments and remove the live entries from the
index.
2026-03-18 19:24:27 +01:00
Michael Litvak
489efca47c logstor: enable/disable compaction per table
add functions to enable or disable compaction for a specific compaction
group or for all compaction groups of a table.
2026-03-18 19:24:27 +01:00
Michael Litvak
21db4f3ed8 logstor: separator buffer pool
pre-allocate write buffers for the separator
2026-03-18 19:24:27 +01:00
Michael Litvak
37c485e3d1 test: logstor: add separator and compaction tests 2026-03-18 19:24:27 +01:00
Michael Litvak
31aefdc07d logstor: segment and separator barrier
add barrier operation that forces switch of the active segment and
separator, and waits for all existing segments to close and all
separators to flush.
2026-03-18 19:24:27 +01:00
Michael Litvak
1231fafb46 logstor: separator debt controller
add tracking of the total separator debt - writes that were written to a
separator and waiting to be flushed, and add flow control to keep the
debt in control by delaying normal writes.
2026-03-18 19:24:27 +01:00
Michael Litvak
17cb173e18 logstor: compaction controller
adjust compaction shares by the compaction overhead: how many segments
compaction writes to generate a single free segment for new writes.
2026-03-18 19:24:27 +01:00
Michael Litvak
1da1bb9d99 logstor: recovery: recover mixed segments using separator
on recovery we may find mixed segments. recover them by adding them to a
separator, reading all their records and writing them to the separator,
and flush the separator.
2026-03-18 19:24:27 +01:00
Michael Litvak
b78cc787a6 logstor: wait for pending reads in compaction
we free a segment from compaction after updating all live records in the
segment to point to new locations in the index. we need to ensure they
are no running operations that use the old locations before we free the
segment.
2026-03-18 19:24:27 +01:00
Michael Litvak
600ec82bec logstor: separator
initial implementation of the separator. it replaces "mixed" segments -
segments that have records from different groups, to segments by group.

every write is written to the active segment and to a buffer in the
active separator. the active separator has in-memory buffers by group.
at some threshold number of segments we switch the active segment and
separator atomically, and start flushing the separator.

the separator is flushed by writing the buffers into new non-mixed
segments, adding them to a compaction group, and frees the mixed
segments.
2026-03-18 19:24:27 +01:00
Michael Litvak
009fc3757a logstor: compaction groups
divide the segments in the compaction manager to compaction group.
compaction will compact only segments from a single compaction group at
a time.
2026-03-18 19:24:27 +01:00
Michael Litvak
b3293f8579 logstor: cache files for read
keep all files for all segments open for read to improve reads.
2026-03-18 19:24:26 +01:00
Michael Litvak
5a16980845 logstor: recovery: initial
initial and basic recovery implementation.
* find all files, read their segments and populate the index with the
  newest record for each key.
* find which segments are used and build the usage histogram
2026-03-18 19:24:26 +01:00
Michael Litvak
bc9fc96579 logstor: add segment generation
add segment generation number that is incremented when the segment is
reused, and it's written to every buffer that is written to the segment.
this is useful for recovery.
2026-03-18 19:24:26 +01:00
Michael Litvak
719f7cca57 logstor: reserve segments for compaction
reserve segments for compaction so it always has enough segments to run
and doesn't get stuck.

do the compaction writes into full new segments instead of the active
segment.
2026-03-18 19:24:26 +01:00
Michael Litvak
521fca5c92 logstor: index: buckets
divide the primary index to buckets, each bucket containing a btree. the
bucket is determined by using bits from the key hash.
2026-03-18 19:24:26 +01:00
Michael Litvak
99c3b1998a logstor: add buffer header
add a buffer header in each write buffer we write that contains some
information that can be useful for recovery and reading.
2026-03-18 19:24:26 +01:00
Michael Litvak
ddd72a16b0 logstor: add group_id
add group_id value to each log record that is passed with the mutation
when writing it.

the group_id will be used to group log records in segments, such that a
segment will contain records only from a single group.

this will be useful for tablet migration. we want for each tablet to
have their own segments with all their records, so we can migrate them
efficiently by copying these segments.

the group_id value is set to a value equivalent to the tablet id.
2026-03-18 19:24:26 +01:00
Michael Litvak
08bea860ef logstor: record generation
add a record generation number for each record so we can compare
records and find which one is newer.
2026-03-18 19:24:26 +01:00
Michael Litvak
28f820eb1c logstor: generation utility
basic utility for generation numbers that will be useful next. a
generation number is an unsigned integer that can be incremented and
compared even if it wraparounds, assuming the values we compare were
written around the same time.
2026-03-18 19:24:26 +01:00
Michael Litvak
5f649dd39f logstor: use RIPEMD-160 for index key
use a 20-byte hash function for the index key to make hash collisions
very unlikely. we assume there are no hash collisions.
2026-03-18 19:24:26 +01:00
Michael Litvak
a521bcbcee test: add test_logstor.py
add basic tests for key-value tables with logstor storage
2026-03-18 19:24:26 +01:00
Michael Litvak
1ae1f37ec1 api: add logstor compaction trigger endpoint
add a new api endpoint that triggers logstor compaction.
2026-03-18 19:24:26 +01:00
Michael Litvak
2128b1b15c replica: add logstor to db
Add a single logstor instance in the database that is used for writing
and reading to tables with kv storage
2026-03-18 19:24:26 +01:00
Michael Litvak
9172cc172e schema: add logstor cf property
add a schema property for tables with logstor storage
2026-03-18 19:24:26 +01:00
Michael Litvak
0b1343747f logstor: initial commit
initial implementation of the logstor storage engine for key-value
tables that supports writes, reads and basic compaction.

main components:
* logstor: this is the main interface to users that supports writing and
  reading back mutations, and manages the internal components.
* index: the primary index in-memory that maps a key to a location on
  disk.
* write buffer: writes go initially to a write buffer. it accumulates
  multiple records in a buffer and writes them to the segment manager in
  4k sized blocks.
* segment manager: manages the storage - files, segments, compaction. it
  manages file and segment allocation, and writes 4k aligned buffers to
  the active segment sequentially. it tracks the used space in each
  segment. the compaction finds segment with low space usage and writes
  them to new segments, and frees the old segments.
2026-03-18 19:24:26 +01:00
Michael Litvak
27fd0c119f db: disable tablet balancing with logstor
initially logstor tables will not support tablet migrations, so
disable tablet balancing if the experimental feature flag is set.
2026-03-18 19:24:26 +01:00
Michael Litvak
ed852a2af2 db: add logstor experimental feature flag
add a new experimental feature flag for key-value tables with the new
logstor storage engine.
2026-03-18 19:24:26 +01:00
Anna Stuchlik
88b98fac3a doc: update the warning about shared dictionary training
This commit updates the inadequate warning on the Advanced Internode (RPC) Compression page.

The warning is replaced with a note about how training data is encrypted.

Fixes https://github.com/scylladb/scylladb/issues/29109

Closes scylladb/scylladb#29111
2026-03-18 19:35:18 +02:00
Avi Kivity
46a6f8e1d3 Merge 'auth: add maintenance_socket_authorizer' from Dario Mirovic
GRANT/REVOKE fails on the maintenance socket connections, because maintenance_auth_service uses allow_all_authorizer. allow_all_authorizer allows all operations, but not GRANT/REVOKE, because they make no sense in its context.

This has been observed during PGO run failure in operations from ./pgo/conf/auth.cql file.

This patch introduces maintenance_socket_authorizer that supports the capabilities of default_authorizer ('CassandraAuthorizer') without needing authorization.

Refs SCYLLADB-1070

This is an improvement, no need for backport.

Closes scylladb/scylladb#29080

* github.com:scylladb/scylladb:
  test: use NetworkTopologyStrategy in maintenance socket tests
  test: use cleanup fixture in maintenance socket auth tests
  auth: add maintenance_socket_authorizer
2026-03-18 19:29:57 +02:00
Pavel Emelyanov
d6c01be09b s3/client: Don't reconstruct regex on every parse_content_range call
Make the pattern static const so it is compiled once at first call rather
than on every Content-Range header parse.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#29054
2026-03-18 17:56:33 +02:00
Tomasz Grabiec
4410e9c61a sstables: mx: index_reader: Optimize parsing for no promoted index case
It's a common case with small partition workloads.
2026-03-18 16:25:21 +01:00
Tomasz Grabiec
32f8609b89 vint: Use std::countl_zero()
It handles 0, and could generate better code for that. On Broadwell
architecture, it translates to a single instruction (LZCNT). We're
still on Westmere, so it translates to BSR with a conditional move.

Also, drop unnecessary casts and bit arithmetic, which saves a few
instructions.

Move to header so that it's inlined in parsers.
2026-03-18 16:25:21 +01:00
Tomasz Grabiec
6017688445 test: sstable_partition_index_cache_test: Validate scenario of pages with sparse promoted index placement 2026-03-18 16:25:21 +01:00
Tomasz Grabiec
f55bb154ec sstables: mx: index_reader: Amoritze partition key storage
This change reduces the cost of partition index page construction and
LSA migration. This is achieved by several things working together:

 - index entries don't store keys as separate small objects (managed_bytes)
   They are written into one managed_bytes fragmented storage, entries
   hold offset into it.

   Before, we paid 16 bytes for managed_bytes plus LSA descriptor for
   the storage (1 byte) plus back-reference in the storage (8 bytes),
   so 25 bytes. Now we only pay 4 bytes for the size offset. If keys are 16
   bytes, that's a reduction from 31 bytes to 20 bytes per key.

 - index entries and key storage are now trivially moveable, so LSA
   migration can use memcpy() which amortizes the cost per key.
   memcpy().

   LSA eviction is now trivial and constant time for the whole page
   regardless of the number of entries. Page eviction dropped from
   14 us to 1 us.

This improves throughput in a CPU-bound miss-heavy read workload where
the partition index doesn't fit in memory.

  scylla perf-simple-query -c1 -m200M --partitions=1000000

Before:

    15328.25 tps (150.0 allocs/op,  14.1 logallocs/op,  45.4 tasks/op,  286769 insns/op,  218134 cycles/op,        0 errors)
    15279.01 tps (149.9 allocs/op,  14.1 logallocs/op,  45.3 tasks/op,  287696 insns/op,  218637 cycles/op,        0 errors)
    15347.78 tps (149.7 allocs/op,  14.1 logallocs/op,  45.3 tasks/op,  285851 insns/op,  217795 cycles/op,        0 errors)
    15403.68 tps (149.6 allocs/op,  14.1 logallocs/op,  45.2 tasks/op,  285111 insns/op,  216984 cycles/op,        0 errors)
    15189.47 tps (150.0 allocs/op,  14.1 logallocs/op,  45.5 tasks/op,  289509 insns/op,  219602 cycles/op,        0 errors)
    15295.04 tps (149.8 allocs/op,  14.1 logallocs/op,  45.3 tasks/op,  288021 insns/op,  218545 cycles/op,        0 errors)
    15162.01 tps (149.8 allocs/op,  14.1 logallocs/op,  45.4 tasks/op,  291265 insns/op,  220451 cycles/op,        0 errors)

After:

    21620.18 tps (148.4 allocs/op,  13.4 logallocs/op,  43.7 tasks/op,  176817 insns/op,  153183 cycles/op,        0 errors)
    20644.03 tps (149.8 allocs/op,  13.5 logallocs/op,  44.3 tasks/op,  187941 insns/op,  160409 cycles/op,        0 errors)
    20588.06 tps (150.1 allocs/op,  13.5 logallocs/op,  44.5 tasks/op,  188090 insns/op,  160818 cycles/op,        0 errors)
    20789.29 tps (149.5 allocs/op,  13.5 logallocs/op,  44.2 tasks/op,  186495 insns/op,  159382 cycles/op,        0 errors)
    20977.89 tps (149.5 allocs/op,  13.4 logallocs/op,  44.2 tasks/op,  183969 insns/op,  158140 cycles/op,        0 errors)
    21125.34 tps (149.1 allocs/op,  13.4 logallocs/op,  44.1 tasks/op,  183204 insns/op,  156925 cycles/op,        0 errors)
    21244.42 tps (148.6 allocs/op,  13.4 logallocs/op,  43.8 tasks/op,  181276 insns/op,  155973 cycles/op,        0 errors)

Mostly because the index now fits in memory.

When it doesn't, the benefits are still visible due to lower LSA overhead.
2026-03-18 16:25:21 +01:00
Tomasz Grabiec
1452e92567 managed_bytes: Hoist write_fragmented() to common header 2026-03-18 16:25:20 +01:00
Tomasz Grabiec
75e6412b1c utils: managed_vector: Use std::uninitialized_move() to move objects
It's shorter, and is supposed to be optimized for trivially-moveable
types.

Important for managed_vector<index_entry>, which can have lots of
elements.
2026-03-18 16:25:20 +01:00
Tomasz Grabiec
50dc7c6dd8 sstables: mx: index_reader: Keep promoted_index info next to index_entry
Densely populated pages have no promoted index (small partitions), so
we can save space in such workloads by keeping promoted index in a
separate vector.

For workloads which do have a promoted index, pages have only one
partition. There aren't many such pages and they are long-lived, so
the extra allocation of the vector is amortized.

promoted_index class is removed, and replaced with equivalent
parsed_promoted_index_entry for simplicity. Because it's removed,
make_cursor() is moved into the index_reader class.

Reducing the size of index_entry is important for performence if pages
are densly populated. It helps to reduce LSA allocator pressure and
compaction/eviction speed.

This change, combined with the earlier change "Shave-off 16 bytes from
index_entry by using raw_token", gives significant improvement in
throughput in perf_simple_query run where the index doesn't fit in
memory:

  scylla perf-simple-query -c1 -m200M --partitions=1000000

Before:

9714.78 tps (170.9 allocs/op,  16.9 logallocs/op,  55.3 tasks/op,  494788 insns/op,  343920 cycles/op,        0 errors)
9603.13 tps (171.6 allocs/op,  17.0 logallocs/op,  55.6 tasks/op,  502358 insns/op,  348344 cycles/op,        0 errors)
9621.43 tps (171.9 allocs/op,  17.0 logallocs/op,  55.8 tasks/op,  500612 insns/op,  347508 cycles/op,        0 errors)
9597.75 tps (171.6 allocs/op,  17.0 logallocs/op,  55.6 tasks/op,  501428 insns/op,  348604 cycles/op,        0 errors)
9615.54 tps (171.6 allocs/op,  16.9 logallocs/op,  55.6 tasks/op,  501313 insns/op,  347935 cycles/op,        0 errors)
9577.03 tps (171.8 allocs/op,  17.0 logallocs/op,  55.7 tasks/op,  503283 insns/op,  349251 cycles/op,        0 errors)

After:

15328.25 tps (150.0 allocs/op,  14.1 logallocs/op,  45.4 tasks/op,  286769 insns/op,  218134 cycles/op,        0 errors)
15279.01 tps (149.9 allocs/op,  14.1 logallocs/op,  45.3 tasks/op,  287696 insns/op,  218637 cycles/op,        0 errors)
15347.78 tps (149.7 allocs/op,  14.1 logallocs/op,  45.3 tasks/op,  285851 insns/op,  217795 cycles/op,        0 errors)
15403.68 tps (149.6 allocs/op,  14.1 logallocs/op,  45.2 tasks/op,  285111 insns/op,  216984 cycles/op,        0 errors)
15189.47 tps (150.0 allocs/op,  14.1 logallocs/op,  45.5 tasks/op,  289509 insns/op,  219602 cycles/op,        0 errors)
15295.04 tps (149.8 allocs/op,  14.1 logallocs/op,  45.3 tasks/op,  288021 insns/op,  218545 cycles/op,        0 errors)
15162.01 tps (149.8 allocs/op,  14.1 logallocs/op,  45.4 tasks/op,  291265 insns/op,  220451 cycles/op,        0 errors)
2026-03-18 16:25:20 +01:00
Tomasz Grabiec
5e228a8387 sstables: mx: index_reader: Extract partition_index_page::clear_gently()
There will be more elements to clear. And partition_index_page should
know how to clear itself.
2026-03-18 16:25:20 +01:00
Tomasz Grabiec
2d77e4fc28 sstables: mx: index_reader: Shave-off 16 bytes from index_entry by using raw_token
The std::optional<> adds 8 bytes.

And dht::token adds 8 bytes due to _kind, which in this case is always
kind::key.

The size changd from 56 to 48 bytes.
2026-03-18 16:25:20 +01:00
Tomasz Grabiec
e9c98274b5 sstables: mx: index_reader: Reduce allocation_section overhead during index page parsing by batching allocation
If the page has many entries, we continuously enter and leave the
allocating section for every key. This can be avoided by batching LSA
operations for the whole page, after collecting all the entries.

Later optimizations will also build on this, where we will allocate
fragmented storage for keys in LSA using a single managed_bytes
constructor.

This alone brings only a minor improvement, but it does reduce LSA
allocations, probably due to less frequent memory reclamation:

  scylla perf-simple-query -c1 -m200M --duration 6000 --partitions=1000000

Before:

  9560.42 tps (172.2 allocs/op,  19.6 logallocs/op,  57.7 tasks/op,  567741 insns/op,  345158 cycles/op,        0 errors)
  9445.95 tps (173.1 allocs/op,  19.7 logallocs/op,  58.1 tasks/op,  579075 insns/op,  352173 cycles/op,        0 errors)
  9576.75 tps (172.2 allocs/op,  19.6 logallocs/op,  57.6 tasks/op,  572004 insns/op,  347373 cycles/op,        0 errors)
  9597.16 tps (172.2 allocs/op,  19.6 logallocs/op,  57.6 tasks/op,  569615 insns/op,  346618 cycles/op,        0 errors)
  9454.07 tps (173.5 allocs/op,  19.8 logallocs/op,  58.3 tasks/op,  579213 insns/op,  351569 cycles/op,        0 errors)

After:

  9562.21 tps (172.0 allocs/op,  17.0 logallocs/op,  55.8 tasks/op,  499225 insns/op,  347832 cycles/op,        0 errors)
  9480.20 tps (172.3 allocs/op,  17.0 logallocs/op,  55.9 tasks/op,  507271 insns/op,  350640 cycles/op,        0 errors)
  9512.42 tps (172.1 allocs/op,  17.0 logallocs/op,  55.9 tasks/op,  504247 insns/op,  350392 cycles/op,        0 errors)
  9498.45 tps (172.4 allocs/op,  17.1 logallocs/op,  55.9 tasks/op,  505765 insns/op,  350320 cycles/op,        0 errors)
  9076.30 tps (173.5 allocs/op,  17.1 logallocs/op,  56.5 tasks/op,  512791 insns/op,  354792 cycles/op,        0 errors)
  9542.62 tps (171.9 allocs/op,  17.0 logallocs/op,  55.8 tasks/op,  502532 insns/op,  348922 cycles/op,        0 errors)
2026-03-18 16:25:20 +01:00
Tomasz Grabiec
0e0f9f41b3 sstables: mx: index_reader: Keep index_entry directly in the vector
Partition index entries are relatively small, and if the workload has
small partitions, index pages have a lot of elements. Currently, index
entries are indirected via managed_ref, which causes increased cost of
LSA eviction and compaction. This patch amortizes this cost by storing
them dierctly in the managed_chunked_vector.

This gives about 23% improvement in throughput in perf-simple-query
for a workload where the index doesn't fit in memory:

  scylla perf-simple-query -c1 -m200M --duration 6000 --partitions=1000000

Before:

  7774.96 tps (166.0 allocs/op, 521.7 logallocs/op,  54.0 tasks/op,  802428 insns/op,  430457 cycles/op,        0 errors)
  7511.08 tps (166.1 allocs/op, 527.2 logallocs/op,  54.0 tasks/op,  804185 insns/op,  430752 cycles/op,        0 errors)
  7740.44 tps (166.3 allocs/op, 526.2 logallocs/op,  54.2 tasks/op,  805347 insns/op,  432117 cycles/op,        0 errors)
  7818.72 tps (165.2 allocs/op, 517.6 logallocs/op,  53.7 tasks/op,  794965 insns/op,  427751 cycles/op,        0 errors)
  7865.49 tps (165.1 allocs/op, 513.3 logallocs/op,  53.6 tasks/op,  788898 insns/op,  425171 cycles/op,        0 errors)

After:

  9560.42 tps (172.2 allocs/op,  19.6 logallocs/op,  57.7 tasks/op,  567741 insns/op,  345158 cycles/op,        0 errors)
  9445.95 tps (173.1 allocs/op,  19.7 logallocs/op,  58.1 tasks/op,  579075 insns/op,  352173 cycles/op,        0 errors)
  9576.75 tps (172.2 allocs/op,  19.6 logallocs/op,  57.6 tasks/op,  572004 insns/op,  347373 cycles/op,        0 errors)
  9597.16 tps (172.2 allocs/op,  19.6 logallocs/op,  57.6 tasks/op,  569615 insns/op,  346618 cycles/op,        0 errors)
  9454.07 tps (173.5 allocs/op,  19.8 logallocs/op,  58.3 tasks/op,  579213 insns/op,  351569 cycles/op,        0 errors)

Disabling the partition index doesn't improve the throuhgput beyond
that.
2026-03-18 16:25:20 +01:00
Tomasz Grabiec
b6bfdeb111 dht: Introduce raw_token
Most tokens stored in data structures are for key-scoped tokens, and
we don't need to pay for token::kind storage.
2026-03-18 16:25:20 +01:00
Tomasz Grabiec
3775593e53 test: perf_simple_query: Add 'sstable-format' command-line option 2026-03-18 16:25:20 +01:00
Tomasz Grabiec
6ee9bc63eb test: perf_simple_query: Add 'sstable-summary-ratio' command-line option 2026-03-18 16:25:20 +01:00
Tomasz Grabiec
38d130d9d0 test: perf-simple-query: Add option to disable index cache 2026-03-18 16:25:20 +01:00
Tomasz Grabiec
5ee61f067d test: cql_test_env: Respect enable-index-cache config
Mirrors the code in main.cc
2026-03-18 16:25:20 +01:00
Aleksandra Martyniuk
2d16083ba6 tasks: fix indentation 2026-03-18 15:37:24 +01:00
Aleksandra Martyniuk
1fbf3a4ba1 tasks: do not fail the wait request if rpc fails
During decommission, we first mark a topology request as done, then shut
down a node and in the following steps we remove node from the topology.
Thus,  finished request does not imply that a node is removed from
the topology.

Due to that, in node_ops_virtual_task::wait, while gathering children
from the whole cluster, we may hit the connection exception - because
a node is still in topology, even though it is down.

Modify the get_children method to ignore the exception and warn
about the failure instead.
2026-03-18 15:37:24 +01:00
Aleksandra Martyniuk
d4fdeb4839 tasks: pass token_metadata_ptr to task_manager::virtual_task::impl::get_children
In get_children we get the vector of alive nodes with get_nodes.
Yet, between this and sending rpc to those nodes there might be
a preemption. Currently, the liveness of a node is checked once
again before the rpcs (only with gossiper not in topology - unlike
get_nodes).

Modify get_children, so that it keeps a token_metadata_ptr,
preventing topology from changing between get_nodes and rpcs.

Remove test_get_children as it checked if the get_children method
won't fail if a node is down after get_nodes - which cannot happen
currently.
2026-03-18 15:37:24 +01:00
Calle Wilund
0013f22374 memtable_test::memtable_flush_period: Change sleep to use injection signal instead
Fixes: SCYLLADB-942

Adds an injection signal _from_ table::seal_active_memtable to allow us to
reliably wait for flushing. And does so.

Closes scylladb/scylladb#29070
2026-03-18 16:23:13 +02:00
Botond Dénes
ae17596c2a Merge 'Demote log level on split failure during shutdown' from Raphael Raph Carvalho
Since commit 509f2af8db, gate_closed_exception can be triggered for ongoing split during shutdown. The commit is correct, but it causes split failure on shutdown to log an error, which causes CI instability. Previously, aborted_exception would be triggered instead which is logged as warning. Let's do the same.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-951.
Fixes https://github.com/scylladb/scylladb/issues/24850.

Only 2026.1 is affected.

Closes scylladb/scylladb#29032

* github.com:scylladb/scylladb:
  replica: Demote log level on split failure during shutdown
  service: Demote log level on split failure during shutdown
2026-03-18 16:21:05 +02:00
Pavel Emelyanov
8b1ca6dcd6 database: Rate limit all tokens from a range
The limiter scans ranges to decide whether or not to rate-limit the
query. However, when considering each range only the front one's token
is accounted. This looks like a misprint.

The limiter was introduced in cc9a2ad41f

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#29050
2026-03-18 13:50:48 +01:00
Pavel Emelyanov
d68c92ec04 test: Replace a bunch of ternary operators with an if-else block
A followup of the merge of two test cases that happened in the previous
patch. Both used `foo = N if domain == bar else M` to evaluate the
parameters for topology. Using if-else block makes it immediately obvious
which topology and scope apply for each domain value without having to
evaluate multiple inline conditionals.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-03-18 13:08:36 +03:00
Pavel Emelyanov
b1d4fc5e6e test: Squash test_restore_primary_replica_same|different_domain tests
The two tests differ only in the way they set up the topology for the
cluster and the post-restore checks against the resulting streams.

The merge happens with the help of a "scope_is_same" boolean parameter
and corresponding updates in the topology setup and post-checks.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-03-18 13:08:36 +03:00
Pavel Emelyanov
21c603a79e test: Use the same regexp in test_restore_primary_replica_different|same_domain-s
The one in "different domain" test is simpler because the test performs
less checks. Next patch will merge both tests and making regexp-s look
identical makes the merge even smother.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-03-18 13:07:09 +03:00
Emil Maskovsky
34f3916e7d .github: update test instructions for unified pytest runner
Update test running instructions to reflect unified pytest-based runner.

The test.py now requires full test paths with file extensions for both
C++ and Python tests.

No backport: The change is only relevant for recent test.py changes in
master.

Closes scylladb/scylladb#29062
2026-03-18 09:28:28 +01:00
Marcin Maliszkiewicz
04bf631d7f auth: switch query_attribute_for_all to use cache 2026-03-18 09:06:20 +01:00
Marcin Maliszkiewicz
cf578fd81a auth: switch get_attribute to use cache 2026-03-18 09:06:20 +01:00
Marcin Maliszkiewicz
06d16b6ea2 auth: cache: add heterogeneous map lookups
Some callers have only string_view role name,
they shouldn't need to allocate sstring to do the
lookup.
2026-03-18 09:06:20 +01:00
Marcin Maliszkiewicz
7fdb1118f5 auth: switch query_all to use cache 2026-03-18 09:06:20 +01:00
Marcin Maliszkiewicz
fca11c5a21 auth: switch query_all_directly_granted to use cache 2026-03-18 09:06:20 +01:00
Marcin Maliszkiewicz
6f682f7eb1 auth: cache: add ability to go over all roles
This is needed to implement auth service api where
we list all roles.
2026-03-18 09:06:20 +01:00
Marcin Maliszkiewicz
61952cd985 raft: service: reload auth cache before service levels
Since service levels depend on auth data, and not other
way around, we need to ensure a proper loading order.
2026-03-18 09:06:20 +01:00
Marcin Maliszkiewicz
c4cfb278bc service: raft: move update_service_levels_effective_cache check
The auth::cache::includes_table function also covers role_members and
role_attributes. The existing check was removed because it blocked these
tables from triggering necessary cache updates.

While previously non-critical (due to unused attributes and table coupling),
maintaining a correct cache is essential for upcoming changes.
2026-03-18 09:06:20 +01:00
Benny Halevy
c2a6d1e930 test/boost/database_test: add test_snapshot_ctl_details_exception_handling
Verify that the directory listers opened by get_snapshot_details
are properly closed when handling an (injected) exception.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-03-18 09:37:44 +02:00
Benny Halevy
6dc4ea766b table: get_snapshot_details: fix indentation inside try block
Whitespace-only change: indent the loop body one level inside the
try block added in the previous commit.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-03-18 09:28:50 +02:00
Benny Halevy
b09d45b89a table: per-snapshot get_snapshot_details: fix typo in comment
The comment says the snapshot directory may contain a `schema.sql` file,
but the code treats `schema.cql` as the special-case schema file.

Reported-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-03-18 09:27:40 +02:00
Benny Halevy
580cc309d2 table: per-snapshot get_snapshot_details: always close lister using try/catch
Since this is a coroutine, we cannot just use deferred_close,
but rather we need to catch an error, close the lister, and then
return the error, is applicable.

Fixes: SCYLLADB-1013

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-03-18 09:27:23 +02:00
Benny Halevy
78c817f71e table: get_snapshot_details: always close lister using deferred_close
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-03-18 09:26:26 +02:00
Dario Mirovic
71e6918f28 test: use NetworkTopologyStrategy in maintenance socket tests
NetworkTopologyStrategy is the preferred choice. We should not
use SimpleStrategy anymore. This patch changes the topology strategy
for all the maintenance socket tests.

Refs SCYLLADB-1070
2026-03-17 20:20:47 +01:00
Dario Mirovic
278535e4e3 test: use cleanup fixture in maintenance socket auth tests
Add a cql_clusters pytest fixture that tracks CQL driver Cluster
objects and shuts them down automatically after test completion.
This replaces manual shutdown() calls at the end of each test.

Also consolidate shutdown() calls in retry helpers into finally
blocks for consistent cleanup.

Refs SCYLLADB-1070
2026-03-17 20:15:30 +01:00
Dario Mirovic
2e4b72c6b9 auth: add maintenance_socket_authorizer
GRANT/REVOKE fails on the maintenance socket connections,
because maintenance_auth_service uses allow_all_authorizer.
allow_all_authorizer allows all operations, but not GRANT/REVOKE,
because they make no sense in its context.

This has been observed during PGO run failure in operations from
./pgo/conf/auth.cql file.

This patch introduces maintenance_socket_authorizer that supports
the capabilities of default_authorizer ('CassandraAuthorizer')
without needing authorization.

Refs SCYLLADB-1070
2026-03-17 19:19:41 +01:00
Botond Dénes
172c786079 Merge 'perf-alternator: wait for alternator port before running workload' from Marcin Maliszkiewicz
This patch is mostly for the purpose of running pgo CI job.

We may receive connection error if asyncio.sleep(5) in
pgo.py is not sufficient waiting time.

In pgo.py we do wait for port but only for cql,
anyway it's better to have high level check than
trying to wait for alternator port there.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1071
Backport: 2026.1 - it failed on CI for that build

Closes scylladb/scylladb#29063

* github.com:scylladb/scylladb:
  perf: add abort_source support to wait-for-port loops
  perf-alternator: wait for alternator port before running workload
2026-03-17 18:38:11 +02:00
Botond Dénes
5d868dcc55 Merge 's3_client: fix s3::range max value for object size' from Ernest Zaslavsky
- fix s3::range max value for object size which is 50TiB and not 5.
- refactor constants to make it accessible for all interested parties, also reuse these constants in tests

No need to backport, doubt we will encounter an object larger than 5TiB

Closes scylladb/scylladb#28601

* github.com:scylladb/scylladb:
  s3_client: reorganize tests in part_size_calculation_test
  s3_client: switch using s3 limits constants in tests
  s3_client: fix the s3::range max object size
  s3_client: remove "aws" prefix from object limits constants
  s3_client: make s3 object limits accessible
2026-03-17 16:34:42 +02:00
Anna Stuchlik
f4a6bb1885 doc: remove the Open Source Example from Installation
This commit replaces the Open Soruce example from the Installation section for CentOS.
We updated the example for Ubuntu, but not for CentOS.
We don't want to have any Open Source information in the docs.

Fixes https://github.com/scylladb/scylladb/issues/29087
2026-03-17 14:54:32 +01:00
Anna Stuchlik
95bc8911dd doc: replace http with https in the installation instructions
Fixes https://github.com/scylladb/scylladb/issues/17227
2026-03-17 14:46:16 +01:00
Dawid Mędrek
a8dd13731f Merge 'Improve debuggability of test/cluster/test_data_resurrection_in_memtable.py' from Botond Dénes
This test was observed to fail in CI recently but there is not enough information in the logs to figure out what went wrong. This PR makes a few improvements to make the next investigation easier, should it be needed:
* storage-service: add table name to mutation write failure error messages.
* database: the `database_apply` error injection used to cause trouble, catching writes to bystander tables, making tests flaky. To eliminate this, it gained a filter to apply only to non-system keyspaces. Unfortunately, this still allows it to catch writes to the trace tables. While this should not fail the test, it reduces observability, as some traces disappear. Improve this error injection to only apply to selected table. Also merge it with the `database_apply_wait` error injection, to streamline the code a bit.
* test/test_data_resurrection_in_memtable.py: dump data from the datable, before the checks for expected data, so if checks fail, the data in the table is known.

Refs: SCYLLADB-812
Refs: SCYLLADB-870
Fixes: SCYLLADB-1050 (by restricting `database_apply` error injection, so it doesn't affect writes to system traces)

Backport: test related improvement, no backport

Closes scylladb/scylladb#28899

* github.com:scylladb/scylladb:
  test/cluster/test_data_resurrection_in_memtable.py: dump rows before check
  replica/database: consolidate the two database_apply error injections
  service/storage_proxy: add name of table to error message for write errors
2026-03-17 13:35:19 +01:00
Botond Dénes
318aa07158 Merge ' test/alternator: use module-scope fixtures in test_streams.py ' from Nadav Har'El
Previously, all stream-table fixtures in test_streams.py used scope="function",
forcing a fresh table to be created for every test, slowing down the test a bit
(though not much), and discouraging writing small new tests.

 This was a workaround for a DynamoDB quirk (that Alternator doesn't have):
LATEST shard iterators have a time slack and may point slightly before  the true
stream head, causing leftover events from a previous test to appear in the next
test's reads.

The first two tests in this series fix small problems that turn up once we start
sharing test tables in test_streams.py. The final patch fixes the "LATEST" problem
and enables sharing the test table by using "module" scope fixtures instead of
"function".

After this series, test_streams.py run time went down a bit, from 20.2 seconds to 17.7 seconds.

Closes scylladb/scylladb#28972

* github.com:scylladb/scylladb:
  test/alternator: speed up test_streams.py by using module-scope fixtures
  test/alternator: test_streams.py don't use fixtures in 4 tests
  test/alternator: fix do_test() in test_streams.py
2026-03-17 13:56:16 +02:00
Ernest Zaslavsky
7f597aca67 cmake: fix broken build
Add raft_util.idl.hh to cmake to generate the code properly

Closes scylladb/scylladb#29055
2026-03-17 10:35:34 +01:00
Botond Dénes
dbe70cddca test/boost/querier_cache_test: make test_time_based_cache_eviction less sensitive to timing
This test relies on the cache entry being evicted after 200ms past the
TTL. This may not happen on a busy CI machine. Make the test less
reliant on timing by using eventually_true().
Simplify the test by dropping the second entry, it doesn't add anything
to the test.

Fixes: SCYLLADB-811

Closes scylladb/scylladb#28958
2026-03-17 10:32:23 +01:00
Botond Dénes
0fd51c4adb test/nodetool: rest_api_mock_server: add retry for status code 404
This fixtures starts the mock server and immediately connects to it to
setup the expected requests. The connection attempt might be too early,
so there is a retry loop with a timeout. The loop currently checks for
requests.exception.ConnectionError. We've seen a case where the
connection is successful but the request fails with 404. The mock
started the server but didn't setup the routes yet. Add a retry for http
404 to handle this.

Fixes: SCYLLADB-966

Closes scylladb/scylladb#29003
2026-03-17 10:30:23 +01:00
Asias He
6cb263bab0 repair: Prevent CPU stall during cross-shard row copy and destruction
When handling `repair_stream_cmd::end_of_current_rows`, passing the
foreign list directly to `put_row_diff_handler` triggered a massive
synchronous deep copy on the destination shard. Additionally, destroying
the list triggered a synchronous deallocation on the source shard. This
blocked the reactor and triggered the CPU stall detector.

This commit fixes the issue by introducing `clone_gently()` to copy the
list elements one by one, and leveraging the existing
`utils::clear_gently()` to destroy them. Both utilize
`seastar::coroutine::maybe_yield()` to allow the reactor to breathe
during large cross-shard transfers and cleanups.

Fixes SCYLLADB-403

Closes scylladb/scylladb#28979
2026-03-17 11:05:15 +02:00
Pavel Emelyanov
9fe19ec9d9 sstables: Fix object storage lister not resetting position in batch
vector

The lister loop in get() pre-fetches records in batches and keeps them
in a _info vector, iterating over it with the help of _pos cursor. When
the vector is re-read, the cursor must be reset too.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-03-17 10:32:42 +03:00
Pavel Emelyanov
1a6a7647c6 sstables: Fix object storage lister skipping entries when filter is active
The lister loop in get() method looks weird. It uses do-while(false)
loop and calls continue; inside when filter asks to skip a entry.
Skipping, thus, aborts the whole thing and EOF-s, which is not what's
supposed to happen.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-03-17 10:32:40 +03:00
Botond Dénes
035aa90d4b Merge 'Alternator: add per-table batch latency metrics and test coverage' from Amnon Heiman
This series fixes a metrics visibility gap in Alternator and adds regression coverage.

Until now, BatchGetItem and BatchWriteItem updated global latency histograms but did not consistently update per-table latency histograms. As a result, table-level latency dashboards could miss batch traffic.

It updates the batch read/write paths to compute request duration once and record it in both global and per-table latency metrics.

Add the missing tests, including a metric-agnostic helper and a dedicated per-table latency test that verifies latency counters increase for item and batch operations.

This change is metrics-only (no API/behavior change for requests) and improves observability consistency between global and per-table views.

Fixes #28721

**We assume the alternator per-table metrics exist, but the batch ones are not updated**

Closes scylladb/scylladb#28732

* github.com:scylladb/scylladb:
  test(alternator): add per-table latency coverage for item and batch ops
  alternator: track per-table latency for batch get/write operations
2026-03-16 17:18:00 +02:00
Michał Hudobski
40d180a7ef docs: update vector search filtering to reflect primary key support only
Remove outdated references to filtering on columns provided in the
index definition, and remove the note about equal relations (= and IN)
being the only supported operations. Vector search filtering currently
supports WHERE clauses on primary key columns only.

Closes scylladb/scylladb#28949
2026-03-16 17:16:16 +02:00
Botond Dénes
9de8d6798e Merge 'reader_concurrency_semaphore: skip preemptive abort for permits waiting for memory' from Łukasz Paszkowski
Permits in the `waiting_for_memory` state represent already-executing reads that are blocked on memory allocation. Preemptively aborting them is wasteful -- these reads have already consumed resources and made progress, so they should be allowed to complete.

Restrict the preemptive abort check in maybe_admit_waiters() to only apply to permits in the `waiting_for_admission` state, and tighten the state validation in `on_preemptive_aborted()` accordingly.

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1016

Backport not needed. The commit introducing replica load shedding is not part of 2026.1

Closes scylladb/scylladb#29025

* github.com:scylladb/scylladb:
  reader_concurrency_semaphore: skip preemptive abort for permits waiting for memory
  reader_concurrency_semaphore_test: detect memory leak on preemptive abort of waiting_for_memory permit
2026-03-16 17:14:25 +02:00
Marcin Maliszkiewicz
9318c80203 perf: add abort_source support to wait-for-port loops
Check abort_source on each retry iteration in
wait_for_alternator and wait_for_cql so the
wait can be interrupted on shutdown.

Didn't use sleep_abortable as the sleep is very short
anyway.
2026-03-16 16:14:10 +01:00
Calle Wilund
a5df2e79a7 storage_service: Wait for snapshot/backup before decommission
Fixes: SCYLLADB-244

Disables snapshot control such that any active ops finish/fail
before proceeding with decommission.
Note: snapshot control provided as argument, not member ref
due to storage_service being used from both main and cql_test_env.
(The latter has no snapshot_ctl to provide).

Could do the snapshot lockout on API level, but want to do
pre-checks before this.

Note: this just disables backup/snapshot fully. Could re-enable
after decommission, but this seems somewhat pointless.

v2:
* Add log message to snapshot shutdown
* Make test use log waiting instead of timeouts

Closes scylladb/scylladb#28980
2026-03-16 17:12:57 +02:00
Marcin Maliszkiewicz
edf0148bee perf-alternator: wait for alternator port before running workload
This patch is mostly for the purpose of running pgo CI job.

We may receive connection error if asyncio.sleep(5) in
pgo.py is not sufficient waiting time.

In pgo.py we do wait for port but only for cql,
anyway it's better to have high level check than
trying to wait for alternator port there.
2026-03-16 16:07:52 +01:00
bitpathfinder
85d5073234 test: Fix non-awaited coroutine in test_gossiper_empty_self_id_on_shadow_round
The line with the error was not actually needed and has therefore been removed.

Fixes: SCYLLADB-906

Closes scylladb/scylladb#28884
2026-03-16 17:07:36 +02:00
Botond Dénes
3e4e0c57b8 Merge 'Relax rf-rack-valid-keyspace option in backup/restore tests' from Pavel Emelyanov
Some tests, when create a cluster, configure nodes with the rf-rack-valid option, because sometimes they want to have it OFF. For that the option is explicitly carried around, but the cluster creating helper can guess this option itself -- out of the provided topology and replication factor.

Removing this option simplifies the code and (which a nicer outcome) the test "signature" that's used e.g. in command-line to run a specific test.

Improving tests, not backporting

Closes scylladb/scylladb#28860

* github.com:scylladb/scylladb:
  test: Relax topology_rf_validity parameter for some tests
  test: Auto detect rf-rack-valid option in create_cluster()
2026-03-16 17:06:46 +02:00
Raphael S. Carvalho
ee87b66033 replica: Demote log level on split failure during shutdown
Dtest failed with:

table - Failed to load SSTable .../me-3gyn_0qwi_313gw2n2y90v2j4fcv-big-Data.db
of origin memtable due to std::runtime_error (Cannot split
.../me-3gyn_0qwi_313gw2n2y90v2j4fcv-big-Data.db because manager has compaction
disabled, reason might be out of space prevention), it will be unlinked...

The reason is that the error above is being triggered when the cause is
shutdown, not out of space prevention. Let's distinguish between the two
cases and log the error with warning level on shutdown.

Fixes https://github.com/scylladb/scylladb/issues/24850.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2026-03-16 12:03:17 -03:00
Patryk Jędrzejczak
526e5986fe test: test_raft_no_quorum: decrease group0_raft_op_timeout_in_ms after quorum loss
`test_raft_no_quorum.py::test_cannot_add_new_node` is currently flaky in dev
mode. The bootstrap of the first node can fail due to `add_entry()` timing
out (with the 1s timeout set by the test case).

Other test cases in this test file could fail in the same way as well, so we
need a general fix. We don't want to increase the timeout in dev mode, as it
would slow down the test. The solution is to keep the timeout unchanged, but
set it only after quorum is lost. This prevents unexpected timeouts of group0
operations with almost no impact on the test running time.

A note about the new `update_group0_raft_op_timeout` function: waiting for
the log seems to be necessary only for
`test_quorum_lost_during_node_join_response_handler`, but let's do it
for all test cases just in case (including `test_can_restart` that shouldn't
be flaky currently).

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-913

Closes scylladb/scylladb#28998
2026-03-16 16:58:15 +02:00
Raphael S. Carvalho
b508f3dd38 service: Demote log level on split failure during shutdown
Since commit 509f2af8db, gate_closed_exception can be triggered
for ongoing split during shutdown. The commit is correct, but it
causes split failure on shutdown to log an error, which causes
CI instability. Previously, aborted_exception would be triggered
instead which is logged as warning. Let's do the same.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-951.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2026-03-16 11:52:00 -03:00
Dani Tweig
bc0952781a Update Jira sync calling workflow to consolidated view
Replaced multiple per-action workflow jobs with a single consolidated
call to main_pr_events_jira_sync.yml. Added 'edited' event trigger.
This makes CI actions in PRs more readable and workflow execution faster.

Fixes:PM-253

Closes scylladb/scylladb#29042
2026-03-16 08:25:32 +02:00
Artsiom Mishuta
755d528135 test.py: fix warnings
changes in this commit:
1)rename class from 'TestContext' to  'Context' so pytest will not consider this class as a test

2)extend pytest filterwarnings list to ignore warnings from external libs

3) use datetime.datetime.now(datetime.UTC) unstead  datetime.datetime.utcnow()

4) use  ResultSet.one() instead  ResultSet[0]

Fixes SCYLLADB-904
Fixes SCYLLADB-908
Related SCYLLADB-902

Closes scylladb/scylladb#28956
2026-03-15 12:00:10 +02:00
Karol Nowacki
7659a5b878 vector_search: test: fix flaky test
The test assumes that the sleep duration will be at least the value of
the sleep parameter. However, the actual sleep time can be slightly less
than requested (e.g., a 100ms sleep request might result in a 99ms
sleep).

This commit adjusts the test's time comparison to be more lenient,
preventing test flakiness.
2026-03-13 16:28:22 +01:00
Karol Nowacki
5474cc6cc2 vector_search: fix race condition on connection timeout
When a `with_connect` operation timed out, the underlying connection
attempt continued to run in the reactor. This could lead to a crash
if the connection was established/rejected after the client object had
already been destroyed. This issue was observed during the teardown
phase of a upcoming high-availability test case.

This commit fixes the race condition by ensuring the connection attempt
is properly canceled on timeout.

Additionally, the explicit TLS handshake previously forced during the
connection is now deferred to the first I/O operation, which is the
default and preferred behavior.

Fixes: SCYLLADB-832
2026-03-13 16:28:22 +01:00
Piotr Dulikowski
d8b283e1fb Merge 'Add CQL forwarding for strongly consistent tables' from Wojciech Mitros
In this series we add support for forwarding strongly consistent CQL requests to suitable replicas, so that clients can issue reads/writes to any node and have the request executed on an appropriate tablet replica (and, for writes, on the Raft leader). We return the same CQL response as what the user would get while sending the request to the correct replica and we perform the same logging/stats updates on the request coordinator as if the coordinator was the appropriate replica.

The core mechanism of forwarding a strongly consistent request is sending an RPC containing the user's cql request frame to the appropriate replica and returning back a ready, serialized `cql_transport::response`. We do this in the CQL server - it is most prepared for handling these types and forwarding a request containing a CQL frame allows us to reuse near-top-level methods for CQL request handling in the new RPC handler (such as the general `process`)

For sending the RPC, the CQL server needs to obtain the information about who should it forward the request to. This requires knowledge about the tablet raft group members and leader. We obtain this information during the execution of a `cql3/strong_consistency` statement, and we return this information back to the CQL server using the generalized `bounce_to_shard` `response_message`, where we now store the information about either a shard, or a specific replica to which we should forward to. Similarly to `bounce_to_shard`, we need to handle this `result_message` in a loop - a replica may move during statement execution, or the Raft leader can change. We also use it for forwarding strongly consistent writes when we're not a member of the affected tablet raft group - in that case we need to forward the statement twice - once to any replica of the affected tablet, then that replica can find the leader and return this information to the coordinator, which allows the second request to be directed to the leader.

This feature also allows passing through exception messages which happened on the target replica while executing the statement. For that, many methods of the `cql_transport::cql_server::connection` for creating error responses needed to be moved to `cql_transport::cql_server`. And for final exception handling on the coordinator, we added additional error info to the RPC response, so that the handling can be performed without having the `result_message::exception` or `exception_ptr` itself.

Fixes [SCYLLADB-71](https://scylladb.atlassian.net/browse/SCYLLADB-71)

[SCYLLADB-71]: https://scylladb.atlassian.net/browse/SCYLLADB-71?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

Closes scylladb/scylladb#27517

* github.com:scylladb/scylladb:
  test: add tests for CQL forwarding
  transport: enable CQL forwarding for strong consistency statements
  transport: add remote statement preparation for CQL forwarding
  transport: handle redirect responses in CQL forwarding
  transport: add exception handling for forwarded CQL requests
  transport: add basic CQL request forwarding
  idl: add a representation of client_state for forwarding
  cql_server: handle query, execute, batch in one case
  transport: inline process_on_shard in cql_server::process
  transport: extract process() to cql_server
  transport: add messaging_service to cql_server
  transport: add response reconstruction helpers for forwarding
  transport: generalize the bounce result message for bouncing to other nodes
  strong consistency: redirect requests to live replicas from the same rack
  transport: pass foreign_ptr into sleep_until_timeout_passes and move it to cql_server
  transport: extract the error handling from process_request_one
  transport: move error response helpers from connection to cql_server
2026-03-13 15:03:10 +01:00
Andrzej Jackowski
60aaea8547 cql: improve write consistency level guardrail messages
Update warn and fail messages for the write_consistency_levels_warned
and write_consistency_levels_disallowed guardrails to include the
configuration option name and actionable guidance. The main motivation
is to make the messages follow the conventions of other guardrails.

Refs: SCYLLADB-257
2026-03-13 14:40:45 +01:00
Tomasz Grabiec
518470e89e Merge 'load_stats: improve tablet filtering for load stats' from Ferenc Szili
When computing table sizes via load_stats to determine if a split/merge is needed, we are filtering tablets which are being migrated, in order to avoid counting them twice (both on leaving and pending replica) in the total table size. The tablets are filtered so that they are counted on the leaving replica until the streaming stage, and on the pending replica after the streaming stage.

Currently, the procedure for collecting tablet sizes for load balancing also uses this same filter. This should be changed, because the load balancer needs to have as much information about tablet sizes as possible, and could ignore a node due to missing tablet sizes for tablets in the `write_both_read_new` and `use_new` stages.

For tablet size collection, we should include all the tablets which are currently taking up disk space. This means:
- on leaving replica, include all tablets until the `cleanup` stage
- on pending replica, include all tablets starting with the `write_both_read_new` and later stages

While this is an improvement, it causes problems with some of the tests, and therefore needs to be backported to 2026.1

Fixes: SCYLLADB-829

Closes scylladb/scylladb#28587

* github.com:scylladb/scylladb:
  load_stats: add filtering for tablet sizes
  load_stats: move tablet filtering for table size computation
  load_stats: bring the comment and code in sync
2026-03-13 13:08:11 +01:00
Pavel Emelyanov
d544d8602d test: Relax topology_rf_validity parameter for some tests
Tests that call create_cluster() helper no longer need to carry the
rf-validity parameter. This simplifies the code and test signature.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-03-13 14:30:32 +03:00
Pavel Emelyanov
313985fed7 test: Auto detect rf-rack-valid option in create_cluster()
The helper accepts its as boolean argument, but it can easily estimate
one from the provided topology.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-03-13 14:30:32 +03:00
Gleb Natapov
fae5282c82 service level: fix crash during migration to driver server level
Before b59b3d4 the migration code checked that service level controller
is on v2 version before migration and the check also implicitly checked
that _sl_data_accessor field is already initialized, but now that the
check is gone the migration can start before service level controller is
fully initialized. Re add the check, but to a different place.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1049

Closes scylladb/scylladb#29021
2026-03-13 11:24:26 +01:00
Łukasz Paszkowski
4c4d043a3b reader_concurrency_semaphore: skip preemptive abort for permits waiting for memory
Permits in the `waiting_for_memory` state represent already-executing
reads that are blocked on memory allocation. Preemptively aborting
them is wasteful -- these reads have already consumed resources and
made progress, so they should be allowed to complete.

Restrict the preemptive abort check in maybe_admit_waiters() to only
apply to permits in the `waiting_for_admission` state, and tighten
the state validation in `on_preemptive_aborted()` accordingly.

Adjust the following tests:
+ test_reader_concurrency_semaphore_abort_preemptively_aborted_permit
  no longer relies on requesting memory
+ test_reader_concurrency_semaphore_preemptive_abort_requested_memory_leak
  adjusted to the fix

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1016
2026-03-13 09:50:05 +01:00
Dani Tweig
aa46a0f4e0 Add VECTOR to the list of synced milestones in scylladb.git
- Added VECTOR to the comma-separated list of Jira project keys in `call_sync_milestone_to_jira.yml`.
- The `jira_project_keys` value changed from `SCYLLADB,CUSTOMER,SMI,RELENG` to `SCYLLADB,CUSTOMER,SMI,RELENG,VECTOR`.

- The VECTOR project needs to sync with scylladb.git milestones, so that when a GitHub milestone is created or closed in scylladb/scylladb, the corresponding Jira release is also created or released in the VECTOR project.
- Previously only SCYLLADB, CUSTOMER, SMI, and RELENG projects were synced.

Fixes:PM-220

Closes scylladb/scylladb#29014
2026-03-13 09:58:41 +02:00
Botond Dénes
fc8cebd671 Merge 'Verify components digests during component load and scrub in validate mode' from Taras Veretilnyk
This PR adds integrity verification for SSTable component files during loading. When component digests are present in Scylla metadata, the loader now validates each component's CRC32 digest against the stored expected value, catching silent corruption of component files. Index, Rows and Partitions components digests are also validated duriung scrub in validate mode

Added corruption tests that write an SSTable, flip a bit in a specific component file, then verify that reloading the SSTable detects the corruption and throws the expected exception.

Depends on https://github.com/scylladb/scylladb/pull/28338

Backport is not required, this is new feature

Fixes https://github.com/scylladb/scylladb/issues/20103

Closes scylladb/scylladb#28761

* github.com:scylladb/scylladb:
  test/cqlpy: test --ignore-component-digest-mismatch flag in scylla sstable upgrade
  docs: document --ignore-component-digest-mismatch flag for scylla sstable upgrade
  sstables: propagate ignore_component_digest_mismatch config to all load sites
  sstables: add option to ignore component digest mismatches
  sstable_compaction_test: Add scrub validate test for corrupted index
  sstables: add tests for component digest validation on corrupted SSTables
  sstables: validate index components digests during SSTable scrub in validate mode
  sstables: verify component digests on SSTable load
  sstables: add digest_file_random_access_reader for CRC32 digest computation
2026-03-13 09:55:55 +02:00
Avi Kivity
ae8a418744 Merge 'Await async calls in test tablets migration' from Benny Halevy
Fix several test cases that did not await async tasks:
- test_restart_leaving_replica_during_cleanup
- test_restart_in_cleanup_stage_after_cleanup
- test_tablet_back_and_forth_migration
- test_staging_backlog_is_preserved_with_file_based_streaming

Fixes SCYLLADB-910

* Minor fixes, no backport needed

Closes scylladb/scylladb#28908

* github.com:scylladb/scylladb:
  test_tablets_migration: test_staging_backlog_is_preserved_with_file_based_streaming: convert for loop to asyncio.gather
  test_tablets_migration: test_tablet_back_and_forth_migration: await move_tablet
  test_tablets_migration: test_restart_in_cleanup_stage_after_cleanup: await move_task
  test_tablets_migration: test_restart_leaving_replica_during_cleanup: await move_task
  test_tablets_migration: drop unused imports from cassandra.query
2026-03-13 00:20:29 +02:00
Avi Kivity
b228eb26e6 Merge 'dbuild: Use slirp4netns network in dbuild nested containers' from Calle Wilund
Fixes #25084

Add slirp4netns and use for nested containers. This will allow nested container port aliasing, helping CI stability.

Note: this contains and updated Dockerfile for dbuild image, but since chicken and eggs, right now will force install slirp4netns before anything in dbuild script.

Updates the mock server handling to use ephemeral ports and query from container, ensuring we don't get port collisions. (boost as well as pytest).

Includes a timeout up, and a tweak to our scylla_cluster handling, ensuring we don't deadlock when pipe size is less than requires for our sys notify messages.

Closes scylladb/scylladb#28727

* github.com:scylladb/scylladb:
  gcs_fixture: Change to use docker helper
  aws_kms_fixture: Modify to use docker helper
  test/lib/proc_util: Add docker helper
  pytest: use ephemeral port publish for docker mock servers
  dbuild: Use container network in dbuild nested containers
  scylla_cluster: Read notify sock in background to prevent deadlock
2026-03-12 23:49:25 +02:00
Nadav Har'El
ad832c263e test/cluster: mark test_alternator_concurrent_rmw_same_partition_different_server not strictly xfail
A few days ago, in commit 7b30a39 we added to pytest.ini the option
xfail_strict. This option causes every time a test XPASSes, i.e., an xfail
test actually passes - to be considered an error and fail the test.

But some tests demonstrate a timing-related bug and do not reproduce the
bug every single time. An example we noticed in one CI run is:

test/cluster/test_alternator.py::test_alternator_concurrent_rmw_same_partition_different_server

This test reproduces a timing-related bug (if you do an LWT write to
one partition on to two different coordinators "at the same time", you
can get a failure), but only most of the time, not 100% of the time.

The solution is to add "strict=False" for the xfail marker on this specific
test. This undoes the xfail_strict for this specific test, accepting that
this specific test can either pass or fail. Note that this does NOT make
this test worthless - we still see this test failing most of the time, and
when a developer finally fixes this issue, the test will begin to pass all
the time.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-941
Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#29016
2026-03-12 23:46:23 +02:00
Tomasz Grabiec
1256a9faa7 tablets: Fix deadlock in background storage group merge fiber
When it deadlocks, groups stop merging and compaction group merge
backlog will run-away.

Also, graceful shutdown will be blocked on it.

Found by flaky unit test
test_merge_chooses_best_replica_with_odd_count, which timed-out in 1
in 100 runs.

Reason for deadlock:

When storage groups are merged, the main compaction group of the new
storage group takes a compaction lock, which is appended to
_compaction_reenablers_for_merging, and released when the merge
completion fiber is done with the whole batch.

If we accumulate more than 1 merge cycle for the fiber, deadlock
occurs. Lock order will be this

Initial state:

 cg0: main
 cg1: main
 cg2: main
 cg3: main

After 1st merge:

 cg0': main [locked], merging_groups=[cg0.main, cg1.main]
 cg1': main [locked], merging_groups=[cg2.main, cg3.main]

After 2nd merge:

 cg0'': main [locked], merging_groups=[cg0'.main [locked], cg0.main, cg1.main, cg1'.main [locked], cg2.main, cg3.main]

merge completion fiber will try to stop cg0'.main, which will be
blocked on compaction lock. which is held by the reenabler in
_compaction_reenablers_for_merging, hence deadlock.

The fix is to wait for background merge to finish before we start the
next merge. It's achieved by holding old erm in the background merge,
and doing a topology barrier from the merge finalizing transition.

Background merge is supposed to be a relatively quick operation, it's
stopping compaction groups. So may wait for active requests. It
shouldn't prolong the barrier indefinitely.

Tablet boost unit tests which trigger merge need to be adjusted to
call the barrier, otherwise they will be vulnerable to the deadlock.

Two cluster tests were removed because they assumed that merge happens
in the backgournd. Now that it happens as part of merge finalization,
and blocks topology state machine, those tests deadlock because they
are unable to make topology changes (node bootstrap) while background
merge is blocked.

The test "test_tablets_merge_waits_for_lwt" needed to be adjusted. It
assumed that merge finalization doesn't wait for the erm held by the
LWT operation, and triggered tablet movement afterwards, and assumed
that this migration will issue a barrier which will block on the LWT
operation. After this commit, it's the barrier in merge finalization
which is blocked. The test was adjusted to use an earlier log mark
when waiting for "Got raft_topology_cmd::barrier_and_drain", which
will catch the barrier in merge finalization.

Fixes SCYLLADB-928
2026-03-12 22:45:01 +01:00
Tomasz Grabiec
7706c9e8c4 replica: table: Propagate old erm to storage group merge 2026-03-12 22:45:01 +01:00
Tomasz Grabiec
582a4abeb6 test: boost: tablets_test: Save tablet metadata when ACKing split resize decision
Needs to be ordered before split finalization, because storage_group
must be in split mode already at finalization time. There must be
split-ready compaction groups, otherwise finalization fails with this
error:

  Found 0 split ready compaction groups, but expected 2 instead.

Exposed by increased split activity in tests.
2026-03-12 22:45:01 +01:00
Tomasz Grabiec
279fcdd5ff storage_service: Extract local_topology_barrier()
Will be called in tests. It does the local part of the global topology
barrier.

The comment:

        // We capture the topology version right after the checks
        // above, before any yields. This is crucial since _topology_state_machine._topology
        // might be altered concurrently while this method is running,
        // which can cause the fence command to apply an invalid fence version.

was dropped, because it's no longer true after
fad6c41cee, and it doesn't make sense in
the context of local_topology_barrier(). We'd have to propagate the
version to local_topology_barrier(), but it's pointless. The fence
version is decided before calling the local barrier, and it will be
valid even if local version moves ahead.
2026-03-12 22:44:56 +01:00
Avi Kivity
03186ce60d Merge 'Cleanup after auth v1 and default superuser code removal' from Marcin Maliszkiewicz
This is short cleanup after recent removal of creating default cassandra superuser and auth-v1 code removal.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1036

Backport: no, just code cleanup

Closes scylladb/scylladb#29004

* github.com:scylladb/scylladb:
  auth: remove DEFAULT_SUPERUSER_NAME constant and dead DEFAULT_USER_PASSWORD
  auth: use configurable default_superuser in describe_roles
  auth: move default_superuser to common, remove _superuser member
  auth: use LOCAL_ONE for all auth queries
  auth: remove get_auth_ks_name indirection
2026-03-12 23:44:32 +02:00
Avi Kivity
e2eeef3e01 Merge 'service level: remove remnants of version 1 service level' from Gleb Natapov
can_use_effective_service_level_cache() always returns true now, so the function can be dropped entirely and all the code that assumes it may return false can be dropped as well. Also drop async versions of find_effective_service_level and get_user_scheduling_group since they are unused.

No need to backport, code removal,

Closes scylladb/scylladb#29002

* github.com:scylladb/scylladb:
  service level: make maybe_update_per_service_level_params synchronous
  service level: remove unused get_user_scheduling_group function
  service level: drop async find_effective_service_level
  service level: remove remnants of version 1 service level
2026-03-12 23:39:41 +02:00
Botond Dénes
eed3a6d407 sstables/mx/writer: move post-cell write yield to collection write loop
Introduced by 54bddeb3b5, the yield was
added to write_cell(), to also help the general case where there is no
collection. Arguably this was unnecessary and this patch moves the yield
to write_collection(), to the cell write loop instead, so regular cells
don't have to poll the preempt flag.

Closes scylladb/scylladb#29013
2026-03-12 21:26:35 +02:00
Avi Kivity
e8a6706d6e Merge 'shorten some sleeps to speed up bootstrap in tests' from Patryk Jędrzejczak
This PR shortens two sleeps from 1s to 100ms to speed up bootstrap in tests.
The changed sleeps are:
- the pause duration in group0 discovery,
- the retry period in `wait_for_cql`.

Refs: https://scylladb.atlassian.net/browse/SCYLLADB-918

No backport: performance improvements mostly relevant to tests.

Closes scylladb/scylladb#29020

* github.com:scylladb/scylladb:
  test: pylib: util: wait for CQL being ready with a shorter period
  group0: discovery: shorten the pause duration
2026-03-12 21:17:05 +02:00
Wojciech Mitros
32974770b0 test: add tests for CQL forwarding
Add basic cluster tests for CQL forwarding.
The test cases include:
- basic reads and writes
- prepared statements with binds
- forwarding from a non-replica
- exception passthrough during forwarding (using an injection)
- re-preparing a statement on the target node, even if the user
  query is also an EXECUTE request on a prepared statement
- verification metric updates

The existing test_basic_write_read was modified so that a few extra
cases could be validated on the same cluster.
2026-03-12 19:43:35 +01:00
Wojciech Mitros
916a9995c1 transport: enable CQL forwarding for strong consistency statements
We enable CQL forwarding by starting to return the bounce_to_node
result message in redirect_statement() instead of throwing. The
forwarding code introduced in the preceding patches reacts to these
messages, allowing the requests to be forwarded.

With the update, some tests assuming that requests can't be forwarded
need to be adjusted, so we do that as well.
2026-03-12 19:43:35 +01:00
Wojciech Mitros
21a7b036a5 transport: add remote statement preparation for CQL forwarding
During forwarding of CQL EXECUTE requests, the target node may
not have the prepared statement in its cache. If we do have this
statement as a coordinator, instead of returning PREPARED NOT FOUND
to the client, we want to prepare the statement ourselves on target
node.
For that, we add a new FORWARD_CQL_PREPARE RPC. We use the new RPC
after gettting the prepared_not_found status during forwarding. When
we try to forward a request, we always have the query string (we
decide whether to forward based on this query), so we can always use
the new RPC when getting the prepared_not_found status.
After receiving the response, we try forwarding the EXECUTE request
again.
2026-03-12 19:43:35 +01:00
Wojciech Mitros
96a5e1c7ce transport: handle redirect responses in CQL forwarding
During CQL forwarding, when the target node can't handle the request,
it will find another node which can execute the request or which knows
where the request can be executed. We return this information in
responses to CQL forwarding, and in this patch, we add handling of
this kind of a response.
After getting a redirect response, we retry forwarding to the returned
host/shard until success or timeout. This can happen many times during
a single request, when we first forward to a replica and later to the
coordinator, or when a replica/coordinator migrated while we were
performing the forwarding
2026-03-12 19:43:31 +01:00
Wojciech Mitros
8816d3038c transport: add exception handling for forwarded CQL requests
When a forwarded request fails on the remote node, we can't use the
exception handling that happens in process_request_one because we
don't go through this code path. Instead, we use the previously
extracted cql_server::handle_exception handler, which performs
all accounting on the forwarded-to node, and which prepares the
response. For the read_failure_exception_with_timeout exception,
we need to perform the sleep on the source node, so we return the
timeout in the forwarding response and use it on the source node
to know how long to sleep without any extra calculations.

The handle_forward_execute() method is extracted from the inline handler
lambda to make the error catching wrapper cleaner.
2026-03-12 19:41:37 +01:00
Wojciech Mitros
23bff5dfef transport: add basic CQL request forwarding
Add the infrastructure for forwarding CQL requests to other nodes.
When a process() call results in a node bounce (as opposed to a shard
bounce), the coordinator serializes the request and sends it via the
FORWARD_CQL_EXECUTE RPC verb to the target node.

In this patch we omit several features that allow handling more
scenarios that can happen when trying to forward a CQL request,
but the RPC request and response are already prepared for them.
They will be handled in the following commits.
2026-03-12 19:41:35 +01:00
Avi Kivity
76b6784c1a Merge 'cql3: track CQL parsing memory cost and use it for admission control' from Marcin Maliszkiewicz
Use rolling_max_tracker to record gross bytes allocated during each
CQL parse.  The rolling maximum is then added to the memory estimate
for incoming QUERY and PREPARE requests so that the admission control
in the CQL transport layer accounts for parsing overhead.

The measured memory footprint serves as upper bound rather than
exact number but it's purpose is to prevent OOMs under unprepared
statements heavy load.

In benchmark 1G memory node shows decrease of non-LSA memory usage
from peak 320MB (our coordinator budget is 10% of 1G) to 96MB. While
tps drops from 1.2 kops to 0.8 kops. Drop in tps is expected as
memory admission kicks in trying to prevent OOM.

This is phase 1 of OOM prevention, potential next steps:
- add second admission in query_processor::get_statement trying to prevent potential thundering herd problem
- decrease cql_server memory pool size
- count reads in the memory pool
- add per service level memory pool and a shared one

Related https://scylladb.atlassian.net/browse/SCYLLADB-740
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-938

Backport: no, new feature, but we may reconsider if some customer needs it

Closes scylladb/scylladb#28919

* github.com:scylladb/scylladb:
  cql3: track CQL parsing memory cost and use it for admission control
  utils: add rolling max tracker
2026-03-12 19:59:52 +02:00
Wojciech Mitros
170b82ddca idl: add a representation of client_state for forwarding
In the following patches, when we start allowing to forward CQL
requests to other nodes, we'll need to use the same client state
for executing the request on the destination node as we had on the
source. client_state contains many fields and we need to create
a new instance of it when we start handling the forwarded request,
so to prepare for the forwarding RPC, we add a serializable format
of the client_state as an IDL struct. The new class is missing some
fields that are not used while executing requests, and some whose
value is determined by the fact that the client state is used for
a forwarded request.
These include:
- driver name, driver version, client options - not used for executing
requests. Instead, we use these as data sources for the virtual
"clients" system table.
- auth_state - must be READY - we reached a bounce message, so we were
able to try executing the request locally
- _control_connection - used for altering a cql_server::connection, which
we don't have on the target node
- _default_timeout_config - used when updating service levels, also only
per-connection
- workload_type - used for deciding whether to allow shedding at the
start of processing the request, and for getting per-connection service
level params (for an API)
2026-03-12 17:48:58 +01:00
Wojciech Mitros
b4a7fefe20 cql_server: handle query, execute, batch in one case
Currently we perform the same steps when handling query, execute
and batch CQL requests. So instead of creating multiple functions
performing these steps, we can handle them all in one fallthrough
case in cql_server::connection::process_request_one.
2026-03-12 17:48:58 +01:00
Wojciech Mitros
dadb87047c transport: inline process_on_shard in cql_server::process
The process_on_shard method is relatively short, it's only used
in the process() method and the Process concept that is uses
is as long as the function itself. This area will be made more
complex by the following patches for cql forwarding, so we simplify
it by inlining process_on_shard in cql_server::process.
2026-03-12 17:48:58 +01:00
Wojciech Mitros
24cdc3a10d transport: extract process() to cql_server
Move process() and process_on_shard() from cql_server::connection to
cql_server. The process() method is no longer a template - instead, it
takes an opcode parameter and uses get_process_fn_for_opcode() to select
the appropriate internal processing function.

The process_query, process_execute, and process_batch wrappers on
connection now delegate to _server.process() with the appropriate opcode.

This refactoring is preparation for CQL request forwarding, where
process() will need to be called from a context other than connection
- the forwarding RPC handler).
2026-03-12 17:48:57 +01:00
Wojciech Mitros
0e3469e89c transport: add messaging_service to cql_server
The messaging service will be used by cql_server to register RPC
handlers for forwarding CQL requests between nodes.
We pass it through the controller to cql_server.
2026-03-12 17:48:57 +01:00
Wojciech Mitros
1376caf980 transport: add response reconstruction helpers for forwarding
Expose response::flags() and response::extract_body(), and a new constructor.
It will be needed for creating a cql_transport::response from the response body returned
during CQL forwarding.
2026-03-12 17:48:57 +01:00
Wojciech Mitros
e44820ba1f transport: generalize the bounce result message for bouncing to other nodes
In the following patches, we'll start allowing forwarding requests to strongly
consistent tables so that they'll get executed on the suitable tablet Raft group
members. For that we'll reuse the approach that we already have for bouncing
requests to other shards - we'll try to execute a request locally, and the
result of that will be a bounce message with another replica as the target.
In this patch we generalize the former bounce_to_shard result message so that
it will be able to specify the target of the bounce as another shard or specific
replica.
We also rename it to result_message::bounce so that it stops implying that only
another shard may be its target.
Aside from the host_id and the shard, the new message also includes the timeout,
because in the service handling the forwarding we won't have the access to it,
and it's needed for specifying how long we should wait for the forwarded
requests. It also includes an information whether this is a write request
to return correct timeout response in case the deadline is exceeded.
We will return other hosts in the new bounce message when executing requests to
strongly consistent tables when we can't handle the request because we aren't
a suitable replica. We can't handle this message yet, so we don't return it
anywhere and we still assume that every bounce message is a bounce to the same
host.
2026-03-12 17:48:57 +01:00
Wojciech Mitros
b4d66fda2e strong consistency: redirect requests to live replicas from the same rack
Forwarding CQL requests is not implemented yet, but we're already
prepared to return the target to forward to when trying to execute
strongly consistent requests. Currently, if we're not a replica
of the affected tablet, we redirect the request to the first replica
in the list.
This is not optimal, because this replica may be down or it may be
in another rack, making us perform cross-rack requests during forwarding.
Instead, we should forward the request to the replica from the same
rack and handle the case where the replica is down.

In this patch we change the replica selection for forwarding strongly
consistent requests, so that when the coordinator isn't a replica, it
redirects the request to the replica from the same rack.

If the replica from the same rack is down, or there is no replica in
our rack, we choose the next closest replica (preferring same-DC replicas
over other DCs). If no replica is alive, the query fails - the driver
should retry when some replica comes back up.
2026-03-12 17:48:54 +01:00
Andrzej Jackowski
3b9cd52a95 reader_concurrency_semaphore_test: detect memory leak on preemptive abort of waiting_for_memory permit
A permit in `waiting_for_memory` state can be preemptively aborted by
maybe_admit_waiters(). This is wrong: such permits have already been
admitted and are actively processing a read — they are merely blocked
waiting for memory under serialize-limit pressure.

When `on_preemptive_aborted()` fires on a `waiting_for_memory` permit,
it does not clear `_requested_memory`. A subsequent `request_memory()`
call accumulatesa on top of the stale value, causing `on_granted_memory()`
to consume more than resource_units tracks.

This commit adds a test that confirms that scenario by counting
internal_errors.
2026-03-12 17:09:34 +01:00
Alex
7fd39ba586 test/cluster: strengthen raft voters multi-DC test and tune debug runtime
The test_raft_voters_multidc_kill_dc scenario had become weaker after group0 voter count was made always odd.
  In particular, the old num_nodes == 1 case (dc1=2, dc2=1, dc3=1) could pass even without the intended balancing logic, because with 3 voters total we naturally get one voter per DC.

  This change restores coverage of the original intent:

  - Replace num_nodes parametrization with explicit DC triples.
  - Use (3, 1, 1) to force a meaningful asymmetric topology where voter placement logic is required.
  - Keep a larger topology case (6, 3, 3) for broader coverage.
  - Mark (6, 3, 3) as skip_mode(debug) with reason:
    larger topology case is too slow in debug on minipcs.

  Also updated comments/docstring to match the new setup.

Fixes: SCYLLADB-794

backport: None, it is done to deflake minipcs that will start working only on master

Closes scylladb/scylladb#29000
2026-03-12 17:07:45 +01:00
Wojciech Mitros
309abc44d9 transport: pass foreign_ptr into sleep_until_timeout_passes and move it to cql_server
Change sleep_until_timeout_passes() to accept a foreign_ptr<std::unique_ptr<response>>.
We can easily create the foreign_ptr for the responses created in the CQL server,
but we'll need this when we get responses when forwarding CQL statements - the responses
may come from other shards.
We also move it from cql_server::connection to cql_server, because for forwarded CQL
requests, we'll need to handle it at the cql_server level.
The method also loses its const qualifier - the abort_source that we pass into
sleep_abortable needs to be non-const. Apparently, we could still use it in a const
method of cql_server::connection because we passed it as _server._abort_source which
caused the const qualifier to be lost.
2026-03-12 16:03:14 +01:00
Marcin Maliszkiewicz
975cd60e05 ldap: fix use-after-move crash in ldap_reuser::reap()
After stop() moved _reaper, in-flight with_connection() callbacks could
still call reap(), which accessed the moved-from future causing a
SIGSEGV in future_base::detach_promise(). Add a seastar::gate so
stop() waits for all in-flight operations before moving _reaper.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1043

Closes scylladb/scylladb#29015
2026-03-12 16:48:45 +02:00
Patryk Jędrzejczak
c50cf32793 test: pylib: util: wait for CQL being ready with a shorter period
`wait_for_cql` is used in hundreds, if not thousands, of places in tests.
We shouldn't waste up to 1s for every call.

Also, the 1s period is clearly too long compared to the bootstrap time,
which is usually 0-3s in dev mode.

The following test speeds up from 50s to 42s with the change:
```
for _ in range(10):
    servers = await manager.servers_add(3)
    await manager.get_ready_cql(servers)
```
2026-03-12 15:40:19 +01:00
Patryk Jędrzejczak
f85628a9a0 group0: discovery: shorten the pause duration
Nodes currently pause group0 discovery for 1s. This case is always hit while
adding multiple nodes in parallel to an empty cluster by all nodes except the
one that becomes the group0 leader.

This is fine in production, but in tests, the slowdown is quite significant.
Every `manager.servers_add(n)` call for n > 1 becomes 1s slower when the
cluster is empty. Many cluster tests are affected.

In this commit, we decrease the sleep duration from 1s to 100ms to speed up
tests. The consequence of this change is that nodes might perform more steps
in group0 discovery, but the increase in CPU usage and network traffic should
be negligible.
2026-03-12 15:40:18 +01:00
Gleb Natapov
c67f876893 service level: make maybe_update_per_service_level_params synchronous
It does not call async functions any more.
2026-03-12 15:53:08 +02:00
Benny Halevy
b3fec20960 test_tablets_migration: test_staging_backlog_is_preserved_with_file_based_streaming: convert for loop to asyncio.gather
Currently the test iterates on all servers and calls manager.api.disable_injection
but it doesn't await those calls.
Use asyncio.gather to await all calls in parallel.

Co-authored-by: Copilot CLI
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-03-12 15:26:40 +02:00
Benny Halevy
61d5a2df02 test_tablets_migration: test_tablet_back_and_forth_migration: await move_tablet
Co-authored-by: Copilot CLI
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-03-12 15:26:40 +02:00
Benny Halevy
b8655748a2 test_tablets_migration: test_restart_in_cleanup_stage_after_cleanup: await move_task
Co-authored-by: Copilot CLI
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-03-12 15:26:40 +02:00
Benny Halevy
10dccc2c4e test_tablets_migration: test_restart_leaving_replica_during_cleanup: await move_task
Co-authored-by: Copilot CLI
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-03-12 15:26:40 +02:00
Benny Halevy
c9d653fb1e test_tablets_migration: drop unused imports from cassandra.query
Co-authored-by: Copilot CLI
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-03-12 15:26:40 +02:00
Gleb Natapov
c30907b8f2 service level: remove unused get_user_scheduling_group function 2026-03-12 14:28:26 +02:00
Gleb Natapov
a934d8391d service level: drop async find_effective_service_level
find_cached_effective_service_level does exactly same thing now and it
is synchronous.
2026-03-12 14:28:26 +02:00
Botond Dénes
15cfa5beeb mutation/collection_mutation: don't copy the serialized collection
serialize_collection_mutation() copies the serialized collection into
the returned collection_mutation object. Change to move to avoid the
copy.

Fixes: SCYLLADB-1041

Closes scylladb/scylladb#29010
2026-03-12 13:57:40 +02:00
Gleb Natapov
f888f2dced service level: remove remnants of version 1 service level
can_use_effective_service_level_cache() always returns true now, so the
function can be dropped entirely and all the code that assumes it may
return false can be dropped as well.
2026-03-12 12:27:52 +02:00
Nadav Har'El
27f0510280 test/alternator: test_gzip_request_oversized now passes on AWS
The Alternator test test_compressed_request.py::test_gzip_request_oversized
checks that a very large request that compresses to a small size is still
rejected. This test passed on Alternator, but used to fail on DynamoDB
because DynamoDB didn't reject this case. This was a bug in DynamoDB
(a "decompression bomb" vulnerability), and after I reported it, it
was fixed.

So now this test does pass on DynamoDB (after a small modification to
allow for different error codes). So remove its scylla_only marker,
and make the comment true to the current state.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#28820
2026-03-12 10:41:56 +01:00
Marcin Maliszkiewicz
b277d9d9aa cql3: track CQL parsing memory cost and use it for admission control
Use rolling_max_tracker to record gross bytes allocated during each
CQL parse.  The rolling maximum is then added to the memory estimate
for incoming QUERY and PREPARE requests so that the admission control
in the CQL transport layer accounts for parsing overhead.

The measured memory footprint serves as upper bound rather than
exact number but it's purpose is to prevent OOMs under unprepared
statements heavy load.

In benchmark 1G memory node shows decrease of non-LSA memory usage
from peak 320MB (our coordinator budget is 10% of 1G) to 96MB. While
tps drops from 1.2 kops to 0.8 kops. Drop in tps is expected as
memory admission kicks in trying to prevent OOM.
2026-03-12 10:16:10 +01:00
Botond Dénes
0b19a6de85 tombstone_gc: tombstone_gc_state::for_tests(): remove unused param
Closes scylladb/scylladb#28923
2026-03-12 10:01:42 +01:00
Marcin Maliszkiewicz
2d22eea2f9 Merge 'cql3: Replace SCYLLA_ASSERT and abort by throwing_assert' from Nadav Har'El
In this patch we replace every single use of SCYLLA_ASSERT(), abort() and assert() in the cql3/ directory by throwing_assert().

The problem with SCYLLA_ASSERT()/abort()/assert() is that when it fails, it crashes Scylla. This is almost always a bad idea (see #7871 discussing why), but it's even riskier in front-end code like cql3/: In front-end code, there is a risk that due to a bug in our code, a specific user request can cause Scylla to crash. A malicious user can send this query to all nodes and crash the entire cluster. When the user is not malicious, it causes a small problem (a failing request) to become a much worse crash - and worse, the user has no idea which request is causing this crash and the crash will repeat if the same request is tried again.

All of this is solved by using the new throwing_assert(), which is the same as SCYLLA_ASSERT() but throws an exception (using on_internal_error()) instead of crashing. The exception will prevent the code path with the invalid assumption from continuing, but will result in only the current user request being aborted, with a clear error message reporting the internal server error due to an assertion failure.

I reviewed all the changes that I did in these patches to check that (to the best of my understanding) none of the assertions in cql3/ involve the sort of serious corruption that might require crashing the Scylla node entirely.

throwing_assert() also improves logging of assertion failures compared to the original SCYLLA_ASSERT()/abort() - SCYLLA_ASSERT() printed a message to stderr which in many installations is lost, and abort() often prints no message at all. But throwing_assert() uses Scylla's standard logger, and also includes a backtrace in the log message.

Fixes #13970 (Exorcise assertions from CQL code paths)
Refs #7871 (Exorcise assertions from Scylla)

Closes scylladb/scylladb#28847

* github.com:scylladb/scylladb:
  cql3: remove unnecessary assert()
  cql3: replace abort() by throwing_assert()
  cql3: Replace SCYLLA_ASSERT by throwing_assert
2026-03-12 09:09:24 +01:00
Szymon Malewski
3116db6c2d test: fix testJsonOrdering
The `test/cqlpy/cassandra_tests/validation/entities/json_test.py::testJsonOrdering` was failing because of differences between Cassandra and Scylla in printing
JSON floating point values - e.g. Cassandra prints 30.0, where Scylla prints 30.
Both are valid, so in this patch, instead of comparing strings, we compare parsed JSON using `EquivalentJson`.

Fixes #28467

Closes scylladb/scylladb#28924
2026-03-12 09:07:08 +01:00
Marcin Maliszkiewicz
5b2a07b408 utils: add rolling max tracker
We will use it later to track parser memory
usage via per query samples.

Tests runtime in dev: 1.6s
2026-03-12 08:56:41 +01:00
Marcin Maliszkiewicz
54ef8fca57 auth: remove DEFAULT_SUPERUSER_NAME constant and dead DEFAULT_USER_PASSWORD
DEFAULT_SUPERUSER_NAME is no longer referenced after removing the
role_part special-casing in describe_roles. DEFAULT_USER_PASSWORD was
dead code too.
2026-03-12 08:46:00 +01:00
Marcin Maliszkiewicz
029410e159 auth: use configurable default_superuser in describe_roles
Replace the hardcoded meta::DEFAULT_SUPERUSER_NAME comparison with
default_superuser(_qp) which reads from the auth_superuser_name
config option. This makes the IF NOT EXISTS clause in DESCRIBE
output correct for clusters with a non-default superuser name.
2026-03-12 08:45:47 +01:00
Nadav Har'El
09a399ae3c Merge 'Replace estimated_histogram with approx_exponential_histogram - alternator' from Amnon Heiman
_"A journey of a thousand miles begins with a single step" Lao Tzu_

ScyllaDB uses estimated_histogram in many places.
We already have a more efficient alternative: approx_exponential_histogram. It is both CPU and
memory-efficient and can be exported as Prometheus native histograms.

Its main limitation (which has its benefits) is that the bucket layout is fixed at compile time, so
histograms with different configurations cannot be mixed.

The end goal is to replace all uses of estimated_histogram in the codebase.
That migration needs a few small API adjustments, so I am splitting the work
into steps for easier review.

This series is the first step. It introduces a base template for fixed-size
estimated histograms, and switches the Alternator's estimated_histogram with the template.
This change is self-contained and valuable on its own, while keeping the scope limited.

Minor adjustments were made to the code and tests so that the tests would pass.

Follow-up PRs will apply the same pattern to the rest of the code.

**New feature no need to backport**

Closes scylladb/scylladb#28987

* github.com:scylladb/scylladb:
  alternator: migrate to operation_size_kb histograms
  test/alternator/test_metrics.py: Update the bucket in the histogram search
  alternator: Use batch_histogram for batch size histograms
  estimated_histogram.hh: adds estimated_histogram_with_max
2026-03-12 00:06:16 +02:00
Wojciech Mitros
b1bd206147 transport: extract the error handling from process_request_one
When we forward CQL statements, we'll need to handle the errors
on the destination node. Only for read_failure_exception_with_timeout
exception, we'll still need to wait until timeout passes on the
source node.
For that we extract the exception handling to a separate method.
Additionally, we separate the waiting and all other handling,
so that all handling aside from waiting will be reusable after
forwarding, and we'll also be able to sleep on the source node
if necessary.
2026-03-11 19:40:47 +01:00
Wojciech Mitros
6184b1d5ea transport: move error response helpers from connection to cql_server
These methods are used only in the error handler in the cql server,
and outside of 3 cases, they don't need any information from the
cql_server::connection. We move them from cql_server::connection
to cql_server, so that they can be used in the following patches
for methods for CQL request forwarding where we'll have no instance
of cql_server::connection on the node forwarded to.

After the change the methods require no access to the server's
or connection's fields, so we also make them static methods.
2026-03-11 19:40:47 +01:00
Amnon Heiman
1339a44163 alternator: migrate to operation_size_kb histograms
Switch Alternator operation-size metrics from the legacy estimated
histogram implementation to estimated_histogram_with_max<512> and export
them through the native approx-exponential histogram path.

Add a dedicated operation-size histogram type alias based on
estimated_histogram_with_max<512>.
Replace all per-operation size histograms (GetItem/PutItem/DeleteItem/
UpdateItem/BatchGetItem/BatchWriteItem) with the new type.

Remove the custom legacy histogram-to-metrics adapter and use
to_metrics_histogram() for operation size metrics, aligning export
behavior with other approx-exponential histograms.
Update Alternator metrics tests to compute expected le bucket boundaries using
approx-exponential bucket math (including deduplication of equal
bounds), so assertions match the new exported histogram schema.
Update bucket helper signatures to use (max, precision) parameters and keep
+Inf handling unchanged.

Replace byte-to-KB ceiling conversion with plain integer division (bytes
/ 1024): histogram export already reports each bucket by its upper bound
(le), so rounding input values up before bucketing is unnecessary and
would over-shift borderline samples into higher buckets.
2026-03-11 17:29:14 +02:00
Marcin Maliszkiewicz
adc840919b auth: move default_superuser to common, remove _superuser member
Move default_superuser() to auth::meta in common.{hh,cc} and remove the
cached _superuser member from both standard_role_manager and
password_authenticator. The superuser name comes from config which is
immutable at runtime, so caching it is unnecessary.
2026-03-11 16:28:38 +01:00
Marcin Maliszkiewicz
993e06c1ae auth: use LOCAL_ONE for all auth queries
Removes auth-v1 hack for cassandra superuser as auth-v1
code no longer exists.

Also CL is not really used when quering raft replicated
tables (like auth ones), but LOCAL_ONE is the least confusing
one.
2026-03-11 16:27:15 +01:00
Marcin Maliszkiewicz
6d1153687a auth: remove get_auth_ks_name indirection
Replace get_auth_ks_name(qp) with db::system_keyspace::NAME directly.
The function always returned the constant "system" and its qp
parameter was unused.
2026-03-11 16:26:47 +01:00
David
79f9967eaa docs: update theme 1.9
Motivation

Upgrades Sphinx to 9.x, MyST Parser to 5.x, Python to 3.11+–3.14, Node.js to 22, and replaces Poetry with uv for dependency management.

Changelog: https://github.com/scylladb/sphinx-scylladb-theme/blob/master/docs/source/upgrade/CHANGELOG.md#190---26-february-2026
How to test

* Make sure you are using Python 3.11-3.14:
* python --version
* Install uv:
* make setupenv
* Build the docs:
* make preview
* Docs should render without errors at http://127.0.0.1:5500

Closes scylladb/scylladb#28971
2026-03-11 16:56:51 +02:00
Aleksandra Martyniuk
2e68f48068 nodetool: cluster repair: do not fail if a table was dropped
nodetool cluster repair without additional params repairs all tablet
keyspaces in a cluster. Currently, if a table is dropped while
the command is running, all tables are repaired but the command finishes
with a failure.

Modify nodetool cluster repair. If a table wasn't specified
(i.e. all tables are repaired), the command finishes successfully
even if a table was dropped.

If a table was specified and it does not exist (e.g. because it was
dropped before the repair was requested), then the behavior remains
unchanged.

Fixes: SCYLLADB-568.

Closes scylladb/scylladb#28739
2026-03-11 16:35:04 +02:00
Dani Tweig
45d7d9a96c .github/workflow: also call call_sync_milestone_to_jira.yml for close milestone event
What changed

* Added closed to milestone event types in
  call_sync_milestone_to_jira.yml (types: [created] -> types: [created, closed])
* Added VECTOR to the list of Jira project keys being synced
  (jira_project_keys: SCYLLADB,CUSTOMER,SMI,RELENG -> jira_project_keys: SCYLLADB,CUSTOMER,SMI,RELENG,VECTOR)

Why (Requirements Summary)

* The call_sync_milestone_to_jira.yml workflow only triggered on
  milestone creation. When a GitHub milestone is closed, the
  corresponding Jira versions (in SCYLLADB, CUSTOMER, SMI, RELENG
  projects) should be marked as released. Adding the closed trigger
  enables the called workflow (main_sync_milestone_to_jira_release.yml
  in github-automation) to handle both creating and releasing Jira
  versions from GitHub milestone events.
* Added the VECTOR project so its Jira versions are also created/released
  when milestones are created or closed in scylladb.git.
* This is consistent with the same change already applied to the staging
  and scylla-machine-image repos.

Fixes:PM-216

Update call_sync_milestone_to_jira.yml in scylladb.git - add close trigger and VECTOR project sync

Closes scylladb/scylladb#28981
2026-03-11 15:56:55 +02:00
Amnon Heiman
69fbcd32bd test/alternator/test_metrics.py: Update the bucket in the histogram search 2026-03-11 15:24:05 +02:00
Amnon Heiman
50af1f3671 alternator: Use batch_histogram for batch size histograms
Switch batch-related histograms to estimated_histogram_with_max.
Results with better memory consumption and improve efficiency.
2026-03-11 15:21:25 +02:00
Amnon Heiman
b22162c719 estimated_histogram.hh: adds estimated_histogram_with_max
This patch adds estimated_histogram_with_max template that will be a
based for specific estimated_histograms, eventually replacing the current
struct implementation.

Introduce estimated_histogram_with_max<Max> as a reusable wrapper around
approx_exponential_histogram<1, Max, 4>, providing merge support and the
same add helpers used by existing estimated_histogra type.

Add estimated_histogram_with_max_merge()

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2026-03-11 15:02:37 +02:00
Radosław Cybulski
fe8117feee alternator: fix shard's parent calculation for vnodes
Fix an invalid condition, when searching for a parent shard, when table
is based on vnodes. Shards have associated with them `last token` -
token, than marks the end of the range of tokens they consume (inclusive).
An additional assumptions are whole token space is used and
(for vnodes) token space wraps around.

Previously code looked like this:
    auto pid = std::upper_bound(..., [](const dht::token& t, const cdc::stream_id& id) {
                    return t < id.token();
    });
    if (pid != pids.begin()) {
        pid = std::prev(pid);
    }

An `upper_bound` call with `t < id.token()` means it is looking for
an iterator, for which value `t < id.token()` changed to true,
which effectively means a position, where iterator is bigger
then searched value. Then we move iterator backward once if possible.
Assuming token space <-2, 2> and parents [0, 2], when we search for:
- -1 -> we will get 0, it's first, so we can't move backward, so 0 (ok)
- 0 -> we will get 2, it's not first, so we go back and we return 0 (ok)
- 1 -> we will get 2, it's not first, so we go back and we return 0
      (not ok - should be 2)

The fix is to replace it with `std::lower_bound` and remove conditional
backward motion. Since we've a guarantees that whole token space is used
if `std::lower_bound` ends with `end()` value, then we have a wrap
around case and we need to pick `begin()` as result.

Fixes #28354
Fixes: SCYLLADB-537

Closes scylladb/scylladb#28382
2026-03-11 14:51:42 +02:00
Calle Wilund
bc544eb08e gcs_fixture: Change to use docker helper 2026-03-11 12:32:02 +01:00
Calle Wilund
eb2dfe04e1 aws_kms_fixture: Modify to use docker helper 2026-03-11 12:32:02 +01:00
Calle Wilund
4a8afd9649 test/lib/proc_util: Add docker helper
Adds boost test equivalent of dockerized_service to
handle launching dockerized mock service using ephermal port,
query port and return the process.
2026-03-11 12:32:02 +01:00
Calle Wilund
3e8a9a0beb pytest: use ephemeral port publish for docker mock servers
Changes dockerized_service to use ephermal port publish, and
query the published port from podman/docker.
Modifies client code to use slightly changed usage syntax.
2026-03-11 12:32:01 +01:00
Piotr Dulikowski
d9a277453e Merge 'cql3: pin prepared cache entry in prepare() to avoid invalid weak handle race' from Alex Dathskovsky
query_processor::prepare() could race with prepared statement invalidation: after loading from the prepared cache, we converted the cached object to a checked weak pointer and then continued asynchronous work (including error-injection waitpoints). If invalidation happened in that window, the weak handle could no longer be promoted and the prepare path could fail nondeterministically.

This change keeps a strong cache entry reference alive across the whole critical section in prepare() by using a pinned cache accessor (get_pinned()), and only deriving the weak handle while the entry is pinned. This removes the lifetime gap without adding retry loops.

  Test coverage was extended in test/cluster/test_prepare_race.py:

  - reproduces the invalidation-during-prepare window with injection,
  - verifies prepare completes successfully,
  - then invalidates again and executes the same stale client prepared object,
  - confirms the driver transparently re-requests/re-prepares and execution succeeds.

  This change introduces:

  - no behavior change for normal prepare flow besides stronger lifetime guarantees,
  - no new protocol semantics,
  - preserves existing cache invalidation logic,
  - adds explicit cluster-level regression coverage for both the race and driver reprepare path.
  - pushes the re prepare operation twards the driver, the server will return unprepared error for the first time and the driver will have to re prepare during execution stage

Fixes: https://github.com/scylladb/scylladb/issues/27657

Backport to active branches recommended: No node crash, but user-visible PREPARE failures under rare schema-invalidation race; low-risk timeout-bounded retry improves robustness.

Closes scylladb/scylladb#28952

* github.com:scylladb/scylladb:
  transport/messages: hold pinned prepared entry in PREPARE result
  cql3: pin prepared cache entry in prepare() to avoid invalid weak handle race
2026-03-11 12:09:23 +01:00
Calle Wilund
e3e940bc47 dbuild: Use container network in dbuild nested containers
Remove the host network setting, ensuring we use private networks
(slirp4netns). This will allow nested container port aliasing,
helping CI stability (can use ephemeral ports and container
introspection).

This also makes the nested podman setup non-conditional,
since we only run podman containers inside dbuild, and need
the setup regardless if host container is docker or not.
2026-03-11 12:05:51 +01:00
Calle Wilund
8a56eafd39 scylla_cluster: Read notify sock in background to prevent deadlock
Starts a thread to process scylla notify messages (NOTIFY_SOCKET)
instead of just processing inline, non-blocking. This because it
is possible for the pipe created to be to small to hold enough
messages for us to reach the point where we otherwise even read
from said pipe, allowing other end (scylla) to proceed.
2026-03-11 11:59:00 +01:00
Patryk Jędrzejczak
37aeba9c8c Merge 'raft: add global read barrier to group0_batch::commit and switch auth and service levels' from Marcin Maliszkiewicz
This series adds a global read barrier to raft_group0_client, ensuring that Raft group0 mutations are applied on all live nodes before returning to the caller.

Currently, after a group0_batch::commit, the mutations are only guaranteed to be applied on the leader. Other nodes may still be catching up, leading to stale reads. This patch introduces a broadcast read barrier mechanism. Calling  send_group0_read_barrier_to_live_members after committing will cause the coordinator to send a read barrier RPC to all live nodes (discovered via gossiper) and waits for them to complete. This is best effort attempt to get cluster-wide visibility of the committed state before the response is returned to the user.

Auth and service levels write paths are switched to use this new mechanism.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-650

Backport: no, new feature

Closes scylladb/scylladb#28731

* https://github.com/scylladb/scylladb:
  test: add tests for global group0_batch barrier feature
  qos: switch service levels write paths to use global group0_batch barrier
  auth: switch write paths to use global group0_batch barrier
  raft: add function to broadcast read barrier request
  raft: add gossiper dependency to raft_group0_client
  raft: add read barrier RPC
2026-03-11 10:37:19 +01:00
Botond Dénes
54bddeb3b5 sstables/mx/writer: yield after writing a cell
With the goal of avoiding stalls on writing large collections, like
below:
    ++[0#1/1 100%] addr=0x5422d1e total=32 count=1 avg=32:
    |              seastar::backtrace<seastar::backtrace_buffer::append_backtrace_oneline()::{lambda(seastar::frame)#1}> at ./build/release/seastar/./seastar/include/seastar/util/backtrace.hh:85
    ++           - addr=0x541b6d4:
    |              seastar::backtrace_buffer::append_backtrace_oneline at ./build/release/seastar/./seastar/src/core/reactor.cc:811
    |              (inlined by) seastar::print_with_backtrace at ./build/release/seastar/./seastar/src/core/reactor.cc:838
    ++           - addr=0x541afb7:
    |              seastar::internal::cpu_stall_detector::generate_trace at ./build/release/seastar/./seastar/src/core/reactor.cc:1479
    ++           - addr=0x541b86c:
    |              seastar::internal::cpu_stall_detector::maybe_report at ./build/release/seastar/./seastar/src/core/reactor.cc:1214
    |              (inlined by) seastar::internal::cpu_stall_detector::on_signal at ./build/release/seastar/./seastar/src/core/reactor.cc:1234
    |              (inlined by) seastar::reactor::block_notifier at ./build/release/seastar/./seastar/src/core/reactor.cc:1548
    /opt/scylladb/libreloc/libc.so.6: ELF 64-bit LSB shared object, x86-64, version 1 (GNU/Linux), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, BuildID[sha1]=f83d43b9b4b0ed5c2bd0a1613bf33e08ee054c93, for GNU/Linux 3.2.0, not stripped

    ++           - addr=/opt/scylladb/libreloc/libc.so.6+0x1a28f:
    |              sigpending at ??:0
    ++           - addr=0x1760bf6:
    |              std::basic_string_view<signed char, std::char_traits<signed char> >::remove_prefix at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/string_view:302
    |              (inlined by) managed_bytes_basic_view<(mutable_view)0>::remove_prefix at ././utils/managed_bytes.hh:421
    |              (inlined by) _Z11read_simpleIlTk14FragmentedView24managed_bytes_basic_viewIL12mutable_view0EEET_RT0_ at ././utils/fragment_range.hh:365
    |              (inlined by) _ZL9get_fieldIlTk14FragmentedView24managed_bytes_basic_viewIL12mutable_view0EEQsr3stdE12is_trivial_vIT_EES3_T0_j at ././mutation/atomic_cell.hh:62
    |              (inlined by) atomic_cell_type::timestamp at ././mutation/atomic_cell.hh:103
    |              (inlined by) basic_atomic_cell_view<(mutable_view)0>::timestamp at ././mutation/atomic_cell.hh:232
    |              (inlined by) sstables::mc::writer::write_cell at ./sstables/mx/writer.cc:1101
    |              (inlined by) sstables::mc::writer::write_collection(bytes_ostream&, clustering_key_prefix const*, column_definition const&, collection_mutation_view, sstables::mc::writer::row_time_properties const&, bool)::$_0::operator() at ./sstables/mx/writer.cc:1233
    |              (inlined by) collection_mutation_view::with_deserialized<sstables::mc::writer::write_collection(bytes_ostream&, clustering_key_prefix const*, column_definition const&, collection_mutation_view, sstables::mc::writer::row_time_properties const&, bool)::$_0> at ././mutation/collection_mutation.hh:97
    |              (inlined by) sstables::mc::writer::write_collection at ./sstables/mx/writer.cc:1221
    ++           - addr=0x1677af3:
    |              sstables::mc::writer::write_cells at ./sstables/mx/writer.cc:1261
    |              (inlined by) sstables::mc::writer::write_row_body at ./sstables/mx/writer.cc:1287
    |              (inlined by) sstables::mc::writer::write_clustered at ./sstables/mx/writer.cc:1377
    |              (inlined by) _ZN8sstables2mc6writer15write_clusteredI14clustering_rowQ9ClusteredIT_EEEvRKS4_9tombstone at ./sstables/mx/writer.cc:766
    |              (inlined by) sstables::mc::writer::consume at ./sstables/mx/writer.cc:1425

Putting the yield in write_cell() instead of in write_collection() means
that writing any row benefits from the added yield point in the middle.

Refs: SCYLLADB-964

Closes scylladb/scylladb#28948
2026-03-11 10:34:55 +01:00
Botond Dénes
475220b9c9 Merge 'Remove the rest of pre raft topology code' from Gleb Natapov
Remove the rest of the code that assumes that either group0 does not exist yet or a cluster is till not upgraded to raft topology. Both of those are not supported any more.

No need to backport since we remove functionality here.

Closes scylladb/scylladb#28841

* github.com:scylladb/scylladb:
  service level: remove version 1 service level code
  features: move GROUP0_SCHEMA_VERSIONING to deprecated features list
  migration_manager: remove unused forward definitions
  test: remove unused code
  auth: drop auth_migration_listener since it does nothing now
  schema: drop schema_registry_entry::maybe_sync() function
  schema: drop make_table_deleting_mutations since it should not be needed with raft
  schema: remove calculate_schema_digest function
  schema: drop recalculate_schema_version function and its uses
  migration_manager: drop check for group0_schema_versioning feature
  cdc: drop usage of cdc_local table and v1 generation definition
  storage_service: no need to add yourself to the topology during reboot since raft state loading already did it
  storage_service: remove unused functions
  group0: drop with_raft() function from group0_guard since it always returns true now
  gossiper: do not gossip TOKENS and CDC_GENERATION_ID any more
  gossiper: drop tokens from loaded_endpoint_state
  gossiper: remove unused functions
  storage_service: do not pass loaded_peer_features to join_topology()
  storage_service: remove unused fields from replacement_info
  gossiper: drop is_safe_for_restart() function and its use
  storage_service: remove unused variables from join_topology
  gossiper: remove the code that was only used in gossiper topology
  storage_service: drop the check for raft mode from recovery code
  cdc: remove legacy code
  test: remove unused injection points
  auth: remove legacy auth mode and upgrade code
  treewide: remove schema pull code since we never pull schema any more
  raft topology: drop upgrade_state and its type from the topology state machine since it is not used any longer
  group0: hoist the checks for an illegal upgrade into main.cc
  api: drop get_topology_upgrade_state and always report upgrade status as done
  service_level_controller: drop service level upgrade code
  test: drop run_with_raft_recovery parameter to cql_test_env
  group0: get rid of group0_upgrade_state
  storage_service: drop topology_change_kind as it is no longer needed
  storage_service: drop check_ability_to_perform_topology_operation since no upgrades can happen any more
  service_storage: remove unused functions
  storage_service: remove non raft rebuild code
  storage_service: set topology change kind only once
  group0: drop in_recovery function and its uses
  group0: rename use_raft to maintenance_mode and make it sync
2026-03-11 10:24:20 +02:00
Piotr Dulikowski
38a2829f69 Merge 'Return HTTP error description in Vector Store client' from Szymon Wasik
The `service_error` struct: 6dc2c42f8b/service/vector_store_client.hh (L64)
currently stores just the error status code. For this reason whenever the HTTP error occurs, only the error code can be forwarded to the client. For example see here: 6dc2c42f8b/service/vector_store_client.cc (L580)
For this reason in the output of the drivers full description of the error is missing which forces user to take a look into Scylla server logs.

The objective of this PR is to extend the support for HTTP errors in Vector Store client to handle messages as well.

Moreover, it removes the quadratic reallocation in response_content_to_sstring() helper function that is used for getting the response in case of error.

Fixes: VECTOR-189

Closes scylladb/scylladb#26139

* github.com:scylladb/scylladb:
  vector_search: Avoid quadratic reallocation in response_content_to_sstring
  vector_store_client: Return HTTP error description, not just code
2026-03-11 09:19:27 +01:00
Calle Wilund
6d8ac23731 test_encryption: Use maximum replication in _smoke_test
Refs: SCYLLADB-557

We should use full replication in KS/CF creation and population,
for at least two reasons:
1.) Ensure we wait fully for and write to all nodes
2.) Make test more "real", behaving like a proper cluster

Closes scylladb/scylladb#28959
2026-03-11 09:54:57 +02:00
Nadav Har'El
00a819bcd8 cql3: remove unnecessary assert()
In cql3/, there was one call to assert() (not SCYLLA_ASSERT or
throwing_assert), and it was:

        const auto shard_num = smp::count;
        assert(shard_num > 0)

Rather than converting this assert() to throwing_assert() as I did in
previous patches, I decided to outright remove it: Seastar guarantees
that smp::count is not zero. Many other places in the code use
smp::count assuming that it is correct, no other place bothers to assert
it isn't zero.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-03-11 09:43:24 +02:00
Nadav Har'El
34eec020b3 cql3: replace abort() by throwing_assert()
After the previous patch replaced all SCYLLA_ASSERT() calls by
throwing_assert(), this patch also replaces all calls to abort().

All these abort() calls are supposedly cases that can never happen,
but if they ever do happen because of a bug, in none of these places
we absolutely need to crash - and exception that aborts the current
operation should be enough.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-03-11 09:43:11 +02:00
Nadav Har'El
c87d6407ed cql3: Replace SCYLLA_ASSERT by throwing_assert
In this patch we replace every single use of SCYLLA_ASSERT() in the cql3/
directory by throwing_assert().

The problem with SCYLLA_ASSERT() is that when it fails, it crashes Scylla.
This is almost always a bad idea (see #7871 discussing why), but it's even
riskier in front-end code like cql3/: In front-end code, there is a risk
that due to a bug in our code, a specific user request can cause Scylla
to crash. A malicious user can send this query to all nodes and crash
the entire cluster. When the user is not malicious, it causes a small
problem (a failing request) to become a much worse crash - and worse,
the user has no idea which request is causing this crash and the crash
will repeat if the same request is tried again.

All of this is solved by using the new throwing_assert(), which is the
same as SCYLLA_ASSERT() but throws an exception (using on_internal_error())
instead of crashing. The exception will prevent the code path with the
invalid assumption from continuing, but will result in only the current
user request being aborted, with a clear error message reporting the
internal server error due to an assertion failure.

I reviewed all the changes that I did in this patch to check that (to the
best of my understanding) none of the assertions in cql3/ involve the
sort of serious corruption that might require crashing the Scylla node
entirely.

throwing_assert() also improves logging of assertion failures compared
to the original SCYLLA_ASSERT() - SCYLLA_ASSERT() printed a message to
stderr which in many installations is lost, whereas throwing_assert()
uses Scylla's standard logger, and also includes a backtrace in the
log message.

Fixes #13970 (Exorcise assertions from CQL code paths)
Refs #7871 (Exorcise assertions from Scylla)

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-03-11 09:41:20 +02:00
Botond Dénes
99fa912f1b Merge 'Generalize streaming scopes tests' from Pavel Emelyanov
To restore how streaming scopes work there are two tests that greatly duplicate each other -- test_restore_with_streaming_scopes from cluster/object_store suite and test_refresh_with_streaming_scopes from cluster suite.

This patch generalizes both into a do_test_streaming_scopes() non-test function

Closes scylladb/scylladb#28874

* github.com:scylladb/scylladb:
  test: Re-sort comments around do_test_streaming_scopes()
  test: Split do_load_sstables()
  test: Drop load_fn argument from do_load_sstables()
  test: Re-use do_test_streaming_scopes() in refresh test
  test: Introduce SSTablesOnLocalStorage
  test: Introduce SSTablesOnObjectStorage
  test: Move test_restore_with_streaming_scopes() into do_test_streaming_scopes()
2026-03-11 09:35:21 +02:00
Dmitriy Kruglov
cee44716db docs: add cluster platform migration procedure
Document how to migrate a ScyllaDB cluster to different instance
types using the add-and-replace node cycling approach.

Closes: QAINFRA-42

Closes scylladb/scylladb#28458
2026-03-11 09:31:35 +02:00
Nadav Har'El
401dc1894c test/alternator,cqlpy: avoid xfail_strict against DynamoDB/Cassandra
Recently, in commit 7b30a39, we added to pytest.ini the option xfail_strict.
This option causes every time a test XPASSes, i.e., an XFAIL test actually
passes, to be considered an error and fail the test.

While this has some benefits, it's a big problem when running tests
against a reference implementation like DynamoDB or Cassandra: We
typically mark a test "xfail" if the test shows a known bug - i.e., if
the test fails on Scylla but passes on the reference system (DynamoDB
or Cassandra). This means that when running "test/cqlpy/run-cassandra"
or "test/alternator/run --aws", we expect to see many tests XPASS,
and now this will cause these runs to "fail".

So in this patch we add the xfail_strict=false to cqlpy/run-cassandra
and alternator/run --aws. This option is not added to cqlpy/run or
to alternator/run without --aws, and also doesn't affect test.py or
Jenkins.

P.S. This is another nail in the coffin of doing "cd test/alternator;
pytest --aws". You should get used to running Alternator tests through
test/alternator/run, even if you don't need to run Scylla (the "--aws"
option doesn't run Scylla).

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#28973
2026-03-11 09:29:30 +02:00
Robert Bindar
29619e48d7 replica/table: calculate manifest tablet_count from tablet map
During tests I noticed that if the number of tablets is very small,
say 2, and the number of nodes is 3 (2 shards per node), using the
number of storage groups on each shard, a shard may end up holding 0 groups,
whilst the other holds 1 group. And in some nodes even both shards have
0 groups.
Taking the minimum among shards here was showing in manifests a tablet
count of 0 for all 3 nodes, which is incorrect.

Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>

Closes scylladb/scylladb#28978
2026-03-11 09:27:04 +02:00
Botond Dénes
3fed6f9eff Merge 'service: tasks: scan all tablets in tablet_virtual_task::wait' from Aleksandra Martyniuk
Currently, for repair tasks tablet_virtual_task::wait gathers the
ids of tablets that are to be repaired. The gathered set is later
used to check if the repair is still ongoing.

However, if the tablets are resized (split or merged), the gathered
set becomes irrelevant. Those, we may end up with invalid tablet id
error being thrown.

Wait until repair is done for all tablets in the table.

Fixes: https://github.com/scylladb/scylladb/issues/28202

Backport to 2026.1 needed as it contains the change introducing the issue d51b1fea94

Closes scylladb/scylladb#28323

* github.com:scylladb/scylladb:
  service: fix indentation
  test: add test_tablet_repair_wait
  service: remove status_helper::tablets
  service: tasks: scan all tablets in tablet_virtual_task::wait
2026-03-11 09:24:07 +02:00
Raphael S. Carvalho
cc5b1acadf Improve log when sstable load fails due to missing tablet replica
A bug or some bad operator intervention can lead to a sstable existing in a
node after the tablet replica was moved to a different node.

This will result sstable loading during boot failing, requiring operator
intervention.
The log today just dumps the name of the "orphaned" sstable, but one
investigating it might want to know which process (repair, memtable, whatever)
generated that sstable, if the sstable was created locally or remotely,
and the current replica set of the underlying tablet.
From the original identifier, we can know the exact time the sstable was
created on its original node. From the current id, we know the time it
was created on the current node.
All this info can help the investigator to correlate with events in other nodes
(includes actions from the coordinator) to get closer to the root cause.

The new log will look like this:

"Unable to load SSTable .../me-3gyg_1fsw_2u0u826b00b71vc46o-big-Data.db
(originated from compaction with id 913f41c0-18c2-11f1-8f08-cb8521b3f330
on host e483238c-2287-4022-8bc4-b4f1c4cb2b0d)
of tablet 6 (replica set: [e483238c-2287-4022-8bc4-b4f1c4cb2b0d:0])"

Refs https://scylladb.atlassian.net/browse/SCYLLADB-788.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#28921
2026-03-11 06:20:34 +02:00
Avi Kivity
b17e1259e3 Merge 'types: optimize vector deserialization for high-dimensional vectors' from Szymon Wasik
Vector deserialization is an operation which performance is critical for
vector similarity search feature because it is frequently executed during
rescoring operation. Some of the identified performance bottlenecks
for it include:

1. Per-element virtual dispatch in deserialize(): each of the N elements
   went through visit() which switches on ~28 type variants. For a
   1024-dimension float vector, that's 1024 redundant type switches when
   the element type is the same for all of them.

2. Redundant work in split_fragmented(): value_length_if_fixed() was
   called inside the loop (N virtual calls), and no reserve() was done
   on the output vector causing repeated reallocations.

This series fixes both:

- Introduce deserialize_vector_visitor that dispatches on the element
  type once for the entire vector, then loops inside the resolved
  handler. Simple numeric types (float, int, etc.) call
  deserialize_value() directly with no virtual dispatch per element.
  String types (ascii, utf8) get a dedicated handler that skips
  make_empty() (sstring has no empty_t constructor). Complex types
  (list, map, tuple, etc.) fall back to per-element dispatch.

- In split_fragmented(), reserve the output vector to _dimension and
  cache value_length_if_fixed() before the loop.

Benchmark results (1024-dim float vector, release build, -O3 -flto):

  deserialize:      15.73 us -> 11.70 us  (1.34x, 26% faster)
  split_fragmented: 10.34 us ->  7.45 us  (1.39x, 28% faster)

References: SCYLLADB-471

Backport: none, unless we observe some critical performance improvement for quantization.

Closes scylladb/scylladb#28618

* github.com:scylladb/scylladb:
  types: optimize reading vector fragments
  types: optimize vector deserialization for high-dimensional vectors
2026-03-11 00:39:46 +02:00
Dawid Mędrek
167feabe1a cql3: Reject user-provided timestamps for strongly consistent tables
Similarly to LWTs, we reject queries with user-provided timestamps
when they target strongly consistent tables.

Such statements could force us to rewrite history, and that contradicts
the philosophy of linearizability we aim for.

Fixes SCYLLADB-879

Closes scylladb/scylladb#28867
2026-03-10 22:11:39 +02:00
Marcin Maliszkiewicz
8ae80a32c0 Update seastar submodule
* seastar d2953d2a...4d268e0e (32):
  > Merge 'prometheus: support multiple __name__ filters and prefixed names' from Travis Downs
    doc: update prometheus.md with __name__ filter enhancements
    prometheus: support prefixed names in __name__ filter
    prometheus: add benchmarks for name filter performance
    prometheus: support multiple __name__ query parameters
    prometheus: move write_body_args to header
  > fair_queue: Subtract from _queued_capacity on pop_front()
  > memory: expose cumulative allocated bytes statistic
  > Merge 'Add ability to configure IO bandwidth limit for supergroup' from Pavel Emelyanov
    test: IO bandiwdth throttler unit tests
    code: Add ability to configure IO bandwidth limit for supergroup
    io_queue: Have more than one throttler par class
    io_queue: Introduce bandwidth_throttler helper class
    io_queue: Nest io_group::priotiy_class_data-s
    io_queue: Update class bandwidth on group's class data
    io_queue: Make io_group::priority_class_data::tokens() static
    fair_queue: Introduce group (un)plugging
  > Fix _shard_to_numa_node_mapping double population
  > Use exception parameter in log_timer_callback_exception()
  > Fix wakeup_granularity() fallback debug-fs reading
  > test_fixture: Fix SEASTAR_FIXTURE_THREAD_TEST_CASE thread not propagated
  > build: support tuning -ffile-prefix-map
  > test: Remove unused C::dup() method of testing class
  > src/core/reactor: introduce reactor::get_backend_name()
  > util/process: add pid() accessor
  > Merge 'Add source location to task and tasktrace object' from Radosław Cybulski
    coroutine.hh: disable source_location for GCC to avoid ICE
    reactor: improve do_dump_task_queue reporting
    Use source_location in `do_dump_task_queue`
    Update backtrace with source locations of resume points
    Add calls to update resume_point
    Add a std::source_location (resume_point) to task object.
  > Merge 'Refine posix file .dup() implementation' from Pavel Emelyanov
    file: Templatize posix_file_handle_impl
    file: Don't dup() non-read-only files
    file: Split ..._impl::dup() implementations
    test: Add a simple test for dup()
  > Merge 'Deprecate reactor::make_pollable_fd(socket_address, int)' from Pavel Emelyanov
    reactor: Deprecate make_pollable_fd()
    net/posix: Create file_desc for sockets in-place
    reactor,net: Keep sock_need_nonblock boolean on posix_network_stack
    net/posix: Re-format constructor initializer lists
  > Merge 'test: add fuzz testing infrastructure and sstring fuzzer' from Travis Downs
    test: add fuzz tests to CI workflow
    test: add sstring differential fuzzer
    test: add fuzz testing infrastructure
  > Introduce "integrated queue length" metrics and use it for IO classes (#3210)
  > reactor: Remove get_sg_data(unsigned) overload
  > memcached: Stop using scattered_message
  > reactor: Mark uptime() method const
  > alien: Remove deprecated run_on and submit_to calls
  > file: make open_flags and access_flags constexpr
  > scheduling: Unfriend some methods from scheduling_group
  > reactor: Move _dying bit to epoll backend
  > file: coroutinize the with_file templates
  > configure: validate --cook ingredient names
  > fix trailing whitespace
  > Merge 'Estimate timing overhead, allow failing if it is too high' from Travis Downs
    perf_tests: document overhead column and threshold options
    perf_tests: add measurement overhead tracking and warnings
    perf_tests: remove inline/hot attributes from time_measurement methods
    perf_tests: move time_measurement class to implementation file
    perf_tests: move perf counters into time_measurement singleton
  > rpc: log handler type
  > Merge 'Add pre-commit with trailing whitespace hook' from Travis Downs
    Add GitHub Actions workflow for pre-commit enforcement
    Add pre-commit setup documentation to HACKING.md
    Add pre-commit configuration with trailing-whitespace hook
    Remove trailing whitespace from source files
  > posix-stack: Make internal::posix_connect() resolve exceptions into futures
  > sstring: fix npos to be size_t for consistency with std::string

Closes scylladb/scylladb#28954
2026-03-10 22:06:58 +02:00
Szymon Wasik
7fae78d2b0 types: optimize reading vector fragments
There was a redundant work in split_fragmented(): value_length_if_fixed() was
called inside the loop (N virtual calls), and no reserve() was done
on the output vector causing repeated reallocations.

This patch reserves the output vector to _dimension and
caches value_length_if_fixed() before the loop.

Additionally, split read_vector_element() into two specialized functions:
read_vector_element_fixed() and read_vector_element_variable(), and hoist
the branch on fixed_len outside the loop in split_fragmented() and
deserialize_loop(). This avoids a conditional branch per element in the
hot path.

Benchmark results (1024-dim float vector, release build, -O3 -flto):
10.34 us ->  7.45 us  (1.39x, 28% faster)
2026-03-10 20:17:31 +01:00
Taras Veretilnyk
579269b3c5 test/cqlpy: test --ignore-component-digest-mismatch flag in scylla sstable upgrade
Verify that scylla sstable upgrade fails when an sstable has a
corrupted Statistics component digest, and succeeds when the
--ignore-component-digest-mismatch flag is provided.
2026-03-10 19:24:05 +01:00
Taras Veretilnyk
fc4c82b962 docs: document --ignore-component-digest-mismatch flag for scylla sstable upgrade 2026-03-10 19:24:05 +01:00
Taras Veretilnyk
7214f5a0b6 sstables: propagate ignore_component_digest_mismatch config to all load sites
Add ignore_component_digest_mismatch option to db::config (default false).
When set, sstable loading logs a warning instead of throwing on component
digest mismatches, allowing a node to start up despite corrupted non-vital
components or bugs in digest calculation.

Propagate the config to all production sstable load paths:
- distributed_loader (node startup, upload dir processing)
- storage_service (tablet storage cloning)
- sstables_loader (load-and-stream, download tasks, attach)
- stream_blob (tablet streaming)
2026-03-10 19:24:05 +01:00
Taras Veretilnyk
c123f637ea sstables: add option to ignore component digest mismatches
Add `ignore_component_digest_mismatch` option to `sstable_open_config`
that logs a warning instead of throwing `malformed_sstable_exception`
on component digest mismatch. This is useful for recovering sstables
with corrupted non-vital components or working around bugs in digest
calculation.

Expose the option in scylla-sstable via the
`--ignore-component-digest-mismatch` flag for the upgrade operation.
2026-03-10 19:24:05 +01:00
Taras Veretilnyk
95420014ea sstable_compaction_test: Add scrub validate test for corrupted index
Generalize corrupt_sstable() and scrub_validate_corrupted_file() to
accept a component_type parameter, defaulting to Data, so they can be
reused for corrupting other components.
2026-03-10 19:24:05 +01:00
Taras Veretilnyk
a3912cf7f1 sstables: add tests for component digest validation on corrupted SSTables
Add tests that verify SSTable component digest validation detects
corruption on load. Each test writes an SSTable, corrupts a specific
component file by flipping a bit, then asserts that reloading the
SSTable throws malformed_sstable_exception with the expected digest
mismatch message.
2026-03-10 19:24:05 +01:00
Taras Veretilnyk
e78a3d2c44 sstables: validate index components digests during SSTable scrub in validate mode 2026-03-10 19:24:05 +01:00
Taras Veretilnyk
9decbdeab0 sstables: verify component digests on SSTable load
Add integrity verification for SSTable component files by validating
their CRC32 digests against the expected values stored in Scylla
metadata during SSTable loading.

The following components are validated on load: TOC, Scylla metadata,
CompressionInfo, Statistics, Summary, and Filter.
2026-03-10 19:24:05 +01:00
Taras Veretilnyk
478c1eaec5 sstables: add digest_file_random_access_reader for CRC32 digest computation
Add a new reader that wraps file_random_access_reader and computes a
running CRC32 digest over the data as it is read. The digest accumulates
across sequential read_exactly() calls and is reset on seek(), since a
non-sequential read invalidates the running checksum.
2026-03-10 19:24:05 +01:00
Szymon Wasik
6c0ef8eb92 types: optimize vector deserialization for high-dimensional vectors
One of the performance bottlenecks while deserializing vectors was
per-element virtual dispatch in deserialize(): each of the N elements
went through visit() which switches on ~28 type variants. For a
1024-dimension float vector, that's 1024 redundant type switches when
the element type is the same for all of them.

This patch introduces deserialize_vector_visitor that dispatches on the element
type once for the entire vector, then loops inside the resolved
handler. Simple numeric types (float, int, etc.) call
deserialize_value() directly with no virtual dispatch per element.
String types (ascii, utf8) get a dedicated handler that skips
make_empty() (sstring has no empty_t constructor). Complex types
(list, map, tuple, etc.) fall back to per-element dispatch.

Benchmark results (1024-dim float vector, release build, -O3 -flto):
15.73 us -> 11.70 us  (1.34x, 26% faster)
2026-03-10 18:21:34 +01:00
Andrzej Jackowski
9247dff8c2 reader_concurrency_semaphore: fix leak workaround
`e4da0afb8d5491bf995cbd1d7a7efb966c79ac34` introduces a protection
against resources that are "made up" of thin air to
`reader_concurrency_semaphore`. If there are more `_resources` than
the `_initial_resources`, it means there is a negative leak, and
`on_internal_error_noexcept` is called. In addition to it,
`_resources` is set to `std::max(_resources, _initial_resources)`.

However, the commit message of `e4da0afb8d5491bf995cbd1d7a7efb966c79ac34`
states the opposite: "The detection also clamps the
_resources to _initial_resources, to prevent any damage".

Before this commit, the protection mechanism doesn't clamp
`_resources` to `_initial_resources` but instead keeps `_resources` high,
possibly even indefinitely growing. This commit changes `std::max` to
`std::min` to make the code behave as intended.

Refs: SCYLLADB-163

Closes scylladb/scylladb#28982
2026-03-10 18:57:31 +02:00
Szymon Wasik
74d86d3fe9 vector_search: Avoid quadratic reallocation in response_content_to_sstring
Pre-compute the total size and allocate a single uninitialized sstring
before copying the buffers, following the pattern from Seastar's
read_entire_stream_contiguous(). This avoids iterative reallocation
which is O(n^2) for large responses.
2026-03-10 17:45:55 +01:00
Szymon Wasik
d27610f138 vector_store_client: Return HTTP error description, not just code
This simple patch adds support for storing the HTTP error description
that Vector Store client receives from vector store. Until now it was
just printed to the log but it was not returned. For this reason it
was not forwarded to the drivers which forced users to access ScyllaDB
server logs to understand what is wrong with Vector Store.

This patch also updates formatter to print the message next to the
error code.

Fixes: VECTOR-189
2026-03-10 17:22:30 +01:00
Nadav Har'El
92ee959e9b test/alternator: speed up test_streams.py by using module-scope fixtures
Previously, all stream-table fixtures in this test file used
scope="function", forcing a fresh table to be created for every test,
slowing down the test a bit (though not much), and discouraging writing
small new tests.

This was a workaround for a DynamoDB quirk (that Alternator doesn't have):
LATEST shard iterators have a time slack and may point slightly before
the true stream head, causing leftover events from a previous test to
appear in the next test's reads.

We fix this by draining the stream inside latest_iterators() and
shards_and_latest_iterators() after obtaining the LATEST iterators:
fetch records in a loop until two consecutive polling rounds both return
empty, guaranteeing the iterators are positioned past all pre-existing
events before the caller writes anything.  With this guarantee in place,
all stream-table fixtures can safely use scope="module".

After this patch, test_streams.py continues to pass on DynamoDB.
On Alternator, the test file's run time went down a bit, from
20.2 seconds to 17.7 seconds.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-03-10 17:14:04 +02:00
Nadav Har'El
6ac1f1333f test/alternator: test_streams.py don't use fixtures in 4 tests
In the next patch, we plan to make the fixtures in test_streams.py
shared between tests. Most tests work well with shared tables, but two
(test_streams_trim_horizon and test_streams_starting_sequence_number)
were written to expect a new table with an empty history, and two
other (test_streams_closed_read and test_streams_disabled_stream) want
to disable streaming and would break a shared table.

So this patch we modify these four tests to create their own new table
instead of using a fixture.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-03-10 17:12:33 +02:00
Botond Dénes
81e214237f Merge 'Add digests for all sstable components in scylla metadata' from Taras Veretilnyk
This pull request adds support for calculation and storing CRC32 digests for all SSTable components.
This change replaces plain file_writer with crc32_digest_file_writer for all SSTable components that should be checksummed. The resulting component digests are stored in the sstable structure
and later persisted to disk as part of the Scylla metadata component during writer::consume_end_of_stream.
Several test cases where introduced to verify expected behaviour.

Additionally, this PR adds new rewrite component mechanism for safe sstable component rewriting.
Previously, rewriting an sstable component (e.g., via rewrite_statistics) created a temporary file that was renamed to the final name after sealing. This allowed crash recovery by simply removing the temporary file on startup.

However, with component digests stored in scylla_metadata (#20100),
replacing a component like Statistics requires atomically updating both the component
and scylla_metadata with the new digest - impossible with POSIX rename.

The new mechanism creates a clone sstable with a fresh generation:

- Hard-links all components from the source except the component being rewritten and scylla_metadata
- Copies original sstable components pointer and recognized components from the source
- Invokes a modifier callback to adjust the new sstable before rewriting
- Writes the modified component along with updated scylla_metadata containing the new digest
- Seals the new sstable with a temporary TOC
- Replaces the old sstable atomically, the same way as it is done in compaction

This is built on the rewrite_sstables compaction framework to support batch operations (e.g., following incremental repair).
In case of any failure durning the whole process, sstable will be automatically deleted on the node startup due to
temporary toc persistence.

Backport is not required, it is a new feature

Fixes https://github.com/scylladb/scylladb/issues/20100, https://github.com/scylladb/scylladb/issues/27453

Closes scylladb/scylladb#28338

* github.com:scylladb/scylladb:
  docs: document components_digests subcomponent and trailing digest in Scylla.db
  sstable_compaction_test: Add tests for perform_component_rewrite
  sstable_test: add verification testcases of SSTable components digests persistance
  sstables: store digest of all sstable components in scylla metadata
  sstables: replace rewrite_statistics with new rewrite component mechanism
  sstables: add new rewrite component mechanism for safe sstable component rewriting
  compaction: add compaction_group_view method to specify sstable version
  sstables: add null_data_sink and serialized_checksum for checksum-only calculation
  sstables: extract default write open flags into a constant
  sstables: Add write_simple_with_digest for component checksumming
  sstables: Extract file writer closing logic into separate methods
  sstables: Implement CRC32 digest-only writer
2026-03-10 16:02:53 +02:00
Aleksandra Martyniuk
e02b0d763c service: fix indentation 2026-03-10 14:44:52 +01:00
Aleksandra Martyniuk
02257d1429 test: add test_tablet_repair_wait
Add a test that checks if tablet_virtual_task::wait won't fail if
tablets are merged.
2026-03-10 14:42:27 +01:00
Aleksandra Martyniuk
0e0070f118 service: remove status_helper::tablets
Currently, status_helper::tablets, which keeps a vector of processed
tablet ids, is used only in tablet_virtual_task::get_status_helper,
so there is no point in returning it. Also, in get_status_helper,
it is used only to determine if any tablets are processed.

Remove status_helper::tablets. Use a flag instead of the vector
in get_status_helper.
2026-03-10 14:42:27 +01:00
Aleksandra Martyniuk
e5928497ce service: tasks: scan all tablets in tablet_virtual_task::wait
Currently, for repair tasks tablet_virtual_task::wait gathers the
ids of tablets that are to be repaired. The gathered set is later
used to check if the repair is still ongoing.

However, if the tablets are resized (split or merged), the gathered
set becomes irrelevant. Those, we may end up with invalid tablet id
error being thrown.

Wait until repair is done for all tablets in the table.
2026-03-10 14:42:21 +01:00
Andrei Chekun
c36df5ecf4 test.py: eliminite drivers exception
There is a race condition in driver that raises the RuntimeException.
This pollutes the output, so this PR is just silencing this exception.

Fixes: SCYLLADB-900

Closes scylladb/scylladb#28957
2026-03-10 14:31:36 +02:00
Alex
3ac4e258e8 transport/messages: hold pinned prepared entry in PREPARE result
result_message::prepared now owns a strong pinned prepared-cache entry instead of relying only on a weak pointer view. This closes the remaining lifetime gap after query_processor::prepare() returns, so users of the returned PREPARE message cannot observe an invalidated weak handle during subsequent
  processing.

  - update result_message::prepared::cql constructor to accept pinned entry
  - construct weak view from owned pinned entry inside the message
  - pass pinned cache entry from query_processor::prepare() into the message constructor
2026-03-10 14:17:57 +02:00
Alex
27051d9a7c cql3: pin prepared cache entry in prepare() to avoid invalid weak handle race
query_processor::prepare() could race with prepared statement invalidation: after loading from the prepared cache, we converted the cached object to a checked weak pointer and then continued asynchronous work (including error-injection waitpoints). If invalidation happened in that window, the weak
  handle could no longer be promoted and the prepare path could fail nondeterministically.

  This change keeps a strong cache entry reference alive across the whole critical section in prepare() by using a pinned cache accessor (get_pinned()), and only deriving the weak handle while the entry is pinned. This removes the lifetime gap without adding retry loops.

  Test coverage was extended in test/cluster/test_prepare_race.py:

  - reproduces the invalidation-during-prepare window with injection,
  - verifies prepare completes successfully,
  - then invalidates again and executes the same stale client prepared object,
  - confirms the driver transparently re-requests/re-prepares and execution succeeds.

  This change introduces:

  - no behavior change for normal prepare flow besides stronger lifetime guarantees,
  - no new protocol semantics,
  - preserves existing cache invalidation logic,
  - adds explicit cluster-level regression coverage for both the race and driver reprepare path.
  - pushes the re prepare operation twards the driver, the server will return unprepared error for the first time and the driver will have to re prepare during execution stage
2026-03-10 14:17:57 +02:00
Piotr Dulikowski
37f8cdf485 Merge 'test.py: fix unawaited ScyllaLogFile.grep() coroutines' from Andrei Chekun
Fixed several places where ScyllaLogFile.grep() was called without await, resulting in checking coroutine objects for truthiness instead of actual log matches.

Fixes: SCYLLADB-903

No backport, framework fix and one test fix.

Closes scylladb/scylladb#28909

* github.com:scylladb/scylladb:
  test.py: fix unawaited ScyllaLogFile.grep() coroutines
  tests: fix test_group0_recovers_after_partial_command_application
2026-03-10 12:29:23 +01:00
Dario Mirovic
f72081194c db: use prefix tombstones in DROP TABLE schema mutations
When dropping a table, make_drop_table_or_view_mutations() creates
a point tombstone in system_schema.columns for every column in the table.

The clustering key of system_schema.columns is (table_name, column_name).
A clustering key with only the table_name component acts as a prefix
tombstone. That tombstone covers all columns belonging to that table.
This approach is already used by make_table_deleting_mutations() during
CREATE TABLE.

Apply the same prefix tombstone approach to DROP TABLE for the columns,
view_virtual_columns, computed_columns, and dropped_columns schema tables.
This reduces tombstone accumulation in schema table sstables.

In test_max_cells test case, which repeatedly creates and drops a table
with 32768 columns, overall test time improved from ~180s to ~157s, which
is ~12.7% improvement.

Refs SCYLLADB-815

Closes scylladb/scylladb#28976
2026-03-10 11:59:00 +01:00
Gleb Natapov
b59b3d4f8a service level: remove version 1 service level code 2026-03-10 10:46:48 +02:00
Gleb Natapov
b633ec1779 features: move GROUP0_SCHEMA_VERSIONING to deprecated features list 2026-03-10 10:46:48 +02:00
Gleb Natapov
40ec0d4942 migration_manager: remove unused forward definitions 2026-03-10 10:46:48 +02:00
Gleb Natapov
aa9eb0ef8c test: remove unused code 2026-03-10 10:46:48 +02:00
Gleb Natapov
4660f908f9 auth: drop auth_migration_listener since it does nothing now 2026-03-10 10:46:48 +02:00
Gleb Natapov
74b5a8d43d schema: drop schema_registry_entry::maybe_sync() function
Schema is synced through group0 now. Drop all the test of the function
as well.
2026-03-10 10:46:47 +02:00
Gleb Natapov
b9f3281af6 schema: drop make_table_deleting_mutations since it should not be needed with raft
Also remove the test since it is no longer relevant
2026-03-10 10:46:47 +02:00
Gleb Natapov
f76199e5c2 schema: remove calculate_schema_digest function
It is used by the test only, so remove the test and its data as well.
2026-03-10 10:46:47 +02:00
Gleb Natapov
08e33ad7f7 schema: drop recalculate_schema_version function and its uses
There is no need to recalculate schema version any more since it is set
by group0.
2026-03-10 10:46:39 +02:00
Gleb Natapov
7bb334a5dd migration_manager: drop check for group0_schema_versioning feature
We do not allow upgrading from a version that does not have it any
longer.
2026-03-10 10:39:59 +02:00
Gleb Natapov
4402b030ae cdc: drop usage of cdc_local table and v1 generation definition 2026-03-10 10:39:59 +02:00
Gleb Natapov
6769615ff1 storage_service: no need to add yourself to the topology during reboot since raft state loading already did it 2026-03-10 10:39:59 +02:00
Gleb Natapov
33fbda9f3b storage_service: remove unused functions 2026-03-10 10:39:58 +02:00
Gleb Natapov
0e3e7be335 group0: drop with_raft() function from group0_guard since it always returns true now
Also drop the code that assumed that the function can return false.
2026-03-10 10:39:58 +02:00
Gleb Natapov
4e56ca3c76 gossiper: do not gossip TOKENS and CDC_GENERATION_ID any more
They were used by legacy topology and cdc code only.
2026-03-10 10:39:58 +02:00
Gleb Natapov
77f8f952b2 gossiper: drop tokens from loaded_endpoint_state 2026-03-10 10:39:58 +02:00
Gleb Natapov
706754dc24 gossiper: remove unused functions 2026-03-10 10:39:58 +02:00
Gleb Natapov
8ee4cdd4b7 storage_service: do not pass loaded_peer_features to join_topology()
They are not used there any longer.
2026-03-10 10:39:58 +02:00
Gleb Natapov
24c01f2289 storage_service: remove unused fields from replacement_info 2026-03-10 10:39:58 +02:00
Gleb Natapov
2d8722d204 gossiper: drop is_safe_for_restart() function and its use
The function checks that the node's state is not left or removed in
gossiper during restart, but with raft topology a removed node will
not be able to contact the cluster to get this information since it will
be banned.
2026-03-10 10:39:58 +02:00
Gleb Natapov
6f739a8ee4 storage_service: remove unused variables from join_topology 2026-03-10 10:39:58 +02:00
Gleb Natapov
d35b83bec8 gossiper: remove the code that was only used in gossiper topology
The topology state machine is always present now and can be passed to
the gossiper during creation.
2026-03-10 10:39:58 +02:00
Gleb Natapov
390eb46c1a storage_service: drop the check for raft mode from recovery code
In non raft mode the node will node boot at all, so the check is
redundant now.
2026-03-10 10:39:58 +02:00
Gleb Natapov
6a7e850161 cdc: remove legacy code
The patch removes test/boost/cdc_generation_test.cc since it unit tests
cdc::limit_number_of_streams_if_needed function which is remove here.
2026-03-10 10:38:57 +02:00
Gleb Natapov
0b508c5f96 test: remove unused injection points
Also remove test_auth_raft_command_split test which is irrelevant since 5ba7d1b116
because it does not use the function that injects max sized command after the
commit.
2026-03-10 10:09:39 +02:00
Gleb Natapov
1d188f0394 auth: remove legacy auth mode and upgrade code
A system needs to be upgraded to use v2 auth before moving to this
ScyllaDB version otherwise the boot will fail.
2026-03-10 10:09:39 +02:00
Gleb Natapov
02fc4ad0a9 treewide: remove schema pull code since we never pull schema any more
Schema pull was used by legacy schema code which is not supported for a
long time now and during legacy recovery which is no longer supported as
well. It can be dropped now.
2026-03-10 10:09:39 +02:00
Gleb Natapov
0cf726c81f raft topology: drop upgrade_state and its type from the topology state machine since it is not used any longer 2026-03-10 10:09:39 +02:00
Gleb Natapov
60a861c518 group0: hoist the checks for an illegal upgrade into main.cc
The checks are spread around now, but having then in one place and done
as early as possible simplifies the logic.
2026-03-10 10:09:39 +02:00
Gleb Natapov
1ff98c89e3 api: drop get_topology_upgrade_state and always report upgrade status as done
Non upgraded version will not boot any longer.
2026-03-10 10:09:38 +02:00
Gleb Natapov
be153a4eb7 service_level_controller: drop service level upgrade code
We do not allow upgrade from a version that is not updated yet, so the
code is not used any longer.
2026-03-10 10:09:38 +02:00
Gleb Natapov
61cc091364 test: drop run_with_raft_recovery parameter to cql_test_env
It is unused.
2026-03-10 10:09:38 +02:00
Gleb Natapov
00083b42a7 group0: get rid of group0_upgrade_state
Simplify code by getting rid of group0_upgrade_state since upgrade is no
longer supported, so no need to track its state. The none upgraded node
will simply not boot and to detect that the patch checks the state
directly from the system table.
2026-03-10 10:09:38 +02:00
Gleb Natapov
d4b55de214 storage_service: drop topology_change_kind as it is no longer needed
The mode is always raft, so no need to keep a variable that tracks that.
2026-03-10 10:09:38 +02:00
Gleb Natapov
68ea6aa0a6 storage_service: drop check_ability_to_perform_topology_operation since no upgrades can happen any more 2026-03-10 10:09:38 +02:00
Gleb Natapov
06652948f3 service_storage: remove unused functions
raft_topology_change_enabled and upgrade_state_to_topology_op_kind are
not use any more. Remove the code.
2026-03-10 10:09:38 +02:00
Gleb Natapov
e8c72b7ba0 storage_service: remove non raft rebuild code
Only raft is supported now.
2026-03-10 10:09:38 +02:00
Gleb Natapov
49ebab971d storage_service: set topology change kind only once
The only support mode is topology_change_kind::raft, so always set it in
storage_service::join_cluster during join or regular boot. Drop the check
for legacy mode from raft_group0::setup_group0_if_exist since the mode
will not be set at this point any longer. The wrong upgrade will still
be detected in storage_service::join_cluster where topology.upgrade_state
is checked directly.
2026-03-10 10:09:38 +02:00
Gleb Natapov
4e072977d4 group0: drop in_recovery function and its uses
Legacy recovery procedure is no longer supported and the code can be
dropped.
2026-03-10 10:09:38 +02:00
Gleb Natapov
770762edd8 group0: rename use_raft to maintenance_mode and make it sync
group0_upgrade_state::recovery is now used only in maintenance mode
so rename the function to indicate it. Also there is no preemption point
in the function any more and it can be a regular function, not a
co-routine.
2026-03-10 10:09:33 +02:00
Pavel Emelyanov
61af7c8300 test: Re-sort comments around do_test_streaming_scopes()
The test description of refreshing test is very elaborated and it's
worth having it as the description of the streaming scopes test itself.
Callers of the helper can go with smaller descriptions.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-03-10 10:00:09 +03:00
Pavel Emelyanov
5ce3597c25 test: Split do_load_sstables()
This helper does two things -- sorts sstables per server according to
scope in use and calls sstables_storage.restore(). The code looks better
if the sorting of sstables stays in a helper and the call for .restore()
is moved to the caller.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-03-10 10:00:09 +03:00
Pavel Emelyanov
8c1fb2b39a test: Drop load_fn argument from do_load_sstables()
Now all callers provide the sstables_storage argument and the load_fn is
effectively unused.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-03-10 09:59:08 +03:00
Pavel Emelyanov
59051ccc28 test: Re-use do_test_streaming_scopes() in refresh test
Now it's possible to replace the whole body of the
test_refresh_with_streaming_scopes() test by calling the corresponding
helper function from backup/restore test module. This helper does
exactly the same, and the SSTablesOnLocalStorage class provides the
necessary save/restore implementations.

One more thing to mention -- the refreshing test for some reason only
wants to run with restored min-tablet-count equal to the original one.
The do_test_streaming_scopes() needs to account for that, as it runs the
tests for more options.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-03-10 09:59:07 +03:00
Pavel Emelyanov
f6f1cb0391 test: Introduce SSTablesOnLocalStorage
This class implements some of the sstables manipulations performed by
test_refresh_with_streaming_scopes(). It's here to facilitate next patch
that will use it to call do_test_streaming_scopes() helper.

This patch moves two blocks of code out of the test into this new class.

The shutil.rmtree(tmpbackup) is seemingly lost, but it really isn't --
the tmpbackup variable holds a name of a _subdir_ inside servers'
workdirs. This path doesn't really exist on disk on its own, so removing
it is a no-op.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-03-10 09:58:40 +03:00
Pavel Emelyanov
dae4da1810 test: Introduce SSTablesOnObjectStorage
The class in question performs two operations for
do_test_streaming_scopes(): saves sstables and restores them. Current
caller of the helper is the test_restore_with_streaming_scopes() test
that need to backup sstables on object storage and restore them from
there with the restoration API. The SSTablesOnObjectStorage class does
exactly that.

The change in do_load_sstables() that checks for sstables_storage to be
non None is needed to keep test_refresh_with_streaming_scopes() work --
that test doesn't provide sstables_storage (yet) and the function in
question will call the load_fn callback. Next patch will eliminate it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-03-10 09:58:39 +03:00
Pavel Emelyanov
5a033dea47 test: Move test_restore_with_streaming_scopes() into do_test_streaming_scopes()
The body of this test is duplicated by
test_refresh_with_streaming_scopes() test from other module. Keeping it
in a non-test top-level function will help generalizing these two tests.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-03-10 09:57:53 +03:00
Nadav Har'El
d78ea3d498 test/cqlpy: mark test_unbuilt_index_not_used not strictly xfail
A few days ago, in commit 7b30a3981b we added to pytest.ini the option
xfail_strict. This option causes every time a test XPASSes, i.e., an
xfail test actually passes - to be considered an error and fail the test.

But some tests demonstrate a timing-related bug and do not reproduce the
bug every single time. An example we noticed in one CI run is:

    test/cqlpy/test_secondary_index.py::test_unbuilt_index_not_used

This test reproduces a timing-related bug (if you read from a secondary
index "too quickly" you can get wrong results), but only about 90% of the
time, not 100% of the time.

The solution is to add "strict=False" for the xfail marker on this
specific test. This undoes the xfail_strict for this specific test,
accepting that this specific test can either pass or fail. Note that
this does NOT make this test worthless - we still see this test failing
most of the time, and when a developer finally fixes this issue, the
test will begin to pass all the time.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-956
(we'll probably need to follow up this fix with the same fix for other
xfail tests that can sometime pass).

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#28942
2026-03-09 22:48:20 +02:00
Avi Kivity
01ddc17ab9 Merge 'mv: allow skipping view updates when a collection is unmodified' from Wojciech Mitros
When we generate view updates, we check whether we can skip the
entire view update if all columns selected by the view are unmodified.
However, for collection columns, we only check if they were unset
before and after the update.
In this patch we add a check for the actual collection contents.
We perform this check for both virtual and non-virtual selections.
When the column is only a virtual column in the view, it would be
enough to check the liveness of each collection cell, however for
that we'd need to deserialize the entire collection anyway, which
should be effectively as expensive as comparing all of its bytes.

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-808

Closes scylladb/scylladb#28839

* github.com:scylladb/scylladb:
  mv: allow skipping view updates when a collection is unmodified
  mv: allow skipping view updates if an empty collection remains unset
2026-03-09 22:46:01 +02:00
Botond Dénes
13ff9c4394 db,compaction: use utils::chunked_vector for cache invalidation ranges
Instead of dht::partition_ranges_vector, which is an std::vector<> and
have been seen to cause large allocations when calculating ranges to be
invalidated after compaction:

    seastar_memory - oversized allocation: 147456 bytes. This is non-fatal, but could lead to latency and/or fragmentation issues. Please report: at
    [Backtrace #0]
    void seastar::backtrace<seastar::current_backtrace_tasklocal()::$_0>(seastar::current_backtrace_tasklocal()::$_0&&, bool) at ./build/release/seastar/./seastar/include/seastar/util/backtrace.hh:89
     (inlined by) seastar::current_backtrace_tasklocal() at ./build/release/seastar/./seastar/src/util/backtrace.cc:99
    seastar::current_tasktrace() at ./build/release/seastar/./seastar/src/util/backtrace.cc:136
    seastar::current_backtrace() at ./build/release/seastar/./seastar/src/util/backtrace.cc:169
    seastar::memory::cpu_pages::warn_large_allocation(unsigned long) at ./build/release/seastar/./seastar/src/core/memory.cc:840
    seastar::memory::cpu_pages::check_large_allocation(unsigned long) at ./build/release/seastar/./seastar/src/core/memory.cc:903
     (inlined by) seastar::memory::cpu_pages::allocate_large(unsigned int, bool) at ./build/release/seastar/./seastar/src/core/memory.cc:910
     (inlined by) seastar::memory::allocate_large(unsigned long, bool) at ./build/release/seastar/./seastar/src/core/memory.cc:1533
     (inlined by) seastar::memory::allocate_slowpath(unsigned long) at ./build/release/seastar/./seastar/src/core/memory.cc:1679
    seastar::memory::allocate(unsigned long) at ././seastar/src/core/memory.cc:1698
     (inlined by) operator new(unsigned long) at ././seastar/src/core/memory.cc:2440
     (inlined by) std::__new_allocator<interval<dht::ring_position>>::allocate(unsigned long, void const*) at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/bits/new_allocator.h:151
     (inlined by) std::allocator<interval<dht::ring_position>>::allocate(unsigned long) at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/bits/allocator.h:203
     (inlined by) std::allocator_traits<std::allocator<interval<dht::ring_position>>>::allocate(std::allocator<interval<dht::ring_position>>&, unsigned long) at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/bits/alloc_traits.h:614
     (inlined by) std::_Vector_base<interval<dht::ring_position>, std::allocator<interval<dht::ring_position>>>::_M_allocate(unsigned long) at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/bits/stl_vector.h:387
     (inlined by) std::vector<interval<dht::ring_position>, std::allocator<interval<dht::ring_position>>>::reserve(unsigned long) at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/bits/vector.tcc:79
    dht::to_partition_ranges(utils::chunked_vector<interval<dht::token>, 131072ul> const&, seastar::bool_class<utils::can_yield_tag>) at ./dht/i_partitioner.cc:347
    compaction::compaction::get_ranges_for_invalidation(std::vector<seastar::lw_shared_ptr<sstables::sstable>, std::allocator<seastar::lw_shared_ptr<sstables::sstable>>> const&) at ./compaction/compaction.cc:619
     (inlined by) compaction::compaction::get_compaction_completion_desc(std::vector<seastar::lw_shared_ptr<sstables::sstable>, std::allocator<seastar::lw_shared_ptr<sstables::sstable>>>, std::vector<seastar::lw_shared_ptr<sstables::sstable>, std::allocator<seastar::lw_shared_ptr<sstables::sstable>>>) at ./compaction/compaction.cc:719
     (inlined by) compaction::regular_compaction::replace_remaining_exhausted_sstables() at ./compaction/compaction.cc:1362
    compaction::compaction::finish(std::chrono::time_point<db_clock, std::chrono::duration<long, std::ratio<1l, 1000l>>>, std::chrono::time_point<db_clock, std::chrono::duration<long, std::ratio<1l, 1000l>>>) at ./compaction/compaction.cc:1021
    compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0::operator()() at ./compaction/compaction.cc:1960
     (inlined by) compaction::compaction_result std::__invoke_impl<compaction::compaction_result, compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0>(std::__invoke_other, compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0&&) at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/bits/invoke.h:63
     (inlined by) std::__invoke_result<compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0>::type std::__invoke<compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0>(compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0&&) at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/bits/invoke.h:98
     (inlined by) decltype(auto) std::__apply_impl<compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0, std::tuple<>>(compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0&&, std::tuple<>&&, std::integer_sequence<unsigned long, ...>) at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/tuple:2920
     (inlined by) decltype(auto) std::apply<compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0, std::tuple<>>(compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0&&, std::tuple<>&&) at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/tuple:2935
     (inlined by) seastar::future<compaction::compaction_result> seastar::futurize<compaction::compaction_result>::apply<compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0>(compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0&&, std::tuple<>&&) at ././seastar/include/seastar/core/future.hh:1930
     (inlined by) seastar::futurize<std::invoke_result<compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0>::type>::type seastar::async<compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0>(seastar::thread_attributes, compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0&&)::'lambda'()::operator()() const at ././seastar/include/seastar/core/thread.hh:267
     (inlined by) seastar::noncopyable_function<void ()>::direct_vtable_for<seastar::futurize<std::invoke_result<compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0>::type>::type seastar::async<compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0>(seastar::thread_attributes, compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0&&)::'lambda'()>::call(seastar::noncopyable_function<void ()> const*) at ././seastar/include/seastar/util/noncopyable_function.hh:138
    seastar::noncopyable_function<void ()>::operator()() const at ./build/release/seastar/./seastar/include/seastar/util/noncopyable_function.hh:224
     (inlined by) seastar::thread_context::main() at ./build/release/seastar/./seastar/src/core/thread.cc:318

dht::partition_ranges_vector is used on the hot path, so just convert
the problematic user -- cache invalidation -- to use
utils::chunked_vector<dht::partition_range> instead.

Fixes: SCYLLADB-121

Closes scylladb/scylladb#28855
2026-03-09 22:04:54 +02:00
Andrei Chekun
8acba40c84 test.py: fix unawaited ScyllaLogFile.grep() coroutines
Fixed several places where ScyllaLogFile.grep() was called without
await, resulting in checking coroutine objects for truthiness instead
of actual log matches.

Fixes: SCYLLADB-903
2026-03-09 19:41:07 +01:00
Andrei Chekun
224a11be65 tests: fix test_group0_recovers_after_partial_command_application
Due to the fact that grep logs was not awaited this issue was masked.
With adding await for log grep it started to fail. This PR fixes the
test.
2026-03-09 19:41:07 +01:00
Nadav Har'El
16e7a88a02 test/alternator: fix do_test() in test_streams.py
Many tests in test/alternator/test_streams.py use a do_test() function
which performs a user-defined function that runs some write requests,
and then verifies that the expected output appears on the stream.

Because DynamoDB drops do-nothing changes from the stream - such as
writing to an item a value that it already has - these tests need to
write to a different item each time, so do_test() invents a random key
and passes it to the user-defined function to use. But... we had a bug,
the random number generation was done only once, instead of every time.
The fix is to do the random number generation on every call.

We never noticed this bug when each test used a brand new table. But the
next patch will make the tests share the test table, and tests start
to fail. It's especially visible if you run the same test twice against
DynamoDB, e.g.,

test/alternator/run --count 2 --aws \
    test_streams.py::test_streams_putitem_keys_only

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-03-09 19:21:53 +02:00
Łukasz Paszkowski
147b355326 replica/table: avoid computing token range side in storage_group_of() on hot path
`storage_group_of()` is on the replica-side token lookup hot path but
used `tablet_map::get_tablet_id_and_range_side()`, which computes both
tablet id and post-split range side.

Most callers only need the storage group id. Switch `storage_group_of()`
to use `get_tablet_id()` via `tablet_id_for_token()`, and select the
compaction group via new overloads that compute the range side only
when splitting mode is active.
2026-03-09 17:59:36 +01:00
Łukasz Paszkowski
419e9aa323 replica/compaction_group: add lazy select_compaction_group() overloads
Change `storage_group::select_compaction_group()` to accept a token
(and tablet_map) and compute the tablet range side only when
splitting_mode() is active.

Add an overload for selecting the compaction group for an sstable
spanning a token range.
2026-03-09 17:59:36 +01:00
Łukasz Paszkowski
3f70611504 locator/tablets: add tablet_map::get_tablet_range_side()
Add `tablet_map::get_tablet_range_side(token)` to compute the
post-split range side without computing the tablet id.

Pure addition, no behavior change.
2026-03-09 17:59:36 +01:00
Jakub Smolar
7cdd979158 db/config: announce ms format as highest supported
Uncomment the feature flag check in get_highest_supported_format()
to return MS format when supported, otherwise fall back to ME.
2026-03-09 17:12:09 +01:00
Michał Chojnowski
949fc85217 db/config: enable ms sstable format by default
Trie-based sstable indexes are supposed to be (hopefully)
a better default than the old BIG indexes.
Make them the new default.

If we change our mind, this change can be reverted later.
2026-03-09 17:12:09 +01:00
Michał Chojnowski
6b413e3959 cluster/dtest/bypass_cache_test: switch from highest_supported_sstable_format to chosen_sstable_format
Trie-based indexes and older indexes have a difference in metrics,
and the test uses the metrics to check for bypass cache.
To choose the right metrics, it uses highest_supported_sstable_format,
which is inappropriate, because the sstable format chosen for writes
by Scylla might be different than highest_supported_sstable_format.

Use chosen_sstable_format instead.
2026-03-09 17:12:09 +01:00
Michał Chojnowski
b89840c4b9 api/system: add /system/chosen_sstable_version
Returns the sstable version currently chosen for use in for new sstables.

We are adding it because some tests want to know what format they are
writing (tests using upgradesstable, tests which check stats that only
apply to one of the index types, etc).

(Currently they are using `highest_supported_sstable_format` for this
purpose, which is inappropriate, and will become invalid if a non-latest
format is the default).
2026-03-09 17:12:09 +01:00
Michał Chojnowski
9280a039ee test/cluster/dtest: reduce num_tokens to 16
cluster.dtest_alternator_tests.test_slow_query_logging performs
a bootstrap with 768 token ranges.

It works with `me` sstables, which have 2 open file descriptors
per open sstable, but with `ms` sstables, which have 3 open
file descriptors per open sstable, it fails with EMFILE.

To avoid this problem, let's just decrease the number of vnodes
for in the test suite. It's appropriate anyway, because it avoids some
unneeded work without weakening the tests.
(Note: pylib-based have been setting `num_tokens` to 16 for a long time too).

This breaks `bypass_cache_test`, which is written in a way that expects
a certain number of token ranges. We adjust the relevant parameter
accordingly.
2026-03-09 17:12:09 +01:00
Marcin Maliszkiewicz
96a2b0e634 test: add tests for global group0_batch barrier feature
Runtime: 16s in dev mode
2026-03-09 15:15:59 +01:00
Marcin Maliszkiewicz
6723ced684 qos: switch service levels write paths to use global group0_batch barrier
This ensures that we return auth functions only after
we wait until all live nodes apply our mutations.
2026-03-09 15:15:59 +01:00
Marcin Maliszkiewicz
fe79fdf090 auth: switch write paths to use global group0_batch barrier
This ensures that we return auth functions only after
we wait until all live nodes apply our mutations.
2026-03-09 15:15:59 +01:00
Marcin Maliszkiewicz
4c8681a927 raft: add function to broadcast read barrier request
This function ensures that all alive nodes executed
read barrier.

It will be usefull for the following commits which would
eventually delay returning response to the user until
mutations are applied on other nodes so that the user
may perceive better data consistency accross nodes.
2026-03-09 15:15:59 +01:00
Marcin Maliszkiewicz
cbae84a926 raft: add gossiper dependency to raft_group0_client
In following commit raft_group0_client will send read
barrier RPC to all alive nodes, it takes list of the nodes
from gossiper.
2026-03-09 15:15:59 +01:00
Marcin Maliszkiewicz
8422fbca9f raft: add read barrier RPC
The RPC does read barrier on a destination node.

It will be issued in following commits
to live nodes to assure that command was applied
everywhere.
2026-03-09 15:15:59 +01:00
Michał Chojnowski
ff60a5f1e5 cql3: suggest ALTER MATERIALIZED VIEW to users trying to use ALTER TABLE on a view
When a user tries to use ALTER TABLE on a materialized view,
the resulting error message is `Cannot use ALTER TABLE on Materialized View`.

The intention behind this error is that ALTER MATERIALIZED VIEW should
be used instead.

But we observed that some users interpret this error message as a general
"You cannot do any ALTER on this thing".

This patch enhances the error message (and others similar to it)
to prevent the confusion.

Closes scylladb/scylladb#28831
2026-03-09 15:07:21 +01:00
Botond Dénes
1e41db5948 Merge 'service: tasks: return successful status if a table was dropped' from Aleksandra Martyniuk
tablet_virtual_task::wait throws if a table on which a tablet operation
was working is dropped.

Treat the tablet operation as successful if a table is dropped.

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-494

Needs backport to all live releases

Closes scylladb/scylladb#28933

* github.com:scylladb/scylladb:
  test: add test_tablet_repair_wait_with_table_drop
  service: tasks: return successful status if a table was dropped
2026-03-09 16:04:44 +02:00
Piotr Dulikowski
23ed0d4df8 Merge 'vector_search: fix TLS server name with IP' from Karol Nowacki
SNI works only with DNS hostnames. Adding an IP address causes warnings
on the server side.
This change adds SNI only if it is not an IP address.

This change has no unit tests, as this behavior is not critical,
since it causes a warning on the server side.
The critical part, that the server name is verified, is already covered.

This PR also adds warning logs to improve future troubleshooting of connections to the vector-store nodes.

Fixes: VECTOR-528

Backports to 2025.04 and 2026.01 are required, as these branches are also affected.

Closes scylladb/scylladb#28637

* github.com:scylladb/scylladb:
  vector_search: fix TLS server name with IP
  vector_search: add warn log for failed ann requests
2026-03-09 15:03:22 +01:00
Asias He
e0483f6001 test: Fix coordinator assumption in do_test_tablet_incremental_repair_merge_error
The first node in the cluster is not guaranteed to be the coordinator
node. Hardcoding node 0 as the coordinator causes test flakiness.  This
patch dynamically finds the actual coordinator node and targets it for
error injection, log checking, and restarts.

Additionally, inject `tablet_force_tablet_count_decrease_once` across
all servers to force the tablet merge process to trigger once.

Fixes SCYLLADB-865

Closes scylladb/scylladb#28945
2026-03-09 15:27:45 +02:00
Marcin Maliszkiewicz
b6a7484520 docs: note eventual visibility of auth changes
Mention that role and permission changes are durable but may
not be immediately visible on other nodes due to asynchronous
replication.

Fixes: SCYLLADB-651

Closes scylladb/scylladb#28900
2026-03-09 14:07:10 +01:00
Piotr Dulikowski
42d70baad3 db: view: mutate_MV: don't hold keyspace ref across preemption
Currently, the view_update_generator::mutate_MV function acquires a
reference to the keyspace relevant to the operation, then it calls
max_concurrent_for_each and uses that reference inside the lambda passed
to that function. max_concurrent_for_each can preempt and there is no
mechanism that makes sure that the keyspace is alive until the view
updates are generated, so it is possible that the keyspace is freed by
the time the reference is used.

Fix the issue by precomputing the necessary information based on the
keyspace reference right away, and then passing that information by
value to the other parts of the code. It turns out that we only need to
know whether the keyspace uses tablets and whether it uses a network
topology strategy.

Fixes: scylladb/scylladb#28925

Closes scylladb/scylladb#28928
2026-03-09 15:04:26 +02:00
Łukasz Paszkowski
826fd5d6c3 test/storage: harden out-of-space prevention tests around restart and disk-utilization transitions
The tests in test_out_of_space_prevention.py are flaky. Three issues contribute:

1. After creating/removing the blob file that simulates disk pressure,
   the tests immediately checked derived state (e.g., "compaction_manager
   - Drained") without first confirming the disk space monitor had detected
   the utilization change. Fix: explicitly wait for "Reached/Dropped below
   critical disk utilization level" right after creating/removing the blob
   file, before checking downstream effects.

2. Several tests called `manager.driver_connect()` or omitted reconnection
   entirely after `server_restart()` / `server_start()`. The pre-existing
   driver session can silently reconnect multiple times, causing subsequent
   CQL queries to fail. Fix: call `reconnect_driver()` after every node restart.
   Additionally, call `wait_for_cql_and_get_hosts()` where CQL is used afterward,
   to ensure all connection pools are established.

3. Some log assertions used marks captured before a restart, so they could
   match pre-restart messages or miss messages emitted in the correct post-restart
   window. Fix: refresh marks at the right points.

Apart from that, the patch fixes a typo: autotoogle -> autotoggle.

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-655

Closes scylladb/scylladb#28626
2026-03-09 14:45:09 +02:00
Calle Wilund
ef795eda5b test_encryption: Fix test_system_auth_encryption
Fixes: SCYLLADB-915

Test was quite broken; Not waiting for coro:s, as well as a bunch
of checks no longer even close to valid (this is a ported dtest, and
not a very good one).

Closes scylladb/scylladb#28887
2026-03-09 14:38:31 +02:00
Marcin Maliszkiewicz
f177259316 Merge 'vector_search: small improvements' from Karol Nowacki
vector_search: small improvements

This PR addresses several minor code quality issues and style inconsistencies within the vector_search module.

No backport is needed as these improvements are not visible to the end user.

Closes scylladb/scylladb#28718

* github.com:scylladb/scylladb:
  vector_search: fix names of private members
  vector_search: remove unused global variable
2026-03-09 11:42:35 +01:00
Botond Dénes
6bba4f7ca1 Merge 'test: cluster: util: sleep for 0.01s between writes in do_writes' from Patryk Jędrzejczak
Tests use `start_writes` as a simple write workload to test that writes
succeed when they should (e.g., there is no availability loss), but not to
test performance. There is no reason to overload the CPU, which can lead to
test failures.

I suspect this function to be the cause of SCYLLADB-929, where the failures
of `test_raft_recovery_user_data` (that creates multiple write workloads
with `start_writes`) indicated that the machine was overloaded.
The relevant observations:
- two runs failed at the same time in debug mode,
- there were many reactor stalls and RPC timeouts in the logs (leading to
  unexpected events like servers marking each other down and group0
  leader changes).

I didn't prove that `start_writes` really caused this, but adding this sleep
should be a good change, even if I'm wrong.

The number of writes performed by the test decreases 30-50 times with the
sleep.

Note that some other util functions like `start_writes_to_cdc_table` have
such a sleep.

This PR also contains some minor updates to `test_raft_recovery_user_data`.

Fixes SCYLLADB-929

No backport:
- the failures were observed only in master CI,
- no proof that the change fixes the issue, so backports could be a waste
  of time.

Closes scylladb/scylladb#28917

* github.com:scylladb/scylladb:
  test: test_raft_recovery_user_data: replace asyncio.gather with gather_safely
  test: test_raft_recovery_user_data: use the exclude_node API
  test: test_raft_recovery_user_data: drop tablet_load_stats_cfg
  test: cluster: util: sleep for 0.01s between writes in do_writes
2026-03-09 12:12:04 +02:00
Nadav Har'El
47e8206482 test/alternator: test, and document, Alternator's data encoding
This patch adds a test file, test/alternator/test_encoding.py, testing
how Alternator stores its data in the underlying CQL database. We test
how tables are named, and how attributes of different types are encoded
into CQL.

The test, which begins with a long comment, also doubles as developer-
oriented *documention* on how Alternator encodes its data in the CQL
database. This documentation is not intended for end-users - we do
not want to officially support reading or writing Alternator tables
through CQL. But once in a while, this information can come in handy
for developers.

More importantly, this test will also serve as a regression test,
verifying that Alternator's encoding doesn't change unintentionally.
If we make an unintentional change to the way that Alternator stores
its data, this can break upgrades: The new code might not be able to
read or write the old table with its old encoding. So it's important
to make sure we never make such unintentional changes to the encoding
of Alternator's data. If we ever do make *intentional* changes to
Alternator's data encoding, we will need to fix the relevant test;
But also not forget to make sure that the new code is able to read
the old encoding as well.

The new tests use both "dynamodb" (Alternator) and "cql" fixtures,
to test how CQL sees the Alternator tables. So naturally are these
tests are marked "scylla_only" and skipped on DynamoDB.

Fixes #19770.

Closes scylladb/scylladb#28866
2026-03-09 10:50:09 +01:00
Andrzej Jackowski
6fb5ab78eb db/config: move guardrails config to one place and reorder
The motivations for this patch are as follows:
 - Guardrails should follow similar conventions, e.g. for config names,
   metrics names, testing. Keeping guardrails together makes it easier
   to find and compare existing guardrails when new guardrails are
   implemented.
 - The configuration is used to auto-generate the documentation
   (particularly, the `configuration-parameters` page). Currently,
   the order of parameters in the documentation is inconsistent (e.g.
   `minimum_replication_factor_fail_threshold` before
   `minimum_replication_factor_warn_threshold` but
   `maximum_replication_factor_fail_threshold` after
   `maximum_replication_factor_warn_threshold`), which can be confusing
   to customers.

Fixes: SCYLLADB-256

Closes scylladb/scylladb#28932
2026-03-09 10:50:00 +01:00
Patryk Jędrzejczak
46b7170347 Merge 'test/pylib: centralize timeout scaling and propagate build_mode in LWT helpers' from Alex Dathskovsky
This series improves timeout handling consistency across the test framework and makes build-mode effects explicit in LWT tests. (starting with LWT test that got flaky)

1. Centralize timeout scaling
Introduce scale_timeout(timeout) fixture in runner.py to provide a single, consistent mechanism for scaling test timeouts based on build mode.
Previously, timeout adjustments were done in an ad-hoc manner across different helpers and tests. Centralizing the logic:
Ensures consistent behavior across the test suite
Simplifies maintenance and reasoning about timeout behavior
Reduces duplication and per-test scaling logic
This becomes increasingly important as tests run on heterogeneous hardware configurations, where different build modes (especially debug) can significantly impact execution time.

2. Make scale_timeout explicit in LWT helpers
Propagate scale_timeout explicitly through BaseLWTTester and Worker, validating it at construction time instead of relying on implicit pytest fixture injection inside helper classes.
Additionally:
Update wait_for_phase_ops() and wait_for_tablet_count() to use scale_timeout_by_mode() for consistent polling behavior across modes
Update all LWT test call sites to pass build_mode explicitly
Increase default timeout values, as the previous defaults were too short and prone to flakiness, particularly under slower configurations such as debug builds

Overall, this series improves determinism, reduces flakiness, and makes the interaction between build mode and test timing explicit and maintainable.

backport: not required just an enhansment for test.py infra

Closes scylladb/scylladb#28840

* https://github.com/scylladb/scylladb:
  test/auth_cluster: align service-level timeout expectations with scaled config
  test/lwt: propagate scale_timeout through LWT helpers; scale resize waits Pass scale_timeout explicitly through BaseLWTTester and Worker, validating it at construction time instead of relying on implicit pytest fixture injection inside helper classes. Update wait_for_phase_ops() and wait_for_tablet_count() to use scale_timeout_by_mode() so polling behavior remains consistent across build modes. Adjust LWT test call sites to pass scale_timeout explicitly. Increase default timeout values, as the previous defaults were too short and prone to flakiness under slower configurations (notably debug/dev builds).
  test/pylib: introduce scale_timeout fixture helper
2026-03-09 10:28:19 +01:00
Patryk Jędrzejczak
4c8dba15f1 Merge 'strong_consistency/state_machine: ensure and upgrade mutations schema' from Michał Jadwiszczak
This patch fixes 2 issues within strong consistency state machine:
- it might happen that apply is called before the schema is delivered to the node
- on the other hand, the apply may be called after the schema was changed and purged from the schema registry

The first problem is fixed by doing `group0.read_barrier()` before applying the mutations.
The second one is solved by upgrading the mutations using column mappings in case the version of the mutations' schema is older.

Fixes SCYLLADB-428

Strong consistency is in experimental phase, no need to backport.

Closes scylladb/scylladb#28546

* https://github.com/scylladb/scylladb:
  test/cluster/test_strong_consistency: add reproducer for old schema during apply
  test/cluster/test_strong_consistency: add reproducer for missing schema during apply
  test/cluster/test_strong_consistency: extract common function
  raft_group_registry: allow to drop append entries requests for specific raft group
  strong_consistency/state_machine: find and hold schemas of applying mutations
  strong_consistency/state_machine: pull necessary dependencies
  db/schema_tables: add `get_column_mapping_if_exists()`
2026-03-09 09:49:22 +01:00
Marcin Maliszkiewicz
4150c62f29 Merge 'test_proxy_protocol: fix flaky system.clients visibility checks' from Piotr Smaron
`test_proxy_protocol_port_preserved_in_system_clients` failed because it
didn't see the just created connection in system.clients immediately. The
last lines of the stacktrace are:
```
            # Complete CQL handshake
            await do_cql_handshake(reader, writer)

            # Now query system.clients using the driver to see our connection
            cql = manager.get_cql()
            rows = list(cql.execute(
                f"SELECT address, port FROM system.clients WHERE address = '{fake_src_addr}' ALLOW FILTERING"
            ))

            # We should find our connection with the fake source address and port
>           assert len(rows) > 0, f"Expected to find connection from {fake_src_addr} in system.clients"
E           AssertionError: Expected to find connection from 203.0.113.200 in system.clients
E           assert 0 > 0
E            +  where 0 = len([])
```
Explanation: we first await for the hand-made connection to be completed,
then, via another connection, we're querying system.clients, and we don't
get this hand-made connection in the resultset.
The solution is to replace the bare cql.execute() calls with await wait_for_results(), a helper
 that polls via cql.run_async() until the expected row count is reached
 (30 s timeout, 100 ms period).

Fixes: SCYLLADB-819

The flaky test is present on master and in previous release, so backporting only there.

Closes scylladb/scylladb#28849

* github.com:scylladb/scylladb:
  test_proxy_protocol: introduce extra logging to aid debugging
  test_proxy_protocol: fix flaky system.clients visibility checks
2026-03-09 08:37:57 +01:00
Yaron Kaikov
977bdd6260 .github/workflows/trigger-scylla-ci: fix heredoc injection in trigger-scylla-ci workflow
Move all ${{ }} expression interpolations into env: blocks so they are
passed as environment variables instead of being expanded directly into
shell scripts. This prevents an attacker from escaping the heredoc in
the Validate Comment Trigger step and executing arbitrary commands on
the runner.

The Verify Org Membership step is hardened in the same way for
defense-in-depth.

Refs: GHSA-9pmq-v59g-8fxp
Fixes: SCYLLADB-954

Closes scylladb/scylladb#28935
2026-03-08 21:34:51 +02:00
Wojciech Mitros
0008976e2f mv: allow skipping view updates when a collection is unmodified
When we generate view updates, we check whether we can skip the
entire view update if all columns selected by the view are unmodified.
However, for collection columns, we only check if they were unset
before and after the update.
In this patch we add a check for the actual collection contents.
We perform this check for both virtual and non-virtual selections.
When the column is only a virtual column in the view, it would be
enough to check the liveness of each collection cell, however for
that we'd need to deserialize the entire collection anyway, which
should be effectively as expensive as comparing all of its bytes.

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-808
2026-03-08 16:23:22 +01:00
Wojciech Mitros
7d1e0a2e4d mv: allow skipping view updates if an empty collection remains unset
Currently, when we generate view updates, we skip the view update if
all columns selected by the view are unchanged in the base table update.
However, this does not apply for collection columns - if the base table
has a collection regular column, we never allow skipping generating
view updates and the reason for that is missing implementation.
We can easily relax this for the case where the collection was missing
before and after the update - in this commit we move the check for
collections after the check for missing cells.
2026-03-08 16:22:27 +01:00
Artsiom Mishuta
fda68811e8 test.py: fix strict-config argument.
The ini-level strict_config was removed/never existed as a config key in pytest 8 — it's only a command-line flag(and  back in pytest 9)
In pytest 8.3.5, the equivalent is the --strict-config CLI flag, not an ini option

Fixes SCYLLADB-955

Closes scylladb/scylladb#28939
2026-03-08 16:09:29 +02:00
Taras Veretilnyk
739dd59ebc docs: document components_digests subcomponent and trailing digest in Scylla.db
Document the new `components_digests` subcomponent (tag 12) added to the
Scylla.db metadata component, which stores CRC32 digests of all checksummed
SSTable component files. Also document the trailing CRC32 digest that
stores digest of the scylla metadata itself.
2026-03-06 21:58:15 +01:00
Taras Veretilnyk
2b1c37396a sstable_compaction_test: Add tests for perform_component_rewrite
Add two test cases to verify the correctness of the perform_component_rewrite
functionality:
- test_perform_component_rewrite_single_sstable: Tests rewriting the Statistics
  component of a single sstable
- test_perform_component_rewrite_multiple_sstables: Tests rewriting 5 out of 10
  sstables
2026-03-06 21:58:15 +01:00
Taras Veretilnyk
591d13e942 sstable_test: add verification testcases of SSTable components digests persistance
Adds a generic test helper that writes a random SSTable, reloads it, and
verifies that the persisted CRC32 digest for each component matches the
digest computed from disk. Those covers all checksummed components test cases.
2026-03-06 21:58:15 +01:00
Taras Veretilnyk
54af4a26ca sstables: store digest of all sstable components in scylla metadata
This change replaces plain file_writer with crc32_digest_file_writer
for all SSTable components that should be checksummed. The resulting component
digests are stored in scylla metadata component.

This also extends new rewrite component mechanism,
to rewrite metadata with updated digest together with the component.
2026-03-06 21:58:10 +01:00
Dawid Mędrek
5feed00caa Merge 'raft: read_barrier: update local commit_idx to read_idx when it's safe' from Patryk Jędrzejczak
When the local entry with `read_idx` belongs to the current term, it's
safe to update the local `commit_idx` to `read_idx`.

The motivation for this change is to speed up read barriers. `wait_for_apply`
executed at the end of `read_barrier` is delayed until the follower learns
that the entry with `read_idx` is committed. It usually happens quickly in
the `read_quorum` message. However, non-voters don't receive this message,
so they have to wait for `append_entries`. If no new entries are being
added, `append_entries` can come only from `fsm::tick_leader()`. For group0,
this happens once every 100ms.

The issue above significantly slows down cluster setups in tests. Nodes
join group0 as non-voters, and then they are met with several read barriers
just after a write to group0. One example is `global_token_metadata_barrier`
in `write_both_read_new` performed just after `update_topology_state` in
`write_both_read_old`.

I tested the performance impact of this change with the following test:
```python
for _ in range(10):
    await manager.servers_add(3)
```
It consistently takes 44-45s with the change and 50-51s without the change
in dev mode.

No backport:
- non-critical performance improvement mostly relevant in tests,
- the change requires some soak time in master.

Closes scylladb/scylladb#28891

* github.com:scylladb/scylladb:
  raft: server: fix the repeating typo
  raft: clarify the comment about read_barrier_reply
  raft: read_barrier: update local commit_idx to read_idx when it's safe
  raft: log: clarify the specification of term_for
2026-03-06 18:50:08 +01:00
Aleksandra Martyniuk
40dca578c5 test: add test_tablet_repair_wait_with_table_drop 2026-03-06 15:08:29 +01:00
Piotr Smaron
f12e4ea42b test_proxy_protocol: introduce extra logging to aid debugging
In case of an error, we want to see the contents of the system.clients
table to have a better understanding of what happened - whether the
row(s) are really missing or maybe they are there, but 1 digit doesn't
match or the row is half-written.
We'll therefore query for the whole table on the CQL side, and then
filter out the rows we want to later proceed with on the python side.
This way we can dump the contents of the whole system.clients table if
something goes south.
2026-03-06 14:50:12 +01:00
Piotr Smaron
d8cf2c5f23 test_proxy_protocol: fix flaky system.clients visibility checks
`test_proxy_protocol_port_preserved_in_system_clients` failed because it
didn't see the just created connection in system.clients immediately. The
last lines of the stacktrace are:
```
            # Complete CQL handshake
            await do_cql_handshake(reader, writer)

            # Now query system.clients using the driver to see our connection
            cql = manager.get_cql()
            rows = list(cql.execute(
                f"SELECT address, port FROM system.clients WHERE address = '{fake_src_addr}' ALLOW FILTERING"
            ))

            # We should find our connection with the fake source address and port
>           assert len(rows) > 0, f"Expected to find connection from {fake_src_addr} in system.clients"
E           AssertionError: Expected to find connection from 203.0.113.200 in system.clients
E           assert 0 > 0
E            +  where 0 = len([])
```
Explanation: we first await for the hand-made connection to be completed,
then, via another connection, we're querying system.clients, and we don't
get this hand-made connection in the resultset.
The solution is to replace the bare cql.execute() calls with await wait_for_results(), a helper
 that polls via cql.run_async() until the expected row count is reached
 (30 s timeout, 100 ms period).

Fixes: SCYLLADB-819
2026-03-06 14:49:59 +01:00
Aleksandra Martyniuk
dd634c329f service: tasks: return successful status if a table was dropped
tablet_virtual_task::wait throws if a table on which a tablet operation
was working is dropped.

Treat the tablet operation as successful if a table is dropped.
2026-03-06 14:37:44 +01:00
Botond Dénes
4fdc0a5316 Merge 'Relax test's check_mutation_replicas() argument list' from Pavel Emelyanov
The one accepts long list of arguments, some of those is not really needed. Also some callers can be relaxed not to provide default values for arguments with such.

Improving tests, not backporting

Closes scylladb/scylladb#28861

* github.com:scylladb/scylladb:
  test: Remove passing default "expected_replicas" to check_mutation_replicas()
  test: Remove scope and primary-replica-only arguments from check_mutation_replicas() helper
2026-03-06 11:25:00 +02:00
Szymon Malewski
d817e56e87 vector_similarity_fcts.cc: fix strict aliasing violation in extract_float_vector
Previous code performed endian conversion by bulk-copying raw bytes
into a std::vector<float> and then iterating over it via a
reinterpret_cast<uint32_t*> pointer. Accessing float storage through a
uint32_t* violates C++ strict aliasing rules, giving the compiler
freedom to reorder or elide the stores, causing undefined behavior.

Replace the two-pass approach with a single-pass loop using
seastar::consume_be<uint32_t>() and std::bit_cast<float>(), which is
both well-defined and auto-vectorizable.

Follow-up #28754

Closes scylladb/scylladb#28912
2026-03-06 09:15:45 +01:00
Artsiom Mishuta
5d7a73cc5b test.py add support if non_gating tests
Add support for non_gating, the opposite of gating in dtest terminology, tests in test.py
codebase

This test will/should not be run by any current gating job (ci/next/nightly)

Closes scylladb/scylladb#28902
2026-03-06 09:39:32 +02:00
Andrei Chekun
01498a00d5 test.py: make HostRegistry singleton
HostRegistry initialized in several places in the framework, this can
lead to the overlapping IP, even though the possibility is low it's not
zero. This PR makes host registry initialized once for the master
thread and pytest. To avoid communication between with workers, each
worker will get its own subnet that it can use solely for its own goals.
This simplifies the solution while providing the way to avoid overlapping IP's.

Closes scylladb/scylladb#28520
2026-03-06 09:25:29 +02:00
Artsiom Mishuta
2be4d8074d test.py disable XFail tests on CI run
This PR disables running FXAIL tests on ci run to speed it up.

tests will continue run on "nightly" job and FAIL on unexpected pass
and will continue run on "NEXT" job and NOT FAIL on unexpected pass

Closes scylladb/scylladb#28886
2026-03-06 09:12:06 +02:00
Szymon Malewski
f9d213547f cql3: selection: fix add_column_for_post_processing for ORDER BY
The purpose of `add_column_for_post_processing` is to add columns that are required for processing of a query,
but are not part of SELECT clause and shouldn't be returned. They are added to the final result set, but later are not serialized.
Mainly it is used for filtering and grouping columns, with a special case of `WHERE primary_key IN ...  ORDER BY ...` when the whole result set needs additional final sorting,
and ordering columns must be added as well.
There was a bug that manifested in #9435, #8100 and was actually identified in #22061.
In case of selection with processing (e.g functions involved), result set row is formed in two stages.
Initially it is a list of columns fetched from replicas - on which filtering and grouping is performed.
After that the actual selection is resolved and the final number of columns can change.
Ordering is performed on this final shape, but the ordering column index returned by `add_column_for_post_processing` refereed to initial shape.
If selection refereed to the same column twice (e.g. `v, TTL(v)` as in #9435) final row was longer than initial and ordering refereed to incorrect column.
If a function in selection refereed to multiple columns (e.g. as_json(.., ..) which #8100 effectively uses) the final row was shorter
and ordering tried to use a non-existing column.

This patch fixes the problem by making sure that column index of the final result set is used for ordering.

The previously crashing test `cassandra_tests/validation/entities/json_test.py::testJsonOrdering` doesn't have to be skipped, but now it is failing on issue #28467.

Fixes #9435
Fixes #8100
Fixes #22061

Closes scylladb/scylladb#28472
2026-03-05 19:22:34 +02:00
Patryk Jędrzejczak
c8c57850d9 test: test_raft_recovery_user_data: replace asyncio.gather with gather_safely 2026-03-05 17:13:52 +01:00
Patryk Jędrzejczak
c3aa4ed23c test: test_raft_recovery_user_data: use the exclude_node API
The API is now available.
2026-03-05 17:13:52 +01:00
Patryk Jędrzejczak
dd75687251 test: test_raft_recovery_user_data: drop tablet_load_stats_cfg
The issue has been fixed.
2026-03-05 17:13:52 +01:00
Patryk Jędrzejczak
52940c4f31 test: cluster: util: sleep for 0.01s between writes in do_writes
Tests use `start_writes` as a simple write workload to test that writes
succeed when they should (e.g., there is no availability loss), but not to
test performance. There is no reason to overload the CPU, which can lead to
test failures.

I suspect this function to be the cause of SCYLLADB-929, where the failures
of `test_raft_recovery_user_data` (that creates multiple write workloads
with `start_writes`) indicated that the machine was overloaded.
The relevant observations:
- two runs failed at the same time in debug mode,
- there were many reactor stalls and RPC timeouts in the logs (leading to
  unexpected events like servers marking each other down and group0
  leader changes).

I didn't prove that `start_writes` really caused this, but adding this sleep
should be a good change, even if I'm wrong.

The number of writes performed by the test decreases 30-50 times with the
sleep.

Note that some other util functions like `start_writes_to_cdc_table` have
such a sleep.

Fixes SCYLLADB-929
2026-03-05 17:13:40 +01:00
Calle Wilund
ab3d3d8638 build: add slirp4netns to dependencies
Needed for port forwarded podman-in-podman containers

[avi:
  - move from Dockerfile to install-dependencies.sh so non-container
  builds also get it
  - regenerate frozen toolchain with optimized clang from

    https://devpkg.scylladb.com/clang/clang-21.1.8-Fedora-43-aarch64.tar.gz
    https://devpkg.scylladb.com/clang/clang-21.1.8-Fedora-43-x86_64.tar.gz
]

Closes scylladb/scylladb#28870
2026-03-05 17:44:17 +02:00
Michał Jadwiszczak
37bbbd3a27 test/cluster/test_strong_consistency: add reproducer for old schema during apply 2026-03-05 13:50:20 +01:00
Michał Jadwiszczak
6aef4d3541 test/cluster/test_strong_consistency: add reproducer for missing schema during apply 2026-03-05 13:50:16 +01:00
Michał Jadwiszczak
4795f5840f test/cluster/test_strong_consistency: extract common function 2026-03-05 13:47:43 +01:00
Michał Jadwiszczak
3548b7ad38 raft_group_registry: allow to drop append entries requests for specific raft group
Similar to `raft_drop_incoming_append_entries`, the new error injection
`raft_drop_incoming_append_entries_for_specified_group` skips handler
for `raft_append_entries` RPC but it allows to specify id of raft group
for which the requests should be dropped.

The id of a raft group should be passed in error injection parameters
under `value` key.
2026-03-05 13:47:43 +01:00
Michał Jadwiszczak
b0cffb2e81 strong_consistency/state_machine: find and hold schemas of applying mutations
It might happen that a strong consistency command will arrive to a node:
- before it knows about the schema
- after the schema was changes and the old version was removed from the
  memory

To fix the first case, it's enough to perform a read barrier on group0.
In case of the second one, we can use column mapping the upgrade the
mutation to newer schema.

Also, we should hold pointers to schemas until we finish `_db.apply()`,
so the schema is valid for the whole time.
And potentially we should hold multiple pointers because commands passed
to `state_machine::apply()` may contain mutations to different schema
versions.

This commit relies on a fact that the tablet raft group and its state
machine is created only after the table is created locally on the node.

Fixes SCYLLADB-428
2026-03-05 13:47:40 +01:00
Tomasz Grabiec
b90fe19a42 Merge 'service: assert that tables updated via group0 use schema commitlog' from Aleksandra Martyniuk
Set enable_schema_commitlog for each group0 tables.

Assert that group0 tables use schema commitlog in ensure_group0_schema
(per each command).

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-914.

Needs backport to all live releases as all are vulnerable

Closes scylladb/scylladb#28876

* github.com:scylladb/scylladb:
  test: add test_group0_tables_use_schema_commitlog
  db: service: remove group0 tables from schema commitlog schema initializer
  service: ensure that tables updated via group0 use schema commitlog
  db: schema: remove set_is_group0_table param
2026-03-05 13:28:13 +01:00
Botond Dénes
509f2af8db Merge 'repair: Fix rwlock in compaction_state and lock holder lifecycle' from Raphael Raph Carvalho
Consider this:

- repair takes the lock holder
- tablet merge filber destories the compaction group and the compaction state
- repair fails
- repair destroy the lock holder

This is observed in the test:

```
repair - repair[5d73d094-72ee-4570-a3cc-1cd479b2a036] Repair 1 out of 1 tablets: table=sec_index.users range=(432345564227567615,504403158265495551] replicas=[0e9d51a5-9c99-4d6e-b9db-ad36a148b0ea:15, 498e354c-1254-4d8d-a565-2f5c6523845a:9, 5208598c-84f0-4526-bb7f-573728592172:28]

...

repair - repair[5d73d094-72ee-4570-a3cc-1cd479b2a036]: Started to repair 1 out of 1 tables in keyspace=sec_index, table=users, table_id=ea2072d0-ccd9-11f0-8dba-c5ab01bffb77, repair_reason=repair
repair - Enable incremental repair for table=sec_index.users range=(432345564227567615,504403158265495551]
table - Disabled compaction for range=(432345564227567615,504403158265495551] session_id=a13a72cc-cd2d-11f0-8e9b-76d54580ab09 for incremental repair
table - Got unrepaired compaction and repair lock for range=(432345564227567615,504403158265495551] session_id=a13a72cc-cd2d-11f0-8e9b-76d54580ab09 for incremental repair
table - Disabled compaction for range=(432345564227567615,504403158265495551] session_id=a13a72cc-cd2d-11f0-8e9b-76d54580ab09 for incremental repair
table - Got unrepaired compaction and repair lock for range=(432345564227567615,504403158265495551] session_id=a13a72cc-cd2d-11f0-8e9b-76d54580ab09 for incremental repair
repair - repair[5d73d094-72ee-4570-a3cc-1cd479b2a036]: get_sync_boundary: got error from node=0e9d51a5-9c99-4d6e-b9db-ad36a148b0ea, keyspace=sec_index, table=users, range=(432345564227567615,504403158265495551], error=seastar::rpc::remote_verb_error (Compaction state for table [0x60f008fa34c0] not found)
compaction_manager - Stopping 1 tasks for 1 ongoing compactions for table sec_index.users compaction_group=238 due to tablet merge
compaction_manager - Stopping 1 tasks for 1 ongoing compactions for table sec_index.users compaction_group=238 due to tablet merge

....

scylla[10793] Segmentation fault on shard 28, in scheduling group streaming
```

The rwlock in compaction_state could be destroyed before the lock holder
of the rwlock is destroyed. This causes user after free when the lock
the holder is destroyed.

To fix it, users of repair lock will now be waited when a compaction
group is being stopped.
That way, compaction group - which controls the lifetime of rwlock -
cannot be destroyed while the lock is held.
Additionally, the merge completion fiber - that might remove groups -
is properly serialized with incremental repair.

The issue can be reproduced using sanitize build consistently and can not
be reproduced after the fix.

Fixes #27365

Closes scylladb/scylladb#28823

* github.com:scylladb/scylladb:
  repair: Fix rwlock in compaction_state and lock holder lifecycle
  repair: Prevent repair lock holder leakage after table drop
2026-03-05 14:18:25 +02:00
Patryk Jędrzejczak
f1978d8a22 raft: server: fix the repeating typo 2026-03-05 13:06:08 +01:00
Patryk Jędrzejczak
5a43695f6a raft: clarify the comment about read_barrier_reply
The comment could be misleading. It could suggest that the returned index is
already safe to read. That's not necessarily true. The entry with the
returned index could, for example, be dropped by the leader if the leader's
entry with this index had a different term.
2026-03-05 13:06:08 +01:00
Patryk Jędrzejczak
1ae2ae50a6 raft: read_barrier: update local commit_idx to read_idx when it's safe
When the local entry with `read_idx` belongs to the current term, it's
safe to update the local `commit_idx` to `read_idx`. The argument for
safety is in the new comment above `maybe_update_commit_idx_for_read`.

The motivation for this change is to speed up read barriers. `wait_for_apply`
executed at the end of `read_barrier` is delayed until the follower learns
that the entry with `read_idx` is committed. It usually happens quickly in
the `read_quorum` message. However, non-voters don't receive this message,
so they have to wait for `append_entries`. If no new entries are being
added, `append_entries` can come only from `fsm::tick_leader()`. For group0,
this happens once every 100ms.

The issue above significantly slows down cluster setups in tests. Nodes
join group0 as non-voters, and then they are met with several read barriers
just after a write to group0. One example is `global_token_metadata_barrier`
in `write_both_read_new` performed just after `update_topology_state` in
`write_both_read_old`.

Writing a test for this change would be difficult, so we trust the nemesis
tests to do the job. They have already found consistency issues in read
barriers. See #10578.
2026-03-05 13:06:08 +01:00
Patryk Jędrzejczak
1cbd0da519 raft: log: clarify the specification of term_for
When `idx > last_idx()`, the function does an out-of-bounds access to `_log`.
This may look contradictory to the current specification.
2026-03-05 13:06:07 +01:00
Michał Jadwiszczak
33a16940be strong_consistency/state_machine: pull necessary dependencies
Both migration manager and system keyspace will be used in next commit.
The first one is needed to execute group0 read barrier and we need
system keyspace to get column mappings.
2026-03-05 12:33:17 +01:00
Alex
b32ef8ecd5 test/auth_cluster: align service-level timeout expectations with scaled config
Use scale_timeout_by_mode() in make_scylla_conf() to derive
  request_timeout_in_ms in test/pylib/scylla_cluster.py.

  Update test_connections_parameters_auto_update in
  test/cluster/auth_cluster/test_raft_service_levels.py to expect the
  mode-specific timeout string returned by the REST endpoint after this
  scaling change.
2026-03-05 13:32:15 +02:00
Alex
a66565cc42 test/lwt: propagate scale_timeout through LWT helpers; scale resize waits
Pass scale_timeout explicitly through BaseLWTTester and Worker, validating it at construction time instead of relying on implicit pytest fixture injection inside helper classes.
Update wait_for_phase_ops() and wait_for_tablet_count() to use scale_timeout_by_mode() so polling behavior remains consistent across build modes.
Adjust LWT test call sites to pass scale_timeout explicitly.
Increase default timeout values, as the previous defaults were too short and prone to flakiness under slower configurations (notably debug/dev builds).
2026-03-05 13:07:09 +02:00
Alex
73f1a65203 test/pylib: introduce scale_timeout fixture helper
Introduce scale_timeout(mode) to centralize test timeout scaling logic based on build mode, the function will return a callable that will handle the timeout by mode.
This ensures consistent timeout behavior across test helpers and eliminates ad-hoc per-test scaling adjustments.
Centralizing the logic improves maintainability and makes timeout behavior easier to reason about.
This becomes increasingly important as we run tests on heterogeneous hardware configurations.
Different build modes (especially debug) can significantly affect execution time, and having a single scaling mechanism helps keep test stability predictable across environments.

No functional change beyond unifying existing timeout scaling behavior.
2026-03-05 13:07:09 +02:00
Anna Stuchlik
855c503c63 doc: fix the unified installer instructions
This commit updates the documentation for the unified installer.

- The Open Source example is replaced with version 2025.1 (Source Available, currently supported, LTS).
- The info about CentOS 7 is removed (no longer supported).
- Java 8 is removed.
- The example for cassandra-stress is removed (as it was already removed on other installation pages).

Fixes https://github.com/scylladb/scylladb/issues/28150

Closes scylladb/scylladb#28152
2026-03-05 12:57:06 +02:00
Michał Jadwiszczak
d25be9e389 db/schema_tables: add get_column_mapping_if_exists()
In scenarios where we want to firsty check if a column mapping exists
and if we don't want do flow control with exception, it is very wasteful
to do
```
if (column_mapping_exists()) {
  get_column_mapping();
}
```
especially in a hot path like `state_machine::apply()` becase this will
execute 2 internal queries.

This commit introduces `get_column_mapping_if_exists()` function,
which simply wrapps result of `get_column_mapping()` in optional and
doesn't throw an exception if the mapping doesn't exist.
2026-03-05 11:55:57 +01:00
Artsiom Mishuta
7b30a3981b test.py: enable strict_config,xfail_strict,strict-markers
this commit enables 3 strict pytest options:

strict_config - if any warnings encountered while parsing the pytest section of the configuration file will raise errors.
xfail_strict - if markers not registered in the markers section of the configuration file will raise errors.
strict-markers - if tests marked with @pytest.mark.xfail that actually succeed will by default fail the test suite

and fix errors that occur after enabling these options

Closes scylladb/scylladb#28859
2026-03-05 12:54:26 +02:00
Dawid Mędrek
7564a56dc8 Merge 'tombstone_gc: allow using repair-mode tombstone gc with RF=1 tables' from Botond Dénes
Currently, repair-mode tombstone-gc cannot be used on tables with RF=1. We want to make repair-mode the default for all tablet tables (and more, see https://github.com/scylladb/scylladb/issues/22814), but currently a keyspace created with RF=1 and later altered to RF>1 will end up using timeout-mode tombstone gc. This is because the repair-mode tombstone-gc code relies on repair history to determine the gc-before time for keys/ranges. RF=1 tables cannot run repairs so they will have empty repair history and consequently won't be able to purge tombstones.
This PR solves this by keeping a registry of RF=1 tables and consulting this registry when creating `tombstone_gc_state` objects. If the table is RF=1, tombstone-gc will work as if the table used immediate-mode tombstone-gc. The registry is updated on each replication update. As soon as the table is not RF=1 anymore, the tombstone-gc reverts to the natural repair-mode behaviour.

After this PR, tombstone-gc defaults to repair-mode for all tables, regardless of RF and tablets/vnodes.

Fixes: SCYLLADB-106.

New feature, no backport required.

Closes scylladb/scylladb#22945

* github.com:scylladb/scylladb:
  test/{boost,cluster}: add test for tombstone gc mode=repair with RF=1
  tombstone_gc: allow use of repair-mode for RF=1 tables
  replica/table: update rf=1 table registry in shared tombstone-gc state
  tombstone_gc: tombstone_gc_before_getter: consider RF when getting gc before time
  tombstone_gc: unpack per_table_history_maps
  tombstone_gc: extract _group0_gc_time from per_table_history_map
  tombstone_gc: drop tombstone_gc_state(nullptr) ctor and operator bool()
  test/lib/random_schema: use timeout-mode tombstone_gc
  tombstone_gc_options: add C++ friendly constructor
  test: move away from tombstone_gc_state(nullptr) ctor
  treewide: move away from tombstone_gc_state(nullptr) ctor
  sstable: move away from tombstone_gc_mode::operator bool()
  replica/table: add get_tombstone_gc_state()
  compaction: use tombstone_gc_state with value semantics
  db/row_cache: use tombstone_gc_state with value semantics
  tombstone_gc: introduce tombstone_gc_state::for_tests()
2026-03-05 11:50:31 +01:00
Piotr Dulikowski
a2669e9983 test: test_mv_merge_allowed: add mistakenly omitted awaits
The test test_mv_merge_allowed asserts in two places that the tablet
count is 2. It does so by calling an async function but, mistakenly, the
returned coroutine was not awaited. The coroutine is, apparently, truthy
so the assertions always passed.

Fix the test to properly await the coroutines in the assertions.

Fixes: SCYLLADB-905

Closes scylladb/scylladb#28875
2026-03-05 11:29:23 +01:00
Avi Kivity
5ae40caa6d dist: tune tcp_mem to 3% of total memory in scylla-kernel-conf package
tcp_mem defaults to 9% of total memory. ScyllaDB defaults to 93%. The
sum is more than 100%.

Fix by tuning tcp_mem to 3% of total memory.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-734

Closes scylladb/scylladb#28700
2026-03-05 12:51:04 +03:00
Patryk Jędrzejczak
bb1a798c2c Merge 'raft: Throw stopped_error if server aborted' from Dawid Mędrek
This PR solves a series of similar problems related to executing
methods on an already aborted `raft::server`. They materialize
in various ways:

* For `add_entry` and `modify_config`, a `raft::not_a_leader` with
  a null ID will be returned IF forwarding is disabled. This wasn't
  a big problem because forwarding has always been enabled for group0,
  but it's something that's nice to fix. It's might be relevant for
  strong consistency that will heavily rely on this code.

* For `wait_for_leader` and `wait_for_state_change`, the calls may
  hang and never resolve. A more detailed scenario is provided in a
  commit message.

For the last two methods, we also extend their descriptions to indicate
the new possible exception type, `raft::stopped_error`. This change is
correct since either we enter the functions and throw the exception
immediately (if the server has already been aborted), or it will be
thrown upon the call to `raft::server::abort`.

We fix both issues. A few reproducer tests have been included to verify
that the calls finish and throw the appropriate errors.

Fixes SCYLLADB-841

Backport: Although the hanging problems haven't been spotted so far
          (at least to the best of my knowledge), it's best to avoid
          running into a problem like that, so let's backport the
          changes to all supported versions. They're small enough.

Closes scylladb/scylladb#28822

* https://github.com/scylladb/scylladb:
  raft: Make methods throw stopped_error if server aborted
  raft: Throw stopped_error if server aborted
  test: raft: Introduce get_default_cluster
2026-03-05 10:47:39 +01:00
Botond Dénes
cd13a911cc test/cluster/test_data_resurrection_in_memtable.py: dump rows before check
So that if the check of expected rows fail, we have a dump to look at
and see what is different.
2026-03-05 11:44:02 +02:00
Botond Dénes
f375aae257 replica/database: consolidate the two database_apply error injections
Into a single database_apply one. Add three parameters:
* ks_name and cf_name to filter the tables to be affected
* what - what to do: throw or wait

This leads to smaller footprint in the code and improved filtering for
table names at the cost of some extra error injection params in the
tests.
2026-03-05 11:44:02 +02:00
Marcin Maliszkiewicz
c3f59e4fa1 Merge 'cql3: implement write_consistency_levels guardrails' from Andrzej Jackowski
This patch series implements `write_consistency_levels_warned` and `write_consistency_levels_disallowed` guardrails, allowing the configuration of which consistency levels are unwanted for writes. The motivation for these guardrails is to forbid writing with consistency levels that don't provide high durability guarantees (like CL=ANY, ONE, or LOCAL_ONE).

Neither guardrail is enabled by default, so as not to disrupt clusters that are currently using any of the CLs for writes. The warning guardrail may seem harmless, as it only adds a warning to the CQL response; however, enabling it can significantly increase network traffic (as a warning message is added to each response) and also decrease throughput due to additional allocations required to prepare the warning. Therefore, both guardrails should be enabled with care. The newly added `writes_per_consistency_level` metric, which is incremented unconditionally, can help decide whether a guardrail can be safely enabled in an existing cluster.

This commit adds additional `if` instructions on the critical path. However, based on the `perf_simple_query` benchmark for writes, the difference is marginal (~40 additional instructions, which is a relative difference smaller than 0.001).

BEFORE:
```
291443.35 tps ( 53.3 allocs/op,  16.0 logallocs/op,  14.2 tasks/op,   48067 insns/op,   18885 cycles/op,        0 errors)
throughput:
 mean=   289743.07 standard-deviation=6075.60
 median= 291424.69 median-absolute-deviation=1702.56
 maximum=292498.27 minimum=261920.06
instructions_per_op:
 mean=   48072.30 standard-deviation=21.15
 median= 48074.49 median-absolute-deviation=12.07
 maximum=48119.87 minimum=48019.89
cpu_cycles_per_op:
 mean=   18884.09 standard-deviation=56.43
 median= 18877.33 median-absolute-deviation=14.71
 maximum=19155.48 minimum=18821.57
```

AFTER:
```
290108.83 tps ( 53.3 allocs/op,  16.0 logallocs/op,  14.2 tasks/op,   48121 insns/op,   18988 cycles/op,        0 errors)
throughput:
 mean=   289105.08 standard-deviation=3626.58
 median= 290018.90 median-absolute-deviation=1072.25
 maximum=291110.44 minimum=274669.98
instructions_per_op:
 mean=   48117.57 standard-deviation=18.58
 median= 48114.51 median-absolute-deviation=12.08
 maximum=48162.18 minimum=48087.18
cpu_cycles_per_op:
 mean=   18953.43 standard-deviation=28.76
 median= 18945.82 median-absolute-deviation=20.84
 maximum=19023.93 minimum=18916.46
```

Fixes: SCYLLADB-259
Refs: SCYLLADB-739
No backport, it's a new feature

Closes scylladb/scylladb#28570

* github.com:scylladb/scylladb:
  scylla.yaml: add write CL guardrails to scylla.yaml
  scylla.yaml: reorganize guardrails config to be in one place
  test: add cluster tests for write CL guardrails
  test: implement test_guardrail_write_consistency_level
  cql3: start using write CL guardrails
  cql3/query_processor: implement metrics to track CL of writes
  db: cql3/query_processor: add write_consistency_levels enum_sets
  config: add write_consistency_levels_* guardrails configuration
2026-03-05 09:55:38 +01:00
Botond Dénes
44b8cad3df service/storage_proxy: add name of table to error message for write errors
It is useful to know what table the failed write belongs to.
2026-03-05 10:51:12 +02:00
Yauheni Khatsianevich
aa85f5a9c3 test: migrating alternator ttl tests to scylla repo
migrating alternator_ttl_tests.py to scylla repo as part of
deprecating dtest framework
migrated tests:
- test_ttl_with_load_and_decommission

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-869

Closes scylladb/scylladb#28858
2026-03-05 10:04:14 +02:00
Nadav Har'El
8e32d97be6 test/alternator: fix run script
The test/alternator/run script currently fails, Scylla fails to boot
complaining that "--alternator-ttl-period-in-seconds" is specified
twice (which is, unfortunately, not allowed). The problem is that
recently we started to set this option in test/cqlpy/run.py, for
CQL's new row-level TTL, so now it is no longer needed in
test/alternator/run - and in fact not allowed and we must remove it.

This patch only affects the script test/alternator/run, and has no
affect on running tests through test.py or Jenkins.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#28868
2026-03-05 10:06:38 +03:00
Botond Dénes
9b2242c752 test/cluster/test_repair.py: fix test_repair_timtestamp_difference
This test forgot to await its check() calls, which is the pass-condition
of the test. Once the await was added, the test started failing. Turns
out, the test was broken, but this was never discovered, because due to
the missing await, the errors were not propagated.
This patch adds the missing await and fixes the discovered problems:
* Use cql.run_async() instead of cql.execute()
* Fix json path for timestamp
* Missing flush/compact

Fixes: SCYLLADB-911

Closes scylladb/scylladb#28883
2026-03-05 10:04:49 +03:00
Nadav Har'El
af07718fff test/cqlpy: fix "run --release" for versions 5.4 or older
Recently we started to rely on the options "--auth-superuser-name"
and "--auth-superuser-salted-password" to ensure that a
cassandra/cassandra user exists for tests - without those options
a default superuser no longer exists.

This broke "test/cqlpy/run --release" for old releases, earlier
than 5.4 (in the enterprise stream, 2024.1 or earlier), because
those old release didn't have this option.

So in this patch we fix the "--release" logic that removes these
options from the command line when running these old versions.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#28894
2026-03-05 09:59:46 +03:00
Botond Dénes
5e7b966d37 Merge 'Remove prepare_snapshot_for_backup() helper from backup/restore tests' from Pavel Emelyanov
The helper in question duplicates the functionality of `take_snapshot()` one from the same file. The only difference is that it additionally creates keyspace:table with yet another helper, but that helper is also going to be removed (as continuation of #28600 and #28608)

Enhancing tests, not backporting

Closes scylladb/scylladb#28834

* github.com:scylladb/scylladb:
  test_backup: Remove prepare_snapshot_for_backup()
  test_backup: Patch test_simple_backup_and_restore to use take_snapshot()
  test_backup: Patch backup tests to use take_snapshot()
  test_backup: Add helper to take snapshot on a single server
2026-03-05 06:54:07 +02:00
Dani Tweig
25fc8ef14c Add RELENG to milestone-to-Jira sync project keys
Closes scylladb/scylladb#28889
2026-03-05 06:51:21 +02:00
Calle Wilund
35aab75256 test_internode_compression: Add await for "run" coro:s
Fixes: SCYLLADB-907

Closes scylladb/scylladb#28885
2026-03-05 06:50:33 +02:00
Patryk Jędrzejczak
2a3476094e storage_service: raft_topology_cmd_handler: fix use-after-free
8e9c7397c5 made `rs` a reference, which can
lead to use-after-free. The `normal_nodes` map containing the referenced
value can be destroyed before the last use of `rs` when the topology state
is reloaded after a context switch on some `co_await`. The following move
assignment in `storage_service::topology_state_load` causes this:
```
_topology_state_machine._topology = co_await _sys_ks.local().load_topology_state(tablet_hosts);
```

This issue has been discovered in next-2026.1 CI after queueing the
backport of #28558. `test_truncate_during_topology_change` failed after
ASan reported a heap-use-after-free in
```
co_await _repair.local().bootstrap_with_repair(get_token_metadata_ptr(), rs.ring.value().tokens, session);
```
This test enables `delay_bootstrap_120s`, which makes the bug much more
likely to reproduce, but it could happen elsewhere.

No backport needed, as the only backport of #28558 hasn't been merged yet.
The backport PR will cherry-pick this commit.

Closes scylladb/scylladb#28772
2026-03-05 01:50:22 +01:00
Aleksandra Martyniuk
156c29f962 test: add test_group0_tables_use_schema_commitlog 2026-03-04 17:25:06 +01:00
Aleksandra Martyniuk
5306e26b83 db: service: remove group0 tables from schema commitlog schema initializer
Remove group0 tables from schema commitlog schema initializer.
The schema commitlog of group0 tables is ensured by set_is_group0_table.
2026-03-04 17:25:06 +01:00
Aleksandra Martyniuk
690b2c4142 service: ensure that tables updated via group0 use schema commitlog
Set enable_schema_commitlog for each group0 tables.

Assert that group0 tables use schema commitlog in ensure_group0_schema
(per each command).

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-914.
2026-03-04 17:25:04 +01:00
Aleksandra Martyniuk
6b3b174704 db: schema: remove set_is_group0_table param
set_is_group0_table takes an enabled flag, based on which it decides
whether it's a group0 table. The method is called only with enabled = true.

Drop the param. For not group0 tables nothing should be set.
2026-03-04 17:24:34 +01:00
Aleksandra Martyniuk
57f1e46204 test: cluster: tasks: await set_task_ttl
Await set_task_ttl in test_tablet_repair_task_children.

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-912

Closes scylladb/scylladb#28882
2026-03-04 17:58:37 +02:00
Dawid Mędrek
d44fc00c4c raft: Make methods throw stopped_error if server aborted
After the previous changes in `raft::server::{add_entry, modify_config}`
(cf. SCYLLADB-841), we also go through other methods of `raft::server`
and verify that they handle the aborted state properly.

I found two methods that do not:

(A) `wait_for_leader`
(B) `wait_for_state_change`

What happened before these changes?

In case (A), the dangerous scenario occurred when `_leader_promise` was
empty on entering the function. In that case, we would construct the
promise and wait on the corresponding future. However, if the server
had been already aborted before the call, the future would never
resolve and we'd be effectively stuck.

Case (B) is fully analogous: instead of `_leader_promise`, we'd work
with `_stte_change_promise`.

There's probably a more proper solution to this problem, but since I'm
not familiar with the internal code of Raft, I fix it this way. We can
improve it further in the future.

We provide two simple validation tests. They verify that after aborting
a `raft::server`, the calls:

* do not hang (the tests would time out otherwise),
* throw raft::stopped_error.

Fixes SCYLLADB-841
2026-03-04 16:28:11 +01:00
Dawid Mędrek
c200d6ab4f raft: Throw stopped_error if server aborted
Before the change, calling `add_entry` or `modify_config` on an already
aborted Raft server could result in an error `not_a_leader` containing
a null server ID. It was possible precisely when forwarding was
disabled in the server configuration.

`not_a_leader` is supposed to return the ID of the current leader,
so that was wrong. Furthermore, the description of the function
specified that if a server is aborted, then it should throw
`stopped_error`.

We fix that issue. A few small reproducer tests were provided to verify
that the functions behave correctly with and without forwarding enabled.

Refs SCYLLADB-841
2026-03-04 16:28:08 +01:00
Marcin Maliszkiewicz
9697b6013f Merge 'test: add missing awaits in test_client_routes_upgrade' from Andrzej Jackowski
Two calls in test_client_routes_upgrade were missing `await`,
so they were never actually executed. This caused Python
to emit RuntimeWarning about unawaited coroutines, and more
importantly, the test skipped important verification steps, which
could mask real bugs or cause flakiness.

Additionally, increase 10s timeouts to 60s to avoid flakiness in slow
environments. Although these tests haven't failed so far, similar
issues have already been observed in other tests with too-short
timeouts.

Fixes: [SCYLLADB-909](https://scylladb.atlassian.net/browse/SCYLLADB-909)

Backport to 2026.1, as the test is also there.

[SCYLLADB-909]: https://scylladb.atlassian.net/browse/SCYLLADB-909?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

Closes scylladb/scylladb#28877

* github.com:scylladb/scylladb:
  test: increase timeouts in test_client_routes.py
  test: add missing awaits in test_client_routes_upgrade
2026-03-04 15:26:34 +01:00
Szymon Malewski
212bd6ae1a vector: Vectorize loops in similarity functions
The main loops iterating over vector components were not vectorized due to:
- "cannot prove it is safe to reorder floating-point operations"
- "Cannot vectorize early exit loop with more than one early exit"
The first issue is fixed with adding `#pragma clang fp contract(fast) reassociate(on)`, which allows compiler to optimize floating point operations.
The second issue is solved by refactoring the operations in the affected loop.
Additionally using float operations instead of double increases throughput and numerical accuracy is not the main consideration in vector search scenarios.

Performance measured:
- scylla built using dbuild
- using https://github.com/zilliztech/VectorDBBench (modified to call `SELECT id, similarity_cosine({vector<float, 1536>}, {vector<float, 1536>}) ...` without ANN search):
- client concurrency 20
before: ~2250 QPS
`float` operations: ~2350 QPS
`compute_cosine_similarity` vectorization: ~2500QPS
`extract_float_vector` vectorization: ~3000QPS

Follow-up https://github.com/scylladb/scylladb/pull/28615
Ref https://scylladb.atlassian.net/browse/SCYLLADB-764

Closes scylladb/scylladb#28754
2026-03-04 15:14:53 +01:00
Andrzej Jackowski
221b78cb81 test: increase timeouts in test_client_routes.py
Increase 10s timeouts to 60s to avoid flakiness in slow
environments. Although these tests haven't failed so far, similar
issues have already been observed in other tests with too-short
timeouts.

Test execution time is unaffected; the entire suite in `dev` takes ~30s
before and after this change.
2026-03-04 13:40:30 +01:00
Andrzej Jackowski
527c4141da test: add missing awaits in test_client_routes_upgrade
Two calls in test_client_routes_upgrade were missing `await`,
so they were never actually executed. This caused Python
to emit RuntimeWarning about unawaited coroutines, and more
importantly, the test skipped important verification steps, which
could mask real bugs or cause flakiness.

Fixes: SCYLLADB-909
2026-03-04 13:34:37 +01:00
Piotr Smaron
a31cb18324 db: fix UB in system.clients row sorting
The comparator used to sort per-IP client rows was not a strict-weak-ordering (it could return true in both directions for some pairs), which makes `std::ranges::sort` behavior undefined. A concrete pair that breaks it (and is realistic in system.clients):
a = (port=9042, client_type="cql")
b = (port=10000, client_type="alternator")
With the current comparator:
cmp(a,b) = (9042 < 10000) || ("cql" < "alternator") = true || false = true
cmp(b,a) = (10000 < 9042) || ("alternator" < "cql") = false || true = true
So both directions are true, meaning there is no valid ordering that sort can achieve.

The fix is to sort lexicographically by (port, client_type) to match the table's clustering key and ensure deterministic ordering.

Closes scylladb/scylladb#28844
2026-03-04 14:10:49 +03:00
Avi Kivity
c331796d28 Merge 'Support Min < Precision for approx_exponential_histogram' from Amnon Heiman
This series closes a gap in the approx_exponential_histogram implementation to
cover integer values starting from small Min values.

While the original implementation was focused on durations, where this limitation
was not an issue, over time, there has been a growing need for histograms that
cover smaller values, such as the number of SSTables or the number of items in a
batch.

The reason for the original limitation is inherent to the exponential histogram
math. The previous code required Min to be at least Precision to avoid negative
bit shifts in the exponential calculations.

After this series, approx_exponential_histogram allows Min to be smaller than
Precision by scaling values during indexing. The value is shifted left by
log2 Precision minus log2 Min or zero whichever is larger, and the existing
exponential math is applied. Bucket limits are then scaled back to the original
units. This keeps insertion and retrieval O(1) without runtime branching, at the
cost of repeated bucket limits for some values in the Min to Precision range.

Additional tests cover the new behavior.
Relates to #2785

** New feature, no need to backport. **

Closes scylladb/scylladb#28371

* github.com:scylladb/scylladb:
  estimated_histogram_test.cc: add to_metrics_histogram test
  histogram_metrics_helper.hh: Support Min < Precision
  estimated_histogram_test.cc: Add tests for approx_exponential_histogram with Min<Precision
  estimated_histogram.hh: support Min less than Precision histograms
2026-03-04 12:43:26 +02:00
Szymon Malewski
4c4673e8f9 test: vector_similarity: Fix similarity value checks
`isclose` function checks if returned similarity floats are close enough to expected value, but it doesn't `assert` by itself.
Several tests missed that `assert`, effectively always passing.
With this patch similarity values checks are wrapped in helper function `assert_similarity` with predefined tolerance.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-877

Closes scylladb/scylladb#28748
2026-03-04 09:53:32 +01:00
Marcin Maliszkiewicz
c7d3f80863 Merge 'auth: do not create default 'cassandra:cassandra' superuser' from Dario Mirovic
This patch series removes creation of default 'cassandra:cassandra' superuser on system start.

Disable creation of a superuser with default 'cassandra:cassandra' credentials to improve security. The current flow requires clients to create another superuser and then drop the default `cassandra:cassandra' role. For those who do, there is a time window where the default credentials exist. For those who do not, that role stays. We want to improve security by forcing the client to either use config to specify default values for default superuser name and password or use cqlsh over maintenance socket connection to explicitly create/alter a superuser role.

The patch series:
- Enable role modification over the maintenance socket
- Stop using default 'cassandra' value for default superuser, skipping creation instead

Design document: https://scylladb.atlassian.net/wiki/spaces/RND/pages/165773327/Drop+default+cassandra+superuser

Fixes scylladb/scylla-enterprise#5657

This is an improvement. It does not need a backport.

Closes scylladb/scylladb#27215

* github.com:scylladb/scylladb:
  config: enable maintenance socket in workdir by default
  docs: auth: do not specify password with -p option
  docs: update documentation related to default superuser
  test: maintenance socket role management
  test: cluster: add logs to test_maintenance_socket.py
  test: pylib: fix connect_driver handling when adding and starting server
  auth: do not create default 'cassandra:cassandra' superuser
  auth: remove redundant DEFAULT_USER_NAME from password authenticator
  auth: enable role management operations via maintenance socket
  client_state: add has_superuser method
  client_state: add _bypass_auth_checks flag
  auth: let maintenance_socket_role_manager know if node is in maintenance mode
  auth: remove class registrator usage
  auth: instantiate auth service with factory functors
  auth: add service constructor with factory functors
  auth: add transitional.hh file
  service: qos: handle special scheduling group case for maintenance socket
  service: qos: use _auth_integration as condition for using _auth_integration
2026-03-04 09:43:57 +01:00
Piotr Dulikowski
85dcbfae9a Merge 'hint: Don't switch group in database::apply_hint()' from Pavel Emelyanov
The method is called from storage_proxy::mutate_hint() which is in turn called from hint_mutation::apply_locally(). The latter is either called from directly by hint sender, which already runs in streaming group, or via RPC HINT_MUTATION handler which uses index 1 that negotiates streaming group as well.

To be sure, add a debugging check for current group being the expected one.

Code cleanup, not backporting

Closes scylladb/scylladb#28545

* github.com:scylladb/scylladb:
  hint: Don't switch group in database::apply_hint()
  hint_sender: Switch to sender group on stop either
2026-03-04 09:36:38 +01:00
Pavel Emelyanov
5793e305b5 test_backup: Remove prepare_snapshot_for_backup()
It's now unused

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-03-04 11:33:43 +03:00
Pavel Emelyanov
ffbd9a3218 test_backup: Patch test_simple_backup_and_restore to use take_snapshot()
This change is a bit more careful, as the test collects files from
snapshot directory several times. Before patching it to use the helper,
it collected _all_ the files. Now the helper only provides TOC-s, but
that's fine -- the only check that relies on that may also re-collect
TOC-s and compare new set with old set.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-03-04 11:33:43 +03:00
Pavel Emelyanov
c1b0ac141b test_backup: Patch backup tests to use take_snapshot()
Some of those tests need to update the hard-coded 'backup' snapshot name
to use the one provided by take_snapshot() helper. Other than that, the
patching is pretty straightforward.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-03-04 11:33:43 +03:00
Pavel Emelyanov
ea17c26fd9 test_backup: Add helper to take snapshot on a single server
The take_snapshot() helper returns a dict(server: list[string]). When
there's only one server to work with, it's more handy to just get a
single list of sstables.

Next patches will make use of that helper.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-03-04 11:31:39 +03:00
Botond Dénes
e7487c21e4 test/{boost,cluster}: add test for tombstone gc mode=repair with RF=1 2026-03-04 09:45:38 +02:00
Botond Dénes
5998a859f7 tombstone_gc: allow use of repair-mode for RF=1 tables
Modify the methods which calculate the default gc mode as well as that
which validates whether repair-mode can be used at all, so both accepts
use of repair-mode on RF=1 tables.

This de-facto changes the default tombstone-gc to repair-mode for all
tables. Documentation is updated accordingly.

Some tests need adjusting:
* cqlpy/test_select_from_mutation_fragments.py: disable GC for some test
  cases because this patch makes tombstones they write subject to GC
  when using defaults.
* test/cluster/test_mv.py::test_mv_tombstone_gc_not_inherited used
  repair-mode as a non-default for the base table and expected the MV to
  revert to default. Another mode has to be used as the non-default
  (immediate).
* test/cqlpy/test_tools.py::test_scylla_sstable_dump_schema: don't
  compare tombstone_gc schema extension when comparing dumped schema vs.
  original. The tool's schema loader doesn't have access to the keyspace
  definition so it will come up with different defaults for
  tombstone-gc.
* test/boost/row_cache_test.cc::test_populating_cache_with_expired_and_nonexpired_tombstones
  sets tombstone expiry assuming the tombstone-gc timeout-mode default.
  Change the CREATE TABLE statement to set the expected mode.
2026-03-04 09:44:24 +02:00
Andrzej Jackowski
c0e94828de scylla.yaml: add write CL guardrails to scylla.yaml
Disabled by default. This change is introduced only to document the
guardrail.

Refs: SCYLLADB-259
2026-03-04 08:00:17 +01:00
Andrzej Jackowski
038f89ede4 scylla.yaml: reorganize guardrails config to be in one place
Also change the format of the section header and add "#" to empty
lines, so that in the future no one splits the section by adding new
configs.
2026-03-04 08:00:17 +01:00
Andrzej Jackowski
ec42fdfd01 test: add cluster tests for write CL guardrails
Most of the functionality is tested in cqlpy tests located in
`test_guardrail_write_consistency_level.py`. Add two tests
that require the cluster framework:

- `test_invalid_write_cl_guardrail_config` checks the node startup
  path when incorrect `write_consistency_levels_warned` and
  `write_consistency_levels_disallowed` values are used.
- `test_write_cl_default` checks the behavior of the default
  configuration using a multi-node cluster.

Tests execution time:
 - Dev: 10s
 - Debug: 18s

Refs: SCYLLADB-259
2026-03-04 08:00:17 +01:00
Andrzej Jackowski
446539f12f test: implement test_guardrail_write_consistency_level
Implement basic tests for write consistency level guardrails,
verifying that they work for each type of write request (inserts,
updates, deletes, logged batches, unlogged batches, conditional batches,
and counter operations).

All tests are marked as Scylla-only because they currently don't
pass with Cassandra due to differences in handling superusers (see:
SCYLLADB-882).

Tests execution time:
 - Dev: 3s
 - Debug: 14s

Refs: SCYLLADB-259
Refs: SCYLLADB-882
2026-03-04 08:00:13 +01:00
Avi Kivity
85bd6d0114 Merge 'Add multiple-shard persistent metadata storage for strongly consistent tables' from Wojciech Mitros
In this series we introduce new system tables and use them for storing the raft metadata
for strongly consistent tables. In contrast to the previously used raft group0 tables, the
new tables can store data on any shard. The tables also allow specifying the shard where
each partition should reside, which enables the tablets of strongly consistent tables to have
their raft group metadata co-located on the same shard as the tablet replica.

The new tables have almost the same schemas as the raft group0 tables. However, they
have an additional column in their partition keys. The additional column is the shard
that specifies where the data should be located. While a tablet and its corresponding
raft group server resides on some shard, it now writes and reads all requests to the
metadata tables using its shard in addition to the group_id.

The extra partition key column is used by the new partitioner and sharder which allow
this special shard routing. The partitioner encodes the shard in the token and the
sharder decodes the shard from the token. This approach for routing avoids any
additional lookups (for the tablet mapping) during operations on the new tables
and it also doesn't require keeping any state. It also doesn't interact negatively
with resharding - as long as tablets (and their corresponding raft metadata) occupy
some shard, we do not allow starting the node with a shard count lower than the
id of this shard. When increasing the shard count, the routing does not change,
similarly to how tablet allocation doesn't change.

To use the new tables, a new implementation of `raft::persistence` is added. Currently,
it's almost an exact copy of the `raft_sys_table_storage` which just uses the new tables,
but in the future we can modify it with changes specific to metadata (or mutation)
storage for strongly consistent tables. The new storage is used in the `groups_manager`,
which combined with the removal of some `this_shard_id() == 0` checks, allows strongly
consistent tables to be used on all shards.

This approach for making sure that the reads/writes to the new tables end up on the correct shards
won in the balance of complexity/usability/performance against a few other approaches we've considered.
They include:
1. Making the Raft server read/write directly to the database, skipping the sharder, on its shard, while using
the default partitioner/sharder. This approach could let us avoid changing the schema and there should be
no problems for reads and writes performed by the Raft server. However, in this approach we would input
data in tables conflicting with the placement determined by the sharder. As a result, any read going through
the sharder could miss the rows it was supposed to read. Even when reading all shards to find a specific value,
there is a risk of polluting the cache - the rows loaded on incorrect shards may persist in the cache for an unknown
amount of time. The cache may also mistakenly remember that a row is missing, even though it's actually present,
just on an incorrect shard.
Some of the issues with this approach could be worked around using another sharder which always returns
this_shard_id() when asked about a shard. It's not clear how such a sharder would implement a method like
`token_for_next_shard`, and how much simpler it would be compared to the current "identity" sharder.
2. Using a sharder depending on the current allocation of tablets on the node. This approach relies on the
knowledge of group_id -> shard mapping at any point in time in the cluster. For this approach we'd also
need to either add a custom partitioner which encodes the group_id in the token, or we'd need to track the
token(group_id) -> shard mapping. This approach has the benefit over the one used in the series of keeping
the partition key as just group_id. However, it requires more logic, and the access to the live state of the node
in the sharder, and it's not static - the same token may be sharded differently depending on the state of the
node - it shouldn't occur in practice, but if we changed the state of the node before adjusting the table data,
we would be unable to access/fix the stale data without artificially also changing the state of the node.
3. Using metadata tables co-located to the strongly consistent tables. This approach could simplify the
metadata migrations in the future, however it would require additional schema management of all co-located
metadata tables, and it's not even obvious what could be used as the partition key in these tables - some
metadata is per-raft-group, so we couldn't reuse the partition key of the strongly consistent table for it. And
finding and remembering a partition key that is routed to a specific shard is not a simple task. Finally, splits
and merges will most likely need special handling for metadata anyway, so we wouldn't even make use of
co-located table's splits and merges.

Fixes [SCYLLADB-361](https://scylladb.atlassian.net/browse/SCYLLADB-361)

[SCYLLADB-361]: https://scylladb.atlassian.net/browse/SCYLLADB-361?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

Closes scylladb/scylladb#28509

* github.com:scylladb/scylladb:
  docs: add strong consistency doc
  test/cluster: add tests for strongly-consistent tables' metadata persistence
  raft: enable multi-shard raft groups for strongly consistent tablets
  test/raft: add unit tests for raft_groups_storage
  raft: add raft_groups_storage persistence class
  db: add system tables for strongly consistent tables' raft groups
  dht: add fixed_shard_partitioner and fixed_shard_sharder
  raft: add group_id -> shard mapping to raft_group_registry
  schema: add with_sharder overload accepting static_sharder reference
2026-03-04 08:55:43 +02:00
Piotr Dulikowski
2fb981413a Merge 'vector_search: test: fix HTTPS client test flakiness' from Karol Nowacki
The default 100ms timeout for client readiness in tests is too
aggressive. In some test environments, this is not enough time for
client creation, which involves address resolution and TLS certificate
reading, leading to flaky tests.

This commit increases the default client creation timeout to 10 seconds.
This makes the tests more robust, especially in slower execution
environments, and prevents similar flakiness in other test cases.

Fixes: VECTOR-547, SCYLLADB-802, SCYLLADB-825, SCYLLADB-826

Backport to 2025.4 and 2026.1, as the same problem occurs on these branches and can potentially make the CI flaky there as well.

Closes scylladb/scylladb#28846

* github.com:scylladb/scylladb:
  vector_search: test: include ANN error in assertion
  vector_search: test: fix HTTPS client test flakiness
2026-03-04 08:55:43 +02:00
Wojciech Mitros
38f02b8d76 mv: remove dead code in view_updates::can_skip_view_updates
When we create a materialized view, we consider 2 cases:
1. the view's primary key contains a column that is not
in the primary key of the base table
2. the view's primary key doesn't contain such a column

In the 2nd case, we add all columns from the base table
to the schema of the view (as virtual columns). As a result,
all of these columns are effectively "selected" in
view_updates::can_skip_view_updates. Same thing happens when
we add new columns to the base table using ALTER.
Because of this, we can never have !column_is_selected and
!has_base_non_pk_columns_in_view_pk at the same time. And
thus, the check (!column_is_selected
&& _base_info.has_base_non_pk_columns_in_view_pk) is always
the same as (!column_is_selected).
Because we immediately return after this check, the tail of
this function is also never reached - all checks after the
(column_is_selected) are affected by this. Also, the condition
(!column_is_selected && base_has_nonexpiring_marker) is always
false at the point it is called. And this in turn makes the
`base_has_nonexpiring_marker` unused, so we delete it as well.

It's worth considering, why did we even have
`base_has_nonexpiring_marker` if it's effectively unused. We
initially introduced it in bd52e05ae2 and we (incorrectly)
used it to allow skipping view updates even if the liveness of
virtual columns changed. Soon after, in 5f85a7a821, we
started categorizing virtual columns as column_is_selected == true
and we moved the liveness checks for virtual columns to the
`if (column_is_selected)` clause, before the `base_has_nonexpiring_marker`
check. We changed this because even if we have a nonexpiring marker
right now, it may be changed in the future, in which case the liveness
of the view row will depend on liveness of the virtual columns and
we'll need to have the view updates from the time the row marker was
nonexpiring.

Closes scylladb/scylladb#28838
2026-03-04 08:55:43 +02:00
Geoff Montee
0eb5603ebd Docs: describe the system tables
Fixes issue #12818 with the following docs changes:

docs/dev/system_keyspace.md: Added missing system tables, added table of contents (TOC), added categories

Closes scylladb/scylladb#27789
2026-03-04 08:55:43 +02:00
Botond Dénes
f156bcddab Merge 'test: decrease strain in test_startup_response' from Marcin Maliszkiewicz
For 2025.3 and 2025.4 this test runs order of magnitude
slower in debug mode. Potentially due to passwords::check
running in alien thread and overwhelming the CPU (this is
fixed in newer versions).

Decreasing the number of connections in test makes it fast
again, without breaking reproducibility.

As additional measure we double the timeout.

The fix is now cherry-picked to master as sometimes
test fails there too.

(cherry picked from commit 1f1fc2c2ac)

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-795

backport: 2026.1, already on other stable branches

Closes scylladb/scylladb#28848

* github.com:scylladb/scylladb:
  test: add more logs to test_startup_no_auth_response
  test: decrease strain in test_startup_response
2026-03-04 08:55:43 +02:00
Andrzej Jackowski
bb359b3b78 cql3: start using write CL guardrails
Enable verification of write consistency level guardrails in
`modification_statement` and `batch_statement`.

Neither guardrail is enabled by default, so as not to disrupt clusters
that are currently using any of the CLs for writes. The warning
guardrail may seem harmless, as it only adds a warning to the CQL
response; however, enabling it can significantly increase network
traffic (as a warning message is added to each response) and also
decrease throughput due to additional allocations required to prepare
the warning. Therefore, both guardrails should be enabled with care.
The newly added `writes_per_consistency_level` metric, which is
incremented unconditionally, can help decide whether a guardrail can
be safely enabled in an existing cluster.

This commit adds additional `if` instructions on the critical path.
However, based on the `perf_simple_query` benchmark for writes,
the difference is marginal (~40 additional instructions, which is
a relative difference smaller than 0.001).

BEFORE:
```
291443.35 tps ( 53.3 allocs/op,  16.0 logallocs/op,  14.2 tasks/op,   48067 insns/op,   18885 cycles/op,        0 errors)
throughput:
	mean=   289743.07 standard-deviation=6075.60
	median= 291424.69 median-absolute-deviation=1702.56
	maximum=292498.27 minimum=261920.06
instructions_per_op:
	mean=   48072.30 standard-deviation=21.15
	median= 48074.49 median-absolute-deviation=12.07
	maximum=48119.87 minimum=48019.89
cpu_cycles_per_op:
	mean=   18884.09 standard-deviation=56.43
	median= 18877.33 median-absolute-deviation=14.71
	maximum=19155.48 minimum=18821.57
```

AFTER:
```
290108.83 tps ( 53.3 allocs/op,  16.0 logallocs/op,  14.2 tasks/op,   48121 insns/op,   18988 cycles/op,        0 errors)
throughput:
	mean=   289105.08 standard-deviation=3626.58
	median= 290018.90 median-absolute-deviation=1072.25
	maximum=291110.44 minimum=274669.98
instructions_per_op:
	mean=   48117.57 standard-deviation=18.58
	median= 48114.51 median-absolute-deviation=12.08
	maximum=48162.18 minimum=48087.18
cpu_cycles_per_op:
	mean=   18953.43 standard-deviation=28.76
	median= 18945.82 median-absolute-deviation=20.84
	maximum=19023.93 minimum=18916.46
```

Fixes: SCYLLADB-259
2026-03-04 07:26:00 +01:00
Asias He
225b10b683 repair: Fix rwlock in compaction_state and lock holder lifecycle
Consider this:

- repair takes the lock holder
- tablet merge filber destories the compaction group and the compaction state
- repair fails
- repair destroy the lock holder

This is observed in the test:

```
repair - repair[5d73d094-72ee-4570-a3cc-1cd479b2a036] Repair 1 out of 1 tablets: table=sec_index.users range=(432345564227567615,504403158265495551] replicas=[0e9d51a5-9c99-4d6e-b9db-ad36a148b0ea:15, 498e354c-1254-4d8d-a565-2f5c6523845a:9, 5208598c-84f0-4526-bb7f-573728592172:28]

...

repair - repair[5d73d094-72ee-4570-a3cc-1cd479b2a036]: Started to repair 1 out of 1 tables in keyspace=sec_index, table=users, table_id=ea2072d0-ccd9-11f0-8dba-c5ab01bffb77, repair_reason=repair
repair - Enable incremental repair for table=sec_index.users range=(432345564227567615,504403158265495551]
table - Disabled compaction for range=(432345564227567615,504403158265495551] session_id=a13a72cc-cd2d-11f0-8e9b-76d54580ab09 for incremental repair
table - Got unrepaired compaction and repair lock for range=(432345564227567615,504403158265495551] session_id=a13a72cc-cd2d-11f0-8e9b-76d54580ab09 for incremental repair
table - Disabled compaction for range=(432345564227567615,504403158265495551] session_id=a13a72cc-cd2d-11f0-8e9b-76d54580ab09 for incremental repair
table - Got unrepaired compaction and repair lock for range=(432345564227567615,504403158265495551] session_id=a13a72cc-cd2d-11f0-8e9b-76d54580ab09 for incremental repair
repair - repair[5d73d094-72ee-4570-a3cc-1cd479b2a036]: get_sync_boundary: got error from node=0e9d51a5-9c99-4d6e-b9db-ad36a148b0ea, keyspace=sec_index, table=users, range=(432345564227567615,504403158265495551], error=seastar::rpc::remote_verb_error (Compaction state for table [0x60f008fa34c0] not found)
compaction_manager - Stopping 1 tasks for 1 ongoing compactions for table sec_index.users compaction_group=238 due to tablet merge
compaction_manager - Stopping 1 tasks for 1 ongoing compactions for table sec_index.users compaction_group=238 due to tablet merge

....

scylla[10793] Segmentation fault on shard 28, in scheduling group streaming
```

The rwlock in compaction_state could be destroyed before the lock holder
of the rwlock is destroyed. This causes user after free when the lock
the holder is destroyed.

To fix it, users of repair lock will now be waited when a compaction
group is being stopped.
That way, compaction group - which controls the lifetime of rwlock -
cannot be destroyed while the lock is held.
Additionally, the merge completion fiber - that might remove groups -
is properly serialized with incremental repair.

The issue can be reproduced using sanitize build consistently and can not
be reproduced after the fix.

Fixes #27365

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2026-03-03 21:05:15 -03:00
Raphael S. Carvalho
1d8903d9f7 repair: Prevent repair lock holder leakage after table drop
Prevent repair lock holder from being leaked in repair_service when table
is dropped midway.
The leakage might result in use-after-free later, since the repair lock
itself will be gone after table drop.
The RPC verb that removes the lock on success path will not be called
by coordinator after table was dropped.

Refs #27365.
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-896.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2026-03-03 21:05:10 -03:00
Dario Mirovic
06af4480ea config: enable maintenance socket in workdir by default
We want to enable maintenance socket by default.
This will prevent users from having to reboot a server to enable it.
Also, there is little point in having maintenance socket that is turned off,
and we want users to use it. After this patch series, they will have
to use it. Note that while config seeding exists, we do not encourage it
for production deployments.

This patch changes default maintenance_socket value from ignore to workdir.
This enables maintenance socket without specifying an explicit path.

Refs SCYLLADB-409
2026-03-04 00:01:07 +01:00
Dario Mirovic
6e83fb5029 docs: auth: do not specify password with -p option
Specifying password with -p option is considered unsafe.
The password will be saved in bash history.
The preferred approach is to enter the password when prompted.
Any approach that passes the password via command line arguments
makes that password visible in process options (ps command), no matter
if the password is passed directly or as an environment variable.

Refs SCYLLADB-409
2026-03-04 00:01:07 +01:00
Dario Mirovic
afafb8a8fa docs: update documentation related to default superuser
Update create superuser procedure:
- Remove notes about default `cassandra` superuser
- Add create superuser using existing superuser section
- Update create superuser by using `scylla.yaml` config
- Add create superuser using maintenance socket

Update password reset procedure:
- Add maintenance socket approach
- Remove the old approach with deleting all the roles

Update enabling authentication with downtime and during runtime:
- Mention creating new superuser over the maintenance socket
- Remove default superuser usage

Update enable authorization:
- Mention creating new superuser over the maintenance socket
- Remove mention of default superuser

Reasoning for deletion of the old approach:
- [old] Needs cluster downtime, removes all roles, needs recreation of roles,
  needs maintenance socket anyways, if config values are not used for superuser
- [new] No cluster downtime, possibly one node restart to enable maintenance
  socket, faster

Refs SCYLLADB-409
2026-03-04 00:01:07 +01:00
Dario Mirovic
3db74aaf5f test: maintenance socket role management
Introduce a test that cover:
- Server startup without credentials config seeding with no roles created
- Await maintenance socket role management to be enabled
- `CREATE ROLE`, `ALTER ROLE`, and `DROP ROLE` statement execution success

All the tests in the test_maintenance_socket.py module take 2-3 seconds
to execute.

Explicitly shut down Cluster objects to prevent 'RuntimeError: cannot
schedule new futures after shutdown'.

Refs SCYLLADB-409
2026-03-03 23:57:50 +01:00
Dario Mirovic
f74fe22386 test: cluster: add logs to test_maintenance_socket.py
Add logs to test_maintenance_socket.py test test_maintenance_socket.
This approach offers additional visibility in case of test failure.
Such logs will be added to new tests in a follow up patch in this
patch series.

Refs SCYLLADB-409
2026-03-03 23:42:25 +01:00
Dario Mirovic
0e5ddec2a8 test: pylib: fix connect_driver handling when adding and starting server
When connect_driver=False, the expected server up state should be
capped to HOST_ID_QUERIED. This is to avoid waiting for CQL readiness,
which requires a superuser to be present.

This logic was only in ScyllaCluster.server_start. ManagerClient.server_add
with start=True and connect_driver=False would still wait for CQL and hang
if no superuser is present. The workaround was to call
ManagerClient.server_add(start=False, connect_driver=False) followed by
ManagerClient.server_start(connect_driver=False).

This patch moves the capping from ScyllaCluster.server_start to
ManagerClient.server_add and ManagerClient.server_start, where connect_driver
is processed. ScyllaCluster only receives the already resolved
expected_server_up_state value.

Refs SCYLLADB-409
2026-03-03 23:42:25 +01:00
Dario Mirovic
fd17dcbec8 auth: do not create default 'cassandra:cassandra' superuser
Changes the behavior of default superuser creation.
Previously, without configuration 'cassandra:cassandra' credentials
were used. Now default superuser creation is skipped if not configured.

The two ways to create default superuser are:
- Config file - auth_superuser_name and auth_superuser_salted_password fields
- Maintenance socket - connect over maintenance socket and CREATE/ALTER ROLE ...

Behavior changes:

Old behavior:
- No config - 'cassandra:cassandra' created
- auth_superuser_name only - <name>:cassandra created
- auth_superuser_salted_password only - 'cassandra:<password>' created
- Both specified - '<name>:<password>' created

New behavior:
- No config - no default superuser
    - Requires maintenance socket setup
- auth_superuser_name only - '<name>:' created WITHOUT password
    - Requires maintenance socket setup
- auth_superuser_salted_password only - no default superuser
- Both specified - '<name>:<password>' created

Fixes SCYLLADB-409
2026-03-03 23:42:25 +01:00
Dario Mirovic
9dc1deccf3 auth: remove redundant DEFAULT_USER_NAME from password authenticator
Remove redundant DEFAULT_USER_NAME from password_authenticator.cc file.
It is just a copy of meta::DEFAULT_SUPERUSER_NAME.

Refs SCYLLADB-409
2026-03-03 23:42:25 +01:00
Dario Mirovic
45628cf041 auth: enable role management operations via maintenance socket
Introduce maintenance_socket_authenticator and rework
maintenance_socket_role_manager to support role management operations.

Maintenance auth service uses allow_all_authenticator. To allow
role modification statements over the maintenance socket connections,
we need to treat the maintenance socket connections as superusers and
give them proper access rights.

Possible approaches are:
1. Modify allow_all_authenticator with conditional logic that
   password_authenticator already does
2. Modify password_authenticator with conditional logic specific
   for the maintenance socket connections
3. Extend password_authenticator, overriding the methods that differ

Option 3 is chosen: maintenance_socket_authenticator extends
password_authenticator with authentication disabled.

The maintenance_socket_role_manager is reworked to lazily create a
standard_role_manager once the node joins the cluster, delegating role
operations to it. In maintenance mode role operations remain disabled.

Refs SCYLLADB-409
2026-03-03 23:41:05 +01:00
Dario Mirovic
6a1edab2ac client_state: add has_superuser method
Encapsulate the superuser check in client_state so that it
respects _bypass_auth_checks. Connections that bypass auth
(internal callers and the maintenance socket) are always
considered superusers.

Migrate existing call sites from auth::has_superuser(service, user)
to client_state.has_superuser(). Also add _bypass_auth_checks
handling to ensure_not_anonymous().

Refs SCYLLADB-409
2026-03-03 22:31:35 +01:00
Dario Mirovic
d765b5b309 client_state: add _bypass_auth_checks flag
Authorization checks were previously skipped based on the
_is_internal flag. This couples two concerns: marking client
state as internal and bypassing authorization.

Introduce _bypass_auth_checks to handle only the authorization
bypass. Internal client state sets it to true, preserving current
behavior. External client state accepts it as a constructor
parameter, defaulting to false.

This will allow maintenance socket connections to skip
authorization without being marked as internal.

Refs SCYLLADB-409
2026-03-03 22:31:35 +01:00
Dario Mirovic
b68656b59f auth: let maintenance_socket_role_manager know if node is in maintenance mode
This patch is part of preparations for dropping 'cassandra::cassandra'
default superuser. When that is implemented, maintenance_socket_role_manager
will have two modes of work:
1. in maintenance mode, where role operations are forbidden
2. in normal mode, where role operations are allowed

To execute the role operations, the node has to join a cluster.
In maintenance mode the node does not join a cluster.

This patch lets maintenance_socket_role_manager know if it works under
maintenance mode and returns appropriate error message when role
operations execution is requested.

Refs SCYLLADB-409
2026-03-03 22:31:35 +01:00
Dario Mirovic
3bef493a35 auth: remove class registrator usage
This patch removes class registrator usage in auth module.
It is not used after switching to factory functor initialization
of auth service.

Several role manager, authenticator, and authorizer name variables
are returned as well, and hardcoded inside qualified_java_name method,
since that is the only place they are ever used.

Refs SCYLLADB-409
2026-03-03 22:31:35 +01:00
Dario Mirovic
eab24ff3b0 auth: instantiate auth service with factory functors
Auth service is instantiated with the constructor that accepts
service_config, which then uses class registrator to instantiate
authorizer, authenticator, and role manager.

This patch switches to instantiating auth service via the constructor
that accepts factory functors. This is a step towards removing
usage of class registrator.

Refs SCYLLADB-409
2026-03-03 22:31:35 +01:00
Dario Mirovic
bfff07eacb auth: add service constructor with factory functors
Auth service can be initialized:
- [current] by passing instantiated authorizer, authenticator, role manager
- [current] by passing service_config, which then uses class registrator to instantiate authorizer, authenticator, role manager
    - This approach is easy to use with sharded services
- [new] by passing factory functors which instantiate authorizer, authenticator, role manager
    - This approach is also easy to use with sharded services

Refs SCYLLADB-409
2026-03-03 22:31:35 +01:00
Dario Mirovic
e8e00c874b auth: add transitional.hh file
In a follow-up patch in this patch series class registrator will be removed.
Adding transitional.hh file will be necessary to expose the authenticator and authorizer.

Refs SCYLLADB-409
2026-03-03 22:31:35 +01:00
Dario Mirovic
e5218157de service: qos: handle special scheduling group case for maintenance socket
service_level_controller has special handling for maintenance socket connections.
If the current user is not a named user, it should use the default scheduling group.

The reason is that the maintenance socket can communicate with Scylla before
auth_integration is registered.

The guard is already present, but it was omitted in get_cached_user_scheduling_group.

This also fixes flakiness in test_maintenance_socket.py tests.

Refs SCYLLADB-409
2026-03-03 22:31:35 +01:00
Dario Mirovic
dc9a90d7cb service: qos: use _auth_integration as condition for using _auth_integration
Maintenance socket connections can be established before _auth_integration is
initialized. The fix introduced with scylladb/scylladb#26856 PR check for
the value of user variable. For maintenance socket connections it will be an
anonymous user, and will fall back to using default scheduling group.

This patch changes the criteria for using default scheduling group from
the user variable to checking the _auth_integration variable itself:
- If _auth_integration is not initialized, use default scheduling group
- If _auth_integration is initialized, let it choose the scheduling group

Refs SCYLLADB-409
2026-03-03 22:31:35 +01:00
Andrzej Jackowski
371cdb3c81 cql3/query_processor: implement metrics to track CL of writes
Add `write_consistency_levels_disallowed_violations` and
`write_consistency_levels_warned_violations` metrics to track
violations of write_consistency_levels guardrails.

Add `writes_per_consistency_level` to track what CL is used by
writes, regardless of the guardrails configuration.
Data gathered by this metric can be used to decide whether enabling
a particular write consistency level guardrail in a particular
existing cluster is safe.

Refs: SCYLLADB-259
2026-03-03 21:18:11 +01:00
Andrzej Jackowski
3606934458 db: cql3/query_processor: add write_consistency_levels enum_sets
Add enum_sets to query_processor that track the configuration
values of `write_consistency_levels_warned` and
`write_consistency_levels_disallowed`.

Refs: SCYLLADB-259
2026-03-03 20:28:57 +01:00
Dawid Mędrek
7fd083e329 test: raft: Introduce get_default_cluster
We introduce a function creating a Raft cluster with parameters usually
used by the tests. This will avoid code duplication, especially after
introducing new tests in the following commits.

Note that the test `test_aborting_wait_for_state_change` has changed:
the previous complex configuration was unnecessary for it (I wrote it).
2026-03-03 18:50:21 +01:00
Pavel Emelyanov
b768753c0f test: Remove passing default "expected_replicas" to check_mutation_replicas()
The value of None is default, callers don't need to specify it
explicitly

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-03-03 17:28:21 +03:00
Pavel Emelyanov
b8ae9ede63 test: Remove scope and primary-replica-only arguments from check_mutation_replicas() helper
These two are only used to print into logs on error. However, their
values can be found from previous logs and test execution context.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-03-03 17:26:25 +03:00
Karol Nowacki
45477d9c6b vector_search: test: include ANN error in assertion
When the test fails, the assertion message does not include
the error from the ANN request.

This change enhances the assertion to include the specific ANN error,
making it easier to diagnose test failures.
2026-03-03 14:19:20 +01:00
Karol Nowacki
ab6c222fc4 vector_search: test: fix HTTPS client test flakiness
The default 100ms timeout for client readiness in tests is too
aggressive. In some test environments, this is not enough time for
client creation, which involves address resolution and TLS certificate
reading, leading to flaky tests.

This commit increases the default client creation timeout to 10 seconds.
This makes the tests more robust, especially in slower execution
environments, and prevents similar flakiness in other test cases.

Fixes: VECTOR-547, SCYLLADB-802
2026-03-03 14:19:20 +01:00
Botond Dénes
4f5310bc72 replica/table: update rf=1 table registry in shared tombstone-gc state
On every update of the ERM, update the state of the current table in the
registry of RF=1 tables in shared tombstone gc state.
Ensures that tombstone gc stops collection of tombstones in immediate
mode as soon as the table starts transitioning away from RF=1.
2026-03-03 14:09:28 +02:00
Botond Dénes
7c2c63ab43 tombstone_gc: tombstone_gc_before_getter: consider RF when getting gc before time
Currently we cannot use repair-mode tombstone gc on RF=1 tables, because
such tables don't need repair and so there won't be repair history to
use to produce gc_before times.
Introduce shared_tombstone_gc_state::_rf_one_tables which will keep a
registry of RF=1 tables. Keeping this up to date is left to outside code
(table.cc). Consult the registry to determine whether a table is RF=1 or
not, so the repair history check can be ellided for rf=1 tables.
Not wired in yet into the table code.
2026-03-03 14:09:28 +02:00
Botond Dénes
074006749c tombstone_gc: unpack per_table_history_maps
It is now a class with a single member, replace usage with that of the
member (through an alias to reduce churn).
2026-03-03 14:09:28 +02:00
Botond Dénes
d6e2d44759 tombstone_gc: extract _group0_gc_time from per_table_history_map
Doesn't belong there. Also, having it as a separate member of
shared_tombstone_gc_state makes updating _group0_gc_time cheaper, as the
update doesn't have to do a copy-mutate-swap of the history maps.
2026-03-03 14:09:28 +02:00
Botond Dénes
5fd9fc3056 tombstone_gc: drop tombstone_gc_state(nullptr) ctor and operator bool()
Both are ambiguous and all users were migrated away to more meaningful
alternatives. They are now unused, drop them.
2026-03-03 14:09:28 +02:00
Botond Dénes
a785c0cf41 test/lib/random_schema: use timeout-mode tombstone_gc
This is the current de-facto default for all tests using random schema
and some are apparently relying on this. Make this explicit to avoid
upsetting tests, by the impending change of this default to repair.
2026-03-03 14:09:28 +02:00
Botond Dénes
d10e622a3b tombstone_gc_options: add C++ friendly constructor
So one can create options with strong types, instead of from a map of
strings.
2026-03-03 14:09:28 +02:00
Botond Dénes
6004e84f18 test: move away from tombstone_gc_state(nullptr) ctor
Use for_tests() instead (or no_gc() where approriate).
2026-03-03 14:09:28 +02:00
Botond Dénes
3c34598d88 treewide: move away from tombstone_gc_state(nullptr) ctor
It is ambigous, use the appropriate no-gc or gc-all factories instead,
as appropriate.
A special note for mutation::compacted(): according to the comment above
it, it doesn't drop expired tombstones but as it is currently, it
actually does. Change the tombstone gc param for the underlying call to
compact_for_compaction() to uphold the comment. This is used in tests
mostly, so no fallout expected.

Tests are handled in the next commit, to reduce noise.

Two tests in mutation_test.cc have to be updated:

* test_compactor_range_tombstone_spanning_many_pages
  has to be updated in this commit, as it uses
  mutation_partition::compact_for_query() as well as
  compact_for_query(). The test passes default constructed
  tombstone_gc() to the latter while the former now uses no-gc
  creating a mismatch in tombstone gc behaviour, resulting in test
  failure. Update the test to also pass no-gc to compact_for_query().

* test_query_digest similarly uses mutation_partition::query_mutation()
  and another compaction method, having to match the no-gc now used in
  query_mutation().
2026-03-03 14:09:28 +02:00
Botond Dénes
04b001daa6 sstable: move away from tombstone_gc_mode::operator bool()
It is ambiguous, use tombstone_gc_mode::is_gc_enabled() instead.
Note that the two has slightly different meanings, operator bool()
returned true when repair-history related functionality was enabled.
This is fine, because the only two users are logs, where the two
meanings are close enough. All other users were eliminated or migrated
already, taking the change in meaning into account.
2026-03-03 14:09:28 +02:00
Botond Dénes
6364e35403 replica/table: add get_tombstone_gc_state()
Shorthand for
get_compaction_manager().get_shared_tombstone_gc_state().get_tombstone_gc_state().
2026-03-03 14:09:28 +02:00
Botond Dénes
f3ee6a0bd1 compaction: use tombstone_gc_state with value semantics
Instead of passing around references to it, pass around values.

This object is now designed to be used as a value-type, after recent
refactoring.
2026-03-03 14:09:27 +02:00
Botond Dénes
83e20d920e db/row_cache: use tombstone_gc_state with value semantics
Instead of keeping a pointer to it. Replace nullptr with
tombstone_gc_state::no_gc().

This object is now designed to be used as a value-type, after recent
refactoring.
2026-03-03 14:09:27 +02:00
Botond Dénes
041ab593c7 tombstone_gc: introduce tombstone_gc_state::for_tests()
To replace the usage of tombstone_gc_state(nullptr) usage in tests
specifically. This more verbose factory method will hopefully convey
that this is not to be used in production code. The nullptr constructor
doesn't convey this and in fact it was used in production code
here-and-there.
2026-03-03 14:09:27 +02:00
Artsiom Mishuta
5c84a76b28 test.py: setup pytest logger
This commit introduces pure pytest logging into a file

Previously, test.py used pytest as a script(not a framework) and just captured pytest stdout and logged this data by itself

This commit sets up the log files format that additionaly display Python processName, threadName adn taskName because test.py test cases use them, and now it is so hard to investigate issues that are connected with parallelism inside test case themselve

In addition, commit splits the logging of different pytest workers(xdist) into different files. If pytest workers have ho failed test - log file for these workers will be deleted

There is also additional logging for failures that will contain a separate file per test failure and contain the error itself (stacktrace) and all capture logs from stdout, stderr during the test run. With --save-log-on-success it will be a separate file per test on pass as well

All this new functionality works with the new xdit scheduler (--test-py-init=True)

Fixes SCYLLADB-713

Closes scylladb/scylladb#28705
2026-03-03 11:49:01 +01:00
Dimitrios Symonidis
80b74d7df2 tablet options: Add max_tablet_count tablet option to enforce tablet count upper bounds
Introduced a new max_tablet_count tablet option that caps the maximum number of tablets a table can have. This feature is designed primarily for backup and restore workflows.
During backup, when load balancing is disabled for snapshot consistency, the current tablet count is recorded in the backup manifest.
During restore, max_tablet_count is set to this recorded value, ensuring the restored table's tablet count never exceeds the original snapshot's tablet distribution.
This guarantee enables efficient file-based SSTable streaming during restore, as each SSTable remains fully contained within a single tablet boundary.

Closes scylladb/scylladb#28450
2026-03-03 11:19:24 +03:00
Marcin Maliszkiewicz
d95939d69a test: add more logs to test_startup_no_auth_response
When test fails with assert connections_observed
we would like to know if it was unable to connect
or execute query in attempt_good_connection
2026-03-02 14:53:46 +01:00
Marcin Maliszkiewicz
91126eb2fb test: decrease strain in test_startup_response
For 2025.3 and 2025.4 this test runs order of magnitude
slower in debug mode. Potentially due to passwords::check
running in alien thread and overwhelming the CPU (this is
fixed in newer versions).

Decreasing the number of connections in test makes it fast
again, without breaking reproducibility.

As additional measure we double the timeout.

The fix is now cherry-picked to master as sometimes
test fails there too.

(cherry picked from commit 1f1fc2c2ac)
2026-03-02 14:46:51 +01:00
Karol Nowacki
647172d4b8 vector_search: fix names of private members
According to coding style in Scylla,
member variables are prefixed with underscore.
2026-03-02 14:08:16 +01:00
Karol Nowacki
f2308b000f vector_search: remove unused global variable 2026-03-02 14:08:07 +01:00
Taras Veretilnyk
5bbc44ed12 sstables: replace rewrite_statistics with new rewrite component mechanism
This commits migrates all callers that used rewrite_statistics to new
rewrite component mechanism.
2026-02-26 22:38:55 +01:00
Taras Veretilnyk
51c345aaf6 sstables: add new rewrite component mechanism for safe sstable component rewriting
Previously, rewriting an sstable component (e.g., via rewrite_statistics) created a temporary file that was renamed
to the final name after sealing. This allows crash recovery by simply removing the temporary file on startup.

However, this approach won't work once component digests are stored in scylla_metadata,
as replacing a component like Statistics will require atomically updating both the component
and scylla_metadata with the new digest—impossible with POSIX rename.

The new mechanism creates a clone sstable with a fresh generation:
- Hard-links all components from the source except the component being rewritten and scylla metadata if update_sstable_id is true
- Copies original sstable components pointer and recognized components from the source
- Invokes a modifier callback to adjust the new sstable before rewriting
- Writes the modified component. If update_sstable_id is true, reads scylla metadata, generates new sstable_id and rewrites it.
- Seals the new sstable with a temporary TOC
- Replaces the old sstable atomically, the same way as it is done in compaction

This is built on the rewrite_sstables compaction framework to support batch operations (e.g., following incremental repair).
In case of any failure during the whole process, sstable will be automatically deleted on the node startup due to
temporary toc persistence.

This prepares the infrastructure for component digests. Once digests are introduced in scylla_metadata
this mechanism will be extended to also rewrite scylla metadata with the updated digest alongside the modified component, ensuring atomic updates of both.
2026-02-26 22:38:55 +01:00
Taras Veretilnyk
4aa0a3acf9 compaction: add compaction_group_view method to specify sstable version
Add make_sstable() overload that accepts sstable_version_types parameter
to compaction_group_view interface and all implementations.
This will be useful in rewrite component mechanism, as we
need to preserve sstable version when creating the new one for the replacement.
2026-02-26 22:38:55 +01:00
Taras Veretilnyk
16ea7a8c1c sstables: add null_data_sink and serialized_checksum for checksum-only calculation
Introduce a null_data_sink and make_digest_calculator implementation that discards
all writes, enabling checksum calculation without file I/O. This allows the
existing checksummed_file_writer to be used for digest computation
without writing data to disk.

This will be used in a future commit to calculate the scylla metadata
component checksum before writing it to disk, allowing the component
to store its own checksum.
2026-02-26 22:38:51 +01:00
Ferenc Szili
f22d75a57e load_stats: add filtering for tablet sizes
This patch adds filtering for tablet sizes collected in load_stats.
This is needed to improve the chances that the balancer will have all
the tablet sizes for the node, and that way avoid having the node
ignored during balancing.
2026-02-26 15:17:39 +01:00
Ferenc Szili
d0a5a1d5d0 load_stats: move tablet filtering for table size computation
This patch moves the table size tablet filtering code from a lambda in
storage_service::load_stats_for_tablet_based_tables() to the code
section where it will be used:
tablet_storage_group_manager::table_load_stats()

This is needed to better accomodate the next commit which will add
code for filtering tablets for tablet sizes.
2026-02-26 11:07:53 +01:00
Ferenc Szili
9cd2a04e79 load_stats: bring the comment and code in sync
Table sizes are collected in load stats, and are filtered according to
the migration stage, so as to avoid double accounting of tablet sizes.
The comment which explains the cut-off migration stage (cleanup) and
the cut-off stage in the code (streaming) are not the same.

This patch fixes that.
2026-02-26 11:03:33 +01:00
Amnon Heiman
5db971c2f9 estimated_histogram_test.cc: add to_metrics_histogram test
Add a test that exercises to_metrics_histogram when Min is smaller than
Precision.

The test verifies duplicate integer bounds are collapsed,
counts remain cumulative, and native histogram metadata is still present
with the expected schema and min id.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2026-02-26 09:00:52 +02:00
Amnon Heiman
0b4f28ae21 histogram_metrics_helper.hh: Support Min < Precision
to_metrics_histogram now collapses duplicate integer bucket bounds
caused by Min less than Precision scaling while always keeping native
histogram metadata.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2026-02-26 09:00:38 +02:00
Amnon Heiman
af6371c11f estimated_histogram_test.cc: Add tests for approx_exponential_histogram with Min<Precision
Add three test cases to verify the hybrid linear/exponential bucketing:
- test_histogram_min_1_bucket_limits: Validates bucket lower limits
- test_histogram_min_1_basic: Tests value insertion and bucket distribution
- test_histogram_min_1_statistics: Tests min(), max(), quantile(), and mean()
- test_histogram_min_2_precision_4: Test min == 2 and precision 4.

These tests cover the new Min<Precision mode with Precision=4, verifying both the
linear range and exponential range.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2026-02-26 08:53:06 +02:00
Amnon Heiman
6c21e5f80c estimated_histogram.hh: support Min less than Precision histograms
approx_exponential_histogram is a pseudo exponential histogram implementation
that can insert and retrieve values into and from buckets in O 1 time.
The implementation uses power of two ranges and splits them linearly into
buckets. The number of buckets per power of two range is called Precision.

The original implementation aimed at covering large value ranges had a
limitation. The histogram Min value had to be greater than or equal to
Precision. As a result code that needs histograms for small integer values
could not use this implementation efficiently.

This change addresses that gap by handling the case where Min is less than
Precision. For Min smaller than Precision the value is scaled by a power of
two factor during indexing so the existing exponential math can be reused
without runtime branching. Bucket limits are scaled back to the original
units which can lead to repeated bucket limits in the Min to Precision
range for integer values.

Example with Min 2 and Precision 4
Buckets 2 2 3 3 4 5 6 7 8 10 12 14 and so on

Implementation details
Introduce SHIFT based on log2 Precision minus log2 Min when positive
Scale Min and Max by SHIFT for all exponential calculations
Compute NUM_BUCKETS using the standard log2 Max over Min formula
Use scaled value in find_bucket_index to avoid fractional bucket steps
Return bucket limits by scaling back to original units
Constraint relaxed from Min greater or equal to Precision to allow any Min
less than Max still power of two

This change maintains backward compatibility with existing histograms
while enabling efficient tracking of small integer values.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2026-02-26 00:46:14 +02:00
Amnon Heiman
aca5284b13 test(alternator): add per-table latency coverage for item and batch ops
Add missing tests for per-table Alternator latency metrics to ensure recent
per-table latency accounting is actually validated.

Changes in this patch:

Refactor latency assertion helper into check_sets_latency_by_metric(),
parameterized by metric name.
Keep existing behavior by implementing check_sets_latency() as a wrapper
over scylla_alternator_op_latency.
Add test_item_latency_per_table() to verify
scylla_alternator_table_op_latency_count increases for:
PutItem, GetItem, DeleteItem, UpdateItem, BatchWriteItem,
and BatchGetItem.
This closes a test gap where only global latency metrics were checked, while
per-table latency metrics were not covered.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2026-02-25 20:51:18 +02:00
Amnon Heiman
29e0b4e08c alternator: track per-table latency for batch get/write operations
Batch operations were updating only global latency histograms, which left
table-level latency metrics incomplete.

This change computes request duration once at the end of each operation and
reuses it to update both global and per-table latency stats:

Latencies are stored per table used,

This aligns batch read/write metric behavior with other operations and improves
per-table observability.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2026-02-25 20:51:18 +02:00
Wojciech Mitros
d1ff8f1db3 docs: add strong consistency doc
Add a new docs/dev document for the strongly consistent tables feature.
For now, it only contains information about the Raft metadata persistence,
but it should be updated as more of the strong-consistency components are
added.
2026-02-25 12:34:58 +01:00
Wojciech Mitros
97dc88d6b6 test/cluster: add tests for strongly-consistent tables' metadata persistence
In this patch we add various tests for checking how strongly consistent
tables work while allowing their tablets to reside on non-0 shards and
while using the new persistent storage for their raft metadata.
The tests verify that:
- strongly consistent tables' tablets can be allocated on different shards
and we can write/read from them
- the raft metadata is persistent across restarts even with disruptions
- the sharder correctly routes metadata queries to specified shards
- we can correctly perform multi-shard reads from the metadata tables
- we can read using just the group_id (without shard) using ALLOW FILTERING

For the tests we add logging to the sharder and partitioner and we add
some extra logs for observability.
2026-02-25 12:34:58 +01:00
Wojciech Mitros
f841c0522d raft: enable multi-shard raft groups for strongly consistent tablets
In this patch we allow strongly consistent tables to have tablets on
shards different than 0.
For that, we remove the checks for shard 0 for the non-group0 raft
groups, and  we allow the tablet allocator to place tablets of
strongly consistent tables on shards different than 0.
We also start using the new storage (raft::persistence) for strongly
consistent tables, added in the preceding commits.
2026-02-25 12:34:58 +01:00
Wojciech Mitros
ffe32e8e4d test/raft: add unit tests for raft_groups_storage
Most functions of the new storage for raft groups for strongly
consistent tables are the same as for the system raft table
storage, so we reuse the tests for them to test the new storage.

We add additional tests for checking the new raft groups partitioner
and sharder, and for verifying that writes using storages for different
shards do not affect the data read on different shards.

We also add a test for checking the snapshot_descriptor present after
the storage bootstrap - for both system and strongly consistent storages
we check that the storage contains the initial descriptor.
2026-02-25 12:34:58 +01:00
Wojciech Mitros
16977d7aa0 raft: add raft_groups_storage persistence class
Add raft_groups_storage, a raft::persistence implementation for
strongly consistent tablet groups.

Currently, it's almost an exact copy of the raft_sys_table_storage that
uses the new raft tables for strongly consistent tables (raft_groups,
raft_groups_snapshots, raft_groups_snapshot_config) which have
a (shard, group_id) partition key.

In the future, the mutation, term and commit_idx data will be stored
differently for for strongly consistent tables than for group0, which
will differentiate this class from the original raft_sys_table_storage.

The storage is created for each raft group server and it takes a shard
parameter at construction time to ensure all queries target the correct
partition (and thus shard).
2026-02-25 12:34:58 +01:00
Wojciech Mitros
654fe4b1ca db: add system tables for strongly consistent tables' raft groups
Add three new system tables for storing raft state for strongly
consistent tablets, corresponding to the tables for group0:

- system.raft_groups: Stores the raft log, term/vote, snapshot_id,
  and commit_idx for each tablet's raft group.

- system.raft_groups_snapshots: Stores snapshot descriptors
  (index, term) for each group.

- system.raft_groups_snapshot_config: Stores the raft configuration
  (current and previous voters) for each snapshot.

These tables use a (shard, group_id) composite partition key with
the newly added raft_groups_partitioner and raft_groups_sharder, ensuring
data is co-located with the tablet replica that owns the raft group.

The tables are only created when the STRONGLY_CONSISTENT_TABLES experimental
feature is enabled.
2026-02-25 12:34:58 +01:00
Wojciech Mitros
cb0caea8bf dht: add fixed_shard_partitioner and fixed_shard_sharder
Add a custom partitioner and sharder that will be used for Raft tables
for strongly consistent tables. These tables will have partition keys
of the form (shard, group_id) and the partitioner creates tokens that
encode the target shard in the high 16 bits.

Token layout:
  [shard: 16 bits][partition key hash: 48 bits]

This encoding guarantees that raft group data will be located on the
same shard as the tablet replica corresponding to that raft group as long
we use the tablet replica's shard as the value in the partition key.

Storing the shard directly in the partition key avoids additional lookups
for request routing to the incoming new raft tables.
For even more simplicity, we avoid biasing between uint64_t and int64_t
by limiting the acceptable shard ids up to 32767 (leaving the top bit 0),
which results in the same value of the token when interpreting either as
uint64_t or int64_t.

The sharder decodes the shard by extracting the high bits, which is
shard-count independent. This allows the partition key:shard mapping
to remain the same even during smp changes (only increases are allowed,
the same limitation as for tablets).
2026-02-25 12:34:51 +01:00
Andrzej Jackowski
e2c4b0a733 config: add write_consistency_levels_* guardrails configuration
Add guardrails configuration that can be used later in this
patch series.

Refs: SCYLLADB-259
2026-02-25 10:30:03 +01:00
Wojciech Mitros
c1b3fec11a raft: add group_id -> shard mapping to raft_group_registry
To handle RPC from other nodes, we need to be able to redirect the
requests for each raft group to the shard that owns it. We need to
be able to do the redirection on all shards, so to achieve that, on
all shards we need to store the information about which shard is
occupied by each Raft group server.
For that we add a group_id -> shard mapping to the raft_group_registry.
The mapping is filled out when starting raft servers, it's emptied
when we abort raft servers. We use it when registering RPC verb handlers,
so that regardless of the shard handling the RPC, the work on the raft
group can be performed on the corresponding shard.
2026-02-23 15:34:56 +01:00
Karol Nowacki
ca7f9a8baf vector_search: fix TLS server name with IP
SNI works only with DNS hostnames. Adding an IP address causes warnings
on the server side.
This change adds SNI only if it is not an IP address.

This change has no unit tests, as this behavior is not critical,
since it causes a warning on the server side.
The critical part, that the server name is verified, is already covered.

Fixes: VECTOR-528
2026-02-19 13:00:03 +01:00
Karol Nowacki
6205aad601 vector_search: add warn log for failed ann requests
In order to simplify troubleshooting connection problems, this patch
adds an extra warn log that prints the error for the vector search
request whenever it fails.
2026-02-19 13:00:03 +01:00
Ernest Zaslavsky
afac984632 s3_client: reorganize tests in part_size_calculation_test
just group all BOOST_REQUIRE_EXCEPTION tests in one block and
remove artificial scopes
2026-02-18 12:12:04 +02:00
Ernest Zaslavsky
1a20877afe s3_client: switch using s3 limits constants in tests
instead of using magic numbers, switch using s3 limit constants to
make it clearer what and why is tested
2026-02-18 12:12:04 +02:00
Ernest Zaslavsky
d763bdabc2 s3_client: fix the s3::range max object size
in s3::Range class start using s3 global constant for two reasons:
1) uniformity, no need to introduce semantically same constant in each class
2) the value was wrong
2026-02-18 12:12:04 +02:00
Ernest Zaslavsky
24e70b30c8 s3_client: remove "aws" prefix from object limits constants
remove "aws" prefix from object limits constants since it is
irrelevant and unnecessary when sitting under s3 namespace
2026-02-18 12:12:04 +02:00
Ernest Zaslavsky
329c156600 s3_client: make s3 object limits accessible
make s3 limits constants publicly accessible to reuse it later
2026-02-18 12:12:04 +02:00
Taras Veretilnyk
f140ab0332 sstables: extract default write open flags into a constant
Extract the commonly used `open_flags::wo | open_flags::create |
open_flags::exclusive` into a reusable constant
`sstable_write_open_flags` to reduce duplication.
2026-02-13 14:27:01 +01:00
Taras Veretilnyk
c8281b7b8b sstables: Add write_simple_with_digest for component checksumming
Introduce new methods to write SSTable components while calculating
and returning their CRC32 checksums. This adds:

- make_digests_component_file_writer(): creates a crc32_digest_file_writer
  for component writing with checksum tracking
- write_simple_with_digest() and do_write_simple_with_digest(): write
  components and return the full checksum value
2026-02-13 14:27:01 +01:00
Taras Veretilnyk
1bf934c77c sstables: Extract file writer closing logic into separate methods
Refactor the consume_end_of_stream() method by extracting the inline
file writer closing logic into dedicated methods:
- close_index_writer()
- close_partitions_writer()
- close_rows_writer()
2026-02-13 14:27:01 +01:00
Taras Veretilnyk
dec5e48666 sstables: Implement CRC32 digest-only writer
Introduce template parameter to checksummed file writer to support
digest-only calculation without storing chunk checksums.
This will be needed for future to calculate digest of other components.
2026-02-13 14:27:00 +01:00
Wojciech Mitros
c5a44b0f88 schema: add with_sharder overload accepting static_sharder reference
Add a schema_builder::with_sharder() overload that accepts a const
reference to dht::static_sharder. This allows schemas to use custom
sharder instances instead of only static sharder configurations.

This is needed to support tables that use custom partitioning and
sharding strategies, such as the incoming raft metadata tables for
strongly consistent tables.
2026-02-10 10:52:00 +01:00
Pavel Emelyanov
83e64b516a hint: Don't switch group in database::apply_hint()
The method is called from storage_proxy::mutate_hint() which is in turn
called from hint_mutation::apply_locally(). The latter is either called
from directly by hint sender, which already runs in streaming group, or
via RPC HINT_MUTATION handler which uses index 1 that negotiates streaming
group as well.

To be sure, add a debugging check for current group being the expected one.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-02-09 08:54:51 +03:00
Pavel Emelyanov
727f1be11c hint_sender: Switch to sender group on stop either
Currently sender only switches group for hints sending on start. It's
worth doing the same on stop too for consistency. There's nothing to
compete with at this point.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-02-09 08:54:51 +03:00
2762 changed files with 25904 additions and 17858 deletions

View File

@@ -55,22 +55,26 @@ ninja build/<mode>/test/boost/<test_name>
ninja build/<mode>/scylla
# Run all tests in a file
./test.py --mode=<mode> <test_path>
./test.py --mode=<mode> test/<suite>/<test_name>.py
# Run a single test case from a file
./test.py --mode=<mode> <test_path>::<test_function_name>
./test.py --mode=<mode> test/<suite>/<test_name>.py::<test_function_name>
# Run all tests in a directory
./test.py --mode=<mode> test/<suite>/
# Examples
./test.py --mode=dev alternator/
./test.py --mode=dev cluster/test_raft_voters::test_raft_limited_voters_retain_coordinator
./test.py --mode=dev test/alternator/
./test.py --mode=dev test/cluster/test_raft_voters.py::test_raft_limited_voters_retain_coordinator
./test.py --mode=dev test/cqlpy/test_json.py
# Optional flags
./test.py --mode=dev cluster/test_raft_no_quorum -v # Verbose output
./test.py --mode=dev cluster/test_raft_no_quorum --repeat 5 # Repeat test 5 times
./test.py --mode=dev test/cluster/test_raft_no_quorum.py -v # Verbose output
./test.py --mode=dev test/cluster/test_raft_no_quorum.py --repeat 5 # Repeat test 5 times
```
**Important:**
- Use path without `.py` extension (e.g., `cluster/test_raft_no_quorum`, not `cluster/test_raft_no_quorum.py`)
- Use full path with `.py` extension (e.g., `test/cluster/test_raft_no_quorum.py`, not `cluster/test_raft_no_quorum`)
- To run a single test case, append `::<test_function_name>` to the file path
- Add `-v` for verbose output
- Add `--repeat <num>` to repeat a test multiple times

View File

@@ -1,6 +1,6 @@
version: 2
updates:
- package-ecosystem: "pip"
- package-ecosystem: "uv"
directory: "/docs"
schedule:
interval: "daily"

View File

@@ -8,6 +8,9 @@ on:
jobs:
check-fixes-prefix:
runs-on: ubuntu-latest
permissions:
contents: read
issues: write
steps:
- name: Check PR body for "Fixes" prefix patterns
uses: actions/github-script@v7

View File

@@ -1,8 +1,8 @@
name: Sync Jira Based on PR Events
name: Sync Jira Based on PR Events
on:
pull_request_target:
types: [opened, ready_for_review, review_requested, labeled, unlabeled, closed]
types: [opened, edited, ready_for_review, review_requested, labeled, unlabeled, closed]
permissions:
contents: read
@@ -10,32 +10,9 @@ permissions:
issues: write
jobs:
jira-sync-pr-opened:
if: github.event.action == 'opened'
uses: scylladb/github-automation/.github/workflows/main_jira_sync_pr_opened.yml@main
secrets:
caller_jira_auth: ${{ secrets.USER_AND_KEY_FOR_JIRA_AUTOMATION }}
jira-sync-in-review:
if: github.event.action == 'ready_for_review' || github.event.action == 'review_requested'
uses: scylladb/github-automation/.github/workflows/main_jira_sync_in_review.yml@main
secrets:
caller_jira_auth: ${{ secrets.USER_AND_KEY_FOR_JIRA_AUTOMATION }}
jira-sync-add-label:
if: github.event.action == 'labeled'
uses: scylladb/github-automation/.github/workflows/main_jira_sync_add_label.yml@main
secrets:
caller_jira_auth: ${{ secrets.USER_AND_KEY_FOR_JIRA_AUTOMATION }}
jira-status-remove-label:
if: github.event.action == 'unlabeled'
uses: scylladb/github-automation/.github/workflows/main_jira_sync_remove_label.yml@main
secrets:
caller_jira_auth: ${{ secrets.USER_AND_KEY_FOR_JIRA_AUTOMATION }}
jira-status-pr-closed:
if: github.event.action == 'closed'
uses: scylladb/github-automation/.github/workflows/main_jira_sync_pr_closed.yml@main
jira-sync:
uses: scylladb/github-automation/.github/workflows/main_pr_events_jira_sync.yml@main
with:
caller_action: ${{ github.event.action }}
secrets:
caller_jira_auth: ${{ secrets.USER_AND_KEY_FOR_JIRA_AUTOMATION }}

View File

@@ -1,14 +1,14 @@
name: Call Jira release creation for new milestone
name: Call Jira release creation for new milestone
on:
milestone:
types: [created]
types: [created, closed]
jobs:
sync-milestone-to-jira:
uses: scylladb/github-automation/.github/workflows/main_sync_milestone_to_jira_release.yml@main
with:
# Comma-separated list of Jira project keys
jira_project_keys: "SCYLLADB,CUSTOMER,SMI"
jira_project_keys: "SCYLLADB,CUSTOMER,SMI,RELENG,VECTOR"
secrets:
caller_jira_auth: ${{ secrets.USER_AND_KEY_FOR_JIRA_AUTOMATION }}

View File

@@ -19,6 +19,8 @@ on:
jobs:
release:
permissions:
pages: write
id-token: write
contents: write
runs-on: ubuntu-latest
steps:
@@ -31,7 +33,9 @@ jobs:
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: "3.10"
python-version: "3.12"
- name: Install uv
uses: astral-sh/setup-uv@v6
- name: Set up env
run: make -C docs FLAG="${{ env.FLAG }}" setupenv
- name: Build docs

View File

@@ -29,7 +29,9 @@ jobs:
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: "3.10"
python-version: "3.12"
- name: Install uv
uses: astral-sh/setup-uv@v6
- name: Set up env
run: make -C docs FLAG="${{ env.FLAG }}" setupenv
- name: Build docs

View File

@@ -1,4 +1,6 @@
name: Trigger Scylla CI Route
permissions:
contents: read
on:
issue_comment:
@@ -14,14 +16,20 @@ jobs:
steps:
- name: Verify Org Membership
id: verify_author
env:
EVENT_NAME: ${{ github.event_name }}
PR_AUTHOR: ${{ github.event.pull_request.user.login }}
PR_ASSOCIATION: ${{ github.event.pull_request.author_association }}
COMMENT_AUTHOR: ${{ github.event.comment.user.login }}
COMMENT_ASSOCIATION: ${{ github.event.comment.author_association }}
shell: bash
run: |
if [[ "${{ github.event_name }}" == "pull_request_target" ]]; then
AUTHOR="${{ github.event.pull_request.user.login }}"
ASSOCIATION="${{ github.event.pull_request.author_association }}"
if [[ "$EVENT_NAME" == "pull_request_target" ]]; then
AUTHOR="$PR_AUTHOR"
ASSOCIATION="$PR_ASSOCIATION"
else
AUTHOR="${{ github.event.comment.user.login }}"
ASSOCIATION="${{ github.event.comment.author_association }}"
AUTHOR="$COMMENT_AUTHOR"
ASSOCIATION="$COMMENT_ASSOCIATION"
fi
if [[ "$ASSOCIATION" == "MEMBER" || "$ASSOCIATION" == "OWNER" ]]; then
echo "member=true" >> $GITHUB_OUTPUT
@@ -33,13 +41,11 @@ jobs:
- name: Validate Comment Trigger
if: github.event_name == 'issue_comment'
id: verify_comment
env:
COMMENT_BODY: ${{ github.event.comment.body }}
shell: bash
run: |
BODY=$(cat << 'EOF'
${{ github.event.comment.body }}
EOF
)
CLEAN_BODY=$(echo "$BODY" | grep -v '^[[:space:]]*>')
CLEAN_BODY=$(echo "$COMMENT_BODY" | grep -v '^[[:space:]]*>')
if echo "$CLEAN_BODY" | grep -qi '@scylladbbot' && echo "$CLEAN_BODY" | grep -qi 'trigger-ci'; then
echo "trigger=true" >> $GITHUB_OUTPUT

View File

@@ -1,5 +1,8 @@
name: Trigger next gating
permissions:
contents: read
on:
push:
branches:

View File

@@ -13,7 +13,8 @@
#include <string_view>
#include "alternator/auth.hh"
#include <fmt/format.h>
#include "auth/password_authenticator.hh"
#include "db/consistency_level_type.hh"
#include "db/system_keyspace.hh"
#include "service/storage_proxy.hh"
#include "alternator/executor.hh"
#include "cql3/selection/selection.hh"
@@ -25,8 +26,8 @@ namespace alternator {
static logging::logger alogger("alternator-auth");
future<std::string> get_key_from_roles(service::storage_proxy& proxy, auth::service& as, std::string username) {
schema_ptr schema = proxy.data_dictionary().find_schema(auth::get_auth_ks_name(as.query_processor()), "roles");
future<std::string> get_key_from_roles(service::storage_proxy& proxy, std::string username) {
schema_ptr schema = proxy.data_dictionary().find_schema(db::system_keyspace::NAME, "roles");
partition_key pk = partition_key::from_single_value(*schema, utf8_type->decompose(username));
dht::partition_range_vector partition_ranges{dht::partition_range(dht::decorate_key(*schema, pk))};
std::vector<query::clustering_range> bounds{query::clustering_range::make_open_ended_both_sides()};
@@ -39,7 +40,7 @@ future<std::string> get_key_from_roles(service::storage_proxy& proxy, auth::serv
auto partition_slice = query::partition_slice(std::move(bounds), {}, query::column_id_vector{salted_hash_col->id, can_login_col->id}, selection->get_query_options());
auto command = ::make_lw_shared<query::read_command>(schema->id(), schema->version(), partition_slice,
proxy.get_max_result_size(partition_slice), query::tombstone_limit(proxy.get_tombstone_limit()));
auto cl = auth::password_authenticator::consistency_for_user(username);
auto cl = db::consistency_level::LOCAL_ONE;
service::client_state client_state{service::client_state::internal_tag()};
service::storage_proxy::coordinator_query_result qr = co_await proxy.query(schema, std::move(command), std::move(partition_ranges), cl,

View File

@@ -20,6 +20,6 @@ namespace alternator {
using key_cache = utils::loading_cache<std::string, std::string, 1>;
future<std::string> get_key_from_roles(service::storage_proxy& proxy, auth::service& as, std::string username);
future<std::string> get_key_from_roles(service::storage_proxy& proxy, std::string username);
}

View File

@@ -3463,7 +3463,11 @@ future<executor::request_return_type> executor::batch_write_item(client_state& c
if (should_add_wcu) {
rjson::add(ret, "ConsumedCapacity", std::move(consumed_capacity));
}
_stats.api_operations.batch_write_item_latency.mark(std::chrono::steady_clock::now() - start_time);
auto duration = std::chrono::steady_clock::now() - start_time;
_stats.api_operations.batch_write_item_latency.mark(duration);
for (const auto& w : per_table_wcu) {
w.first->api_operations.batch_write_item_latency.mark(duration);
}
co_return rjson::print(std::move(ret));
}
@@ -4974,7 +4978,12 @@ future<executor::request_return_type> executor::batch_get_item(client_state& cli
if (!some_succeeded && eptr) {
co_await coroutine::return_exception_ptr(std::move(eptr));
}
_stats.api_operations.batch_get_item_latency.mark(std::chrono::steady_clock::now() - start_time);
auto duration = std::chrono::steady_clock::now() - start_time;
_stats.api_operations.batch_get_item_latency.mark(duration);
for (const table_requests& rs : requests) {
lw_shared_ptr<stats> per_table_stats = get_stats_from_schema(_proxy, *rs.schema);
per_table_stats->api_operations.batch_get_item_latency.mark(duration);
}
if (is_big(response)) {
co_return make_streamed(std::move(response));
} else {

View File

@@ -411,8 +411,8 @@ future<std::string> server::verify_signature(const request& req, const chunked_c
}
}
auto cache_getter = [&proxy = _proxy, &as = _auth_service] (std::string username) {
return get_key_from_roles(proxy, as, std::move(username));
auto cache_getter = [&proxy = _proxy] (std::string username) {
return get_key_from_roles(proxy, std::move(username));
};
return _key_cache.get_ptr(user, cache_getter).then_wrapped([this, &req, &content,
user = std::move(user),
@@ -771,7 +771,7 @@ future<executor::request_return_type> server::handle_api_request(std::unique_ptr
if (!username.empty()) {
client_state.set_login(auth::authenticated_user(username));
}
co_await client_state.maybe_update_per_service_level_params();
client_state.maybe_update_per_service_level_params();
tracing::trace_state_ptr trace_state = maybe_trace_query(client_state, username, op, content, _max_users_query_size_in_trace_output.get());
tracing::trace(trace_state, "{}", op);

View File

@@ -14,20 +14,6 @@
namespace alternator {
const char* ALTERNATOR_METRICS = "alternator";
static seastar::metrics::histogram estimated_histogram_to_metrics(const utils::estimated_histogram& histogram) {
seastar::metrics::histogram res;
res.buckets.resize(histogram.bucket_offsets.size());
uint64_t cumulative_count = 0;
res.sample_count = histogram._count;
res.sample_sum = histogram._sample_sum;
for (size_t i = 0; i < res.buckets.size(); i++) {
auto& v = res.buckets[i];
v.upper_bound = histogram.bucket_offsets[i];
cumulative_count += histogram.buckets[i];
v.count = cumulative_count;
}
return res;
}
static seastar::metrics::label column_family_label("cf");
static seastar::metrics::label keyspace_label("ks");
@@ -151,21 +137,21 @@ static void register_metrics_with_optional_table(seastar::metrics::metric_groups
seastar::metrics::make_counter("batch_item_count", seastar::metrics::description("The total number of items processed across all batches"), labels,
stats.api_operations.batch_get_item_batch_total)(op("BatchGetItem")).aggregate(aggregate_labels).set_skip_when_empty(),
seastar::metrics::make_histogram("batch_item_count_histogram", seastar::metrics::description("Histogram of the number of items in a batch request"), labels,
[&stats]{ return estimated_histogram_to_metrics(stats.api_operations.batch_get_item_histogram);})(op("BatchGetItem")).aggregate({seastar::metrics::shard_label}).set_skip_when_empty(),
[&stats]{ return to_metrics_histogram(stats.api_operations.batch_get_item_histogram);})(op("BatchGetItem")).aggregate({seastar::metrics::shard_label}).set_skip_when_empty(),
seastar::metrics::make_histogram("batch_item_count_histogram", seastar::metrics::description("Histogram of the number of items in a batch request"), labels,
[&stats]{ return estimated_histogram_to_metrics(stats.api_operations.batch_write_item_histogram);})(op("BatchWriteItem")).aggregate({seastar::metrics::shard_label}).set_skip_when_empty(),
[&stats]{ return to_metrics_histogram(stats.api_operations.batch_write_item_histogram);})(op("BatchWriteItem")).aggregate({seastar::metrics::shard_label}).set_skip_when_empty(),
seastar::metrics::make_histogram("operation_size_kb", seastar::metrics::description("Histogram of item sizes involved in a request"), labels,
[&stats]{ return estimated_histogram_to_metrics(stats.operation_sizes.get_item_op_size_kb);})(op("GetItem")).aggregate({seastar::metrics::shard_label}).set_skip_when_empty(),
[&stats]{ return to_metrics_histogram(stats.operation_sizes.get_item_op_size_kb);})(op("GetItem")).aggregate({seastar::metrics::shard_label}).set_skip_when_empty(),
seastar::metrics::make_histogram("operation_size_kb", seastar::metrics::description("Histogram of item sizes involved in a request"), labels,
[&stats]{ return estimated_histogram_to_metrics(stats.operation_sizes.put_item_op_size_kb);})(op("PutItem")).aggregate({seastar::metrics::shard_label}).set_skip_when_empty(),
[&stats]{ return to_metrics_histogram(stats.operation_sizes.put_item_op_size_kb);})(op("PutItem")).aggregate({seastar::metrics::shard_label}).set_skip_when_empty(),
seastar::metrics::make_histogram("operation_size_kb", seastar::metrics::description("Histogram of item sizes involved in a request"), labels,
[&stats]{ return estimated_histogram_to_metrics(stats.operation_sizes.delete_item_op_size_kb);})(op("DeleteItem")).aggregate({seastar::metrics::shard_label}).set_skip_when_empty(),
[&stats]{ return to_metrics_histogram(stats.operation_sizes.delete_item_op_size_kb);})(op("DeleteItem")).aggregate({seastar::metrics::shard_label}).set_skip_when_empty(),
seastar::metrics::make_histogram("operation_size_kb", seastar::metrics::description("Histogram of item sizes involved in a request"), labels,
[&stats]{ return estimated_histogram_to_metrics(stats.operation_sizes.update_item_op_size_kb);})(op("UpdateItem")).aggregate({seastar::metrics::shard_label}).set_skip_when_empty(),
[&stats]{ return to_metrics_histogram(stats.operation_sizes.update_item_op_size_kb);})(op("UpdateItem")).aggregate({seastar::metrics::shard_label}).set_skip_when_empty(),
seastar::metrics::make_histogram("operation_size_kb", seastar::metrics::description("Histogram of item sizes involved in a request"), labels,
[&stats]{ return estimated_histogram_to_metrics(stats.operation_sizes.batch_get_item_op_size_kb);})(op("BatchGetItem")).aggregate({seastar::metrics::shard_label}).set_skip_when_empty(),
[&stats]{ return to_metrics_histogram(stats.operation_sizes.batch_get_item_op_size_kb);})(op("BatchGetItem")).aggregate({seastar::metrics::shard_label}).set_skip_when_empty(),
seastar::metrics::make_histogram("operation_size_kb", seastar::metrics::description("Histogram of item sizes involved in a request"), labels,
[&stats]{ return estimated_histogram_to_metrics(stats.operation_sizes.batch_write_item_op_size_kb);})(op("BatchWriteItem")).aggregate({seastar::metrics::shard_label}).set_skip_when_empty(),
[&stats]{ return to_metrics_histogram(stats.operation_sizes.batch_write_item_op_size_kb);})(op("BatchWriteItem")).aggregate({seastar::metrics::shard_label}).set_skip_when_empty(),
});
seastar::metrics::label expression_label("expression");

View File

@@ -16,6 +16,8 @@
#include "cql3/stats.hh"
namespace alternator {
using batch_histogram = utils::estimated_histogram_with_max<128>;
using op_size_histogram = utils::estimated_histogram_with_max<512>;
// Object holding per-shard statistics related to Alternator.
// While this object is alive, these metrics are also registered to be
@@ -76,34 +78,34 @@ public:
utils::timed_rate_moving_average_summary_and_histogram batch_get_item_latency;
utils::timed_rate_moving_average_summary_and_histogram get_records_latency;
utils::estimated_histogram batch_get_item_histogram{22}; // a histogram that covers the range 1 - 100
utils::estimated_histogram batch_write_item_histogram{22}; // a histogram that covers the range 1 - 100
batch_histogram batch_get_item_histogram;
batch_histogram batch_write_item_histogram;
} api_operations;
// Operation size metrics
struct {
// Item size statistics collected per table and aggregated per node.
// Each histogram covers the range 0 - 446. Resolves #25143.
// Each histogram covers the range 0 - 512. Resolves #25143.
// A size is the retrieved item's size.
utils::estimated_histogram get_item_op_size_kb{30};
op_size_histogram get_item_op_size_kb;
// A size is the maximum of the new item's size and the old item's size.
utils::estimated_histogram put_item_op_size_kb{30};
op_size_histogram put_item_op_size_kb;
// A size is the deleted item's size. If the deleted item's size is
// unknown (i.e. read-before-write wasn't necessary and it wasn't
// forced by a configuration option), it won't be recorded on the
// histogram.
utils::estimated_histogram delete_item_op_size_kb{30};
op_size_histogram delete_item_op_size_kb;
// A size is the maximum of existing item's size and the estimated size
// of the update. This will be changed to the maximum of the existing item's
// size and the new item's size in a subsequent PR.
utils::estimated_histogram update_item_op_size_kb{30};
op_size_histogram update_item_op_size_kb;
// A size is the sum of the sizes of all items per table. This means
// that a single BatchGetItem / BatchWriteItem updates the histogram
// for each table that it has items in.
// The sizes are the retrieved items' sizes grouped per table.
utils::estimated_histogram batch_get_item_op_size_kb{30};
op_size_histogram batch_get_item_op_size_kb;
// The sizes are the the written items' sizes grouped per table.
utils::estimated_histogram batch_write_item_op_size_kb{30};
op_size_histogram batch_write_item_op_size_kb;
} operation_sizes;
// Count of authentication and authorization failures, counted if either
// alternator_enforce_authorization or alternator_warn_authorization are
@@ -140,7 +142,7 @@ public:
cql3::cql_stats cql_stats;
// Enumeration of expression types only for stats
// if needed it can be extended e.g. per operation
// if needed it can be extended e.g. per operation
enum expression_types {
UPDATE_EXPRESSION,
CONDITION_EXPRESSION,
@@ -164,7 +166,7 @@ struct table_stats {
void register_metrics(seastar::metrics::metric_groups& metrics, const stats& stats);
inline uint64_t bytes_to_kb_ceil(uint64_t bytes) {
return (bytes + 1023) / 1024;
return (bytes) / 1024;
}
}

View File

@@ -33,6 +33,8 @@
#include "data_dictionary/data_dictionary.hh"
#include "utils/rjson.hh"
static logging::logger elogger("alternator-streams");
/**
* Base template type to implement rapidjson::internal::TypeHelper<...>:s
* for types that are ostreamable/string constructible/castable.
@@ -428,6 +430,25 @@ using namespace std::chrono_literals;
// Dynamo docs says no data shall live longer than 24h.
static constexpr auto dynamodb_streams_max_window = 24h;
// find the parent shard in previous generation for the given child shard
// takes care of wrap-around case in vnodes
// prev_streams must be sorted by token
const cdc::stream_id& find_parent_shard_in_previous_generation(db_clock::time_point prev_timestamp, const utils::chunked_vector<cdc::stream_id> &prev_streams, const cdc::stream_id &child) {
if (prev_streams.empty()) {
// something is really wrong - streams are empty
// let's try internal_error in hope it will be notified and fixed
on_internal_error(elogger, fmt::format("streams are empty for cdc generation at {} ({})", prev_timestamp, prev_timestamp.time_since_epoch().count()));
}
auto it = std::lower_bound(prev_streams.begin(), prev_streams.end(), child.token(), [](const cdc::stream_id& id, const dht::token& t) {
return id.token() < t;
});
if (it == prev_streams.end()) {
// wrap around case - take first
it = prev_streams.begin();
}
return *it;
}
future<executor::request_return_type> executor::describe_stream(client_state& client_state, service_permit permit, rjson::value request) {
_stats.api_operations.describe_stream++;
@@ -578,16 +599,8 @@ future<executor::request_return_type> executor::describe_stream(client_state& cl
auto shard = rjson::empty_object();
if (prev != e) {
auto& pids = prev->second.streams;
auto pid = std::upper_bound(pids.begin(), pids.end(), id.token(), [](const dht::token& t, const cdc::stream_id& id) {
return t < id.token();
});
if (pid != pids.begin()) {
pid = std::prev(pid);
}
if (pid != pids.end()) {
rjson::add(shard, "ParentShardId", shard_id(prev->first, *pid));
}
auto &pid = find_parent_shard_in_previous_generation(prev->first, prev->second.streams, id);
rjson::add(shard, "ParentShardId", shard_id(prev->first, pid));
}
last.emplace(ts, id);

View File

@@ -243,7 +243,7 @@
"GOSSIP_DIGEST_SYN",
"GOSSIP_DIGEST_ACK2",
"GOSSIP_SHUTDOWN",
"DEFINITIONS_UPDATE",
"UNUSED__DEFINITIONS_UPDATE",
"TRUNCATE",
"UNUSED__REPLICATION_FINISHED",
"MIGRATION_REQUEST",

View File

@@ -1295,6 +1295,45 @@
}
]
},
{
"path":"/storage_service/logstor_compaction",
"operations":[
{
"method":"POST",
"summary":"Trigger compaction of the key-value storage",
"type":"void",
"nickname":"logstor_compaction",
"produces":[
"application/json"
],
"parameters":[
{
"name":"major",
"description":"When true, perform a major compaction",
"required":false,
"allowMultiple":false,
"type":"boolean",
"paramType":"query"
}
]
}
]
},
{
"path":"/storage_service/logstor_flush",
"operations":[
{
"method":"POST",
"summary":"Trigger flush of logstor storage",
"type":"void",
"nickname":"logstor_flush",
"produces":[
"application/json"
],
"parameters":[]
}
]
},
{
"path":"/storage_service/active_repair/",
"operations":[
@@ -3229,6 +3268,38 @@
}
]
},
{
"path":"/storage_service/logstor_info",
"operations":[
{
"method":"GET",
"summary":"Logstor segment information for one table",
"type":"table_logstor_info",
"nickname":"logstor_info",
"produces":[
"application/json"
],
"parameters":[
{
"name":"keyspace",
"description":"The keyspace",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"table",
"description":"table name",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"query"
}
]
}
]
},
{
"path":"/storage_service/retrain_dict",
"operations":[
@@ -3637,6 +3708,47 @@
}
}
},
"logstor_hist_bucket":{
"id":"logstor_hist_bucket",
"properties":{
"bucket":{
"type":"long"
},
"count":{
"type":"long"
},
"min_data_size":{
"type":"long"
},
"max_data_size":{
"type":"long"
}
}
},
"table_logstor_info":{
"id":"table_logstor_info",
"description":"Per-table logstor segment distribution",
"properties":{
"keyspace":{
"type":"string"
},
"table":{
"type":"string"
},
"compaction_groups":{
"type":"long"
},
"segments":{
"type":"long"
},
"data_size_histogram":{
"type":"array",
"items":{
"$ref":"logstor_hist_bucket"
}
}
}
},
"tablet_repair_result":{
"id":"tablet_repair_result",
"description":"Tablet repair result",

View File

@@ -209,6 +209,21 @@
"parameters":[]
}
]
},
{
"path":"/system/chosen_sstable_version",
"operations":[
{
"method":"GET",
"summary":"Get sstable version currently chosen for use in new sstables",
"type":"string",
"nickname":"get_chosen_sstable_version",
"produces":[
"application/json"
],
"parameters":[]
}
]
}
]
}

View File

@@ -122,9 +122,9 @@ future<> unset_thrift_controller(http_context& ctx) {
return ctx.http_server.set_routes([&ctx] (routes& r) { unset_thrift_controller(ctx, r); });
}
future<> set_server_storage_service(http_context& ctx, sharded<service::storage_service>& ss, service::raft_group0_client& group0_client) {
return ctx.http_server.set_routes([&ctx, &ss, &group0_client] (routes& r) {
set_storage_service(ctx, r, ss, group0_client);
future<> set_server_storage_service(http_context& ctx, sharded<service::storage_service>& ss, sharded<db::snapshot_ctl>& ssc, service::raft_group0_client& group0_client) {
return ctx.http_server.set_routes([&ctx, &ss, &ssc, &group0_client] (routes& r) {
set_storage_service(ctx, r, ss, ssc, group0_client);
});
}

View File

@@ -98,7 +98,7 @@ future<> set_server_config(http_context& ctx, db::config& cfg);
future<> unset_server_config(http_context& ctx);
future<> set_server_snitch(http_context& ctx, sharded<locator::snitch_ptr>& snitch);
future<> unset_server_snitch(http_context& ctx);
future<> set_server_storage_service(http_context& ctx, sharded<service::storage_service>& ss, service::raft_group0_client&);
future<> set_server_storage_service(http_context& ctx, sharded<service::storage_service>& ss, sharded<db::snapshot_ctl>&, service::raft_group0_client&);
future<> unset_server_storage_service(http_context& ctx);
future<> set_server_client_routes(http_context& ctx, sharded<service::client_routes_service>& cr);
future<> unset_server_client_routes(http_context& ctx);

View File

@@ -18,7 +18,9 @@
#include "utils/assert.hh"
#include "utils/estimated_histogram.hh"
#include <algorithm>
#include <sstream>
#include "db/data_listeners.hh"
#include "utils/hash.hh"
#include "storage_service.hh"
#include "compaction/compaction_manager.hh"
#include "unimplemented.hh"
@@ -342,6 +344,56 @@ uint64_t accumulate_on_active_memtables(replica::table& t, noncopyable_function<
return ret;
}
static
future<json::json_return_type>
rest_toppartitions_generic(sharded<replica::database>& db, std::unique_ptr<http::request> req) {
bool filters_provided = false;
std::unordered_set<std::tuple<sstring, sstring>, utils::tuple_hash> table_filters {};
if (auto filters = req->get_query_param("table_filters"); !filters.empty()) {
filters_provided = true;
std::stringstream ss { filters };
std::string filter;
while (!filters.empty() && ss.good()) {
std::getline(ss, filter, ',');
table_filters.emplace(parse_fully_qualified_cf_name(filter));
}
}
std::unordered_set<sstring> keyspace_filters {};
if (auto filters = req->get_query_param("keyspace_filters"); !filters.empty()) {
filters_provided = true;
std::stringstream ss { filters };
std::string filter;
while (!filters.empty() && ss.good()) {
std::getline(ss, filter, ',');
keyspace_filters.emplace(std::move(filter));
}
}
// when the query is empty return immediately
if (filters_provided && table_filters.empty() && keyspace_filters.empty()) {
apilog.debug("toppartitions query: processing results");
cf::toppartitions_query_results results;
results.read_cardinality = 0;
results.write_cardinality = 0;
return make_ready_future<json::json_return_type>(results);
}
api::req_param<std::chrono::milliseconds, unsigned> duration{*req, "duration", 1000ms};
api::req_param<unsigned> capacity(*req, "capacity", 256);
api::req_param<unsigned> list_size(*req, "list_size", 10);
apilog.info("toppartitions query: #table_filters={} #keyspace_filters={} duration={} list_size={} capacity={}",
!table_filters.empty() ? std::to_string(table_filters.size()) : "all", !keyspace_filters.empty() ? std::to_string(keyspace_filters.size()) : "all", duration.value, list_size.value, capacity.value);
return seastar::do_with(db::toppartitions_query(db, std::move(table_filters), std::move(keyspace_filters), duration.value, list_size, capacity), [] (db::toppartitions_query& q) {
return run_toppartitions_query(q);
});
}
void set_column_family(http_context& ctx, routes& r, sharded<replica::database>& db) {
cf::get_column_family_name.set(r, [&db] (const_req req){
std::vector<sstring> res;
@@ -1047,6 +1099,10 @@ void set_column_family(http_context& ctx, routes& r, sharded<replica::database>&
});
});
ss::toppartitions_generic.set(r, [&db] (std::unique_ptr<http::request> req) {
return rest_toppartitions_generic(db, std::move(req));
});
cf::force_major_compaction.set(r, [&ctx, &db](std::unique_ptr<http::request> req) -> future<json::json_return_type> {
if (!req->get_query_param("split_output").empty()) {
fail(unimplemented::cause::API);
@@ -1213,6 +1269,7 @@ void unset_column_family(http_context& ctx, routes& r) {
cf::get_sstable_count_per_level.unset(r);
cf::get_sstables_for_key.unset(r);
cf::toppartitions.unset(r);
ss::toppartitions_generic.unset(r);
cf::force_major_compaction.unset(r);
ss::get_load.unset(r);
ss::get_metrics_load.unset(r);

View File

@@ -17,9 +17,7 @@
#include "gms/feature_service.hh"
#include "schema/schema_builder.hh"
#include "sstables/sstables_manager.hh"
#include "utils/hash.hh"
#include <optional>
#include <sstream>
#include <stdexcept>
#include <time.h>
#include <algorithm>
@@ -612,56 +610,6 @@ rest_get_token_endpoint(http_context& ctx, sharded<service::storage_service>& ss
co_return json::json_return_type(stream_range_as_array(token_endpoints, &map_to_json<dht::token, gms::inet_address>));
}
static
future<json::json_return_type>
rest_toppartitions_generic(http_context& ctx, std::unique_ptr<http::request> req) {
bool filters_provided = false;
std::unordered_set<std::tuple<sstring, sstring>, utils::tuple_hash> table_filters {};
if (auto filters = req->get_query_param("table_filters"); !filters.empty()) {
filters_provided = true;
std::stringstream ss { filters };
std::string filter;
while (!filters.empty() && ss.good()) {
std::getline(ss, filter, ',');
table_filters.emplace(parse_fully_qualified_cf_name(filter));
}
}
std::unordered_set<sstring> keyspace_filters {};
if (auto filters = req->get_query_param("keyspace_filters"); !filters.empty()) {
filters_provided = true;
std::stringstream ss { filters };
std::string filter;
while (!filters.empty() && ss.good()) {
std::getline(ss, filter, ',');
keyspace_filters.emplace(std::move(filter));
}
}
// when the query is empty return immediately
if (filters_provided && table_filters.empty() && keyspace_filters.empty()) {
apilog.debug("toppartitions query: processing results");
httpd::column_family_json::toppartitions_query_results results;
results.read_cardinality = 0;
results.write_cardinality = 0;
return make_ready_future<json::json_return_type>(results);
}
api::req_param<std::chrono::milliseconds, unsigned> duration{*req, "duration", 1000ms};
api::req_param<unsigned> capacity(*req, "capacity", 256);
api::req_param<unsigned> list_size(*req, "list_size", 10);
apilog.info("toppartitions query: #table_filters={} #keyspace_filters={} duration={} list_size={} capacity={}",
!table_filters.empty() ? std::to_string(table_filters.size()) : "all", !keyspace_filters.empty() ? std::to_string(keyspace_filters.size()) : "all", duration.value, list_size.value, capacity.value);
return seastar::do_with(db::toppartitions_query(ctx.db, std::move(table_filters), std::move(keyspace_filters), duration.value, list_size, capacity), [] (db::toppartitions_query& q) {
return run_toppartitions_query(q);
});
}
static
json::json_return_type
rest_get_release_version(sharded<service::storage_service>& ss, const_req& req) {
@@ -835,9 +783,31 @@ rest_force_keyspace_flush(http_context& ctx, std::unique_ptr<http::request> req)
static
future<json::json_return_type>
rest_decommission(sharded<service::storage_service>& ss, std::unique_ptr<http::request> req) {
rest_logstor_compaction(http_context& ctx, std::unique_ptr<http::request> req) {
bool major = false;
if (auto major_param = req->get_query_param("major"); !major_param.empty()) {
major = validate_bool(major_param);
}
apilog.info("logstor_compaction: major={}", major);
auto& db = ctx.db;
co_await replica::database::trigger_logstor_compaction_on_all_shards(db, major);
co_return json_void();
}
static
future<json::json_return_type>
rest_logstor_flush(http_context& ctx, std::unique_ptr<http::request> req) {
apilog.info("logstor_flush");
auto& db = ctx.db;
co_await replica::database::flush_logstor_separator_on_all_shards(db);
co_return json_void();
}
static
future<json::json_return_type>
rest_decommission(sharded<service::storage_service>& ss, sharded<db::snapshot_ctl>& ssc, std::unique_ptr<http::request> req) {
apilog.info("decommission");
return ss.local().decommission().then([] {
return ss.local().decommission(ssc).then([] {
return make_ready_future<json::json_return_type>(json_void());
});
}
@@ -1553,6 +1523,54 @@ rest_sstable_info(http_context& ctx, std::unique_ptr<http::request> req) {
});
}
static
future<json::json_return_type>
rest_logstor_info(http_context& ctx, std::unique_ptr<http::request> req) {
auto keyspace = api::req_param<sstring>(*req, "keyspace", {}).value;
auto table = api::req_param<sstring>(*req, "table", {}).value;
if (table.empty()) {
table = api::req_param<sstring>(*req, "cf", {}).value;
}
if (keyspace.empty()) {
throw bad_param_exception("The query parameter 'keyspace' is required");
}
if (table.empty()) {
throw bad_param_exception("The query parameter 'table' is required");
}
keyspace = validate_keyspace(ctx, keyspace);
auto tid = validate_table(ctx.db.local(), keyspace, table);
auto& cf = ctx.db.local().find_column_family(tid);
if (!cf.uses_logstor()) {
throw bad_param_exception(fmt::format("Table {}.{} does not use logstor", keyspace, table));
}
return do_with(replica::logstor::table_segment_stats{}, [keyspace = std::move(keyspace), table = std::move(table), tid, &ctx] (replica::logstor::table_segment_stats& merged_stats) {
return ctx.db.map_reduce([&merged_stats](replica::logstor::table_segment_stats&& shard_stats) {
merged_stats += shard_stats;
}, [tid](const replica::database& db) {
return db.get_logstor_table_segment_stats(tid);
}).then([&merged_stats, keyspace = std::move(keyspace), table = std::move(table)] {
ss::table_logstor_info result;
result.keyspace = keyspace;
result.table = table;
result.compaction_groups = merged_stats.compaction_group_count;
result.segments = merged_stats.segment_count;
for (const auto& bucket : merged_stats.histogram) {
ss::logstor_hist_bucket hist;
hist.count = bucket.count;
hist.max_data_size = bucket.max_data_size;
result.data_size_histogram.push(std::move(hist));
}
return make_ready_future<json::json_return_type>(stream_object(result));
});
});
}
static
future<json::json_return_type>
rest_reload_raft_topology_state(sharded<service::storage_service>& ss, service::raft_group0_client& group0_client, std::unique_ptr<http::request> req) {
@@ -1572,10 +1590,7 @@ rest_upgrade_to_raft_topology(sharded<service::storage_service>& ss, std::unique
static
future<json::json_return_type>
rest_raft_topology_upgrade_status(sharded<service::storage_service>& ss, std::unique_ptr<http::request> req) {
const auto ustate = co_await ss.invoke_on(0, [] (auto& ss) {
return ss.get_topology_upgrade_state();
});
co_return sstring(format("{}", ustate));
co_return sstring("done");
}
static
@@ -1785,9 +1800,8 @@ rest_bind(FuncType func, BindArgs&... args) {
return std::bind_front(func, std::ref(args)...);
}
void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_service>& ss, service::raft_group0_client& group0_client) {
void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_service>& ss, sharded<db::snapshot_ctl>& ssc, service::raft_group0_client& group0_client) {
ss::get_token_endpoint.set(r, rest_bind(rest_get_token_endpoint, ctx, ss));
ss::toppartitions_generic.set(r, rest_bind(rest_toppartitions_generic, ctx));
ss::get_release_version.set(r, rest_bind(rest_get_release_version, ss));
ss::get_scylla_release_version.set(r, rest_bind(rest_get_scylla_release_version, ss));
ss::get_schema_version.set(r, rest_bind(rest_get_schema_version, ss));
@@ -1802,7 +1816,9 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
ss::reset_cleanup_needed.set(r, rest_bind(rest_reset_cleanup_needed, ctx, ss));
ss::force_flush.set(r, rest_bind(rest_force_flush, ctx));
ss::force_keyspace_flush.set(r, rest_bind(rest_force_keyspace_flush, ctx));
ss::decommission.set(r, rest_bind(rest_decommission, ss));
ss::decommission.set(r, rest_bind(rest_decommission, ss, ssc));
ss::logstor_compaction.set(r, rest_bind(rest_logstor_compaction, ctx));
ss::logstor_flush.set(r, rest_bind(rest_logstor_flush, ctx));
ss::move.set(r, rest_bind(rest_move, ss));
ss::remove_node.set(r, rest_bind(rest_remove_node, ss));
ss::exclude_node.set(r, rest_bind(rest_exclude_node, ss));
@@ -1851,6 +1867,7 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
ss::retrain_dict.set(r, rest_bind(rest_retrain_dict, ctx, ss, group0_client));
ss::estimate_compression_ratios.set(r, rest_bind(rest_estimate_compression_ratios, ctx, ss));
ss::sstable_info.set(r, rest_bind(rest_sstable_info, ctx));
ss::logstor_info.set(r, rest_bind(rest_logstor_info, ctx));
ss::reload_raft_topology_state.set(r, rest_bind(rest_reload_raft_topology_state, ss, group0_client));
ss::upgrade_to_raft_topology.set(r, rest_bind(rest_upgrade_to_raft_topology, ss));
ss::raft_topology_upgrade_status.set(r, rest_bind(rest_raft_topology_upgrade_status, ss));
@@ -1867,7 +1884,6 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
void unset_storage_service(http_context& ctx, routes& r) {
ss::get_token_endpoint.unset(r);
ss::toppartitions_generic.unset(r);
ss::get_release_version.unset(r);
ss::get_scylla_release_version.unset(r);
ss::get_schema_version.unset(r);
@@ -1881,6 +1897,8 @@ void unset_storage_service(http_context& ctx, routes& r) {
ss::reset_cleanup_needed.unset(r);
ss::force_flush.unset(r);
ss::force_keyspace_flush.unset(r);
ss::logstor_compaction.unset(r);
ss::logstor_flush.unset(r);
ss::decommission.unset(r);
ss::move.unset(r);
ss::remove_node.unset(r);
@@ -1928,6 +1946,7 @@ void unset_storage_service(http_context& ctx, routes& r) {
ss::get_ownership.unset(r);
ss::get_effective_ownership.unset(r);
ss::sstable_info.unset(r);
ss::logstor_info.unset(r);
ss::reload_raft_topology_state.unset(r);
ss::upgrade_to_raft_topology.unset(r);
ss::raft_topology_upgrade_status.unset(r);
@@ -2144,6 +2163,7 @@ void unset_snapshot(http_context& ctx, routes& r) {
ss::start_backup.unset(r);
cf::get_true_snapshots_size.unset(r);
cf::get_all_true_snapshots_size.unset(r);
ss::decommission.unset(r);
}
}

View File

@@ -66,7 +66,7 @@ struct scrub_info {
scrub_info parse_scrub_options(const http_context& ctx, std::unique_ptr<http::request> req);
void set_storage_service(http_context& ctx, httpd::routes& r, sharded<service::storage_service>& ss, service::raft_group0_client&);
void set_storage_service(http_context& ctx, httpd::routes& r, sharded<service::storage_service>& ss, sharded<db::snapshot_ctl>&, service::raft_group0_client&);
void unset_storage_service(http_context& ctx, httpd::routes& r);
void set_sstables_loader(http_context& ctx, httpd::routes& r, sharded<sstables_loader>& sst_loader);
void unset_sstables_loader(http_context& ctx, httpd::routes& r);

View File

@@ -190,6 +190,13 @@ void set_system(http_context& ctx, routes& r) {
return make_ready_future<json::json_return_type>(seastar::to_sstring(format));
});
});
hs::get_chosen_sstable_version.set(r, [&ctx] (std::unique_ptr<request> req) {
return smp::submit_to(0, [&ctx] {
auto format = ctx.db.local().get_user_sstables_manager().get_preferred_sstable_version();
return make_ready_future<json::json_return_type>(seastar::to_sstring(format));
});
});
}
}

View File

@@ -19,12 +19,12 @@ target_sources(scylla_auth
permission.cc
resource.cc
role_or_anonymous.cc
roles-metadata.cc
sasl_challenge.cc
saslauthd_authenticator.cc
service.cc
standard_role_manager.cc
transitional.cc
maintenance_socket_authenticator.cc
maintenance_socket_role_manager.cc)
target_include_directories(scylla_auth
PUBLIC
@@ -48,4 +48,4 @@ if (Scylla_USE_PRECOMPILED_HEADER_USE)
target_precompile_headers(scylla_auth REUSE_FROM scylla-precompiled-header)
endif()
check_headers(check-headers scylla_auth
GLOB_RECURSE ${CMAKE_CURRENT_SOURCE_DIR}/*.hh)
GLOB_RECURSE ${CMAKE_CURRENT_SOURCE_DIR}/*.hh)

View File

@@ -9,19 +9,9 @@
#include "auth/allow_all_authenticator.hh"
#include "service/migration_manager.hh"
#include "utils/class_registrator.hh"
namespace auth {
constexpr std::string_view allow_all_authenticator_name("org.apache.cassandra.auth.AllowAllAuthenticator");
// To ensure correct initialization order, we unfortunately need to use a string literal.
static const class_registrator<
authenticator,
allow_all_authenticator,
cql3::query_processor&,
::service::raft_group0_client&,
::service::migration_manager&,
cache&> registration("org.apache.cassandra.auth.AllowAllAuthenticator");
}

View File

@@ -9,18 +9,9 @@
#include "auth/allow_all_authorizer.hh"
#include "auth/common.hh"
#include "utils/class_registrator.hh"
namespace auth {
constexpr std::string_view allow_all_authorizer_name("org.apache.cassandra.auth.AllowAllAuthorizer");
// To ensure correct initialization order, we unfortunately need to use a string literal.
static const class_registrator<
authorizer,
allow_all_authorizer,
cql3::query_processor&,
::service::raft_group0_client&,
::service::migration_manager&> registration("org.apache.cassandra.auth.AllowAllAuthorizer");
}

View File

@@ -26,7 +26,7 @@ extern const std::string_view allow_all_authorizer_name;
class allow_all_authorizer final : public authorizer {
public:
allow_all_authorizer(cql3::query_processor&, ::service::raft_group0_client&, ::service::migration_manager&) {
allow_all_authorizer(cql3::query_processor&) {
}
virtual future<> start() override {

View File

@@ -47,7 +47,7 @@ void cache::set_permission_loader(permission_loader_func loader) {
_permission_loader = std::move(loader);
}
lw_shared_ptr<const cache::role_record> cache::get(const role_name_t& role) const noexcept {
lw_shared_ptr<const cache::role_record> cache::get(std::string_view role) const noexcept {
auto it = _roles.find(role);
if (it == _roles.end()) {
return {};
@@ -55,6 +55,16 @@ lw_shared_ptr<const cache::role_record> cache::get(const role_name_t& role) cons
return it->second;
}
void cache::for_each_role(const std::function<void(const role_name_t&, const role_record&)>& func) const {
for (const auto& [name, record] : _roles) {
func(name, *record);
}
}
size_t cache::roles_count() const noexcept {
return _roles.size();
}
future<permission_set> cache::get_permissions(const role_or_anonymous& role, const resource& r) {
std::unordered_map<resource, permission_set>* perms_cache;
lw_shared_ptr<role_record> role_ptr;
@@ -209,9 +219,6 @@ future<> cache::prune_all() noexcept {
}
future<> cache::load_all() {
if (legacy_mode(_qp)) {
co_return;
}
SCYLLA_ASSERT(this_shard_id() == 0);
auto units = co_await get_units(_loading_sem, 1, _as);
@@ -263,9 +270,6 @@ future<> cache::gather_inheriting_roles(std::unordered_set<role_name_t>& roles,
}
future<> cache::load_roles(std::unordered_set<role_name_t> roles) {
if (legacy_mode(_qp)) {
co_return;
}
SCYLLA_ASSERT(this_shard_id() == 0);
auto units = co_await get_units(_loading_sem, 1, _as);

View File

@@ -9,6 +9,7 @@
#pragma once
#include <seastar/core/abort_source.hh>
#include <string_view>
#include <unordered_set>
#include <unordered_map>
@@ -19,7 +20,7 @@
#include <seastar/core/semaphore.hh>
#include <seastar/core/metrics_registration.hh>
#include <absl/container/flat_hash_map.h>
#include "absl-flat_hash_map.hh"
#include "auth/permission.hh"
#include "auth/common.hh"
@@ -42,8 +43,8 @@ public:
std::unordered_set<role_name_t> member_of;
std::unordered_set<role_name_t> members;
sstring salted_hash;
std::unordered_map<sstring, sstring> attributes;
std::unordered_map<sstring, permission_set> permissions;
std::unordered_map<sstring, sstring, sstring_hash, sstring_eq> attributes;
std::unordered_map<sstring, permission_set, sstring_hash, sstring_eq> permissions;
private:
friend cache;
// cached permissions include effects of role's inheritance
@@ -52,7 +53,7 @@ public:
};
explicit cache(cql3::query_processor& qp, abort_source& as) noexcept;
lw_shared_ptr<const role_record> get(const role_name_t& role) const noexcept;
lw_shared_ptr<const role_record> get(std::string_view role) const noexcept;
void set_permission_loader(permission_loader_func loader);
future<permission_set> get_permissions(const role_or_anonymous& role, const resource& r);
future<> prune(const resource& r);
@@ -61,8 +62,15 @@ public:
future<> load_roles(std::unordered_set<role_name_t> roles);
static bool includes_table(const table_id&) noexcept;
// Returns the number of roles in the cache.
size_t roles_count() const noexcept;
// The callback doesn't suspend (no co_await) so it observes the state
// of the cache atomically.
void for_each_role(const std::function<void(const role_name_t&, const role_record&)>& func) const;
private:
using roles_map = absl::flat_hash_map<role_name_t, lw_shared_ptr<role_record>>;
using roles_map = absl::flat_hash_map<role_name_t, lw_shared_ptr<role_record>, sstring_hash, sstring_eq>;
roles_map _roles;
// anonymous permissions map exists mainly due to compatibility with
// higher layers which use role_or_anonymous to get permissions.

View File

@@ -13,14 +13,11 @@
#include <boost/regex.hpp>
#include <fmt/ranges.h>
#include "utils/class_registrator.hh"
#include "utils/to_string.hh"
#include "data_dictionary/data_dictionary.hh"
#include "cql3/query_processor.hh"
#include "db/config.hh"
static const auto CERT_AUTH_NAME = "com.scylladb.auth.CertificateAuthenticator";
const std::string_view auth::certificate_authenticator_name(CERT_AUTH_NAME);
static logging::logger clogger("certificate_authenticator");
@@ -30,13 +27,6 @@ static const std::string cfg_query_attr = "query";
static const std::string cfg_source_subject = "SUBJECT";
static const std::string cfg_source_altname = "ALTNAME";
static const class_registrator<auth::authenticator
, auth::certificate_authenticator
, cql3::query_processor&
, ::service::raft_group0_client&
, ::service::migration_manager&
, auth::cache&> cert_auth_reg(CERT_AUTH_NAME);
enum class auth::certificate_authenticator::query_source {
subject, altname
};
@@ -99,7 +89,7 @@ future<> auth::certificate_authenticator::stop() {
}
std::string_view auth::certificate_authenticator::qualified_java_name() const {
return certificate_authenticator_name;
return "com.scylladb.auth.CertificateAuthenticator";
}
bool auth::certificate_authenticator::require_authentication() const {

View File

@@ -27,8 +27,6 @@ namespace auth {
class cache;
extern const std::string_view certificate_authenticator_name;
class certificate_authenticator : public authenticator {
enum class query_source;
std::vector<std::pair<query_source, boost::regex>> _queries;

View File

@@ -14,18 +14,11 @@
#include <seastar/core/sharded.hh>
#include "mutation/canonical_mutation.hh"
#include "schema/schema_fwd.hh"
#include "mutation/timestamp.hh"
#include "utils/assert.hh"
#include "utils/exponential_backoff_retry.hh"
#include "cql3/query_processor.hh"
#include "cql3/statements/create_table_statement.hh"
#include "schema/schema_builder.hh"
#include "service/migration_manager.hh"
#include "service/raft/group0_state_machine.hh"
#include "timeout_config.hh"
#include "utils/error_injection.hh"
#include "db/system_keyspace.hh"
namespace auth {
@@ -33,22 +26,14 @@ namespace meta {
namespace legacy {
constinit const std::string_view AUTH_KS("system_auth");
constinit const std::string_view USERS_CF("users");
} // namespace legacy
constinit const std::string_view AUTH_PACKAGE_NAME("org.apache.cassandra.auth.");
} // namespace meta
static logging::logger auth_log("auth");
bool legacy_mode(cql3::query_processor& qp) {
return qp.auth_version < db::auth_version_t::v2;
}
std::string_view get_auth_ks_name(cql3::query_processor& qp) {
if (legacy_mode(qp)) {
return meta::legacy::AUTH_KS;
}
return db::system_keyspace::NAME;
std::string default_superuser(cql3::query_processor& qp) {
return qp.db().get_config().auth_superuser_name();
}
// Func must support being invoked more than once.
@@ -65,47 +50,6 @@ future<> do_after_system_ready(seastar::abort_source& as, seastar::noncopyable_f
}).discard_result();
}
static future<> create_legacy_metadata_table_if_missing_impl(
std::string_view table_name,
cql3::query_processor& qp,
std::string_view cql,
::service::migration_manager& mm) {
SCYLLA_ASSERT(this_shard_id() == 0); // once_among_shards makes sure a function is executed on shard 0 only
auto db = qp.db();
auto parsed_statement = cql3::query_processor::parse_statement(cql, cql3::dialect{});
auto& parsed_cf_statement = static_cast<cql3::statements::raw::cf_statement&>(*parsed_statement);
parsed_cf_statement.prepare_keyspace(meta::legacy::AUTH_KS);
auto statement = static_pointer_cast<cql3::statements::create_table_statement>(
parsed_cf_statement.prepare(db, qp.get_cql_stats())->statement);
const auto schema = statement->get_cf_meta_data(qp.db());
const auto uuid = generate_legacy_id(schema->ks_name(), schema->cf_name());
schema_builder b(schema);
b.set_uuid(uuid);
schema_ptr table = b.build();
if (!db.has_schema(table->ks_name(), table->cf_name())) {
auto group0_guard = co_await mm.start_group0_operation();
auto ts = group0_guard.write_timestamp();
try {
co_return co_await mm.announce(co_await ::service::prepare_new_column_family_announcement(qp.proxy(), table, ts),
std::move(group0_guard), format("auth: create {} metadata table", table->cf_name()));
} catch (const exceptions::already_exists_exception&) {}
}
}
future<> create_legacy_metadata_table_if_missing(
std::string_view table_name,
cql3::query_processor& qp,
std::string_view cql,
::service::migration_manager& mm) noexcept {
return futurize_invoke(create_legacy_metadata_table_if_missing_impl, table_name, qp, cql, mm);
}
::service::query_state& internal_distributed_query_state() noexcept {
#ifdef DEBUG
// Give the much slower debug tests more headroom for completing auth queries.
@@ -140,56 +84,6 @@ static future<> announce_mutations_with_guard(
return group0_client.add_entry(std::move(group0_cmd), std::move(group0_guard), as, timeout);
}
future<> announce_mutations_with_batching(
::service::raft_group0_client& group0_client,
start_operation_func_t start_operation_func,
std::function<::service::mutations_generator(api::timestamp_type t)> gen,
seastar::abort_source& as,
std::optional<::service::raft_timeout> timeout) {
// account for command's overhead, it's better to use smaller threshold than constantly bounce off the limit
size_t memory_threshold = group0_client.max_command_size() * 0.75;
utils::get_local_injector().inject("auth_announce_mutations_command_max_size",
[&memory_threshold] {
memory_threshold = 1000;
});
size_t memory_usage = 0;
utils::chunked_vector<canonical_mutation> muts;
// guard has to be taken before we execute code in gen as
// it can do read-before-write and we want announce_mutations
// operation to be linearizable with other such calls,
// for instance if we do select and then delete in gen
// we want both to operate on the same data or fail
// if someone else modified it in the middle
std::optional<::service::group0_guard> group0_guard;
group0_guard = co_await start_operation_func(as);
auto timestamp = group0_guard->write_timestamp();
auto g = gen(timestamp);
while (auto mut = co_await g()) {
muts.push_back(canonical_mutation{*mut});
memory_usage += muts.back().representation().size();
if (memory_usage >= memory_threshold) {
if (!group0_guard) {
group0_guard = co_await start_operation_func(as);
timestamp = group0_guard->write_timestamp();
}
co_await announce_mutations_with_guard(group0_client, std::move(muts), std::move(*group0_guard), as, timeout);
group0_guard = std::nullopt;
memory_usage = 0;
muts = {};
}
}
if (!muts.empty()) {
if (!group0_guard) {
group0_guard = co_await start_operation_func(as);
timestamp = group0_guard->write_timestamp();
}
co_await announce_mutations_with_guard(group0_client, std::move(muts), std::move(*group0_guard), as, timeout);
}
}
future<> announce_mutations(
cql3::query_processor& qp,
::service::raft_group0_client& group0_client,

View File

@@ -21,12 +21,7 @@
using namespace std::chrono_literals;
namespace replica {
class database;
}
namespace service {
class migration_manager;
class query_state;
}
@@ -40,10 +35,8 @@ namespace meta {
namespace legacy {
extern constinit const std::string_view AUTH_KS;
extern constinit const std::string_view USERS_CF;
} // namespace legacy
constexpr std::string_view DEFAULT_SUPERUSER_NAME("cassandra");
extern constinit const std::string_view AUTH_PACKAGE_NAME;
} // namespace meta
@@ -52,12 +45,7 @@ constexpr std::string_view PERMISSIONS_CF = "role_permissions";
constexpr std::string_view ROLE_MEMBERS_CF = "role_members";
constexpr std::string_view ROLE_ATTRIBUTES_CF = "role_attributes";
// This is a helper to check whether auth-v2 is on.
bool legacy_mode(cql3::query_processor& qp);
// We have legacy implementation using different keyspace
// and need to parametrize depending on runtime feature.
std::string_view get_auth_ks_name(cql3::query_processor& qp);
std::string default_superuser(cql3::query_processor& qp);
template <class Task>
future<> once_among_shards(Task&& f) {
@@ -71,12 +59,6 @@ future<> once_among_shards(Task&& f) {
// Func must support being invoked more than once.
future<> do_after_system_ready(seastar::abort_source& as, seastar::noncopyable_function<future<>()> func);
future<> create_legacy_metadata_table_if_missing(
std::string_view table_name,
cql3::query_processor&,
std::string_view cql,
::service::migration_manager&) noexcept;
///
/// Time-outs for internal, non-local CQL queries.
///
@@ -84,20 +66,6 @@ future<> create_legacy_metadata_table_if_missing(
::service::raft_timeout get_raft_timeout() noexcept;
// Execute update query via group0 mechanism, mutations will be applied on all nodes.
// Use this function when need to perform read before write on a single guard or if
// you have more than one mutation and potentially exceed single command size limit.
using start_operation_func_t = std::function<future<::service::group0_guard>(abort_source&)>;
future<> announce_mutations_with_batching(
::service::raft_group0_client& group0_client,
// since we can operate also in topology coordinator context where we need stronger
// guarantees than start_operation from group0_client gives we allow to inject custom
// function here
start_operation_func_t start_operation_func,
std::function<::service::mutations_generator(api::timestamp_type t)> gen,
seastar::abort_source& as,
std::optional<::service::raft_timeout> timeout);
// Execute update query via group0 mechanism, mutations will be applied on all nodes.
future<> announce_mutations(
cql3::query_processor& qp,

View File

@@ -26,7 +26,6 @@ extern "C" {
#include "cql3/untyped_result_set.hh"
#include "exceptions/exceptions.hh"
#include "utils/log.hh"
#include "utils/class_registrator.hh"
namespace auth {
@@ -40,111 +39,14 @@ static constexpr std::string_view PERMISSIONS_NAME = "permissions";
static logging::logger alogger("default_authorizer");
// To ensure correct initialization order, we unfortunately need to use a string literal.
static const class_registrator<
authorizer,
default_authorizer,
cql3::query_processor&,
::service::raft_group0_client&,
::service::migration_manager&> password_auth_reg("org.apache.cassandra.auth.CassandraAuthorizer");
default_authorizer::default_authorizer(cql3::query_processor& qp, ::service::raft_group0_client& g0, ::service::migration_manager& mm)
: _qp(qp)
, _migration_manager(mm) {
default_authorizer::default_authorizer(cql3::query_processor& qp)
: _qp(qp) {
}
default_authorizer::~default_authorizer() {
}
static const sstring legacy_table_name{"permissions"};
bool default_authorizer::legacy_metadata_exists() const {
return _qp.db().has_schema(meta::legacy::AUTH_KS, legacy_table_name);
}
future<bool> default_authorizer::legacy_any_granted() const {
static const sstring query = seastar::format("SELECT * FROM {}.{} LIMIT 1", meta::legacy::AUTH_KS, PERMISSIONS_CF);
return _qp.execute_internal(
query,
db::consistency_level::LOCAL_ONE,
{},
cql3::query_processor::cache_internal::yes).then([](::shared_ptr<cql3::untyped_result_set> results) {
return !results->empty();
});
}
future<> default_authorizer::migrate_legacy_metadata() {
alogger.info("Starting migration of legacy permissions metadata.");
static const sstring query = seastar::format("SELECT * FROM {}.{}", meta::legacy::AUTH_KS, legacy_table_name);
return _qp.execute_internal(
query,
db::consistency_level::LOCAL_ONE,
cql3::query_processor::cache_internal::no).then([this](::shared_ptr<cql3::untyped_result_set> results) {
return do_for_each(*results, [this](const cql3::untyped_result_set_row& row) {
return do_with(
row.get_as<sstring>("username"),
parse_resource(row.get_as<sstring>(RESOURCE_NAME)),
::service::group0_batch::unused(),
[this, &row](const auto& username, const auto& r, auto& mc) {
const permission_set perms = permissions::from_strings(row.get_set<sstring>(PERMISSIONS_NAME));
return grant(username, perms, r, mc);
});
}).finally([results] {});
}).then([] {
alogger.info("Finished migrating legacy permissions metadata.");
}).handle_exception([](std::exception_ptr ep) {
alogger.error("Encountered an error during migration!");
std::rethrow_exception(ep);
});
}
future<> default_authorizer::start_legacy() {
static const sstring create_table = fmt::format(
"CREATE TABLE {}.{} ("
"{} text,"
"{} text,"
"{} set<text>,"
"PRIMARY KEY({}, {})"
") WITH gc_grace_seconds={}",
meta::legacy::AUTH_KS,
PERMISSIONS_CF,
ROLE_NAME,
RESOURCE_NAME,
PERMISSIONS_NAME,
ROLE_NAME,
RESOURCE_NAME,
90 * 24 * 60 * 60); // 3 months.
return once_among_shards([this] {
return create_legacy_metadata_table_if_missing(
PERMISSIONS_CF,
_qp,
create_table,
_migration_manager).then([this] {
_finished = do_after_system_ready(_as, [this] {
return async([this] {
_migration_manager.wait_for_schema_agreement(_qp.db().real_database(), db::timeout_clock::time_point::max(), &_as).get();
if (legacy_metadata_exists()) {
if (!legacy_any_granted().get()) {
migrate_legacy_metadata().get();
return;
}
alogger.warn("Ignoring legacy permissions metadata since role permissions exist.");
}
});
});
});
});
}
future<> default_authorizer::start() {
if (legacy_mode(_qp)) {
return start_legacy();
}
return make_ready_future<>();
}
@@ -161,7 +63,7 @@ default_authorizer::authorize(const role_or_anonymous& maybe_role, const resourc
const sstring query = seastar::format("SELECT {} FROM {}.{} WHERE {} = ? AND {} = ?",
PERMISSIONS_NAME,
get_auth_ks_name(_qp),
db::system_keyspace::NAME,
PERMISSIONS_CF,
ROLE_NAME,
RESOURCE_NAME);
@@ -185,21 +87,13 @@ default_authorizer::modify(
std::string_view op,
::service::group0_batch& mc) {
const sstring query = seastar::format("UPDATE {}.{} SET {} = {} {} ? WHERE {} = ? AND {} = ?",
get_auth_ks_name(_qp),
db::system_keyspace::NAME,
PERMISSIONS_CF,
PERMISSIONS_NAME,
PERMISSIONS_NAME,
op,
ROLE_NAME,
RESOURCE_NAME);
if (legacy_mode(_qp)) {
co_return co_await _qp.execute_internal(
query,
db::consistency_level::ONE,
internal_distributed_query_state(),
{permissions::to_strings(set), sstring(role_name), resource.name()},
cql3::query_processor::cache_internal::no).discard_result();
}
co_await collect_mutations(_qp, mc, query,
{permissions::to_strings(set), sstring(role_name), resource.name()});
}
@@ -218,7 +112,7 @@ future<std::vector<permission_details>> default_authorizer::list_all() const {
ROLE_NAME,
RESOURCE_NAME,
PERMISSIONS_NAME,
get_auth_ks_name(_qp),
db::system_keyspace::NAME,
PERMISSIONS_CF);
const auto results = co_await _qp.execute_internal(
@@ -243,74 +137,16 @@ future<std::vector<permission_details>> default_authorizer::list_all() const {
future<> default_authorizer::revoke_all(std::string_view role_name, ::service::group0_batch& mc) {
try {
const sstring query = seastar::format("DELETE FROM {}.{} WHERE {} = ?",
get_auth_ks_name(_qp),
db::system_keyspace::NAME,
PERMISSIONS_CF,
ROLE_NAME);
if (legacy_mode(_qp)) {
co_await _qp.execute_internal(
query,
db::consistency_level::ONE,
internal_distributed_query_state(),
{sstring(role_name)},
cql3::query_processor::cache_internal::no).discard_result();
} else {
co_await collect_mutations(_qp, mc, query, {sstring(role_name)});
}
co_await collect_mutations(_qp, mc, query, {sstring(role_name)});
} catch (const exceptions::request_execution_exception& e) {
alogger.warn("CassandraAuthorizer failed to revoke all permissions of {}: {}", role_name, e);
}
}
future<> default_authorizer::revoke_all_legacy(const resource& resource) {
static const sstring query = seastar::format("SELECT {} FROM {}.{} WHERE {} = ? ALLOW FILTERING",
ROLE_NAME,
get_auth_ks_name(_qp),
PERMISSIONS_CF,
RESOURCE_NAME);
return _qp.execute_internal(
query,
db::consistency_level::LOCAL_ONE,
{resource.name()},
cql3::query_processor::cache_internal::no).then_wrapped([this, resource](future<::shared_ptr<cql3::untyped_result_set>> f) {
try {
auto res = f.get();
return parallel_for_each(
res->begin(),
res->end(),
[this, res, resource](const cql3::untyped_result_set::row& r) {
static const sstring query = seastar::format("DELETE FROM {}.{} WHERE {} = ? AND {} = ?",
get_auth_ks_name(_qp),
PERMISSIONS_CF,
ROLE_NAME,
RESOURCE_NAME);
return _qp.execute_internal(
query,
db::consistency_level::LOCAL_ONE,
{r.get_as<sstring>(ROLE_NAME), resource.name()},
cql3::query_processor::cache_internal::no).discard_result().handle_exception(
[resource](auto ep) {
try {
std::rethrow_exception(ep);
} catch (const exceptions::request_execution_exception& e) {
alogger.warn("CassandraAuthorizer failed to revoke all permissions on {}: {}", resource, e);
}
});
});
} catch (const exceptions::request_execution_exception& e) {
alogger.warn("CassandraAuthorizer failed to revoke all permissions on {}: {}", resource, e);
return make_ready_future();
}
});
}
future<> default_authorizer::revoke_all(const resource& resource, ::service::group0_batch& mc) {
if (legacy_mode(_qp)) {
co_return co_await revoke_all_legacy(resource);
}
if (resource.kind() == resource_kind::data &&
data_resource_view(resource).is_keyspace()) {
revoke_all_keyspace_resources(resource, mc);
@@ -321,7 +157,7 @@ future<> default_authorizer::revoke_all(const resource& resource, ::service::gro
auto gen = [this, name] (api::timestamp_type t) -> ::service::mutations_generator {
const sstring query = seastar::format("SELECT {} FROM {}.{} WHERE {} = ? ALLOW FILTERING",
ROLE_NAME,
get_auth_ks_name(_qp),
db::system_keyspace::NAME,
PERMISSIONS_CF,
RESOURCE_NAME);
auto res = co_await _qp.execute_internal(
@@ -331,7 +167,7 @@ future<> default_authorizer::revoke_all(const resource& resource, ::service::gro
cql3::query_processor::cache_internal::no);
for (const auto& r : *res) {
const sstring query = seastar::format("DELETE FROM {}.{} WHERE {} = ? AND {} = ?",
get_auth_ks_name(_qp),
db::system_keyspace::NAME,
PERMISSIONS_CF,
ROLE_NAME,
RESOURCE_NAME);
@@ -356,7 +192,7 @@ void default_authorizer::revoke_all_keyspace_resources(const resource& ks_resour
const sstring query = seastar::format("SELECT {}, {} FROM {}.{}",
ROLE_NAME,
RESOURCE_NAME,
get_auth_ks_name(_qp),
db::system_keyspace::NAME,
PERMISSIONS_CF);
auto res = co_await _qp.execute_internal(
query,
@@ -371,7 +207,7 @@ void default_authorizer::revoke_all_keyspace_resources(const resource& ks_resour
continue;
}
const sstring query = seastar::format("DELETE FROM {}.{} WHERE {} = ? AND {} = ?",
get_auth_ks_name(_qp),
db::system_keyspace::NAME,
PERMISSIONS_CF,
ROLE_NAME,
RESOURCE_NAME);

View File

@@ -27,14 +27,12 @@ namespace auth {
class default_authorizer : public authorizer {
cql3::query_processor& _qp;
::service::migration_manager& _migration_manager;
abort_source _as{};
future<> _finished{make_ready_future<>()};
public:
default_authorizer(cql3::query_processor&, ::service::raft_group0_client&, ::service::migration_manager&);
default_authorizer(cql3::query_processor&);
~default_authorizer();
@@ -59,16 +57,6 @@ public:
virtual const resource_set& protected_resources() const override;
private:
future<> start_legacy();
bool legacy_metadata_exists() const;
future<> revoke_all_legacy(const resource&);
future<bool> legacy_any_granted() const;
future<> migrate_legacy_metadata();
future<> modify(std::string_view, permission_set, const resource&, std::string_view, ::service::group0_batch&);
void revoke_all_keyspace_resources(const resource& ks_resource, ::service::group0_batch& mc);

View File

@@ -24,7 +24,6 @@
#include "exceptions/exceptions.hh"
#include "seastarx.hh"
#include "service/raft/raft_group0_client.hh"
#include "utils/class_registrator.hh"
#include "db/config.hh"
#include "utils/exponential_backoff_retry.hh"
@@ -33,7 +32,7 @@ namespace {
logger mylog{"ldap_role_manager"}; // `log` is taken by math.
struct url_desc_deleter {
void operator()(LDAPURLDesc *p) {
void operator()(LDAPURLDesc* p) {
ldap_free_urldesc(p);
}
};
@@ -41,7 +40,7 @@ struct url_desc_deleter {
using url_desc_ptr = std::unique_ptr<LDAPURLDesc, url_desc_deleter>;
url_desc_ptr parse_url(std::string_view url) {
LDAPURLDesc *desc = nullptr;
LDAPURLDesc* desc = nullptr;
if (ldap_url_parse(url.data(), &desc)) {
mylog.error("error in ldap_url_parse({})", url);
}
@@ -54,8 +53,12 @@ std::vector<sstring> get_attr_values(LDAP* ld, LDAPMessage* res, const char* att
mylog.debug("Analyzing search results");
for (auto e = ldap_first_entry(ld, res); e; e = ldap_next_entry(ld, e)) {
struct deleter {
void operator()(berval** p) { ldap_value_free_len(p); }
void operator()(char* p) { ldap_memfree(p); }
void operator()(berval** p) {
ldap_value_free_len(p);
}
void operator()(char* p) {
ldap_memfree(p);
}
};
const std::unique_ptr<char, deleter> dname(ldap_get_dn(ld, e));
mylog.debug("Analyzing entry {}", dname.get());
@@ -72,50 +75,37 @@ std::vector<sstring> get_attr_values(LDAP* ld, LDAPMessage* res, const char* att
return values;
}
const char* ldap_role_manager_full_name = "com.scylladb.auth.LDAPRoleManager";
} // anonymous namespace
namespace auth {
static const class_registrator<
role_manager,
ldap_role_manager,
cql3::query_processor&,
::service::raft_group0_client&,
::service::migration_manager&,
cache&> registration(ldap_role_manager_full_name);
ldap_role_manager::ldap_role_manager(
std::string_view query_template, std::string_view target_attr, std::string_view bind_name, std::string_view bind_password,
uint32_t permissions_update_interval_in_ms,
utils::observer<uint32_t> permissions_update_interval_in_ms_observer,
cql3::query_processor& qp, ::service::raft_group0_client& rg0c, ::service::migration_manager& mm, cache& cache)
: _std_mgr(qp, rg0c, mm, cache), _group0_client(rg0c), _query_template(query_template), _target_attr(target_attr), _bind_name(bind_name)
, _bind_password(bind_password)
, _permissions_update_interval_in_ms(permissions_update_interval_in_ms)
, _permissions_update_interval_in_ms_observer(std::move(permissions_update_interval_in_ms_observer))
, _connection_factory(bind(std::mem_fn(&ldap_role_manager::reconnect), std::ref(*this)))
, _cache(cache)
, _cache_pruner(make_ready_future<>()) {
ldap_role_manager::ldap_role_manager(std::string_view query_template, std::string_view target_attr, std::string_view bind_name, std::string_view bind_password,
uint32_t permissions_update_interval_in_ms, utils::observer<uint32_t> permissions_update_interval_in_ms_observer, cql3::query_processor& qp,
::service::raft_group0_client& rg0c, ::service::migration_manager& mm, cache& cache)
: _std_mgr(qp, rg0c, mm, cache)
, _group0_client(rg0c)
, _query_template(query_template)
, _target_attr(target_attr)
, _bind_name(bind_name)
, _bind_password(bind_password)
, _permissions_update_interval_in_ms(permissions_update_interval_in_ms)
, _permissions_update_interval_in_ms_observer(std::move(permissions_update_interval_in_ms_observer))
, _connection_factory(bind(std::mem_fn(&ldap_role_manager::reconnect), std::ref(*this)))
, _cache(cache)
, _cache_pruner(make_ready_future<>()) {
}
ldap_role_manager::ldap_role_manager(cql3::query_processor& qp, ::service::raft_group0_client& rg0c, ::service::migration_manager& mm, cache& cache)
: ldap_role_manager(
qp.db().get_config().ldap_url_template(),
qp.db().get_config().ldap_attr_role(),
qp.db().get_config().ldap_bind_dn(),
qp.db().get_config().ldap_bind_passwd(),
qp.db().get_config().permissions_update_interval_in_ms(),
qp.db().get_config().permissions_update_interval_in_ms.observe([this] (const uint32_t& v) { _permissions_update_interval_in_ms = v; }),
qp,
rg0c,
mm,
cache) {
: ldap_role_manager(qp.db().get_config().ldap_url_template(), qp.db().get_config().ldap_attr_role(), qp.db().get_config().ldap_bind_dn(),
qp.db().get_config().ldap_bind_passwd(), qp.db().get_config().permissions_update_interval_in_ms(),
qp.db().get_config().permissions_update_interval_in_ms.observe([this](const uint32_t& v) {
_permissions_update_interval_in_ms = v;
}),
qp, rg0c, mm, cache) {
}
std::string_view ldap_role_manager::qualified_java_name() const noexcept {
return ldap_role_manager_full_name;
return "com.scylladb.auth.LDAPRoleManager";
}
const resource_set& ldap_role_manager::protected_resources() const {
@@ -124,17 +114,16 @@ const resource_set& ldap_role_manager::protected_resources() const {
future<> ldap_role_manager::start() {
if (!parse_url(get_url("dummy-user"))) { // Just need host and port -- any user should do.
return make_exception_future(
std::runtime_error(fmt::format("error getting LDAP server address from template {}", _query_template)));
return make_exception_future(std::runtime_error(fmt::format("error getting LDAP server address from template {}", _query_template)));
}
_cache_pruner = futurize_invoke([this] () -> future<> {
_cache_pruner = futurize_invoke([this]() -> future<> {
while (true) {
try {
co_await seastar::sleep_abortable(std::chrono::milliseconds(_permissions_update_interval_in_ms), _as);
} catch (const seastar::sleep_aborted&) {
co_return; // ignore
}
co_await _cache.container().invoke_on_all([] (cache& c) -> future<> {
co_await _cache.container().invoke_on_all([](cache& c) -> future<> {
try {
co_await c.reload_all_permissions();
} catch (...) {
@@ -176,7 +165,7 @@ future<conn_ptr> ldap_role_manager::connect() {
future<conn_ptr> ldap_role_manager::reconnect() {
unsigned retries_left = 5;
using namespace std::literals::chrono_literals;
conn_ptr conn = co_await exponential_backoff_retry::do_until_value(1s, 32s, _as, [this, &retries_left] () -> future<std::optional<conn_ptr>> {
conn_ptr conn = co_await exponential_backoff_retry::do_until_value(1s, 32s, _as, [this, &retries_left]() -> future<std::optional<conn_ptr>> {
if (!retries_left) {
co_return conn_ptr{};
}
@@ -199,11 +188,13 @@ future<conn_ptr> ldap_role_manager::reconnect() {
future<> ldap_role_manager::stop() {
_as.request_abort();
return std::move(_cache_pruner).then([this] {
return _std_mgr.stop();
}).then([this] {
return _connection_factory.stop();
});
return std::move(_cache_pruner)
.then([this] {
return _std_mgr.stop();
})
.then([this] {
return _connection_factory.stop();
});
}
future<> ldap_role_manager::create(std::string_view name, const role_config& config, ::service::group0_batch& mc) {
@@ -232,43 +223,42 @@ future<role_set> ldap_role_manager::query_granted(std::string_view grantee_name,
if (!desc) {
return make_exception_future<role_set>(std::runtime_error(format("Error parsing URL {}", url)));
}
return _connection_factory.with_connection([this, desc = std::move(desc), grantee_name_ = sstring(grantee_name)]
(ldap_connection& conn) -> future<role_set> {
sstring grantee_name = std::move(grantee_name_);
ldap_msg_ptr res = co_await conn.search(desc->lud_dn, desc->lud_scope, desc->lud_filter, desc->lud_attrs,
/*attrsonly=*/0, /*serverctrls=*/nullptr, /*clientctrls=*/nullptr,
/*timeout=*/nullptr, /*sizelimit=*/0);
mylog.trace("query_granted: got search results");
const auto mtype = ldap_msgtype(res.get());
if (mtype != LDAP_RES_SEARCH_ENTRY && mtype != LDAP_RES_SEARCH_RESULT && mtype != LDAP_RES_SEARCH_REFERENCE) {
mylog.error("ldap search yielded result {} of type {}", static_cast<const void*>(res.get()), mtype);
co_return coroutine::exception(std::make_exception_ptr(std::runtime_error("ldap_role_manager: search result has wrong type")));
}
std::vector<sstring> values = get_attr_values(conn.get_ldap(), res.get(), _target_attr.c_str());
auth::role_set valid_roles{grantee_name};
// Each value is a role to be granted.
co_await parallel_for_each(values, [this, &valid_roles] (const sstring& ldap_role) {
return _std_mgr.exists(ldap_role).then([&valid_roles, &ldap_role] (bool exists) {
if (exists) {
valid_roles.insert(ldap_role);
} else {
mylog.error("unrecognized role received from LDAP: {}", ldap_role);
return _connection_factory.with_connection(
[this, desc = std::move(desc), grantee_name_ = sstring(grantee_name)](ldap_connection& conn) -> future<role_set> {
sstring grantee_name = std::move(grantee_name_);
ldap_msg_ptr res = co_await conn.search(desc->lud_dn, desc->lud_scope, desc->lud_filter, desc->lud_attrs,
/*attrsonly=*/0, /*serverctrls=*/nullptr, /*clientctrls=*/nullptr,
/*timeout=*/nullptr, /*sizelimit=*/0);
mylog.trace("query_granted: got search results");
const auto mtype = ldap_msgtype(res.get());
if (mtype != LDAP_RES_SEARCH_ENTRY && mtype != LDAP_RES_SEARCH_RESULT && mtype != LDAP_RES_SEARCH_REFERENCE) {
mylog.error("ldap search yielded result {} of type {}", static_cast<const void*>(res.get()), mtype);
co_return coroutine::exception(std::make_exception_ptr(std::runtime_error("ldap_role_manager: search result has wrong type")));
}
});
});
std::vector<sstring> values = get_attr_values(conn.get_ldap(), res.get(), _target_attr.c_str());
auth::role_set valid_roles{grantee_name};
co_return std::move(valid_roles);
});
// Each value is a role to be granted.
co_await parallel_for_each(values, [this, &valid_roles](const sstring& ldap_role) {
return _std_mgr.exists(ldap_role).then([&valid_roles, &ldap_role](bool exists) {
if (exists) {
valid_roles.insert(ldap_role);
} else {
mylog.error("unrecognized role received from LDAP: {}", ldap_role);
}
});
});
co_return std::move(valid_roles);
});
}
future<role_to_directly_granted_map>
ldap_role_manager::query_all_directly_granted(::service::query_state& qs) {
future<role_to_directly_granted_map> ldap_role_manager::query_all_directly_granted(::service::query_state& qs) {
role_to_directly_granted_map result;
auto roles = co_await query_all(qs);
for (auto& role: roles) {
for (auto& role : roles) {
auto granted_set = co_await query_granted(role, recursive_role_query::no);
for (auto& granted: granted_set) {
for (auto& granted : granted_set) {
if (granted != role) {
result.insert({role, granted});
}
@@ -282,7 +272,7 @@ future<role_set> ldap_role_manager::query_all(::service::query_state& qs) {
}
future<> ldap_role_manager::create_role(std::string_view role_name) {
return smp::submit_to(0, [this, role_name] () -> future<> {
return smp::submit_to(0, [this, role_name]() -> future<> {
int retries = 10;
while (true) {
auto guard = co_await _group0_client.start_operation(_as, ::service::raft_timeout{});
@@ -294,8 +284,8 @@ future<> ldap_role_manager::create_role(std::string_view role_name) {
} catch (const role_already_exists&) {
// ok
} catch (const ::service::group0_concurrent_modification& ex) {
mylog.warn("Failed to auto-create role \"{}\" due to guard conflict.{}.",
role_name, retries ? " Retrying" : " Number of retries exceeded, giving up");
mylog.warn("Failed to auto-create role \"{}\" due to guard conflict.{}.", role_name,
retries ? " Retrying" : " Number of retries exceeded, giving up");
if (retries--) {
continue;
}
@@ -340,8 +330,7 @@ future<bool> ldap_role_manager::can_login(std::string_view role_name) {
return _std_mgr.can_login(role_name);
}
future<std::optional<sstring>> ldap_role_manager::get_attribute(
std::string_view role_name, std::string_view attribute_name, ::service::query_state& qs) {
future<std::optional<sstring>> ldap_role_manager::get_attribute(std::string_view role_name, std::string_view attribute_name, ::service::query_state& qs) {
return _std_mgr.get_attribute(role_name, attribute_name, qs);
}

View File

@@ -57,8 +57,7 @@ class ldap_role_manager : public role_manager {
cache& cache ///< Passed to standard_role_manager.
);
/// Retrieves LDAP configuration entries from qp and invokes the other constructor. Required by
/// class_registrator<role_manager>.
/// Retrieves LDAP configuration entries from qp and invokes the other constructor.
ldap_role_manager(cql3::query_processor& qp, ::service::raft_group0_client& rg0c, ::service::migration_manager& mm, cache& cache);
/// Thrown when query-template parsing fails.

View File

@@ -0,0 +1,31 @@
/*
* Copyright (C) 2026-present ScyllaDB
*
* Modified by ScyllaDB
*/
/*
* SPDX-License-Identifier: (LicenseRef-ScyllaDB-Source-Available-1.0 and Apache-2.0)
*/
#include "auth/maintenance_socket_authenticator.hh"
namespace auth {
maintenance_socket_authenticator::~maintenance_socket_authenticator() {
}
future<> maintenance_socket_authenticator::start() {
return make_ready_future<>();
}
future<> maintenance_socket_authenticator::ensure_superuser_is_created() const {
return make_ready_future<>();
}
bool maintenance_socket_authenticator::require_authentication() const {
return false;
}
} // namespace auth

View File

@@ -0,0 +1,36 @@
/*
* Copyright (C) 2026-present ScyllaDB
*
* Modified by ScyllaDB
*/
/*
* SPDX-License-Identifier: (LicenseRef-ScyllaDB-Source-Available-1.0 and Apache-2.0)
*/
#pragma once
#include <seastar/core/shared_future.hh>
#include "password_authenticator.hh"
namespace auth {
// maintenance_socket_authenticator is used for clients connecting to the
// maintenance socket. It does not require authentication,
// while still allowing the managing of roles and their credentials.
class maintenance_socket_authenticator : public password_authenticator {
public:
using password_authenticator::password_authenticator;
virtual ~maintenance_socket_authenticator();
virtual future<> start() override;
virtual future<> ensure_superuser_is_created() const override;
bool require_authentication() const override;
};
} // namespace auth

View File

@@ -0,0 +1,37 @@
/*
* Copyright (C) 2026-present ScyllaDB
*
* Modified by ScyllaDB
*/
/*
* SPDX-License-Identifier: (LicenseRef-ScyllaDB-Source-Available-1.0 and Apache-2.0)
*/
#pragma once
#include "auth/default_authorizer.hh"
#include "auth/permission.hh"
namespace auth {
// maintenance_socket_authorizer is used for clients connecting to the
// maintenance socket. It grants all permissions unconditionally (like
// AllowAllAuthorizer) while still supporting grant/revoke operations
// (delegated to the underlying CassandraAuthorizer / default_authorizer).
class maintenance_socket_authorizer : public default_authorizer {
public:
using default_authorizer::default_authorizer;
~maintenance_socket_authorizer() override = default;
future<> start() override {
return make_ready_future<>();
}
future<permission_set> authorize(const role_or_anonymous&, const resource&) const override {
return make_ready_future<permission_set>(permissions::ALL);
}
};
} // namespace auth

View File

@@ -13,23 +13,48 @@
#include <string_view>
#include "auth/cache.hh"
#include "cql3/description.hh"
#include "utils/class_registrator.hh"
#include "utils/log.hh"
#include "utils/on_internal_error.hh"
namespace auth {
constexpr std::string_view maintenance_socket_role_manager_name = "com.scylladb.auth.MaintenanceSocketRoleManager";
static logging::logger log("maintenance_socket_role_manager");
static const class_registrator<
role_manager,
maintenance_socket_role_manager,
cql3::query_processor&,
::service::raft_group0_client&,
::service::migration_manager&,
cache&> registration(sstring{maintenance_socket_role_manager_name});
future<> maintenance_socket_role_manager::ensure_role_operations_are_enabled() {
if (_is_maintenance_mode) {
on_internal_error(log, "enabling role operations not allowed in maintenance mode");
}
if (_std_mgr.has_value()) {
on_internal_error(log, "role operations are already enabled");
}
_std_mgr.emplace(_qp, _group0_client, _migration_manager, _cache);
return _std_mgr->start();
}
void maintenance_socket_role_manager::set_maintenance_mode() {
if (_std_mgr.has_value()) {
on_internal_error(log, "cannot enter maintenance mode after role operations have been enabled");
}
_is_maintenance_mode = true;
}
maintenance_socket_role_manager::maintenance_socket_role_manager(
cql3::query_processor& qp,
::service::raft_group0_client& rg0c,
::service::migration_manager& mm,
cache& c)
: _qp(qp)
, _group0_client(rg0c)
, _migration_manager(mm)
, _cache(c)
, _std_mgr(std::nullopt)
, _is_maintenance_mode(false) {
}
std::string_view maintenance_socket_role_manager::qualified_java_name() const noexcept {
return maintenance_socket_role_manager_name;
return "com.scylladb.auth.MaintenanceSocketRoleManager";
}
const resource_set& maintenance_socket_role_manager::protected_resources() const {
@@ -43,81 +68,161 @@ future<> maintenance_socket_role_manager::start() {
}
future<> maintenance_socket_role_manager::stop() {
return make_ready_future<>();
return _std_mgr ? _std_mgr->stop() : make_ready_future<>();
}
future<> maintenance_socket_role_manager::ensure_superuser_is_created() {
return make_ready_future<>();
return _std_mgr ? _std_mgr->ensure_superuser_is_created() : make_ready_future<>();
}
template<typename T = void>
future<T> operation_not_supported_exception(std::string_view operation) {
future<T> operation_not_available_in_maintenance_mode_exception(std::string_view operation) {
return make_exception_future<T>(
std::runtime_error(fmt::format("role manager: {} operation not supported through maintenance socket", operation)));
std::runtime_error(fmt::format("role manager: {} operation not available through maintenance socket in maintenance mode", operation)));
}
future<> maintenance_socket_role_manager::create(std::string_view role_name, const role_config&, ::service::group0_batch&) {
return operation_not_supported_exception("CREATE");
template<typename T = void>
future<T> manager_not_ready_exception(std::string_view operation) {
return make_exception_future<T>(
std::runtime_error(fmt::format("role manager: {} operation not available because manager not ready yet (role operations not enabled)", operation)));
}
future<> maintenance_socket_role_manager::validate_operation(std::string_view name) const {
if (_is_maintenance_mode) {
return operation_not_available_in_maintenance_mode_exception(name);
}
if (!_std_mgr) {
return manager_not_ready_exception(name);
}
return make_ready_future<>();
}
future<> maintenance_socket_role_manager::create(std::string_view role_name, const role_config& c, ::service::group0_batch& mc) {
auto f = validate_operation("CREATE");
if (f.failed()) {
return f;
}
return _std_mgr->create(role_name, c, mc);
}
future<> maintenance_socket_role_manager::drop(std::string_view role_name, ::service::group0_batch& mc) {
return operation_not_supported_exception("DROP");
auto f = validate_operation("DROP");
if (f.failed()) {
return f;
}
return _std_mgr->drop(role_name, mc);
}
future<> maintenance_socket_role_manager::alter(std::string_view role_name, const role_config_update&, ::service::group0_batch&) {
return operation_not_supported_exception("ALTER");
future<> maintenance_socket_role_manager::alter(std::string_view role_name, const role_config_update& u, ::service::group0_batch& mc) {
auto f = validate_operation("ALTER");
if (f.failed()) {
return f;
}
return _std_mgr->alter(role_name, u, mc);
}
future<> maintenance_socket_role_manager::grant(std::string_view grantee_name, std::string_view role_name, ::service::group0_batch& mc) {
return operation_not_supported_exception("GRANT");
auto f = validate_operation("GRANT");
if (f.failed()) {
return f;
}
return _std_mgr->grant(grantee_name, role_name, mc);
}
future<> maintenance_socket_role_manager::revoke(std::string_view revokee_name, std::string_view role_name, ::service::group0_batch& mc) {
return operation_not_supported_exception("REVOKE");
auto f = validate_operation("REVOKE");
if (f.failed()) {
return f;
}
return _std_mgr->revoke(revokee_name, role_name, mc);
}
future<role_set> maintenance_socket_role_manager::query_granted(std::string_view grantee_name, recursive_role_query) {
return operation_not_supported_exception<role_set>("QUERY GRANTED");
future<role_set> maintenance_socket_role_manager::query_granted(std::string_view grantee_name, recursive_role_query m) {
auto f = validate_operation("QUERY GRANTED");
if (f.failed()) {
return make_exception_future<role_set>(f.get_exception());
}
return _std_mgr->query_granted(grantee_name, m);
}
future<role_to_directly_granted_map> maintenance_socket_role_manager::query_all_directly_granted(::service::query_state&) {
return operation_not_supported_exception<role_to_directly_granted_map>("QUERY ALL DIRECTLY GRANTED");
future<role_to_directly_granted_map> maintenance_socket_role_manager::query_all_directly_granted(::service::query_state& qs) {
auto f = validate_operation("QUERY ALL DIRECTLY GRANTED");
if (f.failed()) {
return make_exception_future<role_to_directly_granted_map>(f.get_exception());
}
return _std_mgr->query_all_directly_granted(qs);
}
future<role_set> maintenance_socket_role_manager::query_all(::service::query_state&) {
return operation_not_supported_exception<role_set>("QUERY ALL");
future<role_set> maintenance_socket_role_manager::query_all(::service::query_state& qs) {
auto f = validate_operation("QUERY ALL");
if (f.failed()) {
return make_exception_future<role_set>(f.get_exception());
}
return _std_mgr->query_all(qs);
}
future<bool> maintenance_socket_role_manager::exists(std::string_view role_name) {
return operation_not_supported_exception<bool>("EXISTS");
auto f = validate_operation("EXISTS");
if (f.failed()) {
return make_exception_future<bool>(f.get_exception());
}
return _std_mgr->exists(role_name);
}
future<bool> maintenance_socket_role_manager::is_superuser(std::string_view role_name) {
return make_ready_future<bool>(true);
auto f = validate_operation("IS SUPERUSER");
if (f.failed()) {
return make_exception_future<bool>(f.get_exception());
}
return _std_mgr->is_superuser(role_name);
}
future<bool> maintenance_socket_role_manager::can_login(std::string_view role_name) {
return make_ready_future<bool>(true);
auto f = validate_operation("CAN LOGIN");
if (f.failed()) {
return make_exception_future<bool>(f.get_exception());
}
return _std_mgr->can_login(role_name);
}
future<std::optional<sstring>> maintenance_socket_role_manager::get_attribute(std::string_view role_name, std::string_view attribute_name, ::service::query_state&) {
return operation_not_supported_exception<std::optional<sstring>>("GET ATTRIBUTE");
future<std::optional<sstring>> maintenance_socket_role_manager::get_attribute(std::string_view role_name, std::string_view attribute_name, ::service::query_state& qs) {
auto f = validate_operation("GET ATTRIBUTE");
if (f.failed()) {
return make_exception_future<std::optional<sstring>>(f.get_exception());
}
return _std_mgr->get_attribute(role_name, attribute_name, qs);
}
future<role_manager::attribute_vals> maintenance_socket_role_manager::query_attribute_for_all(std::string_view attribute_name, ::service::query_state&) {
return operation_not_supported_exception<role_manager::attribute_vals>("QUERY ATTRIBUTE");
future<role_manager::attribute_vals> maintenance_socket_role_manager::query_attribute_for_all(std::string_view attribute_name, ::service::query_state& qs) {
auto f = validate_operation("QUERY ATTRIBUTE FOR ALL");
if (f.failed()) {
return make_exception_future<role_manager::attribute_vals>(f.get_exception());
}
return _std_mgr->query_attribute_for_all(attribute_name, qs);
}
future<> maintenance_socket_role_manager::set_attribute(std::string_view role_name, std::string_view attribute_name, std::string_view attribute_value, ::service::group0_batch& mc) {
return operation_not_supported_exception("SET ATTRIBUTE");
auto f = validate_operation("SET ATTRIBUTE");
if (f.failed()) {
return f;
}
return _std_mgr->set_attribute(role_name, attribute_name, attribute_value, mc);
}
future<> maintenance_socket_role_manager::remove_attribute(std::string_view role_name, std::string_view attribute_name, ::service::group0_batch& mc) {
return operation_not_supported_exception("REMOVE ATTRIBUTE");
auto f = validate_operation("REMOVE ATTRIBUTE");
if (f.failed()) {
return f;
}
return _std_mgr->remove_attribute(role_name, attribute_name, mc);
}
future<std::vector<cql3::description>> maintenance_socket_role_manager::describe_role_grants() {
return operation_not_supported_exception<std::vector<cql3::description>>("DESCRIBE SCHEMA WITH INTERNALS");
auto f = validate_operation("DESCRIBE ROLE GRANTS");
if (f.failed()) {
return make_exception_future<std::vector<cql3::description>>(f.get_exception());
}
return _std_mgr->describe_role_grants();
}
} // namespace auth

View File

@@ -11,6 +11,7 @@
#include "auth/cache.hh"
#include "auth/resource.hh"
#include "auth/role_manager.hh"
#include "auth/standard_role_manager.hh"
#include <seastar/core/future.hh>
namespace cql3 {
@@ -24,13 +25,26 @@ class raft_group0_client;
namespace auth {
extern const std::string_view maintenance_socket_role_manager_name;
// This role manager is used by the maintenance socket. It has disabled all role management operations to not depend on
// system_auth keyspace, which may be not yet created when the maintenance socket starts listening.
// This role manager is used by the maintenance socket. It has disabled all role management operations
// in maintenance mode. In normal mode it delegates all operations to a standard_role_manager,
// which is created on demand when the node joins the cluster.
class maintenance_socket_role_manager final : public role_manager {
cql3::query_processor& _qp;
::service::raft_group0_client& _group0_client;
::service::migration_manager& _migration_manager;
cache& _cache;
std::optional<standard_role_manager> _std_mgr;
bool _is_maintenance_mode;
public:
maintenance_socket_role_manager(cql3::query_processor&, ::service::raft_group0_client&, ::service::migration_manager&, cache&) {}
void set_maintenance_mode() override;
// Ensures role management operations are enabled.
// It must be called once the node has joined the cluster.
// In the meantime all role management operations will fail.
future<> ensure_role_operations_are_enabled() override;
maintenance_socket_role_manager(cql3::query_processor&, ::service::raft_group0_client&, ::service::migration_manager&, cache&);
virtual std::string_view qualified_java_name() const noexcept override;
@@ -42,21 +56,21 @@ public:
virtual future<> ensure_superuser_is_created() override;
virtual future<> create(std::string_view role_name, const role_config&, ::service::group0_batch&) override;
virtual future<> create(std::string_view role_name, const role_config& c, ::service::group0_batch& mc) override;
virtual future<> drop(std::string_view role_name, ::service::group0_batch& mc) override;
virtual future<> alter(std::string_view role_name, const role_config_update&, ::service::group0_batch&) override;
virtual future<> alter(std::string_view role_name, const role_config_update& u, ::service::group0_batch& mc) override;
virtual future<> grant(std::string_view grantee_name, std::string_view role_name, ::service::group0_batch& mc) override;
virtual future<> revoke(std::string_view revokee_name, std::string_view role_name, ::service::group0_batch& mc) override;
virtual future<role_set> query_granted(std::string_view grantee_name, recursive_role_query) override;
virtual future<role_set> query_granted(std::string_view grantee_name, recursive_role_query m) override;
virtual future<role_to_directly_granted_map> query_all_directly_granted(::service::query_state&) override;
virtual future<role_to_directly_granted_map> query_all_directly_granted(::service::query_state& qs) override;
virtual future<role_set> query_all(::service::query_state&) override;
virtual future<role_set> query_all(::service::query_state& qs) override;
virtual future<bool> exists(std::string_view role_name) override;
@@ -64,15 +78,19 @@ public:
virtual future<bool> can_login(std::string_view role_name) override;
virtual future<std::optional<sstring>> get_attribute(std::string_view role_name, std::string_view attribute_name, ::service::query_state&) override;
virtual future<std::optional<sstring>> get_attribute(std::string_view role_name, std::string_view attribute_name, ::service::query_state& qs) override;
virtual future<role_manager::attribute_vals> query_attribute_for_all(std::string_view attribute_name, ::service::query_state&) override;
virtual future<role_manager::attribute_vals> query_attribute_for_all(std::string_view attribute_name, ::service::query_state& qs) override;
virtual future<> set_attribute(std::string_view role_name, std::string_view attribute_name, std::string_view attribute_value, ::service::group0_batch& mc) override;
virtual future<> remove_attribute(std::string_view role_name, std::string_view attribute_name, ::service::group0_batch& mc) override;
virtual future<std::vector<cql3::description>> describe_role_grants() override;
private:
future<> validate_operation(std::string_view name) const;
};
}

View File

@@ -26,10 +26,9 @@
#include "cql3/untyped_result_set.hh"
#include "utils/log.hh"
#include "service/migration_manager.hh"
#include "utils/class_registrator.hh"
#include "replica/database.hh"
#include "cql3/query_processor.hh"
#include "db/config.hh"
#include "db/system_keyspace.hh"
namespace auth {
@@ -37,29 +36,10 @@ constexpr std::string_view password_authenticator_name("org.apache.cassandra.aut
// name of the hash column.
static constexpr std::string_view SALTED_HASH = "salted_hash";
static constexpr std::string_view DEFAULT_USER_NAME = meta::DEFAULT_SUPERUSER_NAME;
static const sstring DEFAULT_USER_PASSWORD = sstring(meta::DEFAULT_SUPERUSER_NAME);
static logging::logger plogger("password_authenticator");
// To ensure correct initialization order, we unfortunately need to use a string literal.
static const class_registrator<
authenticator,
password_authenticator,
cql3::query_processor&,
::service::raft_group0_client&,
::service::migration_manager&,
cache&> password_auth_reg("org.apache.cassandra.auth.PasswordAuthenticator");
static thread_local auto rng_for_salt = std::default_random_engine(std::random_device{}());
static std::string_view get_config_value(std::string_view value, std::string_view def) {
return value.empty() ? def : value;
}
std::string password_authenticator::default_superuser(const db::config& cfg) {
return std::string(get_config_value(cfg.auth_superuser_name(), DEFAULT_USER_NAME));
}
password_authenticator::~password_authenticator() {
}
@@ -69,7 +49,6 @@ password_authenticator::password_authenticator(cql3::query_processor& qp, ::serv
, _migration_manager(mm)
, _cache(cache)
, _stopped(make_ready_future<>())
, _superuser(default_superuser(qp.db().get_config()))
{}
static bool has_salted_hash(const cql3::untyped_result_set_row& row) {
@@ -78,76 +57,18 @@ static bool has_salted_hash(const cql3::untyped_result_set_row& row) {
sstring password_authenticator::update_row_query() const {
return seastar::format("UPDATE {}.{} SET {} = ? WHERE {} = ?",
get_auth_ks_name(_qp),
db::system_keyspace::NAME,
meta::roles_table::name,
SALTED_HASH,
meta::roles_table::role_col_name);
}
static const sstring legacy_table_name{"credentials"};
bool password_authenticator::legacy_metadata_exists() const {
return _qp.db().has_schema(meta::legacy::AUTH_KS, legacy_table_name);
}
future<> password_authenticator::migrate_legacy_metadata() const {
plogger.info("Starting migration of legacy authentication metadata.");
static const sstring query = seastar::format("SELECT * FROM {}.{}", meta::legacy::AUTH_KS, legacy_table_name);
return _qp.execute_internal(
query,
db::consistency_level::QUORUM,
internal_distributed_query_state(),
cql3::query_processor::cache_internal::no).then([this](::shared_ptr<cql3::untyped_result_set> results) {
return do_for_each(*results, [this](const cql3::untyped_result_set_row& row) {
auto username = row.get_as<sstring>("username");
auto salted_hash = row.get_as<sstring>(SALTED_HASH);
static const auto query = seastar::format("UPDATE {}.{} SET {} = ? WHERE {} = ?",
meta::legacy::AUTH_KS,
meta::roles_table::name,
SALTED_HASH,
meta::roles_table::role_col_name);
return _qp.execute_internal(
query,
consistency_for_user(username),
internal_distributed_query_state(),
{std::move(salted_hash), username},
cql3::query_processor::cache_internal::no).discard_result();
}).finally([results] {});
}).then([] {
plogger.info("Finished migrating legacy authentication metadata.");
}).handle_exception([](std::exception_ptr ep) {
plogger.error("Encountered an error during migration!");
std::rethrow_exception(ep);
});
}
future<> password_authenticator::legacy_create_default_if_missing() {
const auto exists = co_await legacy::default_role_row_satisfies(_qp, &has_salted_hash, _superuser);
if (exists) {
co_return;
}
std::string salted_pwd(get_config_value(_qp.db().get_config().auth_superuser_salted_password(), ""));
if (salted_pwd.empty()) {
salted_pwd = passwords::hash(DEFAULT_USER_PASSWORD, rng_for_salt, _scheme);
}
const auto query = seastar::format("UPDATE {}.{} SET {} = ? WHERE {} = ?",
meta::legacy::AUTH_KS,
meta::roles_table::name,
SALTED_HASH,
meta::roles_table::role_col_name);
co_await _qp.execute_internal(
query,
db::consistency_level::QUORUM,
internal_distributed_query_state(),
{salted_pwd, _superuser},
cql3::query_processor::cache_internal::no);
plogger.info("Created default superuser authentication record.");
}
future<> password_authenticator::maybe_create_default_password() {
auto needs_password = [this] () -> future<bool> {
const sstring query = seastar::format("SELECT * FROM {}.{} WHERE is_superuser = true ALLOW FILTERING", get_auth_ks_name(_qp), meta::roles_table::name);
if (default_superuser(_qp).empty()) {
co_return false;
}
const sstring query = seastar::format("SELECT * FROM {}.{} WHERE is_superuser = true ALLOW FILTERING", db::system_keyspace::NAME, meta::roles_table::name);
auto results = co_await _qp.execute_internal(query,
db::consistency_level::LOCAL_ONE,
internal_distributed_query_state(), cql3::query_processor::cache_internal::yes);
@@ -157,7 +78,7 @@ future<> password_authenticator::maybe_create_default_password() {
bool has_default = false;
bool has_superuser_with_password = false;
for (auto& result : *results) {
if (result.get_as<sstring>(meta::roles_table::role_col_name) == _superuser) {
if (result.get_as<sstring>(meta::roles_table::role_col_name) == default_superuser(_qp)) {
has_default = true;
}
if (has_salted_hash(result)) {
@@ -178,12 +99,12 @@ future<> password_authenticator::maybe_create_default_password() {
co_return;
}
// Set default superuser's password.
std::string salted_pwd(get_config_value(_qp.db().get_config().auth_superuser_salted_password(), ""));
std::string salted_pwd(_qp.db().get_config().auth_superuser_salted_password());
if (salted_pwd.empty()) {
salted_pwd = passwords::hash(DEFAULT_USER_PASSWORD, rng_for_salt, _scheme);
co_return;
}
const auto update_query = update_row_query();
co_await collect_mutations(_qp, batch, update_query, {salted_pwd, _superuser});
co_await collect_mutations(_qp, batch, update_query, {salted_pwd, default_superuser(_qp)});
co_await std::move(batch).commit(_group0_client, _as, get_raft_timeout());
plogger.info("Created default superuser authentication record.");
}
@@ -216,58 +137,14 @@ future<> password_authenticator::start() {
_stopped = do_after_system_ready(_as, [this] {
return async([this] {
if (legacy_mode(_qp)) {
if (!_superuser_created_promise.available()) {
// Counterintuitively, we mark promise as ready before any startup work
// because wait_for_schema_agreement() below will block indefinitely
// without cluster majority. In that case, blocking node startup
// would lead to a cluster deadlock.
_superuser_created_promise.set_value();
}
_migration_manager.wait_for_schema_agreement(_qp.db().real_database(), db::timeout_clock::time_point::max(), &_as).get();
if (legacy::any_nondefault_role_row_satisfies(_qp, &has_salted_hash, _superuser).get()) {
if (legacy_metadata_exists()) {
plogger.warn("Ignoring legacy authentication metadata since nondefault data already exist.");
}
return;
}
if (legacy_metadata_exists()) {
migrate_legacy_metadata().get();
return;
}
legacy_create_default_if_missing().get();
}
utils::get_local_injector().inject("password_authenticator_start_pause", utils::wait_for_message(5min)).get();
if (!legacy_mode(_qp)) {
maybe_create_default_password_with_retries().get();
if (!_superuser_created_promise.available()) {
_superuser_created_promise.set_value();
}
maybe_create_default_password_with_retries().get();
if (!_superuser_created_promise.available()) {
_superuser_created_promise.set_value();
}
});
});
if (legacy_mode(_qp)) {
static const sstring create_roles_query = fmt::format(
"CREATE TABLE {}.{} ("
" {} text PRIMARY KEY,"
" can_login boolean,"
" is_superuser boolean,"
" member_of set<text>,"
" salted_hash text"
")",
meta::legacy::AUTH_KS,
meta::roles_table::name,
meta::roles_table::role_col_name);
return create_legacy_metadata_table_if_missing(
meta::roles_table::name,
_qp,
create_roles_query,
_migration_manager);
}
return make_ready_future<>();
});
}
@@ -277,15 +154,6 @@ future<> password_authenticator::stop() {
return _stopped.handle_exception_type([] (const sleep_aborted&) { }).handle_exception_type([](const abort_requested_exception&) {});
}
db::consistency_level password_authenticator::consistency_for_user(std::string_view role_name) {
// TODO: this is plain dung. Why treat hardcoded default special, but for example a user-created
// super user uses plain LOCAL_ONE?
if (role_name == DEFAULT_USER_NAME) {
return db::consistency_level::QUORUM;
}
return db::consistency_level::LOCAL_ONE;
}
std::string_view password_authenticator::qualified_java_name() const {
return password_authenticator_name;
}
@@ -315,20 +183,12 @@ future<authenticated_user> password_authenticator::authenticate(
const sstring password = credentials.at(PASSWORD_KEY);
try {
std::optional<sstring> salted_hash;
if (legacy_mode(_qp)) {
salted_hash = co_await get_password_hash(username);
if (!salted_hash) {
throw exceptions::authentication_exception("Username and/or password are incorrect");
}
} else {
auto role = _cache.get(username);
if (!role || role->salted_hash.empty()) {
throw exceptions::authentication_exception("Username and/or password are incorrect");
}
salted_hash = role->salted_hash;
auto role = _cache.get(username);
if (!role || role->salted_hash.empty()) {
throw exceptions::authentication_exception("Username and/or password are incorrect");
}
const bool password_match = co_await passwords::check(password, *salted_hash);
const auto& salted_hash = role->salted_hash;
const bool password_match = co_await passwords::check(password, salted_hash);
if (!password_match) {
throw exceptions::authentication_exception("Username and/or password are incorrect");
}
@@ -367,16 +227,7 @@ future<> password_authenticator::create(std::string_view role_name, const authen
}
const auto query = update_row_query();
if (legacy_mode(_qp)) {
co_await _qp.execute_internal(
query,
consistency_for_user(role_name),
internal_distributed_query_state(),
{std::move(*maybe_hash), sstring(role_name)},
cql3::query_processor::cache_internal::no).discard_result();
} else {
co_await collect_mutations(_qp, mc, query, {std::move(*maybe_hash), sstring(role_name)});
}
co_await collect_mutations(_qp, mc, query, {std::move(*maybe_hash), sstring(role_name)});
}
future<> password_authenticator::alter(std::string_view role_name, const authentication_options& options, ::service::group0_batch& mc) {
@@ -387,38 +238,21 @@ future<> password_authenticator::alter(std::string_view role_name, const authent
const auto password = std::get<password_option>(*options.credentials).password;
const sstring query = seastar::format("UPDATE {}.{} SET {} = ? WHERE {} = ?",
get_auth_ks_name(_qp),
db::system_keyspace::NAME,
meta::roles_table::name,
SALTED_HASH,
meta::roles_table::role_col_name);
if (legacy_mode(_qp)) {
co_await _qp.execute_internal(
query,
consistency_for_user(role_name),
internal_distributed_query_state(),
{passwords::hash(password, rng_for_salt, _scheme), sstring(role_name)},
cql3::query_processor::cache_internal::no).discard_result();
} else {
co_await collect_mutations(_qp, mc, query,
{passwords::hash(password, rng_for_salt, _scheme), sstring(role_name)});
}
co_await collect_mutations(_qp, mc, query,
{passwords::hash(password, rng_for_salt, _scheme), sstring(role_name)});
}
future<> password_authenticator::drop(std::string_view name, ::service::group0_batch& mc) {
const sstring query = seastar::format("DELETE {} FROM {}.{} WHERE {} = ?",
SALTED_HASH,
get_auth_ks_name(_qp),
db::system_keyspace::NAME,
meta::roles_table::name,
meta::roles_table::role_col_name);
if (legacy_mode(_qp)) {
co_await _qp.execute_internal(
query, consistency_for_user(name),
internal_distributed_query_state(),
{sstring(name)},
cql3::query_processor::cache_internal::no).discard_result();
} else {
co_await collect_mutations(_qp, mc, query, {sstring(name)});
}
co_await collect_mutations(_qp, mc, query, {sstring(name)});
}
future<custom_options> password_authenticator::query_custom_options(std::string_view role_name) const {
@@ -437,13 +271,13 @@ future<std::optional<sstring>> password_authenticator::get_password_hash(std::st
// that a map lookup string->statement is not gonna kill us much.
const sstring query = seastar::format("SELECT {} FROM {}.{} WHERE {} = ?",
SALTED_HASH,
get_auth_ks_name(_qp),
db::system_keyspace::NAME,
meta::roles_table::name,
meta::roles_table::role_col_name);
const auto res = co_await _qp.execute_internal(
query,
consistency_for_user(role_name),
db::consistency_level::LOCAL_ONE,
internal_distributed_query_state(),
{role_name},
cql3::query_processor::cache_internal::yes);

View File

@@ -13,7 +13,6 @@
#include <seastar/core/abort_source.hh>
#include <seastar/core/shared_future.hh>
#include "db/consistency_level_type.hh"
#include "auth/authenticator.hh"
#include "auth/passwords.hh"
#include "auth/cache.hh"
@@ -44,15 +43,11 @@ class password_authenticator : public authenticator {
cache& _cache;
future<> _stopped;
abort_source _as;
std::string _superuser; // default superuser name from the config (may or may not be present in roles table)
shared_promise<> _superuser_created_promise;
// We used to also support bcrypt, SHA-256, and MD5 (ref. scylladb#24524).
constexpr static auth::passwords::scheme _scheme = passwords::scheme::sha_512;
public:
static db::consistency_level consistency_for_user(std::string_view role_name);
static std::string default_superuser(const db::config&);
password_authenticator(cql3::query_processor&, ::service::raft_group0_client&, ::service::migration_manager&, cache&);
~password_authenticator();
@@ -90,12 +85,6 @@ public:
virtual future<> ensure_superuser_is_created() const override;
private:
bool legacy_metadata_exists() const;
future<> migrate_legacy_metadata() const;
future<> legacy_create_default_if_missing();
future<> maybe_create_default_password();
future<> maybe_create_default_password_with_retries();

View File

@@ -112,6 +112,11 @@ public:
virtual future<> stop() = 0;
///
/// Notify that the maintenance mode is starting.
///
virtual void set_maintenance_mode() {}
///
/// Ensure that superuser role exists.
///
@@ -119,6 +124,11 @@ public:
///
virtual future<> ensure_superuser_is_created() = 0;
///
/// Ensure role management operations are enabled. Some role managers may defer initialization.
///
virtual future<> ensure_role_operations_are_enabled() { return make_ready_future<>(); }
///
/// \returns an exceptional future with \ref role_already_exists for a role that has previously been created.
///

View File

@@ -1,68 +0,0 @@
/*
* Copyright (C) 2018-present ScyllaDB
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
*/
#include "auth/roles-metadata.hh"
#include <seastar/core/format.hh>
#include <seastar/core/shared_ptr.hh>
#include <seastar/core/sstring.hh>
#include "auth/common.hh"
#include "cql3/query_processor.hh"
#include "cql3/untyped_result_set.hh"
namespace auth {
namespace legacy {
future<bool> default_role_row_satisfies(
cql3::query_processor& qp,
std::function<bool(const cql3::untyped_result_set_row&)> p,
std::optional<std::string> rolename) {
const sstring query = seastar::format("SELECT * FROM {}.{} WHERE {} = ?",
auth::meta::legacy::AUTH_KS,
meta::roles_table::name,
meta::roles_table::role_col_name);
for (auto cl : { db::consistency_level::ONE, db::consistency_level::QUORUM }) {
auto results = co_await qp.execute_internal(query, cl
, internal_distributed_query_state()
, {rolename.value_or(std::string(auth::meta::DEFAULT_SUPERUSER_NAME))}
, cql3::query_processor::cache_internal::yes
);
if (!results->empty()) {
co_return p(results->one());
}
}
co_return false;
}
future<bool> any_nondefault_role_row_satisfies(
cql3::query_processor& qp,
std::function<bool(const cql3::untyped_result_set_row&)> p,
std::optional<std::string> rolename) {
const sstring query = seastar::format("SELECT * FROM {}.{}", auth::meta::legacy::AUTH_KS, meta::roles_table::name);
auto results = co_await qp.execute_internal(query, db::consistency_level::QUORUM
, internal_distributed_query_state(), cql3::query_processor::cache_internal::no
);
if (results->empty()) {
co_return false;
}
static const sstring col_name = sstring(meta::roles_table::role_col_name);
co_return std::ranges::any_of(*results, [&](const cql3::untyped_result_set_row& row) {
auto superuser = rolename ? std::string_view(*rolename) : meta::DEFAULT_SUPERUSER_NAME;
const bool is_nondefault = row.get_as<sstring>(col_name) != superuser;
return is_nondefault && p(row);
});
}
} // namespace legacy
} // namespace auth

View File

@@ -8,18 +8,7 @@
#pragma once
#include <optional>
#include <string_view>
#include <functional>
#include <seastar/core/future.hh>
#include "seastarx.hh"
namespace cql3 {
class query_processor;
class untyped_result_set_row;
}
namespace auth {
@@ -35,26 +24,4 @@ constexpr std::string_view role_col_name{"role", 4};
} // namespace meta
namespace legacy {
///
/// Check that the default role satisfies a predicate, or `false` if the default role does not exist.
///
future<bool> default_role_row_satisfies(
cql3::query_processor&,
std::function<bool(const cql3::untyped_result_set_row&)>,
std::optional<std::string> rolename = {}
);
///
/// Check that any nondefault role satisfies a predicate. `false` if no nondefault roles exist.
///
future<bool> any_nondefault_role_row_satisfies(
cql3::query_processor&,
std::function<bool(const cql3::untyped_result_set_row&)>,
std::optional<std::string> rolename = {}
);
} // namespace legacy
} // namespace auth

View File

@@ -22,21 +22,11 @@
#include "db/config.hh"
#include "utils/log.hh"
#include "seastarx.hh"
#include "utils/class_registrator.hh"
namespace auth {
static logging::logger mylog("saslauthd_authenticator");
// To ensure correct initialization order, we unfortunately need to use a string literal.
static const class_registrator<
authenticator,
saslauthd_authenticator,
cql3::query_processor&,
::service::raft_group0_client&,
::service::migration_manager&,
cache&> saslauthd_auth_reg("com.scylladb.auth.SaslauthdAuthenticator");
saslauthd_authenticator::saslauthd_authenticator(cql3::query_processor& qp, ::service::raft_group0_client&, ::service::migration_manager&, cache&)
: _socket_path(qp.db().get_config().saslauthd_socket_path())
{}

View File

@@ -16,6 +16,8 @@
#include <algorithm>
#include <chrono>
#include <boost/algorithm/string.hpp>
#include <seastar/core/future-util.hh>
#include <seastar/core/shard_id.hh>
#include <seastar/core/sharded.hh>
@@ -23,8 +25,18 @@
#include "auth/allow_all_authenticator.hh"
#include "auth/allow_all_authorizer.hh"
#include "auth/certificate_authenticator.hh"
#include "auth/common.hh"
#include "auth/default_authorizer.hh"
#include "auth/ldap_role_manager.hh"
#include "auth/maintenance_socket_authenticator.hh"
#include "auth/maintenance_socket_authorizer.hh"
#include "auth/maintenance_socket_role_manager.hh"
#include "auth/password_authenticator.hh"
#include "auth/role_or_anonymous.hh"
#include "auth/saslauthd_authenticator.hh"
#include "auth/standard_role_manager.hh"
#include "auth/transitional.hh"
#include "cql3/functions/functions.hh"
#include "cql3/query_processor.hh"
#include "cql3/description.hh"
@@ -43,7 +55,6 @@
#include "service/raft/raft_group0_client.hh"
#include "mutation/timestamp.hh"
#include "utils/assert.hh"
#include "utils/class_registrator.hh"
#include "locator/abstract_replication_strategy.hh"
#include "data_dictionary/keyspace_metadata.hh"
#include "service/storage_service.hh"
@@ -63,88 +74,6 @@ static const sstring superuser_col_name("super");
static logging::logger log("auth_service");
class auth_migration_listener final : public ::service::migration_listener {
service& _service;
cql3::query_processor& _qp;
public:
explicit auth_migration_listener(service& s, cql3::query_processor& qp) : _service(s), _qp(qp) {
}
private:
void on_create_keyspace(const sstring& ks_name) override {}
void on_create_column_family(const sstring& ks_name, const sstring& cf_name) override {}
void on_create_user_type(const sstring& ks_name, const sstring& type_name) override {}
void on_create_function(const sstring& ks_name, const sstring& function_name) override {}
void on_create_aggregate(const sstring& ks_name, const sstring& aggregate_name) override {}
void on_create_view(const sstring& ks_name, const sstring& view_name) override {}
void on_update_keyspace(const sstring& ks_name) override {}
void on_update_column_family(const sstring& ks_name, const sstring& cf_name, bool) override {}
void on_update_user_type(const sstring& ks_name, const sstring& type_name) override {}
void on_update_function(const sstring& ks_name, const sstring& function_name) override {}
void on_update_aggregate(const sstring& ks_name, const sstring& aggregate_name) override {}
void on_update_view(const sstring& ks_name, const sstring& view_name, bool columns_changed) override {}
void on_drop_keyspace(const sstring& ks_name) override {
if (!legacy_mode(_qp)) {
// in non legacy path revoke is part of schema change statement execution
return;
}
// Do it in the background.
(void)do_with(auth::make_data_resource(ks_name), ::service::group0_batch::unused(), [this] (auto& r, auto& mc) mutable {
return _service.revoke_all(r, mc);
}).handle_exception([] (std::exception_ptr e) {
log.error("Unexpected exception while revoking all permissions on dropped keyspace: {}", e);
});
(void)do_with(auth::make_functions_resource(ks_name), ::service::group0_batch::unused(), [this] (auto& r, auto& mc) mutable {
return _service.revoke_all(r, mc);
}).handle_exception([] (std::exception_ptr e) {
log.error("Unexpected exception while revoking all permissions on functions in dropped keyspace: {}", e);
});
}
void on_drop_column_family(const sstring& ks_name, const sstring& cf_name) override {
if (!legacy_mode(_qp)) {
// in non legacy path revoke is part of schema change statement execution
return;
}
// Do it in the background.
(void)do_with(auth::make_data_resource(ks_name, cf_name), ::service::group0_batch::unused(), [this] (auto& r, auto& mc) mutable {
return _service.revoke_all(r, mc);
}).handle_exception([] (std::exception_ptr e) {
log.error("Unexpected exception while revoking all permissions on dropped table: {}", e);
});
}
void on_drop_user_type(const sstring& ks_name, const sstring& type_name) override {}
void on_drop_function(const sstring& ks_name, const sstring& function_name) override {
if (!legacy_mode(_qp)) {
// in non legacy path revoke is part of schema change statement execution
return;
}
// Do it in the background.
(void)do_with(auth::make_functions_resource(ks_name, function_name), ::service::group0_batch::unused(), [this] (auto& r, auto& mc) mutable {
return _service.revoke_all(r, mc);
}).handle_exception([] (std::exception_ptr e) {
log.error("Unexpected exception while revoking all permissions on dropped function: {}", e);
});
}
void on_drop_aggregate(const sstring& ks_name, const sstring& aggregate_name) override {
if (!legacy_mode(_qp)) {
// in non legacy path revoke is part of schema change statement execution
return;
}
(void)do_with(auth::make_functions_resource(ks_name, aggregate_name), ::service::group0_batch::unused(), [this] (auto& r, auto& mc) mutable {
return _service.revoke_all(r, mc);
}).handle_exception([] (std::exception_ptr e) {
log.error("Unexpected exception while revoking all permissions on dropped aggregate: {}", e);
});
}
void on_drop_view(const sstring& ks_name, const sstring& view_name) override {}
};
static future<> validate_role_exists(const service& ser, std::string_view role_name) {
return ser.underlying_role_manager().exists(role_name).then([role_name](bool exists) {
if (!exists) {
@@ -157,7 +86,6 @@ service::service(
cache& cache,
cql3::query_processor& qp,
::service::raft_group0_client& g0,
::service::migration_notifier& mn,
std::unique_ptr<authorizer> z,
std::unique_ptr<authenticator> a,
std::unique_ptr<role_manager> r,
@@ -165,29 +93,26 @@ service::service(
: _cache(cache)
, _qp(qp)
, _group0_client(g0)
, _mnotifier(mn)
, _authorizer(std::move(z))
, _authenticator(std::move(a))
, _role_manager(std::move(r))
, _migration_listener(std::make_unique<auth_migration_listener>(*this, qp))
, _used_by_maintenance_socket(used_by_maintenance_socket) {}
service::service(
cql3::query_processor& qp,
::service::raft_group0_client& g0,
::service::migration_notifier& mn,
::service::migration_manager& mm,
const service_config& sc,
authorizer_factory authorizer_factory,
authenticator_factory authenticator_factory,
role_manager_factory role_manager_factory,
maintenance_socket_enabled used_by_maintenance_socket,
cache& cache)
: service(
cache,
qp,
g0,
mn,
create_object<authorizer>(sc.authorizer_java_name, qp, g0, mm),
create_object<authenticator>(sc.authenticator_java_name, qp, g0, mm, cache),
create_object<role_manager>(sc.role_manager_java_name, qp, g0, mm, cache),
authorizer_factory(),
authenticator_factory(),
role_manager_factory(),
used_by_maintenance_socket) {
}
@@ -220,9 +145,6 @@ future<> service::create_legacy_keyspace_if_missing(::service::migration_manager
}
future<> service::start(::service::migration_manager& mm, db::system_keyspace& sys_ks) {
auto auth_version = co_await sys_ks.get_auth_version();
// version is set in query processor to be easily available in various places we call auth::legacy_mode check.
_qp.auth_version = auth_version;
if (this_shard_id() == 0) {
co_await _cache.load_all();
}
@@ -252,22 +174,12 @@ future<> service::start(::service::migration_manager& mm, db::system_keyspace& s
&service::get_uncached_permissions,
this, std::placeholders::_1, std::placeholders::_2));
}
co_await once_among_shards([this] {
_mnotifier.register_listener(_migration_listener.get());
return make_ready_future<>();
});
}
future<> service::stop() {
_as.request_abort();
// Only one of the shards has the listener registered, but let's try to
// unregister on each one just to make sure.
return _mnotifier.unregister_listener(_migration_listener.get()).then([this] {
_cache.set_permission_loader(nullptr);
return make_ready_future<>();
}).then([this] {
return when_all_succeed(_role_manager->stop(), _authorizer->stop(), _authenticator->stop()).discard_result();
});
_cache.set_permission_loader(nullptr);
return when_all_succeed(_role_manager->stop(), _authorizer->stop(), _authenticator->stop()).discard_result();
}
future<> service::ensure_superuser_is_created() {
@@ -301,12 +213,16 @@ service::get_uncached_permissions(const role_or_anonymous& maybe_role, const res
}
future<permission_set> service::get_permissions(const role_or_anonymous& maybe_role, const resource& r) const {
if (legacy_mode(_qp) || _used_by_maintenance_socket) {
if (_used_by_maintenance_socket) {
return get_uncached_permissions(maybe_role, r);
}
return _cache.get_permissions(maybe_role, r);
}
void service::set_maintenance_mode() {
_role_manager->set_maintenance_mode();
}
future<bool> service::has_superuser(std::string_view role_name, const role_set& roles) const {
for (const auto& role : roles) {
if (co_await _role_manager->is_superuser(role)) {
@@ -342,6 +258,10 @@ static void validate_authentication_options_are_supported(
}
}
future<> service::ensure_role_operations_are_enabled() {
return _role_manager->ensure_role_operations_are_enabled();
}
future<> service::create_role(std::string_view name,
const role_config& config,
const authentication_options& options,
@@ -359,11 +279,6 @@ future<> service::create_role(std::string_view name,
ep = std::current_exception();
}
if (ep) {
// Rollback only in legacy mode as normally mutations won't be
// applied in case exception is raised
if (legacy_mode(_qp)) {
co_await underlying_role_manager().drop(name, mc);
}
std::rethrow_exception(std::move(ep));
}
}
@@ -442,11 +357,11 @@ future<std::vector<cql3::description>> service::describe_roles(bool with_hashed_
const bool authenticator_uses_password_hashes = _authenticator->uses_password_hashes();
auto produce_create_statement = [with_hashed_passwords] (const sstring& formatted_role_name,
const auto default_su = cql3::util::maybe_quote(default_superuser(_qp));
auto produce_create_statement = [&default_su, with_hashed_passwords] (const sstring& formatted_role_name,
const std::optional<sstring>& maybe_hashed_password, bool can_login, bool is_superuser) {
// Even after applying formatting to a role, `formatted_role_name` can only equal `meta::DEFAULT_SUPER_NAME`
// if the original identifier was equal to it.
const sstring role_part = formatted_role_name == meta::DEFAULT_SUPERUSER_NAME
const sstring role_part = formatted_role_name == default_su
? seastar::format("IF NOT EXISTS {}", formatted_role_name)
: formatted_role_name;
@@ -659,6 +574,10 @@ future<std::vector<cql3::description>> service::describe_auth(bool with_hashed_p
// Free functions.
//
void set_maintenance_mode(service& ser) {
ser.set_maintenance_mode();
}
future<bool> has_superuser(const service& ser, const authenticated_user& u) {
if (is_anonymous(u)) {
return make_ready_future<bool>(false);
@@ -667,6 +586,10 @@ future<bool> has_superuser(const service& ser, const authenticated_user& u) {
return ser.has_superuser(*u.name);
}
future<> ensure_role_operations_are_enabled(service& ser) {
return ser.underlying_role_manager().ensure_role_operations_are_enabled();
}
future<role_set> get_roles(const service& ser, const authenticated_user& u) {
if (is_anonymous(u)) {
return make_ready_future<role_set>();
@@ -849,83 +772,115 @@ future<> commit_mutations(service& ser, ::service::group0_batch&& mc) {
return ser.commit_mutations(std::move(mc));
}
future<> migrate_to_auth_v2(db::system_keyspace& sys_ks, ::service::raft_group0_client& g0, start_operation_func_t start_operation_func, abort_source& as) {
// FIXME: if this function fails it may leave partial data in the new tables
// that should be cleared
auto gen = [&sys_ks] (api::timestamp_type ts) -> ::service::mutations_generator {
auto& qp = sys_ks.query_processor();
for (const auto& cf_name : std::vector<sstring>{
"roles", "role_members", "role_attributes", "role_permissions"}) {
schema_ptr schema;
try {
schema = qp.db().find_schema(meta::legacy::AUTH_KS, cf_name);
} catch (const data_dictionary::no_such_column_family&) {
continue; // some tables might not have been created if they were not used
}
namespace {
std::vector<sstring> col_names;
for (const auto& col : schema->all_columns()) {
col_names.push_back(col.name_as_cql_string());
}
sstring val_binders_str = "?";
for (size_t i = 1; i < col_names.size(); ++i) {
val_binders_str += ", ?";
}
std::string_view get_short_name(std::string_view name) {
auto pos = name.find_last_of('.');
if (pos == std::string_view::npos) {
return name;
}
return name.substr(pos + 1);
}
std::vector<mutation> collected;
// use longer than usual timeout as we scan the whole table
// but not infinite or very long as we want to fail reasonably fast
const auto t = 5min;
const timeout_config tc{t, t, t, t, t, t, t};
::service::client_state cs(::service::client_state::internal_tag{}, tc);
::service::query_state qs(cs, empty_service_permit());
} // anonymous namespace
co_await qp.query_internal(
seastar::format("SELECT * FROM {}.{}", meta::legacy::AUTH_KS, cf_name),
db::consistency_level::ALL,
{},
1000,
[&qp, &cf_name, &col_names, &val_binders_str, &schema, ts, &collected] (const cql3::untyped_result_set::row& row) -> future<stop_iteration> {
std::vector<data_value_or_unset> values;
for (const auto& col : schema->all_columns()) {
if (row.has(col.name_as_text())) {
values.push_back(
col.type->deserialize(row.get_blob_unfragmented(col.name_as_text())));
} else {
values.push_back(unset_value{});
}
}
auto muts = co_await qp.get_mutations_internal(
seastar::format("INSERT INTO {}.{} ({}) VALUES ({})",
db::system_keyspace::NAME,
cf_name,
fmt::join(col_names, ", "),
val_binders_str),
internal_distributed_query_state(),
ts,
std::move(values));
if (muts.size() != 1) {
on_internal_error(log,
format("expecting single insert mutation, got {}", muts.size()));
}
authorizer_factory make_authorizer_factory(
std::string_view name,
sharded<cql3::query_processor>& qp) {
std::string_view short_name = get_short_name(name);
collected.push_back(std::move(muts[0]));
co_return stop_iteration::no;
},
std::move(qs));
if (boost::iequals(short_name, "AllowAllAuthorizer")) {
return [&qp] {
return std::make_unique<allow_all_authorizer>(qp.local());
};
} else if (boost::iequals(short_name, "CassandraAuthorizer")) {
return [&qp] {
return std::make_unique<default_authorizer>(qp.local());
};
} else if (boost::iequals(short_name, "TransitionalAuthorizer")) {
return [&qp] {
return std::make_unique<transitional_authorizer>(qp.local());
};
}
throw std::invalid_argument(fmt::format("Unknown authorizer: {}", name));
}
for (auto& m : collected) {
co_yield std::move(m);
}
}
co_yield co_await sys_ks.make_auth_version_mutation(ts,
db::system_keyspace::auth_version_t::v2);
authenticator_factory make_authenticator_factory(
std::string_view name,
sharded<cql3::query_processor>& qp,
::service::raft_group0_client& g0,
sharded<::service::migration_manager>& mm,
sharded<cache>& auth_cache) {
std::string_view short_name = get_short_name(name);
if (boost::iequals(short_name, "AllowAllAuthenticator")) {
return [&qp, &g0, &mm, &auth_cache] {
return std::make_unique<allow_all_authenticator>(qp.local(), g0, mm.local(), auth_cache.local());
};
} else if (boost::iequals(short_name, "PasswordAuthenticator")) {
return [&qp, &g0, &mm, &auth_cache] {
return std::make_unique<password_authenticator>(qp.local(), g0, mm.local(), auth_cache.local());
};
} else if (boost::iequals(short_name, "CertificateAuthenticator")) {
return [&qp, &g0, &mm, &auth_cache] {
return std::make_unique<certificate_authenticator>(qp.local(), g0, mm.local(), auth_cache.local());
};
} else if (boost::iequals(short_name, "SaslauthdAuthenticator")) {
return [&qp, &g0, &mm, &auth_cache] {
return std::make_unique<saslauthd_authenticator>(qp.local(), g0, mm.local(), auth_cache.local());
};
} else if (boost::iequals(short_name, "TransitionalAuthenticator")) {
return [&qp, &g0, &mm, &auth_cache] {
return std::make_unique<transitional_authenticator>(qp.local(), g0, mm.local(), auth_cache.local());
};
}
throw std::invalid_argument(fmt::format("Unknown authenticator: {}", name));
}
role_manager_factory make_role_manager_factory(
std::string_view name,
sharded<cql3::query_processor>& qp,
::service::raft_group0_client& g0,
sharded<::service::migration_manager>& mm,
sharded<cache>& auth_cache) {
std::string_view short_name = get_short_name(name);
if (boost::iequals(short_name, "CassandraRoleManager")) {
return [&qp, &g0, &mm, &auth_cache] {
return std::make_unique<standard_role_manager>(qp.local(), g0, mm.local(), auth_cache.local());
};
} else if (boost::iequals(short_name, "LDAPRoleManager")) {
return [&qp, &g0, &mm, &auth_cache] {
return std::make_unique<ldap_role_manager>(qp.local(), g0, mm.local(), auth_cache.local());
};
}
throw std::invalid_argument(fmt::format("Unknown role manager: {}", name));
}
authenticator_factory make_maintenance_socket_authenticator_factory(
sharded<cql3::query_processor>& qp,
::service::raft_group0_client& g0,
sharded<::service::migration_manager>& mm,
sharded<cache>& auth_cache) {
return [&qp, &g0, &mm, &auth_cache] {
return std::make_unique<maintenance_socket_authenticator>(qp.local(), g0, mm.local(), auth_cache.local());
};
}
authorizer_factory make_maintenance_socket_authorizer_factory(sharded<cql3::query_processor>& qp) {
return [&qp] {
return std::make_unique<maintenance_socket_authorizer>(qp.local());
};
}
role_manager_factory make_maintenance_socket_role_manager_factory(
sharded<cql3::query_processor>& qp,
::service::raft_group0_client& g0,
sharded<::service::migration_manager>& mm,
sharded<cache>& auth_cache) {
return [&qp, &g0, &mm, &auth_cache] {
return std::make_unique<maintenance_socket_role_manager>(qp.local(), g0, mm.local(), auth_cache.local());
};
co_await announce_mutations_with_batching(g0,
start_operation_func,
std::move(gen),
as,
std::nullopt);
}
}

View File

@@ -12,6 +12,7 @@
#include <memory>
#include <optional>
#include <seastar/core/coroutine.hh>
#include <seastar/core/future.hh>
#include <seastar/core/sstring.hh>
#include <seastar/util/bool_class.hh>
@@ -36,19 +37,16 @@ class query_processor;
namespace service {
class migration_manager;
class migration_notifier;
class migration_listener;
}
namespace auth {
class role_or_anonymous;
struct service_config final {
sstring authorizer_java_name;
sstring authenticator_java_name;
sstring role_manager_java_name;
};
/// Factory function types for creating auth module instances on each shard.
using authorizer_factory = std::function<std::unique_ptr<authorizer>()>;
using authenticator_factory = std::function<std::unique_ptr<authenticator>()>;
using role_manager_factory = std::function<std::unique_ptr<role_manager>()>;
///
/// Due to poor (in this author's opinion) decisions of Apache Cassandra, certain choices of one role-manager,
@@ -80,17 +78,12 @@ class service final : public seastar::peering_sharded_service<service> {
::service::raft_group0_client& _group0_client;
::service::migration_notifier& _mnotifier;
authorizer::ptr_type _authorizer;
authenticator::ptr_type _authenticator;
role_manager::ptr_type _role_manager;
// Only one of these should be registered, so we end up with some unused instances. Not the end of the world.
std::unique_ptr<::service::migration_listener> _migration_listener;
maintenance_socket_enabled _used_by_maintenance_socket;
abort_source _as;
@@ -100,7 +93,6 @@ public:
cache& cache,
cql3::query_processor&,
::service::raft_group0_client&,
::service::migration_notifier&,
std::unique_ptr<authorizer>,
std::unique_ptr<authenticator>,
std::unique_ptr<role_manager>,
@@ -108,15 +100,15 @@ public:
///
/// This constructor is intended to be used when the class is sharded via \ref seastar::sharded. In that case, the
/// arguments must be copyable, which is why we delay construction with instance-construction instructions instead
/// arguments must be copyable, which is why we delay construction with instance-construction factories instead
/// of the instances themselves.
///
service(
cql3::query_processor&,
::service::raft_group0_client&,
::service::migration_notifier&,
::service::migration_manager&,
const service_config&,
authorizer_factory,
authenticator_factory,
role_manager_factory,
maintenance_socket_enabled,
cache&);
@@ -138,6 +130,11 @@ public:
///
future<permission_set> get_uncached_permissions(const role_or_anonymous&, const resource&) const;
///
/// Notify the service that the node is entering maintenance mode.
///
void set_maintenance_mode();
///
/// Query whether the named role has been granted a role that is a superuser.
///
@@ -147,6 +144,11 @@ public:
///
future<bool> has_superuser(std::string_view role_name) const;
///
/// Ensure that the role operations are enabled. Some role managers defer initialization.
///
future<> ensure_role_operations_are_enabled();
///
/// Create a role with optional authentication information.
///
@@ -192,12 +194,9 @@ public:
return *_role_manager;
}
cql3::query_processor& query_processor() const noexcept {
return _qp;
}
future<> commit_mutations(::service::group0_batch&& mc) {
return std::move(mc).commit(_group0_client, _as, ::service::raft_timeout{});
co_await std::move(mc).commit(_group0_client, _as, ::service::raft_timeout{});
co_await _group0_client.send_group0_read_barrier_to_live_members();
}
private:
@@ -208,8 +207,12 @@ private:
future<std::vector<cql3::description>> describe_permissions() const;
};
void set_maintenance_mode(service&);
future<bool> has_superuser(const service&, const authenticated_user&);
future<> ensure_role_operations_are_enabled(service&);
future<role_set> get_roles(const service&, const authenticated_user&);
future<permission_set> get_permissions(const service&, const authenticated_user&, const resource&);
@@ -393,7 +396,55 @@ future<std::vector<permission_details>> list_filtered_permissions(
// Finalizes write operations performed in auth by committing mutations via raft group0.
future<> commit_mutations(service& ser, ::service::group0_batch&& mc);
// Migrates data from old keyspace to new one which supports linearizable writes via raft.
future<> migrate_to_auth_v2(db::system_keyspace& sys_ks, ::service::raft_group0_client& g0, start_operation_func_t start_operation_func, abort_source& as);
///
/// Factory helper functions for creating auth module instances.
/// These are intended for use with sharded<service>::start() where copyable arguments are required.
/// The returned factories capture the sharded references and call .local() when invoked on each shard.
///
/// Creates an authorizer factory for config-selectable authorizer types.
/// @param name The authorizer class name (e.g., "CassandraAuthorizer", "AllowAllAuthorizer")
authorizer_factory make_authorizer_factory(
std::string_view name,
sharded<cql3::query_processor>& qp);
/// Creates an authenticator factory for config-selectable authenticator types.
/// @param name The authenticator class name (e.g., "PasswordAuthenticator", "AllowAllAuthenticator")
authenticator_factory make_authenticator_factory(
std::string_view name,
sharded<cql3::query_processor>& qp,
::service::raft_group0_client& g0,
sharded<::service::migration_manager>& mm,
sharded<cache>& cache);
/// Creates a role_manager factory for config-selectable role manager types.
/// @param name The role manager class name (e.g., "CassandraRoleManager")
role_manager_factory make_role_manager_factory(
std::string_view name,
sharded<cql3::query_processor>& qp,
::service::raft_group0_client& g0,
sharded<::service::migration_manager>& mm,
sharded<cache>& cache);
/// Creates a factory for the maintenance socket authenticator.
/// This authenticator is not config-selectable and is only used for the maintenance socket.
authenticator_factory make_maintenance_socket_authenticator_factory(
sharded<cql3::query_processor>& qp,
::service::raft_group0_client& g0,
sharded<::service::migration_manager>& mm,
sharded<cache>& cache);
/// Creates a factory for the maintenance socket authorizer.
/// This authorizer is not config-selectable and is only used for the maintenance socket.
/// It grants all permissions unconditionally while delegating grant/revoke to the default authorizer.
authorizer_factory make_maintenance_socket_authorizer_factory(sharded<cql3::query_processor>& qp);
/// Creates a factory for the maintenance socket role manager.
/// This role manager is not config-selectable and is only used for the maintenance socket.
role_manager_factory make_maintenance_socket_role_manager_factory(
sharded<cql3::query_processor>& qp,
::service::raft_group0_client& g0,
sharded<::service::migration_manager>& mm,
sharded<cache>& cache);
}

View File

@@ -28,15 +28,14 @@
#include "cql3/untyped_result_set.hh"
#include "cql3/util.hh"
#include "db/consistency_level_type.hh"
#include "db/system_keyspace.hh"
#include "exceptions/exceptions.hh"
#include "utils/error_injection.hh"
#include "utils/log.hh"
#include <seastar/core/loop.hh>
#include <seastar/coroutine/maybe_yield.hh>
#include "service/raft/raft_group0_client.hh"
#include "utils/class_registrator.hh"
#include "service/migration_manager.hh"
#include "password_authenticator.hh"
#include "utils/managed_string.hh"
namespace auth {
@@ -44,59 +43,13 @@ namespace auth {
static logging::logger log("standard_role_manager");
static const class_registrator<
role_manager,
standard_role_manager,
cql3::query_processor&,
::service::raft_group0_client&,
::service::migration_manager&,
cache&> registration("org.apache.cassandra.auth.CassandraRoleManager");
static db::consistency_level consistency_for_role(std::string_view role_name) noexcept {
if (role_name == meta::DEFAULT_SUPERUSER_NAME) {
return db::consistency_level::QUORUM;
}
return db::consistency_level::LOCAL_ONE;
}
future<std::optional<standard_role_manager::record>> standard_role_manager::legacy_find_record(std::string_view role_name) {
const sstring query = seastar::format("SELECT * FROM {}.{} WHERE {} = ?",
get_auth_ks_name(_qp),
meta::roles_table::name,
meta::roles_table::role_col_name);
const auto results = co_await _qp.execute_internal(
query,
consistency_for_role(role_name),
internal_distributed_query_state(),
{sstring(role_name)},
cql3::query_processor::cache_internal::yes);
if (results->empty()) {
co_return std::optional<record>();
}
const cql3::untyped_result_set_row& row = results->one();
co_return std::make_optional(record{
row.get_as<sstring>(sstring(meta::roles_table::role_col_name)),
row.get_or<bool>("is_superuser", false),
row.get_or<bool>("can_login", false),
(row.has("member_of")
? row.get_set<sstring>("member_of")
: role_set())});
}
future<std::optional<standard_role_manager::record>> standard_role_manager::find_record(std::string_view role_name) {
if (legacy_mode(_qp)) {
return legacy_find_record(role_name);
}
auto name = sstring(role_name);
auto role = _cache.get(name);
auto role = _cache.get(role_name);
if (!role) {
return make_ready_future<std::optional<record>>(std::nullopt);
}
return make_ready_future<std::optional<record>>(std::make_optional(record{
.name = std::move(name),
.name = sstring(role_name),
.is_superuser = role->is_superuser,
.can_login = role->can_login,
.member_of = role->member_of
@@ -123,7 +76,6 @@ standard_role_manager::standard_role_manager(cql3::query_processor& qp, ::servic
, _migration_manager(mm)
, _cache(cache)
, _stopped(make_ready_future<>())
, _superuser(password_authenticator::default_superuser(qp.db().get_config()))
{}
std::string_view standard_role_manager::qualified_java_name() const noexcept {
@@ -138,79 +90,12 @@ const resource_set& standard_role_manager::protected_resources() const {
return resources;
}
future<> standard_role_manager::create_legacy_metadata_tables_if_missing() const {
static const sstring create_roles_query = fmt::format(
"CREATE TABLE {}.{} ("
" {} text PRIMARY KEY,"
" can_login boolean,"
" is_superuser boolean,"
" member_of set<text>,"
" salted_hash text"
")",
meta::legacy::AUTH_KS,
meta::roles_table::name,
meta::roles_table::role_col_name);
static const sstring create_role_members_query = fmt::format(
"CREATE TABLE {}.{} ("
" role text,"
" member text,"
" PRIMARY KEY (role, member)"
")",
meta::legacy::AUTH_KS,
ROLE_MEMBERS_CF);
static const sstring create_role_attributes_query = seastar::format(
"CREATE TABLE {}.{} ("
" role text,"
" name text,"
" value text,"
" PRIMARY KEY(role, name)"
")",
meta::legacy::AUTH_KS,
ROLE_ATTRIBUTES_CF);
return when_all_succeed(
create_legacy_metadata_table_if_missing(
meta::roles_table::name,
_qp,
create_roles_query,
_migration_manager),
create_legacy_metadata_table_if_missing(
ROLE_MEMBERS_CF,
_qp,
create_role_members_query,
_migration_manager),
create_legacy_metadata_table_if_missing(
ROLE_ATTRIBUTES_CF,
_qp,
create_role_attributes_query,
_migration_manager)).discard_result();
}
future<> standard_role_manager::legacy_create_default_role_if_missing() {
try {
const auto exists = co_await legacy::default_role_row_satisfies(_qp, &has_can_login, _superuser);
if (exists) {
co_return;
}
const sstring query = seastar::format("INSERT INTO {}.{} ({}, is_superuser, can_login) VALUES (?, true, true)",
meta::legacy::AUTH_KS,
meta::roles_table::name,
meta::roles_table::role_col_name);
co_await _qp.execute_internal(
query,
db::consistency_level::QUORUM,
internal_distributed_query_state(),
{_superuser},
cql3::query_processor::cache_internal::no).discard_result();
log.info("Created default superuser role '{}'.", _superuser);
} catch (const exceptions::unavailable_exception& e) {
log.warn("Skipped default role setup: some nodes were not ready; will retry");
throw e;
}
}
future<> standard_role_manager::maybe_create_default_role() {
if (default_superuser(_qp).empty()) {
co_return;
}
auto has_superuser = [this] () -> future<bool> {
const sstring query = seastar::format("SELECT * FROM {}.{} WHERE is_superuser = true ALLOW FILTERING", get_auth_ks_name(_qp), meta::roles_table::name);
const sstring query = seastar::format("SELECT * FROM {}.{} WHERE is_superuser = true ALLOW FILTERING", db::system_keyspace::NAME, meta::roles_table::name);
auto results = co_await _qp.execute_internal(query, db::consistency_level::LOCAL_ONE,
internal_distributed_query_state(), cql3::query_processor::cache_internal::yes);
for (const auto& result : *results) {
@@ -234,12 +119,12 @@ future<> standard_role_manager::maybe_create_default_role() {
// There is no superuser which has can_login field - create default role.
// Note that we don't check if can_login is set to true.
const sstring insert_query = seastar::format("INSERT INTO {}.{} ({}, is_superuser, can_login) VALUES (?, true, true)",
get_auth_ks_name(_qp),
db::system_keyspace::NAME,
meta::roles_table::name,
meta::roles_table::role_col_name);
co_await collect_mutations(_qp, batch, insert_query, {_superuser});
co_await collect_mutations(_qp, batch, insert_query, {default_superuser(_qp)});
co_await std::move(batch).commit(_group0_client, _as, get_raft_timeout());
log.info("Created default superuser role '{}'.", _superuser);
log.info("Created default superuser role '{}'.", default_superuser(_qp));
}
future<> standard_role_manager::maybe_create_default_role_with_retries() {
@@ -262,78 +147,12 @@ future<> standard_role_manager::maybe_create_default_role_with_retries() {
}
}
static const sstring legacy_table_name{"users"};
bool standard_role_manager::legacy_metadata_exists() {
return _qp.db().has_schema(meta::legacy::AUTH_KS, legacy_table_name);
}
future<> standard_role_manager::migrate_legacy_metadata() {
log.info("Starting migration of legacy user metadata.");
static const sstring query = seastar::format("SELECT * FROM {}.{}", meta::legacy::AUTH_KS, legacy_table_name);
return _qp.execute_internal(
query,
db::consistency_level::QUORUM,
internal_distributed_query_state(),
cql3::query_processor::cache_internal::no).then([this](::shared_ptr<cql3::untyped_result_set> results) {
return do_for_each(*results, [this](const cql3::untyped_result_set_row& row) {
role_config config;
config.is_superuser = row.get_or<bool>("super", false);
config.can_login = true;
return do_with(
row.get_as<sstring>("name"),
std::move(config),
::service::group0_batch::unused(),
[this](const auto& name, const auto& config, auto& mc) {
return create_or_replace(meta::legacy::AUTH_KS, name, config, mc);
});
}).finally([results] {});
}).then([] {
log.info("Finished migrating legacy user metadata.");
}).handle_exception([](std::exception_ptr ep) {
log.error("Encountered an error during migration!");
std::rethrow_exception(ep);
});
}
future<> standard_role_manager::start() {
return once_among_shards([this] () -> future<> {
if (legacy_mode(_qp)) {
co_await create_legacy_metadata_tables_if_missing();
}
auto handler = [this] () -> future<> {
const bool legacy = legacy_mode(_qp);
if (legacy) {
if (!_superuser_created_promise.available()) {
// Counterintuitively, we mark promise as ready before any startup work
// because wait_for_schema_agreement() below will block indefinitely
// without cluster majority. In that case, blocking node startup
// would lead to a cluster deadlock.
_superuser_created_promise.set_value();
}
co_await _migration_manager.wait_for_schema_agreement(_qp.db().real_database(), db::timeout_clock::time_point::max(), &_as);
if (co_await legacy::any_nondefault_role_row_satisfies(_qp, &has_can_login)) {
if (legacy_metadata_exists()) {
log.warn("Ignoring legacy user metadata since nondefault roles already exist.");
}
co_return;
}
if (legacy_metadata_exists()) {
co_await migrate_legacy_metadata();
co_return;
}
co_await legacy_create_default_role_if_missing();
}
if (!legacy) {
co_await maybe_create_default_role_with_retries();
if (!_superuser_created_promise.available()) {
_superuser_created_promise.set_value();
}
co_await maybe_create_default_role_with_retries();
if (!_superuser_created_promise.available()) {
_superuser_created_promise.set_value();
}
};
@@ -352,21 +171,12 @@ future<> standard_role_manager::ensure_superuser_is_created() {
return _superuser_created_promise.get_shared_future();
}
future<> standard_role_manager::create_or_replace(std::string_view auth_ks_name, std::string_view role_name, const role_config& c, ::service::group0_batch& mc) {
future<> standard_role_manager::create_or_replace(std::string_view role_name, const role_config& c, ::service::group0_batch& mc) {
const sstring query = seastar::format("INSERT INTO {}.{} ({}, is_superuser, can_login) VALUES (?, ?, ?)",
auth_ks_name,
db::system_keyspace::NAME,
meta::roles_table::name,
meta::roles_table::role_col_name);
if (auth_ks_name == meta::legacy::AUTH_KS) {
co_await _qp.execute_internal(
query,
consistency_for_role(role_name),
internal_distributed_query_state(),
{sstring(role_name), c.is_superuser, c.can_login},
cql3::query_processor::cache_internal::yes).discard_result();
} else {
co_await collect_mutations(_qp, mc, query, {sstring(role_name), c.is_superuser, c.can_login});
}
co_await collect_mutations(_qp, mc, query, {sstring(role_name), c.is_superuser, c.can_login});
}
future<>
@@ -376,7 +186,7 @@ standard_role_manager::create(std::string_view role_name, const role_config& c,
throw role_already_exists(role_name);
}
return create_or_replace(get_auth_ks_name(_qp), role_name, c, mc);
return create_or_replace(role_name, c, mc);
});
}
@@ -401,20 +211,11 @@ standard_role_manager::alter(std::string_view role_name, const role_config_updat
return make_ready_future<>();
}
const sstring query = seastar::format("UPDATE {}.{} SET {} WHERE {} = ?",
get_auth_ks_name(_qp),
db::system_keyspace::NAME,
meta::roles_table::name,
build_column_assignments(u),
meta::roles_table::role_col_name);
if (legacy_mode(_qp)) {
return _qp.execute_internal(
std::move(query),
consistency_for_role(role_name),
internal_distributed_query_state(),
{sstring(role_name)},
cql3::query_processor::cache_internal::no).discard_result();
} else {
return collect_mutations(_qp, mc, std::move(query), {sstring(role_name)});
}
return collect_mutations(_qp, mc, std::move(query), {sstring(role_name)});
});
}
@@ -425,11 +226,11 @@ future<> standard_role_manager::drop(std::string_view role_name, ::service::grou
// First, revoke this role from all roles that are members of it.
const auto revoke_from_members = [this, role_name, &mc] () -> future<> {
const sstring query = seastar::format("SELECT member FROM {}.{} WHERE role = ?",
get_auth_ks_name(_qp),
db::system_keyspace::NAME,
ROLE_MEMBERS_CF);
const auto members = co_await _qp.execute_internal(
query,
consistency_for_role(role_name),
db::consistency_level::LOCAL_ONE,
internal_distributed_query_state(),
{sstring(role_name)},
cql3::query_processor::cache_internal::no);
@@ -457,102 +258,33 @@ future<> standard_role_manager::drop(std::string_view role_name, ::service::grou
// Delete all attributes for that role
const auto remove_attributes_of = [this, role_name, &mc] () -> future<> {
const sstring query = seastar::format("DELETE FROM {}.{} WHERE role = ?",
get_auth_ks_name(_qp),
db::system_keyspace::NAME,
ROLE_ATTRIBUTES_CF);
if (legacy_mode(_qp)) {
co_await _qp.execute_internal(query, {sstring(role_name)},
cql3::query_processor::cache_internal::yes).discard_result();
} else {
co_await collect_mutations(_qp, mc, query, {sstring(role_name)});
}
co_await collect_mutations(_qp, mc, query, {sstring(role_name)});
};
// Finally, delete the role itself.
const auto delete_role = [this, role_name, &mc] () -> future<> {
const sstring query = seastar::format("DELETE FROM {}.{} WHERE {} = ?",
get_auth_ks_name(_qp),
db::system_keyspace::NAME,
meta::roles_table::name,
meta::roles_table::role_col_name);
if (legacy_mode(_qp)) {
co_await _qp.execute_internal(
query,
consistency_for_role(role_name),
internal_distributed_query_state(),
{sstring(role_name)},
cql3::query_processor::cache_internal::no).discard_result();
} else {
co_await collect_mutations(_qp, mc, query, {sstring(role_name)});
}
co_await collect_mutations(_qp, mc, query, {sstring(role_name)});
};
co_await when_all_succeed(revoke_from_members, revoke_members_of, remove_attributes_of);
co_await delete_role();
}
future<>
standard_role_manager::legacy_modify_membership(
std::string_view grantee_name,
std::string_view role_name,
membership_change ch) {
const auto modify_roles = [this, role_name, grantee_name, ch] () -> future<> {
const auto query = seastar::format(
"UPDATE {}.{} SET member_of = member_of {} ? WHERE {} = ?",
get_auth_ks_name(_qp),
meta::roles_table::name,
(ch == membership_change::add ? '+' : '-'),
meta::roles_table::role_col_name);
co_await _qp.execute_internal(
query,
consistency_for_role(grantee_name),
internal_distributed_query_state(),
{role_set{sstring(role_name)}, sstring(grantee_name)},
cql3::query_processor::cache_internal::no).discard_result();
};
const auto modify_role_members = [this, role_name, grantee_name, ch] () -> future<> {
switch (ch) {
case membership_change::add: {
const sstring insert_query = seastar::format("INSERT INTO {}.{} (role, member) VALUES (?, ?)",
get_auth_ks_name(_qp),
ROLE_MEMBERS_CF);
co_return co_await _qp.execute_internal(
insert_query,
consistency_for_role(role_name),
internal_distributed_query_state(),
{sstring(role_name), sstring(grantee_name)},
cql3::query_processor::cache_internal::no).discard_result();
}
case membership_change::remove: {
const sstring delete_query = seastar::format("DELETE FROM {}.{} WHERE role = ? AND member = ?",
get_auth_ks_name(_qp),
ROLE_MEMBERS_CF);
co_return co_await _qp.execute_internal(
delete_query,
consistency_for_role(role_name),
internal_distributed_query_state(),
{sstring(role_name), sstring(grantee_name)},
cql3::query_processor::cache_internal::no).discard_result();
}
}
};
co_await when_all_succeed(modify_roles, modify_role_members).discard_result();
}
future<>
standard_role_manager::modify_membership(
std::string_view grantee_name,
std::string_view role_name,
membership_change ch,
::service::group0_batch& mc) {
if (legacy_mode(_qp)) {
co_return co_await legacy_modify_membership(grantee_name, role_name, ch);
}
const auto modify_roles = seastar::format(
"UPDATE {}.{} SET member_of = member_of {} ? WHERE {} = ?",
get_auth_ks_name(_qp),
db::system_keyspace::NAME,
meta::roles_table::name,
(ch == membership_change::add ? '+' : '-'),
meta::roles_table::role_col_name);
@@ -563,12 +295,12 @@ standard_role_manager::modify_membership(
switch (ch) {
case membership_change::add:
modify_role_members = seastar::format("INSERT INTO {}.{} (role, member) VALUES (?, ?)",
get_auth_ks_name(_qp),
db::system_keyspace::NAME,
ROLE_MEMBERS_CF);
break;
case membership_change::remove:
modify_role_members = seastar::format("DELETE FROM {}.{} WHERE role = ? AND member = ?",
get_auth_ks_name(_qp),
db::system_keyspace::NAME,
ROLE_MEMBERS_CF);
break;
default:
@@ -660,57 +392,21 @@ future<role_set> standard_role_manager::query_granted(std::string_view grantee_n
}
future<role_to_directly_granted_map> standard_role_manager::query_all_directly_granted(::service::query_state& qs) {
const sstring query = seastar::format("SELECT * FROM {}.{}",
get_auth_ks_name(_qp),
ROLE_MEMBERS_CF);
const auto results = co_await _qp.execute_internal(
query,
db::consistency_level::ONE,
qs,
cql3::query_processor::cache_internal::yes);
role_to_directly_granted_map roles_map;
std::transform(
results->begin(),
results->end(),
std::inserter(roles_map, roles_map.begin()),
[] (const cql3::untyped_result_set_row& row) {
return std::make_pair(row.get_as<sstring>("member"), row.get_as<sstring>("role")); }
);
_cache.for_each_role([&roles_map] (const cache::role_name_t& name, const cache::role_record& record) {
for (const auto& granted_role : record.member_of) {
roles_map.emplace(name, granted_role);
}
});
co_return roles_map;
}
future<role_set> standard_role_manager::query_all(::service::query_state& qs) {
const sstring query = seastar::format("SELECT {} FROM {}.{}",
meta::roles_table::role_col_name,
get_auth_ks_name(_qp),
meta::roles_table::name);
// To avoid many copies of a view.
static const auto role_col_name_string = sstring(meta::roles_table::role_col_name);
if (utils::get_local_injector().enter("standard_role_manager_fail_legacy_query")) {
if (legacy_mode(_qp)) {
throw std::runtime_error("standard_role_manager::query_all: failed due to error injection");
}
}
const auto results = co_await _qp.execute_internal(
query,
db::consistency_level::QUORUM,
qs,
cql3::query_processor::cache_internal::yes);
role_set roles;
std::transform(
results->begin(),
results->end(),
std::inserter(roles, roles.begin()),
[] (const cql3::untyped_result_set_row& row) {
return row.get_as<sstring>(role_col_name_string);}
);
roles.reserve(_cache.roles_count());
_cache.for_each_role([&roles] (const cache::role_name_t& name, const cache::role_record&) {
roles.insert(name);
});
co_return roles;
}
@@ -733,31 +429,26 @@ future<bool> standard_role_manager::can_login(std::string_view role_name) {
}
future<std::optional<sstring>> standard_role_manager::get_attribute(std::string_view role_name, std::string_view attribute_name, ::service::query_state& qs) {
const sstring query = seastar::format("SELECT name, value FROM {}.{} WHERE role = ? AND name = ?",
get_auth_ks_name(_qp),
ROLE_ATTRIBUTES_CF);
const auto result_set = co_await _qp.execute_internal(query, db::consistency_level::ONE, qs, {sstring(role_name), sstring(attribute_name)}, cql3::query_processor::cache_internal::yes);
if (!result_set->empty()) {
const cql3::untyped_result_set_row &row = result_set->one();
co_return std::optional<sstring>(row.get_as<sstring>("value"));
auto role = _cache.get(role_name);
if (!role) {
co_return std::nullopt;
}
co_return std::optional<sstring>{};
auto it = role->attributes.find(attribute_name);
if (it != role->attributes.end()) {
co_return it->second;
}
co_return std::nullopt;
}
future<role_manager::attribute_vals> standard_role_manager::query_attribute_for_all (std::string_view attribute_name, ::service::query_state& qs) {
return query_all(qs).then([this, attribute_name, &qs] (role_set roles) {
return do_with(attribute_vals{}, [this, attribute_name, roles = std::move(roles), &qs] (attribute_vals &role_to_att_val) {
return parallel_for_each(roles.begin(), roles.end(), [this, &role_to_att_val, attribute_name, &qs] (sstring role) {
return get_attribute(role, attribute_name, qs).then([&role_to_att_val, role] (std::optional<sstring> att_val) {
if (att_val) {
role_to_att_val.emplace(std::move(role), std::move(*att_val));
}
});
}).then([&role_to_att_val] () {
return make_ready_future<attribute_vals>(std::move(role_to_att_val));
});
});
future<role_manager::attribute_vals> standard_role_manager::query_attribute_for_all(std::string_view attribute_name, ::service::query_state& qs) {
attribute_vals result;
_cache.for_each_role([&result, attribute_name] (const cache::role_name_t& name, const cache::role_record& record) {
auto it = record.attributes.find(attribute_name);
if (it != record.attributes.end()) {
result.emplace(name, it->second);
}
});
co_return result;
}
future<> standard_role_manager::set_attribute(std::string_view role_name, std::string_view attribute_name, std::string_view attribute_value, ::service::group0_batch& mc) {
@@ -765,14 +456,10 @@ future<> standard_role_manager::set_attribute(std::string_view role_name, std::s
throw auth::nonexistant_role(role_name);
}
const sstring query = seastar::format("INSERT INTO {}.{} (role, name, value) VALUES (?, ?, ?)",
get_auth_ks_name(_qp),
db::system_keyspace::NAME,
ROLE_ATTRIBUTES_CF);
if (legacy_mode(_qp)) {
co_await _qp.execute_internal(query, {sstring(role_name), sstring(attribute_name), sstring(attribute_value)}, cql3::query_processor::cache_internal::yes).discard_result();
} else {
co_await collect_mutations(_qp, mc, query,
{sstring(role_name), sstring(attribute_name), sstring(attribute_value)});
}
co_await collect_mutations(_qp, mc, query,
{sstring(role_name), sstring(attribute_name), sstring(attribute_value)});
}
future<> standard_role_manager::remove_attribute(std::string_view role_name, std::string_view attribute_name, ::service::group0_batch& mc) {
@@ -780,14 +467,10 @@ future<> standard_role_manager::remove_attribute(std::string_view role_name, std
throw auth::nonexistant_role(role_name);
}
const sstring query = seastar::format("DELETE FROM {}.{} WHERE role = ? AND name = ?",
get_auth_ks_name(_qp),
db::system_keyspace::NAME,
ROLE_ATTRIBUTES_CF);
if (legacy_mode(_qp)) {
co_await _qp.execute_internal(query, {sstring(role_name), sstring(attribute_name)}, cql3::query_processor::cache_internal::yes).discard_result();
} else {
co_await collect_mutations(_qp, mc, query,
{sstring(role_name), sstring(attribute_name)});
}
co_await collect_mutations(_qp, mc, query,
{sstring(role_name), sstring(attribute_name)});
}
future<std::vector<cql3::description>> standard_role_manager::describe_role_grants() {

View File

@@ -40,7 +40,6 @@ class standard_role_manager final : public role_manager {
cache& _cache;
future<> _stopped;
abort_source _as;
std::string _superuser;
shared_promise<> _superuser_created_promise;
public:
@@ -97,24 +96,13 @@ private:
role_set member_of;
};
future<> create_legacy_metadata_tables_if_missing() const;
bool legacy_metadata_exists();
future<> migrate_legacy_metadata();
future<> legacy_create_default_role_if_missing();
future<> maybe_create_default_role();
future<> maybe_create_default_role_with_retries();
future<> create_or_replace(std::string_view auth_ks_name, std::string_view role_name, const role_config&, ::service::group0_batch&);
future<> legacy_modify_membership(std::string_view role_name, std::string_view grantee_name, membership_change);
future<> create_or_replace(std::string_view role_name, const role_config&, ::service::group0_batch&);
future<> modify_membership(std::string_view role_name, std::string_view grantee_name, membership_change, ::service::group0_batch& mc);
future<std::optional<record>> legacy_find_record(std::string_view role_name);
future<std::optional<record>> find_record(std::string_view role_name);
future<record> require_record(std::string_view role_name);
future<> collect_roles(

View File

@@ -8,244 +8,200 @@
* SPDX-License-Identifier: (LicenseRef-ScyllaDB-Source-Available-1.0 and Apache-2.0)
*/
#include "auth/transitional.hh"
#include "auth/authenticated_user.hh"
#include "auth/authenticator.hh"
#include "auth/authorizer.hh"
#include "auth/default_authorizer.hh"
#include "auth/password_authenticator.hh"
#include "auth/cache.hh"
#include "auth/permission.hh"
#include "service/raft/raft_group0_client.hh"
#include "utils/class_registrator.hh"
namespace auth {
static const sstring PACKAGE_NAME("com.scylladb.auth.");
static const sstring& transitional_authenticator_name() {
static const sstring name = PACKAGE_NAME + "TransitionalAuthenticator";
return name;
transitional_authenticator::transitional_authenticator(cql3::query_processor& qp, ::service::raft_group0_client& g0, ::service::migration_manager& mm, cache& cache)
: transitional_authenticator(std::make_unique<password_authenticator>(qp, g0, mm, cache)) {
}
static const sstring& transitional_authorizer_name() {
static const sstring name = PACKAGE_NAME + "TransitionalAuthorizer";
return name;
transitional_authenticator::transitional_authenticator(std::unique_ptr<authenticator> a)
: _authenticator(std::move(a)) {
}
class transitional_authenticator : public authenticator {
std::unique_ptr<authenticator> _authenticator;
future<> transitional_authenticator::start() {
return _authenticator->start();
}
public:
static const sstring PASSWORD_AUTHENTICATOR_NAME;
future<> transitional_authenticator::stop() {
return _authenticator->stop();
}
transitional_authenticator(cql3::query_processor& qp, ::service::raft_group0_client& g0, ::service::migration_manager& mm, cache& cache)
: transitional_authenticator(std::make_unique<password_authenticator>(qp, g0, mm, cache)) {
std::string_view transitional_authenticator::qualified_java_name() const {
return "com.scylladb.auth.TransitionalAuthenticator";
}
bool transitional_authenticator::require_authentication() const {
return true;
}
authentication_option_set transitional_authenticator::supported_options() const {
return _authenticator->supported_options();
}
authentication_option_set transitional_authenticator::alterable_options() const {
return _authenticator->alterable_options();
}
future<authenticated_user> transitional_authenticator::authenticate(const credentials_map& credentials) const {
auto i = credentials.find(authenticator::USERNAME_KEY);
if ((i == credentials.end() || i->second.empty())
&& (!credentials.contains(PASSWORD_KEY) || credentials.at(PASSWORD_KEY).empty())) {
// return anon user
return make_ready_future<authenticated_user>(anonymous_user());
}
transitional_authenticator(std::unique_ptr<authenticator> a)
: _authenticator(std::move(a)) {
}
virtual future<> start() override {
return _authenticator->start();
}
virtual future<> stop() override {
return _authenticator->stop();
}
virtual std::string_view qualified_java_name() const override {
return transitional_authenticator_name();
}
virtual bool require_authentication() const override {
return true;
}
virtual authentication_option_set supported_options() const override {
return _authenticator->supported_options();
}
virtual authentication_option_set alterable_options() const override {
return _authenticator->alterable_options();
}
virtual future<authenticated_user> authenticate(const credentials_map& credentials) const override {
auto i = credentials.find(authenticator::USERNAME_KEY);
if ((i == credentials.end() || i->second.empty())
&& (!credentials.contains(PASSWORD_KEY) || credentials.at(PASSWORD_KEY).empty())) {
return make_ready_future().then([this, &credentials] {
return _authenticator->authenticate(credentials);
}).handle_exception([](auto ep) {
try {
std::rethrow_exception(ep);
} catch (const exceptions::authentication_exception&) {
// return anon user
return make_ready_future<authenticated_user>(anonymous_user());
}
return make_ready_future().then([this, &credentials] {
return _authenticator->authenticate(credentials);
}).handle_exception([](auto ep) {
try {
std::rethrow_exception(ep);
} catch (const exceptions::authentication_exception&) {
// return anon user
return make_ready_future<authenticated_user>(anonymous_user());
}
});
}
virtual future<> create(std::string_view role_name, const authentication_options& options, ::service::group0_batch& mc) override {
return _authenticator->create(role_name, options, mc);
}
virtual future<> alter(std::string_view role_name, const authentication_options& options, ::service::group0_batch& mc) override {
return _authenticator->alter(role_name, options, mc);
}
virtual future<> drop(std::string_view role_name, ::service::group0_batch& mc) override {
return _authenticator->drop(role_name, mc);
}
virtual future<custom_options> query_custom_options(std::string_view role_name) const override {
return _authenticator->query_custom_options(role_name);
}
virtual bool uses_password_hashes() const override {
return _authenticator->uses_password_hashes();
}
virtual future<std::optional<sstring>> get_password_hash(std::string_view role_name) const override {
return _authenticator->get_password_hash(role_name);
}
virtual const resource_set& protected_resources() const override {
return _authenticator->protected_resources();
}
virtual ::shared_ptr<sasl_challenge> new_sasl_challenge() const override {
class sasl_wrapper : public sasl_challenge {
public:
sasl_wrapper(::shared_ptr<sasl_challenge> sasl)
: _sasl(std::move(sasl)) {
}
virtual bytes evaluate_response(bytes_view client_response) override {
try {
return _sasl->evaluate_response(client_response);
} catch (const exceptions::authentication_exception&) {
_complete = true;
return {};
}
}
virtual bool is_complete() const override {
return _complete || _sasl->is_complete();
}
virtual future<authenticated_user> get_authenticated_user() const override {
return futurize_invoke([this] {
return _sasl->get_authenticated_user().handle_exception([](auto ep) {
try {
std::rethrow_exception(ep);
} catch (const exceptions::authentication_exception&) {
// return anon user
return make_ready_future<authenticated_user>(anonymous_user());
}
});
});
}
const sstring& get_username() const override {
return _sasl->get_username();
}
private:
::shared_ptr<sasl_challenge> _sasl;
bool _complete = false;
};
return ::make_shared<sasl_wrapper>(_authenticator->new_sasl_challenge());
}
virtual future<> ensure_superuser_is_created() const override {
return _authenticator->ensure_superuser_is_created();
}
};
class transitional_authorizer : public authorizer {
std::unique_ptr<authorizer> _authorizer;
public:
transitional_authorizer(cql3::query_processor& qp, ::service::raft_group0_client& g0, ::service::migration_manager& mm)
: transitional_authorizer(std::make_unique<default_authorizer>(qp, g0, mm)) {
}
transitional_authorizer(std::unique_ptr<authorizer> a)
: _authorizer(std::move(a)) {
}
~transitional_authorizer() {
}
virtual future<> start() override {
return _authorizer->start();
}
virtual future<> stop() override {
return _authorizer->stop();
}
virtual std::string_view qualified_java_name() const override {
return transitional_authorizer_name();
}
virtual future<permission_set> authorize(const role_or_anonymous&, const resource&) const override {
static const permission_set transitional_permissions =
permission_set::of<
permission::CREATE,
permission::ALTER,
permission::DROP,
permission::SELECT,
permission::MODIFY>();
return make_ready_future<permission_set>(transitional_permissions);
}
virtual future<> grant(std::string_view s, permission_set ps, const resource& r, ::service::group0_batch& mc) override {
return _authorizer->grant(s, std::move(ps), r, mc);
}
virtual future<> revoke(std::string_view s, permission_set ps, const resource& r, ::service::group0_batch& mc) override {
return _authorizer->revoke(s, std::move(ps), r, mc);
}
virtual future<std::vector<permission_details>> list_all() const override {
return _authorizer->list_all();
}
virtual future<> revoke_all(std::string_view s, ::service::group0_batch& mc) override {
return _authorizer->revoke_all(s, mc);
}
virtual future<> revoke_all(const resource& r, ::service::group0_batch& mc) override {
return _authorizer->revoke_all(r, mc);
}
virtual const resource_set& protected_resources() const override {
return _authorizer->protected_resources();
}
};
});
}
//
// To ensure correct initialization order, we unfortunately need to use string literals.
//
future<> transitional_authenticator::create(std::string_view role_name, const authentication_options& options, ::service::group0_batch& mc) {
return _authenticator->create(role_name, options, mc);
}
static const class_registrator<
auth::authenticator,
auth::transitional_authenticator,
cql3::query_processor&,
::service::raft_group0_client&,
::service::migration_manager&,
auth::cache&> transitional_authenticator_reg(auth::PACKAGE_NAME + "TransitionalAuthenticator");
future<> transitional_authenticator::alter(std::string_view role_name, const authentication_options& options, ::service::group0_batch& mc) {
return _authenticator->alter(role_name, options, mc);
}
static const class_registrator<
auth::authorizer,
auth::transitional_authorizer,
cql3::query_processor&,
::service::raft_group0_client&,
::service::migration_manager&> transitional_authorizer_reg(auth::PACKAGE_NAME + "TransitionalAuthorizer");
future<> transitional_authenticator::drop(std::string_view role_name, ::service::group0_batch& mc) {
return _authenticator->drop(role_name, mc);
}
future<custom_options> transitional_authenticator::query_custom_options(std::string_view role_name) const {
return _authenticator->query_custom_options(role_name);
}
bool transitional_authenticator::uses_password_hashes() const {
return _authenticator->uses_password_hashes();
}
future<std::optional<sstring>> transitional_authenticator::get_password_hash(std::string_view role_name) const {
return _authenticator->get_password_hash(role_name);
}
const resource_set& transitional_authenticator::protected_resources() const {
return _authenticator->protected_resources();
}
::shared_ptr<sasl_challenge> transitional_authenticator::new_sasl_challenge() const {
class sasl_wrapper : public sasl_challenge {
public:
sasl_wrapper(::shared_ptr<sasl_challenge> sasl)
: _sasl(std::move(sasl)) {
}
virtual bytes evaluate_response(bytes_view client_response) override {
try {
return _sasl->evaluate_response(client_response);
} catch (const exceptions::authentication_exception&) {
_complete = true;
return {};
}
}
virtual bool is_complete() const override {
return _complete || _sasl->is_complete();
}
virtual future<authenticated_user> get_authenticated_user() const override {
return futurize_invoke([this] {
return _sasl->get_authenticated_user().handle_exception([](auto ep) {
try {
std::rethrow_exception(ep);
} catch (const exceptions::authentication_exception&) {
// return anon user
return make_ready_future<authenticated_user>(anonymous_user());
}
});
});
}
const sstring& get_username() const override {
return _sasl->get_username();
}
private:
::shared_ptr<sasl_challenge> _sasl;
bool _complete = false;
};
return ::make_shared<sasl_wrapper>(_authenticator->new_sasl_challenge());
}
future<> transitional_authenticator::ensure_superuser_is_created() const {
return _authenticator->ensure_superuser_is_created();
}
transitional_authorizer::transitional_authorizer(cql3::query_processor& qp)
: transitional_authorizer(std::make_unique<default_authorizer>(qp)) {
}
transitional_authorizer::transitional_authorizer(std::unique_ptr<authorizer> a)
: _authorizer(std::move(a)) {
}
transitional_authorizer::~transitional_authorizer() {
}
future<> transitional_authorizer::start() {
return _authorizer->start();
}
future<> transitional_authorizer::stop() {
return _authorizer->stop();
}
std::string_view transitional_authorizer::qualified_java_name() const {
return "com.scylladb.auth.TransitionalAuthorizer";
}
future<permission_set> transitional_authorizer::authorize(const role_or_anonymous&, const resource&) const {
static const permission_set transitional_permissions =
permission_set::of<
permission::CREATE,
permission::ALTER,
permission::DROP,
permission::SELECT,
permission::MODIFY>();
return make_ready_future<permission_set>(transitional_permissions);
}
future<> transitional_authorizer::grant(std::string_view s, permission_set ps, const resource& r, ::service::group0_batch& mc) {
return _authorizer->grant(s, std::move(ps), r, mc);
}
future<> transitional_authorizer::revoke(std::string_view s, permission_set ps, const resource& r, ::service::group0_batch& mc) {
return _authorizer->revoke(s, std::move(ps), r, mc);
}
future<std::vector<permission_details>> transitional_authorizer::list_all() const {
return _authorizer->list_all();
}
future<> transitional_authorizer::revoke_all(std::string_view s, ::service::group0_batch& mc) {
return _authorizer->revoke_all(s, mc);
}
future<> transitional_authorizer::revoke_all(const resource& r, ::service::group0_batch& mc) {
return _authorizer->revoke_all(r, mc);
}
const resource_set& transitional_authorizer::protected_resources() const {
return _authorizer->protected_resources();
}
}

81
auth/transitional.hh Normal file
View File

@@ -0,0 +1,81 @@
/*
* Copyright (C) 2026-present ScyllaDB
*
* Modified by ScyllaDB
*/
/*
* SPDX-License-Identifier: (LicenseRef-ScyllaDB-Source-Available-1.0 and Apache-2.0)
*/
#pragma once
#include "auth/authenticator.hh"
#include "auth/authorizer.hh"
#include "auth/cache.hh"
namespace cql3 {
class query_processor;
}
namespace service {
class raft_group0_client;
class migration_manager;
}
namespace auth {
///
/// Transitional authenticator that allows anonymous access when credentials are not provided
/// or authentication fails. Used for migration scenarios.
///
class transitional_authenticator : public authenticator {
std::unique_ptr<authenticator> _authenticator;
public:
transitional_authenticator(cql3::query_processor& qp, ::service::raft_group0_client& g0, ::service::migration_manager& mm, cache& cache);
transitional_authenticator(std::unique_ptr<authenticator> a);
virtual future<> start() override;
virtual future<> stop() override;
virtual std::string_view qualified_java_name() const override;
virtual bool require_authentication() const override;
virtual authentication_option_set supported_options() const override;
virtual authentication_option_set alterable_options() const override;
virtual future<authenticated_user> authenticate(const credentials_map& credentials) const override;
virtual future<> create(std::string_view role_name, const authentication_options& options, ::service::group0_batch& mc) override;
virtual future<> alter(std::string_view role_name, const authentication_options& options, ::service::group0_batch& mc) override;
virtual future<> drop(std::string_view role_name, ::service::group0_batch& mc) override;
virtual future<custom_options> query_custom_options(std::string_view role_name) const override;
virtual bool uses_password_hashes() const override;
virtual future<std::optional<sstring>> get_password_hash(std::string_view role_name) const override;
virtual const resource_set& protected_resources() const override;
virtual ::shared_ptr<sasl_challenge> new_sasl_challenge() const override;
virtual future<> ensure_superuser_is_created() const override;
};
///
/// Transitional authorizer that grants a fixed set of permissions to all users.
/// Used for migration scenarios.
///
class transitional_authorizer : public authorizer {
std::unique_ptr<authorizer> _authorizer;
public:
transitional_authorizer(cql3::query_processor& qp);
transitional_authorizer(std::unique_ptr<authorizer> a);
~transitional_authorizer();
virtual future<> start() override;
virtual future<> stop() override;
virtual std::string_view qualified_java_name() const override;
virtual future<permission_set> authorize(const role_or_anonymous&, const resource&) const override;
virtual future<> grant(std::string_view s, permission_set ps, const resource& r, ::service::group0_batch& mc) override;
virtual future<> revoke(std::string_view s, permission_set ps, const resource& r, ::service::group0_batch& mc) override;
virtual future<std::vector<permission_details>> list_all() const override;
virtual future<> revoke_all(std::string_view s, ::service::group0_batch& mc) override;
virtual future<> revoke_all(const resource& r, ::service::group0_batch& mc) override;
virtual const resource_set& protected_resources() const override;
};
} // namespace auth

View File

@@ -10,24 +10,15 @@
#include <random>
#include <unordered_set>
#include <algorithm>
#include <seastar/core/sleep.hh>
#include <seastar/core/coroutine.hh>
#include <seastar/coroutine/maybe_yield.hh>
#include <seastar/util/later.hh>
#include "gms/endpoint_state.hh"
#include "gms/versioned_value.hh"
#include "keys/keys.hh"
#include "replica/database.hh"
#include "db/system_keyspace.hh"
#include "db/system_distributed_keyspace.hh"
#include "dht/token-sharding.hh"
#include "locator/token_metadata.hh"
#include "types/set.hh"
#include "gms/application_state.hh"
#include "gms/inet_address.hh"
#include "gms/gossiper.hh"
#include "gms/feature_service.hh"
#include "utils/assert.hh"
#include "utils/error_injection.hh"
#include "utils/UUID_gen.hh"
@@ -41,16 +32,6 @@
extern logging::logger cdc_log;
static int get_shard_count(const locator::host_id& endpoint, const gms::gossiper& g) {
auto ep_state = g.get_application_state_ptr(endpoint, gms::application_state::SHARD_COUNT);
return ep_state ? std::stoi(ep_state->value()) : -1;
}
static unsigned get_sharding_ignore_msb(const locator::host_id& endpoint, const gms::gossiper& g) {
auto ep_state = g.get_application_state_ptr(endpoint, gms::application_state::IGNORE_MSB_BITS);
return ep_state ? std::stoi(ep_state->value()) : 0;
}
namespace db {
extern thread_local data_type cdc_streams_set_type;
}
@@ -225,12 +206,6 @@ static std::vector<stream_id> create_stream_ids(
return result;
}
bool should_propose_first_generation(const locator::host_id& my_host_id, const gms::gossiper& g) {
return g.for_each_endpoint_state_until([&] (const gms::endpoint_state& eps) {
return stop_iteration(my_host_id < eps.get_host_id());
}) == stop_iteration::no;
}
bool is_cdc_generation_optimal(const cdc::topology_description& gen, const locator::token_metadata& tm) {
if (tm.sorted_tokens().size() != gen.entries().size()) {
// We probably have garbage streams from old generations
@@ -330,38 +305,6 @@ future<utils::chunked_vector<mutation>> get_cdc_generation_mutations_v3(
co_return co_await get_common_cdc_generation_mutations(s, pkey, std::move(get_ckey), desc, mutation_size_threshold, ts);
}
// non-static for testing
size_t limit_of_streams_in_topology_description() {
// Each stream takes 16B and we don't want to exceed 4MB so we can have
// at most 262144 streams but not less than 1 per vnode.
return 4 * 1024 * 1024 / 16;
}
// non-static for testing
topology_description limit_number_of_streams_if_needed(topology_description&& desc) {
uint64_t streams_count = 0;
for (auto& tr_desc : desc.entries()) {
streams_count += tr_desc.streams.size();
}
size_t limit = std::max(limit_of_streams_in_topology_description(), desc.entries().size());
if (limit >= streams_count) {
return std::move(desc);
}
size_t streams_per_vnode_limit = limit / desc.entries().size();
auto entries = std::move(desc).entries();
auto start = entries.back().token_range_end;
for (size_t idx = 0; idx < entries.size(); ++idx) {
auto end = entries[idx].token_range_end;
if (entries[idx].streams.size() > streams_per_vnode_limit) {
entries[idx].streams =
create_stream_ids(idx, start, end, streams_per_vnode_limit, entries[idx].sharding_ignore_msb);
}
start = end;
}
return topology_description(std::move(entries));
}
// Compute a set of tokens that split the token ring into vnodes.
static auto get_tokens(const std::unordered_set<dht::token>& bootstrap_tokens, const locator::token_metadata_ptr tmptr) {
auto tokens = tmptr->sorted_tokens();
@@ -419,364 +362,6 @@ db_clock::time_point new_generation_timestamp(bool add_delay, std::chrono::milli
return ts;
}
future<cdc::generation_id> generation_service::legacy_make_new_generation(const std::unordered_set<dht::token>& bootstrap_tokens, bool add_delay) {
const locator::token_metadata_ptr tmptr = _token_metadata.get();
// Fetch sharding parameters for a node that owns vnode ending with this token
// using gossiped application states.
auto get_sharding_info = [&] (dht::token end) -> std::pair<size_t, uint8_t> {
if (bootstrap_tokens.contains(end)) {
return {smp::count, _cfg.ignore_msb_bits};
} else {
auto endpoint = tmptr->get_endpoint(end);
if (!endpoint) {
throw std::runtime_error(
format("Can't find endpoint for token {}", end));
}
auto sc = get_shard_count(*endpoint, _gossiper);
return {sc > 0 ? sc : 1, get_sharding_ignore_msb(*endpoint, _gossiper)};
}
};
auto uuid = utils::make_random_uuid();
auto gen = make_new_generation_description(bootstrap_tokens, get_sharding_info, tmptr);
// Our caller should ensure that there are normal tokens in the token ring.
auto normal_token_owners = tmptr->count_normal_token_owners();
SCYLLA_ASSERT(normal_token_owners);
if (_feature_service.cdc_generations_v2) {
cdc_log.info("Inserting new generation data at UUID {}", uuid);
// This may take a while.
co_await _sys_dist_ks.local().insert_cdc_generation(uuid, gen, { normal_token_owners });
// Begin the race.
cdc::generation_id_v2 gen_id{new_generation_timestamp(add_delay, _cfg.ring_delay), uuid};
cdc_log.info("New CDC generation: {}", gen_id);
co_return gen_id;
}
// The CDC_GENERATIONS_V2 feature is not enabled: some nodes may still not understand the V2 format.
// We must create a generation in the old format.
// If the cluster is large we may end up with a generation that contains
// large number of streams. This is problematic because we store the
// generation in a single row (V1 format). For a generation with large number of rows
// this will lead to a row that can be as big as 32MB. This is much more
// than the limit imposed by commitlog_segment_size_in_mb. If the size of
// the row that describes a new generation grows above
// commitlog_segment_size_in_mb, the write will fail and the new node won't
// be able to join. To avoid such problem we make sure that such row is
// always smaller than 4MB. We do that by removing some CDC streams from
// each vnode if the total number of streams is too large.
gen = limit_number_of_streams_if_needed(std::move(gen));
cdc_log.warn(
"Creating a new CDC generation in the old storage format due to a partially upgraded cluster:"
" the CDC_GENERATIONS_V2 feature is known by this node, but not enabled in the cluster."
" The old storage format forces us to create a suboptimal generation."
" It is recommended to finish the upgrade and then create a new generation either by bootstrapping"
" a new node or running the checkAndRepairCdcStreams nodetool command.");
// Begin the race.
cdc::generation_id_v1 gen_id{new_generation_timestamp(add_delay, _cfg.ring_delay)};
co_await _sys_dist_ks.local().insert_cdc_topology_description(gen_id, std::move(gen), { normal_token_owners });
cdc_log.info("New CDC generation: {}", gen_id);
co_return gen_id;
}
/* Retrieves CDC streams generation timestamp from the given endpoint's application state (broadcasted through gossip).
* We might be during a rolling upgrade, so the timestamp might not be there (if the other node didn't upgrade yet),
* but if the cluster already supports CDC, then every newly joining node will propose a new CDC generation,
* which means it will gossip the generation's timestamp.
*/
static std::optional<cdc::generation_id> get_generation_id_for(const locator::host_id& endpoint, const gms::endpoint_state& eps) {
const auto* gen_id_ptr = eps.get_application_state_ptr(gms::application_state::CDC_GENERATION_ID);
if (!gen_id_ptr) {
return std::nullopt;
}
auto gen_id_string = gen_id_ptr->value();
cdc_log.trace("endpoint={}, gen_id_string={}", endpoint, gen_id_string);
return gms::versioned_value::cdc_generation_id_from_string(gen_id_string);
}
static future<std::optional<cdc::topology_description>> retrieve_generation_data_v2(
cdc::generation_id_v2 id,
db::system_keyspace& sys_ks,
db::system_distributed_keyspace& sys_dist_ks) {
auto cdc_gen = co_await sys_dist_ks.read_cdc_generation(id.id);
if (!cdc_gen && id.id.is_timestamp()) {
// If we entered legacy mode due to recovery, we (or some other node)
// might gossip about a generation that was previously propagated
// through raft. If that's the case, it will sit in
// the system.cdc_generations_v3 table.
//
// If the provided id is not a timeuuid, we don't want to query
// the system.cdc_generations_v3 table. This table stores generation
// ids as timeuuids. If the provided id is not a timeuuid, the
// generation cannot be in system.cdc_generations_v3. Also, the query
// would fail with a marshaling error.
cdc_gen = co_await sys_ks.read_cdc_generation_opt(id.id);
}
co_return cdc_gen;
}
static future<std::optional<cdc::topology_description>> retrieve_generation_data(
cdc::generation_id gen_id,
db::system_keyspace& sys_ks,
db::system_distributed_keyspace& sys_dist_ks,
db::system_distributed_keyspace::context ctx) {
return std::visit(make_visitor(
[&] (const cdc::generation_id_v1& id) {
return sys_dist_ks.read_cdc_topology_description(id, ctx);
},
[&] (const cdc::generation_id_v2& id) {
return retrieve_generation_data_v2(id, sys_ks, sys_dist_ks);
}
), gen_id);
}
static future<> do_update_streams_description(
cdc::generation_id gen_id,
db::system_keyspace& sys_ks,
db::system_distributed_keyspace& sys_dist_ks,
db::system_distributed_keyspace::context ctx) {
if (co_await sys_dist_ks.cdc_desc_exists(get_ts(gen_id), ctx)) {
cdc_log.info("Generation {}: streams description table already updated.", gen_id);
co_return;
}
// We might race with another node also inserting the description, but that's ok. It's an idempotent operation.
auto topo = co_await retrieve_generation_data(gen_id, sys_ks, sys_dist_ks, ctx);
if (!topo) {
throw no_generation_data_exception(gen_id);
}
co_await sys_dist_ks.create_cdc_desc(get_ts(gen_id), *topo, ctx);
cdc_log.info("CDC description table successfully updated with generation {}.", gen_id);
}
/* Inform CDC users about a generation of streams (identified by the given timestamp)
* by inserting it into the cdc_streams table.
*
* Assumes that the cdc_generation_descriptions table contains this generation.
*
* Returning from this function does not mean that the table update was successful: the function
* might run an asynchronous task in the background.
*/
static future<> update_streams_description(
cdc::generation_id gen_id,
db::system_keyspace& sys_ks,
shared_ptr<db::system_distributed_keyspace> sys_dist_ks,
noncopyable_function<unsigned()> get_num_token_owners,
abort_source& abort_src) {
try {
co_await do_update_streams_description(gen_id, sys_ks, *sys_dist_ks, { get_num_token_owners() });
} catch (...) {
cdc_log.warn(
"Could not update CDC description table with generation {}: {}. Will retry in the background.",
gen_id, std::current_exception());
// It is safe to discard this future: we keep system distributed keyspace alive.
(void)(([] (cdc::generation_id gen_id,
db::system_keyspace& sys_ks,
shared_ptr<db::system_distributed_keyspace> sys_dist_ks,
noncopyable_function<unsigned()> get_num_token_owners,
abort_source& abort_src) -> future<> {
while (true) {
try {
co_await sleep_abortable(std::chrono::seconds(60), abort_src);
} catch (seastar::sleep_aborted&) {
cdc_log.warn( "Aborted update CDC description table with generation {}", gen_id);
co_return;
}
try {
co_await do_update_streams_description(gen_id, sys_ks, *sys_dist_ks, { get_num_token_owners() });
co_return;
} catch (...) {
cdc_log.warn(
"Could not update CDC description table with generation {}: {}. Will try again.",
gen_id, std::current_exception());
}
}
})(gen_id, sys_ks, std::move(sys_dist_ks), std::move(get_num_token_owners), abort_src));
}
}
static db_clock::time_point as_timepoint(const utils::UUID& uuid) {
return db_clock::time_point(utils::UUID_gen::unix_timestamp(uuid));
}
static future<std::vector<db_clock::time_point>> get_cdc_desc_v1_timestamps(
db::system_distributed_keyspace& sys_dist_ks,
abort_source& abort_src,
const noncopyable_function<unsigned()>& get_num_token_owners) {
while (true) {
try {
co_return co_await sys_dist_ks.get_cdc_desc_v1_timestamps({ get_num_token_owners() });
} catch (...) {
cdc_log.warn(
"Failed to retrieve generation timestamps for rewriting: {}. Retrying in 60s.",
std::current_exception());
}
co_await sleep_abortable(std::chrono::seconds(60), abort_src);
}
}
// Contains a CDC log table's creation time (extracted from its schema's id)
// and its CDC TTL setting.
struct time_and_ttl {
db_clock::time_point creation_time;
int ttl;
};
/*
* See `maybe_rewrite_streams_descriptions`.
* This is the long-running-in-the-background part of that function.
* It returns the timestamp of the last rewritten generation (if any).
*/
static future<std::optional<cdc::generation_id_v1>> rewrite_streams_descriptions(
std::vector<time_and_ttl> times_and_ttls,
db::system_keyspace& sys_ks,
shared_ptr<db::system_distributed_keyspace> sys_dist_ks,
noncopyable_function<unsigned()> get_num_token_owners,
abort_source& abort_src) {
cdc_log.info("Retrieving generation timestamps for rewriting...");
auto tss = co_await get_cdc_desc_v1_timestamps(*sys_dist_ks, abort_src, get_num_token_owners);
cdc_log.info("Generation timestamps retrieved.");
// Find first generation timestamp such that some CDC log table may contain data before this timestamp.
// This predicate is monotonic w.r.t the timestamps.
auto now = db_clock::now();
std::sort(tss.begin(), tss.end());
auto first = std::partition_point(tss.begin(), tss.end(), [&] (db_clock::time_point ts) {
// partition_point finds first element that does *not* satisfy the predicate.
return std::none_of(times_and_ttls.begin(), times_and_ttls.end(),
[&] (const time_and_ttl& tat) {
// In this CDC log table there are no entries older than the table's creation time
// or (now - the table's ttl). We subtract 10s to account for some possible clock drift.
// If ttl is set to 0 then entries in this table never expire. In that case we look
// only at the table's creation time.
auto no_entries_older_than =
(tat.ttl == 0 ? tat.creation_time : std::max(tat.creation_time, now - std::chrono::seconds(tat.ttl)))
- std::chrono::seconds(10);
return no_entries_older_than < ts;
});
});
// Find first generation timestamp such that some CDC log table may contain data in this generation.
// This and all later generations need to be written to the new streams table.
if (first != tss.begin()) {
--first;
}
if (first == tss.end()) {
cdc_log.info("No generations to rewrite.");
co_return std::nullopt;
}
cdc_log.info("First generation to rewrite: {}", *first);
bool each_success = true;
co_await max_concurrent_for_each(first, tss.end(), 10, [&] (db_clock::time_point ts) -> future<> {
while (true) {
try {
co_return co_await do_update_streams_description(cdc::generation_id_v1{ts}, sys_ks, *sys_dist_ks, { get_num_token_owners() });
} catch (const no_generation_data_exception& e) {
cdc_log.error("Failed to rewrite streams for generation {}: {}. Giving up.", ts, e);
each_success = false;
co_return;
} catch (...) {
cdc_log.warn("Failed to rewrite streams for generation {}: {}. Retrying in 60s.", ts, std::current_exception());
}
co_await sleep_abortable(std::chrono::seconds(60), abort_src);
}
});
if (each_success) {
cdc_log.info("Rewriting stream tables finished successfully.");
} else {
cdc_log.info("Rewriting stream tables finished, but some generations could not be rewritten (check the logs).");
}
if (first != tss.end()) {
co_return cdc::generation_id_v1{*std::prev(tss.end())};
}
co_return std::nullopt;
}
future<> generation_service::maybe_rewrite_streams_descriptions() {
if (!_db.has_schema(_sys_dist_ks.local().NAME, _sys_dist_ks.local().CDC_DESC_V1)) {
// This cluster never went through a Scylla version which used this table
// or the user deleted the table. Nothing to do.
co_return;
}
if (co_await _sys_ks.local().cdc_is_rewritten()) {
co_return;
}
if (_cfg.dont_rewrite_streams) {
cdc_log.warn("Stream rewriting disabled. Manual administrator intervention may be required...");
co_return;
}
// For each CDC log table get the TTL setting (from CDC options) and the table's creation time
std::vector<time_and_ttl> times_and_ttls;
_db.get_tables_metadata().for_each_table([&] (table_id, lw_shared_ptr<replica::table> t) {
auto& s = *t->schema();
auto base = cdc::get_base_table(_db, s.ks_name(), s.cf_name());
if (!base) {
// Not a CDC log table.
return;
}
auto& cdc_opts = base->cdc_options();
if (!cdc_opts.enabled()) {
// This table is named like a CDC log table but it's not one.
return;
}
times_and_ttls.push_back(time_and_ttl{as_timepoint(s.id().uuid()), cdc_opts.ttl()});
});
if (times_and_ttls.empty()) {
// There's no point in rewriting old generations' streams (they don't contain any data).
cdc_log.info("No CDC log tables present, not rewriting stream tables.");
co_return co_await _sys_ks.local().cdc_set_rewritten(std::nullopt);
}
auto get_num_token_owners = [tm = _token_metadata.get()] { return tm->count_normal_token_owners(); };
// This code is racing with node startup. At this point, we're most likely still waiting for gossip to settle
// and some nodes that are UP may still be marked as DOWN by us.
// Let's sleep a bit to increase the chance that the first attempt at rewriting succeeds (it's still ok if
// it doesn't - we'll retry - but it's nice if we succeed without any warnings).
co_await sleep_abortable(std::chrono::seconds(10), _abort_src);
cdc_log.info("Rewriting stream tables in the background...");
auto last_rewritten = co_await rewrite_streams_descriptions(
std::move(times_and_ttls),
_sys_ks.local(),
_sys_dist_ks.local_shared(),
std::move(get_num_token_owners),
_abort_src);
co_await _sys_ks.local().cdc_set_rewritten(last_rewritten);
}
static void assert_shard_zero(const sstring& where) {
if (this_shard_id() != 0) {
on_internal_error(cdc_log, format("`{}`: must be run on shard 0", where));
}
}
class and_reducer {
private:
bool _result = true;
@@ -803,195 +388,26 @@ public:
}
};
class generation_handling_nonfatal_exception : public std::runtime_error {
using std::runtime_error::runtime_error;
};
constexpr char could_not_retrieve_msg_template[]
= "Could not retrieve CDC streams with timestamp {} upon gossip event. Reason: \"{}\". Action: {}.";
generation_service::generation_service(
config cfg, gms::gossiper& g, sharded<db::system_distributed_keyspace>& sys_dist_ks,
config cfg,
sharded<db::system_keyspace>& sys_ks,
abort_source& abort_src, const locator::shared_token_metadata& stm, gms::feature_service& f,
replica::database& db)
: _cfg(std::move(cfg))
, _gossiper(g)
, _sys_dist_ks(sys_dist_ks)
, _sys_ks(sys_ks)
, _abort_src(abort_src)
, _token_metadata(stm)
, _feature_service(f)
, _db(db)
{
}
future<> generation_service::stop() {
try {
co_await std::move(_cdc_streams_rewrite_complete);
} catch (...) {
cdc_log.error("CDC stream rewrite failed: ", std::current_exception());
}
if (_joined && (this_shard_id() == 0)) {
co_await leave_ring();
}
_stopped = true;
return make_ready_future<>();
}
generation_service::~generation_service() {
SCYLLA_ASSERT(_stopped);
}
future<> generation_service::after_join(std::optional<cdc::generation_id>&& startup_gen_id) {
assert_shard_zero(__PRETTY_FUNCTION__);
_gen_id = std::move(startup_gen_id);
_gossiper.register_(shared_from_this());
_joined = true;
// Retrieve the latest CDC generation seen in gossip (if any).
co_await legacy_scan_cdc_generations();
// Ensure that the new CDC stream description table has all required streams.
// See the function's comment for details.
//
// Since this depends on the entire cluster (and therefore we cannot guarantee
// timely completion), run it in the background and wait for it in stop().
_cdc_streams_rewrite_complete = maybe_rewrite_streams_descriptions();
}
future<> generation_service::leave_ring() {
assert_shard_zero(__PRETTY_FUNCTION__);
_joined = false;
co_await _gossiper.unregister_(shared_from_this());
}
future<> generation_service::on_join(gms::inet_address ep, locator::host_id id, gms::endpoint_state_ptr ep_state, gms::permit_id pid) {
return on_change(ep, id, ep_state->get_application_state_map(), pid);
}
future<> generation_service::on_change(gms::inet_address ep, locator::host_id id, const gms::application_state_map& states, gms::permit_id pid) {
assert_shard_zero(__PRETTY_FUNCTION__);
return make_ready_future<>();
}
future<> generation_service::check_and_repair_cdc_streams() {
// FIXME: support Raft group 0-based topology changes
if (!_joined) {
throw std::runtime_error("check_and_repair_cdc_streams: node not initialized yet");
}
std::optional<cdc::generation_id> latest = _gen_id;
_gossiper.for_each_endpoint_state([&] (const gms::endpoint_state& state) {
auto addr = state.get_host_id();
if (_gossiper.is_left(addr)) {
cdc_log.info("check_and_repair_cdc_streams ignored node {} because it is in LEFT state", addr);
return;
}
if (!_gossiper.is_normal(addr)) {
throw std::runtime_error(fmt::format("All nodes must be in NORMAL or LEFT state while performing check_and_repair_cdc_streams"
" ({} is in state {})", addr, _gossiper.get_gossip_status(state)));
}
const auto gen_id = get_generation_id_for(addr, state);
if (!latest || (gen_id && get_ts(*gen_id) > get_ts(*latest))) {
latest = gen_id;
}
});
auto tmptr = _token_metadata.get();
auto sys_dist_ks = get_sys_dist_ks();
bool should_regenerate = false;
if (!latest) {
cdc_log.warn("check_and_repair_cdc_streams: no generation observed in gossip");
should_regenerate = true;
} else if (std::holds_alternative<cdc::generation_id_v1>(*latest)
&& _feature_service.cdc_generations_v2) {
cdc_log.info(
"Cluster still using CDC generation storage format V1 (id: {}), even though it already understands the V2 format."
" Creating a new generation using V2.", *latest);
should_regenerate = true;
} else {
cdc_log.info("check_and_repair_cdc_streams: last generation observed in gossip: {}", *latest);
static const auto timeout_msg = "Timeout while fetching CDC topology description";
static const auto topology_read_error_note = "Note: this is likely caused by"
" node(s) being down or unreachable. It is recommended to check the network and"
" restart/remove the failed node(s), then retry checkAndRepairCdcStreams command";
static const auto exception_translating_msg = "Translating the exception to `request_execution_exception`";
std::optional<topology_description> gen;
try {
gen = co_await retrieve_generation_data(*latest, _sys_ks.local(), *sys_dist_ks, { tmptr->count_normal_token_owners() });
} catch (exceptions::request_timeout_exception& e) {
cdc_log.error("{}: \"{}\". {}.", timeout_msg, e.what(), exception_translating_msg);
throw exceptions::request_execution_exception(exceptions::exception_code::READ_TIMEOUT,
format("{}. {}.", timeout_msg, topology_read_error_note));
} catch (exceptions::unavailable_exception& e) {
static const auto unavailable_msg = "Node(s) unavailable while fetching CDC topology description";
cdc_log.error("{}: \"{}\". {}.", unavailable_msg, e.what(), exception_translating_msg);
throw exceptions::request_execution_exception(exceptions::exception_code::UNAVAILABLE,
format("{}. {}.", unavailable_msg, topology_read_error_note));
} catch (...) {
const auto ep = std::current_exception();
if (is_timeout_exception(ep)) {
cdc_log.error("{}: \"{}\". {}.", timeout_msg, ep, exception_translating_msg);
throw exceptions::request_execution_exception(exceptions::exception_code::READ_TIMEOUT,
format("{}. {}.", timeout_msg, topology_read_error_note));
}
// On exotic errors proceed with regeneration
cdc_log.error("Exception while reading CDC topology description: \"{}\". Regenerating streams anyway.", ep);
should_regenerate = true;
}
if (!gen) {
cdc_log.error(
"Could not find CDC generation with timestamp {} in distributed system tables (current time: {}),"
" even though some node gossiped about it.",
latest, db_clock::now());
should_regenerate = true;
} else if (!is_cdc_generation_optimal(*gen, *tmptr)) {
should_regenerate = true;
cdc_log.info("CDC generation {} needs repair, regenerating", latest);
}
}
if (!should_regenerate) {
if (latest != _gen_id) {
co_await legacy_do_handle_cdc_generation(*latest);
}
cdc_log.info("CDC generation {} does not need repair", latest);
co_return;
}
const auto new_gen_id = co_await legacy_make_new_generation({}, true);
// Need to artificially update our STATUS so other nodes handle the generation ID change
// FIXME: after 0e0282cd nodes do not require a STATUS update to react to CDC generation changes.
// The artificial STATUS update here should eventually be removed (in a few releases).
auto status = _gossiper.get_this_endpoint_state_ptr()->get_application_state_ptr(gms::application_state::STATUS);
if (!status) {
cdc_log.error("Our STATUS is missing");
cdc_log.error("Aborting CDC generation repair due to missing STATUS");
co_return;
}
// Update _gen_id first, so that legacy_do_handle_cdc_generation (which will get called due to the status update)
// won't try to update the gossiper, which would result in a deadlock inside add_local_application_state
_gen_id = new_gen_id;
co_await _gossiper.add_local_application_state(
std::pair(gms::application_state::CDC_GENERATION_ID, gms::versioned_value::cdc_generation_id(new_gen_id)),
std::pair(gms::application_state::STATUS, *status)
);
co_await _sys_ks.local().update_cdc_generation_id(new_gen_id);
}
future<> generation_service::handle_cdc_generation(cdc::generation_id_v2 gen_id) {
future<> generation_service::handle_cdc_generation(cdc::generation_id gen_id) {
auto ts = get_ts(gen_id);
if (co_await container().map_reduce(and_reducer(), [ts] (generation_service& svc) {
return !svc._cdc_metadata.prepare(ts);
@@ -1013,171 +429,8 @@ future<> generation_service::handle_cdc_generation(cdc::generation_id_v2 gen_id)
}
}
future<> generation_service::legacy_handle_cdc_generation(std::optional<cdc::generation_id> gen_id) {
assert_shard_zero(__PRETTY_FUNCTION__);
if (!gen_id) {
co_return;
}
if (!_sys_dist_ks.local_is_initialized() || !_sys_dist_ks.local().started()) {
on_internal_error(cdc_log, "Legacy handle CDC generation with sys.dist.ks. down");
}
// The service should not be listening for generation changes until after the node
// is bootstrapped and since the node leaves the ring on decommission
if (co_await container().map_reduce(and_reducer(), [ts = get_ts(*gen_id)] (generation_service& svc) {
return !svc._cdc_metadata.prepare(ts);
})) {
co_return;
}
bool using_this_gen = false;
try {
using_this_gen = co_await legacy_do_handle_cdc_generation_intercept_nonfatal_errors(*gen_id);
} catch (generation_handling_nonfatal_exception& e) {
cdc_log.warn(could_not_retrieve_msg_template, gen_id, e.what(), "retrying in the background");
legacy_async_handle_cdc_generation(*gen_id);
co_return;
} catch (...) {
cdc_log.error(could_not_retrieve_msg_template, gen_id, std::current_exception(), "not retrying");
co_return; // Exotic ("fatal") exception => do not retry
}
if (using_this_gen) {
cdc_log.info("Starting to use generation {}", *gen_id);
co_await update_streams_description(*gen_id, _sys_ks.local(), get_sys_dist_ks(),
[&tm = _token_metadata] { return tm.get()->count_normal_token_owners(); },
_abort_src);
}
}
void generation_service::legacy_async_handle_cdc_generation(cdc::generation_id gen_id) {
assert_shard_zero(__PRETTY_FUNCTION__);
(void)(([] (cdc::generation_id gen_id, shared_ptr<generation_service> svc) -> future<> {
while (true) {
co_await sleep_abortable(std::chrono::seconds(5), svc->_abort_src);
try {
bool using_this_gen = co_await svc->legacy_do_handle_cdc_generation_intercept_nonfatal_errors(gen_id);
if (using_this_gen) {
cdc_log.info("Starting to use generation {}", gen_id);
co_await update_streams_description(gen_id, svc->_sys_ks.local(), svc->get_sys_dist_ks(),
[&tm = svc->_token_metadata] { return tm.get()->count_normal_token_owners(); },
svc->_abort_src);
}
co_return;
} catch (generation_handling_nonfatal_exception& e) {
cdc_log.warn(could_not_retrieve_msg_template, gen_id, e.what(), "continuing to retry in the background");
} catch (...) {
cdc_log.error(could_not_retrieve_msg_template, gen_id, std::current_exception(), "not retrying anymore");
co_return; // Exotic ("fatal") exception => do not retry
}
if (co_await svc->container().map_reduce(and_reducer(), [ts = get_ts(gen_id)] (generation_service& svc) {
return svc._cdc_metadata.known_or_obsolete(ts);
})) {
co_return;
}
}
})(gen_id, shared_from_this()));
}
future<> generation_service::legacy_scan_cdc_generations() {
assert_shard_zero(__PRETTY_FUNCTION__);
std::optional<cdc::generation_id> latest;
_gossiper.for_each_endpoint_state([&] (const gms::endpoint_state& eps) {
auto gen_id = get_generation_id_for(eps.get_host_id(), eps);
if (!latest || (gen_id && get_ts(*gen_id) > get_ts(*latest))) {
latest = gen_id;
}
});
if (latest) {
cdc_log.info("Latest generation seen during startup: {}", *latest);
co_await legacy_handle_cdc_generation(latest);
} else {
cdc_log.info("No generation seen during startup.");
}
}
future<bool> generation_service::legacy_do_handle_cdc_generation_intercept_nonfatal_errors(cdc::generation_id gen_id) {
assert_shard_zero(__PRETTY_FUNCTION__);
// Use futurize_invoke to catch all exceptions from legacy_do_handle_cdc_generation.
return futurize_invoke([this, gen_id] {
return legacy_do_handle_cdc_generation(gen_id);
}).handle_exception([] (std::exception_ptr ep) -> future<bool> {
try {
std::rethrow_exception(ep);
} catch (exceptions::request_timeout_exception& e) {
throw generation_handling_nonfatal_exception(e.what());
} catch (exceptions::unavailable_exception& e) {
throw generation_handling_nonfatal_exception(e.what());
} catch (exceptions::read_failure_exception& e) {
throw generation_handling_nonfatal_exception(e.what());
} catch (...) {
const auto ep = std::current_exception();
if (is_timeout_exception(ep)) {
throw generation_handling_nonfatal_exception(format("{}", ep));
}
throw;
}
});
}
future<bool> generation_service::legacy_do_handle_cdc_generation(cdc::generation_id gen_id) {
assert_shard_zero(__PRETTY_FUNCTION__);
auto sys_dist_ks = get_sys_dist_ks();
auto gen = co_await retrieve_generation_data(gen_id, _sys_ks.local(), *sys_dist_ks, { _token_metadata.get()->count_normal_token_owners() });
if (!gen) {
// This may happen during raft upgrade when a node gossips about a generation that
// was propagated through raft and we didn't apply it yet.
throw generation_handling_nonfatal_exception(fmt::format(
"Could not find CDC generation {} in distributed system tables (current time: {}),"
" even though some node gossiped about it.",
gen_id, db_clock::now()));
}
// We always gossip about the generation with the greatest timestamp. Specific nodes may remember older generations,
// but eventually they forget when their clocks move past the latest generation's timestamp.
// The cluster as a whole is only interested in the last generation so restarting nodes may learn what it is.
// We assume that generation changes don't happen ``too often'' so every node can learn about a generation
// before it is superseded by a newer one which causes nodes to start gossiping the about the newer one.
// The assumption follows from the requirement of bootstrapping nodes sequentially.
if (!_gen_id || get_ts(*_gen_id) < get_ts(gen_id)) {
_gen_id = gen_id;
co_await _sys_ks.local().update_cdc_generation_id(gen_id);
co_await _gossiper.add_local_application_state(
gms::application_state::CDC_GENERATION_ID, gms::versioned_value::cdc_generation_id(gen_id));
}
// Return `true` iff the generation was inserted on any of our shards.
co_return co_await container().map_reduce(or_reducer(),
[ts = get_ts(gen_id), &gen] (generation_service& svc) -> future<bool> {
// We need to copy it here before awaiting anything to avoid destruction of the captures.
const auto timestamp = ts;
topology_description gen_copy = co_await gen->clone_async();
co_return svc._cdc_metadata.insert(timestamp, std::move(gen_copy));
});
}
shared_ptr<db::system_distributed_keyspace> generation_service::get_sys_dist_ks() {
assert_shard_zero(__PRETTY_FUNCTION__);
if (!_sys_dist_ks.local_is_initialized()) {
throw std::runtime_error("system distributed keyspace not initialized");
}
return _sys_dist_ks.local_shared();
}
db_clock::time_point get_ts(const generation_id& gen_id) {
return std::visit([] (auto& id) { return id.ts; }, gen_id);
return gen_id.ts;
}
future<mutation> create_table_streams_mutation(table_id table, db_clock::time_point stream_ts, const locator::tablet_map& map, api::timestamp_type ts) {

View File

@@ -34,16 +34,6 @@ namespace seastar {
class abort_source;
} // namespace seastar
namespace db {
class config;
class system_distributed_keyspace;
} // namespace db
namespace gms {
class inet_address;
class gossiper;
} // namespace gms
namespace locator {
class tablet_map;
} // namespace locator
@@ -153,23 +143,6 @@ struct cdc_stream_diff {
using table_streams = std::map<api::timestamp_type, committed_stream_set>;
class no_generation_data_exception : public std::runtime_error {
public:
no_generation_data_exception(cdc::generation_id generation_ts)
: std::runtime_error(fmt::format("could not find generation data for timestamp {}", generation_ts))
{}
};
/* Should be called when we're restarting and we noticed that we didn't save any streams timestamp in our local tables,
* which means that we're probably upgrading from a non-CDC/old CDC version (another reason could be
* that there's a bug, or the user messed with our local tables).
*
* It checks whether we should be the node to propose the first generation of CDC streams.
* The chosen condition is arbitrary, it only tries to make sure that no two nodes propose a generation of streams
* when upgrading, and nothing bad happens if they for some reason do (it's mostly an optimization).
*/
bool should_propose_first_generation(const locator::host_id& me, const gms::gossiper&);
/*
* Checks if the CDC generation is optimal, which is true if its `topology_description` is consistent
* with `token_metadata`.

View File

@@ -15,48 +15,22 @@
namespace cdc {
struct generation_id_v1 {
db_clock::time_point ts;
bool operator==(const generation_id_v1&) const = default;
};
struct generation_id_v2 {
struct generation_id {
db_clock::time_point ts;
utils::UUID id;
bool operator==(const generation_id_v2&) const = default;
bool operator==(const generation_id&) const = default;
};
using generation_id = std::variant<generation_id_v1, generation_id_v2>;
db_clock::time_point get_ts(const generation_id&);
} // namespace cdc
template <>
struct fmt::formatter<cdc::generation_id_v1> {
constexpr auto parse(format_parse_context& ctx) { return ctx.begin(); }
template <typename FormatContext>
auto format(const cdc::generation_id_v1& gen_id, FormatContext& ctx) const {
return fmt::format_to(ctx.out(), "{}", gen_id.ts);
}
};
template <>
struct fmt::formatter<cdc::generation_id_v2> {
constexpr auto parse(format_parse_context& ctx) { return ctx.begin(); }
template <typename FormatContext>
auto format(const cdc::generation_id_v2& gen_id, FormatContext& ctx) const {
return fmt::format_to(ctx.out(), "({}, {})", gen_id.ts, gen_id.id);
}
};
template <>
struct fmt::formatter<cdc::generation_id> {
constexpr auto parse(format_parse_context& ctx) { return ctx.begin(); }
template <typename FormatContext>
auto format(const cdc::generation_id& gen_id, FormatContext& ctx) const {
return std::visit([&ctx] (auto& id) {
return fmt::format_to(ctx.out(), "{}", id);
}, gen_id);
return fmt::format_to(ctx.out(), "({}, {})", gen_id.ts, gen_id.id);
}
};

View File

@@ -11,135 +11,51 @@
#include <seastar/core/sharded.hh>
#include "cdc/metadata.hh"
#include "cdc/generation_id.hh"
#include "gms/i_endpoint_state_change_subscriber.hh"
namespace db {
class system_distributed_keyspace;
class system_keyspace;
}
namespace gms {
class gossiper;
class feature_service;
}
namespace seastar {
class abort_source;
}
namespace locator {
class shared_token_metadata;
class tablet_map;
}
namespace cdc {
class generation_service : public peering_sharded_service<generation_service>
, public async_sharded_service<generation_service>
, public gms::i_endpoint_state_change_subscriber {
, public async_sharded_service<generation_service> {
public:
struct config {
unsigned ignore_msb_bits;
std::chrono::milliseconds ring_delay;
bool dont_rewrite_streams = false;
};
private:
bool _stopped = false;
// The node has joined the token ring. Set to `true` on `after_join` call.
bool _joined = false;
config _cfg;
gms::gossiper& _gossiper;
sharded<db::system_distributed_keyspace>& _sys_dist_ks;
sharded<db::system_keyspace>& _sys_ks;
abort_source& _abort_src;
const locator::shared_token_metadata& _token_metadata;
gms::feature_service& _feature_service;
replica::database& _db;
/* Maintains the set of known CDC generations used to pick streams for log writes (i.e., the partition keys of these log writes).
* Updated in response to certain gossip events (see the handle_cdc_generation function).
*/
/* Maintains the set of known CDC generations used to pick streams for log writes (i.e., the partition keys of these log writes). */
cdc::metadata _cdc_metadata;
/* The latest known generation timestamp and the timestamp that we're currently gossiping
* (as CDC_GENERATION_ID application state).
*
* Only shard 0 manages this, hence it will be std::nullopt on all shards other than 0.
* This timestamp is also persisted in the system.cdc_local table.
*
* On shard 0 this may be nullopt only in one special case: rolling upgrade, when we upgrade
* from an old version of Scylla that didn't support CDC. In that case one node in the cluster
* will create the first generation and start gossiping it; it may be us, or it may be some
* different node. In any case, eventually - after one of the nodes gossips the first timestamp
* - we'll catch on and this variable will be updated with that generation.
*/
std::optional<cdc::generation_id> _gen_id;
future<> _cdc_streams_rewrite_complete = make_ready_future<>();
public:
generation_service(config cfg, gms::gossiper&,
sharded<db::system_distributed_keyspace>&,
generation_service(config cfg,
sharded<db::system_keyspace>& sys_ks,
abort_source&, const locator::shared_token_metadata&,
gms::feature_service&, replica::database& db);
replica::database& db);
future<> stop();
~generation_service();
/* After the node bootstraps and creates a new CDC generation, or restarts and loads the last
* known generation timestamp from persistent storage, this function should be called with
* that generation timestamp moved in as the `startup_gen_id` parameter.
* This passes the responsibility of managing generations from the node startup code to this service;
* until then, the service remains dormant.
* The startup code is in `storage_service::join_topology`, hence
* `after_join` should be called at the end of that function.
* Precondition: the node has completed bootstrapping and system_distributed_keyspace is initialized.
* Must be called on shard 0 - that's where the generation management happens.
*/
future<> after_join(std::optional<cdc::generation_id>&& startup_gen_id);
future<> leave_ring();
cdc::metadata& get_cdc_metadata() {
return _cdc_metadata;
}
virtual future<> on_join(gms::inet_address, locator::host_id id, gms::endpoint_state_ptr, gms::permit_id) override;
virtual future<> on_change(gms::inet_address, locator::host_id id, const gms::application_state_map&, gms::permit_id) override;
future<> check_and_repair_cdc_streams();
/* Generate a new set of CDC streams and insert it into the internal distributed CDC generations table.
* Returns the ID of this new generation.
*
* Should be called when starting the node for the first time (i.e., joining the ring).
*
* Assumes that the system_distributed_keyspace service is initialized.
* `cluster_supports_generations_v2` must be `true` if and only if the `CDC_GENERATIONS_V2` feature is enabled.
*
* If `CDC_GENERATIONS_V2` is enabled, the new generation will be inserted into
* `system_distributed_everywhere.cdc_generation_descriptions_v2` and the returned ID will be in the v2 format.
* Otherwise the new generation will be limited in size, causing suboptimal stream distribution, it will be inserted
* into `system_distributed.cdc_generation_descriptions` and the returned ID will be in the v1 format.
* The second case should happen only when we create new generations in a mixed cluster.
*
* The caller of this function is expected to insert the ID into the gossiper as fast as possible,
* so that other nodes learn about the generation before their clocks cross the generation's timestamp
* (not guaranteed in the current implementation, but expected to be the common case;
* we assume that `ring_delay` is enough for other nodes to learn about the new generation).
*
* Legacy: used for gossiper-based topology changes.
*/
future<cdc::generation_id> legacy_make_new_generation(
const std::unordered_set<dht::token>& bootstrap_tokens, bool add_delay);
/* Retrieve the CDC generation with the given ID from local tables
* and start using it for CDC log writes if it's not obsolete.
* Precondition: the generation was committed using group 0 and locally applied.
*/
future<> handle_cdc_generation(cdc::generation_id_v2);
future<> handle_cdc_generation(cdc::generation_id);
future<> load_cdc_tablet_streams(std::optional<std::unordered_set<table_id>> changed_tables);
@@ -151,56 +67,6 @@ public:
future<utils::chunked_vector<mutation>> garbage_collect_cdc_streams_for_table(table_id table, std::optional<std::chrono::seconds> ttl, api::timestamp_type ts);
future<> garbage_collect_cdc_streams(utils::chunked_vector<canonical_mutation>& muts, api::timestamp_type ts);
private:
/* Retrieve the CDC generation which starts at the given timestamp (from a distributed table created for this purpose)
* and start using it for CDC log writes if it's not obsolete.
*
* Legacy: used for gossiper-based topology changes.
*/
future<> legacy_handle_cdc_generation(std::optional<cdc::generation_id>);
/* If `legacy_handle_cdc_generation` fails, it schedules an asynchronous retry in the background
* using `legacy_async_handle_cdc_generation`.
*
* Legacy: used for gossiper-based topology changes.
*/
void legacy_async_handle_cdc_generation(cdc::generation_id);
/* Wrapper around `legacy_do_handle_cdc_generation` which intercepts timeout/unavailability exceptions.
* Returns: legacy_do_handle_cdc_generation(ts).
*
* Legacy: used for gossiper-based topology changes.
*/
future<bool> legacy_do_handle_cdc_generation_intercept_nonfatal_errors(cdc::generation_id);
/* Returns `true` iff we started using the generation (it was not obsolete or already known),
* which means that this node might write some CDC log entries using streams from this generation.
*
* Legacy: used for gossiper-based topology changes.
*/
future<bool> legacy_do_handle_cdc_generation(cdc::generation_id);
/* Scan CDC generation timestamps gossiped by other nodes and retrieve the latest one.
* This function should be called once at the end of the node startup procedure
* (after the node is started and running normally, it will retrieve generations on gossip events instead).
*
* Legacy: used for gossiper-based topology changes.
*/
future<> legacy_scan_cdc_generations();
/* generation_service code might be racing with system_distributed_keyspace deinitialization
* (the deinitialization order is broken).
* Therefore, whenever we want to access sys_dist_ks in a background task,
* we need to check if the instance is still there. Storing the shared pointer will keep it alive.
*/
shared_ptr<db::system_distributed_keyspace> get_sys_dist_ks();
/* Part of the upgrade procedure. Useful in case where the version of Scylla that we're upgrading from
* used the "cdc_streams_descriptions" table. This procedure ensures that the new "cdc_streams_descriptions_v2"
* table contains streams of all generations that were present in the old table and may still contain data
* (i.e. there exist CDC log tables that may contain rows with partition keys being the stream IDs from
* these generations). */
future<> maybe_rewrite_streams_descriptions();
};
} // namespace cdc

View File

@@ -618,7 +618,7 @@ static void set_default_properties_log_table(schema_builder& b, const schema& s,
b.set_caching_options(caching_options::get_disabled_caching_options());
auto rs = generate_replication_strategy(ksm, db.get_token_metadata().get_topology());
auto tombstone_gc_ext = seastar::make_shared<tombstone_gc_extension>(get_default_tombstone_gc_mode(*rs, db.get_token_metadata(), false));
auto tombstone_gc_ext = seastar::make_shared<tombstone_gc_extension>(get_default_tombstone_gc_mode(*rs, false));
b.add_extension(tombstone_gc_extension::NAME, std::move(tombstone_gc_ext));
}

View File

@@ -76,14 +76,14 @@ struct partition_deletion {
using clustered_column_set = std::map<clustering_key, cdc::one_kind_column_set, clustering_key::less_compare>;
template<typename Container>
template <typename Container>
concept EntryContainer = requires(Container& container) {
// Parenthesized due to https://bugs.llvm.org/show_bug.cgi?id=45088
{ (container.atomic_entries) } -> std::same_as<std::vector<atomic_column_update>&>;
{ (container.nonatomic_entries) } -> std::same_as<std::vector<nonatomic_column_update>&>;
};
template<EntryContainer Container>
template <EntryContainer Container>
static void add_columns_affected_by_entries(cdc::one_kind_column_set& cset, const Container& cont) {
for (const auto& entry : cont.atomic_entries) {
cset.set(entry.id);
@@ -134,7 +134,7 @@ struct batch {
ret.emplace(clustering_key::make_empty(), all_columns);
}
auto process_change_type = [&] (const auto& changes) {
auto process_change_type = [&](const auto& changes) {
for (const auto& change : changes) {
auto& cset = ret[change.key];
cset.resize(s.regular_columns_count());
@@ -211,7 +211,9 @@ private:
public:
extract_collection_visitor(column_id id, std::map<change_key_t, row_update>& updates)
: _id(id), _updates(updates) {}
: _id(id)
, _updates(updates) {
}
void collection_tombstone(const tombstone& t) {
auto& entry = get_or_append_entry(t.timestamp + 1, gc_clock::duration(0));
@@ -226,7 +228,9 @@ public:
cell(key, c);
}
constexpr bool finished() const { return false; }
constexpr bool finished() const {
return false;
}
};
/* Visits all cells and tombstones in a row, putting the encountered changes into buckets
@@ -249,41 +253,46 @@ struct extract_row_visitor {
void collection_column(const column_definition& cdef, auto&& visit_collection) {
visit(*cdef.type, make_visitor(
[&] (const collection_type_impl& ctype) {
struct collection_visitor : public extract_collection_visitor<collection_visitor> {
data_type _value_type;
[&](const collection_type_impl& ctype) {
struct collection_visitor : public extract_collection_visitor<collection_visitor> {
data_type _value_type;
collection_visitor(column_id id, std::map<change_key_t, row_update>& updates, const collection_type_impl& ctype)
: extract_collection_visitor<collection_visitor>(id, updates), _value_type(ctype.value_comparator()) {}
collection_visitor(column_id id, std::map<change_key_t, row_update>& updates, const collection_type_impl& ctype)
: extract_collection_visitor<collection_visitor>(id, updates)
, _value_type(ctype.value_comparator()) {
}
data_type get_value_type(bytes_view) {
return _value_type;
}
} v(cdef.id, _updates, ctype);
data_type get_value_type(bytes_view) {
return _value_type;
}
} v(cdef.id, _updates, ctype);
visit_collection(v);
},
[&] (const user_type_impl& utype) {
struct udt_visitor : public extract_collection_visitor<udt_visitor> {
const user_type_impl& _utype;
visit_collection(v);
},
[&](const user_type_impl& utype) {
struct udt_visitor : public extract_collection_visitor<udt_visitor> {
const user_type_impl& _utype;
udt_visitor(column_id id, std::map<change_key_t, row_update>& updates, const user_type_impl& utype)
: extract_collection_visitor<udt_visitor>(id, updates), _utype(utype) {}
udt_visitor(column_id id, std::map<change_key_t, row_update>& updates, const user_type_impl& utype)
: extract_collection_visitor<udt_visitor>(id, updates)
, _utype(utype) {
}
data_type get_value_type(bytes_view key) {
return _utype.type(deserialize_field_index(key));
}
} v(cdef.id, _updates, utype);
data_type get_value_type(bytes_view key) {
return _utype.type(deserialize_field_index(key));
}
} v(cdef.id, _updates, utype);
visit_collection(v);
},
[&] (const abstract_type& o) {
throw std::runtime_error(format("extract_changes: unknown collection type:", o.name()));
}
));
visit_collection(v);
},
[&](const abstract_type& o) {
throw std::runtime_error(format("extract_changes: unknown collection type:", o.name()));
}));
}
constexpr bool finished() const { return false; }
constexpr bool finished() const {
return false;
}
};
struct extract_changes_visitor {
@@ -293,12 +302,8 @@ struct extract_changes_visitor {
extract_row_visitor v;
visit_row_cells(v);
for (auto& [ts_ttl, row_update]: v._updates) {
_result[ts_ttl.first].static_updates.push_back({
ts_ttl.second,
std::move(row_update.atomic_entries),
std::move(row_update.nonatomic_entries)
});
for (auto& [ts_ttl, row_update] : v._updates) {
_result[ts_ttl.first].static_updates.push_back({ts_ttl.second, std::move(row_update.atomic_entries), std::move(row_update.nonatomic_entries)});
}
}
@@ -319,24 +324,18 @@ struct extract_changes_visitor {
} v;
visit_row_cells(v);
for (auto& [ts_ttl, row_update]: v._updates) {
for (auto& [ts_ttl, row_update] : v._updates) {
// It is important that changes in the resulting `set_of_changes` are listed
// in increasing TTL order. The reason is explained in a comment in cdc/log.cc,
// search for "#6070".
auto [ts, ttl] = ts_ttl;
if (v._marker && ts == v._marker_ts && ttl == v._marker_ttl) {
_result[ts].clustered_inserts.push_back({
ttl,
ckey,
*v._marker,
std::move(row_update.atomic_entries),
{}
});
_result[ts].clustered_inserts.push_back({ttl, ckey, *v._marker, std::move(row_update.atomic_entries), {}});
auto& cr_insert = _result[ts].clustered_inserts.back();
bool clustered_update_exists = false;
for (auto& nonatomic_up: row_update.nonatomic_entries) {
for (auto& nonatomic_up : row_update.nonatomic_entries) {
// Updating a collection column with an INSERT statement implies inserting a tombstone.
//
// For example, suppose that we have:
@@ -362,12 +361,7 @@ struct extract_changes_visitor {
cr_insert.nonatomic_entries.push_back(std::move(nonatomic_up));
} else {
if (!clustered_update_exists) {
_result[ts].clustered_updates.push_back({
ttl,
ckey,
{},
{}
});
_result[ts].clustered_updates.push_back({ttl, ckey, {}, {}});
// Multiple iterations of this `for` loop (for different collection columns)
// might want to put their `nonatomic_up`s into an UPDATE change;
@@ -390,12 +384,7 @@ struct extract_changes_visitor {
}
}
} else {
_result[ts].clustered_updates.push_back({
ttl,
ckey,
std::move(row_update.atomic_entries),
std::move(row_update.nonatomic_entries)
});
_result[ts].clustered_updates.push_back({ttl, ckey, std::move(row_update.atomic_entries), std::move(row_update.nonatomic_entries)});
}
}
}
@@ -412,7 +401,9 @@ struct extract_changes_visitor {
_result[t.timestamp].partition_deletions = partition_deletion{t};
}
constexpr bool finished() const { return false; }
constexpr bool finished() const {
return false;
}
};
set_of_changes extract_changes(const mutation& m) {
@@ -426,13 +417,23 @@ namespace cdc {
struct find_timestamp_visitor {
api::timestamp_type _ts = api::missing_timestamp;
bool finished() const { return _ts != api::missing_timestamp; }
bool finished() const {
return _ts != api::missing_timestamp;
}
void visit(api::timestamp_type ts) { _ts = ts; }
void visit(const atomic_cell_view& cell) { visit(cell.timestamp()); }
void visit(api::timestamp_type ts) {
_ts = ts;
}
void visit(const atomic_cell_view& cell) {
visit(cell.timestamp());
}
void live_atomic_cell(const column_definition&, const atomic_cell_view& cell) { visit(cell); }
void dead_atomic_cell(const column_definition&, const atomic_cell_view& cell) { visit(cell); }
void live_atomic_cell(const column_definition&, const atomic_cell_view& cell) {
visit(cell);
}
void dead_atomic_cell(const column_definition&, const atomic_cell_view& cell) {
visit(cell);
}
void collection_tombstone(const tombstone& t) {
// A collection tombstone with timestamp T can be created with:
// UPDATE ks.t USING TIMESTAMP T + 1 SET X = null WHERE ...
@@ -441,15 +442,33 @@ struct find_timestamp_visitor {
// with cdc$time using timestamp T + 1 instead of T.
visit(t.timestamp + 1);
}
void live_collection_cell(bytes_view, const atomic_cell_view& cell) { visit(cell); }
void dead_collection_cell(bytes_view, const atomic_cell_view& cell) { visit(cell); }
void collection_column(const column_definition&, auto&& visit_collection) { visit_collection(*this); }
void marker(const row_marker& rm) { visit(rm.timestamp()); }
void static_row_cells(auto&& visit_row_cells) { visit_row_cells(*this); }
void clustered_row_cells(const clustering_key&, auto&& visit_row_cells) { visit_row_cells(*this); }
void clustered_row_delete(const clustering_key&, const tombstone& t) { visit(t.timestamp); }
void range_delete(const range_tombstone& t) { visit(t.tomb.timestamp); }
void partition_delete(const tombstone& t) { visit(t.timestamp); }
void live_collection_cell(bytes_view, const atomic_cell_view& cell) {
visit(cell);
}
void dead_collection_cell(bytes_view, const atomic_cell_view& cell) {
visit(cell);
}
void collection_column(const column_definition&, auto&& visit_collection) {
visit_collection(*this);
}
void marker(const row_marker& rm) {
visit(rm.timestamp());
}
void static_row_cells(auto&& visit_row_cells) {
visit_row_cells(*this);
}
void clustered_row_cells(const clustering_key&, auto&& visit_row_cells) {
visit_row_cells(*this);
}
void clustered_row_delete(const clustering_key&, const tombstone& t) {
visit(t.timestamp);
}
void range_delete(const range_tombstone& t) {
visit(t.tomb.timestamp);
}
void partition_delete(const tombstone& t) {
visit(t.timestamp);
}
};
/* Find some timestamp inside the given mutation.
@@ -505,8 +524,12 @@ struct should_split_visitor {
virtual ~should_split_visitor() = default;
inline bool finished() const { return _result; }
inline void stop() { _result = true; }
inline bool finished() const {
return _result;
}
inline void stop() {
_result = true;
}
void visit(api::timestamp_type ts, gc_clock::duration ttl = gc_clock::duration(0)) {
if (_ts != api::missing_timestamp && _ts != ts) {
@@ -517,15 +540,23 @@ struct should_split_visitor {
if (_ttl && *_ttl != ttl) {
return stop();
}
_ttl = { ttl };
_ttl = {ttl};
}
void visit(const atomic_cell_view& cell) { visit(cell.timestamp(), get_ttl(cell)); }
void visit(const atomic_cell_view& cell) {
visit(cell.timestamp(), get_ttl(cell));
}
void live_atomic_cell(const column_definition&, const atomic_cell_view& cell) { visit(cell); }
void dead_atomic_cell(const column_definition&, const atomic_cell_view& cell) { visit(cell); }
void live_atomic_cell(const column_definition&, const atomic_cell_view& cell) {
visit(cell);
}
void dead_atomic_cell(const column_definition&, const atomic_cell_view& cell) {
visit(cell);
}
void collection_tombstone(const tombstone& t) { visit(t.timestamp + 1); }
void collection_tombstone(const tombstone& t) {
visit(t.timestamp + 1);
}
virtual void live_collection_cell(bytes_view, const atomic_cell_view& cell) {
if (_had_row_marker) {
@@ -534,8 +565,12 @@ struct should_split_visitor {
}
visit(cell);
}
void dead_collection_cell(bytes_view, const atomic_cell_view& cell) { visit(cell); }
void collection_column(const column_definition&, auto&& visit_collection) { visit_collection(*this); }
void dead_collection_cell(bytes_view, const atomic_cell_view& cell) {
visit(cell);
}
void collection_column(const column_definition&, auto&& visit_collection) {
visit_collection(*this);
}
virtual void marker(const row_marker& rm) {
_had_row_marker = true;
@@ -606,8 +641,8 @@ bool should_split(const mutation& m, const per_request_options& options) {
cdc::inspect_mutation(m, v);
return v._result
// A mutation with no timestamp will be split into 0 mutations:
|| v._ts == api::missing_timestamp;
// A mutation with no timestamp will be split into 0 mutations:
|| v._ts == api::missing_timestamp;
}
// Returns true if the row state and the atomic and nonatomic entries represent
@@ -642,7 +677,7 @@ static bool entries_match_row_state(const schema_ptr& base_schema, const cell_ma
if (current_values.size() != update.cells.size()) {
return false;
}
std::unordered_map<sstring_view, bytes> current_values_map;
for (const auto& entry : current_values) {
const auto attr_name = std::string_view(value_cast<sstring>(entry.first));
@@ -711,8 +746,8 @@ bool should_skip(batch& changes, const mutation& base_mutation, change_processor
return true;
}
void process_changes_with_splitting(const mutation& base_mutation, change_processor& processor,
bool enable_preimage, bool enable_postimage, bool alternator_strict_compatibility) {
void process_changes_with_splitting(
const mutation& base_mutation, change_processor& processor, bool enable_preimage, bool enable_postimage, bool alternator_strict_compatibility) {
const auto base_schema = base_mutation.schema();
auto changes = extract_changes(base_mutation);
auto pk = base_mutation.key();
@@ -824,8 +859,8 @@ void process_changes_with_splitting(const mutation& base_mutation, change_proces
}
}
void process_changes_without_splitting(const mutation& base_mutation, change_processor& processor,
bool enable_preimage, bool enable_postimage, bool alternator_strict_compatibility) {
void process_changes_without_splitting(
const mutation& base_mutation, change_processor& processor, bool enable_preimage, bool enable_postimage, bool alternator_strict_compatibility) {
if (alternator_strict_compatibility) {
auto changes = extract_changes(base_mutation);
if (should_skip(changes.begin()->second, base_mutation, processor)) {
@@ -842,7 +877,7 @@ void process_changes_without_splitting(const mutation& base_mutation, change_pro
one_kind_column_set columns{base_schema->static_columns_count()};
if (!p.static_row().empty()) {
p.static_row().get().for_each_cell([&] (column_id id, const atomic_cell_or_collection& cell) {
p.static_row().get().for_each_cell([&](column_id id, const atomic_cell_or_collection& cell) {
columns.set(id);
});
processor.produce_preimage(nullptr, columns);
@@ -855,7 +890,7 @@ void process_changes_without_splitting(const mutation& base_mutation, change_pro
// Row deleted - include all columns in preimage
columns.set(0, base_schema->regular_columns_count(), true);
} else {
cr.row().cells().for_each_cell([&] (column_id id, const atomic_cell_or_collection& cell) {
cr.row().cells().for_each_cell([&](column_id id, const atomic_cell_or_collection& cell) {
columns.set(id);
});
}

View File

@@ -48,6 +48,7 @@
#include "mutation/mutation_fragment_stream_validator.hh"
#include "utils/assert.hh"
#include "utils/error_injection.hh"
#include "utils/chunked_vector.hh"
#include "utils/pretty_printers.hh"
#include "readers/multi_range.hh"
#include "readers/compacting.hh"
@@ -161,6 +162,7 @@ std::string_view to_string(compaction_type type) {
case compaction_type::Reshape: return "Reshape";
case compaction_type::Split: return "Split";
case compaction_type::Major: return "Major";
case compaction_type::RewriteComponent: return "RewriteComponent";
}
on_internal_error_noexcept(clogger, format("Invalid compaction type {}", int(type)));
return "(invalid)";
@@ -598,8 +600,7 @@ protected:
// Garbage collected sstables that were added to SSTable set and should be eventually removed from it.
std::vector<sstables::shared_sstable> _used_garbage_collected_sstables;
utils::observable<> _stop_request_observable;
// optional tombstone_gc_state that is used when gc has to check only the compacting sstables to collect tombstones.
std::optional<tombstone_gc_state> _tombstone_gc_state_with_commitlog_check_disabled;
tombstone_gc_state _tombstone_gc_state;
int64_t _output_repaired_at = 0;
private:
// Keeps track of monitors for input sstable.
@@ -611,23 +612,23 @@ private:
}
// Called in a seastar thread
dht::partition_range_vector
utils::chunked_vector<dht::partition_range>
get_ranges_for_invalidation(const std::vector<sstables::shared_sstable>& sstables) {
// If owned ranges is disengaged, it means no cleanup work was done and
// so nothing needs to be invalidated.
if (!_owned_ranges) {
return dht::partition_range_vector{};
return {};
}
auto owned_ranges = dht::to_partition_ranges(*_owned_ranges, utils::can_yield::yes);
auto owned_ranges = dht::to_partition_ranges_chunked(*_owned_ranges).get();
auto non_owned_ranges = sstables
| std::views::transform([] (const sstables::shared_sstable& sst) {
seastar::thread::maybe_yield();
return dht::partition_range::make({sst->get_first_decorated_key(), true},
{sst->get_last_decorated_key(), true});
}) | std::ranges::to<dht::partition_range_vector>();
}) | std::ranges::to<utils::chunked_vector<dht::partition_range>>();
return dht::subtract_ranges(*_schema, non_owned_ranges, std::move(owned_ranges)).get();
return dht::subtract_ranges(*_schema, std::move(non_owned_ranges), std::move(owned_ranges)).get();
}
protected:
compaction(compaction_group_view& table_s, compaction_descriptor descriptor, compaction_data& cdata, compaction_progress_monitor& progress_monitor, use_backlog_tracker use_backlog_tracker)
@@ -649,9 +650,12 @@ protected:
, _owned_ranges(std::move(descriptor.owned_ranges))
, _sharder(descriptor.sharder)
, _owned_ranges_checker(_owned_ranges ? std::optional<dht::incremental_owned_ranges_checker>(*_owned_ranges) : std::nullopt)
, _tombstone_gc_state_with_commitlog_check_disabled(descriptor.gc_check_only_compacting_sstables ? std::make_optional(_table_s.get_tombstone_gc_state().with_commitlog_check_disabled()) : std::nullopt)
, _tombstone_gc_state(_table_s.get_tombstone_gc_state())
, _progress_monitor(progress_monitor)
{
if (descriptor.gc_check_only_compacting_sstables) {
_tombstone_gc_state = _tombstone_gc_state.with_commitlog_check_disabled();
}
std::unordered_set<sstables::run_id> ssts_run_ids;
_contains_multi_fragment_runs = std::any_of(_sstables.begin(), _sstables.end(), [&ssts_run_ids] (sstables::shared_sstable& sst) {
return !ssts_run_ids.insert(sst->run_identifier()).second;
@@ -718,8 +722,8 @@ protected:
compaction_completion_desc
get_compaction_completion_desc(std::vector<sstables::shared_sstable> input_sstables, std::vector<sstables::shared_sstable> output_sstables) {
auto ranges_for_for_invalidation = get_ranges_for_invalidation(input_sstables);
return compaction_completion_desc{std::move(input_sstables), std::move(output_sstables), std::move(ranges_for_for_invalidation)};
auto ranges = get_ranges_for_invalidation(input_sstables);
return compaction_completion_desc{std::move(input_sstables), std::move(output_sstables), std::move(ranges)};
}
// Tombstone expiration is enabled based on the presence of sstable set.
@@ -849,8 +853,8 @@ private:
return _table_s.get_compaction_strategy().make_sstable_set(_table_s);
}
const tombstone_gc_state& get_tombstone_gc_state() const {
return _tombstone_gc_state_with_commitlog_check_disabled ? _tombstone_gc_state_with_commitlog_check_disabled.value() : _table_s.get_tombstone_gc_state();
tombstone_gc_state get_tombstone_gc_state() const {
return _tombstone_gc_state;
}
future<> setup() {
@@ -1050,7 +1054,7 @@ private:
return can_never_purge;
}
return [this] (const dht::decorated_key& dk, is_shadowable is_shadowable) {
return get_max_purgeable_timestamp(_table_s, *_selector, _compacting_for_max_purgeable_func, dk, _bloom_filter_checks, _compacting_max_timestamp, _tombstone_gc_state_with_commitlog_check_disabled.has_value(), is_shadowable);
return get_max_purgeable_timestamp(_table_s, *_selector, _compacting_for_max_purgeable_func, dk, _bloom_filter_checks, _compacting_max_timestamp, !_tombstone_gc_state.is_commitlog_check_enabled(), is_shadowable);
};
}
@@ -2048,6 +2052,7 @@ compaction_type compaction_type_options::type() const {
compaction_type::Reshape,
compaction_type::Split,
compaction_type::Major,
compaction_type::RewriteComponent,
};
static_assert(std::variant_size_v<compaction_type_options::options_variant> == std::size(index_to_type));
return index_to_type[_options.index()];
@@ -2084,6 +2089,9 @@ static std::unique_ptr<compaction> make_compaction(compaction_group_view& table_
std::unique_ptr<compaction> operator()(compaction_type_options::split split_options) {
return std::make_unique<split_compaction>(table_s, std::move(descriptor), cdata, std::move(split_options), progress_monitor);
}
std::unique_ptr<compaction> operator()(compaction_type_options::component_rewrite) {
throw std::runtime_error("component_rewrite compaction should be handled separately");
}
} visitor_factory{table_s, std::move(descriptor), cdata, progress_monitor};
return descriptor.options.visit(visitor_factory);
@@ -2101,7 +2109,7 @@ static future<compaction_result> scrub_sstables_validate_mode(compaction_descrip
validation_errors += co_await sst->validate(permit, cdata.abort, [&schema] (sstring what) {
scrub_compaction::report_validation_error(compaction_type::Scrub, *schema, what);
}, monitor_generator(sst));
}, monitor_generator(sst), true);
// Did validation actually finish because aborted?
if (cdata.is_stop_requested()) {
// Compaction manager will catch this exception and re-schedule the compaction.
@@ -2138,6 +2146,34 @@ future<compaction_result> scrub_sstables_validate_mode(compaction_descriptor des
co_return res;
}
future<compaction_result> rewrite_sstables_component(compaction_descriptor descriptor, compaction_group_view& table_s) {
return seastar::async([descriptor = std::move(descriptor), &table_s] () mutable {
compaction_result result {
.stats = {
.started_at = db_clock::now(),
},
};
const auto& options = descriptor.options.as<compaction_type_options::component_rewrite>();
bool update_id = static_cast<bool>(options.update_id);
// When rewriting a component, we cannot use the standard descriptor creator
// because we must preserve the sstable version.
auto creator = [&table_s] (sstables::shared_sstable sst) {
return table_s.make_sstable(sst->state(), sst->get_version());
};
result.new_sstables.reserve(descriptor.sstables.size());
for (auto& sst : descriptor.sstables) {
auto rewritten = sst->link_with_rewritten_component(creator, options.component_to_rewrite, options.modifier, update_id).get();
result.new_sstables.push_back(rewritten);
}
descriptor.replacer({std::move(descriptor.sstables), result.new_sstables});
result.stats.ended_at = db_clock::now();
return result;
});
}
future<compaction_result>
compact_sstables(compaction_descriptor descriptor, compaction_data& cdata, compaction_group_view& table_s, compaction_progress_monitor& progress_monitor) {
if (descriptor.sstables.empty()) {
@@ -2149,6 +2185,9 @@ compact_sstables(compaction_descriptor descriptor, compaction_data& cdata, compa
// Bypass the usual compaction machinery for dry-mode scrub
return scrub_sstables_validate_mode(std::move(descriptor), cdata, table_s, progress_monitor);
}
if (descriptor.options.type() == compaction_type::RewriteComponent) {
return rewrite_sstables_component(std::move(descriptor), table_s);
}
return compaction::run(make_compaction(table_s, std::move(descriptor), cdata, progress_monitor));
}

View File

@@ -12,10 +12,12 @@
#include <functional>
#include <optional>
#include <variant>
#include "sstables/component_type.hh"
#include "sstables/types_fwd.hh"
#include "sstables/sstable_set.hh"
#include "compaction_fwd.hh"
#include "mutation_writer/token_group_based_splitting_writer.hh"
#include "utils/chunked_vector.hh"
namespace compaction {
@@ -30,6 +32,7 @@ enum class compaction_type {
Reshape = 7,
Split = 8,
Major = 9,
RewriteComponent = 10,
};
struct compaction_completion_desc {
@@ -38,7 +41,7 @@ struct compaction_completion_desc {
// New, fresh SSTables that should be added to SSTable set, replacing the old ones.
std::vector<sstables::shared_sstable> new_sstables;
// Set of compacted partition ranges that should be invalidated in the cache.
dht::partition_range_vector ranges_for_cache_invalidation;
utils::chunked_vector<dht::partition_range> ranges_for_cache_invalidation;
};
// creates a new SSTable for a given shard
@@ -90,8 +93,15 @@ public:
struct split {
mutation_writer::classify_by_token_group classifier;
};
struct component_rewrite {
sstables::component_type component_to_rewrite;
std::function<void(sstables::sstable&)> modifier;
using update_sstable_id = bool_class<class update_sstable_id_tag>;
update_sstable_id update_id = update_sstable_id::yes;
};
private:
using options_variant = std::variant<regular, cleanup, upgrade, scrub, reshard, reshape, split, major>;
using options_variant = std::variant<regular, cleanup, upgrade, scrub, reshard, reshape, split, major, component_rewrite>;
private:
options_variant _options;
@@ -129,6 +139,10 @@ public:
return compaction_type_options(scrub{.operation_mode = mode, .quarantine_sstables = quarantine_sstables, .drop_unfixable = drop_unfixable_sstables});
}
static compaction_type_options make_component_rewrite(component_type component, std::function<void(sstables::sstable&)> modifier, component_rewrite::update_sstable_id update_id = component_rewrite::update_sstable_id::yes) {
return compaction_type_options(component_rewrite{.component_to_rewrite = component, .modifier = std::move(modifier), .update_id = update_id});
}
static compaction_type_options make_split(mutation_writer::classify_by_token_group classifier) {
return compaction_type_options(split{std::move(classifier)});
}

View File

@@ -46,6 +46,7 @@ public:
virtual reader_permit make_compaction_reader_permit() const = 0;
virtual sstables::sstables_manager& get_sstables_manager() noexcept = 0;
virtual sstables::shared_sstable make_sstable(sstables::sstable_state) const = 0;
virtual sstables::shared_sstable make_sstable(sstables::sstable_state, sstables::sstable_version_types) const = 0;
virtual sstables::sstable_writer_config configure_writer(sstring origin) const = 0;
virtual api::timestamp_type min_memtable_timestamp() const = 0;
virtual api::timestamp_type min_memtable_live_timestamp() const = 0;
@@ -54,7 +55,7 @@ public:
virtual future<> on_compaction_completion(compaction_completion_desc desc, sstables::offstrategy offstrategy) = 0;
virtual bool is_auto_compaction_disabled_by_user() const noexcept = 0;
virtual bool tombstone_gc_enabled() const noexcept = 0;
virtual const tombstone_gc_state& get_tombstone_gc_state() const noexcept = 0;
virtual tombstone_gc_state get_tombstone_gc_state() const noexcept = 0;
virtual compaction_backlog_tracker& get_backlog_tracker() = 0;
virtual const std::string get_group_id() const noexcept = 0;
virtual seastar::condition_variable& get_staging_done_condition() noexcept = 0;

View File

@@ -778,6 +778,7 @@ compaction_manager::get_incremental_repair_read_lock(compaction::compaction_grou
cmlog.debug("Get get_incremental_repair_read_lock for {} started", reason);
}
compaction::compaction_state& cs = get_compaction_state(&t);
auto gh = cs.gate.hold();
auto ret = co_await cs.incremental_repair_lock.hold_read_lock();
if (!reason.empty()) {
cmlog.debug("Get get_incremental_repair_read_lock for {} done", reason);
@@ -791,6 +792,7 @@ compaction_manager::get_incremental_repair_write_lock(compaction::compaction_gro
cmlog.debug("Get get_incremental_repair_write_lock for {} started", reason);
}
compaction::compaction_state& cs = get_compaction_state(&t);
auto gh = cs.gate.hold();
auto ret = co_await cs.incremental_repair_lock.hold_write_lock();
if (!reason.empty()) {
cmlog.debug("Get get_incremental_repair_write_lock for {} done", reason);
@@ -944,7 +946,7 @@ sstables::shared_sstable sstables_task_executor::consume_sstable() {
auto sst = _sstables.back();
_sstables.pop_back();
--_cm._stats.pending_tasks; // from this point on, switch_state(pending|active) works the same way as any other task
cmlog.debug("{}", format("consumed {}", sst->get_filename()));
cmlog.debug("consumed {}", sst->get_filename());
return sst;
}
@@ -1040,7 +1042,7 @@ compaction_manager::compaction_manager(config cfg, abort_source& as, tasks::task
_compaction_controller.set_max_shares(max_shares);
}))
, _strategy_control(std::make_unique<strategy_control>(*this))
, _tombstone_gc_state(_shared_tombstone_gc_state) {
{
tm.register_module(_task_manager_module->get_name(), _task_manager_module);
register_metrics();
// Bandwidth throttling is node-wide, updater is needed on single shard
@@ -1064,7 +1066,7 @@ compaction_manager::compaction_manager(tasks::task_manager& tm)
, _compaction_static_shares_observer(_cfg.static_shares.observe(_update_compaction_static_shares_action.make_observer()))
, _compaction_max_shares_observer(_cfg.max_shares.observe([] (const float& max_shares) {}))
, _strategy_control(std::make_unique<strategy_control>(*this))
, _tombstone_gc_state(_shared_tombstone_gc_state) {
{
tm.register_module(_task_manager_module->get_name(), _task_manager_module);
// No metric registration because this constructor is supposed to be used only by the testing
// infrastructure.
@@ -1206,7 +1208,6 @@ future<> compaction_manager::await_tasks(std::vector<shared_ptr<compaction_task_
std::vector<shared_ptr<compaction_task_executor>>
compaction_manager::do_stop_ongoing_compactions(sstring reason, std::function<bool(const compaction_group_view*)> filter, std::optional<compaction_type> type_opt) noexcept {
auto ongoing_compactions = get_compactions(filter).size();
auto tasks = _tasks
| std::views::filter([&filter, type_opt] (const auto& task) {
return filter(task.compacting_table()) && (!type_opt || task.compaction_type() == *type_opt);
@@ -1215,6 +1216,7 @@ compaction_manager::do_stop_ongoing_compactions(sstring reason, std::function<bo
| std::ranges::to<std::vector<shared_ptr<compaction_task_executor>>>();
logging::log_level level = tasks.empty() ? log_level::debug : log_level::info;
if (cmlog.is_enabled(level)) {
auto ongoing_compactions = get_compactions(filter).size();
std::string scope = "";
if (!tasks.empty()) {
const compaction_group_view* t = tasks.front()->compacting_table();
@@ -1266,9 +1268,15 @@ future<> compaction_manager::start(const db::config& cfg, utils::disk_space_moni
if (dsm && (this_shard_id() == 0)) {
_out_of_space_subscription = dsm->subscribe(cfg.critical_disk_utilization_level, [this] (auto threshold_reached) {
if (threshold_reached) {
return container().invoke_on_all([] (compaction_manager& cm) { return cm.drain(); });
return container().invoke_on_all([] (compaction_manager& cm) {
cm._in_critical_disk_utilization_mode = true;
return cm.drain();
});
}
return container().invoke_on_all([] (compaction_manager& cm) { cm.enable(); });
return container().invoke_on_all([] (compaction_manager& cm) {
cm._in_critical_disk_utilization_mode = false;
cm.enable();
});
});
}
@@ -1418,11 +1426,17 @@ protected:
compaction_strategy cs = t.get_compaction_strategy();
compaction_descriptor descriptor = co_await cs.get_sstables_for_compaction(t, _cm.get_strategy_control());
int weight = calculate_weight(descriptor);
cmlog.debug("Started minor compaction sstables={} sstables_reapired_at={} range={} uuid={} compaction_uuid={}",
descriptor.sstables, compacting_table()->get_sstables_repaired_at(),
compacting_table()->token_range(), uuid, _compaction_data.compaction_uuid);
bool debug_enabled = cmlog.is_enabled(log_level::debug);
if (debug_enabled) {
cmlog.debug("Started minor compaction sstables={} sstables_reapired_at={} range={} uuid={} compaction_uuid={}",
descriptor.sstables, compacting_table()->get_sstables_repaired_at(),
compacting_table()->token_range(), uuid, _compaction_data.compaction_uuid);
}
auto old_sstables = ::format("{}", descriptor.sstables);
sstring old_sstables;
if (debug_enabled) {
old_sstables = ::format("{}", descriptor.sstables);
}
if (descriptor.sstables.empty() || !can_proceed() || t.is_auto_compaction_disabled_by_user()) {
cmlog.debug("{}: sstables={} can_proceed={} auto_compaction={}", *this, descriptor.sstables.size(), can_proceed(), t.is_auto_compaction_disabled_by_user());
@@ -1452,8 +1466,10 @@ protected:
try {
bool should_update_history = this->should_update_history(descriptor.options.type());
compaction_result res = co_await compact_sstables(std::move(descriptor), _compaction_data, on_replace);
cmlog.debug("Finished minor compaction old_sstables={} new_sstables={} sstables_reapired_at={} range={} uuid={} compaction_uuid={}",
old_sstables, res.new_sstables, compacting_table()->get_sstables_repaired_at(), compacting_table()->token_range(), uuid, _compaction_data.compaction_uuid);
if (debug_enabled) {
cmlog.debug("Finished minor compaction old_sstables={} new_sstables={} sstables_reapired_at={} range={} uuid={} compaction_uuid={}",
old_sstables, res.new_sstables, compacting_table()->get_sstables_repaired_at(), compacting_table()->token_range(), uuid, _compaction_data.compaction_uuid);
}
finish_compaction();
if (should_update_history) {
// update_history can take a long time compared to
@@ -1786,6 +1802,41 @@ protected:
}
};
class rewrite_sstables_component_compaction_task_executor final : public rewrite_sstables_compaction_task_executor {
std::unordered_map<sstables::shared_sstable, sstables::shared_sstable>& _rewritten_sstables;
public:
rewrite_sstables_component_compaction_task_executor(compaction_manager& mgr,
throw_if_stopping do_throw_if_stopping,
compaction_group_view* t,
tasks::task_id parent_id,
compaction_type_options options,
std::vector<sstables::shared_sstable> sstables,
compacting_sstable_registration compacting,
std::unordered_map<sstables::shared_sstable, sstables::shared_sstable>& rewritten_sstables)
: rewrite_sstables_compaction_task_executor(mgr, do_throw_if_stopping, t, parent_id, options, {},
std::move(sstables), std::move(compacting), compaction_manager::can_purge_tombstones::no, "component_rewrite"),
_rewritten_sstables(rewritten_sstables)
{}
protected:
virtual future<compaction_manager::compaction_stats_opt> do_run() override {
compaction_stats stats{};
switch_state(state::pending);
auto maintenance_permit = co_await acquire_semaphore(_cm._maintenance_ops_sem);
while (!_sstables.empty()) {
auto sst = consume_sstable();
auto it = _rewritten_sstables.emplace(sst, sstables::shared_sstable{}).first;
auto res = co_await rewrite_sstable(std::move(sst));
_cm._validation_errors += res.stats.validation_errors;
stats += res.stats;
it->second = std::move(res.new_sstables.front());
}
co_return stats;
}
};
class split_compaction_task_executor final : public rewrite_sstables_compaction_task_executor {
compaction_type_options::split _opt;
public:
@@ -1899,6 +1950,28 @@ compaction_manager::rewrite_sstables(compaction_group_view& t, compaction_type_o
return perform_task_on_all_files<rewrite_sstables_compaction_task_executor>("rewrite", info, t, std::move(options), std::move(owned_ranges_ptr), std::move(get_func), throw_if_stopping::no, can_purge, std::move(options_desc));
}
future<compaction_manager::compaction_stats_opt>
compaction_manager::rewrite_sstables_component(compaction_group_view& t,
std::vector<sstables::shared_sstable>& sstables,
compaction_type_options options,
std::unordered_map<sstables::shared_sstable, sstables::shared_sstable>& rewritten_sstables,
tasks::task_info info) {
auto gh = start_compaction(t);
if (!gh) {
co_return std::nullopt;
}
if (sstables.empty()) {
co_return std::nullopt;
}
compacting_sstable_registration compacting(*this, get_compaction_state(&t));
compacting.register_compacting(sstables);
co_return co_await perform_compaction<rewrite_sstables_component_compaction_task_executor>(throw_if_stopping::no, info, &t, info.id,
std::move(options), std::move(sstables), std::move(compacting), rewritten_sstables);
}
class validate_sstables_compaction_task_executor : public sstables_task_executor {
compaction_manager::quarantine_invalid_sstables _quarantine_sstables;
public:
@@ -2289,6 +2362,16 @@ future<compaction_manager::compaction_stats_opt> compaction_manager::perform_spl
return perform_task_on_all_files<split_compaction_task_executor>("split", info, t, std::move(options), std::move(owned_ranges_ptr), std::move(get_sstables), throw_if_stopping::no);
}
std::exception_ptr compaction_manager::make_disabled_exception(compaction::compaction_group_view& cg) {
std::exception_ptr ex;
if (_in_critical_disk_utilization_mode) {
ex = std::make_exception_ptr(std::runtime_error("critical disk utilization"));
} else {
ex = std::make_exception_ptr(compaction_stopped_exception(cg.schema()->ks_name(), cg.schema()->cf_name(), "compaction disabled"));
}
return ex;
}
future<std::vector<sstables::shared_sstable>>
compaction_manager::maybe_split_new_sstable(sstables::shared_sstable sst, compaction_group_view& t, compaction_type_options::split opt) {
if (!split_compaction_task_executor::sstable_needs_split(sst, opt)) {
@@ -2298,8 +2381,7 @@ compaction_manager::maybe_split_new_sstable(sstables::shared_sstable sst, compac
// We don't want to prevent split because compaction is temporarily disabled on a view only for synchronization,
// which is unneeded against new sstables that aren't part of any set yet, so never use can_proceed(&t) here.
if (is_disabled()) {
co_return coroutine::exception(std::make_exception_ptr(std::runtime_error(format("Cannot split {} because manager has compaction disabled, " \
"reason might be out of space prevention", sst->get_filename()))));
co_return coroutine::exception(make_disabled_exception(t));
}
std::vector<sstables::shared_sstable> ret;
@@ -2323,6 +2405,18 @@ compaction_manager::maybe_split_new_sstable(sstables::shared_sstable sst, compac
co_return ret;
}
future<std::unordered_map<sstables::shared_sstable, sstables::shared_sstable>> compaction_manager::perform_component_rewrite(compaction::compaction_group_view& t,
tasks::task_info info,
std::vector<sstables::shared_sstable> sstables,
sstables::component_type component,
std::function<void(sstables::sstable&)> modifier,
compaction_type_options::component_rewrite::update_sstable_id update_id) {
std::unordered_map<sstables::shared_sstable, sstables::shared_sstable> rewritten_sstables;
rewritten_sstables.reserve(sstables.size());
co_await rewrite_sstables_component(t, sstables, compaction_type_options::make_component_rewrite(component, std::move(modifier), update_id), rewritten_sstables, info);
co_return rewritten_sstables;
}
// Submit a table to be scrubbed and wait for its termination.
future<compaction_manager::compaction_stats_opt> compaction_manager::perform_sstable_scrub(compaction_group_view& t, compaction_type_options::scrub opts, tasks::task_info info) {
auto scrub_mode = opts.operation_mode;
@@ -2387,6 +2481,8 @@ future<> compaction_manager::remove(compaction_group_view& t, sstring reason) no
if (!c_state.gate.is_closed()) {
auto close_gate = c_state.gate.close();
co_await stop_ongoing_compactions(reason, &t);
// Wait for users of incremental repair lock (can be either repair itself or maintenance compactions).
co_await c_state.incremental_repair_lock.write_lock();
co_await std::move(close_gate);
}

View File

@@ -55,6 +55,7 @@ class custom_compaction_task_executor;
class regular_compaction_task_executor;
class offstrategy_compaction_task_executor;
class rewrite_sstables_compaction_task_executor;
class rewrite_sstables_component_compaction_task_executor;
class split_compaction_task_executor;
class cleanup_sstables_compaction_task_executor;
class validate_sstables_compaction_task_executor;
@@ -114,6 +115,8 @@ private:
uint32_t _disabled_state_count = 0;
bool is_disabled() const { return _state != state::running || _disabled_state_count > 0; }
// precondition: is_disabled() is true.
std::exception_ptr make_disabled_exception(compaction::compaction_group_view& cg);
std::optional<future<>> _stop_future;
@@ -167,12 +170,9 @@ private:
std::unique_ptr<strategy_control> _strategy_control;
shared_tombstone_gc_state _shared_tombstone_gc_state;
// TODO: tombstone_gc_state should now have value semantics, but the code
// still uses it with reference semantics (inconsistently though).
// Drop this member, once the code is converted into using value semantics.
tombstone_gc_state _tombstone_gc_state;
utils::disk_space_monitor::subscription _out_of_space_subscription;
bool _in_critical_disk_utilization_mode = false;
private:
// Requires task->_compaction_state.gate to be held and task to be registered in _tasks.
future<compaction_stats_opt> perform_task(shared_ptr<compaction::compaction_task_executor> task, throw_if_stopping do_throw_if_stopping);
@@ -256,6 +256,12 @@ private:
future<compaction_stats_opt> rewrite_sstables(compaction::compaction_group_view& t, compaction_type_options options, owned_ranges_ptr, get_candidates_func, tasks::task_info info,
can_purge_tombstones can_purge = can_purge_tombstones::yes, sstring options_desc = "");
future<compaction_stats_opt> rewrite_sstables_component(compaction_group_view& t,
std::vector<sstables::shared_sstable>& sstables,
compaction_type_options options,
std::unordered_map<sstables::shared_sstable, sstables::shared_sstable>& rewritten_sstables,
tasks::task_info info);
// Stop all fibers, without waiting. Safe to be called multiple times.
void do_stop() noexcept;
future<> really_do_stop() noexcept;
@@ -364,6 +370,13 @@ public:
// Submit a table to be scrubbed and wait for its termination.
future<compaction_stats_opt> perform_sstable_scrub(compaction::compaction_group_view& t, compaction_type_options::scrub opts, tasks::task_info info);
future<std::unordered_map<sstables::shared_sstable, sstables::shared_sstable>> perform_component_rewrite(compaction::compaction_group_view& t,
tasks::task_info info,
std::vector<sstables::shared_sstable> sstables,
sstables::component_type component,
std::function<void(sstables::sstable&)> modifier,
compaction_type_options::component_rewrite::update_sstable_id update_id = compaction_type_options::component_rewrite::update_sstable_id::yes);
// Submit a table for major compaction.
future<> perform_major_compaction(compaction::compaction_group_view& t, tasks::task_info info, bool consider_only_existing_data = false);
@@ -456,10 +469,6 @@ public:
compaction::strategy_control& get_strategy_control() const noexcept;
const tombstone_gc_state& get_tombstone_gc_state() const noexcept {
return _tombstone_gc_state;
};
shared_tombstone_gc_state& get_shared_tombstone_gc_state() noexcept {
return _shared_tombstone_gc_state;
};
@@ -489,6 +498,7 @@ public:
friend class compaction::regular_compaction_task_executor;
friend class compaction::offstrategy_compaction_task_executor;
friend class compaction::rewrite_sstables_compaction_task_executor;
friend class compaction::rewrite_sstables_component_compaction_task_executor;
friend class compaction::cleanup_sstables_compaction_task_executor;
friend class compaction::validate_sstables_compaction_task_executor;
friend compaction_reenabler;

View File

@@ -33,8 +33,10 @@ future<compaction_descriptor> leveled_compaction_strategy::get_sstables_for_comp
auto candidate = manifest.get_compaction_candidates(*state->last_compacted_keys, state->compaction_counter);
if (!candidate.sstables.empty()) {
auto main_set = co_await table_s.main_sstable_set();
leveled_manifest::logger.debug("leveled: Compacting {} out of {} sstables", candidate.sstables.size(), main_set->size());
if (leveled_manifest::logger.is_enabled(logging::log_level::debug)) {
auto main_set = co_await table_s.main_sstable_set();
leveled_manifest::logger.debug("leveled: Compacting {} out of {} sstables", candidate.sstables.size(), main_set->size());
}
co_return candidate;
}

View File

@@ -15,6 +15,7 @@
#include "compaction_strategy_state.hh"
#include "utils/error_injection.hh"
#include <seastar/util/lazy.hh>
#include <ranges>
namespace compaction {
@@ -28,12 +29,12 @@ time_window_compaction_strategy_state_ptr time_window_compaction_strategy::get_s
}
const std::unordered_map<sstring, std::chrono::seconds> time_window_compaction_strategy_options::valid_window_units = {
{ "MINUTES", 60s }, { "HOURS", 3600s }, { "DAYS", 86400s }
};
{"MINUTES", 60s}, {"HOURS", 3600s}, {"DAYS", 86400s}};
const std::unordered_map<sstring, time_window_compaction_strategy_options::timestamp_resolutions> time_window_compaction_strategy_options::valid_timestamp_resolutions = {
{ "MICROSECONDS", timestamp_resolutions::microsecond },
{ "MILLISECONDS", timestamp_resolutions::millisecond },
const std::unordered_map<sstring, time_window_compaction_strategy_options::timestamp_resolutions>
time_window_compaction_strategy_options::valid_timestamp_resolutions = {
{"MICROSECONDS", timestamp_resolutions::microsecond},
{"MILLISECONDS", timestamp_resolutions::millisecond},
};
static std::chrono::seconds validate_compaction_window_unit(const std::map<sstring, sstring>& options) {
@@ -43,7 +44,8 @@ static std::chrono::seconds validate_compaction_window_unit(const std::map<sstri
if (tmp_value) {
auto valid_window_units_it = time_window_compaction_strategy_options::valid_window_units.find(tmp_value.value());
if (valid_window_units_it == time_window_compaction_strategy_options::valid_window_units.end()) {
throw exceptions::configuration_exception(fmt::format("Invalid window unit {} for {}", tmp_value.value(), time_window_compaction_strategy_options::COMPACTION_WINDOW_UNIT_KEY));
throw exceptions::configuration_exception(
fmt::format("Invalid window unit {} for {}", tmp_value.value(), time_window_compaction_strategy_options::COMPACTION_WINDOW_UNIT_KEY));
}
window_unit = valid_window_units_it->second;
}
@@ -59,10 +61,12 @@ static std::chrono::seconds validate_compaction_window_unit(const std::map<sstri
static int validate_compaction_window_size(const std::map<sstring, sstring>& options) {
auto tmp_value = compaction_strategy_impl::get_value(options, time_window_compaction_strategy_options::COMPACTION_WINDOW_SIZE_KEY);
int window_size = cql3::statements::property_definitions::to_long(time_window_compaction_strategy_options::COMPACTION_WINDOW_SIZE_KEY, tmp_value, time_window_compaction_strategy_options::DEFAULT_COMPACTION_WINDOW_SIZE);
int window_size = cql3::statements::property_definitions::to_long(time_window_compaction_strategy_options::COMPACTION_WINDOW_SIZE_KEY, tmp_value,
time_window_compaction_strategy_options::DEFAULT_COMPACTION_WINDOW_SIZE);
if (window_size <= 0) {
throw exceptions::configuration_exception(fmt::format("{} value ({}) must be greater than 1", time_window_compaction_strategy_options::COMPACTION_WINDOW_SIZE_KEY, window_size));
throw exceptions::configuration_exception(
fmt::format("{} value ({}) must be greater than 1", time_window_compaction_strategy_options::COMPACTION_WINDOW_SIZE_KEY, window_size));
}
return window_size;
@@ -82,26 +86,30 @@ static db_clock::duration validate_expired_sstable_check_frequency_seconds(const
try {
expired_sstable_check_frequency = std::chrono::seconds(std::stol(tmp_value.value()));
} catch (const std::exception& e) {
throw exceptions::syntax_exception(fmt::format("Invalid long value {} for {}", tmp_value.value(), time_window_compaction_strategy_options::EXPIRED_SSTABLE_CHECK_FREQUENCY_SECONDS_KEY));
throw exceptions::syntax_exception(fmt::format(
"Invalid long value {} for {}", tmp_value.value(), time_window_compaction_strategy_options::EXPIRED_SSTABLE_CHECK_FREQUENCY_SECONDS_KEY));
}
}
return expired_sstable_check_frequency;
}
static db_clock::duration validate_expired_sstable_check_frequency_seconds(const std::map<sstring, sstring>& options, std::map<sstring, sstring>& unchecked_options) {
static db_clock::duration validate_expired_sstable_check_frequency_seconds(
const std::map<sstring, sstring>& options, std::map<sstring, sstring>& unchecked_options) {
db_clock::duration expired_sstable_check_frequency = validate_expired_sstable_check_frequency_seconds(options);
unchecked_options.erase(time_window_compaction_strategy_options::EXPIRED_SSTABLE_CHECK_FREQUENCY_SECONDS_KEY);
return expired_sstable_check_frequency;
}
static time_window_compaction_strategy_options::timestamp_resolutions validate_timestamp_resolution(const std::map<sstring, sstring>& options) {
time_window_compaction_strategy_options::timestamp_resolutions timestamp_resolution = time_window_compaction_strategy_options::timestamp_resolutions::microsecond;
time_window_compaction_strategy_options::timestamp_resolutions timestamp_resolution =
time_window_compaction_strategy_options::timestamp_resolutions::microsecond;
auto tmp_value = compaction_strategy_impl::get_value(options, time_window_compaction_strategy_options::TIMESTAMP_RESOLUTION_KEY);
if (tmp_value) {
if (!time_window_compaction_strategy_options::valid_timestamp_resolutions.contains(tmp_value.value())) {
throw exceptions::configuration_exception(fmt::format("Invalid timestamp resolution {} for {}", tmp_value.value(), time_window_compaction_strategy_options::TIMESTAMP_RESOLUTION_KEY));
throw exceptions::configuration_exception(fmt::format(
"Invalid timestamp resolution {} for {}", tmp_value.value(), time_window_compaction_strategy_options::TIMESTAMP_RESOLUTION_KEY));
} else {
timestamp_resolution = time_window_compaction_strategy_options::valid_timestamp_resolutions.at(tmp_value.value());
}
@@ -110,7 +118,8 @@ static time_window_compaction_strategy_options::timestamp_resolutions validate_t
return timestamp_resolution;
}
static time_window_compaction_strategy_options::timestamp_resolutions validate_timestamp_resolution(const std::map<sstring, sstring>& options, std::map<sstring, sstring>& unchecked_options) {
static time_window_compaction_strategy_options::timestamp_resolutions validate_timestamp_resolution(
const std::map<sstring, sstring>& options, std::map<sstring, sstring>& unchecked_options) {
time_window_compaction_strategy_options::timestamp_resolutions timestamp_resolution = validate_timestamp_resolution(options);
unchecked_options.erase(time_window_compaction_strategy_options::TIMESTAMP_RESOLUTION_KEY);
return timestamp_resolution;
@@ -145,7 +154,7 @@ void time_window_compaction_strategy_options::validate(const std::map<sstring, s
compaction_strategy_impl::validate_min_max_threshold(options, unchecked_options);
auto it = options.find("enable_optimized_twcs_queries");
if (it != options.end() && it->second != "true" && it->second != "false") {
if (it != options.end() && it->second != "true" && it->second != "false") {
throw exceptions::configuration_exception(fmt::format("enable_optimized_twcs_queries value ({}) must be \"true\" or \"false\"", it->second));
}
unchecked_options.erase("enable_optimized_twcs_queries");
@@ -162,7 +171,9 @@ class classify_by_timestamp {
std::vector<int64_t> _known_windows;
public:
explicit classify_by_timestamp(time_window_compaction_strategy_options options) : _options(std::move(options)) { }
explicit classify_by_timestamp(time_window_compaction_strategy_options options)
: _options(std::move(options)) {
}
int64_t operator()(api::timestamp_type ts) {
const auto window = time_window_compaction_strategy::get_window_for(_options, ts);
if (const auto it = std::ranges::find(_known_windows, window); it != _known_windows.end()) {
@@ -190,7 +201,7 @@ uint64_t time_window_compaction_strategy::adjust_partition_estimate(const mutati
auto estimated_window_count = max_data_segregation_window_count;
auto default_ttl = std::chrono::duration_cast<std::chrono::microseconds>(s->default_time_to_live());
bool min_and_max_ts_available = ms_meta.min_timestamp && ms_meta.max_timestamp;
auto estimate_window_count = [this] (timestamp_type min_window, timestamp_type max_window) {
auto estimate_window_count = [this](timestamp_type min_window, timestamp_type max_window) {
const auto window_size = get_window_size(_options);
return (max_window + (window_size - 1) - min_window) / window_size;
};
@@ -210,21 +221,19 @@ uint64_t time_window_compaction_strategy::adjust_partition_estimate(const mutati
return partition_estimate / std::max(1UL, uint64_t(estimated_window_count));
}
mutation_reader_consumer time_window_compaction_strategy::make_interposer_consumer(const mutation_source_metadata& ms_meta, mutation_reader_consumer end_consumer) const {
if (ms_meta.min_timestamp && ms_meta.max_timestamp
&& get_window_for(_options, *ms_meta.min_timestamp) == get_window_for(_options, *ms_meta.max_timestamp)) {
mutation_reader_consumer time_window_compaction_strategy::make_interposer_consumer(
const mutation_source_metadata& ms_meta, mutation_reader_consumer end_consumer) const {
if (ms_meta.min_timestamp && ms_meta.max_timestamp &&
get_window_for(_options, *ms_meta.min_timestamp) == get_window_for(_options, *ms_meta.max_timestamp)) {
return end_consumer;
}
return [options = _options, end_consumer = std::move(end_consumer)] (mutation_reader rd) mutable -> future<> {
return mutation_writer::segregate_by_timestamp(
std::move(rd),
classify_by_timestamp(std::move(options)),
end_consumer);
return [options = _options, end_consumer = std::move(end_consumer)](mutation_reader rd) mutable -> future<> {
return mutation_writer::segregate_by_timestamp(std::move(rd), classify_by_timestamp(std::move(options)), end_consumer);
};
}
compaction_descriptor
time_window_compaction_strategy::get_reshaping_job(std::vector<sstables::shared_sstable> input, schema_ptr schema, reshape_config cfg) const {
compaction_descriptor time_window_compaction_strategy::get_reshaping_job(
std::vector<sstables::shared_sstable> input, schema_ptr schema, reshape_config cfg) const {
auto mode = cfg.mode;
std::vector<sstables::shared_sstable> single_window;
std::vector<sstables::shared_sstable> multi_window;
@@ -239,7 +248,7 @@ time_window_compaction_strategy::get_reshaping_job(std::vector<sstables::shared_
// Sort input sstables by first_key order
// to allow efficient reshaping of disjoint sstables.
std::sort(input.begin(), input.end(), [&schema] (const sstables::shared_sstable& a, const sstables::shared_sstable& b) {
std::sort(input.begin(), input.end(), [&schema](const sstables::shared_sstable& a, const sstables::shared_sstable& b) {
return dht::ring_position(a->get_first_decorated_key()).less_compare(*schema, dht::ring_position(b->get_first_decorated_key()));
});
@@ -253,31 +262,34 @@ time_window_compaction_strategy::get_reshaping_job(std::vector<sstables::shared_
}
}
auto is_disjoint = [&schema, mode, max_sstables] (const std::vector<sstables::shared_sstable>& ssts) {
auto is_disjoint = [&schema, mode, max_sstables](const std::vector<sstables::shared_sstable>& ssts) {
size_t tolerance = (mode == reshape_mode::relaxed) ? max_sstables : 0;
return sstable_set_overlapping_count(schema, ssts) <= tolerance;
};
clogger.debug("time_window_compaction_strategy::get_reshaping_job: offstrategy_threshold={} max_sstables={} multi_window={} disjoint={} single_window={} disjoint={}",
offstrategy_threshold, max_sstables,
multi_window.size(), !multi_window.empty() && sstable_set_overlapping_count(schema, multi_window) == 0,
single_window.size(), !single_window.empty() && sstable_set_overlapping_count(schema, single_window) == 0);
clogger.debug("time_window_compaction_strategy::get_reshaping_job: offstrategy_threshold={} max_sstables={} multi_window={} disjoint={} "
"single_window={} disjoint={}",
offstrategy_threshold, max_sstables, multi_window.size(), seastar::value_of([&] {
return !multi_window.empty() && sstable_set_overlapping_count(schema, multi_window) == 0;
}),
single_window.size(), seastar::value_of([&] {
return !single_window.empty() && sstable_set_overlapping_count(schema, single_window) == 0;
}));
auto get_job_size = [] (const std::vector<sstables::shared_sstable>& ssts) {
auto get_job_size = [](const std::vector<sstables::shared_sstable>& ssts) {
return std::ranges::fold_left(ssts | std::views::transform(std::mem_fn(&sstables::sstable::bytes_on_disk)), uint64_t(0), std::plus{});
};
// Targets a space overhead of 10%. All disjoint sstables can be compacted together as long as they won't
// cause an overhead above target. Otherwise, the job targets a maximum of #max_threshold sstables.
auto need_trimming = [&] (const std::vector<sstables::shared_sstable>& ssts, const uint64_t job_size, bool is_disjoint) {
auto need_trimming = [&](const std::vector<sstables::shared_sstable>& ssts, const uint64_t job_size, bool is_disjoint) {
const size_t min_sstables = 2;
auto is_above_target_size = job_size > target_job_size;
return (ssts.size() > max_sstables && !is_disjoint) ||
(ssts.size() > min_sstables && is_above_target_size);
return (ssts.size() > max_sstables && !is_disjoint) || (ssts.size() > min_sstables && is_above_target_size);
};
auto maybe_trim_job = [&need_trimming] (std::vector<sstables::shared_sstable>& ssts, uint64_t job_size, bool is_disjoint) {
auto maybe_trim_job = [&need_trimming](std::vector<sstables::shared_sstable>& ssts, uint64_t job_size, bool is_disjoint) {
while (need_trimming(ssts, job_size, is_disjoint)) {
auto sst = ssts.back();
ssts.pop_back();
@@ -294,7 +306,7 @@ time_window_compaction_strategy::get_reshaping_job(std::vector<sstables::shared_
// For example, if there are N sstables spanning window W, where N <= 32, then we can produce all data for W
// in a single compaction round, removing the need to later compact W to reduce its number of files.
auto sort_size = std::min(max_sstables, multi_window.size());
std::ranges::partial_sort(multi_window, multi_window.begin() + sort_size, std::ranges::less(), [] (const sstables::shared_sstable &a) {
std::ranges::partial_sort(multi_window, multi_window.begin() + sort_size, std::ranges::less(), [](const sstables::shared_sstable& a) {
return a->get_stats_metadata().max_timestamp;
});
maybe_trim_job(multi_window, job_size, disjoint);
@@ -334,8 +346,7 @@ time_window_compaction_strategy::get_reshaping_job(std::vector<sstables::shared_
return compaction_descriptor();
}
future<compaction_descriptor>
time_window_compaction_strategy::get_sstables_for_compaction(compaction_group_view& table_s, strategy_control& control) {
future<compaction_descriptor> time_window_compaction_strategy::get_sstables_for_compaction(compaction_group_view& table_s, strategy_control& control) {
auto state = get_state(table_s);
auto compaction_time = gc_clock::now();
auto candidates = co_await control.candidates(table_s);
@@ -369,10 +380,8 @@ time_window_compaction_strategy::get_sstables_for_compaction(compaction_group_vi
co_return compaction_descriptor(std::move(compaction_candidates));
}
time_window_compaction_strategy::bucket_compaction_mode
time_window_compaction_strategy::compaction_mode(const time_window_compaction_strategy_state& state,
const bucket_t& bucket, timestamp_type bucket_key,
timestamp_type now, size_t min_threshold) const {
time_window_compaction_strategy::bucket_compaction_mode time_window_compaction_strategy::compaction_mode(
const time_window_compaction_strategy_state& state, const bucket_t& bucket, timestamp_type bucket_key, timestamp_type now, size_t min_threshold) const {
// STCS will also be performed on older window buckets, to avoid a bad write and
// space amplification when something like read repair cause small updates to
// those past windows.
@@ -385,8 +394,7 @@ time_window_compaction_strategy::compaction_mode(const time_window_compaction_st
return bucket_compaction_mode::none;
}
std::vector<sstables::shared_sstable>
time_window_compaction_strategy::get_next_non_expired_sstables(compaction_group_view& table_s, strategy_control& control,
std::vector<sstables::shared_sstable> time_window_compaction_strategy::get_next_non_expired_sstables(compaction_group_view& table_s, strategy_control& control,
std::vector<sstables::shared_sstable> non_expiring_sstables, gc_clock::time_point compaction_time, time_window_compaction_strategy_state& state) {
auto most_interesting = get_compaction_candidates(table_s, control, non_expiring_sstables, state);
@@ -400,31 +408,29 @@ time_window_compaction_strategy::get_next_non_expired_sstables(compaction_group_
// if there is no sstable to compact in standard way, try compacting single sstable whose droppable tombstone
// ratio is greater than threshold.
std::erase_if(non_expiring_sstables, [this, compaction_time, &table_s] (const sstables::shared_sstable& sst) -> bool {
std::erase_if(non_expiring_sstables, [this, compaction_time, &table_s](const sstables::shared_sstable& sst) -> bool {
return !worth_dropping_tombstones(sst, compaction_time, table_s);
});
if (non_expiring_sstables.empty()) {
return {};
}
auto it = std::ranges::min_element(non_expiring_sstables, [] (auto& i, auto& j) {
auto it = std::ranges::min_element(non_expiring_sstables, [](auto& i, auto& j) {
return i->get_stats_metadata().min_timestamp < j->get_stats_metadata().min_timestamp;
});
return { *it };
return {*it};
}
std::vector<sstables::shared_sstable>
time_window_compaction_strategy::get_compaction_candidates(compaction_group_view& table_s, strategy_control& control,
std::vector<sstables::shared_sstable> candidate_sstables, time_window_compaction_strategy_state& state) {
std::vector<sstables::shared_sstable> time_window_compaction_strategy::get_compaction_candidates(compaction_group_view& table_s, strategy_control& control,
std::vector<sstables::shared_sstable> candidate_sstables, time_window_compaction_strategy_state& state) {
auto [buckets, max_timestamp] = get_buckets(std::move(candidate_sstables), _options);
// Update the highest window seen, if necessary
state.highest_window_seen = std::max(state.highest_window_seen, max_timestamp);
return newest_bucket(table_s, control, std::move(buckets), table_s.min_compaction_threshold(), table_s.schema()->max_compaction_threshold(),
state.highest_window_seen, state);
state.highest_window_seen, state);
}
timestamp_type
time_window_compaction_strategy::get_window_lower_bound(std::chrono::seconds sstable_window_size, timestamp_type timestamp) {
timestamp_type time_window_compaction_strategy::get_window_lower_bound(std::chrono::seconds sstable_window_size, timestamp_type timestamp) {
using namespace std::chrono;
// mask out window size from timestamp to get lower bound of its window
auto num_windows = microseconds(timestamp) / sstable_window_size;
@@ -432,8 +438,8 @@ time_window_compaction_strategy::get_window_lower_bound(std::chrono::seconds sst
return duration_cast<microseconds>(num_windows * sstable_window_size).count();
}
std::pair<std::map<timestamp_type, std::vector<sstables::shared_sstable>>, timestamp_type>
time_window_compaction_strategy::get_buckets(std::vector<sstables::shared_sstable> files, const time_window_compaction_strategy_options& options) {
std::pair<std::map<timestamp_type, std::vector<sstables::shared_sstable>>, timestamp_type> time_window_compaction_strategy::get_buckets(
std::vector<sstables::shared_sstable> files, const time_window_compaction_strategy_options& options) {
std::map<timestamp_type, std::vector<sstables::shared_sstable>> buckets;
timestamp_type max_timestamp = 0;
@@ -450,11 +456,13 @@ time_window_compaction_strategy::get_buckets(std::vector<sstables::shared_sstabl
return std::make_pair(std::move(buckets), max_timestamp);
}
}
} // namespace compaction
template <>
struct fmt::formatter<std::map<compaction::timestamp_type, std::vector<sstables::shared_sstable>>> {
constexpr auto parse(format_parse_context& ctx) { return ctx.begin(); }
constexpr auto parse(format_parse_context& ctx) {
return ctx.begin();
}
auto format(const std::map<compaction::timestamp_type, std::vector<sstables::shared_sstable>>& buckets, fmt::format_context& ctx) const {
auto out = fmt::format_to(ctx.out(), " buckets = {{\n");
for (auto& [timestamp, sstables] : buckets | std::views::reverse) {
@@ -466,9 +474,9 @@ struct fmt::formatter<std::map<compaction::timestamp_type, std::vector<sstables:
namespace compaction {
std::vector<sstables::shared_sstable>
time_window_compaction_strategy::newest_bucket(compaction_group_view& table_s, strategy_control& control, std::map<timestamp_type, std::vector<sstables::shared_sstable>> buckets,
int min_threshold, int max_threshold, timestamp_type now, time_window_compaction_strategy_state& state) {
std::vector<sstables::shared_sstable> time_window_compaction_strategy::newest_bucket(compaction_group_view& table_s, strategy_control& control,
std::map<timestamp_type, std::vector<sstables::shared_sstable>> buckets, int min_threshold, int max_threshold, timestamp_type now,
time_window_compaction_strategy_state& state) {
clogger.debug("time_window_compaction_strategy::newest_bucket:\n now {}\n{}", now, buckets);
for (auto&& [key, bucket] : buckets | std::views::reverse) {
@@ -509,8 +517,7 @@ time_window_compaction_strategy::newest_bucket(compaction_group_view& table_s, s
return {};
}
std::vector<sstables::shared_sstable>
time_window_compaction_strategy::trim_to_threshold(std::vector<sstables::shared_sstable> bucket, int max_threshold) {
std::vector<sstables::shared_sstable> time_window_compaction_strategy::trim_to_threshold(std::vector<sstables::shared_sstable> bucket, int max_threshold) {
auto n = std::min(bucket.size(), size_t(max_threshold));
// Trim the largest sstables off the end to meet the maxThreshold
std::ranges::partial_sort(bucket, bucket.begin() + n, std::ranges::less(), std::mem_fn(&sstables::sstable::ondisk_data_size));
@@ -542,8 +549,8 @@ future<int64_t> time_window_compaction_strategy::estimated_pending_compactions(c
co_return n;
}
std::vector<compaction_descriptor>
time_window_compaction_strategy::get_cleanup_compaction_jobs(compaction_group_view& table_s, std::vector<sstables::shared_sstable> candidates) const {
std::vector<compaction_descriptor> time_window_compaction_strategy::get_cleanup_compaction_jobs(
compaction_group_view& table_s, std::vector<sstables::shared_sstable> candidates) const {
std::vector<compaction_descriptor> ret;
for (auto&& [_, sstables] : get_buckets(std::move(candidates), _options).first) {
auto per_window_jobs = size_tiered_compaction_strategy(_stcs_options).get_cleanup_compaction_jobs(table_s, std::move(sstables));
@@ -556,4 +563,4 @@ std::unique_ptr<sstables::sstable_set_impl> time_window_compaction_strategy::mak
return std::make_unique<sstables::time_series_sstable_set>(ts.schema(), _options.enable_optimized_twcs_queries);
}
}
} // namespace compaction

View File

@@ -397,6 +397,17 @@ commitlog_total_space_in_mb: -1
# you can cache more hot rows
# column_index_size_in_kb: 64
# sstable format version for newly written sstables.
# Currently allowed values are `me` and `ms`.
# If not specified in the config, this defaults to `me`.
#
# The difference between `me` and `ms` are the data structures used
# in the primary index.
# In short, `ms` needs more CPU during sstable writes,
# but should behave better during reads,
# although it might behave worse for very long clustering keys.
sstable_format: ms
# Auto-scaling of the promoted index prevents running out of memory
# when the promoted index grows too large (due to partitions with many rows
# vs. too small column_index_size_in_kb). When the serialized representation
@@ -639,7 +650,7 @@ strict_is_not_null_in_views: true
# * workdir: the node will open the maintenance socket on the path <scylla's workdir>/cql.m,
# where <scylla's workdir> is a path defined by the workdir configuration option,
# * <socket path>: the node will open the maintenance socket on the path <socket path>.
maintenance_socket: ignore
maintenance_socket: workdir
# If set to true, configuration parameters defined with LiveUpdate option can be updated in runtime with CQL
# by updating system.config virtual table. If we don't want any configuration parameter to be changed in runtime
@@ -648,10 +659,9 @@ maintenance_socket: ignore
# e.g. for cloud users, for whom scylla's configuration should be changed only by support engineers.
# live_updatable_config_params_changeable_via_cql: true
# ****************
# * GUARDRAILS *
# ****************
#
# Guardrails options
#
# Guardrails to warn or fail when Replication Factor is smaller/greater than the threshold.
# Please note that the value of 0 is always allowed,
# which means that having no replication at all, i.e. RF = 0, is always valid.
@@ -661,6 +671,27 @@ maintenance_socket: ignore
# minimum_replication_factor_warn_threshold: 3
# maximum_replication_factor_warn_threshold: -1
# maximum_replication_factor_fail_threshold: -1
#
# Guardrails to warn about or disallow creating a keyspace with specific replication strategy.
# Each of these 2 settings is a list storing replication strategies considered harmful.
# The replication strategies to choose from are:
# 1) SimpleStrategy,
# 2) NetworkTopologyStrategy,
# 3) LocalStrategy,
# 4) EverywhereStrategy
#
# replication_strategy_warn_list:
# - SimpleStrategy
# replication_strategy_fail_list:
#
# Guardrail to enable the deprecated feature of CREATE TABLE WITH COMPACT STORAGE.
# enable_create_table_with_compact_storage: false
#
# Guardrails to limit usage of selected consistency levels for writes.
# Adding a warning to a CQL query response can significantly increase network
# traffic and decrease overall throughput.
# write_consistency_levels_warned: []
# write_consistency_levels_disallowed: []
#
# System information encryption settings
@@ -838,21 +869,6 @@ maintenance_socket: ignore
# key_namespace: <kmip key namespace> (optional)
#
# Guardrails to warn about or disallow creating a keyspace with specific replication strategy.
# Each of these 2 settings is a list storing replication strategies considered harmful.
# The replication strategies to choose from are:
# 1) SimpleStrategy,
# 2) NetworkTopologyStrategy,
# 3) LocalStrategy,
# 4) EverywhereStrategy
#
# replication_strategy_warn_list:
# - SimpleStrategy
# replication_strategy_fail_list:
# Guardrail to enable the deprecated feature of CREATE TABLE WITH COMPACT STORAGE.
# enable_create_table_with_compact_storage: false
# Control tablets for new keyspaces.
# Can be set to: disabled|enabled|enforced
#

View File

@@ -544,7 +544,6 @@ scylla_tests = set([
'test/boost/caching_options_test',
'test/boost/canonical_mutation_test',
'test/boost/cartesian_product_test',
'test/boost/cdc_generation_test',
'test/boost/cell_locker_test',
'test/boost/checksum_utils_test',
'test/boost/chunked_managed_vector_test',
@@ -619,6 +618,7 @@ scylla_tests = set([
'test/boost/reservoir_sampling_test',
'test/boost/result_utils_test',
'test/boost/rest_client_test',
'test/boost/rolling_max_tracker_test',
'test/boost/reusable_buffer_test',
'test/boost/rust_test',
'test/boost/s3_test',
@@ -896,6 +896,9 @@ scylla_core = (['message/messaging_service.cc',
'replica/multishard_query.cc',
'replica/mutation_dump.cc',
'replica/querier.cc',
'replica/logstor/segment_manager.cc',
'replica/logstor/logstor.cc',
'replica/logstor/write_buffer.cc',
'mutation/atomic_cell.cc',
'mutation/canonical_mutation.cc',
'mutation/frozen_mutation.cc',
@@ -1204,6 +1207,7 @@ scylla_core = (['message/messaging_service.cc',
'gms/application_state.cc',
'gms/inet_address.cc',
'dht/i_partitioner.cc',
'dht/fixed_shard.cc',
'dht/token.cc',
'dht/murmur3_partitioner.cc',
'dht/boot_strapper.cc',
@@ -1239,7 +1243,6 @@ scylla_core = (['message/messaging_service.cc',
'service/pager/query_pagers.cc',
'service/qos/qos_common.cc',
'service/qos/service_level_controller.cc',
'service/qos/standard_service_level_distributed_data_accessor.cc',
'service/qos/raft_service_level_distributed_data_accessor.cc',
'streaming/stream_task.cc',
'streaming/stream_session.cc',
@@ -1273,8 +1276,8 @@ scylla_core = (['message/messaging_service.cc',
'auth/common.cc',
'auth/default_authorizer.cc',
'auth/resource.cc',
'auth/roles-metadata.cc',
'auth/passwords.cc',
'auth/maintenance_socket_authenticator.cc',
'auth/password_authenticator.cc',
'auth/permission.cc',
'auth/service.cc',
@@ -1340,6 +1343,7 @@ scylla_core = (['message/messaging_service.cc',
'service/strong_consistency/groups_manager.cc',
'service/strong_consistency/coordinator.cc',
'service/strong_consistency/state_machine.cc',
'service/strong_consistency/raft_groups_storage.cc',
'service/raft/group0_state_id_handler.cc',
'service/raft/group0_state_machine.cc',
'service/raft/group0_state_machine_merger.cc',
@@ -1466,6 +1470,7 @@ idls = ['idl/gossip_digest.idl.hh',
'idl/query.idl.hh',
'idl/idl_test.idl.hh',
'idl/commitlog.idl.hh',
'idl/logstor.idl.hh',
'idl/tracing.idl.hh',
'idl/consistency_level.idl.hh',
'idl/cache_temperature.idl.hh',
@@ -1473,6 +1478,7 @@ idls = ['idl/gossip_digest.idl.hh',
'idl/messaging_service.idl.hh',
'idl/paxos.idl.hh',
'idl/raft.idl.hh',
'idl/raft_util.idl.hh',
'idl/raft_storage.idl.hh',
'idl/group0.idl.hh',
'idl/hinted_handoff.idl.hh',
@@ -1492,7 +1498,9 @@ idls = ['idl/gossip_digest.idl.hh',
'idl/gossip.idl.hh',
'idl/migration_manager.idl.hh',
"idl/node_ops.idl.hh",
"idl/tasks.idl.hh"
"idl/tasks.idl.hh",
"idl/client_state.idl.hh",
"idl/forward_cql.idl.hh",
]
scylla_tests_generic_dependencies = [
@@ -1585,6 +1593,7 @@ pure_boost_tests = set([
'test/boost/wrapping_interval_test',
'test/boost/range_tombstone_list_test',
'test/boost/reservoir_sampling_test',
'test/boost/rolling_max_tracker_test',
'test/boost/serialization_test',
'test/boost/small_vector_test',
'test/boost/top_k_test',
@@ -1733,6 +1742,7 @@ deps['test/boost/url_parse_test'] = ['utils/http.cc', 'test/boost/url_parse_test
deps['test/boost/murmur_hash_test'] = ['bytes.cc', 'utils/murmur_hash.cc', 'test/boost/murmur_hash_test.cc']
deps['test/boost/allocation_strategy_test'] = ['test/boost/allocation_strategy_test.cc', 'utils/logalloc.cc', 'utils/dynamic_bitset.cc', 'utils/labels.cc']
deps['test/boost/log_heap_test'] = ['test/boost/log_heap_test.cc']
deps['test/boost/rolling_max_tracker_test'] = ['test/boost/rolling_max_tracker_test.cc']
deps['test/boost/estimated_histogram_test'] = ['test/boost/estimated_histogram_test.cc']
deps['test/boost/summary_test'] = ['test/boost/summary_test.cc']
deps['test/boost/anchorless_list_test'] = ['test/boost/anchorless_list_test.cc']

View File

@@ -23,7 +23,7 @@ column_specification::column_specification(std::string_view ks_name_, std::strin
bool column_specification::all_in_same_table(const std::vector<lw_shared_ptr<column_specification>>& names)
{
SCYLLA_ASSERT(!names.empty());
throwing_assert(!names.empty());
auto first = names.front();
return std::all_of(std::next(names.begin()), names.end(), [first] (auto&& spec) {

View File

@@ -49,9 +49,9 @@ static cql3_type::kind get_cql3_kind(const abstract_type& t) {
cql3_type::kind operator()(const uuid_type_impl&) { return cql3_type::kind::UUID; }
cql3_type::kind operator()(const varint_type_impl&) { return cql3_type::kind::VARINT; }
cql3_type::kind operator()(const reversed_type_impl& r) { return get_cql3_kind(*r.underlying_type()); }
cql3_type::kind operator()(const tuple_type_impl&) { SCYLLA_ASSERT(0 && "no kind for this type"); }
cql3_type::kind operator()(const vector_type_impl&) { SCYLLA_ASSERT(0 && "no kind for this type"); }
cql3_type::kind operator()(const collection_type_impl&) { SCYLLA_ASSERT(0 && "no kind for this type"); }
cql3_type::kind operator()(const tuple_type_impl&) { throwing_assert(0 && "no kind for this type"); }
cql3_type::kind operator()(const vector_type_impl&) { throwing_assert(0 && "no kind for this type"); }
cql3_type::kind operator()(const collection_type_impl&) { throwing_assert(0 && "no kind for this type"); }
};
return visit(t, visitor{});
}
@@ -124,7 +124,7 @@ class cql3_type::raw_collection : public raw {
} else if (_kind == abstract_type::kind::map) {
return format("{}map<{}, {}>{}", start, _keys, _values, end);
}
abort();
throwing_assert(0 && "invalid raw_collection kind");
}
public:
raw_collection(const abstract_type::kind kind, shared_ptr<raw> keys, shared_ptr<raw> values)
@@ -150,7 +150,7 @@ public:
}
virtual cql3_type prepare_internal(const sstring& keyspace, const data_dictionary::user_types_metadata& user_types) override {
SCYLLA_ASSERT(_values); // "Got null values type for a collection";
throwing_assert(_values); // "Got null values type for a collection";
if (_values->is_counter()) {
throw exceptions::invalid_request_exception(format("Counters are not allowed inside collections: {}", *this));
@@ -190,7 +190,7 @@ private:
}
return cql3_type(set_type_impl::get_instance(_values->prepare_internal(keyspace, user_types).get_type(), !is_frozen()));
} else if (_kind == abstract_type::kind::map) {
SCYLLA_ASSERT(_keys); // "Got null keys type for a collection";
throwing_assert(_keys); // "Got null keys type for a collection";
if (_keys->is_duration()) {
throw exceptions::invalid_request_exception(format("Durations are not allowed as map keys: {}", *this));
}
@@ -198,7 +198,7 @@ private:
_values->prepare_internal(keyspace, user_types).get_type(),
!is_frozen()));
}
abort();
throwing_assert(0 && "do_prepare invalid kind");
}
};

View File

@@ -1603,7 +1603,7 @@ static cql3::raw_value do_evaluate(const collection_constructor& collection, con
case collection_constructor::style_type::vector:
return evaluate_vector(collection, inputs);
}
std::abort();
throwing_assert(0 && "do_evaluate invalid style");
}
static cql3::raw_value do_evaluate(const usertype_constructor& user_val, const evaluation_inputs& inputs) {

View File

@@ -876,7 +876,7 @@ cast_test_assignment(const cast& c, data_dictionary::database db, const sstring&
return assignment_testable::test_result::NOT_ASSIGNABLE;
}
} catch (exceptions::invalid_request_exception& e) {
abort();
throwing_assert(0 && "cast_test_assignment exception");
}
}

View File

@@ -544,7 +544,7 @@ functions::get_user_aggregates(const sstring& keyspace) const {
std::ranges::subrange<functions::declared_t::const_iterator>
functions::find(const function_name& name) const {
SCYLLA_ASSERT(name.has_keyspace()); // : "function name not fully qualified";
throwing_assert(name.has_keyspace()); // : "function name not fully qualified";
auto pair = _declared.equal_range(name);
return std::ranges::subrange(pair.first, pair.second);
}

View File

@@ -10,8 +10,9 @@
#include "types/types.hh"
#include "types/vector.hh"
#include "exceptions/exceptions.hh"
#include <span>
#include <bit>
#include <span>
#include <seastar/core/byteorder.hh>
namespace cql3 {
namespace functions {
@@ -30,14 +31,10 @@ std::vector<float> extract_float_vector(const bytes_opt& param, vector_dimension
expected_size, dimension, param->size()));
}
std::vector<float> result;
result.reserve(dimension);
bytes_view view(*param);
std::vector<float> result(dimension);
const char* p = reinterpret_cast<const char*>(param->data());
for (size_t i = 0; i < dimension; ++i) {
// read_simple handles network byte order (big-endian) conversion
uint32_t raw = read_simple<uint32_t>(view);
result.push_back(std::bit_cast<float>(raw));
result[i] = std::bit_cast<float>(consume_be<uint32_t>(p));
}
return result;
@@ -55,13 +52,14 @@ namespace {
// You should only use this function if you need to preserve the original vectors and cannot normalize
// them in advance.
float compute_cosine_similarity(std::span<const float> v1, std::span<const float> v2) {
double dot_product = 0.0;
double squared_norm_a = 0.0;
double squared_norm_b = 0.0;
#pragma clang fp contract(fast) reassociate(on) // Allow the compiler to optimize the loop.
float dot_product = 0.0;
float squared_norm_a = 0.0;
float squared_norm_b = 0.0;
for (size_t i = 0; i < v1.size(); ++i) {
double a = v1[i];
double b = v2[i];
float a = v1[i];
float b = v2[i];
dot_product += a * b;
squared_norm_a += a * a;
@@ -79,13 +77,14 @@ float compute_cosine_similarity(std::span<const float> v1, std::span<const float
}
float compute_euclidean_similarity(std::span<const float> v1, std::span<const float> v2) {
double sum = 0.0;
#pragma clang fp contract(fast) reassociate(on) // Allow the compiler to optimize the loop.
float sum = 0.0;
for (size_t i = 0; i < v1.size(); ++i) {
double a = v1[i];
double b = v2[i];
float a = v1[i];
float b = v2[i];
double diff = a - b;
float diff = a - b;
sum += diff * diff;
}
@@ -98,11 +97,12 @@ float compute_euclidean_similarity(std::span<const float> v1, std::span<const fl
// Assumes that both vectors are L2-normalized.
// This similarity is intended as an optimized way to perform cosine similarity calculation.
float compute_dot_product_similarity(std::span<const float> v1, std::span<const float> v2) {
double dot_product = 0.0;
#pragma clang fp contract(fast) reassociate(on) // Allow the compiler to optimize the loop.
float dot_product = 0.0;
for (size_t i = 0; i < v1.size(); ++i) {
double a = v1[i];
double b = v2[i];
float a = v1[i];
float b = v2[i];
dot_product += a * b;
}

View File

@@ -25,7 +25,7 @@ bool keyspace_element_name::has_keyspace() const
const sstring& keyspace_element_name::get_keyspace() const
{
SCYLLA_ASSERT(_ks_name);
throwing_assert(_ks_name);
return *_ks_name;
}

View File

@@ -62,7 +62,7 @@ lists::setter_by_index::fill_prepare_context(prepare_context& ctx) {
void
lists::setter_by_index::execute(mutation& m, const clustering_key_prefix& prefix, const update_parameters& params) {
// we should not get here for frozen lists
SCYLLA_ASSERT(column.type->is_multi_cell()); // "Attempted to set an individual element on a frozen list";
throwing_assert(column.type->is_multi_cell()); // "Attempted to set an individual element on a frozen list";
auto index = expr::evaluate(_idx, params._options);
if (index.is_null()) {
@@ -105,7 +105,7 @@ lists::setter_by_uuid::requires_read() const {
void
lists::setter_by_uuid::execute(mutation& m, const clustering_key_prefix& prefix, const update_parameters& params) {
// we should not get here for frozen lists
SCYLLA_ASSERT(column.type->is_multi_cell()); // "Attempted to set an individual element on a frozen list";
throwing_assert(column.type->is_multi_cell()); // "Attempted to set an individual element on a frozen list";
auto index = expr::evaluate(_idx, params._options);
auto value = expr::evaluate(*_e, params._options);
@@ -133,7 +133,7 @@ lists::setter_by_uuid::execute(mutation& m, const clustering_key_prefix& prefix,
void
lists::appender::execute(mutation& m, const clustering_key_prefix& prefix, const update_parameters& params) {
const cql3::raw_value value = expr::evaluate(*_e, params._options);
SCYLLA_ASSERT(column.type->is_multi_cell()); // "Attempted to append to a frozen list";
throwing_assert(column.type->is_multi_cell()); // "Attempted to append to a frozen list";
do_append(value, m, prefix, column, params);
}
@@ -189,7 +189,7 @@ lists::do_append(const cql3::raw_value& list_value,
void
lists::prepender::execute(mutation& m, const clustering_key_prefix& prefix, const update_parameters& params) {
SCYLLA_ASSERT(column.type->is_multi_cell()); // "Attempted to prepend to a frozen list";
throwing_assert(column.type->is_multi_cell()); // "Attempted to prepend to a frozen list";
cql3::raw_value lvalue = expr::evaluate(*_e, params._options);
if (lvalue.is_null()) {
return;
@@ -244,7 +244,7 @@ lists::discarder::requires_read() const {
void
lists::discarder::execute(mutation& m, const clustering_key_prefix& prefix, const update_parameters& params) {
SCYLLA_ASSERT(column.type->is_multi_cell()); // "Attempted to delete from a frozen list";
throwing_assert(column.type->is_multi_cell()); // "Attempted to delete from a frozen list";
auto&& existing_list = params.get_prefetched_list(m.key(), prefix, column);
// We want to call bind before possibly returning to reject queries where the value provided is not a list.
@@ -300,7 +300,7 @@ lists::discarder_by_index::requires_read() const {
void
lists::discarder_by_index::execute(mutation& m, const clustering_key_prefix& prefix, const update_parameters& params) {
SCYLLA_ASSERT(column.type->is_multi_cell()); // "Attempted to delete an item by index from a frozen list";
throwing_assert(column.type->is_multi_cell()); // "Attempted to delete an item by index from a frozen list";
cql3::raw_value index = expr::evaluate(*_e, params._options);
if (index.is_null()) {
throw exceptions::invalid_request_exception("Invalid null value for list index");

View File

@@ -45,7 +45,7 @@ maps::setter_by_key::fill_prepare_context(prepare_context& ctx) {
void
maps::setter_by_key::execute(mutation& m, const clustering_key_prefix& prefix, const update_parameters& params) {
using exceptions::invalid_request_exception;
SCYLLA_ASSERT(column.type->is_multi_cell()); // "Attempted to set a value for a single key on a frozen map"m
throwing_assert(column.type->is_multi_cell()); // "Attempted to set a value for a single key on a frozen map"m
auto key = expr::evaluate(_k, params._options);
auto value = expr::evaluate(*_e, params._options);
if (key.is_null()) {
@@ -63,7 +63,7 @@ maps::setter_by_key::execute(mutation& m, const clustering_key_prefix& prefix, c
void
maps::putter::execute(mutation& m, const clustering_key_prefix& prefix, const update_parameters& params) {
SCYLLA_ASSERT(column.type->is_multi_cell()); // "Attempted to add items to a frozen map";
throwing_assert(column.type->is_multi_cell()); // "Attempted to add items to a frozen map";
cql3::raw_value value = expr::evaluate(*_e, params._options);
do_put(m, prefix, params, value, column);
}
@@ -96,7 +96,7 @@ maps::do_put(mutation& m, const clustering_key_prefix& prefix, const update_para
void
maps::discarder_by_key::execute(mutation& m, const clustering_key_prefix& prefix, const update_parameters& params) {
SCYLLA_ASSERT(column.type->is_multi_cell()); // "Attempted to delete a single key in a frozen map";
throwing_assert(column.type->is_multi_cell()); // "Attempted to delete a single key in a frozen map";
cql3::raw_value key = expr::evaluate(*_e, params._options);
if (key.is_null()) {
throw exceptions::invalid_request_exception("Invalid null map key");

View File

@@ -67,7 +67,7 @@ operation::set_element::prepare(data_dictionary::database db, const sstring& key
verify_no_aggregate_functions(mval, "SET clause");
return make_shared<maps::setter_by_key>(receiver, std::move(key), std::move(mval));
}
abort();
throwing_assert(0 && "prepare set_element collection type");
}
bool
@@ -166,7 +166,7 @@ operation::addition::prepare(data_dictionary::database db, const sstring& keyspa
} else if (ctype->get_kind() == abstract_type::kind::map) {
return make_shared<maps::putter>(receiver, std::move(v));
} else {
abort();
throwing_assert(0 && "prepare addition collection type");
}
}
@@ -216,7 +216,7 @@ operation::subtraction::prepare(data_dictionary::database db, const sstring& key
verify_no_aggregate_functions(v, "SET clause");
return ::make_shared<sets::discarder>(receiver, std::move(v));
}
abort();
throwing_assert(0 && "prepare subtraction collection type");
}
bool
@@ -267,7 +267,7 @@ operation::set_value::prepare(data_dictionary::database db, const sstring& keysp
} else if (k == abstract_type::kind::map) {
return make_shared<maps::setter>(receiver, std::move(v));
} else {
abort();
throwing_assert(0 && "prepare set_value collection type");
}
}
@@ -385,7 +385,7 @@ operation::element_deletion::prepare(data_dictionary::database db, const sstring
verify_no_aggregate_functions(key, "SET clause");
return make_shared<maps::discarder_by_key>(receiver, std::move(key));
}
abort();
throwing_assert(0 && "prepare element_deletion collection type");
}
expr::expression

View File

@@ -105,6 +105,7 @@ public:
static const std::chrono::minutes entry_expiry;
using key_type = prepared_cache_key_type;
using pinned_value_type = cache_value_ptr;
using value_type = checked_weak_ptr;
using statement_is_too_big = typename cache_type::entry_is_too_big;
@@ -116,9 +117,14 @@ public:
: _cache(size, entry_expiry, logger)
{}
template <typename LoadFunc>
future<pinned_value_type> get_pinned(const key_type& key, LoadFunc&& load) {
return _cache.get_ptr(key.key(), [load = std::forward<LoadFunc>(load)] (const cache_key_type&) { return load(); });
}
template <typename LoadFunc>
future<value_type> get(const key_type& key, LoadFunc&& load) {
return _cache.get_ptr(key.key(), [load = std::forward<LoadFunc>(load)] (const cache_key_type&) { return load(); }).then([] (cache_value_ptr v_ptr) {
return get_pinned(key, std::forward<LoadFunc>(load)).then([] (cache_value_ptr v_ptr) {
return make_ready_future<value_type>((*v_ptr)->checked_weak_from_this());
});
}

View File

@@ -11,6 +11,7 @@
#include "cql3/query_processor.hh"
#include <seastar/core/metrics.hh>
#include <seastar/core/memory.hh>
#include <seastar/core/shared_ptr.hh>
#include <seastar/coroutine/parallel_for_each.hh>
#include <seastar/coroutine/as_future.hh>
@@ -47,13 +48,15 @@ const sstring query_processor::CQL_VERSION = "3.3.1";
const std::chrono::minutes prepared_statements_cache::entry_expiry = std::chrono::minutes(60);
struct query_processor::remote {
remote(service::migration_manager& mm, service::mapreduce_service& fwd,
service::storage_service& ss, service::raft_group0_client& group0_client,
service::strong_consistency::coordinator& _sc_coordinator)
: mm(mm), mapreducer(fwd), ss(ss), group0_client(group0_client)
, sc_coordinator(_sc_coordinator)
, gate("query_processor::remote")
{}
remote(service::migration_manager& mm, service::mapreduce_service& fwd, service::storage_service& ss, service::raft_group0_client& group0_client,
service::strong_consistency::coordinator& _sc_coordinator)
: mm(mm)
, mapreducer(fwd)
, ss(ss)
, group0_client(group0_client)
, sc_coordinator(_sc_coordinator)
, gate("query_processor::remote") {
}
service::migration_manager& mm;
service::mapreduce_service& mapreducer;
@@ -76,28 +79,42 @@ static service::query_state query_state_for_internal_call() {
return {service::client_state::for_internal_calls(), empty_service_permit()};
}
query_processor::query_processor(service::storage_proxy& proxy, data_dictionary::database db, service::migration_notifier& mn, vector_search::vector_store_client& vsc, query_processor::memory_config mcfg, cql_config& cql_cfg, utils::loading_cache_config auth_prep_cache_cfg, lang::manager& langm)
: _migration_subscriber{std::make_unique<migration_subscriber>(this)}
, _proxy(proxy)
, _db(db)
, _mnotifier(mn)
, _vector_store_client(vsc)
, _mcfg(mcfg)
, _cql_config(cql_cfg)
, _prepared_cache(prep_cache_log, _mcfg.prepared_statment_cache_size)
, _authorized_prepared_cache(std::move(auth_prep_cache_cfg), authorized_prepared_statements_cache_log)
, _auth_prepared_cache_cfg_cb([this] (uint32_t) { (void) _authorized_prepared_cache_config_action.trigger_later(); })
, _authorized_prepared_cache_config_action([this] { update_authorized_prepared_cache_config(); return make_ready_future<>(); })
, _authorized_prepared_cache_update_interval_in_ms_observer(_db.get_config().permissions_update_interval_in_ms.observe(_auth_prepared_cache_cfg_cb))
, _authorized_prepared_cache_validity_in_ms_observer(_db.get_config().permissions_validity_in_ms.observe(_auth_prepared_cache_cfg_cb))
, _lang_manager(langm)
{
query_processor::query_processor(service::storage_proxy& proxy, data_dictionary::database db, service::migration_notifier& mn,
vector_search::vector_store_client& vsc, query_processor::memory_config mcfg, cql_config& cql_cfg, utils::loading_cache_config auth_prep_cache_cfg,
lang::manager& langm)
: _migration_subscriber{std::make_unique<migration_subscriber>(this)}
, _proxy(proxy)
, _db(db)
, _mnotifier(mn)
, _vector_store_client(vsc)
, _mcfg(mcfg)
, _cql_config(cql_cfg)
, _prepared_cache(prep_cache_log, _mcfg.prepared_statment_cache_size)
, _authorized_prepared_cache(std::move(auth_prep_cache_cfg), authorized_prepared_statements_cache_log)
, _auth_prepared_cache_cfg_cb([this](uint32_t) {
(void)_authorized_prepared_cache_config_action.trigger_later();
})
, _authorized_prepared_cache_config_action([this] {
update_authorized_prepared_cache_config();
return make_ready_future<>();
})
, _authorized_prepared_cache_update_interval_in_ms_observer(_db.get_config().permissions_update_interval_in_ms.observe(_auth_prepared_cache_cfg_cb))
, _authorized_prepared_cache_validity_in_ms_observer(_db.get_config().permissions_validity_in_ms.observe(_auth_prepared_cache_cfg_cb))
, _lang_manager(langm)
, _write_consistency_levels_warned_observer(_db.get_config().write_consistency_levels_warned.observe([this](const auto& v) {
_write_consistency_levels_warned = to_consistency_level_set(v);
}))
, _write_consistency_levels_disallowed_observer(_db.get_config().write_consistency_levels_disallowed.observe([this](const auto& v) {
_write_consistency_levels_disallowed = to_consistency_level_set(v);
})) {
_write_consistency_levels_warned = to_consistency_level_set(_db.get_config().write_consistency_levels_warned());
_write_consistency_levels_disallowed = to_consistency_level_set(_db.get_config().write_consistency_levels_disallowed());
namespace sm = seastar::metrics;
namespace stm = statements;
using clevel = db::consistency_level;
sm::label cl_label("consistency_level");
sm::label who_label("who"); // Who queried system tables
sm::label who_label("who"); // Who queried system tables
const auto user_who_label_instance = who_label("user");
const auto internal_who_label_instance = who_label("internal");
@@ -105,17 +122,11 @@ query_processor::query_processor(service::storage_proxy& proxy, data_dictionary:
const auto system_ks_label_instance = ks_label("system");
std::vector<sm::metric_definition> qp_group;
qp_group.push_back(sm::make_counter(
"statements_prepared",
_stats.prepare_invocations,
sm::description("Counts the total number of parsed CQL requests.")));
qp_group.push_back(sm::make_counter("statements_prepared", _stats.prepare_invocations, sm::description("Counts the total number of parsed CQL requests.")));
for (auto cl = size_t(clevel::MIN_VALUE); cl <= size_t(clevel::MAX_VALUE); ++cl) {
qp_group.push_back(
sm::make_counter(
"queries",
_stats.queries_by_cl[cl],
sm::description("Counts queries by consistency level."),
{cl_label(clevel(cl)), basic_level}).set_skip_when_empty());
qp_group.push_back(sm::make_counter(
"queries", _stats.queries_by_cl[cl], sm::description("Counts queries by consistency level."), {cl_label(clevel(cl)), basic_level})
.set_skip_when_empty());
}
_metrics.add_group("query_processor", qp_group);
@@ -506,8 +517,34 @@ query_processor::query_processor(service::storage_proxy& proxy, data_dictionary:
_cql_stats.replication_strategy_fail_list_violations,
sm::description("Counts the number of replication_strategy_fail_list guardrail violations, "
"i.e. attempts to set a forbidden replication strategy in a keyspace via CREATE/ALTER KEYSPACE.")).set_skip_when_empty(),
sm::make_counter(
"forwarded_requests",
_cql_stats.forwarded_requests,
sm::description("Counts the total number of attempts to forward CQL requests to other nodes. One request may be forwarded multiple times, "
"particularly when a write is handled by a non-replica node.")).set_skip_when_empty(),
});
std::vector<sm::metric_definition> cql_cl_group;
for (auto cl = size_t(clevel::MIN_VALUE); cl <= size_t(clevel::MAX_VALUE); ++cl) {
cql_cl_group.push_back(sm::make_counter("writes_per_consistency_level", _cql_stats.writes_per_consistency_level[cl],
sm::description("Counts the number of writes for each consistency level."), {cl_label(clevel(cl)), basic_level})
.set_skip_when_empty());
}
_metrics.add_group("cql", cql_cl_group);
_metrics.add_group(
"cql", {
sm::make_counter("write_consistency_levels_disallowed_violations", _cql_stats.write_consistency_levels_disallowed_violations,
sm::description("Counts the number of write_consistency_levels_disallowed guardrail violations, "
"i.e. attempts to write with a forbidden consistency level."),
{basic_level}),
sm::make_counter("write_consistency_levels_warned_violations", _cql_stats.write_consistency_levels_warned_violations,
sm::description("Counts the number of write_consistency_levels_warned guardrail violations, "
"i.e. attempts to write with a discouraged consistency level."),
{basic_level}),
});
_mnotifier.register_listener(_migration_subscriber.get());
}
@@ -517,15 +554,13 @@ query_processor::~query_processor() {
}
}
std::pair<std::reference_wrapper<service::strong_consistency::coordinator>, gate::holder>
query_processor::acquire_strongly_consistent_coordinator() {
std::pair<std::reference_wrapper<service::strong_consistency::coordinator>, gate::holder> query_processor::acquire_strongly_consistent_coordinator() {
auto [remote_, holder] = remote();
return {remote_.get().sc_coordinator, std::move(holder)};
}
void query_processor::start_remote(service::migration_manager& mm, service::mapreduce_service& mapreducer,
service::storage_service& ss, service::raft_group0_client& group0_client,
service::strong_consistency::coordinator& sc_coordinator) {
void query_processor::start_remote(service::migration_manager& mm, service::mapreduce_service& mapreducer, service::storage_service& ss,
service::raft_group0_client& group0_client, service::strong_consistency::coordinator& sc_coordinator) {
_remote = std::make_unique<struct remote>(mm, mapreducer, ss, group0_client, sc_coordinator);
}
@@ -545,23 +580,24 @@ future<> query_processor::stop() {
}
future<::shared_ptr<cql_transport::messages::result_message>> query_processor::execute_with_guard(
std::function<future<::shared_ptr<cql_transport::messages::result_message>>(service::query_state&, ::shared_ptr<cql_statement>, const query_options&, std::optional<service::group0_guard>)> fn,
std::function<future<::shared_ptr<cql_transport::messages::result_message>>(
service::query_state&, ::shared_ptr<cql_statement>, const query_options&, std::optional<service::group0_guard>)>
fn,
::shared_ptr<cql_statement> statement, service::query_state& query_state, const query_options& options) {
// execute all statements that need group0 guard on shard0
if (this_shard_id() != 0) {
co_return ::make_shared<cql_transport::messages::result_message::bounce_to_shard>(0,
std::move(const_cast<cql3::query_options&>(options).take_cached_pk_function_calls()));
co_return bounce_to_shard(0, std::move(const_cast<cql3::query_options&>(options).take_cached_pk_function_calls()), false);
}
auto [remote_, holder] = remote();
size_t retries = remote_.get().mm.get_concurrent_ddl_retries();
while (true) {
while (true) {
try {
auto guard = co_await remote_.get().mm.start_group0_operation();
co_return co_await fn(query_state, statement, options, std::move(guard));
} catch (const service::group0_concurrent_modification& ex) {
log.warn("Failed to execute statement \"{}\" due to guard conflict.{}.",
statement->raw_cql_statement, retries ? " Retrying" : " Number of retries exceeded, giving up");
log.warn("Failed to execute statement \"{}\" due to guard conflict.{}.", statement->raw_cql_statement,
retries ? " Retrying" : " Number of retries exceeded, giving up");
if (retries--) {
continue;
}
@@ -570,29 +606,30 @@ future<::shared_ptr<cql_transport::messages::result_message>> query_processor::e
}
}
template<typename... Args>
future<::shared_ptr<result_message>>
query_processor::execute_maybe_with_guard(service::query_state& query_state, ::shared_ptr<cql_statement> statement, const query_options& options,
future<::shared_ptr<result_message>>(query_processor::*fn)(service::query_state&, ::shared_ptr<cql_statement>, const query_options&, std::optional<service::group0_guard>, Args...), Args... args) {
template <typename... Args>
future<::shared_ptr<result_message>> query_processor::execute_maybe_with_guard(service::query_state& query_state, ::shared_ptr<cql_statement> statement,
const query_options& options,
future<::shared_ptr<result_message>> (query_processor::*fn)(
service::query_state&, ::shared_ptr<cql_statement>, const query_options&, std::optional<service::group0_guard>, Args...),
Args... args) {
if (!statement->needs_guard(*this, query_state)) {
return (this->*fn)(query_state, std::move(statement), options, std::nullopt, std::forward<Args>(args)...);
}
static auto exec = [fn] (query_processor& qp, Args... args, service::query_state& query_state, ::shared_ptr<cql_statement> statement, const query_options& options, std::optional<service::group0_guard> guard) {
static auto exec = [fn](query_processor& qp, Args... args, service::query_state& query_state, ::shared_ptr<cql_statement> statement,
const query_options& options, std::optional<service::group0_guard> guard) {
return (qp.*fn)(query_state, std::move(statement), options, std::move(guard), std::forward<Args>(args)...);
};
return execute_with_guard(std::bind_front(exec, std::ref(*this), std::forward<Args>(args)...), std::move(statement), query_state, options);
}
future<::shared_ptr<result_message>>
query_processor::execute_direct_without_checking_exception_message(const std::string_view& query_string, service::query_state& query_state, dialect d, query_options& options) {
future<::shared_ptr<result_message>> query_processor::execute_direct_without_checking_exception_message(
const std::string_view& query_string, service::query_state& query_state, dialect d, query_options& options) {
log.trace("execute_direct: \"{}\"", query_string);
tracing::trace(query_state.get_trace_state(), "Parsing a statement");
auto p = get_statement(query_string, query_state.get_client_state(), d);
auto statement = p->statement;
if (statement->get_bound_terms() != options.get_values_count()) {
const auto msg = format("Invalid amount of bind variables: expected {:d} received {:d}",
statement->get_bound_terms(),
options.get_values_count());
const auto msg = format("Invalid amount of bind variables: expected {:d} received {:d}", statement->get_bound_terms(), options.get_values_count());
throw exceptions::invalid_request_exception(msg);
}
options.prepare(p->bound_names);
@@ -603,17 +640,13 @@ query_processor::execute_direct_without_checking_exception_message(const std::st
metrics.regularStatementsExecuted.inc();
#endif
auto user = query_state.get_client_state().user();
tracing::trace(query_state.get_trace_state(), "Processing a statement for authenticated user: {}", user ? (user->name ? *user->name : "anonymous") : "no user authenticated");
tracing::trace(query_state.get_trace_state(), "Processing a statement for authenticated user: {}",
user ? (user->name ? *user->name : "anonymous") : "no user authenticated");
return execute_maybe_with_guard(query_state, std::move(statement), options, &query_processor::do_execute_direct, std::move(p->warnings));
}
future<::shared_ptr<result_message>>
query_processor::do_execute_direct(
service::query_state& query_state,
shared_ptr<cql_statement> statement,
const query_options& options,
std::optional<service::group0_guard> guard,
cql3::cql_warnings_vec warnings) {
future<::shared_ptr<result_message>> query_processor::do_execute_direct(service::query_state& query_state, shared_ptr<cql_statement> statement,
const query_options& options, std::optional<service::group0_guard> guard, cql3::cql_warnings_vec warnings) {
auto access_future = co_await coroutine::as_future(statement->check_access(*this, query_state.get_client_state()));
if (access_future.failed()) {
co_await audit::inspect(statement, query_state, options, true);
@@ -638,26 +671,16 @@ query_processor::do_execute_direct(
co_return std::move(m);
}
future<::shared_ptr<result_message>>
query_processor::execute_prepared_without_checking_exception_message(
service::query_state& query_state,
shared_ptr<cql_statement> statement,
const query_options& options,
statements::prepared_statement::checked_weak_ptr prepared,
cql3::prepared_cache_key_type cache_key,
bool needs_authorization) {
return execute_maybe_with_guard(query_state, std::move(statement), options, &query_processor::do_execute_prepared, std::move(prepared), std::move(cache_key), needs_authorization);
future<::shared_ptr<result_message>> query_processor::execute_prepared_without_checking_exception_message(service::query_state& query_state,
shared_ptr<cql_statement> statement, const query_options& options, statements::prepared_statement::checked_weak_ptr prepared,
cql3::prepared_cache_key_type cache_key, bool needs_authorization) {
return execute_maybe_with_guard(
query_state, std::move(statement), options, &query_processor::do_execute_prepared, std::move(prepared), std::move(cache_key), needs_authorization);
}
future<::shared_ptr<result_message>>
query_processor::do_execute_prepared(
service::query_state& query_state,
shared_ptr<cql_statement> statement,
const query_options& options,
std::optional<service::group0_guard> guard,
statements::prepared_statement::checked_weak_ptr prepared,
cql3::prepared_cache_key_type cache_key,
bool needs_authorization) {
future<::shared_ptr<result_message>> query_processor::do_execute_prepared(service::query_state& query_state, shared_ptr<cql_statement> statement,
const query_options& options, std::optional<service::group0_guard> guard, statements::prepared_statement::checked_weak_ptr prepared,
cql3::prepared_cache_key_type cache_key, bool needs_authorization) {
if (needs_authorization) {
co_await statement->check_access(*this, query_state.get_client_state());
try {
@@ -671,8 +694,8 @@ query_processor::do_execute_prepared(
co_return co_await process_authorized_statement(std::move(statement), query_state, options, std::move(guard));
}
future<::shared_ptr<result_message>>
query_processor::process_authorized_statement(const ::shared_ptr<cql_statement> statement, service::query_state& query_state, const query_options& options, std::optional<service::group0_guard> guard) {
future<::shared_ptr<result_message>> query_processor::process_authorized_statement(const ::shared_ptr<cql_statement> statement,
service::query_state& query_state, const query_options& options, std::optional<service::group0_guard> guard) {
auto& client_state = query_state.get_client_state();
++_stats.queries_by_cl[size_t(options.get_consistency())];
@@ -682,43 +705,39 @@ query_processor::process_authorized_statement(const ::shared_ptr<cql_statement>
auto msg = co_await statement->execute_without_checking_exception_message(*this, query_state, options, std::move(guard));
if (msg) {
co_return std::move(msg);
co_return std::move(msg);
}
co_return ::make_shared<result_message::void_message>();
}
future<::shared_ptr<cql_transport::messages::result_message::prepared>>
query_processor::prepare(sstring query_string, service::query_state& query_state, cql3::dialect d) {
future<::shared_ptr<cql_transport::messages::result_message::prepared>> query_processor::prepare(
sstring query_string, service::query_state& query_state, cql3::dialect d) {
auto& client_state = query_state.get_client_state();
return prepare(std::move(query_string), client_state, d);
}
future<::shared_ptr<cql_transport::messages::result_message::prepared>>
query_processor::prepare(sstring query_string, const service::client_state& client_state, cql3::dialect d) {
future<::shared_ptr<cql_transport::messages::result_message::prepared>> query_processor::prepare(
sstring query_string, const service::client_state& client_state, cql3::dialect d) {
try {
auto key = compute_id(query_string, client_state.get_raw_keyspace(), d);
auto prep_ptr = co_await _prepared_cache.get(key, [this, &query_string, &client_state, d] {
auto prepared = get_statement(query_string, client_state, d);
prepared->calculate_metadata_id();
auto bound_terms = prepared->statement->get_bound_terms();
if (bound_terms > std::numeric_limits<uint16_t>::max()) {
throw exceptions::invalid_request_exception(
format("Too many markers(?). {:d} markers exceed the allowed maximum of {:d}",
bound_terms,
std::numeric_limits<uint16_t>::max()));
}
SCYLLA_ASSERT(bound_terms == prepared->bound_names.size());
return make_ready_future<std::unique_ptr<statements::prepared_statement>>(std::move(prepared));
});
auto prep_entry = co_await _prepared_cache.get_pinned(key, [this, &query_string, &client_state, d] {
auto prepared = get_statement(query_string, client_state, d);
prepared->calculate_metadata_id();
auto bound_terms = prepared->statement->get_bound_terms();
if (bound_terms > std::numeric_limits<uint16_t>::max()) {
throw exceptions::invalid_request_exception(
format("Too many markers(?). {:d} markers exceed the allowed maximum of {:d}", bound_terms, std::numeric_limits<uint16_t>::max()));
}
throwing_assert(bound_terms == prepared->bound_names.size());
return make_ready_future<std::unique_ptr<statements::prepared_statement>>(std::move(prepared));
});
const auto& warnings = prep_ptr->warnings;
const auto msg = ::make_shared<result_message::prepared::cql>(prepared_cache_key_type::cql_id(key), std::move(prep_ptr),
client_state.is_protocol_extension_set(cql_transport::cql_protocol_extension::LWT_ADD_METADATA_MARK));
for (const auto& w : warnings) {
msg->add_warning(w);
}
co_return ::shared_ptr<cql_transport::messages::result_message::prepared>(std::move(msg));
} catch(typename prepared_statements_cache::statement_is_too_big&) {
co_await utils::get_local_injector().inject("query_processor_prepare_wait_after_cache_get", utils::wait_for_message(std::chrono::seconds(60)));
auto msg = ::make_shared<result_message::prepared::cql>(prepared_cache_key_type::cql_id(key), std::move(prep_entry),
client_state.is_protocol_extension_set(cql_transport::cql_protocol_extension::LWT_ADD_METADATA_MARK));
co_return std::move(msg);
} catch (typename prepared_statements_cache::statement_is_too_big&) {
throw prepared_statement_is_too_big(query_string);
}
}
@@ -729,15 +748,15 @@ static std::string hash_target(std::string_view query_string, std::string_view k
return ret;
}
prepared_cache_key_type query_processor::compute_id(
std::string_view query_string,
std::string_view keyspace,
dialect d) {
prepared_cache_key_type query_processor::compute_id(std::string_view query_string, std::string_view keyspace, dialect d) {
return prepared_cache_key_type(md5_hasher::calculate(hash_target(query_string, keyspace)), d);
}
std::unique_ptr<prepared_statement>
query_processor::get_statement(const std::string_view& query, const service::client_state& client_state, dialect d) {
std::unique_ptr<prepared_statement> query_processor::get_statement(const std::string_view& query, const service::client_state& client_state, dialect d) {
// Measuring allocation cost requires that no yield points exist
// between bytes_before and bytes_after. It needs fixing if this
// function is ever futurized.
auto bytes_before = seastar::memory::stats().total_bytes_allocated();
std::unique_ptr<raw::parsed_statement> statement = parse_statement(query, d);
// Set keyspace for statement that require login
@@ -753,11 +772,12 @@ query_processor::get_statement(const std::string_view& query, const service::cli
audit_info->set_query_string(query);
p->statement->sanitize_audit_info();
}
auto bytes_after = seastar::memory::stats().total_bytes_allocated();
_parsing_cost_tracker.add_sample(bytes_after - bytes_before);
return p;
}
std::unique_ptr<raw::parsed_statement>
query_processor::parse_statement(const std::string_view& query, dialect d) {
std::unique_ptr<raw::parsed_statement> query_processor::parse_statement(const std::string_view& query, dialect d) {
try {
{
const char* error_injection_key = "query_processor-parse_statement-test_failure";
@@ -782,8 +802,7 @@ query_processor::parse_statement(const std::string_view& query, dialect d) {
}
}
std::vector<std::unique_ptr<raw::parsed_statement>>
query_processor::parse_statements(std::string_view queries, dialect d) {
std::vector<std::unique_ptr<raw::parsed_statement>> query_processor::parse_statements(std::string_view queries, dialect d) {
try {
auto statements = util::do_with_parser(queries, d, std::mem_fn(&cql3_parser::CqlParser::queries));
if (statements.empty()) {
@@ -812,15 +831,10 @@ std::pair<std::reference_wrapper<struct query_processor::remote>, gate::holder>
on_internal_error(log, "attempted to perform distributed query when `query_processor::remote` is unavailable");
}
query_options query_processor::make_internal_options(
const statements::prepared_statement::checked_weak_ptr& p,
const std::vector<data_value_or_unset>& values,
db::consistency_level cl,
int32_t page_size,
service::node_local_only node_local_only) const {
query_options query_processor::make_internal_options(const statements::prepared_statement::checked_weak_ptr& p, const std::vector<data_value_or_unset>& values,
db::consistency_level cl, int32_t page_size, service::node_local_only node_local_only) const {
if (p->bound_names.size() != values.size()) {
throw std::invalid_argument(
format("Invalid number of values. Expecting {:d} but got {:d}", p->bound_names.size(), values.size()));
throw std::invalid_argument(format("Invalid number of values. Expecting {:d} but got {:d}", p->bound_names.size(), values.size()));
}
auto ni = p->bound_names.begin();
raw_value_vector_with_unset bound_values;
@@ -828,32 +842,28 @@ query_options query_processor::make_internal_options(
bound_values.unset.resize(values.size());
for (auto& var : values) {
auto& n = *ni;
std::visit(overloaded_functor {
[&] (const data_value& v) {
if (v.type() == bytes_type) {
bound_values.values.emplace_back(cql3::raw_value::make_value(value_cast<bytes>(v)));
} else if (v.is_null()) {
bound_values.values.emplace_back(cql3::raw_value::make_null());
} else {
bound_values.values.emplace_back(cql3::raw_value::make_value(n->type->decompose(v)));
}
}, [&] (const unset_value&) {
bound_values.values.emplace_back(cql3::raw_value::make_null());
bound_values.unset[std::distance(p->bound_names.begin(), ni)] = true;
}
}, var);
std::visit(overloaded_functor{[&](const data_value& v) {
if (v.type() == bytes_type) {
bound_values.values.emplace_back(cql3::raw_value::make_value(value_cast<bytes>(v)));
} else if (v.is_null()) {
bound_values.values.emplace_back(cql3::raw_value::make_null());
} else {
bound_values.values.emplace_back(cql3::raw_value::make_value(n->type->decompose(v)));
}
},
[&](const unset_value&) {
bound_values.values.emplace_back(cql3::raw_value::make_null());
bound_values.unset[std::distance(p->bound_names.begin(), ni)] = true;
}},
var);
++ni;
}
return query_options(
cl,
std::move(bound_values),
cql3::query_options::specific_options {
.page_size = page_size,
.state = {},
.serial_consistency = db::consistency_level::SERIAL,
.timestamp = api::missing_timestamp,
.node_local_only = node_local_only
});
return query_options(cl, std::move(bound_values),
cql3::query_options::specific_options{.page_size = page_size,
.state = {},
.serial_consistency = db::consistency_level::SERIAL,
.timestamp = api::missing_timestamp,
.node_local_only = node_local_only});
}
statements::prepared_statement::checked_weak_ptr query_processor::prepare_internal(const sstring& query_string) {
@@ -875,11 +885,7 @@ struct internal_query_state {
};
internal_query_state query_processor::create_paged_state(
const sstring& query_string,
db::consistency_level cl,
const data_value_list& values,
int32_t page_size,
std::optional<service::query_state> qs) {
const sstring& query_string, db::consistency_level cl, const data_value_list& values, int32_t page_size, std::optional<service::query_state> qs) {
auto p = prepare_internal(query_string);
auto opts = make_internal_options(p, values, cl, page_size);
if (!qs) {
@@ -893,8 +899,7 @@ bool query_processor::has_more_results(cql3::internal_query_state& state) const
}
future<> query_processor::for_each_cql_result(
cql3::internal_query_state& state,
noncopyable_function<future<stop_iteration>(const cql3::untyped_result_set::row&)> f) {
cql3::internal_query_state& state, noncopyable_function<future<stop_iteration>(const cql3::untyped_result_set::row&)> f) {
do {
auto msg = co_await execute_paged_internal(state);
for (auto& row : *msg) {
@@ -905,17 +910,18 @@ future<> query_processor::for_each_cql_result(
} while (has_more_results(state));
}
future<::shared_ptr<untyped_result_set>>
query_processor::execute_paged_internal(internal_query_state& state) {
future<::shared_ptr<untyped_result_set>> query_processor::execute_paged_internal(internal_query_state& state) {
state.p->statement->validate(*this, service::client_state::for_internal_calls());
::shared_ptr<cql_transport::messages::result_message> msg =
co_await state.p->statement->execute(*this, *state.qs, *state.opts, std::nullopt);
::shared_ptr<cql_transport::messages::result_message> msg = co_await state.p->statement->execute(*this, *state.qs, *state.opts, std::nullopt);
class visitor : public result_message::visitor_base {
internal_query_state& _state;
query_processor& _qp;
public:
visitor(internal_query_state& state, query_processor& qp) : _state(state), _qp(qp) {
visitor(internal_query_state& state, query_processor& qp)
: _state(state)
, _qp(qp) {
}
virtual ~visitor() = default;
void visit(const result_message::rows& rmrs) override {
@@ -944,23 +950,14 @@ query_processor::execute_paged_internal(internal_query_state& state) {
co_return ::make_shared<untyped_result_set>(msg);
}
future<::shared_ptr<untyped_result_set>>
query_processor::execute_internal(
const sstring& query_string,
db::consistency_level cl,
const data_value_list& values,
cache_internal cache) {
future<::shared_ptr<untyped_result_set>> query_processor::execute_internal(
const sstring& query_string, db::consistency_level cl, const data_value_list& values, cache_internal cache) {
auto qs = query_state_for_internal_call();
co_return co_await execute_internal(query_string, cl, qs, values, cache);
}
future<::shared_ptr<untyped_result_set>>
query_processor::execute_internal(
const sstring& query_string,
db::consistency_level cl,
service::query_state& query_state,
const data_value_list& values,
cache_internal cache) {
future<::shared_ptr<untyped_result_set>> query_processor::execute_internal(
const sstring& query_string, db::consistency_level cl, service::query_state& query_state, const data_value_list& values, cache_internal cache) {
if (log.is_enabled(logging::log_level::trace)) {
log.trace("execute_internal: {}\"{}\" ({})", cache ? "(cached) " : "", query_string, fmt::join(values, ", "));
@@ -978,10 +975,7 @@ query_processor::execute_internal(
}
future<utils::chunked_vector<mutation>> query_processor::get_mutations_internal(
const sstring query_string,
service::query_state& query_state,
api::timestamp_type timestamp,
std::vector<data_value_or_unset> values) {
const sstring query_string, service::query_state& query_state, api::timestamp_type timestamp, std::vector<data_value_or_unset> values) {
log.debug("get_mutations_internal: \"{}\" ({})", query_string, fmt::join(values, ", "));
auto stmt = prepare_internal(query_string);
auto mod_stmt = dynamic_pointer_cast<cql3::statements::modification_statement>(stmt->statement);
@@ -999,12 +993,8 @@ future<utils::chunked_vector<mutation>> query_processor::get_mutations_internal(
co_return co_await mod_stmt->get_mutations(*this, opts, timeout, true, timestamp, query_state, json_cache, std::move(keys));
}
future<::shared_ptr<untyped_result_set>>
query_processor::execute_with_params(
statements::prepared_statement::checked_weak_ptr p,
db::consistency_level cl,
service::query_state& query_state,
const data_value_list& values) {
future<::shared_ptr<untyped_result_set>> query_processor::execute_with_params(
statements::prepared_statement::checked_weak_ptr p, db::consistency_level cl, service::query_state& query_state, const data_value_list& values) {
auto opts = make_internal_options(p, values, cl);
auto statement = p->statement;
@@ -1012,30 +1002,24 @@ query_processor::execute_with_params(
co_return ::make_shared<untyped_result_set>(msg);
}
future<::shared_ptr<result_message>>
query_processor::do_execute_with_params(
service::query_state& query_state,
shared_ptr<cql_statement> statement,
const query_options& options, std::optional<service::group0_guard> guard) {
future<::shared_ptr<result_message>> query_processor::do_execute_with_params(
service::query_state& query_state, shared_ptr<cql_statement> statement, const query_options& options, std::optional<service::group0_guard> guard) {
statement->validate(*this, service::client_state::for_internal_calls());
co_return co_await coroutine::try_future(statement->execute(*this, query_state, options, std::move(guard)));
}
future<::shared_ptr<cql_transport::messages::result_message>>
query_processor::execute_batch_without_checking_exception_message(
::shared_ptr<statements::batch_statement> batch,
service::query_state& query_state,
query_options& options,
future<::shared_ptr<cql_transport::messages::result_message>> query_processor::execute_batch_without_checking_exception_message(
::shared_ptr<statements::batch_statement> batch, service::query_state& query_state, query_options& options,
std::unordered_map<prepared_cache_key_type, authorized_prepared_statements_cache::value_type> pending_authorization_entries) {
auto access_future = co_await coroutine::as_future(batch->check_access(*this, query_state.get_client_state()));
co_await coroutine::parallel_for_each(pending_authorization_entries, [this, &query_state] (auto& e) -> future<> {
try {
co_await _authorized_prepared_cache.insert(*query_state.get_client_state().user(), e.first, std::move(e.second));
} catch (...) {
log.error("failed to cache the entry: {}", std::current_exception());
}
});
co_await coroutine::parallel_for_each(pending_authorization_entries, [this, &query_state](auto& e) -> future<> {
try {
co_await _authorized_prepared_cache.insert(*query_state.get_client_state().user(), e.first, std::move(e.second));
} catch (...) {
log.error("failed to cache the entry: {}", std::current_exception());
}
});
bool failed = access_future.failed();
co_await audit::inspect(batch, query_state, options, failed);
if (access_future.failed()) {
@@ -1044,30 +1028,28 @@ query_processor::execute_batch_without_checking_exception_message(
batch->validate();
batch->validate(*this, query_state.get_client_state());
_stats.queries_by_cl[size_t(options.get_consistency())] += batch->get_statements().size();
if (log.is_enabled(logging::log_level::trace)) {
if (log.is_enabled(logging::log_level::trace)) {
std::ostringstream oss;
for (const auto& s: batch->get_statements()) {
oss << std::endl << s.statement->raw_cql_statement;
for (const auto& s : batch->get_statements()) {
oss << std::endl << s.statement->raw_cql_statement;
}
log.trace("execute_batch({}): {}", batch->get_statements().size(), oss.str());
}
co_return co_await batch->execute(*this, query_state, options, std::nullopt);
}
future<service::broadcast_tables::query_result>
query_processor::execute_broadcast_table_query(const service::broadcast_tables::query& query) {
future<service::broadcast_tables::query_result> query_processor::execute_broadcast_table_query(const service::broadcast_tables::query& query) {
auto [remote_, holder] = remote();
co_return co_await service::broadcast_tables::execute(remote_.get().group0_client, query);
}
future<query::mapreduce_result>
query_processor::mapreduce(query::mapreduce_request req, tracing::trace_state_ptr tr_state) {
future<query::mapreduce_result> query_processor::mapreduce(query::mapreduce_request req, tracing::trace_state_ptr tr_state) {
auto [remote_, holder] = remote();
co_return co_await remote_.get().mapreducer.dispatch(std::move(req), std::move(tr_state));
}
future<::shared_ptr<messages::result_message>>
query_processor::execute_schema_statement(const statements::schema_altering_statement& stmt, service::query_state& state, const query_options& options, service::group0_batch& mc) {
future<::shared_ptr<messages::result_message>> query_processor::execute_schema_statement(
const statements::schema_altering_statement& stmt, service::query_state& state, const query_options& options, service::group0_batch& mc) {
if (this_shard_id() != 0) {
on_internal_error(log, "DDL must be executed on shard 0");
}
@@ -1121,7 +1103,8 @@ future<> query_processor::announce_schema_statement(const statements::schema_alt
co_await remote_.get().mm.announce(std::move(m), std::move(guard), description);
}
query_processor::migration_subscriber::migration_subscriber(query_processor* qp) : _qp{qp} {
query_processor::migration_subscriber::migration_subscriber(query_processor* qp)
: _qp{qp} {
}
void query_processor::migration_subscriber::on_create_keyspace(const sstring& ks_name) {
@@ -1147,10 +1130,7 @@ void query_processor::migration_subscriber::on_create_view(const sstring& ks_nam
void query_processor::migration_subscriber::on_update_keyspace(const sstring& ks_name) {
}
void query_processor::migration_subscriber::on_update_column_family(
const sstring& ks_name,
const sstring& cf_name,
bool columns_changed) {
void query_processor::migration_subscriber::on_update_column_family(const sstring& ks_name, const sstring& cf_name, bool columns_changed) {
// #1255: Ignoring columns_changed deliberately.
log.info("Column definitions for {}.{} changed, invalidating related prepared statements", ks_name, cf_name);
remove_invalid_prepared_statements(ks_name, cf_name);
@@ -1165,9 +1145,7 @@ void query_processor::migration_subscriber::on_update_function(const sstring& ks
void query_processor::migration_subscriber::on_update_aggregate(const sstring& ks_name, const sstring& aggregate_name) {
}
void query_processor::migration_subscriber::on_update_view(
const sstring& ks_name,
const sstring& view_name, bool columns_changed) {
void query_processor::migration_subscriber::on_update_view(const sstring& ks_name, const sstring& view_name, bool columns_changed) {
// scylladb/scylladb#16392 - Materialized views are also tables so we need at least handle
// them as such when changed.
on_update_column_family(ks_name, view_name, columns_changed);
@@ -1196,48 +1174,54 @@ void query_processor::migration_subscriber::on_drop_view(const sstring& ks_name,
remove_invalid_prepared_statements(ks_name, view_name);
}
void query_processor::migration_subscriber::remove_invalid_prepared_statements(
sstring ks_name,
std::optional<sstring> cf_name) {
_qp->_prepared_cache.remove_if([&] (::shared_ptr<cql_statement> stmt) {
void query_processor::migration_subscriber::remove_invalid_prepared_statements(sstring ks_name, std::optional<sstring> cf_name) {
_qp->_prepared_cache.remove_if([&](::shared_ptr<cql_statement> stmt) {
return this->should_invalidate(ks_name, cf_name, stmt);
});
}
bool query_processor::migration_subscriber::should_invalidate(
sstring ks_name,
std::optional<sstring> cf_name,
::shared_ptr<cql_statement> statement) {
bool query_processor::migration_subscriber::should_invalidate(sstring ks_name, std::optional<sstring> cf_name, ::shared_ptr<cql_statement> statement) {
return statement->depends_on(ks_name, cf_name);
}
future<> query_processor::query_internal(
const sstring& query_string,
db::consistency_level cl,
const data_value_list& values,
int32_t page_size,
noncopyable_function<future<stop_iteration>(const cql3::untyped_result_set_row&)> f,
std::optional<service::query_state> qs) {
future<> query_processor::query_internal(const sstring& query_string, db::consistency_level cl, const data_value_list& values, int32_t page_size,
noncopyable_function<future<stop_iteration>(const cql3::untyped_result_set_row&)> f, std::optional<service::query_state> qs) {
auto query_state = create_paged_state(query_string, cl, values, page_size, std::move(qs));
co_return co_await for_each_cql_result(query_state, std::move(f));
}
future<> query_processor::query_internal(
const sstring& query_string,
noncopyable_function<future<stop_iteration>(const cql3::untyped_result_set_row&)> f) {
future<> query_processor::query_internal(const sstring& query_string, noncopyable_function<future<stop_iteration>(const cql3::untyped_result_set_row&)> f) {
return query_internal(query_string, db::consistency_level::ONE, {}, 1000, std::move(f));
}
shared_ptr<cql_transport::messages::result_message> query_processor::bounce_to_shard(unsigned shard, cql3::computed_function_values cached_fn_calls) {
_proxy.get_stats().replica_cross_shard_ops++;
return ::make_shared<cql_transport::messages::result_message::bounce_to_shard>(shard, std::move(cached_fn_calls));
shared_ptr<cql_transport::messages::result_message> query_processor::bounce_to_shard(
unsigned shard, cql3::computed_function_values cached_fn_calls, bool track) {
if (track) {
_proxy.get_stats().replica_cross_shard_ops++;
}
const auto my_host_id = _proxy.get_token_metadata_ptr()->get_topology().my_host_id();
return ::make_shared<cql_transport::messages::result_message::bounce>(my_host_id, shard, std::move(cached_fn_calls));
}
shared_ptr<cql_transport::messages::result_message> query_processor::bounce_to_node(
locator::tablet_replica replica, cql3::computed_function_values cached_fn_calls, seastar::lowres_clock::time_point timeout, bool is_write) {
get_cql_stats().forwarded_requests++;
return ::make_shared<cql_transport::messages::result_message::bounce>(replica.host, replica.shard, std::move(cached_fn_calls), timeout, is_write);
}
query_processor::consistency_level_set query_processor::to_consistency_level_set(const query_processor::cl_option_list& levels) {
query_processor::consistency_level_set result;
for (const auto& opt : levels) {
result.set(static_cast<db::consistency_level>(opt));
}
return result;
}
void query_processor::update_authorized_prepared_cache_config() {
utils::loading_cache_config cfg;
cfg.max_size = _mcfg.authorized_prepared_cache_size;
cfg.expiry = std::min(std::chrono::milliseconds(_db.get_config().permissions_validity_in_ms()),
std::chrono::duration_cast<std::chrono::milliseconds>(prepared_statements_cache::entry_expiry));
std::chrono::duration_cast<std::chrono::milliseconds>(prepared_statements_cache::entry_expiry));
cfg.refresh = std::chrono::milliseconds(_db.get_config().permissions_update_interval_in_ms());
if (!_authorized_prepared_cache.update_config(std::move(cfg))) {
@@ -1249,4 +1233,4 @@ void query_processor::reset_cache() {
_authorized_prepared_cache.reset();
}
}
} // namespace cql3

View File

@@ -31,9 +31,12 @@
#include "vector_search/vector_store_client.hh"
#include "utils/assert.hh"
#include "utils/observable.hh"
#include "utils/rolling_max_tracker.hh"
#include "service/raft/raft_group0_client.hh"
#include "types/types.hh"
#include "db/auth_version.hh"
#include "db/consistency_level_type.hh"
#include "db/config.hh"
#include "utils/enum_option.hh"
#include "service/storage_proxy_fwd.hh"
@@ -132,6 +135,9 @@ private:
prepared_statements_cache _prepared_cache;
authorized_prepared_statements_cache _authorized_prepared_cache;
// Tracks the rolling maximum of gross bytes allocated during CQL parsing
utils::rolling_max_tracker _parsing_cost_tracker{1000};
std::function<void(uint32_t)> _auth_prepared_cache_cfg_cb;
serialized_action _authorized_prepared_cache_config_action;
utils::observer<uint32_t> _authorized_prepared_cache_update_interval_in_ms_observer;
@@ -142,6 +148,30 @@ private:
std::unordered_map<sstring, std::unique_ptr<statements::prepared_statement>> _internal_statements;
lang::manager& _lang_manager;
using cl_option_list = std::vector<enum_option<db::consistency_level_restriction_t>>;
/// Efficient bitmask-based set of consistency levels.
using consistency_level_set = enum_set<super_enum<db::consistency_level,
db::consistency_level::ANY,
db::consistency_level::ONE,
db::consistency_level::TWO,
db::consistency_level::THREE,
db::consistency_level::QUORUM,
db::consistency_level::ALL,
db::consistency_level::LOCAL_QUORUM,
db::consistency_level::EACH_QUORUM,
db::consistency_level::SERIAL,
db::consistency_level::LOCAL_SERIAL,
db::consistency_level::LOCAL_ONE>>;
consistency_level_set _write_consistency_levels_warned;
consistency_level_set _write_consistency_levels_disallowed;
utils::observer<cl_option_list> _write_consistency_levels_warned_observer;
utils::observer<cl_option_list> _write_consistency_levels_disallowed_observer;
static consistency_level_set to_consistency_level_set(const cl_option_list& levels);
public:
static const sstring CQL_VERSION;
@@ -186,6 +216,11 @@ public:
return _cql_stats;
}
/// Returns the estimated peak memory cost of CQL parsing.
size_t parsing_cost_estimate() const noexcept {
return _parsing_cost_tracker.current_max();
}
lang::manager& lang() { return _lang_manager; }
const vector_search::vector_store_client& vector_store_client() const noexcept {
@@ -196,8 +231,6 @@ public:
return _vector_store_client;
}
db::auth_version_t auth_version;
statements::prepared_statement::checked_weak_ptr get_prepared(const std::optional<auth::authenticated_user>& user, const prepared_cache_key_type& key) {
if (user) {
auto vp = _authorized_prepared_cache.find(*user, key);
@@ -477,7 +510,12 @@ public:
friend class migration_subscriber;
shared_ptr<cql_transport::messages::result_message> bounce_to_shard(unsigned shard, cql3::computed_function_values cached_fn_calls);
shared_ptr<cql_transport::messages::result_message> bounce_to_shard(unsigned shard, cql3::computed_function_values cached_fn_calls, bool track = true);
shared_ptr<cql_transport::messages::result_message> bounce_to_node(
locator::tablet_replica replica,
cql3::computed_function_values cached_fn_calls,
seastar::lowres_clock::time_point timeout,
bool is_write);
void update_authorized_prepared_cache_config();
@@ -493,6 +531,21 @@ public:
int32_t page_size = -1,
service::node_local_only node_local_only = service::node_local_only::no) const;
enum class write_consistency_guardrail_state { NONE, WARN, FAIL };
inline write_consistency_guardrail_state check_write_consistency_levels_guardrail(db::consistency_level cl) {
_cql_stats.writes_per_consistency_level[size_t(cl)]++;
if (_write_consistency_levels_disallowed.contains(cl)) [[unlikely]] {
_cql_stats.write_consistency_levels_disallowed_violations++;
return write_consistency_guardrail_state::FAIL;
}
if (_write_consistency_levels_warned.contains(cl)) [[unlikely]] {
_cql_stats.write_consistency_levels_warned_violations++;
return write_consistency_guardrail_state::WARN;
}
return write_consistency_guardrail_state::NONE;
}
private:
// Keep the holder until you stop using the `remote` services.
std::pair<std::reference_wrapper<remote>, gate::holder> remote();

View File

@@ -89,10 +89,10 @@ public:
*/
void merge(const bounds_slice& other) {
if (has_bound(statements::bound::START)) {
SCYLLA_ASSERT(!other.has_bound(statements::bound::START));
throwing_assert(!other.has_bound(statements::bound::START));
_bounds[get_idx(statements::bound::END)] = other._bounds[get_idx(statements::bound::END)];
} else {
SCYLLA_ASSERT(!other.has_bound(statements::bound::END));
throwing_assert(!other.has_bound(statements::bound::END));
_bounds[get_idx(statements::bound::START)] = other._bounds[get_idx(statements::bound::START)];
}
}

View File

@@ -61,7 +61,7 @@ void metadata::set_paging_state(lw_shared_ptr<const service::pager::paging_state
}
void metadata::maybe_set_paging_state(lw_shared_ptr<const service::pager::paging_state> paging_state) {
SCYLLA_ASSERT(paging_state);
throwing_assert(paging_state);
if (paging_state->get_remaining() > 0) {
set_paging_state(std::move(paging_state));
} else {
@@ -138,7 +138,7 @@ bool result_set::empty() const {
}
void result_set::add_row(std::vector<managed_bytes_opt> row) {
SCYLLA_ASSERT(row.size() == _metadata->value_count());
throwing_assert(row.size() == _metadata->value_count());
_rows.emplace_back(std::move(row));
}

View File

@@ -212,11 +212,20 @@ public:
}
virtual uint32_t add_column_for_post_processing(const column_definition& c) override {
uint32_t index = selection::add_column_for_post_processing(c);
auto it = std::find_if(_selectors.begin(), _selectors.end(), [&c](const expr::expression& e) {
auto col = expr::as_if<expr::column_value>(&e);
return col && col->col == &c;
});
if (it != _selectors.end()) {
return std::distance(_selectors.begin(), it);
}
add_column(c);
get_result_metadata()->add_non_serialized_column(c.column_specification);
_selectors.push_back(expr::column_value(&c));
if (_inner_loop.empty()) {
// Simple case: no aggregation
return index;
return _selectors.size() - 1;
} else {
// Complex case: aggregation, must pass through temporary
auto first_func = cql3::functions::aggregate_fcts::make_first_function(c.type);
@@ -470,10 +479,21 @@ std::vector<const column_definition*> selection::wildcard_columns(schema_ptr sch
return simple_selection::make(schema, std::move(columns), false);
}
uint32_t selection::add_column_for_post_processing(const column_definition& c) {
selection::add_column_result selection::add_column(const column_definition& c) {
auto index = index_of(c);
if (index != -1) {
return {index, false};
}
_columns.push_back(&c);
_metadata->add_non_serialized_column(c.column_specification);
return _columns.size() - 1;
return {_columns.size() - 1, true};
}
uint32_t selection::add_column_for_post_processing(const column_definition& c) {
auto col = add_column(c);
if (col.added) {
_metadata->add_non_serialized_column(c.column_specification);
}
return col.index;
}
::shared_ptr<selection> selection::from_selectors(data_dictionary::database db, schema_ptr schema, const sstring& ks, const std::vector<prepared_selector>& prepared_selectors) {

View File

@@ -130,6 +130,14 @@ public:
virtual std::vector<shared_ptr<functions::function>> used_functions() const { return {}; }
query::partition_slice::option_set get_query_options();
protected:
// Result of add_column: index in _columns and whether it was added now (or existed already).
struct add_column_result {
uint32_t index;
bool added;
};
// Adds a column to the _columns if not already present, returns add_column_result.
add_column_result add_column(const column_definition& c);
private:
static bool processes_selection(const std::vector<prepared_selector>& prepared_selectors);
@@ -348,7 +356,7 @@ public:
add_value(*def, static_row_iterator);
break;
default:
SCYLLA_ASSERT(0);
throwing_assert(0);
}
}
_builder.complete_row();

View File

@@ -34,7 +34,7 @@ sets::setter::execute(mutation& m, const clustering_key_prefix& row_key, const u
void
sets::adder::execute(mutation& m, const clustering_key_prefix& row_key, const update_parameters& params) {
const cql3::raw_value value = expr::evaluate(*_e, params._options);
SCYLLA_ASSERT(column.type->is_multi_cell()); // "Attempted to add items to a frozen set";
throwing_assert(column.type->is_multi_cell()); // "Attempted to add items to a frozen set";
do_add(m, row_key, params, value, column);
}
@@ -77,7 +77,7 @@ sets::adder::do_add(mutation& m, const clustering_key_prefix& row_key, const upd
void
sets::discarder::execute(mutation& m, const clustering_key_prefix& row_key, const update_parameters& params) {
SCYLLA_ASSERT(column.type->is_multi_cell()); // "Attempted to remove items from a frozen set";
throwing_assert(column.type->is_multi_cell()); // "Attempted to remove items from a frozen set";
cql3::raw_value svalue = expr::evaluate(*_e, params._options);
if (svalue.is_null()) {
@@ -98,7 +98,7 @@ sets::discarder::execute(mutation& m, const clustering_key_prefix& row_key, cons
void sets::element_discarder::execute(mutation& m, const clustering_key_prefix& row_key, const update_parameters& params)
{
SCYLLA_ASSERT(column.type->is_multi_cell() && "Attempted to remove items from a frozen set");
throwing_assert(column.type->is_multi_cell() && "Attempted to remove items from a frozen set");
cql3::raw_value elt = expr::evaluate(*_e, params._options);
if (elt.is_null()) {
throw exceptions::invalid_request_exception("Invalid null set element");

View File

@@ -296,7 +296,7 @@ void alter_table_statement::drop_column(const query_options& options, const sche
std::pair<schema_ptr, std::vector<view_ptr>> alter_table_statement::prepare_schema_update(data_dictionary::database db, const query_options& options) const {
auto s = validation::validate_column_family(db, keyspace(), column_family());
if (s->is_view()) {
throw exceptions::invalid_request_exception("Cannot use ALTER TABLE on Materialized View");
throw exceptions::invalid_request_exception("Cannot use ALTER TABLE on Materialized View. (Did you mean ALTER MATERIALIZED VIEW)?");
}
const bool is_cdc_log_table = cdc::is_log_for_some_table(db.real_database(), s->ks_name(), s->cf_name());
@@ -368,7 +368,7 @@ std::pair<schema_ptr, std::vector<view_ptr>> alter_table_statement::prepare_sche
switch (_type) {
case alter_table_statement::type::add:
SCYLLA_ASSERT(_column_changes.size());
throwing_assert(_column_changes.size());
if (s->is_dense()) {
throw exceptions::invalid_request_exception("Cannot add new column to a COMPACT STORAGE table");
}
@@ -376,12 +376,12 @@ std::pair<schema_ptr, std::vector<view_ptr>> alter_table_statement::prepare_sche
break;
case alter_table_statement::type::alter:
SCYLLA_ASSERT(_column_changes.size() == 1);
throwing_assert(_column_changes.size() == 1);
invoke_column_change_fn(std::mem_fn(&alter_table_statement::alter_column));
break;
case alter_table_statement::type::drop:
SCYLLA_ASSERT(_column_changes.size());
throwing_assert(_column_changes.size());
if (!s->is_cql3_table()) {
throw exceptions::invalid_request_exception("Cannot drop columns from a non-CQL3 table");
}

View File

@@ -46,7 +46,7 @@ future<> alter_view_statement::check_access(query_processor& qp, const service::
view_ptr alter_view_statement::prepare_view(data_dictionary::database db) const {
schema_ptr schema = validation::validate_column_family(db, keyspace(), column_family());
if (!schema->is_view()) {
throw exceptions::invalid_request_exception("Cannot use ALTER MATERIALIZED VIEW on Table");
throw exceptions::invalid_request_exception("Cannot use ALTER MATERIALIZED VIEW on Table. (Did you mean ALTER TABLE)?");
}
if (!_properties) {

View File

@@ -25,7 +25,7 @@ attach_service_level_statement::attach_service_level_statement(sstring service_l
}
bool attach_service_level_statement::needs_guard(query_processor& qp, service::query_state& state) const {
return !auth::legacy_mode(qp) || state.get_service_level_controller().is_v2();
return true;
}
std::unique_ptr<cql3::statements::prepared_statement>

View File

@@ -11,7 +11,6 @@
#include "authentication_statement.hh"
#include "transport/messages/result_message.hh"
#include "cql3/query_processor.hh"
#include "auth/common.hh"
uint32_t cql3::statements::authentication_statement::get_bound_terms() const {
return 0;
@@ -26,7 +25,7 @@ future<> cql3::statements::authentication_statement::check_access(query_processo
}
bool cql3::statements::authentication_altering_statement::needs_guard(query_processor& qp, service::query_state&) const {
return !auth::legacy_mode(qp);
return true;
}
audit::statement_category cql3::statements::authentication_statement::category() const {

View File

@@ -14,7 +14,6 @@
#include "cql3/query_processor.hh"
#include "exceptions/exceptions.hh"
#include "db/cql_type_parser.hh"
#include "auth/common.hh"
uint32_t cql3::statements::authorization_statement::get_bound_terms() const {
return 0;
@@ -74,7 +73,7 @@ void cql3::statements::authorization_statement::maybe_correct_resource(auth::res
bool cql3::statements::authorization_altering_statement::needs_guard(
query_processor& qp, service::query_state&) const {
return !auth::legacy_mode(qp);
return true;
};
audit::statement_category cql3::statements::authorization_statement::category() const {

View File

@@ -259,6 +259,18 @@ future<shared_ptr<cql_transport::messages::result_message>> batch_statement::do_
if (options.getSerialConsistency() == null)
throw new InvalidRequestException("Invalid empty serial consistency level");
#endif
const auto cl = options.get_consistency();
const query_processor::write_consistency_guardrail_state guardrail_state = qp.check_write_consistency_levels_guardrail(cl);
if (guardrail_state == query_processor::write_consistency_guardrail_state::FAIL) {
return make_exception_future<shared_ptr<cql_transport::messages::result_message>>(
exceptions::invalid_request_exception(
format("Write consistency level {} is forbidden by the current configuration "
"setting of write_consistency_levels_disallowed. Please use a different "
"consistency level, or remove {} from write_consistency_levels_disallowed "
"set in the configuration.", cl, cl)));
}
for (size_t i = 0; i < _statements.size(); ++i) {
_statements[i].statement->restrictions().validate_primary_key(options.for_statement(i));
}
@@ -266,23 +278,33 @@ future<shared_ptr<cql_transport::messages::result_message>> batch_statement::do_
if (_has_conditions) {
++_stats.cas_batches;
_stats.statements_in_cas_batches += _statements.size();
return execute_with_conditions(qp, options, query_state);
return execute_with_conditions(qp, options, query_state).then([guardrail_state, cl] (auto result) {
if (guardrail_state == query_processor::write_consistency_guardrail_state::WARN) {
result->add_warning(format("Using write consistency level {} listed on the "
"write_consistency_levels_warned is not recommended.", cl));
}
return result;
});
}
++_stats.batches;
_stats.statements_in_batches += _statements.size();
auto timeout = db::timeout_clock::now() + get_timeout(query_state.get_client_state(), options);
return get_mutations(qp, options, timeout, local, now, query_state).then([this, &qp, &options, timeout, tr_state = query_state.get_trace_state(),
return get_mutations(qp, options, timeout, local, now, query_state).then([this, &qp, cl, timeout, tr_state = query_state.get_trace_state(),
permit = query_state.get_permit()] (utils::chunked_vector<mutation> ms) mutable {
return execute_without_conditions(qp, std::move(ms), options.get_consistency(), timeout, std::move(tr_state), std::move(permit));
}).then([] (coordinator_result<> res) {
return execute_without_conditions(qp, std::move(ms), cl, timeout, std::move(tr_state), std::move(permit));
}).then([guardrail_state, cl] (coordinator_result<> res) {
if (!res) {
return make_ready_future<shared_ptr<cql_transport::messages::result_message>>(
seastar::make_shared<cql_transport::messages::result_message::exception>(std::move(res).assume_error()));
}
return make_ready_future<shared_ptr<cql_transport::messages::result_message>>(
make_shared<cql_transport::messages::result_message::void_message>());
auto result = make_shared<cql_transport::messages::result_message::void_message>();
if (guardrail_state == query_processor::write_consistency_guardrail_state::WARN) {
result->add_warning(format("Using write consistency level {} listed on the "
"write_consistency_levels_warned is not recommended.", cl));
}
return make_ready_future<shared_ptr<cql_transport::messages::result_message>>(std::move(result));
});
}

View File

@@ -20,6 +20,7 @@
#include "cql3/attributes.hh"
#include "cql3/expr/expression.hh"
#include "cql3/expr/evaluate.hh"
#include "cql3/query_options.hh"
#include "cql3/query_processor.hh"
#include "cql3/values.hh"
#include "timeout_config.hh"
@@ -65,7 +66,7 @@ evaluate_prepared(
future<::shared_ptr<cql_transport::messages::result_message>>
broadcast_modification_statement::execute_without_checking_exception_message(query_processor& qp, service::query_state& qs, const query_options& options, std::optional<service::group0_guard> guard) const {
if (this_shard_id() != 0) {
co_return ::make_shared<cql_transport::messages::result_message::bounce_to_shard>(0, cql3::computed_function_values{});
co_return qp.bounce_to_shard(0, cql3::computed_function_values{}, false);
}
auto result = co_await qp.execute_broadcast_table_query(

View File

@@ -96,7 +96,7 @@ evaluate_prepared(
future<::shared_ptr<cql_transport::messages::result_message>>
broadcast_select_statement::execute_without_checking_exception_message(query_processor& qp, service::query_state& qs, const query_options& options, std::optional<service::group0_guard> guard) const {
if (this_shard_id() != 0) {
co_return ::make_shared<cql_transport::messages::result_message::bounce_to_shard>(0, cql3::computed_function_values{});
co_return qp.bounce_to_shard(0, cql3::computed_function_values{}, false);
}
auto result = co_await qp.execute_broadcast_table_query(

View File

@@ -51,7 +51,7 @@ public:
, _key(std::move(key_arg))
, _rows(schema_arg)
{
SCYLLA_ASSERT(_key.size() == 1 && query::is_single_partition(_key.front()));
throwing_assert(_key.size() == 1 && query::is_single_partition(_key.front()));
}
dht::partition_range_vector key() const {

View File

@@ -59,6 +59,8 @@ const sstring cf_prop_defs::COMPACTION_ENABLED_KEY = "enabled";
const sstring cf_prop_defs::KW_TABLETS = "tablets";
const sstring cf_prop_defs::KW_STORAGE_ENGINE = "storage_engine";
schema::extensions_map cf_prop_defs::make_schema_extensions(const db::extensions& exts) const {
schema::extensions_map er;
for (auto& p : exts.schema_extensions()) {
@@ -106,6 +108,7 @@ void cf_prop_defs::validate(const data_dictionary::database db, sstring ks_name,
KW_BF_FP_CHANCE, KW_MEMTABLE_FLUSH_PERIOD, KW_COMPACTION,
KW_COMPRESSION, KW_CRC_CHECK_CHANCE, KW_ID, KW_PAXOSGRACESECONDS,
KW_SYNCHRONOUS_UPDATES, KW_TABLETS,
KW_STORAGE_ENGINE,
});
static std::set<sstring> obsolete_keywords({
sstring("index_interval"),
@@ -196,6 +199,20 @@ void cf_prop_defs::validate(const data_dictionary::database db, sstring ks_name,
}
db::tablet_options::validate(*tablet_options_map);
}
if (has_property(KW_STORAGE_ENGINE)) {
auto storage_engine = get_string(KW_STORAGE_ENGINE, "");
if (storage_engine == "logstor") {
if (!db.features().logstor) {
throw exceptions::configuration_exception(format("The experimental feature 'logstor' must be enabled in order to use the 'logstor' storage engine."));
}
if (!db.get_config().enable_logstor()) {
throw exceptions::configuration_exception(format("The configuration option 'enable_logstor' must be set to true in the configuration in order to use the 'logstor' storage engine."));
}
} else {
throw exceptions::configuration_exception(format("Illegal value for '{}'", KW_STORAGE_ENGINE));
}
}
}
std::map<sstring, sstring> cf_prop_defs::get_compaction_type_options() const {
@@ -396,6 +413,13 @@ void cf_prop_defs::apply_to_builder(schema_builder& builder, schema::extensions_
if (auto tablet_options_opt = get_map(KW_TABLETS)) {
builder.set_tablet_options(std::move(*tablet_options_opt));
}
if (has_property(KW_STORAGE_ENGINE)) {
auto storage_engine = get_string(KW_STORAGE_ENGINE, "");
if (storage_engine == "logstor") {
builder.set_logstor();
}
}
}
void cf_prop_defs::validate_minimum_int(const sstring& field, int32_t minimum_value, int32_t default_value) const

View File

@@ -64,6 +64,8 @@ public:
static const sstring KW_TABLETS;
static const sstring KW_STORAGE_ENGINE;
// FIXME: In origin the following consts are in CFMetaData.
static constexpr int32_t DEFAULT_DEFAULT_TIME_TO_LIVE = 0;
static constexpr int32_t DEFAULT_MIN_INDEX_INTERVAL = 128;

Some files were not shown because too many files have changed in this diff Show More