Compare commits

..

230 Commits

Author SHA1 Message Date
Avi Kivity
b708e5d7c9 Merge 'test: fix race condition in test_crashed_node_substitution' from Sergey Zolotukhin
`test_crashed_node_substitution` intermittently failed:
```python
   assert len(gossiper_eps) == (len(server_eps) + 1)
```
The test crashed the node right after a single ACK2 handshake (`finished do_send_ack2_msg`), assuming the node state was visible to all peers. However, since gossip is eventually consistent, the update may not have propagated yet, so some nodes did not see the failed node.

This change: Wait until the gossiper state is visible on peers before continuing the test and asserting.

Fixes: [SCYLLADB-1256](https://scylladb.atlassian.net/browse/SCYLLADB-1256).

backport: this issue may affect CI for all branches, so should be backported to all versions.

[SCYLLADB-1256]: https://scylladb.atlassian.net/browse/SCYLLADB-1256?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

Closes scylladb/scylladb#29254

* github.com:scylladb/scylladb:
  test: test_crashed_node_substitution: add docstring and fix whitespace
  test: fix race condition in test_crashed_node_substitution
2026-03-26 21:40:33 +02:00
Petr Gusev
c38e312321 test_lwt_fencing_upgrade: fix quorum failure due to gossip lag
If lwt_workload() sends an update immediately after a
rolling restart, the coordinator might still see a replica as
down due to gossip lagging behind. Concurrently restarting another
node leaves only one available replica, failing the
LOCAL_QUORUM requirement for learn or eventually consistent
sp::query() in sp::cas() and resulting in
a mutation_write_failure_exception.

We fix this problem by waiting for the restarted server
to see 2 other peers. The server_change_version
doesn't do that by default -- it passes
wait_others=0 to server_start().

Fixes SCYLLADB-1136

Closes scylladb/scylladb#29234
2026-03-26 21:25:53 +02:00
bitpathfinder
627a8294ed test: test_crashed_node_substitution: add docstring and fix whitespace
Add a description of the test's intent and scenario; remove extra blanks.
2026-03-26 18:40:17 +01:00
bitpathfinder
5a086ae9b7 test: fix race condition in test_crashed_node_substitution
`test_crashed_node_substitution` intermittently failed:
```
    assert len(gossiper_eps) == (len(server_eps) + 1)
```
The test crashed the node right after a single ACK2 handshake
("finished do_send_ack2_msg"), assuming the node state was
visible to all peers. However, since gossip is eventually
consistent, the update may not have propagated yet, so some
nodes did not see the failed node.

This change: Wait until the gossiper state is visible on
peers before continuing the test and asserting.

Fixes: SCYLLADB-1256.
2026-03-26 18:25:05 +01:00
Robert Bindar
c575bbf1e8 test_refresh_deletes_uploaded_sstables should wait for sstables to get deleted
SSTable unlinking is async, so in some cases it may happen that
the upload dir is not empty immediately after refresh is done.
This patch adjusts test_refresh_deletes_uploaded_sstables so
it waits with a timeout till the upload dir becomes empty
instead of just assuming the API will sync on sstables being
gone.

Fixes SCYLLADB-1190

Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>

Closes scylladb/scylladb#29215
2026-03-26 08:43:14 +03:00
Marcin Maliszkiewicz
7fdd650009 Merge 'test: audit: clean up test helper class naming' from Dario Mirovic
Remove unused `pytest.mark.single_node` marker from `TestCQLAudit`.

Rename `TestCQLAudit` to `CQLAuditTester` to reflect that it is a test helper, not a test class. This avoids accidental pytest collection and subsequent warning about `__init__`.

Logs before the fixes:
```
test/cluster/test_audit.py:514: 14 warnings
  /home/dario/dev/scylladb/test/cluster/test_audit.py:514: PytestCollectionWarning: cannot collect test class 'TestCQLAudit' because it has a __init__ constructor (from: cluster/test_audit.py)
    @pytest.mark.single_node
```

Fixes SCYLLADB-1237

This is an addition to the latest master code. No backport needed.

Closes scylladb/scylladb#29237

* github.com:scylladb/scylladb:
  test: audit: rename TestCQLAudit to CQLAuditTester
  test: audit: remove unused pytest.mark.single_node
2026-03-25 15:30:16 +01:00
Dario Mirovic
552a2d0995 test: audit: rename TestCQLAudit to CQLAuditTester
pytest tries to collect tests for execution in several ways.
One is to pick all classes that start with 'Test'. Those classes
must not have custom '__init__' constructor. TestCQLAudit does.

TestCQLAudit after migration from test/cluster/dtest is not a test
class anymore, but rather a helper class. There are two ways to fix
this:
1. Add __init__ = False to the TestCQLAudit class
2. Rename it to not start with 'Test'

Option 2 feels better because the new name itself does not convey
the wrong message about its role.

Fixes SCYLLADB-1237
2026-03-25 13:21:08 +01:00
Dario Mirovic
73de865ca3 test: audit: remove unused pytest.mark.single_node
Remove unused pytest.mark.single_node in TestCQLAudit class.
This is a leftover from audit tests migration from
test/cluster/dtest to test/cluster.

Refs SCYLLADB-1237
2026-03-25 13:18:37 +01:00
Marcin Maliszkiewicz
f988ec18cb test/lib: fix port in-use detection in start_docker_service
Previously, the result of when_all was discarded. when_all stores
exceptions in the returned futures rather than throwing, so the outer
catch(in_use&) could never trigger. Now we capture the when_all result
and inspect each future individually to properly detect in_use from
either stream.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1216

Closes scylladb/scylladb#29219
2026-03-25 11:45:53 +02:00
Artsiom Mishuta
cd1679934c test/pylib: use exponential backoff in wait_for()
Change wait_for() defaults from period=1s/no backoff to period=0.1s
with 1.5x backoff capped at 1.0s. This catches fast conditions in
100ms instead of 1000ms, benefiting ~100 call sites automatically.

Add completion logging with elapsed time and iteration count.

Tested local with test/cluster/test_fencing.py::test_fence_hints (dev mode),
log output:

  wait_for(at_least_one_hint_failed) completed in 0.83s (4 iterations)
  wait_for(exactly_one_hint_sent) completed in 1.34s (5 iterations)

Fixes SCYLLADB-738

Closes scylladb/scylladb#29173
2026-03-24 23:49:49 +02:00
Botond Dénes
d52fbf7ada Merge 'test: cluster: Deflake test_startup_with_keyspaces_violating_rf_rack_valid_keyspaces' from Dawid Mędrek
The test was flaky. The scenario looked like this:

1. Stop server 1.
2. Set its rf_rack_valid_keyspaces configuration option to true.
3. Create an RF-rack-invalid keyspace.
4. Start server 1 and expect a failure during start-up.

It was wrong. We cannot predict when the Raft mutation corresponding to
the newly created keyspace will arrive at the node or when it will be
processed. If the check of the RF-rack-valid keyspaces we perform at
start-up was done before that, it won't include the keyspace. This will
lead to a test failure.

Unfortunately, it's not feasible to perform a read barrier during
start-up. What's more, although it would help the test, it wouldn't be
useful otherwise. Because of that, we simply fix the test, at least for
now.

The new scenario looks like this:

1. Disable the rf_rack_valid_keyspaces configuration option on server 1.
2. Start the server.
3. Create an RF-rack-invalid keyspace.
4. Perform a read barrier on server 1. This will ensure that it has
   observed all Raft mutations, and we won't run into the same problem.
5. Stop the node.
6. Set its rf_rack_valid_keyspaces configuration option to true.
7. Try to start the node and observe a failure.

This will make the test perform consistently.

---

I ran the test (in dev mode, on my local machine) three times before
these changes, and three times with them. I include the time results
below.

Before:
```
real    0m47.570s
user    0m41.631s
sys     0m8.634s

real    0m50.495s
user    0m42.499s
sys     0m8.607s

real    0m50.375s
user    0m41.832s
sys     0m8.789s
```

After:
```
real    0m50.509s
user    0m43.535s
sys     0m9.715s

real    0m50.857s
user    0m44.185s
sys     0m9.811s

real    0m50.873s
user    0m44.289s
sys     0m9.737s
```

Fixes SCYLLADB-1137

Backport: The test is present on all supported branches, and so we
          should backport these changes to them.

Closes scylladb/scylladb#29218

* github.com:scylladb/scylladb:
  test: cluster: Deflake test_startup_with_keyspaces_violating_rf_rack_valid_keyspaces
  test: cluster: Mark test with @pytest.mark.asyncio in test_multidc.py
2026-03-24 21:09:19 +02:00
Patryk Jędrzejczak
141aa2d696 Merge 'test/cluster/test_incremental_repair.py: fix typo + enable compaction DEBUG logs' from Botond Dénes
This PR contains two small improvements to `test_incremental_repair.py`
motivated by the sporadic failure of
`test_tablet_incremental_repair_and_scrubsstables_abort`.

The test fails with `assert 3 == 2` on `len(sst_add)` in the second
repair round. The extra SSTable has `repaired_at=0`, meaning scrub
unexpectedly produced more unrepaired SSTables than anticipated. Since
scrub (and compaction in general) logs at DEBUG level and the test did
not enable debug logging, the existing logs do not contain enough
information to determine the root cause.

**Commit 1** fixes a long-standing typo in the helper function name
(`preapre` -> `prepare`).

**Commit 2** enables `compaction=debug` for the Scylla nodes started by
`do_tablet_incremental_repair_and_ops`, which covers all
`test_tablet_incremental_repair_and_*` variants. This will capture full
compaction/scrub activity on the next reproduction, making the failure
diagnosable.

Refs: SCYLLADB-1086

Backport: test improvement, no backport

Closes scylladb/scylladb#29175

* https://github.com/scylladb/scylladb:
  test/cluster/test_incremental_repair.py: enable compaction DEBUG logs in do_tablet_incremental_repair_and_ops
  test/cluster/test_incremental_repair.py: fix typo preapre -> prepare
2026-03-24 16:27:01 +01:00
Ernest Zaslavsky
c670183be8 cmake: fix precompiled header (PCH) creation
Two issues prevented the precompiled header from compiling
successfully when using CMake directly (rather than the
configure.py + ninja build system):

a) Propagate build flags to Rust binding targets reusing the
   PCH. The wasmtime_bindings and inc targets reuse the PCH
   from scylla-precompiled-header, which is compiled with
   Seastar's flags (including sanitizer flags in
   Debug/Sanitize modes). Without matching compile options,
   the compiler rejects the PCH due to flag mismatch (e.g.,
   -fsanitize=address). Link these targets against
   Seastar::seastar to inherit the required compile options.

Closes scylladb/scylladb#28941
2026-03-24 15:53:40 +02:00
Dawid Mędrek
e639dcda0b test: cluster: Deflake test_startup_with_keyspaces_violating_rf_rack_valid_keyspaces
The test was flaky. The scenario looked like this:

1. Stop server 1.
2. Set its rf_rack_valid_keyspaces configuration option to true.
3. Create an RF-rack-invalid keyspace.
4. Start server 1 and expect a failure during start-up.

It was wrong. We cannot predict when the Raft mutation corresponding to
the newly created keyspace will arrive at the node or when it will be
processed. If the check of the RF-rack-valid keyspaces we perform at
start-up was done before that, it won't include the keyspace. This will
lead to a test failure.

Unfortunately, it's not feasible to perform a read barrier during
start-up. What's more, although it would help the test, it wouldn't be
useful otherwise. Because of that, we simply fix the test, at least for
now.

The new scenario looks like this:

1. Disable the rf_rack_valid_keyspaces configuration option on server 1.
2. Start the server.
3. Create an RF-rack-invalid keyspace.
4. Perform a read barrier on server 1. This will ensure that it has
   observed all Raft mutations, and we won't run into the same problem.
5. Stop the node.
6. Set its rf_rack_valid_keyspaces configuration option to true.
7. Try to start the node and observe a failure.

This will make the test perform consistently.

---

I ran the test (in dev mode, on my local machine) three times before
these changes, and three times with them. I include the time results
below.

Before:
```
real    0m47.570s
user    0m41.631s
sys     0m8.634s

real    0m50.495s
user    0m42.499s
sys     0m8.607s

real    0m50.375s
user    0m41.832s
sys     0m8.789s
```

After:
```
real    0m50.509s
user    0m43.535s
sys     0m9.715s

real    0m50.857s
user    0m44.185s
sys     0m9.811s

real    0m50.873s
user    0m44.289s
sys     0m9.737s
```

Fixes SCYLLADB-1137
2026-03-24 14:27:36 +01:00
Patryk Jędrzejczak
503a6e2d7e locator: everywhere_replication_strategy: fix sanity_check_read_replicas when read_new is true
ERMs created in `calculate_vnode_effective_replication_map` have RF computed based
on the old token metadata during a topology change. The reading replicas, however,
are computed based on the new token metadata (`target_token_metadata`) when
`read_new` is true. That can create a mismatch for EverywhereStrategy during some
topology changes - RF can be equal to the number of reading replicas +-1. During
bootstrap, this can cause the
`everywhere_replication_strategy::sanity_check_read_replicas` check to fail in
debug mode.

We fix the check in this commit by allowing one more reading replica when
`read_new` is true.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1147

Closes scylladb/scylladb#29150
2026-03-24 13:43:39 +01:00
Jenkins Promoter
0f02c0d6fa Update pgo profiles - x86_64 2026-03-24 14:11:38 +02:00
Dawid Mędrek
4fead4baae test: cluster: Mark test with @pytest.mark.asyncio in test_multidc.py
One of the tests,
test_startup_with_keyspaces_violating_rf_rack_valid_keyspaces,
didn't have the marker. Let's add it now.
2026-03-24 12:52:00 +01:00
Botond Dénes
ffd58ca1f0 Merge 'test: cluster: Deflake test_write_cl_any_to_dead_node_generates_hints' from Dawid Mędrek
Before these changes, we would send mutations to the node and
immediately query the metrics to see how many hints had been written.
However, that could lead to random failures of the test: even if the
mutations have finished executing, hints are stored asynchronously, so
we don't have a guarantee they have already been processed.

To prevent such failures, we rewrite the check: we will perform multiple
checks against the metrics until we have confirmed that the hints have
indeed been written or we hit the timeout.

We're generous with the timeout: we give the test 60 seconds. That
should be enough time to avoid flakiness even on super slow machines,
and if the test does fail, we will know something is really wrong.

As a bonus, we improve the test in general too. We explicitly express
the preconditions we rely on, as well as bump the log level. If the
test fails in the future, it might be very difficult do debug it
without this additional information.

Fixes SCYLLADB-1133

Backport: The test is present on all supported branches. To avoid
          running into more failures, we should backport these changes
          to them.

Closes scylladb/scylladb#29191

* github.com:scylladb/scylladb:
  test: cluster: Increase log level in test_write_cl_any_to_dead_node_generates_hints
  test: cluster: Await all mutations concurrently in test_write_cl_any_to_dead_node_generates_hints
  test: cluster: Specify min_tablet_count in test_write_cl_any_to_dead_node_generates_hints
  test: cluster: Use new_test_table in test_write_cl_any_to_dead_node_generates_hints
  test: cluster: Introduce auxiliary function keyspace_has_tablets
  test: cluster: Deflake test_write_cl_any_to_dead_node_generates_hints
2026-03-24 13:39:56 +02:00
Andrei Chekun
f6fd3bbea0 test.py: reduce timeout for one test
Reduce the timeout for one test to 60 minutes. The longest test we had
so far was ~10-15 minutes. So reducing this timeout is pretty safe and
should help with hanging tests.

Closes scylladb/scylladb#29212
2026-03-24 12:50:10 +02:00
Marcin Maliszkiewicz
66be0f4577 Merge 'test: cluster: audit test suite optimization' from Dario Mirovic
Migrate audit tests from test/cluster/dtest to test/cluster. Optimize their execution time through cluster reuse.

The audit test suite is heavy. There are more than 70 test executions. Environment preparation is a significant part of each test case execution time.

This PR:
1. Copies audit tests from test/cluster/dtest to test/cluster, refactoring and enabling them
2. Groups tests functions by non-live cluster configuration variations to enable cluster reuse between them
    - Execution time reduced from 4m 29s to 2m 47s, which is ~38% execution time decrease
3. Removes the old audit tests from test/cluster/dtest

Includes two supporting changes:
- Allow specifying `AuthProvider` in `ManagerClient.get_cql_exclusive`
- Fix server log file handling for clean clusters

Refs [SCYLLADB-573](https://scylladb.atlassian.net/browse/SCYLLADB-573)

This PR is an improvement and does not require a backport.

[SCYLLADB-573]: https://scylladb.atlassian.net/browse/SCYLLADB-573?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

Closes scylladb/scylladb#28650

* github.com:scylladb/scylladb:
  test: cluster: fix log clear race condition in test_audit.py
  test: pylib: shut down exclusive cql connections in ManagerClient
  test: cluster: fix multinode audit entry comparison in test_audit.py
  test: cluster: dtest: remove old audit tests
  test: cluster: group migrated audit tests for cluster reuse
  test: cluster: enable migrated audit tests and make them work
  test: pylib: manager_client: specify AuthProvider in get_cql_exclusive
  test: pylib: scylla cluster after_test log fix
  test: audit: copy audit test from dtest
2026-03-24 09:29:52 +01:00
Dario Mirovic
120f381a9d pgo: fix maintenance socket path too long
Maintenance socket path used for PGO is in the node workdir.
When the node workdir path is too long, the maintenance socket path
(workdir/cql.m) can exceed the Unix domain socket sun_path limit
and failing the PGO training pipeline.

To prevent this:
- pass an explicit --maintenance-socket override
  pointing to a short determinitic path in /tmp derived from the MD5
  hash of the workdir maintenance socket path
- update maintenance_socket_path to return the matching short path
  so that exec_cql.py connects to the right socket

The short path socket files are cleaned up after the cluster stops.

The path is using MD5 hash of the workdir path, so it is deterministic.

Fixes SCYLLADB-1070

Closes scylladb/scylladb#29149
2026-03-24 09:17:10 +01:00
Pavel Emelyanov
f112e42ddd raft: Fix split mutations freeze
Commit faa0ee9844 accidentally broke the way split snapshot mutation was
frozen -- instead of appending the sub-mutation `m` the commit kept the
old variable name of `mut` which in the new code corresponds to "old"
non-split mutation

Fixes #29051

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#29052
2026-03-24 08:53:50 +02:00
Botond Dénes
56c375b1f3 Merge 'table: don't close a disengaged querier in query()' from Pavel Emelyanov
There's a flaw in table::query() -- calling querier_opt->close() can dereferences a disengaged std::optional. The fix pretty simple. Once fixed, there are two if-s checking for querier_opt being engaged or not that are worth being merged.

The problem doesn't really shows itself becase table::query() is not called with null saved_querier, so the de-facto if is always correct. However, better to be on safe-side.

The problem doesn't show itself for real, not worth backporting

Closes scylladb/scylladb#29142

* github.com:scylladb/scylladb:
  table: merge adjacent querier_opt checks in query()
  table: don't close a disengaged querier in query()
2026-03-24 08:47:35 +02:00
Yaniv Kaul
e59a21752d .github/workflows/trigger_jenkins.yaml: add workflow permissions
Potential fix for https://github.com/scylladb/scylladb/security/code-scanning/147.

To fix the problem, add an explicit `permissions:` block to the workflow
(either at the top level or inside the `trigger-jenkins` job) that
constrains the `GITHUB_TOKEN` to the minimal necessary privileges. This
codifies least-privilege in the workflow itself instead of relying on
repository or organization defaults.

The best minimal, non‑breaking change is to define a root‑level
`permissions:` block with read‑only contents access because the job does
not perform any write operations to the repository, nor does it interact
with issues, pull requests, or other GitHub resources. A conservative,
widely accepted baseline is `contents: read`. If later steps require more
permissions, they can be added explicitly, but for this snippet, no such
need is visible.

Concretely, in `.github/workflows/trigger_jenkins.yaml`, insert:

```yaml
permissions:
  contents: read
```

between the `name:` block and the `on:` block (e.g., after line 2).
No additional methods, imports, or definitions are needed since this is
a pure YAML configuration change and does not alter runtime behavior of
the existing shell steps.

Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>

Closes scylladb/scylladb#27815
2026-03-24 08:40:30 +02:00
Yaniv Kaul
85a531819b .github/workflows/trigger-scylla-ci.yaml: add permissions to workflow
Potential fix for https://github.com/scylladb/scylladb/security/code-scanning/169.

In general, the fix is to add an explicit `permissions:` block to the
workflow (at the root level or per job) so that the `GITHUB_TOKEN` has
only the minimal scopes needed. Since this job only reads event data and
uses secrets to talk to Jenkins, we can restrict `GITHUB_TOKEN` to
read‑only repository contents.

The single best fix here is to add a top‑level `permissions:` block
right under the `name:` (and before `on:`) in
`.github/workflows/trigger-scylla-ci.yaml`, setting `contents: read`.
This applies to all jobs in the workflow, including `trigger-jenkins`,
and does not alter any existing steps or logic. No additional imports or
methods are needed, as this is purely a YAML configuration change for
GitHub Actions.

Concretely, edit `.github/workflows/trigger-scylla-ci.yaml` to insert:

```yaml
permissions:
  contents: read
```

after line 1. No other lines in the file need to change.

Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>

Closes scylladb/scylladb#27812
2026-03-24 08:37:49 +02:00
Dawid Mędrek
148217bed6 test: cluster: Increase log level in test_write_cl_any_to_dead_node_generates_hints
We increase the log level of `hints_manager` to TRACE in the test.
If it fails, it may be incredibly difficult to debug it without any
additional information.
2026-03-23 19:19:17 +01:00
Dawid Mędrek
2b472fe7fd test: cluster: Await all mutations concurrently in test_write_cl_any_to_dead_node_generates_hints 2026-03-23 19:19:17 +01:00
Dawid Mędrek
ae12c712ce test: cluster: Specify min_tablet_count in test_write_cl_any_to_dead_node_generates_hints
The test relies on the assumption that mutations will be distributed
more or less uniformly over the nodes. Although in practice this should
not be possible, theoretically it's possible that there's only one
tablet allocated for the table.

To clearly indicate this precondition, we explicitly set the property
`min_tablet_count` when creating the table. This way, we have a gurantee
that the table has multiple tablets. The load balancer should now take
care of distributing them over the nodes equally. Thanks to that,
`servers[1]` will have some tablets, and so it'll be the target for some
of the mutations we perform.
2026-03-23 19:19:14 +01:00
Dawid Mędrek
dd446aa442 test: cluster: Use new_test_table in test_write_cl_any_to_dead_node_generates_hints
The context manager is the de-facto standard in the test suite. It will
also allow us for a prettier way to conditionally enable per-table
tablet options in the following commit.
2026-03-23 19:07:01 +01:00
Dawid Mędrek
dea79b09a9 test: cluster: Introduce auxiliary function keyspace_has_tablets
The function is adapted from its counterpart in the cqlpy test suite:
cqlpy/util.py::keyspace_has_tablets. We will use it in a commit in this
series to conditionally set tablet properties when creating a table.
It might also be useful in general.
2026-03-23 19:07:01 +01:00
Dawid Mędrek
3d04fd1d13 test: cluster: Deflake test_write_cl_any_to_dead_node_generates_hints
Before these changes, we would send mutations to the node and
immediately query the metrics to see how many hints had been written.
However, that could lead to random failures of the test: even if the
mutations have finished executing, hints are stored asynchronously, so
we don't have a guarantee they have already been processed.

To prevent such failures, we rewrite the check: we will perform multiple
checks against the metrics until we have confirmed that the hints have
indeed been written or we hit the timeout.

We're generous with the timeout: we give the test 60 seconds. That
should be enough time to avoid flakiness even on super slow machines,
and if the test does fail, we will know something is really wrong.

Fixes SCYLLADB-1133
2026-03-23 19:06:57 +01:00
Botond Dénes
772b32d9f7 test/scylla_gdb: fix flakiness by preparing objects at test time
Fixtures previously ran GDB once (module scope) to find live objects
(sstables, tasks, schemas) and stored their addresses. Tests then
reused those addresses in separate GDB invocations. Sometimes these
addresses would become stale and the test would step on use-after-free
(e.g. sstables compacted away between invocations).

Fix by dropping the fixtures. The helper functions used by the fixtures
to obtain the required objects are converted to gdb convenience
functions, which can be used in the same expression as the test command
invocation. Thus, the object is aquired on-demand at the moment it is
used, so it is guaranteed to be fresh and relevant.

Fixes: SCYLLADB-1020

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Closes scylladb/scylladb#28999
2026-03-23 16:54:03 +02:00
Piotr Dulikowski
60fb5270a9 logstor: fix fmt::format use with std::filesystem::path
The version of fmt installed on my machine refuses to work with
`std::filesystem::path` directly. Add `.string()` calls in places that
attempt to print paths directly in order to make them work.

Closes scylladb/scylladb#29148
2026-03-23 15:15:52 +01:00
Pavel Emelyanov
3b9398dfc8 Merge 'encryption: fix deadlock in encrypted_data_source::get()' from Ernest Zaslavsky
When encrypted_data_source::get() caches a trailing block in _next, the next call takes it directly — bypassing input_stream::read(), which checks _eof. It then calls input_stream::read_exactly() on the already-drained stream. Unlike read(), read_up_to(), and consume(), read_exactly() does not check _eof when the buffer is empty, so it calls _fd.get() on a source that already returned EOS.

In production this manifested as stuck encrypted SSTable component downloads during tablet restore: the underlying chunked_download_source hung forever on the post-EOS get(), causing 4 tablets to never complete. The stuck files were always block-aligned sizes (8k, 12k) where _next gets populated and the source is fully consumed in the same call.

Fix by checking _input.eof() before calling read_exactly(). When the stream already reached EOF, buf2 is known to be empty, so the call is skipped entirely.

A comprehensive test is added that uses a strict_memory_source which fails on post-EOS get(), reproducing the exact code path that caused the production deadlock.

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1128

Backport to 2025.3/4 and 2026.1 is needed since it fixes a bug that may bite us in production, to be on the safe side

Closes scylladb/scylladb#29110

* github.com:scylladb/scylladb:
  encryption: fix deadlock in encrypted_data_source::get()
  test_lib: mark `limiting_data_source_impl` as not `final`
  Fix formatting after previous patch
  Fix indentation after previous patch
  test_lib: make limiting_data_source_impl available to tests
2026-03-23 17:12:44 +03:00
Botond Dénes
f5438e0587 test/cluster/test_incremental_repair.py: enable compaction DEBUG logs in do_tablet_incremental_repair_and_ops
The test sporadically fails because scrub produces an unexpected number
of SSTables. Compaction logs are needed to diagnose why, but were not
captured since scrub runs at DEBUG level. Enable compaction=debug for
the servers started by do_tablet_incremental_repair_and_ops so the next
reproduction provides enough information to root-cause the issue.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-03-23 15:48:26 +02:00
Botond Dénes
f6ab576ed9 test/cluster/test_incremental_repair.py: fix typo preapre -> prepare
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-03-23 15:48:12 +02:00
Piotr Dulikowski
df68d0c0f7 directories: add missing seastar/util/closeable.hh include
Without this include the file would not compile on its own. The issue
was most likely masked by the use of precompiled headers in our CI.

Closes scylladb/scylladb#29170
2026-03-23 15:46:56 +03:00
Yaniv Michael Kaul
051107f5bc scylla-gdb: fix sstable-summary crash on ms-format sstables
The 'scylla sstable-summary' GDB command crashes with
'ValueError: Argument "count" should be greater than zero' when
inspecting ms-format (trie-based) sstables. This happens because
ms-format sstables don't populate the traditional summary structure,
leaving all fields zeroed out, which causes gdb.read_memory() to be
called with a zero count.

Fix by:
- Adding zero-length guards to sstring.to_hex() and sstring.as_bytes()
  to return early when the data length is zero, consistent with the
  existing guard in managed_bytes.get().
- Adding the same guard to scylla_sstable_summary.to_hex().
- Detecting ms-format sstables (version == 5) early in
  scylla_sstable_summary.invoke() and printing an informative message
  instead of attempting to read the unpopulated summary.

Fixes: SCYLLADB-1180

Closes scylladb/scylladb#29162
2026-03-23 12:44:47 +02:00
Piotr Szymaniak
c8e7e20c5c test/cluster: retry create_table on transient schema agreement timeout
In test_index_requires_rf_rack_valid_keyspace, the create_table call
for a plain tablet-based table can fail with 'Unable to reach schema
agreement' after the server's 10s timeout is exceeded. This happens
when schema gossip propagation across the 4-node cluster takes longer
than expected after a sequence of rapid schema changes earlier in the
test.

Add a retry (up to 2 attempts) on schema agreement errors for this
specific create_table call rather than increasing the server-side
timeout.

Fixes: SCYLLADB-1135

Closes scylladb/scylladb#29132
2026-03-23 10:45:30 +02:00
Yaniv Kaul
fb1f995d6b .github/workflows/backport-pr-fixes-validation.yaml: workflow does not contain permissions (Potential fix for code scanning alert no. 139)
Potential fix for https://github.com/scylladb/scylladb/security/code-scanning/139,

To fix the problem, explicitly restrict the `GITHUB_TOKEN` permissions
for this workflow/job so it has only what is needed. The script reads PR
data and repository info (which is covered by `contents: read`/default
read scopes) and posts a comment via `github.rest.issues.createComment`,
which requires `issues: write`. No other write scopes (e.g., `contents:
write`, `pull-requests: write`) are necessary.

The best fix without changing functionality is to add a `permissions`
block scoped to this job (or at the workflow root). Since we only see a
single job here, we’ll add it under `check-fixes-prefix`. Concretely, in
`.github/workflows/backport-pr-fixes-validation.yaml`, between the
`runs-on: ubuntu-latest` line (line 10) and `steps:` (line 11), add:

```yaml
    permissions:
      contents: read
      issues: write
```

This keeps the token minimally privileged while still allowing the script
to create issue/PR comments.

Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>

Closes scylladb/scylladb#27810
2026-03-23 10:30:01 +02:00
Piotr Smaron
32225797cd dtest: fix flaky test_writes_schema_recreated_while_node_down
`read_barrier(session2)` was supposed to ensure `node2` has caught up on schema
before a CL=ALL write. But `patient_cql_connection(node2)` creates a
cluster-aware driver session `(TokenAwarePolicy(DCAwareRoundRobinPolicy()))`
that can route the barrier CQL statement to any node — not necessarily `node2`.
If the barrier runs on `node1` or `node3` (which already have the new schema),
it's a no-op, and `node2` remains stale, thus the observed `WriteFailure`.
The fix is to switch to `patient_exclusive_cql_connection(node2)`,
which uses `WhiteListRoundRobinPolicy([node2_ip])` to pin all CQL to `node2`.
This is already the established pattern used by other tests in the same file.

Fixes: SCYLLADB-1139

No need to backport yet, appeared only on master.

Closes scylladb/scylladb#29151
2026-03-23 10:25:54 +02:00
Michał Chojnowski
f29525f3a6 test/boost/cache_algorithm_test: disable sstable compression to avoid giant index pages
The test intentionally creates huge index pages.
But since 5e7fb08bf3,
the index reader allocates a block of memory for a whole index page,
instead of incrementally allocating small pieces during index parsing.
This giant allocation causes the test to fail spuriously in CI sometimes.

Fix this by disabling sstable compression on the test table,
which puts a hard cap of 2000 keys per index page.

Fixes: SCYLLADB-1152

Closes scylladb/scylladb#29152
2026-03-23 09:57:11 +02:00
Raphael S. Carvalho
05b11a3b82 sstables_loader: use new sstable add path
Use add_new_sstable_and_update_cache() when attaching SSTables
downloaded by the node-scoped local loader.

This is the correct variant for new SSTables: it can unlink the
SSTable on failure to add it, and it can split the SSTable if a
tablet split is in progress. The older
add_sstable_and_update_cache() helper is intended for preexisting
SSTables that are already stable on disk.

Additionally, downloaded SSTables are now left unsealed (TemporaryTOC)
until they are successfully added to the table's SSTable set. The
download path (download_fully_contained_sstables) passes
leave_unsealed=true to create_stream_sink, and attach_sstable opens
the SSTable with unsealed_sstable=true and seals it only inside the
on_add callback — matching the pattern used by stream_blob.cc and
storage_service.cc for tablet streaming.

This prevents a data-resurrection hazard: previously, if the process
crashed between download and attach_sstable, or if attach_sstable
failed mid-loop, sealed (TOC) SSTables would remain in the table
directory and be reloaded by distributed_loader on restart. With
TemporaryTOC, sstable_directory automatically cleans them up on
restart instead.

Fixes  https://scylladb.atlassian.net/browse/SCYLLADB-1085.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#29072
2026-03-23 10:33:04 +03:00
Piotr Szymaniak
f511264831 alternator/test: fix test_ttl_with_load_and_decommission flaky Connection refused error
The native Scylla nodetool reports ECONNREFUSED as 'Connection refused',
not as 'ConnectException' (which is the Java nodetool format). Add
'Connection refused' to the valid_errors list so that transient
connection failures during concurrent decommission/bootstrap topology
changes are properly tolerated.

Fixes SCYLLADB-1167

Closes scylladb/scylladb#29156
2026-03-22 11:01:45 +02:00
Pavel Emelyanov
7dce43363e table: merge adjacent querier_opt checks in query()
After the previous fix both guarding if-s start with 'if (querier_opt &&'.
Merge them into a single outer 'if (querier_opt)' block to avoid the
redundant check and make the structure easier to follow.

No functional change.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-03-20 14:48:08 +03:00
Piotr Dulikowski
cc695bc3f7 Merge 'vector_search: fix race condition on connection timeout' from Karol Nowacki
When a `with_connect` operation timed out, the underlying connection
attempt continued to run in the reactor. This could lead to a crash
if the connection was established/rejected after the client object had
already been destroyed. This issue was observed during the teardown
phase of a upcoming high-availability test case.

This commit fixes the race condition by ensuring the connection attempt
is properly canceled on timeout.

Additionally, the explicit TLS handshake previously forced during the
connection is now deferred to the first I/O operation, which is the
default and preferred behavior.

Fixes: SCYLLADB-832

Backports to 2026.1 and 2025.4 are required, as this issue also exists on those branches and is causing CI flakiness.

Closes scylladb/scylladb#29031

* github.com:scylladb/scylladb:
  vector_search: test: fix flaky test
  vector_search: fix race condition on connection timeout
2026-03-20 11:12:04 +01:00
Petr Gusev
4bfcd035ae test_fencing: add missing await-s
Fixes SCYLLADB-1099

Closes scylladb/scylladb#29133
2026-03-20 10:55:35 +01:00
Pavel Emelyanov
9c1c41df03 table: don't close a disengaged querier in query()
The condition guarding querier_opt->close() was:

When saved_querier is null the short-circuit makes the whole condition true
regardless of whether querier_opt is engaged.  If partition_ranges is empty,
query_state::done() is true before the while-loop body ever runs, so querier_opt
is never created.  Calling querier_opt->close() then dereferences a disengaged
std::optional — undefined behaviour.

Fix by checking querier_opt first:

This preserves all existing semantics (close when not saving, or when saving
wouldn't be useful) while making the no-querier path safe.

Why this doesn't surface today: the sole production call site, database::query(),
in practice.  The API header documents nullptr as valid ("Pass nullptr when
queriers are not saved"), so the bug is real but latent.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-03-20 12:25:13 +03:00
Pavel Emelyanov
c4a0f6f2e6 object_store: Don't leave dangling objects by iterating moved-from names vector
The code in upload_file std::move()-s vector of names into
merge_objects() method, then iterates over this vector to delete
objects. The iteration is apparently a no-op on moved-from vector.

The fix is to make merge_objects() helper get vector of names by const
reference -- the method doesn't modify the names collection, the caller
keeps one in stable storage.

Fixes #29060

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#29061
2026-03-20 10:09:30 +02:00
Pavel Emelyanov
712ba5a31f utils: Use yielding directory_lister in owner verification
Switch directories::do_verify_owner_and_mode() from lister::scan_dir() to
utils::directory_lister while preserving the previous hidden-entry
behavior.

Make do_verify_subpath use lister::filter_type directly so the
verification helper can pass it straight into directory_lister, and keep
a single yielding iteration loop for directory traversal.

Minus one scan_dir user twards scan_dir removal from code.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#29064
2026-03-20 10:08:38 +02:00
Pavel Emelyanov
961fc9e041 s3: Don't rearm credential timers when credentials are not refreshed
The update_credentials_and_rearm() may get "empty" credentials from
_creds_provider_chain.get_aws_credentials() -- it doesn't throw, but
returns default-initialized value. In that case the expires_at will be
set to time_point::min, and it's probably not a good idea to arm the
refresh timer and, even worse idea, to subtract 1h from it.

Fixes #29056

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#29057
2026-03-20 10:07:01 +02:00
Pavel Emelyanov
0a8dc4532b s3: Fix missing upload ID in copy_part trace log
The format string had two {} placeholders but three arguments, the
_upload_id one is skipped from formatting

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#29053
2026-03-20 10:05:44 +02:00
Botond Dénes
bb5c328a16 Merge 'Squash two primary-replica restoration tests together' from Pavel Emelyanov
The test_restore_primary_replica_same_domain and test_restore_primary_replica_different_domain tests have very much in common. Previously both tests were also split each into two, so we have four tests, and now we have two that can also be squashed, the lines-of-code savings still worth it.

This is the continuation of #28569

Tests improvement, not backporting

Closes scylladb/scylladb#28994

* github.com:scylladb/scylladb:
  test: Replace a bunch of ternary operators with an if-else block
  test: Squash test_restore_primary_replica_same|different_domain tests
  test: Use the same regexp in test_restore_primary_replica_different|same_domain-s
2026-03-20 10:05:16 +02:00
Pavel Emelyanov
ea2a214959 test/backup: Use unique_name() for backup prefix instead of cf_dir
The do_test_backup_abort() fetched the node's workdir and resolved cf_dir
solely to construct a unique-ish backup prefix:

    prefix = f'{cf_dir}/backup'

The comment already acknowledged this was only "unique(ish)" — relying
on the UUID-derived cf_dir name as a uniqueness source is roundabout.
unique_name() is already imported and used for exactly this purpose
elsewhere in the file.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#29030
2026-03-20 10:04:22 +02:00
Pavel Emelyanov
65032877d4 api: Move /storage_service/toppartitions from storage_service.cc to column_family.cc
The endpoint URL remains intact. Having it next to another toppartitions
endpoint (the /column_family/toppartitions one) is natural.

This endpoint only needs sharded<replica::database>&, grabs it from
http_context and doesn't use any other service. In column_family.cc the
database reference is already available as a parameter. Once more user
of http_context.db is gone.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Closes scylladb/scylladb#28996
2026-03-20 09:52:33 +02:00
Botond Dénes
de0bdf1a65 Merge 'Decouple test_refresh_deletes_uploaded_sstables from backup test-suite' from Pavel Emelyanov
The test in question uses several helpers from the backup sute, but it doesn't really need them -- the operations it want to perform can be performed with standard pylib methods. "While at it" also collect some dangling effectively unused local variables from this test (these were apparently left from backup tests this one was copied-and-reworked from)

Enhancing tests, not backporting

Closes scylladb/scylladb#29130

* github.com:scylladb/scylladb:
  test/refresh: Simplify refresh invocation
  test/refresh: Remove r_servers alias for servers
  test/refresh: Replace check_mutation_replicas with a plain CQL SELECT
  test/refresh: Inline keyspace/table/data setup in test_refresh_deletes_uploaded_sstables
  test/refresh: Prepare indentation for new_test_keyspace in test_refresh_deletes_uploaded_sstables
  test/refresh: Decouple test_refresh_deletes_uploaded_sstables from backup tests
  test/refresh: Remove unused wait_for_cql_and_get_hosts import
2026-03-20 09:29:15 +02:00
Botond Dénes
97430e2df5 Merge 'Fix object storage lister entries walking loop' from Pavel Emelyanov
Two issues found in the lister returned by gs_client_wrapper::make_object_lister()
Lister can report EOF too early in case filter is active, another one is potential vector out-of-bounds access

Fixes #29058

The code appeared in 2026.1, worth fixing it there as well

Closes scylladb/scylladb#29059

* github.com:scylladb/scylladb:
  sstables: Fix object storage lister not resetting position in batch vector
  sstables: Fix object storage lister skipping entries when filter is active
2026-03-20 09:12:42 +02:00
Botond Dénes
5573c3b18e Merge 'tablets: Fix deadlock in background storage group merge fiber' from Tomasz Grabiec
When it deadlocks, groups stop merging and compaction group merge
backlog will run-away.

Also, graceful shutdown will be blocked on it.

Found by flaky unit test
test_merge_chooses_best_replica_with_odd_count, which timed-out in 1
in 100 runs.

Reason for deadlock:

When storage groups are merged, the main compaction group of the new
storage group takes a compaction lock, which is appended to
_compaction_reenablers_for_merging, and released when the merge
completion fiber is done with the whole batch.

If we accumulate more than 1 merge cycle for the fiber, deadlock
occurs. Lock order will be this

Initial state:

 cg0: main
 cg1: main
 cg2: main
 cg3: main

After 1st merge:

 cg0': main [locked], merging_groups=[cg0.main, cg1.main]
 cg1': main [locked], merging_groups=[cg2.main, cg3.main]

After 2nd merge:

 cg0'': main [locked], merging_groups=[cg0'.main [locked], cg0.main, cg1.main, cg1'.main [locked], cg2.main, cg3.main]

merge completion fiber will try to stop cg0'.main, which will be
blocked on compaction lock. which is held by the reenabler in
_compaction_reenablers_for_merging, hence deadlock.

The fix is to wait for background merge to finish before we start the
next merge. It's achieved by holding old erm in the background merge,
and doing a topology barrier from the merge finalizing transition.

Background merge is supposed to be a relatively quick operation, it's
stopping compaction groups. So may wait for active requests. It
shouldn't prolong the barrier indefinitely.

Tablet tests which trigger merge need to be adjusted to call the
barrier, otherwise they will be vulnerable to the deadlock.

Fixes SCYLLADB-928

Backport to >= 2025.4 because it's the earliest vulnerable due to f9021777d8.

Closes scylladb/scylladb#29007

* github.com:scylladb/scylladb:
  tablets: Fix deadlock in background storage group merge fiber
  replica: table: Propagate old erm to storage group merge
  test: boost: tablets_test: Save tablet metadata when ACKing split resize decision
  storage_service: Extract local_topology_barrier()
2026-03-20 09:05:52 +02:00
Botond Dénes
34473302b0 Merge 'docs: document existing guardrails' from Andrzej Jackowski
This patch series introduces a new documentation for exiting guardrails.

Moreover:
 - Warning / failure messages of recently added write CL guardrails (SCYLLADB-259) are rephrased, so all guardrails have similar messages.
 - Some new tests are added, to help verify the correctness of the documentation and avoid situations where the documentation and implementation diverge.

Fixes: [SCYLLADB-257](https://scylladb.atlassian.net/browse/SCYLLADB-257)

No backport, just new docs and tests.

[SCYLLADB-257]: https://scylladb.atlassian.net/browse/SCYLLADB-257?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

Closes scylladb/scylladb#29011

* github.com:scylladb/scylladb:
  test: add new guardrail tests matching documentation scenarios
  test: add metric assertions to guardrail replication strategy tests
  test: use regex matching in guardrail replication strategy tests
  test: extract ks_opts helper in test_guardrail_replication_strategy
  docs: document CQL guardrails
  cql: improve write consistency level guardrail messages
2026-03-20 08:56:00 +02:00
artem.penner
9898e5700b scylla-node-exporter: Add systemd collector to node exporter
This PR enables the node_exporter systemd collector and configures the unit whitelist to include scylla-server.service and systemd-coredump services.

**Motivation**: We currently lack visibility into system-level service states, which is critical for diagnosing stability issues.

This configuration enables two specific use cases:
- Detecting Coredump Loops: We encounter scenarios where ScyllaDB enters a restart loop. To pinpoint SIGSEGV (coredumps) as the root cause, we need to track when the systemd-coredump service becomes active, indicating a dump is being processed.
- Identifying Startup Failures: We need to detect when the scylla-server unit enters a failed state. This is essential for catching unrecoverable errors (e.g., corrupted commitlogs or configuration bugs) that prevent the server from starting.

example of promql queries:
- `node_systemd_unit_state{name=~"systemd-coredump@.*", state="active"} == 1`
- `node_systemd_unit_state{name="scylla-server.service", state="failed"} == 1`

Closes #28402
2026-03-20 08:39:56 +02:00
Andrzej Jackowski
10c4b9b5b0 test: verify signal() detects resource negative leak in rcs
reader_concurrency_semaphore::signal() guards against available
resources exceeding the initial limit after a signal, which would
indicate a bug such as double-returning resources. It reports the
issue via on_internal_error_noexcept and clamps resources back to
the initial values. However, before this commit there were no tests
that verified this behavior, so bugs like SCYLLADB-1014 went
undetected.

Add a test that artificially signals resources that were never
consumed and verifies that signal() detects the negative leak and
clamps available resources back to the initial limit.

Refs: SCYLLADB-1014
Fixes: SCYLLADB-1031

Closes scylladb/scylladb#28993
2026-03-20 09:21:20 +03:00
Botond Dénes
f9adbc7548 test/cqlpy/test_tombstone_limit.py: disable tombstone-gc for test table
Since 7564a56dc8, all tables default to
repair-mode tombstone-gc, which is identical to immediate-mode for RF=1
tables. Consequently the tombstones written by the tests in this test
file are immediately collectible and with some unlucky timing, some of
them can be collected before the end of the test, failing the empty-page
prefix check because the empty pages prefix will be smaller than
expected based on the number of tombstones written.
Disable tombstone-gc to remove this source of flakyness.

Fixes: SCYLLADB-1062

Closes scylladb/scylladb#29077
2026-03-20 09:14:29 +03:00
Michał Chojnowski
6b18d95dec test: add a missing reconnect_driver in test_sstable_compression_dictionaries_upgrade.py
Need to work around https://github.com/scylladb/python-driver/issues/295,
lest a CQL query fail spuriously after the cluster restart.

Fixes: SCYLLADB-1114

Closes scylladb/scylladb#29118
2026-03-20 09:05:14 +03:00
Botond Dénes
89388510a0 test/cluster/test_data_resurrection_in_memtable.py: use explicit CL
The test has expectation w.r.t which write makes it to which nodes:
* inserts make it to all nodes
* delete makes it to all-1 (QUORUM) node

However, this was not expressed with CL, and the default CL=ONE allowed
for some nodes missing the writes and this violating the tests
expectations on what data is persent on which nodes. This resulted on
the test being flaky and failing on the data checks.

Use explicit CL for the ingestion to prevent this.

The improvements to the test introduced in
a8dd13731f was of great help in
investigating this: traces are now available and the check happens after
the data was dumped to logs.

Fixes: SCYLLADB-870
Fixes: SCYLLADB-812
Fixes: SCYLLADB-1102

Closes scylladb/scylladb#29128
2026-03-20 09:02:57 +03:00
Avi Kivity
6b259babeb Merge 'logstor: initial log-structured storage for key-value tables' from Michael Litvak
Introduce an initial and experimental implementation of an alternative log-structured storage engine for key-value tables.

Main flows and components:
* The storage is composed of 32MB files, each file divided to segments of size 128k. We write to them sequentially records that contain a mutation and additional metadata. Records are written to a buffer first and then written to the active segment sequentially in 4k sized blocks.
* The primary index in memory maps keys to their location on disk. It is a B-tree per-table that is ordered by tokens, similar to a memtable.
* On reads we calculate the key and look it up in the primary index, then read the mutation from disk with a single disk IO.
* On writes we write the record to a buffer, wait for it to be written to disk, then update the index with the new location, and free the previous record.
* We track the used space in each segment. When overwriting a record, we increase the free space counter for the segment of the previous record that becomes dead. We store the segments in a histogram by usage.
* The compaction process takes segments with low utilization, reads them and writes the live records to new segments, and frees the old segments.
* Segments are initially "mixed" - we write to the active segment records from all tables and all tablets. The "separator" process rewrites records from mixed segments into new segments that are organized by compaction groups (tablets), and frees the mixed segments. Each write is written to the active segment and to a separator buffer of the compaction group, which is eventually flushed to a new segment in the compaction group.

Currently this mode is experimental and requires an experimental flag to be enabled.
Some things that are not supported yet are strong consistency, tablet migration, tablet split/merge, big mutations, tombstone gc, ttl.

to use, add to config:
```
enable_logstor: true

experimental_features:
  - logstor
```

create a table:
```
CREATE TABLE ks.t(pk int PRIMARY KEY, a int, v text) WITH storage_engine = 'logstor';
```

INSERT, SELECT, DELETE work as expected
UPDATE not supported yet

no backport - new feature

Closes scylladb/scylladb#28706

* github.com:scylladb/scylladb:
  logstor: trigger separator flush for buffers that hold old segments
  docs/dev: add logstor documentation
  logstor: recover segments into compaction groups
  logstor: range read
  logstor: change index to btree by token per table
  logstor: move segments to replica::compaction_group
  db: update dirty mem limits dynamically
  logstor: track memory usage
  logstor: logstor stats api
  logstor: compaction buffer pool
  logstor: separator: flush buffer when full
  logstor: hold segment until index updates
  logstor: truncate table
  logstor: enable/disable compaction per table
  logstor: separator buffer pool
  test: logstor: add separator and compaction tests
  logstor: segment and separator barrier
  logstor: separator debt controller
  logstor: compaction controller
  logstor: recovery: recover mixed segments using separator
  logstor: wait for pending reads in compaction
  logstor: separator
  logstor: compaction groups
  logstor: cache files for read
  logstor: recovery: initial
  logstor: add segment generation
  logstor: reserve segments for compaction
  logstor: index: buckets
  logstor: add buffer header
  logstor: add group_id
  logstor: record generation
  logstor: generation utility
  logstor: use RIPEMD-160 for index key
  test: add test_logstor.py
  api: add logstor compaction trigger endpoint
  replica: add logstor to db
  schema: add logstor cf property
  logstor: initial commit
  db: disable tablet balancing with logstor
  db: add logstor experimental feature flag
2026-03-20 00:18:09 +02:00
Avi Kivity
062751fcec Merge 'db/config: enable ms sstable format by default' from Łukasz Paszkowski
Trie-based sstable indexes are supposed to be (hopefully) a better default than the old BIG indexes.
Make the new format a new default for new clusters by naming ms in the default scylla.yaml.

New functionality. No backport needed.

This PR is basically Michał's one https://github.com/scylladb/scylladb/pull/26377, Jakub's  https://github.com/scylladb/scylladb/pull/27332 fixing `sstables_manager::get_highest_supported_format()` and one test fix.

Closes scylladb/scylladb#28960

* github.com:scylladb/scylladb:
  db/config: announce ms format as highest supported
  db/config: enable `ms` sstable format by default
  cluster/dtest/bypass_cache_test: switch from highest_supported_sstable_format to chosen_sstable_format
  api/system: add /system/chosen_sstable_version
  test/cluster/dtest: reduce num_tokens to 16
2026-03-19 18:19:01 +02:00
Pavel Emelyanov
969dddb630 test/refresh: Simplify refresh invocation
take_snapshot return values were unused so drop them. do_refresh was a
thin wrapper around load_new_sstables that added no logic; inline it
directly into the gather expression.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-03-19 18:42:57 +03:00
Pavel Emelyanov
de21572b31 test/refresh: Remove r_servers alias for servers
r_servers = servers was a no-op assignment; use servers directly.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-03-19 18:42:52 +03:00
Pavel Emelyanov
20b1531e6d test/refresh: Replace check_mutation_replicas with a plain CQL SELECT
The goal of test_refresh_deletes_uploaded_sstables is to verify that
sstables are removed from the upload directory after refresh. The replica
check was just a sanity guard; a simple SELECT of all keys is sufficient
and much lighter.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-03-19 18:42:48 +03:00
Pavel Emelyanov
c591b9ebe2 test/refresh: Inline keyspace/table/data setup in test_refresh_deletes_uploaded_sstables
Replace create_dataset() with explicit keyspace creation via new_test_keyspace,
inline CREATE TABLE, and direct cql.run_async inserts — matching the pattern
used in do_test_streaming_scopes. This removes the last dependency on backup
helpers for dataset setup and makes the test self-contained.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-03-19 18:42:44 +03:00
Pavel Emelyanov
06006a6328 test/refresh: Prepare indentation for new_test_keyspace in test_refresh_deletes_uploaded_sstables
Wrap the test body under if True: to pre-indent it, making the subsequent
patch that introduces new_test_keyspace a pure content change with no
whitespace noise.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-03-19 18:42:40 +03:00
Pavel Emelyanov
67d8cde42d test/refresh: Decouple test_refresh_deletes_uploaded_sstables from backup tests
Replace create_cluster() from object_store/test_backup.py with a plain
manager.servers_add(2) call. The test does not use object storage, so
there is no need to pull in the backup helper along with its config and
logging knobs.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-03-19 18:42:36 +03:00
Pavel Emelyanov
04f046d2d8 test/refresh: Remove unused wait_for_cql_and_get_hosts import
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-03-19 18:42:32 +03:00
Botond Dénes
e8b37d1a89 Merge 'doc: fix the installation section' from Anna Stuchlik
This PR fixes the Installation page:

- Replaces `http `with `https `in the download command.
- Replaces the Open Source example from the Installation section for CentOS (we overlooked this example before).

Fixes https://github.com/scylladb/scylladb/issues/29087
Fixes https://github.com/scylladb/scylladb/issues/29087

This update affects all supported versions and should be backported as a bug fix.

Closes scylladb/scylladb#29088

* github.com:scylladb/scylladb:
  doc: remove the Open Source Example from Installation
  doc: replace http with https in the installation instructions
2026-03-19 17:13:53 +02:00
Dario Mirovic
d2c44722e1 test: cluster: fix log clear race condition in test_audit.py
assert_entries_were_added:
- takes a "before" snapshot of the audit log
- yields to execute a statement
- takes an "after" snapshot of the audit log
- computes new rows by diffing "after" minus "before"

If an audit entry generated by prepare() arrives between the snapshot
and the diff, it inflates the new row count and the test fails with
assert 2 <= 1.

Fix by:
- Adding clear_audit_logs() at the end of prepare(), after all setup
- Waiting for the "completed re-reading configuration file" log message
  after server_update_config
- Draining pending syslog lines before clearing the buffer

Refs SCYLLADB-573
2026-03-19 16:12:13 +01:00
Dario Mirovic
821f8696a7 test: pylib: shut down exclusive cql connections in ManagerClient
get_cql_exclusive() creates a Cluster object per call, but never
records it. driver_close() cannot shut it down. The cluster's
internal scheduler thread then tries to submit work to an already
shut down executor. This causes RuntimeError:

RuntimeError: cannot schedule new futures after shutdown

Fix this by tracking every exclusive Cluster in a list and shutting
them all down in driver_close().

Refs SCYLLADB-573
2026-03-19 16:12:13 +01:00
Dario Mirovic
d94999f87b test: cluster: fix multinode audit entry comparison in test_audit.py
assert_entries_were_added computes new audit rows by slicing the "after"
list at the length of the "before" list: rows_after[len(rows_before):].
This assumes new rows always appear at the tail of the combined sorted
list. In a multinode setup, each node generates its own event_time
timestamps. A new row from node A can sort before an old row from node
B, breaking the tail assumption. The assertion "new rows are not the
last rows in the audit table" then fires.

Fix this by splitting the before/after lists per node and computing the
new rows tail independently for each node. This guarantees that per node
ordering, which is monotonic, is respected, and the combined new rows
are sorted afterwards.

Refs SCYLLADB-573
2026-03-19 16:12:13 +01:00
Dario Mirovic
249a6cec1b test: cluster: dtest: remove old audit tests
Since audit tests have been migrated to test/cluster/test_audit.py,
old tests in test/cluster/dtest/audit_test.py have to be removed.

Refs SCYLLADB-573
2026-03-19 16:12:13 +01:00
Dario Mirovic
adc790a8bf test: cluster: group migrated audit tests for cluster reuse
This patch reorganizes the execution flow of the test functions.
They are grouped to enable cluster reuse between specific test
functions. One of the main contributors to the test execution time
is the cluster preparation. This patch significantly reduces the
total test execution time by having way less new cluster preparation
calls and more cluster reuse.

Performance increase on the developer machine is around 38%:
- before: 4m 29s
- after: 2m 47s

Fixes SCYLLADB-573
2026-03-19 16:11:47 +01:00
Dario Mirovic
967b7ff6bf test: cluster: enable migrated audit tests and make them work
Make audit tests from test/cluster/dtest to test/cluster.
test/cluster environment has less overhead, and audit tests
are heavy, their execution taking lots of time. This patch
is part of an effort to improve audit test suite performance.

This patch refactors the tests so that they execute correctly,
as well as enables them. A follow up patch will remove the
audit tests in test/cluster/dtest.

All the tests are confirmed to be running after the change.
No dead code present.

Test test_audit_categories_invalid is not parametrized anymore.
It never used the parametrized helper class, so it just ran
the same logic three times. This is why there are now 74,
and not 76, test executions.

Refs SCYLLADB-573
2026-03-19 16:07:28 +01:00
Dario Mirovic
5d51501a0b pgo: use maintenance socket for CQL setup in PGO training
The default 'cassandra' superuser was removed from ScyllaDB, which
broke PGO training. exec_cql.py relied on username/password auth
('cassandra'/'cassandra') to execute setup CQL scripts like auth.cql
and counters.cql.

Switch exec_cql.py to connect via the Unix domain maintenance socket
instead. The maintenance socket bypasses authentication, no credentials
are needed. Additionally, create the 'cassandra' superuser via the
maintenance socket during the populate phase, so that cassandra-stress
keeps working. cassandra-stress hardcodes user=cassandra password=cassandra.

Changes:
- exec_cql.py: replace host/port/username/password arguments with a
  single --socket argument; add connect_maintenance_socket() with
  wait ready logic
- pgo.py: add maintenance_socket_path() helper; update
  populate_auth_conns() and populate_counters() to pass the socket
  path to exec_cql.py

Fixes SCYLLADB-1070

Closes scylladb/scylladb#29081
2026-03-19 16:52:36 +02:00
Dario Mirovic
8367509b3b test: pylib: manager_client: specify AuthProvider in get_cql_exclusive
This patch allows ManagerClient.get_cql_exclusive to accept AuthProvider
as parameter. This will be used in a follow up patch which migrates
audit test suite to test/cluster and requires this functionality for
some tests.

Refs SCYLLADB-573
2026-03-19 15:35:24 +01:00
Dario Mirovic
0a7a69345c test: pylib: scylla cluster after_test log fix
Before any test, a pool of ScyllaCluster objects is created.

At the beginning of a test suite, a ScyllaClusterManager is created,
and given a reference to the pool.
At the end of a test suite, the ScyllaClusterManager is destroyed.

Before each test case:
- ManagerClient is constructed and connected to the ScyllaClusterManager
  of that test suite
- A ScyllaCluster object is fetched from the pool
  - If the pool is empty, a new ScyllaCluster object is created
  - If the pool is not empty, a cached ScyllaCluster object is returned

After each test case:
- Return ScyllaCluster object from ManagerClient to the pool
  - If the cluster is dirty, the pool destroys it
  - If the cluster is clean, the pool caches it
- ManagerClient is destroyed

Many actions mark a cluster as dirty. Normal test execution will always
make the cluster be destroyed upon returning to the pool.
ManagerClient.mark_clean is not used in the tests. When it is used,
the flow with cluster reuse happens.

The bug is that the log file is closed even if cluster is not dirty.
This causes an error when trying to log to a reused cluster server.

The solution in this patch is to not close the log file if the cluster
is not dirty. Upon cluster reuse the log file will be open and functional.

Another approach would be to reopen the log file if closed, but this
approach seems more clean.

Refs SCYLLADB-573
2026-03-19 15:35:24 +01:00
Dario Mirovic
899ae71349 test: audit: copy audit test from dtest
This patch just copies the audit test suite from dtest and
disables it in the test config file. Later patches will
update the code and enable the test suite.

Refs SCYLLADB-573
2026-03-19 15:35:24 +01:00
Andrzej Jackowski
4deeb7ebfc test: add new guardrail tests matching documentation scenarios
Add tests for RF guardrails (min/max warn/fail, RF=0 bypass,
threshold=-1 disable, ALTER KEYSPACE) and write consistency level
guardrails to cover all scenarios described in guardrails.rst.

Test runtime (dev):
test_guardrail_replication_strategy - 6s
test_guardrail_write_consistency_level - 5s

Refs: SCYLLADB-257
2026-03-19 15:07:03 +01:00
Andrzej Jackowski
2a03c634c0 test: add metric assertions to guardrail replication strategy tests
Verify that guardrail violations increment the corresponding metrics.

Refs: SCYLLADB-257
2026-03-19 15:07:03 +01:00
Andrzej Jackowski
81c4e717e2 test: use regex matching in guardrail replication strategy tests
Replace loose substring assertions with regex-based matching against
the exact server message formats. Add regex constants for all
guardrail messages and rewrite create_ks_and_assert_warnings_and_errors()
to verify count and content of warnings and failures.

Refs: SCYLLADB-257
2026-03-19 15:07:03 +01:00
Anna Stuchlik
6b1df5202c doc: remove the instructions to install old versions from Web Installer
The Web Installer page includes instructions to install the old pre-2025.1 Enterprise versions,
which are no longer supported (since we released 2026.1).

This commit removes those redundant and misleading instructions.

Fixes https://github.com/scylladb/scylladb/issues/29099

Closes scylladb/scylladb#29103
2026-03-19 15:47:00 +02:00
Piotr Dulikowski
171504c84f Merge 'auth: migrate some standard role manager APIs to use cache' from Marcin Maliszkiewicz
This patchset migrates: query_all_directly_granted, query_all,
get_attribute, query_attribute_for_all functions to use cache
instead of doing CQL queries. It also includes some preparatory
work which fixes cache update order and triggering.

Main motivation behind this is to make sure that all calls
from service_level_controller::auth_integration are cached,
which we achieve here.

Alternative implementation could move the whole auth_integration
data into auth cache but since auth_integration manages also lifetime
and contains service levels specific logic such solution would be
too complex for little (if any) gain.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-159
Backport: no, not a bug

Closes scylladb/scylladb#28791

* github.com:scylladb/scylladb:
  auth: switch query_attribute_for_all to use cache
  auth: switch get_attribute to use cache
  auth: cache: add heterogeneous map lookups
  auth: switch query_all to use cache
  auth: switch query_all_directly_granted to use cache
  auth: cache: add ability to go over all roles
  raft: service: reload auth cache before service levels
  service: raft: move update_service_levels_effective_cache check
2026-03-19 14:37:22 +01:00
Avi Kivity
5e7fb08bf3 Merge 'Fix bad performance for densely populated partition index pages' from Tomasz Grabiec
This applies to small partition workload where index pages have high partition count, and the index doesn't fit in cache. It was observed that the count can be in the order of hundreds. In such a workload pages undergo constant population, LSA compaction, and LSA eviction, which has severe impact on CPU utilization.

Refs https://scylladb.atlassian.net/browse/SCYLLADB-620

This PR reduces the impact by several changes:

  - reducing memory footprint in the partition index. Assuming partition key size is 16 bytes, the cost dropped from 96 bytes to 36 bytes per partition.

  - flattening the object graph and amortizing storage. Storing entries directly in the vector. Storing all key values in a single managed_bytes. Making index_entry a trivial struct.

  - index entries and key storage are now trivially moveable, and batched inside vector storage
    so LSA migration can use memcpy(), which amortizes the cost per key. This reduces the cost of LSA segment compaction.

 - LSA eviction is now pretty much constant time for the whole page
   regardless of the number of entries, because elements are trivial and batched inside vectors.
   Page eviction cost dropped from 50 us to 1 us.

Performance evaluated with:

   scylla perf-simple-query -c1 -m200M --partitions=1000000

Before:

```
7774.96 tps (166.0 allocs/op, 521.7 logallocs/op,  54.0 tasks/op,  802428 insns/op,  430457 cycles/op,        0 errors)
7511.08 tps (166.1 allocs/op, 527.2 logallocs/op,  54.0 tasks/op,  804185 insns/op,  430752 cycles/op,        0 errors)
7740.44 tps (166.3 allocs/op, 526.2 logallocs/op,  54.2 tasks/op,  805347 insns/op,  432117 cycles/op,        0 errors)
7818.72 tps (165.2 allocs/op, 517.6 logallocs/op,  53.7 tasks/op,  794965 insns/op,  427751 cycles/op,        0 errors)
7865.49 tps (165.1 allocs/op, 513.3 logallocs/op,  53.6 tasks/op,  788898 insns/op,  425171 cycles/op,        0 errors)
```

After (+318%):

```
32492.40 tps (130.7 allocs/op,  12.8 logallocs/op,  36.1 tasks/op,  109236 insns/op,  103203 cycles/op,        0 errors)
32591.99 tps (130.4 allocs/op,  12.8 logallocs/op,  36.0 tasks/op,  108947 insns/op,  102889 cycles/op,        0 errors)
32514.52 tps (130.6 allocs/op,  12.8 logallocs/op,  36.0 tasks/op,  109118 insns/op,  103219 cycles/op,        0 errors)
32491.14 tps (130.6 allocs/op,  12.8 logallocs/op,  36.0 tasks/op,  109349 insns/op,  103272 cycles/op,        0 errors)
32582.90 tps (130.5 allocs/op,  12.8 logallocs/op,  36.0 tasks/op,  109269 insns/op,  102872 cycles/op,        0 errors)
32479.43 tps (130.6 allocs/op,  12.8 logallocs/op,  36.0 tasks/op,  109313 insns/op,  103242 cycles/op,        0 errors)
32418.48 tps (130.7 allocs/op,  12.8 logallocs/op,  36.1 tasks/op,  109201 insns/op,  103301 cycles/op,        0 errors)
31394.14 tps (130.7 allocs/op,  12.8 logallocs/op,  36.1 tasks/op,  109267 insns/op,  103301 cycles/op,        0 errors)
32298.55 tps (130.7 allocs/op,  12.8 logallocs/op,  36.1 tasks/op,  109323 insns/op,  103551 cycles/op,        0 errors)
```

When the workload is miss-only, with both row cache and index cache disabled (no cache maintenance cost):

  perf-simple-query -c1 -m200M --duration 6000 --partitions=100000 --enable-index-cache=0 --enable-cache=0

Before:

```
9124.57 tps (146.2 allocs/op, 789.0 logallocs/op,  45.3 tasks/op,  889320 insns/op,  357937 cycles/op,        0 errors)
9437.23 tps (146.1 allocs/op, 789.3 logallocs/op,  45.3 tasks/op,  889613 insns/op,  357782 cycles/op,        0 errors)
9455.65 tps (146.0 allocs/op, 787.4 logallocs/op,  45.2 tasks/op,  887606 insns/op,  357167 cycles/op,        0 errors)
9451.22 tps (146.0 allocs/op, 787.4 logallocs/op,  45.3 tasks/op,  887627 insns/op,  357357 cycles/op,        0 errors)
9429.50 tps (146.0 allocs/op, 787.4 logallocs/op,  45.3 tasks/op,  887761 insns/op,  358148 cycles/op,        0 errors)
9430.29 tps (146.1 allocs/op, 788.2 logallocs/op,  45.3 tasks/op,  888501 insns/op,  357679 cycles/op,        0 errors)
9454.08 tps (146.0 allocs/op, 787.3 logallocs/op,  45.3 tasks/op,  887545 insns/op,  357132 cycles/op,        0 errors)
```

After (+55%):

```
14484.84 tps (150.7 allocs/op,   6.5 logallocs/op,  44.7 tasks/op,  396164 insns/op,  229490 cycles/op,        0 errors)
14526.21 tps (150.8 allocs/op,   6.5 logallocs/op,  44.8 tasks/op,  396401 insns/op,  228824 cycles/op,        0 errors)
14567.53 tps (150.7 allocs/op,   6.5 logallocs/op,  44.7 tasks/op,  396319 insns/op,  228701 cycles/op,        0 errors)
14545.63 tps (150.6 allocs/op,   6.5 logallocs/op,  44.7 tasks/op,  395889 insns/op,  228493 cycles/op,        0 errors)
14626.06 tps (150.5 allocs/op,   6.5 logallocs/op,  44.7 tasks/op,  395254 insns/op,  227891 cycles/op,        0 errors)
14593.74 tps (150.5 allocs/op,   6.5 logallocs/op,  44.7 tasks/op,  395480 insns/op,  227993 cycles/op,        0 errors)
14538.10 tps (150.8 allocs/op,   6.5 logallocs/op,  44.8 tasks/op,  397035 insns/op,  228831 cycles/op,        0 errors)
14527.18 tps (150.8 allocs/op,   6.5 logallocs/op,  44.8 tasks/op,  396992 insns/op,  228839 cycles/op,        0 errors)
```

Same as above, but with summary ratio increased from 0.0005 to 0.005 (smaller pages):

Before:

```
33906.70 tps (146.1 allocs/op,  83.6 logallocs/op,  45.1 tasks/op,  170553 insns/op,   98104 cycles/op,        0 errors)
32696.16 tps (146.0 allocs/op,  83.5 logallocs/op,  45.1 tasks/op,  170369 insns/op,   98405 cycles/op,        0 errors)
33889.05 tps (146.1 allocs/op,  83.6 logallocs/op,  45.1 tasks/op,  170551 insns/op,   98135 cycles/op,        0 errors)
33893.24 tps (146.1 allocs/op,  83.5 logallocs/op,  45.1 tasks/op,  170488 insns/op,   98168 cycles/op,        0 errors)
33836.73 tps (146.1 allocs/op,  83.6 logallocs/op,  45.1 tasks/op,  170528 insns/op,   98226 cycles/op,        0 errors)
33897.61 tps (146.0 allocs/op,  83.5 logallocs/op,  45.1 tasks/op,  170428 insns/op,   98081 cycles/op,        0 errors)
33834.73 tps (146.1 allocs/op,  83.5 logallocs/op,  45.1 tasks/op,  170438 insns/op,   98178 cycles/op,        0 errors)
33776.31 tps (146.3 allocs/op,  83.9 logallocs/op,  45.2 tasks/op,  170958 insns/op,   98418 cycles/op,        0 errors)
33808.08 tps (146.3 allocs/op,  83.9 logallocs/op,  45.2 tasks/op,  170940 insns/op,   98388 cycles/op,        0 errors)
```

After (+18%):

```
40081.51 tps (148.2 allocs/op,   4.4 logallocs/op,  45.0 tasks/op,  121047 insns/op,   82231 cycles/op,        0 errors)
40005.85 tps (148.6 allocs/op,   4.4 logallocs/op,  45.2 tasks/op,  121327 insns/op,   82545 cycles/op,        0 errors)
39816.75 tps (148.3 allocs/op,   4.4 logallocs/op,  45.1 tasks/op,  121067 insns/op,   82419 cycles/op,        0 errors)
39953.11 tps (148.1 allocs/op,   4.4 logallocs/op,  45.0 tasks/op,  121027 insns/op,   82258 cycles/op,        0 errors)
40073.96 tps (148.2 allocs/op,   4.4 logallocs/op,  45.0 tasks/op,  121006 insns/op,   82313 cycles/op,        0 errors)
39882.25 tps (148.2 allocs/op,   4.4 logallocs/op,  45.0 tasks/op,  120925 insns/op,   82320 cycles/op,        0 errors)
39916.08 tps (148.3 allocs/op,   4.4 logallocs/op,  45.1 tasks/op,  121054 insns/op,   82393 cycles/op,        0 errors)
39786.30 tps (148.2 allocs/op,   4.4 logallocs/op,  45.0 tasks/op,  121027 insns/op,   82465 cycles/op,        0 errors)
38662.45 tps (148.3 allocs/op,   4.4 logallocs/op,  45.0 tasks/op,  121108 insns/op,   82312 cycles/op,        0 errors)
39849.42 tps (148.3 allocs/op,   4.4 logallocs/op,  45.1 tasks/op,  121098 insns/op,   82447 cycles/op,        0 errors)
```

Closes scylladb/scylladb#28603

* github.com:scylladb/scylladb:
  sstables: mx: index_reader: Optimize parsing for no promoted index case
  vint: Use std::countl_zero()
  test: sstable_partition_index_cache_test: Validate scenario of pages with sparse promoted index placement
  sstables: mx: index_reader: Amoritze partition key storage
  managed_bytes: Hoist write_fragmented() to common header
  utils: managed_vector: Use std::uninitialized_move() to move objects
  sstables: mx: index_reader: Keep promoted_index info next to index_entry
  sstables: mx: index_reader: Extract partition_index_page::clear_gently()
  sstables: mx: index_reader: Shave-off 16 bytes from index_entry by using raw_token
  sstables: mx: index_reader: Reduce allocation_section overhead during index page parsing by batching allocation
  sstables: mx: index_reader: Keep index_entry directly in the vector
  dht: Introduce raw_token
  test: perf_simple_query: Add 'sstable-format' command-line option
  test: perf_simple_query: Add 'sstable-summary-ratio' command-line option
  test: perf-simple-query: Add option to disable index cache
  test: cql_test_env: Respect enable-index-cache config
2026-03-19 14:42:50 +02:00
Botond Dénes
4981e72607 Merge 'replica: avoid unnecessary computation on token lookup hot path' from Łukasz Paszkowski
`storage_group_of()` sits on the replica-side token lookup hot path, yet it called `tablet_map::get_tablet_id_and_range_side()`, which always computes both the tablet id and the post-split range side — even though most callers only need the storage group id.

The range-side computation is only relevant when a storage group is in tablet splitting mode, but we were paying for it unconditionally on every lookup.

This series fixes that by:

1. Adding `tablet_map::get_tablet_range_side()` so the range side can be computed independently when needed.
2. Adding lazy `select_compaction_group()` overloads that defer the range-side computation until splitting mode is actually active.
3. Switching `storage_group_of()` to use the cheaper `get_tablet_id()` path, only computing the range side on demand.

Improvements. No backport is required.

Closes scylladb/scylladb#28963

* github.com:scylladb/scylladb:
  replica/table: avoid computing token range side in storage_group_of() on hot path
  replica/compaction_group: add lazy select_compaction_group() overloads
  locator/tablets: add tablet_map::get_tablet_range_side()
2026-03-19 14:27:12 +02:00
Ernest Zaslavsky
aa9da87e97 encryption: fix deadlock in encrypted_data_source::get()
When encrypted_data_source::get() caches a trailing block in
_next, the next call takes it directly — bypassing
input_stream::read(), which checks _eof. It then calls
input_stream::read_exactly() on the already-drained stream.
Unlike read(), read_up_to(), and consume(), read_exactly()
does not check _eof when the buffer is empty, so it calls
_fd.get() on a source that already returned EOS.

In production this manifested as stuck encrypted SSTable
component downloads during tablet restore: the underlying
chunked_download_source hung forever on the post-EOS get(),
causing 4 tablets to never complete. The stuck files were
always block-aligned sizes (8k, 12k) where _next gets
populated and the source is fully consumed in the same call.

Fix by checking _input.eof() before calling read_exactly().
When the stream already reached EOF, buf2 is known to be
empty, so the call is skipped entirely.

A comprehensive test is added that uses a strict_memory_source
which fails on post-EOS get(), reproducing the exact code
path that caused the production deadlock.
2026-03-19 13:54:54 +02:00
Ernest Zaslavsky
f74a54f005 test_lib: mark limiting_data_source_impl as not final 2026-03-19 13:54:54 +02:00
Ernest Zaslavsky
151e945d9f Fix formatting after previous patch 2026-03-19 13:54:44 +02:00
Andrzej Jackowski
517bb8655d test: extract ks_opts helper in test_guardrail_replication_strategy
Factor out ks_opts() to build keyspace options with tablets handling
and use it across all existing replication strategy guardrail tests.
No behavioral changes.

This facilitates further modification of the tests later in this
patch series.

Refs: SCYLLADB-257
2026-03-19 12:49:41 +01:00
Andrzej Jackowski
9b24d9ee7d docs: document CQL guardrails
Add docs/cql/guardrails.rst covering replication factor, replication
strategy, write consistency level, and compact storage guardrails.

Fixes: SCYLLADB-257
2026-03-19 12:49:41 +01:00
Ernest Zaslavsky
537747cf5d Fix indentation after previous patch 2026-03-19 13:48:53 +02:00
Ernest Zaslavsky
2535164542 test_lib: make limiting_data_source_impl available to tests
Relocate the `limiting_data_source_impl` declaration to the header file
so that test code can access it directly.
2026-03-19 13:48:53 +02:00
Botond Dénes
86d7c82993 test/cluster/test_repair.py: use tablets in test_repair_timestamp_difference
After repair, the test does a major to compact all sstables into a
single one, so the results can be simply checked by a select from
mutation_fragments() query. Sometimes off-strategy happens parallel to
this major, so after the major there are still 2 sstables, resulting in
the test failing when checking that the query returns just a single row.
To fix, just use tablets for the test table, tablets don't use
off-strategy anymore.

Fixes: SCYLLADB-940

Closes scylladb/scylladb#29071
2026-03-19 12:42:18 +03:00
Michael Litvak
399260a6c0 test: mv: fix flaky wait for commitlog sync
Previously the test test_interrupt_view_build_shard_registration stopped
the node ungracefully and used commitlog periodic mode to persist the
view build progress in a not very reliable way.

It can happen that due to timing issues, the view build progress is not
persisted, or some of it is persisted in a different ordering than
expected.

To make the test more reliable we change it to stop the node gracefully,
so the commitlog is persisted in a graceful and consistent way, without
using the periodic mode delay. We need to also change the injection for
the shutdown to not get stuck.

Fixes SCYLLADB-1005

Closes scylladb/scylladb#29008
2026-03-19 10:41:21 +01:00
Pavel Emelyanov
f27dc12b7c Merge 'Fix directory lister leak in table::get_snapshot_details: ' from Benny Halevy
As reported in SCYLLADB-1013, the directory lister must be closed also when an exception is thrown.

For example, see backtrace below:
```
seastar::on_internal_error(seastar::logger&, std::basic_string_view<char, std::char_traits<char>>) at ./build/release/seastar/./seastar/src/core/on_internal_error.cc:57
directory_lister::~directory_lister() at ./utils/lister.cc:77
replica::table::get_snapshot_details(std::filesystem::__cxx11::path, std::filesystem::__cxx11::path) (.resume) at ./replica/table.cc:4081
std::__n4861::coroutine_handle<seastar::internal::coroutine_traits_base<db::snapshot_ctl::table_snapshot_details>::promise_type>::resume() const at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/coroutine:247
 (inlined by) seastar::internal::coroutine_traits_base<db::snapshot_ctl::table_snapshot_details>::promise_type::run_and_dispose() at ././seastar/include/seastar/core/coroutine.hh:129
seastar::reactor::task_queue::run_tasks() at ./build/release/seastar/./seastar/src/core/reactor.cc:2695
 (inlined by) seastar::reactor::task_queue_group::run_tasks() at ./build/release/seastar/./seastar/src/core/reactor.cc:3201
seastar::reactor::task_queue_group::run_some_tasks() at ./build/release/seastar/./seastar/src/core/reactor.cc:3185
 (inlined by) seastar::reactor::do_run() at ./build/release/seastar/./seastar/src/core/reactor.cc:3353
seastar::reactor::run() at ./build/release/seastar/./seastar/src/core/reactor.cc:3245
seastar::app_template::run_deprecated(int, char**, std::function<void ()>&&) at ./build/release/seastar/./seastar/src/core/app-template.cc:266
seastar::app_template::run(int, char**, std::function<seastar::future<int> ()>&&) at ./build/release/seastar/./seastar/src/core/app-template.cc:160
scylla_main(int, char**) at ./main.cc:756
```

Fixes: [SCYLLADB-1013](https://scylladb.atlassian.net/browse/SCYLLADB-1013)

* Requires backport to 2026.1 since the leak exists since 004c08f525

[SCYLLADB-1013]: https://scylladb.atlassian.net/browse/SCYLLADB-1013?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

Closes scylladb/scylladb#29084

* github.com:scylladb/scylladb:
  test/boost/database_test: add test_snapshot_ctl_details_exception_handling
  table: get_snapshot_details: fix indentation inside try block
  table: per-snapshot get_snapshot_details: fix typo in comment
  table: per-snapshot get_snapshot_details: always close lister using try/catch
  table: get_snapshot_details: always close lister using deferred_close
2026-03-19 12:40:23 +03:00
Raphael S. Carvalho
3143134968 test: avoid split/major compaction deadlock in tablet split test
Run keyspace compaction asynchronously in
`test_tombstone_gc_correctness_during_tablet_split` and only await it
after `split_sstable_rewrite` is disabled.

The problem is that `keyspace_compaction()` starts with a flush, and that
flush can take around five seconds. During that window the split
compaction is stopped before major compaction is retried. The stop aborts
the in-flight major compaction attempt, then the split proceeds far enough
to enter the `split_sstable_rewrite` injection point.

At that point the test used to wait synchronously for major compaction to
finish, but major compaction cannot finish yet: when it retries, it needs
the same semaphore that is still effectively tied up behind the blocked
split rewrite. So the test waits for major compaction, while the split
waits for the injection to be released, and the code that would release
that injection never runs.

Starting major compaction as a task breaks that cycle. The test can first
disable `split_sstable_rewrite`, let the split get out of the way, and
only then wait for major compaction to complete.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-827.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#29066
2026-03-19 11:12:21 +02:00
Botond Dénes
2e47fd9f56 Merge 'tasks: do not fail the wait request if rpc fails' from Aleksandra Martyniuk
During decommission, we first mark a topology request as done, then shut
down a node and in the following steps we remove node from the topology.
Thus,  finished request does not imply that a node is removed from
the topology.

Due to that, in node_ops_virtual_task::wait, while gathering children
from the whole cluster, we may hit the connection exception - because
a node is still in topology, even though it is down.

Modify the get_children method to ignore the exception and warn
about the failure instead.

Keep token_metadata_ptr in get_children to prevent topology from changing.

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-867

Needs backports to all versions

Closes scylladb/scylladb#29035

* github.com:scylladb/scylladb:
  tasks: fix indentation
  tasks: do not fail the wait request if rpc fails
  tasks: pass token_metadata_ptr to task_manager::virtual_task::impl::get_children
2026-03-19 10:03:18 +02:00
Piotr Smaron
a2ad57062f docs/cql: clarify WHERE clause boolean limitations
Document that `SELECT ... WHERE` clause currently accepts only conjunctions
of relations joined by `AND` (`OR` is not supported), and that
parentheses cannot be used to group boolean subexpressions.
Add an unsupported query example and point readers to equivalent `IN`
rewrites when applicable.
This problem has been raised by one of our users in
https://forum.scylladb.com/t/error-parsing-query-or-unsupported-statement/5299,
and while one could infer answer to user's question by looking at the
syntax of the `SELECT ... WHERE`, it's not immediately obvious to
non-advanced users, so clarifying these concepts is justified.

Fixes: SCYLLADB-1116

Closes scylladb/scylladb#29100
2026-03-19 09:47:22 +02:00
Michael Litvak
31d339e54a logstor: trigger separator flush for buffers that hold old segments
A compaction group has a separator buffer that holds the mixed segments
alive until the separator buffer is flushed. A mixed segment can be
freed only after all separator buffers that hold writes from the segment
are flushed.

Typically a separator buffer is flushed when it becomes full. However
it's possible for example that one compaction groups is filled slower
than others and holds many segments.

To fix this we trigger a separator flush periodically for separator
buffers that hold old segments. We track the active segment sequence
number and for each separator buffer the oldest sequence number it
holds.
2026-03-18 19:24:28 +01:00
Michael Litvak
ad87eda835 docs/dev: add logstor documentation 2026-03-18 19:24:28 +01:00
Michael Litvak
a0da07e5b7 logstor: recover segments into compaction groups
Fix the logstor recovery to work with compaction groups. When recovering
a segment find its token range and add it to the appropriate compaction
groups. if it doesn't fit in a single compaction group then write each
record to its compaction group's separator buffer.
2026-03-18 19:24:28 +01:00
Michael Litvak
24379acc76 logstor: range read
extend the logstor mutation reader to support range read
2026-03-18 19:24:28 +01:00
Michael Litvak
a9d0211a64 logstor: change index to btree by token per table
Change the primary index to be a btree that is ordered by token,
similarly to a memtable, and create a index per-table instead of a
single global index.
2026-03-18 19:24:28 +01:00
Michael Litvak
e7c3942d43 logstor: move segments to replica::compaction_group
Add a segment_set member to replica::compaction_group that manages the
logstor segments that belong to the compaction group, similarly to how
it manages sstables. Add also a separator buffer in each compaction
group.

When writing a mutation to a compaction group, the mutation is written
to the active segment and to the separator buffer of the compaction
group, and when the separator buffer is flushed the segment is added to
the compaction_group's segment set.
2026-03-18 19:24:28 +01:00
Michael Litvak
d69f7eb0ee db: update dirty mem limits dynamically
when logstor is enabled, update the db dirty memory limits dynamically.

previously the threshold is set to 0.5 of the available memory, so 0.5
goes to memtables and 0.5 to others (cache).

when logstor is enabled, we calculate the available memory excluding
logstor, and divide it evenly between memtables and cache.
2026-03-18 19:24:27 +01:00
Michael Litvak
65cd0b5639 logstor: track memory usage
add logstor::get_memory_usage() that returns an estimate of the memory
usage by logstor. add tracking to how many unique keys are held in the
index.
2026-03-18 19:24:27 +01:00
Michael Litvak
b7bdb1010a logstor: logstor stats api
add api to get logstor statistics about segments for a table
2026-03-18 19:24:27 +01:00
Michael Litvak
8bd3bd7e2a logstor: compaction buffer pool
pre-allocate write buffers for compaction
2026-03-18 19:24:27 +01:00
Michael Litvak
caf5aa47c2 logstor: separator: flush buffer when full
flush separator buffers when they become full and switched instead of
aggregating all the buffers and flushing them when the separator is
switched.
2026-03-18 19:24:27 +01:00
Michael Litvak
6ddb7a4d13 logstor: hold segment until index updates
add a write gate to write_buffer. when writing a record to the write
buffer, the gate is held and passed back to the caller, and the caller
holds the gate until the write operation is complete, including
follow-up operations such as updating the index after the write.

in particular, when writing a mutation in logstor::write, the write
buffer is held open until the write is completed and updated in the
index.

when writing the write buffer to the active segment, we write the buffer
and then wait for the write buffer gate to close, i.e. we wait for all
index updates to complete before proceeding. the segment is held open
until all the write operations and index updates are complete.

this property is useful for correctness: when a segment is closed we
know that all the writes to it are updated in the index. this is needed
in compaction for example, where we take closed segments and check
which records in them are alive by looking them up in the index. if the
index is not updated yet then it will be wrong.
2026-03-18 19:24:27 +01:00
Michael Litvak
bd66edee5c logstor: truncate table
implement freeing all segments of a table for table truncate.

first do barrier to flush all active and mixed segments and put all the
table's data in compaction groups, then stop compaction for the table,
then free the table's segments and remove the live entries from the
index.
2026-03-18 19:24:27 +01:00
Michael Litvak
489efca47c logstor: enable/disable compaction per table
add functions to enable or disable compaction for a specific compaction
group or for all compaction groups of a table.
2026-03-18 19:24:27 +01:00
Michael Litvak
21db4f3ed8 logstor: separator buffer pool
pre-allocate write buffers for the separator
2026-03-18 19:24:27 +01:00
Michael Litvak
37c485e3d1 test: logstor: add separator and compaction tests 2026-03-18 19:24:27 +01:00
Michael Litvak
31aefdc07d logstor: segment and separator barrier
add barrier operation that forces switch of the active segment and
separator, and waits for all existing segments to close and all
separators to flush.
2026-03-18 19:24:27 +01:00
Michael Litvak
1231fafb46 logstor: separator debt controller
add tracking of the total separator debt - writes that were written to a
separator and waiting to be flushed, and add flow control to keep the
debt in control by delaying normal writes.
2026-03-18 19:24:27 +01:00
Michael Litvak
17cb173e18 logstor: compaction controller
adjust compaction shares by the compaction overhead: how many segments
compaction writes to generate a single free segment for new writes.
2026-03-18 19:24:27 +01:00
Michael Litvak
1da1bb9d99 logstor: recovery: recover mixed segments using separator
on recovery we may find mixed segments. recover them by adding them to a
separator, reading all their records and writing them to the separator,
and flush the separator.
2026-03-18 19:24:27 +01:00
Michael Litvak
b78cc787a6 logstor: wait for pending reads in compaction
we free a segment from compaction after updating all live records in the
segment to point to new locations in the index. we need to ensure they
are no running operations that use the old locations before we free the
segment.
2026-03-18 19:24:27 +01:00
Michael Litvak
600ec82bec logstor: separator
initial implementation of the separator. it replaces "mixed" segments -
segments that have records from different groups, to segments by group.

every write is written to the active segment and to a buffer in the
active separator. the active separator has in-memory buffers by group.
at some threshold number of segments we switch the active segment and
separator atomically, and start flushing the separator.

the separator is flushed by writing the buffers into new non-mixed
segments, adding them to a compaction group, and frees the mixed
segments.
2026-03-18 19:24:27 +01:00
Michael Litvak
009fc3757a logstor: compaction groups
divide the segments in the compaction manager to compaction group.
compaction will compact only segments from a single compaction group at
a time.
2026-03-18 19:24:27 +01:00
Michael Litvak
b3293f8579 logstor: cache files for read
keep all files for all segments open for read to improve reads.
2026-03-18 19:24:26 +01:00
Michael Litvak
5a16980845 logstor: recovery: initial
initial and basic recovery implementation.
* find all files, read their segments and populate the index with the
  newest record for each key.
* find which segments are used and build the usage histogram
2026-03-18 19:24:26 +01:00
Michael Litvak
bc9fc96579 logstor: add segment generation
add segment generation number that is incremented when the segment is
reused, and it's written to every buffer that is written to the segment.
this is useful for recovery.
2026-03-18 19:24:26 +01:00
Michael Litvak
719f7cca57 logstor: reserve segments for compaction
reserve segments for compaction so it always has enough segments to run
and doesn't get stuck.

do the compaction writes into full new segments instead of the active
segment.
2026-03-18 19:24:26 +01:00
Michael Litvak
521fca5c92 logstor: index: buckets
divide the primary index to buckets, each bucket containing a btree. the
bucket is determined by using bits from the key hash.
2026-03-18 19:24:26 +01:00
Michael Litvak
99c3b1998a logstor: add buffer header
add a buffer header in each write buffer we write that contains some
information that can be useful for recovery and reading.
2026-03-18 19:24:26 +01:00
Michael Litvak
ddd72a16b0 logstor: add group_id
add group_id value to each log record that is passed with the mutation
when writing it.

the group_id will be used to group log records in segments, such that a
segment will contain records only from a single group.

this will be useful for tablet migration. we want for each tablet to
have their own segments with all their records, so we can migrate them
efficiently by copying these segments.

the group_id value is set to a value equivalent to the tablet id.
2026-03-18 19:24:26 +01:00
Michael Litvak
08bea860ef logstor: record generation
add a record generation number for each record so we can compare
records and find which one is newer.
2026-03-18 19:24:26 +01:00
Michael Litvak
28f820eb1c logstor: generation utility
basic utility for generation numbers that will be useful next. a
generation number is an unsigned integer that can be incremented and
compared even if it wraparounds, assuming the values we compare were
written around the same time.
2026-03-18 19:24:26 +01:00
Michael Litvak
5f649dd39f logstor: use RIPEMD-160 for index key
use a 20-byte hash function for the index key to make hash collisions
very unlikely. we assume there are no hash collisions.
2026-03-18 19:24:26 +01:00
Michael Litvak
a521bcbcee test: add test_logstor.py
add basic tests for key-value tables with logstor storage
2026-03-18 19:24:26 +01:00
Michael Litvak
1ae1f37ec1 api: add logstor compaction trigger endpoint
add a new api endpoint that triggers logstor compaction.
2026-03-18 19:24:26 +01:00
Michael Litvak
2128b1b15c replica: add logstor to db
Add a single logstor instance in the database that is used for writing
and reading to tables with kv storage
2026-03-18 19:24:26 +01:00
Michael Litvak
9172cc172e schema: add logstor cf property
add a schema property for tables with logstor storage
2026-03-18 19:24:26 +01:00
Michael Litvak
0b1343747f logstor: initial commit
initial implementation of the logstor storage engine for key-value
tables that supports writes, reads and basic compaction.

main components:
* logstor: this is the main interface to users that supports writing and
  reading back mutations, and manages the internal components.
* index: the primary index in-memory that maps a key to a location on
  disk.
* write buffer: writes go initially to a write buffer. it accumulates
  multiple records in a buffer and writes them to the segment manager in
  4k sized blocks.
* segment manager: manages the storage - files, segments, compaction. it
  manages file and segment allocation, and writes 4k aligned buffers to
  the active segment sequentially. it tracks the used space in each
  segment. the compaction finds segment with low space usage and writes
  them to new segments, and frees the old segments.
2026-03-18 19:24:26 +01:00
Michael Litvak
27fd0c119f db: disable tablet balancing with logstor
initially logstor tables will not support tablet migrations, so
disable tablet balancing if the experimental feature flag is set.
2026-03-18 19:24:26 +01:00
Michael Litvak
ed852a2af2 db: add logstor experimental feature flag
add a new experimental feature flag for key-value tables with the new
logstor storage engine.
2026-03-18 19:24:26 +01:00
Anna Stuchlik
88b98fac3a doc: update the warning about shared dictionary training
This commit updates the inadequate warning on the Advanced Internode (RPC) Compression page.

The warning is replaced with a note about how training data is encrypted.

Fixes https://github.com/scylladb/scylladb/issues/29109

Closes scylladb/scylladb#29111
2026-03-18 19:35:18 +02:00
Avi Kivity
46a6f8e1d3 Merge 'auth: add maintenance_socket_authorizer' from Dario Mirovic
GRANT/REVOKE fails on the maintenance socket connections, because maintenance_auth_service uses allow_all_authorizer. allow_all_authorizer allows all operations, but not GRANT/REVOKE, because they make no sense in its context.

This has been observed during PGO run failure in operations from ./pgo/conf/auth.cql file.

This patch introduces maintenance_socket_authorizer that supports the capabilities of default_authorizer ('CassandraAuthorizer') without needing authorization.

Refs SCYLLADB-1070

This is an improvement, no need for backport.

Closes scylladb/scylladb#29080

* github.com:scylladb/scylladb:
  test: use NetworkTopologyStrategy in maintenance socket tests
  test: use cleanup fixture in maintenance socket auth tests
  auth: add maintenance_socket_authorizer
2026-03-18 19:29:57 +02:00
Pavel Emelyanov
d6c01be09b s3/client: Don't reconstruct regex on every parse_content_range call
Make the pattern static const so it is compiled once at first call rather
than on every Content-Range header parse.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#29054
2026-03-18 17:56:33 +02:00
Tomasz Grabiec
4410e9c61a sstables: mx: index_reader: Optimize parsing for no promoted index case
It's a common case with small partition workloads.
2026-03-18 16:25:21 +01:00
Tomasz Grabiec
32f8609b89 vint: Use std::countl_zero()
It handles 0, and could generate better code for that. On Broadwell
architecture, it translates to a single instruction (LZCNT). We're
still on Westmere, so it translates to BSR with a conditional move.

Also, drop unnecessary casts and bit arithmetic, which saves a few
instructions.

Move to header so that it's inlined in parsers.
2026-03-18 16:25:21 +01:00
Tomasz Grabiec
6017688445 test: sstable_partition_index_cache_test: Validate scenario of pages with sparse promoted index placement 2026-03-18 16:25:21 +01:00
Tomasz Grabiec
f55bb154ec sstables: mx: index_reader: Amoritze partition key storage
This change reduces the cost of partition index page construction and
LSA migration. This is achieved by several things working together:

 - index entries don't store keys as separate small objects (managed_bytes)
   They are written into one managed_bytes fragmented storage, entries
   hold offset into it.

   Before, we paid 16 bytes for managed_bytes plus LSA descriptor for
   the storage (1 byte) plus back-reference in the storage (8 bytes),
   so 25 bytes. Now we only pay 4 bytes for the size offset. If keys are 16
   bytes, that's a reduction from 31 bytes to 20 bytes per key.

 - index entries and key storage are now trivially moveable, so LSA
   migration can use memcpy() which amortizes the cost per key.
   memcpy().

   LSA eviction is now trivial and constant time for the whole page
   regardless of the number of entries. Page eviction dropped from
   14 us to 1 us.

This improves throughput in a CPU-bound miss-heavy read workload where
the partition index doesn't fit in memory.

  scylla perf-simple-query -c1 -m200M --partitions=1000000

Before:

    15328.25 tps (150.0 allocs/op,  14.1 logallocs/op,  45.4 tasks/op,  286769 insns/op,  218134 cycles/op,        0 errors)
    15279.01 tps (149.9 allocs/op,  14.1 logallocs/op,  45.3 tasks/op,  287696 insns/op,  218637 cycles/op,        0 errors)
    15347.78 tps (149.7 allocs/op,  14.1 logallocs/op,  45.3 tasks/op,  285851 insns/op,  217795 cycles/op,        0 errors)
    15403.68 tps (149.6 allocs/op,  14.1 logallocs/op,  45.2 tasks/op,  285111 insns/op,  216984 cycles/op,        0 errors)
    15189.47 tps (150.0 allocs/op,  14.1 logallocs/op,  45.5 tasks/op,  289509 insns/op,  219602 cycles/op,        0 errors)
    15295.04 tps (149.8 allocs/op,  14.1 logallocs/op,  45.3 tasks/op,  288021 insns/op,  218545 cycles/op,        0 errors)
    15162.01 tps (149.8 allocs/op,  14.1 logallocs/op,  45.4 tasks/op,  291265 insns/op,  220451 cycles/op,        0 errors)

After:

    21620.18 tps (148.4 allocs/op,  13.4 logallocs/op,  43.7 tasks/op,  176817 insns/op,  153183 cycles/op,        0 errors)
    20644.03 tps (149.8 allocs/op,  13.5 logallocs/op,  44.3 tasks/op,  187941 insns/op,  160409 cycles/op,        0 errors)
    20588.06 tps (150.1 allocs/op,  13.5 logallocs/op,  44.5 tasks/op,  188090 insns/op,  160818 cycles/op,        0 errors)
    20789.29 tps (149.5 allocs/op,  13.5 logallocs/op,  44.2 tasks/op,  186495 insns/op,  159382 cycles/op,        0 errors)
    20977.89 tps (149.5 allocs/op,  13.4 logallocs/op,  44.2 tasks/op,  183969 insns/op,  158140 cycles/op,        0 errors)
    21125.34 tps (149.1 allocs/op,  13.4 logallocs/op,  44.1 tasks/op,  183204 insns/op,  156925 cycles/op,        0 errors)
    21244.42 tps (148.6 allocs/op,  13.4 logallocs/op,  43.8 tasks/op,  181276 insns/op,  155973 cycles/op,        0 errors)

Mostly because the index now fits in memory.

When it doesn't, the benefits are still visible due to lower LSA overhead.
2026-03-18 16:25:21 +01:00
Tomasz Grabiec
1452e92567 managed_bytes: Hoist write_fragmented() to common header 2026-03-18 16:25:20 +01:00
Tomasz Grabiec
75e6412b1c utils: managed_vector: Use std::uninitialized_move() to move objects
It's shorter, and is supposed to be optimized for trivially-moveable
types.

Important for managed_vector<index_entry>, which can have lots of
elements.
2026-03-18 16:25:20 +01:00
Tomasz Grabiec
50dc7c6dd8 sstables: mx: index_reader: Keep promoted_index info next to index_entry
Densely populated pages have no promoted index (small partitions), so
we can save space in such workloads by keeping promoted index in a
separate vector.

For workloads which do have a promoted index, pages have only one
partition. There aren't many such pages and they are long-lived, so
the extra allocation of the vector is amortized.

promoted_index class is removed, and replaced with equivalent
parsed_promoted_index_entry for simplicity. Because it's removed,
make_cursor() is moved into the index_reader class.

Reducing the size of index_entry is important for performence if pages
are densly populated. It helps to reduce LSA allocator pressure and
compaction/eviction speed.

This change, combined with the earlier change "Shave-off 16 bytes from
index_entry by using raw_token", gives significant improvement in
throughput in perf_simple_query run where the index doesn't fit in
memory:

  scylla perf-simple-query -c1 -m200M --partitions=1000000

Before:

9714.78 tps (170.9 allocs/op,  16.9 logallocs/op,  55.3 tasks/op,  494788 insns/op,  343920 cycles/op,        0 errors)
9603.13 tps (171.6 allocs/op,  17.0 logallocs/op,  55.6 tasks/op,  502358 insns/op,  348344 cycles/op,        0 errors)
9621.43 tps (171.9 allocs/op,  17.0 logallocs/op,  55.8 tasks/op,  500612 insns/op,  347508 cycles/op,        0 errors)
9597.75 tps (171.6 allocs/op,  17.0 logallocs/op,  55.6 tasks/op,  501428 insns/op,  348604 cycles/op,        0 errors)
9615.54 tps (171.6 allocs/op,  16.9 logallocs/op,  55.6 tasks/op,  501313 insns/op,  347935 cycles/op,        0 errors)
9577.03 tps (171.8 allocs/op,  17.0 logallocs/op,  55.7 tasks/op,  503283 insns/op,  349251 cycles/op,        0 errors)

After:

15328.25 tps (150.0 allocs/op,  14.1 logallocs/op,  45.4 tasks/op,  286769 insns/op,  218134 cycles/op,        0 errors)
15279.01 tps (149.9 allocs/op,  14.1 logallocs/op,  45.3 tasks/op,  287696 insns/op,  218637 cycles/op,        0 errors)
15347.78 tps (149.7 allocs/op,  14.1 logallocs/op,  45.3 tasks/op,  285851 insns/op,  217795 cycles/op,        0 errors)
15403.68 tps (149.6 allocs/op,  14.1 logallocs/op,  45.2 tasks/op,  285111 insns/op,  216984 cycles/op,        0 errors)
15189.47 tps (150.0 allocs/op,  14.1 logallocs/op,  45.5 tasks/op,  289509 insns/op,  219602 cycles/op,        0 errors)
15295.04 tps (149.8 allocs/op,  14.1 logallocs/op,  45.3 tasks/op,  288021 insns/op,  218545 cycles/op,        0 errors)
15162.01 tps (149.8 allocs/op,  14.1 logallocs/op,  45.4 tasks/op,  291265 insns/op,  220451 cycles/op,        0 errors)
2026-03-18 16:25:20 +01:00
Tomasz Grabiec
5e228a8387 sstables: mx: index_reader: Extract partition_index_page::clear_gently()
There will be more elements to clear. And partition_index_page should
know how to clear itself.
2026-03-18 16:25:20 +01:00
Tomasz Grabiec
2d77e4fc28 sstables: mx: index_reader: Shave-off 16 bytes from index_entry by using raw_token
The std::optional<> adds 8 bytes.

And dht::token adds 8 bytes due to _kind, which in this case is always
kind::key.

The size changd from 56 to 48 bytes.
2026-03-18 16:25:20 +01:00
Tomasz Grabiec
e9c98274b5 sstables: mx: index_reader: Reduce allocation_section overhead during index page parsing by batching allocation
If the page has many entries, we continuously enter and leave the
allocating section for every key. This can be avoided by batching LSA
operations for the whole page, after collecting all the entries.

Later optimizations will also build on this, where we will allocate
fragmented storage for keys in LSA using a single managed_bytes
constructor.

This alone brings only a minor improvement, but it does reduce LSA
allocations, probably due to less frequent memory reclamation:

  scylla perf-simple-query -c1 -m200M --duration 6000 --partitions=1000000

Before:

  9560.42 tps (172.2 allocs/op,  19.6 logallocs/op,  57.7 tasks/op,  567741 insns/op,  345158 cycles/op,        0 errors)
  9445.95 tps (173.1 allocs/op,  19.7 logallocs/op,  58.1 tasks/op,  579075 insns/op,  352173 cycles/op,        0 errors)
  9576.75 tps (172.2 allocs/op,  19.6 logallocs/op,  57.6 tasks/op,  572004 insns/op,  347373 cycles/op,        0 errors)
  9597.16 tps (172.2 allocs/op,  19.6 logallocs/op,  57.6 tasks/op,  569615 insns/op,  346618 cycles/op,        0 errors)
  9454.07 tps (173.5 allocs/op,  19.8 logallocs/op,  58.3 tasks/op,  579213 insns/op,  351569 cycles/op,        0 errors)

After:

  9562.21 tps (172.0 allocs/op,  17.0 logallocs/op,  55.8 tasks/op,  499225 insns/op,  347832 cycles/op,        0 errors)
  9480.20 tps (172.3 allocs/op,  17.0 logallocs/op,  55.9 tasks/op,  507271 insns/op,  350640 cycles/op,        0 errors)
  9512.42 tps (172.1 allocs/op,  17.0 logallocs/op,  55.9 tasks/op,  504247 insns/op,  350392 cycles/op,        0 errors)
  9498.45 tps (172.4 allocs/op,  17.1 logallocs/op,  55.9 tasks/op,  505765 insns/op,  350320 cycles/op,        0 errors)
  9076.30 tps (173.5 allocs/op,  17.1 logallocs/op,  56.5 tasks/op,  512791 insns/op,  354792 cycles/op,        0 errors)
  9542.62 tps (171.9 allocs/op,  17.0 logallocs/op,  55.8 tasks/op,  502532 insns/op,  348922 cycles/op,        0 errors)
2026-03-18 16:25:20 +01:00
Tomasz Grabiec
0e0f9f41b3 sstables: mx: index_reader: Keep index_entry directly in the vector
Partition index entries are relatively small, and if the workload has
small partitions, index pages have a lot of elements. Currently, index
entries are indirected via managed_ref, which causes increased cost of
LSA eviction and compaction. This patch amortizes this cost by storing
them dierctly in the managed_chunked_vector.

This gives about 23% improvement in throughput in perf-simple-query
for a workload where the index doesn't fit in memory:

  scylla perf-simple-query -c1 -m200M --duration 6000 --partitions=1000000

Before:

  7774.96 tps (166.0 allocs/op, 521.7 logallocs/op,  54.0 tasks/op,  802428 insns/op,  430457 cycles/op,        0 errors)
  7511.08 tps (166.1 allocs/op, 527.2 logallocs/op,  54.0 tasks/op,  804185 insns/op,  430752 cycles/op,        0 errors)
  7740.44 tps (166.3 allocs/op, 526.2 logallocs/op,  54.2 tasks/op,  805347 insns/op,  432117 cycles/op,        0 errors)
  7818.72 tps (165.2 allocs/op, 517.6 logallocs/op,  53.7 tasks/op,  794965 insns/op,  427751 cycles/op,        0 errors)
  7865.49 tps (165.1 allocs/op, 513.3 logallocs/op,  53.6 tasks/op,  788898 insns/op,  425171 cycles/op,        0 errors)

After:

  9560.42 tps (172.2 allocs/op,  19.6 logallocs/op,  57.7 tasks/op,  567741 insns/op,  345158 cycles/op,        0 errors)
  9445.95 tps (173.1 allocs/op,  19.7 logallocs/op,  58.1 tasks/op,  579075 insns/op,  352173 cycles/op,        0 errors)
  9576.75 tps (172.2 allocs/op,  19.6 logallocs/op,  57.6 tasks/op,  572004 insns/op,  347373 cycles/op,        0 errors)
  9597.16 tps (172.2 allocs/op,  19.6 logallocs/op,  57.6 tasks/op,  569615 insns/op,  346618 cycles/op,        0 errors)
  9454.07 tps (173.5 allocs/op,  19.8 logallocs/op,  58.3 tasks/op,  579213 insns/op,  351569 cycles/op,        0 errors)

Disabling the partition index doesn't improve the throuhgput beyond
that.
2026-03-18 16:25:20 +01:00
Tomasz Grabiec
b6bfdeb111 dht: Introduce raw_token
Most tokens stored in data structures are for key-scoped tokens, and
we don't need to pay for token::kind storage.
2026-03-18 16:25:20 +01:00
Tomasz Grabiec
3775593e53 test: perf_simple_query: Add 'sstable-format' command-line option 2026-03-18 16:25:20 +01:00
Tomasz Grabiec
6ee9bc63eb test: perf_simple_query: Add 'sstable-summary-ratio' command-line option 2026-03-18 16:25:20 +01:00
Tomasz Grabiec
38d130d9d0 test: perf-simple-query: Add option to disable index cache 2026-03-18 16:25:20 +01:00
Tomasz Grabiec
5ee61f067d test: cql_test_env: Respect enable-index-cache config
Mirrors the code in main.cc
2026-03-18 16:25:20 +01:00
Aleksandra Martyniuk
2d16083ba6 tasks: fix indentation 2026-03-18 15:37:24 +01:00
Aleksandra Martyniuk
1fbf3a4ba1 tasks: do not fail the wait request if rpc fails
During decommission, we first mark a topology request as done, then shut
down a node and in the following steps we remove node from the topology.
Thus,  finished request does not imply that a node is removed from
the topology.

Due to that, in node_ops_virtual_task::wait, while gathering children
from the whole cluster, we may hit the connection exception - because
a node is still in topology, even though it is down.

Modify the get_children method to ignore the exception and warn
about the failure instead.
2026-03-18 15:37:24 +01:00
Aleksandra Martyniuk
d4fdeb4839 tasks: pass token_metadata_ptr to task_manager::virtual_task::impl::get_children
In get_children we get the vector of alive nodes with get_nodes.
Yet, between this and sending rpc to those nodes there might be
a preemption. Currently, the liveness of a node is checked once
again before the rpcs (only with gossiper not in topology - unlike
get_nodes).

Modify get_children, so that it keeps a token_metadata_ptr,
preventing topology from changing between get_nodes and rpcs.

Remove test_get_children as it checked if the get_children method
won't fail if a node is down after get_nodes - which cannot happen
currently.
2026-03-18 15:37:24 +01:00
Calle Wilund
0013f22374 memtable_test::memtable_flush_period: Change sleep to use injection signal instead
Fixes: SCYLLADB-942

Adds an injection signal _from_ table::seal_active_memtable to allow us to
reliably wait for flushing. And does so.

Closes scylladb/scylladb#29070
2026-03-18 16:23:13 +02:00
Botond Dénes
ae17596c2a Merge 'Demote log level on split failure during shutdown' from Raphael Raph Carvalho
Since commit 509f2af8db, gate_closed_exception can be triggered for ongoing split during shutdown. The commit is correct, but it causes split failure on shutdown to log an error, which causes CI instability. Previously, aborted_exception would be triggered instead which is logged as warning. Let's do the same.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-951.
Fixes https://github.com/scylladb/scylladb/issues/24850.

Only 2026.1 is affected.

Closes scylladb/scylladb#29032

* github.com:scylladb/scylladb:
  replica: Demote log level on split failure during shutdown
  service: Demote log level on split failure during shutdown
2026-03-18 16:21:05 +02:00
Pavel Emelyanov
8b1ca6dcd6 database: Rate limit all tokens from a range
The limiter scans ranges to decide whether or not to rate-limit the
query. However, when considering each range only the front one's token
is accounted. This looks like a misprint.

The limiter was introduced in cc9a2ad41f

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#29050
2026-03-18 13:50:48 +01:00
Pavel Emelyanov
d68c92ec04 test: Replace a bunch of ternary operators with an if-else block
A followup of the merge of two test cases that happened in the previous
patch. Both used `foo = N if domain == bar else M` to evaluate the
parameters for topology. Using if-else block makes it immediately obvious
which topology and scope apply for each domain value without having to
evaluate multiple inline conditionals.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-03-18 13:08:36 +03:00
Pavel Emelyanov
b1d4fc5e6e test: Squash test_restore_primary_replica_same|different_domain tests
The two tests differ only in the way they set up the topology for the
cluster and the post-restore checks against the resulting streams.

The merge happens with the help of a "scope_is_same" boolean parameter
and corresponding updates in the topology setup and post-checks.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-03-18 13:08:36 +03:00
Pavel Emelyanov
21c603a79e test: Use the same regexp in test_restore_primary_replica_different|same_domain-s
The one in "different domain" test is simpler because the test performs
less checks. Next patch will merge both tests and making regexp-s look
identical makes the merge even smother.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-03-18 13:07:09 +03:00
Emil Maskovsky
34f3916e7d .github: update test instructions for unified pytest runner
Update test running instructions to reflect unified pytest-based runner.

The test.py now requires full test paths with file extensions for both
C++ and Python tests.

No backport: The change is only relevant for recent test.py changes in
master.

Closes scylladb/scylladb#29062
2026-03-18 09:28:28 +01:00
Marcin Maliszkiewicz
04bf631d7f auth: switch query_attribute_for_all to use cache 2026-03-18 09:06:20 +01:00
Marcin Maliszkiewicz
cf578fd81a auth: switch get_attribute to use cache 2026-03-18 09:06:20 +01:00
Marcin Maliszkiewicz
06d16b6ea2 auth: cache: add heterogeneous map lookups
Some callers have only string_view role name,
they shouldn't need to allocate sstring to do the
lookup.
2026-03-18 09:06:20 +01:00
Marcin Maliszkiewicz
7fdb1118f5 auth: switch query_all to use cache 2026-03-18 09:06:20 +01:00
Marcin Maliszkiewicz
fca11c5a21 auth: switch query_all_directly_granted to use cache 2026-03-18 09:06:20 +01:00
Marcin Maliszkiewicz
6f682f7eb1 auth: cache: add ability to go over all roles
This is needed to implement auth service api where
we list all roles.
2026-03-18 09:06:20 +01:00
Marcin Maliszkiewicz
61952cd985 raft: service: reload auth cache before service levels
Since service levels depend on auth data, and not other
way around, we need to ensure a proper loading order.
2026-03-18 09:06:20 +01:00
Marcin Maliszkiewicz
c4cfb278bc service: raft: move update_service_levels_effective_cache check
The auth::cache::includes_table function also covers role_members and
role_attributes. The existing check was removed because it blocked these
tables from triggering necessary cache updates.

While previously non-critical (due to unused attributes and table coupling),
maintaining a correct cache is essential for upcoming changes.
2026-03-18 09:06:20 +01:00
Benny Halevy
c2a6d1e930 test/boost/database_test: add test_snapshot_ctl_details_exception_handling
Verify that the directory listers opened by get_snapshot_details
are properly closed when handling an (injected) exception.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-03-18 09:37:44 +02:00
Benny Halevy
6dc4ea766b table: get_snapshot_details: fix indentation inside try block
Whitespace-only change: indent the loop body one level inside the
try block added in the previous commit.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-03-18 09:28:50 +02:00
Benny Halevy
b09d45b89a table: per-snapshot get_snapshot_details: fix typo in comment
The comment says the snapshot directory may contain a `schema.sql` file,
but the code treats `schema.cql` as the special-case schema file.

Reported-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-03-18 09:27:40 +02:00
Benny Halevy
580cc309d2 table: per-snapshot get_snapshot_details: always close lister using try/catch
Since this is a coroutine, we cannot just use deferred_close,
but rather we need to catch an error, close the lister, and then
return the error, is applicable.

Fixes: SCYLLADB-1013

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-03-18 09:27:23 +02:00
Benny Halevy
78c817f71e table: get_snapshot_details: always close lister using deferred_close
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-03-18 09:26:26 +02:00
Dario Mirovic
71e6918f28 test: use NetworkTopologyStrategy in maintenance socket tests
NetworkTopologyStrategy is the preferred choice. We should not
use SimpleStrategy anymore. This patch changes the topology strategy
for all the maintenance socket tests.

Refs SCYLLADB-1070
2026-03-17 20:20:47 +01:00
Dario Mirovic
278535e4e3 test: use cleanup fixture in maintenance socket auth tests
Add a cql_clusters pytest fixture that tracks CQL driver Cluster
objects and shuts them down automatically after test completion.
This replaces manual shutdown() calls at the end of each test.

Also consolidate shutdown() calls in retry helpers into finally
blocks for consistent cleanup.

Refs SCYLLADB-1070
2026-03-17 20:15:30 +01:00
Dario Mirovic
2e4b72c6b9 auth: add maintenance_socket_authorizer
GRANT/REVOKE fails on the maintenance socket connections,
because maintenance_auth_service uses allow_all_authorizer.
allow_all_authorizer allows all operations, but not GRANT/REVOKE,
because they make no sense in its context.

This has been observed during PGO run failure in operations from
./pgo/conf/auth.cql file.

This patch introduces maintenance_socket_authorizer that supports
the capabilities of default_authorizer ('CassandraAuthorizer')
without needing authorization.

Refs SCYLLADB-1070
2026-03-17 19:19:41 +01:00
Botond Dénes
172c786079 Merge 'perf-alternator: wait for alternator port before running workload' from Marcin Maliszkiewicz
This patch is mostly for the purpose of running pgo CI job.

We may receive connection error if asyncio.sleep(5) in
pgo.py is not sufficient waiting time.

In pgo.py we do wait for port but only for cql,
anyway it's better to have high level check than
trying to wait for alternator port there.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1071
Backport: 2026.1 - it failed on CI for that build

Closes scylladb/scylladb#29063

* github.com:scylladb/scylladb:
  perf: add abort_source support to wait-for-port loops
  perf-alternator: wait for alternator port before running workload
2026-03-17 18:38:11 +02:00
Botond Dénes
5d868dcc55 Merge 's3_client: fix s3::range max value for object size' from Ernest Zaslavsky
- fix s3::range max value for object size which is 50TiB and not 5.
- refactor constants to make it accessible for all interested parties, also reuse these constants in tests

No need to backport, doubt we will encounter an object larger than 5TiB

Closes scylladb/scylladb#28601

* github.com:scylladb/scylladb:
  s3_client: reorganize tests in part_size_calculation_test
  s3_client: switch using s3 limits constants in tests
  s3_client: fix the s3::range max object size
  s3_client: remove "aws" prefix from object limits constants
  s3_client: make s3 object limits accessible
2026-03-17 16:34:42 +02:00
Anna Stuchlik
f4a6bb1885 doc: remove the Open Source Example from Installation
This commit replaces the Open Soruce example from the Installation section for CentOS.
We updated the example for Ubuntu, but not for CentOS.
We don't want to have any Open Source information in the docs.

Fixes https://github.com/scylladb/scylladb/issues/29087
2026-03-17 14:54:32 +01:00
Anna Stuchlik
95bc8911dd doc: replace http with https in the installation instructions
Fixes https://github.com/scylladb/scylladb/issues/17227
2026-03-17 14:46:16 +01:00
Dawid Mędrek
a8dd13731f Merge 'Improve debuggability of test/cluster/test_data_resurrection_in_memtable.py' from Botond Dénes
This test was observed to fail in CI recently but there is not enough information in the logs to figure out what went wrong. This PR makes a few improvements to make the next investigation easier, should it be needed:
* storage-service: add table name to mutation write failure error messages.
* database: the `database_apply` error injection used to cause trouble, catching writes to bystander tables, making tests flaky. To eliminate this, it gained a filter to apply only to non-system keyspaces. Unfortunately, this still allows it to catch writes to the trace tables. While this should not fail the test, it reduces observability, as some traces disappear. Improve this error injection to only apply to selected table. Also merge it with the `database_apply_wait` error injection, to streamline the code a bit.
* test/test_data_resurrection_in_memtable.py: dump data from the datable, before the checks for expected data, so if checks fail, the data in the table is known.

Refs: SCYLLADB-812
Refs: SCYLLADB-870
Fixes: SCYLLADB-1050 (by restricting `database_apply` error injection, so it doesn't affect writes to system traces)

Backport: test related improvement, no backport

Closes scylladb/scylladb#28899

* github.com:scylladb/scylladb:
  test/cluster/test_data_resurrection_in_memtable.py: dump rows before check
  replica/database: consolidate the two database_apply error injections
  service/storage_proxy: add name of table to error message for write errors
2026-03-17 13:35:19 +01:00
Botond Dénes
318aa07158 Merge ' test/alternator: use module-scope fixtures in test_streams.py ' from Nadav Har'El
Previously, all stream-table fixtures in test_streams.py used scope="function",
forcing a fresh table to be created for every test, slowing down the test a bit
(though not much), and discouraging writing small new tests.

 This was a workaround for a DynamoDB quirk (that Alternator doesn't have):
LATEST shard iterators have a time slack and may point slightly before  the true
stream head, causing leftover events from a previous test to appear in the next
test's reads.

The first two tests in this series fix small problems that turn up once we start
sharing test tables in test_streams.py. The final patch fixes the "LATEST" problem
and enables sharing the test table by using "module" scope fixtures instead of
"function".

After this series, test_streams.py run time went down a bit, from 20.2 seconds to 17.7 seconds.

Closes scylladb/scylladb#28972

* github.com:scylladb/scylladb:
  test/alternator: speed up test_streams.py by using module-scope fixtures
  test/alternator: test_streams.py don't use fixtures in 4 tests
  test/alternator: fix do_test() in test_streams.py
2026-03-17 13:56:16 +02:00
Ernest Zaslavsky
7f597aca67 cmake: fix broken build
Add raft_util.idl.hh to cmake to generate the code properly

Closes scylladb/scylladb#29055
2026-03-17 10:35:34 +01:00
Botond Dénes
dbe70cddca test/boost/querier_cache_test: make test_time_based_cache_eviction less sensitive to timing
This test relies on the cache entry being evicted after 200ms past the
TTL. This may not happen on a busy CI machine. Make the test less
reliant on timing by using eventually_true().
Simplify the test by dropping the second entry, it doesn't add anything
to the test.

Fixes: SCYLLADB-811

Closes scylladb/scylladb#28958
2026-03-17 10:32:23 +01:00
Botond Dénes
0fd51c4adb test/nodetool: rest_api_mock_server: add retry for status code 404
This fixtures starts the mock server and immediately connects to it to
setup the expected requests. The connection attempt might be too early,
so there is a retry loop with a timeout. The loop currently checks for
requests.exception.ConnectionError. We've seen a case where the
connection is successful but the request fails with 404. The mock
started the server but didn't setup the routes yet. Add a retry for http
404 to handle this.

Fixes: SCYLLADB-966

Closes scylladb/scylladb#29003
2026-03-17 10:30:23 +01:00
Pavel Emelyanov
9fe19ec9d9 sstables: Fix object storage lister not resetting position in batch
vector

The lister loop in get() pre-fetches records in batches and keeps them
in a _info vector, iterating over it with the help of _pos cursor. When
the vector is re-read, the cursor must be reset too.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-03-17 10:32:42 +03:00
Pavel Emelyanov
1a6a7647c6 sstables: Fix object storage lister skipping entries when filter is active
The lister loop in get() method looks weird. It uses do-while(false)
loop and calls continue; inside when filter asks to skip a entry.
Skipping, thus, aborts the whole thing and EOF-s, which is not what's
supposed to happen.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-03-17 10:32:40 +03:00
Marcin Maliszkiewicz
9318c80203 perf: add abort_source support to wait-for-port loops
Check abort_source on each retry iteration in
wait_for_alternator and wait_for_cql so the
wait can be interrupted on shutdown.

Didn't use sleep_abortable as the sleep is very short
anyway.
2026-03-16 16:14:10 +01:00
Marcin Maliszkiewicz
edf0148bee perf-alternator: wait for alternator port before running workload
This patch is mostly for the purpose of running pgo CI job.

We may receive connection error if asyncio.sleep(5) in
pgo.py is not sufficient waiting time.

In pgo.py we do wait for port but only for cql,
anyway it's better to have high level check than
trying to wait for alternator port there.
2026-03-16 16:07:52 +01:00
Raphael S. Carvalho
ee87b66033 replica: Demote log level on split failure during shutdown
Dtest failed with:

table - Failed to load SSTable .../me-3gyn_0qwi_313gw2n2y90v2j4fcv-big-Data.db
of origin memtable due to std::runtime_error (Cannot split
.../me-3gyn_0qwi_313gw2n2y90v2j4fcv-big-Data.db because manager has compaction
disabled, reason might be out of space prevention), it will be unlinked...

The reason is that the error above is being triggered when the cause is
shutdown, not out of space prevention. Let's distinguish between the two
cases and log the error with warning level on shutdown.

Fixes https://github.com/scylladb/scylladb/issues/24850.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2026-03-16 12:03:17 -03:00
Raphael S. Carvalho
b508f3dd38 service: Demote log level on split failure during shutdown
Since commit 509f2af8db, gate_closed_exception can be triggered
for ongoing split during shutdown. The commit is correct, but it
causes split failure on shutdown to log an error, which causes
CI instability. Previously, aborted_exception would be triggered
instead which is logged as warning. Let's do the same.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-951.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2026-03-16 11:52:00 -03:00
Karol Nowacki
7659a5b878 vector_search: test: fix flaky test
The test assumes that the sleep duration will be at least the value of
the sleep parameter. However, the actual sleep time can be slightly less
than requested (e.g., a 100ms sleep request might result in a 99ms
sleep).

This commit adjusts the test's time comparison to be more lenient,
preventing test flakiness.
2026-03-13 16:28:22 +01:00
Karol Nowacki
5474cc6cc2 vector_search: fix race condition on connection timeout
When a `with_connect` operation timed out, the underlying connection
attempt continued to run in the reactor. This could lead to a crash
if the connection was established/rejected after the client object had
already been destroyed. This issue was observed during the teardown
phase of a upcoming high-availability test case.

This commit fixes the race condition by ensuring the connection attempt
is properly canceled on timeout.

Additionally, the explicit TLS handshake previously forced during the
connection is now deferred to the first I/O operation, which is the
default and preferred behavior.

Fixes: SCYLLADB-832
2026-03-13 16:28:22 +01:00
Andrzej Jackowski
60aaea8547 cql: improve write consistency level guardrail messages
Update warn and fail messages for the write_consistency_levels_warned
and write_consistency_levels_disallowed guardrails to include the
configuration option name and actionable guidance. The main motivation
is to make the messages follow the conventions of other guardrails.

Refs: SCYLLADB-257
2026-03-13 14:40:45 +01:00
Tomasz Grabiec
1256a9faa7 tablets: Fix deadlock in background storage group merge fiber
When it deadlocks, groups stop merging and compaction group merge
backlog will run-away.

Also, graceful shutdown will be blocked on it.

Found by flaky unit test
test_merge_chooses_best_replica_with_odd_count, which timed-out in 1
in 100 runs.

Reason for deadlock:

When storage groups are merged, the main compaction group of the new
storage group takes a compaction lock, which is appended to
_compaction_reenablers_for_merging, and released when the merge
completion fiber is done with the whole batch.

If we accumulate more than 1 merge cycle for the fiber, deadlock
occurs. Lock order will be this

Initial state:

 cg0: main
 cg1: main
 cg2: main
 cg3: main

After 1st merge:

 cg0': main [locked], merging_groups=[cg0.main, cg1.main]
 cg1': main [locked], merging_groups=[cg2.main, cg3.main]

After 2nd merge:

 cg0'': main [locked], merging_groups=[cg0'.main [locked], cg0.main, cg1.main, cg1'.main [locked], cg2.main, cg3.main]

merge completion fiber will try to stop cg0'.main, which will be
blocked on compaction lock. which is held by the reenabler in
_compaction_reenablers_for_merging, hence deadlock.

The fix is to wait for background merge to finish before we start the
next merge. It's achieved by holding old erm in the background merge,
and doing a topology barrier from the merge finalizing transition.

Background merge is supposed to be a relatively quick operation, it's
stopping compaction groups. So may wait for active requests. It
shouldn't prolong the barrier indefinitely.

Tablet boost unit tests which trigger merge need to be adjusted to
call the barrier, otherwise they will be vulnerable to the deadlock.

Two cluster tests were removed because they assumed that merge happens
in the backgournd. Now that it happens as part of merge finalization,
and blocks topology state machine, those tests deadlock because they
are unable to make topology changes (node bootstrap) while background
merge is blocked.

The test "test_tablets_merge_waits_for_lwt" needed to be adjusted. It
assumed that merge finalization doesn't wait for the erm held by the
LWT operation, and triggered tablet movement afterwards, and assumed
that this migration will issue a barrier which will block on the LWT
operation. After this commit, it's the barrier in merge finalization
which is blocked. The test was adjusted to use an earlier log mark
when waiting for "Got raft_topology_cmd::barrier_and_drain", which
will catch the barrier in merge finalization.

Fixes SCYLLADB-928
2026-03-12 22:45:01 +01:00
Tomasz Grabiec
7706c9e8c4 replica: table: Propagate old erm to storage group merge 2026-03-12 22:45:01 +01:00
Tomasz Grabiec
582a4abeb6 test: boost: tablets_test: Save tablet metadata when ACKing split resize decision
Needs to be ordered before split finalization, because storage_group
must be in split mode already at finalization time. There must be
split-ready compaction groups, otherwise finalization fails with this
error:

  Found 0 split ready compaction groups, but expected 2 instead.

Exposed by increased split activity in tests.
2026-03-12 22:45:01 +01:00
Tomasz Grabiec
279fcdd5ff storage_service: Extract local_topology_barrier()
Will be called in tests. It does the local part of the global topology
barrier.

The comment:

        // We capture the topology version right after the checks
        // above, before any yields. This is crucial since _topology_state_machine._topology
        // might be altered concurrently while this method is running,
        // which can cause the fence command to apply an invalid fence version.

was dropped, because it's no longer true after
fad6c41cee, and it doesn't make sense in
the context of local_topology_barrier(). We'd have to propagate the
version to local_topology_barrier(), but it's pointless. The fence
version is decided before calling the local barrier, and it will be
valid even if local version moves ahead.
2026-03-12 22:44:56 +01:00
Nadav Har'El
92ee959e9b test/alternator: speed up test_streams.py by using module-scope fixtures
Previously, all stream-table fixtures in this test file used
scope="function", forcing a fresh table to be created for every test,
slowing down the test a bit (though not much), and discouraging writing
small new tests.

This was a workaround for a DynamoDB quirk (that Alternator doesn't have):
LATEST shard iterators have a time slack and may point slightly before
the true stream head, causing leftover events from a previous test to
appear in the next test's reads.

We fix this by draining the stream inside latest_iterators() and
shards_and_latest_iterators() after obtaining the LATEST iterators:
fetch records in a loop until two consecutive polling rounds both return
empty, guaranteeing the iterators are positioned past all pre-existing
events before the caller writes anything.  With this guarantee in place,
all stream-table fixtures can safely use scope="module".

After this patch, test_streams.py continues to pass on DynamoDB.
On Alternator, the test file's run time went down a bit, from
20.2 seconds to 17.7 seconds.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-03-10 17:14:04 +02:00
Nadav Har'El
6ac1f1333f test/alternator: test_streams.py don't use fixtures in 4 tests
In the next patch, we plan to make the fixtures in test_streams.py
shared between tests. Most tests work well with shared tables, but two
(test_streams_trim_horizon and test_streams_starting_sequence_number)
were written to expect a new table with an empty history, and two
other (test_streams_closed_read and test_streams_disabled_stream) want
to disable streaming and would break a shared table.

So this patch we modify these four tests to create their own new table
instead of using a fixture.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-03-10 17:12:33 +02:00
Nadav Har'El
16e7a88a02 test/alternator: fix do_test() in test_streams.py
Many tests in test/alternator/test_streams.py use a do_test() function
which performs a user-defined function that runs some write requests,
and then verifies that the expected output appears on the stream.

Because DynamoDB drops do-nothing changes from the stream - such as
writing to an item a value that it already has - these tests need to
write to a different item each time, so do_test() invents a random key
and passes it to the user-defined function to use. But... we had a bug,
the random number generation was done only once, instead of every time.
The fix is to do the random number generation on every call.

We never noticed this bug when each test used a brand new table. But the
next patch will make the tests share the test table, and tests start
to fail. It's especially visible if you run the same test twice against
DynamoDB, e.g.,

test/alternator/run --count 2 --aws \
    test_streams.py::test_streams_putitem_keys_only

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-03-09 19:21:53 +02:00
Łukasz Paszkowski
147b355326 replica/table: avoid computing token range side in storage_group_of() on hot path
`storage_group_of()` is on the replica-side token lookup hot path but
used `tablet_map::get_tablet_id_and_range_side()`, which computes both
tablet id and post-split range side.

Most callers only need the storage group id. Switch `storage_group_of()`
to use `get_tablet_id()` via `tablet_id_for_token()`, and select the
compaction group via new overloads that compute the range side only
when splitting mode is active.
2026-03-09 17:59:36 +01:00
Łukasz Paszkowski
419e9aa323 replica/compaction_group: add lazy select_compaction_group() overloads
Change `storage_group::select_compaction_group()` to accept a token
(and tablet_map) and compute the tablet range side only when
splitting_mode() is active.

Add an overload for selecting the compaction group for an sstable
spanning a token range.
2026-03-09 17:59:36 +01:00
Łukasz Paszkowski
3f70611504 locator/tablets: add tablet_map::get_tablet_range_side()
Add `tablet_map::get_tablet_range_side(token)` to compute the
post-split range side without computing the tablet id.

Pure addition, no behavior change.
2026-03-09 17:59:36 +01:00
Jakub Smolar
7cdd979158 db/config: announce ms format as highest supported
Uncomment the feature flag check in get_highest_supported_format()
to return MS format when supported, otherwise fall back to ME.
2026-03-09 17:12:09 +01:00
Michał Chojnowski
949fc85217 db/config: enable ms sstable format by default
Trie-based sstable indexes are supposed to be (hopefully)
a better default than the old BIG indexes.
Make them the new default.

If we change our mind, this change can be reverted later.
2026-03-09 17:12:09 +01:00
Michał Chojnowski
6b413e3959 cluster/dtest/bypass_cache_test: switch from highest_supported_sstable_format to chosen_sstable_format
Trie-based indexes and older indexes have a difference in metrics,
and the test uses the metrics to check for bypass cache.
To choose the right metrics, it uses highest_supported_sstable_format,
which is inappropriate, because the sstable format chosen for writes
by Scylla might be different than highest_supported_sstable_format.

Use chosen_sstable_format instead.
2026-03-09 17:12:09 +01:00
Michał Chojnowski
b89840c4b9 api/system: add /system/chosen_sstable_version
Returns the sstable version currently chosen for use in for new sstables.

We are adding it because some tests want to know what format they are
writing (tests using upgradesstable, tests which check stats that only
apply to one of the index types, etc).

(Currently they are using `highest_supported_sstable_format` for this
purpose, which is inappropriate, and will become invalid if a non-latest
format is the default).
2026-03-09 17:12:09 +01:00
Michał Chojnowski
9280a039ee test/cluster/dtest: reduce num_tokens to 16
cluster.dtest_alternator_tests.test_slow_query_logging performs
a bootstrap with 768 token ranges.

It works with `me` sstables, which have 2 open file descriptors
per open sstable, but with `ms` sstables, which have 3 open
file descriptors per open sstable, it fails with EMFILE.

To avoid this problem, let's just decrease the number of vnodes
for in the test suite. It's appropriate anyway, because it avoids some
unneeded work without weakening the tests.
(Note: pylib-based have been setting `num_tokens` to 16 for a long time too).

This breaks `bypass_cache_test`, which is written in a way that expects
a certain number of token ranges. We adjust the relevant parameter
accordingly.
2026-03-09 17:12:09 +01:00
Botond Dénes
cd13a911cc test/cluster/test_data_resurrection_in_memtable.py: dump rows before check
So that if the check of expected rows fail, we have a dump to look at
and see what is different.
2026-03-05 11:44:02 +02:00
Botond Dénes
f375aae257 replica/database: consolidate the two database_apply error injections
Into a single database_apply one. Add three parameters:
* ks_name and cf_name to filter the tables to be affected
* what - what to do: throw or wait

This leads to smaller footprint in the code and improved filtering for
table names at the cost of some extra error injection params in the
tests.
2026-03-05 11:44:02 +02:00
Botond Dénes
44b8cad3df service/storage_proxy: add name of table to error message for write errors
It is useful to know what table the failed write belongs to.
2026-03-05 10:51:12 +02:00
Ernest Zaslavsky
afac984632 s3_client: reorganize tests in part_size_calculation_test
just group all BOOST_REQUIRE_EXCEPTION tests in one block and
remove artificial scopes
2026-02-18 12:12:04 +02:00
Ernest Zaslavsky
1a20877afe s3_client: switch using s3 limits constants in tests
instead of using magic numbers, switch using s3 limit constants to
make it clearer what and why is tested
2026-02-18 12:12:04 +02:00
Ernest Zaslavsky
d763bdabc2 s3_client: fix the s3::range max object size
in s3::Range class start using s3 global constant for two reasons:
1) uniformity, no need to introduce semantically same constant in each class
2) the value was wrong
2026-02-18 12:12:04 +02:00
Ernest Zaslavsky
24e70b30c8 s3_client: remove "aws" prefix from object limits constants
remove "aws" prefix from object limits constants since it is
irrelevant and unnecessary when sitting under s3 namespace
2026-02-18 12:12:04 +02:00
Ernest Zaslavsky
329c156600 s3_client: make s3 object limits accessible
make s3 limits constants publicly accessible to reuse it later
2026-02-18 12:12:04 +02:00
169 changed files with 8615 additions and 2233 deletions

View File

@@ -55,22 +55,26 @@ ninja build/<mode>/test/boost/<test_name>
ninja build/<mode>/scylla
# Run all tests in a file
./test.py --mode=<mode> <test_path>
./test.py --mode=<mode> test/<suite>/<test_name>.py
# Run a single test case from a file
./test.py --mode=<mode> <test_path>::<test_function_name>
./test.py --mode=<mode> test/<suite>/<test_name>.py::<test_function_name>
# Run all tests in a directory
./test.py --mode=<mode> test/<suite>/
# Examples
./test.py --mode=dev alternator/
./test.py --mode=dev cluster/test_raft_voters::test_raft_limited_voters_retain_coordinator
./test.py --mode=dev test/alternator/
./test.py --mode=dev test/cluster/test_raft_voters.py::test_raft_limited_voters_retain_coordinator
./test.py --mode=dev test/cqlpy/test_json.py
# Optional flags
./test.py --mode=dev cluster/test_raft_no_quorum -v # Verbose output
./test.py --mode=dev cluster/test_raft_no_quorum --repeat 5 # Repeat test 5 times
./test.py --mode=dev test/cluster/test_raft_no_quorum.py -v # Verbose output
./test.py --mode=dev test/cluster/test_raft_no_quorum.py --repeat 5 # Repeat test 5 times
```
**Important:**
- Use path without `.py` extension (e.g., `cluster/test_raft_no_quorum`, not `cluster/test_raft_no_quorum.py`)
- Use full path with `.py` extension (e.g., `test/cluster/test_raft_no_quorum.py`, not `cluster/test_raft_no_quorum`)
- To run a single test case, append `::<test_function_name>` to the file path
- Add `-v` for verbose output
- Add `--repeat <num>` to repeat a test multiple times

View File

@@ -8,6 +8,9 @@ on:
jobs:
check-fixes-prefix:
runs-on: ubuntu-latest
permissions:
contents: read
issues: write
steps:
- name: Check PR body for "Fixes" prefix patterns
uses: actions/github-script@v7

View File

@@ -1,4 +1,6 @@
name: Trigger Scylla CI Route
permissions:
contents: read
on:
issue_comment:

View File

@@ -1,5 +1,8 @@
name: Trigger next gating
permissions:
contents: read
on:
push:
branches:

View File

@@ -1295,6 +1295,45 @@
}
]
},
{
"path":"/storage_service/logstor_compaction",
"operations":[
{
"method":"POST",
"summary":"Trigger compaction of the key-value storage",
"type":"void",
"nickname":"logstor_compaction",
"produces":[
"application/json"
],
"parameters":[
{
"name":"major",
"description":"When true, perform a major compaction",
"required":false,
"allowMultiple":false,
"type":"boolean",
"paramType":"query"
}
]
}
]
},
{
"path":"/storage_service/logstor_flush",
"operations":[
{
"method":"POST",
"summary":"Trigger flush of logstor storage",
"type":"void",
"nickname":"logstor_flush",
"produces":[
"application/json"
],
"parameters":[]
}
]
},
{
"path":"/storage_service/active_repair/",
"operations":[
@@ -3229,6 +3268,38 @@
}
]
},
{
"path":"/storage_service/logstor_info",
"operations":[
{
"method":"GET",
"summary":"Logstor segment information for one table",
"type":"table_logstor_info",
"nickname":"logstor_info",
"produces":[
"application/json"
],
"parameters":[
{
"name":"keyspace",
"description":"The keyspace",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"table",
"description":"table name",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"query"
}
]
}
]
},
{
"path":"/storage_service/retrain_dict",
"operations":[
@@ -3637,6 +3708,47 @@
}
}
},
"logstor_hist_bucket":{
"id":"logstor_hist_bucket",
"properties":{
"bucket":{
"type":"long"
},
"count":{
"type":"long"
},
"min_data_size":{
"type":"long"
},
"max_data_size":{
"type":"long"
}
}
},
"table_logstor_info":{
"id":"table_logstor_info",
"description":"Per-table logstor segment distribution",
"properties":{
"keyspace":{
"type":"string"
},
"table":{
"type":"string"
},
"compaction_groups":{
"type":"long"
},
"segments":{
"type":"long"
},
"data_size_histogram":{
"type":"array",
"items":{
"$ref":"logstor_hist_bucket"
}
}
}
},
"tablet_repair_result":{
"id":"tablet_repair_result",
"description":"Tablet repair result",

View File

@@ -209,6 +209,21 @@
"parameters":[]
}
]
},
{
"path":"/system/chosen_sstable_version",
"operations":[
{
"method":"GET",
"summary":"Get sstable version currently chosen for use in new sstables",
"type":"string",
"nickname":"get_chosen_sstable_version",
"produces":[
"application/json"
],
"parameters":[]
}
]
}
]
}

View File

@@ -18,7 +18,9 @@
#include "utils/assert.hh"
#include "utils/estimated_histogram.hh"
#include <algorithm>
#include <sstream>
#include "db/data_listeners.hh"
#include "utils/hash.hh"
#include "storage_service.hh"
#include "compaction/compaction_manager.hh"
#include "unimplemented.hh"
@@ -342,6 +344,56 @@ uint64_t accumulate_on_active_memtables(replica::table& t, noncopyable_function<
return ret;
}
static
future<json::json_return_type>
rest_toppartitions_generic(sharded<replica::database>& db, std::unique_ptr<http::request> req) {
bool filters_provided = false;
std::unordered_set<std::tuple<sstring, sstring>, utils::tuple_hash> table_filters {};
if (auto filters = req->get_query_param("table_filters"); !filters.empty()) {
filters_provided = true;
std::stringstream ss { filters };
std::string filter;
while (!filters.empty() && ss.good()) {
std::getline(ss, filter, ',');
table_filters.emplace(parse_fully_qualified_cf_name(filter));
}
}
std::unordered_set<sstring> keyspace_filters {};
if (auto filters = req->get_query_param("keyspace_filters"); !filters.empty()) {
filters_provided = true;
std::stringstream ss { filters };
std::string filter;
while (!filters.empty() && ss.good()) {
std::getline(ss, filter, ',');
keyspace_filters.emplace(std::move(filter));
}
}
// when the query is empty return immediately
if (filters_provided && table_filters.empty() && keyspace_filters.empty()) {
apilog.debug("toppartitions query: processing results");
cf::toppartitions_query_results results;
results.read_cardinality = 0;
results.write_cardinality = 0;
return make_ready_future<json::json_return_type>(results);
}
api::req_param<std::chrono::milliseconds, unsigned> duration{*req, "duration", 1000ms};
api::req_param<unsigned> capacity(*req, "capacity", 256);
api::req_param<unsigned> list_size(*req, "list_size", 10);
apilog.info("toppartitions query: #table_filters={} #keyspace_filters={} duration={} list_size={} capacity={}",
!table_filters.empty() ? std::to_string(table_filters.size()) : "all", !keyspace_filters.empty() ? std::to_string(keyspace_filters.size()) : "all", duration.value, list_size.value, capacity.value);
return seastar::do_with(db::toppartitions_query(db, std::move(table_filters), std::move(keyspace_filters), duration.value, list_size, capacity), [] (db::toppartitions_query& q) {
return run_toppartitions_query(q);
});
}
void set_column_family(http_context& ctx, routes& r, sharded<replica::database>& db) {
cf::get_column_family_name.set(r, [&db] (const_req req){
std::vector<sstring> res;
@@ -1047,6 +1099,10 @@ void set_column_family(http_context& ctx, routes& r, sharded<replica::database>&
});
});
ss::toppartitions_generic.set(r, [&db] (std::unique_ptr<http::request> req) {
return rest_toppartitions_generic(db, std::move(req));
});
cf::force_major_compaction.set(r, [&ctx, &db](std::unique_ptr<http::request> req) -> future<json::json_return_type> {
if (!req->get_query_param("split_output").empty()) {
fail(unimplemented::cause::API);
@@ -1213,6 +1269,7 @@ void unset_column_family(http_context& ctx, routes& r) {
cf::get_sstable_count_per_level.unset(r);
cf::get_sstables_for_key.unset(r);
cf::toppartitions.unset(r);
ss::toppartitions_generic.unset(r);
cf::force_major_compaction.unset(r);
ss::get_load.unset(r);
ss::get_metrics_load.unset(r);

View File

@@ -17,9 +17,7 @@
#include "gms/feature_service.hh"
#include "schema/schema_builder.hh"
#include "sstables/sstables_manager.hh"
#include "utils/hash.hh"
#include <optional>
#include <sstream>
#include <stdexcept>
#include <time.h>
#include <algorithm>
@@ -612,56 +610,6 @@ rest_get_token_endpoint(http_context& ctx, sharded<service::storage_service>& ss
co_return json::json_return_type(stream_range_as_array(token_endpoints, &map_to_json<dht::token, gms::inet_address>));
}
static
future<json::json_return_type>
rest_toppartitions_generic(http_context& ctx, std::unique_ptr<http::request> req) {
bool filters_provided = false;
std::unordered_set<std::tuple<sstring, sstring>, utils::tuple_hash> table_filters {};
if (auto filters = req->get_query_param("table_filters"); !filters.empty()) {
filters_provided = true;
std::stringstream ss { filters };
std::string filter;
while (!filters.empty() && ss.good()) {
std::getline(ss, filter, ',');
table_filters.emplace(parse_fully_qualified_cf_name(filter));
}
}
std::unordered_set<sstring> keyspace_filters {};
if (auto filters = req->get_query_param("keyspace_filters"); !filters.empty()) {
filters_provided = true;
std::stringstream ss { filters };
std::string filter;
while (!filters.empty() && ss.good()) {
std::getline(ss, filter, ',');
keyspace_filters.emplace(std::move(filter));
}
}
// when the query is empty return immediately
if (filters_provided && table_filters.empty() && keyspace_filters.empty()) {
apilog.debug("toppartitions query: processing results");
httpd::column_family_json::toppartitions_query_results results;
results.read_cardinality = 0;
results.write_cardinality = 0;
return make_ready_future<json::json_return_type>(results);
}
api::req_param<std::chrono::milliseconds, unsigned> duration{*req, "duration", 1000ms};
api::req_param<unsigned> capacity(*req, "capacity", 256);
api::req_param<unsigned> list_size(*req, "list_size", 10);
apilog.info("toppartitions query: #table_filters={} #keyspace_filters={} duration={} list_size={} capacity={}",
!table_filters.empty() ? std::to_string(table_filters.size()) : "all", !keyspace_filters.empty() ? std::to_string(keyspace_filters.size()) : "all", duration.value, list_size.value, capacity.value);
return seastar::do_with(db::toppartitions_query(ctx.db, std::move(table_filters), std::move(keyspace_filters), duration.value, list_size, capacity), [] (db::toppartitions_query& q) {
return run_toppartitions_query(q);
});
}
static
json::json_return_type
rest_get_release_version(sharded<service::storage_service>& ss, const_req& req) {
@@ -833,6 +781,28 @@ rest_force_keyspace_flush(http_context& ctx, std::unique_ptr<http::request> req)
co_return json_void();
}
static
future<json::json_return_type>
rest_logstor_compaction(http_context& ctx, std::unique_ptr<http::request> req) {
bool major = false;
if (auto major_param = req->get_query_param("major"); !major_param.empty()) {
major = validate_bool(major_param);
}
apilog.info("logstor_compaction: major={}", major);
auto& db = ctx.db;
co_await replica::database::trigger_logstor_compaction_on_all_shards(db, major);
co_return json_void();
}
static
future<json::json_return_type>
rest_logstor_flush(http_context& ctx, std::unique_ptr<http::request> req) {
apilog.info("logstor_flush");
auto& db = ctx.db;
co_await replica::database::flush_logstor_separator_on_all_shards(db);
co_return json_void();
}
static
future<json::json_return_type>
rest_decommission(sharded<service::storage_service>& ss, sharded<db::snapshot_ctl>& ssc, std::unique_ptr<http::request> req) {
@@ -1553,6 +1523,54 @@ rest_sstable_info(http_context& ctx, std::unique_ptr<http::request> req) {
});
}
static
future<json::json_return_type>
rest_logstor_info(http_context& ctx, std::unique_ptr<http::request> req) {
auto keyspace = api::req_param<sstring>(*req, "keyspace", {}).value;
auto table = api::req_param<sstring>(*req, "table", {}).value;
if (table.empty()) {
table = api::req_param<sstring>(*req, "cf", {}).value;
}
if (keyspace.empty()) {
throw bad_param_exception("The query parameter 'keyspace' is required");
}
if (table.empty()) {
throw bad_param_exception("The query parameter 'table' is required");
}
keyspace = validate_keyspace(ctx, keyspace);
auto tid = validate_table(ctx.db.local(), keyspace, table);
auto& cf = ctx.db.local().find_column_family(tid);
if (!cf.uses_logstor()) {
throw bad_param_exception(fmt::format("Table {}.{} does not use logstor", keyspace, table));
}
return do_with(replica::logstor::table_segment_stats{}, [keyspace = std::move(keyspace), table = std::move(table), tid, &ctx] (replica::logstor::table_segment_stats& merged_stats) {
return ctx.db.map_reduce([&merged_stats](replica::logstor::table_segment_stats&& shard_stats) {
merged_stats += shard_stats;
}, [tid](const replica::database& db) {
return db.get_logstor_table_segment_stats(tid);
}).then([&merged_stats, keyspace = std::move(keyspace), table = std::move(table)] {
ss::table_logstor_info result;
result.keyspace = keyspace;
result.table = table;
result.compaction_groups = merged_stats.compaction_group_count;
result.segments = merged_stats.segment_count;
for (const auto& bucket : merged_stats.histogram) {
ss::logstor_hist_bucket hist;
hist.count = bucket.count;
hist.max_data_size = bucket.max_data_size;
result.data_size_histogram.push(std::move(hist));
}
return make_ready_future<json::json_return_type>(stream_object(result));
});
});
}
static
future<json::json_return_type>
rest_reload_raft_topology_state(sharded<service::storage_service>& ss, service::raft_group0_client& group0_client, std::unique_ptr<http::request> req) {
@@ -1784,7 +1802,6 @@ rest_bind(FuncType func, BindArgs&... args) {
void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_service>& ss, sharded<db::snapshot_ctl>& ssc, service::raft_group0_client& group0_client) {
ss::get_token_endpoint.set(r, rest_bind(rest_get_token_endpoint, ctx, ss));
ss::toppartitions_generic.set(r, rest_bind(rest_toppartitions_generic, ctx));
ss::get_release_version.set(r, rest_bind(rest_get_release_version, ss));
ss::get_scylla_release_version.set(r, rest_bind(rest_get_scylla_release_version, ss));
ss::get_schema_version.set(r, rest_bind(rest_get_schema_version, ss));
@@ -1800,6 +1817,8 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
ss::force_flush.set(r, rest_bind(rest_force_flush, ctx));
ss::force_keyspace_flush.set(r, rest_bind(rest_force_keyspace_flush, ctx));
ss::decommission.set(r, rest_bind(rest_decommission, ss, ssc));
ss::logstor_compaction.set(r, rest_bind(rest_logstor_compaction, ctx));
ss::logstor_flush.set(r, rest_bind(rest_logstor_flush, ctx));
ss::move.set(r, rest_bind(rest_move, ss));
ss::remove_node.set(r, rest_bind(rest_remove_node, ss));
ss::exclude_node.set(r, rest_bind(rest_exclude_node, ss));
@@ -1848,6 +1867,7 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
ss::retrain_dict.set(r, rest_bind(rest_retrain_dict, ctx, ss, group0_client));
ss::estimate_compression_ratios.set(r, rest_bind(rest_estimate_compression_ratios, ctx, ss));
ss::sstable_info.set(r, rest_bind(rest_sstable_info, ctx));
ss::logstor_info.set(r, rest_bind(rest_logstor_info, ctx));
ss::reload_raft_topology_state.set(r, rest_bind(rest_reload_raft_topology_state, ss, group0_client));
ss::upgrade_to_raft_topology.set(r, rest_bind(rest_upgrade_to_raft_topology, ss));
ss::raft_topology_upgrade_status.set(r, rest_bind(rest_raft_topology_upgrade_status, ss));
@@ -1864,7 +1884,6 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
void unset_storage_service(http_context& ctx, routes& r) {
ss::get_token_endpoint.unset(r);
ss::toppartitions_generic.unset(r);
ss::get_release_version.unset(r);
ss::get_scylla_release_version.unset(r);
ss::get_schema_version.unset(r);
@@ -1878,6 +1897,8 @@ void unset_storage_service(http_context& ctx, routes& r) {
ss::reset_cleanup_needed.unset(r);
ss::force_flush.unset(r);
ss::force_keyspace_flush.unset(r);
ss::logstor_compaction.unset(r);
ss::logstor_flush.unset(r);
ss::decommission.unset(r);
ss::move.unset(r);
ss::remove_node.unset(r);
@@ -1925,6 +1946,7 @@ void unset_storage_service(http_context& ctx, routes& r) {
ss::get_ownership.unset(r);
ss::get_effective_ownership.unset(r);
ss::sstable_info.unset(r);
ss::logstor_info.unset(r);
ss::reload_raft_topology_state.unset(r);
ss::upgrade_to_raft_topology.unset(r);
ss::raft_topology_upgrade_status.unset(r);

View File

@@ -190,6 +190,13 @@ void set_system(http_context& ctx, routes& r) {
return make_ready_future<json::json_return_type>(seastar::to_sstring(format));
});
});
hs::get_chosen_sstable_version.set(r, [&ctx] (std::unique_ptr<request> req) {
return smp::submit_to(0, [&ctx] {
auto format = ctx.db.local().get_user_sstables_manager().get_preferred_sstable_version();
return make_ready_future<json::json_return_type>(seastar::to_sstring(format));
});
});
}
}

View File

@@ -47,7 +47,7 @@ void cache::set_permission_loader(permission_loader_func loader) {
_permission_loader = std::move(loader);
}
lw_shared_ptr<const cache::role_record> cache::get(const role_name_t& role) const noexcept {
lw_shared_ptr<const cache::role_record> cache::get(std::string_view role) const noexcept {
auto it = _roles.find(role);
if (it == _roles.end()) {
return {};
@@ -55,6 +55,16 @@ lw_shared_ptr<const cache::role_record> cache::get(const role_name_t& role) cons
return it->second;
}
void cache::for_each_role(const std::function<void(const role_name_t&, const role_record&)>& func) const {
for (const auto& [name, record] : _roles) {
func(name, *record);
}
}
size_t cache::roles_count() const noexcept {
return _roles.size();
}
future<permission_set> cache::get_permissions(const role_or_anonymous& role, const resource& r) {
std::unordered_map<resource, permission_set>* perms_cache;
lw_shared_ptr<role_record> role_ptr;

View File

@@ -9,6 +9,7 @@
#pragma once
#include <seastar/core/abort_source.hh>
#include <string_view>
#include <unordered_set>
#include <unordered_map>
@@ -19,7 +20,7 @@
#include <seastar/core/semaphore.hh>
#include <seastar/core/metrics_registration.hh>
#include <absl/container/flat_hash_map.h>
#include "absl-flat_hash_map.hh"
#include "auth/permission.hh"
#include "auth/common.hh"
@@ -42,8 +43,8 @@ public:
std::unordered_set<role_name_t> member_of;
std::unordered_set<role_name_t> members;
sstring salted_hash;
std::unordered_map<sstring, sstring> attributes;
std::unordered_map<sstring, permission_set> permissions;
std::unordered_map<sstring, sstring, sstring_hash, sstring_eq> attributes;
std::unordered_map<sstring, permission_set, sstring_hash, sstring_eq> permissions;
private:
friend cache;
// cached permissions include effects of role's inheritance
@@ -52,7 +53,7 @@ public:
};
explicit cache(cql3::query_processor& qp, abort_source& as) noexcept;
lw_shared_ptr<const role_record> get(const role_name_t& role) const noexcept;
lw_shared_ptr<const role_record> get(std::string_view role) const noexcept;
void set_permission_loader(permission_loader_func loader);
future<permission_set> get_permissions(const role_or_anonymous& role, const resource& r);
future<> prune(const resource& r);
@@ -61,8 +62,15 @@ public:
future<> load_roles(std::unordered_set<role_name_t> roles);
static bool includes_table(const table_id&) noexcept;
// Returns the number of roles in the cache.
size_t roles_count() const noexcept;
// The callback doesn't suspend (no co_await) so it observes the state
// of the cache atomically.
void for_each_role(const std::function<void(const role_name_t&, const role_record&)>& func) const;
private:
using roles_map = absl::flat_hash_map<role_name_t, lw_shared_ptr<role_record>>;
using roles_map = absl::flat_hash_map<role_name_t, lw_shared_ptr<role_record>, sstring_hash, sstring_eq>;
roles_map _roles;
// anonymous permissions map exists mainly due to compatibility with
// higher layers which use role_or_anonymous to get permissions.

View File

@@ -0,0 +1,37 @@
/*
* Copyright (C) 2026-present ScyllaDB
*
* Modified by ScyllaDB
*/
/*
* SPDX-License-Identifier: (LicenseRef-ScyllaDB-Source-Available-1.0 and Apache-2.0)
*/
#pragma once
#include "auth/default_authorizer.hh"
#include "auth/permission.hh"
namespace auth {
// maintenance_socket_authorizer is used for clients connecting to the
// maintenance socket. It grants all permissions unconditionally (like
// AllowAllAuthorizer) while still supporting grant/revoke operations
// (delegated to the underlying CassandraAuthorizer / default_authorizer).
class maintenance_socket_authorizer : public default_authorizer {
public:
using default_authorizer::default_authorizer;
~maintenance_socket_authorizer() override = default;
future<> start() override {
return make_ready_future<>();
}
future<permission_set> authorize(const role_or_anonymous&, const resource&) const override {
return make_ready_future<permission_set>(permissions::ALL);
}
};
} // namespace auth

View File

@@ -30,6 +30,7 @@
#include "auth/default_authorizer.hh"
#include "auth/ldap_role_manager.hh"
#include "auth/maintenance_socket_authenticator.hh"
#include "auth/maintenance_socket_authorizer.hh"
#include "auth/maintenance_socket_role_manager.hh"
#include "auth/password_authenticator.hh"
#include "auth/role_or_anonymous.hh"
@@ -866,6 +867,12 @@ authenticator_factory make_maintenance_socket_authenticator_factory(
};
}
authorizer_factory make_maintenance_socket_authorizer_factory(sharded<cql3::query_processor>& qp) {
return [&qp] {
return std::make_unique<maintenance_socket_authorizer>(qp.local());
};
}
role_manager_factory make_maintenance_socket_role_manager_factory(
sharded<cql3::query_processor>& qp,
::service::raft_group0_client& g0,

View File

@@ -434,6 +434,11 @@ authenticator_factory make_maintenance_socket_authenticator_factory(
sharded<::service::migration_manager>& mm,
sharded<cache>& cache);
/// Creates a factory for the maintenance socket authorizer.
/// This authorizer is not config-selectable and is only used for the maintenance socket.
/// It grants all permissions unconditionally while delegating grant/revoke to the default authorizer.
authorizer_factory make_maintenance_socket_authorizer_factory(sharded<cql3::query_processor>& qp);
/// Creates a factory for the maintenance socket role manager.
/// This role manager is not config-selectable and is only used for the maintenance socket.
role_manager_factory make_maintenance_socket_role_manager_factory(

View File

@@ -44,13 +44,12 @@ namespace auth {
static logging::logger log("standard_role_manager");
future<std::optional<standard_role_manager::record>> standard_role_manager::find_record(std::string_view role_name) {
auto name = sstring(role_name);
auto role = _cache.get(name);
auto role = _cache.get(role_name);
if (!role) {
return make_ready_future<std::optional<record>>(std::nullopt);
}
return make_ready_future<std::optional<record>>(std::make_optional(record{
.name = std::move(name),
.name = sstring(role_name),
.is_superuser = role->is_superuser,
.can_login = role->can_login,
.member_of = role->member_of
@@ -393,51 +392,21 @@ future<role_set> standard_role_manager::query_granted(std::string_view grantee_n
}
future<role_to_directly_granted_map> standard_role_manager::query_all_directly_granted(::service::query_state& qs) {
const sstring query = seastar::format("SELECT * FROM {}.{}",
db::system_keyspace::NAME,
ROLE_MEMBERS_CF);
const auto results = co_await _qp.execute_internal(
query,
db::consistency_level::ONE,
qs,
cql3::query_processor::cache_internal::yes);
role_to_directly_granted_map roles_map;
std::transform(
results->begin(),
results->end(),
std::inserter(roles_map, roles_map.begin()),
[] (const cql3::untyped_result_set_row& row) {
return std::make_pair(row.get_as<sstring>("member"), row.get_as<sstring>("role")); }
);
_cache.for_each_role([&roles_map] (const cache::role_name_t& name, const cache::role_record& record) {
for (const auto& granted_role : record.member_of) {
roles_map.emplace(name, granted_role);
}
});
co_return roles_map;
}
future<role_set> standard_role_manager::query_all(::service::query_state& qs) {
const sstring query = seastar::format("SELECT {} FROM {}.{}",
meta::roles_table::role_col_name,
db::system_keyspace::NAME,
meta::roles_table::name);
// To avoid many copies of a view.
static const auto role_col_name_string = sstring(meta::roles_table::role_col_name);
const auto results = co_await _qp.execute_internal(
query,
db::consistency_level::LOCAL_ONE,
qs,
cql3::query_processor::cache_internal::yes);
role_set roles;
std::transform(
results->begin(),
results->end(),
std::inserter(roles, roles.begin()),
[] (const cql3::untyped_result_set_row& row) {
return row.get_as<sstring>(role_col_name_string);}
);
roles.reserve(_cache.roles_count());
_cache.for_each_role([&roles] (const cache::role_name_t& name, const cache::role_record&) {
roles.insert(name);
});
co_return roles;
}
@@ -460,31 +429,26 @@ future<bool> standard_role_manager::can_login(std::string_view role_name) {
}
future<std::optional<sstring>> standard_role_manager::get_attribute(std::string_view role_name, std::string_view attribute_name, ::service::query_state& qs) {
const sstring query = seastar::format("SELECT name, value FROM {}.{} WHERE role = ? AND name = ?",
db::system_keyspace::NAME,
ROLE_ATTRIBUTES_CF);
const auto result_set = co_await _qp.execute_internal(query, db::consistency_level::ONE, qs, {sstring(role_name), sstring(attribute_name)}, cql3::query_processor::cache_internal::yes);
if (!result_set->empty()) {
const cql3::untyped_result_set_row &row = result_set->one();
co_return std::optional<sstring>(row.get_as<sstring>("value"));
auto role = _cache.get(role_name);
if (!role) {
co_return std::nullopt;
}
co_return std::optional<sstring>{};
auto it = role->attributes.find(attribute_name);
if (it != role->attributes.end()) {
co_return it->second;
}
co_return std::nullopt;
}
future<role_manager::attribute_vals> standard_role_manager::query_attribute_for_all (std::string_view attribute_name, ::service::query_state& qs) {
return query_all(qs).then([this, attribute_name, &qs] (role_set roles) {
return do_with(attribute_vals{}, [this, attribute_name, roles = std::move(roles), &qs] (attribute_vals &role_to_att_val) {
return parallel_for_each(roles.begin(), roles.end(), [this, &role_to_att_val, attribute_name, &qs] (sstring role) {
return get_attribute(role, attribute_name, qs).then([&role_to_att_val, role] (std::optional<sstring> att_val) {
if (att_val) {
role_to_att_val.emplace(std::move(role), std::move(*att_val));
}
});
}).then([&role_to_att_val] () {
return make_ready_future<attribute_vals>(std::move(role_to_att_val));
});
});
future<role_manager::attribute_vals> standard_role_manager::query_attribute_for_all(std::string_view attribute_name, ::service::query_state& qs) {
attribute_vals result;
_cache.for_each_role([&result, attribute_name] (const cache::role_name_t& name, const cache::role_record& record) {
auto it = record.attributes.find(attribute_name);
if (it != record.attributes.end()) {
result.emplace(name, it->second);
}
});
co_return result;
}
future<> standard_role_manager::set_attribute(std::string_view role_name, std::string_view attribute_name, std::string_view attribute_value, ::service::group0_batch& mc) {

View File

@@ -1268,9 +1268,15 @@ future<> compaction_manager::start(const db::config& cfg, utils::disk_space_moni
if (dsm && (this_shard_id() == 0)) {
_out_of_space_subscription = dsm->subscribe(cfg.critical_disk_utilization_level, [this] (auto threshold_reached) {
if (threshold_reached) {
return container().invoke_on_all([] (compaction_manager& cm) { return cm.drain(); });
return container().invoke_on_all([] (compaction_manager& cm) {
cm._in_critical_disk_utilization_mode = true;
return cm.drain();
});
}
return container().invoke_on_all([] (compaction_manager& cm) { cm.enable(); });
return container().invoke_on_all([] (compaction_manager& cm) {
cm._in_critical_disk_utilization_mode = false;
cm.enable();
});
});
}
@@ -2348,6 +2354,16 @@ future<compaction_manager::compaction_stats_opt> compaction_manager::perform_spl
return perform_task_on_all_files<split_compaction_task_executor>("split", info, t, std::move(options), std::move(owned_ranges_ptr), std::move(get_sstables), throw_if_stopping::no);
}
std::exception_ptr compaction_manager::make_disabled_exception(compaction::compaction_group_view& cg) {
std::exception_ptr ex;
if (_in_critical_disk_utilization_mode) {
ex = std::make_exception_ptr(std::runtime_error("critical disk utilization"));
} else {
ex = std::make_exception_ptr(compaction_stopped_exception(cg.schema()->ks_name(), cg.schema()->cf_name(), "compaction disabled"));
}
return ex;
}
future<std::vector<sstables::shared_sstable>>
compaction_manager::maybe_split_new_sstable(sstables::shared_sstable sst, compaction_group_view& t, compaction_type_options::split opt) {
if (!split_compaction_task_executor::sstable_needs_split(sst, opt)) {
@@ -2357,8 +2373,7 @@ compaction_manager::maybe_split_new_sstable(sstables::shared_sstable sst, compac
// We don't want to prevent split because compaction is temporarily disabled on a view only for synchronization,
// which is unneeded against new sstables that aren't part of any set yet, so never use can_proceed(&t) here.
if (is_disabled()) {
co_return coroutine::exception(std::make_exception_ptr(std::runtime_error(format("Cannot split {} because manager has compaction disabled, " \
"reason might be out of space prevention", sst->get_filename()))));
co_return coroutine::exception(make_disabled_exception(t));
}
std::vector<sstables::shared_sstable> ret;

View File

@@ -115,6 +115,8 @@ private:
uint32_t _disabled_state_count = 0;
bool is_disabled() const { return _state != state::running || _disabled_state_count > 0; }
// precondition: is_disabled() is true.
std::exception_ptr make_disabled_exception(compaction::compaction_group_view& cg);
std::optional<future<>> _stop_future;
@@ -170,6 +172,7 @@ private:
shared_tombstone_gc_state _shared_tombstone_gc_state;
utils::disk_space_monitor::subscription _out_of_space_subscription;
bool _in_critical_disk_utilization_mode = false;
private:
// Requires task->_compaction_state.gate to be held and task to be registered in _tasks.
future<compaction_stats_opt> perform_task(shared_ptr<compaction::compaction_task_executor> task, throw_if_stopping do_throw_if_stopping);

View File

@@ -397,6 +397,17 @@ commitlog_total_space_in_mb: -1
# you can cache more hot rows
# column_index_size_in_kb: 64
# sstable format version for newly written sstables.
# Currently allowed values are `me` and `ms`.
# If not specified in the config, this defaults to `me`.
#
# The difference between `me` and `ms` are the data structures used
# in the primary index.
# In short, `ms` needs more CPU during sstable writes,
# but should behave better during reads,
# although it might behave worse for very long clustering keys.
sstable_format: ms
# Auto-scaling of the promoted index prevents running out of memory
# when the promoted index grows too large (due to partitions with many rows
# vs. too small column_index_size_in_kb). When the serialized representation

View File

@@ -896,6 +896,9 @@ scylla_core = (['message/messaging_service.cc',
'replica/multishard_query.cc',
'replica/mutation_dump.cc',
'replica/querier.cc',
'replica/logstor/segment_manager.cc',
'replica/logstor/logstor.cc',
'replica/logstor/write_buffer.cc',
'mutation/atomic_cell.cc',
'mutation/canonical_mutation.cc',
'mutation/frozen_mutation.cc',
@@ -1467,6 +1470,7 @@ idls = ['idl/gossip_digest.idl.hh',
'idl/query.idl.hh',
'idl/idl_test.idl.hh',
'idl/commitlog.idl.hh',
'idl/logstor.idl.hh',
'idl/tracing.idl.hh',
'idl/consistency_level.idl.hh',
'idl/cache_temperature.idl.hh',

View File

@@ -265,7 +265,10 @@ future<shared_ptr<cql_transport::messages::result_message>> batch_statement::do_
if (guardrail_state == query_processor::write_consistency_guardrail_state::FAIL) {
return make_exception_future<shared_ptr<cql_transport::messages::result_message>>(
exceptions::invalid_request_exception(
format("Consistency level {} is not allowed for write operations", cl)));
format("Write consistency level {} is forbidden by the current configuration "
"setting of write_consistency_levels_disallowed. Please use a different "
"consistency level, or remove {} from write_consistency_levels_disallowed "
"set in the configuration.", cl, cl)));
}
for (size_t i = 0; i < _statements.size(); ++i) {
@@ -277,7 +280,8 @@ future<shared_ptr<cql_transport::messages::result_message>> batch_statement::do_
_stats.statements_in_cas_batches += _statements.size();
return execute_with_conditions(qp, options, query_state).then([guardrail_state, cl] (auto result) {
if (guardrail_state == query_processor::write_consistency_guardrail_state::WARN) {
result->add_warning(format("Write with consistency level {} is warned by guardrail configuration", cl));
result->add_warning(format("Using write consistency level {} listed on the "
"write_consistency_levels_warned is not recommended.", cl));
}
return result;
});
@@ -297,7 +301,8 @@ future<shared_ptr<cql_transport::messages::result_message>> batch_statement::do_
}
auto result = make_shared<cql_transport::messages::result_message::void_message>();
if (guardrail_state == query_processor::write_consistency_guardrail_state::WARN) {
result->add_warning(format("Write with consistency level {} is warned by guardrail configuration", cl));
result->add_warning(format("Using write consistency level {} listed on the "
"write_consistency_levels_warned is not recommended.", cl));
}
return make_ready_future<shared_ptr<cql_transport::messages::result_message>>(std::move(result));
});

View File

@@ -59,6 +59,8 @@ const sstring cf_prop_defs::COMPACTION_ENABLED_KEY = "enabled";
const sstring cf_prop_defs::KW_TABLETS = "tablets";
const sstring cf_prop_defs::KW_STORAGE_ENGINE = "storage_engine";
schema::extensions_map cf_prop_defs::make_schema_extensions(const db::extensions& exts) const {
schema::extensions_map er;
for (auto& p : exts.schema_extensions()) {
@@ -106,6 +108,7 @@ void cf_prop_defs::validate(const data_dictionary::database db, sstring ks_name,
KW_BF_FP_CHANCE, KW_MEMTABLE_FLUSH_PERIOD, KW_COMPACTION,
KW_COMPRESSION, KW_CRC_CHECK_CHANCE, KW_ID, KW_PAXOSGRACESECONDS,
KW_SYNCHRONOUS_UPDATES, KW_TABLETS,
KW_STORAGE_ENGINE,
});
static std::set<sstring> obsolete_keywords({
sstring("index_interval"),
@@ -196,6 +199,20 @@ void cf_prop_defs::validate(const data_dictionary::database db, sstring ks_name,
}
db::tablet_options::validate(*tablet_options_map);
}
if (has_property(KW_STORAGE_ENGINE)) {
auto storage_engine = get_string(KW_STORAGE_ENGINE, "");
if (storage_engine == "logstor") {
if (!db.features().logstor) {
throw exceptions::configuration_exception(format("The experimental feature 'logstor' must be enabled in order to use the 'logstor' storage engine."));
}
if (!db.get_config().enable_logstor()) {
throw exceptions::configuration_exception(format("The configuration option 'enable_logstor' must be set to true in the configuration in order to use the 'logstor' storage engine."));
}
} else {
throw exceptions::configuration_exception(format("Illegal value for '{}'", KW_STORAGE_ENGINE));
}
}
}
std::map<sstring, sstring> cf_prop_defs::get_compaction_type_options() const {
@@ -396,6 +413,13 @@ void cf_prop_defs::apply_to_builder(schema_builder& builder, schema::extensions_
if (auto tablet_options_opt = get_map(KW_TABLETS)) {
builder.set_tablet_options(std::move(*tablet_options_opt));
}
if (has_property(KW_STORAGE_ENGINE)) {
auto storage_engine = get_string(KW_STORAGE_ENGINE, "");
if (storage_engine == "logstor") {
builder.set_logstor();
}
}
}
void cf_prop_defs::validate_minimum_int(const sstring& field, int32_t minimum_value, int32_t default_value) const

View File

@@ -64,6 +64,8 @@ public:
static const sstring KW_TABLETS;
static const sstring KW_STORAGE_ENGINE;
// FIXME: In origin the following consts are in CFMetaData.
static constexpr int32_t DEFAULT_DEFAULT_TIME_TO_LIVE = 0;
static constexpr int32_t DEFAULT_MIN_INDEX_INTERVAL = 128;

View File

@@ -9,6 +9,7 @@
*/
#include "cql3/statements/cf_prop_defs.hh"
#include "utils/assert.hh"
#include <inttypes.h>
#include <boost/regex.hpp>
@@ -266,6 +267,13 @@ std::unique_ptr<prepared_statement> create_table_statement::raw_statement::prepa
stmt_warning("CREATE TABLE WITH COMPACT STORAGE is deprecated and will eventually be removed in a future version.");
}
if (_properties.properties()->has_property(cf_prop_defs::KW_STORAGE_ENGINE)) {
auto storage_engine = _properties.properties()->get_string(cf_prop_defs::KW_STORAGE_ENGINE, "");
if (storage_engine == "logstor" && !_column_aliases.empty()) {
throw exceptions::configuration_exception("The 'logstor' storage engine cannot be used with tables that have clustering columns");
}
}
auto& key_aliases = _key_aliases[0];
std::vector<data_type> key_types;
for (auto&& alias : key_aliases) {

View File

@@ -273,7 +273,10 @@ modification_statement::do_execute(query_processor& qp, service::query_state& qs
if (guardrail_state == query_processor::write_consistency_guardrail_state::FAIL) {
co_return coroutine::exception(
std::make_exception_ptr(exceptions::invalid_request_exception(
format("Consistency level {} is not allowed for write operations", cl))));
format("Write consistency level {} is forbidden by the current configuration "
"setting of write_consistency_levels_disallowed. Please use a different "
"consistency level, or remove {} from write_consistency_levels_disallowed "
"set in the configuration.", cl, cl))));
}
_restrictions->validate_primary_key(options);
@@ -281,7 +284,8 @@ modification_statement::do_execute(query_processor& qp, service::query_state& qs
if (has_conditions()) {
auto result = co_await execute_with_condition(qp, qs, options);
if (guardrail_state == query_processor::write_consistency_guardrail_state::WARN) {
result->add_warning(format("Write with consistency level {} is warned by guardrail configuration", cl));
result->add_warning(format("Using write consistency level {} listed on the "
"write_consistency_levels_warned is not recommended.", cl));
}
co_return result;
}
@@ -303,7 +307,8 @@ modification_statement::do_execute(query_processor& qp, service::query_state& qs
auto result = seastar::make_shared<cql_transport::messages::result_message::void_message>();
if (guardrail_state == query_processor::write_consistency_guardrail_state::WARN) {
result->add_warning(format("Write with consistency level {} is warned by guardrail configuration", cl));
result->add_warning(format("Using write consistency level {} listed on the "
"write_consistency_levels_warned is not recommended.", cl));
}
if (keys_size_one) {
auto&& table = s->table();

View File

@@ -679,6 +679,8 @@ db::config::config(std::shared_ptr<db::extensions> exts)
"The directory where hints files are stored if hinted handoff is enabled.")
, view_hints_directory(this, "view_hints_directory", value_status::Used, "",
"The directory where materialized-view updates are stored while a view replica is unreachable.")
, logstor_directory(this, "logstor_directory", value_status::Used, "",
"The directory where data files for logstor storage are stored.")
, saved_caches_directory(this, "saved_caches_directory", value_status::Unused, "",
"The directory location where table key and row caches are stored.")
/**
@@ -862,6 +864,14 @@ db::config::config(std::shared_ptr<db::extensions> exts)
"* offheap_objects Native memory, eliminating NIO buffer heap overhead.")
, memtable_cleanup_threshold(this, "memtable_cleanup_threshold", value_status::Invalid, .11,
"Ratio of occupied non-flushing memtable size to total permitted size for triggering a flush of the largest memtable. Larger values mean larger flushes and less compaction, but also less concurrent flush activity, which can make it difficult to keep your disks saturated under heavy write load.")
, logstor_disk_size_in_mb(this, "logstor_disk_size_in_mb", value_status::Used, 2048,
"Total size in megabytes allocated for logstor storage on disk.")
, logstor_file_size_in_mb(this, "logstor_file_size_in_mb", value_status::Used, 32,
"Total size in megabytes allocated for each logstor data file on disk.")
, logstor_separator_delay_limit_ms(this, "logstor_separator_delay_limit_ms", value_status::Used, 100,
"Maximum delay in milliseconds for logstor separator debt control.")
, logstor_separator_max_memory_in_mb(this, "logstor_separator_max_memory_in_mb", value_status::Used, 256,
"Maximum memory in megabytes for logstor separator memory buffers.")
, file_cache_size_in_mb(this, "file_cache_size_in_mb", value_status::Unused, 512,
"Total memory to use for SSTable-reading buffers.")
, memtable_flush_queue_size(this, "memtable_flush_queue_size", value_status::Unused, 4,
@@ -1281,6 +1291,7 @@ db::config::config(std::shared_ptr<db::extensions> exts)
, enable_in_memory_data_store(this, "enable_in_memory_data_store", value_status::Used, false, "Enable in memory mode (system tables are always persisted).")
, enable_cache(this, "enable_cache", value_status::Used, true, "Enable cache.")
, enable_commitlog(this, "enable_commitlog", value_status::Used, true, "Enable commitlog.")
, enable_logstor(this, "enable_logstor", value_status::Used, false, "Enable the logstor storage engine.")
, volatile_system_keyspace_for_testing(this, "volatile_system_keyspace_for_testing", value_status::Used, false, "Don't persist system keyspace - testing only!")
, api_port(this, "api_port", value_status::Used, 10000, "Http Rest API port.")
, api_address(this, "api_address", value_status::Used, "", "Http Rest API address.")
@@ -1692,6 +1703,7 @@ void db::config::setup_directories() {
maybe_in_workdir(data_file_directories, "data");
maybe_in_workdir(hints_directory, "hints");
maybe_in_workdir(view_hints_directory, "view_hints");
maybe_in_workdir(logstor_directory, "logstor");
maybe_in_workdir(saved_caches_directory, "saved_caches");
}
@@ -1861,7 +1873,8 @@ std::map<sstring, db::experimental_features_t::feature> db::experimental_feature
{"keyspace-storage-options", feature::KEYSPACE_STORAGE_OPTIONS},
{"tablets", feature::UNUSED},
{"views-with-tablets", feature::UNUSED},
{"strongly-consistent-tables", feature::STRONGLY_CONSISTENT_TABLES}
{"strongly-consistent-tables", feature::STRONGLY_CONSISTENT_TABLES},
{"logstor", feature::LOGSTOR}
};
}

View File

@@ -117,7 +117,8 @@ struct experimental_features_t {
ALTERNATOR_STREAMS,
BROADCAST_TABLES,
KEYSPACE_STORAGE_OPTIONS,
STRONGLY_CONSISTENT_TABLES
STRONGLY_CONSISTENT_TABLES,
LOGSTOR,
};
static std::map<sstring, feature> map(); // See enum_option.
static std::vector<enum_option<experimental_features_t>> all();
@@ -201,6 +202,7 @@ public:
named_value<uint64_t> data_file_capacity;
named_value<sstring> hints_directory;
named_value<sstring> view_hints_directory;
named_value<sstring> logstor_directory;
named_value<sstring> saved_caches_directory;
named_value<sstring> commit_failure_policy;
named_value<sstring> disk_failure_policy;
@@ -244,6 +246,10 @@ public:
named_value<bool> defragment_memory_on_idle;
named_value<sstring> memtable_allocation_type;
named_value<double> memtable_cleanup_threshold;
named_value<uint32_t> logstor_disk_size_in_mb;
named_value<uint32_t> logstor_file_size_in_mb;
named_value<uint32_t> logstor_separator_delay_limit_ms;
named_value<uint32_t> logstor_separator_max_memory_in_mb;
named_value<uint32_t> file_cache_size_in_mb;
named_value<uint32_t> memtable_flush_queue_size;
named_value<uint32_t> memtable_flush_writers;
@@ -364,6 +370,7 @@ public:
named_value<bool> enable_in_memory_data_store;
named_value<bool> enable_cache;
named_value<bool> enable_commitlog;
named_value<bool> enable_logstor;
named_value<bool> volatile_system_keyspace_for_testing;
named_value<uint16_t> api_port;
named_value<sstring> api_address;

View File

@@ -336,6 +336,8 @@ schema_ptr scylla_tables(schema_features features) {
// since it is written to only after the cluster feature is enabled.
sb.with_column("tablets", map_type_impl::get_instance(utf8_type, utf8_type, false));
sb.with_column("storage_engine", utf8_type);
sb.with_hash_version();
s = sb.build();
}
@@ -1676,6 +1678,9 @@ mutation make_scylla_tables_mutation(schema_ptr table, api::timestamp_type times
m.set_clustered_cell(ckey, cdef, make_map_mutation(map, cdef, timestamp));
}
}
if (table->logstor_enabled()) {
m.set_clustered_cell(ckey, "storage_engine", "logstor", timestamp);
}
// In-memory tables are deprecated since scylla-2024.1.0
// FIXME: delete the column when there's no live version supporting it anymore.
// Writing it here breaks upgrade rollback to versions that do not support the in_memory schema_feature
@@ -2161,6 +2166,13 @@ static void prepare_builder_from_scylla_tables_row(const schema_ctxt& ctxt, sche
auto tablet_options = db::tablet_options(*opt_map);
builder.set_tablet_options(tablet_options.to_map());
}
if (auto storage_engine = table_row.get<sstring>("storage_engine")) {
if (*storage_engine == "logstor") {
builder.set_logstor();
} else {
throw std::invalid_argument(format("Invalid value for storage_engine: {}", *storage_engine));
}
}
}
schema_ptr create_table_from_mutations(const schema_ctxt& ctxt, schema_mutations sm, const data_dictionary::user_types_storage& user_types, schema_ptr cdc_schema, std::optional<table_schema_version> version)

View File

@@ -3052,7 +3052,7 @@ future<service::topology> system_keyspace::load_topology_state(const std::unorde
co_return ret;
}
const bool strongly_consistent_tables = _db.features().strongly_consistent_tables;
const bool tablet_balancing_not_supported = _db.features().strongly_consistent_tables || _db.features().logstor;
for (auto& row : *rs) {
if (!row.has("host_id")) {
@@ -3289,7 +3289,7 @@ future<service::topology> system_keyspace::load_topology_state(const std::unorde
ret.session = service::session_id(some_row.get_as<utils::UUID>("session"));
}
if (strongly_consistent_tables) {
if (tablet_balancing_not_supported) {
ret.tablet_balancing_enabled = false;
} else if (some_row.has("tablet_balancing_enabled")) {
ret.tablet_balancing_enabled = some_row.get_as<bool>("tablet_balancing_enabled");

View File

@@ -2647,7 +2647,7 @@ future<> view_builder::add_new_view(view_ptr view, build_step& step) {
}
if (this_shard_id() == smp::count - 1) {
co_await utils::get_local_injector().inject("add_new_view_pause_last_shard", utils::wait_for_message(5min));
inject_failure("add_new_view_fail_last_shard");
}
co_await _sys_ks.register_view_for_building(view->ks_name(), view->cf_name(), step.current_token());

View File

@@ -30,6 +30,31 @@ enum class token_kind {
after_all_keys,
};
// Represents a token for partition keys.
// Has a disengaged state, which sorts before all engaged states.
struct raw_token {
int64_t value;
/// Constructs a disengaged token.
raw_token() : value(std::numeric_limits<int64_t>::min()) {}
/// Constructs an engaged token.
/// The token must be of token_kind::key kind.
explicit raw_token(const token&);
explicit raw_token(int64_t v) : value(v) {};
std::strong_ordering operator<=>(const raw_token& o) const noexcept = default;
std::strong_ordering operator<=>(const token& o) const noexcept;
/// Returns true iff engaged.
explicit operator bool() const noexcept {
return value != std::numeric_limits<int64_t>::min();
}
};
using raw_token_opt = seastar::optimized_optional<raw_token>;
class token {
// INT64_MIN is not a legal token, but a special value used to represent
// infinity in token intervals.
@@ -52,6 +77,10 @@ public:
constexpr explicit token(int64_t d) noexcept : token(kind::key, normalize(d)) {}
token(raw_token raw) noexcept
: token(raw ? kind::key : kind::before_all_keys, raw.value)
{ }
// This constructor seems redundant with the bytes_view constructor, but
// it's necessary for IDL, which passes a deserialized_bytes_proxy here.
// (deserialized_bytes_proxy is convertible to bytes&&, but not bytes_view.)
@@ -223,6 +252,29 @@ public:
}
};
inline
raw_token::raw_token(const token& t)
: value(t.raw())
{
#ifdef DEBUG
assert(t._kind == token::kind::key);
#endif
}
inline
std::strong_ordering raw_token::operator<=>(const token& o) const noexcept {
switch (o._kind) {
case token::kind::after_all_keys:
return std::strong_ordering::less;
case token::kind::before_all_keys:
// before_all_keys has a raw value set to the same raw value as a disengaged raw_token, and sorts before all keys.
// So we can order them by just comparing raw values.
[[fallthrough]];
case token::kind::key:
return value <=> o._data;
}
}
inline constexpr std::strong_ordering tri_compare_raw(const int64_t l1, const int64_t l2) noexcept {
if (l1 == l2) {
return std::strong_ordering::equal;
@@ -329,6 +381,17 @@ struct fmt::formatter<dht::token> : fmt::formatter<string_view> {
}
};
template <>
struct fmt::formatter<dht::raw_token> : fmt::formatter<string_view> {
template <typename FormatContext>
auto format(const dht::raw_token& t, FormatContext& ctx) const {
if (!t) {
return fmt::format_to(ctx.out(), "null");
}
return fmt::format_to(ctx.out(), "{}", t.value);
}
};
namespace std {
template<>

View File

@@ -1 +1 @@
SCYLLA_NODE_EXPORTER_ARGS="--collector.interrupts --collector.ethtool.metrics-include='(bw_in_allowance_exceeded|bw_out_allowance_exceeded|conntrack_allowance_exceeded|conntrack_allowance_available|linklocal_allowance_exceeded)' --collector.ethtool --no-collector.hwmon --no-collector.bcache --no-collector.btrfs --no-collector.fibrechannel --no-collector.infiniband --no-collector.ipvs --no-collector.nfs --no-collector.nfsd --no-collector.powersupplyclass --no-collector.rapl --no-collector.tapestats --no-collector.thermal_zone --no-collector.udp_queues --no-collector.zfs"
SCYLLA_NODE_EXPORTER_ARGS="--collector.interrupts --collector.ethtool.metrics-include='(bw_in_allowance_exceeded|bw_out_allowance_exceeded|conntrack_allowance_exceeded|conntrack_allowance_available|linklocal_allowance_exceeded)' --collector.ethtool --collector.systemd --collector.systemd.unit-include='^(scylla-server|systemd-coredump.*)\.service$' --no-collector.hwmon --no-collector.bcache --no-collector.btrfs --no-collector.fibrechannel --no-collector.infiniband --no-collector.ipvs --no-collector.nfs --no-collector.nfsd --no-collector.powersupplyclass --no-collector.rapl --no-collector.tapestats --no-collector.thermal_zone --no-collector.udp_queues --no-collector.zfs"

View File

@@ -139,7 +139,7 @@ The ``WHERE`` clause
~~~~~~~~~~~~~~~~~~~~
The ``WHERE`` clause specifies which rows must be queried. It is composed of relations on the columns that are part of
the ``PRIMARY KEY``.
the ``PRIMARY KEY``, and relations can be joined only with ``AND`` (``OR`` and other logical operators are not supported).
Not all relations are allowed in a query. For instance, non-equal relations (where ``IN`` is considered as an equal
relation) on a partition key are not supported (see the use of the ``TOKEN`` method below to do non-equal queries on
@@ -200,6 +200,23 @@ The tuple notation may also be used for ``IN`` clauses on clustering columns::
WHERE userid = 'john doe'
AND (blog_title, posted_at) IN (('John''s Blog', '2012-01-01'), ('Extreme Chess', '2014-06-01'))
This tuple notation is different from boolean grouping. For example, the following query is not supported::
SELECT * FROM users
WHERE (country = 'BR' AND state = 'SP')
because parentheses are only allowed around a single relation, so this works: ``(country = 'BR') AND (state = 'SP')``, but this does not: ``(country = 'BR' AND state = 'SP')``.
Similarly, an extended query of the form of::
SELECT * FROM users
WHERE (country = 'BR' AND state = 'SP')
OR (country = 'BR' AND state = 'RJ')
won't work due to both: grouping boolean expressions and not supporting ``OR``, so when possible,
rewrite such queries with ``IN`` on the varying column, for example
``country = 'BR' AND state IN ('SP', 'RJ')``, or run multiple queries and merge
the results client-side.
The ``CONTAINS`` operator may only be used on collection columns (lists, sets, and maps). In the case of maps,
``CONTAINS`` applies to the map values. The ``CONTAINS KEY`` operator may only be used on map columns and applies to the
map keys.

236
docs/cql/guardrails.rst Normal file
View File

@@ -0,0 +1,236 @@
.. highlight:: cql
.. _cql-guardrails:
CQL Guardrails
==============
ScyllaDB provides a set of configurable guardrail parameters that help operators
enforce best practices and prevent misconfigurations that could degrade cluster
health, availability, or performance. Guardrails operate at two severity levels:
* **Warn**: The request succeeds, but the server includes a warning in the CQL
response. Depending on the specific guardrail, the warning may also be logged on the server side.
* **Fail**: The request is rejected with an error/exception (the specific type
depends on the guardrail). The user must correct the request or adjust the
guardrail configuration to proceed.
.. note::
Guardrails are checked only when a statement is
executed. They do not retroactively validate existing keyspaces, tables, or
previously completed writes.
For the full list of configuration properties, including types, defaults, and
liveness information, see :doc:`Configuration Parameters </reference/configuration-parameters>`.
.. _guardrails-replication-factor:
Replication Factor Guardrails
-----------------------------
These four parameters control the minimum and maximum allowed replication factor
(RF) values. They are evaluated whenever a ``CREATE KEYSPACE`` or
``ALTER KEYSPACE`` statement is executed. Each data center's RF is checked
individually.
An RF of ``0`` — which means "do not replicate to this data center" — is
always allowed and never triggers a guardrail.
A threshold value of ``-1`` disables the corresponding check.
``minimum_replication_factor_warn_threshold``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
If any data center's RF is set to a value greater than ``0`` and lower than
this threshold, the server attaches a warning to the CQL response identifying
the offending data center and RF value.
**When to use.** The default of ``3`` is the standard recommendation for
production clusters. An RF below ``3`` means that the cluster cannot tolerate
even a single node failure without data loss or read unavailability (assuming
``QUORUM`` consistency). Keep this at ``3`` unless your deployment has specific
constraints (e.g., a development or test cluster with fewer than 3 nodes).
``minimum_replication_factor_fail_threshold``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
If any data center's RF is set to a value greater than ``0`` and lower than
this threshold, the request is rejected with a ``ConfigurationException``
identifying the offending data center and RF value.
**When to use.** Enable this parameter (e.g., set to ``3``) in production
environments where allowing a low RF would be operationally dangerous. Unlike
the warn threshold, this provides a hard guarantee that no keyspace can be
created or altered to have an RF below the limit.
``maximum_replication_factor_warn_threshold``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
If any data center's RF exceeds this threshold, the server attaches a warning to the CQL response identifying
the offending data center and RF value.
**When to use.** An excessively high RF increases write amplification and
storage costs proportionally. For example, an RF of ``5`` means every write
is replicated to five nodes. Set this threshold to alert operators who
may unintentionally set an RF that is too high.
``maximum_replication_factor_fail_threshold``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
If any data center's RF exceeds this threshold, the request is rejected with a ``ConfigurationException``
identifying the offending data center and RF value.
**When to use.** Enable this parameter to prevent accidental creation of
keyspaces with an unreasonably high RF. An extremely high RF wastes storage and
network bandwidth and can lead to write latency spikes. This is a hard limit —
the keyspace creation or alteration will not proceed until the RF is lowered.
**Metrics.** ScyllaDB exposes per-shard metrics that track the number of
times each replication factor guardrail has been triggered:
* ``scylla_cql_minimum_replication_factor_warn_violations``
* ``scylla_cql_minimum_replication_factor_fail_violations``
* ``scylla_cql_maximum_replication_factor_warn_violations``
* ``scylla_cql_maximum_replication_factor_fail_violations``
A sustained increase in any of these metrics indicates that
``CREATE KEYSPACE`` or ``ALTER KEYSPACE`` requests are hitting the configured
thresholds.
.. _guardrails-replication-strategy:
Replication Strategy Guardrails
-------------------------------
These two parameters control which replication strategies trigger warnings or
are rejected when a keyspace is created or altered.
``replication_strategy_warn_list``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
If the replication strategy used in a ``CREATE KEYSPACE`` or ``ALTER KEYSPACE``
statement is on this list, the server attaches a warning to the CQL response
identifying the discouraged strategy and the affected keyspace.
**When to use.** ``SimpleStrategy`` is not recommended for production use.
It places replicas without awareness of data center or rack topology, which
can undermine fault tolerance in multi-DC deployments. Even in single-DC
deployments, ``NetworkTopologyStrategy`` is recommended because it keeps the
schema ready for future topology changes.
The default configuration warns on ``SimpleStrategy``, which is appropriate
for most deployments. If you have existing keyspaces that use
``SimpleStrategy``, see :doc:`Update Topology Strategy From Simple to Network
</operating-scylla/procedures/cluster-management/update-topology-strategy-from-simple-to-network>`
for the migration procedure.
``replication_strategy_fail_list``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
If the replication strategy used in a ``CREATE KEYSPACE`` or ``ALTER KEYSPACE``
statement is on this list, the request is rejected with a
``ConfigurationException`` identifying the forbidden strategy and the affected
keyspace.
**When to use.** In production environments, add ``SimpleStrategy`` to this
list to enforce ``NetworkTopologyStrategy`` across all keyspaces. This helps
prevent new production keyspaces from being created with a topology-unaware
strategy.
**Metrics.** The following per-shard metrics track replication strategy
guardrail violations:
* ``scylla_cql_replication_strategy_warn_list_violations``
* ``scylla_cql_replication_strategy_fail_list_violations``
.. _guardrails-write-consistency-level:
Write Consistency Level Guardrails
----------------------------------
These two parameters control which consistency levels (CL) are allowed for
write operations (``INSERT``, ``UPDATE``, ``DELETE``, and ``BATCH``
statements).
Be aware that adding warnings to CQL responses can significantly increase
network traffic and reduce overall throughput.
``write_consistency_levels_warned``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
If a write operation uses a consistency level on this list, the server attaches
a warning to the CQL response identifying the discouraged consistency level.
**When to use.** Use this parameter to alert application developers when they
use a consistency level that, while technically functional, is not recommended
for the workload. Common examples:
* **Warn on** ``ANY``: writes at ``ANY`` are acknowledged as soon as at least
one node (including a coordinator acting as a hinted handoff store) receives
the mutation. This means data may not be persisted on any replica node at
the time of acknowledgement, risking data loss if the coordinator fails
before hinted handoff completes.
* **Warn on** ``ALL``: writes at ``ALL`` require every replica to acknowledge
the write. If any single replica is down, the write fails. This significantly
reduces write availability.
``write_consistency_levels_disallowed``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
If a write operation uses a consistency level on this list, the request is
rejected with an ``InvalidRequestException`` identifying the forbidden
consistency level.
**When to use.** Use this parameter to hard-block consistency levels that are
considered unsafe for your deployment:
* **Disallow** ``ANY``: in production environments, ``ANY`` is almost never
appropriate. It provides the weakest durability guarantee and is a common
source of data-loss incidents when operators or application developers use it
unintentionally.
* **Disallow** ``ALL``: in clusters where high write availability is critical,
blocking ``ALL`` prevents a single node failure from causing write
unavailability.
**Metrics.** The following per-shard metrics track write consistency level
guardrail violations:
* ``scylla_cql_write_consistency_levels_warned_violations``
* ``scylla_cql_write_consistency_levels_disallowed_violations``
Additionally, ScyllaDB exposes the
``scylla_cql_writes_per_consistency_level`` metric, labeled by consistency
level, which tracks the total number of write requests per CL. This metric is
useful for understanding the current write-CL distribution across the cluster
*before* deciding which levels to warn on or disallow. For example, querying
this metric can reveal whether any application is inadvertently using ``ANY``
or ``ALL`` for writes.
.. _guardrails-compact-storage:
Compact Storage Guardrail
-------------------------
``enable_create_table_with_compact_storage``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
This boolean parameter controls whether ``CREATE TABLE`` statements with the
deprecated ``COMPACT STORAGE`` option are allowed. Unlike the other guardrails,
it acts as a simple on/off switch rather than using separate warn and fail
thresholds.
**When to use.** Leave this at the default (``false``) for all new
deployments. ``COMPACT STORAGE`` is a legacy feature that will be permanently
removed in a future version of ScyllaDB. Set to ``true`` only if you have a specific,
temporary need to create compact storage tables (e.g., compatibility with legacy
applications during a migration). For details on the ``COMPACT STORAGE`` option, see
:ref:`Compact Tables <compact-tables>` in the Data Definition documentation.
Additional References
---------------------
* :doc:`Consistency Level </cql/consistency>`
* :doc:`Data Definition (CREATE/ALTER KEYSPACE) </cql/ddl>`
* :doc:`How to Safely Increase the Replication Factor </kb/rf-increase>`
* :doc:`Metrics Reference </reference/metrics>`

View File

@@ -17,6 +17,7 @@ CQL Reference
secondary-indexes
time-to-live
functions
guardrails
wasm
json
mv
@@ -46,6 +47,7 @@ It allows you to create keyspaces and tables, insert and query tables, and more.
* :doc:`Data Types </cql/types>`
* :doc:`Definitions </cql/definitions>`
* :doc:`Global Secondary Indexes </cql/secondary-indexes>`
* :doc:`CQL Guardrails </cql/guardrails>`
* :doc:`Expiring Data with Time to Live (TTL) </cql/time-to-live>`
* :doc:`Functions </cql/functions>`
* :doc:`JSON Support </cql/json>`

View File

@@ -1,347 +1,111 @@
# Prototype design: auditing all keyspaces and per-role auditing
# Introduction
## Summary
Similar to the approach described in CASSANDRA-12151, we add the
concept of an audit specification. An audit has a target (syslog or a
table) and a set of events/actions that it wants recorded. We
introduce new CQL syntax for Scylla users to describe and manipulate
audit specifications.
Extend the existing `scylla.yaml`-driven audit subsystem with two focused capabilities:
Prior art:
- Microsoft SQL Server [audit
description](https://docs.microsoft.com/en-us/sql/relational-databases/security/auditing/sql-server-audit-database-engine?view=sql-server-ver15)
- pgAudit [docs](https://github.com/pgaudit/pgaudit/blob/master/README.md)
- MySQL audit_log docs in
[MySQL](https://dev.mysql.com/doc/refman/8.0/en/audit-log.html) and
[Azure](https://docs.microsoft.com/en-us/azure/mysql/concepts-audit-logs)
- DynamoDB can [use CloudTrail](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/logging-using-cloudtrail.html) to log all events
1. allow auditing **all keyspaces** without enumerating them one by one
2. allow auditing only a configured set of **roles**
# CQL extensions
The prototype should stay close to the current implementation in `audit/`:
## Create an audit
- keep the existing backends (`table`, `syslog`, or both)
- keep the existing category / keyspace / table filters
- preserve live updates for audit configuration
- avoid any schema change to `audit.audit_log`
This is intentionally a small extension of the current auditing model, not a redesign around new CQL statements such as `CREATE AUDIT`.
## Motivation
Today Scylla exposes three main audit selectors:
- `audit_categories`
- `audit_tables`
- `audit_keyspaces`
This leaves two operational gaps:
1. **Auditing all keyspaces is cumbersome.**
Large installations may create keyspaces dynamically, or manage many tenant keyspaces. Requiring operators to keep
`audit_keyspaces` synchronized with the full keyspace list is error-prone and defeats the point of cluster-wide auditing.
2. **Auditing is all-or-nothing with respect to users.**
Once a category/keyspace/table combination matches, any authenticated user generating that traffic is audited.
Operators want to narrow the scope to specific tenants, service accounts, or privileged roles.
These two additions also work well together: "audit all keyspaces, but only for selected roles" is a practical way to reduce
both audit volume and performance impact.
## Goals
- Add a way to express "all keyspaces" in the current configuration model.
- Add a new role filter that limits auditing to selected roles.
- Preserve backwards compatibility for existing configurations.
- Keep the evaluation cheap on the request path.
- Support live configuration updates, consistent with the existing audit options.
## Non-goals
- Introducing `CREATE AUDIT`, `ALTER AUDIT`, or other new CQL syntax.
- Adding per-role audit destinations.
- Adding different categories per role.
- Expanding role matching through the full granted-role graph in the prototype.
- Changing the on-disk audit table schema.
## Current behavior
At the moment, audit logging is controlled by:
- `audit`
- `audit_categories`
- `audit_tables`
- `audit_keyspaces`
The current decision rule in `audit::should_log()` is effectively:
```text
category matches
&& (
keyspace is listed in audit_keyspaces
|| table is listed in audit_tables
|| category in {AUTH, ADMIN, DCL}
)
```cql
CREATE AUDIT [IF NOT EXISTS] audit-name WITH TARGET { SYSLOG | table-name }
[ AND TRIGGER KEYSPACE IN (ks1, ks2, ks3) ]
[ AND TRIGGER TABLE IN (tbl1, tbl2, tbl3) ]
[ AND TRIGGER ROLE IN (usr1, usr2, usr3) ]
[ AND TRIGGER CATEGORY IN (cat1, cat2, cat3) ]
;
```
Observations:
From this point on, every database event that matches all present
triggers will be recorded in the target. When the target is a table,
it behaves like the [current
design](https://docs.scylladb.com/operating-scylla/security/auditing/#table-storage).
- `AUTH`, `ADMIN`, and `DCL` are already global once their category is enabled.
- `DDL`, `DML`, and `QUERY` need a matching keyspace or table.
- An empty `audit_keyspaces` means "audit no keyspaces", not "audit every keyspace".
- There is no role-based filter; the authenticated user is recorded in the log but is not part of the decision.
- The exact implementation to preserve is in `audit/audit.cc` (`should_log()`, `inspect()`, and `inspect_login()`).
The audit name must be different from all other audits, unless IF NOT
EXISTS precedes it, in which case the existing audit must be identical
to the new definition. Case sensitivity and length limit are the same
as for table names.
## Proposed configuration
A trigger kind (ie, `KEYSPACE`, `TABLE`, `ROLE`, or `CATEGORY`) can be
specified at most once.
### 1. Add `audit_all_keyspaces`
## Show an audit
Introduce a new live-update boolean option:
Examples:
```yaml
# Audit all keyspaces for matching categories
audit_all_keyspaces: true
# Audit all keyspaces for selected roles
audit_all_keyspaces: true
audit_roles: "alice,bob"
```cql
DESCRIBE AUDIT [audit-name ...];
```
Semantics:
Prints definitions of all audits named herein. If no names are
provided, prints all audits.
- `audit_all_keyspaces: false` keeps the existing behavior.
- `audit_all_keyspaces: true` makes every keyspace match.
- `audit_keyspaces` keeps its existing meaning: an explicit list of keyspaces, or no keyspace-wide auditing when left empty.
- `audit_all_keyspaces: true` and a non-empty `audit_keyspaces` must be rejected as invalid configuration,
because the two options express overlapping scope in different ways.
- A dedicated boolean is preferable to overloading `audit_keyspaces`, because it avoids changing the meaning of existing configurations.
- This also keeps the behavior aligned with today's `audit_tables` handling, where leaving `audit_tables` empty does not introduce a new wildcard syntax.
## Delete an audit
### 2. Add `audit_roles`
Introduce a new live-update configuration option:
```yaml
audit_roles: "alice,bob,service_api"
```cql
DROP AUDIT audit-name;
```
Semantics:
Stops logging events specified by this audit. Doesn't impact the
already logged events. If the target is a table, it remains as it is.
- empty `audit_roles` means **no role filtering**, preserving today's behavior
- non-empty `audit_roles` means audit only requests whose effective logged username matches one of the configured roles
- matching is byte-for-byte exact, using the same role name that is already written to the audit record's `username` column / syslog field
- the prototype should compare against the post-authentication role name from the session and audit log,
with no additional case folding or role-graph expansion
## Alter an audit
Examples:
```yaml
# Audit all roles in a single keyspace (current behavior, made explicit)
audit_keyspaces: "ks1"
audit_roles: ""
# Audit two roles across all keyspaces
audit_all_keyspaces: true
audit_roles: "alice,bob"
# Audit a service role, but only for selected tables
audit_tables: "ks1.orders,ks1.payments"
audit_roles: "billing_service"
```cql
ALTER AUDIT audit-name WITH {same syntax as CREATE}
```
## Decision rule after the change
Any trigger provided will be updated (or newly created, if previously
absent). To drop a trigger, use `IN *`.
After the prototype, the rule becomes:
## Permissions
```text
category matches
&& role matches
&& (
category in {AUTH, ADMIN, DCL}
|| audit_all_keyspaces
|| keyspace is listed in audit_keyspaces
|| table is listed in audit_tables
)
```
Only superusers can modify audits or turn them on and off.
Where:
Only superusers can read tables that are audit targets; no user can
modify them. Only superusers can drop tables that are audit targets,
after the audit itself is dropped. If a superuser doesn't drop a
target table, it remains in existence indefinitely.
- `role matches` is always true when `audit_roles` is empty
- `audit_all_keyspaces` is true when the new boolean option is enabled
# Implementation
For login auditing, the rule is simply:
```text
AUTH category enabled && role matches(login username)
```
## Implementation details
### Configuration parsing
Add a new config entry:
- `db::config::audit_all_keyspaces`
- `db::config::audit_roles`
It should mirror the existing audit selectors:
- `audit_all_keyspaces`: type `named_value<bool>`, liveness `LiveUpdate`, default `false`
- `audit_roles`: type `named_value<sstring>`, liveness `LiveUpdate`, default empty string
Parsing changes:
- keep `parse_audit_tables()` as-is
- keep `parse_audit_keyspaces()` semantics as-is
- add `parse_audit_roles()` that returns a set of role names
- normalize empty or whitespace-only keyspace lists to an empty configuration rather than treating them as real keyspace names
- add cross-field validation so `audit_all_keyspaces: true` cannot be combined with a non-empty
`audit_keyspaces`, both at startup and during live updates
To avoid re-parsing on every request, the `audit::audit` service should store:
## Efficient trigger evaluation
```c++
bool _audit_all_keyspaces;
std::set<sstring> _audited_keyspaces;
std::set<sstring> _audited_roles;
namespace audit {
/// Stores triggers from an AUDIT statement.
class triggers {
// Use trie structures for speedy string lookup.
optional<trie> _ks_trigger, _tbl_trigger, _usr_trigger;
// A logical-AND filter.
optional<unsigned> _cat_trigger;
public:
/// True iff every non-null trigger matches the corresponding ainf element.
bool should_audit(const audit_info& ainf);
};
} // namespace audit
```
Using a dedicated boolean keeps the hot-path check straightforward and avoids reinterpreting the existing
`_audited_keyspaces` selector.
To prevent modification of target tables, `audit::inspect()` will
check the statement and throw if it is disallowed, similar to what
`check_access()` currently does.
Using `std::set` for the explicit selectors keeps the prototype aligned with the current implementation and minimizes code churn.
If profiling later shows lookup cost matters here, the container choice can be revisited independently of the feature semantics.
## Persisting audit definitions
### Audit object changes
The current `audit_info` already carries:
- category
- keyspace
- table
- query text
The username is available separately from `service::query_state` and is already passed to storage helpers when an entry is written.
For the prototype there is no need to duplicate the username into `audit_info`.
Instead:
- change `should_log()` to take the effective username as an additional input
- change `should_log_login()` to check the username against `audit_roles`
- keep the storage helpers unchanged, because they already persist the username
- update the existing internal call sites in `inspect()` and `inspect_login()` to pass the username through
One possible interface shape is:
```c++
bool should_log(std::string_view username, const audit_info* info) const;
bool should_log_login(std::string_view username) const;
```
### Role semantics
For the prototype, "role" means the role name already associated with the current client session:
- successful authenticated sessions use the session's user name
- failed login events use the login name from the authentication attempt
- failed login events are still subject to `audit_roles`, matched against the attempted login name
This keeps the feature easy to explain and aligns the filter with what users already see in audit output.
The prototype should **not** try to expand inherited roles. If a user logs in as `alice` and inherits permissions from another role,
the audit filter still matches `alice`. This keeps the behavior deterministic and avoids expensive role graph lookups on the request path.
### Keyspace semantics
`audit_all_keyspaces: true` should affect any statement whose `audit_info` carries a keyspace name.
Important consequences:
- it makes `DDL` / `DML` / `QUERY` auditing effectively cluster-wide
- it does not change the existing global handling of `AUTH`, `ADMIN`, and `DCL`
- statements that naturally have no keyspace name continue to depend on their category-specific behavior
No extra schema or metadata scan is required: the request already carries the keyspace information needed for the decision.
## Backwards compatibility
This design keeps existing behavior intact:
- existing clusters that do not set `audit_roles` continue to audit all roles
- existing clusters that leave `audit_keyspaces` empty continue to audit no keyspaces
- existing explicit keyspace/table lists keep their current meaning
The feature is enabled only by a new explicit boolean, so existing `audit_keyspaces` values do not need to be reinterpreted.
The only newly-invalid combination is enabling `audit_all_keyspaces` while also listing explicit keyspaces.
## Operational considerations
### Performance and volume
`audit_all_keyspaces: true` can significantly increase audit volume, especially with `QUERY` and `DML`.
The intended mitigation is to combine it with:
- a narrow `audit_categories`
- a narrow `audit_roles`
That combination gives operators a simple and cheap filter model:
- first by category
- then by role
- then by keyspace/table scope
### Live updates
`audit_roles` should follow the same live-update behavior as the current audit filters.
Changing:
- `audit_roles`
- `audit_all_keyspaces`
- `audit_keyspaces`
- `audit_tables`
- `audit_categories`
should update the in-memory selectors on all shards without restarting the node.
### Prototype limitation
Because matching is done against the authenticated session role name, `audit_roles` cannot express "audit everyone who inherits role X".
Operators must list the concrete login roles they want to audit. This is a deliberate trade-off in the prototype to keep matching cheap
and avoid role graph lookups on every audited request.
Example: if `alice` inherits permissions from `admin_role`, configuring `audit_roles: "admin_role"` would not audit requests from
`alice`; to audit those requests, `alice` itself must be listed.
### Audit table schema
No schema change is needed. The audit table already includes `username`, which is sufficient for both storage and later analysis.
## Testing plan
The prototype should extend existing audit coverage rather than introduce a separate test framework.
### Parser / unit coverage
Add focused tests for:
- empty `audit_roles`
- specific `audit_roles`
- `audit_all_keyspaces: true`
- invalid mixed configuration: `audit_all_keyspaces: true` with non-empty `audit_keyspaces`
- empty or whitespace-only keyspace lists such as `",,,"` or `" "`, which should normalize to an empty configuration and therefore audit no keyspaces
- boolean config parsing for `audit_all_keyspaces`
### Behavioral coverage
Extend the existing audit tests in `test/cluster/dtest/audit_test.py` with scenarios such as:
1. `audit_all_keyspaces: true` audits statements in multiple keyspaces without listing them explicitly
2. `audit_roles: "alice"` logs requests from `alice` but not from `bob`
3. `audit_all_keyspaces: true` + `audit_roles: "alice"` only logs `alice`'s traffic cluster-wide
4. login auditing respects `audit_roles`
5. live-updating `audit_roles` changes behavior without restart
6. setting `audit_all_keyspaces: true` together with explicit `audit_keyspaces` is rejected with a clear error
## Future evolution
This prototype is deliberately small, but it fits a broader audit-spec design if we decide to revisit that later.
In a future CQL-driven design, these two additions map naturally to triggers such as:
- `TRIGGER KEYSPACE IN *`
- `TRIGGER ROLE IN (...)`
That means the prototype is not throwaway work: it improves the current operational model immediately while keeping a clean path
toward richer audit objects in the future.
Obviously, an audit definition must survive a server restart and stay
consistent among all nodes in a cluster. We'll accomplish both by
storing audits in a system table.

124
docs/dev/logstor.md Normal file
View File

@@ -0,0 +1,124 @@
# Logstor
## Introduction
Logstor is a log-structured storage engine for ScyllaDB optimized for key-value workloads. It provides an alternative storage backend for key-value tables - tables with a partition key only, with no clustering columns.
Unlike the traditional LSM-tree based storage, logstor uses a log-structured approach with in-memory indexing, making it particularly suitable for workloads with frequent overwrites and point lookups.
## Architecture
Logstor consists of several key components:
### Components
#### Primary Index
The primary index is entirely in memory and it maps a partition key to its location in the log segments. It consists of a B-tree per each table that is ordered token.
#### Segment Manager
The `segment_manager` handles the allocation and management of fixed-size segments (default 128KB). Segments are grouped into large files (default 32MB). Key responsibilities include:
- **Segment allocation**: Provides segments for writing new data
- **Space reclamation**: Tracks free space in each segment
- **Compaction**: Copies live data from sparse segments to reclaim space
- **Recovery**: Scans segments on startup to rebuild the index
- **Separator**: Rewrites segments that have records from different compaction groups into new segments that are separated by compaction group.
The data in the segments consists of records of type `log_record`. Each record contains the value for some key as a `canonical_mutation` and additional metadata.
The `segment_manager` receives new writes via a `write_buffer` and writes them sequentially to the active segment with 4k-block alignment.
#### Write Buffer
The `write_buffer` manages a buffer of log records and handles the serialization of the records including headers and alignment. It can be used to write multiple records to the buffer and then write the buffer to the segment manager.
The `buffered_writer` manages multiple write buffers for user writes, an active buffer and multiple flushing ones, to batch writes and manage backpressure.
### Data Flow
**Write Path:**
1. Application writes mutation to logstor
2. Mutation is converted to a log record
3. Record is written to write buffer
4. The buffer is switched and written to the active segment.
5. Index is updated with new record locations
6. Old record locations (for overwrites) are marked as free
**Read Path:**
1. Application requests data for a partition key
2. Index lookup returns record location
3. Segment manager reads record from disk
4. Record is deserialized into a mutation and returned
**Separator:**
1. When a record is written to the active segment, it is also written to its compaction group's separator buffer. The separator buffer holds a reference to the original segment.
2. The separator buffer is flushed when it's full, or requested to flush for other reason. It is written into a new segment in the compaction group, and it updates the location of the records from the original mixed segments to the new segments in the compaction group.
3. After the separator buffer is flushed and all records from the original segment are moved, it releases the reference of the segment. When there are no more reference to the segment it is freed.
**Compaction:**
1. The amount of live data is tracked for each segment in its segment_descriptor. The segment descriptors are stored in a histogram by live data.
2. A segment set from a single compaction group is submitted for compaction.
3. Compaction picks segments for compaction from the segment set. It chooses segments with the lowest utilization such that compacting them results in net gain of free segments.
4. It reads the segments, finding all live records, and writing them into a write buffer. When the buffer is full it is flushed into a new segment, and for each recording updating the index location to the new location.
5. After all live records are rewritten the old segments are freed.
## Usage
### Enabling Logstor
To use logstor, enable it in the configuration:
```yaml
enable_logstor: true
experimental_features:
- logstor
```
### Creating Tables
Tables using logstor must have no clustering columns, and created with the `storage_engine` property equals to 'logstor':
```cql
CREATE TABLE keyspace.user_profiles (
user_id uuid PRIMARY KEY,
name text,
email text,
metadata frozen<map<text, text>>
) WITH storage_engine = 'logstor';
```
### Basic Operations
**Insert/Update:**
```cql
INSERT INTO keyspace.table_name (pk, v) VALUES (1, 'value1');
INSERT INTO keyspace.table_name (pk, v) VALUES (2, 'value2');
-- Overwrite with new value
INSERT INTO keyspace.table_name (pk, v) VALUES (1, 'updated_value');
```
Currently, updates must write the full row. Updating individual columns is not yet supported. Each write replaces the entire partition.
**Select:**
```cql
SELECT * FROM keyspace.table_name WHERE pk = 1;
-- Returns: (1, 'updated_value')
SELECT pk, v FROM keyspace.table_name WHERE pk = 2;
-- Returns: (2, 'value2')
SELECT * FROM keyspace.table_name;
-- Returns: (1, 'updated_value'), (2, 'value2')
```
**Delete:**
```cql
DELETE FROM keyspace.table_name WHERE pk = 1;
```

View File

@@ -52,7 +52,7 @@ Install ScyllaDB
.. code-block:: console
:substitutions:
sudo wget -O /etc/apt/sources.list.d/scylla.list http://downloads.scylladb.com/deb/debian/|UBUNTU_SCYLLADB_LIST|
sudo wget -O /etc/apt/sources.list.d/scylla.list https://downloads.scylladb.com/deb/debian/|UBUNTU_SCYLLADB_LIST|
#. Install ScyllaDB packages.
@@ -125,7 +125,7 @@ Install ScyllaDB
.. code-block:: console
:substitutions:
sudo curl -o /etc/yum.repos.d/scylla.repo -L http://downloads.scylladb.com/rpm/centos/|CENTOS_SCYLLADB_REPO|
sudo curl -o /etc/yum.repos.d/scylla.repo -L https://downloads.scylladb.com/rpm/centos/|CENTOS_SCYLLADB_REPO|
#. Install ScyllaDB packages.
@@ -133,19 +133,19 @@ Install ScyllaDB
sudo yum install scylla
Running the command installs the latest official version of ScyllaDB Open Source.
Alternatively, you can to install a specific patch version:
Running the command installs the latest official version of ScyllaDB.
Alternatively, you can install a specific patch version:
.. code-block:: console
sudo yum install scylla-<your patch version>
Example: The following example shows the command to install ScyllaDB 5.2.3.
Example: The following example shows installing ScyllaDB 2025.3.1.
.. code-block:: console
:class: hide-copy-button
sudo yum install scylla-5.2.3
sudo yum install scylla-2025.3.1
.. include:: /getting-started/_common/setup-after-install.rst

View File

@@ -36,11 +36,8 @@ release versions, run:
curl -sSf get.scylladb.com/server | sudo bash -s -- --list-active-releases
Versions 2025.1 and Later
==============================
Run the command with the ``--scylla-version`` option to specify the version
you want to install.
To install a non-default version, run the command with the ``--scylla-version``
option to specify the version you want to install.
**Example**
@@ -50,20 +47,4 @@ you want to install.
curl -sSf get.scylladb.com/server | sudo bash -s -- --scylla-version |CURRENT_VERSION|
Versions Earlier than 2025.1
================================
To install a supported version of *ScyllaDB Enterprise*, run the command with:
* ``--scylla-product scylla-enterprise`` to specify that you want to install
ScyllaDB Entrprise.
* ``--scylla-version`` to specify the version you want to install.
For example:
.. code:: console
curl -sSf get.scylladb.com/server | sudo bash -s -- --scylla-product scylla-enterprise --scylla-version 2024.1
.. include:: /getting-started/_common/setup-after-install.rst

View File

@@ -57,12 +57,11 @@ To enable shared dictionaries:
internode_compression_enable_advanced: true
rpc_dict_training_when: when_leader
.. warning:: Enabling shared dictionary training might leak unencrypted data to disk.
.. note::
Trained dictionaries contain randomly chosen samples of data transferred between
nodes. The data samples are persisted in the Raft log, which is not encrypted.
As a result, some data from otherwise encrypted tables might be stored on disk
unencrypted.
Some dictionary training data may be encrypted using storage-level encryption
(if enabled) instead of database-level encryption, meaning protection is
applied at the storage layer rather than within the database itself.
Reference

View File

@@ -727,7 +727,12 @@ public:
// now we need one page more to be able to save one for next lap
auto fill_size = align_up(buf1.size(), block_size) + block_size - buf1.size();
auto buf2 = co_await _input.read_exactly(fill_size);
// If the underlying stream is already at EOF (e.g. buf1 came from
// cached _next while the previous read_exactly drained the source),
// skip the read_exactly call — it would return empty anyway.
auto buf2 = _input.eof()
? temporary_buffer<char>()
: co_await _input.read_exactly(fill_size);
temporary_buffer<char> output(buf1.size() + buf2.size());

View File

@@ -172,6 +172,7 @@ public:
gms::feature rack_list_rf { *this, "RACK_LIST_RF"sv };
gms::feature driver_service_level { *this, "DRIVER_SERVICE_LEVEL"sv };
gms::feature strongly_consistent_tables { *this, "STRONGLY_CONSISTENT_TABLES"sv };
gms::feature logstor { *this, "LOGSTOR"sv };
gms::feature client_routes { *this, "CLIENT_ROUTES"sv };
gms::feature removenode_with_left_token_ring { *this, "REMOVENODE_WITH_LEFT_TOKEN_RING"sv };
gms::feature size_based_load_balancing { *this, "SIZE_BASED_LOAD_BALANCING"sv };

View File

@@ -48,6 +48,7 @@ set(idl_headers
messaging_service.idl.hh
paxos.idl.hh
raft.idl.hh
raft_util.idl.hh
raft_storage.idl.hh
group0.idl.hh
hinted_handoff.idl.hh
@@ -55,6 +56,7 @@ set(idl_headers
storage_proxy.idl.hh
storage_service.idl.hh
strong_consistency/state_machine.idl.hh
logstor.idl.hh
group0_state_machine.idl.hh
mapreduce_request.idl.hh
replica_exception.idl.hh

28
idl/logstor.idl.hh Normal file
View File

@@ -0,0 +1,28 @@
/*
* Copyright 2026-present ScyllaDB
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
*/
#include "idl/frozen_schema.idl.hh"
#include "idl/token.idl.hh"
#include "mutation/canonical_mutation.hh"
namespace replica {
namespace logstor {
struct primary_index_key {
dht::decorated_key dk;
};
class log_record {
replica::logstor::primary_index_key key;
replica::logstor::record_generation generation;
table_id table;
canonical_mutation mut;
};
}
}

View File

@@ -96,6 +96,9 @@ std::set<sstring> get_disabled_features_from_db_config(const db::config& cfg, st
if (!cfg.check_experimental(db::experimental_features_t::feature::STRONGLY_CONSISTENT_TABLES)) {
disabled.insert("STRONGLY_CONSISTENT_TABLES"s);
}
if (!cfg.check_experimental(db::experimental_features_t::feature::LOGSTOR)) {
disabled.insert("LOGSTOR"s);
}
if (!cfg.table_digest_insensitive_to_expiry()) {
disabled.insert("TABLE_DIGEST_INSENSITIVE_TO_EXPIRY"s);
}

View File

@@ -42,7 +42,14 @@ void everywhere_replication_strategy::validate_options(const gms::feature_servic
sstring everywhere_replication_strategy::sanity_check_read_replicas(const effective_replication_map& erm, const host_id_vector_replica_set& read_replicas) const {
const auto replication_factor = erm.get_replication_factor();
if (read_replicas.size() > replication_factor) {
if (const auto& topo_info = erm.get_token_metadata().get_topology_change_info(); topo_info && topo_info->read_new) {
if (read_replicas.size() > replication_factor + 1) {
return seastar::format(
"everywhere_replication_strategy: the number of replicas for everywhere_replication_strategy is {}, "
"cannot be higher than replication factor {} + 1 during the 'read from new replicas' stage of a topology change",
read_replicas.size(), replication_factor);
}
} else if (read_replicas.size() > replication_factor) {
return seastar::format("everywhere_replication_strategy: the number of replicas for everywhere_replication_strategy is {}, cannot be higher than replication factor {}", read_replicas.size(), replication_factor);
}
return {};

View File

@@ -531,6 +531,11 @@ tablet_id tablet_map::get_tablet_id(token t) const {
return tablet_id(dht::compaction_group_of(_log2_tablets, t));
}
tablet_range_side tablet_map::get_tablet_range_side(token t) const {
auto id_after_split = dht::compaction_group_of(_log2_tablets + 1, t);
return tablet_range_side(id_after_split & 0x1);
}
std::pair<tablet_id, tablet_range_side> tablet_map::get_tablet_id_and_range_side(token t) const {
auto id_after_split = dht::compaction_group_of(_log2_tablets + 1, t);
auto current_id = id_after_split >> 1;

View File

@@ -611,6 +611,10 @@ public:
/// Returns tablet_id of a tablet which owns a given token.
tablet_id get_tablet_id(token) const;
// Returns the side of the tablet's range that a given token belongs to.
// Less expensive than get_tablet_id_and_range_side() when tablet_id is already known.
tablet_range_side get_tablet_range_side(token) const;
// Returns tablet_id and also the side of the tablet's range that a given token belongs to.
std::pair<tablet_id, tablet_range_side> get_tablet_id_and_range_side(token) const;

View File

@@ -19,8 +19,6 @@
#include "gms/inet_address.hh"
#include "auth/allow_all_authenticator.hh"
#include "auth/allow_all_authorizer.hh"
#include "auth/maintenance_socket_authenticator.hh"
#include "auth/maintenance_socket_role_manager.hh"
#include <seastar/core/future.hh>
#include <seastar/core/signal.hh>
#include <seastar/core/timer.hh>
@@ -1964,6 +1962,11 @@ To start the scylla server proper, simply invoke as: scylla server (or just scyl
checkpoint(stop_signal, "loading non-system sstables");
replica::distributed_loader::init_non_system_keyspaces(db, proxy, sys_ks).get();
checkpoint(stop_signal, "recovering logstor");
db.invoke_on_all([] (replica::database& db) {
return db.recover_logstor();
}).get();
// Depends on all keyspaces being initialized because after this call
// we can be reloading schema.
mm.local().register_feature_listeners();
@@ -2102,7 +2105,7 @@ To start the scylla server proper, simply invoke as: scylla server (or just scyl
if (cfg->maintenance_socket() != "ignore") {
checkpoint(stop_signal, "starting maintenance auth service");
maintenance_auth_service.start(std::ref(qp), std::ref(group0_client),
auth::make_authorizer_factory(auth::allow_all_authorizer_name, qp),
auth::make_maintenance_socket_authorizer_factory(qp),
auth::make_maintenance_socket_authenticator_factory(qp, group0_client, mm, auth_cache),
auth::make_maintenance_socket_role_manager_factory(qp, group0_client, mm, auth_cache),
maintenance_socket_enabled::yes, std::ref(auth_cache)).get();

View File

@@ -103,7 +103,7 @@ future<std::optional<tasks::task_status>> node_ops_virtual_task::get_status(task
.entity = stats.entity,
.progress_units = "",
.progress = tasks::task_manager::task::progress{},
.children = co_await get_children(get_module(), id, std::bind_front(&gms::gossiper::is_alive, &_ss.gossiper()))
.children = co_await get_children(get_module(), id, _ss.get_token_metadata_ptr())
};
}

View File

@@ -8,9 +8,10 @@
"""exec_cql.py
Execute CQL statements from a file where each non-empty, non-comment line is exactly one CQL statement.
Connects via a Unix domain socket (maintenance socket), bypassing authentication.
Requires python cassandra-driver. Stops at first failure.
Usage:
./exec_cql.py --file ./conf/auth.cql [--host 127.0.0.1 --port 9042]
./exec_cql.py --file ./conf/auth.cql --socket /path/to/cql.m
"""
import argparse, os, sys
from typing import Sequence
@@ -26,18 +27,27 @@ def read_statements(path: str) -> list[tuple[int, str]]:
stms.append((lineno, line))
return stms
def exec_driver(statements: list[tuple[int, str]], host: str, port: int, timeout: float, username: str, password: str) -> int:
def exec_statements(statements: list[tuple[int, str]], socket_path: str, timeout: float) -> int:
"""Execute CQL statements via a Unix domain socket (maintenance socket).
The maintenance socket only starts listening after the auth subsystem is
fully initialised, so a successful connect means the node is ready.
"""
from cassandra.cluster import Cluster
from cassandra.connection import UnixSocketEndPoint # type: ignore
from cassandra.policies import WhiteListRoundRobinPolicy # type: ignore
ep = UnixSocketEndPoint(socket_path)
try:
from cassandra.cluster import Cluster
from cassandra.auth import PlainTextAuthProvider # type: ignore
except Exception:
print('ERROR: cassandra-driver not installed. Install with: pip install cassandra-driver', file=sys.stderr)
cluster = Cluster(
contact_points=[ep],
load_balancing_policy=WhiteListRoundRobinPolicy([ep]),
)
session = cluster.connect()
except Exception as e:
print(f'ERROR: failed to connect to maintenance socket {socket_path}: {e}', file=sys.stderr)
return 2
auth_provider = None
if username != "":
auth_provider = PlainTextAuthProvider(username=username, password=password)
cluster = Cluster([host], port=port, auth_provider=auth_provider)
session = cluster.connect()
try:
for _, (lineno, s) in enumerate(statements, 1):
try:
@@ -50,13 +60,11 @@ def exec_driver(statements: list[tuple[int, str]], host: str, port: int, timeout
return 0
def main(argv: Sequence[str]) -> int:
ap = argparse.ArgumentParser(description='Execute one-line CQL statements from file (driver only)')
ap = argparse.ArgumentParser(description='Execute one-line CQL statements from file via maintenance socket')
ap.add_argument('--file', required=True)
ap.add_argument('--host', default='127.0.0.1')
ap.add_argument('--port', type=int, default=9042)
ap.add_argument('--socket', required=True,
help='Path to the Unix domain maintenance socket (<workdir>/cql.m)')
ap.add_argument('--timeout', type=float, default=30.0)
ap.add_argument('--username', default='cassandra')
ap.add_argument('--password', default='cassandra')
args = ap.parse_args(argv)
if not os.path.isfile(args.file):
print(f"File not found: {args.file}", file=sys.stderr)
@@ -65,7 +73,7 @@ def main(argv: Sequence[str]) -> int:
if not stmts:
print('No statements found', file=sys.stderr)
return 1
rc = exec_driver(stmts, args.host, args.port, args.timeout, args.username, args.password)
rc = exec_statements(stmts, args.socket, args.timeout)
if rc == 0:
print('All statements executed successfully')
return rc

View File

@@ -15,6 +15,7 @@ from typing import Any, Optional
import asyncio
import contextlib
import glob
import hashlib
import json
import logging
import os
@@ -364,12 +365,14 @@ async def start_node(executable: PathLike, cluster_workdir: PathLike, addr: str,
llvm_profile_file = f"{addr}-%m.profraw"
scylla_workdir = f"{addr}"
logfile = f"{addr}.log"
socket = maintenance_socket_path(cluster_workdir, addr)
command = [
"env",
f"LLVM_PROFILE_FILE={llvm_profile_file}",
f"SCYLLA_HOME={os.path.realpath(os.getcwd())}", # We assume that the script has Scylla's `conf/` as its filesystem neighbour.
os.path.realpath(executable),
f"--workdir={scylla_workdir}",
f"--maintenance-socket={socket}",
"--ring-delay-ms=0",
"--developer-mode=yes",
"--memory=1G",
@@ -391,6 +394,7 @@ async def start_node(executable: PathLike, cluster_workdir: PathLike, addr: str,
f"--authenticator=PasswordAuthenticator",
f"--authorizer=CassandraAuthorizer",
] + list(extra_opts)
training_logger.info(f"Using maintenance socket {socket}")
return await run(['bash', '-c', fr"""exec {shlex.join(command)} >{q(logfile)} 2>&1"""], cwd=cluster_workdir)
async def start_cluster(executable: PathLike, addrs: list[str], cpusets: Optional[list[str]], workdir: PathLike, cluster_name: str, extra_opts: list[str]) -> list[Process]:
@@ -433,16 +437,25 @@ async def start_cluster(executable: PathLike, addrs: list[str], cpusets: Optiona
procs.append(proc)
await wait_for_node(proc, addrs[i], timeout)
except:
await stop_cluster(procs, addrs)
await stop_cluster(procs, addrs, cluster_workdir=workdir)
raise
return procs
async def stop_cluster(procs: list[Process], addrs: list[str]) -> None:
async def stop_cluster(procs: list[Process], addrs: list[str], cluster_workdir: PathLike) -> None:
"""Stops a Scylla cluster started with start_cluster().
Doesn't return until all nodes exit, even if stop_cluster() is cancelled.
"""
await clean_gather(*[cancel_process(p, timeout=60) for p in procs])
_cleanup_short_sockets(cluster_workdir, addrs)
def _cleanup_short_sockets(cluster_workdir: PathLike, addrs: list[str]) -> None:
"""Remove short maintenance socket files created in /tmp."""
for addr in addrs:
try:
os.unlink(maintenance_socket_path(cluster_workdir, addr))
except OSError:
pass
async def wait_for_port(addr: str, port: int) -> None:
await bash(fr'until printf "" >>/dev/tcp/{addr}/{port}; do sleep 0.1; done 2>/dev/null')
@@ -452,6 +465,33 @@ async def merge_profraw(directory: PathLike) -> None:
if glob.glob(f"{directory}/*.profraw"):
await bash(fr"llvm-profdata merge {q(directory)}/*.profraw -output {q(directory)}/prof.profdata")
def maintenance_socket_path(cluster_workdir: PathLike, addr: str) -> str:
"""Return the maintenance socket path for a node.
Returns a short deterministic path in /tmp (derived from an MD5 hash of
the natural ``<cluster_workdir>/<addr>/cql.m`` path) to stay within the
Unix domain socket length limit.
The same path is passed to Scylla via ``--maintenance-socket`` in
``start_node()``.
"""
natural = os.path.realpath(f"{cluster_workdir}/{addr}/cql.m")
path_hash = hashlib.md5(natural.encode()).hexdigest()[:12]
return os.path.join(tempfile.gettempdir(), f'pgo-{path_hash}.m')
async def setup_cassandra_user(workdir: PathLike, addr: str) -> None:
"""Create the ``cassandra`` superuser via the maintenance socket.
The default cassandra superuser is no longer seeded automatically, but
``cassandra-stress`` hardcodes ``user=cassandra password=cassandra``.
We create the role over the maintenance socket so that cassandra-stress
and other tools that rely on the default credentials keep working.
"""
socket = maintenance_socket_path(workdir, addr)
stmt = "CREATE ROLE cassandra WITH PASSWORD = 'cassandra' AND SUPERUSER = true AND LOGIN = true;"
f = q(socket)
# Write the statement to a temp file and execute it via exec_cql.py.
await bash(fr"""tmpf=$(mktemp); echo {q(stmt)} > "$tmpf"; python3 ./exec_cql.py --file "$tmpf" --socket {f}; rc=$?; rm -f "$tmpf"; exit $rc""")
async def get_bolt_opts(executable: PathLike) -> list[str]:
"""Returns the extra opts which have to be passed to a BOLT-instrumented Scylla
to trigger a generation of a BOLT profile file.
@@ -503,7 +543,7 @@ async def with_cluster(executable: PathLike, workdir: PathLike, cpusets: Optiona
yield addrs, procs
finally:
training_logger.info(f"Stopping the cluster in {workdir}")
await stop_cluster(procs, addrs)
await stop_cluster(procs, addrs, cluster_workdir=workdir)
training_logger.info(f"Stopped the cluster in {workdir}")
################################################################################
@@ -557,8 +597,10 @@ def kw(**kwargs):
@contextlib.asynccontextmanager
async def with_cs_populate(executable: PathLike, workdir: PathLike) -> AsyncIterator[str]:
"""Provides a Scylla cluster and waits for compactions to end before stopping it."""
"""Provides a Scylla cluster, creates the cassandra superuser, and waits
for compactions to end before stopping it."""
async with with_cluster(executable=executable, workdir=workdir) as (addrs, procs):
await setup_cassandra_user(workdir, addrs[0])
yield addrs[0]
async with asyncio.timeout(3600):
# Should it also flush memtables?
@@ -667,9 +709,10 @@ populators["decommission_dataset"] = populate_decommission
# AUTH CONNECTIONS STRESS ==================================================
async def populate_auth_conns(executable: PathLike, workdir: PathLike) -> None:
# Create roles, table and permissions via CQL script.
# Create roles, table and permissions via CQL script over the maintenance socket.
async with with_cs_populate(executable=executable, workdir=workdir) as server:
await bash(fr"python3 ./exec_cql.py --file conf/auth.cql --host {server}")
socket = maintenance_socket_path(workdir, server)
await bash(fr"python3 ./exec_cql.py --file conf/auth.cql --socket {q(socket)}")
async def train_auth_conns(executable: PathLike, workdir: PathLike) -> None:
# Repeatedly connect as the reader user and perform simple reads to stress
@@ -722,7 +765,8 @@ populators["si_dataset"] = populate_si
async def populate_counters(executable: PathLike, workdir: PathLike) -> None:
async with with_cs_populate(executable=executable, workdir=workdir) as server:
await bash(fr"python3 ./exec_cql.py --file conf/counters.cql --host {server}")
socket = maintenance_socket_path(workdir, server)
await bash(fr"python3 ./exec_cql.py --file conf/counters.cql --socket {q(socket)}")
# Sleeps added in reaction to schema disagreement errors.
# FIXME: get rid of this sleep and find a sane way to wait for schema
# agreement.

View File

@@ -1,3 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:34a0955d2c5a88e18ddab0f1df085e10a17e14129c3e21de91e4f27ef949b6c4
size 6502668
oid sha256:d424ce6cc7f65338c34dd35881d23f5ad3425651d66e47dc2c3a20dc798848d4
size 6598648

View File

@@ -68,6 +68,7 @@ public:
using resources = reader_resources;
friend class reader_permit;
friend struct reader_concurrency_semaphore_tester;
enum class evict_reason {
permit, // evicted due to permit shortage

View File

@@ -9,6 +9,9 @@ target_sources(replica
memtable.cc
exceptions.cc
dirty_memory_manager.cc
logstor/segment_manager.cc
logstor/logstor.cc
logstor/write_buffer.cc
multishard_query.cc
mutation_dump.cc
schema_describe_helper.cc

View File

@@ -17,6 +17,7 @@
// FIXME: un-nest compaction_reenabler, so we can forward declare it and remove this include.
#include "compaction/compaction_manager.hh"
#include "locator/tablets.hh"
#include "replica/logstor/compaction.hh"
#include "sstables/sstable_set.hh"
#include "utils/chunked_vector.hh"
#include <absl/container/flat_hash_map.h>
@@ -33,6 +34,10 @@ class effective_replication_map;
namespace replica {
namespace logstor {
class primary_index;
}
using enable_backlog_tracker = bool_class<class enable_backlog_tracker_tag>;
enum class repair_sstable_classification {
@@ -91,6 +96,12 @@ class compaction_group {
bool _tombstone_gc_enabled = true;
std::optional<compaction::compaction_backlog_tracker> _backlog_tracker;
repair_classifier_func _repair_sstable_classifier;
lw_shared_ptr<logstor::segment_set> _logstor_segments;
std::optional<logstor::separator_buffer> _logstor_separator;
std::vector<future<>> _separator_flushes;
seastar::semaphore _separator_flush_sem{1};
private:
std::unique_ptr<compaction_group_view> make_compacting_view();
std::unique_ptr<compaction_group_view> make_non_compacting_view();
@@ -223,6 +234,7 @@ public:
const std::vector<sstables::shared_sstable>& compacted_undeleted_sstables() const noexcept;
// Triggers regular compaction.
void trigger_compaction();
void trigger_logstor_compaction();
bool compaction_disabled() const;
future<unsigned> estimate_pending_compactions() const;
@@ -231,6 +243,7 @@ public:
size_t live_sstable_count() const noexcept;
uint64_t live_disk_space_used() const noexcept;
size_t logstor_disk_space_used() const noexcept;
sstables::file_size_stats live_disk_space_used_full_stats() const noexcept;
uint64_t total_disk_space_used() const noexcept;
sstables::file_size_stats total_disk_space_used_full_stats() const noexcept;
@@ -262,12 +275,37 @@ public:
compaction::compaction_manager& get_compaction_manager() noexcept;
const compaction::compaction_manager& get_compaction_manager() const noexcept;
logstor::segment_manager& get_logstor_segment_manager() noexcept;
const logstor::segment_manager& get_logstor_segment_manager() const noexcept;
logstor::compaction_manager& get_logstor_compaction_manager() noexcept;
const logstor::compaction_manager& get_logstor_compaction_manager() const noexcept;
logstor::primary_index& get_logstor_index() noexcept;
future<> split(compaction::compaction_type_options::split opt, tasks::task_info tablet_split_task_info);
void set_repair_sstable_classifier(repair_classifier_func repair_sstable_classifier) {
_repair_sstable_classifier = std::move(repair_sstable_classifier);
}
void add_logstor_segment(logstor::segment_descriptor& desc) {
_logstor_segments->add_segment(desc);
}
future<> discard_logstor_segments();
future<> flush_separator(std::optional<size_t> seq_num = std::nullopt);
logstor::separator_buffer& get_separator_buffer(size_t write_size);
logstor::segment_set& logstor_segments() noexcept {
return *_logstor_segments;
}
const logstor::segment_set& logstor_segments() const noexcept {
return *_logstor_segments;
}
friend class storage_group;
};
@@ -312,7 +350,14 @@ public:
const compaction_group_ptr& main_compaction_group() const noexcept;
const std::vector<compaction_group_ptr>& split_ready_compaction_groups() const;
compaction_group_ptr& select_compaction_group(locator::tablet_range_side) noexcept;
// Selects the compaction group for the given token. Computes the range side
// from the token only when in splitting mode. This avoids the cost of computing
// range side on the hot path when it's not needed.
compaction_group_ptr& select_compaction_group(dht::token, const locator::tablet_map&) noexcept;
// Selects the compaction group for an sstable spanning a token range.
// If the first and last tokens fall on different sides of the split point,
// the sstable belongs to the main compaction group.
compaction_group_ptr& select_compaction_group(dht::token first, dht::token last, const locator::tablet_map&) noexcept;
uint64_t live_disk_space_used() const;
@@ -432,7 +477,9 @@ public:
// refresh_mutation_source must be called when there are changes to data source
// structures but logical state of data is not changed (e.g. when state for a
// new tablet replica is allocated).
virtual void update_effective_replication_map(const locator::effective_replication_map& erm, noncopyable_function<void()> refresh_mutation_source) = 0;
virtual void update_effective_replication_map(const locator::effective_replication_map_ptr& old_erm,
const locator::effective_replication_map& erm,
noncopyable_function<void()> refresh_mutation_source) = 0;
virtual compaction_group& compaction_group_for_token(dht::token token) const = 0;
virtual compaction_group& compaction_group_for_key(partition_key_view key, const schema_ptr& s) const = 0;

View File

@@ -76,6 +76,7 @@
#include "locator/abstract_replication_strategy.hh"
#include "timeout_config.hh"
#include "tombstone_gc.hh"
#include "logstor/logstor.hh"
#include "service/qos/service_level_controller.hh"
#include "replica/data_dictionary_impl.hh"
@@ -393,6 +394,13 @@ database::database(const db::config& cfg, database_config dbcfg, service::migrat
// Allow system tables a pool of 10 MB memory to write, but never block on other regions.
, _system_dirty_memory_manager(*this, 10 << 20, cfg.unspooled_dirty_soft_limit(), default_scheduling_group())
, _dirty_memory_manager(*this, dbcfg.available_memory * 0.50, cfg.unspooled_dirty_soft_limit(), dbcfg.statement_scheduling_group)
, _dirty_memory_threshold_controller([this] {
if (_logstor) {
size_t logstor_memory_usage = get_logstor_memory_usage();
size_t available_memory = _dbcfg.available_memory > logstor_memory_usage ? _dbcfg.available_memory - logstor_memory_usage : 0;
_dirty_memory_manager.update_threshold(available_memory * 0.50);
}
})
, _dbcfg(dbcfg)
, _memtable_controller(make_flush_controller(_cfg, _dbcfg, [this, limit = float(_dirty_memory_manager.throttle_threshold())] {
auto backlog = (_dirty_memory_manager.unspooled_dirty_memory()) / limit;
@@ -906,6 +914,50 @@ database::init_commitlog() {
});
}
future<>
database::init_logstor() {
dblog.info("Initializing logstor");
auto cfg = logstor::logstor_config{
.segment_manager_cfg = {
.base_dir = std::filesystem::path(_cfg.logstor_directory()),
.file_size = _cfg.logstor_file_size_in_mb() * 1024ull * 1024ull,
.disk_size = _cfg.logstor_disk_size_in_mb() * 1024ull * 1024ull,
.compaction_sg = _dbcfg.compaction_scheduling_group,
.compaction_static_shares = _cfg.compaction_static_shares,
.separator_sg = _dbcfg.memtable_scheduling_group,
.separator_delay_limit_ms = _cfg.logstor_separator_delay_limit_ms(),
.max_separator_memory = _cfg.logstor_separator_max_memory_in_mb() * 1024ull * 1024ull,
},
.flush_sg = _dbcfg.commitlog_scheduling_group,
};
_logstor = std::make_unique<logstor::logstor>(std::move(cfg));
_logstor->set_trigger_compaction_hook([this] {
trigger_logstor_compaction(false);
});
_logstor->set_trigger_separator_flush_hook([this] (size_t seq_num) {
(void)flush_logstor_separator(seq_num);
});
dblog.info("logstor initialized");
co_return;
}
future<>
database::recover_logstor() {
if (!_logstor) {
co_return;
}
co_await _logstor->do_recovery(*this);
co_await _logstor->start();
_dirty_memory_threshold_controller.arm_periodic(std::chrono::seconds(5));
}
future<> database::modify_keyspace_on_all_shards(sharded<database>& sharded_db, std::function<future<>(replica::database&)> func) {
// Run func first on shard 0
// to allow "seeding" of the effective_replication_map
@@ -1128,6 +1180,17 @@ void database::add_column_family(keyspace& ks, schema_ptr schema, column_family:
cf->set_truncation_time(db_clock::time_point::min());
}
if (schema->logstor_enabled()) {
if (!_cfg.enable_logstor()) {
throw std::runtime_error(fmt::format("The table {}.{} is using logstor storage but logstor is not enabled in the configuration", schema->ks_name(), schema->cf_name()));
}
if (!_logstor) {
on_internal_error(dblog, "The table is using logstor but logstor is not initialized");
}
cf->init_logstor(_logstor.get());
dblog.info0("Table {}.{} is using logstor storage", schema->ks_name(), schema->cf_name());
}
auto uuid = schema->id();
if (_tables_metadata.contains(uuid)) {
throw std::invalid_argument("UUID " + uuid.to_sstring() + " already mapped");
@@ -1699,7 +1762,7 @@ static db::rate_limiter::can_proceed account_singular_ranges_to_rate_limit(
if (!range.is_singular()) {
continue;
}
auto token = dht::token::to_int64(ranges.front().start()->value().token());
auto token = dht::token::to_int64(range.start()->value().token());
if (limiter.account_operation(read_label, token, table_limit, rate_limit_info) == db::rate_limiter::can_proceed::no) {
// Don't return immediately - account all ranges first
ret = can_proceed::no;
@@ -2163,7 +2226,7 @@ static std::exception_ptr wrap_commitlog_add_error(const schema_ptr& s, const fr
future<> database::apply_with_commitlog(column_family& cf, const mutation& m, db::timeout_clock::time_point timeout) {
db::rp_handle h;
if (cf.commitlog() != nullptr && cf.durable_writes()) {
if (cf.commitlog() != nullptr && cf.durable_writes() && !cf.uses_logstor()) {
auto fm = freeze(m);
std::exception_ptr ex;
try {
@@ -2212,6 +2275,10 @@ future<> database::do_apply_many(const utils::chunked_vector<frozen_mutation>& m
auto s = local_schema_registry().get(muts[i].schema_version());
auto&& cf = find_column_family(muts[i].column_family_id());
if (cf.uses_logstor()) {
continue;
}
if (!cl) {
cl = cf.commitlog();
} else if (cl != cf.commitlog()) {
@@ -2248,16 +2315,16 @@ future<> database::do_apply(schema_ptr s, const frozen_mutation& m, tracing::tra
// assume failure until proven otherwise
auto update_writes_failed = defer([&] { ++_stats->total_writes_failed; });
utils::get_local_injector().inject("database_apply", [&s] () {
if (!is_system_keyspace(s->ks_name())) {
throw std::runtime_error("injected error");
co_await utils::get_local_injector().inject("database_apply", [&s] (auto& handler) -> future<> {
if (s->ks_name() != handler.get("ks_name") || s->cf_name() != handler.get("cf_name")) {
co_return;
}
});
co_await utils::get_local_injector().inject("database_apply_wait", [&] (auto& handler) -> future<> {
if (s->cf_name() == handler.get("cf_name")) {
dblog.info("database_apply_wait: wait");
if (handler.get("what") == "throw") {
throw std::runtime_error(format("injected error for {}.{}", s->ks_name(), s->cf_name()));
} else if (handler.get("what") == "wait") {
dblog.info("database_apply: wait");
co_await handler.wait_for_message(std::chrono::steady_clock::now() + std::chrono::minutes{5});
dblog.info("database_apply_wait: done");
dblog.info("database_apply: done");
}
});
@@ -2309,7 +2376,7 @@ future<> database::do_apply(schema_ptr s, const frozen_mutation& m, tracing::tra
// frames.
db::rp_handle h;
auto cl = cf.commitlog();
if (cl != nullptr && cf.durable_writes()) {
if (cl != nullptr && cf.durable_writes() && !cf.uses_logstor()) {
std::exception_ptr ex;
try {
commitlog_entry_writer cew(s, m, sync);
@@ -2633,6 +2700,9 @@ future<> database::start(sharded<qos::service_level_controller>& sl_controller,
_compaction_manager.enable();
}
co_await init_commitlog();
if (_cfg.enable_logstor()) {
co_await init_logstor();
}
}
future<> database::shutdown() {
@@ -2673,6 +2743,11 @@ future<> database::stop() {
co_await _commitlog->shutdown();
dblog.info("Shutting down commitlog complete");
}
if (_logstor) {
dblog.info("Shutting down logstor");
co_await _logstor->stop();
dblog.info("Shutting down logstor complete");
}
if (_schema_commitlog) {
dblog.info("Shutting down schema commitlog");
co_await _schema_commitlog->shutdown();
@@ -2807,6 +2882,53 @@ future<> database::drop_cache_for_keyspace_on_all_shards(sharded<database>& shar
});
}
future<> database::trigger_logstor_compaction_on_all_shards(sharded<database>& sharded_db, bool major) {
return sharded_db.invoke_on_all([major] (replica::database& db) {
return db.trigger_logstor_compaction(major);
});
}
void database::trigger_logstor_compaction(bool major) {
_tables_metadata.for_each_table([&] (table_id id, const lw_shared_ptr<table> tp) {
if (tp->uses_logstor()) {
tp->trigger_logstor_compaction();
}
});
}
future<> database::flush_logstor_separator_on_all_shards(sharded<database>& sharded_db) {
return sharded_db.invoke_on_all([] (replica::database& db) {
return db.flush_logstor_separator();
});
}
future<> database::flush_logstor_separator(std::optional<size_t> seq_num) {
return _tables_metadata.parallel_for_each_table([seq_num] (table_id, lw_shared_ptr<table> table) {
return table->flush_separator(seq_num);
});
}
future<logstor::table_segment_stats> database::get_logstor_table_segment_stats(table_id table) const {
return find_column_family(table).get_logstor_segment_stats();
}
size_t database::get_logstor_memory_usage() const {
if (!_logstor) {
return 0;
}
size_t m = 0;
m += _logstor->get_memory_usage();
get_tables_metadata().for_each_table([&m] (table_id, lw_shared_ptr<replica::table> table) {
if (table->uses_logstor()) {
m += table->get_logstor_memory_usage();
}
});
return m;
}
future<> database::snapshot_table_on_all_shards(sharded<database>& sharded_db, table_id uuid, sstring tag, db::snapshot_options opts) {
if (!opts.skip_flush) {
co_await flush_table_on_all_shards(sharded_db, uuid);
@@ -2927,6 +3049,7 @@ future<> database::truncate_table_on_all_shards(sharded<database>& sharded_db, s
co_await coroutine::parallel_for_each(views, [&] (lw_shared_ptr<replica::table> v) -> future<> {
co_await flush_or_clear(*v);
});
co_await cf.flush_separator();
// Since writes could be appended to active memtable between getting low_mark above
// and flush, the low_mark has to be adjusted to account for those writes, where
// memtable was flushed with a higher replay position than the one obtained above.
@@ -2968,6 +3091,8 @@ future<> database::truncate(db::system_keyspace& sys_ks, column_family& cf, std:
dblog.debug("Discarding sstable data for truncated CF + indexes");
// TODO: notify truncation
co_await cf.discard_logstor_segments();
db::replay_position rp = co_await cf.discard_sstables(truncated_at);
// TODO: indexes.
// Note: since discard_sstables was changed to only count tables owned by this shard,

View File

@@ -16,6 +16,7 @@
#include <seastar/core/execution_stage.hh>
#include <seastar/core/when_all.hh>
#include "replica/global_table_ptr.hh"
#include "replica/logstor/compaction.hh"
#include "types/user.hh"
#include "utils/assert.hh"
#include "utils/hash.hh"
@@ -35,6 +36,7 @@
#include <seastar/core/gate.hh>
#include "db/commitlog/replay_position.hh"
#include "db/commitlog/commitlog_types.hh"
#include "logstor/logstor.hh"
#include "schema/schema_fwd.hh"
#include "db/view/view.hh"
#include "db/snapshot-ctl.hh"
@@ -544,6 +546,9 @@ private:
utils::phased_barrier _flush_barrier;
std::vector<view_ptr> _views;
logstor::logstor* _logstor = nullptr;
std::unique_ptr<logstor::primary_index> _logstor_index;
std::unique_ptr<cell_locker> _counter_cell_locks; // Memory-intensive; allocate only when needed.
// Labels used to identify writes and reads for this table in the rate_limiter structure.
@@ -611,6 +616,10 @@ public:
sstables::offstrategy offstrategy = sstables::offstrategy::no);
future<> add_sstables_and_update_cache(const std::vector<sstables::shared_sstable>& ssts);
bool add_logstor_segment(logstor::segment_descriptor&, dht::token first_token, dht::token last_token);
logstor::separator_buffer& get_logstor_separator_buffer(dht::token token, size_t write_size);
// Restricted to new sstables produced by external processes such as repair.
// The sstable might undergo split if table is in split mode.
// If no need for split, the input sstable will only be attached to the sstable set.
@@ -833,6 +842,21 @@ public:
// to issue disk operations safely.
void mark_ready_for_writes(db::commitlog* cl);
void init_logstor(logstor::logstor* ls);
bool uses_logstor() const {
return _logstor != nullptr;
}
logstor::primary_index& logstor_index() noexcept {
return *_logstor_index;
}
const logstor::primary_index& logstor_index() const noexcept {
return *_logstor_index;
}
size_t get_logstor_memory_usage() const;
// Creates a mutation reader which covers all data sources for this column family.
// Caller needs to ensure that column_family remains live (FIXME: relax this).
// Note: for data queries use query() instead.
@@ -858,6 +882,14 @@ public:
return make_mutation_reader(std::move(schema), std::move(permit), range, full_slice);
}
mutation_reader make_logstor_mutation_reader(schema_ptr s,
reader_permit permit,
const dht::partition_range& pr,
const query::partition_slice& slice,
tracing::trace_state_ptr trace_state,
streamed_mutation::forwarding fwd,
mutation_reader::forwarding fwd_mr) const;
// The streaming mutation reader differs from the regular mutation reader in that:
// - Reflects all writes accepted by replica prior to creation of the
// reader and a _bounded_ amount of writes which arrive later.
@@ -1047,6 +1079,7 @@ public:
bool needs_flush() const;
future<> clear(); // discards memtable(s) without flushing them to disk.
future<db::replay_position> discard_sstables(db_clock::time_point);
future<> discard_logstor_segments();
bool can_flush() const;
@@ -1098,6 +1131,7 @@ public:
void start_compaction();
void trigger_compaction();
void try_trigger_compaction(compaction_group& cg) noexcept;
void trigger_logstor_compaction();
// Triggers offstrategy compaction, if needed, in the background.
void trigger_offstrategy_compaction();
// Performs offstrategy compaction, if needed, returning
@@ -1126,6 +1160,22 @@ public:
return _compaction_manager;
}
logstor::segment_manager& get_logstor_segment_manager() noexcept {
return _logstor->get_segment_manager();
}
const logstor::segment_manager& get_logstor_segment_manager() const noexcept {
return _logstor->get_segment_manager();
}
logstor::compaction_manager& get_logstor_compaction_manager() noexcept {
return _logstor->get_compaction_manager();
}
future<> flush_separator(std::optional<size_t> seq_num = std::nullopt);
future<logstor::table_segment_stats> get_logstor_segment_stats() const;
table_stats& get_stats() const {
return _stats;
}
@@ -1613,6 +1663,8 @@ private:
dirty_memory_manager _system_dirty_memory_manager;
dirty_memory_manager _dirty_memory_manager;
timer<lowres_clock> _dirty_memory_threshold_controller;
database_config _dbcfg;
flush_controller _memtable_controller;
drain_progress _drain_progress {};
@@ -1655,6 +1707,8 @@ private:
bool _enable_autocompaction_toggle = false;
querier_cache _querier_cache;
std::unique_ptr<logstor::logstor> _logstor;
std::unique_ptr<db::large_data_handler> _large_data_handler;
std::unique_ptr<db::large_data_handler> _nop_large_data_handler;
@@ -1696,6 +1750,8 @@ public:
std::shared_ptr<data_dictionary::user_types_storage> as_user_types_storage() const noexcept;
const data_dictionary::user_types_storage& user_types() const noexcept;
future<> init_commitlog();
future<> init_logstor();
future<> recover_logstor();
const gms::feature_service& features() const { return _feat; }
future<> apply_in_memory(const frozen_mutation& m, schema_ptr m_schema, db::rp_handle&&, db::timeout_clock::time_point timeout);
future<> apply_in_memory(const mutation& m, column_family& cf, db::rp_handle&&, db::timeout_clock::time_point timeout);
@@ -1996,6 +2052,13 @@ public:
// a wrapper around flush_all_tables, allowing the caller to express intent more clearly
future<> flush_commitlog() { return flush_all_tables(); }
static future<> trigger_logstor_compaction_on_all_shards(sharded<database>& sharded_db, bool major);
void trigger_logstor_compaction(bool major);
static future<> flush_logstor_separator_on_all_shards(sharded<database>& sharded_db);
future<> flush_logstor_separator(std::optional<size_t> seq_num = std::nullopt);
future<logstor::table_segment_stats> get_logstor_table_segment_stats(table_id table) const;
size_t get_logstor_memory_usage() const;
static future<db_clock::time_point> get_all_tables_flushed_at(sharded<database>& sharded_db);
static future<> drop_cache_for_table_on_all_shards(sharded<database>& sharded_db, table_id id);

View File

@@ -142,6 +142,16 @@ void region_group::notify_unspooled_pressure_relieved() {
_relief.signal();
}
void region_group::update_limits(size_t unspooled_hard_limit, size_t unspooled_soft_limit, size_t real_hard_limit) {
_cfg.unspooled_hard_limit = unspooled_hard_limit;
_cfg.unspooled_soft_limit = unspooled_soft_limit;
_cfg.real_hard_limit = real_hard_limit;
// check pressure with the new limits
update_real(0);
update_unspooled(0);
}
bool region_group::do_update_real_and_check_relief(ssize_t delta) {
_real_total_memory += delta;
@@ -211,9 +221,18 @@ dirty_memory_manager::dirty_memory_manager(replica::database& db, size_t thresho
.real_hard_limit = threshold,
.start_reclaiming = std::bind_front(&dirty_memory_manager::start_reclaiming, this)
}, deferred_work_sg)
, _threshold(threshold)
, _soft_limit(soft_limit)
, _flush_serializer(1)
, _waiting_flush(flush_when_needed()) {}
void dirty_memory_manager::update_threshold(size_t threshold) {
if (threshold != _threshold) {
_threshold = threshold;
_region_group.update_limits(threshold / 2, threshold * _soft_limit / 2, threshold);
}
}
void
dirty_memory_manager::setup_collectd(sstring namestr) {
namespace sm = seastar::metrics;

View File

@@ -268,6 +268,8 @@ public:
}
void update_unspooled(ssize_t delta);
void update_limits(size_t unspooled_hard_limit, size_t unspooled_soft_limit, size_t real_hard_limit);
void increase_usage(logalloc::region* r) { // Called by memtable's region_listener
// It would be easier to call update, but it is unfortunately broken in boost versions up to at
// least 1.59.
@@ -395,6 +397,9 @@ class dirty_memory_manager {
// memory usage minus bytes that were already written to disk.
dirty_memory_manager_logalloc::region_group _region_group;
size_t _threshold;
double _soft_limit;
// We would like to serialize the flushing of memtables. While flushing many memtables
// simultaneously can sustain high levels of throughput, the memory is not freed until the
// memtable is totally gone. That means that if we have throttled requests, they will stay
@@ -483,6 +488,8 @@ public:
return _region_group;
}
void update_threshold(size_t threshold);
void revert_potentially_cleaned_up_memory(logalloc::region* from, int64_t delta) {
_region_group.update_real(-delta);
_region_group.update_unspooled(delta);

View File

@@ -0,0 +1,177 @@
/*
* Copyright (C) 2026-present ScyllaDB
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
*/
#pragma once
#include "types.hh"
#include "utils/chunked_vector.hh"
#include "write_buffer.hh"
#include "utils/log_heap.hh"
namespace replica::logstor {
constexpr log_heap_options segment_descriptor_hist_options(4 * 1024, 3, 128 * 1024);
struct segment_set;
struct segment_descriptor : public log_heap_hook<segment_descriptor_hist_options> {
// free_space = segment_size - net_data_size
// initially set to segment_size
// when writing records, decrease by total net data size
// when freeing a record, increase by the record's net data size
size_t free_space{0};
size_t record_count{0};
segment_generation seg_gen{1};
segment_set* owner{nullptr}; // non-owning, set when added to a segment_set
void reset(size_t segment_size) noexcept {
free_space = segment_size;
record_count = 0;
}
size_t net_data_size(size_t segment_size) const noexcept {
return segment_size - free_space;
}
void on_free_segment() noexcept {
++seg_gen;
}
void on_write(size_t net_data_size, size_t cnt = 1) noexcept {
free_space -= net_data_size;
record_count += cnt;
}
void on_write(log_location loc) noexcept {
on_write(loc.size);
}
void on_free(size_t net_data_size, size_t cnt = 1) noexcept {
free_space += net_data_size;
record_count -= cnt;
}
void on_free(log_location loc) noexcept {
on_free(loc.size);
}
};
using segment_descriptor_hist = log_heap<segment_descriptor, segment_descriptor_hist_options>;
struct segment_set {
segment_descriptor_hist _segments;
size_t _segment_count{0};
void add_segment(segment_descriptor& desc) {
desc.owner = this;
_segments.push(desc);
++_segment_count;
}
void update_segment(segment_descriptor& desc) {
_segments.adjust_up(desc);
}
void remove_segment(segment_descriptor& desc) {
_segments.erase(desc);
desc.owner = nullptr;
--_segment_count;
}
size_t segment_count() const noexcept {
return _segment_count;
}
};
class segment_ref {
struct state {
log_segment_id id;
std::function<void()> on_last_release;
std::function<void()> on_failure;
bool flush_failure{false};
~state() {
if (!flush_failure) {
if (on_last_release) on_last_release();
} else {
if (on_failure) on_failure();
}
}
};
lw_shared_ptr<state> _state;
public:
segment_ref() = default;
// Copyable: copying increments the shared ref count
segment_ref(const segment_ref&) = default;
segment_ref& operator=(const segment_ref&) = default;
segment_ref(segment_ref&&) noexcept = default;
segment_ref& operator=(segment_ref&&) noexcept = default;
log_segment_id id() const noexcept { return _state->id; }
bool empty() const noexcept { return !_state; }
void set_flush_failure() noexcept { if (_state) _state->flush_failure = true; }
private:
friend class segment_manager_impl;
explicit segment_ref(log_segment_id id, std::function<void()> on_last_release, std::function<void()> on_failure)
: _state(make_lw_shared<state>(id, std::move(on_last_release), std::move(on_failure)))
{}
};
struct separator_buffer {
write_buffer* buf;
utils::chunked_vector<future<>> pending_updates;
utils::chunked_vector<segment_ref> held_segments;
std::optional<size_t> min_seq_num;
bool flushed{false};
separator_buffer(write_buffer* wb)
: buf(wb)
{}
~separator_buffer() {
if (!flushed && buf && buf->has_data()) {
for (auto& seg_ref : held_segments) {
seg_ref.set_flush_failure();
}
}
}
separator_buffer(const separator_buffer&) = delete;
separator_buffer& operator=(const separator_buffer&) = delete;
separator_buffer(separator_buffer&&) noexcept = default;
separator_buffer& operator=(separator_buffer&&) noexcept = default;
future<log_location_with_holder> write(log_record_writer writer) {
return buf->write(std::move(writer));
}
bool can_fit(const log_record_writer& writer) const noexcept {
return buf->can_fit(writer);
}
bool can_fit(size_t write_size) const noexcept {
return buf->can_fit(write_size);
}
};
class compaction_manager {
public:
virtual ~compaction_manager() = default;
virtual separator_buffer allocate_separator_buffer() = 0;
virtual future<> flush_separator_buffer(separator_buffer, replica::compaction_group&) = 0;
virtual void submit(replica::compaction_group&) = 0;
virtual future<> stop_ongoing_compactions(replica::compaction_group&) = 0;
};
}

167
replica/logstor/index.hh Normal file
View File

@@ -0,0 +1,167 @@
/*
* Copyright (C) 2026-present ScyllaDB
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
*/
#pragma once
#include "dht/decorated_key.hh"
#include "dht/ring_position.hh"
#include "types.hh"
#include "utils/bptree.hh"
#include "utils/double-decker.hh"
#include "utils/phased_barrier.hh"
namespace replica::logstor {
class primary_index_entry {
dht::decorated_key _key;
index_entry _e;
struct {
bool _head : 1;
bool _tail : 1;
bool _train : 1;
} _flags{};
public:
primary_index_entry(dht::decorated_key key, index_entry e)
: _key(std::move(key))
, _e(std::move(e))
{ }
primary_index_entry(primary_index_entry&&) noexcept = default;
bool is_head() const noexcept { return _flags._head; }
void set_head(bool v) noexcept { _flags._head = v; }
bool is_tail() const noexcept { return _flags._tail; }
void set_tail(bool v) noexcept { _flags._tail = v; }
bool with_train() const noexcept { return _flags._train; }
void set_train(bool v) noexcept { _flags._train = v; }
const dht::decorated_key& key() const noexcept { return _key; }
const index_entry& entry() const noexcept { return _e; }
friend class primary_index;
friend dht::ring_position_view ring_position_view_to_compare(const primary_index_entry& e) { return e._key; }
};
class primary_index final {
public:
using partitions_type = double_decker<int64_t, primary_index_entry,
dht::raw_token_less_comparator, dht::ring_position_comparator,
16, bplus::key_search::linear>;
private:
partitions_type _partitions;
schema_ptr _schema;
size_t _key_count = 0;
mutable utils::phased_barrier _reads_phaser{"logstor_primary_index"};
public:
explicit primary_index(schema_ptr schema)
: _partitions(dht::raw_token_less_comparator{})
, _schema(std::move(schema))
{}
void set_schema(schema_ptr s) {
_schema = std::move(s);
}
void clear() {
_partitions.clear();
_key_count = 0;
}
utils::phased_barrier::operation start_read() const {
return _reads_phaser.start();
}
future<> await_pending_reads() {
return _reads_phaser.advance_and_await();
}
std::optional<index_entry> get(const primary_index_key& key) const {
auto it = _partitions.find(key.dk, dht::ring_position_comparator(*_schema));
if (it != _partitions.end()) {
return it->_e;
}
return std::nullopt;
}
std::optional<index_entry> exchange(const primary_index_key& key, index_entry new_entry) {
partitions_type::bound_hint hint;
auto i = _partitions.lower_bound(key.dk, dht::ring_position_comparator(*_schema), hint);
if (hint.match) {
auto old_entry = i->_e;
i->_e = std::move(new_entry);
return old_entry;
} else {
_partitions.emplace_before(i, key.dk.token().raw(), hint, key.dk, std::move(new_entry));
++_key_count;
return std::nullopt;
}
}
bool update_record_location(const primary_index_key& key, log_location old_location, log_location new_location) {
auto it = _partitions.find(key.dk, dht::ring_position_comparator(*_schema));
if (it != _partitions.end()) {
if (it->_e.location == old_location) {
it->_e.location = new_location;
return true;
}
}
return false;
}
std::pair<bool, std::optional<index_entry>> insert_if_newer(const primary_index_key& key, index_entry new_entry) {
partitions_type::bound_hint hint;
auto i = _partitions.lower_bound(key.dk, dht::ring_position_comparator(*_schema), hint);
if (hint.match) {
if (i->_e.generation < new_entry.generation) {
auto old_entry = i->_e;
i->_e = std::move(new_entry);
return {true, std::make_optional(old_entry)};
} else {
return {false, std::make_optional(i->_e)};
}
} else {
_partitions.emplace_before(i, key.dk.token().raw(), hint, key.dk, std::move(new_entry));
++_key_count;
return {true, std::nullopt};
}
}
bool erase(const primary_index_key& key, log_location loc) {
auto it = _partitions.find(key.dk, dht::ring_position_comparator(*_schema));
if (it != _partitions.end() && it->_e.location == loc) {
it.erase(dht::raw_token_less_comparator{});
--_key_count;
return true;
}
return false;
}
auto begin() const noexcept { return _partitions.begin(); }
auto end() const noexcept { return _partitions.end(); }
bool empty() const noexcept { return _partitions.empty(); }
size_t get_key_count() const noexcept { return _key_count; }
size_t get_memory_usage() const noexcept { return _key_count * sizeof(index_entry); }
// First entry with key >= pos (for positioning at range start)
partitions_type::const_iterator lower_bound(const dht::ring_position_view& pos) const {
return _partitions.lower_bound(pos, dht::ring_position_comparator(*_schema));
}
// First entry with key strictly > key (for advancing past a key after a yield)
partitions_type::const_iterator upper_bound(const dht::decorated_key& key) const {
return _partitions.upper_bound(key, dht::ring_position_comparator(*_schema));
}
};
}

297
replica/logstor/logstor.cc Normal file
View File

@@ -0,0 +1,297 @@
/*
* Copyright (C) 2026-present ScyllaDB
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
*/
#include "replica/logstor/logstor.hh"
#include <seastar/core/coroutine.hh>
#include <seastar/util/log.hh>
#include <seastar/core/future.hh>
#include "readers/from_mutations.hh"
#include "keys/keys.hh"
#include "replica/logstor/segment_manager.hh"
#include "replica/logstor/types.hh"
#include "utils/managed_bytes.hh"
#include <openssl/ripemd.h>
#include <openssl/evp.h>
namespace replica::logstor {
seastar::logger logstor_logger("logstor");
logstor::logstor(logstor_config config)
: _segment_manager(config.segment_manager_cfg)
, _write_buffer(_segment_manager, config.flush_sg) {
}
future<> logstor::do_recovery(replica::database& db) {
co_await _segment_manager.do_recovery(db);
}
future<> logstor::start() {
logstor_logger.info("Starting logstor");
co_await _segment_manager.start();
co_await _write_buffer.start();
logstor_logger.info("logstor started");
}
future<> logstor::stop() {
logstor_logger.info("Stopping logstor");
co_await _write_buffer.stop();
co_await _segment_manager.stop();
logstor_logger.info("logstor stopped");
}
size_t logstor::get_memory_usage() const {
return _segment_manager.get_memory_usage();
}
future<> logstor::write(const mutation& m, compaction_group& cg, seastar::gate::holder cg_holder) {
primary_index_key key(m.decorated_key());
table_id table = m.schema()->id();
auto& index = cg.get_logstor_index();
// TODO ?
record_generation gen = index.get(key)
.transform([](const index_entry& entry) {
return entry.generation + 1;
}).value_or(record_generation(1));
log_record record {
.key = key,
.generation = gen,
.table = table,
.mut = canonical_mutation(m)
};
return _write_buffer.write(std::move(record), &cg, std::move(cg_holder)).then_unpack([this, &index, gen, key = std::move(key)]
(log_location location, seastar::gate::holder op) {
index_entry new_entry {
.location = location,
.generation = gen,
};
auto old_entry = index.exchange(key, std::move(new_entry));
// If overwriting, free old record
if (old_entry) {
_segment_manager.free_record(old_entry->location);
}
}).handle_exception([] (std::exception_ptr ep) {
logstor_logger.error("Error writing mutation: {}", ep);
return make_exception_future<>(ep);
});
}
future<std::optional<log_record>> logstor::read(const primary_index& index, primary_index_key key) {
auto op = index.start_read();
auto entry_opt = index.get(key);
if (!entry_opt.has_value()) {
return make_ready_future<std::optional<log_record>>(std::nullopt);
}
const auto& entry = *entry_opt;
return _segment_manager.read(entry.location).then([key = std::move(key), op = std::move(op)] (log_record record) {
return std::optional<log_record>(std::move(record));
}).handle_exception([] (std::exception_ptr ep) {
logstor_logger.error("Error reading record: {}", ep);
return make_exception_future<std::optional<log_record>>(ep);
});
}
future<std::optional<canonical_mutation>> logstor::read(const schema& s, const primary_index& index, const dht::decorated_key& dk) {
primary_index_key key(dk);
return read(index, key).then([&dk] (std::optional<log_record> record_opt) -> std::optional<canonical_mutation> {
if (!record_opt.has_value()) {
return std::nullopt;
}
auto& record = *record_opt;
if (record.mut.key() != dk.key()) [[unlikely]] {
throw std::runtime_error(fmt::format(
"Key mismatch reading log entry: expected {}, got {}",
dk.key(), record.mut.key()
));
}
return std::optional<canonical_mutation>(std::move(record.mut));
});
}
segment_manager& logstor::get_segment_manager() noexcept {
return _segment_manager;
}
const segment_manager& logstor::get_segment_manager() const noexcept {
return _segment_manager;
}
compaction_manager& logstor::get_compaction_manager() noexcept {
return _segment_manager.get_compaction_manager();
}
const compaction_manager& logstor::get_compaction_manager() const noexcept {
return _segment_manager.get_compaction_manager();
}
mutation_reader logstor::make_reader(schema_ptr schema,
const primary_index& index,
reader_permit permit,
const dht::partition_range& pr,
const query::partition_slice& slice,
tracing::trace_state_ptr trace_state) {
class logstor_range_reader : public mutation_reader::impl {
logstor* _logstor;
const primary_index& _index;
dht::partition_range _pr;
query::partition_slice _slice;
tracing::trace_state_ptr _trace_state;
std::optional<dht::decorated_key> _last_key; // owns the key, safe across yields
mutation_reader_opt _current_partition_reader;
dht::ring_position_comparator _cmp;
// Finds the next iterator to process, safe to call after any co_await
primary_index::partitions_type::const_iterator find_next() const {
auto it = _last_key
? _index.upper_bound(*_last_key) // strictly after last key
: position_at_range_start(); // initial positioning
// If start was exclusive and we haven't yet seen a key
return it;
}
primary_index::partitions_type::const_iterator position_at_range_start() const {
if (!_pr.start()) {
return _index.begin();
}
auto it = _index.lower_bound(_pr.start()->value());
if (!_pr.start()->is_inclusive() && it != _index.end()) {
if (_cmp(it->key(), _pr.start()->value()) == 0) {
++it;
}
}
return it;
}
bool exceeds_range_end(const primary_index_entry& e) const {
if (!_pr.end()) return false;
auto c = _cmp(e.key(), _pr.end()->value());
return _pr.end()->is_inclusive() ? c > 0 : c >= 0;
}
public:
logstor_range_reader(schema_ptr s, const primary_index& idx, reader_permit p,
logstor* ls, dht::partition_range pr,
query::partition_slice slice, tracing::trace_state_ptr ts)
: impl(std::move(s), std::move(p))
, _logstor(ls), _index(idx), _pr(std::move(pr))
, _slice(std::move(slice)), _trace_state(std::move(ts))
, _cmp(*_schema)
{}
virtual future<> fill_buffer() override {
while (!is_buffer_full() && !_end_of_stream) {
// Drain current partition's reader first
if (_current_partition_reader) {
co_await _current_partition_reader->fill_buffer();
_current_partition_reader->move_buffer_content_to(*this);
if (!_current_partition_reader->is_end_of_stream()) {
continue;
}
co_await _current_partition_reader->close();
_current_partition_reader = std::nullopt;
// _last_key was already set when we opened the reader
}
// Find next key in range (safe after co_await since we use _last_key)
auto it = find_next();
if (it == _index.end() || exceeds_range_end(*it)) {
_end_of_stream = true;
break;
}
// Snapshot the key before yielding
auto current_key = it->key();
auto guard = reader_permit::awaits_guard(_permit);
auto cmut = co_await _logstor->read(*_schema, _index, current_key);
_last_key = current_key; // mark as visited even if not found (tombstoned)
if (!cmut) {
continue; // key was removed between index lookup and read
}
tracing::trace(_trace_state, "logstor_range_reader: fetched key {}", current_key);
_current_partition_reader = make_mutation_reader_from_mutations(
_schema, _permit, cmut->to_mutation(_schema),
_slice, streamed_mutation::forwarding::no
);
}
}
virtual future<> next_partition() override {
clear_buffer_to_next_partition();
if (!is_buffer_empty()) return make_ready_future<>();
_end_of_stream = false;
if (_current_partition_reader) {
auto fut = _current_partition_reader->close();
_current_partition_reader = std::nullopt;
return fut;
}
return make_ready_future<>();
}
virtual future<> fast_forward_to(const dht::partition_range& pr) override {
clear_buffer();
_end_of_stream = false;
_pr = pr;
_last_key = std::nullopt; // re-position from new range start
if (_current_partition_reader) {
auto fut = _current_partition_reader->close();
_current_partition_reader = std::nullopt;
return fut;
}
return make_ready_future<>();
}
virtual future<> fast_forward_to(position_range pr) override {
if (_current_partition_reader) {
clear_buffer();
return _current_partition_reader->fast_forward_to(std::move(pr));
}
return make_ready_future<>();
}
virtual future<> close() noexcept override {
if (_current_partition_reader) {
return _current_partition_reader->close();
}
return make_ready_future<>();
}
};
return make_mutation_reader<logstor_range_reader>(
std::move(schema), index, std::move(permit), this, pr, slice, std::move(trace_state)
);
}
void logstor::set_trigger_compaction_hook(std::function<void()> fn) {
_segment_manager.set_trigger_compaction_hook(std::move(fn));
}
void logstor::set_trigger_separator_flush_hook(std::function<void(size_t)> fn) {
_segment_manager.set_trigger_separator_flush_hook(std::move(fn));
}
}

View File

@@ -0,0 +1,81 @@
/*
* Copyright (C) 2026-present ScyllaDB
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
*/
#pragma once
#include <seastar/core/future.hh>
#include <seastar/core/temporary_buffer.hh>
#include <optional>
#include <seastar/core/scheduling.hh>
#include "readers/mutation_reader.hh"
#include "replica/compaction_group.hh"
#include "types.hh"
#include "index.hh"
#include "segment_manager.hh"
#include "write_buffer.hh"
#include "mutation/mutation.hh"
#include "dht/decorated_key.hh"
namespace replica {
class compaction_group;
class database;
namespace logstor {
extern seastar::logger logstor_logger;
struct logstor_config {
segment_manager_config segment_manager_cfg;
seastar::scheduling_group flush_sg;
};
class logstor {
segment_manager _segment_manager;
buffered_writer _write_buffer;
public:
explicit logstor(logstor_config);
logstor(const logstor&) = delete;
logstor& operator=(const logstor&) = delete;
future<> do_recovery(replica::database&);
future<> start();
future<> stop();
size_t get_memory_usage() const;
segment_manager& get_segment_manager() noexcept;
const segment_manager& get_segment_manager() const noexcept;
compaction_manager& get_compaction_manager() noexcept;
const compaction_manager& get_compaction_manager() const noexcept;
future<> write(const mutation&, compaction_group&, seastar::gate::holder cg_holder);
future<std::optional<log_record>> read(const primary_index&, primary_index_key);
future<std::optional<canonical_mutation>> read(const schema&, const primary_index&, const dht::decorated_key&);
/// Create a mutation reader for a specific key
mutation_reader make_reader(schema_ptr schema,
const primary_index& index,
reader_permit permit,
const dht::partition_range& pr,
const query::partition_slice& slice,
tracing::trace_state_ptr trace_state = nullptr);
void set_trigger_compaction_hook(std::function<void()> fn);
void set_trigger_separator_flush_hook(std::function<void(size_t)> fn);
};
} // namespace logstor
} // namespace replica

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,128 @@
/*
* Copyright (C) 2026-present ScyllaDB
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
*/
#pragma once
#include <cstdint>
#include <filesystem>
#include <seastar/core/shared_future.hh>
#include <seastar/core/file.hh>
#include <seastar/core/rwlock.hh>
#include <seastar/core/gate.hh>
#include <seastar/core/queue.hh>
#include <seastar/core/shared_ptr.hh>
#include "bytes_fwd.hh"
#include "replica/logstor/write_buffer.hh"
#include "types.hh"
#include "utils/updateable_value.hh"
namespace replica {
class database;
namespace logstor {
class compaction_manager;
class segment_set;
class primary_index;
static constexpr size_t default_segment_size = 128 * 1024;
static constexpr size_t default_file_size = 32 * 1024 * 1024;
/// Configuration for the segment manager
struct segment_manager_config {
std::filesystem::path base_dir;
size_t segment_size = default_segment_size;
size_t file_size = default_file_size;
size_t disk_size;
bool compaction_enabled = true;
size_t max_segments_per_compaction = 8;
seastar::scheduling_group compaction_sg;
utils::updateable_value<float> compaction_static_shares;
seastar::scheduling_group separator_sg;
uint32_t separator_delay_limit_ms;
size_t max_separator_memory = 1 * 1024 * 1024;
};
struct table_segment_histogram_bucket {
size_t count;
size_t max_data_size;
table_segment_histogram_bucket& operator+=(table_segment_histogram_bucket& other) {
count += other.count;
max_data_size = std::max(max_data_size, other.max_data_size);
return *this;
}
};
struct table_segment_stats {
size_t compaction_group_count{0};
size_t segment_count{0};
std::vector<table_segment_histogram_bucket> histogram;
table_segment_stats& operator+=(table_segment_stats& other) {
compaction_group_count += other.compaction_group_count;
segment_count += other.segment_count;
histogram.resize(std::max(histogram.size(), other.histogram.size()));
for (size_t i = 0; i < other.histogram.size(); i++) {
histogram[i] += other.histogram[i];
}
return *this;
}
};
class segment_manager_impl;
class log_index;
class segment_manager {
std::unique_ptr<segment_manager_impl> _impl;
private:
segment_manager_impl& get_impl() noexcept;
const segment_manager_impl& get_impl() const noexcept;
public:
static constexpr size_t block_alignment = 4096;
explicit segment_manager(segment_manager_config config);
~segment_manager();
segment_manager(const segment_manager&) = delete;
segment_manager& operator=(const segment_manager&) = delete;
future<> do_recovery(replica::database&);
future<> start();
future<> stop();
future<log_location> write(write_buffer& wb);
future<log_record> read(log_location location);
void free_record(log_location location);
future<> for_each_record(const std::vector<log_segment_id>& segments,
std::function<future<>(log_location, log_record)> callback);
compaction_manager& get_compaction_manager() noexcept;
const compaction_manager& get_compaction_manager() const noexcept;
void set_trigger_compaction_hook(std::function<void()> fn);
void set_trigger_separator_flush_hook(std::function<void(size_t)> fn);
size_t get_segment_size() const noexcept;
future<> discard_segments(segment_set&);
size_t get_memory_usage() const;
future<> await_pending_writes();
friend class segment_manager_impl;
};
}
}

80
replica/logstor/types.hh Normal file
View File

@@ -0,0 +1,80 @@
/*
* Copyright (C) 2026-present ScyllaDB
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
*/
#pragma once
#include <cstdint>
#include <fmt/format.h>
#include "mutation/canonical_mutation.hh"
#include "replica/logstor/utils.hh"
#include "dht/decorated_key.hh"
namespace replica::logstor {
struct log_segment_id {
uint32_t value;
bool operator==(const log_segment_id& other) const noexcept = default;
auto operator<=>(const log_segment_id& other) const noexcept = default;
};
struct log_location {
log_segment_id segment;
uint32_t offset;
uint32_t size;
bool operator==(const log_location& other) const noexcept = default;
};
struct primary_index_key {
dht::decorated_key dk;
};
using record_generation = generation_base<uint16_t>;
using segment_generation = generation_base<uint16_t>;
struct index_entry {
log_location location;
record_generation generation;
bool operator==(const index_entry& other) const noexcept = default;
};
struct log_record {
primary_index_key key;
record_generation generation;
table_id table;
canonical_mutation mut;
};
}
// Format specialization declarations and implementations
template <>
struct fmt::formatter<replica::logstor::log_segment_id> : fmt::formatter<string_view> {
template <typename FormatContext>
auto format(const replica::logstor::log_segment_id& id, FormatContext& ctx) const {
return fmt::format_to(ctx.out(), "segment({})", id.value);
}
};
template <>
struct fmt::formatter<replica::logstor::log_location> : fmt::formatter<string_view> {
template <typename FormatContext>
auto format(const replica::logstor::log_location& loc, FormatContext& ctx) const {
return fmt::format_to(ctx.out(), "{{segment:{}, offset:{}, size:{}}}",
loc.segment, loc.offset, loc.size);
}
};
template <>
struct fmt::formatter<replica::logstor::primary_index_key> : fmt::formatter<string_view> {
template <typename FormatContext>
auto format(const replica::logstor::primary_index_key& key, FormatContext& ctx) const {
return fmt::format_to(ctx.out(), "{}", key.dk);
}
};

104
replica/logstor/utils.hh Normal file
View File

@@ -0,0 +1,104 @@
/*
* Copyright (C) 2026-present ScyllaDB
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
*/
#pragma once
#include <concepts>
#include "serializer.hh"
namespace replica::logstor {
// an unsigned integer that can be incremented and compared with wraparound semantics
template <std::unsigned_integral T>
class generation_base {
T _value;
public:
using underlying = T;
constexpr generation_base() noexcept : _value(0) {}
constexpr explicit generation_base(T value) noexcept : _value(value) {}
constexpr T value() const noexcept { return _value; }
constexpr generation_base& operator++() noexcept {
++_value;
return *this;
}
constexpr generation_base operator++(int) noexcept {
auto old = *this;
++_value;
return old;
}
constexpr generation_base& operator+=(T delta) noexcept {
_value += delta;
return *this;
}
constexpr generation_base operator+(T delta) const noexcept {
return generation_base(_value + delta);
}
constexpr bool operator==(const generation_base& other) const noexcept = default;
/// Comparison using wraparound semantics.
/// Returns true if this generation is less than other, accounting for wraparound.
/// Assumes generations are within half the value space of each other.
constexpr bool operator<(const generation_base& other) const noexcept {
// Use signed comparison after converting difference to signed type
// This handles wraparound: if diff > max/2, it's treated as negative
using signed_type = std::make_signed_t<T>;
auto diff = static_cast<signed_type>(_value - other._value);
return diff < 0;
}
constexpr bool operator<=(const generation_base& other) const noexcept {
return *this == other || *this < other;
}
constexpr bool operator>(const generation_base& other) const noexcept {
return other < *this;
}
constexpr bool operator>=(const generation_base& other) const noexcept {
return other <= *this;
}
};
}
template <std::unsigned_integral T>
struct fmt::formatter<replica::logstor::generation_base<T>> : fmt::formatter<T> {
template <typename FormatContext>
auto format(const replica::logstor::generation_base<T>& gen, FormatContext& ctx) const {
return fmt::formatter<T>::format(gen.value(), ctx);
}
};
namespace ser {
template <std::unsigned_integral T>
struct serializer<replica::logstor::generation_base<T>> {
template <typename Output>
static void write(Output& out, const replica::logstor::generation_base<T>& g) {
serializer<typename replica::logstor::generation_base<T>::underlying>::write(out, g.value());
}
template <typename Input>
static replica::logstor::generation_base<T> read(Input& in) {
auto val = serializer<typename replica::logstor::generation_base<T>::underlying>::read(in);
return replica::logstor::generation_base<T>(val);
}
template <typename Input>
static void skip(Input& in) {
serializer<typename replica::logstor::generation_base<T>::underlying>::skip(in);
}
};
}

View File

@@ -0,0 +1,278 @@
/*
* Copyright (C) 2026-present ScyllaDB
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
*/
#include "write_buffer.hh"
#include "segment_manager.hh"
#include "bytes_fwd.hh"
#include "logstor.hh"
#include "replica/logstor/types.hh"
#include <seastar/core/simple-stream.hh>
#include <seastar/core/with_scheduling_group.hh>
#include <seastar/core/on_internal_error.hh>
#include "serializer_impl.hh"
#include "idl/logstor.dist.hh"
#include "idl/logstor.dist.impl.hh"
#include <seastar/core/align.hh>
#include <seastar/core/aligned_buffer.hh>
namespace replica::logstor {
void log_record_writer::compute_size() const {
seastar::measuring_output_stream ms;
ser::serialize(ms, _record);
_size = ms.size();
}
void log_record_writer::write(ostream& out) const {
ser::serialize(out, _record);
}
// write_buffer
write_buffer::write_buffer(size_t buffer_size, bool with_record_copy)
: _buffer_size(buffer_size)
, _buffer(seastar::allocate_aligned_buffer<char>(buffer_size, 4096))
, _with_record_copy(with_record_copy)
{
if (_with_record_copy) {
_records_copy.reserve(_buffer_size / 100);
}
reset();
}
void write_buffer::reset() {
_stream = seastar::simple_memory_output_stream(_buffer.get(), _buffer_size);
_header_stream = _stream.write_substream(buffer_header_size);
_buffer_header = {};
_net_data_size = 0;
_record_count = 0;
_written = {};
_records_copy.clear();
_write_gate = {};
}
future<> write_buffer::close() {
if (!_write_gate.is_closed()) {
co_await _write_gate.close();
}
}
size_t write_buffer::get_max_write_size() const noexcept {
return _buffer_size - (buffer_header_size + record_header_size);
}
bool write_buffer::can_fit(size_t data_size) const noexcept {
// Calculate total space needed including header, data, and alignment padding
auto total_size = record_header_size + data_size;
auto aligned_size = align_up(total_size, record_alignment);
return aligned_size <= _stream.size();
}
bool write_buffer::has_data() const noexcept {
return offset_in_buffer() > buffer_header_size;
}
future<log_location_with_holder> write_buffer::write(log_record_writer writer, compaction_group* cg, seastar::gate::holder cg_holder) {
const auto data_size = writer.size();
if (!can_fit(data_size)) {
throw std::runtime_error(fmt::format("Write size {} exceeds buffer size {}", data_size, _stream.size()));
}
auto rh = record_header {
.data_size = data_size
};
ser::serialize(_stream, rh);
// Write actual data
size_t data_offset_in_buffer = offset_in_buffer();
auto data_out = _stream.write_substream(data_size);
writer.write(data_out);
_net_data_size += data_size;
_record_count++;
// Add padding to align record
pad_to_alignment(record_alignment);
auto record_location = [data_offset_in_buffer, data_size] (log_location base_location) {
return log_location {
.segment = base_location.segment,
.offset = base_location.offset + data_offset_in_buffer,
.size = data_size
};
};
if (_with_record_copy) {
_records_copy.push_back(record_in_buffer {
.writer = std::move(writer),
.offset_in_buffer = data_offset_in_buffer,
.data_size = data_size,
.loc = _written.get_shared_future().then(record_location),
.cg = cg,
.cg_holder = std::move(cg_holder)
});
}
// hold the write buffer until the write is complete, and pass the holder to the
// caller for follow-up operations that should continue holding the buffer, such
// as index updates.
auto op = _write_gate.hold();
return _written.get_shared_future().then([record_location, op = std::move(op)] (log_location base_location) mutable {
return std::make_tuple(record_location(base_location), std::move(op));
});
}
future<log_location> write_buffer::write_no_holder(log_record_writer writer) {
// write and leave the gate immediately after the write.
// use carefully when the gate it not needed.
return write(std::move(writer)).then_unpack([] (log_location loc, seastar::gate::holder op) {
return loc;
});
}
void write_buffer::pad_to_alignment(size_t alignment) {
auto current_pos = offset_in_buffer();
auto next_pos = align_up(current_pos, alignment);
auto padding = next_pos - current_pos;
if (padding > 0) {
_stream.fill('\0', padding);
}
}
void write_buffer::finalize(size_t alignment) {
_buffer_header.data_size = static_cast<uint32_t>(offset_in_buffer() - buffer_header_size);
pad_to_alignment(alignment);
}
void write_buffer::write_header(segment_generation seg_gen) {
_buffer_header.magic = buffer_header_magic;
_buffer_header.seg_gen = seg_gen;
ser::serialize<buffer_header>(_header_stream, _buffer_header);
}
future<> write_buffer::complete_writes(log_location base_location) {
_written.set_value(base_location);
co_await close();
}
future<> write_buffer::abort_writes(std::exception_ptr ex) {
if (!_written.available()) {
_written.set_exception(std::move(ex));
}
co_await close();
}
std::vector<write_buffer::record_in_buffer>& write_buffer::records() {
if (!_with_record_copy) {
on_internal_error(logstor_logger, "requesting records but the write buffer has no record copy enabled");
}
return _records_copy;
}
size_t write_buffer::estimate_required_segments(size_t net_data_size, size_t record_count, size_t segment_size) {
// Calculate total size needed including headers and alignment padding
size_t total_size = record_header_size * record_count + net_data_size;
// not perfect so let's multiply by some overhead constant
total_size = static_cast<size_t>(total_size * 1.1);
return align_up(total_size, segment_size) / segment_size;
}
// buffered_writer
buffered_writer::buffered_writer(segment_manager& sm, seastar::scheduling_group flush_sg)
: _sm(sm)
, _available_buffers(num_flushing_buffers)
, _flush_sg(flush_sg) {
_buffers.reserve(num_flushing_buffers + 1);
for (size_t i = 0; i < num_flushing_buffers + 1; ++i) {
_buffers.emplace_back(_sm.get_segment_size(), true);
}
_active_buffer = active_buffer {
.buf = &_buffers[0],
};
for (size_t i = 1; i < num_flushing_buffers + 1; ++i) {
_available_buffers.push(&_buffers[i]);
}
}
future<> buffered_writer::start() {
logstor_logger.info("Starting write buffer");
co_return;
}
future<> buffered_writer::stop() {
if (_async_gate.is_closed()) {
co_return;
}
logstor_logger.info("Stopping write buffer");
co_await _async_gate.close();
logstor_logger.info("Write buffer stopped");
}
future<log_location_with_holder> buffered_writer::write(log_record record, compaction_group* cg, seastar::gate::holder cg_holder) {
auto holder = _async_gate.hold();
log_record_writer writer(std::move(record));
if (writer.size() > _active_buffer.buf->get_max_write_size()) {
throw std::runtime_error(fmt::format("Write size {} exceeds buffer size {}", writer.size(), _active_buffer.buf->get_max_write_size()));
}
// Check if write fits in current buffer
while (!_active_buffer.buf->can_fit(writer)) {
co_await _buffer_switched.wait();
}
// Write to buffer at current position
auto fut = _active_buffer.buf->write(std::move(writer), cg, std::move(cg_holder));
// Trigger flush for the active buffer if not in progress
if (!std::exchange(_active_buffer.flush_requested, true)) {
(void)with_gate(_async_gate, [this] {
return switch_buffer().then([this] (write_buffer* old_buf) mutable {
return with_scheduling_group(_flush_sg, [this, old_buf] mutable {
return flush(old_buf);
});
});
});
}
co_return co_await std::move(fut);
}
future<write_buffer*> buffered_writer::switch_buffer() {
// Wait for and get the next available buffer
auto new_buf = co_await _available_buffers.pop_eventually();
auto next_active_buffer = active_buffer {
.buf = std::move(new_buf),
};
auto old_active_buffer = std::exchange(_active_buffer, std::move(next_active_buffer));
_buffer_switched.broadcast();
co_return std::move(old_active_buffer.buf);
}
future<> buffered_writer::flush(write_buffer* buf) {
co_await _sm.write(*buf);
// Return the flushed buffer to the available queue
buf->reset();
_available_buffers.push(std::move(buf));
}
}

View File

@@ -0,0 +1,294 @@
/*
* Copyright (C) 2026-present ScyllaDB
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
*/
#pragma once
#include <seastar/core/future.hh>
#include <seastar/core/gate.hh>
#include <seastar/core/temporary_buffer.hh>
#include <seastar/core/aligned_buffer.hh>
#include <seastar/core/condition-variable.hh>
#include <seastar/core/scheduling.hh>
#include <seastar/core/semaphore.hh>
#include <seastar/core/queue.hh>
#include <seastar/core/simple-stream.hh>
#include <seastar/core/shared_future.hh>
#include "types.hh"
#include "serializer.hh"
namespace replica {
class compaction_group;
namespace logstor {
class segment_manager;
// Writer for log records that handles serialization and size computation
class log_record_writer {
using ostream = seastar::simple_memory_output_stream;
log_record _record;
mutable std::optional<size_t> _size;
void compute_size() const;
public:
explicit log_record_writer(log_record record)
: _record(std::move(record))
{}
// Get serialized size (computed lazily)
size_t size() const {
if (!_size) {
compute_size();
}
return *_size;
}
// Write the record to an output stream
void write(ostream& out) const;
const log_record& record() const {
return _record;
}
};
using log_location_with_holder = std::tuple<log_location, seastar::gate::holder>;
// Manages a single aligned buffer for accumulating records and writing
// them to the segment manager.
//
// usage:
//
// create write buffer with specified size:
// write_buffer wb(buffer_size);
// write data to the buffer if fits and get a future for the log location when flushed:
// log_record_writer writer(record);
// auto loc_fut = wb.write(writer);
// flush the buffer to the segment manager:
// co_await sm.write(wb);
// await individual write locations:
// auto record_loc = co_await std::move(loc_fut);
class write_buffer {
public:
using ostream = seastar::simple_memory_output_stream;
// buffer: buffer_header | record_1 | ... | record_n | 0-padding
// record: record_header | record_data | 0-padding
//
// buffer_header and record are aligned by record_alignment
// buffer_header and record_header have explicit sizes and serialization below
static constexpr uint32_t buffer_header_magic = 0x4c475342;
static constexpr size_t record_alignment = 8;
struct buffer_header {
uint32_t magic;
uint32_t data_size; // size of all records data following the buffer_header
segment_generation seg_gen;
uint16_t reserved1;
uint32_t reserved2;
};
static constexpr size_t buffer_header_size = 3 * sizeof(uint32_t) + sizeof(uint16_t) + sizeof(segment_generation::underlying);
static_assert(buffer_header_size % record_alignment == 0, "Buffer header size must be aligned by record_alignment");
struct record_header {
uint32_t data_size; // size of the record data following the record_header
};
static constexpr size_t record_header_size = sizeof(uint32_t);
private:
using aligned_buffer_type = std::unique_ptr<char[], free_deleter>;
size_t _buffer_size;
aligned_buffer_type _buffer;
seastar::simple_memory_output_stream _stream;
buffer_header _buffer_header;
seastar::simple_memory_output_stream _header_stream;
size_t _net_data_size{0};
size_t _record_count{0};
shared_promise<log_location> _written;
seastar::gate _write_gate;
struct record_in_buffer {
log_record_writer writer;
size_t offset_in_buffer;
size_t data_size;
future<log_location> loc;
compaction_group* cg;
seastar::gate::holder cg_holder;
};
bool _with_record_copy;
std::vector<record_in_buffer> _records_copy;
public:
write_buffer(size_t buffer_size, bool with_record_copy);
void reset();
write_buffer(const write_buffer&) = delete;
write_buffer& operator=(const write_buffer&) = delete;
write_buffer(write_buffer&&) noexcept = default;
write_buffer& operator=(write_buffer&&) noexcept = default;
future<> close();
size_t get_buffer_size() const noexcept { return _buffer_size; }
size_t offset_in_buffer() const noexcept { return _buffer_size - _stream.size(); }
bool can_fit(size_t data_size) const noexcept;
bool can_fit(const log_record_writer& writer) const noexcept {
return can_fit(writer.size());
}
bool has_data() const noexcept;
size_t get_max_write_size() const noexcept;
size_t get_net_data_size() const noexcept { return _net_data_size; }
size_t get_record_count() const noexcept { return _record_count; }
// Write a record to the buffer.
// Returns a future that will be resolved with the log location once flushed and a gate holder
// that keeps the write buffer open. The gate should be held for index updates after the write
// is done.
future<log_location_with_holder> write(log_record_writer, compaction_group*, seastar::gate::holder cg_holder);
future<log_location_with_holder> write(log_record_writer writer) {
return write(std::move(writer), nullptr, {});
}
// Write a record to the buffer.
// Returns a future that will be resolved with the log location once flushed.
// If there are follow-up operations to the write such as index updates then consider
// using write_with_holder instead to keep the write buffer open until those operations are complete.
future<log_location> write_no_holder(log_record_writer);
static size_t estimate_required_segments(size_t net_data_size, size_t record_count, size_t segment_size);
private:
const char* data() const noexcept { return _buffer.get(); }
void write_header(segment_generation);
// get all write records in the buffer.
// with_record_copy must be to true when creating the write_buffer.
std::vector<record_in_buffer>& records();
/// Complete all tracked writes with their locations when the buffer is flushed to base_location
future<> complete_writes(log_location base_location);
future<> abort_writes(std::exception_ptr);
void pad_to_alignment(size_t alignment);
void finalize(size_t alignment);
friend class segment_manager_impl;
friend class compaction_manager_impl;
};
// Manages multiple buffers, a single active buffer and multiple flushing buffers.
// When switch is requested for the active buffer, it waits for a flushing buffer to
// become available, and continuing to accumulate writes until then.
class buffered_writer {
static constexpr size_t num_flushing_buffers = 4;
segment_manager& _sm;
struct active_buffer {
write_buffer* buf;
bool flush_requested{false};
} _active_buffer;
std::vector<write_buffer> _buffers;
seastar::queue<write_buffer*> _available_buffers;
seastar::gate _async_gate;
seastar::condition_variable _buffer_switched;
seastar::scheduling_group _flush_sg;
public:
explicit buffered_writer(segment_manager& sm, seastar::scheduling_group flush_sg);
buffered_writer(const buffered_writer&) = delete;
buffered_writer& operator=(const buffered_writer&) = delete;
future<> start();
future<> stop();
future<log_location_with_holder> write(log_record, compaction_group* cg = nullptr, seastar::gate::holder cg_holder = {});
private:
future<write_buffer*> switch_buffer();
future<> flush(write_buffer*);
};
}
}
namespace ser {
template <>
struct serializer<replica::logstor::write_buffer::buffer_header> {
template <typename Output>
static void write(Output& out, const replica::logstor::write_buffer::buffer_header& h) {
serializer<uint32_t>::write(out, h.magic);
serializer<uint32_t>::write(out, h.data_size);
serializer<replica::logstor::segment_generation>::write(out, h.seg_gen);
serializer<uint16_t>::write(out, h.reserved1);
serializer<uint32_t>::write(out, h.reserved2);
}
template <typename Input>
static replica::logstor::write_buffer::buffer_header read(Input& in) {
replica::logstor::write_buffer::buffer_header h;
h.magic = serializer<uint32_t>::read(in);
h.data_size = serializer<uint32_t>::read(in);
h.seg_gen = serializer<replica::logstor::segment_generation>::read(in);
h.reserved1 = serializer<uint16_t>::read(in);
h.reserved2 = serializer<uint32_t>::read(in);
return h;
}
template <typename Input>
static void skip(Input& in) {
serializer<uint32_t>::skip(in);
serializer<uint32_t>::skip(in);
serializer<replica::logstor::segment_generation>::skip(in);
serializer<uint16_t>::skip(in);
serializer<uint32_t>::skip(in);
}
};
template <>
struct serializer<replica::logstor::write_buffer::record_header> {
template <typename Output>
static void write(Output& out, const replica::logstor::write_buffer::record_header& h) {
serializer<uint32_t>::write(out, h.data_size);
}
template <typename Input>
static replica::logstor::write_buffer::record_header read(Input& in) {
replica::logstor::write_buffer::record_header h;
h.data_size = serializer<uint32_t>::read(in);
return h;
}
template <typename Input>
static void skip(Input& in) {
serializer<uint32_t>::skip(in);
}
};
} // namespace ser

View File

@@ -217,6 +217,17 @@ table::add_memtables_to_reader_list(std::vector<mutation_reader>& readers,
}
}
mutation_reader
table::make_logstor_mutation_reader(schema_ptr s,
reader_permit permit,
const dht::partition_range& pr,
const query::partition_slice& slice,
tracing::trace_state_ptr trace_state,
streamed_mutation::forwarding fwd,
mutation_reader::forwarding fwd_mr) const {
return _logstor->make_reader(std::move(s), logstor_index(), std::move(permit), pr, slice, std::move(trace_state));
}
mutation_reader
table::make_mutation_reader(schema_ptr s,
reader_permit permit,
@@ -229,6 +240,10 @@ table::make_mutation_reader(schema_ptr s,
return (*_virtual_reader).make_mutation_reader(s, std::move(permit), range, slice, trace_state, fwd, fwd_mr);
}
if (_logstor) [[unlikely]] {
return make_logstor_mutation_reader(s, std::move(permit), range, slice, std::move(trace_state), fwd, fwd_mr);
}
std::vector<mutation_reader> readers;
// We're assuming that cache and memtables are both read atomically
@@ -716,7 +731,9 @@ public:
return make_ready_future<>();
}
void update_effective_replication_map(const locator::effective_replication_map& erm, noncopyable_function<void()> refresh_mutation_source) override {}
void update_effective_replication_map(const locator::effective_replication_map_ptr& old_erm,
const locator::effective_replication_map& erm,
noncopyable_function<void()> refresh_mutation_source) override {}
compaction_group& compaction_group_for_token(dht::token token) const override {
return get_compaction_group();
@@ -762,6 +779,11 @@ public:
}
};
struct background_merge_guard {
compaction::compaction_reenabler compaction_guard;
locator::effective_replication_map_ptr erm_guard;
};
class tablet_storage_group_manager final : public storage_group_manager {
replica::table& _t;
locator::host_id _my_host_id;
@@ -782,7 +804,7 @@ class tablet_storage_group_manager final : public storage_group_manager {
utils::phased_barrier _merge_fiber_barrier;
std::optional<utils::phased_barrier::operation> _pending_merge_fiber_work;
// Holds compaction reenabler which disables compaction temporarily during tablet merge
std::vector<compaction::compaction_reenabler> _compaction_reenablers_for_merging;
std::vector<background_merge_guard> _compaction_reenablers_for_merging;
private:
const schema_ptr& schema() const {
return _t.schema();
@@ -806,7 +828,8 @@ private:
// Called when coordinator executes tablet merge. Tablet ids X and X+1 are merged into
// the new tablet id (X >> 1). In practice, that means storage groups for X and X+1
// are merged into a new storage group with id (X >> 1).
void handle_tablet_merge_completion(const locator::tablet_map& old_tmap, const locator::tablet_map& new_tmap);
void handle_tablet_merge_completion(locator::effective_replication_map_ptr old_erm,
const locator::tablet_map& old_tmap, const locator::tablet_map& new_tmap);
// When merge completes, compaction groups of sibling tablets are added to same storage
// group, but they're not merged yet into one, since the merge completion handler happens
@@ -822,9 +845,8 @@ private:
return tablet_map().get_tablet_id(t).value();
}
std::pair<size_t, locator::tablet_range_side> storage_group_of(dht::token t) const {
auto [id, side] = tablet_map().get_tablet_id_and_range_side(t);
auto idx = id.value();
size_t storage_group_of(dht::token t) const {
auto idx = tablet_id_for_token(t);
#ifndef SCYLLA_BUILD_MODE_RELEASE
if (idx >= tablet_count()) {
on_fatal_internal_error(tlogger, format("storage_group_of: index out of range: idx={} size_log2={} size={} token={}",
@@ -836,7 +858,7 @@ private:
idx, sg.token_range(), t));
}
#endif
return { idx, side };
return idx;
}
repair_classifier_func make_repair_sstable_classifier_func() const {
@@ -900,7 +922,9 @@ public:
std::exchange(_stop_fut, make_ready_future())).discard_result();
}
void update_effective_replication_map(const locator::effective_replication_map& erm, noncopyable_function<void()> refresh_mutation_source) override;
void update_effective_replication_map(const locator::effective_replication_map_ptr& old_erm,
const locator::effective_replication_map& erm,
noncopyable_function<void()> refresh_mutation_source) override;
compaction_group& compaction_group_for_token(dht::token token) const override;
utils::chunked_vector<storage_group_ptr> storage_groups_for_token_range(dht::token_range tr) const override;
@@ -911,7 +935,7 @@ public:
return log2ceil(tablet_map().tablet_count());
}
storage_group& storage_group_for_token(dht::token token) const override {
return storage_group_for_id(storage_group_of(token).first);
return storage_group_for_id(storage_group_of(token));
}
locator::combined_load_stats table_load_stats() const override;
@@ -959,9 +983,20 @@ size_t storage_group::to_idx(locator::tablet_range_side side) const {
return size_t(side);
}
compaction_group_ptr& storage_group::select_compaction_group(locator::tablet_range_side side) noexcept {
compaction_group_ptr& storage_group::select_compaction_group(dht::token token, const locator::tablet_map& tmap) noexcept {
if (splitting_mode()) {
return _split_ready_groups[to_idx(side)];
return _split_ready_groups[to_idx(tmap.get_tablet_range_side(token))];
}
return _main_cg;
}
compaction_group_ptr& storage_group::select_compaction_group(dht::token first, dht::token last, const locator::tablet_map& tmap) noexcept {
if (splitting_mode()) {
auto first_side = tmap.get_tablet_range_side(first);
auto last_side = tmap.get_tablet_range_side(last);
if (first_side == last_side) {
return _split_ready_groups[to_idx(first_side)];
}
}
return _main_cg;
}
@@ -1056,6 +1091,38 @@ future<> compaction_group::split(compaction::compaction_type_options::split opt,
}
}
future<> compaction_group::discard_logstor_segments() {
auto& sm = get_logstor_segment_manager();
co_await sm.discard_segments(*_logstor_segments);
}
future<> compaction_group::flush_separator(std::optional<size_t> seq_num) {
auto units = co_await get_units(_separator_flush_sem, 1);
auto pending = std::exchange(_separator_flushes, {});
if (_logstor_separator && (!seq_num || _logstor_separator->min_seq_num < *seq_num)) {
auto& cm = get_logstor_compaction_manager();
auto b = std::move(*_logstor_separator);
_logstor_separator.reset();
pending.push_back(cm.flush_separator_buffer(std::move(b), *this));
}
co_await when_all(pending.begin(), pending.end());
}
logstor::separator_buffer& compaction_group::get_separator_buffer(size_t write_size) {
if (!_logstor_separator || !_logstor_separator->can_fit(write_size)) {
auto& cm = get_logstor_compaction_manager();
if (_logstor_separator) {
auto b = std::move(*_logstor_separator);
_logstor_separator.reset();
std::erase_if(_separator_flushes, [](future<>& f) { return f.available(); });
_separator_flushes.push_back(cm.flush_separator_buffer(std::move(b), *this));
}
_logstor_separator.emplace(cm.allocate_separator_buffer());
}
return *_logstor_separator;
}
future<> storage_group::split(compaction::compaction_type_options::split opt, tasks::task_info tablet_split_task_info) {
if (set_split_mode()) {
co_return;
@@ -1222,9 +1289,9 @@ storage_group& table::storage_group_for_id(size_t i) const {
}
compaction_group& tablet_storage_group_manager::compaction_group_for_token(dht::token token) const {
auto [idx, range_side] = storage_group_of(token);
auto idx = storage_group_of(token);
auto& sg = storage_group_for_id(idx);
return *sg.select_compaction_group(range_side);
return *sg.select_compaction_group(token, tablet_map());
}
compaction_group& table::compaction_group_for_token(dht::token token) const {
@@ -1265,8 +1332,8 @@ compaction_group& table::compaction_group_for_key(partition_key_view key, const
}
compaction_group& tablet_storage_group_manager::compaction_group_for_sstable(const sstables::shared_sstable& sst) const {
auto [first_id, first_range_side] = storage_group_of(sst->get_first_decorated_key().token());
auto [last_id, last_range_side] = storage_group_of(sst->get_last_decorated_key().token());
auto first_id = storage_group_of(sst->get_first_decorated_key().token());
auto last_id = storage_group_of(sst->get_last_decorated_key().token());
auto sstable_desc = [] (const sstables::shared_sstable& sst) {
auto& identifier_opt = sst->sstable_identifier();
@@ -1289,12 +1356,10 @@ compaction_group& tablet_storage_group_manager::compaction_group_for_sstable(con
try {
auto& sg = storage_group_for_id(first_id);
if (first_range_side != last_range_side) {
return *sg.main_compaction_group();
}
return *sg.select_compaction_group(first_range_side);
return *sg.select_compaction_group(
sst->get_first_decorated_key().token(),
sst->get_last_decorated_key().token(),
tablet_map());
} catch (std::out_of_range& e) {
on_internal_error(tlogger, format("Unable to load SSTable {} of tablet {}, due to {}",
sstable_desc(sst),
@@ -1465,6 +1530,7 @@ table::add_new_sstable_and_update_cache(sstables::shared_sstable new_sst,
sstables::offstrategy offstrategy) {
std::vector<sstables::shared_sstable> ret, ssts;
std::exception_ptr ex;
log_level failure_log_level = log_level::error;
try {
bool trigger_compaction = offstrategy == sstables::offstrategy::no;
auto& cg = compaction_group_for_sstable(new_sst);
@@ -1486,6 +1552,9 @@ table::add_new_sstable_and_update_cache(sstables::shared_sstable new_sst,
co_await do_add_sstable_and_update_cache(cg, sst, offstrategy, trigger_compaction);
sst = nullptr;
}
} catch (compaction::compaction_stopped_exception&) {
failure_log_level = log_level::warn;
ex = std::current_exception();
} catch (...) {
ex = std::current_exception();
}
@@ -1493,13 +1562,13 @@ table::add_new_sstable_and_update_cache(sstables::shared_sstable new_sst,
if (ex) {
// on failed split, input sstable is unlinked here.
if (new_sst) {
tlogger.error("Failed to load SSTable {} of origin {} due to {}, it will be unlinked...", new_sst->get_filename(), new_sst->get_origin(), ex);
tlogger.log(failure_log_level, "Failed to load SSTable {} of origin {} due to {}, it will be unlinked...", new_sst->get_filename(), new_sst->get_origin(), ex);
co_await new_sst->unlink();
}
// on failure after successful split, sstables not attached yet will be unlinked
co_await coroutine::parallel_for_each(ssts, [&ex] (sstables::shared_sstable sst) -> future<> {
co_await coroutine::parallel_for_each(ssts, [&ex, failure_log_level] (sstables::shared_sstable sst) -> future<> {
if (sst) {
tlogger.error("Failed to load SSTable {} of origin {} due to {}, it will be unlinked...", sst->get_filename(), sst->get_origin(), ex);
tlogger.log(failure_log_level, "Failed to load SSTable {} of origin {} due to {}, it will be unlinked...", sst->get_filename(), sst->get_origin(), ex);
co_await sst->unlink();
}
});
@@ -1513,6 +1582,7 @@ table::add_new_sstables_and_update_cache(std::vector<sstables::shared_sstable> n
std::function<future<>(sstables::shared_sstable)> on_add) {
std::exception_ptr ex;
std::vector<sstables::shared_sstable> ret;
log_level failure_log_level = log_level::error;
// We rely on add_new_sstable_and_update_cache() to unlink the sstable fed into it,
// so the exception handling below will only have to unlink sstables not processed yet.
@@ -1522,14 +1592,17 @@ table::add_new_sstables_and_update_cache(std::vector<sstables::shared_sstable> n
std::ranges::move(ssts, std::back_inserter(ret));
}
} catch (compaction::compaction_stopped_exception&) {
failure_log_level = log_level::warn;
ex = std::current_exception();
} catch (...) {
ex = std::current_exception();
}
if (ex) {
co_await coroutine::parallel_for_each(new_ssts, [&ex] (sstables::shared_sstable sst) -> future<> {
co_await coroutine::parallel_for_each(new_ssts, [&ex, failure_log_level] (sstables::shared_sstable sst) -> future<> {
if (sst) {
tlogger.error("Failed to load SSTable {} of origin {} due to {}, it will be unlinked...", sst->get_filename(), sst->get_origin(), ex);
tlogger.log(failure_log_level, "Failed to load SSTable {} of origin {} due to {}, it will be unlinked...", sst->get_filename(), sst->get_origin(), ex);
co_await sst->unlink();
}
});
@@ -1568,6 +1641,19 @@ table::update_cache(compaction_group& cg, lw_shared_ptr<memtable> m, std::vector
}
}
bool table::add_logstor_segment(logstor::segment_descriptor& seg_desc, dht::token first_token, dht::token last_token) {
auto& cg = compaction_group_for_token(first_token);
if (&cg != &compaction_group_for_token(last_token)) {
return false;
}
cg.add_logstor_segment(seg_desc);
return true;
}
logstor::separator_buffer& table::get_logstor_separator_buffer(dht::token token, size_t write_size) {
return compaction_group_for_token(token).get_separator_buffer(write_size);
}
// Handles permit management only, used for situations where we don't want to inform
// the compaction manager about backlogs (i.e., tests)
class permit_monitor : public sstables::write_monitor {
@@ -1765,7 +1851,9 @@ table::seal_active_memtable(compaction_group& cg, flush_permit&& flush_permit) n
utils::get_local_injector().inject("table_seal_active_memtable_try_flush", []() {
throw std::system_error(ENOSPC, std::system_category(), "Injected error");
});
co_return co_await this->try_flush_memtable_to_sstable(cg, old, std::move(write_permit));
co_await this->try_flush_memtable_to_sstable(cg, old, std::move(write_permit));
// signal a memtable was sealed
utils::get_local_injector().receive_message("table_seal_post_flush_waiters");
});
undo_stats.reset();
@@ -2021,8 +2109,15 @@ size_t compaction_group::live_sstable_count() const noexcept {
return _main_sstables->size() + _maintenance_sstables->size();
}
size_t compaction_group::logstor_disk_space_used() const noexcept {
if (!_logstor_segments || !_t.uses_logstor()) {
return 0;
}
return _logstor_segments->segment_count() * _t.get_logstor_segment_manager().get_segment_size();
}
uint64_t compaction_group::live_disk_space_used() const noexcept {
return _main_sstables->bytes_on_disk() + _maintenance_sstables->bytes_on_disk();
return _main_sstables->bytes_on_disk() + _maintenance_sstables->bytes_on_disk() + logstor_disk_space_used();
}
sstables::file_size_stats compaction_group::live_disk_space_used_full_stats() const noexcept {
@@ -2372,6 +2467,12 @@ void table::trigger_compaction() {
});
}
void table::trigger_logstor_compaction() {
for_each_compaction_group([] (compaction_group& cg) {
cg.trigger_logstor_compaction();
});
}
void table::try_trigger_compaction(compaction_group& cg) noexcept {
try {
cg.trigger_compaction();
@@ -2380,6 +2481,51 @@ void table::try_trigger_compaction(compaction_group& cg) noexcept {
}
}
future<> table::flush_separator(std::optional<size_t> seq_num) {
if (!uses_logstor()) {
co_return;
}
// wait for all previous writes to be written to a separator buffer
co_await get_logstor_segment_manager().await_pending_writes();
// flush separator buffers
co_await parallel_foreach_compaction_group([seq_num] (compaction_group& cg) {
return cg.flush_separator(seq_num);
});
}
future<logstor::table_segment_stats> table::get_logstor_segment_stats() const {
logstor::table_segment_stats result;
if (!uses_logstor()) {
co_return std::move(result);
}
const auto segment_size = get_logstor_segment_manager().get_segment_size();
const auto bucket_count = 32;
const auto bucket_size = segment_size / bucket_count;
result.histogram.resize(bucket_count);
co_await const_cast<table*>(this)->parallel_foreach_compaction_group([&] (const compaction_group& cg) -> future<> {
const auto& cg_segments = cg.logstor_segments();
result.compaction_group_count++;
result.segment_count += cg_segments.segment_count();
for (const auto& desc : cg_segments._segments) {
co_await coroutine::maybe_yield();
auto data_size = desc.net_data_size(segment_size);
auto bucket_index = std::min<size_t>(data_size / bucket_size, bucket_count - 1);
auto& bucket = result.histogram[bucket_index];
bucket.count++;
bucket.max_data_size = std::max(bucket.max_data_size, data_size);
}
});
co_return std::move(result);
}
void compaction_group::trigger_compaction() {
// But not if we're locked out or stopping
if (!_async_gate.is_closed()) {
@@ -2390,6 +2536,14 @@ void compaction_group::trigger_compaction() {
}
}
void compaction_group::trigger_logstor_compaction() {
if (!_async_gate.is_closed() && !_t.is_auto_compaction_disabled_by_user()) {
if (_logstor_segments) {
get_logstor_compaction_manager().submit(*this);
}
}
}
void table::trigger_offstrategy_compaction() {
// Run in background.
// This is safe since the the compaction task is tracked
@@ -2846,6 +3000,7 @@ compaction_group::compaction_group(table& t, size_t group_id, dht::token_range t
, _async_gate(format("[compaction_group {}.{} {}]", t.schema()->ks_name(), t.schema()->cf_name(), group_id))
, _backlog_tracker(t.get_compaction_strategy().make_backlog_tracker())
, _repair_sstable_classifier(std::move(repair_classifier))
, _logstor_segments(make_lw_shared<logstor::segment_set>())
{
}
@@ -2879,9 +3034,13 @@ future<> compaction_group::stop(sstring reason) noexcept {
for (auto view : all_views()) {
co_await _t._compaction_manager.stop_ongoing_compactions(reason, view);
}
if (_t.uses_logstor()) {
co_await get_logstor_compaction_manager().stop_ongoing_compactions(*this);
}
co_await _async_gate.close();
auto flush_future = co_await seastar::coroutine::as_future(flush());
co_await flush_separator();
co_await _flush_gate.close();
co_await _sstable_add_gate.close();
// FIXME: indentation
@@ -3198,7 +3357,9 @@ future<> tablet_storage_group_manager::merge_completion_fiber() {
}
}
void tablet_storage_group_manager::handle_tablet_merge_completion(const locator::tablet_map& old_tmap, const locator::tablet_map& new_tmap) {
void tablet_storage_group_manager::handle_tablet_merge_completion(locator::effective_replication_map_ptr old_erm,
const locator::tablet_map& old_tmap,
const locator::tablet_map& new_tmap) {
auto table_id = schema()->id();
size_t old_tablet_count = old_tmap.tablet_count();
size_t new_tablet_count = new_tmap.tablet_count();
@@ -3222,7 +3383,7 @@ void tablet_storage_group_manager::handle_tablet_merge_completion(const locator:
auto new_cg = make_lw_shared<compaction_group>(_t, new_tid, new_range, make_repair_sstable_classifier_func());
for (auto& view : new_cg->all_views()) {
auto cre = _t.get_compaction_manager().stop_and_disable_compaction_no_wait(*view, "tablet merging");
_compaction_reenablers_for_merging.push_back(std::move(cre));
_compaction_reenablers_for_merging.push_back(background_merge_guard{std::move(cre), old_erm});
}
auto new_sg = make_lw_shared<storage_group>(std::move(new_cg));
@@ -3255,7 +3416,11 @@ void tablet_storage_group_manager::handle_tablet_merge_completion(const locator:
_merge_completion_event.signal();
}
void tablet_storage_group_manager::update_effective_replication_map(const locator::effective_replication_map& erm, noncopyable_function<void()> refresh_mutation_source) {
void tablet_storage_group_manager::update_effective_replication_map(
const locator::effective_replication_map_ptr& old_erm,
const locator::effective_replication_map& erm,
noncopyable_function<void()> refresh_mutation_source)
{
auto* new_tablet_map = &erm.get_token_metadata().tablets().get_tablet_map(schema()->id());
auto* old_tablet_map = std::exchange(_tablet_map, new_tablet_map);
@@ -3271,7 +3436,7 @@ void tablet_storage_group_manager::update_effective_replication_map(const locato
if (utils::get_local_injector().is_enabled("tablet_force_tablet_count_decrease_once")) {
utils::get_local_injector().disable("tablet_force_tablet_count_decrease");
}
handle_tablet_merge_completion(*old_tablet_map, *new_tablet_map);
handle_tablet_merge_completion(old_erm, *old_tablet_map, *new_tablet_map);
}
// Allocate storage group if tablet is migrating in, or deallocate if it's migrating out.
@@ -3357,7 +3522,7 @@ void table::update_effective_replication_map(locator::effective_replication_map_
};
if (uses_tablets()) {
_sg_manager->update_effective_replication_map(*_erm, refresh_mutation_source);
_sg_manager->update_effective_replication_map(old_erm, *_erm, refresh_mutation_source);
}
if (old_erm) {
old_erm->invalidate();
@@ -4002,6 +4167,7 @@ future<std::unordered_map<sstring, table::snapshot_details>> table::get_snapshot
}
auto lister = directory_lister(snapshots_dir, lister::dir_entry_types::of<directory_entry_type::directory>());
auto close_lister = deferred_close(lister);
while (auto de = lister.get().get()) {
auto snapshot_name = de->name;
all_snapshots.emplace(snapshot_name, snapshot_details());
@@ -4009,6 +4175,9 @@ future<std::unordered_map<sstring, table::snapshot_details>> table::get_snapshot
auto& sd = all_snapshots.at(snapshot_name);
sd.total += details.total;
sd.live += details.live;
utils::get_local_injector().inject("get_snapshot_details", [&] (auto& handler) -> future<> {
throw std::runtime_error("Injected exception in get_snapshot_details");
}).get();
}
}
return all_snapshots;
@@ -4028,53 +4197,66 @@ future<table::snapshot_details> table::get_snapshot_details(fs::path snapshot_di
}
auto lister = directory_lister(snapshot_directory, snapshot_dir, lister::dir_entry_types::of<directory_entry_type::regular>());
while (auto de = co_await lister.get()) {
const auto& name = de->name;
future<stat_data> (&file_stat)(file& directory, std::string_view name, follow_symlink) noexcept = seastar::file_stat;
auto sd = co_await io_check(file_stat, snapshot_directory, name, follow_symlink::no);
auto size = sd.allocated_size;
std::exception_ptr ex;
try {
while (auto de = co_await lister.get()) {
const auto& name = de->name;
future<stat_data> (&file_stat)(file& directory, std::string_view name, follow_symlink) noexcept = seastar::file_stat;
auto sd = co_await io_check(file_stat, snapshot_directory, name, follow_symlink::no);
auto size = sd.allocated_size;
// The manifest and schema.sql files are the only files expected to be in this directory not belonging to the SSTable.
//
// All the others should just generate an exception: there is something wrong, so don't blindly
// add it to the size.
if (name != "manifest.json" && name != "schema.cql") {
details.total += size;
if (sd.number_of_links == 1) {
// File exists only in the snapshot directory.
details.live += size;
utils::get_local_injector().inject("per-snapshot-get_snapshot_details", [&] (auto& handler) -> future<> {
throw std::runtime_error("Injected exception in per-snapshot-get_snapshot_details");
}).get();
// The manifest and schema.cql files are the only files expected to be in this directory not belonging to the SSTable.
//
// All the others should just generate an exception: there is something wrong, so don't blindly
// add it to the size.
if (name != "manifest.json" && name != "schema.cql") {
details.total += size;
if (sd.number_of_links == 1) {
// File exists only in the snapshot directory.
details.live += size;
continue;
}
// If the number of links is greater than 1, it is still possible that the file is linked to another snapshot
// So check the datadir for the file too.
} else {
continue;
}
// If the number of links is greater than 1, it is still possible that the file is linked to another snapshot
// So check the datadir for the file too.
} else {
continue;
}
auto exists_in_dir = [&] (file& dir, const fs::path& path, std::string_view name) -> future<bool> {
try {
// File exists in the main SSTable directory. Snapshots are not contributing to size
auto psd = co_await io_check(file_stat, dir, name, follow_symlink::no);
// File in main SSTable directory must be hardlinked to the file in the snapshot dir with the same name.
if (psd.device_id != sd.device_id || psd.inode_number != sd.inode_number) {
dblog.warn("[{} device_id={} inode_number={} size={}] is not the same file as [{} device_id={} inode_number={} size={}]",
(path / name).native(), psd.device_id, psd.inode_number, psd.size,
(snapshot_dir / name).native(), sd.device_id, sd.inode_number, sd.size);
auto exists_in_dir = [&] (file& dir, const fs::path& path, std::string_view name) -> future<bool> {
try {
// File exists in the main SSTable directory. Snapshots are not contributing to size
auto psd = co_await io_check(file_stat, dir, name, follow_symlink::no);
// File in main SSTable directory must be hardlinked to the file in the snapshot dir with the same name.
if (psd.device_id != sd.device_id || psd.inode_number != sd.inode_number) {
dblog.warn("[{} device_id={} inode_number={} size={}] is not the same file as [{} device_id={} inode_number={} size={}]",
(path / name).native(), psd.device_id, psd.inode_number, psd.size,
(snapshot_dir / name).native(), sd.device_id, sd.inode_number, sd.size);
co_return false;
}
co_return true;
} catch (std::system_error& e) {
if (e.code() != std::error_code(ENOENT, std::system_category())) {
throw;
}
co_return false;
}
};
// Check staging dir first, as files might be moved from there to the datadir concurrently to this check
if ((!staging_dir || !co_await exists_in_dir(staging_directory, *staging_dir, name)) &&
!co_await exists_in_dir(data_directory, datadir, name)) {
details.live += size;
}
co_return true;
} catch (std::system_error& e) {
if (e.code() != std::error_code(ENOENT, std::system_category())) {
throw;
}
co_return false;
}
};
// Check staging dir first, as files might be moved from there to the datadir concurrently to this check
if ((!staging_dir || !co_await exists_in_dir(staging_directory, *staging_dir, name)) &&
!co_await exists_in_dir(data_directory, datadir, name)) {
details.live += size;
}
} catch (...) {
ex = std::current_exception();
}
co_await lister.close();
if (ex) {
co_await coroutine::return_exception_ptr(std::move(ex));
}
co_return details;
@@ -4261,6 +4443,18 @@ future<db::replay_position> table::discard_sstables(db_clock::time_point truncat
co_return rp;
}
future<> table::discard_logstor_segments() {
if (!uses_logstor()) {
co_return;
}
_logstor_index->clear();
co_await parallel_foreach_compaction_group([] (compaction_group& cg) {
return cg.discard_logstor_segments();
});
}
void table::mark_ready_for_writes(db::commitlog* cl) {
if (!_readonly) {
on_internal_error(dblog, ::format("table {}.{} is already writable", _schema->ks_name(), _schema->cf_name()));
@@ -4271,6 +4465,19 @@ void table::mark_ready_for_writes(db::commitlog* cl) {
_readonly = false;
}
void table::init_logstor(logstor::logstor* ls) {
_logstor = ls;
_logstor_index = std::make_unique<logstor::primary_index>(_schema);
}
size_t table::get_logstor_memory_usage() const {
size_t m = 0;
if (_logstor_index) {
m += _logstor_index->get_memory_usage();
}
return m;
}
db::commitlog* table::commitlog() const {
if (_readonly) [[unlikely]] {
on_internal_error(dblog, ::format("table {}.{} is readonly", _schema->ks_name(), _schema->cf_name()));
@@ -4295,6 +4502,9 @@ void table::set_schema(schema_ptr s) {
if (_counter_cell_locks) {
_counter_cell_locks->set_schema(s);
}
if (_logstor_index) {
_logstor_index->set_schema(s);
}
_schema = std::move(s);
for (auto&& v : _views) {
@@ -4522,6 +4732,11 @@ future<> table::apply(const mutation& m, db::rp_handle&& h, db::timeout_clock::t
auto& cg = compaction_group_for_token(m.token());
auto holder = cg.async_gate().hold();
if (_logstor) [[unlikely]] {
return _logstor->write(m, cg, std::move(holder));
}
return dirty_memory_region_group().run_when_memory_available([this, &m, h = std::move(h), &cg, holder = std::move(holder)] () mutable {
do_apply(cg, std::move(h), m);
}, timeout);
@@ -4537,6 +4752,10 @@ future<> table::apply(const frozen_mutation& m, schema_ptr m_schema, db::rp_hand
auto& cg = compaction_group_for_key(m.key(), m_schema);
auto holder = cg.async_gate().hold();
if (_logstor) [[unlikely]] {
return _logstor->write(m.unfreeze(m_schema), cg, std::move(holder));
}
return dirty_memory_region_group().run_when_memory_available([this, &m, m_schema = std::move(m_schema), h = std::move(h), &cg, holder = std::move(holder)]() mutable {
do_apply(cg, std::move(h), m, m_schema);
}, timeout);
@@ -4641,13 +4860,14 @@ table::query(schema_ptr query_schema,
}
std::optional<full_position> last_pos;
if (querier_opt && querier_opt->current_position()) {
last_pos.emplace(*querier_opt->current_position());
}
if (!saved_querier || (querier_opt && !querier_opt->are_limits_reached() && !qs.builder.is_short_read())) {
co_await querier_opt->close();
querier_opt = {};
if (querier_opt) {
if (querier_opt->current_position()) {
last_pos.emplace(*querier_opt->current_position());
}
if (!saved_querier || (!querier_opt->are_limits_reached() && !qs.builder.is_short_read())) {
co_await querier_opt->close();
querier_opt = {};
}
}
if (saved_querier) {
*saved_querier = std::move(querier_opt);
@@ -4737,6 +4957,10 @@ table::enable_auto_compaction() {
// see table::disable_auto_compaction() notes.
_compaction_disabled_by_user = false;
trigger_compaction();
if (uses_logstor()) {
trigger_logstor_compaction();
}
}
future<>
@@ -4768,11 +4992,18 @@ table::disable_auto_compaction() {
// - it will break computation of major compaction descriptor
// for new submissions
_compaction_disabled_by_user = true;
return with_gate(_async_gate, [this] {
return parallel_foreach_compaction_group_view([this] (compaction::compaction_group_view& view) {
return _compaction_manager.stop_ongoing_compactions("disable auto-compaction", &view, compaction::compaction_type::Compaction);
});
auto holder = _async_gate.hold();
co_await parallel_foreach_compaction_group_view([this] (compaction::compaction_group_view& view) {
return _compaction_manager.stop_ongoing_compactions("disable auto-compaction", &view, compaction::compaction_type::Compaction);
});
if (uses_logstor()) {
co_await parallel_foreach_compaction_group([this] (compaction_group& cg) {
return get_logstor_compaction_manager().stop_ongoing_compactions(cg);
});
}
}
void table::set_tombstone_gc_enabled(bool tombstone_gc_enabled) noexcept {
@@ -4985,6 +5216,26 @@ const compaction::compaction_manager& compaction_group::get_compaction_manager()
return _t.get_compaction_manager();
}
logstor::segment_manager& compaction_group::get_logstor_segment_manager() noexcept {
return _t.get_logstor_segment_manager();
}
const logstor::segment_manager& compaction_group::get_logstor_segment_manager() const noexcept {
return _t.get_logstor_segment_manager();
}
logstor::compaction_manager& compaction_group::get_logstor_compaction_manager() noexcept {
return _t.get_logstor_compaction_manager();
}
const logstor::compaction_manager& compaction_group::get_logstor_compaction_manager() const noexcept {
return _t.get_logstor_compaction_manager();
}
logstor::primary_index& compaction_group::get_logstor_index() noexcept {
return _t.logstor_index();
}
compaction::compaction_group_view& compaction_group::as_view_for_static_sharding() const {
return view_for_unrepaired_data();
}

View File

@@ -87,6 +87,11 @@ target_include_directories(wasmtime_bindings
target_link_libraries(wasmtime_bindings
INTERFACE Rust::rust_combined)
if (Scylla_USE_PRECOMPILED_HEADER_USE)
# The PCH from scylla-precompiled-header is compiled with Seastar's compile
# flags, including sanitizer flags in Debug/Sanitize modes. Any target reusing
# this PCH must have matching compile options, otherwise the compiler rejects
# the PCH due to flag mismatch (e.g., -fsanitize=address).
target_link_libraries(wasmtime_bindings PRIVATE Seastar::seastar)
target_precompile_headers(wasmtime_bindings REUSE_FROM scylla-precompiled-header)
endif()
@@ -108,5 +113,6 @@ target_include_directories(inc
target_link_libraries(inc
INTERFACE Rust::rust_combined)
if (Scylla_USE_PRECOMPILED_HEADER_USE)
target_link_libraries(inc PRIVATE Seastar::seastar)
target_precompile_headers(inc REUSE_FROM scylla-precompiled-header)
endif()

View File

@@ -592,6 +592,7 @@ bool operator==(const schema::user_properties& lhs, const schema::user_propertie
&& lhs.compaction_strategy == rhs.compaction_strategy
&& lhs.compaction_strategy_options == rhs.compaction_strategy_options
&& lhs.compaction_enabled == rhs.compaction_enabled
&& lhs.storage_engine == rhs.storage_engine
&& lhs.caching_options == rhs.caching_options
&& lhs.tablet_options == rhs.tablet_options
&& lhs.get_paxos_grace_seconds() == rhs.get_paxos_grace_seconds()
@@ -698,6 +699,7 @@ table_schema_version schema::calculate_digest(const schema::raw_schema& r) {
feed_hash(h, r._view_info);
feed_hash(h, r._indices_by_name);
feed_hash(h, r._is_counter);
feed_hash(h, r._props.storage_engine);
for (auto&& [name, ext] : r._props.extensions) {
feed_hash(h, name);
@@ -874,6 +876,9 @@ auto fmt::formatter<schema>::format(const schema& s, fmt::format_context& ctx) c
out = fmt::format_to(out, ",minIndexInterval={}", s._raw._props.min_index_interval);
out = fmt::format_to(out, ",maxIndexInterval={}", s._raw._props.max_index_interval);
out = fmt::format_to(out, ",speculativeRetry={}", s._raw._props.speculative_retry.to_sstring());
if (s.storage_engine() != storage_engine_type::normal) {
out = fmt::format_to(out, ",storage_engine={}", storage_engine_type_to_sstring(s.storage_engine()));
}
out = fmt::format_to(out, ",tablets={{");
if (s._raw._props.tablet_options) {
n = 0;
@@ -1210,6 +1215,9 @@ fragmented_ostringstream& schema::schema_properties(const schema_describe_helper
os << "\n AND memtable_flush_period_in_ms = " << fmt::to_string(memtable_flush_period());
os << "\n AND min_index_interval = " << fmt::to_string(min_index_interval());
os << "\n AND speculative_retry = '" << speculative_retry().to_sstring() << "'";
if (storage_engine() != storage_engine_type::normal) {
os << "\n AND storage_engine = '" << storage_engine_type_to_sstring(storage_engine()) << "'";
}
if (has_tablet_options()) {
os << "\n AND tablets = {";

View File

@@ -175,6 +175,21 @@ public:
bool operator==(const speculative_retry& other) const = default;
};
enum class storage_engine_type {
normal,
logstor,
};
inline sstring storage_engine_type_to_sstring(storage_engine_type t) {
switch (t) {
case storage_engine_type::normal:
return "normal";
case storage_engine_type::logstor:
return "logstor";
}
throw std::invalid_argument(format("unknown storage engine type: {:d}\n", uint8_t(t)));
}
using index_options_map = std::unordered_map<sstring, sstring>;
enum class index_metadata_kind {
@@ -561,6 +576,7 @@ public:
compaction::compaction_strategy_type compaction_strategy = compaction::compaction_strategy_type::incremental;
std::map<sstring, sstring> compaction_strategy_options;
bool compaction_enabled = true;
storage_engine_type storage_engine = storage_engine_type::normal;
::caching_options caching_options;
std::optional<std::map<sstring, sstring>> tablet_options;
@@ -776,6 +792,14 @@ public:
return _raw._props.compaction_enabled;
}
storage_engine_type storage_engine() const {
return _raw._props.storage_engine;
}
bool logstor_enabled() const {
return _raw._props.storage_engine == storage_engine_type::logstor;
}
const cdc::options& cdc_options() const {
return _raw._props.get_cdc_options();
}

View File

@@ -269,6 +269,11 @@ public:
enable_schema_commitlog();
}
schema_builder& set_logstor() {
_raw._props.storage_engine = storage_engine_type::logstor;
return *this;
}
class default_names {
public:
default_names(const schema_builder&);

View File

@@ -952,6 +952,8 @@ class sstring:
@staticmethod
def to_hex(data, size):
if size == 0:
return ''
inf = gdb.selected_inferior()
return bytes(inf.read_memory(data, size)).hex()
@@ -974,6 +976,8 @@ class sstring:
return self.ref['u']['external']['str']
def as_bytes(self):
if len(self) == 0:
return b''
inf = gdb.selected_inferior()
return bytes(inf.read_memory(self.data(), len(self)))
@@ -5636,6 +5640,8 @@ class scylla_sstable_summary(gdb.Command):
self.inf = gdb.selected_inferior()
def to_hex(self, data, size):
if size == 0:
return ''
return bytes(self.inf.read_memory(data, size)).hex()
def invoke(self, arg, for_tty):
@@ -5647,6 +5653,10 @@ class scylla_sstable_summary(gdb.Command):
sst = seastar_lw_shared_ptr(arg).get().dereference()
else:
sst = arg
ms_version = int(gdb.parse_and_eval('sstables::sstable_version_types::ms'))
if int(sst['_version']) >= ms_version:
gdb.write("sstable uses ms format (trie-based index); summary is not populated.\n")
return
summary = seastar_lw_shared_ptr(sst['_components']['_value']).get().dereference()['summary']
gdb.write("header: {}\n".format(summary['header']))

View File

@@ -227,8 +227,6 @@ future<> group0_state_machine::reload_modules(modules_to_reload modules) {
for (const auto& m : modules.entries) {
if (m.table == db::system_keyspace::service_levels_v2()->id()) {
update_service_levels_cache = true;
} else if (m.table == db::system_keyspace::role_members()->id() || m.table == db::system_keyspace::role_attributes()->id()) {
update_service_levels_effective_cache = true;
} else if (m.table == db::system_keyspace::dicts()->id()) {
auto pk_type = db::system_keyspace::dicts()->partition_key_type();
auto name_value = pk_type->deserialize_value(m.pk.representation());
@@ -247,6 +245,11 @@ future<> group0_state_machine::reload_modules(modules_to_reload modules) {
auto cdc_log_table_id = table_id(value_cast<utils::UUID>(uuid_type->deserialize_value(elements.front())));
update_cdc_streams.insert(cdc_log_table_id);
} else if (auth::cache::includes_table(m.table)) {
if (m.table == db::system_keyspace::role_members()->id() ||
m.table == db::system_keyspace::role_attributes()->id()) {
update_service_levels_effective_cache = true;
}
auto schema = _ss.get_database().find_schema(m.table);
const auto elements = m.pk.explode(*schema);
auto role = value_cast<sstring>(schema->partition_key_type()->
@@ -255,6 +258,9 @@ future<> group0_state_machine::reload_modules(modules_to_reload modules) {
}
}
if (update_auth_cache_roles.size()) {
co_await _ss.auth_cache().load_roles(std::move(update_auth_cache_roles));
}
if (update_service_levels_cache || update_service_levels_effective_cache) { // this also updates SL effective cache
co_await _ss.update_service_levels_cache(qos::update_both_cache_levels(update_service_levels_cache), qos::query_context::group0);
}
@@ -264,9 +270,6 @@ future<> group0_state_machine::reload_modules(modules_to_reload modules) {
if (update_cdc_streams.size()) {
co_await _ss.load_cdc_streams(std::move(update_cdc_streams));
}
if (update_auth_cache_roles.size()) {
co_await _ss.auth_cache().load_roles(std::move(update_auth_cache_roles));
}
}
future<> group0_state_machine::merge_and_apply(group0_state_machine_merger& merger) {

View File

@@ -4653,6 +4653,7 @@ void storage_proxy::send_to_live_endpoints(storage_proxy::response_id_type respo
auto& stats = handler_ptr->stats();
auto& handler = *handler_ptr;
auto& global_stats = handler._proxy->_global_stats;
auto schema = handler_ptr->get_schema();
if (handler.get_targets().size() == 0) {
// Usually we remove the response handler when receiving responses from all targets.
@@ -4748,7 +4749,7 @@ void storage_proxy::send_to_live_endpoints(storage_proxy::response_id_type respo
}
// Waited on indirectly.
(void)f.handle_exception([response_id, forward_size, coordinator, handler_ptr, p = shared_from_this(), &stats] (std::exception_ptr eptr) {
(void)f.handle_exception([response_id, forward_size, coordinator, handler_ptr, p = shared_from_this(), &stats, schema] (std::exception_ptr eptr) {
++stats.writes_errors.get_ep_stat(handler_ptr->_effective_replication_map_ptr->get_topology(), coordinator);
error err = error::FAILURE;
std::optional<sstring> msg;
@@ -4762,8 +4763,8 @@ void storage_proxy::send_to_live_endpoints(storage_proxy::response_id_type respo
// ignore, disconnect will be logged by gossiper
} else if (const auto* e = try_catch_nested<seastar::gate_closed_exception>(eptr)) {
// may happen during shutdown, log and ignore it
slogger.warn("gate_closed_exception during mutation write to {}: {}",
coordinator, e->what());
slogger.warn("gate_closed_exception during mutation write to {}.{} on {}: {}",
schema->ks_name(), schema->cf_name(), coordinator, e->what());
} else if (try_catch<timed_out_error>(eptr)) {
// from lmutate(). Ignore so that logs are not flooded
// database total_writes_timedout counter was incremented.
@@ -4774,7 +4775,8 @@ void storage_proxy::send_to_live_endpoints(storage_proxy::response_id_type respo
} else if (auto* e = try_catch<replica::critical_disk_utilization_exception>(eptr)) {
msg = e->what();
} else {
slogger.error("exception during mutation write to {}: {}", coordinator, eptr);
slogger.error("exception during mutation write to {}.{} on {}: {}",
schema->ks_name(), schema->cf_name(), coordinator, eptr);
}
p->got_failure_response(response_id, coordinator, forward_size + 1, std::nullopt, err, std::move(msg));
});

View File

@@ -910,7 +910,7 @@ future<> storage_service::merge_topology_snapshot(raft_snapshot snp) {
frozen_muts_to_apply.push_back(co_await freeze_gently(mut));
} else {
co_await for_each_split_mutation(std::move(mut), max_size, [&] (mutation m) -> future<> {
frozen_muts_to_apply.push_back(co_await freeze_gently(mut));
frozen_muts_to_apply.push_back(co_await freeze_gently(m));
});
}
}
@@ -3026,6 +3026,8 @@ future<> storage_service::drain() {
}
future<> storage_service::do_drain() {
co_await utils::get_local_injector().inject("storage_service_drain_wait", utils::wait_for_message(60s));
// Need to stop transport before group0, otherwise RPCs may fail with raft_group_not_found.
co_await stop_transport();
@@ -4016,6 +4018,9 @@ future<> storage_service::process_tablet_split_candidate(table_id table) noexcep
} catch (raft::request_aborted& ex) {
slogger.warn("Failed to complete splitting of table {} due to {}", table, ex);
break;
} catch (seastar::gate_closed_exception& ex) {
slogger.warn("Failed to complete splitting of table {} due to {}", table, ex);
break;
} catch (...) {
slogger.error("Failed to complete splitting of table {} due to {}, retrying after {} seconds",
table, std::current_exception(), split_retry.sleep_time());
@@ -4082,6 +4087,58 @@ future<> storage_service::snitch_reconfigured() {
}
}
future<> storage_service::local_topology_barrier() {
if (this_shard_id() != 0) {
co_await container().invoke_on(0, [] (storage_service& ss) {
return ss.local_topology_barrier();
});
co_return;
}
auto version = _topology_state_machine._topology.version;
utils::get_local_injector().inject("raft_topology_barrier_and_drain_fail_before", [] {
throw std::runtime_error("raft_topology_barrier_and_drain_fail_before injected exception");
});
co_await utils::get_local_injector().inject("pause_before_barrier_and_drain", utils::wait_for_message(std::chrono::minutes(5)));
if (_topology_state_machine._topology.tstate == topology::transition_state::write_both_read_old) {
for (auto& n : _topology_state_machine._topology.transition_nodes) {
if (!_address_map.find(locator::host_id{n.first.uuid()})) {
rtlogger.error("The topology transition is in a double write state but the IP of the node in transition is not known");
break;
}
}
}
co_await container().invoke_on_all([version] (storage_service& ss) -> future<> {
const auto current_version = ss._shared_token_metadata.get()->get_version();
rtlogger.info("Got raft_topology_cmd::barrier_and_drain, version {}, "
"current version {}, stale versions (version: use_count): {}",
version, current_version, ss._shared_token_metadata.describe_stale_versions());
// This shouldn't happen under normal operation, it's only plausible
// if the topology change coordinator has
// moved to another node and managed to update the topology
// parallel to this method. The previous coordinator
// should be inactive now, so it won't observe this
// exception. By returning exception we aim
// to reveal any other conditions where this may arise.
if (current_version != version) {
co_await coroutine::return_exception(std::runtime_error(
::format("raft topology: command::barrier_and_drain, the version has changed, "
"version {}, current_version {}, the topology change coordinator "
" had probably migrated to another node",
version, current_version)));
}
co_await ss._shared_token_metadata.stale_versions_in_use();
co_await get_topology_session_manager().drain_closing_sessions();
rtlogger.info("raft_topology_cmd::barrier_and_drain done");
});
}
future<raft_topology_cmd_result> storage_service::raft_topology_cmd_handler(raft::term_t term, uint64_t cmd_index, raft_topology_cmd cmd) {
raft_topology_cmd_result result;
rtlogger.info("topology cmd rpc {} is called index={}", cmd.cmd, cmd_index);
@@ -4109,12 +4166,6 @@ future<raft_topology_cmd_result> storage_service::raft_topology_cmd_handler(raft
state.last_index = cmd_index;
}
// We capture the topology version right after the checks
// above, before any yields. This is crucial since _topology_state_machine._topology
// might be altered concurrently while this method is running,
// which can cause the fence command to apply an invalid fence version.
const auto version = _topology_state_machine._topology.version;
switch (cmd.cmd) {
case raft_topology_cmd::command::barrier: {
utils::get_local_injector().inject("raft_topology_barrier_fail",
@@ -4153,44 +4204,7 @@ future<raft_topology_cmd_result> storage_service::raft_topology_cmd_handler(raft
}
break;
case raft_topology_cmd::command::barrier_and_drain: {
utils::get_local_injector().inject("raft_topology_barrier_and_drain_fail_before", [] {
throw std::runtime_error("raft_topology_barrier_and_drain_fail_before injected exception");
});
co_await utils::get_local_injector().inject("pause_before_barrier_and_drain", utils::wait_for_message(std::chrono::minutes(5)));
if (_topology_state_machine._topology.tstate == topology::transition_state::write_both_read_old) {
for (auto& n : _topology_state_machine._topology.transition_nodes) {
if (!_address_map.find(locator::host_id{n.first.uuid()})) {
rtlogger.error("The topology transition is in a double write state but the IP of the node in transition is not known");
break;
}
}
}
co_await container().invoke_on_all([version] (storage_service& ss) -> future<> {
const auto current_version = ss._shared_token_metadata.get()->get_version();
rtlogger.info("Got raft_topology_cmd::barrier_and_drain, version {}, "
"current version {}, stale versions (version: use_count): {}",
version, current_version, ss._shared_token_metadata.describe_stale_versions());
// This shouldn't happen under normal operation, it's only plausible
// if the topology change coordinator has
// moved to another node and managed to update the topology
// parallel to this method. The previous coordinator
// should be inactive now, so it won't observe this
// exception. By returning exception we aim
// to reveal any other conditions where this may arise.
if (current_version != version) {
co_await coroutine::return_exception(std::runtime_error(
::format("raft topology: command::barrier_and_drain, the version has changed, "
"version {}, current_version {}, the topology change coordinator "
" had probably migrated to another node",
version, current_version)));
}
co_await ss._shared_token_metadata.stale_versions_in_use();
co_await get_topology_session_manager().drain_closing_sessions();
rtlogger.info("raft_topology_cmd::barrier_and_drain done");
});
co_await local_topology_barrier();
co_await utils::get_local_injector().inject("raft_topology_barrier_and_drain_fail", [this] (auto& handler) -> future<> {
auto ks = handler.get("keyspace");

View File

@@ -813,6 +813,9 @@ public:
future<bool> ongoing_rf_change(const group0_guard& guard, sstring ks) const;
future<> raft_initialize_discovery_leader(const join_node_request_params& params);
future<> initialize_done_topology_upgrade_state();
// Does the local part of global_token_metadata_barrier(), without a raft group0 barrier.
// In particular, waits for non-latest local erms to go die.
future<> local_topology_barrier();
private:
// State machine that is responsible for topology change
topology_state_machine& _topology_state_machine;

View File

@@ -195,9 +195,9 @@ future<std::optional<tasks::task_status>> tablet_virtual_task::wait(tasks::task_
} else if (is_resize_task(task_type)) {
auto new_tablet_count = _ss.get_token_metadata().tablets().get_tablet_map(table).tablet_count();
res->status.state = new_tablet_count == tablet_count ? tasks::task_manager::task_state::suspended : tasks::task_manager::task_state::done;
res->status.children = task_type == locator::tablet_task_type::split ? co_await get_children(get_module(), id, std::bind_front(&gms::gossiper::is_alive, &_ss.gossiper())) : utils::chunked_vector<tasks::task_identity>{};
res->status.children = task_type == locator::tablet_task_type::split ? co_await get_children(get_module(), id, _ss.get_token_metadata_ptr()) : utils::chunked_vector<tasks::task_identity>{};
} else {
res->status.children = co_await get_children(get_module(), id, std::bind_front(&gms::gossiper::is_alive, &_ss.gossiper()));
res->status.children = co_await get_children(get_module(), id, _ss.get_token_metadata_ptr());
}
res->status.end_time = db_clock::now(); // FIXME: Get precise end time.
co_return res->status;
@@ -312,7 +312,7 @@ future<std::optional<status_helper>> tablet_virtual_task::get_status_helper(task
}
return make_ready_future();
});
res.status.children = co_await get_children(get_module(), id, std::bind_front(&gms::gossiper::is_alive, &_ss.gossiper()));
res.status.children = co_await get_children(get_module(), id, _ss.get_token_metadata_ptr());
} else if (is_migration_task(task_type)) { // Migration task.
auto tablet_id = hint.get_tablet_id();
res.pending_replica = tmap.get_tablet_transition_info(tablet_id)->pending_replica;
@@ -326,7 +326,7 @@ future<std::optional<status_helper>> tablet_virtual_task::get_status_helper(task
if (task_info.tablet_task_id.uuid() == id.uuid()) {
update_status(task_info, res.status, sched_nr);
res.status.state = tasks::task_manager::task_state::running;
res.status.children = task_type == locator::tablet_task_type::split ? co_await get_children(get_module(), id, std::bind_front(&gms::gossiper::is_alive, &_ss.gossiper())) : utils::chunked_vector<tasks::task_identity>{};
res.status.children = task_type == locator::tablet_task_type::split ? co_await get_children(get_module(), id, _ss.get_token_metadata_ptr()) : utils::chunked_vector<tasks::task_identity>{};
co_return res;
}
}

View File

@@ -2229,6 +2229,19 @@ class topology_coordinator : public endpoint_lifecycle_subscriber
_tablet_allocator.set_load_stats(reconciled_stats);
}
}
// Wait for the background storage group merge to finish before releasing the state machine.
// Background merge holds the old erm, so a successful barrier joins with it.
// This guarantees that the background merge doesn't run concurrently with the next merge.
// Replica-side storage group merge takes compaction locks on the tablet's main compaction group, released
// by the background merge. If the next merge starts before the background merge finishes, it can cause a deadlock.
// The background merge fiber will try to stop a compaction group which is locked, and the lock is held
// by the background merge fiber.
tm = nullptr;
if (!guard) {
guard = co_await start_operation();
}
co_await global_tablet_token_metadata_barrier(std::move(guard));
}
using get_table_ids_func = std::function<std::unordered_set<table_id>(const db::system_keyspace::topology_requests_entry&)>;

View File

@@ -201,95 +201,49 @@ public:
virtual future<std::optional<entry_info>> next_entry() = 0;
};
// Allocated inside LSA.
class promoted_index {
deletion_time _del_time;
uint64_t _promoted_index_start;
uint32_t _promoted_index_size;
uint32_t _num_blocks;
public:
promoted_index(const schema& s,
deletion_time del_time,
uint64_t promoted_index_start,
uint32_t promoted_index_size,
uint32_t num_blocks)
: _del_time{del_time}
, _promoted_index_start(promoted_index_start)
, _promoted_index_size(promoted_index_size)
, _num_blocks(num_blocks)
{ }
[[nodiscard]] deletion_time get_deletion_time() const { return _del_time; }
[[nodiscard]] uint32_t get_promoted_index_size() const { return _promoted_index_size; }
// Call under allocating_section.
// For sstable versions >= mc the returned cursor will be of type `bsearch_clustered_cursor`.
std::unique_ptr<clustered_index_cursor> make_cursor(shared_sstable,
reader_permit,
tracing::trace_state_ptr,
file_input_stream_options,
use_caching);
// Promoted index information produced by the parser.
struct parsed_promoted_index_entry {
deletion_time del_time;
uint64_t promoted_index_start;
uint32_t promoted_index_size;
uint32_t num_blocks;
};
using promoted_index = parsed_promoted_index_entry;
// A partition index element.
// Allocated inside LSA.
class index_entry {
private:
managed_bytes _key;
mutable std::optional<dht::token> _token;
uint64_t _position;
managed_ref<promoted_index> _index;
struct [[gnu::packed]] index_entry {
mutable int64_t raw_token;
uint64_t data_file_offset;
uint32_t key_offset;
public:
key_view get_key() const {
return key_view{_key};
}
// May allocate so must be called under allocating_section.
decorated_key_view get_decorated_key(const schema& s) const {
if (!_token) {
_token.emplace(s.get_partitioner().get_token(get_key()));
}
return decorated_key_view(*_token, get_key());
}
uint64_t position() const { return _position; };
std::optional<deletion_time> get_deletion_time() const {
if (_index) {
return _index->get_deletion_time();
}
return {};
}
index_entry(managed_bytes&& key, uint64_t position, managed_ref<promoted_index>&& index)
: _key(std::move(key))
, _position(position)
, _index(std::move(index))
{}
index_entry(index_entry&&) = default;
index_entry& operator=(index_entry&&) = default;
// Can be nullptr
const managed_ref<promoted_index>& get_promoted_index() const { return _index; }
managed_ref<promoted_index>& get_promoted_index() { return _index; }
uint32_t get_promoted_index_size() const { return _index ? _index->get_promoted_index_size() : 0; }
size_t external_memory_usage() const {
return _key.external_memory_usage() + _index.external_memory_usage();
}
uint64_t position() const { return data_file_offset; }
dht::raw_token token() const { return dht::raw_token(raw_token); }
};
// Required for optimized LSA migration of storage of managed_vector.
static_assert(std::is_trivially_move_assignable_v<index_entry>);
static_assert(std::is_trivially_move_assignable_v<parsed_promoted_index_entry>);
// A partition index page.
//
// Allocated in the standard allocator space but with an LSA allocator as the current allocator.
// So the shallow part is in the standard allocator but all indirect objects are inside LSA.
class partition_index_page {
public:
lsa::chunked_managed_vector<managed_ref<index_entry>> _entries;
lsa::chunked_managed_vector<index_entry> _entries;
managed_bytes _key_storage;
// Stores promoted index information of index entries.
// The i-th element corresponds to the i-th entry in _entries.
// Can be smaller than _entries. If _entries[i] doesn't have a matching element in _promoted_indexes then
// that entry doesn't have a promoted index.
// It's not chunked, because promoted index is present only when there are large partitions in the page,
// which also means the page will have typically only 1 entry due to summary:data_file size ratio.
// Kept separately to avoid paying for storage cost in pages where no entry has a promoted index,
// which is typical in workloads with small partitions.
managed_vector<promoted_index> _promoted_indexes;
public:
partition_index_page() = default;
partition_index_page(partition_index_page&&) noexcept = default;
@@ -298,15 +252,68 @@ public:
bool empty() const { return _entries.empty(); }
size_t size() const { return _entries.size(); }
stop_iteration clear_gently() {
// Vectors have trivial storage, so are fast to destroy.
return stop_iteration::yes;
}
void clear_one_entry() {
_entries.pop_back();
}
bool has_promoted_index(size_t i) const {
return i < _promoted_indexes.size() && _promoted_indexes[i].promoted_index_size > 0;
}
/// Get promoted index for the i-th entry.
/// Call only when has_promoted_index(i) is true.
const promoted_index& get_promoted_index(size_t i) const {
return _promoted_indexes[i];
}
/// Get promoted index for the i-th entry.
/// Call only when has_promoted_index(i) is true.
promoted_index& get_promoted_index(size_t i) {
return _promoted_indexes[i];
}
/// Get promoted index size for the i-th entry.
uint32_t get_promoted_index_size(size_t i) const {
return has_promoted_index(i) ? get_promoted_index(i).promoted_index_size : 0;
}
/// Get deletion_time for partition represented by the i-th entry.
/// Returns disengaged optional if the entry doesn't have a promoted index, so we don't know the deletion_time.
/// It has to be read from the data file.
std::optional<deletion_time> get_deletion_time(size_t i) const {
if (has_promoted_index(i)) {
return get_promoted_index(i).del_time;
}
return {};
}
key_view get_key(size_t i) const {
auto start = _entries[i].key_offset;
auto end = i + 1 < _entries.size() ? _entries[i + 1].key_offset : _key_storage.size();
auto v = managed_bytes_view(_key_storage).prefix(end);
v.remove_prefix(start);
return key_view(v);
}
decorated_key_view get_decorated_key(const schema& s, size_t i) const {
auto key = get_key(i);
auto t = _entries[i].token();
if (!t) {
t = dht::raw_token(s.get_partitioner().get_token(key));
_entries[i].raw_token = t.value;
}
return decorated_key_view(dht::token(t), key);
}
size_t external_memory_usage() const {
size_t size = _entries.external_memory_usage();
for (auto&& e : _entries) {
size += sizeof(index_entry) + e->external_memory_usage();
}
size += _promoted_indexes.external_memory_usage();
size += _key_storage.external_memory_usage();
return size;
}
};

View File

@@ -25,14 +25,6 @@ namespace sstables {
extern seastar::logger sstlog;
extern thread_local mc::cached_promoted_index::metrics promoted_index_cache_metrics;
// Promoted index information produced by the parser.
struct parsed_promoted_index_entry {
deletion_time del_time;
uint64_t promoted_index_start;
uint32_t promoted_index_size;
uint32_t num_blocks;
};
// Partition index entry information produced by the parser.
struct parsed_partition_index_entry {
temporary_buffer<char> key;
@@ -53,9 +45,10 @@ class index_consumer {
schema_ptr _s;
logalloc::allocating_section _alloc_section;
logalloc::region& _region;
utils::chunked_vector<parsed_partition_index_entry> _parsed_entries;
size_t _max_promoted_index_entry_plus_one = 0; // Highest index +1 in _parsed_entries which has a promoted index.
size_t _key_storage_size = 0;
public:
index_list indexes;
index_consumer(logalloc::region& r, schema_ptr s)
: _s(s)
, _alloc_section(abstract_formatter([s] (fmt::format_context& ctx) {
@@ -64,36 +57,63 @@ public:
, _region(r)
{ }
~index_consumer() {
with_allocator(_region.allocator(), [&] {
indexes._entries.clear_and_release();
});
void consume_entry(parsed_partition_index_entry&& e) {
_key_storage_size += e.key.size();
_parsed_entries.emplace_back(std::move(e));
if (e.promoted_index) {
_max_promoted_index_entry_plus_one = std::max(_max_promoted_index_entry_plus_one, _parsed_entries.size());
}
}
void consume_entry(parsed_partition_index_entry&& e) {
_alloc_section(_region, [&] {
future<index_list> finalize() {
index_list result;
// In case of exception, need to deallocate under region allocator.
auto delete_result = seastar::defer([&] {
with_allocator(_region.allocator(), [&] {
managed_ref<promoted_index> pi;
if (e.promoted_index) {
pi = make_managed<promoted_index>(*_s,
e.promoted_index->del_time,
e.promoted_index->promoted_index_start,
e.promoted_index->promoted_index_size,
e.promoted_index->num_blocks);
}
auto key = managed_bytes(reinterpret_cast<const bytes::value_type*>(e.key.get()), e.key.size());
indexes._entries.emplace_back(make_managed<index_entry>(std::move(key), e.data_file_offset, std::move(pi)));
result._entries = {};
result._promoted_indexes = {};
result._key_storage = {};
});
});
auto i = _parsed_entries.begin();
size_t key_offset = 0;
while (i != _parsed_entries.end()) {
_alloc_section(_region, [&] {
with_allocator(_region.allocator(), [&] {
result._entries.reserve(_parsed_entries.size());
result._promoted_indexes.resize(_max_promoted_index_entry_plus_one);
if (result._key_storage.empty()) {
result._key_storage = managed_bytes(managed_bytes::initialized_later(), _key_storage_size);
}
managed_bytes_mutable_view key_out(result._key_storage);
key_out.remove_prefix(key_offset);
while (i != _parsed_entries.end()) {
parsed_partition_index_entry& e = *i;
if (e.promoted_index) {
result._promoted_indexes[result._entries.size()] = *e.promoted_index;
}
write_fragmented(key_out, std::string_view(e.key.begin(), e.key.size()));
result._entries.emplace_back(index_entry{dht::raw_token().value, e.data_file_offset, key_offset});
++i;
key_offset += e.key.size();
if (need_preempt()) {
break;
}
}
});
});
co_await coroutine::maybe_yield();
}
delete_result.cancel();
_parsed_entries.clear();
co_return std::move(result);
}
void prepare(uint64_t size) {
_alloc_section = logalloc::allocating_section();
_alloc_section(_region, [&] {
with_allocator(_region.allocator(), [&] {
indexes._entries.reserve(size);
});
});
_max_promoted_index_entry_plus_one = 0;
_key_storage_size = 0;
_parsed_entries.clear();
_parsed_entries.reserve(size);
}
};
@@ -198,10 +218,14 @@ public:
switch (_state) {
// START comes first, to make the handling of the 0-quantity case simpler
state_START:
case state::START:
sstlog.trace("{}: pos {} state {} - data.size()={}", fmt::ptr(this), current_pos(), state::START, data.size());
_state = state::KEY_SIZE;
break;
if (data.size() == 0) {
break;
}
[[fallthrough]];
case state::KEY_SIZE:
sstlog.trace("{}: pos {} state {}", fmt::ptr(this), current_pos(), state::KEY_SIZE);
_entry_offset = current_pos();
@@ -227,7 +251,16 @@ public:
case state::PROMOTED_SIZE:
sstlog.trace("{}: pos {} state {}", fmt::ptr(this), current_pos(), state::PROMOTED_SIZE);
_position = this->_u64;
if (read_vint_or_uint32(data) != continuous_data_consumer::read_status::ready) {
if (is_mc_format() && data.size() && *data.begin() == 0) { // promoted_index_size == 0
data.trim_front(1);
_consumer.consume_entry(parsed_partition_index_entry{
.key = std::move(_key),
.data_file_offset = _position,
.index_offset = _entry_offset,
.promoted_index = std::nullopt
});
goto state_START;
} else if (read_vint_or_uint32(data) != continuous_data_consumer::read_status::ready) {
_state = state::PARTITION_HEADER_LENGTH_1;
break;
}
@@ -339,33 +372,6 @@ inline file make_tracked_index_file(sstable& sst, reader_permit permit, tracing:
return tracing::make_traced_file(std::move(f), std::move(trace_state), format("{}:", sst.index_filename()));
}
inline
std::unique_ptr<clustered_index_cursor> promoted_index::make_cursor(shared_sstable sst,
reader_permit permit,
tracing::trace_state_ptr trace_state,
file_input_stream_options options,
use_caching caching)
{
if (sst->get_version() >= sstable_version_types::mc) [[likely]] {
seastar::shared_ptr<cached_file> cached_file_ptr = caching
? sst->_cached_index_file
: seastar::make_shared<cached_file>(make_tracked_index_file(*sst, permit, trace_state, caching),
sst->manager().get_cache_tracker().get_index_cached_file_stats(),
sst->manager().get_cache_tracker().get_lru(),
sst->manager().get_cache_tracker().region(),
sst->_index_file_size);
return std::make_unique<mc::bsearch_clustered_cursor>(*sst->get_schema(),
_promoted_index_start, _promoted_index_size,
promoted_index_cache_metrics, permit,
sst->get_column_translation(), cached_file_ptr, _num_blocks, trace_state, sst->features());
}
auto file = make_tracked_index_file(*sst, permit, std::move(trace_state), caching);
auto promoted_index_stream = make_file_input_stream(std::move(file), _promoted_index_start, _promoted_index_size,options);
return std::make_unique<scanning_clustered_index_cursor>(*sst->get_schema(), permit,
std::move(promoted_index_stream), _promoted_index_size, _num_blocks, std::nullopt);
}
// Less-comparator for lookups in the partition index.
class index_comparator {
dht::ring_position_comparator_for_sstables _tri_cmp;
@@ -376,27 +382,17 @@ public:
return _tri_cmp(e.get_decorated_key(), rp) < 0;
}
bool operator()(const index_entry& e, dht::ring_position_view rp) const {
return _tri_cmp(e.get_decorated_key(_tri_cmp.s), rp) < 0;
}
bool operator()(const managed_ref<index_entry>& e, dht::ring_position_view rp) const {
return operator()(*e, rp);
}
bool operator()(dht::ring_position_view rp, const managed_ref<index_entry>& e) const {
return operator()(rp, *e);
}
bool operator()(dht::ring_position_view rp, const summary_entry& e) const {
return _tri_cmp(e.get_decorated_key(), rp) > 0;
}
bool operator()(dht::ring_position_view rp, const index_entry& e) const {
return _tri_cmp(e.get_decorated_key(_tri_cmp.s), rp) > 0;
}
};
inline
std::strong_ordering index_entry_tri_cmp(const schema& s, partition_index_page& page, size_t idx, dht::ring_position_view rp) {
dht::ring_position_comparator_for_sstables tri_cmp(s);
return tri_cmp(page.get_decorated_key(s, idx), rp);
}
// Contains information about index_reader position in the index file
struct index_bound {
index_bound() = default;
@@ -537,7 +533,7 @@ private:
if (ex) {
return make_exception_future<index_list>(std::move(ex));
}
return make_ready_future<index_list>(std::move(bound.consumer->indexes));
return bound.consumer->finalize();
});
});
};
@@ -550,17 +546,18 @@ private:
if (bound.current_list->empty()) {
throw malformed_sstable_exception(format("missing index entry for summary index {} (bound {})", summary_idx, fmt::ptr(&bound)), _sstable->index_filename());
}
bound.data_file_position = bound.current_list->_entries[0]->position();
bound.data_file_position = bound.current_list->_entries[0].position();
bound.element = indexable_element::partition;
bound.end_open_marker.reset();
if (sstlog.is_enabled(seastar::log_level::trace)) {
sstlog.trace("index {} bound {}: page:", fmt::ptr(this), fmt::ptr(&bound));
logalloc::reclaim_lock rl(_region);
for (auto&& e : bound.current_list->_entries) {
for (size_t i = 0; i < bound.current_list->_entries.size(); ++i) {
auto& e = bound.current_list->_entries[i];
auto dk = dht::decorate_key(*_sstable->_schema,
e->get_key().to_partition_key(*_sstable->_schema));
sstlog.trace(" {} -> {}", dk, e->position());
bound.current_list->get_key(i).to_partition_key(*_sstable->_schema));
sstlog.trace(" {} -> {}", dk, e.position());
}
}
@@ -604,7 +601,13 @@ private:
// Valid if partition_data_ready(bound)
index_entry& current_partition_entry(index_bound& bound) {
parse_assert(bool(bound.current_list), _sstable->index_filename());
return *bound.current_list->_entries[bound.current_index_idx];
return bound.current_list->_entries[bound.current_index_idx];
}
// Valid if partition_data_ready(bound)
partition_index_page& current_page(index_bound& bound) {
parse_assert(bool(bound.current_list), _sstable->index_filename());
return *bound.current_list;
}
future<> advance_to_next_partition(index_bound& bound) {
@@ -617,7 +620,7 @@ private:
if (bound.current_index_idx + 1 < bound.current_list->size()) {
++bound.current_index_idx;
bound.current_pi_idx = 0;
bound.data_file_position = bound.current_list->_entries[bound.current_index_idx]->position();
bound.data_file_position = bound.current_list->_entries[bound.current_index_idx].position();
bound.element = indexable_element::partition;
bound.end_open_marker.reset();
return reset_clustered_cursor(bound);
@@ -680,9 +683,13 @@ private:
return advance_to_page(bound, summary_idx).then([this, &bound, pos, summary_idx] {
sstlog.trace("index {}: old page index = {}", fmt::ptr(this), bound.current_index_idx);
auto i = _alloc_section(_region, [&] {
auto& entries = bound.current_list->_entries;
return std::lower_bound(std::begin(entries) + bound.current_index_idx, std::end(entries), pos,
index_comparator(*_sstable->_schema));
auto& page = *bound.current_list;
auto& s = *_sstable->_schema;
auto r = std::views::iota(bound.current_index_idx, page._entries.size());
auto it = std::ranges::partition_point(r, [&] (int idx) {
return index_entry_tri_cmp(s, page, idx, pos) < 0;
});
return page._entries.begin() + bound.current_index_idx + std::ranges::distance(r.begin(), it);
});
// i is valid until next allocation point
auto& entries = bound.current_list->_entries;
@@ -697,7 +704,7 @@ private:
}
bound.current_index_idx = std::distance(std::begin(entries), i);
bound.current_pi_idx = 0;
bound.data_file_position = (*i)->position();
bound.data_file_position = (*i).position();
bound.element = indexable_element::partition;
bound.end_open_marker.reset();
sstlog.trace("index {}: new page index = {}, pos={}", fmt::ptr(this), bound.current_index_idx, bound.data_file_position);
@@ -800,6 +807,34 @@ public:
}
}
static
std::unique_ptr<clustered_index_cursor> make_cursor(const parsed_promoted_index_entry& pi,
shared_sstable sst,
reader_permit permit,
tracing::trace_state_ptr trace_state,
file_input_stream_options options,
use_caching caching)
{
if (sst->get_version() >= sstable_version_types::mc) [[likely]] {
seastar::shared_ptr<cached_file> cached_file_ptr = caching
? sst->_cached_index_file
: seastar::make_shared<cached_file>(make_tracked_index_file(*sst, permit, trace_state, caching),
sst->manager().get_cache_tracker().get_index_cached_file_stats(),
sst->manager().get_cache_tracker().get_lru(),
sst->manager().get_cache_tracker().region(),
sst->_index_file_size);
return std::make_unique<mc::bsearch_clustered_cursor>(*sst->get_schema(),
pi.promoted_index_start, pi.promoted_index_size,
promoted_index_cache_metrics, permit,
sst->get_column_translation(), cached_file_ptr, pi.num_blocks, trace_state, sst->features());
}
auto file = make_tracked_index_file(*sst, permit, std::move(trace_state), caching);
auto promoted_index_stream = make_file_input_stream(std::move(file), pi.promoted_index_start, pi.promoted_index_size,options);
return std::make_unique<scanning_clustered_index_cursor>(*sst->get_schema(), permit,
std::move(promoted_index_stream), pi.promoted_index_size, pi.num_blocks, std::nullopt);
}
// Ensures that partition_data_ready() returns true.
// Can be called only when !eof()
future<> read_partition_data() override {
@@ -835,10 +870,10 @@ public:
clustered_index_cursor* current_clustered_cursor(index_bound& bound) {
if (!bound.clustered_cursor) {
_alloc_section(_region, [&] {
index_entry& e = current_partition_entry(bound);
promoted_index* pi = e.get_promoted_index().get();
if (pi) {
bound.clustered_cursor = pi->make_cursor(_sstable, _permit, _trace_state,
partition_index_page& page = current_page(bound);
if (page.has_promoted_index(bound.current_index_idx)) {
promoted_index& pi = page.get_promoted_index(bound.current_index_idx);
bound.clustered_cursor = make_cursor(pi, _sstable, _permit, _trace_state,
get_file_input_stream_options(), _use_caching);
}
});
@@ -861,15 +896,15 @@ public:
// It may be unavailable for old sstables for which this information was not generated.
// Can be called only when partition_data_ready().
std::optional<sstables::deletion_time> partition_tombstone() override {
return current_partition_entry(_lower_bound).get_deletion_time();
return current_page(_lower_bound).get_deletion_time(_lower_bound.current_index_idx);
}
// Returns the key for current partition.
// Can be called only when partition_data_ready().
std::optional<partition_key> get_partition_key() override {
return _alloc_section(_region, [this] {
index_entry& e = current_partition_entry(_lower_bound);
return e.get_key().to_partition_key(*_sstable->_schema);
return current_page(_lower_bound).get_key(_lower_bound.current_index_idx)
.to_partition_key(*_sstable->_schema);
});
}
@@ -883,8 +918,8 @@ public:
// Returns the number of promoted index entries for the current partition.
// Can be called only when partition_data_ready().
uint64_t get_promoted_index_size() {
index_entry& e = current_partition_entry(_lower_bound);
return e.get_promoted_index_size();
partition_index_page& page = current_page(_lower_bound);
return page.get_promoted_index_size(_lower_bound.current_index_idx);
}
bool partition_data_ready() const override {
@@ -975,9 +1010,9 @@ public:
return make_ready_future<bool>(false);
}
return read_partition_data().then([this, key] {
index_comparator cmp(*_sstable->_schema);
bool found = _alloc_section(_region, [&] {
return cmp(key, current_partition_entry(_lower_bound)) == 0;
auto& page = current_page(_lower_bound);
return index_entry_tri_cmp(*_sstable->_schema, page, _lower_bound.current_index_idx, key) == 0;
});
return make_ready_future<bool>(found);
});

View File

@@ -189,10 +189,11 @@ public:
{}
future<std::optional<directory_entry>> get() override {
std::filesystem::path dir(_prefix);
do {
while (true) {
if (_pos == _info.size()) {
_info.clear();
_info = co_await _client->list_objects(_bucket, _prefix, _paging);
_pos = 0;
}
if (_info.empty()) {
break;
@@ -203,7 +204,7 @@ public:
continue;
}
co_return ent;
} while (false);
}
co_return std::nullopt;
}
@@ -276,7 +277,7 @@ public:
co_await f.close();
auto names = ranges | std::views::transform([](auto& p) { return p.name; }) | std::ranges::to<std::vector<std::string>>();
co_await _client->merge_objects(bucket, object, std::move(names), {}, as);
co_await _client->merge_objects(bucket, object, names, {}, as);
co_await parallel_for_each(names, [this, bucket](auto& name) -> future<> {
co_await _client->delete_object(bucket, name);

View File

@@ -257,14 +257,11 @@ public:
while (partial_page || i != _cache.end()) {
if (partial_page) {
auto preempted = with_allocator(_region.allocator(), [&] {
while (!partial_page->empty()) {
partial_page->clear_one_entry();
if (need_preempt()) {
return true;
}
while (partial_page->clear_gently() != stop_iteration::yes) {
return true;
}
partial_page.reset();
return false;
return need_preempt();
});
if (preempted) {
auto key = (i != _cache.end()) ? std::optional(i->key()) : std::nullopt;

View File

@@ -1132,7 +1132,6 @@ public:
friend class mc::writer;
friend class index_reader;
friend class promoted_index;
friend class sstables_manager;
template <typename DataConsumeRowsContext>
friend future<std::unique_ptr<DataConsumeRowsContext>>

View File

@@ -180,18 +180,11 @@ storage_manager::config_updater::config_updater(const db::config& cfg, storage_m
{}
sstables::sstable::version_types sstables_manager::get_highest_supported_format() const noexcept {
// FIXME: start announcing `ms` here after it becomes the default.
// (There are several tests which expect that new sstables are written with
// the format reported by this API).
//
// After `ms` becomes the default, this function look like this:
//
// if (_features.ms_sstable) {
// return sstable_version_types::ms;
// } else {
// return sstable_version_types::me;
// }
return sstable_version_types::me;
if (_features.ms_sstable) {
return sstable_version_types::ms;
} else {
return sstable_version_types::me;
}
}
sstables::sstable::version_types sstables_manager::get_preferred_sstable_version() const {

View File

@@ -221,10 +221,16 @@ private:
sst->set_sstable_level(0);
auto units = co_await sst_manager.dir_semaphore().get_units(1);
sstables::sstable_open_config cfg {
.unsealed_sstable = true,
.ignore_component_digest_mismatch = db.get_config().ignore_component_digest_mismatch(),
};
co_await sst->load(table.get_effective_replication_map()->get_sharder(*table.schema()), cfg);
co_await table.add_sstable_and_update_cache(sst);
co_await table.add_new_sstable_and_update_cache(sst, [&sst_manager, sst] (sstables::shared_sstable loading_sst) -> future<> {
if (loading_sst == sst) {
auto writer_cfg = sst_manager.configure_writer(loading_sst->get_origin());
co_await loading_sst->seal_sstable(writer_cfg.backup);
}
});
}
future<>
@@ -295,7 +301,8 @@ private:
sstables::sstable_state::normal,
sstables::sstable::component_basename(
_table.schema()->ks_name(), _table.schema()->cf_name(), descriptor.version, gen, descriptor.format, it->first),
sstables::sstable_stream_sink_cfg{.last_component = std::next(it) == components.cend()});
sstables::sstable_stream_sink_cfg{.last_component = std::next(it) == components.cend(),
.leave_unsealed = true});
auto out = co_await sstable_sink->output(foptions, stream_options);
input_stream src(co_await [this, &it, sstable, f = files.at(it->first)]() -> future<input_stream<char>> {

View File

@@ -400,7 +400,7 @@ task_manager::virtual_task::impl::impl(module_ptr module) noexcept
: _module(std::move(module))
{}
future<utils::chunked_vector<task_identity>> task_manager::virtual_task::impl::get_children(module_ptr module, task_id parent_id, std::function<bool(locator::host_id)> is_host_alive) {
future<utils::chunked_vector<task_identity>> task_manager::virtual_task::impl::get_children(module_ptr module, task_id parent_id, locator::token_metadata_ptr tmptr) {
auto ms = module->get_task_manager()._messaging;
if (!ms) {
auto ids = co_await module->get_task_manager().get_virtual_task_children(parent_id);
@@ -417,19 +417,18 @@ future<utils::chunked_vector<task_identity>> task_manager::virtual_task::impl::g
tmlogger.info("tasks_vt_get_children: waiting");
co_await handler.wait_for_message(std::chrono::steady_clock::now() + std::chrono::seconds{60});
});
co_return co_await map_reduce(nodes, [ms, parent_id, is_host_alive = std::move(is_host_alive)] (auto host_id) -> future<utils::chunked_vector<task_identity>> {
if (is_host_alive(host_id)) {
return ser::tasks_rpc_verbs::send_tasks_get_children(ms, host_id, parent_id).then([host_id] (auto resp) {
return resp | std::views::transform([host_id] (auto id) {
return task_identity{
.host_id = host_id,
.task_id = id
};
}) | std::ranges::to<utils::chunked_vector<task_identity>>();
});
} else {
return make_ready_future<utils::chunked_vector<task_identity>>();
}
co_return co_await map_reduce(nodes, [ms, parent_id] (auto host_id) -> future<utils::chunked_vector<task_identity>> {
return ser::tasks_rpc_verbs::send_tasks_get_children(ms, host_id, parent_id).then([host_id] (auto resp) {
return resp | std::views::transform([host_id] (auto id) {
return task_identity{
.host_id = host_id,
.task_id = id
};
}) | std::ranges::to<utils::chunked_vector<task_identity>>();
}).handle_exception_type([host_id, parent_id] (const rpc::closed_error& ex) {
tmlogger.warn("Failed to get children of virtual task with id={} from node {}: {}", parent_id, host_id, ex);
return utils::chunked_vector<task_identity>{};
});
}, utils::chunked_vector<task_identity>{}, [] (auto a, auto&& b) {
std::move(b.begin(), b.end(), std::back_inserter(a));
return a;

View File

@@ -19,6 +19,7 @@
#include "db_clock.hh"
#include "utils/log.hh"
#include "locator/host_id.hh"
#include "locator/token_metadata_fwd.hh"
#include "schema/schema_fwd.hh"
#include "tasks/types.hh"
#include "utils/chunked_vector.hh"
@@ -282,7 +283,7 @@ public:
impl& operator=(impl&&) = delete;
virtual ~impl() = default;
protected:
static future<utils::chunked_vector<task_identity>> get_children(module_ptr module, task_id parent_id, std::function<bool(locator::host_id)> is_host_alive);
static future<utils::chunked_vector<task_identity>> get_children(module_ptr module, task_id parent_id, locator::token_metadata_ptr tmptr);
public:
virtual task_group get_group() const noexcept = 0;
// Returns std::nullopt if an operation with task_id isn't tracked by this virtual_task.

View File

@@ -181,7 +181,7 @@ def parse_cmd_line() -> argparse.Namespace:
help="Run only tests for given build mode(s)")
parser.add_argument('--repeat', action="store", default="1", type=int,
help="number of times to repeat test execution")
parser.add_argument('--timeout', action="store", default="24000", type=int,
parser.add_argument('--timeout', action="store", default="3600", type=int,
help="timeout value for single test execution")
parser.add_argument('--session-timeout', action="store", default="24000", type=int,
help="timeout value for test.py/pytest session execution")

View File

@@ -469,18 +469,6 @@ def test_get_records_nonexistent_iterator(dynamodbstreams):
# not allowed (see test_streams_change_type), and while removing and re-adding
# a stream is possible, it is very slow. So we create four different fixtures
# with the four different StreamViewType settings for these four fixtures.
#
# It turns out that DynamoDB makes reusing the same table in different tests
# very difficult, because when we request a "LATEST" iterator we sometimes
# miss the immediately following write (this issue doesn't happen in
# ALternator, just in DynamoDB - presumably LATEST adds some time slack?)
# So all the fixtures we create below have scope="function", meaning that a
# separate table is created for each of the tests using these fixtures. This
# slows the tests down a bit, but not by much (about 0.05 seconds per test).
# It is still worthwhile to use a fixture rather than to create a table
# explicitly - it is convenient, safe (the table gets deleted automatically)
# and if in the future we can work around the DynamoDB problem, we can return
# these fixtures to module scope.
@contextmanager
def create_table_ss(dynamodb, dynamodbstreams, type):
@@ -524,43 +512,43 @@ def create_table_s_no_ck(dynamodb, dynamodbstreams, type):
yield table, arn
table.delete()
@pytest.fixture(scope="function")
@pytest.fixture(scope="module")
def test_table_sss_new_and_old_images_lsi(dynamodb, dynamodbstreams):
yield from create_table_sss_lsi(dynamodb, dynamodbstreams, 'NEW_AND_OLD_IMAGES')
@pytest.fixture(scope="function")
@pytest.fixture(scope="module")
def test_table_ss_keys_only(dynamodb, dynamodbstreams):
with create_table_ss(dynamodb, dynamodbstreams, 'KEYS_ONLY') as stream:
yield stream
@pytest.fixture(scope="function")
@pytest.fixture(scope="module")
def test_table_ss_new_image(dynamodb, dynamodbstreams):
with create_table_ss(dynamodb, dynamodbstreams, 'NEW_IMAGE') as stream:
yield stream
@pytest.fixture(scope="function")
@pytest.fixture(scope="module")
def test_table_ss_old_image(dynamodb, dynamodbstreams):
with create_table_ss(dynamodb, dynamodbstreams, 'OLD_IMAGE') as stream:
yield stream
@pytest.fixture(scope="function")
@pytest.fixture(scope="module")
def test_table_ss_new_and_old_images(dynamodb, dynamodbstreams):
with create_table_ss(dynamodb, dynamodbstreams, 'NEW_AND_OLD_IMAGES') as stream:
yield stream
@pytest.fixture(scope="function")
@pytest.fixture(scope="module")
def test_table_s_no_ck_keys_only(dynamodb, dynamodbstreams):
yield from create_table_s_no_ck(dynamodb, dynamodbstreams, 'KEYS_ONLY')
@pytest.fixture(scope="function")
@pytest.fixture(scope="module")
def test_table_s_no_ck_new_image(dynamodb, dynamodbstreams):
yield from create_table_s_no_ck(dynamodb, dynamodbstreams, 'NEW_IMAGE')
@pytest.fixture(scope="function")
@pytest.fixture(scope="module")
def test_table_s_no_ck_old_image(dynamodb, dynamodbstreams):
yield from create_table_s_no_ck(dynamodb, dynamodbstreams, 'OLD_IMAGE')
@pytest.fixture(scope="function")
@pytest.fixture(scope="module")
def test_table_s_no_ck_new_and_old_images(dynamodb, dynamodbstreams):
yield from create_table_s_no_ck(dynamodb, dynamodbstreams, 'NEW_AND_OLD_IMAGES')
@@ -626,13 +614,30 @@ def list_shards(dynamodbstreams, arn):
# Utility function for getting shard iterators starting at "LATEST" for
# all the shards of the given stream arn.
# On DynamoDB (but not Alternator), LATEST has a time slack: it may point to
# a position slightly before the true end of the stream, so writes from a
# previous test that reused the same table can appear to be "in the future"
# relative to the returned iterators and therefore show up unexpectedly in
# the current test's reads. To work around this we drain any already-pending
# records from the iterators before returning them, so the caller is
# guaranteed to see only events written *after* this call returns.
def latest_iterators(dynamodbstreams, arn):
iterators = []
for shard_id in list_shards(dynamodbstreams, arn):
iterators.append(dynamodbstreams.get_shard_iterator(StreamArn=arn,
ShardId=shard_id, ShardIteratorType='LATEST')['ShardIterator'])
assert len(set(iterators)) == len(iterators)
return iterators
# Drain any records that are already visible at the LATEST position.
# We keep fetching until no more records are returned, which means that
# the stream is caught up. This drain loop is not necessary on Alternator,
# and needlessly slows the test down.
if not dynamodbstreams._endpoint.host.endswith('.amazonaws.com'):
return iterators
while True:
events = []
iterators = fetch_more(dynamodbstreams, iterators, events)
if events == []:
return iterators
# Similar to latest_iterators(), just also returns the shard id which produced
# each iterator.
@@ -641,7 +646,16 @@ def shards_and_latest_iterators(dynamodbstreams, arn):
for shard_id in list_shards(dynamodbstreams, arn):
shards_and_iterators.append((shard_id, dynamodbstreams.get_shard_iterator(StreamArn=arn,
ShardId=shard_id, ShardIteratorType='LATEST')['ShardIterator']))
return shards_and_iterators
# Drain pre-existing records from the iterators, for the same reason as
# explained in latest_iterators() above.
if not dynamodbstreams._endpoint.host.endswith('.amazonaws.com'):
return shards_and_iterators
while True:
events = []
new_iters = fetch_more(dynamodbstreams, [it for _, it in shards_and_iterators], events)
shards_and_iterators = list(zip([sh for sh, _ in shards_and_iterators], new_iters))
if events == []:
return shards_and_iterators
# Utility function for fetching more content from the stream (given its
# array of iterators) into an "output" array. Call repeatedly to get more
@@ -806,9 +820,11 @@ def fetch_and_compare_events(dynamodb, dynamodbstreams, iterators, expected_even
# function "updatefunc" which is supposed to do some updates to the table
# and also return an expected_events list. do_test() then fetches the streams
# data and compares it to the expected_events using compare_events().
def do_test(test_table_ss_stream, dynamodb, dynamodbstreams, updatefunc, mode, p = random_string(), c = random_string()):
def do_test(test_table_ss_stream, dynamodb, dynamodbstreams, updatefunc, mode):
table, arn = test_table_ss_stream
iterators = latest_iterators(dynamodbstreams, arn)
p = random_string()
c = random_string()
expected_events = updatefunc(table, p, c)
fetch_and_compare_events(dynamodb, dynamodbstreams, iterators, expected_events, mode)
@@ -956,7 +972,7 @@ def test_streams_updateitem_old_image_empty_item(test_table_ss_old_image, dynamo
# columns they are only included in the preimage if they change.
# Currently fails in Alternator because the item's key is missing in
# OldImage (#6935) and the LSI key is also missing (#7030).
@pytest.fixture(scope="function")
@pytest.fixture(scope="module")
def test_table_ss_old_image_and_lsi(dynamodb, dynamodbstreams):
table = create_test_table(dynamodb,
Tags=TAGS,
@@ -1357,49 +1373,48 @@ def test_streams_after_sequence_number(test_table_ss_keys_only, dynamodbstreams)
# Test the "TRIM_HORIZON" iterator, which can be used to re-read *all* the
# previously-read events of the stream shard again.
# NOTE: This test relies on the test_table_ss_keys_only fixture giving us a
# brand new stream, with no old events saved from other tests. If we ever
# change this, we should change this test to use a different fixture.
def test_streams_trim_horizon(test_table_ss_keys_only, dynamodbstreams):
table, arn = test_table_ss_keys_only
shards_and_iterators = shards_and_latest_iterators(dynamodbstreams, arn)
# Do two UpdateItem operations to the same key, that are expected to leave
# two events in the stream.
p = random_string()
c = random_string()
table.update_item(Key={'p': p, 'c': c},
UpdateExpression='SET x = :val1', ExpressionAttributeValues={':val1': 3})
table.update_item(Key={'p': p, 'c': c},
UpdateExpression='SET x = :val1', ExpressionAttributeValues={':val1': 5})
# Eventually, *one* of the stream shards will return the two events:
timeout = time.time() + 15
while time.time() < timeout:
for (shard_id, iter) in shards_and_iterators:
response = dynamodbstreams.get_records(ShardIterator=iter)
if 'Records' in response and len(response['Records']) == 2:
assert response['Records'][0]['dynamodb']['Keys'] == {'p': {'S': p}, 'c': {'S': c}}
assert response['Records'][1]['dynamodb']['Keys'] == {'p': {'S': p}, 'c': {'S': c}}
sequence_number_1 = response['Records'][0]['dynamodb']['SequenceNumber']
sequence_number_2 = response['Records'][1]['dynamodb']['SequenceNumber']
# If we use the TRIM_HORIZON iterator, we should receive the
# same two events again, in the same order.
# Note that we assume that the fixture gave us a brand new
# stream, with no old events saved from other tests. If we
# couldn't assume this, this test would need to become much
# more complex, and would need to read from this shard until
# we find the two events we are looking for.
iter = dynamodbstreams.get_shard_iterator(StreamArn=arn,
ShardId=shard_id, ShardIteratorType='TRIM_HORIZON')['ShardIterator']
def test_streams_trim_horizon(dynamodb, dynamodbstreams):
# This test needs a brand-new stream, without old data from other
# tests, so we can't reuse the test_table_ss_keys_only fixture.
with create_table_ss(dynamodb, dynamodbstreams, 'KEYS_ONLY') as (table, arn):
shards_and_iterators = shards_and_latest_iterators(dynamodbstreams, arn)
# Do two UpdateItem operations to the same key, that are expected to leave
# two events in the stream.
p = random_string()
c = random_string()
table.update_item(Key={'p': p, 'c': c},
UpdateExpression='SET x = :val1', ExpressionAttributeValues={':val1': 3})
table.update_item(Key={'p': p, 'c': c},
UpdateExpression='SET x = :val1', ExpressionAttributeValues={':val1': 5})
# Eventually, *one* of the stream shards will return the two events:
timeout = time.time() + 15
while time.time() < timeout:
for (shard_id, iter) in shards_and_iterators:
response = dynamodbstreams.get_records(ShardIterator=iter)
assert 'Records' in response
assert len(response['Records']) == 2
assert response['Records'][0]['dynamodb']['Keys'] == {'p': {'S': p}, 'c': {'S': c}}
assert response['Records'][1]['dynamodb']['Keys'] == {'p': {'S': p}, 'c': {'S': c}}
assert response['Records'][0]['dynamodb']['SequenceNumber'] == sequence_number_1
assert response['Records'][1]['dynamodb']['SequenceNumber'] == sequence_number_2
return
time.sleep(0.5)
pytest.fail("timed out")
if 'Records' in response and len(response['Records']) == 2:
assert response['Records'][0]['dynamodb']['Keys'] == {'p': {'S': p}, 'c': {'S': c}}
assert response['Records'][1]['dynamodb']['Keys'] == {'p': {'S': p}, 'c': {'S': c}}
sequence_number_1 = response['Records'][0]['dynamodb']['SequenceNumber']
sequence_number_2 = response['Records'][1]['dynamodb']['SequenceNumber']
# If we use the TRIM_HORIZON iterator, we should receive the
# same two events again, in the same order.
# Note that we assume that the fixture gave us a brand new
# stream, with no old events saved from other tests. If we
# couldn't assume this, this test would need to become much
# more complex, and would need to read from this shard until
# we find the two events we are looking for.
iter = dynamodbstreams.get_shard_iterator(StreamArn=arn,
ShardId=shard_id, ShardIteratorType='TRIM_HORIZON')['ShardIterator']
response = dynamodbstreams.get_records(ShardIterator=iter)
assert 'Records' in response
assert len(response['Records']) == 2
assert response['Records'][0]['dynamodb']['Keys'] == {'p': {'S': p}, 'c': {'S': c}}
assert response['Records'][1]['dynamodb']['Keys'] == {'p': {'S': p}, 'c': {'S': c}}
assert response['Records'][0]['dynamodb']['SequenceNumber'] == sequence_number_1
assert response['Records'][1]['dynamodb']['SequenceNumber'] == sequence_number_2
return
time.sleep(0.5)
pytest.fail("timed out")
# Test the StartingSequenceNumber information returned by DescribeStream.
# The DynamoDB documentation explains that StartingSequenceNumber is
@@ -1414,45 +1429,47 @@ def test_streams_trim_horizon(test_table_ss_keys_only, dynamodbstreams):
# that the important thing is that reading a shard starting at
# StartingSequenceNumber will result in reading all the available items -
# similar to how TRIM_HORIZON works. This is what the following test verifies.
def test_streams_starting_sequence_number(test_table_ss_keys_only, dynamodbstreams):
table, arn = test_table_ss_keys_only
# Do two UpdateItem operations to the same key, that are expected to leave
# two events in the stream.
p = random_string()
c = random_string()
table.update_item(Key={'p': p, 'c': c},
UpdateExpression='SET x = :val1', ExpressionAttributeValues={':val1': 3})
table.update_item(Key={'p': p, 'c': c},
UpdateExpression='SET x = :val1', ExpressionAttributeValues={':val1': 5})
# Get for all the stream shards the iterator starting at the shard's
# StartingSequenceNumber:
response = dynamodbstreams.describe_stream(StreamArn=arn)
shards = response['StreamDescription']['Shards']
while 'LastEvaluatedShardId' in response['StreamDescription']:
response = dynamodbstreams.describe_stream(StreamArn=arn,
ExclusiveStartShardId=response['StreamDescription']['LastEvaluatedShardId'])
shards.extend(response['StreamDescription']['Shards'])
iterators = []
for shard in shards:
shard_id = shard['ShardId']
start = shard['SequenceNumberRange']['StartingSequenceNumber']
assert start.isdecimal()
iterators.append(dynamodbstreams.get_shard_iterator(StreamArn=arn,
ShardId=shard_id, ShardIteratorType='AT_SEQUENCE_NUMBER',
SequenceNumber=start)['ShardIterator'])
def test_streams_starting_sequence_number(dynamodb, dynamodbstreams):
# This test needs a brand-new stream, without old data from other
# tests, so we can't reuse the test_table_ss_keys_only fixture.
with create_table_ss(dynamodb, dynamodbstreams, 'KEYS_ONLY') as (table, arn):
# Do two UpdateItem operations to the same key, that are expected to leave
# two events in the stream.
p = random_string()
c = random_string()
table.update_item(Key={'p': p, 'c': c},
UpdateExpression='SET x = :val1', ExpressionAttributeValues={':val1': 3})
table.update_item(Key={'p': p, 'c': c},
UpdateExpression='SET x = :val1', ExpressionAttributeValues={':val1': 5})
# Get for all the stream shards the iterator starting at the shard's
# StartingSequenceNumber:
response = dynamodbstreams.describe_stream(StreamArn=arn)
shards = response['StreamDescription']['Shards']
while 'LastEvaluatedShardId' in response['StreamDescription']:
response = dynamodbstreams.describe_stream(StreamArn=arn,
ExclusiveStartShardId=response['StreamDescription']['LastEvaluatedShardId'])
shards.extend(response['StreamDescription']['Shards'])
iterators = []
for shard in shards:
shard_id = shard['ShardId']
start = shard['SequenceNumberRange']['StartingSequenceNumber']
assert start.isdecimal()
iterators.append(dynamodbstreams.get_shard_iterator(StreamArn=arn,
ShardId=shard_id, ShardIteratorType='AT_SEQUENCE_NUMBER',
SequenceNumber=start)['ShardIterator'])
# Eventually, *one* of the stream shards will return the two events:
timeout = time.time() + 15
while time.time() < timeout:
for iter in iterators:
response = dynamodbstreams.get_records(ShardIterator=iter)
if 'Records' in response and len(response['Records']) == 2:
assert response['Records'][0]['dynamodb']['Keys'] == {'p': {'S': p}, 'c': {'S': c}}
assert response['Records'][1]['dynamodb']['Keys'] == {'p': {'S': p}, 'c': {'S': c}}
return
time.sleep(0.5)
# Eventually, *one* of the stream shards will return the two events:
timeout = time.time() + 15
while time.time() < timeout:
for iter in iterators:
response = dynamodbstreams.get_records(ShardIterator=iter)
if 'Records' in response and len(response['Records']) == 2:
assert response['Records'][0]['dynamodb']['Keys'] == {'p': {'S': p}, 'c': {'S': c}}
assert response['Records'][1]['dynamodb']['Keys'] == {'p': {'S': p}, 'c': {'S': c}}
return
time.sleep(0.5)
pytest.fail("timed out")
pytest.fail("timed out")
# Above we tested some specific operations in small tests aimed to reproduce
# a specific bug, in the following tests we do a all the different operations,
@@ -1746,50 +1763,49 @@ def test_stream_specification(test_table_stream_with_result, dynamodbstreams):
# that the right answer is that NextShardIterator should be *missing*
# (reproduces issue #7237).
@pytest.mark.xfail(reason="disabled stream is deleted - issue #7239")
def test_streams_closed_read(test_table_ss_keys_only, dynamodbstreams):
table, arn = test_table_ss_keys_only
shards_and_iterators = shards_and_latest_iterators(dynamodbstreams, arn)
# Do an UpdateItem operation that is expected to leave one event in the
# stream.
table.update_item(Key={'p': random_string(), 'c': random_string()},
UpdateExpression='SET x = :val1', ExpressionAttributeValues={':val1': 5})
# Disable streaming for this table. Note that the test_table_ss_keys_only
# fixture has "function" scope so it is fine to ruin table, it will not
# be used in other tests.
disable_stream(dynamodbstreams, table)
def test_streams_closed_read(dynamodb, dynamodbstreams):
# This test can't use the shared table test_table_ss_keys_only,
# because it wants to disable streaming, so let's create a new table:
with create_table_ss(dynamodb, dynamodbstreams, 'KEYS_ONLY') as (table, arn):
shards_and_iterators = shards_and_latest_iterators(dynamodbstreams, arn)
# Do an UpdateItem operation that is expected to leave one event in the
# stream.
table.update_item(Key={'p': random_string(), 'c': random_string()},
UpdateExpression='SET x = :val1', ExpressionAttributeValues={':val1': 5})
disable_stream(dynamodbstreams, table)
# Even after streaming is disabled for the table, we can still read
# from the earlier stream (it is guaranteed to work for 24 hours).
# The iterators we got earlier should still be fully usable, and
# eventually *one* of the stream shards will return one event:
timeout = time.time() + 15
while time.time() < timeout:
for (shard_id, iter) in shards_and_iterators:
response = dynamodbstreams.get_records(ShardIterator=iter)
if 'Records' in response and response['Records'] != []:
# Found the shard with the data! Test that it only has
# one event. NextShardIterator should either be missing now,
# indicating that it is a closed shard (DynamoDB does this),
# or, it may (and currently does in Alternator) return another
# and reading from *that* iterator should then tell us that
# we reached the end of the shard (i.e., zero results and
# missing NextShardIterator).
assert len(response['Records']) == 1
if 'NextShardIterator' in response:
response = dynamodbstreams.get_records(ShardIterator=response['NextShardIterator'])
assert len(response['Records']) == 0
assert not 'NextShardIterator' in response
# Until now we verified that we can read the closed shard
# using an old iterator. Let's test now that the closed
# shard id is also still valid, and a new iterator can be
# created for it, and the old data can be read from it:
iter = dynamodbstreams.get_shard_iterator(StreamArn=arn,
ShardId=shard_id, ShardIteratorType='TRIM_HORIZON')['ShardIterator']
# Even after streaming is disabled for the table, we can still read
# from the earlier stream (it is guaranteed to work for 24 hours).
# The iterators we got earlier should still be fully usable, and
# eventually *one* of the stream shards will return one event:
timeout = time.time() + 15
while time.time() < timeout:
for (shard_id, iter) in shards_and_iterators:
response = dynamodbstreams.get_records(ShardIterator=iter)
assert len(response['Records']) == 1
return
time.sleep(0.5)
pytest.fail("timed out")
if 'Records' in response and response['Records'] != []:
# Found the shard with the data! Test that it only has
# one event. NextShardIterator should either be missing now,
# indicating that it is a closed shard (DynamoDB does this),
# or, it may (and currently does in Alternator) return another
# and reading from *that* iterator should then tell us that
# we reached the end of the shard (i.e., zero results and
# missing NextShardIterator).
assert len(response['Records']) == 1
if 'NextShardIterator' in response:
response = dynamodbstreams.get_records(ShardIterator=response['NextShardIterator'])
assert len(response['Records']) == 0
assert not 'NextShardIterator' in response
# Until now we verified that we can read the closed shard
# using an old iterator. Let's test now that the closed
# shard id is also still valid, and a new iterator can be
# created for it, and the old data can be read from it:
iter = dynamodbstreams.get_shard_iterator(StreamArn=arn,
ShardId=shard_id, ShardIteratorType='TRIM_HORIZON')['ShardIterator']
response = dynamodbstreams.get_records(ShardIterator=iter)
assert len(response['Records']) == 1
return
time.sleep(0.5)
pytest.fail("timed out")
# In the above test (test_streams_closed_read) we used a disabled stream as
# a means to generate a closed shard, and tested the behavior of that closed
@@ -1800,84 +1816,83 @@ def test_streams_closed_read(test_table_ss_keys_only, dynamodbstreams):
# stream's shards should give an indication that they are all closed - but
# all these shards should still be readable.
@pytest.mark.xfail(reason="disabled stream is deleted - issue #7239")
def test_streams_disabled_stream(test_table_ss_keys_only, dynamodbstreams):
table, arn = test_table_ss_keys_only
iterators = latest_iterators(dynamodbstreams, arn)
# Do an UpdateItem operation that is expected to leave one event in the
# stream.
table.update_item(Key={'p': random_string(), 'c': random_string()},
UpdateExpression='SET x = :x', ExpressionAttributeValues={':x': 5})
def test_streams_disabled_stream(dynamodb, dynamodbstreams):
# This test can't use the shared table test_table_ss_keys_only,
# because it wants to disable streaming, so let's create a new table:
with create_table_ss(dynamodb, dynamodbstreams, 'KEYS_ONLY') as (table, arn):
iterators = latest_iterators(dynamodbstreams, arn)
# Do an UpdateItem operation that is expected to leave one event in the
# stream.
table.update_item(Key={'p': random_string(), 'c': random_string()},
UpdateExpression='SET x = :x', ExpressionAttributeValues={':x': 5})
# Wait for this one update to become available in the stream before we
# disable the stream. Otherwise, theoretically (although unlikely in
# practice) we may disable the stream before the update was saved to it.
timeout = time.time() + 15
found = False
while time.time() < timeout and not found:
for iter in iterators:
response = dynamodbstreams.get_records(ShardIterator=iter)
if 'Records' in response and len(response['Records']) > 0:
found = True
break
time.sleep(0.5)
assert found
# Wait for this one update to become available in the stream before we
# disable the stream. Otherwise, theoretically (although unlikely in
# practice) we may disable the stream before the update was saved to it.
timeout = time.time() + 15
found = False
while time.time() < timeout and not found:
for iter in iterators:
response = dynamodbstreams.get_records(ShardIterator=iter)
if 'Records' in response and len(response['Records']) > 0:
found = True
break
time.sleep(0.5)
assert found
# Disable streaming for this table. Note that the test_table_ss_keys_only
# fixture has "function" scope so it is fine to ruin table, it will not
# be used in other tests.
disable_stream(dynamodbstreams, table)
disable_stream(dynamodbstreams, table)
# Check that the stream ARN which we previously got for the disabled
# stream is still listed by ListStreams
arns = [stream['StreamArn'] for stream in dynamodbstreams.list_streams(TableName=table.name)['Streams']]
assert arn in arns
# Check that the stream ARN which we previously got for the disabled
# stream is still listed by ListStreams
arns = [stream['StreamArn'] for stream in dynamodbstreams.list_streams(TableName=table.name)['Streams']]
assert arn in arns
# DescribeStream on the disabled stream still works and lists its shards.
# All these shards are listed as being closed (i.e., should have
# EndingSequenceNumber). The basic details of the stream (e.g., the view
# type) are available and the status of the stream is DISABLED.
response = dynamodbstreams.describe_stream(StreamArn=arn)['StreamDescription']
assert response['StreamStatus'] == 'DISABLED'
assert response['StreamViewType'] == 'KEYS_ONLY'
assert response['TableName'] == table.name
shards_info = response['Shards']
while 'LastEvaluatedShardId' in response:
response = dynamodbstreams.describe_stream(StreamArn=arn, ExclusiveStartShardId=response['LastEvaluatedShardId'])['StreamDescription']
# DescribeStream on the disabled stream still works and lists its shards.
# All these shards are listed as being closed (i.e., should have
# EndingSequenceNumber). The basic details of the stream (e.g., the view
# type) are available and the status of the stream is DISABLED.
response = dynamodbstreams.describe_stream(StreamArn=arn)['StreamDescription']
assert response['StreamStatus'] == 'DISABLED'
assert response['StreamViewType'] == 'KEYS_ONLY'
assert response['TableName'] == table.name
shards_info.extend(response['Shards'])
print('Number of shards in stream: {}'.format(len(shards_info)))
for shard in shards_info:
assert 'EndingSequenceNumber' in shard['SequenceNumberRange']
assert shard['SequenceNumberRange']['EndingSequenceNumber'].isdecimal()
shards_info = response['Shards']
while 'LastEvaluatedShardId' in response:
response = dynamodbstreams.describe_stream(StreamArn=arn, ExclusiveStartShardId=response['LastEvaluatedShardId'])['StreamDescription']
assert response['StreamStatus'] == 'DISABLED'
assert response['StreamViewType'] == 'KEYS_ONLY'
assert response['TableName'] == table.name
shards_info.extend(response['Shards'])
print('Number of shards in stream: {}'.format(len(shards_info)))
for shard in shards_info:
assert 'EndingSequenceNumber' in shard['SequenceNumberRange']
assert shard['SequenceNumberRange']['EndingSequenceNumber'].isdecimal()
# We can get TRIM_HORIZON iterators for all these shards, to read all
# the old data they still have (this data should be saved for 24 hours
# after the stream was disabled)
iterators = []
for shard in shards_info:
iterators.append(dynamodbstreams.get_shard_iterator(StreamArn=arn,
ShardId=shard['ShardId'], ShardIteratorType='TRIM_HORIZON')['ShardIterator'])
# We can get TRIM_HORIZON iterators for all these shards, to read all
# the old data they still have (this data should be saved for 24 hours
# after the stream was disabled)
iterators = []
for shard in shards_info:
iterators.append(dynamodbstreams.get_shard_iterator(StreamArn=arn,
ShardId=shard['ShardId'], ShardIteratorType='TRIM_HORIZON')['ShardIterator'])
# We can read the one change we did in one of these iterators. The data
# should be available immediately - no need for retries with timeout.
nrecords = 0
for iter in iterators:
response = dynamodbstreams.get_records(ShardIterator=iter)
if 'Records' in response:
nrecords += len(response['Records'])
# The shard is closed, so NextShardIterator should either be missing
# now, indicating that it is a closed shard (DynamoDB does this),
# or, it may (and currently does in Alternator) return an iterator
# and reading from *that* iterator should then tell us that
# we reached the end of the shard (i.e., zero results and
# missing NextShardIterator).
if 'NextShardIterator' in response:
response = dynamodbstreams.get_records(ShardIterator=response['NextShardIterator'])
assert len(response['Records']) == 0
assert not 'NextShardIterator' in response
assert nrecords == 1
# We can read the one change we did in one of these iterators. The data
# should be available immediately - no need for retries with timeout.
nrecords = 0
for iter in iterators:
response = dynamodbstreams.get_records(ShardIterator=iter)
if 'Records' in response:
nrecords += len(response['Records'])
# The shard is closed, so NextShardIterator should either be missing
# now, indicating that it is a closed shard (DynamoDB does this),
# or, it may (and currently does in Alternator) return an iterator
# and reading from *that* iterator should then tell us that
# we reached the end of the shard (i.e., zero results and
# missing NextShardIterator).
if 'NextShardIterator' in response:
response = dynamodbstreams.get_records(ShardIterator=response['NextShardIterator'])
assert len(response['Records']) == 0
assert not 'NextShardIterator' in response
assert nrecords == 1
# When streams are enabled for a table, we get a unique ARN which should be
# unique but not change unless streams are eventually disabled for this table.

View File

@@ -62,7 +62,11 @@ SEASTAR_TEST_CASE(test_index_doesnt_flood_cache_in_small_partition_workload) {
// cfg.db_config->index_cache_fraction.set(1.0);
return do_with_cql_env_thread([] (cql_test_env& e) {
// We disable compactions because they cause confusing cache mispopulations.
e.execute_cql("CREATE TABLE ks.t(pk blob PRIMARY KEY) WITH compaction = { 'class' : 'NullCompactionStrategy' };").get();
// We disable compression because the sstable writer targets a specific
// (*compressed* data file size : summary file size) ratio,
// so the number of keys per index page becomes hard to control,
// and might be arbitrarily large.
e.execute_cql("CREATE TABLE ks.t(pk blob PRIMARY KEY) WITH compaction = { 'class' : 'NullCompactionStrategy' } AND compression = {'sstable_compression': ''};").get();
auto insert_query = e.prepare("INSERT INTO ks.t(pk) VALUES (?)").get();
auto select_query = e.prepare("SELECT * FROM t WHERE pk = ?").get();
@@ -154,7 +158,11 @@ SEASTAR_TEST_CASE(test_index_is_cached_in_big_partition_workload) {
// cfg.db_config->index_cache_fraction.set(0.0);
return do_with_cql_env_thread([] (cql_test_env& e) {
// We disable compactions because they cause confusing cache mispopulations.
e.execute_cql("CREATE TABLE ks.t(pk bigint, ck bigint, v blob, primary key (pk, ck)) WITH compaction = { 'class' : 'NullCompactionStrategy' };").get();
// We disable compression because the sstable writer targets a specific
// (*compressed* data file size : summary file size) ratio,
// so the number of keys per index page becomes hard to control,
// and might be arbitrarily large.
e.execute_cql("CREATE TABLE ks.t(pk bigint, ck bigint, v blob, primary key (pk, ck)) WITH compaction = { 'class' : 'NullCompactionStrategy' } AND compression = {'sstable_compression': ''};").get();
auto insert_query = e.prepare("INSERT INTO ks.t(pk, ck, v) VALUES (?, ?, ?)").get();
auto select_query = e.prepare("SELECT * FROM t WHERE pk = ? AND ck = ?").get();

View File

@@ -1058,6 +1058,30 @@ SEASTAR_TEST_CASE(test_snapshot_ctl_true_snapshots_size) {
});
}
SEASTAR_TEST_CASE(test_snapshot_ctl_details_exception_handling) {
#ifndef SCYLLA_ENABLE_ERROR_INJECTION
testlog.debug("Skipping test as it depends on error injection. Please run in mode where it's enabled (debug,dev).\n");
return make_ready_future();
#endif
return do_with_some_data_in_thread({"cf"}, [] (cql_test_env& e) {
sharded<db::snapshot_ctl> sc;
sc.start(std::ref(e.db()), std::ref(e.get_storage_proxy()), std::ref(e.get_task_manager()), std::ref(e.get_sstorage_manager()), db::snapshot_ctl::config{}).get();
auto stop_sc = deferred_stop(sc);
auto& cf = e.local_db().find_column_family("ks", "cf");
take_snapshot(e).get();
utils::get_local_injector().enable("get_snapshot_details", true);
BOOST_REQUIRE_THROW(cf.get_snapshot_details().get(), std::runtime_error);
utils::get_local_injector().enable("per-snapshot-get_snapshot_details", true);
BOOST_REQUIRE_THROW(cf.get_snapshot_details().get(), std::runtime_error);
auto details = cf.get_snapshot_details().get();
BOOST_REQUIRE_EQUAL(details.size(), 1);
});
}
// toppartitions_query caused a lw_shared_ptr to cross shards when moving results, #5104
SEASTAR_TEST_CASE(toppartitions_cross_shard_schema_ptr) {
return do_with_cql_env_thread([] (cql_test_env& e) {

View File

@@ -23,8 +23,11 @@
#include "test/lib/tmpdir.hh"
#include "test/lib/random_utils.hh"
#include "test/lib/exception_utils.hh"
#include "test/lib/limiting_data_source.hh"
#include "utils/io-wrappers.hh"
#include <seastar/util/memory-data-source.hh>
using namespace encryption;
static tmpdir dir;
@@ -595,6 +598,113 @@ SEASTAR_TEST_CASE(test_encrypted_data_source_simple) {
co_await test_random_data_source(sizes);
}
// Reproduces the production deadlock where encrypted SSTable component downloads
// got stuck during restore. The encrypted_data_source::get() caches a block in
// _next, then on the next call bypasses input_stream::read()'s _eof check and
// calls input_stream::read_exactly() — which does NOT check _eof when _buf is
// empty. This causes a second get() on the underlying source after EOS.
//
// In production the underlying source was chunked_download_source whose get()
// hung forever. Here we simulate it with a strict source that fails the test.
//
// The fix belongs in seastar's input_stream::read_exactly(): check _eof before
// calling _fd.get(), consistent with read(), read_up_to(), and consume().
static future<> test_encrypted_source_copy(size_t plaintext_size) {
testlog.info("test_encrypted_source_copy: plaintext_size={}", plaintext_size);
key_info info{"AES/CBC", 256};
auto k = ::make_shared<symmetric_key>(info);
// Step 1: Encrypt the plaintext into memory buffers
auto plaintext = generate_random<char>(plaintext_size);
std::vector<temporary_buffer<char>> encrypted_bufs;
{
data_sink sink(make_encrypted_sink(create_memory_sink(encrypted_bufs), k));
co_await sink.put(plaintext.clone());
co_await sink.close();
}
// Flatten encrypted buffers into a single contiguous buffer
size_t encrypted_total = 0;
for (const auto& b : encrypted_bufs) {
encrypted_total += b.size();
}
temporary_buffer<char> encrypted(encrypted_total);
size_t pos = 0;
for (const auto& b : encrypted_bufs) {
std::copy(b.begin(), b.end(), encrypted.get_write() + pos);
pos += b.size();
}
// Step 2: Create a data source from the encrypted data that fails on
// post-EOS get() — simulating a source like chunked_download_source
// that would hang forever in this situation.
class strict_memory_source final : public limiting_data_source_impl {
bool _eof = false;
public:
strict_memory_source(temporary_buffer<char> data, size_t chunk_size)
: limiting_data_source_impl(
data_source(std::make_unique<util::temporary_buffer_data_source>(std::move(data))),
chunk_size) {}
future<temporary_buffer<char>> get() override {
BOOST_REQUIRE_MESSAGE(!_eof,
"get() called on source after it already returned EOS — "
"this is the production deadlock: read_exactly() does not "
"check _eof before calling _fd.get()");
auto buf = co_await limiting_data_source_impl::get();
_eof = buf.empty();
co_return buf;
}
};
// Step 3: Wrap in encrypted_data_source and drain via consume() —
// the exact code path used by seastar::copy() which is what
// sstables_loader_helpers::download_sstable() calls.
// Try multiple chunk sizes to hit different alignment scenarios.
for (size_t chunk_size : {1ul, 7ul, 4096ul, 8192ul, encrypted_total, encrypted_total + 1}) {
if (chunk_size == 0) continue;
auto src = data_source(make_encrypted_source(
data_source(std::make_unique<strict_memory_source>(encrypted.clone(), chunk_size)), k));
auto in = input_stream<char>(std::move(src));
// consume() is what seastar::copy() uses internally. It calls
// encrypted_data_source::get() via _fd.get() until EOF.
size_t total_decrypted = 0;
co_await in.consume([&total_decrypted](temporary_buffer<char> buf) {
total_decrypted += buf.size();
return make_ready_future<consumption_result<char>>(continue_consuming{});
});
co_await in.close();
BOOST_REQUIRE_EQUAL(total_decrypted, plaintext_size);
}
}
SEASTAR_TEST_CASE(test_encrypted_source_copy_8k) {
co_await test_encrypted_source_copy(8192);
}
SEASTAR_TEST_CASE(test_encrypted_source_copy_4k) {
co_await test_encrypted_source_copy(4096);
}
SEASTAR_TEST_CASE(test_encrypted_source_copy_small) {
co_await test_encrypted_source_copy(100);
}
SEASTAR_TEST_CASE(test_encrypted_source_copy_12k) {
co_await test_encrypted_source_copy(12288);
}
SEASTAR_TEST_CASE(test_encrypted_source_copy_unaligned) {
co_await test_encrypted_source_copy(8193);
}
SEASTAR_TEST_CASE(test_encrypted_source_copy_1byte) {
co_await test_encrypted_source_copy(1);
}
SEASTAR_TEST_CASE(test_encrypted_data_source_fuzzy) {
std::mt19937_64 rand_gen(std::random_device{}());

View File

@@ -1004,7 +1004,20 @@ SEASTAR_TEST_CASE(memtable_flush_compresses_mutations) {
}, db_config);
}
SEASTAR_TEST_CASE(memtable_flush_period) {
static auto check_has_error_injection() {
return boost::unit_test::precondition([](auto){
return
#ifdef SCYLLA_ENABLE_ERROR_INJECTION
true
#else
false
#endif
;
});
}
SEASTAR_TEST_CASE(memtable_flush_period, *check_has_error_injection()) {
#ifdef SCYLLA_ENABLE_ERROR_INJECTION
auto db_config = make_shared<db::config>();
db_config->enable_cache.set(false);
return do_with_cql_env_thread([](cql_test_env& env) {
@@ -1028,6 +1041,9 @@ SEASTAR_TEST_CASE(memtable_flush_period) {
t.apply(m);
BOOST_REQUIRE_EQUAL(t.sstables_count(), 0); // add mutation and check there are no sstables for this table
auto& errj = utils::get_local_injector();
errj.enable("table_seal_post_flush_waiters", true);
// change schema to set memtable flush period
// we use small value in this test but it is impossible to set the period less than 60000ms using ALTER TABLE construction
schema_builder b(t.schema());
@@ -1035,8 +1051,10 @@ SEASTAR_TEST_CASE(memtable_flush_period) {
schema_ptr s2 = b.build();
t.set_schema(s2);
sleep(500ms).get(); // wait until memtable flush starts at least once
BOOST_REQUIRE(t.sstables_count() == 1 || t.get_stats().pending_flushes > 0); // flush started
BOOST_TEST_MESSAGE("Wait for flush");
errj.inject("table_seal_post_flush_waiters", utils::wait_for_message(std::chrono::minutes(2))).get();
BOOST_TEST_MESSAGE("Flush received");
BOOST_REQUIRE(eventually_true([&] { // wait until memtable will be flushed at least once
return t.sstables_count() == 1;
}));
@@ -1047,6 +1065,10 @@ SEASTAR_TEST_CASE(memtable_flush_period) {
.produces(m)
.produces_end_of_stream();
}, db_config);
#else
BOOST_TEST_MESSAGE("Skipping test as it depends on error injection. Please run in mode where it's enabled (debug,dev)");
return make_ready_future<>();
#endif
}
SEASTAR_TEST_CASE(sstable_compaction_does_not_resurrect_data) {

View File

@@ -15,6 +15,7 @@
#include "test/lib/cql_test_env.hh"
#include "test/lib/random_utils.hh"
#include "test/lib/exception_utils.hh"
#include "test/lib/eventually.hh"
#include "db/config.hh"
#include <fmt/ranges.h>
@@ -200,6 +201,10 @@ public:
return _sem;
}
const replica::querier_cache::stats& get_stats() const {
return _cache.get_stats();
}
dht::partition_range make_partition_range(bound begin, bound end) const {
return dht::partition_range::make({_mutations.at(begin.value()).decorated_key(), begin.is_inclusive()},
{_mutations.at(end.value()).decorated_key(), end.is_inclusive()});
@@ -562,24 +567,21 @@ SEASTAR_THREAD_TEST_CASE(test_time_based_cache_eviction) {
const auto entry1 = t.produce_first_page_and_save_data_querier(1);
seastar::sleep(500ms).get();
BOOST_REQUIRE_EQUAL(t.get_stats().time_based_evictions, 0);
const auto entry2 = t.produce_first_page_and_save_data_querier(2);
// Don't waste time retrying before the TTL is up
sleep(1s).get();
seastar::sleep(700ms).get();
eventually_true([&t] {
auto stats = t.get_stats();
return stats.time_based_evictions == 1;
});
t.assert_cache_lookup_data_querier(entry1.key, *t.get_schema(), entry1.expected_range, entry1.expected_slice)
.misses()
.no_drops()
.time_based_evictions();
seastar::sleep(700ms).get();
t.assert_cache_lookup_data_querier(entry2.key, *t.get_schema(), entry2.expected_range, entry2.expected_slice)
.misses()
.no_drops()
.time_based_evictions();
// There should be no inactive reads, the querier_cache should unregister
// the expired queriers.
BOOST_REQUIRE_EQUAL(t.get_semaphore().get_stats().inactive_reads, 0);

View File

@@ -26,6 +26,7 @@
#include <fmt/ranges.h>
#include <seastar/core/coroutine.hh>
#include <seastar/coroutine/parallel_for_each.hh>
#include <seastar/testing/on_internal_error.hh>
#undef SEASTAR_TESTING_MAIN
#include <seastar/testing/test_case.hh>
#include <seastar/testing/thread_test_case.hh>
@@ -35,6 +36,13 @@
#include "replica/database.hh" // new_reader_base_cost is there :(
#include "db/config.hh"
// Provides access to private members of reader_concurrency_semaphore for testing.
struct reader_concurrency_semaphore_tester {
static void signal(reader_concurrency_semaphore& sem, reader_resources r) {
sem.signal(r);
}
};
BOOST_AUTO_TEST_SUITE(reader_concurrency_semaphore_test)
SEASTAR_THREAD_TEST_CASE(test_reader_concurrency_semaphore_clear_inactive_reads) {
@@ -2595,4 +2603,35 @@ SEASTAR_THREAD_TEST_CASE(test_reader_concurrency_semaphore_preemptive_abort_requ
permit2 = {};
}
// Verify that signal() detects and corrects a negative resource leak.
// When a bug causes available resources to exceed initial resources
// after signal(), the semaphore should report the negative leak via
// on_internal_error_noexcept and clamp _resources back to _initial_resources
// so that consumed_resources() never goes negative.
SEASTAR_THREAD_TEST_CASE(test_reader_concurrency_semaphore_signal_detects_negative_resource_leak) {
const auto initial = reader_resources{2, 2048};
reader_concurrency_semaphore semaphore(reader_concurrency_semaphore::for_tests{}, get_name(), initial.count, initial.memory);
auto stop_sem = deferred_stop(semaphore);
BOOST_REQUIRE_EQUAL(semaphore.available_resources(), initial);
BOOST_REQUIRE_EQUAL(semaphore.consumed_resources(), reader_resources{});
// Simulate a negative leak: signal more resources than were ever consumed.
// This would happen if a bug double-returned resources or inflated
// the amount returned to signal().
// signal() calls on_internal_error_noexcept which would abort in
// test mode, so temporarily disable that.
const auto leaked = reader_resources{1, 512};
{
seastar::testing::scoped_no_abort_on_internal_error no_abort;
reader_concurrency_semaphore_tester::signal(semaphore, leaked);
}
// signal() should have detected the over-return and clamped
// available resources back to initial.
BOOST_REQUIRE_EQUAL(semaphore.available_resources(), initial);
BOOST_REQUIRE_EQUAL(semaphore.consumed_resources(), reader_resources{});
}
BOOST_AUTO_TEST_SUITE_END()

View File

@@ -982,21 +982,29 @@ BOOST_AUTO_TEST_CASE(s3_fqn_manipulation) {
}
BOOST_AUTO_TEST_CASE(part_size_calculation_test) {
{
BOOST_REQUIRE_EXCEPTION(s3::calc_part_size(490_GiB, 5_MiB), std::runtime_error, [](const std::runtime_error& e) {
return std::string(e.what()).starts_with("too many parts: 100352 > 10000");
});
}
BOOST_REQUIRE_EXCEPTION(s3::calc_part_size(490_GiB, s3::minimum_part_size), std::runtime_error, [](const std::runtime_error& e) {
return std::string(e.what()).starts_with(format("too many parts: 100352 > {}", s3::maximum_parts_in_piece));
});
BOOST_REQUIRE_EXCEPTION(s3::calc_part_size(490_GiB, 4_MiB), std::runtime_error, [](const std::runtime_error& e) {
return std::string(e.what()).starts_with(format("part_size too small: 4194304 is smaller than minimum part size: {}", s3::minimum_part_size));
});
BOOST_REQUIRE_EXCEPTION(s3::calc_part_size(s3::maximum_object_size + 1, 0), std::runtime_error, [](const std::runtime_error& e) {
return std::string(e.what()).starts_with(
format("object size too large: {} is larger than maximum S3 object size: {}", s3::maximum_object_size + 1, s3::maximum_object_size));
});
BOOST_REQUIRE_EXCEPTION(s3::calc_part_size(1_TiB, s3::maximum_part_size + 1), std::runtime_error, [](const std::runtime_error& e) {
return std::string(e.what()).starts_with(
format("part_size too large: {} is larger than maximum part size: {}", s3::maximum_part_size + 1, s3::maximum_part_size));
});
size_t total_size = s3::minimum_part_size * (s3::maximum_parts_in_piece + 1); // 10001 parts at 5 MiB
BOOST_REQUIRE_EXCEPTION(s3::calc_part_size(total_size, s3::minimum_part_size), std::runtime_error, [](auto& e) {
return std::string(e.what()).starts_with(format("too many parts: 10001 > {}", s3::maximum_parts_in_piece));
});
{
auto [parts, size] = s3::calc_part_size(490_GiB, 100_MiB);
BOOST_REQUIRE_EQUAL(size, 100_MiB);
BOOST_REQUIRE(parts == 5018);
}
{
BOOST_REQUIRE_EXCEPTION(s3::calc_part_size(490_GiB, 4_MiB), std::runtime_error, [](const std::runtime_error& e) {
return std::string(e.what()).starts_with("part_size too small: 4194304 is smaller than minimum part size: 5242880");
});
}
{
auto [parts, size] = s3::calc_part_size(50_MiB, 0);
BOOST_REQUIRE_EQUAL(size, 50_MiB);
@@ -1013,24 +1021,14 @@ BOOST_AUTO_TEST_CASE(part_size_calculation_test) {
BOOST_REQUIRE(parts == 9839);
}
{
auto [parts, size] = s3::calc_part_size(50_MiB * 10000, 0);
auto [parts, size] = s3::calc_part_size(50_MiB * s3::maximum_parts_in_piece, 0);
BOOST_REQUIRE_EQUAL(size, 50_MiB);
BOOST_REQUIRE_EQUAL(parts, 10000);
BOOST_REQUIRE_EQUAL(parts, s3::maximum_parts_in_piece);
}
{
auto [parts, size] = s3::calc_part_size(50_MiB * 10000 + 1, 0);
auto [parts, size] = s3::calc_part_size(50_MiB * s3::maximum_parts_in_piece + 1, 0);
BOOST_REQUIRE(size > 50_MiB);
BOOST_REQUIRE(parts <= 10000);
}
{
BOOST_REQUIRE_EXCEPTION(s3::calc_part_size(50_TiB, 0), std::runtime_error, [](const std::runtime_error& e) {
return std::string(e.what()).starts_with("object size too large: 54975581388800 is larger than maximum S3 object size: 53687091200000");
});
}
{
BOOST_REQUIRE_EXCEPTION(s3::calc_part_size(1_TiB, 5_GiB + 1), std::runtime_error, [](const std::runtime_error& e) {
return std::string(e.what()).starts_with("part_size too large: 5368709121 is larger than maximum part size: 5368709120");
});
BOOST_REQUIRE(parts <= s3::maximum_parts_in_piece);
}
{
auto [parts, size] = s3::calc_part_size(5_TiB, 0);
@@ -1038,21 +1036,16 @@ BOOST_AUTO_TEST_CASE(part_size_calculation_test) {
BOOST_REQUIRE_EQUAL(size, 525_MiB);
}
{
auto [parts, size] = s3::calc_part_size(5_MiB * 10000, 5_MiB);
BOOST_REQUIRE_EQUAL(size, 5_MiB);
BOOST_REQUIRE_EQUAL(parts, 10000);
}
{
size_t total = 5_MiB * 10001; // 10001 parts at 5 MiB
BOOST_REQUIRE_EXCEPTION(
s3::calc_part_size(total, 5_MiB), std::runtime_error, [](auto& e) { return std::string(e.what()).starts_with("too many parts: 10001 > 10000"); });
auto [parts, size] = s3::calc_part_size(s3::minimum_part_size * s3::maximum_parts_in_piece, s3::minimum_part_size);
BOOST_REQUIRE_EQUAL(size, s3::minimum_part_size);
BOOST_REQUIRE_EQUAL(parts, s3::maximum_parts_in_piece);
}
{
size_t total = 500_GiB + 123; // odd size to force non-MiB alignment
auto [parts, size] = s3::calc_part_size(total, 0);
BOOST_REQUIRE(size % 1_MiB == 0); // aligned
BOOST_REQUIRE(parts <= 10000);
BOOST_REQUIRE(parts <= s3::maximum_parts_in_piece);
}
{
auto [parts, size] = s3::calc_part_size(6_MiB, 0);

View File

@@ -676,7 +676,7 @@ SEASTAR_TEST_CASE(test_system_schema_version_is_stable) {
// If you changed the schema of system.batchlog then this is expected to fail.
// Just replace expected version with the new version.
BOOST_REQUIRE_EQUAL(s->version(), table_schema_version(utils::UUID("1f504ac7-350f-37aa-8a9e-105b1325d8e3")));
BOOST_REQUIRE_EQUAL(s->version(), table_schema_version(utils::UUID("c3f984e4-f886-3616-bb80-f8c68ed93595")));
});
}

Some files were not shown because too many files have changed in this diff Show More