Compare commits

...

388 Commits

Author SHA1 Message Date
copilot-swe-agent[bot]
8e956fda27 Fix code style issues: remove trailing whitespace
Co-authored-by: tgrabiec <283695+tgrabiec@users.noreply.github.com>
2025-12-29 13:06:09 +00:00
copilot-swe-agent[bot]
d65c5926d5 Handle tablet merge finalization during node draining
When nodes are draining and all tablets have been drained, check if there
are pending merge finalizations. If so, transition to tablet_resize_finalization
to handle them before continuing with decommission. After finalization completes,
check if there are still nodes being drained and transition back to tablet_draining
if needed. This prevents merge finalization from being delayed until after
decommission completes, avoiding performance degradation from excessive per-shard
tablet counts during scale-in operations.

Co-authored-by: tgrabiec <283695+tgrabiec@users.noreply.github.com>
2025-12-29 13:03:59 +00:00
copilot-swe-agent[bot]
1390736e81 Handle tablet merge finalization inline during node draining
When nodes are draining and there are pending merge finalizations, process
the finalizations inline as part of the tablet migration track rather than
delaying them until after decommission completes. This prevents performance
degradation from excessive per-shard tablet counts during scale-in operations.

Co-authored-by: tgrabiec <283695+tgrabiec@users.noreply.github.com>
2025-12-29 12:57:48 +00:00
copilot-swe-agent[bot]
624331018d Fix tablet merge finalization delay during decommission
Remove the condition that allows tablet migrations during node draining
when there are pending resize finalizations. This ensures that merge
finalization is not delayed during decommission, which is important
for performance when the per-shard tablet count limit is exceeded.

Co-authored-by: tgrabiec <283695+tgrabiec@users.noreply.github.com>
2025-12-29 11:57:17 +00:00
copilot-swe-agent[bot]
3b4ff94437 Initial plan 2025-12-29 11:49:40 +00:00
Pavel Emelyanov
2e33234e91 util: Remove lister::rmdir()
There's seastar helper that does the same, no need to carry yet another
implementation

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#27851
2025-12-28 19:46:19 +02:00
Avi Kivity
63e3a22f2e Merge 'group0_state_machine: don't update in-memory state machine until start' from Piotr Dulikowski
Group0 commands consist of one or more mutations and are supposed to be
atomic - i.e. the data structures that reflect the group0 tables state
are not supposed to be updated while only some mutations of a command
are applied, the logic responsible for that is not supposed to observe
an inconsistent state of group0 tables.

It turns out that this assumption can be broken if a node crashes in the
middle of applying a multi-mutation group0 command. Because these
mutations are, in general, applied separately, only some mutations might
survive a crash and a restart, so the group0 tables might be in an
inconsistent state. The current logic of group0_state_machine will
attempt to read the group0 tables' state as it was left after restart,
so it may observe inconsistent state.

This can confuse the node as it may observe a state that it was not
supposed to observe, or the state will just outright break some
invariants and trigger some sanity checks. One of those was observed in
https://github.com/scylladb/scylladb/issues/26945, where a command from the CDC generation
publisher fiber was partially applied. The fiber, in addition to
publishing generations, it removes old, expired generations as well.
Removal is done by removing data that describes the generation from
cdc_generations_v3 and by removing the generation's ID from the
committed generation list in the topology table. If only the first
mutation gets through but not the other one, on reload the node will see
a committed CDC generation without data, which will trigger an
on_internal_error check.

Fix this by delaying the moment when the in memory data structures are
first loaded. In 579dcf187a, a mechanism was introduced which persists the
commit index before applying commands that are considered committed.
Starting a raft server waits until commands are replayed up to that
point. The fix is to start the group0_state_machine in a mode which only
applies mutations - the aforementioned mechanism will re-apply the
commands which will, thanks to the mutation idempotency, bring the
group0 to a consistent state. After the group0 is known to be in
consistent state (so, after raft::server_impl::start) the in-memory data
structures of group0 are loaded for the first time.

There is an exception, however: schema tables. Information about schema
is actually loaded into memory earlier than the moment when group0 is
started. Applying changes to schema is done through the migration
manager module which compares the persisted state before and after the
schema mutations are applied and acts on that. Refactoring migration
manager is out of scope of this PR. However, this is not a problem
because the migration manager takes care to apply all of the mutations
given in a command in a single commitlog segment, so the initial schema
loading code should not see an inconsistent state due to the state being
partially applied.

The fix is accompanied by a reproducer of scylladb/scylladb#26945.

Fixes: scylladb/scylladb#26945

This is not a regression, so no need to backport.

Closes scylladb/scylladb#27528

* github.com:scylladb/scylladb:
  test: cluster: test for recovery after partial group0 command
  group0_state_machine: remove obsolete comment about group0 consistency
  group0_state_machine: don't update in-memory state machine until start
  group0_state_machine: move reloading out of std::visit
  service: raft: add state machine ref to raft_server_for_group
2025-12-28 13:59:26 +02:00
Pavel Emelyanov
e963a8d603 checked-file: Implement experimental_list_directory()
The method in question returns coroutine generator that co_yields
directory_entry-s. In case the method is not implemented, seastar
creates a fallback generator, that calls existing subscription-based
list_directory() and co_yields them. And since checked file doesn't yet
have it, fallback generator is used, thus skipping the lower file
yielding lister. Not nice.

This patch implements the generator lister for checked file, thus making
full use of lower file generator lister too.

A side note. It's not enough to implement it like

    return do_io_check([] {
        return lower_file->experimental_list_directory();
    });

like list_directory() does, since io-checking will _not_ happen on
directory reading itself, as it's supposed to.

This is the problem of the check_file::list_directory() implementation
-- it only checks for exception when creating the subscription (and it
really never happens), but reading the directory itself happens without
io checks.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#27850
2025-12-28 13:37:44 +02:00
Yaron Kaikov
1ee89c9682 Revert "scripts: benign fixes flagged by CodeQL/PyLens"
This reverts commit 377c3ac072.

This breaks all artifact tests and cloud image build process

Closes scylladb/scylladb#27881
2025-12-28 09:49:49 +02:00
Pavel Emelyanov
bda1709734 Merge 'test: fix infinite loop in python log browsing code triggered from test_orphaned_sstables_on_startup' from Avi Kivity
Recently, test/cluster/test_tablet.py::test_orphaned_sstables_on_startup started
spinning in the log browsing code, part of a the test library that looks into log files
for expected or unexpected patterns. This reproduced somewhat in continuous
integration, and very reliably for me locally.

The test was introduced in fa10b0b390, a year ago.

There are two bugs involved: first, that we're looking for crashes in this test,
since in fact it is expected to crash. The node expectedly fails with an
on_internal_error. Second, the log browsing code contains an infinite loop
if the crash backtrace happens to be the last thing in the log. The series
fixes both bugs.

Fixes #27860.

While the bad code exists in release branches, it doesn't trigger there so far, so best
to only backport it if it starts manifesting there.

Closes scylladb/scylladb#27879

* github.com:scylladb/scylladb:
  test: pylib: log_browsing: fix infinite loop in find_backtraces()
  test: pylib/log_browsing, cluster/test_tablets: don't look for expected crashes
2025-12-26 10:45:56 +03:00
Nadav Har'El
9c50d29a00 test/boost: fix flaky test_inject_future_disabled
The test boost/error_injection_test.cc::test_inject_future_disabled
checks what happens when a sleep injection is *disabled*: The test
has a 10-millisecond-sleep injection and measures how much it takes.
The test expects it to take less than 10 milliseconds - in fact it
should take almost zero. But this is not guaranteed - on a slow debug
build and an overcommitted server this do-nothing injection can take
some time, and in one run (#27798) it took 14 milliseconds - and the
test failed.

The solution is easy - make the sleep-that-doesn't-happen much longer -
e.g., 10 whole seconds. Since this sleep still doesn't happen, we
expect the injection to return in less - much less - than 10 seconds.
This 10 seconds is so ridiculously high we don't expect the do-nothing
injection to take 10 seconds, not even a ridiculously busy test machine.

Fixes #27798

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#27874
2025-12-25 20:46:31 +02:00
Avi Kivity
92996ce9fa test: pylib: log_browsing: fix infinite loop in find_backtraces()
The find_backtraces() function uses a very convoluted loop to
read the log file. The loop fails to terminate if the last thing
in the log file is the backtrace, since the loop termination condition
(`not line`) continues to be true.

It's not clear why this did not reliably hit before, but it now
reliably reproduces for me on both x86 and aarch64. Perhaps timing
changed, or perhaps previously we had more text on the log.
2025-12-25 20:22:17 +02:00
Avi Kivity
50a3460441 test: pylib/log_browsing, cluster/test_tablets: don't look for expected crashes
test_tablets.test_orphaned_sstables_on_startup verifies that an
on_internal_error("Unable to load SSTable...") is generated when
an sstable outside a tablet boundary is found on startup.

The test indeed finds the error, but then proceeds to hang in
find_backtraces(), or fail if find_backtraces() is fixed, since
it finds an unexpected (for it) crash.

Fix this by not looking for crashes if a new option expected_crash
is set. Set it for this test.
2025-12-25 20:22:17 +02:00
Avi Kivity
55c7bc746e Revert "vector_search_validator: move high availability tests from vector-store.git"
This reverts commit caa0cbe328. It is
either extremely slow or broken. I was never able to get it to
run on an r8gd.8xlarge (on the NVMe disk). Even when it passes,
it is very slow.

Test script:

```

git submodule update --recursive || exit 125

rm -rf build

d() { ./tools/toolchain/dbuild -it -- "$@"; }

d ./configure.py --mode release || exit 125
d ninja release-build || exit 125
d ./test.py --mode release
```

Ref #27858
Ref #27859
Ref #27860
2025-12-25 12:30:22 +00:00
Botond Dénes
ebb101f8ae scylla-gdb.py: scylla small-objects: make freelist traversal more robust
Traversing the span's freelist is known to generate "Cannot access
memory at address ..." errors, which is especially annoying when it
results in failed CI. Make this loop more robust: catch gdb.error coming
from it and just log a warning that some listed objects in the span may
be free ones.

Fixes: #27681

Closes scylladb/scylladb#27805
2025-12-25 13:26:09 +03:00
Alex
f769e52877 test: boost: Fix flaky test_large_file_upload_s3 by creating induvidual files for testing During CI runs, multiple instances of the same test may execute concurrently. Although the test uses a temporary directory, the downloaded bucket artifacts were written using an identical filename across all instances.
This caused concurrent writers to operate on the same file, leading to file corruption. In some cases, this manifested as test failures and intermittent std::bad_alloc exceptions.

Change Description

This change ensures that each test instance uses a unique filename for downloaded bucket files.
By isolating file writes per test execution, concurrent runs no longer interfere with each other.

Fixes: #27824

backport not required

Closes scylladb/scylladb#27843
2025-12-25 09:40:13 +02:00
Nadav Har'El
186c91233b Merge 'scylla-gdb.py: improve scylla fiber and scylla read-stats' from Botond Dénes
Improve scylla fiber's ability to traverse through coroutines.
Add --direction command-line parameter to scylla-fiber.
Fix out-of-date premit collection in scylla read-stat and improve the printout.

scylla-gdb.py improvements, no backport needed

Closes scylladb/scylladb#27766

* github.com:scylladb/scylladb:
  scylla-gdb.py: scylla read-stats: include all permit lists
  scylla-gdb.py: scylla fiber: add --direction command-line param
  scylla-gdb.py: scylla fiber: add support for traversing through coroutines backward
2025-12-24 17:49:58 +02:00
Botond Dénes
27bf65e77a db/batchlog_manager: add missing <seastar/coroutine/parallel_for_each.hh> include
Build only fails if `--disable-precompiled-header` is passed to
`configure.py`. Not sure why.

Closes scylladb/scylladb#27721
2025-12-24 16:32:12 +02:00
Botond Dénes
c66275e05c cql3/statements/batch_statement: make size error message more verbose
Mention the type of batch: Logged or Unlogged. The size (warn/fail on
too large size) error has different significance depending on the type.

Refs: #27605

Closes scylladb/scylladb#27664
2025-12-24 15:27:01 +02:00
Piotr Szymaniak
9c5b4e74c3 doc: Correct reference in dev/audit.md
Closes scylladb/scylladb#27832
2025-12-24 15:25:15 +02:00
Botond Dénes
ccc03d0026 test/pylib/runner.py: pytest_configure(): coerce repeat to int
Coerce the return value of config.getoption("--repeat") to int to avoid:

    Traceback (most recent call last):
      File "/usr/bin/pytest", line 8, in <module>
        sys.exit(console_main())
                 ~~~~~~~~~~~~^^
      File "/usr/lib/python3.14/site-packages/_pytest/config/__init__.py", line 201, in console_main
        code = main()
      File "/usr/lib/python3.14/site-packages/_pytest/config/__init__.py", line 175, in main
        ret: ExitCode | int = config.hook.pytest_cmdline_main(config=config)
                              ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^
      File "/usr/lib/python3.14/site-packages/pluggy/_hooks.py", line 512, in __call__
        return self._hookexec(self.name, self._hookimpls.copy(), kwargs, firstresult)
               ~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/usr/lib/python3.14/site-packages/pluggy/_manager.py", line 120, in _hookexec
        return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
               ~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/usr/lib/python3.14/site-packages/pluggy/_callers.py", line 167, in _multicall
        raise exception
      File "/usr/lib/python3.14/site-packages/pluggy/_callers.py", line 121, in _multicall
        res = hook_impl.function(*args)
      File "/usr/lib/python3.14/site-packages/_pytest/helpconfig.py", line 154, in pytest_cmdline_main
        config._do_configure()
        ~~~~~~~~~~~~~~~~~~~~^^
      File "/usr/lib/python3.14/site-packages/_pytest/config/__init__.py", line 1118, in _do_configure
        self.hook.pytest_configure.call_historic(kwargs=dict(config=self))
        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/usr/lib/python3.14/site-packages/pluggy/_hooks.py", line 534, in call_historic
        res = self._hookexec(self.name, self._hookimpls.copy(), kwargs, False)
      File "/usr/lib/python3.14/site-packages/pluggy/_manager.py", line 120, in _hookexec
          return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
               ~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/usr/lib/python3.14/site-packages/pluggy/_callers.py", line 167, in _multicall
        raise exception
      File "/usr/lib/python3.14/site-packages/pluggy/_callers.py", line 121, in _multicall
        res = hook_impl.function(*args)
      File "/home/bdenes/ScyllaDB/scylladb/scylladb/test/pylib/runner.py", line 206, in pytest_configure
        config.run_ids = tuple(range(1, config.getoption("--repeat") + 1))
                                        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~
    TypeError: can only concatenate str (not "int") to str

Closes scylladb/scylladb#27649
2025-12-24 15:13:02 +02:00
Nadav Har'El
8df5189f9c Merge 'docs: scylla-sstable.rst: extract script API to separate document' from Botond Dénes
The script API is 500+ lines long in an already too long and hard to navigate document. Extract it to a separate document, making both documents shorter and easier to navigate.

Documentation refactoring, no backport needed.

Closes scylladb/scylladb#27609

* github.com:scylladb/scylladb:
  docs: scylla-sstable-script-api.rst: add introduction and title
  docs: scylla-sstable.rst: extract script API to separate document
  docs: scylla-sstable: prepare for script API extract
2025-12-24 15:02:57 +02:00
Botond Dénes
b036a461b7 tools/scylla-sstable: dump-schema: incude UDT description in dump
If the table uses UDTs, include the description of these (CREATE TYPE
statement) in the schema dump. Without these the schema is not useful.

Closes scylladb/scylladb#27559
2025-12-24 14:46:52 +02:00
Botond Dénes
3071ccd54a Merge 'Storage-agnostic table::snapshot_on_all_shards()' from Pavel Emelyanov
The method in question knows that it writes snapshot to local filesystem and uses this actively. This PR relaxes this knowledge and splits the logic into two parts -- one that orchestrates sstables snapshot and collects the necessary metadata, and the code that writes the metadata itself.

Closes scylladb/scylladb#27762

* github.com:scylladb/scylladb:
  table: Move snapshot_file_set to table.cc
  table: Rename and move snapshot_on_all_shards() method
  table: Ditch jsondir variable
  table, sstables: Pass snapshot name to sstable::snapshot()
  table: Use snapshot_writer in write_manifest()
  table: Use snapshot_writer in write_schema_as_cql()
  table: Add snapshot_writer::sync()
  table: Add snapshot_writer::init()
  table: Introduce snapshot_writer
  table: Move final sync and rename seal_snapshot()
  table: Hide write_schema_as_cql()
  table: Hide table::seal_snapshot()
  table: Open-code finalize_snapshot()
  table: Fix indentation after previuous patch
  table: Use smp::invoke_on_all() to populate the vector with filenames
  table: Don't touch dir once more on seal_snapshot()
  table: Open-code table::take_snapshot() into caller lambda
  table: Move parts of table::take_snapshot to sstables_manager
  table: Introduce table::take_snapshot()
  table: Store the result of smp::submit_to in local variable
2025-12-24 13:46:47 +02:00
Nadav Har'El
4ae45eb367 test/alternator: remove unused imports
Remove many unused "import" statements or parts of import statement.
All of them were detected by Copilot, but I verified each one manually
and prepared this patch.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#27676
2025-12-24 13:44:28 +02:00
Nadav Har'El
da00401b7d test/alternator: rename test with duplicate name
The file test/alternator/test_transact.py accidentally had two tests
with the same name, test_transact_get_items_projection_expression.
This means the first of the two tests was ignored and never run.

This patch renames the second of the two to a more appropriate
(and unique...) name.

I verified that after this change the number of tests in this file
grows by one, and that still all tests pass on DynamoDB and fail
(as expected by xfail) on Alternator.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#27702
2025-12-24 13:43:43 +02:00
Botond Dénes
95d4c73eb1 Merge 'Make object storage config truly updateable' from Pavel Emelyanov
The db::config::object_storage_endpoints parameter is live-updateable, but when the update really happens, the new endpoints may fail to propagate to non-zero shards because of the way db::config sharding is implemented.

Refs: #7316
Fixes: #26509

Backport to 2025.3 and 2025.4, AFAIK there are set ups with object storage configs for native backup

Closes scylladb/scylladb#27689

* github.com:scylladb/scylladb:
  sstables/storage_manager: Fix configured endpoints observer
  test/object_store: Add test to validate how endpoint config update works
2025-12-24 13:42:44 +02:00
Botond Dénes
12dcf79c60 Merge 'build: support (and prefer) sccache as the compiler cache' from Avi Kivity
Currently, we support ccache as the compiler cache. Since it is transparent, nothing
much is needed to support it.

This series adds support to sccache[1] and prefers it over ccache when it is installed.

sccache brings the following benefits over ccache:
1. Integrated distributed build support similar to distcc, but with automatic toolchain packaging and a scheduler
2. Rust support
3. C++20 modules (upcoming[2])

It is the C++20 modules support that motivates the series. C++20 modules have the potential to reduce
build times, but without a compiler cache and distributed build support, they come with too large
a penalty. This removes the penalty.

The series detects that sccache is installed, selects it if so (and if not overridden
by a new option), enables it for C++ and Rust, and disables ccache transparent
caching if sccache is selected.

Note: this series doesn't add sccache to the frozen toolchain or add dbuild support. That
is left for later.

[1] https://github.com/mozilla/sccache
[2] https://github.com/mozilla/sccache/pull/2516

Toolchain improvement, won't be backported.

Closes scylladb/scylladb#27834

* github.com:scylladb/scylladb:
  build: apply sccache to rust builds too
  build: prevent double caching by compiler cache
  build: allow selecting compiler cache, including sccache
2025-12-24 13:40:02 +02:00
Nadav Har'El
74a57d2872 test/cqlpy: remove unused imports
Remove many unused "import" statements or parts of import statement.
All of them were detected by Copilot, but I verified each one manually
and prepared this patch.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#27675
2025-12-24 13:31:41 +02:00
Andrzej Jackowski
632ff66897 doc: audit: mention double audit sink in Enabling Audit section
Configuration of both table and syslog audit is possible since
scylladb/scylladb#26613 was implemented. However, the "Enabling Audit"
section of the documentation wasn't updated, which can be misleading.

Ref: scylladb/scylladb#26613

Closes scylladb/scylladb#27790
2025-12-24 13:20:03 +02:00
Gleb Natapov
04976875cc topology coordinator: set session id for streaming at the correct time
Commit d3efb3ab6f added streaming session for rebuild, but it set
the session and request submission time. The session should be set when
request starts the execution, so this patch moved it to the correct
place.

Closes scylladb/scylladb#27757
2025-12-24 13:17:53 +02:00
Yaniv Michael Kaul
377c3ac072 scripts: benign fixes flagged by CodeQL/PyLens
Unused imports, unused variables and such.
No functional changes, just to get rid of some standard CodeQL warnings.

Benign - no need to backport.
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>

Closes scylladb/scylladb#27801
2025-12-24 13:08:24 +02:00
Avi Kivity
d6edad4117 test: pylib: resource_gather: don't take ownership of /sys/fs/cgroup under podman
Under podman, we already own /sys/fs/cgroup. Run the chown command only
under docker where the container does not map the host user to the
container root user.

The chown process is sometimes observed to fail with EPERM (see issue).
But it's not needed, so avoid it.

Fixes #27837.

Closes scylladb/scylladb#27842
2025-12-24 10:56:24 +02:00
Marcin Maliszkiewicz
3c1e1f867d raft: auth: add semaphore to auth_cache::load_all
Auth cache loading at startup is racing between
auth service and raft code and it doesn't support
concurrency causing it to crash.

We can't easily remove any of the places as during
raft recovery snapshot is not loaded and it relies
on loading cache via auth service. Therefore we add
semaphore.

Fixes https://github.com/scylladb/scylladb/issues/27540

Closes scylladb/scylladb#27573
2025-12-24 10:56:24 +02:00
Nadav Har'El
f3a4af199f test/cqlpy/test_materialized_view.py: Fix for Commented-out code
This patch was suggested and prepared by copilot, I am writing the commit
message because the original one was worthless.

In commit cf138da, for an an unexplained reason, a loop waiting until the
expected value appears in a materialized view was replaced by a call for
wait_for_view_built(). The old loop code was left behind in a comment,
and this commented-out code is now bothering our AI. So let's delete the
commented-out code.

Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>

Closes scylladb/scylladb#27646
2025-12-24 10:56:23 +02:00
Botond Dénes
1bb897c7ca Merge 'Unify configuration of object storage endpoints' from Pavel Emelyanov
To configure S3 storage, one needs to do

```
object_storage_endpoints:
  - name: s3.us-east-1.amazonaws.com
    port: 443
    https: true
    aws_region: us-east-1
```

and for GCS it's

```
object_storage_endpoints:
  - name: https://storage.googleapis.com:433
    type: gs
    credentials_file: <gcp account credentials json file>
```

This PR updates the S3 part to look like

```
object_storage_endpoints:
  - name: https://s3.us-east-1.amazonaws.com:443
    aws_region: us-east-1
```

fixes: #26570

Not-yet released feature, no need to backport. Old configs are not accepted any longer. If it's needed, then this decision needs to be revised.

Closes scylladb/scylladb#27360

* github.com:scylladb/scylladb:
  object_storage: Temporarily handle pure endpoint addresses as endpoints
  code: Remove dangling mentions of s3::endpoint_config
  docs: Update docs according to new endpoints config option format
  object_storage: Create s3 client with "extended" endpoint name
  test: Add named constants for test_get_object_store_endpoints endpoint names
  s3/storage: Tune config updating
  sstable: Shuffle args for s3_client_wrapper
2025-12-24 06:59:02 +02:00
Botond Dénes
954f2cbd2f Merge 'config, transport: add listeners for native protocol fronted by proxy protocol v2' from Avi Kivity
For deployments fronted by a reverse proxy (haproxy or privatelink), we want to
use proxy protocol v2 so that client information in system.clients is correct and so
that the shard-aware selection protocol, which depends on the source port, works
correctly. Add proxy-protocol enabled variants of each of the existing native transport
listeners.

Tests are added to verify this works. I also manually tested with haproxy.

New feature, no backport.

Closes scylladb/scylladb#27522

* github.com:scylladb/scylladb:
  test: add proxy protocol tests
  config, transport: support proxy protocol v2 enhanced connections
2025-12-24 06:58:00 +02:00
Nadav Har'El
e75c75f8cd test/cqlpy: fix two tests that couldn't fail because of typo
As noticed by copilot, two tests in test_guardrail_compact_storage.py
could never fail, because they used `pytest.fail` instead of the
correct `pytest.fail()` to fail. Unfortunately, Python has a footgun
where if it sees a bare function name without parenthesis, instead of
complaining it evaluates the function object and then ignores it,
and absolutely nothing happens.

So let's add the missing `()`. The test still passes, but now it at
least has a chance of failing if we have a regression.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#27658
2025-12-24 06:49:54 +02:00
Yaron Kaikov
d671ca9f53 fix: remove return from finally block in s3_proxy.py
during any jenkins job that trigger `test.py` we get:
```
/jenkins/workspace/releng-testing/byo/byo_build_tests_dtest/scylla/test/pylib/s3_proxy.py:152: SyntaxWarning: 'return' in a 'finally' block
```

The 'return' statement in the finally block was causing a SyntaxWarning.
Moving the return outside the finally block ensures proper exception
handling while maintaining the intended behavior.

Closes scylladb/scylladb#27823
2025-12-24 06:48:03 +02:00
Avi Kivity
fc81983d42 test: sstable_validation_test: actually test ms version
sstable_validation_test tests the `scylla sstable validate` command
by passing it intentionally corrupted sstables. It uses an sstable
cache to avoid re-creating the same sstables. However, the cache
does not consider the sstable version, so if called twice with the
same inputs for different versions, it will return an sstable with
the original version for both calls. As a results, `ms` sstables
were not tested. Fix this bug by adding the sstable version (and
the schema for good measure) to the cache key.

An additional bug, hidden by the first, was that we corrupted the
sstable by overwriting its Index.db component. But `ms` sstables
don't have an Index.db component, they have a Partitions.db component.
Adjust the corrupting code to take that into account.

With these two fixes, test_scylla_sstable_validate_mismatching_partition_large
fails on `ms` sstables. Disable it for that version. Since it was
previously practically untested, we're not losing any coverage.

Fixing this test unblocks further work on making pytest take charge
of running the tests. pytest exposed this problem, likely by running
it on different runners (and thus reducing the effectiveness of the
cache).

Fixes #27822.

Closes scylladb/scylladb#27825
2025-12-24 06:47:31 +02:00
Botond Dénes
cf70250a5c Update seastar submodule
* seastar 7ec14e83...f0298e40 (8):
  > Merge 'coroutine/try_future: call set_current_task() when resuming the coroutine' from Botond Dénes
    coroutine/try_future: call set_current_task() when resuming the coroutine
    core: move set_current_task() out-of-line
  > stop_signal: stop including reactor.hh
  > cmake: Mark hwloc headers as system includes to suppress warnings
  > build: explicitly enable vptr sanitizer
  > httpd: Add API to set tcp keepalive params
  > Merge 'Make datagram_channel::send() use temporary_buffer-s' from Pavel Emelyanov
    net: Remove no longer used to_iovec() helpers
    net,code: Update callers to use new datagram_channel::send()
    net: Introduce datagram_channel::send(span<temporary_buffer>) method
    posix-stack: Make UDP socket implementation use wrapped_iovec
    posix-stack: Introduce wrapped_iovec
  > code: Move pollable_fd_state::write_all(const char*) from API level 9
  > thread: Remove unused sched_group() helper

configure.py: added -lubsan to DEBUG sanitizer flags

Closes scylladb/scylladb#27511
2025-12-24 06:46:36 +02:00
Nadav Har'El
54f3e69fdc Fix for Statement has no effect
This problem and its fix was suggested by copilot, I'm just writing the
cover letter.
test/nodetool/test_status.py has the silly statement tokens == "?" which
has no effect. Looking around the code suggested to me (and also to
Copilot, nice) that the correct intent was assert tokens == "?" and not,
say, tokens = "?".

Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>

Closes scylladb/scylladb#27659
2025-12-24 06:43:26 +02:00
Piotr Dulikowski
9ed820cbf5 test: cluster: test for recovery after partial group0 command
Add a reproducer for scylladb/scylladb#26945. By using error injections,
the test triggers a situation where a command that removes an obsolete
CDC generation is partially applied, then the node is killed an brought
back. Thanks to the fix, restarting the node succeeds and does not
trigger any consistency checks in the group0 reload logic.
2025-12-23 20:50:43 +01:00
Piotr Dulikowski
71bc1886ee group0_state_machine: remove obsolete comment about group0 consistency
The comment is outdated. It is concerned about group0 consistency after
crash, and that re-applying committed commands may require a raft
quorum. First, 579dcf1 was introduced (long ago) which gets rid of the
need for quorum as the node persists the commit index before applying
the commands - so it knows up to which command it should re-apply on
restart. Second, the preceding commits in this PR makes use of this
mechanism for group0.

Remove the comment as the concern was fully addressed. Additionally,
remove a mention of the comment in raft_group0_client.cc - although it
claims that the comment is placed in `group0_state_machine::apply`, it
has been moved to `merge_and_apply` in 96c6e0d (both comments were
originally introduced in 6a00e79).
2025-12-23 20:44:17 +01:00
Piotr Dulikowski
b24001b5e7 group0_state_machine: don't update in-memory state machine until start
Group0 commands consist of one or more mutations and are supposed to be
atomic - i.e. the data structures that reflect the group0 tables state
are not supposed to be updated while only some mutations of a command
are applied, the logic responsible for that is not supposed to observe
an inconsistent state of group0 tables.

It turns out that this assumption can be broken if a node crashes in the
middle of applying a multi-mutation group0 command. Because these
mutations are, in general, applied separately, only some mutations might
survive a crash and a restart, so the group0 tables might be in an
inconsistent state. The current logic of group0_state_machine will
attempt to read the group0 tables' state as it was left after restart,
so it may observe inconsistent state.

This can confuse the node as it may observe a state that it was not
supposed to observe, or the state will just outright break some
invariants and trigger some sanity checks. One of those was observed in
scylladb/scylladb#26945, where a command from the CDC generation
publisher fiber was partially applied. The fiber, in addition to
publishing generations, it removes old, expired generations as well.
Removal is done by removing data that describes the generation from
cdc_generations_v3 and by removing the generation's ID from the
committed generation list in the topology table. If only the first
mutation gets through but not the other one, on reload the node will see
a committed CDC generation without data, which will trigger an
on_internal_error check.

Fix this by delaying the moment when the in memory data structures are
first loaded. In 579dcf1, a mechanism was introduced which persists the
commit index before applying commands that are considered committed.
Starting a raft server waits until commands are replayed up to that
point. The fix is to start the group0_state_machine in a mode which only
applies mutations - the aforementioned mechanism will re-apply the
commands which will, thanks to the mutation idempotency, bring the
group0 to a consistent state. After the group0 is known to be in
consistent state (so, after raft::server_impl::start) the in-memory data
structures of group0 are loaded for the first time.

There is an exception, however: schema tables. Information about schema
is actually loaded into memory earlier than the moment when group0 is
started. Applying changes to schema is done through the migration
manager module which compares the persisted state before and after the
schema mutations are applied and acts on that. Refactoring migration
manager is out of scope of this PR. However, this is not a problem
because the migration manager takes care to apply all of the mutations
given in a command in a single commitlog segment, so the initial schema
loading code should not see an inconsistent state due to the state being
partially applied.

Fixes: scylladb/scylladb#26945
2025-12-23 20:44:16 +01:00
Piotr Dulikowski
f4efdf18a5 group0_state_machine: move reloading out of std::visit
In the next commit, we will adjust the logic so that it only reloads in
memory state only when a flag is set. By moving the reload logic to one
place in `merge_and_apply`, the next commit will be able to reach its
goal by only adding a single `if`.
2025-12-23 20:44:16 +01:00
Piotr Dulikowski
6bdbd91cf7 service: raft: add state machine ref to raft_server_for_group
This reference will be used by the code that starts group0. It will
manually enable the in-memory state machine only after the group0 server
is fully started, which entails replaying the group0 commands that are,
locally, seen as committed - in order to repair any inconsistencies that
might have arisen due to some commands being applied only partially
(e.g. due to a crash).
2025-12-23 20:44:16 +01:00
Michał Hudobski
ce3320a3ff auth: add system table permissions to VECTOR_SEARCH_INDEXING
Due to the recent changes in the vector store service,
the service needs to read two of the system tables
to function correctly. This was not accounted for
when the new permission was added. This patch fixes that
by allowing these tables (group0_history and versions)
to be read with the VECTOR_SEARCH_INDEXING permission.

We also add a test that validates this behavior.

Fixes: SCYLLADB-73

Closes scylladb/scylladb#27546
2025-12-23 15:53:07 +02:00
Pawel Pery
caa0cbe328 vector_search_validator: move high availability tests from vector-store.git
Initially, tests for high availability were implemented in vector-store.git
repository. High availability is currently implemented in scylladb.git
repository so this repository should be the better place to store them. This
commit copies these tests into the scylladb.git.

The commit copies validator-vector-store/src/high_availability.rs (tests logic)
and validator-tests/src/common.rs (utils for tests) into the local crate
validator-scylla. The common.rs should be copied to be able for reviewer to see
common test code and this code most likely be frequent to change - it will be
hard to maintain one common version between two repositories.

The commit updates also README for vector_search_validator; it shortly describe
the validator modules.

The commit updates reference to the latest vector-store.git master.

As a next step on the vector-store.git high_availability.rs would be removed
and common.rs moved from validator-tests into validator-vector-store.

References: VECTOR-394

Closes scylladb/scylladb#27499
2025-12-23 15:53:07 +02:00
Yaron Kaikov
bad2fe72b6 .github/workflows: Add email validator workflow
This workflow validates that all commits in a pull request use email
addresses ending in @scylladb.com. For each commit with an author or
committer email that doesn't match this pattern, the workflow automatically
adds a comment to the pull request with a warning.

This serves two purposes:

1. Alert maintainers when external contributors submit code (which is
   acceptable, but good to be aware of)

2. Help ScyllaDB developers catch cases where they haven't configured
   their git email correctly

When a non-@scylladb.com email is detected, the workflow posts this
comment on the pull request:
```
    ⚠️ Non-@scylladb.com Email Addresses Detected

    Found commit(s) with author or committer emails that don't end with
    @scylladb.com.

    This indicates either:

    - An external contributor (acceptable, but maintainer should be aware)
    - A developer who hasn't configured their git email correctly

    For ScyllaDB developers:

    If you're a ScyllaDB employee, please configure your git email globally:

        git config --global user.email "your.name@scylladb.com"

    If only your most recent commit is invalid, you can amend it:

        git commit --amend --reset-author --no-edit
        git push --force

    If you have multiple invalid commits, you need to rewrite them all:

        git rebase -i <base-commit>
        # Mark each invalid commit as 'edit', then for each:
        git commit --amend --reset-author --no-edit
        git rebase --continue
        # Repeat for each invalid commit
        git push --force
```
Fixes: https://scylladb.atlassian.net/browse/RELENG-35

Closes scylladb/scylladb#27796
2025-12-23 15:53:06 +02:00
Pavel Emelyanov
ec15a1b602 table: Do not move foreign string when writing snapshot
The table::seal_snapshot() accepts a vector of sstables filenames and
writes them into manifest file. For that, it iterates over the vector
and moves all filenames from it into the streamer object.

The problem is that the vector contains foreign pointers on sets with
sstrings. Not only sets are foreign, sstrings in it are foreign too.
It's not correct to std::move() them to local CPU.

The fix is to make streamer object work on string_view-s and populate it
with non-owning references to the sstrings from aforementioned sets.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#27755
2025-12-23 15:53:06 +02:00
Pavel Emelyanov
ecef158345 api: Use ranges library to process views in get_built_indexes()
No functional changes, just make the loop shorter and more
self-contained.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#27742
2025-12-23 15:53:06 +02:00
Israel Fruchter
53abf93bd8 Update tools/cqlsh submodule
* tools/cqlsh 22401228...9e5a91d7 (7):
  > Add pip retry configuration to handle network timeouts
  > Clean up unwanted build artifacts and update .gitignore
  > test_legacy_auth: update to pytest format
  > Add support for disabling compression via CLI and cqlshrc
  > Update scylla-driver to 3.29.7 (#144)
  > Update scylla-driver version to 3.29.6
  > Revert "Migrate workflows to Blacksmith"

Closes scylladb/scylladb#27567

[avi: build optimized clang 21.1.7
      regenerate frozen toolchain with optimized clang from

      https://devpkg.scylladb.com/clang/clang-21.1.7-Fedora-43-aarch64.tar.gz
      https://devpkg.scylladb.com/clang/clang-21.1.7-Fedora-43-x86_64.tar.gz
]

Closes scylladb/scylladb#27734
2025-12-23 15:53:06 +02:00
Aleksandra Martyniuk
bbe64e0e2a test: rename duplicate tests
There are two test with name test_repair_options_hosts_tablets in
test/nodetool/test_cluster_repair.py and and two test_repair_keyspace
in test/nodetool/test_repair.py. Due to that one of each pair is ignored.

Rename the tests so that they are unique.

Fixes: https://github.com/scylladb/scylladb/issues/27701.

Closes scylladb/scylladb#27720
2025-12-23 15:53:06 +02:00
Anna Stuchlik
7198191aa9 doc: fix the license information on DockerHub
This commit removes the OSS-related information from DockerHub.
It adds the link to the Source Available license.

Fixes https://github.com/scylladb/scylladb/issues/22440

Closes scylladb/scylladb#27706
2025-12-23 15:53:06 +02:00
Calle Wilund
d5f72cd5fc test::pylib::encryption_provider: Push up setting system_key_directory to all providers
Fixes #27694

Unless set by config, the location will default to /etc/scylla, which is not a good
place to write things for tests. Push the config properly and the directory (but
_not_ creation) to all provider basetype.

Closes scylladb/scylladb#27696
2025-12-23 15:53:06 +02:00
Dawid Mędrek
afde5f668a test: Implement describing Boost tests in JSON format
The Boost.Test framework offers a way to describe tests written in it
by running them with the option `--list_content`. It can be
parametrized by either HRF (Human Readable Format) or DOT (the Graphviz
graph format) [1]. Thanks to that, we can learn the test tree structure
and collect additional information about the tests (e.g. labels [2]).

We currently emply that feature of the framework to collect and run
Boost tests in Scylla. Unfortunately, both formats have their
shortcomings:

* HRF: the format is simple to parse, but it doesn't contain all
       relevant information, e.g. labels.
* DOT: the format is designed for creating graphical visualizations,
       and it's relatively difficult to parse.

To amend those problems, we implement a custom extension of the feature.
It produces output in the JSON format and contains more than the most
basic information about the tests; at the same time, it's easy to browse
and parse.

To obtain that output, the user needs to call a Boost.Test executable
with the option `--list_json_content`. For example:

```
$ ./path/to/test/exec -- --list_json_content
```

Note that the argument should be prepended with a `--` to indicate that
it targets user code, not Boost.Test itself.

---

The structure of the new format looks like this (top-level downwards):

- File name
- Test suite(s) & free test cases
- Test cases wrapped in test suites

Note that it's different from the output the default Boost.Test formats
produce: they organize information within test suites, which can
potentially span multiple files [3]. The JSON format makes test files
the primary object of interest and test suites from different files
are always considered distinct.

Example of the output (after applying some formatting):

```
$ ./build/dev/test/boost/canonical_mutation_test -- --list_json_content
[{"file":"test/boost/canonical_mutation_test.cc", "content": {
  "suites": [],
  "tests": [
    {"name": "test_conversion_back_and_forth", "labels": ""},
    {"name": "test_reading_with_different_schemas", "labels": ""}
  ]
}}]
```

---

The implementation may be seen as a bit ugly, and it's effectively
a hack. It's based on registering a global fixture [4] and linking
that code to every Boost.Test executable.

Unfortunately, there doesn't seem to be any better way. That would
require more extensive changes in the test files (e.g. enforcing
going through the same entry point in all of them).

This implementation is a compromise between simplicity and
effectiveness. The changes are kept minimal, while the developers
writing new tests shouldn't need to remember to do anything special.
Everything should work out of the box (at least as long as there's
no non-trivial linking involved).

Fixes scylladb/scylladb#25415

---

References:
[1] https://www.boost.org/doc/libs/1_89_0/libs/test/doc/html/boost_test/utf_reference/rt_param_reference/list_content.html
[2] https://www.boost.org/doc/libs/1_89_0/libs/test/doc/html/boost_test/tests_organization/tests_grouping.html
[3] https://www.boost.org/doc/libs/1_89_0/libs/test/doc/html/boost_test/tests_organization/test_tree/test_suite.html
[4] https://www.boost.org/doc/libs/1_89_0/libs/test/doc/html/boost_test/tests_organization/fixtures/global.html

Closes scylladb/scylladb#27527
2025-12-23 15:53:06 +02:00
Asias He
140858fc22 tablet-mon.py: Add repair support
Add repair support in the tablet monitor.

Fixes #24824

Closes scylladb/scylladb#27400
2025-12-23 15:53:06 +02:00
Pavel Emelyanov
132aa753da sstables/storage_manager: Fix configured endpoints observer
On start the manager creates observer for object_storage_endpoints
config parameter. The goal is to refresh the maintained set of endpoint
parameters and client upon config change. The observer is created on
shard 0 only, and when kicked it calls manager.invoke-on-all to update
manager on all shards.

However, there's a race here. The thing is that db::config values are
implicitly "sharded" under the hood with the help of plain array. When
any code tries to read a value from db::config::something, the reading
code secretly gets the value from this inner array indexed by the
current shard id.

Next, when the config is updated, it first assigns new values to [0]
element of the hidden array, then calls broadcast_to_all_shards() helper
that copies the valaues from zeroth slot to all the others. But the
manager's observer is triggered when the new value is assigned on zero
index, and if the invoke-on-all lambda (mentioned above) happens to be
faster than broadcast_to_all_shards, the non-zero shards will read old
values from db::config's inner array.

The fix is to instantiate observer on all shards and update only local
shard, whenever this update is triggered.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-12-23 12:43:11 +03:00
Pavel Emelyanov
f902eb1632 test/object_store: Add test to validate how endpoint config update works
There's a test for backup with non-existing endpoint/bucket/snapshot. It
checks that API call to backup sstables properly fails in that case.
This patch adds similar test for "unconfigured endpoint", but it adds
the endpoint configuration on-the-fly and expects that backup will
proceed after config update.

Currently the test fails, as config update only affect the config
itself, the storage_manager, that's in charge of maintaining endpoint
clients, is not really updated. Next patch will fix it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-12-23 12:41:38 +03:00
Pavel Emelyanov
e0cddc8c99 table: Move snapshot_file_set to table.cc
It's not used anywhere else now.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-12-23 12:14:36 +03:00
Pavel Emelyanov
e31b72c61f table: Rename and move snapshot_on_all_shards() method
Now it's database::snapshot_table_on_all_shards(). This is symmetric to
database::truncate_table_on_all_shards().

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-12-23 12:14:36 +03:00
Pavel Emelyanov
48b1ceefaf table: Ditch jsondir variable
Now the table::snapshot_on_all_shards() is storage-independent and can
stop maintaining the local path variable.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-12-23 12:14:36 +03:00
Pavel Emelyanov
a21aa5bdf6 table, sstables: Pass snapshot name to sstable::snapshot()
Currently sstable::snapshot() is called with directory name where to put
snapshots into. This patch changes it to accept snapshot name instead.
This makes the table-sstable API be unware of snapshot destination
storage type.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-12-23 12:14:36 +03:00
Pavel Emelyanov
d0812c951e table: Use snapshot_writer in write_manifest()
The manifest writing code is self-contained in a sense that it needs
list of sstable files and output_stream to write it too. The
snapshot_writer can provide output_stream for specific component, it can
be re-used for manifest writing code, thus making it independent from
local filesystem.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-12-23 12:13:47 +03:00
Pavel Emelyanov
fe8923bdc7 table: Use snapshot_writer in write_schema_as_cql()
The schema writing code is self-contained in a sense that it needs
schema description and output_stream to write it too. Teach the
snapshot_writer to provide output_stream and make write_schema_as_cql()
be independent from local filesystem.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-12-23 12:06:17 +03:00
Pavel Emelyanov
9fee06d3bc table: Add snapshot_writer::sync()
And move calls to sync_directory() into it too.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-12-23 12:06:16 +03:00
Pavel Emelyanov
7a298788c0 table: Add snapshot_writer::init()
And move the call to recursive_touch_directory into it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-12-23 12:06:05 +03:00
Pavel Emelyanov
db6a5aa20b table: Introduce snapshot_writer
It's an abstract class that defines how to write data and metadata with
table snapshot. Currently it just replaces the storage_options checks
done by table::snapshot_on_all_shards(), but it will soon evolve.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-12-23 12:00:26 +03:00
Pavel Emelyanov
8e247b06a2 table: Move final sync and rename seal_snapshot()
The seal_snapshot() syncs directory at the end. Now when the method is
table.cc-local, it doesn't need to be that careful. It looks nicer if
being renamed to write_manifest() and it's caller that syncs directory
after calling it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-12-23 11:57:26 +03:00
Pavel Emelyanov
37746ba814 table: Hide write_schema_as_cql()
The method only needs schema description from table. The caller can
pre-get it and pass it as argument. This makes it symmetric with
seal_snapshot() (that will be renamed soon) and reduces the class table
API size.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-12-23 11:57:26 +03:00
Pavel Emelyanov
976bcef5d0 table: Hide table::seal_snapshot()
The method is static and has nothing to do with table. The
snapshot_file_set needs to become public, but it will be moved to
table.cc soon.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-12-23 11:57:26 +03:00
Pavel Emelyanov
8a4daf3ef1 table: Open-code finalize_snapshot()
It makes is easier to modify this code further.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-12-23 11:57:26 +03:00
Pavel Emelyanov
6975234d1c table: Fix indentation after previuous patch
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-12-23 11:57:26 +03:00
Pavel Emelyanov
4a88f15465 table: Use smp::invoke_on_all() to populate the vector with filenames
There's a vector of foreign pointers to sets with sstable filenames
that's populated on all shards. The code does the invoke-on-all by hand
to grow the vector with push-back-s. However, if resizing the vector in
advance, shards will be able to just populate their slots.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-12-23 11:57:25 +03:00
Pavel Emelyanov
41ed11cdbe table: Don't touch dir once more on seal_snapshot()
The directory had been created way before that, in the caller.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-12-23 11:57:25 +03:00
Pavel Emelyanov
07708acebd table: Open-code table::take_snapshot() into caller lambda
Now when the logic of take_snapshot() is split between two components
(table and sstables_manager) it's no longer useful

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-12-23 11:57:25 +03:00
Pavel Emelyanov
ef3b651e1b table: Move parts of table::take_snapshot to sstables_manager
Move the loop over vector of sstables that calls sstable->snapshot()
into sstables manager.

This makes it symmetric with sstables_manager::delete_atomically() and
allows for future changes.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-12-23 11:57:25 +03:00
Pavel Emelyanov
0c45a7df00 table: Introduce table::take_snapshot()
The method returns all sstables vector with a guard that prevents this
list from being modified. Currently this is the part of another existing
table::take_snapshot() method, but the newer, smaller one, is more
atomic and self-contained, next patches will benefit from it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-12-23 11:57:25 +03:00
Pavel Emelyanov
2e2cd2aa39 table: Store the result of smp::submit_to in local variable
Tossing bits around not to make it in the next, larger, patch

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-12-23 11:57:25 +03:00
Botond Dénes
58b5d43538 Merge 'test: multi LWT and counters test during tablets resize and migration' from Yauheni Khatsianevich
This PR extends BaseLWTTester with optional counter-table configuration and
verification, enabling randomized LWT tests over tablets with counters.

And introduces new LWT with counters test durng tablets resize and migration
- Workload: N workers perform CAS updates
- Update counter table each time CAS was successful
- Enable balancing and increase min_tablet_count to force split,
 and lower min_tablet_count to merge.
- Run tablets migrations loop
- Stop workload and verify data consistency

Refs: https://github.com/scylladb/qa-tasks/issues/1918
Refs: https://github.com/scylladb/qa-tasks/issues/1988
Refs https://github.com/scylladb/scylladb/issues/18068

Closes scylladb/scylladb#27170

* github.com:scylladb/scylladb:
  test: new LWT with counters test during tablets migration/resize - Workload: N workers perform CAS updates - Update counter table each time CAS was successful - Enable balancing and increase min_tablet_count to force split,  and lower min_tablet_count to merge. - Run tablets migrations loop - Stop workload and verify data consistency
  test/lwt: add counter-table support to BaseLWTTester
2025-12-23 07:29:35 +02:00
Botond Dénes
bfdd4f7776 Merge 'Synchronize incremental repair and tablet split' from Raphael Raph Carvalho
Split prepare can run concurrently with repair.

Consider this:

1) split prepare starts
2) incremental repair starts
3) split prepare finishes
4) incremental repair produces unsplit sstable
5) split is not happening on sstable produced by repair
        5.1) that sstable is not marked as repaired yet
        5.2) might belong to repairing set (has compaction disabled)
6) split executes
7) repairing or repaired set has unsplit sstable

If split was acked to coordinator (meaning prepare phase finished),
repair must make sure that all sstables produced by it are split.
It's not happening today with incremental repair because it disables
split on sstables belonging to repairing group. And there's a window
where sstables produced by repair belong to that group.

To solve the problem, we want the invariant where all sealed sstables
will be split.
To achieve this, streaming consumers are patched to produce unsealed
sstable, and the new variant add_new_sstable_and_update_cache() will
take care of splitting the sstable while it's unsealed.
If no split is needed, the new sstable will be sealed and attached.

This solution was also needed to interact nicely with out of space
prevention too. If disk usage is critical, split must not happen on
restart, and the invariant aforementioned allows for it, since any
unsplit sstable left unsealed will be discarded on restart.
The streaming consumer will fail if disk usage is critical too.

The reason interposer consumer doesn't fully solve the problem is
because incremental repair can start before split, and the sstable
being produced when split decision was emitted must be split before
attached. So we need a solution which covers both scenarios.

Fixes #26041.
Fixes #27414.

Should be backported to 2025.4 that contains incremental repair

Closes scylladb/scylladb#26528

* github.com:scylladb/scylladb:
  test: Add reproducer for split vs intra-node migration race
  test: Verify split failure on behalf of repair during critical disk utilization
  test: boost: Add failure_when_adding_new_sstable_test
  test: Add reproducer for split vs incremental repair race condition
  compaction: Fail split of new sstable if manager is disabled
  replica: Don't split in do_add_sstable_and_update_cache()
  streaming: Leave sstables unsealed until attached to the table
  replica: Wire add_new_sstables_and_update_cache() into intra-node streaming
  replica: Wire add_new_sstable_and_update_cache() into file streaming consumer
  replica: Wire add_new_sstable_and_update_cache() into streaming consumer
  replica: Document old add_sstable_and_update_cache() variants
  replica: Introduce add_new_sstables_and_update_cache()
  replica: Introduce add_new_sstable_and_update_cache()
  replica: Account for sstables being added before ACKing split
  replica: Remove repair read lock from maybe_split_new_sstable()
  compaction: Preserve state of input sstable in maybe_split_new_sstable()
  Rename maybe_split_sstable() to maybe_split_new_sstable()
  sstables: Allow storage::snapshot() to leave destination sstable unsealed
  sstables: Add option to leave sstable unsealed in the stream sink
  test: Verify unsealed sstable can be compacted
  sstables: Allow unsealed sstable to be loaded
  sstables: Restore sstable_writer_config::leave_unsealed
2025-12-23 07:28:56 +02:00
Botond Dénes
bf9640457e Merge 'test: add crash detection during tests' from Cezar Moise
After tests end, an extra check is performed, looking into node logs for crashes, aborts and similar issues.
The test directory is also scanned for coredumps.
If any of the above are found, the test will fail with an error.

The following checks are made:
- Any log line matching `Assertion.*failed` or containing `AddressSanitizer` is marked as a critical error
- Lines matching `Aborting on shard` will only be marked as a critical error if the paterns in `manager.ignore_cores_log_patterns` are not found in that log
- If any critical error is found, the log is also scanned for backtraces
- Any backtraces found are decoded and saved
- If the test is marked with `@pytest.mark.check_nodes_for_errors`, the logs are checked for any `ERROR` lines
- Any pattern in `manager.ignore_log_patterns` and `manager.ignore_cores_log_patterns` will cause above check to ignore that line
- The `expected_error` value that many methods, like `manager.decommission_node`, have will be automatically appended to `manager.ignore_log_patterns`

refs: https://github.com/scylladb/qa-tasks/issues/1804

---

[Examples](https://jenkins.scylladb.com/job/scylla-staging/job/cezar/job/byo_build_tests_dtest/46/testReport/):
Following examples are run on a separate branch where changes have been made to enable these failures.

`test_unfinished_writes_during_shutdown`
- Errors are found in logs and are not ignored
```
failed on teardown with "Failed:
Server 2096: found 1 error(s) (log: scylla-2096.log)
  ERROR 2025-12-15 14:20:06,563 [shard 0: gms] raft_topology - raft_topology_cmd barrier_and_drain failed with: std::runtime_error (raft topology: command::barrier_and_drain, the version has changed, version 11, current_version 12, the topology change coordinator  had probably migrated to another node)
Server 2101: found 4 error(s) (log: scylla-2101.log)
  ERROR 2025-12-15 14:20:04,674 [shard 0:strm] repair - repair[c434c0c0-68da-472c-ba3e-ed80960ce0d5]: Repair 1 out of 4 ranges, keyspace=system_distributed, table=view_build_status, range=(minimum token,maximum token), peers=[27c027a6-603d-49d0-8766-1b085d8c7d29, b549cb36-fae8-490b-a19e-86d42e7aa07a, f7049967-81ff-4296-9be7-9d6a4d33a29e], live_peers=[b549cb36-fae8-490b-a19e-86d42e7aa07a, f7049967-81ff-4296-9be7-9d6a4d33a29e], status=failed: mandatory neighbor=27c027a6-603d-49d0-8766-1b085d8c7d29 is not alive
  ERROR 2025-12-15 14:20:04,674 [shard 1:strm] repair - repair[c434c0c0-68da-472c-ba3e-ed80960ce0d5]: Repair 1 out of 4 ranges, keyspace=system_distributed, table=view_build_status, range=(minimum token,maximum token), peers=[27c027a6-603d-49d0-8766-1b085d8c7d29, b549cb36-fae8-490b-a19e-86d42e7aa07a, f7049967-81ff-4296-9be7-9d6a4d33a29e], live_peers=[b549cb36-fae8-490b-a19e-86d42e7aa07a, f7049967-81ff-4296-9be7-9d6a4d33a29e], status=failed: mandatory neighbor=27c027a6-603d-49d0-8766-1b085d8c7d29 is not alive
  ERROR 2025-12-15 14:20:04,675 [shard 0: gms] raft_topology - raft_topology_cmd stream_ranges failed with: std::runtime_error (["shard 0: std::runtime_error (repair[c434c0c0-68da-472c-ba3e-ed80960ce0d5]: 1 out of 4 ranges failed, keyspace=system_distributed, tables=[\"view_build_status\", \"cdc_generation_timestamps\", \"service_levels\", \"cdc_streams_descriptions_v2\"], repair_reason=bootstrap, nodes_down_during_repair={27c027a6-603d-49d0-8766-1b085d8c7d29}, aborted_by_user=false, failed_because=std::runtime_error (Repair mandatory neighbor=27c027a6-603d-49d0-8766-1b085d8c7d29 is not alive, keyspace=system_distributed, mandatory_neighbors=[27c027a6-603d-49d0-8766-1b085d8c7d29, b549cb36-fae8-490b-a19e-86d42e7aa07a, f7049967-81ff-4296-9be7-9d6a4d33a29e]))", "shard 1: std::runtime_error (repair[c434c0c0-68da-472c-ba3e-ed80960ce0d5]: 1 out of 4 ranges failed, keyspace=system_distributed, tables=[\"view_build_status\", \"cdc_generation_timestamps\", \"service_levels\", \"cdc_streams_descriptions_v2\"], repair_reason=bootstrap, nodes_down_during_repair={27c027a6-603d-49d0-8766-1b085d8c7d29}, aborted_by_user=false, failed_because=std::runtime_error (Repair mandatory neighbor=27c027a6-603d-49d0-8766-1b085d8c7d29 is not alive, keyspace=system_distributed, mandatory_neighbors=[27c027a6-603d-49d0-8766-1b085d8c7d29, b549cb36-fae8-490b-a19e-86d42e7aa07a, f7049967-81ff-4296-9be7-9d6a4d33a29e]))"])
  ERROR 2025-12-15 14:20:06,812 [shard 0:main] init - Startup failed: std::runtime_error (Bootstrap failed. See earlier errors (Rolled back: Failed stream ranges: std::runtime_error (failed status returned from 9dd942aa-acec-4105-9719-9bda403e8e94)))
Server 2094: found 1 error(s) (log: scylla-2094.log)
  ERROR 2025-12-15 14:20:04,675 [shard 0: gms] raft_topology - send_raft_topology_cmd(stream_ranges) failed with exception (node state is bootstrapping): std::runtime_error (failed status returned from 9dd942aa-acec-4105-9719-9bda403e8e94)"
```

`test_kill_coordinator_during_op`
- aborts caused by injection
- `ignore_cores_log_patterns` is not set
- while there are errors in logs and `ignore_log_patterns` is not set, they are ignored automatically due to the `expected_error` parameter, such as in `await manager.decommission_node(server_id=other_nodes[-1].server_id, expected_error="Decommission failed. See earlier errors")`
```
failed on teardown with "Failed:
Server 1105: found 1 critical error(s), 1 backtrace(s) (log: scylla-1105.log)
  Aborting on shard 0, in scheduling group gossip.
  1 backtrace(s) saved in scylla-1105-backtraces.txt
Server 1106: found 1 critical error(s), 1 backtrace(s) (log: scylla-1106.log)
  Aborting on shard 0, in scheduling group gossip.
  1 backtrace(s) saved in scylla-1106-backtraces.txt
Server 1113: found 1 critical error(s), 1 backtrace(s) (log: scylla-1113.log)
  Aborting on shard 0, in scheduling group gossip.
  1 backtrace(s) saved in scylla-1113-backtraces.txt
Server 1148: found 1 critical error(s), 1 backtrace(s) (log: scylla-1148.log)
  Aborting on shard 0, in scheduling group gossip.
  1 backtrace(s) saved in scylla-1148-backtraces.txt"
```
Decoded backtrace can be found in [failed_test_logs](https://jenkins.scylladb.com/job/scylla-staging/job/cezar/job/byo_build_tests_dtest/46/artifact/testlog/x86_64/dev/failed_test/test_kill_coordinator_during_op.dev.1)

Closes scylladb/scylladb#26177

* github.com:scylladb/scylladb:
  test: add logging to crash_coordinator_before_stream injection
  test: add crash detection during tests
  test.py: add pid to ServerInfo
2025-12-23 07:27:58 +02:00
Pavel Emelyanov
cd2568ad00 test: Merge and parametrize test_backup_to_non_existent_something tests
There are three tests in cluster/object_store suite that check how
backup fails in case either of its parameters doesn't really exists. All
three greatly duplicate each other, it makes sense to merge them into
one larger parametrized test.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#27695
2025-12-23 07:02:18 +02:00
Avi Kivity
7586c5ccbd Merge 'system.clients: add client_options map column' from Vladislav Zolotarov
This pull request introduces a new caching mechanism for client options in the Alternator and transport layers, refactors how client metadata is stored and accessed, and extends the `system.clients` virtual table to surface richer client information. The changes improve efficiency by deduplicating commonly used strings (like driver names/versions and client options), and ensure that client data is handled in a way that's safe for cross-shard access. Additionally, the test suite and virtual table schema are updated to reflect the new client options data.

**Caching and client metadata refactoring:**

* The largest and most repeatable items in the connection state before this PR were a `driver_name` and a `driver_version` which were stored as an `sstring` object which means that the corresponding memory consumption was 16 bytes per each such value at least (the smallest size of the `seastar`'s `sstring` object) **per-connection**. In reality the driver name is usually longer than 15 characters, e.g. "ScyllaDB Python Driver" is 23 characters and this is not the longest driver name there is. In such cases the actual memory usage of a corresponding `sstring` object jumps to 8 + 4 + 1 + (string length, 23 in our example) + 1.
So, for "ScyllaDB Python Driver" it would be 37 bytes (in reality it would be a bit more due to natural alignment of other allocations since the `contents` size is not well aligned (13 bytes), but let's ignore this for now).
* These bytes add up quickly as there are more connections and, sometimes we are talking about millions of connections per-shard.
* Using a smart pointer (`lw_shared_ptr`) referencing a corresponding cached value will effectively reduce the per-connection memory usage to be 8 bytes (a size of a pointer on 64-bit CPU platform) for each such value. While storing a corresponding `sstring` value only once.
* This will would reduce the "variable" (per-connection) memory usage by **at least 50%**. And in case of "ScyllaDB Python Driver" driver version - by 78%!
* And all this for a price of a single `loading_shared_values` object **per-shard** (implements a hash table) and a minor overhead for each value **stored** in it.

* Introduced a new cache type (`client_options_cache_type`) for deduplicating and sharing client option strings, and refactored `client_data`, `client_state`, and related classes to use `foreign_ptr<std::unique_ptr<client_data>>` and cached entry types for fields like driver name, driver version, and client options. (`client_data.hh`, `service/client_state.hh`, `alternator/server.hh`, `alternator/controller.hh`, `transport/controller.hh`, `transport/protocol_server.hh`) [[1]](diffhunk://#diff-664a3b19e905481bdf8eb3843fc4d34691067bb97ab11cfd6e652e74aac51d9fR33-R36) [[2]](diffhunk://#diff-664a3b19e905481bdf8eb3843fc4d34691067bb97ab11cfd6e652e74aac51d9fL40-R56) [[3]](diffhunk://#diff-daadce1a2de3667511e59558f3a8f077b5ee30a14bcc6a99d588db90d0fcd2bdL105-R107) [[4]](diffhunk://#diff-daadce1a2de3667511e59558f3a8f077b5ee30a14bcc6a99d588db90d0fcd2bdL154-R182) [[5]](diffhunk://#diff-5fce246edf5abffb2351bd02e2eb1e9850880f7a00607ccaa90c3eee7ef57c6bL91-R92) [[6]](diffhunk://#diff-5fce246edf5abffb2351bd02e2eb1e9850880f7a00607ccaa90c3eee7ef57c6bL110-R111) [[7]](diffhunk://#diff-31730ba8e7374f784a88dc27c1512291cf73b7f24e08768f7466a3c8cfcc7a1aL96-R96) [[8]](diffhunk://#diff-19a97c0247cc08155ee49b277e43859ca32d6ef8cbff0ed7368ec5fa19e0a11eL172-R172) [[9]](diffhunk://#diff-eea7e2db5d799a25e717a72ac8ce5842bd4adb72b694d38d8f47166d9cd926faL356-R356) [[10]](diffhunk://#diff-d0b4ec3a144bbc5dc993866cf0b940850a457ff6156064f7e2b4b10ad0a95fefL80-R80) [[11]](diffhunk://#diff-4293b94c444d9bd5ecd17ce7eda8c00685d35ecf6e07f844efc91a91bbe85be1L46-R48)

* Updated the methods for setting and getting driver name, driver version, and client options in `client_state` to be asynchronous and use the new cache. (`service/client_state.hh`, `service/client_state.cc`) [[1]](diffhunk://#diff-daadce1a2de3667511e59558f3a8f077b5ee30a14bcc6a99d588db90d0fcd2bdL154-R182) [[2]](diffhunk://#diff-99634aae22e2573f38b4e2f050ed2ac4f8173ff27f0ae8b3609d1f0cc1aeb775R347-R362)

**Virtual table and API enhancements:**

* Extended the `system.clients` virtual table schema and implementation to include a new `client_options` column (a map of option key/value pairs), and updated the table population logic to use the new cached types and foreign pointers. (`db/virtual_tables.cc`) [[1]](diffhunk://#diff-05f7bff3edb39fb8759c90b445e860189f2f30e04717ed58bae42716082af3d1R752) [[2]](diffhunk://#diff-05f7bff3edb39fb8759c90b445e860189f2f30e04717ed58bae42716082af3d1L769-R770) [[3]](diffhunk://#diff-05f7bff3edb39fb8759c90b445e860189f2f30e04717ed58bae42716082af3d1L809-R816) [[4]](diffhunk://#diff-05f7bff3edb39fb8759c90b445e860189f2f30e04717ed58bae42716082af3d1L828-R879)

**API and interface changes:**

* Changed the signatures of `get_client_data` methods throughout the codebase to return vectors of `foreign_ptr<std::unique_ptr<client_data>>` instead of plain `client_data` objects, to ensure safe cross-shard access. (`alternator/controller.hh`, `alternator/controller.cc`, `alternator/server.hh`, `alternator/server.cc`, `transport/controller.hh`, `transport/protocol_server.hh`) [[1]](diffhunk://#diff-31730ba8e7374f784a88dc27c1512291cf73b7f24e08768f7466a3c8cfcc7a1aL96-R96) [[2]](diffhunk://#diff-19a97c0247cc08155ee49b277e43859ca32d6ef8cbff0ed7368ec5fa19e0a11eL172-R172) [[3]](diffhunk://#diff-5fce246edf5abffb2351bd02e2eb1e9850880f7a00607ccaa90c3eee7ef57c6bL110-R111) [[4]](diffhunk://#diff-a7e2cda866c03a75afcf3b087de1c1dcd2e7aa996214db67f9a11ed6451e596dL988-R995) [[5]](diffhunk://#diff-eea7e2db5d799a25e717a72ac8ce5842bd4adb72b694d38d8f47166d9cd926faL356-R356) [[6]](diffhunk://#diff-d0b4ec3a144bbc5dc993866cf0b940850a457ff6156064f7e2b4b10ad0a95fefL80-R80) [[7]](diffhunk://#diff-4293b94c444d9bd5ecd17ce7eda8c00685d35ecf6e07f844efc91a91bbe85be1L46-R48)

**Testing and validation:**

* Updated the Python test for the `system.clients` table to verify the new `client_options` column and its contents, ensuring that driver name and version are present in the options map. (`test/cqlpy/test_virtual_tables.py`) [[1]](diffhunk://#diff-6dd8bd4a6a82cd642252a29dc70726f89a46ceefb991c3e63fc67e283f323f03R79) [[2]](diffhunk://#diff-6dd8bd4a6a82cd642252a29dc70726f89a46ceefb991c3e63fc67e283f323f03R88-R90)

Closes scylladb/scylladb#25746

* github.com:scylladb/scylladb:
  transport/server: declare a new "CLIENT_OPTIONS" option as supported
  service/client_state and alternator/server: use cached values for driver_name and driver_version fields
  system.clients: add a client_options column
  controller: update get_client_data to use foreign_ptr for client_data
2025-12-22 20:02:40 +02:00
Emil Maskovsky
d60b908a8e test/raft: improve reporting in the randomized_nemesis_test digest functions
The Boost ASSERTs in the digest functions of the randomized_nemesis_test
were not working well inside the state machine digest functions, leading
to unhelpful boost::execution_exception errors that terminated the apply
fiber, and didn't provide any helpful information.

Replaced by explicit checks with on_fatal_internal_error calls that
provide more context about the failure. Also added validation of the
digest value after appending or removing an element, which allows to
determine which operation resulted in causing the wrong value.

This effectively reverts the changes done in https://github.com/scylladb/scylladb/pull/19282,
but adds improved error reporting.

Refs: scylladb/scylladb#27307
Refs: scylladb/scylladb#17030

Closes scylladb/scylladb#27791
2025-12-22 20:02:40 +02:00
Nikos Dragazis
20ff2fcc18 docs: Amend limitations for keyspace RF changes
The doc about DDL statements claims that an `ALTER KEYSPACE` will fail
in the presence of an ongoing global topology operation.

This limitation was specifically referring to RF changes, which Scylla
implements as global topology requests (`keyspace_rf_change`), and it
was true when it was first introduced (1b913dd880) because there was
no global topology request queue at that time, so only one ongoing
global request was allowed in the cluster.

This limitation was lifted with the introduction of the global topology
request queue (6489308ebc), and it was re-introduced again very
recently (2e7ba1f8ce) in a slightly different form; it now applies only
to RF changes (not to any request type) and only those that affect the
same keyspace. None of these two changes were ever reflected in the doc.

Synchronize the doc with the current state.

Fixes #27776.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>

Closes scylladb/scylladb#27786
2025-12-22 20:02:40 +02:00
Andrei Chekun
6ffdada0ea test.py: modify JUnit report for easier rerun on CI
This will allow to add custom XML attribute to the JUnit report. In this
case there will be path to the function that can be used to run with
pytest command. Parametrized tests will have path to the function
excluding parameter.

Closes scylladb/scylladb#27707
2025-12-22 20:02:40 +02:00
Anna Stuchlik
4c247a5d08 doc: document support for i8g and i8ge instances
Fixes https://github.com/scylladb/scylladb/issues/27703

Closes scylladb/scylladb#27754
2025-12-22 20:02:40 +02:00
copilot-swe-agent[bot]
288d4b49e9 Skip backtrace in lsa-timing logs for preemptible reclaim
Preemptible reclaim is only done from the background reclaimer,
so backtrace is not useful. It's also normal that it takes a long time.
Skip the backtrace when reclaim is preemptible to reduce log noise.

Fixes the issue where background reclaim was printing unnecessary
backtraces in lsa-timing logs when operations took longer than the
stall detection threshold.

Closes: #27692

Co-authored-by: tgrabiec <283695+tgrabiec@users.noreply.github.com>
2025-12-22 20:02:40 +02:00
Pavel Emelyanov
e304d912b4 Merge 'db/view/view_building_worker: follow-ups' from Michał Jadwiszczak
This patch consists of a few smaller follow-ups to the view building worker:
- catch general execption in staging task registrator
- remove unnecessary CV broadcast
- don't pollute function context with conditionally compiled variable
- avoid creating a copy of tasks map
- fix some typos

Refs https://github.com/scylladb/scylladb/issues/25929
Refs https://github.com/scylladb/scylladb/pull/26897

This PR doesn't fix any bugs but recently we're backporting some PRs to 2025.4, so let's also backport this one to avoid painful conflicts.

Closes scylladb/scylladb#26558

* github.com:scylladb/scylladb:
  docs/dev/view-building-coordinator: fix typos
  db/view/view_building_worker: remove unnnecessary empty lines
  db/view/view_building_worker: fix typo
  db/view/view_building_worker: avoid creating a copy of tasks map
  db/view/view_building_worker: wrap conditionally compiled code in a scope
  db/view/view_building_worker: remove unnecessary CV broadcast
  db/view/view_building_worker: catch general execption in staging task registrator
2025-12-22 20:02:40 +02:00
Botond Dénes
846a6e700b Merge 'get_snapshot_details: process also staging directory' from Benny Halevy
Currently, we determine the live vs. total snapshot size by listing all files in the snapshot directory,
and for each name, look it up in the base table directory and see if it exists there, and if so, if it's the same file
as in the snapshot by looking to the fstat data for the dev id and inode number.

However, we do not look the names in the staging directory so staging sstable
would skew the results as the will falsely contribute to the live size, since they
wouldn't be found in the base directory.

This change processes both the staging directory and base table directory
and keeps the file capacity in a map, indexed by the files inode number, allowing us to easily
detect hard links and be resilient against concurrent move of files from the staging sub-directory
back into the base table directory.

Fixes #27635

* Minor issue, no backport required

Closes scylladb/scylladb#27636

* github.com:scylladb/scylladb:
  table: get_snapshot_details: add FIXME comments
  table: get_snapshot_details: lookup entries also in the staging directory
  table: get_snapshot_details: optimize using the entry number_of_links
  table: get_snapshot_details: continue loop for manifest and schema entries
  table: get_snapshot_details: use directory_lister
2025-12-22 20:02:40 +02:00
Botond Dénes
af5e73def9 Merge 'test/cqlpy: remove unused variables' from Nadav Har'El
These patches fix a bunch of variables defined in test/cqlpy tests, but not used. Besides wasting a few bytes on disk, these unused variables can add confusion for readers who see them and might think they have some use which they are missing.

All these unused variables were found by Copilot's "code quality" scanner, but I considered each of them, and fixed them manually.

Closes scylladb/scylladb#27667

* github.com:scylladb/scylladb:
  test/cqlpy: remove unused variables
  test/cqlpy: use unique partition in test
2025-12-22 20:02:39 +02:00
Avi Kivity
8e462d06be build: apply sccache to rust builds too
sccache works for rust as well as for C++; use it for rust builds
as well.
2025-12-22 15:36:15 +02:00
Avi Kivity
9ac82d63e9 build: prevent double caching by compiler cache
To avoid both sccache and ccache being used simultaneously,
detect if ccache is in $PATH and use whatever compiler would
have been called instead.
2025-12-22 15:36:14 +02:00
Anna Stuchlik
9793a45288 doc: add a Vector Search page under Features
This commit adds a page with an overview of Vector Search under the Features section.
It includes a link to the VS documentation in ScyllaDB Cloud,
as the feature is only available in ScyllaDB Cloud.

The purpose of the page is to raise awareness of the feature.

Fixes https://scylladb.atlassian.net/browse/VECTOR-215

Closes scylladb/scylladb#27787
2025-12-22 15:29:45 +02:00
Avi Kivity
afb96b6387 build: allow selecting compiler cache, including sccache
Add a new configuration option for selecting the compiler
cache. Prefer sccache if found, since it supports rust as
well as C++, has better support for distributed compilation,
and is slated to receive module support soon.

cmake is also supported.
2025-12-22 15:27:15 +02:00
Alex
033579ad6f db: api: service: Fix ClientConnectorError in test_client_routes The bug was caused by capturing local variables by reference in lambdas passed to with_retry(), which is a coroutine. When the coroutine suspends, the lambda frame exits and the referenced locals are destroyed, leading to use-after-lifetime issues. This change fixes the problem by ensuring safe ownership across suspension points and also refactors how route_keys and route_entries are passed from the caller. Previously they were passed as const lvalue references, which cannot be moved and therefore ended up being repeatedly copied across function calls and lambda invocations. The new approach avoids unnecessary copies and makes the lifetime semantics explicit and safe.
Fixes: 27792

no backport needed private link is only in master branch

Closes scylladb/scylladb#27795
2025-12-22 14:52:47 +02:00
Yaniv Michael Kaul
c1da552fa4 test/pylib/scylla_cluster.py:get_scylla_2025_1_executable() - retry curl download of 2025.1
For some reason, we might fail. Retry 10 times, and fail with an error code instead of 404 or whatnot.

Benign, I hope - no need to backport.
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>

Closes scylladb/scylladb#27746
2025-12-22 14:45:06 +02:00
Piotr Smaron
cb3b96b8f4 raft: correct lease->least typo in a comment
Funny, when researching if our raft implementation relies on the 'lease'
mechanism, I noticed this typo.

Closes scylladb/scylladb#27803
2025-12-22 14:39:55 +02:00
Avi Kivity
b105ad8379 build: drop -fexperimental-assignment-tracking clang option
`-fexperimental-assignment-tracking` was added in fdd8b03d4b to
make coroutine debugging work.

However, since then, it became unnecessary, perhaps due to 87c0adb2fe,
or perhaps to a toolchain fix.

Drop it, so we can benefit from assignment tracking (whatever it is),
and to improve compatibility with sccache, which rejects this option.

I verified that the test added in fdd8b03d4b fails without the option
and passes with this patch; in other words we're not introducing a
regression here.

Closes scylladb/scylladb#27763
2025-12-22 14:33:48 +02:00
Karol Nowacki
addac8b3f7 vector_search: test: Fix flaky DNS resolution test
The `vector_store_client_test_dns_resolving_repeated` test had race
conditions causing it to be flaky. Two main issues were identified:

1. Race between initial refresh and manual trigger: The test assumes
    a specific resolution sequence, but timing variations between the
    initial DNS refresh (on client creation) and the first manual
    trigger (in the test loop) can cause unexpected delayed scheduling.

2. Extra triggers from resolve_hostname fiber: During the client
    refresh phase, the background DNS fiber clears the client list.
    If resolve_hostname executes in the window after clearing but
    before the update completes, pending triggers are processed,
    incrementing the resolution count unexpectedly. At count 6, the
    mock resolver returns a valid address (count % 3 == 0), causing
    the test to fail.

The fix relaxes test assertions to verify retry behavior and client
clearing on DNS address loss, rather than enforcing exact resolution
counts.

Fixes: #27074

Closes scylladb/scylladb#27685
2025-12-21 20:02:16 +02:00
Vlad Zolotarov
ea95cdaaec transport/server: declare a new "CLIENT_OPTIONS" option as supported
Declare support for a 'CLIENT_OPTIONS' startup key.
This key is meant to be used by drivers for sending client-specific
configurations like request timeouts values, retry policy configuration, etc.
The value of this key can be any string in general (according to the CQL binary protocol),
however, it's expected to be some structured format, e.g. JSON.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2025-12-20 12:26:22 -05:00
Vlad Zolotarov
28cbaef110 service/client_state and alternator/server: use cached values for driver_name and driver_version fields
Optimize memory usage changing types of driver_name and driver_version be
a reference to a cached value instead of an sstring.

These fields very often have the same value among different connections hence
it makes sense to cache these values and use references to them instead of duplicating
such strings in each connection state.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2025-12-20 12:26:22 -05:00
Vlad Zolotarov
85adf6bdb1 system.clients: add a client_options column
This new column is going to contain all OPTIONS sent in the
STARTUP frame of the corresponding CQL session.

The new column has a `frozen<map<text, text>>` type, and
we are also optimizing the amount of required memory for storing
corresponding keys and values by caching them on each shard level.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2025-12-20 12:26:15 -05:00
Vlad Zolotarov
3a54bab193 controller: update get_client_data to use foreign_ptr for client_data
get_client_data() is used to assemble `client_data` objects from each connection
on each CPU in the context of generation of the `system.clients` virtual table data.

After collected, `client_data` objects were std::moved and arranged into a
different structure to match the table's sorting requirements.

This didn't allow having not-cross-shard-movable objects as fields in the `client_data`,
e.g. lw_shared_ptr objects.

Since we are planning to add such fields to `client_data` in following patches this patch
is solving the limitation above by making get_client_data() return `foreign_ptr<std::unique_ptr<client_data>>`
objects instead of naked `client_data` ones.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2025-12-19 11:01:41 -05:00
Anna Stuchlik
f65db4e8eb doc: remove the links to the Download Center
This commit removes the remaining links to the Download Center on the website.
We no longer use it for installation, and we don't want users to infer that
something like that still exists.

Fixes https://github.com/scylladb/scylladb/issues/27753

Closes scylladb/scylladb#27756
2025-12-19 12:53:40 +01:00
Botond Dénes
df2ac0f257 Merge 'test: dtest: schema_management_test.py: migrate from dtest' from Dario Mirovic
This PR migrates schema management tests from dtest to this repository.

One reason is that there is an ongoing effort to migrate tests from dtest to here.

Test `TestLargePartitionAlterSchema.test_large_partition_with_drop_column` failed with timeout error once. The main suspect so far are infra related problems, like infra congestion. The [logs from the test execution](https://jenkins.scylladb.com/job/scylla-master/job/dtest-release/1062/testReport/junit/schema_management_test/TestLargePartitionAlterSchema/Run_Dtest_Parallel_Cloud_Machines___Dtest___full_split001___test_large_partition_with_drop_column/), linked in the issue [test_large_partition_with_drop_column failed on TimeoutError #26932](https://github.com/scylladb/scylladb/issues/26932) show the following:
- `populate` works as intended - it starts, then during populate/insert drop column happened, then an exception is raised and intentionally ignored in the test, so no `Finish populate DB` for 50 x 1490 records - expected
- drop column works as intended - interrupts `populate` and proceeds to flush
- flush **probably** works as intended - logs are consistent with what we expect and what I got in local test runs
- `read` is the only thing that visibly got stuck, all the way until timeout happened, 5 minutes after the start

Migrating the test to this repo will also give us test start and end times on CI machines, in the sql report database. It has start and end timestamp for each test executed. We will be able to see how long does it usually take when the test is successful. It can not be seen from the logs, because logs are not kept for successful tests.

Another thing this PR does is adding a log message at the end of `database::flush_all_tables`. This will let us know if a thread got stuck inside or finished successfully. This addresses the **probably** part of the flush analysis step described above. If the issue reoccurs, we will have more information.

The test `test_large_partition_with_add_column` has not been executing for ~5 years. It was never migrated to pytest. The name was left as `large_partition_with_add_column_test`, and was skipped. Now it is enabled and updated.

Both `test_large_partition_with_add_column` and `test_large_partition_with_drop_column` are improved.
Small performance improvements:
- Regex compilation extracted from the stress function to the module level, to avoid recompilation.
- Do not materialize list in `stress_object` for loop. Use a generator expression.

The tests in `TestLargePartitionAlterSchema` are `test_large_partition_with_add_column`
and `test_large_partition_with_drop_column`.

These tests need to replicate the following conditions that led to a bug before a fix from around 5 years ago.

The scenario in which the problem could have happened has to involve:
- a large partition with many rows, large enough for preemption (every 0.5ms) to happen during the scan of the partition.
- appending writes to the partition (not overwrites)
- scans of the partition
- schema alter of that table. The issue is exposed only by adding or dropping a column, such that the added/dropped
  column lands in the middle (in alphabetical order) of the old column set.

The way the test is set up is:
- fixed number of writes per populate call
- fixed number of reads

This has the following implications:
- if the machine executing the test is fast, all the writes are done before the 10 seconds sleep
- there are too many reads - most of them get executed after the test logic is done

This patch solves these issues in the following way:
- populate lazily generates write data, and stops when instructed by `stop_populating` event
- read, which is done sequentially, stops when instructed by `stop_reading` event
- number of max operations is increased significantly, but the operations are stopped 1 second
  after node flush; this makes sure there are enough operations during the test, but also that
  the test does not take unnecessary time

Test execution time has been reduced severalfold. On dev machine the time the tests take is
reduced from 110 seconds to 34 seconds.

scylla-dtest PR that removes migrated tests:
[schema_management_test.py: remove tests already ported to scylladb repo #6427](https://github.com/scylladb/scylla-dtest/pull/6427)

Fixes #26932

This is a migration of existing tests to this repository. No need for backport.

Closes scylladb/scylladb#27106

* github.com:scylladb/scylladb:
  test: dtest: schema_management_test.py: speed up `TestLargePartitionAlterSchema` tests
  test: dtest: schema_management_test.py: fix large partition add column test
  test: dtest: schema_management_test.py: add `TestSchemaManagement.prepare`
  test: dtest: schema_management_test.py: test enhancements
  test: dtest: schema_management_test.py: make the tests work
  test: dtest: migrate setup and tools from dtest
  test: dtest: copy unmodified schema_management_test.py
  replica: database: flush_all_tables log on completion
2025-12-19 12:30:00 +02:00
Botond Dénes
093e97a539 Merge 'test: increase num of requests in driver_service_level tests' from Andrzej Jackowski
`_verify_tasks_processed_metrics()` is used to check that the correct
service level is used to process requests. It takes two service levels
as arguments and executes numerous requests. After that, the number
of tasks processed by one of the service levels is expected to rise
by at least the number of executed requests. In contrast,
the second service level is expected to process fewer tasks than
the number of requests.

Unfortunately, background noise may cause some tasks to be executed
on the service level that is not supposed to process requests.
This patch increases the number of executed requests to eliminate
the chance of noise causing test failures.

Additionally, this commit extends logging to make future investigation
easier.

Fixes: https://github.com/scylladb/scylladb/issues/27715

No backport, fix for test on master.

Closes scylladb/scylladb#27735

* github.com:scylladb/scylladb:
  test: remove unused `get_processed_tasks_for_group`
  test: increase num of requests in driver_service_level tests
2025-12-19 10:54:14 +02:00
Emil Maskovsky
fa6e5d0754 test/random_failures: fix handling of banned notification
After 39cec4a node join may fail with either "init - Startup failed"
notification or occasionally because it was banned, depending on timing.

The change updates the test to handle both cases.

Fixes: scylladb/scylladb#27697

No backport: This failure is only present in master.

Closes scylladb/scylladb#27768
2025-12-19 09:55:31 +02:00
Emil Maskovsky
08518b2c12 test/raft: fix test_joining_old_node_fails flakiness
When a node without the required feature attempts to join a Raft-based
cluster with the feature enabled, there is a race between the join
rejection response ("Feature check failed") and the ban notification
("received notification of being banned"). Depending on timing, either
message may appear in the joining node's log.

This starts to happen after 39cec4a (which introduced informing the
nodes about being banned).

Updated the test to accept both error messages as valid, making the test
robust against this race condition, which is more likely in debug mode
or under slow execution.

Fixes: scylladb/scylladb#27603

No backport: This failure is only present in master.

Closes scylladb/scylladb#27760
2025-12-19 09:44:09 +02:00
Emil Maskovsky
2a75b1374e test/raft: fix race condition in failure_detector_test
The test had a sporadic failure due to a broken promise exception.
The issue was in `test_pinger::ping()` which captured the promise by
move into the subscription lambda, causing the promise to be destroyed
when the lambda was destroyed during coroutine unwinding.

Simplify `test_pinger::ping()` by replacing manual abort_source/promise
logic with `seastar::sleep_abortable()`.
This removes the risk of promise lifetime/race issues and makes the code
simpler and more robust.

Fixes: scylladb/scylladb#27136

Backport to active branches: This fixes a CI test issue, so it is
beneficial to backport the fix. As this is a test-only fix, it is a low
risk change.

Closes scylladb/scylladb#27737
2025-12-19 09:42:19 +02:00
Łukasz Paszkowski
2cb9bb8f3a test_user_writes_rejection: Disable speculative retries
This test starts a 3-node cluster and creates a large blob file so that one
node reaches critical disk utilization, triggering write rejections on that
node. The test then writes data with CL=QUORUM and validates that the data:
- did not reach the critically utilized node
- did reach the remaining two nodes

By default, tables use speculative retries to determine when coordinators may
query additional replicas.

Since the validation uses CL=ONE, it is possible that an additional request
is sent to satisfy the consistency level. As a result:
- the first check may fail if the additional request is sent to a node that
  already contains data, making it appear as if data reached the critically
  utilized node
- the second check may fail if the additional request is sent to the critically
  utilized node, making it appear as if data did not reach the healthy node

The patch fixes the flakiness by disabling the speculative retries.

Fixes https://github.com/scylladb/scylladb/issues/27212

Closes scylladb/scylladb#27488
2025-12-19 09:39:09 +02:00
Botond Dénes
c899c117c7 scylla-gdb.py: scylla read-stats: include all permit lists
The current code which collects permit stats is out-of-date (by a few
years), as it only iterates through _permit_list. There are 4 additional
lists that permits can be part of now (all intrusive). Include all of
these in the stat collection.
As a bonus, also print the semaphore pointer in the printout, so the
user can hand-examine it, should they wish to.
2025-12-18 18:26:07 +02:00
Botond Dénes
db9f9e1b7e scylla-gdb.py: scylla fiber: add --direction command-line param
Can be "forward", "backward" or "both" (default).
Allows traversing the fiber in just one direction. Useful when scylla
fiber fails to traverse through a task and the user has to locate the
next one in the chain manually. When resuming from this next item, the
user might want to skip the already seen part of the fiber, to save time
on the invokation.
2025-12-18 18:26:07 +02:00
Botond Dénes
4c6e508811 scylla-gdb.py: scylla fiber: add support for traversing through coroutines backward
Traversing through coroutines forward (finding task waiting on this
coroutine) is already supported. This patch adds support for traversing
through coroutines backwards (finding task waited on by coroutine).
Coroutines need special handling: the future<> object is most likely
allocated on the coroutine frame, so we have to search throgh that to
find it. When doing so the first two pointers on the frame have to be
skipped: these are pointers to .resume and .destroy respectively and
will halt the search algorithm if seen.
2025-12-18 18:26:06 +02:00
Dario Mirovic
f1d63d014c test: dtest: schema_management_test.py: speed up TestLargePartitionAlterSchema tests
The tests in `TestLargePartitionAlterSchema` are `test_large_partition_with_add_column`
and `test_large_partition_with_drop_column`.

These tests need to replicate the following conditions that led to a bug before a fix from around 5 years ago.

The scenario in which the problem could have happened has to involve:
- a large partition with many rows, large enough for preemption (every 0.5ms) to happen during the scan of the partition.
- appending writes to the partition (not overwrites)
- scans of the partition
- schema alter of that table. The issue is exposed only by adding or dropping a column, such that the added/dropped
  column lands in the middle (in alphabetical order) of the old column set.

The way the test is set up is:
- fixed number of writes per populate call
- fixed number of reads

This has the following implications:
- if the machine executing the test is fast, all the writes are done before the 10 seconds sleep
- there are too many reads - most of them get executed after the test logic is done

This patch solves these issues in the following way:
- populate lazily generates write data, and stops when instructed by `stop_populating` event
- read, which is done sequentially, stops when instructed by `stop_reading` event
- number of max operations is increased significantly, but the operations are stopped 1 second
  after node flush; this makes sure there are enough operations during the test, but also that
  the test does not take unnecessary time

Test execution time has been reduced severalfold. On dev machine the time the tests take is
reduced from 110 seconds to 34 seconds.

The patch also introduces a few small improvements:
- `cs_run` renamed to `run_stress` for clarity
- Stopped checking if cluster is `ScyllaCluster`, since it is the only one we use
- `case_map` removed from `test_alter_table_in_parallel_to_read_and_write`, used `mixed` param directly
- Added explanation comment on why we do `data[i].append(None)`
- Replaced `alter_table` inner function with its body, for simplicity
- Removed unnecessary `ck_rows` variable in `populate`
- Removed unnecessary `isinstance(self.cluster. ScyllaCluster)`
- Adjusted `ThreadPoolExecutor` size in several places where 5 workers are not needed
- Replaced functional programming style expressions for `new_versions` and `columns_list` with
  comprehension/generator statement python style code, improving readability

Refs #26932

fix
2025-12-18 17:07:27 +01:00
Michael Litvak
33f7bc28da docs: document restrictions of colocated tables
Currently some things are not supported for colocated tables: it's not
possible to repair a colocated table, and due to this it's also not
possible to use the tombstone_gc=repair mode on a colocated table.

Extend the documentation to explain what colocated tables are and
document these restrictions.

Fixes scylladb/scylladb#27261

Closes scylladb/scylladb#27516
2025-12-18 15:38:29 +01:00
Cezar Moise
0ef8ca4c57 test: add logging to crash_coordinator_before_stream injection
In order to have the test ignore crashes caused by the injection,
it needs to log its occurence.
2025-12-18 16:28:13 +02:00
Cezar Moise
95d0782f89 test: add crash detection during tests
After tests end, an extra check if performed, looking into node logs.
By default, it only searches for critical errors and scans for coredumps.
If the test has the fixture `check_nodes_for_errors`, it will search for all errors.
Both checks can be ignored by setting `ignore_cores_log_patterns` and `ignore_log_patterns`.
If any of the above are found, the test will fail with an error.
2025-12-18 16:28:13 +02:00
Dario Mirovic
f831ca5ab5 test: dtest: schema_management_test.py: fix large partition add column test
`large_partition_with_add_column_test` and `large_partition_with_drop_column_test`
were added on August 17th, 2020 in scylladb/scylla-dtest#1569.

Only `large_partition_with_drop_column_test` was migrated to pytest, and renamed
to `test_large_partition_with_drop_column` on March 31st, 2021 in scylladb/scylla-dtest#2051.
Since then this test has not been running.

This patch fixes it - the test is updated and renamed and the testing environment
now properly picks it up.

Refs #26932
2025-12-18 12:54:43 +01:00
Dario Mirovic
1fe0509a9b test: dtest: schema_management_test.py: add TestSchemaManagement.prepare
Extract repeated cluster initialization code in `TestSchemaManagement`
into a separate `prepare` method. It holds all the common code for
cluster preparation, with just the necessary parameters.

Refs #26932
2025-12-18 12:54:43 +01:00
Dario Mirovic
e7d76fd8f3 test: dtest: schema_management_test.py: test enhancements
Extract regex compilation from the stress functions to the module level,
to avoid unnecessary regex compilation repetition.

Add descriptions to the stress functions.

Do not materialize list in `stress_object` for loop. Use a generator expression.

Make `_set_stress_val` an object method.

Refs #26932
2025-12-18 12:54:43 +01:00
Dario Mirovic
700853740d test: dtest: schema_management_test.py: make the tests work
Remove unused function markers.
Add wait_other_notice=True to cluster start method in
TestSchemaHistory.prepare function to make the test stable.

Enable the test in suite.yaml for dev and debug modes.

Fixes #26932
2025-12-18 12:54:43 +01:00
Dario Mirovic
3c5dd5e5ae test: dtest: migrate setup and tools from dtest
Migrate several functionalities from dtest. These will be used by
the schema_management_test.py tests when they are enabled.

Refs #26932
2025-12-18 12:54:43 +01:00
Dario Mirovic
5971b2ad97 test: dtest: copy unmodified schema_management_test.py
Copy schema_management_test.py from scylla-dtest to
test/cluster/dtest/schema_management_test.py.

Add license header.

Disable it for debug, dev, and release mode.

Refs #26932
2025-12-18 12:54:42 +01:00
Dario Mirovic
f89315d02f replica: database: flush_all_tables log on completion
In database::flush_all_tables add log on completion.
This slightly improves the readability of logs when debugging an issue.

Refs #26932
2025-12-18 12:54:42 +01:00
Patryk Jędrzejczak
d5c205194b Merge 'topology: Make removenode use left_token_ring state for global barrier' from Emil Maskovsky
Make the removenode operation go through the `left_token_ring` state, similar to decommission. This ensures that when removenode completes, all nodes in the cluster are aware of the topology change through a global token metadata barrier.

Previously, removenode would skip the `left_token_ring` state and go directly from `write_both_read_new` to `left` state. This meant that when the operation completed, some nodes might not yet know about the topology change, potentially causing issues with subsequent data plane requests.

Key changes:
- Both decommission and removenode now transition to `left_token_ring` state in the `write_both_read_new` handler
- In `left_token_ring` state, only decommissioning nodes receive the shutdown RPC (removed nodes are already dead)
- Updated documentation to reflect that both operations use this state

This change improves consistency guarantees for removenode operations by ensuring cluster-wide awareness before completion.

The change is protected by "REMOVENODE_WITH_LEFT_TOKEN_RING" feature flag to also support mixed clusters during e.g. upgrade.

Fixes:  scylladb/scylladb#25530

No backport: This fixes and issue found in tests. It can theoretically happen in production too, but wasn't reported in any customer issue, so a backport is not needed.

Closes scylladb/scylladb#26931

* https://github.com/scylladb/scylladb:
  topology: make removenode use left_token_ring state for global barrier
  topology: allow removing nodes not having tokens
  features: add feature flag for removenode via left token ring
2025-12-18 09:34:38 +01:00
Andrzej Jackowski
6ad10b141a test: remove unused get_processed_tasks_for_group
The function `get_processed_tasks_for_group` was defined twice in
`test_raft_service_levels.py`. This change removes the unused
definition to avoid confusion and clean up the code.
2025-12-17 20:45:53 +01:00
Andrzej Jackowski
8cf8e6c87d test: increase num of requests in driver_service_level tests
`_verify_tasks_processed_metrics()` is used to check that the correct
service level is used to process requests. It takes two service levels
as arguments and executes numerous requests. After that, the number
of tasks processed by one of the service levels is expected to rise
by at least the number of executed requests. In contrast,
the second service level is expected to process fewer tasks than
the number of requests.

Unfortunately, background noise may cause some tasks to be executed
on the service level that is not supposed to process requests.
This patch increases the number of executed requests to eliminate
the chance of noise causing test failures.

Additionally, this commit extends logging to make future investigation
easier.

Fixes: scylladb/scylladb#27715
2025-12-17 20:45:48 +01:00
Michael Litvak
3a06c32749 schema_registry: fix learning a schema with cdc schema
When learning a schema that has a linked cdc schema, we need to learn
also the cdc schema, and at the end the schema should point to the
learned cdc schema.

This is needed because the linked cdc schema is used for generating cdc
mutations, and when we process the mutations later it is assumed in some
places that the mutation's schema has a schema registry entry.

We fix a scenario where we could end up with a schema that points to a
cdc schema that doesn't have a schema registry entry. This could happen
for example if the schema is loaded before it is learned, so when we
learn it we see that it already has an entry. In that case, we need to
set the cdc schema to the learned cdc schema as well, because it could
have been loaded previously with a cdc schema that was not learned.

Fixes scylladb/scylladb#27610

Closes scylladb/scylladb#27704
2025-12-17 20:01:00 +02:00
Michał Jadwiszczak
74ab5addd3 test/cluster/test_view_building_coordinator: fix flakiness in test_file_streaming
The test generates a staging sstable on a node and verifies whether
the view is correctly populated.

However view updates generated by a staging sstable
(`view_update_generator::generate_and_propagate_view_updates()`) aren't
awaited by sstable consumer.
It's possible that the view building coordinator may see the task as finished
(so the staging sstable was processed) but not all view updates were
writted yet.

This patch fixes the flakiness by waiting until
`scylla_database_view_update_backlog` drops down to 0 on all shards.

Fixes scylladb/scylladb#26683

Closes scylladb/scylladb#27389
2025-12-17 17:29:15 +01:00
Michael Litvak
55f4a2b754 migration_listener: fix deadlock in nested notifications
When calling a migration notification from the context of a notification
callback, this could lead to a deadlock with unregistering a listener:
A: the parent notification is called. it calls thread_for_each, where it
   acquires a read lock on the vector of listeners, and calls the
   callback function for each listener while holding the lock.
B: a listener is unregistered. it calls `remove` and tries to acquire a
   write lock on the vector of listeners. it waits because the lock is
   held.
A: the callback function calls another notification and calls
   thread_for_each which tries to acquire the read lock again. but it
   waits since there is a waiter.

Currently we have such concrete scenario when creating a table, where
the callback of `before_create_column_family` in the tablet allocator
calls `before_allocate_tablet_map`, and this could deadlock with node
shutdown where we unregister listeners.

Fix this by not acquiring the read lock again in the nested
notification. There is no need because the read lock is already held by
the parent notification while the child notification is running. We add
a function `thread_for_each_nested` that is similar to `thread_for_each`
except it assumes the read lock is already held and doesn't acquire it,
and it should be used for nested notifications instead of
`thread_for_each`.

Fixes scylladb/scylladb#27364

Closes scylladb/scylladb#27637
2025-12-17 14:00:28 +01:00
Emil Maskovsky
1642c686c2 topology: make removenode use left_token_ring state for global barrier
Make the removenode operation go through the `left_token_ring` state,
similar to decommission. This ensures that when removenode completes,
all nodes in the cluster are aware of the topology change through a
global token metadata barrier.

Previously, removenode would skip the `left_token_ring` state and go
directly from `write_both_read_new` to `left` state. This meant that
when the operation completed, some nodes might not yet know about the
topology change, potentially causing issues with subsequent data plane
requests.

Key changes:
- Both decommission and removenode now transition to `left_token_ring`
  state in the `write_both_read_new` handler
- In `left_token_ring` state, only decommissioning nodes receive the
  shutdown RPC (removed nodes may already be dead)
- Updated documentation to reflect that both operations use this state

This change improves consistency guarantees for removenode operations
by ensuring cluster-wide awareness before completion.

Fixes:  scylladb/scylladb#25530
2025-12-17 13:31:11 +01:00
Emil Maskovsky
9431826c52 topology: allow removing nodes not having tokens
For the changes to go through the left_token_ring state when
REMOVENODE_WITH_LEFT_TOKEN_RING feature is enabled, we need to allow
removing nodes to not have any tokens (similarly to decommissioning
nodes, which use the same sequence of states).

This means the tests also need to change to allow for this new behavior
- it can temporarily happen that a removing node has no tokens but is
still part of Raft group 0 (so there may be a temporary mismatch between
the token ring and group 0 membership).

Therefore, the `check_token_ring_and_group0_consistency` function is
replaced by `wait_for_token_ring_and_group0_consistency`, which waits
up to 30 seconds for consistency to be reached.
2025-12-17 13:31:11 +01:00
Emil Maskovsky
ba6fabfc88 features: add feature flag for removenode via left token ring
To improve the behavior of the removenode operation, we want to issue
a global topology barrier after the removenode has been applied.
However, this requires changing the topology state machine to add a new
state (left_token_ring) to the removenode flow, which is not supported
by older nodes.

To allow rolling upgrades, we add a feature flag
REMOVENODE_WITH_LEFT_TOKEN_RING that controls whether the new removenode
flow is used.
2025-12-17 13:31:11 +01:00
Avi Kivity
cfd91545f9 test: add proxy protocol tests
Test that the new configuration options work and that we can
connect to them. Use direct connections with an inline implementation
of the proxy protocol and the CQL native protocol, since we want
to maintain direct control over the source port number (for shard-aware
ports). Also test we land on the expected shard.
2025-12-17 14:18:04 +02:00
Avi Kivity
1382b47d45 config, transport: support proxy protocol v2 enhanced connections
We have four native transport ports: two for plain/TLS, and two
more for shard-aware (plain/TLS as well). Add four more that expect
the proxy protocol v2 header. This allows nodes behind a reverse
proxy to record the correct source address and port in system.clients,
and the shard-aware port to see the correct source port selection
made my the client.
2025-12-17 14:18:04 +02:00
Pavel Emelyanov
a6618f225c object_storage_endpoint_param: Make it formattable for real
Currently the formatter converts it to json and then tries to emit into
the output context with the "...{{}}" format string. The intent was to
have the "...{<json text>}" output. However, the double curly brace in
format string means "print a curly brace", so the output of the above
formatting is "...{}", literally.

Fix by keeping a single curly brace. The "<json text>" thing will have
its own surrounding curly braces.

Fixes #27718

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#27687
2025-12-17 11:48:39 +01:00
Dani Tweig
0bfd07a268 github: synching scylladb.git new milestones with Jira releases
This action is triggered when a new milestone is created in scylladb.git
It will call the main logic which will create the same milestone as Jira
releases in the SCYLLADB and CUSTOMER Jira projects.

Fixes: PM-100

Closes scylladb/scylladb#27717
2025-12-17 10:05:06 +01:00
Tomasz Grabiec
c077283352 Merge 'service: support conversion of tablet keyspaces to rack-list using ALTER KEYSPACE' from Aleksandra Martyniuk
If a keyspace has a numeric replication factor in a DC and rf < #racks,
then the replicas of tablets in this keyspace can be distributed among
all racks in the DC (different for each tablet). With rack list, we need all
tablet replicas to be placed on the same racks. Hence, the conversion
requires tablet co-location.

After this series, the conversion can be done using ALTER KEYSPACE
statement. The statement that does this conversion in any DC is not
allowed to change a rf in any DC. So, if we have dc1 and dc2 with 3 racks
each and a keyspace ks then with a single ALTER KEYSPACE we can do:
- {dc1 : 2} -> {dc1 : [r1, r2]};
- {dc1 : 2, dc2: 2} ->  {dc1 : [r1, r2], dc2: [r2,r3]};
- {dc1 : 2, dc2: 2} -> {dc1 : [r1, r2], dc2: 2}
- {dc1 : 2} -> {dc1 : 2, dc2 : [r1]}
But we cannot do:
- {dc1 : 2} -> {dc1 : [r1, r2, r3]};
- {dc1 : 1, dc2 : [r1, r2] → dc1: [r1], dc2: [r1].

In order to do the co-locations rf change request is paused. Tablet
load balancer examines the paused rf change requests and schedules
necessary tablet migrations. During the process of co-location, no other
cross-rack migration is allowed.

Load balancer checks whether any paused rf change request is
ready to be resumed. If so, it puts the request back to global topology
request queue.

While an rf change request for a keyspace is running, any other rf change
of this keyspace will fail.

Fixes: #26398.

New feature, no backport

Closes scylladb/scylladb#27279

* github.com:scylladb/scylladb:
  test: add est_rack_list_conversion_with_two_replicas_in_rack
  test: test creating tablet_rack_list_colocation_plan
  test: add test_numeric_rf_to_rack_list_conversion test
  tasks: service: add global_topology_request_virtual_task
  cql3: statements: allow altering from numeric rf to rack list
  service: topology_coordinator: pause keyspace_rf_change request
  service: implement make_rack_list_colocation_plan
  service: add tablet_rack_list_colocation_plan
  cql3: reject concurrent alter of the same keyspace
  test: check paused rf change requests persistence
  db: service: add paused_rf_change_requests to system.topology
  service: pass topology and system_keyspace to load_balancer ctor
  service: tablet_allocator: extract load updates
  service: tablet_allocator: extract ensure_node
  tasks, system_keyspace: Introduce get_topology_request_entry_opt()
  node_ops: Drop get_pending_ids()
  node_ops: Drop redundant get_status_helper()
2025-12-17 10:05:06 +01:00
Anna Stuchlik
7061384a27 doc: replace "Scylla" with "ScyllaDB" in the Alternator docs
Fixes https://github.com/scylladb/scylladb/issues/27708

Closes scylladb/scylladb#27709
2025-12-17 10:05:06 +01:00
Tomasz Grabiec
7bc59e93b2 Fix lambda-coroutine fiasco in hint_endpoint_manager.cc
Found by copilot.

No issue was observed yet.

Fixes #27520

Closes scylladb/scylladb#27477
2025-12-16 20:16:41 +03:00
Łukasz Paszkowski
a61c221902 test/pylib/suite/python.py: Handle extra_cmdline_options correctly
runner.py defines a command-line option `--extra-scylla-cmdline-options`
with the default type=str. However, the function `merge_cmdline_options`,
which consumes this value to merge command-line options from multiple
sources, expects a list of strings.

This mismatch results in the following exception:
```
raise ValueError(f'invalid argument name {name}, all args {args}')
ValueError: invalid argument name o, all args --logger-log-level repair=debug --default-log-level=error
```
when a test is run with pytest using:
`--extra-scylla-cmdline-options='--logger-log-level repair=debug --default-log-level=error'`

Fix this by handling the option consistently and calling `.split()`.
Also change the default value from an empty list to an empty string
to avoid confusion both in runner.py and test.py.

Closes scylladb/scylladb#27523
2025-12-16 20:14:43 +03:00
Benny Halevy
798714183e table: get_snapshot_details: add FIXME comments
Ref https://github.com/scylladb/seastar/pull/3163

We can optimize the stat calls we use here by
using open_directory to open the snapshot,
base, and staging directory once, and using statat
calls for the relative name instead of the full
blown file_stat that needs to traverse the whole
path prefix for every call (the dirents are likely
to be cached, but still why waste cpu cycles on that
over and over again).

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-12-16 18:45:56 +02:00
Benny Halevy
f5ca3657e2 table: get_snapshot_details: lookup entries also in the staging directory
Since the sstables in the snapshot may still be in the staging dir.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-12-16 18:42:05 +02:00
Benny Halevy
dc00461adf table: get_snapshot_details: optimize using the entry number_of_links
If the number_of_linkes equals 1, we can be sure that
the file exists only in the snapshot directory so there is no need
to look it up in the data directory.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-12-16 18:42:05 +02:00
Benny Halevy
be6d87648c table: get_snapshot_details: continue loop for manifest and schema entries
Now that we're using a simple loop in the coroutine just continue
the loop for files we want to ignore.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-12-16 18:42:05 +02:00
Benny Halevy
004c08f525 table: get_snapshot_details: use directory_lister
It is more efficient to use the coroutine generator
to list the directory.

Brewing changes in seastar would make the generator buffered
as well as adding an extended generation that would
return the file stat data for each entry, that would become
useful in the next patch that optimizes the algorithm by
considering the entry's link count.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-12-16 18:42:05 +02:00
David Garcia
386ec0af4e docs: fix multiversion build
Installs sphinx-scylladb-markdown = "^0.1.4" to fix the multiversion build.

Details in https://github.com/scylladb/sphinx-scylladb-theme/pull/1552

fix: update poetry

Closes scylladb/scylladb#27619
2025-12-16 19:41:03 +03:00
Pavel Emelyanov
c4496dd63c Merge 'test/cqlpy: rename tests with duplicate name' from Nadav Har'El
When translating Cassandra's unit tests, in a couple of places I accidentally used the same name for two tests, resulting in the first of each pair to never running.

Let's fix the name of the second of the each pair to be the real name it had in the original Cassandra test.

Closes scylladb/scylladb#27644

* github.com:scylladb/scylladb:
  test/cqlpy: rename test with duplicate name
  test/cqlpy: rename test with duplicate name
2025-12-16 19:32:20 +03:00
Nadav Har'El
84df5cfaf8 test/alternator: delete unnecessary "pass"
Fixing something that never bothered anyone but our automated "code quality" tool: there's an unnecessary call to "pass" in one of our tests. Just remove it.

Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>

Closes scylladb/scylladb#27645
2025-12-16 19:29:23 +03:00
Botond Dénes
f06db096bd test/boost/reader_concurrency_semaphore_test: un-flake memory limit engages test
This test was observed to fail multiple times recently in promotion,
because there were successful reads. The failure only reproduces on
arm64, it doesn't reproduce on x86.
The suspected reason is that the data set is too close to the edge,
where all reads fail due to too high memory consumption. Reduce the
number of sstables used by this test to 54 (from 64).

Fixes: #27248

Closes scylladb/scylladb#27650
2025-12-16 19:24:49 +03:00
Pavel Emelyanov
31f90c089c Merge 'test/alternator: remove unused variable assignments and statements' from Nadav Har'El
Copilot found in test/alternator a bunch of places where we unnecessarily assign a variable that we don't use, or had a duplicated statement which doesn't do anything. This patch fixes all of them. AI still doesn't know how to prepare a patch that looks anything close to reasonable, so I did this part manually, and also carefully investigated each and every change (this took **a lot** of human time).

These patches don't change anything in the functionality of any of the tests. It's all cosmetic.

Closes scylladb/scylladb#27655

* github.com:scylladb/scylladb:
  test/alternator: remove unnecessary duplicate statement
  test/alternator: remove unused variable assignments
2025-12-16 19:23:34 +03:00
Nadav Har'El
c58739de6a Fix for Unused local variable
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>

Closes scylladb/scylladb#27665
2025-12-16 19:20:53 +03:00
Benny Halevy
9e18cfbe17 sstable: add _mutate_sem to serialize link/move with components rewrite
We currently have races, like between moving an sstable from staging
using change_state, or when taking a snapshot, to e.g.
rewrite_statistics that replaces one of the sstable component files
when called, for example, from update_repaired_at by incremental repair.

Use a semaphore as a mutex to serialize those functions.
Note that there is no need for rwlock since the operations
are rare and read-only operations like snapshot don't
need to run in parallel.

Fixes #25919

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-12-16 17:06:45 +02:00
Piotr Dulikowski
7900aa5319 Merge 'server: fix scheduling group update timing in system.clients' from Alex Dathskovsky
Previously, the scheduling_group column was updated during the switch_tenant function, which meant the update occurred only after the tenant change operation completed—updating rows one by one. With this change, the scheduling_group column is now updated before the switch_tenant logic runs, ensuring that the table reflects the correct scheduling groups for all rows as early as possible.

fixes: #26060
fixes: #27295

backport: not required
this is a minor bug fix. Internal logic worked but the user couldnt see the change in the table if they would read the system.clients table

Closes scylladb/scylladb#26404

* github.com:scylladb/scylladb:
  test: cqlpy: Remove test_switch_tenants and add test in cluster testing. The test needs to run twice, in two separate Scylla runs, using two different modes: gossip and raft. The cluster framework supports this setup, while cqlpy only runs against Scylla instances in raft mode. Therefore, the test was moved from cqlpy to the cluster-based framework. This commit both adds the test in cluster/ and removes the old version in cqlpy/.
  server: Refactor update_control_connection_scheduling_group functionality This refactoring moves the logic that retrieves the scheduling group for driver_service_level_name out of switch_tenant. This change is possible because the scheduling group for the driver is retrieved from a map (LOOKUP). The lookup function is fully synchronized, non-coroutine, and returns immediately. For that reason, it’s better to perform this lookup outside of the switch_tenant function.
  server: Refactor scheduling group update functionality. This change generalizes the scheduling-group update functionality and removes some copy-paste code, improving overall readability and maintainability. To achieve this, capturing lambdas were introduced. As a result, self-deducing this was added to those lambdas to avoid coroutine-related issues (“coroutine fiasco”).
  server: Fix switch_tenant problem, When running on a V2 server, service-level data comes from service level cache. Because of this, we can use synchronized function to get the schedualing group. Since we are transitioning to a Raft-based architecture where all servers will be V2, we can safely implement this fix specifically for that case. This change adds get_cached_user_scheduling_group functionality and moves its usage out of switch_tenant function in update_scheduling_group_v2 usage.
  server: Add update_service_level_scheduling_group_v1 functions to create placehholder for functionality that will introduce v2 implementation. The new functionality will allow usage of service level cache
2025-12-16 15:39:49 +01:00
Aleksandra Martyniuk
9d20f0a3d2 test: add est_rack_list_conversion_with_two_replicas_in_rack 2025-12-16 13:31:24 +01:00
Aleksandra Martyniuk
0476e8d272 test: test creating tablet_rack_list_colocation_plan 2025-12-16 13:31:24 +01:00
Aleksandra Martyniuk
e48789cf6c test: add test_numeric_rf_to_rack_list_conversion test 2025-12-16 13:31:24 +01:00
Aleksandra Martyniuk
9039dfa4a5 tasks: service: add global_topology_request_virtual_task
Add a service::topo::global_topology_request_virtual_task, which
covers the replication factor changes.

Currently, the global_topology_request_virtual_task can be aborted
only if it is paused.

The progress of the rf change isn't counted.
2025-12-16 13:31:22 +01:00
Aleksandra Martyniuk
1884e655d6 cql3: statements: allow altering from numeric rf to rack list
Allow altering from numeric replication factor to rack list. Ensure
that a single ALTER KEYSPACE statement doesn't try to both convert
to rack list and change rf.
2025-12-16 13:29:08 +01:00
Aleksandra Martyniuk
640c491388 service: topology_coordinator: pause keyspace_rf_change request
To do the conversion from numeric rf to rack list, we need to co-locate
tablets of the keyspace, so that all of them have replicas on the same racks.
Pause the keyspace_rf_change global topology request, so that the co-location
could be done before the ALTER KEYSPACE changes are applied.

The pause is needed if in any dc rf changes from numeric to rack list
and the co-location is necessary. In this case we don't finish the request.
Instead, we add the request to the paused request vector. No migrations are
started.
2025-12-16 13:29:08 +01:00
Aleksandra Martyniuk
cd83d1d4dc service: implement make_rack_list_colocation_plan
The make_rack_list_colocation_plan consists of two phases.

In the first phase (realized with find_required_rack_list_colocations),
we find the pairs of (replica to be co-located, destination dc and rack).
We skip the pairs related to the tablets that are in transition or for
which the load balancer migration is already planned. We group the pairs
by destination dc and rack. Thanks to that in the second phase we can
calculate the least loaded nodes and shards only once for each rack.

In the second phase, we calculate the load of the nodes in a cluster
based on current transition and previously scheduled migrations.
We utilize the map created in the first phase and choose the least
loaded targets in each rack. We skip the tablets for which the
co-location was already scheduled.

find_required_rack_list_colocations isn't a method of load_balancer,
because in the following changes it is going to be reused by topology
coordinator to determine whether the rf change should be paused.
2025-12-16 13:29:05 +01:00
Aleksandra Martyniuk
bbe0b01b14 service: add tablet_rack_list_colocation_plan
Add tablet_rack_list_colocation_plan. Keep it in migration_plan.
The plan includes a request that is ready to resume. There can be
more than one such request at the time, but we consider them one
by one for clarity of code. Rack list co-locations will be kept together
with normal load balancer migrations.

Consider normal load balancer migrations before rack list co-locations.
During rack list co-location, allow load balancer migrations to happen
only within a single rack. Do not create the merge co-location plan if
there is ongoing rack list co-location (if there are any rf changes paused).

Generate rf change resume based on the plan. Add _request_to_resume back
to global requests queue.

make_rack_list_colocation_plan will be implemented in the following change.
2025-12-16 13:27:50 +01:00
Aleksandra Martyniuk
2e7ba1f8ce cql3: reject concurrent alter of the same keyspace
Reject ALTER KEYSPACE request if there is unfinished (queued, pending,
or paused) alter request of the same keyspace.

This is required as in the following changes, global request queue
will contain rf change requests meant to be resumed.
2025-12-16 13:27:48 +01:00
Aleksandra Martyniuk
b3a0e4c2dc test: check paused rf change requests persistence 2025-12-16 13:25:38 +01:00
Aleksandra Martyniuk
08e5f35527 db: service: add paused_rf_change_requests to system.topology
In the following changes, we allow to alter from numeric rf to rack list.
Before the alter, two tablets of the same keyspace can have replicas
on different racks. To switch to rack list, we need to co-locate
the replicas. It will be achieved by pausing the keyspace_rf_change
and scheduling migrations.

We need to persist the ids of requests that are paused. A new column -
paused_rf_change_requests is added to system.topology table.

In this commit no data is kept in the new column.
2025-12-16 13:25:38 +01:00
Aleksandra Martyniuk
d66a36058b service: pass topology and system_keyspace to load_balancer ctor
Pass a pointer to service::topology and db::system_keyspace to load
balancer. It will be used in the following patches to create
rack_list_colocation plan.
2025-12-16 13:25:38 +01:00
Aleksandra Martyniuk
6681c0f33f service: tablet_allocator: extract load updates
Extract consider_scheduled_load and consider_planned_load so that
they can be reused.
2025-12-16 13:25:38 +01:00
Aleksandra Martyniuk
13e9ee3f6f service: tablet_allocator: extract ensure_node
Extract ensure_node method so that it can be reused.
2025-12-16 13:25:38 +01:00
Tomasz Grabiec
71e6ef90f4 tasks, system_keyspace: Introduce get_topology_request_entry_opt()
It's a cleanup. Better to return std::nullopt than faking an entry
with an id when require_entry == false.
2025-12-16 13:25:34 +01:00
Tomasz Grabiec
902803babd node_ops: Drop get_pending_ids()
Every pending request should also have an entry in
system.topology_requests so it's redundant.

And problematic, because we cannot build a full request entry from
just an id alone, so if we would return those requests, they would
have blank information, and logic which needs more information would
not work.
2025-12-16 13:25:11 +01:00
Tomasz Grabiec
4ed17c9e88 node_ops: Drop redundant get_status_helper() 2025-12-16 11:09:11 +01:00
Patryk Jędrzejczak
73db5c94de Merge 'db: api: service: introduce system.client_routes table and related API endpoints' from Andrzej Jackowski
`system.client_routes` is a system table that sets the target address and ports for each `host_id`, for one or more connection (e.g., Private Link) represented by `connection_id`. Cloud will write the table via REST, and drivers will read it via CQL to override values obtained from `system.local` and `system.peers`.

This patch series contains:
 - Introduction of `CLIENT_ROUTES` feature flag.
 - Implementation of raft-based `system.client_routes` table
 - Implementation of `v2/client-routes` POST/DELETE/GET endpoints
 - Implementation of new `CLIENT_ROUTES_CHANGE` event that is sent to drivers when `system.client_routes` is changed
 - New tests that verifies the aforementioned features

Ref: scylladb/scylla-enterprise#5699

For now, no automatic backport. However, the changes are planned to be release on `2025.4` either as a backport or a private build.

Closes scylladb/scylladb#27323

* https://github.com/scylladb/scylladb:
  docs: describe CLIENT_ROUTES_CHANGE extension
  test: add test for CLIENT_ROUTES event
  service: transport: add CLIENT_ROUTES_CHANGE event
  test: add cluster tests for client routes
  test: add API tests for client_routes endpoints
  test: add `timeout` parameter to `delete` in RESTClient
  test: allow json_body in send
  api: implement client_routes endpoints
  api: add client_routes.json
  service: main: add client_routes_service
  db: add system.client_routes table
  gms: add CLIENT_ROUTES feature
2025-12-16 10:38:27 +01:00
Botond Dénes
85f05fbe1b Revert "Merge 'Add digests for all sstable components in scylla metadata' from Taras Veretilnyk"
This reverts commit 866c96f536, reversing
changes made to 367633270a.

This change caused all longevities to fail, with a crash in parsing
scylla-metadata. The investigation is still ongoing, with no quick fix
in sight yet.

Fixes: #27496

Closes scylladb/scylladb#27518
2025-12-16 11:34:40 +02:00
Botond Dénes
83f46fa7f5 doc: add video link for TTL
Closes: #26210
2025-12-16 10:43:03 +02:00
Anna Stuchlik
ea6f2a21c6 doc: remove references to ScyllaDB versions 4.3 and 4.4
We should never refer to the no longer supported OSS versions.
This is a leftover - other mentions were removed long time ago.

Fixes https://github.com/scylladb/scylladb/issues/19569

Closes scylladb/scylladb#27656
2025-12-16 06:58:53 +02:00
Yaniv Kaul
30c4bc3f96 Fix for __iter__ method returns a non-iterator
To fix this issue, the std_list_iterator class defined within
std_list.__iter__ should implement the full iterator protocol by defining
an __iter__() method that returns self. This change ensures any instance
of std_list_iterator can be used as an iterator in Python for loops and
other iteration contexts, as required. The fix is to add a small method
definition inside the std_list_iterator class, ideally after the __init__
or in a logical place with the other dunder methods.

Only the code inside the std_list class's __iter__ function (lines around
the definition of the inner class and its methods) needs to be edited.

Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>

Closes scylladb/scylladb#27642
2025-12-16 06:57:19 +02:00
Piotr Smaron
77fa936edc doc: audit: update to present how to enable both syslog and table
Supporting both sinks have been introduced in
https://github.com/scylladb/scylladb/pull/26613, but it missed the docs
changes, so here they are.

Closes scylladb/scylladb#27607
2025-12-16 06:56:39 +02:00
Piotr Smaron
0ec485845b Clarify documentation build instructions
Closes scylladb/scylladb#27606
2025-12-16 06:56:00 +02:00
Botond Dénes
dace39fd6c Merge 'Make commitlog replay handle files with corrupt file header (non-zero) as data loss, not startup failure' from Calle Wilund
Fixes #26744

If a segment to replay is broken such that the main header is not zero, but still broken, we throw header_checksum_error. This was not handled in replayer, which grouped this into the "user error/fundamental problem" category.

However, assuming we allow for "real" disk corruption, this should really be treated same as data corruption, i.e. reported data loss, not failure to start up.

The `test_one_big_mutation_corrupted_on_startup` test accidentally sometimes provoked this issue, by doing random file wrecking, which on rare occasions provoked this, and thus failed test due to scylla not starting up, instead of losing data as expected.

Closes scylladb/scylladb#27556

* github.com:scylladb/scylladb:
  test::cluster::dtest::tools::files: Remove file
  commitlog_replay: Handle fully corrupt files same as partial corruption.
  test::pylib::suite::base: Split options.name test specifier only once
2025-12-16 06:55:42 +02:00
Calle Wilund
5f8f724d78 repair: Don't use off-strategy as repair destination with tablet tables
Fixes #17384

Bypasses enabling off-strategy storage/placement for repair streams
when table repaired is using tablets. Instead, the resulting sstable(s)
will be placed in the "normal" set of sstables, and bypass a post-repair
off-strategy compaction.

v2:
Bypass off-strat for whatever reason iff dest is tablets.

Closes scylladb/scylladb#27500
2025-12-16 06:54:07 +02:00
Michał Chojnowski
df93ea626b test/scylla_gdb: use gcore instead of signal SIGSEGV to generate a coredump on failure
The test fails in CI sometimes, and we want a coredump from a failure
to debug that. We made the test send a `signal SIGSEGV` to Scylla
on failure, but apparently that doesn't work as intended on our CI
hosts. (The CI runner seemingly can't find any coredump afterwards).

We can use gdb's `gcore` command to produce a coredump in a more
predictable way.

Refs scylladb/scylladb#22501

Closes scylladb/scylladb#27498
2025-12-16 06:53:43 +02:00
Botond Dénes
74347625f9 Merge 'test/alternator: add reproducers for more issues' from Nadav Har'El
This series adds an xfailing reproducers for two issue: #8070 and #27037:

27037 is about where even with alternator_streams_increased_compatibility set to true, if an attribute
is set to the same value it had but using a different JSON representation - a Alternator Streams
event is unduly produced.

8070 is about the ability to write malformed values into the database and then fail during read - instead of failing, as expected, during the write. This issue was known for years, but we never really had a reproducer for it - it's not possible to reproduce it using clean boto3 code and we need to build a request manually.

The first two patches are two small cleanups (including fixes #27372)  that I did while preparing the real tests - which are in the final two patches.

Closes scylladb/scylladb#27376

* github.com:scylladb/scylladb:
  test/alternator: add reproducer for bug with storing invalid values
  test/alternator: reproducer for issue 27375
  utils/rjson: fix error messages from rjson::parse()
  test/alternator: extract get_signed_request() to util.py
2025-12-16 06:53:14 +02:00
Andrzej Jackowski
f1fc5cc808 docs: describe CLIENT_ROUTES_CHANGE extension
Ref: scylladb/scylla-enterprise#5699
2025-12-15 18:19:37 +01:00
Andrzej Jackowski
61bbea51ad test: add test for CLIENT_ROUTES event
Ref: scylladb/scylla-enterprise#5699
2025-12-15 18:19:37 +01:00
Andrzej Jackowski
c2b1b10ca0 service: transport: add CLIENT_ROUTES_CHANGE event
Introduce the CLIENT_ROUTES_CHANGE event to let drivers refresh
connections when `system.client_routes` is modified. Some deployments
(e.g., Private Link) require specific address/port mappings that can
change without topology changes and drivers need to adapt promptly
to avoid connectivity issues.

This new EVENT type carries a change indicator plus the affected
`connection_ids` and `host_ids`. The only change value is
`UPDATE_NODES`, meaning one or more client routes were inserted,
updated, or deleted.

Drivers subscribe using the existing events mechanism, so no additional
`cql_protocol_extension` key is required.

Ref: scylladb/scylla-enterprise#5699
2025-12-15 18:19:37 +01:00
Andrzej Jackowski
ec87b92ba1 test: add cluster tests for client routes
Ref: scylladb/scylla-enterprise#5699
2025-12-15 18:17:15 +01:00
Andrzej Jackowski
9c9371511f test: add API tests for client_routes endpoints
Ref: scylladb/scylla-enterprise#5699
2025-12-15 17:46:14 +01:00
Andrzej Jackowski
2e80997630 test: add timeout parameter to delete in RESTClient
The parameter was missing and is needed to implement a test case
later in this patch series.
2025-12-15 17:44:48 +01:00
Andrzej Jackowski
1143acaf5b test: allow json_body in send
Needed to test `/v2/connection_metadata` endpoints that receive
JSON input.

Ref: scylladb/scylla-enterprise#5699
2025-12-15 17:44:48 +01:00
Andrzej Jackowski
e153cc434f api: implement client_routes endpoints
Ref: scylladb/scylla-enterprise#5699
2025-12-15 17:36:47 +01:00
Nadav Har'El
4e106b9820 test/cqlpy: remove unused variables
Copilot detected a few cases of cqlpy tests setting a variable which
they don't use. In all the cases in this patch, we can just remove
the variable. Although the AI found all these unused variables, I
verified each case carefully before changing it in this patch.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2025-12-15 18:11:04 +02:00
Nadav Har'El
64d9c370ee test/alternator: remove unnecessary duplicate statement
copilot noticed that test/alternator/test_scan.py had a duplicate
statement (call to full_scan()). It doesn't break the test, but also
adds nothing but confusion - so let's just remove it.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2025-12-15 18:07:45 +02:00
Nadav Har'El
a3959fe3db test/alternator: remove unused variable assignments
copilot noticed in that in in many of Alternator tests, we have some
unnecessary assignments. For example, in a few places, we use the idiom:

     with pytest.raises(...):
         ret = ...

The "ret=" part is unnecessary, as this test expects the statement to
fail (hence the raises()), and ret is never assigned. The assignment
was only there because we copied this statement from another place in
the test, which does expect the statement to pass and wants to validate
the returned value.

So we should just drop the "ret=" from these tests.

Another common occurance is that we used the idiom

     response = table.do_something()

Without checking the response and no intention to check it (either we
know it will work, or we just want to check it doesn't throw). So we
can drop the "response=" here too.

All of the unused variables in this patch were discovered by Copilot,
but I reviewed each of them carefully myself and prepared this patch.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2025-12-15 18:07:05 +02:00
Nadav Har'El
4fa4f40712 test/cqlpy: use unique partition in test
It is traditional to use a unique (or random) partition key in cqlpy
tests, to allow multiple tests to share the same table and make the test
suite a bit faster. One of the tests, test_multi_column_relation_desc,
set up a unique key "k", but then forgot to use it and used partition
key 0 instead. Fix the test to use this k.

This problem was spotted by Copilot, who saw the unused variable k.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2025-12-15 17:08:51 +02:00
Yauheni Khatsianevich
07867a9a0d test: new LWT with counters test during tablets migration/resize
- Workload: N workers perform CAS updates
- Update counter table each time CAS was successful
- Enable balancing and increase min_tablet_count to force split,
 and lower min_tablet_count to merge.
- Run tablets migrations loop
- Stop workload and verify data consistency
2025-12-15 14:32:30 +01:00
Patryk Jędrzejczak
844545bb74 Merge 'treewide: fix cases of improper re-throwing of std::exception_ptr' from Emil Maskovsky
Fix multiple cases where the captured `std::exception_ptr` has been re-thrown via simple `throw eptr;`, which results in losing the original exception type and details.

Resolved at various places found by clang-tidy:

1. db::schema_applier

When applying schema changes, the previous implementation attempted to handle exceptions by catching and rethrowing them, but did so incorrectly: using `throw ex` with a `std::exception_ptr` loses the original exception type and details.

However, in this case, explicit exception handling is unnecessary. The only reason for catching was to ensure `ap.destroy()` is called before propagating the exception. This can be more cleanly and safely achieved using Seastar's `.finally()` continuation, which guarantees cleanup regardless of success or failure.

2. directories

The `std::exception_ptr()` has been captured for logging and then again re-thrown incorrectly via `throw ex;`. We could use `std::rethrow_exception()` here instead, but it seems to be simpler to just use regular `throw;` to rethrow the original exception, and only use the `std::current_exception()` for logging (which is a pattern used in other places as well).

3. storage_service

Here the exception has been re-thrown incorrectly in a coroutine. There it is best to use the `co_await coroutine::return_exception_ptr` to propagate exception more efficiently in a coroutine-friendly manner.

Fixes: SCYLLADB-94
Refs: scylladb/scylladb#27501

No backport: This fixes an error logging issue, that isn't a production problem by itself (only found in test), therefore not backporting to older branches.

Closes scylladb/scylladb#27613

* https://github.com/scylladb/scylladb:
  db: schema_applier: improve exception-safe cleanup
  directories: fix exception rethrowing
  storage_service: use coroutine-friendly exception propagation in join_node_response_handler
2025-12-15 13:56:45 +01:00
Nadav Har'El
ccacea621f test/cqlpy: fix flaky test test_view_in_system_tables
The cqlpy test test_materialized_view.py::test_view_in_system_tables
checks that the system table "system.built_views" can inform us that
a view has been built. This test was flaky, starting to fail quite
often recently, and this patch fixes the problem in the test.

For historic reasons  this test began by calling a utility function
wait_for_view_built() - which uses a different system table,
system_distributed.view_build_status, to wait until the view was built.
The test then immediately tries to verify that also system.built_views
lists this view.

But there is no real reason why we could assume - or want to assume -
that these two tables are updated in this order, or how much time
passed between the two tables being changed. The authors of this
test already acknowledged there is a problem - they included a hack
purporting to be a "read barrier" that claimed to solve this exact
problem - but it seems it doesn't, or at least no longer does after
recent changes to the view builder's implementation.

The solution is simple - just remove the call to wait_for_view_built()
and the "hack" after it. We should just wait in a loop (until a timeout)
for the system table that we really wanted to check - system.built_views.
It's as simple as that. No need for any other assumptions or hacks.

Fixes #27296

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#27626
2025-12-15 15:29:08 +03:00
Nadav Har'El
f287484f4d test/cqlpy: rename test with duplicate name
When translating Cassandra's test validation/operations/CreateTest.java
I accidentally used the same name for two tests, resulting in the first
of them never being run.

Let's fix the name of the second of the two to be the real name it had
in the original Cassandra test.

After this patch pytest reports 16 tests in this file, instead of 15
before this patch. The previously-ignored test was correct, and it
now passes in both Scylla and Cassandra.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2025-12-15 14:19:24 +02:00
Benny Halevy
f4a4671ad6 table: seal_snapshot: avoid oversized allocation when dumping manifest.json
Currently, we first print the json contents into a stringstream buffer
and then we write it as a whole to the manifest.json file output stream.

This is not scalable and may cause large allocation for large enough number
of files.

Fixes #24216

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#27542
2025-12-15 15:19:24 +03:00
Andrzej Jackowski
70a0418102 api: add client_routes.json
Add the JSON definitions for the POST, GET, and DELETE endpoints used
to modify client routes. These endpoints are intended for Cloud to
update the `system.client_routes` table.

The API is implemented in `/v2/` because the endpoints process arrays
of objects. Handling of such structures was improved between
Swagger 1.2 and 2.0 versions. There are already similar
`get_metrics_config` and `set_metrics_config` endpoints that operate
on similar structures and they are also in /v2/.

The introduced JSON files start with `, ` but it's intended because
the files are concatenated to the existing (metrics) JSON files,
and they need to represent valid JSON after the concatenation.

Ref: scylladb/scylla-enterprise#5699
2025-12-15 13:13:46 +01:00
Andrzej Jackowski
6fcc1ecf94 service: main: add client_routes_service
Introduce `client_routes_service` for managing
`system.client_routes` table.

Ref: scylladb/scylla-enterprise#5699
2025-12-15 13:13:40 +01:00
Andrzej Jackowski
8dde70d04c db: add system.client_routes table
Introduce `system.client_routes`, a system table that sets the target
address and ports for each `host_id`, for one or more connections
(e.g., Private Link) represented by `connection_id`. Cloud will write
the table via REST, and drivers will read it via CQL to override
values obtained from `system.local` and `system.peers`.

The table is Raft-managed to provide consistent replication across
nodes.

Schema overview: each row is identified by `(connection_id, host_id)`
and describes where clients should connect: `address` and one or more of
`port`, `tls_port`, `alternator_port`, `alternator_https_port`.
`host_id` is a UUID (just as in ScyllaDB) but `connection_id` can be
any string to accept formats of all cloud providers. `address` is
also a regular string because it can represent either an IP address or
a domain. Ports are optional in the sense that at least one of
the four must be provided.

Ref: scylladb/scylla-enterprise#5699
2025-12-15 13:08:05 +01:00
Andrzej Jackowski
2e7070d3b7 gms: add CLIENT_ROUTES feature
The feature will be used later in this patch series:
 - To avoid unnecessary operations when the feature is not enabled
 - To guard new API endpoints from being used before the cluster is
   ready to use them.
 - To implement update tests (by disabling/enabling the feature)

Ref: scylladb/scylla-enterprise#5699
2025-12-15 13:08:04 +01:00
Pavel Emelyanov
3f7ee3ce5d Merge 'batchlog: make replay (flush) faster' from Botond Dénes
The batchlog table contains an entry for each logged batch that is processed by the local node as coordinator. These entries are typically very short lived, they are inserted when the batch is processed and deleted immediately after the batch is successfully applied.
When a table has `tombstone_gc = {'mode': 'repair'}` enabled, every repair has to flush all hints and batchlogs, so that we can be certain that there is no live data in any of these, older than the last repair. Since batches can contain member queries from any number of tables, the whole batchlog has to be flushed, even if repair-mode tombstone-gc is enabled for a single table.

Flushing the batchlog table happens by doing a batchlog replay. This involves reading the entire content of this table, and attempting to replay+delete any live entries (that are old enough to be replayed).  Under normal operating circumstances, 99%+ of the content of the batchlog table is partition tombstones.  Because of this, scanning the content of this table has to process thousands to millions of tombstones. This was observed to require up to 20 minutes to finish, causing repairs to slow down to a crawl, as the batchlog-flush has to be repeated at the end of the repair of each token-range.

When trying to address this problem, the first idea was that we should expedite the garbage-collection of these accumulated tombstones. This experiment failed, see https://github.com/scylladb/scylladb/pull/23752. The commitlog proved to be an impossible to bypass barrier, preventing quick garbage-collection of tombstones. So long as a single commit-log segment is alive, holding content from the batchlog table, all tombstones written after are blocked from GC.
The second approach, represented by this PR, is to not rely in tombstone GC to reduce the tombstone amount. Instead restructure the table such that a single higher-order tombstone can be used to shadow and allow for the eviction of the myriads of individual batchlog entry tombstones. This is realized by reorganizing the batchlog table such that individual batches are rows, not partitions.
This new schema is introduced by the new `system.batchlog_v2` table, introduced by this PR:

    CREATE TABLE system.batchlog_v2 (
        version int,
        stage int,
        shard int,
        written_at timestamp,
        id uuid,
        data blob,
        PRIMARY KEY ((version, stage, shard), written_at, id));

The new schema organization has the following goals:
1) Make post-replay batchlog cleanup possible with a simple range-tombstone. This allows dropping the individual dead batchlog entries, as they are shadowed by a higher level tombstone. This enables dropping tombstones without tombstone GC.
2) To make the above possible, introduce the stage key component: batchlog entries that fail the first replay attempt, are moved to the failed_replay stage, so the initial stage can be cleaned up safely.
3) Spread out the data among Scylla shards, via the batchlog shard column.
4) Make batchlog entries ordered by the batchlog create time (id). This allows for selecting batchlogs to replay, without post-filtering of batchlogs that are too young to be replayed.

Fixes: https://github.com/scylladb/scylladb/issues/23358

This is an improvement, normally not a backport-candidate. We might override this and backport to allow wider use of `tombstone_gc: {'mode': 'repair'}`.

Closes scylladb/scylladb#26671

* github.com:scylladb/scylladb:
  db/config: change batchlog_replay_cleanup_after_replays default to 1
  test/boost/batchlog_manager_test: add test for batchlog cleanup
  replica/mutation_dump: always set position weight for clustering positions
  service/storage_proxy: s/batch_replay_throw/storage_proxy_fail_replay_batch/
  test/lib: introduce error_injection.hh
  utils/error_injection: add debug log to disable() and disable_all()
  test/lib/cql_test_env: forward config to batchlog
  test/lib/cql_test_env: add batch type to execute_batch()
  test/lib/cql_assertions: add with_size(predicate) overload
  test/lib/cql_assertions: add source location to fail messages
  test/lib/cql_assertions: columns_assertions: add assert_for_columns_of_each_row()
  test/lib/cql_assertions: rows_assertions::assert_for_columns_of_row(): add index bound check
  test/lib/cql_assertions: columns_assertions: add T* with_typed_column() overload
  db/batchlog_manager: config: s/write_timeout/reply_timeot/
  db,service: switch to system.batchlog_v2
  db/system_keyspace: introduce system.batchlog_v2
  service,db: extract generation of batchlog delete mutation
  service,db: extract get_batchlog_mutation_for() from storage-proxy
  db/batchlog_manager: only consider propagation delay with tombstone-gc=repair
  db/batchlog_manager: don't drop entire batch if one mutations' table was dropped
  data_dictionary: table: add get_truncation_time()
  db/batchlog_manager: batch(): replace map_reduce() with simple loop
  db/batchlog_manager: finish coroutinizing replay_all_failed_batches
  db/batchlog_manager: improve replayAllFailedBatches logs
2025-12-15 15:05:19 +03:00
Nadav Har'El
a9442e6d56 test/cqlpy: rename test with duplicate name
When translating Cassandra's test validation/operations/DeleteTest.java
I accidentally used the same name for two tests, resulting in the first
of them never being run.

Let's fix the name of the second of the two to be the real name it had
in the original Cassandra test.

After this patch pytest reports 52 tests in this file, instead of 51
before this patch. The previously-ignored test was correct, and it
now passes in both Scylla and Cassandra.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2025-12-15 12:02:59 +02:00
Dawid Mędrek
1e14c08eee locator/token_metadata: Remove get_host_id()
The function is declared, but it's not defined or used anywhere.

Closes scylladb/scylladb#27374
2025-12-15 10:36:52 +01:00
Michael Litvak
b9ec1180f5 alternator: require rf_rack_valid_keyspaces when creating index
When creating an alternator table with tablets, if it has an index, LSI
or GSI, require the config option rf_rack_valid_keyspaces to be enabled.

The option is required for materialized views in tablets keyspaces to
function properly and avoid consistency issues that could happen due to
cross-rack migrations and pairing switches when RF-rack validity is not
enforced.

Currently the option is validated when creating a materialized view via
the CQL interface, but it's missing from the alternator interface. Since
alternator indexes are based on materialized views, the same check
should be added there as well.

Fixes scylladb/scylladb#27612

Closes scylladb/scylladb#27622
2025-12-15 10:36:57 +02:00
Michał Hudobski
12483d8c3c vector_search: throw an error when we restrict primary in vector search
We currently allow restrictions on single column primary key,
but we ignore the restriction and return all results.
This can confuse the users. We change it so such a restriction
will throw an error and add a test to validate it.

Fixes: VECTOR-331

Closes scylladb/scylladb#27143
2025-12-15 09:45:56 +02:00
Jenkins Promoter
d5641398f5 Update pgo profiles - aarch64 2025-12-15 05:16:31 +02:00
Alex
d21faab9dc test: cqlpy: Remove test_switch_tenants and add test in cluster testing. The test needs to run twice, in two separate Scylla runs, using two different modes: gossip and raft.
The cluster framework supports this setup, while cqlpy only runs against Scylla instances in raft mode.
Therefore, the test was moved from cqlpy to the cluster-based framework.
This commit both adds the test in cluster/ and removes the old version in cqlpy/.
2025-12-14 18:46:06 +02:00
Alex
30f6a40ae6 server: Refactor update_control_connection_scheduling_group functionality This refactoring moves the logic that retrieves the scheduling group for driver_service_level_name out of switch_tenant.
This change is possible because the scheduling group for the driver is retrieved from a map (LOOKUP).
The lookup function is fully synchronized, non-coroutine, and returns immediately.
For that reason, it’s better to perform this lookup outside of the switch_tenant function.
2025-12-14 18:46:06 +02:00
Alex
5579489c4c server: Refactor scheduling group update functionality. This change generalizes the scheduling-group update functionality and removes some copy-paste code, improving overall readability and maintainability.
To achieve this, capturing lambdas were introduced. As a result, self-deducing this was added to those lambdas to avoid coroutine-related issues (“coroutine fiasco”).
2025-12-14 18:46:05 +02:00
Alex
17c9d640fe server: Fix switch_tenant problem, When running on a V2 server, service-level data comes from service level cache. Because of this, we can use synchronized function to get the schedualing group.
Since we are transitioning to a Raft-based architecture where all servers will be V2, we can safely implement this fix specifically for that case.
This change adds get_cached_user_scheduling_group functionality and moves its usage out of switch_tenant function in update_scheduling_group_v2 usage.
2025-12-14 16:27:40 +02:00
Alex
f98af582a7 server: Add update_service_level_scheduling_group_v1 functions to create placehholder for functionality that will introduce v2 implementation.
The new functionality will allow usage of service level cache
2025-12-14 16:09:18 +02:00
Nadav Har'El
c06e63daed Merge 'auth: start using SHA 512 hashing originated from musl with added yielding' from Andrzej Jackowski
This patch series contains the following changes:
 - Incorporation of `crypt_sha512.c` from musl to out codebase
 - Conversion of `crypt_sha512.c` to C++ and coroutinization
 - Coroutinization of `auth::passwords::check`
 - Enabling use of `__crypt_sha512` orignated from `crypt_sha512.c` for
   computing SHA 512 passwords of length <=255
 - Addition of yielding in the aforementioned hashing implementation.

The alien thread was a solution for reactor stalls caused by indivisible
password‑hashing tasks (https://github.com/scylladb/scylladb/issues/24524).
However, because there is only one alien thread, overall hashing throughput was reduced
(see, e.g., https://github.com/scylladb/scylla-enterprise/issues/5711). To address this,
the alien‑thread solution is reverted, and a hashing implementation
with yielding is introduced in this patch series.

Before this patch series, ScyllaDB used SHA-512 hashing provided
by the `crypt_r` function, which in our case meant using the implementation
from the `libxcrypt` library. Adding yielding to this `libxcrypt`
implementation is problematic, both due to licensing (LGPL) and because the
implementation is split into many functions across multiple files. In
contrast, the SHA-512 implementation from `musl libc` has a more
permissive license and is concise, which makes it easier to incorporate
into the ScyllaDB codebase.

The performance of this solution was compared with the previous
implementation that used one alien thread and the implementation
after the alien thread was reverted. The results (median) of
`perf-cql-raw` with `--connection-per-request 1 --smp 10` parameters
are as follows:
 - Alien thread: 41.5 new connections/s per shard
 - Reverted alien thread: 244.1 new connections/s per shard
 - This commit (yielding in hashing): 198.4 new connections/s per shard

The roughly 20% performance deterioration compared to
the old implementation without the alien thread comes from the fact
that the new hashing algorithm implemented in `utils/crypt_sha512.cc`
performs an expensive self-verification and stack cleanup.

On the other hand, with smp=10 the current implementation achieves
roughly 5x higher throughput than the alien thread. In addition,
due to yielding added in this commit, the algorithm is expected
to provide similar protection from stalls as the alien thread did.
In a test that in parallel started a cassandra-stress workload and
created thousands of new connections using python-driver, the values of
`scylla_reactor_stalls_count` metric were as follows:
 - Alien thread: 109 stalls/shard total
 - Reverted alien thread: 13186 stalls/shard total
 - This commit (yielding in hashing): 149 stalls/shard total

Similarly, the `scylla_scheduler_time_spent_on_task_quota_violations_ms`
values were:
 - Alien thread: 1087 ms/shard total
 - Reverted alien thread: 72839 ms/shard total
 - This commit (yielding in hashing): 1623 ms/shard total

To summarize, yielding during hashing computations achieves similar
throughput to the old solution without the alien thread but also
prevents stalls similarly to the alien thread.

Fixes: scylladb/scylladb#26859
Refs: scylladb/scylla-enterprise#5711

No automatic backport. After this PR is completed, the alien thread should be rather reverted from older branches (2025.2-2025.4 because on 2025.1 it's already removed). Backporting of the other commits needs further discussion.

Closes scylladb/scylladb#26860

* github.com:scylladb/scylladb:
  test/boost: add too_long_password to auth_passwords_test
  test/boost: add same_hashes_as_crypt_r to auth_passwords_test
  auth: utils: add yielding to crypt_sha512
  auth: change return type of passwords::check to future
  auth: remove code duplication in verify_scheme
  test/boost: coroutinize auth_passwords_test
  utils: coroutinize crypt_sha512
  utils: make crypt_sha512.cc to compile
  utils: license: import crypt_sha512.c from musl to the project
  Revert "auth: move passwords::check call to alien thread"
2025-12-14 14:01:01 +02:00
David Garcia
c1c3b2c5bb docs: fix local build
prevents early exits in metrics docs generation to break the local build.

Fixes #27497

Closes scylladb/scylladb#27615
2025-12-14 11:48:48 +02:00
Raphael S. Carvalho
a0a7941eb1 test: Add reproducer for split vs intra-node migration race
This is a problem caught after removing split from
add_sstable_and_update_cache(), which was used by
intra node migration when loading new sstables
into the destination shard.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-12-12 17:01:18 -03:00
Raphael S. Carvalho
e3b9abdb30 test: Verify split failure on behalf of repair during critical disk utilization
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-12-12 17:01:18 -03:00
Raphael S. Carvalho
bc772b791d test: boost: Add failure_when_adding_new_sstable_test
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-12-12 17:01:18 -03:00
Raphael S. Carvalho
77a4f95eb8 test: Add reproducer for split vs incremental repair race condition
Refs #26041.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-12-12 17:01:16 -03:00
Raphael S. Carvalho
992bfb9f63 compaction: Fail split of new sstable if manager is disabled
If manager has been disabled due to out of space prevention, it's
important to throw an exception rather than silently not
splitting the new sstable.

Not splitting a sstable when needed can cause correctness issue
when finalizing split later.
It's better to fail the writer (e.g. repair one) which will be
retried than making caller think everything succeeded.

The new replica::table::add_new_sstable_and_update_cache() will
now unlink the new sstable on failure, so the table dir will
not be left with sstables not loaded.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-12-12 16:59:51 -03:00
Raphael S. Carvalho
ee3a743dc4 replica: Don't split in do_add_sstable_and_update_cache()
Now, only sstable loader on boot and refresh from upload uses this
procedure. The idea is that maybe_split_new_sstable() will throw
when compaction cannot run due to e.g. out of space prevention.
It could fail repair writer, but we don't want it to fail boot.

As for refresh from upload, it's not supposed to work when tablet
map at the time of backup is not the same when restoring.
Even before this, refresh would fail if split already executed,
split would only happen if split was still ongoing. We need
token range stability for local restore. The safe variant will
always be load and stream.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-12-12 16:59:51 -03:00
Raphael S. Carvalho
48d243f32f streaming: Leave sstables unsealed until attached to the table
We want the invariant that after ACK, all sealed sstables will be split.

This guarantee that on restart, no unsplit sstables will be found
sealed.

The paths that generate unsplit sstables are streaming and file
streaming consumers. It includes intra-node streaming, which
is local but can clone an unsplit sstable into destination.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-12-12 16:59:51 -03:00
Raphael S. Carvalho
d9d58780e2 replica: Wire add_new_sstables_and_update_cache() into intra-node streaming
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-12-12 16:59:51 -03:00
Raphael S. Carvalho
ddb27488fa replica: Wire add_new_sstable_and_update_cache() into file streaming consumer
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-12-12 16:59:51 -03:00
Raphael S. Carvalho
10225ee434 replica: Wire add_new_sstable_and_update_cache() into streaming consumer
After the wiring, failure to attach the new sstable in the streaming
consumer will unlink the sstable automatically.

Fixes #27414.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-12-12 16:59:51 -03:00
Raphael S. Carvalho
a72025bbf6 replica: Document old add_sstable_and_update_cache() variants
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-12-12 16:59:51 -03:00
Raphael S. Carvalho
3f8363300a replica: Introduce add_new_sstables_and_update_cache()
Piggyback on new add_new_sstable_and_update_cache(), replacing
the previous add_sstables_and_update_cache().

Will be used by intra-node migration since we want it to be
safe when loading the cloned sstables. An unsplit sstable
can be cloned into destination which already ACKed split,
so we need this variant which splits sstable if needed,
while it's unsealed.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-12-12 16:59:50 -03:00
Raphael S. Carvalho
63d1d6c39b replica: Introduce add_new_sstable_and_update_cache()
Failure to load sstable in streaming can leave sealed sstables
on disk since they're not unlinked on failure.

This can result in several problems:
1) Data resurrection: since the sstable may contain deleted data
2) Split issue: since the finalization requires all sstables to be split
3) Disk usage issue: since the sstables hold space and streaming retries
can keep accumulating these files.

This new procedure will be later wired into streaming consumers, in
order to fix those problems.

Another benefit of the interface is that if there's split when adding
the new sstable, the output sstables will be returned to the caller,
allowing them to register the actual loaded sstables into e.g.
the view builder.

Refs #27414.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-12-12 16:59:50 -03:00
Raphael S. Carvalho
27d460758f replica: Account for sstables being added before ACKing split
We want the invariant that after ACK, all sealed sstables will be split.
If check-and-attach is not atomic, this sequence is possible:

1) no split decision set.
2) Unsplit sstable is checked, no need to split, sealed.
3) split decision is set and ACKed
4) unsplit sstable is attached

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-12-12 16:59:50 -03:00
Raphael S. Carvalho
794e03856a replica: Remove repair read lock from maybe_split_new_sstable()
The lock is intended to serialize some maintenance compactions,
such as major, with repair. But maybe_split_new_sstable() is
restricted solely to new sstables that aren't part of the
sstable set.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-12-12 16:59:50 -03:00
Raphael S. Carvalho
2dae0a7380 compaction: Preserve state of input sstable in maybe_split_new_sstable()
This is crucial with MVs, since the splitting must preserve the state of
the original sstable. We want the sstable to be in staging dir, so it's
excluded when calculating the diff for performing pushes to view
replicas.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-12-12 16:59:50 -03:00
Raphael S. Carvalho
1fdc410e24 Rename maybe_split_sstable() to maybe_split_new_sstable()
Since the function must only be used on new sstables, it should
be renamed to something describing its usage should be restricted.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-12-12 16:59:50 -03:00
Raphael S. Carvalho
1a077a80f1 sstables: Allow storage::snapshot() to leave destination sstable unsealed
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-12-12 16:59:50 -03:00
Raphael S. Carvalho
c5e840e460 sstables: Add option to leave sstable unsealed in the stream sink
That will be needed for file streaming to leave output sstable unsealed.

we want the invariant where all sealed sstables are split after split
was ACKed.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-12-12 16:59:50 -03:00
Raphael S. Carvalho
c10486a5e9 test: Verify unsealed sstable can be compacted
This is crucial for splitting before sealing the sstable produced by
repair. This way, unsplit sstables won't be left on disk sealed.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-12-12 16:59:50 -03:00
Raphael S. Carvalho
ab82428228 sstables: Allow unsealed sstable to be loaded
File streaming will have to load an unsealed sstable, so we need
to be able to parse components from temporary TOC instead.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-12-12 16:59:50 -03:00
Raphael S. Carvalho
b1be4ba2fc sstables: Restore sstable_writer_config::leave_unsealed
This option was retired in commit 0959739216, but
it will be again needed in order to implement split before sealing.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-12-12 16:59:50 -03:00
Emil Maskovsky
5e7456936e db: schema_applier: improve exception-safe cleanup
When applying schema changes, the previous implementation attempted to
handle exceptions by catching and rethrowing them, but did so
incorrectly: using `throw ex` with a `std::exception_ptr` loses the
original exception type and details. The correct approach is to use
`std::rethrow_exception()`.

However, in this case, explicit exception handling is unnecessary. The
only reason for catching was to ensure `ap.destroy()` is called before
propagating the exception. This can be more cleanly and safely achieved
using Seastar's `.finally()` continuation, which guarantees cleanup
regardless of success or failure.

This change removes the manual try/catch/rethrow and uses `.finally()`
to ensure proper cleanup, letting exceptions propagate naturally and
preserving their type and information.

Fixes: SCYLLADB-94
Refs: scylladb/scylladb#27501
2025-12-12 18:18:31 +01:00
Emil Maskovsky
e6f5f2537e directories: fix exception rethrowing
Fix location identified by clang-tidy where `std::exception_ptr` was
incorrectly rethrown using `throw ep;`. The correct approach is to use
`std::rethrow_exception(ep)`, which preserves the original exception
type and stack trace.

But this can be even further simplified by logging the current exception
with `std::current_exception()` and rethrowing using `throw;` instead of
capturing and rethrowing a `std::exception_ptr`. This matches the
idiomatic pattern used elsewhere in the codebase and improves clarity.

This change ensures proper exception propagation and avoids type slicing
or loss of diagnostic information.
2025-12-12 18:10:20 +01:00
Emil Maskovsky
76aacc00f2 storage_service: use coroutine-friendly exception propagation in join_node_response_handler
Improve exception handling in join_node_response_handler by using
`co_await coroutine::return_exception_ptr` to propagate exceptions.
This replaces the incorrect direct throw of `std::exception_ptr` and
ensures proper coroutine-friendly exception propagation.
2025-12-12 17:59:54 +01:00
Cezar Moise
7c8ab3d3d3 test.py: add pid to ServerInfo
Adding pid info to servers allows matching coredumps with servers

Other improvements:
- When replacing just some fields of ServerInfo, use `_replace` instead of
building a new object. This way it is agnostic to changes to the Object
- When building ServerInfo from a list, the types defined for its fields are
not enforced, so ServerInfo(*list) works fine and does not need to be changed if
fields are added or removed.
2025-12-12 15:11:03 +02:00
Botond Dénes
cb7f2e4953 docs: scylla-sstable-script-api.rst: add introduction and title 2025-12-12 13:50:12 +02:00
Botond Dénes
dd5b6770c8 docs: scylla-sstable.rst: extract script API to separate document
The script API is 500+ lines long in an already too long and hard to
navigate document. Extract it to a separate document, making both
documents shorter and easier to navigate.
2025-12-12 13:44:32 +02:00
Botond Dénes
7e7e378a4b Merge 'Revert "Merge 'Add option to use sstable identifier in snapshot' from Benny Halevy"' from null
Reverts commit 8192f45e84.

The merge exposed a critical bug where truncate operations during table drop with auto-snapshot fail, causing Raft applier fiber to stop with unhandled exceptions. This leads to schema inconsistencies across nodes and test failures with "Keyspace does not exist" errors.

**Root Cause**

Commit 19b6207f modified `truncate_table_on_all_shards` to set `use_sstable_identifier = true`:

```cpp
// Before (working)
co_await table::snapshot_on_all_shards(sharded_db, table_shards, name);

// After (broken)
auto opts = db::snapshot_options{.use_sstable_identifier = true};
co_await table::snapshot_on_all_shards(sharded_db, table_shards, name, opts);
```

This triggers exceptions during snapshot that propagate through Raft state machine, causing:
- Raft applier stops: `raft::state_machine_error` at `raft/server.cc:1369`
- Schema changes fail to propagate
- Nodes report non-existent keyspaces for valid schemas

**Changes**

Reverts 15 files (200 deletions, 74 insertions):
- Removes `use_sstable_identifier` from truncate/snapshot code paths
- Reverts `snapshot_options` struct back to simple `skip_flush` boolean
- Removes REST API and nodetool `--use-sstable-identifier` parameter
- Removes feature tests from `test/boost/database_test.cc`

No backport required - the original feature was merged to master only and never released.

<!-- START COPILOT ORIGINAL PROMPT -->

<details>

<summary>Original prompt</summary>

----

*This section details on the original issue you should resolve*

<issue_title>test_table_drop_with_auto_snapshot failed with InvalidRequest</issue_title>
<issue_description>Seen in: https://jenkins.scylladb.com/job/scylla-master/job/next/9968//testReport
Logs: [download](https://downloads.scylladb.com/unstable/scylla/master/testLogs/2025-12-08T15:05:42Z/)

Error message:
```
cassandra.InvalidRequest: Error from server: code=2200 [Invalid query] message="Keyspace test does not exist"
```

Stacktrace:
```
manager = <test.pylib.manager_client.ManagerClient object at 0xffff734e6c10>

    @pytest.mark.asyncio
    async def test_table_drop_with_auto_snapshot(manager: ManagerClient):
        logger.info("Bootstrapping cluster")
        cfg = { 'auto_snapshot': True }
        servers = await manager.servers_add(3, config = cfg)

        cql = manager.get_cql()

        # Increases the chance of tablet migration concurrent with schema change
        await inject_error_on(manager, "tablet_allocator_shuffle", servers)

        for i in range(3):
            await cql.run_async("DROP KEYSPACE IF EXISTS test;")
            await cql.run_async("CREATE KEYSPACE IF NOT EXISTS test WITH replication = {'class': 'NetworkTopologyStrategy', 'replication_factor': 1} AND tablets = {'initial': 8 };")
            await cql.run_async("CREATE TABLE IF NOT EXISTS test.tbl_sample_kv (id int, value text, PRIMARY KEY (id));")
>           await cql.run_async("INSERT INTO test.tbl_sample_kv (id, value) VALUES (1, 'ala');")
E           cassandra.InvalidRequest: Error from server: code=2200 [Invalid query] message="Keyspace test does not exist"

test/cluster/test_tablets2.py:173: InvalidRequest
```
</issue_description>

<agent_instructions>this issue was exposed by commit 8192f45e84, please send a pull request reverting that merge commit and mark it as fixing this github issue.</agent_instructions>

<comments>
<comment_new><author>@yaronkaikov</author><body>
@denesb is this something in your team area? if not , please feel free to delegate it or un-assign yourself :-)</body></comment_new>
<comment_new><author>@nyh</author><body>
This is very strange. Clearly the keyspace `test` does exist at this point, because we created it two lines above and also we ran `CREATE TABLE .. test.tbl_sample_kv` which would have failed if the keyspace `test` didn't exist - so it must exit, no?

In the past, we had a bug where the running `CREATE KEYSPACE IF NOT EXISTS` forgot to set the "schema modified" event in the response so it failed to wait for schema agreement, but 1. we fixed this bug (https://github.com/scylladb/scylladb/pull/18819 by @nuivall ) and 2. this bug didn't happen in this case, where CREATE TABLE deed had work to do.

But I just realized something... Our fix in https://github.com/scylladb/scylladb/pull/18819 only applies to CREATE KEYSPACE / TABLE / VIEW / TYPE statements. It wasn't applied to `DROP KEYSPACE` - and it should have been....

But I don't have a good theory how a bug like https://github.com/scylladb/scylladb/pull/18819 can explain this specific test failure. Different schema operations are already linearized, so if a `CREATE TABLE test.tbl_sample_kv` succeeded, I don't see how there could possibly be any earlier `DROP KEYSPACE test` that suddenly springs to life. Unless we have a serious bug in our raft-based schema operations.</body></comment_new>
<comment_new><author>@nyh</author><body>
Another bug we could have in theory is that the Python driver's async `cql.run_async` might have a bug where it is not waiting for the schema agreement despite being told to wait. If it doesn't wait for schema agreement, this can easily explain this bug:
1. the CREATE KEYSPACE, CREATE TABLE both are sent to node A, but
2. the last INSERT INTO is sent to node B which is not yet aware of this new keyspace and table, and fails.

Copilot claims that **execute_async() does have this bug!**

> For schema-altering statements, schema agreement (meaning all nodes agree on the new schema) is important before running follow-up operations, but this is enforced only by synchronous helpers like Session.execute(), not the asynchronous version.
> If you use execute_async() for schema operations, you are responsible for checking schema agreement yourself, using [Session.check_schema_agreement()](https://docs.datastax.com/en/developer/python-driver/latest/api/cassandra/cluster/#cassandra.cluster.Session.check_schema_agreement) or (in newer code) ResponseFuture.check_schema_agreement.
> According to [a discussion on the DataStax support forum](https://support.datastax.com/s/article/Does-the-Python-Driver-for-Cassandra-Wait-for-Schema-Agreement-after-a-Schema-Change?language=en_US) and the [driver’s source code](7f12a5e1c6/cassandra/cluster.py (L487)), schema agreement is not ch...

</details>

<!-- START COPILOT CODING AGENT SUFFIX -->

- Fixes scylladb/scylladb#27501

<!-- START COPILOT CODING AGENT TIPS -->
---

Closes scylladb/scylladb#27604

* github.com:scylladb/scylladb:
  Revert "Merge 'Add option to use sstable identifier in snapshot' from Benny Halevy"
  Initial plan
2025-12-12 13:20:49 +02:00
Botond Dénes
3d73a9781e docs: scylla-sstable: prepare for script API extract
We are about to extract the script API to a separate document. In
preparation convert soon-to-be cross-document references, so they keep
working after the extraction.
2025-12-12 13:15:48 +02:00
copilot-swe-agent[bot]
77ee7f3417 Revert "Merge 'Add option to use sstable identifier in snapshot' from Benny Halevy"
This reverts commit 8192f45e84.

The merge exposed a bug where truncate (via drop) fails and causes Raft
errors, leading to schema inconsistencies across nodes. This results in
test_table_drop_with_auto_snapshot failures with 'Keyspace test does not exist'
errors.

The specific problematic change was in commit 19b6207f which modified
truncate_table_on_all_shards to set use_sstable_identifier = true. This
causes exceptions during truncate that are not properly handled, leading
to Raft applier fiber stopping and nodes losing schema synchronization.
2025-12-12 03:55:13 +00:00
copilot-swe-agent[bot]
0ff89a58be Initial plan 2025-12-12 03:48:12 +00:00
Yaron Kaikov
f7ffa395a8 workflows: trigger CI automatically when conflicts label is removed
Add pull_request_target event with unlabeled type to trigger-scylla-ci
workflow. This allows automatic CI triggering when the 'conflicts' label
is removed from a PR, in addition to the existing manual trigger via
comment.

The workflow now runs when:
- A user posts a comment with '@scylladbbot trigger-ci' (existing)
- The 'conflicts' label is removed from a PR (new)

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-84

Closes scylladb/scylladb#27521
2025-12-11 16:48:06 +02:00
Piotr Smaron
3fa3b920de Update CODEOWNERS to remove redundant entries
Removing myself as I have no maintainer's permissions to review the code

Closes scylladb/scylladb#27576
2025-12-11 16:47:08 +02:00
Botond Dénes
e7ca52ee79 Merge 'api: storage_service/tablets/repair: disable incremental repair by default' from Benny Halevy
Change the default incremental_mode to `disabled` due to https://github.com/scylladb/scylladb/issues/26041 and https://github.com/scylladb/scylladb/issues/27414

** Backport to 2025.4 where 611918056a was introduced **

Closes scylladb/scylladb#27530

* github.com:scylladb/scylladb:
  api: storage_service/tablets/repair: disable incremental repair by default
  docs: nodetool-commands: cluster: repair: fix incremental-mode example
2025-12-11 15:23:09 +02:00
Botond Dénes
730eca5dac Merge 'Remove noexcept from storage_group and table functions to allow exception propagation' from null
Fixed a critical bug where `storage_group::for_each_compaction_group()` was incorrectly marked `noexcept`, causing `std::terminate` when actions threw exceptions (e.g., `utils::memory_limit_reached` during memory-constrained reader creation).

**Changes made:**
1. Removed `noexcept` from `storage_group::for_each_compaction_group()` declaration and implementation
2. Removed `noexcept` from `storage_group::compaction_groups()` overloads (they call for_each_compaction_group)
3. Removed `noexcept` from `storage_group::live_disk_space_used()` and `memtable_count()` (they call compaction_groups())
4. Kept `noexcept` on `storage_group::flush()` - it's a coroutine that automatically captures exceptions and returns them as exceptional futures
5. Removed `noexcept` from `table_load_stats()` functions in base class, table, and storage group managers

**Rationale:**

As noted by reviewers, there's no reason to kill the server if these functions throw. For coroutines returning futures, `noexcept` is appropriate because Seastar automatically captures exceptions and returns them as exceptional futures. For other functions, proper exception handling allows the system to recover gracefully instead of terminating.

Fixes #27475

Closes scylladb/scylladb#27476

* github.com:scylladb/scylladb:
  replica: Remove unnecessary noexcept
  replica: Remove noexcept from compaction_groups() functions
  replica: Remove noexcept from storage_group::for_each_compaction_group
2025-12-11 15:17:35 +02:00
Benny Halevy
c8cff94a5a api: storage_service/tablets/repair: disable incremental repair by default
Change the default incremental_mode to `disabled` due to
https://github.com/scylladb/scylladb/issues/26041 and
https://github.com/scylladb/scylladb/issues/27414

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-12-11 14:25:21 +02:00
Benny Halevy
5fae4cdf80 docs: nodetool-commands: cluster: repair: fix incremental-mode example
There is no 'regular' incremental mode anymore.
The example seems have meant 'disabled'.

Fixes #27587

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-12-11 14:25:11 +02:00
Marcin Maliszkiewicz
8bbcaacba1 auth: always catch by const reference
This is best practice.

Closes scylladb/scylladb#27525
2025-12-11 12:42:30 +01:00
Yaron Kaikov
3dfa5ebd7f Add JIRA issue validation to backport PR fixes check
Extend the Fixes validation pattern to also accept JIRA issue references
(format: [A-Z]+-\d+) in addition to GitHub issue references. This allows
backport PRs to reference JIRA issues in the format 'Fixes: PROJECT-123'.

Fixes: https://github.com/scylladb/scylladb/issues/27571

Closes scylladb/scylladb#27572
2025-12-11 12:23:16 +02:00
Avi Kivity
24264e24bb Revert "repair: Add tablet repair progress report support"
This reverts commit faad0167d7. It causes
a regression in

test_two_tablets_concurrent_repair_and_migration_repair_writer_level

in debug mode (with ~5%-10% probability).

Fixes #27510.

Closes scylladb/scylladb#27560
2025-12-11 12:18:11 +02:00
Nadav Har'El
0c64e3be9a Merge 'Unify and fix rjson string and string_view conversions' from Marcin Maliszkiewicz
This patch-set consolidates and corrects rjson string conversion handling.
It removes unnecessary string copies, ensures proper length usage and
replaces ad-hoc conversions with consistent helper functions.

Overall, the changes make rjson string handling safer, faster, and more uniform across the codebase.

Backport: no, it's a refactor

Closes scylladb/scylladb#27394

* github.com:scylladb/scylladb:
  fix rjson::value to bytes conversion with missing GetStringLength call
  alternator: change type from string to string_view in should_add_capacity
  fix rjson::value to string_view conversion with missing GetStringLength call
  use rjson::to_string_view when rjson::value gets converted using GetStringLength
  use rjson::to_sstring and rjson::to_string for various string conversions
  utils: use rjson document wrapper in instance_profile_credentials_provider::parse_creds
  utils: move rjson::to_string_view func to string related place
  utils: add to_sstring and to_string rjson helper
2025-12-11 12:05:41 +02:00
Nadav Har'El
b3b0860e7c test/alternator: add reproducer for bug with storing invalid values
This patch adds a reproducer for a long-known bug, #8070, where
Alternator can store invalid values which are just blindly stored as
JSON, and we will only see the failure when reading the item back -
and either the client will fail to parse it, or sometimes even Alternator's
own code (e.g., FilterExpression) will fail to parse it. The right
behavior is to fail the write - not the read.

The included test checks writing different kinds of invalid values using
PutItem, UpdateItem, and BatchWriteItem. The new tests pass on DynamoDB,
but fail on Alternator so marked as "xfail".

Refs #8070.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2025-12-11 11:58:22 +02:00
Nadav Har'El
db15c212a6 test/alternator: reproducer for issue 27375
This patch adds a reproducer for issue #27375, where even with
alternator_streams_increased_compatibility set to true, if an attribute
is set to the same value it had but using a different JSON
representation - a Alternator Streams event is unduly produced.
For example, if a map {'dog': 1, 'cat': 2} is changed to
{'cat': 2, 'dog': 1}, this non-change should not be reported.

The new test added in this patch passes on DynamoDB (an event
is not generated) but fails on Alternator (an event is generated),
so the new test is marked with xfail.

Refs #27375.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2025-12-11 11:34:19 +02:00
Nadav Har'El
3595941020 utils/rjson: fix error messages from rjson::parse()
rjson::parse() when parsing JSON stored in a chunked_content (a vector
of temporary buffers) failed to initialize its byte counter to 0,
resulting in garbage positions in error messages like:

  Parsing JSON failed: Missing a name for object member. at 1452254

These error messages were most noticable in Alternator, which parses
JSON requests using a chunked_content, and reports these errors back
to the user.

The fix is trivial: add the missing initialization of the counter.

The patch also adds a regression test for this bug - it sends a JSON
corrupt at position 1, and expect to see "at 1" and not some large
random number.

Fixes #27372

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2025-12-11 11:17:01 +02:00
Nadav Har'El
102516a787 test/alternator: extract get_signed_request() to util.py
get_signed_request() started in test_manual_requests.py as a way to sign
a manually-created DynamoDB-API request - for sending requests that boto3
can't.

Over time, we started to use this function in additional test files, and
it's about time to move it to util.py - which is more natural to import
from multiple files.

This patch also adds a new function, manual_request(), which combines
get_signed_request() and actually sending the request via
requests.post(). New tests should prefer it, because it's easier to use.
We'll use the new function in tests that we add in the next patches.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2025-12-11 11:16:42 +02:00
Marcin Maliszkiewicz
d5b63df46e transport: remove redundant futurize_invoke from counted data sink and source
Closes scylladb/scylladb#27526
2025-12-11 10:32:16 +03:00
Dario Mirovic
f545ed37bc test: dtest: audit_test.py: fix audit error log detection
`test_insert_failure_doesnt_report_success` test in `test/cluster/dtest/audit_test.py`
has an insert statement that is expected to fail. Dtest environment uses
`FlakyRetryPolicy`, which has `max_retries = 5`. 1 initial fail and 5 retry fails
means we expect 6 error audit logs.

The test failed because `create keyspace ks` failed once, then succeeded on retry.
It allowed the test to proceed properly, but the last part of the test that expects
exactly 6 failed queries actually had 7.

The goal of this patch is to make sure there are exactly 6 = 1 + `max_retries` failed
queries, counting only the query expected to fail. If other queries fail with
successful retry, it's fine. If other queries fail without successful retry, the test
will fail, as it should in such situations. They are not related to this expected
failed insert statement.

Fixes #27322

Closes scylladb/scylladb#27378
2025-12-11 10:17:07 +03:00
Benny Halevy
5f13880a91 utils: error_injection: wait_for_message: print injection_name and caller source_location on timeout
When waiting for the condition variable times out
we call on_internal_error, but unfortunately, the backtrace
it generates is obfuscated by
`coroutine_handle<seastar::internal::coroutine_traits_base<void>::promise_type>::resume`.

To make the log more useful, print the error injection name
and the caller's source_location in the timeout error message.

Fixes #27531

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#27532
2025-12-10 23:25:54 +01:00
Calle Wilund
8c4ac457af test::cluster::dtest::tools::files: Remove file
This contained only one routine; `corrupt_file`, which is
highly problematic, and not used. If you want to "corrupt" a
file, it should be done controlled, not at random.
2025-12-10 15:37:04 +01:00
Calle Wilund
e48170ca8e commitlog_replay: Handle fully corrupt files same as partial corruption.
Fixes #26744

If a segment to replay is broken such that the main header is not zero,
but still broken, we throw header_checksum_error. This was not handled in
replayer, which grouped this into the "user error/fundamental problem"
category. However, assuming we allow for "real" disk corruption, this should
really be treated same as data corruption, i.e. reported data loss, not
failure to start up.

The `test_one_big_mutation_corrupted_on_startup` test accidentally sometimes
provoked this issue, by doing random file wrecking, which on rare occasions
provoked this, and thus failed test due to scylla not starting up, instead
of loosing data as expected.

Changed test to consistently cause this exact error instead.
2025-12-10 15:37:04 +01:00
Andrzej Jackowski
11ad32c85e test/boost: add too_long_password to auth_passwords_test
The test documents the current behavior of hashing algorithms that
fail if the passphrase has 512 bytes or more.

Moreover, it documents the behavior of the current bcrypt
implementation that compares only the first 72 bytes of the password.
Although we don't typically use bcrypt for password hashing, it is
possible to insert such a hash using
`CREATE ROLE ... WITH HASHED PASSWORD ...`.

Refs: scylladb/scylladb#26842
2025-12-10 15:36:18 +01:00
Andrzej Jackowski
4c8c9cd548 test/boost: add same_hashes_as_crypt_r to auth_passwords_test
The test verifies that the old and new implementation of SHA-512
hashing returns exactly the same values.

Refs: scylladb/scylladb#26859
2025-12-10 15:36:18 +01:00
Andrzej Jackowski
98f431dd81 auth: utils: add yielding to crypt_sha512
This change allows yielding during hashing computations to prevent
stalls.

The performance of this solution was compared with the previous
implementation that used one alien thread and the implementation
after the alien thread was reverted. The results (median) of
`perf-cql-raw` with `--connection-per-request 1 --smp 10` parameters
are as follows:
 - Alien thread: 41.5 new connections/s per shard
 - Reverted alien thread: 244.1 new connections/s per shard
 - This commit (yielding in hashing): 198.4 new connections/s per shard

The alien thread is limited by a single-core hashing throughput,
which is roughly 400-500 hashes/s in the test environment. Therefore,
with smp=10, the throughput is below 50 hashes/s, and the difference
between the alien thread and other solutions further increases with
higer smp.

The roughly 20% performance deterioration compared to
the old implementation without the alien thread comes from the fact
that the new hashing algorithm implemented in `utils/crypt_sha512.cc`
performs an expensive self-verification and stack cleanup.

On the other hand, with smp=10 the current implementation achieves
roughly 5x higher throughput than the alien thread. In addition,
due to yielding added in this commit, the algorithm is expected
to provide similar protection from stalls as the alien thread did.
In a test that in parallel started a cassandra-stress workload and
created thousands of new connections using python-driver, the values of
`scylla_reactor_stalls_count` metric were as follows:
 - Alien thread: 109 stalls/shard total
 - Reverted alien thread: 13186 stalls/shard total
 - This commit (yielding in hashing): 149 stalls/shard total

Similarly, the `scylla_scheduler_time_spent_on_task_quota_violations_ms`
values were:
 - Alien thread: 1087 ms/shard total
 - Reverted alien thread: 72839 ms/shard total
 - This commit (yielding in hashing): 1623 ms/shard total

To summarize, yielding during hashing computations achieves similar
throughput to the old solution without the alien thread but also
prevents stalls similarly to the alien thread.

Fixes: scylladb/scylladb#26859
Refs: scylladb/scylla-enterprise#5711
2025-12-10 15:36:18 +01:00
Andrzej Jackowski
4ffdb0721f auth: change return type of passwords::check to future
Introduce a new `passwords::hash_with_salt_async` and change the return
type of `passwords::check` to `future<bool>`. This enables yielding
during password computations later in this patch series.

The old method, `hash_with_salt`, is marked as deprecated because
new code should use the new `hash_with_salt_async` function.
We are not removing `hash_with_salt` now to reduce the regression risk
of changing the hashing implementation—at least the methods that change
persistent hashes (CREATE, ALTER) will continue to use the old hashing
method. However, in the future, `hash_with_salt` should be entirely
removed.

Refs: scylladb/scylladb#26859
2025-12-10 15:36:18 +01:00
Andrzej Jackowski
775906d749 auth: remove code duplication in verify_scheme
Refactoring: create a new function `verify_hashing_output` to reuse
code in `hash_with_salt` and `verify_scheme`. The change is introduced
to facilitate verification of hashing output when the implementation
is extended later in this patch series.

Refs: scylladb/scylladb#26859
2025-12-10 15:36:18 +01:00
Andrzej Jackowski
11eca621b0 test/boost: coroutinize auth_passwords_test
This commit prepares `auth_passwords_test` for using coroutines,
because later in this patch series `auth::passwords::check` and other
similar functions will return Seastar futures.

Refs: scylladb/scylladb#26859
2025-12-10 15:36:18 +01:00
Andrzej Jackowski
d7818b56df utils: coroutinize crypt_sha512
Change `sha512crypt` and `__crypt_sha512` to coroutines to allow
yielding during hash computations later in this patch series.

Refs: scylladb/scylladb#26859
2025-12-10 15:36:18 +01:00
Andrzej Jackowski
033fed5734 utils: make crypt_sha512.cc to compile
The purpose of this change is to allow the usage of Seastar futures in
crypt_sha512 later in this patch series.

Refs: scylladb/scylladb#26859
2025-12-10 15:36:18 +01:00
Andrzej Jackowski
c6c30b7d0a utils: license: import crypt_sha512.c from musl to the project
This patch imports the `crypt_sha512.c` file from the musl library.
We need it to incorporate yielding in the `crypt_r` function to avoid
reactor stalls during long hashing computations.

Before this patch series, ScyllaDB used SHA-512 hashing provided
by the `crypt_r` function, which in our case meant using the implementation
from the `libxcrypt` library. Adding yielding to this `libxcrypt`
implementation is problematic, both due to licensing (LGPL) and because the
implementation is split into many functions across multiple files. In
contrast, the SHA-512 implementation from `musl libc` has a more
permissive license and is concise, which makes it easier to incorporate
into the ScyllaDB codebase.

Both `crypt_sha512.c` and musl license are obtained from
git.musl-libc.org:
 - https://git.musl-libc.org/cgit/musl/tree/src/crypt/crypt_sha512.c
 - https://git.musl-libc.org/cgit/musl/tree/COPYRIGHT

Import commit:
  commit 1b76ff0767d01df72f692806ee5adee13c67ef88
  Author: Alex Rønne Petersen <alex@alexrp.com>
  Date:   Sun Oct 12 05:35:19 2025 +0200

  s390x: shuffle register usage in __tls_get_offset to avoid r0 as address

Refs: scylladb/scylladb#26859
2025-12-10 15:36:18 +01:00
Andrzej Jackowski
5afcec4a3d Revert "auth: move passwords::check call to alien thread"
The alien thread was a solution for reactor stalls caused by indivisible
password‑hashing tasks (scylladb/scylladb#24524). However, because
there is only one alien thread, overall hashing throughput was reduced
(see, e.g., scylladb/scylla-enterprise#5711). To address this,
the alien‑thread solution is reverted, and a hashing implementation
with yielding will be introduced later in this patch series.

This reverts commit 9574513ec1.
2025-12-10 15:36:09 +01:00
Calle Wilund
9b5f3d12a3 test::pylib::suite::base: Split options.name test specifier only once
For some arcane reason, we split optional the test pattern given to
test.py twice across '::' to get the file + case specifiers later given
to pytest etc. This means that for a test with a class group (such as some
migrated dtests), we cannot really specify the exact test to run
(pattern <file>::<class>::test).

Simply splitting only on first '::' fixes this. Should not affect any
other tests.
2025-12-10 15:35:12 +01:00
Tomasz Grabiec
0e51a1f812 replica: Remove unnecessary noexcept
Can potentially lead to unnecessary abort.

compaction_groups() and for_each_compaction_group() can throw.

Co-authored-by: bhalevy <20910904+bhalevy@users.noreply.github.com>
2025-12-10 14:51:35 +01:00
Tomasz Grabiec
8b807b299e replica: Remove noexcept from compaction_groups() functions
They can throw during merge, when the number of compaction groups
is higher than 3.

Callers can deal with that, so we shouldn't abort.
2025-12-10 14:48:23 +01:00
Tomasz Grabiec
07ff659849 replica: Remove noexcept from storage_group::for_each_compaction_group
They don't really have to be noexcept.

And "action" may actually throw, leading to abort.

It was observed to throw when creating memtable readers:

terminate called after throwing an instance of 'utils::memory_limit_reached'
   what():  kill limit triggered on semaphore sl:users by permit xxx
Aborting on shard 4, in scheduling group sl:users.

std::terminate() at ??:0
__clang_call_terminate at main.cc:0
replica::storage_group::for_each_compaction_group(std::function<void (seastar::lw_shared_ptr<replica::compaction_group> const&)>) const at ./replica/table.cc:920
 (inlined by) replica::table::add_memtables_to_reader_list(std::vector<mutation_reader, std::allocator<mutation_reader>>&, seastar::lw_shared_ptr<schema const> const&, reader_permit const&, interval<dht::ring_position> const&, query::partition_slice const&, tracing::trace_state_ptr const&, seastar::bool_class<streamed_mutation::forwarding_tag>, seastar::bool_class<mutation_reader::partition_range_forwarding_tag>, std::function<void (unsigned long)>) const at ./replica/table.cc:196
 (inlined by) replica::table::make_reader_v2(seastar::lw_shared_ptr<schema const>, reader_permit, interval<dht::ring_position> const&, query::partition_slice const&, tracing::trace_state_ptr, seastar::bool_class<streamed_mutation::forwarding_tag>, seastar::bool_class<mutation_reader::partition_range_forwarding_tag>) const at ./replica/table.cc:243
 (inlined by) replica::table::as_mutation_source() const::$_0::operator()(seastar::lw_shared_ptr<schema const>, reader_permit, interval<dht::ring_position> const&, query::partition_slice const&, tracing::trace_state_ptr, seastar::bool_class<streamed_mutation::forwarding_tag>, seastar::bool_class<mutation_reader::partition_range_forwarding_tag>) const at ./replica/table.cc:3673
 (inlined by) mutation_reader std::__invoke_impl<mutation_reader, replica::table::as_mutation_source() const::$_0&, seastar::lw_shared_ptr<schema const>, reader_permit, interval<dht::ring_position> const&, query::partition_slice const&, tracing::trace_state_ptr, seastar::bool_class<streamed_mutation::forwarding_tag>, seastar::bool_class<mutation_reader::partition_range_forwarding_tag>>(std::__invoke_other, replica::table::as_mutation_source() const::$_0&, seastar::lw_shared_ptr<schema const>&&, reader_permit&&, interval<dht::ring_position> const&, query::partition_slice const&, tracing::trace_state_ptr&&, seastar::bool_class<streamed_mutation::forwarding_tag>&&, seastar::bool_class<mutation_reader::partition_range_forwarding_tag>&&) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/invoke.h:61
 (inlined by) std::enable_if<is_invocable_r_v<mutation_reader, replica::table::as_mutation_source() const::$_0&, seastar::lw_shared_ptr<schema const>, reader_permit, interval<dht::ring_position> const&, query::partition_slice const&, tracing::trace_state_ptr, seastar::bool_class<streamed_mutation::forwarding_tag>, seastar::bool_class<mutation_reader::partition_range_forwarding_tag>>, mutation_reader>::type std::__invoke_r<mutation_reader, replica::table::as_mutation_source() const::$_0&, seastar::lw_shared_ptr<schema const>, reader_permit, interval<dht::ring_position> const&, query::partition_slice const&, tracing::trace_state_ptr, seastar::bool_class<streamed_mutation::forwarding_tag>, seastar::bool_class<mutation_reader::partition_range_forwarding_tag>>(replica::table::as_mutation_source() const::$_0&, seastar::lw_shared_ptr<schema const>&&, reader_permit&&, interval<dht::ring_position> const&, query::partition_slice const&, tracing::trace_state_ptr&&, seastar::bool_class<streamed_mutation::forwarding_tag>&&, seastar::bool_class<mutation_reader::partition_range_forwarding_tag>&&) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/invoke.h:114
 (inlined by) std::_Function_handler<mutation_reader (seastar::lw_shared_ptr<schema const>, reader_permit, interval<dht::ring_position> const&, query::partition_slice const&, tracing::trace_state_ptr, seastar::bool_class<streamed_mutation::forwarding_tag>, seastar::bool_class<mutation_reader::partition_range_forwarding_tag>), replica::table::as_mutation_source() const::$_0>::_M_invoke(std::_Any_data const&, seastar::lw_shared_ptr<schema const>&&, reader_permit&&, interval<dht::ring_position> const&, query::partition_slice const&, tracing::trace_state_ptr&&, seastar::bool_class<streamed_mutation::forwarding_tag>&&, seastar::bool_class<mutation_reader::partition_range_forwarding_tag>&&) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/std_function.h:290
 (inlined by) std::function<mutation_reader (seastar::lw_shared_ptr<schema const>, reader_permit, interval<dht::ring_position> const&, query::partition_slice const&, tracing::trace_state_ptr, seastar::bool_class<streamed_mutation::forwarding_tag>, seastar::bool_class<mutation_reader::partition_range_forwarding_tag>)>::operator()(seastar::lw_shared_ptr<schema const>, reader_permit, interval<dht::ring_position> const&, query::partition_slice const&, tracing::trace_state_ptr, seastar::bool_class<streamed_mutation::forwarding_tag>, seastar::bool_class<mutation_reader::partition_range_forwarding_tag>) const at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/std_function.h:591
 (inlined by) mutation_source::make_reader_v2(seastar::lw_shared_ptr<schema const>, reader_permit, interval<dht::ring_position> const&, query::partition_slice const&, tracing::trace_state_ptr, seastar::bool_class<streamed_mutation::forwarding_tag>, seastar::bool_class<mutation_reader::partition_range_forwarding_tag>) const at ././readers/mutation_source.hh:143
query::querier_base::querier_base(seastar::lw_shared_ptr<schema const>, reader_permit, interval<dht::ring_position>, query::partition_slice, mutation_source const&, tracing::trace_state_ptr, query::querier_base::querier_config) at ././querier.hh:91
 (inlined by) query::querier::querier(mutation_source const&, seastar::lw_shared_ptr<schema const>, reader_permit, interval<dht::ring_position>, query::partition_slice, tracing::trace_state_ptr, query::querier_base::querier_config) at ././querier.hh:164
 (inlined by) replica::table::query(seastar::lw_shared_ptr<schema const>, reader_permit, query::read_command const&, query::result_options, std::vector<interval<dht::ring_position>, std::allocator<interval<dht::ring_position>>> const&, tracing::trace_state_ptr, query::result_memory_limiter&, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l>>>, std::optional<query::querier>*) at ./replica/table.cc:3583
replica::database::query(seastar::lw_shared_ptr<schema const>, query::read_command const&, query::result_options, std::vector<interval<dht::ring_position>, std::allocator<interval<dht::ring_position>>> const&, tracing::trace_state_ptr, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l>>>, std::variant<std::monostate, db::per_partition_rate_limit::account_only, db::per_partition_rate_limit::account_and_enforce>)::$_0::operator()(reader_permit) const at ./replica/database.cc:1533
 (inlined by) seastar::noncopyable_function<seastar::future<void> (reader_permit)>::indirect_vtable_for<replica::database::query(seastar::lw_shared_ptr<schema const>, query::read_command const&, query::result_options, std::vector<interval<dht::ring_position>, std::allocator<interval<dht::ring_position>>> const&, tracing::trace_state_ptr, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l>>>, std::variant<std::monostate, db::per_partition_rate_limit::account_only, db::per_partition_rate_limit::account_and_enforce>)::$_0>::call(seastar::noncopyable_function<seastar::future<void> (reader_permit)> const*, reader_permit) (.llvm.13537529942037499926) at ././seastar/include/seastar/util/noncopyable_function.hh:158
seastar::noncopyable_function<seastar::future<void> (reader_permit)>::operator()(reader_permit) const at ././seastar/include/seastar/util/noncopyable_function.hh:215
 (inlined by) reader_concurrency_semaphore::execution_loop() (.resume) at ./reader_concurrency_semaphore.cc:980
std::__n4861::coroutine_handle<seastar::internal::coroutine_traits_base<void>::promise_type>::resume() const at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/coroutine:242
 (inlined by) seastar::internal::coroutine_traits_base<void>::promise_type::run_and_dispose() at ./build/release/seastar/./seastar/include/seastar/core/coroutine.hh:122
 (inlined by) seastar::reactor::run_tasks(seastar::reactor::task_queue&) at ./build/release/seastar/./seastar/src/core/reactor.cc:2627
 (inlined by) seastar::reactor::run_some_tasks() at ./build/release/seastar/./seastar/src/core/reactor.cc:3099
seastar::reactor::do_run() at ./build/release/seastar/./seastar/src/core/reactor.cc:3267
seastar::smp::configure(seastar::smp_options const&, seastar::reactor_options const&)::$_0::operator()() const at ./build/release/seastar/./seastar/src/core/reactor.cc:4591
 (inlined by) void std::__invoke_impl<void, seastar::smp::configure(seastar::smp_options const&, seastar::reactor_options const&)::$_0&>(std::__invoke_other, seastar::smp::configure(seastar::smp_options const&, seastar::reactor_options const&)::$_0&) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/invoke.h:61
 (inlined by) std::enable_if<is_invocable_r_v<void, seastar::smp::configure(seastar::smp_options const&, seastar::reactor_options const&)::$_0&>, void>::type std::__invoke_r<void, seastar::smp::configure(seastar::smp_options const&, seastar::reactor_options const&)::$_0&>(seastar::smp::configure(seastar::smp_options const&, seastar::reactor_options const&)::$_0&) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/invoke.h:111
 (inlined by) std::_Function_handler<void (), seastar::smp::configure(seastar::smp_options const&, seastar::reactor_options const&)::$_0>::_M_invoke(std::_Any_data const&) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/std_function.h:290
std::function<void ()>::operator()() const at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/std_function.h:591

Fixes #27475

Co-authored-by: bhalevy <20910904+bhalevy@users.noreply.github.com>
2025-12-10 14:48:11 +01:00
Yaron Kaikov
d3e199984e auto-backport.py: modify instruction for making PR ready for review
Update the comment sent when PR has conflicts with clear instrauctions how to make the PR Ready for review

Fixes: https://scylladb.atlassian.net/browse/RELENG-152

Closes scylladb/scylladb#27547
2025-12-10 14:53:38 +02:00
Pavel Emelyanov
a63508a6f7 object_storage: Temporarily handle pure endpoint addresses as endpoints
To keep backward compatibility, support

- old configs -- where endpoint is just an address and port is separate.
  When it happens, format the "new" endpoint name

- lookup by address-only. If it happens, scan all endpoints and see if
  any one matches the provided address

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-12-10 15:33:47 +03:00
Pavel Emelyanov
ca4564e41c code: Remove dangling mentions of s3::endpoint_config
Collect some code that's unused after previous changes.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-12-10 15:33:47 +03:00
Pavel Emelyanov
f93cafac51 docs: Update docs according to new endpoints config option format
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-12-10 15:33:47 +03:00
Pavel Emelyanov
a3ca4fccef object_storage: Create s3 client with "extended" endpoint name
For this, add the s3::client::make(endpoint, ...) overload that accepts
endpoint in proto://host:port format. Then it parses the provided url
and calls the legacy one, that accepts raw host string and config with
port, https bit, etc.

The generic object_storage_endpoint_param no longer needs to carry the
internal s3::endpoint_config, the config option parsing changes
respectively.

Tests, that generate the config files, and docs are updated.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-12-10 15:33:47 +03:00
Pavel Emelyanov
5953a89822 test: Add named constants for test_get_object_store_endpoints endpoint
names

Instead of hard-coded 'a' and 'b' here and there

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-12-10 15:33:46 +03:00
Pavel Emelyanov
932b008107 s3/storage: Tune config updating
Don't prepare s3::endpoint_config from generic code, jut pass the region
and iam_role_arn (those that can potentially change) to the callback.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-12-10 15:33:46 +03:00
Pavel Emelyanov
e47f0c6284 sstable: Shuffle args for s3_client_wrapper
Make it construct like gs_client_wrapper -- with generic endpoint param
reference and make the storage-specific casts/gets/whatever internally.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-12-10 15:33:46 +03:00
Nadav Har'El
8822c23ad4 Merge 'test: cqlpy: test_protocol_exceptions.py: increase cpp exceptions thr…' from Dario Mirovic
…eshold

The initial problem:

Some of the tests in test_protocol_exceptions.py started failing. The failure is on the condition that no more than `cpp_exception_threshold` happened.

Test logic:

These tests assert that specific code paths do not throw an exception anymore. Initial implementation ran a code path once, and asserted there were 0 exceptions. Sometimes an exception or several can occur, not directly related to the code paths the tests check, but those would fail the tests.

The solution was to run the tests multiple times. If there is a regression, there would be at least as many exceptions thrown as there are test runs. If there is no regression, a few exceptions might happen, up to 10 per 100 test runs. I have arbitrarily chosen `run_count = 100` and `cpp_exception_threshold = 10` values.

Note that the exceptions are counted per shard, not per code path.

The new problem:

The occassional exceptions thrown by some parts of the server now throw a bit more than before. Based on the logs linked on the issues, it is usually 12.

There are possibly multiple ways to resolve the issue. I have considered logging exceptions and parsing them. I would have to filter exception logs only for wanted exceptions. However, if a new, different exception is introduced, it might not be counted.

Another approach is to just increase the threshold a bit. The issue of throwing more exceptions than before in some other server modules should be addressed by a set of tests for that module, just like these tests check protocol exceptions, not caring who used protocol check code paths.

For those reasons, the solution implemented here is to increase `cpp_exception_threshold` to `20`. It will not make the tests unreliable, because, as mentioned, if there is a regression, there would be at least `run_count` exceptions per `run_count` test runs (1 exception per single test run).

Still, to make "background exceptions" occurence a bit more normalized, `run_count` too is doubled, from `100` to `200`. At the first glance this looks like nothing is changed, but actually doubling both run count and exception threshold here implies that the burst does not scale as much as run count, it is just that the "jitter" is bigger than the old threshold.

Also, this patch series enables debug logging for `exception` logger. This will allow us to inspect which exceptions happened if a protocol exceptions test fails again.

Fixes #27247
Fixes #27325

Issue observed on master and branch-2025.4. The tests, in the same form, exist on master, branch-2025.4, branch-2025.3, branch-2025.2, and branch-2025.1. Code change is simple, and no issue is expected with backport automation. Thus, backports for all the aforementioned versions is requested.

Closes scylladb/scylladb#27412

* github.com:scylladb/scylladb:
  test: cqlpy: test_protocol_exceptions.py: enable debug exception logging
  test: cqlpy: test_protocol_exceptions.py: increase cpp exceptions threshold
2025-12-10 10:53:30 +02:00
Marcin Maliszkiewicz
be9992cfb3 fix rjson::value to bytes conversion with missing GetStringLength call 2025-12-09 19:27:22 +01:00
Marcin Maliszkiewicz
daf00a7f24 alternator: change type from string to string_view in should_add_capacity
It avoids allocation.
2025-12-09 19:27:21 +01:00
Marcin Maliszkiewicz
62962f33bb fix rjson::value to string_view conversion with missing GetStringLength call
In some cases we unnecessarily convert to string which
causes a copy. In other we convert without calling
GetStringLength which causes iteration to dermine length
which is already known. In some cases we do even both.
This commit fixes that.
2025-12-09 19:27:21 +01:00
Marcin Maliszkiewicz
060c2f7c0d use rjson::to_string_view when rjson::value gets converted using GetStringLength
This commit is only cosmetics, changes calls to GetStringLength
into rjson::to_string_view with the same underlying implementation.
2025-12-09 19:27:21 +01:00
Marcin Maliszkiewicz
64149b57c3 use rjson::to_sstring and rjson::to_string for various string conversions
In some cases we ommit size checking which is wrong
as according to rapid json documentation strings may
contain \0 byte in the middle.
2025-12-09 19:27:21 +01:00
Marcin Maliszkiewicz
4b004fcdfc utils: use rjson document wrapper in instance_profile_credentials_provider::parse_creds
So that we can use our common utility functions.
2025-12-09 19:27:21 +01:00
Marcin Maliszkiewicz
5e38b3071b utils: move rjson::to_string_view func to string related place 2025-12-09 19:27:21 +01:00
Marcin Maliszkiewicz
225b3351fc utils: add to_sstring and to_string rjson helper
So that conversion code is common and it's easier
to avoid accidental type conversions. Additionally
according to rapid json library size must be checked
explicitly, this also avoids extra iteration in char*
to (s)string conversion.
2025-12-09 19:27:21 +01:00
Avi Kivity
80c6718ea8 build: update toolchain to Fedora 43 with clang 21.1.6
Rebase to Fedora 43 with clang 21.1 and libstdc++ 15.

Fedora container image registry moved to registry.fedoraproject.org as
it seems to be updated more regularly.

Added python3-devel to the dependencies as some packages scylla-cqlsh
depends on aren't yet available in the form of wheels for Python 3.14,
and so have to be built locally. In any case it's better to reduce
dependency on those wheels even if the ones currently missing appear
eventually.

Added libev-devel to the dependencies so that the python driver
builds correctly even if "wheels" are not published. This reduces
our dependency on the python driver's binary release schedule.
Without libev-devel, TLS does not work correctly.

We no long remove the clang and clang-libs packages. Doxygen
started depending on clang-libs, and removing them removes
doxygen, breaking the build when it looks for that. The build
will still pick up the optimized clang, since /usr/local/bin
is earlier in the path. We keep the clang package, since it allows
us to mess a little less with the directory structure.

Optimized clang binaries generates and stored in

  https://devpkg.scylladb.com/clang/clang-21.1.6-Fedora-43-aarch64.tar.gz
  https://devpkg.scylladb.com/clang/clang-21.1.6-Fedora-43-x86_64.tar.gz

With ./scripts/refresh-pgo-profiles.sh, the new compiler shows a small
performance improvement (instructions_per_op) in perf-simple-query:

clang 21:

259353.60 tps ( 64.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   35720 insns/op,   17427 cycles/op,        0 errors)
265940.08 tps ( 64.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   35725 insns/op,   17042 cycles/op,        0 errors)
262650.01 tps ( 64.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   35720 insns/op,   17240 cycles/op,        0 errors)
262881.22 tps ( 64.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   35675 insns/op,   17222 cycles/op,        0 errors)
264898.68 tps ( 64.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   35732 insns/op,   17070 cycles/op,        0 errors)
throughput:
	mean=   263144.72 standard-deviation=2528.69
	median= 262881.22 median-absolute-deviation=1753.96
	maximum=265940.08 minimum=259353.60
instructions_per_op:
	mean=   35714.47 standard-deviation=22.34
	median= 35720.38 median-absolute-deviation=10.20
	maximum=35732.14 minimum=35675.50
cpu_cycles_per_op:
	mean=   17200.12 standard-deviation=154.62
	median= 17221.70 median-absolute-deviation=129.77
	maximum=17427.33 minimum=17041.57

clang 20:

254431.39 tps ( 64.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   35883 insns/op,   17708 cycles/op,        0 errors)
259701.02 tps ( 64.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   35883 insns/op,   17351 cycles/op,        0 errors)
261166.92 tps ( 64.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   35912 insns/op,   17270 cycles/op,        0 errors)
260656.31 tps ( 64.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   35869 insns/op,   17289 cycles/op,        0 errors)
259628.13 tps ( 64.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   35946 insns/op,   17370 cycles/op,        0 errors)
throughput:
	mean=   259116.75 standard-deviation=2698.56
	median= 259701.02 median-absolute-deviation=1539.55
	maximum=261166.92 minimum=254431.39
instructions_per_op:
	mean=   35898.42 standard-deviation=30.69
	median= 35882.97 median-absolute-deviation=15.90
	maximum=35945.63 minimum=35869.02
cpu_cycles_per_op:
	mean=   17397.49 standard-deviation=178.35
	median= 17351.35 median-absolute-deviation=108.79
	maximum=17707.63 minimum=17269.68

Closes scylladb/scylladb#26773
2025-12-09 15:16:31 +02:00
Pavel Emelyanov
855b91ec20 scripts: Make PR merging check more granular
Currently we have 3 explicit checks, and some of them are configurable:
- Jenkins job being stable. Can be disabled with --force
- Whether submodule update is happenning. It's not allowed by default, and
  should be enabled with --allow-submodule option
- Target branch checking (recently merged #27249). Happens unconditionally

This PR unifies all checks in two ways.

First, each restriction can be lifted with --allow-foo options. The existing
--allow-submodule stays and two options are added:

- --allow-unstable to skip jenkins job check (like --force works now)
- --allow-any-branch to skip target branch check

Second, the --force option lifts all the known restrictions.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#27294
2025-12-09 13:58:21 +02:00
Nadav Har'El
95e303faf3 Merge 'Refactor get_view_natural_endpoint' from Wojciech Mitros
With the introduction of rack-lists and the reliance of materialized views on them, the `get_view_natural_endpoint` function can be greatly simplified. When using tablets, instead of doing any index-matching, we can now pair base tables with views only in the same rack.
In this series we remove no longer needed code and reorganize the needed code for better clarity.
After the changes, the `get_view_natural_endpoint` function goes down from 245 lines to 85 lines, while the whole pairing-related text goes down from 346 lines to 239 lines.

Fixes https://github.com/scylladb/scylladb/issues/26313

Closes scylladb/scylladb#27383

* github.com:scylladb/scylladb:
  mv: replace the simple/complex rack-aware pairing with exact rack matching
  mv: split out vnode pairing code from get_view_natural_endpoint
  mv: unify self-pairing and rack-aware pairing into one bool
  mv: remove the workaround for left nodes when sending view updates
2025-12-09 13:19:13 +02:00
Nadav Har'El
8ba595e472 Merge 'alternator: fix batch writes during intranode tablet migrations' from Petr Gusev
Scylla implements `LWT` in the` storage_proxy::cas` method. This method expects to be called on a specific shard, represented by the `cas_shard` parameter. Clients must create this object before calling `storage_proxy::cas`, check its `this_shard()` method, and jump to `cas_shard.shard()` if it returns false.

The nuance is that by the time the request reaches the destination shard, the tablet may have already advanced in its migration state machine. For example, a client may acquire a `cas_shard` at the `streaming` tablet state, then submit a request to another shard via `smp::submit_to(cas_shard.shard())`. However, the new `cas_shard` created on that other shard might already be in the `write_both_read_new` state, and its `cas_shard.shard()` would not be equal to `this_shard_id()`. Such broken invariant results in an `on_internal_error` in `storage_proxy::cas`.

Clients of `storage_proxy::cas` are expected to check` cas_shard.this_shard()` and recursively jump to another shard if it returns false. Most calls to `storage_proxy::cas` already implement this logic. The only exception is `executor::do_batch_write`, which currently checks `cas_shard.this_shard()` only once. This can break the invariant if the tablet state changes more than once during the operation.

This PR fixes the issue by implementing recursive `cas_shard.this_shard()` checks in `executor::do_batch_write`. It also adds a test that reproduces the problem.

Fixes: scylladb/scylladb#27353

backport: need to be backported to 2025.4

Closes scylladb/scylladb#27396

* github.com:scylladb/scylladb:
  alternator/executor.cc: eliminate redundant dk copy
  alternator/executor.cc: release cas_shard on the original shard
  alternator/executor.cc: move shard check into cas_write
  alternator/executor.cc: make cas_write a private method
  alternator/executor.cc: make do_batch_write a private method
  alternator/executor.cc: fix indent
  test_alternator: add test_alternator_invalid_shard_for_lwt
2025-12-09 11:25:15 +02:00
Petr Gusev
608eee0357 alternator/executor.cc: eliminate redundant dk copy
A small refactoring/optimization.
2025-12-09 10:21:06 +01:00
Petr Gusev
0bcc2977bb alternator/executor.cc: release cas_shard on the original shard
Before this series, we kept the cas_shard on the original shard to
guard against tablet movements running in parallel with
storage_proxy::cas.

The bug addressed by this PR shows that this approach is flawed:
keeping the cas_shard on the original shard does not guarantee that
a new cas_shard acquired on the target shard won’t require another
jump.

We fixed this in the previous commit by checking cas_shard.this_shard()
on the target shard and continuing to jump to another shard if
necessary. Once cas_shard.this_shard() on the target shard returns
true, the storage_proxy::cas invariants are satisfied, and no other
cas_shard instances need to remain alive except the one passed
into storage_proxy::cas.
2025-12-09 10:21:06 +01:00
Petr Gusev
3a865fe991 alternator/executor.cc: move shard check into cas_write
This change ensures that if cas_shard points to a different shard,
the executor will continue issuing shard jumps until
cas_shard.this_shard() returns true. The commit simply moves the
this_shard() check from the parallel_for_each lambda into cas_write,
with minimal functional changes.

We enable test_alternator_invalid_shard_for_lwt since now it should
pass.

Fixes scylladb/scylladb#27353
2025-12-09 10:21:01 +01:00
Pavel Emelyanov
fb32e1c7ee Merge 'streaming: tablet_sstable_streamer::stream refactoring' from Ernest Zaslavsky
Refactor the way we decide the sstable belong to a tablet, fully or partially to simplify the flow and make it more readable. Also extract the logic and make it testable, add tests to cover changes

The change is purely aesthetic, no need to backport

Closes scylladb/scylladb#27101

* github.com:scylladb/scylladb:
  streaming: remove unnecessary lambda creating sstable token range
  streaming: simplify get_sstables_for_tablets logic
  streaming: switch to range-based for loop
  streaming: drop sstable skip microoptimization in tablet loop
  streaming: replace reverse iterators with reverse view in sstables scan
  streaming: return from get_sstables_for_tablets earlier
  streaming: add get_sstables_by_tablet_range tests
  test,sstables: add helper to set sstable first and last keys
  streaming: refactor get_sstables_for_tablets to make it accessible
  streaming: refactor get_sstables_for_tablets to make it testable
  streaming: refactor tablet_sstable_streamer::stream by extracting SST filtering logic
2025-12-09 10:53:57 +03:00
Patryk Jędrzejczak
b6895f0fa7 test: make test_broken_bootstrap faster
This change makes the test ~20 s faster. It's a forgotten follow-up:
https://github.com/scylladb/scylladb/pull/18927#discussion_r1627331946

Closes scylladb/scylladb#27445
2025-12-09 09:25:42 +02:00
Dario Mirovic
c30b326033 test: cqlpy: test_protocol_exceptions.py: enable debug exception logging
Enable debug logging for "exception" logger inside protocol exception tests.
The exceptions will be logged, and it will be possible to see which ones
occured if a protocol exceptions test fails.

Refs #27272
Refs #27325
2025-12-09 01:35:42 +01:00
Dario Mirovic
807fc68dc5 test: cqlpy: test_protocol_exceptions.py: increase cpp exceptions threshold
The initial problem:

Some of the tests in test_protocol_exceptions.py started failing. The failure is
on the condition that no more than `cpp_exception_threshold` happened.

Test logic:

These tests assert that specific code paths do not throw an exception anymore.
Initial implementation ran a code path once, and asserted there were 0 exceptions.
Sometimes an exception or several can occur, not directly related to the code paths
the tests check, but those would fail the tests.

The solution was to run the tests multiple times. If there is a regression, there
would be at least as many exceptions thrown as there are test runs. If there is no
regression, a few exceptions might happen, up to 10 per 100 test runs.
I have arbitrarily chosen `run_count = 100` and `cpp_exception_threshold = 10` values.

Note that the exceptions are counted per shard, not per code path.

The new problem:

The occassional exceptions thrown by some parts of the server now throw a bit more
than before. Based on the logs linked on the issues, it is usually 12.

There are possibly multiple ways to resolve the issue. I have considered logging
exceptions and parsing them. I would have to filter exception logs only for wanted
exceptions. However, if a new, different exception is introduced, it might not be
counted.

Another approach is to just increase the threshold a bit. The issue of throwing
more exceptions than before in some other server modules should be addressed by
a set of tests for that module, just like these tests check protocol exceptions,
not caring who used protocol check code paths.

For those reasons, the solution implemented here is to increase `cpp_exception_threshold`
to `20`. It will not make the tests unreliable, because, as mentioned, if there is a
regression, there would be at least `run_count` exceptions per `run_count` test runs
(1 exception per single test run).

Still, to make "background exceptions" occurence a bit more normalized, `run_count` too
is doubled, from `100` to `200`. At the first glance this looks like nothing is changed,
but actually doubling both run count and exception threshold here implies that the
exception burst does not scale as much as run count, it is just that the "jitter" is
bigger than the old threshold.

Fixes #27247
Fixes #27325
2025-12-09 01:34:48 +01:00
Michał Jadwiszczak
51843195f7 test/boost/view_build_test: increase number of retires
Default number of retires in `eventually()` in `test_builder_with_concurrent_drop`
sometimes is not enough to observe changes in system tables on aarch64
builds.

This patch increases the number of retries to 30.

Fixes scylladb/scylladb#27370

Closes scylladb/scylladb#27493
2025-12-08 23:14:01 +02:00
Gleb Natapov
7038b8b544 test/scylla_cluster: fix the check that a process failed to start
If the process is running returncode will be Node, otherwise it will
have some value (which can be 0 s well) and the current code treats 0
as if the process is still running.

Closes scylladb/scylladb#27490
2025-12-08 18:23:29 +01:00
Tomasz Grabiec
7df610b73d sstables: Remove host id mismatch warning for sstable streaming
Tablet migration transfers sstable files without changing origin
host-id.  As it should, becuase those sstables were not written on the
destination host, and should be ignored by commit log replay.

So it's a normal situation, and it's confusing to see this warning in
logs.

Fixes #26957

Closes scylladb/scylladb#27433
2025-12-08 18:39:22 +02:00
Piotr Dulikowski
386309d6a0 Merge 'Improve the way distributed-loader constructs storage_options for backup sstables' from Pavel Emelyanov
The distributed_loader::get_sstables_from_object_store() method accepts an endpoint parameter and internally wants to get storage type for that endpoint (s3 or gcs). This is needed to construct storage_options object to create an sstable object.

To get the type, the method scans db::config option, but there's much simpler way to get one.

Code cleanup, no need to backport

Closes scylladb/scylladb#27381

* github.com:scylladb/scylladb:
  sstables_loader: Provide endpoint type for get_sstables_from_object_store()
  storage_manager: Introduce get_endpoint_type() method
  storage_manager: Split get_endpoint_client()
2025-12-08 16:55:20 +01:00
Yauheni Khatsianevich
f12adfc292 test/lwt: add counter-table support to BaseLWTTester
Extend BaseLWTTester with optional counter-table configuration and
verification, enabling randomized LWT tests over tablets
 with counters.
2025-12-08 15:57:33 +01:00
Amnon Heiman
a213e41250 scylla-node-exporter: Add ethtool to node exporter
AWS suggests following multiple network performance metrics:
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/monitoring-network-performance-ena.html#network-performance-metrics

This patch enables the ethtool collector with the specific list of
metrics

Ater this patch the relevant metris looks like:

$ curl http://localhost:9100/metrics |& grep ethtool
node_ethtool_bw_in_allowance_exceeded{device="ens5"} 0
node_ethtool_bw_out_allowance_exceeded{device="ens5"} 0
node_ethtool_conntrack_allowance_available{device="ens5"} 51303
node_ethtool_conntrack_allowance_exceeded{device="ens5"} 0
node_ethtool_info{bus_info="0000:00:05.0",device="ens5",driver="ena",expansion_rom_version="",firmware_version="",version="6.14.0-1015-aws"} 1
node_ethtool_linklocal_allowance_exceeded{device="ens5"} 0
node_scrape_collector_duration_seconds{collector="ethtool"} 0.001091436
node_scrape_collector_success{collector="ethtool"} 1

Signed-off-by: Amnon Heiman <amnon@scylladb.com>

Closes scylladb/scylladb#27358
2025-12-08 14:27:10 +02:00
Dawid Mędrek
58dc414912 test/cluster/mv: Rewrite test_view_building_scheduling_group
We rewrite the test to avoid flakiness. Instead of looking at the
metrics, we make a trade-off and start depending on a less reliable
mechanism -- logs. We grep all relevant messages printed by Scylla
in TRACE mode and make sure that they were all printed from a context
using the streaming scheduling group.

Although it's a "less proper" way of testing, it should be much more
dependable and avoid flakiness.

Fixes scylladb/scylladb#25957

Closes scylladb/scylladb#26656
2025-12-08 14:24:25 +02:00
Ferenc Szili
d883ff2317 test: fix flakyness caused by TRUNCATE retries
The test test_truncate_during_topology_change tests TRUNCATE TABLE while
bootstrapping a new node. With tablets enabled TRUNCATE is a global
topology operation which needs to serialize with boostrap.

When TRUNCATE TABLE is issued, it first checks if there is an already
queued truncate for the same table. This can happen if a previous
TRUNCATE operation has timed out, and the client retried. The newly
issued truncate will only join the queued one if it is waiting to be
processed, and will fail immediatelly if the TRUNCATE is already being
processed.

In this test, TRUNCATE will be retried after a timeout (1 minute) due to
the default retry policy, and will be retried up to 3 times, while the
bootstrap is delayed by 2 minutes. This means that the test can validate
the result of a truncate which was started after bootstrap was
completed.

Because of the way truncate joins existing truncate operations, we can
also have the following scenario:
- TRUNCATE times out after one minute because the new node is being
  bootstrapped
- the client retries the TRUNCATE command which also times out after 1m
- the third attempt is received during TRUNCATE being processed which
  fails the test

This patch changes the retry policy of the TRUNCATE operation to
FallthroughRetryPolicy which guarantees that TRUNCATE will not be
retried on timeout. It also increases the timeout of the TRUNCATE from 1
to 4 minutes. This way the test will actually validate the performance
of the TRUNCATE operation which was issued during bootstrap, instead of
the subsequent, retried TRUNCATEs which could have been issued after the
bootstrap was complete.

Fixes: #26347

Closes scylladb/scylladb#27245
2025-12-08 14:13:26 +02:00
dependabot[bot]
1f777da863 build(deps): bump sphinx-scylladb-theme from 1.8.9 to 1.8.10 in /docs
Bumps [sphinx-scylladb-theme](https://github.com/scylladb/sphinx-scylladb-theme) from 1.8.9 to 1.8.10.
- [Release notes](https://github.com/scylladb/sphinx-scylladb-theme/releases)
- [Commits](https://github.com/scylladb/sphinx-scylladb-theme/commits)

---
updated-dependencies:
- dependency-name: sphinx-scylladb-theme
  dependency-version: 1.8.10
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

Closes scylladb/scylladb#27468
2025-12-08 13:40:51 +02:00
Asias He
faad0167d7 repair: Add tablet repair progress report support
This patch adds tablet repair progress report support so that the user
could use the /task_manager/task_status API to query the progress.

In order to support this, a new system table is introduced to record the
user request related info, i.e, start of the request and end of the
request.

The progress is accurate when tablet split or merge happens in the
middle of the request, since the tokens of the tablet are recorded when
the request is started and when repair of each tablet is finished. The
original tablet repair is considered as finished when the finished
ranges cover the original tablet token ranges.

After this patch, the /task_manager/task_status API will report correct
progress_total and progress_completed.

Fixes #22564
Fixes #26896

Closes scylladb/scylladb#26924
2025-12-08 13:35:19 +02:00
Andrei Chekun
0115a21b9a test.py: fail test when timeout reached for boost test
There is a bug in current pytest's boost implementation. When timeout
reached process will be killed, but it was not correctly propagated,
that lead to a false positive result. This will fail test case when
timeout for the process is reached.
This is to prevent issues like this https://github.com/scylladb/scylladb/issues/27237

Closes scylladb/scylladb#27463
2025-12-08 11:49:46 +01:00
Ernest Zaslavsky
71834ce7dd streaming: remove unnecessary lambda creating sstable token range
The `sstable_token_range` lambda was only used once to create a token
range for an SSTable. Inline the construction directly where needed,
removing the extra lambda. This simplifies the code without changing
behavior.
2025-12-08 12:30:24 +02:00
Ernest Zaslavsky
df21112c39 streaming: simplify get_sstables_for_tablets logic
Remove the use of the `overlaps` helper and unnest nested conditionals
in get_sstables_for_tablets. Straightforward `before` and `after` checks
are sufficient to decide how each SSTable should be handled.
2025-12-08 12:30:24 +02:00
Ernest Zaslavsky
bd339cc4d8 streaming: switch to range-based for loop
Replace the explicit iterator loop with a range-based for loop. This
simplifies the code, enforces constness, and avoids the unnecessary
use of postfix increment. The behavior remains unchanged,but
readability and maintainability are improved.
2025-12-08 12:30:23 +02:00
Ernest Zaslavsky
91bf23eea1 streaming: drop sstable skip microoptimization in tablet loop
Remove the microoptimization that advanced over SSTables ending before a
tablet range. This approach is misleading since SSTables are not sorted
by their end token, and the extra logic adds complexity with little to
no benefit. The streaming path here is not performance‑critical, so the
simpler loop is preferable.
2025-12-08 12:30:23 +02:00
Ernest Zaslavsky
f925ed176b streaming: replace reverse iterators with reverse view in sstables scan
Use a reverse view over the SSTables vector instead of reverse iterators.
This avoids awkward rbegin/rend usage and the mental overhead of tracking
inverted sort order. With a view, we can use standard begin/end iteration
while preserving the intended scan direction.
2025-12-08 12:30:23 +02:00
Ernest Zaslavsky
68dcd1b1b2 streaming: return from get_sstables_for_tablets earlier
Check if tablets or sstables list is empty and if so, return immediately
2025-12-08 12:30:23 +02:00
Ernest Zaslavsky
6fd5160947 streaming: add get_sstables_by_tablet_range tests
Add a comprehensive test suite that exercises various combinations of
SSTable containment within tablet ranges. These cases cover boundary
conditions, partial overlaps, and full containment to validate all
recent changes made to `get_sstables_by_tablet_range`.
2025-12-08 12:30:23 +02:00
Ernest Zaslavsky
3fc914ca59 test,sstables: add helper to set sstable first and last keys
Introduce a utility helper to set the first and last decorated keys on
an SSTable. This is intended for testing purposes, making it easier to
construct SSTables with defined boundaries in unit tests.
2025-12-08 12:30:23 +02:00
Ernest Zaslavsky
6ef7ad9b5a streaming: refactor get_sstables_for_tablets to make it accessible
Create `get_sstables_for_tablets_for_tests` friend free function
for testing purposes. Adding this free function allows
direct testing without requiring the full streamer context.
2025-12-08 12:30:23 +02:00
Ernest Zaslavsky
581b8ace83 streaming: refactor get_sstables_for_tablets to make it testable
Make the `get_sstables_for_tablets` member function `static`. This
is a step toward improved testability, allowing the function to be
invoked directly without requiring a full instance of the streamer.
2025-12-08 12:30:23 +02:00
Pavel Emelyanov
8192f45e84 Merge 'Add option to use sstable identifier in snapshot' from Benny Halevy
This change adds a new option to the REST api and correspondingly, to scylla nodetool: use_sstable_identifier.
When set, we use the sstable identifier, if available, to name each sstable in the snapshots directory
and the manifest.json file, rather than using the sstable generation.

This can be used by the user (e.g. Scylla Manager) for global deduplication with tablets, where an sstable
may be migrated across shards or across nodes, and in this case, its generation may change, but its
sstable identifier remains sstable.

Currently, Scylla manager uses the sstable generation to detect sstables that are already backed up to
object storage and exist in previous backed up snapshots.
Historically, the sstable generation was guaranteed to be unique only per table per node,
so the dedup code currently checks for deduplication in the node scope.

However, with tablet migration, sstables are renamed when migrated to a different shard,
i.e. their generation changes, and they may be renamed when migrated to another node,
but even if they are not, the dedup logic still assumes uniqueness only within a node.

To address both cases, we keep the sstable_id stable throughout the sstable life cycle (since 3a12ad96c7).
Given the globally unique sstable identifier, scylla manager can now detect duplicate sstables
in a wider scope.  This can be cluster-wide, but we practically need only rack-wide deduplication
or dc-wide, as tablets are migrated across racks only in rare occasions (like when converting from a
numerical replication factor to a rack list containing a subset of the available racks in a datacenter).

Fixes #27181

* New feature, no backport required

Closes scylladb/scylladb#27184

* github.com:scylladb/scylladb:
  database: truncate_table_on_all_shards: set use_sstable_identifier to true
  nodetool: snapshot: add --use-sstable-identifier option
  api: storage_service: take_snapshot: add use_sstable_identifier option
  test: database_test: add snapshot_use_sstable_identifier_works
  test: database_test: snapshot_works: add validate_manifest
  sstable: write_scylla_metadata: add random_sstable_identifier error injection
  table: snapshot_on_all_shards: take snapshot_options
  sstable: add get_format getter
  sstable: snapshot: add use_sstable_identifier option
  db: snapshot_ctl: snapshot_options: add use_sstable_identifier options
  db: snapshot_ctl: move skip_flush to struct snapshot_options
2025-12-08 12:56:12 +03:00
Petr Gusev
c6eec4eeef alternator/executor.cc: make cas_write a private method
We will need to access executor::_stats field from cas_write. We could
pass it as a paramter, but it seems simpler to just make cas_write
and instance method too.
2025-12-08 10:29:54 +01:00
Petr Gusev
9bef142328 alternator/executor.cc: make do_batch_write a private method
We will need to access executor::_stats field on other shards.
2025-12-08 10:29:54 +01:00
Petr Gusev
74bf24a4a7 alternator/executor.cc: fix indent 2025-12-08 10:29:28 +01:00
Petr Gusev
e60bcd0011 test_alternator: add test_alternator_invalid_shard_for_lwt
This test reproduces scylladb/scylladb#27353 using two injection
points. First, the test triggers an intra-node tablet migration and
suspends it at the streaming stage using the
intranode_migration_streaming_wait injection. Next, it enables the
alternator_executor_batch_write_wait injection, which suspends a
batch write after its cas_shard has already been created.
The test then issues several batch writes and waits until one of them
hits this injection on the destination shard. At this point, the
cas_shard.erm for that write is still in the streaming state,
meaning the executor would need to jump back to the source shard.
The test then resumes the suspended tablet migration, allowing it to
update the ERM on the source shard to write_both_read_new. After that,
the test releases the suspended batch write and expects it to perform
two shard jumps: first from the destination to the source shard, and
then again back to the source shard.

This commit adds the alternator_executor_batch_write_wait injection to
alternator/executor.cc. Coroutines are intentionally avoided in the
parallel_for_each lambda to prevent unnecessary coroutine-frame
allocations.
2025-12-08 10:29:28 +01:00
Michał Jadwiszczak
aa908ba99c docs/dev/view-building-coordinator: fix typos 2025-12-04 12:52:42 +01:00
Michał Jadwiszczak
529cd25c51 db/view/view_building_worker: remove unnnecessary empty lines 2025-12-04 12:52:42 +01:00
Michał Jadwiszczak
4fc5fcaec4 db/view/view_building_worker: fix typo 2025-12-04 12:52:42 +01:00
Michał Jadwiszczak
3253b05ec9 db/view/view_building_worker: avoid creating a copy of tasks map
The loop can be converted to use an iterator and avoid creating a copy
of the tasks map.
2025-12-04 12:52:41 +01:00
Michał Jadwiszczak
597a2ce5f9 db/view/view_building_worker: wrap conditionally compiled code in a scope
The code creates a local variable, so it's better to wrap it in a local
scope, to the conditionally compiled variable doesn't pollute the
external scope.
2025-12-04 12:52:41 +01:00
Michał Jadwiszczak
a5f19af050 db/view/view_building_worker: remove unnecessary CV broadcast
After scylladb/scylladb#26897 was merged, the worker doesn't use the
view building state machine CV to manage lifetime of batches, so the
broadcast is not needed.
2025-12-04 12:52:41 +01:00
Michał Jadwiszczak
b4fe565f07 db/view/view_building_worker: catch general execption in staging task registrator
In case of general exception in `view_building_worker::create_staging_sstable_tasks()`,
catch it, print it with error level and sleep 1s before retrying.
This will allow for the registrator to retry its work in case of failure
and it should be easier to detect any bugs in the method.
2025-12-04 12:52:37 +01:00
Benny Halevy
19b6207f17 database: truncate_table_on_all_shards: set use_sstable_identifier to true
To facilitate global sstable deduplication on backup.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-12-04 11:57:39 +02:00
Benny Halevy
ff52550739 nodetool: snapshot: add --use-sstable-identifier option
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-12-04 11:57:39 +02:00
Benny Halevy
e654045755 api: storage_service: take_snapshot: add use_sstable_identifier option
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-12-04 11:57:39 +02:00
Benny Halevy
07b92a1ee8 test: database_test: add snapshot_use_sstable_identifier_works
Test that taking a snapshot with the use_sstable_identifier
option (and injecting `random_sstable_identifier`) produces
different file names in the snapshot than the original
sstable names and validate te manifest.json file respectively.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-12-04 11:57:38 +02:00
Benny Halevy
7504d10d9e test: database_test: snapshot_works: add validate_manifest
Validate the manifest.json format by loading it using rjson::parse
and then validate its contents to ensure it lists exactly the
SSTables present in the snapshot directory.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-12-04 11:55:50 +02:00
Benny Halevy
28cb300d0a sstable: write_scylla_metadata: add random_sstable_identifier error injection
To be used by a unit test in the following patch for testing
the snapshot use_sstable_identifier option.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-12-04 11:55:50 +02:00
Benny Halevy
9b3fbedc8c table: snapshot_on_all_shards: take snapshot_options
And pass the use_sstable_identifier down the stack
to the sstables layer.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-12-04 11:55:50 +02:00
Benny Halevy
420fb1fd53 sstable: add get_format getter
To be used by the snapshot code in te following patch
for manufacturing a basename using the sstable_id rather
than its generation.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-12-04 11:55:50 +02:00
Benny Halevy
7c62417b54 sstable: snapshot: add use_sstable_identifier option
When set to true, use the sstable_identifier as the sstable name
in the snapshot rather than its generation.

sstable::snapshot now returns the generation it used
for the sstable in the snapshot, based on the `use_sstable_identifier`
option, to be used by the upper layer generating the manifest.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-12-04 11:53:32 +02:00
Benny Halevy
1c45ad7cee db: snapshot_ctl: snapshot_options: add use_sstable_identifier options
To be used for naming sstables in the snapshot by their
sstable identifiers rather than their generation, to
facilitate global deduplication of sstables in backup.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-12-04 09:46:35 +02:00
Benny Halevy
c18133b6cb db: snapshot_ctl: move skip_flush to struct snapshot_options
Prepare for adding another option: use_sstable_identifer.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-12-04 09:46:35 +02:00
Botond Dénes
e762027943 db/config: change batchlog_replay_cleanup_after_replays default to 1
Now that batchlog cleanup is cheap, on account of memtable flush on the
system.batchlog table garbage-collecting tombstones (previous patch), we
can afford to do cleanup on each replay, keeping the memtable size small
and more importantly -- the amount of tombstones in the memtable small.
2025-12-02 14:21:26 +02:00
Botond Dénes
8edd5b80ab test/boost/batchlog_manager_test: add test for batchlog cleanup
Add more tests covering different aspects of batchlog replay, cleanup,
replay timeout and finally v1 -> v2 migration.
2025-12-02 14:21:26 +02:00
Botond Dénes
fb84b30f88 replica/mutation_dump: always set position weight for clustering positions
SELECT * FROM MUTATION_FRAGMENTS() queries have a transformed schema
(mutation-fragment schema), which is a superset of that of the queried
table's. The mutation fragment schema represents position_in_partition
of mutation fragments expressed as clustering columns. This presents
some challenges, as some position_in_partition fields are null for some
positions. This was solved by setting these clustering keys components
to bytes{}. In the process, a mistake was made: when the clustering key
is missing in the position_in_partition, the position_weight is also set
to bytes{}. This is not correct, it is possible for some positions to
have no key but to still have a position_weight. An example is
position_in_partition::before_all_clustered_rows().
Fix this by always filling in the position_weight for positions which
have region() == clustered, instead of the earlier condition on the key
presence.

This is a minor bug affecting range tombstone changes at the two
extremes: position_in_partition::{before,after}_all_clustered_rows(). In
both cases, the position_weight can be deduced by a human looking at the
results, based on the position of the range tombstone change, relative
to other fragments.
2025-12-02 14:21:26 +02:00
Botond Dénes
8545f7eedd service/storage_proxy: s/batch_replay_throw/storage_proxy_fail_replay_batch/
Rename to make it more explicit where the error injection happens.
Also change how the error is injected, use the lambda overload instead
of is_enabled(), the former leaves better trace in logs, which helps
when debugging tests.
2025-12-02 14:21:26 +02:00
Botond Dénes
e52e1f842e test/lib: introduce error_injection.hh
Test-specific helpers for working with error injection. Provides an
RAII object to enable/disable error injection points, on all shards.
2025-12-02 14:21:26 +02:00
Botond Dénes
0a7df4b8ac utils/error_injection: add debug log to disable() and disable_all()
enable() and friends already has debug logs.
2025-12-02 14:21:26 +02:00
Botond Dénes
9bb8156f02 test/lib/cql_test_env: forward config to batchlog
Currently all batchlog config items are hardcoded. Make the two
important ones configurable: replay_timeout and
replay_cleanup_after_replays.
2025-12-02 14:21:26 +02:00
Botond Dénes
d1b796bc43 test/lib/cql_test_env: add batch type to execute_batch()
Allow executing logged batches too. Also add a trace log to the method.
2025-12-02 14:21:26 +02:00
Botond Dénes
1ad64731bc test/lib/cql_assertions: add with_size(predicate) overload
Sometimes, expected size is not a single number. Add predicate overload
to allow expressing more complicated expectations.
2025-12-02 14:21:26 +02:00
Botond Dénes
abadb8ebfb test/lib/cql_assertions: add source location to fail messages
For tests that contain multiple assert_that() invokations, identifying
the one that failed is very challenging. Add source location to fail
messages to allow convenient identification of the call-site.
2025-12-02 14:21:26 +02:00
Botond Dénes
54f16f9019 test/lib/cql_assertions: columns_assertions: add assert_for_columns_of_each_row()
Invoke the passed in function with a columns_assertions instance for the
current row, allowing for sweeping checks across columns of all rows.
2025-12-02 14:21:26 +02:00
Botond Dénes
b584e1e18e test/lib/cql_assertions: rows_assertions::assert_for_columns_of_row(): add index bound check 2025-12-02 14:21:26 +02:00
Botond Dénes
aa1d3f1170 test/lib/cql_assertions: columns_assertions: add T* with_typed_column() overload
To enable assertions on columns which are sometimes null.
One existing user of with_typed_column() needs adjustment, because the
previous version of with_typed_column() covered up silently for null
value, but after this patch this caused a failure.
2025-12-02 14:21:26 +02:00
Botond Dénes
e309b5dbe1 db/batchlog_manager: config: s/write_timeout/reply_timeot/
Although the value of this item is indeed derived from the write timeout
config, the name doesn't reflect what it is used for. Change it to
reflect it better.
2025-12-02 14:21:26 +02:00
Botond Dénes
846b656610 db,service: switch to system.batchlog_v2
New batchlogs are written to the batchlog_v2 table and replay also uses
the v2 table.
The content of system.batchlog is attempted to be migrated to
system.batchlog_v2 after each start of the batchlog_manager service.
The migration is retried on each replay if it fails. This is reduntant
but simple.

Batchlog cleanup now doesn't involve flushing memtables, the only
remaining user of replica/database.hh is gone, so the include is
dropped.
2025-12-02 14:21:26 +02:00
Botond Dénes
ee851266be db/system_keyspace: introduce system.batchlog_v2
Rearranges the system.batchlog schema as follows:

    CREATE TABLE system.batchlog_v2 (
        version int,
        stage int,
        shard int,
        written_at timestamp,
        id uuid,
        data blob,
        PRIMARY KEY ((version, stage, shard), written_at, id));

With the following goals:
1) Make post-replay batchlog cleanup possible with a simple
   range-tombstone. This allows dropping the individual dead batchlog
   entries, as they are shadowed by a higher level tombstone. This
   enables dropping tombstones without tombstone GC.
2) To make the above possible, introduce the stage key component:
   batchlog entries that fail the first replay attempt, are moved to the
   failed_replay stage, so the initial stage can be cleaned up safely.
3) Spread out the data among Scylla shards, via the batchlog shard
   column.
4) Make batchlog entries ordered by the batchlog create time (id). This
   allows for selecting batchlogs to replay, without post-filtering of
   batchlogs that are too young to be replayed.
2025-12-02 14:21:25 +02:00
Botond Dénes
9434ec2fd1 service,db: extract generation of batchlog delete mutation
Don't build batchlog delete mutations in storage-proxy code. Move this
code into db/batchlog_manager.cc, exposed via db/batchlog.hh.
This serves multiple goals:
1) Concentrates low-level batchlog related logic in
   db/batchlog_manager.cc
2) Reduce current and future code duplication.
3) Make future changes to this logic easier.
2025-12-02 14:21:25 +02:00
Botond Dénes
f54602daf0 service,db: extract get_batchlog_mutation_for() from storage-proxy
Don't build batchlog mutations in storage-proxy code. Move this code
into db/batchlog_manager.cc, exposed via db/batchlog.hh.
This serves multiple goals:
1) Concentrates low-level batchlog related logic in
   db/batchlog_manager.cc
2) Reduce current and future code duplication.
2) Make future changes to this logic easier.
2025-12-02 14:21:25 +02:00
Botond Dénes
097c2cd676 db/batchlog_manager: only consider propagation delay with tombstone-gc=repair
The propagation delay has no effect for other tombstone gc strategies,
so ignore it when tombstone-gc != repair.
2025-12-02 14:21:25 +02:00
Botond Dénes
4f30807f01 db/batchlog_manager: don't drop entire batch if one mutations' table was dropped
Just skip the mutation(s) whose tables were dropped instead.
Use the newly introduced data_dictionary::table::get_truncation_time()
to avoid looking up real table object.
2025-12-02 14:21:25 +02:00
Botond Dénes
55704908a0 data_dictionary: table: add get_truncation_time()
So the batchlog manager can avoid looking up the real table and instead
just work with data dictionary.
2025-12-02 14:21:25 +02:00
Botond Dénes
337f417b13 db/batchlog_manager: batch(): replace map_reduce() with simple loop
The map_reduce achieves no concurrency, both map and reduce are
synchronous. It only achieves two redundant lookups for the table and
hard-to-read code. Convert it into a simple loop. Preserve the
stall-protection by adding a maybe_yield() to the loop.
2025-12-02 12:05:10 +02:00
Wojciech Mitros
6221c58325 mv: replace the simple/complex rack-aware pairing with exact rack matching
When the initial version of rack-aware pairing was introduced, materialized
views with tablets were still experimental. Since then, we decided that
we'll only allow materialized views in clusters where the base table and
the view are replicated on the same racks, with one replica of each tablet
on each rack.
This allows us to remove almost all logic from our base-view pairing. The
only check for the paired view replica is now whether it's in the same
rack as the base replica sending the update.
In this patch we replace the simple and complex rack-aware pairing with
the simple check above.
Because of this, we have to remove a test case from network_topology_strategy_test
which was testing complex pairing. The tested topology is not supported
for views with tablets (or is unlikely to be supported, as it's a random test),
so there's no use keeping the test.
The test case for simple rack aware pairing was kept, but now we only test
the case where each rack has one replica, not multiple.
Additionally, we split finding of an unpaired replica to a separate function
and partially rewrite it without reusing the helper stuctures that were
present when calculating the simple and complex rack-aware pairing.
We only look for an unpaired replica if we couldn't find a paired replica
ourselves or if the number of view replicas didn't match the base replicas.
If an unpaired replica appears while these conditions pass, we won't send
an extra update, but that would be a new bug altogether, because we only
expect the unpaired replica to appear during RF changes, so when these
conditions aren't fulfilled.

Fixes https://github.com/scylladb/scylladb/issues/26313
2025-12-02 10:52:36 +01:00
Botond Dénes
705af2bc16 db/batchlog_manager: finish coroutinizing replay_all_failed_batches
It was coroutinized already but strangely, some continuations also
remained. The `batch` lambda is still left in continuation style.
2025-12-02 10:42:28 +02:00
Botond Dénes
5b5f9120d0 db/batchlog_manager: improve replayAllFailedBatches logs
Add cleanup flag value to start message and drop cpu, it is redundant as
Scylla already adds the shard number to the logs.
Add all_replayed to finish message.
2025-12-02 10:42:28 +02:00
Pavel Emelyanov
6c115c691f sstables_loader: Provide endpoint type for get_sstables_from_object_store()
Currently the method scans db::config to find one. It has some
drawbacks. First, it's not very nice. Second, it needs to handle the
case when the endpoint is missing, while it relally never is. Third, the
type in config entry is not necessarily set.

It's nicer to get the type from storage manager.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-12-02 11:18:32 +03:00
Pavel Emelyanov
5924c36b50 storage_manager: Introduce get_endpoint_type() method
So that other code (spoiler: see next patch) have simple API to get one.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-12-02 11:18:27 +03:00
Pavel Emelyanov
ad6a73c29b storage_manager: Split get_endpoint_client()
To get the get_endpoint() internal helper for future use.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-12-02 11:18:23 +03:00
Wojciech Mitros
4ec0fa6eb5 mv: split out vnode pairing code from get_view_natural_endpoint
To avoid repeatedly checking whether we're using tablets and having
to use unnecesarily flexible code fitting both cases, we split out
the base-view pairing code for the case of vnodes to another function.
The get_view_natural_endpoint will now have only common steps,
a call to that function, and steps specific to tablets.
2025-12-02 03:32:36 +01:00
Wojciech Mitros
c313b215e4 mv: unify self-pairing and rack-aware pairing into one bool
We always use "legacy self pairing" when not using tablets, and
the "rack aware pairing" has been enabled in every version where
views with tablets isn't experimental. So in practice, instead
of checking these variables we can just look at whether the
table uses tablets.
2025-12-02 03:32:32 +01:00
Wojciech Mitros
7c612e1789 mv: remove the workaround for left nodes when sending view updates
At one point, the get_view_natural_endpoint was using IP for the
view update (and hint) destinations, but the hint code was using
host_id for the destinations. When a node left, we could no longer
have a mapping for a IP to host_id and when trying to store a hint
for this IP, we'd crash.
We worked around this issue by dropping the view update completely
if the target is in the "left" state.
Since then, we also moved to host_id's in the view update code, so
there's no longer any translation needed when storing the hints.
Additionally, we now drain hints not when entering the "left" state,
but when the node actually stops owning tokens.
Because of that, the workaround is not needed anymore, so we remove
it in this commit.
The existing test_mv_tablets_empty_ip case verifies that indeed, we
do not crash in the original problematic scenario.
2025-12-01 12:27:28 +01:00
Ernest Zaslavsky
f0e2941e34 streaming: refactor tablet_sstable_streamer::stream by extracting SST filtering logic
Extract the SST filtering logic into a dedicated member function. This
prepares the code for independent testing without requiring the entire
streamer to be initialized.
2025-11-30 18:27:15 +02:00
389 changed files with 12101 additions and 3229 deletions

8
.github/CODEOWNERS vendored
View File

@@ -1,5 +1,5 @@
# AUTH
auth/* @nuivall @ptrsmrn
auth/* @nuivall
# CACHE
row_cache* @tgrabiec
@@ -25,11 +25,11 @@ compaction/* @raphaelsc
transport/*
# CQL QUERY LANGUAGE
cql3/* @tgrabiec @nuivall @ptrsmrn
cql3/* @tgrabiec @nuivall
# COUNTERS
counters* @nuivall @ptrsmrn
tests/counter_test* @nuivall @ptrsmrn
counters* @nuivall
tests/counter_test* @nuivall
# DOCS
docs/* @annastuchlik @tzach

View File

@@ -62,7 +62,7 @@ def create_pull_request(repo, new_branch_name, base_branch_name, pr, backport_pr
if is_draft:
labels_to_add.append("conflicts")
pr_comment = f"@{pr.user.login} - This PR was marked as draft because it has conflicts\n"
pr_comment += "Please resolve them and mark this PR as ready for review"
pr_comment += "Please resolve them and remove the 'conflicts' label. The PR will be made ready for review automatically."
backport_pr.create_issue_comment(pr_comment)
# Apply all labels at once if we have any

View File

@@ -18,7 +18,7 @@ jobs:
// Regular expression pattern to check for "Fixes" prefix
// Adjusted to dynamically insert the repository full name
const pattern = `Fixes:? (?:#|${repo.replace('/', '\\/')}#|https://github\\.com/${repo.replace('/', '\\/')}/issues/)(\\d+)`;
const pattern = `Fixes:? ((?:#|${repo.replace('/', '\\/')}#|https://github\\.com/${repo.replace('/', '\\/')}/issues/)(\\d+)|([A-Z]+-\\d+))`;
const regex = new RegExp(pattern);
if (!regex.test(body)) {

View File

@@ -0,0 +1,14 @@
name: Call Jira release creation for new milestone
on:
milestone:
types: [created]
jobs:
sync-milestone-to-jira:
uses: scylladb/github-automation/.github/workflows/main_sync_milestone_to_jira_release.yml@main
with:
# Comma-separated list of Jira project keys
jira_project_keys: "SCYLLADB,CUSTOMER"
secrets:
caller_jira_auth: ${{ secrets.USER_AND_KEY_FOR_JIRA_AUTOMATION }}

View File

@@ -0,0 +1,13 @@
name: validate_pr_author_email
on:
pull_request_target:
types:
- opened
- synchronize
- reopened
jobs:
validate_pr_author_email:
uses: scylladb/github-automation/.github/workflows/validate_pr_author_email.yml@main

View File

@@ -7,7 +7,7 @@ on:
- enterprise
paths:
- '**/*.cc'
- 'scripts/metrics-config.yml'
- 'scripts/metrics-config.yml'
- 'scripts/get_description.py'
- 'docs/_ext/scylladb_metrics.py'
@@ -15,20 +15,20 @@ jobs:
validate-metrics:
runs-on: ubuntu-latest
name: Check metrics documentation coverage
steps:
- name: Checkout code
uses: actions/checkout@v4
with:
submodules: true
- name: Set up Python
uses: actions/setup-python@v6
with:
python-version: '3.10'
- name: Install dependencies
run: pip install PyYAML
- name: Validate metrics
run: python3 scripts/get_description.py --validate -c scripts/metrics-config.yml

View File

@@ -3,10 +3,13 @@ name: Trigger Scylla CI Route
on:
issue_comment:
types: [created]
pull_request_target:
types:
- unlabeled
jobs:
trigger-jenkins:
if: github.event.comment.user.login != 'scylladbbot' && contains(github.event.comment.body, '@scylladbbot') && contains(github.event.comment.body, 'trigger-ci')
if: (github.event.comment.user.login != 'scylladbbot' && contains(github.event.comment.body, '@scylladbbot') && contains(github.event.comment.body, 'trigger-ci')) || github.event.label.name == 'conflicts'
runs-on: ubuntu-latest
steps:
- name: Trigger Scylla-CI-Route Jenkins Job

View File

@@ -42,7 +42,7 @@ comparison_operator_type get_comparison_operator(const rjson::value& comparison_
if (!comparison_operator.IsString()) {
throw api_error::validation(fmt::format("Invalid comparison operator definition {}", rjson::print(comparison_operator)));
}
std::string op = comparison_operator.GetString();
std::string op = rjson::to_string(comparison_operator);
auto it = ops.find(op);
if (it == ops.end()) {
throw api_error::validation(fmt::format("Unsupported comparison operator {}", op));
@@ -377,8 +377,8 @@ bool check_compare(const rjson::value* v1, const rjson::value& v2, const Compara
return cmp(unwrap_number(*v1, cmp.diagnostic), unwrap_number(v2, cmp.diagnostic));
}
if (kv1.name == "S") {
return cmp(std::string_view(kv1.value.GetString(), kv1.value.GetStringLength()),
std::string_view(kv2.value.GetString(), kv2.value.GetStringLength()));
return cmp(rjson::to_string_view(kv1.value),
rjson::to_string_view(kv2.value));
}
if (kv1.name == "B") {
auto d_kv1 = unwrap_bytes(kv1.value, v1_from_query);
@@ -470,9 +470,9 @@ static bool check_BETWEEN(const rjson::value* v, const rjson::value& lb, const r
return check_BETWEEN(unwrap_number(*v, diag), unwrap_number(lb, diag), unwrap_number(ub, diag), bounds_from_query);
}
if (kv_v.name == "S") {
return check_BETWEEN(std::string_view(kv_v.value.GetString(), kv_v.value.GetStringLength()),
std::string_view(kv_lb.value.GetString(), kv_lb.value.GetStringLength()),
std::string_view(kv_ub.value.GetString(), kv_ub.value.GetStringLength()),
return check_BETWEEN(rjson::to_string_view(kv_v.value),
rjson::to_string_view(kv_lb.value),
rjson::to_string_view(kv_ub.value),
bounds_from_query);
}
if (kv_v.name == "B") {

View File

@@ -8,6 +8,8 @@
#include "consumed_capacity.hh"
#include "error.hh"
#include "utils/rjson.hh"
#include <fmt/format.h>
namespace alternator {
@@ -32,12 +34,12 @@ bool consumed_capacity_counter::should_add_capacity(const rjson::value& request)
if (!return_consumed->IsString()) {
throw api_error::validation("Non-string ReturnConsumedCapacity field in request");
}
std::string consumed = return_consumed->GetString();
std::string_view consumed = rjson::to_string_view(*return_consumed);
if (consumed == "INDEXES") {
throw api_error::validation("INDEXES consumed capacity is not supported");
}
if (consumed != "TOTAL") {
throw api_error::validation("Unknown consumed capacity "+ consumed);
throw api_error::validation(fmt::format("Unknown consumed capacity {}", consumed));
}
return true;
}

View File

@@ -169,7 +169,7 @@ future<> controller::request_stop_server() {
});
}
future<utils::chunked_vector<client_data>> controller::get_client_data() {
future<utils::chunked_vector<foreign_ptr<std::unique_ptr<client_data>>>> controller::get_client_data() {
return _server.local().get_client_data();
}

View File

@@ -93,7 +93,7 @@ public:
// This virtual function is called (on each shard separately) when the
// virtual table "system.clients" is read. It is expected to generate a
// list of clients connected to this server (on this shard).
virtual future<utils::chunked_vector<client_data>> get_client_data() override;
virtual future<utils::chunked_vector<foreign_ptr<std::unique_ptr<client_data>>>> get_client_data() override;
};
}

View File

@@ -419,7 +419,7 @@ static std::optional<std::string> find_table_name(const rjson::value& request) {
if (!table_name_value->IsString()) {
throw api_error::validation("Non-string TableName field in request");
}
std::string table_name = table_name_value->GetString();
std::string table_name = rjson::to_string(*table_name_value);
return table_name;
}
@@ -546,7 +546,7 @@ get_table_or_view(service::storage_proxy& proxy, const rjson::value& request) {
// does exist but the index does not (ValidationException).
if (proxy.data_dictionary().has_schema(keyspace_name, orig_table_name)) {
throw api_error::validation(
fmt::format("Requested resource not found: Index '{}' for table '{}'", index_name->GetString(), orig_table_name));
fmt::format("Requested resource not found: Index '{}' for table '{}'", rjson::to_string_view(*index_name), orig_table_name));
} else {
throw api_error::resource_not_found(
fmt::format("Requested resource not found: Table: {} not found", orig_table_name));
@@ -587,7 +587,7 @@ static std::string get_string_attribute(const rjson::value& value, std::string_v
throw api_error::validation(fmt::format("Expected string value for attribute {}, got: {}",
attribute_name, value));
}
return std::string(attribute_value->GetString(), attribute_value->GetStringLength());
return rjson::to_string(*attribute_value);
}
// Convenience function for getting the value of a boolean attribute, or a
@@ -1080,8 +1080,8 @@ static void add_column(schema_builder& builder, const std::string& name, const r
}
for (auto it = attribute_definitions.Begin(); it != attribute_definitions.End(); ++it) {
const rjson::value& attribute_info = *it;
if (attribute_info["AttributeName"].GetString() == name) {
auto type = attribute_info["AttributeType"].GetString();
if (rjson::to_string_view(attribute_info["AttributeName"]) == name) {
std::string_view type = rjson::to_string_view(attribute_info["AttributeType"]);
data_type dt = parse_key_type(type);
if (computed_column) {
// Computed column for GSI (doesn't choose a real column as-is
@@ -1116,7 +1116,7 @@ static std::pair<std::string, std::string> parse_key_schema(const rjson::value&
throw api_error::validation("First element of KeySchema must be an object");
}
const rjson::value *v = rjson::find((*key_schema)[0], "KeyType");
if (!v || !v->IsString() || v->GetString() != std::string("HASH")) {
if (!v || !v->IsString() || rjson::to_string_view(*v) != "HASH") {
throw api_error::validation("First key in KeySchema must be a HASH key");
}
v = rjson::find((*key_schema)[0], "AttributeName");
@@ -1124,14 +1124,14 @@ static std::pair<std::string, std::string> parse_key_schema(const rjson::value&
throw api_error::validation("First key in KeySchema must have string AttributeName");
}
validate_attr_name_length(supplementary_context, v->GetStringLength(), true, "HASH key in KeySchema - ");
std::string hash_key = v->GetString();
std::string hash_key = rjson::to_string(*v);
std::string range_key;
if (key_schema->Size() == 2) {
if (!(*key_schema)[1].IsObject()) {
throw api_error::validation("Second element of KeySchema must be an object");
}
v = rjson::find((*key_schema)[1], "KeyType");
if (!v || !v->IsString() || v->GetString() != std::string("RANGE")) {
if (!v || !v->IsString() || rjson::to_string_view(*v) != "RANGE") {
throw api_error::validation("Second key in KeySchema must be a RANGE key");
}
v = rjson::find((*key_schema)[1], "AttributeName");
@@ -1799,6 +1799,11 @@ static future<executor::request_return_type> create_table_on_shard0(service::cli
}
}
}
// Creating an index in tablets mode requires the rf_rack_valid_keyspaces option to be enabled.
// GSI and LSI indexes are based on materialized views which require this option to avoid consistency issues.
if (!view_builders.empty() && ksm->uses_tablets() && !sp.data_dictionary().get_config().rf_rack_valid_keyspaces()) {
co_return api_error::validation("GlobalSecondaryIndexes and LocalSecondaryIndexes with tablets require the rf_rack_valid_keyspaces option to be enabled.");
}
try {
schema_mutations = service::prepare_new_keyspace_announcement(sp.local_db(), ksm, ts);
} catch (exceptions::already_exists_exception&) {
@@ -1887,8 +1892,8 @@ future<executor::request_return_type> executor::create_table(client_state& clien
std::string def_type = type_to_string(def.type);
for (auto it = attribute_definitions.Begin(); it != attribute_definitions.End(); ++it) {
const rjson::value& attribute_info = *it;
if (attribute_info["AttributeName"].GetString() == def.name_as_text()) {
auto type = attribute_info["AttributeType"].GetString();
if (rjson::to_string_view(attribute_info["AttributeName"]) == def.name_as_text()) {
std::string_view type = rjson::to_string_view(attribute_info["AttributeType"]);
if (type != def_type) {
throw api_error::validation(fmt::format("AttributeDefinitions redefined {} to {} already a key attribute of type {} in this table", def.name_as_text(), type, def_type));
}
@@ -2019,6 +2024,10 @@ future<executor::request_return_type> executor::update_table(client_state& clien
co_return api_error::validation(fmt::format(
"LSI {} already exists in table {}, can't use same name for GSI", index_name, table_name));
}
if (p.local().local_db().find_keyspace(keyspace_name).get_replication_strategy().uses_tablets() &&
!p.local().data_dictionary().get_config().rf_rack_valid_keyspaces()) {
co_return api_error::validation("GlobalSecondaryIndexes with tablets require the rf_rack_valid_keyspaces option to be enabled.");
}
elogger.trace("Adding GSI {}", index_name);
// FIXME: read and handle "Projection" parameter. This will
@@ -2362,7 +2371,7 @@ put_or_delete_item::put_or_delete_item(const rjson::value& item, schema_ptr sche
_cells = std::vector<cell>();
_cells->reserve(item.MemberCount());
for (auto it = item.MemberBegin(); it != item.MemberEnd(); ++it) {
bytes column_name = to_bytes(it->name.GetString());
bytes column_name = to_bytes(rjson::to_string_view(it->name));
validate_value(it->value, "PutItem");
const column_definition* cdef = find_attribute(*schema, column_name);
validate_attr_name_length("", column_name.size(), cdef && cdef->is_primary_key());
@@ -2783,10 +2792,10 @@ static void verify_all_are_used(const rjson::value* field,
return;
}
for (auto it = field->MemberBegin(); it != field->MemberEnd(); ++it) {
if (!used.contains(it->name.GetString())) {
if (!used.contains(rjson::to_string(it->name))) {
throw api_error::validation(
format("{} has spurious '{}', not used in {}",
field_name, it->name.GetString(), operation));
field_name, rjson::to_string_view(it->name), operation));
}
}
}
@@ -3000,7 +3009,7 @@ future<executor::request_return_type> executor::delete_item(client_state& client
}
static schema_ptr get_table_from_batch_request(const service::storage_proxy& proxy, const rjson::value::ConstMemberIterator& batch_request) {
sstring table_name = batch_request->name.GetString(); // JSON keys are always strings
sstring table_name = rjson::to_sstring(batch_request->name); // JSON keys are always strings
try {
return proxy.data_dictionary().find_schema(sstring(executor::KEYSPACE_NAME_PREFIX) + table_name, table_name);
} catch(data_dictionary::no_such_column_family&) {
@@ -3055,17 +3064,44 @@ public:
}
};
static future<> cas_write(service::storage_proxy& proxy, schema_ptr schema, service::cas_shard cas_shard, const dht::decorated_key& dk, const std::vector<put_or_delete_item>& mutation_builders,
service::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit) {
future<> executor::cas_write(schema_ptr schema, service::cas_shard cas_shard, const dht::decorated_key& dk,
const std::vector<put_or_delete_item>& mutation_builders, service::client_state& client_state,
tracing::trace_state_ptr trace_state, service_permit permit)
{
if (!cas_shard.this_shard()) {
_stats.shard_bounce_for_lwt++;
return container().invoke_on(cas_shard.shard(), _ssg,
[cs = client_state.move_to_other_shard(),
&mb = mutation_builders,
&dk,
ks = schema->ks_name(),
cf = schema->cf_name(),
gt = tracing::global_trace_state_ptr(trace_state),
permit = std::move(permit)]
(executor& self) mutable {
return do_with(cs.get(), [&mb, &dk, ks = std::move(ks), cf = std::move(cf),
trace_state = tracing::trace_state_ptr(gt), &self]
(service::client_state& client_state) mutable {
auto schema = self._proxy.data_dictionary().find_schema(ks, cf);
service::cas_shard cas_shard(*schema, dk.token());
//FIXME: Instead of passing empty_service_permit() to the background operation,
// the current permit's lifetime should be prolonged, so that it's destructed
// only after all background operations are finished as well.
return self.cas_write(schema, std::move(cas_shard), dk, mb, client_state, std::move(trace_state), empty_service_permit());
});
});
}
auto timeout = executor::default_timeout();
auto op = std::make_unique<put_or_delete_item_cas_request>(schema, mutation_builders);
auto* op_ptr = op.get();
auto cdc_opts = cdc::per_request_options{
.alternator = true,
.alternator_streams_increased_compatibility =
schema->cdc_options().enabled() && proxy.data_dictionary().get_config().alternator_streams_increased_compatibility(),
schema->cdc_options().enabled() && _proxy.data_dictionary().get_config().alternator_streams_increased_compatibility(),
};
return proxy.cas(schema, std::move(cas_shard), *op_ptr, nullptr, to_partition_ranges(dk),
return _proxy.cas(schema, std::move(cas_shard), *op_ptr, nullptr, to_partition_ranges(dk),
{timeout, std::move(permit), client_state, trace_state},
db::consistency_level::LOCAL_SERIAL, db::consistency_level::LOCAL_QUORUM,
timeout, timeout, true, std::move(cdc_opts)).finally([op = std::move(op)]{}).discard_result();
@@ -3091,13 +3127,11 @@ struct schema_decorated_key_equal {
// FIXME: if we failed writing some of the mutations, need to return a list
// of these failed mutations rather than fail the whole write (issue #5650).
static future<> do_batch_write(service::storage_proxy& proxy,
smp_service_group ssg,
future<> executor::do_batch_write(
std::vector<std::pair<schema_ptr, put_or_delete_item>> mutation_builders,
service::client_state& client_state,
tracing::trace_state_ptr trace_state,
service_permit permit,
stats& stats) {
service_permit permit) {
if (mutation_builders.empty()) {
return make_ready_future<>();
}
@@ -3119,7 +3153,7 @@ static future<> do_batch_write(service::storage_proxy& proxy,
mutations.push_back(b.second.build(b.first, now));
any_cdc_enabled |= b.first->cdc_options().enabled();
}
return proxy.mutate(std::move(mutations),
return _proxy.mutate(std::move(mutations),
db::consistency_level::LOCAL_QUORUM,
executor::default_timeout(),
trace_state,
@@ -3128,7 +3162,7 @@ static future<> do_batch_write(service::storage_proxy& proxy,
false,
cdc::per_request_options{
.alternator = true,
.alternator_streams_increased_compatibility = any_cdc_enabled && proxy.data_dictionary().get_config().alternator_streams_increased_compatibility(),
.alternator_streams_increased_compatibility = any_cdc_enabled && _proxy.data_dictionary().get_config().alternator_streams_increased_compatibility(),
});
} else {
// Do the write via LWT:
@@ -3140,46 +3174,35 @@ static future<> do_batch_write(service::storage_proxy& proxy,
schema_decorated_key_hash,
schema_decorated_key_equal>;
auto key_builders = std::make_unique<map_type>(1, schema_decorated_key_hash{}, schema_decorated_key_equal{});
for (auto& b : mutation_builders) {
auto dk = dht::decorate_key(*b.first, b.second.pk());
auto [it, added] = key_builders->try_emplace(schema_decorated_key{b.first, dk});
for (auto&& b : std::move(mutation_builders)) {
auto [it, added] = key_builders->try_emplace(schema_decorated_key {
.schema = b.first,
.dk = dht::decorate_key(*b.first, b.second.pk())
});
it->second.push_back(std::move(b.second));
}
auto* key_builders_ptr = key_builders.get();
return parallel_for_each(*key_builders_ptr, [&proxy, &client_state, &stats, trace_state, ssg, permit = std::move(permit)] (const auto& e) {
stats.write_using_lwt++;
return parallel_for_each(*key_builders_ptr, [this, &client_state, trace_state, permit = std::move(permit)] (const auto& e) {
_stats.write_using_lwt++;
auto desired_shard = service::cas_shard(*e.first.schema, e.first.dk.token());
if (desired_shard.this_shard()) {
return cas_write(proxy, e.first.schema, std::move(desired_shard), e.first.dk, e.second, client_state, trace_state, permit);
} else {
stats.shard_bounce_for_lwt++;
return proxy.container().invoke_on(desired_shard.shard(), ssg,
[cs = client_state.move_to_other_shard(),
&mb = e.second,
&dk = e.first.dk,
ks = e.first.schema->ks_name(),
cf = e.first.schema->cf_name(),
gt = tracing::global_trace_state_ptr(trace_state),
permit = std::move(permit)]
(service::storage_proxy& proxy) mutable {
return do_with(cs.get(), [&proxy, &mb, &dk, ks = std::move(ks), cf = std::move(cf),
trace_state = tracing::trace_state_ptr(gt)]
(service::client_state& client_state) mutable {
auto schema = proxy.data_dictionary().find_schema(ks, cf);
auto s = e.first.schema;
// The desired_shard on the original shard remains alive for the duration
// of cas_write on this shard and prevents any tablet operations.
// However, we need a local instance of cas_shard on this shard
// to pass it to sp::cas, so we just create a new one.
service::cas_shard cas_shard(*schema, dk.token());
//FIXME: Instead of passing empty_service_permit() to the background operation,
// the current permit's lifetime should be prolonged, so that it's destructed
// only after all background operations are finished as well.
return cas_write(proxy, schema, std::move(cas_shard), dk, mb, client_state, std::move(trace_state), empty_service_permit());
});
}).finally([desired_shard = std::move(desired_shard)]{});
}
static const auto* injection_name = "alternator_executor_batch_write_wait";
return utils::get_local_injector().inject(injection_name, [s = std::move(s)] (auto& handler) -> future<> {
const auto ks = handler.get("keyspace");
const auto cf = handler.get("table");
const auto shard = std::atoll(handler.get("shard")->data());
if (ks == s->ks_name() && cf == s->cf_name() && shard == this_shard_id()) {
elogger.info("{}: hit", injection_name);
co_await handler.wait_for_message(std::chrono::steady_clock::now() + std::chrono::minutes{5});
elogger.info("{}: continue", injection_name);
}
}).then([&e, desired_shard = std::move(desired_shard),
&client_state, trace_state = std::move(trace_state), permit = std::move(permit), this]() mutable
{
return cas_write(e.first.schema, std::move(desired_shard), e.first.dk,
std::move(e.second), client_state, std::move(trace_state), std::move(permit));
});
}).finally([key_builders = std::move(key_builders)]{});
}
}
@@ -3327,7 +3350,7 @@ future<executor::request_return_type> executor::batch_write_item(client_state& c
_stats.wcu_total[stats::DELETE_ITEM] += wcu_delete_units;
_stats.api_operations.batch_write_item_batch_total += total_items;
_stats.api_operations.batch_write_item_histogram.add(total_items);
co_await do_batch_write(_proxy, _ssg, std::move(mutation_builders), client_state, trace_state, std::move(permit), _stats);
co_await do_batch_write(std::move(mutation_builders), client_state, trace_state, std::move(permit));
// FIXME: Issue #5650: If we failed writing some of the updates,
// need to return a list of these failed updates in UnprocessedItems
// rather than fail the whole write (issue #5650).
@@ -3372,7 +3395,7 @@ static bool hierarchy_filter(rjson::value& val, const attribute_path_map_node<T>
}
rjson::value newv = rjson::empty_object();
for (auto it = v.MemberBegin(); it != v.MemberEnd(); ++it) {
std::string attr = it->name.GetString();
std::string attr = rjson::to_string(it->name);
auto x = members.find(attr);
if (x != members.end()) {
if (x->second) {
@@ -3592,7 +3615,7 @@ static std::optional<attrs_to_get> calculate_attrs_to_get(const rjson::value& re
const rjson::value& attributes_to_get = req["AttributesToGet"];
attrs_to_get ret;
for (auto it = attributes_to_get.Begin(); it != attributes_to_get.End(); ++it) {
attribute_path_map_add("AttributesToGet", ret, it->GetString());
attribute_path_map_add("AttributesToGet", ret, rjson::to_string(*it));
validate_attr_name_length("AttributesToGet", it->GetStringLength(), false);
}
if (ret.empty()) {
@@ -4258,12 +4281,12 @@ inline void update_item_operation::apply_attribute_updates(const std::unique_ptr
attribute_collector& modified_attrs, bool& any_updates, bool& any_deletes) const {
for (auto it = _attribute_updates->MemberBegin(); it != _attribute_updates->MemberEnd(); ++it) {
// Note that it.key() is the name of the column, *it is the operation
bytes column_name = to_bytes(it->name.GetString());
bytes column_name = to_bytes(rjson::to_string_view(it->name));
const column_definition* cdef = _schema->get_column_definition(column_name);
if (cdef && cdef->is_primary_key()) {
throw api_error::validation(format("UpdateItem cannot update key column {}", it->name.GetString()));
throw api_error::validation(format("UpdateItem cannot update key column {}", rjson::to_string_view(it->name)));
}
std::string action = (it->value)["Action"].GetString();
std::string action = rjson::to_string((it->value)["Action"]);
if (action == "DELETE") {
// The DELETE operation can do two unrelated tasks. Without a
// "Value" option, it is used to delete an attribute. With a
@@ -5460,7 +5483,7 @@ calculate_bounds_conditions(schema_ptr schema, const rjson::value& conditions) {
std::vector<query::clustering_range> ck_bounds;
for (auto it = conditions.MemberBegin(); it != conditions.MemberEnd(); ++it) {
std::string key = it->name.GetString();
sstring key = rjson::to_sstring(it->name);
const rjson::value& condition = it->value;
const rjson::value& comp_definition = rjson::get(condition, "ComparisonOperator");
@@ -5468,13 +5491,13 @@ calculate_bounds_conditions(schema_ptr schema, const rjson::value& conditions) {
const column_definition& pk_cdef = schema->partition_key_columns().front();
const column_definition* ck_cdef = schema->clustering_key_size() > 0 ? &schema->clustering_key_columns().front() : nullptr;
if (sstring(key) == pk_cdef.name_as_text()) {
if (key == pk_cdef.name_as_text()) {
if (!partition_ranges.empty()) {
throw api_error::validation("Currently only a single restriction per key is allowed");
}
partition_ranges.push_back(calculate_pk_bound(schema, pk_cdef, comp_definition, attr_list));
}
if (ck_cdef && sstring(key) == ck_cdef->name_as_text()) {
if (ck_cdef && key == ck_cdef->name_as_text()) {
if (!ck_bounds.empty()) {
throw api_error::validation("Currently only a single restriction per key is allowed");
}
@@ -5875,7 +5898,7 @@ future<executor::request_return_type> executor::list_tables(client_state& client
rjson::value* exclusive_start_json = rjson::find(request, "ExclusiveStartTableName");
rjson::value* limit_json = rjson::find(request, "Limit");
std::string exclusive_start = exclusive_start_json ? exclusive_start_json->GetString() : "";
std::string exclusive_start = exclusive_start_json ? rjson::to_string(*exclusive_start_json) : "";
int limit = limit_json ? limit_json->GetInt() : 100;
if (limit < 1 || limit > 100) {
co_return api_error::validation("Limit must be greater than 0 and no greater than 100");

View File

@@ -40,6 +40,7 @@ namespace cql3::selection {
namespace service {
class storage_proxy;
class cas_shard;
}
namespace cdc {
@@ -57,6 +58,7 @@ class schema_builder;
namespace alternator {
class rmw_operation;
class put_or_delete_item;
schema_ptr get_table(service::storage_proxy& proxy, const rjson::value& request);
bool is_alternator_keyspace(const sstring& ks_name);
@@ -219,6 +221,16 @@ private:
static void describe_key_schema(rjson::value& parent, const schema&, std::unordered_map<std::string,std::string> * = nullptr, const std::map<sstring, sstring> *tags = nullptr);
future<> do_batch_write(
std::vector<std::pair<schema_ptr, put_or_delete_item>> mutation_builders,
service::client_state& client_state,
tracing::trace_state_ptr trace_state,
service_permit permit);
future<> cas_write(schema_ptr schema, service::cas_shard cas_shard, const dht::decorated_key& dk,
const std::vector<put_or_delete_item>& mutation_builders, service::client_state& client_state,
tracing::trace_state_ptr trace_state, service_permit permit);
public:
static void describe_key_schema(rjson::value& parent, const schema& schema, std::unordered_map<std::string,std::string>&, const std::map<sstring, sstring> *tags = nullptr);

View File

@@ -496,7 +496,7 @@ const std::pair<std::string, const rjson::value*> unwrap_set(const rjson::value&
return {"", nullptr};
}
auto it = v.MemberBegin();
const std::string it_key = it->name.GetString();
const std::string it_key = rjson::to_string(it->name);
if (it_key != "SS" && it_key != "BS" && it_key != "NS") {
return {std::move(it_key), nullptr};
}

View File

@@ -708,8 +708,12 @@ future<executor::request_return_type> server::handle_api_request(std::unique_ptr
// As long as the system_clients_entry object is alive, this request will
// be visible in the "system.clients" virtual table. When requested, this
// entry will be formatted by server::ongoing_request::make_client_data().
auto user_agent_header = co_await _connection_options_keys_and_values.get_or_load(req->get_header("User-Agent"), [] (const client_options_cache_key_type&) {
return make_ready_future<options_cache_value_type>(options_cache_value_type{});
});
auto system_clients_entry = _ongoing_requests.emplace(
req->get_client_address(), req->get_header("User-Agent"),
req->get_client_address(), std::move(user_agent_header),
username, current_scheduling_group(),
req->get_protocol_name() == "https");
@@ -985,10 +989,10 @@ client_data server::ongoing_request::make_client_data() const {
return cd;
}
future<utils::chunked_vector<client_data>> server::get_client_data() {
utils::chunked_vector<client_data> ret;
future<utils::chunked_vector<foreign_ptr<std::unique_ptr<client_data>>>> server::get_client_data() {
utils::chunked_vector<foreign_ptr<std::unique_ptr<client_data>>> ret;
co_await _ongoing_requests.for_each_gently([&ret] (const ongoing_request& r) {
ret.emplace_back(r.make_client_data());
ret.emplace_back(make_foreign(std::make_unique<client_data>(r.make_client_data())));
});
co_return ret;
}

View File

@@ -55,6 +55,7 @@ class server : public peering_sharded_service<server> {
// though it isn't really relevant for Alternator which defines its own
// timeouts separately. We can create this object only once.
updateable_timeout_config _timeout_config;
client_options_cache_type _connection_options_keys_and_values;
alternator_callbacks_map _callbacks;
@@ -88,7 +89,7 @@ class server : public peering_sharded_service<server> {
// is called when reading the "system.clients" virtual table.
struct ongoing_request {
socket_address _client_address;
sstring _user_agent;
client_options_cache_entry_type _user_agent;
sstring _username;
scheduling_group _scheduling_group;
bool _is_https;
@@ -107,7 +108,7 @@ public:
// table "system.clients" is read. It is expected to generate a list of
// clients connected to this server (on this shard). This function is
// called by alternator::controller::get_client_data().
future<utils::chunked_vector<client_data>> get_client_data();
future<utils::chunked_vector<foreign_ptr<std::unique_ptr<client_data>>>> get_client_data();
private:
void set_routes(seastar::httpd::routes& r);
// If verification succeeds, returns the authenticated user's username

View File

@@ -93,7 +93,7 @@ future<executor::request_return_type> executor::update_time_to_live(client_state
if (v->GetStringLength() < 1 || v->GetStringLength() > 255) {
co_return api_error::validation("The length of AttributeName must be between 1 and 255");
}
sstring attribute_name(v->GetString(), v->GetStringLength());
sstring attribute_name = rjson::to_sstring(*v);
co_await verify_permission(_enforce_authorization, _warn_authorization, client_state, schema, auth::permission::ALTER, _stats);
co_await db::modify_tags(_mm, schema->ks_name(), schema->cf_name(), [&](std::map<sstring, sstring>& tags_map) {

View File

@@ -31,6 +31,7 @@ set(swagger_files
api-doc/column_family.json
api-doc/commitlog.json
api-doc/compaction_manager.json
api-doc/client_routes.json
api-doc/config.json
api-doc/cql_server_test.json
api-doc/endpoint_snitch_info.json
@@ -68,6 +69,7 @@ target_sources(api
PRIVATE
api.cc
cache_service.cc
client_routes.cc
collectd.cc
column_family.cc
commitlog.cc

View File

@@ -0,0 +1,23 @@
, "client_routes_entry": {
"id": "client_routes_entry",
"summary": "An entry storing client routes",
"properties": {
"connection_id": {"type": "string"},
"host_id": {"type": "string", "format": "uuid"},
"address": {"type": "string"},
"port": {"type": "integer"},
"tls_port": {"type": "integer"},
"alternator_port": {"type": "integer"},
"alternator_https_port": {"type": "integer"}
},
"required": ["connection_id", "host_id", "address"]
}
, "client_routes_key": {
"id": "client_routes_key",
"summary": "A key of client_routes_entry",
"properties": {
"connection_id": {"type": "string"},
"host_id": {"type": "string", "format": "uuid"}
}
}

View File

@@ -0,0 +1,74 @@
, "/v2/client-routes":{
"get": {
"description":"List all client route entries",
"operationId":"get_client_routes",
"tags":["client_routes"],
"produces":[
"application/json"
],
"parameters":[],
"responses":{
"200":{
"schema":{
"type":"array",
"items":{ "$ref":"#/definitions/client_routes_entry" }
}
},
"default":{
"description":"unexpected error",
"schema":{"$ref":"#/definitions/ErrorModel"}
}
}
},
"post": {
"description":"Upsert one or more client route entries",
"operationId":"set_client_routes",
"tags":["client_routes"],
"parameters":[
{
"name":"body",
"in":"body",
"required":true,
"schema":{
"type":"array",
"items":{ "$ref":"#/definitions/client_routes_entry" }
}
}
],
"responses":{
"200":{ "description": "OK" },
"default":{
"description":"unexpected error",
"schema":{ "$ref":"#/definitions/ErrorModel" }
}
}
},
"delete": {
"description":"Delete one or more client route entries",
"operationId":"delete_client_routes",
"tags":["client_routes"],
"parameters":[
{
"name":"body",
"in":"body",
"required":true,
"schema":{
"type":"array",
"items":{ "$ref":"#/definitions/client_routes_key" }
}
}
],
"responses":{
"200":{
"description": "OK"
},
"default":{
"description":"unexpected error",
"schema":{
"$ref":"#/definitions/ErrorModel"
}
}
}
}
}

View File

@@ -3051,7 +3051,7 @@
},
{
"name":"incremental_mode",
"description":"Set the incremental repair mode. Can be 'disabled', 'incremental', or 'full'. 'incremental': The incremental repair logic is enabled. Unrepaired sstables will be included for repair. Repaired sstables will be skipped. The incremental repair states will be updated after repair. 'full': The incremental repair logic is enabled. Both repaired and unrepaired sstables will be included for repair. The incremental repair states will be updated after repair. 'disabled': The incremental repair logic is disabled completely. The incremental repair states, e.g., repaired_at in sstables and sstables_repaired_at in the system.tablets table, will not be updated after repair. When the option is not provided, it defaults to incremental mode.",
"description":"Set the incremental repair mode. Can be 'disabled', 'incremental', or 'full'. 'incremental': The incremental repair logic is enabled. Unrepaired sstables will be included for repair. Repaired sstables will be skipped. The incremental repair states will be updated after repair. 'full': The incremental repair logic is enabled. Both repaired and unrepaired sstables will be included for repair. The incremental repair states will be updated after repair. 'disabled': The incremental repair logic is disabled completely. The incremental repair states, e.g., repaired_at in sstables and sstables_repaired_at in the system.tablets table, will not be updated after repair. When the option is not provided, it defaults to 'disabled' mode.",
"required":false,
"allowMultiple":false,
"type":"string",

View File

@@ -37,6 +37,7 @@
#include "raft.hh"
#include "gms/gossip_address_map.hh"
#include "service_levels.hh"
#include "client_routes.hh"
logging::logger apilog("api");
@@ -67,9 +68,11 @@ future<> set_server_init(http_context& ctx) {
rb02->set_api_doc(r);
rb02->register_api_file(r, "swagger20_header");
rb02->register_api_file(r, "metrics");
rb02->register_api_file(r, "client_routes");
rb->register_function(r, "system",
"The system related API");
rb02->add_definitions_file(r, "metrics");
rb02->add_definitions_file(r, "client_routes");
set_system(ctx, r);
rb->register_function(r, "error_injection",
"The error injection API");
@@ -129,6 +132,16 @@ future<> unset_server_storage_service(http_context& ctx) {
return ctx.http_server.set_routes([&ctx] (routes& r) { unset_storage_service(ctx, r); });
}
future<> set_server_client_routes(http_context& ctx, sharded<service::client_routes_service>& cr) {
return ctx.http_server.set_routes([&ctx, &cr] (routes& r) {
set_client_routes(ctx, r, cr);
});
}
future<> unset_server_client_routes(http_context& ctx) {
return ctx.http_server.set_routes([&ctx] (routes& r) { unset_client_routes(ctx, r); });
}
future<> set_load_meter(http_context& ctx, service::load_meter& lm) {
return ctx.http_server.set_routes([&ctx, &lm] (routes& r) { set_load_meter(ctx, r, lm); });
}

View File

@@ -29,6 +29,7 @@ class storage_proxy;
class storage_service;
class raft_group0_client;
class raft_group_registry;
class client_routes_service;
} // namespace service
@@ -99,6 +100,8 @@ future<> set_server_snitch(http_context& ctx, sharded<locator::snitch_ptr>& snit
future<> unset_server_snitch(http_context& ctx);
future<> set_server_storage_service(http_context& ctx, sharded<service::storage_service>& ss, service::raft_group0_client&);
future<> unset_server_storage_service(http_context& ctx);
future<> set_server_client_routes(http_context& ctx, sharded<service::client_routes_service>& cr);
future<> unset_server_client_routes(http_context& ctx);
future<> set_server_sstables_loader(http_context& ctx, sharded<sstables_loader>& sst_loader);
future<> unset_server_sstables_loader(http_context& ctx);
future<> set_server_view_builder(http_context& ctx, sharded<db::view::view_builder>& vb, sharded<gms::gossiper>& g);

176
api/client_routes.cc Normal file
View File

@@ -0,0 +1,176 @@
/*
* Copyright (C) 2025-present ScyllaDB
*
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
*/
#include <seastar/http/short_streams.hh>
#include "client_routes.hh"
#include "api/api.hh"
#include "service/storage_service.hh"
#include "service/client_routes.hh"
#include "utils/rjson.hh"
#include "api/api-doc/client_routes.json.hh"
using namespace seastar::httpd;
using namespace std::chrono_literals;
using namespace json;
extern logging::logger apilog;
namespace api {
static void validate_client_routes_endpoint(sharded<service::client_routes_service>& cr, sstring endpoint_name) {
if (!cr.local().get_feature_service().client_routes) {
apilog.warn("{}: called before the cluster feature was enabled", endpoint_name);
throw std::runtime_error(fmt::format("{} requires all nodes to support the CLIENT_ROUTES cluster feature", endpoint_name));
}
}
static sstring parse_string(const char* name, rapidjson::Value const& v) {
const auto it = v.FindMember(name);
if (it == v.MemberEnd()) {
throw bad_param_exception(fmt::format("Missing '{}'", name));
}
if (!it->value.IsString()) {
throw bad_param_exception(fmt::format("'{}' must be a string", name));
}
return {it->value.GetString(), it->value.GetStringLength()};
}
static std::optional<uint32_t> parse_port(const char* name, rapidjson::Value const& v) {
const auto it = v.FindMember(name);
if (it == v.MemberEnd()) {
return std::nullopt;
}
if (!it->value.IsInt()) {
throw bad_param_exception(fmt::format("'{}' must be an integer", name));
}
auto port = it->value.GetInt();
if (port < 1 || port > 65535) {
throw bad_param_exception(fmt::format("'{}' value={} is outside the allowed port range", name, port));
}
return port;
}
static std::vector<service::client_routes_service::client_route_entry> parse_set_client_array(const rapidjson::Document& root) {
if (!root.IsArray()) {
throw bad_param_exception("Body must be a JSON array");
}
std::vector<service::client_routes_service::client_route_entry> v;
v.reserve(root.GetArray().Size());
for (const auto& element : root.GetArray()) {
if (!element.IsObject()) { throw bad_param_exception("Each element must be object"); }
const auto port = parse_port("port", element);
const auto tls_port = parse_port("tls_port", element);
const auto alternator_port = parse_port("alternator_port", element);
const auto alternator_https_port = parse_port("alternator_https_port", element);
if (!port.has_value() && !tls_port.has_value() && !alternator_port.has_value() && !alternator_https_port.has_value()) {
throw bad_param_exception("At least one port field ('port', 'tls_port', 'alternator_port', 'alternator_https_port') must be specified");
}
v.emplace_back(
parse_string("connection_id", element),
utils::UUID{parse_string("host_id", element)},
parse_string("address", element),
port,
tls_port,
alternator_port,
alternator_https_port
);
}
return v;
}
static
future<json::json_return_type>
rest_set_client_routes(http_context& ctx, sharded<service::client_routes_service>& cr, std::unique_ptr<http::request> req) {
validate_client_routes_endpoint(cr, "rest_set_client_routes");
rapidjson::Document root;
auto content = co_await util::read_entire_stream_contiguous(*req->content_stream);
root.Parse(content.c_str());
co_await cr.local().set_client_routes(parse_set_client_array(root));
co_return seastar::json::json_void();
}
static std::vector<service::client_routes_service::client_route_key> parse_delete_client_array(const rapidjson::Document& root) {
if (!root.IsArray()) {
throw bad_param_exception("Body must be a JSON array");
}
std::vector<service::client_routes_service::client_route_key> v;
v.reserve(root.GetArray().Size());
for (const auto& element : root.GetArray()) {
v.emplace_back(
parse_string("connection_id", element),
utils::UUID{parse_string("host_id", element)}
);
}
return v;
}
static
future<json::json_return_type>
rest_delete_client_routes(http_context& ctx, sharded<service::client_routes_service>& cr, std::unique_ptr<http::request> req) {
validate_client_routes_endpoint(cr, "delete_client_routes");
rapidjson::Document root;
auto content = co_await util::read_entire_stream_contiguous(*req->content_stream);
root.Parse(content.c_str());
co_await cr.local().delete_client_routes(parse_delete_client_array(root));
co_return seastar::json::json_void();
}
static
future<json::json_return_type>
rest_get_client_routes(http_context& ctx, sharded<service::client_routes_service>& cr, std::unique_ptr<http::request> req) {
validate_client_routes_endpoint(cr, "get_client_routes");
co_return co_await cr.invoke_on(0, [] (service::client_routes_service& cr) -> future<json::json_return_type> {
co_return json::json_return_type(stream_range_as_array(co_await cr.get_client_routes(), [](const service::client_routes_service::client_route_entry & entry) {
seastar::httpd::client_routes_json::client_routes_entry obj;
obj.connection_id = entry.connection_id;
obj.host_id = fmt::to_string(entry.host_id);
obj.address = entry.address;
if (entry.port.has_value()) { obj.port = entry.port.value(); }
if (entry.tls_port.has_value()) { obj.tls_port = entry.tls_port.value(); }
if (entry.alternator_port.has_value()) { obj.alternator_port = entry.alternator_port.value(); }
if (entry.alternator_https_port.has_value()) { obj.alternator_https_port = entry.alternator_https_port.value(); }
return obj;
}));
});
}
void set_client_routes(http_context& ctx, routes& r, sharded<service::client_routes_service>& cr) {
seastar::httpd::client_routes_json::set_client_routes.set(r, [&ctx, &cr] (std::unique_ptr<seastar::http::request> req) {
return rest_set_client_routes(ctx, cr, std::move(req));
});
seastar::httpd::client_routes_json::delete_client_routes.set(r, [&ctx, &cr] (std::unique_ptr<seastar::http::request> req) {
return rest_delete_client_routes(ctx, cr, std::move(req));
});
seastar::httpd::client_routes_json::get_client_routes.set(r, [&ctx, &cr] (std::unique_ptr<seastar::http::request> req) {
return rest_get_client_routes(ctx, cr, std::move(req));
});
}
void unset_client_routes(http_context& ctx, routes& r) {
seastar::httpd::client_routes_json::set_client_routes.unset(r);
seastar::httpd::client_routes_json::delete_client_routes.unset(r);
seastar::httpd::client_routes_json::get_client_routes.unset(r);
}
}

20
api/client_routes.hh Normal file
View File

@@ -0,0 +1,20 @@
/*
* Copyright (C) 2025-present ScyllaDB
*
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
*/
#pragma once
#include <seastar/core/sharded.hh>
#include <seastar/json/json_elements.hh>
#include "api/api_init.hh"
namespace api {
void set_client_routes(http_context& ctx, httpd::routes& r, sharded<service::client_routes_service>& cr);
void unset_client_routes(http_context& ctx, httpd::routes& r);
}

View File

@@ -547,17 +547,13 @@ void set_view_builder(http_context& ctx, routes& r, sharded<db::view::view_build
vp.insert(b.second);
}
}
std::vector<sstring> res;
replica::database& db = vb.local().get_db();
auto uuid = validate_table(db, ks, cf_name);
replica::column_family& cf = db.find_column_family(uuid);
res.reserve(cf.get_index_manager().list_indexes().size());
for (auto&& i : cf.get_index_manager().list_indexes()) {
if (vp.contains(secondary_index::index_table_name(i.metadata().name()))) {
res.emplace_back(i.metadata().name());
}
}
co_return res;
co_return cf.get_index_manager().list_indexes()
| std::views::transform([] (const auto& i) { return i.metadata().name(); })
| std::views::filter([&vp] (const auto& n) { return vp.contains(secondary_index::index_table_name(n)); })
| std::ranges::to<std::vector>();
});
}

View File

@@ -9,7 +9,6 @@
#include "auth/allow_all_authenticator.hh"
#include "service/migration_manager.hh"
#include "utils/alien_worker.hh"
#include "utils/class_registrator.hh"
namespace auth {
@@ -23,7 +22,6 @@ static const class_registrator<
cql3::query_processor&,
::service::raft_group0_client&,
::service::migration_manager&,
cache&,
utils::alien_worker&> registration("org.apache.cassandra.auth.AllowAllAuthenticator");
cache&> registration("org.apache.cassandra.auth.AllowAllAuthenticator");
}

View File

@@ -14,7 +14,6 @@
#include "auth/authenticator.hh"
#include "auth/cache.hh"
#include "auth/common.hh"
#include "utils/alien_worker.hh"
namespace cql3 {
class query_processor;
@@ -30,7 +29,7 @@ extern const std::string_view allow_all_authenticator_name;
class allow_all_authenticator final : public authenticator {
public:
allow_all_authenticator(cql3::query_processor&, ::service::raft_group0_client&, ::service::migration_manager&, cache&, utils::alien_worker&) {
allow_all_authenticator(cql3::query_processor&, ::service::raft_group0_client&, ::service::migration_manager&, cache&) {
}
virtual future<> start() override {

View File

@@ -15,6 +15,7 @@
#include "db/system_keyspace.hh"
#include "schema/schema.hh"
#include <iterator>
#include <seastar/core/abort_source.hh>
#include <seastar/coroutine/maybe_yield.hh>
#include <seastar/core/format.hh>
@@ -22,9 +23,11 @@ namespace auth {
logging::logger logger("auth-cache");
cache::cache(cql3::query_processor& qp) noexcept
cache::cache(cql3::query_processor& qp, abort_source& as) noexcept
: _current_version(0)
, _qp(qp) {
, _qp(qp)
, _loading_sem(1)
, _as(as) {
}
lw_shared_ptr<const cache::role_record> cache::get(const role_name_t& role) const noexcept {
@@ -116,6 +119,8 @@ future<> cache::load_all() {
co_return;
}
SCYLLA_ASSERT(this_shard_id() == 0);
auto units = co_await get_units(_loading_sem, 1, _as);
++_current_version;
logger.info("Loading all roles");
@@ -146,6 +151,9 @@ future<> cache::load_roles(std::unordered_set<role_name_t> roles) {
if (legacy_mode(_qp)) {
co_return;
}
SCYLLA_ASSERT(this_shard_id() == 0);
auto units = co_await get_units(_loading_sem, 1, _as);
for (const auto& name : roles) {
logger.info("Loading role {}", name);
auto role = co_await fetch_role(name);

View File

@@ -8,6 +8,7 @@
#pragma once
#include <seastar/core/abort_source.hh>
#include <unordered_set>
#include <unordered_map>
@@ -15,6 +16,7 @@
#include <seastar/core/future.hh>
#include <seastar/core/sharded.hh>
#include <seastar/core/shared_ptr.hh>
#include <seastar/core/semaphore.hh>
#include <absl/container/flat_hash_map.h>
@@ -41,7 +43,7 @@ public:
version_tag_t version; // used for seamless cache reloads
};
explicit cache(cql3::query_processor& qp) noexcept;
explicit cache(cql3::query_processor& qp, abort_source& as) noexcept;
lw_shared_ptr<const role_record> get(const role_name_t& role) const noexcept;
future<> load_all();
future<> load_roles(std::unordered_set<role_name_t> roles);
@@ -52,6 +54,8 @@ private:
roles_map _roles;
version_tag_t _current_version;
cql3::query_processor& _qp;
semaphore _loading_sem;
abort_source& _as;
future<lw_shared_ptr<role_record>> fetch_role(const role_name_t& role) const;
future<> prune_all() noexcept;

View File

@@ -35,14 +35,13 @@ static const class_registrator<auth::authenticator
, cql3::query_processor&
, ::service::raft_group0_client&
, ::service::migration_manager&
, auth::cache&
, utils::alien_worker&> cert_auth_reg(CERT_AUTH_NAME);
, auth::cache&> cert_auth_reg(CERT_AUTH_NAME);
enum class auth::certificate_authenticator::query_source {
subject, altname
};
auth::certificate_authenticator::certificate_authenticator(cql3::query_processor& qp, ::service::raft_group0_client&, ::service::migration_manager&, auth::cache&, utils::alien_worker&)
auth::certificate_authenticator::certificate_authenticator(cql3::query_processor& qp, ::service::raft_group0_client&, ::service::migration_manager&, auth::cache&)
: _queries([&] {
auto& conf = qp.db().get_config();
auto queries = conf.auth_certificate_role_queries();
@@ -77,9 +76,9 @@ auth::certificate_authenticator::certificate_authenticator(cql3::query_processor
throw std::invalid_argument(fmt::format("Invalid source: {}", map.at(cfg_source_attr)));
}
continue;
} catch (std::out_of_range&) {
} catch (const std::out_of_range&) {
// just fallthrough
} catch (boost::regex_error&) {
} catch (const boost::regex_error&) {
std::throw_with_nested(std::invalid_argument(fmt::format("Invalid query expression: {}", map.at(cfg_query_attr))));
}
}

View File

@@ -10,7 +10,6 @@
#pragma once
#include "auth/authenticator.hh"
#include "utils/alien_worker.hh"
#include <boost/regex_fwd.hpp> // IWYU pragma: keep
namespace cql3 {
@@ -34,7 +33,7 @@ class certificate_authenticator : public authenticator {
enum class query_source;
std::vector<std::pair<query_source, boost::regex>> _queries;
public:
certificate_authenticator(cql3::query_processor&, ::service::raft_group0_client&, ::service::migration_manager&, cache&, utils::alien_worker&);
certificate_authenticator(cql3::query_processor&, ::service::raft_group0_client&, ::service::migration_manager&, cache&);
~certificate_authenticator();
future<> start() override;

View File

@@ -94,7 +94,7 @@ static future<> create_legacy_metadata_table_if_missing_impl(
try {
co_return co_await mm.announce(co_await ::service::prepare_new_column_family_announcement(qp.proxy(), table, ts),
std::move(group0_guard), format("auth: create {} metadata table", table->cf_name()));
} catch (exceptions::already_exists_exception&) {}
} catch (const exceptions::already_exists_exception&) {}
}
}

View File

@@ -256,7 +256,7 @@ future<> default_authorizer::revoke_all(std::string_view role_name, ::service::g
} else {
co_await collect_mutations(_qp, mc, query, {sstring(role_name)});
}
} catch (exceptions::request_execution_exception& e) {
} catch (const exceptions::request_execution_exception& e) {
alogger.warn("CassandraAuthorizer failed to revoke all permissions of {}: {}", role_name, e);
}
}
@@ -293,13 +293,13 @@ future<> default_authorizer::revoke_all_legacy(const resource& resource) {
[resource](auto ep) {
try {
std::rethrow_exception(ep);
} catch (exceptions::request_execution_exception& e) {
} catch (const exceptions::request_execution_exception& e) {
alogger.warn("CassandraAuthorizer failed to revoke all permissions on {}: {}", resource, e);
}
});
});
} catch (exceptions::request_execution_exception& e) {
} catch (const exceptions::request_execution_exception& e) {
alogger.warn("CassandraAuthorizer failed to revoke all permissions on {}: {}", resource, e);
return make_ready_future();
}

View File

@@ -49,8 +49,7 @@ static const class_registrator<
cql3::query_processor&,
::service::raft_group0_client&,
::service::migration_manager&,
cache&,
utils::alien_worker&> password_auth_reg("org.apache.cassandra.auth.PasswordAuthenticator");
cache&> password_auth_reg("org.apache.cassandra.auth.PasswordAuthenticator");
static thread_local auto rng_for_salt = std::default_random_engine(std::random_device{}());
@@ -64,14 +63,13 @@ std::string password_authenticator::default_superuser(const db::config& cfg) {
password_authenticator::~password_authenticator() {
}
password_authenticator::password_authenticator(cql3::query_processor& qp, ::service::raft_group0_client& g0, ::service::migration_manager& mm, cache& cache, utils::alien_worker& hashing_worker)
password_authenticator::password_authenticator(cql3::query_processor& qp, ::service::raft_group0_client& g0, ::service::migration_manager& mm, cache& cache)
: _qp(qp)
, _group0_client(g0)
, _migration_manager(mm)
, _cache(cache)
, _stopped(make_ready_future<>())
, _superuser(default_superuser(qp.db().get_config()))
, _hashing_worker(hashing_worker)
{}
static bool has_salted_hash(const cql3::untyped_result_set_row& row) {
@@ -330,20 +328,18 @@ future<authenticated_user> password_authenticator::authenticate(
}
salted_hash = role->salted_hash;
}
const bool password_match = co_await _hashing_worker.submit<bool>([password = std::move(password), salted_hash] {
return passwords::check(password, *salted_hash);
});
const bool password_match = co_await passwords::check(password, *salted_hash);
if (!password_match) {
throw exceptions::authentication_exception("Username and/or password are incorrect");
}
co_return username;
} catch (std::system_error &) {
} catch (const std::system_error &) {
std::throw_with_nested(exceptions::authentication_exception("Could not verify password"));
} catch (exceptions::request_execution_exception& e) {
} catch (const exceptions::request_execution_exception& e) {
std::throw_with_nested(exceptions::authentication_exception(e.what()));
} catch (exceptions::authentication_exception& e) {
} catch (const exceptions::authentication_exception& e) {
std::throw_with_nested(e);
} catch (exceptions::unavailable_exception& e) {
} catch (const exceptions::unavailable_exception& e) {
std::throw_with_nested(exceptions::authentication_exception(e.get_message()));
} catch (...) {
std::throw_with_nested(exceptions::authentication_exception("authentication failed"));

View File

@@ -18,7 +18,6 @@
#include "auth/passwords.hh"
#include "auth/cache.hh"
#include "service/raft/raft_group0_client.hh"
#include "utils/alien_worker.hh"
namespace db {
class config;
@@ -49,13 +48,12 @@ class password_authenticator : public authenticator {
shared_promise<> _superuser_created_promise;
// We used to also support bcrypt, SHA-256, and MD5 (ref. scylladb#24524).
constexpr static auth::passwords::scheme _scheme = passwords::scheme::sha_512;
utils::alien_worker& _hashing_worker;
public:
static db::consistency_level consistency_for_user(std::string_view role_name);
static std::string default_superuser(const db::config&);
password_authenticator(cql3::query_processor&, ::service::raft_group0_client&, ::service::migration_manager&, cache&, utils::alien_worker&);
password_authenticator(cql3::query_processor&, ::service::raft_group0_client&, ::service::migration_manager&, cache&);
~password_authenticator();

View File

@@ -7,6 +7,8 @@
*/
#include "auth/passwords.hh"
#include "utils/crypt_sha512.hh"
#include <seastar/core/coroutine.hh>
#include <cerrno>
@@ -21,27 +23,48 @@ static thread_local crypt_data tlcrypt = {};
namespace detail {
void verify_hashing_output(const char * res) {
if (!res || (res[0] == '*')) {
throw std::system_error(errno, std::system_category());
}
}
void verify_scheme(scheme scheme) {
const sstring random_part_of_salt = "aaaabbbbccccdddd";
const sstring salt = sstring(prefix_for_scheme(scheme)) + random_part_of_salt;
const char* e = crypt_r("fisk", salt.c_str(), &tlcrypt);
if (e && (e[0] != '*')) {
return;
try {
verify_hashing_output(e);
} catch (const std::system_error& ex) {
throw no_supported_schemes();
}
throw no_supported_schemes();
}
sstring hash_with_salt(const sstring& pass, const sstring& salt) {
auto res = crypt_r(pass.c_str(), salt.c_str(), &tlcrypt);
if (!res || (res[0] == '*')) {
throw std::system_error(errno, std::system_category());
}
verify_hashing_output(res);
return res;
}
seastar::future<sstring> hash_with_salt_async(const sstring& pass, const sstring& salt) {
sstring res;
// Only SHA-512 hashes for passphrases shorter than 256 bytes can be computed using
// the __crypt_sha512 method. For other computations, we fall back to the
// crypt_r implementation from `<crypt.h>`, which can stall.
if (salt.starts_with(prefix_for_scheme(scheme::sha_512)) && pass.size() <= 255) {
char buf[128];
const char * output_ptr = co_await __crypt_sha512(pass.c_str(), salt.c_str(), buf);
verify_hashing_output(output_ptr);
res = output_ptr;
} else {
const char * output_ptr = crypt_r(pass.c_str(), salt.c_str(), &tlcrypt);
verify_hashing_output(output_ptr);
res = output_ptr;
}
co_return res;
}
std::string_view prefix_for_scheme(scheme c) noexcept {
switch (c) {
case scheme::bcrypt_y: return "$2y$";
@@ -58,8 +81,9 @@ no_supported_schemes::no_supported_schemes()
: std::runtime_error("No allowed hashing schemes are supported on this system") {
}
bool check(const sstring& pass, const sstring& salted_hash) {
return detail::hash_with_salt(pass, salted_hash) == salted_hash;
seastar::future<bool> check(const sstring& pass, const sstring& salted_hash) {
const auto pwd_hash = co_await detail::hash_with_salt_async(pass, salted_hash);
co_return pwd_hash == salted_hash;
}
} // namespace auth::passwords

View File

@@ -11,6 +11,7 @@
#include <random>
#include <stdexcept>
#include <seastar/core/future.hh>
#include <seastar/core/sstring.hh>
#include "seastarx.hh"
@@ -75,10 +76,19 @@ sstring generate_salt(RandomNumberEngine& g, scheme scheme) {
///
/// Hash a password combined with an implementation-specific salt string.
/// Deprecated in favor of `hash_with_salt_async`.
///
/// \throws \ref std::system_error when an unexpected implementation-specific error occurs.
///
sstring hash_with_salt(const sstring& pass, const sstring& salt);
[[deprecated("Use hash_with_salt_async instead")]] sstring hash_with_salt(const sstring& pass, const sstring& salt);
///
/// Async version of `hash_with_salt` that returns a future.
/// If possible, hashing uses `coroutine::maybe_yield` to prevent reactor stalls.
///
/// \throws \ref std::system_error when an unexpected implementation-specific error occurs.
///
seastar::future<sstring> hash_with_salt_async(const sstring& pass, const sstring& salt);
} // namespace detail
@@ -107,6 +117,6 @@ sstring hash(const sstring& pass, RandomNumberEngine& g, scheme scheme) {
///
/// \throws \ref std::system_error when an unexpected implementation-specific error occurs.
///
bool check(const sstring& pass, const sstring& salted_hash);
seastar::future<bool> check(const sstring& pass, const sstring& salted_hash);
} // namespace auth::passwords

View File

@@ -35,10 +35,9 @@ static const class_registrator<
cql3::query_processor&,
::service::raft_group0_client&,
::service::migration_manager&,
cache&,
utils::alien_worker&> saslauthd_auth_reg("com.scylladb.auth.SaslauthdAuthenticator");
cache&> saslauthd_auth_reg("com.scylladb.auth.SaslauthdAuthenticator");
saslauthd_authenticator::saslauthd_authenticator(cql3::query_processor& qp, ::service::raft_group0_client&, ::service::migration_manager&, cache&, utils::alien_worker&)
saslauthd_authenticator::saslauthd_authenticator(cql3::query_processor& qp, ::service::raft_group0_client&, ::service::migration_manager&, cache&)
: _socket_path(qp.db().get_config().saslauthd_socket_path())
{}

View File

@@ -12,7 +12,6 @@
#include "auth/authenticator.hh"
#include "auth/cache.hh"
#include "utils/alien_worker.hh"
namespace cql3 {
class query_processor;
@@ -30,7 +29,7 @@ namespace auth {
class saslauthd_authenticator : public authenticator {
sstring _socket_path; ///< Path to the domain socket on which saslauthd is listening.
public:
saslauthd_authenticator(cql3::query_processor&, ::service::raft_group0_client&, ::service::migration_manager&, cache&,utils::alien_worker&);
saslauthd_authenticator(cql3::query_processor&, ::service::raft_group0_client&, ::service::migration_manager&, cache&);
future<> start() override;

View File

@@ -191,8 +191,7 @@ service::service(
::service::migration_manager& mm,
const service_config& sc,
maintenance_socket_enabled used_by_maintenance_socket,
cache& cache,
utils::alien_worker& hashing_worker)
cache& cache)
: service(
std::move(c),
cache,
@@ -200,7 +199,7 @@ service::service(
g0,
mn,
create_object<authorizer>(sc.authorizer_java_name, qp, g0, mm),
create_object<authenticator>(sc.authenticator_java_name, qp, g0, mm, cache, hashing_worker),
create_object<authenticator>(sc.authenticator_java_name, qp, g0, mm, cache),
create_object<role_manager>(sc.role_manager_java_name, qp, g0, mm, cache),
used_by_maintenance_socket) {
}
@@ -226,7 +225,7 @@ future<> service::create_legacy_keyspace_if_missing(::service::migration_manager
try {
co_return co_await mm.announce(::service::prepare_new_keyspace_announcement(db.real_database(), ksm, ts),
std::move(group0_guard), seastar::format("auth_service: create {} keyspace", meta::legacy::AUTH_KS));
} catch (::service::group0_concurrent_modification&) {
} catch (const ::service::group0_concurrent_modification&) {
log.info("Concurrent operation is detected while creating {} keyspace, retrying.", meta::legacy::AUTH_KS);
}
}

View File

@@ -27,7 +27,6 @@
#include "cql3/description.hh"
#include "seastarx.hh"
#include "service/raft/raft_group0_client.hh"
#include "utils/alien_worker.hh"
#include "utils/observable.hh"
#include "utils/serialized_action.hh"
#include "service/maintenance_mode.hh"
@@ -131,8 +130,7 @@ public:
::service::migration_manager&,
const service_config&,
maintenance_socket_enabled,
cache&,
utils::alien_worker&);
cache&);
future<> start(::service::migration_manager&, db::system_keyspace&);

View File

@@ -192,7 +192,7 @@ future<> standard_role_manager::legacy_create_default_role_if_missing() {
{_superuser},
cql3::query_processor::cache_internal::no).discard_result();
log.info("Created default superuser role '{}'.", _superuser);
} catch(const exceptions::unavailable_exception& e) {
} catch (const exceptions::unavailable_exception& e) {
log.warn("Skipped default role setup: some nodes were not ready; will retry");
throw e;
}

View File

@@ -38,8 +38,8 @@ class transitional_authenticator : public authenticator {
public:
static const sstring PASSWORD_AUTHENTICATOR_NAME;
transitional_authenticator(cql3::query_processor& qp, ::service::raft_group0_client& g0, ::service::migration_manager& mm, cache& cache, utils::alien_worker& hashing_worker)
: transitional_authenticator(std::make_unique<password_authenticator>(qp, g0, mm, cache, hashing_worker)) {
transitional_authenticator(cql3::query_processor& qp, ::service::raft_group0_client& g0, ::service::migration_manager& mm, cache& cache)
: transitional_authenticator(std::make_unique<password_authenticator>(qp, g0, mm, cache)) {
}
transitional_authenticator(std::unique_ptr<authenticator> a)
: _authenticator(std::move(a)) {
@@ -81,7 +81,7 @@ public:
}).handle_exception([](auto ep) {
try {
std::rethrow_exception(ep);
} catch (exceptions::authentication_exception&) {
} catch (const exceptions::authentication_exception&) {
// return anon user
return make_ready_future<authenticated_user>(anonymous_user());
}
@@ -126,7 +126,7 @@ public:
virtual bytes evaluate_response(bytes_view client_response) override {
try {
return _sasl->evaluate_response(client_response);
} catch (exceptions::authentication_exception&) {
} catch (const exceptions::authentication_exception&) {
_complete = true;
return {};
}
@@ -141,7 +141,7 @@ public:
return _sasl->get_authenticated_user().handle_exception([](auto ep) {
try {
std::rethrow_exception(ep);
} catch (exceptions::authentication_exception&) {
} catch (const exceptions::authentication_exception&) {
// return anon user
return make_ready_future<authenticated_user>(anonymous_user());
}
@@ -241,8 +241,7 @@ static const class_registrator<
cql3::query_processor&,
::service::raft_group0_client&,
::service::migration_manager&,
auth::cache&,
utils::alien_worker&> transitional_authenticator_reg(auth::PACKAGE_NAME + "TransitionalAuthenticator");
auth::cache&> transitional_authenticator_reg(auth::PACKAGE_NAME + "TransitionalAuthenticator");
static const class_registrator<
auth::authorizer,

View File

@@ -10,7 +10,9 @@
#include <seastar/net/inet_address.hh>
#include <seastar/core/sstring.hh>
#include "seastarx.hh"
#include "utils/loading_shared_values.hh"
#include <list>
#include <optional>
enum class client_type {
@@ -27,6 +29,20 @@ enum class client_connection_stage {
ready,
};
// We implement a keys cache using a map-like utils::loading_shared_values container by storing empty values.
struct options_cache_value_type {};
using client_options_cache_type = utils::loading_shared_values<sstring, options_cache_value_type>;
using client_options_cache_entry_type = client_options_cache_type::entry_ptr;
using client_options_cache_key_type = client_options_cache_type::key_type;
// This struct represents a single OPTION key-value pair from the client's connection options.
// Both key and value are represented by corresponding "references" to their cached values.
// Each "reference" is effectively a lw_shared_ptr value.
struct client_option_key_value_cached_entry {
client_options_cache_entry_type key;
client_options_cache_entry_type value;
};
sstring to_string(client_connection_stage ct);
// Representation of a row in `system.clients'. std::optionals are for nullable cells.
@@ -37,8 +53,8 @@ struct client_data {
client_connection_stage connection_stage = client_connection_stage::established;
int32_t shard_id; /// ID of server-side shard which is processing the connection.
std::optional<sstring> driver_name;
std::optional<sstring> driver_version;
std::optional<client_options_cache_entry_type> driver_name;
std::optional<client_options_cache_entry_type> driver_version;
std::optional<sstring> hostname;
std::optional<int32_t> protocol_version;
std::optional<sstring> ssl_cipher_suite;
@@ -46,6 +62,7 @@ struct client_data {
std::optional<sstring> ssl_protocol;
std::optional<sstring> username;
std::optional<sstring> scheduling_group_name;
std::list<client_option_key_value_cached_entry> client_options;
sstring stage_str() const { return to_string(connection_stage); }
sstring client_type_str() const { return to_string(ct); }

View File

@@ -125,10 +125,6 @@ if(target_arch)
add_compile_options("-march=${target_arch}")
endif()
if(CMAKE_CXX_COMPILER_ID STREQUAL "Clang")
add_compile_options("SHELL:-Xclang -fexperimental-assignment-tracking=disabled")
endif()
function(maybe_limit_stack_usage_in_KB stack_usage_threshold_in_KB config)
math(EXPR _stack_usage_threshold_in_bytes "${stack_usage_threshold_in_KB} * 1024")
set(_stack_usage_threshold_flag "-Wstack-usage=${_stack_usage_threshold_in_bytes}")

View File

@@ -12,6 +12,7 @@
#include <seastar/core/condition-variable.hh>
#include "schema/schema_fwd.hh"
#include "sstables/open_info.hh"
#include "compaction_descriptor.hh"
class reader_permit;
@@ -44,7 +45,7 @@ public:
virtual compaction_strategy_state& get_compaction_strategy_state() noexcept = 0;
virtual reader_permit make_compaction_reader_permit() const = 0;
virtual sstables::sstables_manager& get_sstables_manager() noexcept = 0;
virtual sstables::shared_sstable make_sstable() const = 0;
virtual sstables::shared_sstable make_sstable(sstables::sstable_state) const = 0;
virtual sstables::sstable_writer_config configure_writer(sstring origin) const = 0;
virtual api::timestamp_type min_memtable_timestamp() const = 0;
virtual api::timestamp_type min_memtable_live_timestamp() const = 0;

View File

@@ -416,7 +416,9 @@ future<compaction_result> compaction_task_executor::compact_sstables(compaction_
descriptor.enable_garbage_collection(co_await sstable_set_for_tombstone_gc(t));
}
descriptor.creator = [&t] (shard_id) {
return t.make_sstable();
// All compaction types going through this path will work on normal input sstables only.
// Off-strategy, for example, waits until the sstables move out of staging state.
return t.make_sstable(sstables::sstable_state::normal);
};
descriptor.replacer = [this, &t, &on_replace, offstrategy] (compaction_completion_desc desc) {
t.get_compaction_strategy().notify_completion(t, desc.old_sstables, desc.new_sstables);
@@ -1847,6 +1849,10 @@ protected:
throw make_compaction_stopped_exception();
}
}, false);
if (utils::get_local_injector().is_enabled("split_sstable_force_stop_exception")) {
throw make_compaction_stopped_exception();
}
co_return co_await do_rewrite_sstable(std::move(sst));
}
};
@@ -2284,12 +2290,16 @@ future<compaction_manager::compaction_stats_opt> compaction_manager::perform_spl
}
future<std::vector<sstables::shared_sstable>>
compaction_manager::maybe_split_sstable(sstables::shared_sstable sst, compaction_group_view& t, compaction_type_options::split opt) {
compaction_manager::maybe_split_new_sstable(sstables::shared_sstable sst, compaction_group_view& t, compaction_type_options::split opt) {
if (!split_compaction_task_executor::sstable_needs_split(sst, opt)) {
co_return std::vector<sstables::shared_sstable>{sst};
}
if (!can_proceed(&t)) {
co_return std::vector<sstables::shared_sstable>{sst};
// Throw an error if split cannot be performed due to e.g. out of space prevention.
// We don't want to prevent split because compaction is temporarily disabled on a view only for synchronization,
// which is uneeded against new sstables that aren't part of any set yet, so never use can_proceed(&t) here.
if (is_disabled()) {
co_return coroutine::exception(std::make_exception_ptr(std::runtime_error(format("Cannot split {} because manager has compaction disabled, " \
"reason might be out of space prevention", sst->get_filename()))));
}
std::vector<sstables::shared_sstable> ret;
@@ -2297,8 +2307,11 @@ compaction_manager::maybe_split_sstable(sstables::shared_sstable sst, compaction
compaction_progress_monitor monitor;
compaction_data info = create_compaction_data();
compaction_descriptor desc = split_compaction_task_executor::make_descriptor(sst, opt);
desc.creator = [&t] (shard_id _) {
return t.make_sstable();
desc.creator = [&t, sst] (shard_id _) {
// NOTE: preserves the sstable state, since we want the output to be on the same state as the original.
// For example, if base table has views, it's important that sstable produced by repair will be
// in the staging state.
return t.make_sstable(sst->state());
};
desc.replacer = [&] (compaction_completion_desc d) {
std::move(d.new_sstables.begin(), d.new_sstables.end(), std::back_inserter(ret));

View File

@@ -376,7 +376,8 @@ public:
// Splits a single SSTable by segregating all its data according to the classifier.
// If SSTable doesn't need split, the same input SSTable is returned as output.
// If SSTable needs split, then output SSTables are returned and the input SSTable is deleted.
future<std::vector<sstables::shared_sstable>> maybe_split_sstable(sstables::shared_sstable sst, compaction_group_view& t, compaction_type_options::split opt);
// Exception is thrown if the input sstable cannot be split due to e.g. out of space prevention.
future<std::vector<sstables::shared_sstable>> maybe_split_new_sstable(sstables::shared_sstable sst, compaction_group_view& t, compaction_type_options::split opt);
// Run a custom job for a given table, defined by a function
// it completes when future returned by job is ready or returns immediately

View File

@@ -368,6 +368,87 @@ def find_ninja():
sys.exit(1)
def find_compiler(name):
"""
Find a compiler by name, skipping ccache wrapper directories.
This is useful when using sccache to avoid double-caching through ccache.
Args:
name: The compiler name (e.g., 'clang++', 'clang', 'gcc')
Returns:
Path to the compiler, skipping ccache directories, or None if not found.
"""
ccache_dirs = {'/usr/lib/ccache', '/usr/lib64/ccache'}
for path_dir in os.environ.get('PATH', '').split(os.pathsep):
# Skip ccache wrapper directories
if os.path.realpath(path_dir) in ccache_dirs or path_dir in ccache_dirs:
continue
candidate = os.path.join(path_dir, name)
if os.path.isfile(candidate) and os.access(candidate, os.X_OK):
return candidate
return None
def resolve_compilers_for_sccache(args, compiler_cache):
"""
When using sccache, resolve compiler paths to avoid ccache directories.
This prevents double-caching when ccache symlinks are in PATH.
Args:
args: The argument namespace with cc and cxx attributes.
compiler_cache: Path to the compiler cache binary, or None.
"""
if not compiler_cache or 'sccache' not in compiler_cache:
return
if not os.path.isabs(args.cxx):
real_cxx = find_compiler(args.cxx)
if real_cxx:
args.cxx = real_cxx
if not os.path.isabs(args.cc):
real_cc = find_compiler(args.cc)
if real_cc:
args.cc = real_cc
def find_compiler_cache(preference):
"""
Find a compiler cache based on the preference.
Args:
preference: One of 'auto', 'sccache', 'ccache', 'none', or a path to a binary.
Returns:
Path to the compiler cache binary, or None if not found/disabled.
"""
if preference == 'none':
return None
if preference == 'auto':
# Prefer sccache over ccache
for cache in ['sccache', 'ccache']:
path = which(cache)
if path:
return path
return None
if preference in ('sccache', 'ccache'):
path = which(preference)
if path:
return path
print(f"Warning: {preference} not found on PATH, disabling compiler cache")
return None
# Assume it's a path to a binary
if os.path.isfile(preference) and os.access(preference, os.X_OK):
return preference
print(f"Warning: compiler cache '{preference}' not found or not executable, disabling compiler cache")
return None
modes = {
'debug': {
'cxxflags': '-DDEBUG -DSANITIZE -DDEBUG_LSA_SANITIZER -DSCYLLA_ENABLE_ERROR_INJECTION',
@@ -732,6 +813,8 @@ arg_parser.add_argument('--compiler', action='store', dest='cxx', default='clang
help='C++ compiler path')
arg_parser.add_argument('--c-compiler', action='store', dest='cc', default='clang',
help='C compiler path')
arg_parser.add_argument('--compiler-cache', action='store', dest='compiler_cache', default='auto',
help='Compiler cache to use: auto (default, prefers sccache), sccache, ccache, none, or a path to a binary')
add_tristate(arg_parser, name='dpdk', dest='dpdk', default=False,
help='Use dpdk (from seastar dpdk sources)')
arg_parser.add_argument('--dpdk-target', action='store', dest='dpdk_target', default='',
@@ -859,6 +942,7 @@ scylla_core = (['message/messaging_service.cc',
'utils/alien_worker.cc',
'utils/array-search.cc',
'utils/base64.cc',
'utils/crypt_sha512.cc',
'utils/logalloc.cc',
'utils/large_bitset.cc',
'utils/buffer_input_stream.cc',
@@ -1157,6 +1241,7 @@ scylla_core = (['message/messaging_service.cc',
'locator/topology.cc',
'locator/util.cc',
'service/client_state.cc',
'service/client_routes.cc',
'service/storage_service.cc',
'service/session.cc',
'service/task_manager_module.cc',
@@ -1317,6 +1402,8 @@ api = ['api/api.cc',
'api/storage_proxy.cc',
Json2Code('api/api-doc/cache_service.json'),
'api/cache_service.cc',
Json2Code('api/api-doc/client_routes.json'),
'api/client_routes.cc',
Json2Code('api/api-doc/collectd.json'),
'api/collectd.cc',
Json2Code('api/api-doc/endpoint_snitch_info.json'),
@@ -1479,7 +1566,6 @@ deps = {
pure_boost_tests = set([
'test/boost/anchorless_list_test',
'test/boost/auth_passwords_test',
'test/boost/auth_resource_test',
'test/boost/big_decimal_test',
'test/boost/caching_options_test',
@@ -1695,6 +1781,18 @@ deps['test/vector_search/vector_store_client_test'] = ['test/vector_search/vect
deps['test/vector_search/load_balancer_test'] = ['test/vector_search/load_balancer_test.cc'] + scylla_tests_dependencies
deps['test/vector_search/client_test'] = ['test/vector_search/client_test.cc'] + scylla_tests_dependencies
boost_tests_prefixes = ["test/boost/", "test/vector_search/", "test/raft/", "test/manual/", "test/ldap/"]
# We need to link these files to all Boost tests to make sure that
# we can execute `--list_json_content` on them. That will produce
# a similar result as calling `--list_content={HRF,DOT}`.
# Unfortunately, to be able to do that, we're forced to link the
# relevant code by hand.
for key in deps.keys():
for prefix in boost_tests_prefixes:
if key.startswith(prefix):
deps[key] += ["test/lib/boost_tree_lister_injector.cc", "test/lib/boost_test_tree_lister.cc"]
wasm_deps = {}
wasm_deps['wasm/return_input.wat'] = 'test/resource/wasm/rust/return_input.rs'
@@ -1999,7 +2097,7 @@ def semicolon_separated(*flags):
def real_relpath(path, start):
return os.path.relpath(os.path.realpath(path), os.path.realpath(start))
def configure_seastar(build_dir, mode, mode_config):
def configure_seastar(build_dir, mode, mode_config, compiler_cache=None):
seastar_cxx_ld_flags = mode_config['cxx_ld_flags']
# We want to "undo" coverage for seastar if we have it enabled.
if args.coverage:
@@ -2046,6 +2144,10 @@ def configure_seastar(build_dir, mode, mode_config):
'-DSeastar_IO_URING=ON',
]
if compiler_cache:
seastar_cmake_args += [f'-DCMAKE_CXX_COMPILER_LAUNCHER={compiler_cache}',
f'-DCMAKE_C_COMPILER_LAUNCHER={compiler_cache}']
if args.stack_guards is not None:
stack_guards = 'ON' if args.stack_guards else 'OFF'
seastar_cmake_args += ['-DSeastar_STACK_GUARDS={}'.format(stack_guards)]
@@ -2077,7 +2179,7 @@ def configure_seastar(build_dir, mode, mode_config):
subprocess.check_call(seastar_cmd, shell=False, cwd=cmake_dir)
def configure_abseil(build_dir, mode, mode_config):
def configure_abseil(build_dir, mode, mode_config, compiler_cache=None):
abseil_cflags = mode_config['lib_cflags']
cxx_flags = mode_config['cxxflags']
if '-DSANITIZE' in cxx_flags:
@@ -2103,6 +2205,10 @@ def configure_abseil(build_dir, mode, mode_config):
'-DABSL_PROPAGATE_CXX_STD=ON',
]
if compiler_cache:
abseil_cmake_args += [f'-DCMAKE_CXX_COMPILER_LAUNCHER={compiler_cache}',
f'-DCMAKE_C_COMPILER_LAUNCHER={compiler_cache}']
cmake_args = abseil_cmake_args[:]
abseil_build_dir = os.path.join(build_dir, mode, 'abseil')
abseil_cmd = ['cmake', '-G', 'Ninja', real_relpath('abseil', abseil_build_dir)] + cmake_args
@@ -2248,15 +2354,6 @@ def get_extra_cxxflags(mode, mode_config, cxx, debuginfo):
if debuginfo and mode_config['can_have_debug_info']:
cxxflags += ['-g', '-gz']
if 'clang' in cxx:
# Since AssignmentTracking was enabled by default in clang
# (llvm/llvm-project@de6da6ad55d3ca945195d1cb109cb8efdf40a52a)
# coroutine frame debugging info (`coro_frame_ty`) is broken.
#
# It seems that we aren't losing much by disabling AssigmentTracking,
# so for now we choose to disable it to get `coro_frame_ty` back.
cxxflags.append('-Xclang -fexperimental-assignment-tracking=disabled')
return cxxflags
@@ -2284,10 +2381,15 @@ def write_build_file(f,
scylla_product,
scylla_version,
scylla_release,
compiler_cache,
args):
use_precompiled_header = not args.disable_precompiled_header
warnings = get_warning_options(args.cxx)
rustc_target = pick_rustc_target('wasm32-wasi', 'wasm32-wasip1')
# If compiler cache is available, prefix the compiler with it
cxx_with_cache = f'{compiler_cache} {args.cxx}' if compiler_cache else args.cxx
# For Rust, sccache is used via RUSTC_WRAPPER environment variable
rustc_wrapper = f'RUSTC_WRAPPER={compiler_cache} ' if compiler_cache and 'sccache' in compiler_cache else ''
f.write(textwrap.dedent('''\
configure_args = {configure_args}
builddir = {outdir}
@@ -2350,7 +2452,7 @@ def write_build_file(f,
command = clang --target=wasm32 --no-standard-libraries -Wl,--export-all -Wl,--no-entry $in -o $out
description = C2WASM $out
rule rust2wasm
command = cargo build --target={rustc_target} --example=$example --locked --manifest-path=test/resource/wasm/rust/Cargo.toml --target-dir=$builddir/wasm/ $
command = {rustc_wrapper}cargo build --target={rustc_target} --example=$example --locked --manifest-path=test/resource/wasm/rust/Cargo.toml --target-dir=$builddir/wasm/ $
&& wasm-opt -Oz $builddir/wasm/{rustc_target}/debug/examples/$example.wasm -o $builddir/wasm/$example.wasm $
&& wasm-strip $builddir/wasm/$example.wasm
description = RUST2WASM $out
@@ -2366,7 +2468,7 @@ def write_build_file(f,
command = llvm-profdata merge $in -output=$out
''').format(configure_args=configure_args,
outdir=outdir,
cxx=args.cxx,
cxx=cxx_with_cache,
user_cflags=user_cflags,
warnings=warnings,
defines=defines,
@@ -2374,6 +2476,7 @@ def write_build_file(f,
user_ldflags=user_ldflags,
libs=libs,
rustc_target=rustc_target,
rustc_wrapper=rustc_wrapper,
link_pool_depth=link_pool_depth,
seastar_path=args.seastar_path,
ninja=ninja,
@@ -2458,10 +2561,10 @@ def write_build_file(f,
description = TEST {mode}
# This rule is unused for PGO stages. They use the rust lib from the parent mode.
rule rust_lib.{mode}
command = CARGO_BUILD_DEP_INFO_BASEDIR='.' cargo build --locked --manifest-path=rust/Cargo.toml --target-dir=$builddir/{mode} --profile=rust-{mode} $
command = CARGO_BUILD_DEP_INFO_BASEDIR='.' {rustc_wrapper}cargo build --locked --manifest-path=rust/Cargo.toml --target-dir=$builddir/{mode} --profile=rust-{mode} $
&& touch $out
description = RUST_LIB $out
''').format(mode=mode, antlr3_exec=args.antlr3_exec, fmt_lib=fmt_lib, test_repeat=args.test_repeat, test_timeout=args.test_timeout, **modeval))
''').format(mode=mode, antlr3_exec=args.antlr3_exec, fmt_lib=fmt_lib, test_repeat=args.test_repeat, test_timeout=args.test_timeout, rustc_wrapper=rustc_wrapper, **modeval))
f.write(
'build {mode}-build: phony {artifacts} {wasms} {vector_search_validator_bins}\n'.format(
mode=mode,
@@ -2525,7 +2628,7 @@ def write_build_file(f,
# In debug/sanitize modes, we compile with fsanitizers,
# so must use the same options during the link:
if '-DSANITIZE' in modes[mode]['cxxflags']:
f.write(' libs = -fsanitize=address -fsanitize=undefined\n')
f.write(' libs = -fsanitize=address -fsanitize=undefined -lubsan\n')
else:
f.write(' libs =\n')
f.write(f'build $builddir/{mode}/{binary}.stripped: strip $builddir/{mode}/{binary}\n')
@@ -2921,6 +3024,9 @@ def create_build_system(args):
os.makedirs(outdir, exist_ok=True)
compiler_cache = find_compiler_cache(args.compiler_cache)
resolve_compilers_for_sccache(args, compiler_cache)
scylla_product, scylla_version, scylla_release = generate_version(args.date_stamp)
for mode, mode_config in build_modes.items():
@@ -2937,8 +3043,8 @@ def create_build_system(args):
# {outdir}/{mode}/seastar/build.ninja, and
# {outdir}/{mode}/seastar/seastar.pc is queried for building flags
for mode, mode_config in build_modes.items():
configure_seastar(outdir, mode, mode_config)
configure_abseil(outdir, mode, mode_config)
configure_seastar(outdir, mode, mode_config, compiler_cache)
configure_abseil(outdir, mode, mode_config, compiler_cache)
user_cflags += ' -isystem abseil'
for mode, mode_config in build_modes.items():
@@ -2961,6 +3067,7 @@ def create_build_system(args):
scylla_product,
scylla_version,
scylla_release,
compiler_cache,
args)
generate_compdb('compile_commands.json', ninja, args.buildfile, selected_modes)
@@ -3003,6 +3110,10 @@ def configure_using_cmake(args):
selected_modes = args.selected_modes or default_modes
selected_configs = ';'.join(build_modes[mode].cmake_build_type for mode
in selected_modes)
compiler_cache = find_compiler_cache(args.compiler_cache)
resolve_compilers_for_sccache(args, compiler_cache)
settings = {
'CMAKE_CONFIGURATION_TYPES': selected_configs,
'CMAKE_CROSS_CONFIGS': selected_configs,
@@ -3020,6 +3131,14 @@ def configure_using_cmake(args):
'Scylla_WITH_DEBUG_INFO' : 'ON' if args.debuginfo else 'OFF',
'Scylla_USE_PRECOMPILED_HEADER': 'OFF' if args.disable_precompiled_header else 'ON',
}
if compiler_cache:
settings['CMAKE_CXX_COMPILER_LAUNCHER'] = compiler_cache
settings['CMAKE_C_COMPILER_LAUNCHER'] = compiler_cache
# For Rust, sccache is used via RUSTC_WRAPPER
if 'sccache' in compiler_cache:
settings['Scylla_RUSTC_WRAPPER'] = compiler_cache
if args.date_stamp:
settings['Scylla_DATE_STAMP'] = args.date_stamp
if args.staticboost:
@@ -3051,7 +3170,7 @@ def configure_using_cmake(args):
if not args.dist_only:
for mode in selected_modes:
configure_seastar(build_dir, build_modes[mode].cmake_build_type, modes[mode])
configure_seastar(build_dir, build_modes[mode].cmake_build_type, modes[mode], compiler_cache)
cmake_command = ['cmake']
cmake_command += [f'-D{var}={value}' for var, value in settings.items()]

View File

@@ -64,6 +64,10 @@ bool query_processor::topology_global_queue_empty() {
return remote().first.get().ss.topology_global_queue_empty();
}
future<bool> query_processor::ongoing_rf_change(const service::group0_guard& guard, sstring ks) {
return remote().first.get().ss.ongoing_rf_change(guard, std::move(ks));
}
static service::query_state query_state_for_internal_call() {
return {service::client_state::for_internal_calls(), empty_service_permit()};
}

View File

@@ -474,6 +474,7 @@ public:
void reset_cache();
bool topology_global_queue_empty();
future<bool> ongoing_rf_change(const service::group0_guard& guard, sstring ks);
query_options make_internal_options(
const statements::prepared_statement::checked_weak_ptr& p,

View File

@@ -1322,6 +1322,10 @@ const std::vector<expr::expression>& statement_restrictions::index_restrictions(
return _index_restrictions;
}
bool statement_restrictions::is_empty() const {
return !_where.has_value();
}
// Current score table:
// local and restrictions include full partition key: 2
// global: 1

View File

@@ -408,6 +408,8 @@ public:
/// Checks that the primary key restrictions don't contain null values, throws invalid_request_exception otherwise.
void validate_primary_key(const query_options& options) const;
bool is_empty() const;
};
statement_restrictions analyze_statement_restrictions(

View File

@@ -19,6 +19,7 @@
#include "locator/abstract_replication_strategy.hh"
#include "mutation/canonical_mutation.hh"
#include "prepared_statement.hh"
#include "seastar/coroutine/exception.hh"
#include "service/migration_manager.hh"
#include "service/storage_proxy.hh"
#include "service/topology_mutation.hh"
@@ -138,6 +139,7 @@ bool cql3::statements::alter_keyspace_statement::changes_tablets(query_processor
future<std::tuple<::shared_ptr<cql_transport::event::schema_change>, cql3::cql_warnings_vec>>
cql3::statements::alter_keyspace_statement::prepare_schema_mutations(query_processor& qp, service::query_state& state, const query_options& options, service::group0_batch& mc) const {
using namespace cql_transport;
bool unknown_keyspace = false;
try {
event::schema_change::target_type target_type = event::schema_change::target_type::KEYSPACE;
auto ks = qp.db().find_keyspace(_name);
@@ -158,8 +160,12 @@ cql3::statements::alter_keyspace_statement::prepare_schema_mutations(query_proce
// when in reality nothing or only schema is being changed
if (changes_tablets(qp)) {
if (!qp.proxy().features().topology_global_request_queue && !qp.topology_global_queue_empty()) {
return make_exception_future<std::tuple<::shared_ptr<::cql_transport::event::schema_change>, cql3::cql_warnings_vec>>(
exceptions::invalid_request_exception("Another global topology request is ongoing, please retry."));
co_await coroutine::return_exception(
exceptions::invalid_request_exception("Another global topology request is ongoing, please retry."));
}
if (qp.proxy().features().rack_list_rf && co_await qp.ongoing_rf_change(mc.guard(),_name)) {
co_await coroutine::return_exception(
exceptions::invalid_request_exception(format("Another RF change for this keyspace {} ongoing, please retry.", _name)));
}
qp.db().real_database().validate_keyspace_update(*ks_md_update);
@@ -242,10 +248,15 @@ cql3::statements::alter_keyspace_statement::prepare_schema_mutations(query_proce
target_type,
keyspace());
mc.add_mutations(std::move(muts), "CQL alter keyspace");
return make_ready_future<std::tuple<::shared_ptr<cql_transport::event::schema_change>, cql3::cql_warnings_vec>>(std::make_tuple(std::move(ret), warnings));
co_return std::make_tuple(std::move(ret), warnings);
} catch (data_dictionary::no_such_keyspace& e) {
return make_exception_future<std::tuple<::shared_ptr<cql_transport::event::schema_change>, cql3::cql_warnings_vec>>(exceptions::invalid_request_exception("Unknown keyspace " + _name));
unknown_keyspace = true;
}
if (unknown_keyspace) {
co_await coroutine::return_exception(
exceptions::invalid_request_exception("Unknown keyspace " + _name));
}
std::unreachable();
}
std::unique_ptr<cql3::statements::prepared_statement>

View File

@@ -190,7 +190,7 @@ future<utils::chunked_vector<mutation>> batch_statement::get_mutations(query_pro
co_return vresult;
}
void batch_statement::verify_batch_size(query_processor& qp, const utils::chunked_vector<mutation>& mutations) {
void batch_statement::verify_batch_size(query_processor& qp, const utils::chunked_vector<mutation>& mutations) const {
if (mutations.size() <= 1) {
return; // We only warn for batch spanning multiple mutations
}
@@ -209,8 +209,9 @@ void batch_statement::verify_batch_size(query_processor& qp, const utils::chunke
for (auto&& m : mutations) {
ks_cf_pairs.insert(m.schema()->ks_name() + "." + m.schema()->cf_name());
}
return seastar::format("Batch modifying {:d} partitions in {} is of size {:d} bytes, exceeding specified {} threshold of {:d} by {:d}.",
mutations.size(), fmt::join(ks_cf_pairs, ", "), size, type, threshold, size - threshold);
const auto batch_type = _type == type::LOGGED ? "Logged" : "Unlogged";
return seastar::format("{} batch modifying {:d} partitions in {} is of size {:d} bytes, exceeding specified {} threshold of {:d} by {:d}.",
batch_type, mutations.size(), fmt::join(ks_cf_pairs, ", "), size, type, threshold, size - threshold);
};
if (size > fail_threshold) {
_logger.error("{}", error("FAIL", fail_threshold).c_str());

View File

@@ -116,7 +116,7 @@ public:
* Checks batch size to ensure threshold is met. If not, a warning is logged.
* @param cfs ColumnFamilies that will store the batch's mutations.
*/
static void verify_batch_size(query_processor& qp, const utils::chunked_vector<mutation>& mutations);
void verify_batch_size(query_processor& qp, const utils::chunked_vector<mutation>& mutations) const;
virtual future<shared_ptr<cql_transport::messages::result_message>> execute(
query_processor& qp, service::query_state& state, const query_options& options, std::optional<service::group0_guard> guard) const override;

View File

@@ -61,7 +61,7 @@ expand_to_racks(const locator::token_metadata& tm,
// Handle ALTER:
// ([]|0) -> numeric is allowed, there are no existing replicas
// numeric -> numeric' is not supported. User should convert RF to rack list of equal count first.
// numeric -> numeric' is not supported unless numeric == numeric'. User should convert RF to rack list of equal count first.
// rack_list -> len(rack_list) is allowed (no-op)
// rack_list -> numeric is not allowed
if (old_options.contains(dc)) {
@@ -75,6 +75,8 @@ expand_to_racks(const locator::token_metadata& tm,
"Cannot change replication factor for '{}' from {} to numeric {}, use rack list instead",
dc, old_rf_val, data.count()));
}
} else if (old_rf.count() == data.count()) {
return rf;
} else if (old_rf.count() > 0) {
throw exceptions::configuration_exception(fmt::format(
"Cannot change replication factor for '{}' from {} to {}, only rack list is allowed",
@@ -153,6 +155,8 @@ static locator::replication_strategy_config_options prepare_options(
}
// Validate options.
bool numeric_to_rack_list_transition = false;
bool rf_change = false;
for (auto&& [dc, opt] : options) {
locator::replication_factor_data rf(opt);
@@ -162,6 +166,7 @@ static locator::replication_strategy_config_options prepare_options(
old_rf = locator::replication_factor_data(i->second);
}
rf_change = rf_change || (old_rf && old_rf->count() != rf.count()) || (!old_rf && rf.count() != 0);
if (!rf.is_rack_based()) {
if (old_rf && old_rf->is_rack_based() && rf.count() != 0) {
if (old_rf->count() != rf.count()) {
@@ -187,12 +192,11 @@ static locator::replication_strategy_config_options prepare_options(
throw exceptions::configuration_exception(fmt::format(
"Rack list for '{}' contains duplicate entries", dc));
}
if (old_rf && !old_rf->is_rack_based() && old_rf->count() != 0) {
// FIXME: Allow this if replicas already conform to the given rack list.
// FIXME: Implement automatic colocation to allow transition to rack list.
throw exceptions::configuration_exception(fmt::format(
"Cannot change replication factor from numeric to rack list for '{}'", dc));
}
numeric_to_rack_list_transition = numeric_to_rack_list_transition || (old_rf && !old_rf->is_rack_based() && old_rf->count() != 0);
}
if (numeric_to_rack_list_transition && rf_change) {
throw exceptions::configuration_exception("Cannot change replication factor from numeric to rack list and rf value at the same time");
}
if (!rf && options.empty() && old_options.empty()) {
@@ -412,7 +416,7 @@ lw_shared_ptr<data_dictionary::keyspace_metadata> ks_prop_defs::as_ks_metadata(s
? std::optional<unsigned>(0) : std::nullopt;
auto initial_tablets = get_initial_tablets(default_initial_tablets, cfg.enforce_tablets());
bool uses_tablets = initial_tablets.has_value();
bool rack_list_enabled = feat.rack_list_rf;
bool rack_list_enabled = utils::get_local_injector().enter("create_with_numeric") ? false : feat.rack_list_rf;
auto options = prepare_options(sc, tm, cfg.rf_rack_valid_keyspaces(), get_replication_options(), {}, rack_list_enabled, uses_tablets);
return data_dictionary::keyspace_metadata::new_keyspace(ks_name, sc,
std::move(options), initial_tablets, get_consistency_option(), get_boolean(KW_DURABLE_WRITES, true), get_storage_options());
@@ -428,7 +432,7 @@ lw_shared_ptr<data_dictionary::keyspace_metadata> ks_prop_defs::as_ks_metadata_u
throw exceptions::invalid_request_exception("Cannot alter replication strategy vnode/tablets flavor");
}
auto sc = get_replication_strategy_class();
bool rack_list_enabled = feat.rack_list_rf;
bool rack_list_enabled = utils::get_local_injector().enter("create_with_numeric") ? false : feat.rack_list_rf;
if (sc) {
options = prepare_options(*sc, tm, cfg.rf_rack_valid_keyspaces(), get_replication_options(), old_options, rack_list_enabled, uses_tablets);
} else {

View File

@@ -1976,7 +1976,7 @@ mutation_fragments_select_statement::do_execute(query_processor& qp, service::qu
if (it == indexes.end()) {
throw exceptions::invalid_request_exception("ANN ordering by vector requires the column to be indexed using 'vector_index'");
}
if (index_opt || parameters->allow_filtering() || restrictions->need_filtering() || check_needs_allow_filtering_anyway(*restrictions)) {
if (index_opt || parameters->allow_filtering() || !(restrictions->is_empty()) || check_needs_allow_filtering_anyway(*restrictions)) {
throw exceptions::invalid_request_exception("ANN ordering by vector does not support filtering");
}
index_opt = *it;

View File

@@ -42,6 +42,11 @@ table::get_index_manager() const {
return _ops->get_index_manager(*this);
}
db_clock::time_point
table::get_truncation_time() const {
return _ops->get_truncation_time(*this);
}
lw_shared_ptr<keyspace_metadata>
keyspace::metadata() const {
return _ops->get_keyspace_metadata(*this);

View File

@@ -77,6 +77,7 @@ public:
schema_ptr schema() const;
const std::vector<view_ptr>& views() const;
const secondary_index::secondary_index_manager& get_index_manager() const;
db_clock::time_point get_truncation_time() const;
};
class keyspace {

View File

@@ -27,6 +27,7 @@ public:
virtual std::optional<table> try_find_table(database db, table_id id) const = 0;
virtual const secondary_index::secondary_index_manager& get_index_manager(table t) const = 0;
virtual schema_ptr get_table_schema(table t) const = 0;
virtual db_clock::time_point get_truncation_time(table t) const = 0;
virtual lw_shared_ptr<keyspace_metadata> get_keyspace_metadata(keyspace ks) const = 0;
virtual bool is_internal(keyspace ks) const = 0;
virtual const locator::abstract_replication_strategy& get_replication_strategy(keyspace ks) const = 0;

20
db/batchlog.hh Normal file
View File

@@ -0,0 +1,20 @@
/*
* Copyright (C) 2025-present ScyllaDB
*/
/*
* SPDX-License-Identifier: (LicenseRef-ScyllaDB-Source-Available-1.0 and Apache-2.0)
*/
#pragma once
#include "mutation/mutation.hh"
#include "utils/UUID.hh"
namespace db {
mutation get_batchlog_mutation_for(schema_ptr schema, const utils::chunked_vector<mutation>& mutations, int32_t version, db_clock::time_point now, const utils::UUID& id);
mutation get_batchlog_delete_mutation(schema_ptr schema, int32_t version, db_clock::time_point now, const utils::UUID& id);
}

View File

@@ -10,20 +10,24 @@
#include <chrono>
#include <exception>
#include <ranges>
#include <seastar/core/future-util.hh>
#include <seastar/core/do_with.hh>
#include <seastar/core/semaphore.hh>
#include <seastar/core/metrics.hh>
#include <seastar/core/coroutine.hh>
#include <seastar/core/sleep.hh>
#include <seastar/coroutine/parallel_for_each.hh>
#include "batchlog_manager.hh"
#include "batchlog.hh"
#include "data_dictionary/data_dictionary.hh"
#include "mutation/canonical_mutation.hh"
#include "service/storage_proxy.hh"
#include "system_keyspace.hh"
#include "utils/rate_limiter.hh"
#include "utils/log.hh"
#include "utils/murmur_hash.hh"
#include "db_clock.hh"
#include "unimplemented.hh"
#include "idl/frozen_schema.dist.hh"
@@ -33,17 +37,94 @@
#include "cql3/untyped_result_set.hh"
#include "service_permit.hh"
#include "cql3/query_processor.hh"
#include "replica/database.hh"
static logging::logger blogger("batchlog_manager");
namespace db {
// Yields 256 batchlog shards. Even on the largest nodes we currently run on,
// this should be enough to give every core a batchlog partition.
static constexpr unsigned batchlog_shard_bits = 8;
int32_t batchlog_shard_of(db_clock::time_point written_at) {
const int64_t count = written_at.time_since_epoch().count();
std::array<uint64_t, 2> result;
utils::murmur_hash::hash3_x64_128(bytes_view(reinterpret_cast<const signed char*>(&count), sizeof(count)), 0, result);
uint64_t hash = result[0] ^ result[1];
return hash & ((1ULL << batchlog_shard_bits) - 1);
}
std::pair<partition_key, clustering_key>
get_batchlog_key(const schema& schema, int32_t version, db::batchlog_stage stage, int32_t batchlog_shard, db_clock::time_point written_at, std::optional<utils::UUID> id) {
auto pkey = partition_key::from_exploded(schema, {serialized(version), serialized(int8_t(stage)), serialized(batchlog_shard)});
std::vector<bytes> ckey_components;
ckey_components.reserve(2);
ckey_components.push_back(serialized(written_at));
if (id) {
ckey_components.push_back(serialized(*id));
}
auto ckey = clustering_key::from_exploded(schema, ckey_components);
return {std::move(pkey), std::move(ckey)};
}
std::pair<partition_key, clustering_key>
get_batchlog_key(const schema& schema, int32_t version, db::batchlog_stage stage, db_clock::time_point written_at, std::optional<utils::UUID> id) {
return get_batchlog_key(schema, version, stage, batchlog_shard_of(written_at), written_at, id);
}
mutation get_batchlog_mutation_for(schema_ptr schema, managed_bytes data, int32_t version, db::batchlog_stage stage, db_clock::time_point now, const utils::UUID& id) {
auto [key, ckey] = get_batchlog_key(*schema, version, stage, now, id);
auto timestamp = api::new_timestamp();
mutation m(schema, key);
// Avoid going through data_value and therefore `bytes`, as it can be large (#24809).
auto cdef_data = schema->get_column_definition(to_bytes("data"));
m.set_cell(ckey, *cdef_data, atomic_cell::make_live(*cdef_data->type, timestamp, std::move(data)));
return m;
}
mutation get_batchlog_mutation_for(schema_ptr schema, const utils::chunked_vector<mutation>& mutations, int32_t version, db::batchlog_stage stage, db_clock::time_point now, const utils::UUID& id) {
auto data = [&mutations] {
utils::chunked_vector<canonical_mutation> fm(mutations.begin(), mutations.end());
bytes_ostream out;
for (auto& m : fm) {
ser::serialize(out, m);
}
return std::move(out).to_managed_bytes();
}();
return get_batchlog_mutation_for(std::move(schema), std::move(data), version, stage, now, id);
}
mutation get_batchlog_mutation_for(schema_ptr schema, const utils::chunked_vector<mutation>& mutations, int32_t version, db_clock::time_point now, const utils::UUID& id) {
return get_batchlog_mutation_for(std::move(schema), mutations, version, batchlog_stage::initial, now, id);
}
mutation get_batchlog_delete_mutation(schema_ptr schema, int32_t version, db::batchlog_stage stage, db_clock::time_point now, const utils::UUID& id) {
auto [key, ckey] = get_batchlog_key(*schema, version, stage, now, id);
mutation m(schema, key);
auto timestamp = api::new_timestamp();
m.partition().apply_delete(*schema, ckey, tombstone(timestamp, gc_clock::now()));
return m;
}
mutation get_batchlog_delete_mutation(schema_ptr schema, int32_t version, db_clock::time_point now, const utils::UUID& id) {
return get_batchlog_delete_mutation(std::move(schema), version, batchlog_stage::initial, now, id);
}
} // namespace db
const std::chrono::seconds db::batchlog_manager::replay_interval;
const uint32_t db::batchlog_manager::page_size;
db::batchlog_manager::batchlog_manager(cql3::query_processor& qp, db::system_keyspace& sys_ks, batchlog_manager_config config)
: _qp(qp)
, _sys_ks(sys_ks)
, _write_request_timeout(std::chrono::duration_cast<db_clock::duration>(config.write_request_timeout))
, _replay_timeout(config.replay_timeout)
, _replay_rate(config.replay_rate)
, _delay(config.delay)
, _replay_cleanup_after_replays(config.replay_cleanup_after_replays)
@@ -152,18 +233,75 @@ future<> db::batchlog_manager::stop() {
}
future<size_t> db::batchlog_manager::count_all_batches() const {
sstring query = format("SELECT count(*) FROM {}.{} BYPASS CACHE", system_keyspace::NAME, system_keyspace::BATCHLOG);
sstring query = format("SELECT count(*) FROM {}.{} BYPASS CACHE", system_keyspace::NAME, system_keyspace::BATCHLOG_V2);
return _qp.execute_internal(query, cql3::query_processor::cache_internal::yes).then([](::shared_ptr<cql3::untyped_result_set> rs) {
return size_t(rs->one().get_as<int64_t>("count"));
});
}
db_clock::duration db::batchlog_manager::get_batch_log_timeout() const {
// enough time for the actual write + BM removal mutation
return _write_request_timeout * 2;
future<> db::batchlog_manager::maybe_migrate_v1_to_v2() {
if (_migration_done) {
return make_ready_future<>();
}
return with_gate(_gate, [this] () mutable -> future<> {
blogger.info("Migrating batchlog entries from v1 -> v2");
auto schema_v1 = _qp.db().find_schema(system_keyspace::NAME, system_keyspace::BATCHLOG);
auto schema_v2 = _qp.db().find_schema(system_keyspace::NAME, system_keyspace::BATCHLOG_V2);
auto batch = [this, schema_v1, schema_v2] (const cql3::untyped_result_set::row& row) -> future<stop_iteration> {
// check version of serialization format
if (!row.has("version")) {
blogger.warn("Not migrating logged batch because of unknown version");
co_return stop_iteration::no;
}
auto version = row.get_as<int32_t>("version");
if (version != netw::messaging_service::current_version) {
blogger.warn("Not migrating logged batch because of incorrect version");
co_return stop_iteration::no;
}
auto id = row.get_as<utils::UUID>("id");
auto written_at = row.get_as<db_clock::time_point>("written_at");
auto data = row.get_blob_fragmented("data");
auto& sp = _qp.proxy();
utils::get_local_injector().inject("batchlog_manager_fail_migration", [] { throw std::runtime_error("Error injection: failing batchlog migration"); });
auto migrate_mut = get_batchlog_mutation_for(schema_v2, std::move(data), version, batchlog_stage::failed_replay, written_at, id);
co_await sp.mutate_locally(migrate_mut, tracing::trace_state_ptr(), db::commitlog::force_sync::no);
mutation delete_mut(schema_v1, partition_key::from_single_value(*schema_v1, serialized(id)));
delete_mut.partition().apply_delete(*schema_v1, clustering_key_prefix::make_empty(), tombstone(api::new_timestamp(), gc_clock::now()));
co_await sp.mutate_locally(delete_mut, tracing::trace_state_ptr(), db::commitlog::force_sync::no);
co_return stop_iteration::no;
};
try {
co_await _qp.query_internal(
format("SELECT * FROM {}.{} BYPASS CACHE", system_keyspace::NAME, system_keyspace::BATCHLOG),
db::consistency_level::ONE,
{},
page_size,
std::move(batch));
} catch (...) {
blogger.warn("Batchlog v1 to v2 migration failed: {}; will retry", std::current_exception());
co_return;
}
co_await container().invoke_on_all([] (auto& bm) {
bm._migration_done = true;
});
blogger.info("Done migrating batchlog entries from v1 -> v2");
});
}
future<db::all_batches_replayed> db::batchlog_manager::replay_all_failed_batches(post_replay_cleanup cleanup) {
co_await maybe_migrate_v1_to_v2();
typedef db_clock::rep clock_type;
db::all_batches_replayed all_replayed = all_batches_replayed::yes;
@@ -172,21 +310,26 @@ future<db::all_batches_replayed> db::batchlog_manager::replay_all_failed_batches
auto throttle = _replay_rate / _qp.proxy().get_token_metadata_ptr()->count_normal_token_owners();
auto limiter = make_lw_shared<utils::rate_limiter>(throttle);
auto schema = _qp.db().find_schema(system_keyspace::NAME, system_keyspace::BATCHLOG);
auto delete_batch = [this, schema = std::move(schema)] (utils::UUID id) {
auto key = partition_key::from_singular(*schema, id);
mutation m(schema, key);
auto now = service::client_state(service::client_state::internal_tag()).get_timestamp();
m.partition().apply_delete(*schema, clustering_key_prefix::make_empty(), tombstone(now, gc_clock::now()));
return _qp.proxy().mutate_locally(m, tracing::trace_state_ptr(), db::commitlog::force_sync::no);
auto schema = _qp.db().find_schema(system_keyspace::NAME, system_keyspace::BATCHLOG_V2);
struct replay_stats {
std::optional<db_clock::time_point> min_too_fresh;
bool need_cleanup = false;
};
auto batch = [this, limiter, delete_batch = std::move(delete_batch), &all_replayed](const cql3::untyped_result_set::row& row) -> future<stop_iteration> {
std::unordered_map<int32_t, replay_stats> replay_stats_per_shard;
// Use a stable `now` accross all batches, so skip/replay decisions are the
// same accross a while prefix of written_at (accross all ids).
const auto now = db_clock::now();
auto batch = [this, cleanup, limiter, schema, &all_replayed, &replay_stats_per_shard, now] (const cql3::untyped_result_set::row& row) -> future<stop_iteration> {
const auto stage = static_cast<batchlog_stage>(row.get_as<int8_t>("stage"));
const auto batch_shard = row.get_as<int32_t>("shard");
auto written_at = row.get_as<db_clock::time_point>("written_at");
auto id = row.get_as<utils::UUID>("id");
// enough time for the actual write + batchlog entry mutation delivery (two separate requests).
auto now = db_clock::now();
auto timeout = get_batch_log_timeout();
auto timeout = _replay_timeout;
if (utils::get_local_injector().is_enabled("skip_batch_replay")) {
blogger.debug("Skipping batch replay due to skip_batch_replay injection");
@@ -194,52 +337,48 @@ future<db::all_batches_replayed> db::batchlog_manager::replay_all_failed_batches
co_return stop_iteration::no;
}
// check version of serialization format
if (!row.has("version")) {
blogger.warn("Skipping logged batch because of unknown version");
co_await delete_batch(id);
co_return stop_iteration::no;
}
auto version = row.get_as<int32_t>("version");
if (version != netw::messaging_service::current_version) {
blogger.warn("Skipping logged batch because of incorrect version {}; current version = {}", version, netw::messaging_service::current_version);
co_await delete_batch(id);
co_return stop_iteration::no;
}
auto data = row.get_blob_unfragmented("data");
blogger.debug("Replaying batch {}", id);
blogger.debug("Replaying batch {} from stage {} and batch shard {}", id, int32_t(stage), batch_shard);
utils::chunked_vector<mutation> mutations;
bool send_failed = false;
auto& shard_written_at = replay_stats_per_shard.try_emplace(batch_shard, replay_stats{}).first->second;
try {
auto fms = make_lw_shared<std::deque<canonical_mutation>>();
utils::chunked_vector<std::pair<canonical_mutation, schema_ptr>> fms;
auto in = ser::as_input_stream(data);
while (in.size()) {
fms->emplace_back(ser::deserialize(in, std::type_identity<canonical_mutation>()));
schema_ptr s = _qp.db().find_schema(fms->back().column_family_id());
timeout = std::min(timeout, std::chrono::duration_cast<db_clock::duration>(s->tombstone_gc_options().propagation_delay_in_seconds()));
auto fm = ser::deserialize(in, std::type_identity<canonical_mutation>());
const auto tbl = _qp.db().try_find_table(fm.column_family_id());
if (!tbl) {
continue;
}
if (written_at <= tbl->get_truncation_time()) {
continue;
}
schema_ptr s = tbl->schema();
if (s->tombstone_gc_options().mode() == tombstone_gc_mode::repair) {
timeout = std::min(timeout, std::chrono::duration_cast<db_clock::duration>(s->tombstone_gc_options().propagation_delay_in_seconds()));
}
fms.emplace_back(std::move(fm), std::move(s));
}
if (now < written_at + timeout) {
blogger.debug("Skipping replay of {}, too fresh", id);
shard_written_at.min_too_fresh = std::min(shard_written_at.min_too_fresh.value_or(written_at), written_at);
co_return stop_iteration::no;
}
auto size = data.size();
auto mutations = co_await map_reduce(*fms, [this, written_at] (canonical_mutation& fm) {
const auto& cf = _qp.proxy().local_db().find_column_family(fm.column_family_id());
return make_ready_future<canonical_mutation*>(written_at > cf.get_truncation_time() ? &fm : nullptr);
},
utils::chunked_vector<mutation>(),
[this] (utils::chunked_vector<mutation> mutations, canonical_mutation* fm) {
if (fm) {
schema_ptr s = _qp.db().find_schema(fm->column_family_id());
mutations.emplace_back(fm->to_mutation(s));
}
return mutations;
});
for (const auto& [fm, s] : fms) {
mutations.emplace_back(fm.to_mutation(s));
co_await maybe_yield();
}
if (!mutations.empty()) {
const auto ttl = [written_at]() -> clock_type {
@@ -265,7 +404,11 @@ future<db::all_batches_replayed> db::batchlog_manager::replay_all_failed_batches
co_await limiter->reserve(size);
_stats.write_attempts += mutations.size();
auto timeout = db::timeout_clock::now() + write_timeout;
co_await _qp.proxy().send_batchlog_replay_to_all_replicas(std::move(mutations), timeout);
if (cleanup) {
co_await _qp.proxy().send_batchlog_replay_to_all_replicas(mutations, timeout);
} else {
co_await _qp.proxy().send_batchlog_replay_to_all_replicas(std::move(mutations), timeout);
}
}
}
} catch (data_dictionary::no_such_keyspace& ex) {
@@ -279,31 +422,80 @@ future<db::all_batches_replayed> db::batchlog_manager::replay_all_failed_batches
// Do _not_ remove the batch, assuning we got a node write error.
// Since we don't have hints (which origin is satisfied with),
// we have to resort to keeping this batch to next lap.
co_return stop_iteration::no;
if (!cleanup || stage == batchlog_stage::failed_replay) {
co_return stop_iteration::no;
}
send_failed = true;
}
auto& sp = _qp.proxy();
if (send_failed) {
blogger.debug("Moving batch {} to stage failed_replay", id);
auto m = get_batchlog_mutation_for(schema, mutations, netw::messaging_service::current_version, batchlog_stage::failed_replay, written_at, id);
co_await sp.mutate_locally(m, tracing::trace_state_ptr(), db::commitlog::force_sync::no);
}
// delete batch
co_await delete_batch(id);
auto m = get_batchlog_delete_mutation(schema, netw::messaging_service::current_version, stage, written_at, id);
co_await _qp.proxy().mutate_locally(m, tracing::trace_state_ptr(), db::commitlog::force_sync::no);
shard_written_at.need_cleanup = true;
co_return stop_iteration::no;
};
co_await with_gate(_gate, [this, cleanup, batch = std::move(batch)] () mutable -> future<> {
blogger.debug("Started replayAllFailedBatches (cpu {})", this_shard_id());
co_await with_gate(_gate, [this, cleanup, &all_replayed, batch = std::move(batch), now, &replay_stats_per_shard] () mutable -> future<> {
blogger.debug("Started replayAllFailedBatches with cleanup: {}", cleanup);
co_await utils::get_local_injector().inject("add_delay_to_batch_replay", std::chrono::milliseconds(1000));
co_await _qp.query_internal(
format("SELECT id, data, written_at, version FROM {}.{} BYPASS CACHE", system_keyspace::NAME, system_keyspace::BATCHLOG),
db::consistency_level::ONE,
{},
page_size,
std::move(batch)).then([this, cleanup] {
if (cleanup == post_replay_cleanup::no) {
return make_ready_future<>();
auto schema = _qp.db().find_schema(system_keyspace::NAME, system_keyspace::BATCHLOG_V2);
co_await coroutine::parallel_for_each(std::views::iota(0, 16), [&] (int32_t chunk) -> future<> {
const int32_t batchlog_chunk_base = chunk * 16;
for (int32_t i = 0; i < 16; ++i) {
int32_t batchlog_shard = batchlog_chunk_base + i;
co_await _qp.query_internal(
format("SELECT * FROM {}.{} WHERE version = ? AND stage = ? AND shard = ? BYPASS CACHE", system_keyspace::NAME, system_keyspace::BATCHLOG_V2),
db::consistency_level::ONE,
{data_value(netw::messaging_service::current_version), data_value(int8_t(batchlog_stage::failed_replay)), data_value(batchlog_shard)},
page_size,
batch);
co_await _qp.query_internal(
format("SELECT * FROM {}.{} WHERE version = ? AND stage = ? AND shard = ? BYPASS CACHE", system_keyspace::NAME, system_keyspace::BATCHLOG_V2),
db::consistency_level::ONE,
{data_value(netw::messaging_service::current_version), data_value(int8_t(batchlog_stage::initial)), data_value(batchlog_shard)},
page_size,
batch);
if (cleanup != post_replay_cleanup::yes) {
continue;
}
auto it = replay_stats_per_shard.find(batchlog_shard);
if (it == replay_stats_per_shard.end() || !it->second.need_cleanup) {
// Nothing was replayed on this batchlog shard, nothing to cleanup.
continue;
}
const auto write_time = it->second.min_too_fresh.value_or(now - _replay_timeout);
const auto end_weight = it->second.min_too_fresh ? bound_weight::before_all_prefixed : bound_weight::after_all_prefixed;
auto [key, ckey] = get_batchlog_key(*schema, netw::messaging_service::current_version, batchlog_stage::initial, batchlog_shard, write_time, {});
auto end_pos = position_in_partition(partition_region::clustered, end_weight, std::move(ckey));
range_tombstone rt(position_in_partition::before_all_clustered_rows(), std::move(end_pos), tombstone(api::new_timestamp(), gc_clock::now()));
blogger.trace("Clean up batchlog shard {} with range tombstone {}", batchlog_shard, rt);
mutation m(schema, key);
m.partition().apply_row_tombstone(*schema, std::move(rt));
co_await _qp.proxy().mutate_locally(m, tracing::trace_state_ptr(), db::commitlog::force_sync::no);
}
// Replaying batches could have generated tombstones, flush to disk,
// where they can be compacted away.
return replica::database::flush_table_on_all_shards(_qp.proxy().get_db(), system_keyspace::NAME, system_keyspace::BATCHLOG);
}).then([] {
blogger.debug("Finished replayAllFailedBatches");
});
blogger.debug("Finished replayAllFailedBatches with all_replayed: {}", all_replayed);
});
co_return all_replayed;

View File

@@ -34,12 +34,17 @@ class system_keyspace;
using all_batches_replayed = bool_class<struct all_batches_replayed_tag>;
struct batchlog_manager_config {
std::chrono::duration<double> write_request_timeout;
db_clock::duration replay_timeout;
uint64_t replay_rate = std::numeric_limits<uint64_t>::max();
std::chrono::milliseconds delay = std::chrono::milliseconds(0);
unsigned replay_cleanup_after_replays;
};
enum class batchlog_stage : int8_t {
initial,
failed_replay
};
class batchlog_manager : public peering_sharded_service<batchlog_manager> {
public:
using post_replay_cleanup = bool_class<class post_replay_cleanup_tag>;
@@ -59,7 +64,7 @@ private:
cql3::query_processor& _qp;
db::system_keyspace& _sys_ks;
db_clock::duration _write_request_timeout;
db_clock::duration _replay_timeout;
uint64_t _replay_rate;
std::chrono::milliseconds _delay;
unsigned _replay_cleanup_after_replays = 100;
@@ -71,6 +76,14 @@ private:
gc_clock::time_point _last_replay;
// Was the v1 -> v2 migration already done since last restart?
// The migration is attempted once after each restart. This is redundant but
// keeps thing simple. Once no upgrade path exists from a ScyllaDB version
// which can still produce v1 entries, this migration code can be removed.
bool _migration_done = false;
future<> maybe_migrate_v1_to_v2();
future<all_batches_replayed> replay_all_failed_batches(post_replay_cleanup cleanup);
public:
// Takes a QP, not a distributes. Because this object is supposed
@@ -85,10 +98,13 @@ public:
future<all_batches_replayed> do_batch_log_replay(post_replay_cleanup cleanup);
future<size_t> count_all_batches() const;
db_clock::duration get_batch_log_timeout() const;
gc_clock::time_point get_last_replay() const {
return _last_replay;
}
const stats& stats() const {
return _stats;
}
private:
future<> batchlog_replay_loop();
};

View File

@@ -54,12 +54,14 @@ public:
uint64_t applied_mutations = 0;
uint64_t corrupt_bytes = 0;
uint64_t truncated_at = 0;
uint64_t broken_files = 0;
stats& operator+=(const stats& s) {
invalid_mutations += s.invalid_mutations;
skipped_mutations += s.skipped_mutations;
applied_mutations += s.applied_mutations;
corrupt_bytes += s.corrupt_bytes;
broken_files += s.broken_files;
return *this;
}
stats operator+(const stats& s) const {
@@ -192,6 +194,8 @@ db::commitlog_replayer::impl::recover(const commitlog::descriptor& d, const comm
s->corrupt_bytes += e.bytes();
} catch (commitlog::segment_truncation& e) {
s->truncated_at = e.position();
} catch (commitlog::header_checksum_error&) {
++s->broken_files;
} catch (...) {
throw;
}
@@ -370,6 +374,9 @@ future<> db::commitlog_replayer::recover(std::vector<sstring> files, sstring fna
if (stats.truncated_at != 0) {
rlogger.warn("Truncated file: {} at position {}.", f, stats.truncated_at);
}
if (stats.broken_files != 0) {
rlogger.warn("Corrupted file header: {}. Skipped.", f);
}
rlogger.debug("Log replay of {} complete, {} replayed mutations ({} invalid, {} skipped)"
, f
, stats.applied_mutations

View File

@@ -1105,6 +1105,14 @@ db::config::config(std::shared_ptr<db::extensions> exts)
"Like native_transport_port, but clients-side port number (modulo smp) is used to route the connection to the specific shard.")
, native_shard_aware_transport_port_ssl(this, "native_shard_aware_transport_port_ssl", value_status::Used, 19142,
"Like native_transport_port_ssl, but clients-side port number (modulo smp) is used to route the connection to the specific shard.")
, native_transport_port_proxy_protocol(this, "native_transport_port_proxy_protocol", value_status::Used, 0,
"Port on which the CQL native transport listens for clients using proxy protocol v2. Disabled (0) by default.")
, native_transport_port_ssl_proxy_protocol(this, "native_transport_port_ssl_proxy_protocol", value_status::Used, 0,
"Port on which the CQL TLS native transport listens for clients using proxy protocol v2. Disabled (0) by default.")
, native_shard_aware_transport_port_proxy_protocol(this, "native_shard_aware_transport_port_proxy_protocol", value_status::Used, 0,
"Like native_transport_port_proxy_protocol, but clients-side port number (modulo smp) is used to route the connection to the specific shard.")
, native_shard_aware_transport_port_ssl_proxy_protocol(this, "native_shard_aware_transport_port_ssl_proxy_protocol", value_status::Used, 0,
"Like native_transport_port_ssl_proxy_protocol, but clients-side port number (modulo smp) is used to route the connection to the specific shard.")
, native_transport_max_threads(this, "native_transport_max_threads", value_status::Invalid, 128,
"The maximum number of thread handling requests. The meaning is the same as rpc_max_threads.\n"
"Default is different (128 versus unlimited).\n"
@@ -1152,7 +1160,7 @@ db::config::config(std::shared_ptr<db::extensions> exts)
"Number of threads with which to deliver hints. In multiple data-center deployments, consider increasing this number because cross data-center handoff is generally slower.")
, batchlog_replay_throttle_in_kb(this, "batchlog_replay_throttle_in_kb", value_status::Unused, 1024,
"Total maximum throttle. Throttling is reduced proportionally to the number of nodes in the cluster.")
, batchlog_replay_cleanup_after_replays(this, "batchlog_replay_cleanup_after_replays", liveness::LiveUpdate, value_status::Used, 60,
, batchlog_replay_cleanup_after_replays(this, "batchlog_replay_cleanup_after_replays", liveness::LiveUpdate, value_status::Used, 1,
"Clean up batchlog memtable after every N replays. Replays are issued on a timer, every 60 seconds. So if batchlog_replay_cleanup_after_replays is set to 60, the batchlog memtable is flushed every 60 * 60 seconds.")
/**
* @Group Request scheduler properties

View File

@@ -324,6 +324,10 @@ public:
named_value<uint16_t> native_transport_port_ssl;
named_value<uint16_t> native_shard_aware_transport_port;
named_value<uint16_t> native_shard_aware_transport_port_ssl;
named_value<uint16_t> native_transport_port_proxy_protocol;
named_value<uint16_t> native_transport_port_ssl_proxy_protocol;
named_value<uint16_t> native_shard_aware_transport_port_proxy_protocol;
named_value<uint16_t> native_shard_aware_transport_port_ssl_proxy_protocol;
named_value<uint32_t> native_transport_max_threads;
named_value<uint32_t> native_transport_max_frame_size_in_mb;
named_value<sstring> broadcast_rpc_address;

View File

@@ -248,7 +248,7 @@ future<db::commitlog> hint_endpoint_manager::add_store() noexcept {
// which is larger than the segment ID of the RP of the last written hint.
cfg.base_segment_id = _last_written_rp.base_id();
return commitlog::create_commitlog(std::move(cfg)).then([this] (commitlog l) -> future<commitlog> {
return commitlog::create_commitlog(std::move(cfg)).then([this] (this auto, commitlog l) -> future<commitlog> {
// add_store() is triggered every time hint files are forcefully flushed to I/O (every hints_flush_period).
// When this happens we want to refill _sender's segments only if it has finished with the segments he had before.
if (_sender.have_segments()) {

View File

@@ -26,6 +26,7 @@
#include <seastar/core/smp.hh>
#include <seastar/coroutine/exception.hh>
#include <seastar/coroutine/parallel_for_each.hh>
#include <seastar/util/file.hh>
// Boost features.
@@ -899,7 +900,7 @@ future<> manager::migrate_ip_directories() {
co_await coroutine::parallel_for_each(dirs_to_remove, [] (auto& directory) -> future<> {
try {
manager_logger.warn("Removing hint directory {}", directory.native());
co_await lister::rmdir(directory);
co_await seastar::recursive_remove_directory(directory);
} catch (...) {
on_internal_error(manager_logger,
seastar::format("Removing a hint directory has failed. Reason: {}", std::current_exception()));

View File

@@ -12,8 +12,6 @@
#include <yaml-cpp/yaml.h>
#include <boost/lexical_cast.hpp>
#include "utils/s3/creds.hh"
#include "object_storage_endpoint_param.hh"
using namespace std::string_literals;
@@ -21,9 +19,6 @@ using namespace std::string_literals;
db::object_storage_endpoint_param::object_storage_endpoint_param(s3_storage s)
: _data(std::move(s))
{}
db::object_storage_endpoint_param::object_storage_endpoint_param(std::string endpoint, s3::endpoint_config config)
: object_storage_endpoint_param(s3_storage{std::move(endpoint), std::move(config)})
{}
db::object_storage_endpoint_param::object_storage_endpoint_param(gs_storage s)
: _data(std::move(s))
{}
@@ -32,8 +27,8 @@ db::object_storage_endpoint_param::object_storage_endpoint_param() = default;
db::object_storage_endpoint_param::object_storage_endpoint_param(const object_storage_endpoint_param&) = default;
std::string db::object_storage_endpoint_param::s3_storage::to_json_string() const {
return fmt::format("{{ \"port\": {}, \"use_https\": {}, \"aws_region\": \"{}\", \"iam_role_arn\": \"{}\" }}",
config.port, config.use_https, config.region, config.role_arn
return fmt::format("{{ \"type\": \"s3\", \"aws_region\": \"{}\", \"iam_role_arn\": \"{}\" }}",
region, iam_role_arn
);
}
@@ -99,8 +94,6 @@ const std::string& db::object_storage_endpoint_param::type() const {
db::object_storage_endpoint_param db::object_storage_endpoint_param::decode(const YAML::Node& node) {
auto name = node["name"];
auto aws_region = node["aws_region"];
auto iam_role_arn = node["iam_role_arn"];
auto type = node["type"];
auto get_opt = [](auto& node, const std::string& key, auto def) {
@@ -108,13 +101,19 @@ db::object_storage_endpoint_param db::object_storage_endpoint_param::decode(cons
return tmp ? tmp.template as<std::decay_t<decltype(def)>>() : def;
};
// aws s3 endpoint.
if (!type || type.as<std::string>() == s3_type || aws_region || iam_role_arn) {
if (!type || type.as<std::string>() == s3_type ) {
s3_storage ep;
ep.endpoint = name.as<std::string>();
ep.config.port = node["port"].as<unsigned>();
ep.config.use_https = node["https"].as<bool>(false);
ep.config.region = aws_region ? aws_region.as<std::string>() : std::getenv("AWS_DEFAULT_REGION");
ep.config.role_arn = iam_role_arn ? iam_role_arn.as<std::string>() : "";
auto aws_region = node["aws_region"];
ep.region = aws_region ? aws_region.as<std::string>() : std::getenv("AWS_DEFAULT_REGION");
ep.iam_role_arn = get_opt(node, "iam_role_arn", ""s);
if (maybe_legacy_endpoint_name(ep.endpoint)) {
// Support legacy config for a while
auto port = node["port"].as<unsigned>();
auto use_https = node["https"].as<bool>(false);
ep.endpoint = fmt::format("http{}://{}:{}", use_https ? "s" : "", ep.endpoint, port);
}
return object_storage_endpoint_param{std::move(ep)};
}
@@ -135,5 +134,5 @@ const std::string db::object_storage_endpoint_param::gs_type = "gs";
auto fmt::formatter<db::object_storage_endpoint_param>::format(const db::object_storage_endpoint_param& e, fmt::format_context& ctx) const
-> decltype(ctx.out()) {
return fmt::format_to(ctx.out(), "object_storage_endpoint_param{{}}", e.to_json_string());
return fmt::format_to(ctx.out(), "object_storage_endpoint_param{}", e.to_json_string());
}

View File

@@ -13,7 +13,6 @@
#include <variant>
#include <compare>
#include <fmt/core.h>
#include "utils/s3/creds.hh"
namespace YAML {
class Node;
@@ -25,7 +24,8 @@ class object_storage_endpoint_param {
public:
struct s3_storage {
std::string endpoint;
s3::endpoint_config config;
std::string region;
std::string iam_role_arn;
std::strong_ordering operator<=>(const s3_storage&) const = default;
std::string to_json_string() const;
@@ -43,7 +43,6 @@ public:
object_storage_endpoint_param();
object_storage_endpoint_param(const object_storage_endpoint_param&);
object_storage_endpoint_param(s3_storage);
object_storage_endpoint_param(std::string endpoint, s3::endpoint_config config);
object_storage_endpoint_param(gs_storage);
std::strong_ordering operator<=>(const object_storage_endpoint_param&) const;
@@ -77,3 +76,7 @@ template <>
struct fmt::formatter<db::object_storage_endpoint_param> : fmt::formatter<std::string_view> {
auto format(const db::object_storage_endpoint_param&, fmt::format_context& ctx) const -> decltype(ctx.out());
};
inline bool maybe_legacy_endpoint_name(std::string_view ep) noexcept {
return !(ep.starts_with("http://") || ep.starts_with("https://"));
}

View File

@@ -1262,16 +1262,9 @@ static future<> do_merge_schema(sharded<service::storage_proxy>& proxy, sharded
{
slogger.trace("do_merge_schema: {}", mutations);
schema_applier ap(proxy, ss, sys_ks, reload);
std::exception_ptr ex;
try {
co_await execute_do_merge_schema(proxy, ap, std::move(mutations));
} catch (...) {
ex = std::current_exception();
}
co_await ap.destroy();
if (ex) {
throw ex;
}
co_await execute_do_merge_schema(proxy, ap, std::move(mutations)).finally([&ap]() {
return ap.destroy();
});
}
/**

View File

@@ -110,6 +110,7 @@ namespace {
system_keyspace::v3::CDC_LOCAL,
system_keyspace::DICTS,
system_keyspace::VIEW_BUILDING_TASKS,
system_keyspace::CLIENT_ROUTES,
};
if (ks_name == system_keyspace::NAME && tables.contains(cf_name)) {
props.enable_schema_commitlog();
@@ -137,6 +138,7 @@ namespace {
system_keyspace::ROLE_PERMISSIONS,
system_keyspace::DICTS,
system_keyspace::VIEW_BUILDING_TASKS,
system_keyspace::CLIENT_ROUTES,
};
if (ks_name == system_keyspace::NAME && tables.contains(cf_name)) {
props.is_group0_table = true;
@@ -213,6 +215,30 @@ schema_ptr system_keyspace::batchlog() {
return batchlog;
}
schema_ptr system_keyspace::batchlog_v2() {
static thread_local auto batchlog_v2 = [] {
schema_builder builder(generate_legacy_id(NAME, BATCHLOG_V2), NAME, BATCHLOG_V2,
// partition key
{{"version", int32_type}, {"stage", byte_type}, {"shard", int32_type}},
// clustering key
{{"written_at", timestamp_type}, {"id", uuid_type}},
// regular columns
{{"data", bytes_type}},
// static columns
{},
// regular column name type
utf8_type,
// comment
"batches awaiting replay"
);
builder.set_gc_grace_seconds(0);
builder.set_caching_options(caching_options::get_disabled_caching_options());
builder.with_hash_version();
return builder.build(schema_builder::compact_storage::no);
}();
return batchlog_v2;
}
/*static*/ schema_ptr system_keyspace::paxos() {
static thread_local auto paxos = [] {
// FIXME: switch to the new schema_builder interface (with_column(...), etc)
@@ -285,6 +311,7 @@ schema_ptr system_keyspace::topology() {
.with_column("tablet_balancing_enabled", boolean_type, column_kind::static_column)
.with_column("upgrade_state", utf8_type, column_kind::static_column)
.with_column("global_requests", set_type_impl::get_instance(timeuuid_type, true), column_kind::static_column)
.with_column("paused_rf_change_requests", set_type_impl::get_instance(timeuuid_type, true), column_kind::static_column)
.set_comment("Current state of topology change machine")
.with_hash_version()
.build();
@@ -1391,6 +1418,23 @@ schema_ptr system_keyspace::view_building_tasks() {
return schema;
}
schema_ptr system_keyspace::client_routes() {
static thread_local auto schema = [] {
auto id = generate_legacy_id(NAME, CLIENT_ROUTES);
return schema_builder(NAME, CLIENT_ROUTES, std::make_optional(id))
.with_column("connection_id", utf8_type, column_kind::partition_key)
.with_column("host_id", uuid_type, column_kind::clustering_key)
.with_column("address", utf8_type)
.with_column("port", int32_type)
.with_column("tls_port", int32_type)
.with_column("alternator_port", int32_type)
.with_column("alternator_https_port", int32_type)
.with_hash_version()
.build();
}();
return schema;
}
future<system_keyspace::local_info> system_keyspace::load_local_info() {
auto msg = co_await execute_cql(format("SELECT host_id, cluster_name, data_center, rack FROM system.{} WHERE key=?", LOCAL), sstring(LOCAL));
@@ -2304,7 +2348,7 @@ std::vector<schema_ptr> system_keyspace::all_tables(const db::config& cfg) {
std::copy(schema_tables.begin(), schema_tables.end(), std::back_inserter(r));
auto auth_tables = system_keyspace::auth_tables();
std::copy(auth_tables.begin(), auth_tables.end(), std::back_inserter(r));
r.insert(r.end(), { built_indexes(), hints(), batchlog(), paxos(), local(),
r.insert(r.end(), { built_indexes(), hints(), batchlog(), batchlog_v2(), paxos(), local(),
peers(), peer_events(), range_xfers(),
compactions_in_progress(), compaction_history(),
sstable_activity(), size_estimates(), large_partitions(), large_rows(), large_cells(),
@@ -2318,7 +2362,7 @@ std::vector<schema_ptr> system_keyspace::all_tables(const db::config& cfg) {
v3::cdc_local(),
raft(), raft_snapshots(), raft_snapshot_config(), group0_history(), discovery(),
topology(), cdc_generations_v3(), topology_requests(), service_levels_v2(), view_build_status_v2(),
dicts(), view_building_tasks(), cdc_streams_state(), cdc_streams_history()
dicts(), view_building_tasks(), client_routes(), cdc_streams_state(), cdc_streams_history()
});
if (cfg.check_experimental(db::experimental_features_t::feature::BROADCAST_TABLES)) {
@@ -2335,7 +2379,9 @@ std::vector<schema_ptr> system_keyspace::all_tables(const db::config& cfg) {
}
static bool maybe_write_in_user_memory(schema_ptr s) {
return (s.get() == system_keyspace::batchlog().get()) || (s.get() == system_keyspace::paxos().get())
return (s.get() == system_keyspace::batchlog().get())
|| (s.get() == system_keyspace::batchlog_v2().get())
|| (s.get() == system_keyspace::paxos().get())
|| s == system_keyspace::v3::scylla_views_builds_in_progress();
}
@@ -3111,7 +3157,10 @@ static bool must_have_tokens(service::node_state nst) {
// A decommissioning node doesn't have tokens at the end, they are
// removed during transition to the left_token_ring state.
case service::node_state::decommissioning: return false;
case service::node_state::removing: return true;
// A removing node might or might not have tokens depending on whether
// REMOVENODE_WITH_LEFT_TOKEN_RING feature is enabled. To support both
// cases, we allow removing nodes to not have tokens.
case service::node_state::removing: return false;
case service::node_state::rebuilding: return true;
case service::node_state::normal: return true;
case service::node_state::left: return false;
@@ -3351,6 +3400,12 @@ future<service::topology> system_keyspace::load_topology_state(const std::unorde
}
}
if (some_row.has("paused_rf_change_requests")) {
for (auto&& v : deserialize_set_column(*topology(), some_row, "paused_rf_change_requests")) {
ret.paused_rf_change_requests.insert(value_cast<utils::UUID>(v));
}
}
if (some_row.has("enabled_features")) {
ret.enabled_features = decode_features(deserialize_set_column(*topology(), some_row, "enabled_features"));
}
@@ -3562,35 +3617,43 @@ system_keyspace::topology_requests_entry system_keyspace::topology_request_row_t
return entry;
}
future<system_keyspace::topology_requests_entry> system_keyspace::get_topology_request_entry(utils::UUID id, bool require_entry) {
future<system_keyspace::topology_requests_entry> system_keyspace::get_topology_request_entry(utils::UUID id) {
auto r = co_await get_topology_request_entry_opt(id);
if (!r) {
on_internal_error(slogger, format("no entry for request id {}", id));
}
co_return std::move(*r);
}
future<std::optional<system_keyspace::topology_requests_entry>> system_keyspace::get_topology_request_entry_opt(utils::UUID id) {
auto rs = co_await execute_cql(
format("SELECT * FROM system.{} WHERE id = {}", TOPOLOGY_REQUESTS, id));
if (!rs || rs->empty()) {
if (require_entry) {
on_internal_error(slogger, format("no entry for request id {}", id));
} else {
co_return topology_requests_entry{
.id = utils::null_uuid()
};
}
co_return std::nullopt;
}
const auto& row = rs->one();
co_return topology_request_row_to_entry(id, row);
}
future<system_keyspace::topology_requests_entries> system_keyspace::get_node_ops_request_entries(db_clock::time_point end_time_limit) {
future<system_keyspace::topology_requests_entries> system_keyspace::get_topology_request_entries(std::vector<std::variant<service::topology_request, service::global_topology_request>> request_types, db_clock::time_point end_time_limit) {
sstring request_types_str = "";
bool first = true;
for (const auto& rt : request_types) {
if (!std::exchange(first, false)) {
request_types_str += ", ";
}
request_types_str += std::visit([] (auto&& arg) { return fmt::format("'{}'", arg); }, rt);
}
// Running requests.
auto rs_running = co_await execute_cql(
format("SELECT * FROM system.{} WHERE done = false AND request_type IN ('{}', '{}', '{}', '{}', '{}') ALLOW FILTERING", TOPOLOGY_REQUESTS,
service::topology_request::join, service::topology_request::replace, service::topology_request::rebuild, service::topology_request::leave, service::topology_request::remove));
format("SELECT * FROM system.{} WHERE done = false AND request_type IN ({}) ALLOW FILTERING", TOPOLOGY_REQUESTS, request_types_str));
// Requests which finished after end_time_limit.
auto rs_done = co_await execute_cql(
format("SELECT * FROM system.{} WHERE end_time > {} AND request_type IN ('{}', '{}', '{}', '{}', '{}') ALLOW FILTERING", TOPOLOGY_REQUESTS, end_time_limit.time_since_epoch().count(),
service::topology_request::join, service::topology_request::replace, service::topology_request::rebuild, service::topology_request::leave, service::topology_request::remove));
format("SELECT * FROM system.{} WHERE end_time > {} AND request_type IN ({}) ALLOW FILTERING", TOPOLOGY_REQUESTS, end_time_limit.time_since_epoch().count(), request_types_str));
topology_requests_entries m;
for (const auto& row: *rs_done) {
@@ -3608,6 +3671,16 @@ future<system_keyspace::topology_requests_entries> system_keyspace::get_node_ops
co_return m;
}
future<system_keyspace::topology_requests_entries> system_keyspace::get_node_ops_request_entries(db_clock::time_point end_time_limit) {
return get_topology_request_entries({
service::topology_request::join,
service::topology_request::replace,
service::topology_request::rebuild,
service::topology_request::leave,
service::topology_request::remove
}, end_time_limit);
}
future<mutation> system_keyspace::get_insert_dict_mutation(
std::string_view name,
bytes data,

View File

@@ -163,6 +163,7 @@ public:
static constexpr auto NAME = "system";
static constexpr auto HINTS = "hints";
static constexpr auto BATCHLOG = "batchlog";
static constexpr auto BATCHLOG_V2 = "batchlog_v2";
static constexpr auto PAXOS = "paxos";
static constexpr auto BUILT_INDEXES = "IndexInfo";
static constexpr auto LOCAL = "local";
@@ -198,6 +199,8 @@ public:
static constexpr auto VIEW_BUILD_STATUS_V2 = "view_build_status_v2";
static constexpr auto DICTS = "dicts";
static constexpr auto VIEW_BUILDING_TASKS = "view_building_tasks";
static constexpr auto CLIENT_ROUTES = "client_routes";
static constexpr auto VERSIONS = "versions";
// auth
static constexpr auto ROLES = "roles";
@@ -255,6 +258,7 @@ public:
static schema_ptr hints();
static schema_ptr batchlog();
static schema_ptr batchlog_v2();
static schema_ptr paxos();
static schema_ptr built_indexes(); // TODO (from Cassandra): make private
static schema_ptr raft();
@@ -274,6 +278,7 @@ public:
static schema_ptr view_build_status_v2();
static schema_ptr dicts();
static schema_ptr view_building_tasks();
static schema_ptr client_routes();
// auth
static schema_ptr roles();
@@ -665,7 +670,9 @@ public:
future<service::topology_request_state> get_topology_request_state(utils::UUID id, bool require_entry);
topology_requests_entry topology_request_row_to_entry(utils::UUID id, const cql3::untyped_result_set_row& row);
future<topology_requests_entry> get_topology_request_entry(utils::UUID id, bool require_entry);
future<topology_requests_entry> get_topology_request_entry(utils::UUID id);
future<std::optional<topology_requests_entry>> get_topology_request_entry_opt(utils::UUID id);
future<system_keyspace::topology_requests_entries> get_topology_request_entries(std::vector<std::variant<service::topology_request, service::global_topology_request>> request_types, db_clock::time_point end_time_limit);
future<topology_requests_entries> get_node_ops_request_entries(db_clock::time_point end_time_limit);
public:

View File

@@ -1744,6 +1744,115 @@ bool should_generate_view_updates_on_this_shard(const schema_ptr& base, const lo
&& std::ranges::contains(shards, this_shard_id());
}
static endpoints_to_update get_view_natural_endpoint_vnodes(
locator::host_id me,
std::vector<std::reference_wrapper<const locator::node>> base_nodes,
std::vector<std::reference_wrapper<const locator::node>> view_nodes,
locator::endpoint_dc_rack my_location,
const locator::network_topology_strategy* network_topology,
replica::cf_stats& cf_stats) {
using node_vector = std::vector<std::reference_wrapper<const locator::node>>;
node_vector base_endpoints, view_endpoints;
auto& my_datacenter = my_location.dc;
auto process_candidate = [&] (node_vector& nodes, std::reference_wrapper<const locator::node> node) {
if (!network_topology || node.get().dc() == my_datacenter) {
nodes.emplace_back(node);
}
};
for (auto&& base_node : base_nodes) {
process_candidate(base_endpoints, base_node);
}
for (auto&& view_node : view_nodes) {
auto it = std::ranges::find(base_endpoints, view_node.get().host_id(), std::mem_fn(&locator::node::host_id));
// If this base replica is also one of the view replicas, we use
// ourselves as the view replica.
// We don't return an extra endpoint, as it's only needed when
// using tablets (so !use_legacy_self_pairing)
if (view_node.get().host_id() == me && it != base_endpoints.end()) {
return {.natural_endpoint = me};
}
// We have to remove any endpoint which is shared between the base
// and the view, as it will select itself and throw off the counts
// otherwise.
if (it != base_endpoints.end()) {
base_endpoints.erase(it);
} else if (!network_topology || view_node.get().dc() == my_datacenter) {
view_endpoints.push_back(view_node);
}
}
auto base_it = std::ranges::find(base_endpoints, me, std::mem_fn(&locator::node::host_id));
if (base_it == base_endpoints.end()) {
// This node is not a base replica of this key, so we return empty
// FIXME: This case shouldn't happen, and if it happens, a view update
// would be lost.
++cf_stats.total_view_updates_on_wrong_node;
vlogger.warn("Could not find {} in base_endpoints={}", me,
base_endpoints | std::views::transform(std::mem_fn(&locator::node::host_id)));
return {};
}
size_t idx = base_it - base_endpoints.begin();
return {.natural_endpoint = view_endpoints[idx].get().host_id()};
}
static std::optional<locator::host_id> get_unpaired_view_endpoint(
std::vector<std::reference_wrapper<const locator::node>> base_nodes,
std::vector<std::reference_wrapper<const locator::node>> view_nodes,
replica::cf_stats& cf_stats) {
std::unordered_set<locator::endpoint_dc_rack> base_dc_racks;
for (auto&& base_node : base_nodes) {
if (base_dc_racks.contains(base_node.get().dc_rack())) {
// We can't do rack-aware pairing if there are multiple replicas in the same rack.
++cf_stats.total_view_updates_failed_pairing;
vlogger.warn("Can't perform base-view pairing in this topology. There are multiple base table replicas in the same dc/rack({}/{}):",
base_node.get().dc(), base_node.get().rack());
return std::nullopt;
}
base_dc_racks.insert(base_node.get().dc_rack());
}
std::unordered_set<locator::endpoint_dc_rack> paired_view_dc_racks;
std::unordered_map<locator::endpoint_dc_rack, locator::host_id> unpaired_view_dc_rack_replicas;
for (auto&& view_node : view_nodes) {
if (paired_view_dc_racks.contains(view_node.get().dc_rack()) || unpaired_view_dc_rack_replicas.contains(view_node.get().dc_rack())) {
// We can't do rack-aware pairing if there are multiple replicas in the same rack.
++cf_stats.total_view_updates_failed_pairing;
vlogger.warn("Can't perform base-view pairing in this topology. There are multiple view table replicas in the same dc/rack({}/{}):",
view_node.get().dc(), view_node.get().rack());
return std::nullopt;
}
// Track unpaired replicas in both sets
if (base_dc_racks.contains(view_node.get().dc_rack())) {
paired_view_dc_racks.insert(view_node.get().dc_rack());
} else {
unpaired_view_dc_rack_replicas.insert({view_node.get().dc_rack(), view_node.get().host_id()});
}
}
if (unpaired_view_dc_rack_replicas.size() > 0) {
// There are view replicas that can't be paired with any base replica
// This can happen as a result of an RF change when the view replica finishes streaming
// before the base replica.
// Because of this, a view replica might not get paired with any base replica, so we need
// to send an additional update to it.
++cf_stats.total_view_updates_due_to_replica_count_mismatch;
auto extra_replica = unpaired_view_dc_rack_replicas.begin()->second;
unpaired_view_dc_rack_replicas.erase(unpaired_view_dc_rack_replicas.begin());
if (unpaired_view_dc_rack_replicas.size() > 0) {
// We only expect one extra replica to appear due to an RF change. If there's more, that's an error,
// but we'll still perform updates to the paired and last replicas to minimize degradation.
vlogger.warn("There are too many view endpoints for base-view pairing. View updates may get lost on view_endpoints={}",
unpaired_view_dc_rack_replicas | std::views::values);
}
return extra_replica;
}
return std::nullopt;
}
// Calculate the node ("natural endpoint") to which this node should send
// a view update.
//
@@ -1756,29 +1865,19 @@ bool should_generate_view_updates_on_this_shard(const schema_ptr& base, const lo
// of this function is to find, assuming that this node is one of the base
// replicas for a given partition, the paired view replica.
//
// In the past, we used an optimization called "self-pairing" that if a single
// node was both a base replica and a view replica for a write, the pairing is
// modified so that this node would send the update to itself. This self-
// pairing optimization could cause the pairing to change after view ranges
// are moved between nodes, so currently we only use it if
// use_legacy_self_pairing is set to true. When using tablets - where range
// movements are common - it is strongly recommended to set it to false.
// When using vnodes, we have an optimization called "self-pairing" - if a single
// node is both a base replica and a view replica for a write, the pairing is
// modified so that this node sends the update to itself and this node is removed
// from the lists of nodes paired by index. This self-pairing optimization can
// cause the pairing to change after view ranges are moved between nodes.
//
// If the keyspace's replication strategy is a NetworkTopologyStrategy,
// we pair only nodes in the same datacenter.
//
// When use_legacy_self_pairing is enabled, if one of the base replicas
// also happens to be a view replica, it is paired with itself
// (with the other nodes paired by order in the list
// after taking this node out).
//
// If the table uses tablets and the replication strategy is NetworkTopologyStrategy
// and the replication factor in the node's datacenter is a multiple of the number
// of racks in the datacenter, then pairing is rack-aware. In this case,
// all racks have the same number of replicas, and those are never migrated
// outside their racks. Therefore, the base replicas are naturally paired with the
// view replicas that are in the same rack, based on the ordinal position.
// Note that typically, there is a single replica per rack and pairing is trivial.
// If the table uses tablets, then pairing is rack-aware. In this case, in each
// rack where we have a base replica there is also one replica of each view tablet.
// Therefore, the base replicas are naturally paired with the view replicas that
// are in the same rack.
//
// If the assumption that the given base token belongs to this replica
// does not hold, we return an empty optional.
@@ -1806,19 +1905,12 @@ endpoints_to_update get_view_natural_endpoint(
const locator::abstract_replication_strategy& replication_strategy,
const dht::token& base_token,
const dht::token& view_token,
bool use_legacy_self_pairing,
bool use_tablets_rack_aware_view_pairing,
bool use_tablets,
replica::cf_stats& cf_stats) {
auto& topology = base_erm->get_token_metadata_ptr()->get_topology();
auto& view_topology = view_erm->get_token_metadata_ptr()->get_topology();
auto& my_location = topology.get_location(me);
auto& my_datacenter = my_location.dc;
auto* network_topology = dynamic_cast<const locator::network_topology_strategy*>(&replication_strategy);
auto rack_aware_pairing = use_tablets_rack_aware_view_pairing && network_topology;
bool simple_rack_aware_pairing = false;
using node_vector = std::vector<std::reference_wrapper<const locator::node>>;
node_vector orig_base_endpoints, orig_view_endpoints;
node_vector base_endpoints, view_endpoints;
auto resolve = [&] (const locator::topology& topology, const locator::host_id& ep, bool is_view) -> const locator::node& {
if (auto* np = topology.find_node(ep)) {
@@ -1829,6 +1921,7 @@ endpoints_to_update get_view_natural_endpoint(
// We need to use get_replicas() for pairing to be stable in case base or view tablet
// is rebuilding a replica which has left the ring. get_natural_endpoints() filters such replicas.
using node_vector = std::vector<std::reference_wrapper<const locator::node>>;
auto base_nodes = base_erm->get_replicas(base_token) | std::views::transform([&] (const locator::host_id& ep) -> const locator::node& {
return resolve(topology, ep, false);
}) | std::ranges::to<node_vector>();
@@ -1852,231 +1945,43 @@ endpoints_to_update get_view_natural_endpoint(
// note that the recursive call will not recurse again because leaving_base is in base_nodes.
auto leaving_base = it->get().host_id();
return get_view_natural_endpoint(leaving_base, base_erm, view_erm, replication_strategy, base_token,
view_token, use_legacy_self_pairing, use_tablets_rack_aware_view_pairing, cf_stats);
view_token, use_tablets, cf_stats);
}
}
}
std::function<bool(const locator::node&)> is_candidate;
if (network_topology) {
is_candidate = [&] (const locator::node& node) { return node.dc() == my_datacenter; };
} else {
is_candidate = [&] (const locator::node&) { return true; };
}
auto process_candidate = [&] (node_vector& nodes, std::reference_wrapper<const locator::node> node) {
if (is_candidate(node)) {
nodes.emplace_back(node);
}
};
for (auto&& base_node : base_nodes) {
process_candidate(base_endpoints, base_node);
if (!use_tablets) {
return get_view_natural_endpoint_vnodes(
me,
base_nodes,
view_nodes,
my_location,
network_topology,
cf_stats);
}
if (use_legacy_self_pairing) {
for (auto&& view_node : view_nodes) {
auto it = std::ranges::find(base_endpoints, view_node.get().host_id(), std::mem_fn(&locator::node::host_id));
// If this base replica is also one of the view replicas, we use
// ourselves as the view replica.
// We don't return an extra endpoint, as it's only needed when
// using tablets (so !use_legacy_self_pairing)
if (view_node.get().host_id() == me && it != base_endpoints.end()) {
return {.natural_endpoint = me};
}
// We have to remove any endpoint which is shared between the base
// and the view, as it will select itself and throw off the counts
// otherwise.
if (it != base_endpoints.end()) {
base_endpoints.erase(it);
} else if (is_candidate(view_node)) {
view_endpoints.push_back(view_node);
}
}
} else {
for (auto&& view_node : view_nodes) {
process_candidate(view_endpoints, view_node);
std::optional<locator::host_id> paired_replica;
for (auto&& view_node : view_nodes) {
if (view_node.get().dc_rack() == my_location) {
paired_replica = view_node.get().host_id();
break;
}
}
// Try optimizing for simple rack-aware pairing
// If the numbers of base and view replica differ, that means an RF change is taking place
// and we can't use simple rack-aware pairing.
if (rack_aware_pairing && base_endpoints.size() == view_endpoints.size()) {
auto dc_rf = network_topology->get_replication_factor(my_datacenter);
const auto& racks = topology.get_datacenter_rack_nodes().at(my_datacenter);
// Simple rack-aware pairing is possible when the datacenter replication factor
// is a multiple of the number of racks in the datacenter.
if (dc_rf % racks.size() == 0) {
simple_rack_aware_pairing = true;
size_t rack_rf = dc_rf / racks.size();
// If any rack doesn't have enough nodes to satisfy the per-rack rf
// simple rack-aware pairing is disabled.
for (const auto& [rack, nodes] : racks) {
if (nodes.size() < rack_rf) {
simple_rack_aware_pairing = false;
break;
}
}
}
if (dc_rf != base_endpoints.size()) {
// If the datacenter replication factor is not equal to the number of base replicas,
// we're in progress of a RF change and we can't use simple rack-aware pairing.
simple_rack_aware_pairing = false;
}
if (simple_rack_aware_pairing) {
std::erase_if(base_endpoints, [&] (const locator::node& node) { return node.dc_rack() != my_location; });
std::erase_if(view_endpoints, [&] (const locator::node& node) { return node.dc_rack() != my_location; });
}
if (paired_replica && base_nodes.size() == view_nodes.size()) {
// We don't need to find any extra replicas, so we can return early
return {.natural_endpoint = paired_replica};
}
orig_base_endpoints = base_endpoints;
orig_view_endpoints = view_endpoints;
// For the complex rack_aware_pairing case, nodes are already filtered by datacenter
// Use best-match, for the minimum number of base and view replicas in each rack,
// and ordinal match for the rest.
std::optional<std::reference_wrapper<const locator::node>> paired_replica;
if (rack_aware_pairing && !simple_rack_aware_pairing) {
struct indexed_replica {
size_t idx;
std::reference_wrapper<const locator::node> node;
};
std::unordered_map<sstring, std::vector<indexed_replica>> base_racks, view_racks;
// First, index all replicas by rack
auto index_replica_set = [] (std::unordered_map<sstring, std::vector<indexed_replica>>& racks, const node_vector& replicas) {
size_t idx = 0;
for (const auto& r: replicas) {
racks[r.get().rack()].emplace_back(idx++, r);
}
};
index_replica_set(base_racks, base_endpoints);
index_replica_set(view_racks, view_endpoints);
// Try optimistically pairing `me` first
const auto& my_base_replicas = base_racks[my_location.rack];
auto base_it = std::ranges::find(my_base_replicas, me, [] (const indexed_replica& ir) { return ir.node.get().host_id(); });
if (base_it == my_base_replicas.end()) {
return {};
}
const auto& my_view_replicas = view_racks[my_location.rack];
size_t idx = base_it - my_base_replicas.begin();
if (idx < my_view_replicas.size()) {
if (orig_view_endpoints.size() <= orig_base_endpoints.size()) {
return {.natural_endpoint = my_view_replicas[idx].node.get().host_id()};
} else {
// If the number of view replicas is larger than the number of base replicas,
// we need to find the unpaired view replica, so we can't return yet.
paired_replica = my_view_replicas[idx].node;
}
}
// Collect all unpaired base and view replicas,
// where the number of replicas in the base rack is different than the respective view rack
std::vector<indexed_replica> unpaired_base_replicas, unpaired_view_replicas;
for (const auto& [rack, base_replicas] : base_racks) {
const auto& view_replicas = view_racks[rack];
for (auto i = view_replicas.size(); i < base_replicas.size(); ++i) {
unpaired_base_replicas.emplace_back(base_replicas[i]);
}
}
for (const auto& [rack, view_replicas] : view_racks) {
const auto& base_replicas = base_racks[rack];
for (auto i = base_replicas.size(); i < view_replicas.size(); ++i) {
unpaired_view_replicas.emplace_back(view_replicas[i]);
}
}
// Sort by the original ordinality, and copy the sorted results
// back into {base,view}_endpoints, for backward compatible processing below.
std::ranges::sort(unpaired_base_replicas, std::less(), std::mem_fn(&indexed_replica::idx));
base_endpoints.clear();
std::ranges::transform(unpaired_base_replicas, std::back_inserter(base_endpoints), std::mem_fn(&indexed_replica::node));
std::ranges::sort(unpaired_view_replicas, std::less(), std::mem_fn(&indexed_replica::idx));
view_endpoints.clear();
std::ranges::transform(unpaired_view_replicas, std::back_inserter(view_endpoints), std::mem_fn(&indexed_replica::node));
}
auto base_it = std::ranges::find(base_endpoints, me, std::mem_fn(&locator::node::host_id));
if (!paired_replica && base_it == base_endpoints.end()) {
// This node is not a base replica of this key, so we return empty
// FIXME: This case shouldn't happen, and if it happens, a view update
// would be lost.
++cf_stats.total_view_updates_on_wrong_node;
vlogger.warn("Could not find {} in base_endpoints={}", me,
orig_base_endpoints | std::views::transform(std::mem_fn(&locator::node::host_id)));
return {};
}
size_t idx = base_it - base_endpoints.begin();
std::optional<std::reference_wrapper<const locator::node>> no_pairing_replica;
if (!paired_replica && idx >= view_endpoints.size()) {
// There are fewer view replicas than base replicas
// FIXME: This might still happen when reducing replication factor with tablets,
// see https://github.com/scylladb/scylladb/issues/21492
++cf_stats.total_view_updates_failed_pairing;
vlogger.warn("Could not pair {}: rack_aware={} base_endpoints={} view_endpoints={}", me,
rack_aware_pairing ? (simple_rack_aware_pairing ? "simple" : "complex") : "none",
orig_base_endpoints | std::views::transform(std::mem_fn(&locator::node::host_id)),
orig_view_endpoints | std::views::transform(std::mem_fn(&locator::node::host_id)));
return {};
} else if (base_endpoints.size() < view_endpoints.size()) {
// There are fewer base replicas than view replicas.
// This can happen as a result of an RF change when the view replica finishes streaming
// before the base replica.
// Because of this, a view replica might not get paired with any base replica, so we need
// to send an additional update to it.
++cf_stats.total_view_updates_due_to_replica_count_mismatch;
no_pairing_replica = view_endpoints.back();
if (base_endpoints.size() < view_endpoints.size() - 1) {
// We only expect one extra replica to appear due to an RF change. If there's more, that's an error,
// but we'll still perform updates to the paired and last replicas to minimize degradation.
vlogger.warn("There are too many view endpoints for base-view pairing. View updates may get lost on view_endpoints={}",
std::span(view_endpoints.begin() + base_endpoints.size(), view_endpoints.end() - 1) | std::views::transform(std::mem_fn(&locator::node::host_id)));
}
}
if (!paired_replica) {
paired_replica = view_endpoints[idx];
// We couldn't find any view replica in our rack
++cf_stats.total_view_updates_failed_pairing;
vlogger.warn("Could not find a view replica in the same rack as base replica {} for base_endpoints={} view_endpoints={}",
me,
base_nodes | std::views::transform(std::mem_fn(&locator::node::host_id)),
view_nodes | std::views::transform(std::mem_fn(&locator::node::host_id)));
}
if (!no_pairing_replica && base_nodes.size() < view_nodes.size()) {
// This can happen when the view replica with no pairing is in another DC.
// We need to send an update to it if there are no base replicas in that DC yet,
// as it won't receive updates otherwise.
std::unordered_set<sstring> dcs_with_base_replicas;
for (const auto& base_node : base_nodes) {
dcs_with_base_replicas.insert(base_node.get().dc());
}
for (const auto& view_node : view_nodes) {
if (!dcs_with_base_replicas.contains(view_node.get().dc())) {
++cf_stats.total_view_updates_due_to_replica_count_mismatch;
no_pairing_replica = view_node;
break;
}
}
}
// https://github.com/scylladb/scylladb/issues/19439
// With tablets, a node being replaced might transition to "left" state
// but still be kept as a replica.
// As of writing this hints are not prepared to handle nodes that are left
// but are still replicas. Therefore, there is no other sensible option
// right now but to give up attempt to send the update or write a hint
// to the paired, permanently down replica.
// We use the same workaround for the extra replica.
auto return_host_id_if_not_left = [] (const auto& replica) -> std::optional<locator::host_id> {
if (!replica) {
return std::nullopt;
}
const auto& node = replica->get();
if (!node.left()) {
return node.host_id();
} else {
return std::nullopt;
}
};
return {.natural_endpoint = return_host_id_if_not_left(paired_replica),
.endpoint_with_no_pairing = return_host_id_if_not_left(no_pairing_replica)};
std::optional<locator::host_id> no_pairing_replica = get_unpaired_view_endpoint(base_nodes, view_nodes, cf_stats);
return {.natural_endpoint = paired_replica,
.endpoint_with_no_pairing = no_pairing_replica};
}
static future<> apply_to_remote_endpoints(service::storage_proxy& proxy, locator::effective_replication_map_ptr ermp,
@@ -2136,12 +2041,6 @@ future<> view_update_generator::mutate_MV(
{
auto& ks = _db.find_keyspace(base->ks_name());
auto& replication = ks.get_replication_strategy();
// We set legacy self-pairing for old vnode-based tables (for backward
// compatibility), and unset it for tablets - where range movements
// are more frequent and backward compatibility is less important.
// TODO: Maybe allow users to set use_legacy_self_pairing explicitly
// on a view, like we have the synchronous_updates_flag.
bool use_legacy_self_pairing = !ks.uses_tablets();
std::unordered_map<table_id, locator::effective_replication_map_ptr> erms;
auto get_erm = [&] (table_id id) {
auto it = erms.find(id);
@@ -2154,10 +2053,6 @@ future<> view_update_generator::mutate_MV(
for (const auto& mut : view_updates) {
(void)get_erm(mut.s->id());
}
// Enable rack-aware view updates pairing for tablets
// when the cluster feature is enabled so that all replicas agree
// on the pairing algorithm.
bool use_tablets_rack_aware_view_pairing = _db.features().tablet_rack_aware_view_pairing && ks.uses_tablets();
auto me = base_ermp->get_topology().my_host_id();
static constexpr size_t max_concurrent_updates = 128;
co_await utils::get_local_injector().inject("delay_before_get_view_natural_endpoint", 8000ms);
@@ -2165,7 +2060,7 @@ future<> view_update_generator::mutate_MV(
auto view_token = dht::get_token(*mut.s, mut.fm.key());
auto view_ermp = erms.at(mut.s->id());
auto [target_endpoint, no_pairing_endpoint] = get_view_natural_endpoint(me, base_ermp, view_ermp, replication, base_token, view_token,
use_legacy_self_pairing, use_tablets_rack_aware_view_pairing, cf_stats);
ks.uses_tablets(), cf_stats);
auto remote_endpoints = view_ermp->get_pending_replicas(view_token);
auto memory_units = seastar::make_lw_shared<db::timeout_semaphore_units>(pending_view_update_memory_units.split(memory_usage_of(mut)));
if (no_pairing_endpoint) {

View File

@@ -305,8 +305,7 @@ endpoints_to_update get_view_natural_endpoint(
const locator::abstract_replication_strategy& replication_strategy,
const dht::token& base_token,
const dht::token& view_token,
bool use_legacy_self_pairing,
bool use_tablets_basic_rack_aware_view_pairing,
bool use_tablets,
replica::cf_stats& cf_stats);
/// Verify that the provided keyspace is eligible for storing materialized views.

View File

@@ -198,6 +198,7 @@ future<> view_building_worker::register_staging_sstable_tasks(std::vector<sstabl
future<> view_building_worker::run_staging_sstables_registrator() {
while (!_as.abort_requested()) {
bool sleep = false;
try {
auto lock = co_await get_units(_staging_sstables_mutex, 1, _as);
co_await create_staging_sstable_tasks();
@@ -214,6 +215,14 @@ future<> view_building_worker::run_staging_sstables_registrator() {
vbw_logger.warn("Got group0_concurrent_modification while creating staging sstable tasks");
} catch (raft::request_aborted&) {
vbw_logger.warn("Got raft::request_aborted while creating staging sstable tasks");
} catch (...) {
vbw_logger.error("Exception while creating staging sstable tasks: {}", std::current_exception());
sleep = true;
}
if (sleep) {
vbw_logger.debug("Sleeping after exception.");
co_await seastar::sleep_abortable(1s, _as).handle_exception([] (auto x) { return make_ready_future<>(); });
}
}
}
@@ -417,9 +426,12 @@ future<> view_building_worker::check_for_aborted_tasks() {
auto my_host_id = vbw._db.get_token_metadata().get_topology().my_host_id();
auto my_replica = locator::tablet_replica{my_host_id, this_shard_id()};
auto tasks_map = vbw._state._batch->tasks; // Potentially, we'll remove elements from the map, so we need a copy to iterate over it
for (auto& [id, t]: tasks_map) {
auto task_opt = building_state.get_task(t.base_id, my_replica, id);
auto it = vbw._state._batch->tasks.begin();
while (it != vbw._state._batch->tasks.end()) {
auto id = it->first;
auto task_opt = building_state.get_task(it->second.base_id, my_replica, id);
++it; // Advance the iterator before potentially removing the entry from the map.
if (!task_opt || task_opt->get().aborted) {
co_await vbw._state._batch->abort_task(id);
}
@@ -449,7 +461,7 @@ static std::unordered_set<table_id> get_ids_of_all_views(replica::database& db,
}) | std::ranges::to<std::unordered_set>();;
}
// If `state::processing_base_table` is diffrent that the `view_building_state::currently_processed_base_table`,
// If `state::processing_base_table` is different that the `view_building_state::currently_processed_base_table`,
// clear the state, save and flush new base table
future<> view_building_worker::state::update_processing_base_table(replica::database& db, const view_building_state& building_state, abort_source& as) {
if (processing_base_table != building_state.currently_processed_base_table) {
@@ -571,8 +583,6 @@ future<> view_building_worker::batch::do_work() {
break;
}
}
_vbw.local()._vb_state_machine.event.broadcast();
}
future<> view_building_worker::do_build_range(table_id base_id, std::vector<table_id> views_ids, dht::token last_token, abort_source& as) {
@@ -774,13 +784,15 @@ future<std::vector<utils::UUID>> view_building_worker::work_on_tasks(raft::term_
tasks.insert({id, *task_opt});
}
#ifdef SEASTAR_DEBUG
auto& some_task = tasks.begin()->second;
for (auto& [_, t]: tasks) {
SCYLLA_ASSERT(t.base_id == some_task.base_id);
SCYLLA_ASSERT(t.last_token == some_task.last_token);
SCYLLA_ASSERT(t.replica == some_task.replica);
SCYLLA_ASSERT(t.type == some_task.type);
SCYLLA_ASSERT(t.replica.shard == this_shard_id());
{
auto& some_task = tasks.begin()->second;
for (auto& [_, t]: tasks) {
SCYLLA_ASSERT(t.base_id == some_task.base_id);
SCYLLA_ASSERT(t.last_token == some_task.last_token);
SCYLLA_ASSERT(t.replica == some_task.replica);
SCYLLA_ASSERT(t.type == some_task.type);
SCYLLA_ASSERT(t.replica.shard == this_shard_id());
}
}
#endif
@@ -811,25 +823,6 @@ future<std::vector<utils::UUID>> view_building_worker::work_on_tasks(raft::term_
co_return collect_completed_tasks();
}
}
}

View File

@@ -605,8 +605,8 @@ public:
}
static schema_ptr build_schema() {
auto id = generate_legacy_id(system_keyspace::NAME, "versions");
return schema_builder(system_keyspace::NAME, "versions", std::make_optional(id))
auto id = generate_legacy_id(system_keyspace::NAME, system_keyspace::VERSIONS);
return schema_builder(system_keyspace::NAME, system_keyspace::VERSIONS, std::make_optional(id))
.with_column("key", utf8_type, column_kind::partition_key)
.with_column("version", utf8_type)
.with_column("build_mode", utf8_type)
@@ -749,6 +749,7 @@ class clients_table : public streaming_virtual_table {
.with_column("ssl_protocol", utf8_type)
.with_column("username", utf8_type)
.with_column("scheduling_group", utf8_type)
.with_column("client_options", map_type_impl::get_instance(utf8_type, utf8_type, false))
.with_hash_version()
.build();
}
@@ -766,7 +767,7 @@ class clients_table : public streaming_virtual_table {
future<> execute(reader_permit permit, result_collector& result, const query_restrictions& qr) override {
// Collect
using client_data_vec = utils::chunked_vector<client_data>;
using client_data_vec = utils::chunked_vector<foreign_ptr<std::unique_ptr<client_data>>>;
using shard_client_data = std::vector<client_data_vec>;
std::vector<foreign_ptr<std::unique_ptr<shard_client_data>>> cd_vec;
cd_vec.resize(smp::count);
@@ -806,13 +807,13 @@ class clients_table : public streaming_virtual_table {
for (unsigned i = 0; i < smp::count; i++) {
for (auto&& ps_cdc : *cd_vec[i]) {
for (auto&& cd : ps_cdc) {
if (cd_map.contains(cd.ip)) {
cd_map[cd.ip].emplace_back(std::move(cd));
if (cd_map.contains(cd->ip)) {
cd_map[cd->ip].emplace_back(std::move(cd));
} else {
dht::decorated_key key = make_partition_key(cd.ip);
dht::decorated_key key = make_partition_key(cd->ip);
if (this_shard_owns(key) && contains_key(qr.partition_range(), key)) {
ips.insert(decorated_ip{std::move(key), cd.ip});
cd_map[cd.ip].emplace_back(std::move(cd));
ips.insert(decorated_ip{std::move(key), cd->ip});
cd_map[cd->ip].emplace_back(std::move(cd));
}
}
co_await coroutine::maybe_yield();
@@ -825,39 +826,58 @@ class clients_table : public streaming_virtual_table {
co_await result.emit_partition_start(dip.key);
auto& clients = cd_map[dip.ip];
std::ranges::sort(clients, [] (const client_data& a, const client_data& b) {
return a.port < b.port || a.client_type_str() < b.client_type_str();
std::ranges::sort(clients, [] (const foreign_ptr<std::unique_ptr<client_data>>& a, const foreign_ptr<std::unique_ptr<client_data>>& b) {
return a->port < b->port || a->client_type_str() < b->client_type_str();
});
for (const auto& cd : clients) {
clustering_row cr(make_clustering_key(cd.port, cd.client_type_str()));
set_cell(cr.cells(), "shard_id", cd.shard_id);
set_cell(cr.cells(), "connection_stage", cd.stage_str());
if (cd.driver_name) {
set_cell(cr.cells(), "driver_name", *cd.driver_name);
clustering_row cr(make_clustering_key(cd->port, cd->client_type_str()));
set_cell(cr.cells(), "shard_id", cd->shard_id);
set_cell(cr.cells(), "connection_stage", cd->stage_str());
if (cd->driver_name) {
set_cell(cr.cells(), "driver_name", cd->driver_name->key());
}
if (cd.driver_version) {
set_cell(cr.cells(), "driver_version", *cd.driver_version);
if (cd->driver_version) {
set_cell(cr.cells(), "driver_version", cd->driver_version->key());
}
if (cd.hostname) {
set_cell(cr.cells(), "hostname", *cd.hostname);
if (cd->hostname) {
set_cell(cr.cells(), "hostname", *cd->hostname);
}
if (cd.protocol_version) {
set_cell(cr.cells(), "protocol_version", *cd.protocol_version);
if (cd->protocol_version) {
set_cell(cr.cells(), "protocol_version", *cd->protocol_version);
}
if (cd.ssl_cipher_suite) {
set_cell(cr.cells(), "ssl_cipher_suite", *cd.ssl_cipher_suite);
if (cd->ssl_cipher_suite) {
set_cell(cr.cells(), "ssl_cipher_suite", *cd->ssl_cipher_suite);
}
if (cd.ssl_enabled) {
set_cell(cr.cells(), "ssl_enabled", *cd.ssl_enabled);
if (cd->ssl_enabled) {
set_cell(cr.cells(), "ssl_enabled", *cd->ssl_enabled);
}
if (cd.ssl_protocol) {
set_cell(cr.cells(), "ssl_protocol", *cd.ssl_protocol);
if (cd->ssl_protocol) {
set_cell(cr.cells(), "ssl_protocol", *cd->ssl_protocol);
}
set_cell(cr.cells(), "username", cd.username ? *cd.username : sstring("anonymous"));
if (cd.scheduling_group_name) {
set_cell(cr.cells(), "scheduling_group", *cd.scheduling_group_name);
set_cell(cr.cells(), "username", cd->username ? *cd->username : sstring("anonymous"));
if (cd->scheduling_group_name) {
set_cell(cr.cells(), "scheduling_group", *cd->scheduling_group_name);
}
auto map_type = map_type_impl::get_instance(
utf8_type,
utf8_type,
false
);
auto prepare_client_options = [] (const auto& client_options) {
map_type_impl::native_type tmp;
for (auto& co: client_options) {
auto map_element = std::make_pair(data_value(co.key.key()), data_value(co.value.key()));
tmp.push_back(std::move(map_element));
}
return tmp;
};
set_cell(cr.cells(), "client_options",
make_map_value(map_type, prepare_client_options(cd->client_options)));
co_await result.emit_row(std::move(cr));
}
co_await result.emit_partition_end();

View File

@@ -1 +1 @@
SCYLLA_NODE_EXPORTER_ARGS="--collector.interrupts --no-collector.hwmon --no-collector.bcache --no-collector.btrfs --no-collector.fibrechannel --no-collector.infiniband --no-collector.ipvs --no-collector.nfs --no-collector.nfsd --no-collector.powersupplyclass --no-collector.rapl --no-collector.tapestats --no-collector.thermal_zone --no-collector.udp_queues --no-collector.zfs"
SCYLLA_NODE_EXPORTER_ARGS="--collector.interrupts --collector.ethtool.metrics-include='(bw_in_allowance_exceeded|bw_out_allowance_exceeded|conntrack_allowance_exceeded|conntrack_allowance_available|linklocal_allowance_exceeded)' --collector.ethtool --no-collector.hwmon --no-collector.bcache --no-collector.btrfs --no-collector.fibrechannel --no-collector.infiniband --no-collector.ipvs --no-collector.nfs --no-collector.nfsd --no-collector.powersupplyclass --no-collector.rapl --no-collector.tapestats --no-collector.thermal_zone --no-collector.udp_queues --no-collector.zfs"

View File

@@ -71,7 +71,7 @@ Use "Bash on Ubuntu on Windows" for the same tools and capabilities as on Linux
### Building the Docs
1. Run `make preview` to build the documentation.
1. Run `make preview` in the `docs/` directory to build the documentation.
1. Preview the built documentation locally at http://127.0.0.1:5500/.
### Cleanup

View File

@@ -41,6 +41,8 @@ class MetricsProcessor:
# Get metrics from the file
try:
metrics_file = metrics.get_metrics_from_file(relative_path, "scylla_", metrics_info, strict=strict)
except SystemExit:
pass
finally:
os.chdir(old_cwd)
if metrics_file:

View File

@@ -1,17 +1,17 @@
# Alternator: DynamoDB API in Scylla
# Alternator: DynamoDB API in ScyllaDB
## Introduction
Alternator is a Scylla feature adding compatibility with Amazon DynamoDB(TM).
Alternator is a ScyllaDB feature adding compatibility with Amazon DynamoDB(TM).
DynamoDB's API uses JSON-encoded requests and responses which are sent over
an HTTP or HTTPS transport. It is described in detail in Amazon's [DynamoDB
API Reference](https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/).
Our goal is that any application written to use Amazon DynamoDB could
be run, unmodified, against Scylla with Alternator enabled. Alternator's
be run, unmodified, against ScyllaDB with Alternator enabled. Alternator's
compatibility with DynamoDB is fairly complete, but users should be aware
of some differences and some unimplemented features. The extent of
Alternator's compatibility with DynamoDB is described in the
[Scylla Alternator for DynamoDB users](compatibility.md) document,
[ScyllaDB Alternator for DynamoDB users](compatibility.md) document,
which is updated as the work on Alternator progresses and compatibility
continues to improve.
@@ -19,8 +19,8 @@ Alternator also adds several features and APIs that are not available in
DynamoDB. These are described in [Alternator-specific APIs](new-apis.md).
## Running Alternator
By default, Scylla does not listen for DynamoDB API requests. To enable
this API in Scylla you must set at least two configuration options,
By default, ScyllaDB does not listen for DynamoDB API requests. To enable
this API in ScyllaDB you must set at least two configuration options,
**alternator_port** and **alternator_write_isolation**. For example in the
YAML configuration file:
```yaml
@@ -30,7 +30,7 @@ alternator_write_isolation: only_rmw_uses_lwt # or always, forbid or unsafe
or, equivalently, via command-line arguments: `--alternator-port=8000
--alternator-write-isolation=only_rmw_uses_lwt.
the **alternator_port** option determines on which port Scylla listens for
the **alternator_port** option determines on which port ScyllaDB listens for
DynamoDB API requests. By default, it listens on this port on all network
interfaces. To listen only on a specific interface, configure also the
**alternator_address** option.
@@ -41,12 +41,12 @@ Alternator has four different choices
for the implementation of writes, each with different advantages. You should
carefully consider which of the options makes more sense for your intended
use case and configure alternator_write_isolation accordingly. There is
currently no default for this option: Trying to run Scylla with an Alternator
currently no default for this option: Trying to run ScyllaDB with an Alternator
port selected but without configuring write isolation will result in an error message,
asking you to set it.
In addition to (or instead of) serving HTTP requests on alternator_port,
Scylla can accept DynamoDB API requests over HTTPS (encrypted), on the port
ScyllaDB can accept DynamoDB API requests over HTTPS (encrypted), on the port
specified by **alternator_https_port**. As usual for HTTPS servers, the
operator must specify certificate and key files. By default these should
be placed in `/etc/scylla/scylla.crt` and `/etc/scylla/scylla.key`, but
@@ -54,7 +54,7 @@ these default locations can overridden by specifying
`--alternator-encryption-options keyfile="..."` and
`--alternator-encryption-options certificate="..."`.
By default, Scylla saves a snapshot of deleted tables. But Alternator does
By default, ScyllaDB saves a snapshot of deleted tables. But Alternator does
not offer an API to restore these snapshots, so these snapshots are not useful
and waste disk space - deleting a table does not recover any disk space.
It is therefore recommended to disable this automatic-snapshotting feature
@@ -73,11 +73,11 @@ itself. Instructions, code and examples for doing this can be found in the
This section provides only a very brief introduction to Alternator's
design. A much more detailed document about the features of the DynamoDB
API and how they are, or could be, implemented in Scylla can be found in:
API and how they are, or could be, implemented in ScyllaDB can be found in:
<https://docs.google.com/document/d/1i4yjF5OSAazAY_-T8CBce9-2ykW4twx_E_Nt2zDoOVs>
Almost all of Alternator's source code (except some initialization code)
can be found in the alternator/ subdirectory of Scylla's source code.
can be found in the alternator/ subdirectory of ScyllaDB's source code.
Extensive functional tests can be found in the test/alternator
subdirectory. These tests are written in Python, and can be run against
both Alternator and Amazon's DynamoDB; This allows verifying that
@@ -85,15 +85,15 @@ Alternator's behavior matches the one observed on DynamoDB.
See test/alternator/README.md for more information about the tests and
how to run them.
With Alternator enabled on port 8000 (for example), every Scylla node
With Alternator enabled on port 8000 (for example), every ScyllaDB node
listens for DynamoDB API requests on this port. These requests, in
JSON format over HTTP, are parsed and result in calls to internal Scylla
C++ functions - there is no CQL generation or parsing involved.
In Scylla terminology, the node receiving the request acts as the
In ScyllaDB terminology, the node receiving the request acts as the
*coordinator*, and often passes the request on to one or more other nodes -
*replicas* which hold copies of the requested data.
Alternator tables are stored as Scylla tables, each in a separate keyspace.
Alternator tables are stored as ScyllaDB tables, each in a separate keyspace.
Each keyspace is initialized when the corresponding Alternator table is
created (with a CreateTable request). The replication factor (RF) for this
keyspace is chosen at that point, depending on the size of the cluster:
@@ -101,19 +101,19 @@ RF=3 is used on clusters with three or more nodes, and RF=1 is used for
smaller clusters. Such smaller clusters are, of course, only recommended
for tests because of the risk of data loss.
Each table in Alternator is stored as a Scylla table in a separate
Each table in Alternator is stored as a ScyllaDB table in a separate
keyspace. The DynamoDB key columns (hash and sort key) have known types,
and become partition and clustering key columns of the Scylla table.
and become partition and clustering key columns of the ScyllaDB table.
All other attributes may be different for each row, so are stored in one
map column in Scylla, and not as separate columns.
map column in ScyllaDB, and not as separate columns.
DynamoDB supports two consistency levels for reads, "eventual consistency"
and "strong consistency". These two modes are implemented using Scylla's CL
and "strong consistency". These two modes are implemented using ScyllaDB's CL
(consistency level) feature: All writes are done using the `LOCAL_QUORUM`
consistency level, then strongly-consistent reads are done with
`LOCAL_QUORUM`, while eventually-consistent reads are with just `LOCAL_ONE`.
In Scylla (and its inspiration, Cassandra), high write performance is
In ScyllaDB (and its inspiration, Cassandra), high write performance is
achieved by ensuring that writes do not require reads from disk.
The DynamoDB API, however, provides many types of requests that need a read
before the write (a.k.a. RMW requests - read-modify-write). For example,
@@ -121,7 +121,7 @@ a request may copy an existing attribute, increment an attribute,
be conditional on some expression involving existing values of attribute,
or request that the previous values of attributes be returned. These
read-modify-write transactions should be _isolated_ from each other, so
by default Alternator implements every write operation using Scylla's
by default Alternator implements every write operation using ScyllaDB's
LWT (lightweight transactions). This default can be overridden on a per-table
basis, by tagging the table as explained above in the "write isolation
policies" section.

View File

@@ -1,6 +1,6 @@
# ScyllaDB Alternator for DynamoDB users
Scylla supports the DynamoDB API (this feature is codenamed "Alternator").
ScyllaDB supports the DynamoDB API (this feature is codenamed "Alternator").
Our goal is to support any application written for Amazon DynamoDB.
Nevertheless, there are a few differences between DynamoDB and Scylla, and
and a few DynamoDB features that have not yet been implemented in Scylla.
@@ -8,16 +8,16 @@ The purpose of this document is to inform users of these differences.
## Provisioning
The most obvious difference between DynamoDB and Scylla is that while
DynamoDB is a shared cloud service, Scylla is a dedicated service running
The most obvious difference between DynamoDB and ScyllaDB is that while
DynamoDB is a shared cloud service, ScyllaDB is a dedicated service running
on your private cluster. Whereas DynamoDB allows you to "provision" the
number of requests per second you'll need - or at an extra cost not even
provision that - Scylla requires you to provision your cluster. You need
provision that - ScyllaDB requires you to provision your cluster. You need
to reason about the number and size of your nodes - not the throughput.
Moreover, DynamoDB's per-table provisioning (`BillingMode=PROVISIONED`) is
not yet supported by Scylla. The BillingMode and ProvisionedThroughput options
on a table need to be valid but are ignored, and Scylla behaves like DynamoDB's
on a table need to be valid but are ignored, and ScyllaDB behaves like DynamoDB's
`BillingMode=PAY_PER_REQUEST`: All requests are accepted without a per-table
throughput cap.
@@ -33,7 +33,7 @@ Instructions for doing this can be found in:
## Write isolation policies
Scylla was designed to optimize the performance of pure write operations -
ScyllaDB was designed to optimize the performance of pure write operations -
writes which do not need to read the previous value of the item.
In CQL, writes which do need the previous value of the item must explicitly
use the slower LWT ("LightWeight Transaction") feature to be correctly
@@ -79,11 +79,11 @@ a _higher_ timestamp - and this will be the "last write" that wins.
To avoid or mitigate this write reordering issue, users may consider
one or more of the following:
1. Use NTP to keep the clocks on the different Scylla nodes synchronized.
1. Use NTP to keep the clocks on the different ScyllaDB nodes synchronized.
If the delay between the two writes is longer than NTP's accuracy,
they will not be reordered.
2. If an application wants to ensure that two specific writes are not
reordered, it should send both requests to the same Scylla node.
reordered, it should send both requests to the same ScyllaDB node.
Care should be taken when using a load balancer - which might redirect
two requests to two different nodes.
3. Consider using the `always_use_lwt` write isolation policy.
@@ -210,7 +210,7 @@ CREATE SERVICE_LEVEL IF NOT EXISTS oltp WITH SHARES = 1000;
ATTACH SERVICE_LEVEL olap TO alice;
ATTACH SERVICE_LEVEL oltp TO bob;
```
Note that `alternator_enforce_authorization` has to be enabled in Scylla configuration.
Note that `alternator_enforce_authorization` has to be enabled in ScyllaDB configuration.
See [Authorization](##Authorization) section to learn more about roles and authorization.
See [Workload Prioritization](../features/workload-prioritization)
@@ -218,11 +218,11 @@ to read about Workload Prioritization in detail.
## Metrics
Scylla has an advanced and extensive monitoring framework for inspecting
and graphing hundreds of different metrics of Scylla's usage and performance.
Scylla's monitoring stack, based on Grafana and Prometheus, is described in
ScyllaDB has an advanced and extensive monitoring framework for inspecting
and graphing hundreds of different metrics of ScyllaDB's usage and performance.
ScyllaDB's monitoring stack, based on Grafana and Prometheus, is described in
<https://docs.scylladb.com/operating-scylla/monitoring/>.
This monitoring stack is different from DynamoDB's offering - but Scylla's
This monitoring stack is different from DynamoDB's offering - but ScyllaDB's
is significantly more powerful and gives the user better insights on
the internals of the database and its performance.
@@ -248,7 +248,7 @@ data in different partition order. Applications mustn't rely on that
undocumented order.
Note that inside each partition, the individual items will be sorted the same
in DynamoDB and Scylla - determined by the _sort key_ defined for that table.
in DynamoDB and ScyllaDB - determined by the _sort key_ defined for that table.
---
@@ -274,7 +274,7 @@ is different, or can be configured in Alternator:
## Experimental API features
Some DynamoDB API features are supported by Alternator, but considered
**experimental** in this release. An experimental feature in Scylla is a
**experimental** in this release. An experimental feature in ScyllaDB is a
feature whose functionality is complete, or mostly complete, but it is not
as thoroughly tested or optimized as regular features. Also, an experimental
feature's implementation is still subject to change and upgrades may not be
@@ -351,8 +351,8 @@ they should be easy to detect. Here is a list of these unimplemented features:
* The on-demand backup APIs are not supported: CreateBackup, DescribeBackup,
DeleteBackup, ListBackups, RestoreTableFromBackup.
For now, users can use Scylla's existing backup solutions such as snapshots
or Scylla Manager.
For now, users can use ScyllaDB's existing backup solutions such as snapshots
or ScyllaDB Manager.
<https://github.com/scylladb/scylla/issues/5063>
* Continuous backup (the ability to restore any point in time) is also not
@@ -370,7 +370,7 @@ they should be easy to detect. Here is a list of these unimplemented features:
<https://github.com/scylladb/scylla/issues/5068>
* DAX (DynamoDB Accelerator), an in-memory cache for DynamoDB, is not
available in for Alternator. Anyway, it should not be necessary - Scylla's
available in for Alternator. Anyway, it should not be necessary - ScyllaDB's
internal cache is already rather advanced and there is no need to place
another cache in front of the it. We wrote more about this here:
<https://www.scylladb.com/2017/07/31/database-caches-not-good/>
@@ -384,7 +384,7 @@ they should be easy to detect. Here is a list of these unimplemented features:
* The PartiQL syntax (SQL-like SELECT/UPDATE/INSERT/DELETE expressions)
and the operations ExecuteStatement, BatchExecuteStatement and
ExecuteTransaction are not yet supported.
A user that is interested in an SQL-like syntax can consider using Scylla's
A user that is interested in an SQL-like syntax can consider using ScyllaDB's
CQL protocol instead.
This feature was added to DynamoDB in November 2020.
<https://github.com/scylladb/scylla/issues/8787>
@@ -393,7 +393,7 @@ they should be easy to detect. Here is a list of these unimplemented features:
which is different from AWS's. In particular, the operations
DescribeContributorInsights, ListContributorInsights and
UpdateContributorInsights that configure Amazon's "CloudWatch Contributor
Insights" are not yet supported. Scylla has different ways to retrieve the
Insights" are not yet supported. ScyllaDB has different ways to retrieve the
same information, such as which items were accessed most often.
<https://github.com/scylladb/scylla/issues/8788>

View File

@@ -11,7 +11,7 @@ This section will guide you through the steps for setting up the cluster:
<https://hub.docker.com/r/scylladb/scylla/>, but add to every `docker run`
command a `-p 8000:8000` before the image name and
`--alternator-port=8000 --alternator-write-isolation=always` at the end.
The "alternator-port" option specifies on which port Scylla will listen for
The "alternator-port" option specifies on which port ScyllaDB will listen for
the (unencrypted) DynamoDB API, and the "alternator-write-isolation" chooses
whether or not Alternator will use LWT for every write.
For example,
@@ -24,10 +24,10 @@ This section will guide you through the steps for setting up the cluster:
By default, ScyllaDB run in this way will not have authentication or
authorization enabled, and any DynamoDB API request will be honored without
requiring them to be signed appropriately. See the
[Scylla Alternator for DynamoDB users](compatibility.md#authentication-and-authorization)
[ScyllaDB Alternator for DynamoDB users](compatibility.md#authentication-and-authorization)
document on how to configure authentication and authorization.
## Testing Scylla's DynamoDB API support:
## Testing ScyllaDB's DynamoDB API support:
### Running AWS Tic Tac Toe demo app to test the cluster:
1. Follow the instructions on the [AWS github page](https://github.com/awsdocs/amazon-dynamodb-developer-guide/blob/master/doc_source/TicTacToe.Phase1.md)
2. Enjoy your tic-tac-toe game :-)

View File

@@ -2,9 +2,9 @@
Alternator's primary goal is to be compatible with Amazon DynamoDB(TM)
and its APIs, so that any application written to use Amazon DynamoDB could
be run, unmodified, against Scylla with Alternator enabled. The extent of
be run, unmodified, against ScyllaDB with Alternator enabled. The extent of
Alternator's compatibility with DynamoDB is described in the
[Scylla Alternator for DynamoDB users](compatibility.md) document.
[ScyllaDB Alternator for DynamoDB users](compatibility.md) document.
But Alternator also adds several features and APIs that are not available in
DynamoDB. These Alternator-specific APIs are documented here.
@@ -15,7 +15,7 @@ _conditional_ update or an update based on the old value of an attribute.
The read and the write should be treated as a single transaction - protected
(_isolated_) from other parallel writes to the same item.
Alternator could do this isolation by using Scylla's LWT (lightweight
Alternator could do this isolation by using ScyllaDB's LWT (lightweight
transactions) for every write operation, but this significantly slows
down writes, and not necessary for workloads which don't use read-modify-write
(RMW) updates.
@@ -41,7 +41,7 @@ isolation policy for a specific table can be overridden by tagging the table
which need a read before the write. An attempt to use such statements
(e.g., UpdateItem with a ConditionExpression) will result in an error.
In this mode, the remaining write requests which are allowed - pure writes
without a read - are performed using standard Scylla writes, not LWT,
without a read - are performed using standard ScyllaDB writes, not LWT,
so they are significantly faster than they would have been in the
`always_use_lwt`, but their isolation is still correct.
@@ -65,19 +65,19 @@ isolation policy for a specific table can be overridden by tagging the table
read-modify-write updates. This mode is not recommended for any use case,
and will likely be removed in the future.
## Accessing system tables from Scylla
Scylla exposes lots of useful information via its internal system tables,
## Accessing system tables from ScyllaDB
ScyllaDB exposes lots of useful information via its internal system tables,
which can be found in system keyspaces: 'system', 'system\_auth', etc.
In order to access to these tables via alternator interface,
Scan and Query requests can use a special table name:
`.scylla.alternator.KEYSPACE_NAME.TABLE_NAME`
which will return results fetched from corresponding Scylla table.
which will return results fetched from corresponding ScyllaDB table.
This interface can be used only to fetch data from system tables.
Attempts to read regular tables via the virtual interface will result
in an error.
Example: in order to query the contents of Scylla's `system.large_rows`,
Example: in order to query the contents of ScyllaDB's `system.large_rows`,
pass `TableName='.scylla.alternator.system.large_rows'` to a Query/Scan
request.
@@ -113,14 +113,14 @@ connection (either active or idle), not necessarily an active request as
in Alternator.
## Service discovery
As explained in [Scylla Alternator for DynamoDB users](compatibility.md),
As explained in [ScyllaDB Alternator for DynamoDB users](compatibility.md),
Alternator requires a load-balancer or a client-side load-balancing library
to distribute requests between all Scylla nodes. This load-balancer needs
to be able to _discover_ the Scylla nodes. Alternator provides two special
to distribute requests between all ScyllaDB nodes. This load-balancer needs
to be able to _discover_ the ScyllaDB nodes. Alternator provides two special
requests, `/` and `/localnodes`, to help with this service discovery, which
we will now explain.
Some setups know exactly which Scylla nodes were brought up, so all that
Some setups know exactly which ScyllaDB nodes were brought up, so all that
remains is to periodically verify that each node is still functional. The
easiest way to do this is to make an HTTP (or HTTPS) GET request to the node,
with URL `/`. This is a trivial GET request and does **not** need to be
@@ -133,10 +133,10 @@ $ curl http://localhost:8000/
healthy: localhost:8000
```
In other setups, the load balancer might not know which Scylla nodes exist.
For example, it may be possible to add or remove Scylla nodes without a
In other setups, the load balancer might not know which ScyllaDB nodes exist.
For example, it may be possible to add or remove ScyllaDB nodes without a
client-side load balancer knowing. For these setups we have the `/localnodes`
request that can be used to discover which Scylla nodes exist: A load balancer
request that can be used to discover which ScyllaDB nodes exist: A load balancer
that already knows at least one live node can discover the rest by sending
a `/localnodes` request to the known node. It's again an unauthenticated
HTTP (or HTTPS) GET request:
@@ -160,7 +160,7 @@ list the nodes in a specific _data center_ or _rack_. These options are
useful for certain use cases:
* A `dc` option (e.g., `/localnodes?dc=dc1`) can be passed to list the
nodes in a specific Scylla data center, not the data center of the node
nodes in a specific ScyllaDB data center, not the data center of the node
being contacted. This is useful when a client knowns of _some_ Scylla
node belonging to an unknown DC, but wants to list the nodes in _its_
DC, which it knows by name.
@@ -191,7 +191,7 @@ tells them to.
If you want to influence whether a specific Alternator table is created with tablets or vnodes,
you can do this by specifying the `system:initial_tablets` tag
(in earlier versions of Scylla the tag was `experimental:initial_tablets`)
(in earlier versions of ScyllaDB the tag was `experimental:initial_tablets`)
in the CreateTable operation. The value of this tag can be:
* Any valid integer as the value of this tag enables tablets.

View File

@@ -365,7 +365,7 @@ Modifying a keyspace with tablets enabled is possible and doesn't require any sp
- The replication factor (RF) can be increased or decreased by at most 1 at a time. To reach the desired RF value, modify the RF repeatedly.
- The ``ALTER`` statement rejects the ``replication_factor`` tag. List the DCs explicitly when altering a keyspace. See :ref:`NetworkTopologyStrategy <replication-strategy>`.
- If there's any other ongoing global topology operation, executing the ``ALTER`` statement will fail (with an explicit and specific error) and needs to be repeated.
- An RF change cannot be requested while another RF change is pending for the same keyspace. Attempting to execute an ``ALTER`` statement in this scenario will fail with an explicit error. Wait for the ongoing RF change to complete before issuing another ``ALTER`` statement.
- The ``ALTER`` statement may take longer than the regular query timeout, and even if it times out, it will continue to execute in the background.
- The replication strategy cannot be modified, as keyspaces with tablets only support ``NetworkTopologyStrategy``.
- The ``ALTER`` statement will fail if it would make the keyspace :term:`RF-rack-invalid <RF-rack-valid keyspace>`.
@@ -1043,6 +1043,8 @@ The following modes are available:
* - ``immediate``
- Tombstone GC is immediately performed. There is no wait time or repair requirement. This mode is useful for a table that uses the TWCS compaction strategy with no user deletes. After data is expired after TTL, ScyllaDB can perform compaction to drop the expired data immediately.
.. warning:: The ``repair`` mode is not supported for :term:`Colocated Tables <Colocated Table>` in this version.
.. _cql-per-table-tablet-options:
Per-table tablet options

View File

@@ -102,6 +102,7 @@ Additional Information
To learn more about TTL, and see a hands-on example, check out `this lesson <https://university.scylladb.com/courses/data-modeling/lessons/advanced-data-modeling/topic/expiring-data-with-ttl-time-to-live/>`_ on ScyllaDB University.
* `Video: Managing data expiration with Time-To-Live <https://www.youtube.com/watch?v=SXkbu7mFHeA>`_
* :doc:`Apache Cassandra Query Language (CQL) Reference </cql/index>`
* :doc:`KB Article:How to Change gc_grace_seconds for a Table </kb/gc-grace-seconds/>`
* :doc:`KB Article:Time to Live (TTL) and Compaction </kb/ttl-facts/>`

View File

@@ -1,6 +1,6 @@
# Introduction
Similar to the approach described in CASSANDRA-14471, we add the
Similar to the approach described in CASSANDRA-12151, we add the
concept of an audit specification. An audit has a target (syslog or a
table) and a set of events/actions that it wants recorded. We
introduce new CQL syntax for Scylla users to describe and manipulate

View File

@@ -2,8 +2,11 @@
## What is ScyllaDB?
ScyllaDB is a high-performance NoSQL database system, fully compatible with Apache Cassandra.
ScyllaDB is released under the GNU Affero General Public License version 3 and the Apache License, ScyllaDB is free and open-source software.
ScyllaDB is a high-performance NoSQL database optimized for speed and scalability.
It is designed to efficiently handle large volumes of data with minimal latency,
making it ideal for data-intensive applications.
ScyllaDB is distributed under the [ScyllaDB Source Available License](https://github.com/scylladb/scylladb/blob/master/LICENSE-ScyllaDB-Source-Available.md).
> [ScyllaDB](http://www.scylladb.com/)

View File

@@ -20,9 +20,7 @@ command line option when launchgin scylla.
You can define endpoint details in the `scylla.yaml` file. For example:
```yaml
object_storage_endpoints:
- name: s3.us-east-1.amazonaws.com
port: 443
https: true
- name: https://s3.us-east-1.amazonaws.com:443
aws_region: us-east-1
```
@@ -78,9 +76,7 @@ The examples above are intended for development or local environments. You shoul
For the EC2 Instance Metadata Service to function correctly, no additional configuration is required. However, STS requires the IAM Role ARN to be defined in the `scylla.yaml` file, as shown below:
```yaml
object_storage_endpoints:
- name: s3.us-east-1.amazonaws.com
port: 443
https: true
- name: https://s3.us-east-1.amazonaws.com:443
aws_region: us-east-1
iam_role_arn: arn:aws:iam::123456789012:instance-profile/my-instance-instance-profile
```
@@ -100,9 +96,7 @@ in `scylla.yaml`:
```yaml
object_storage_endpoints:
- name: s3.us-east-2.amazonaws.com
port: 443
https: true
- name: https://s3.us-east-2.amazonaws.com:443
aws_region: us-east-2
```

View File

@@ -74,6 +74,8 @@ The keys and values are:
as an indicator to which shard client wants to connect. The desired shard number
is calculated as: `desired_shard_no = client_port % SCYLLA_NR_SHARDS`.
Its value is a decimal representation of type `uint16_t`, by default `19142`.
- `CLIENT_OPTIONS` is a string containing a JSON object representation that
contains CQL Driver configuration, e.g. load balancing policy, retry policy, timeouts, etc.
Currently, one `SCYLLA_SHARDING_ALGORITHM` is defined,
`biased-token-round-robin`. To apply the algorithm,
@@ -236,3 +238,26 @@ the same mechanism for other protocol versions, such as CQLv4.
The feature is identified by the `SCYLLA_USE_METADATA_ID` key, which is meant to be sent
in the SUPPORTED message.
## Sending the CLIENT_ROUTES_CHANGE event
This extension allows a driver to update its connections when the
`system.client_routes` table is modified.
In some network topologies a specific mapping of addresses and ports is required (e.g.
to support Private Link). This mapping can change dynamically even when no nodes are
added or removed. The driver must adapt to those changes; otherwise connectivity can be
lost.
The extension is implemented as a new `EVENT` type: `CLIENT_ROUTES_CHANGE`. The event
body consists of:
- [string] change
- [string list] connection_ids
- [string list] host_ids
There is only one change value: `UPDATE_NODES`, which means at least one client route
was inserted, updated, or deleted.
Events already have a subscription mechanism similar to protocol extensions (that is,
the driver only receives the events it explicitly subscribed to), so no additional
`cql_protocol_extension` key is introduced for this feature.

View File

@@ -86,6 +86,7 @@ stateDiagram-v2
de_left_token_ring --> [*]
}
state removing {
re_left_token_ring : left_token_ring
re_tablet_draining : tablet_draining
re_tablet_migration : tablet_migration
re_write_both_read_old : write_both_read_old
@@ -98,7 +99,8 @@ stateDiagram-v2
re_tablet_draining --> re_write_both_read_old
re_write_both_read_old --> re_write_both_read_new: streaming completed
re_write_both_read_old --> re_rollback_to_normal: rollback
re_write_both_read_new --> [*]
re_write_both_read_new --> re_left_token_ring
re_left_token_ring --> [*]
}
rebuilding --> normal: streaming completed
decommissioning --> left: operation succeeded
@@ -122,9 +124,10 @@ Note that these are not all states, as there are other states specific to tablet
Writes to vnodes-based tables are going to both new and old replicas (new replicas means calculated according
to modified token ring), reads are using old replicas.
- `write_both_read_new` - as above, but reads are using new replicas.
- `left_token_ring` - the decommissioning node left the token ring, but we still need to wait until other
nodes observe it and stop sending writes to this node. Then, we tell the node to shut down and remove
it from group 0. We also use this state to rollback a failed bootstrap or decommission.
- `left_token_ring` - the decommissioning or removing node left the token ring, but we still need to wait until other
nodes observe it and stop sending writes to this node. For decommission, we tell the node to shut down,
then remove it from group 0. For removenode, the node is already down, so we skip the shutdown step.
We also use this state to rollback a failed bootstrap or decommission.
- `rollback_to_normal` - the decommission or removenode operation failed. Rollback the operation by
moving the node we tried to decommission/remove back to the normal state.
- `lock` - the topology stays in this state until externally changed (to null state), preventing topology
@@ -141,7 +144,9 @@ reads that started before this point exist in the system. Finally we remove the
transitioning state.
Decommission, removenode and replace work similarly, except they don't go through
`commit_cdc_generation`.
`commit_cdc_generation`. Both decommission and removenode go through the
`left_token_ring` state to run a global barrier ensuring all nodes are aware
of the topology change before the operation completes.
The state machine may also go only through the `commit_cdc_generation` state
after getting a request from the user to create a new CDC generation if the

View File

@@ -41,12 +41,12 @@ Unless the task was aborted, the worker will eventually reply that the task was
it temporarily saves list of ids of finished tasks and removes those tasks from group0 state (pernamently marking them as finished) in 200ms intervals. (*)
This batching of removing finished tasks is done in order to reduce number of generated group0 operations.
On the other hand, view buildind tasks can can also be aborted due to 2 main reasons:
On the other hand, view building tasks can can also be aborted due to 2 main reasons:
- a keyspace/view was dropped
- tablet operations (see [tablet operations section](#tablet-operations))
In the first case we simply delete relevant view building tasks as they are no longer needed.
But if a task needs to be aborted due to tablet operation, we're firstly setting the `aborted` flag to true. We need to do this because we need the task informations
to created a new adjusted tasks (if the operation succeeded) or rollback them (if the operation failed).
But if a task needs to be aborted due to tablet operation, we're firstly setting the `aborted` flag to true. We need to do this because we need the task information
to create new adjusted tasks (if the operation succeeded) or rollback them (if the operation failed).
Once a task is aborted by setting the flag, this cannot be revoked, so rolling back a task means creating its duplicate and removing the original task.
(*) - Because there is a time gap between when the coordinator learns that a task is finished (from the RPC response) and when the task is marked as completed,

View File

@@ -29,9 +29,6 @@ A CDC generation consists of:
This is the mapping used to decide on which stream IDs to use when making writes, as explained in the :doc:`./cdc-streams` document. It is a global property of the cluster: it doesn't depend on the table you're making writes to.
.. caution::
The tables mentioned in the following sections: ``system_distributed.cdc_generation_timestamps`` and ``system_distributed.cdc_streams_descriptions_v2`` have been introduced in ScyllaDB 4.4. It is highly recommended to upgrade to 4.4 for efficient CDC usage. The last section explains how to run the below examples in ScyllaDB 4.3.
When CDC generations change
---------------------------

View File

@@ -28,7 +28,8 @@ Incremental Repair is only supported for tables that use the tablets architectur
Incremental Repair Modes
------------------------
While incremental repair is the default and recommended mode, you can control its behavior for a given repair operation using the ``incremental_mode`` parameter. This is useful for situations where you might need to force a full data validation.
Incremental is currently disabled by default. You can control its behavior for a given repair operation using the ``incremental_mode`` parameter.
This is useful for enabling incremental repair, or in situations where you might need to force a full data validation.
The available modes are:

View File

@@ -17,6 +17,7 @@ This document highlights ScyllaDB's key data modeling features.
Workload Prioritization </features/workload-prioritization>
Backup and Restore </features/backup-and-restore>
Incremental Repair </features/incremental-repair/>
Vector Search </features/vector-search/>
.. panel-box::
:title: ScyllaDB Features
@@ -43,3 +44,5 @@ This document highlights ScyllaDB's key data modeling features.
* :doc:`Incremental Repair </features/incremental-repair/>` provides a much more
efficient and lightweight approach to maintaining data consistency by
repairing only the data that has changed since the last repair.
* :doc:`Vector Search in ScyllaDB </features/vector-search/>` enables
similarity-based queries on vector embeddings.

View File

@@ -0,0 +1,55 @@
=================================
Vector Search in ScyllaDB
=================================
.. note::
This feature is currently available only in `ScyllaDB Cloud <https://cloud.docs.scylladb.com/>`_.
What Is Vector Search
-------------------------
Vector Search enables similarity-based queries over high-dimensional data,
such as text, images, audio, or user behavior. Instead of searching for exact
matches, it allows applications to find items that are semantically similar to
a given input.
To do this, Vector Search works on vector embeddings, which are numerical
representations of data that capture semantic meaning. This enables queries
such as:
* “Find documents similar to this paragraph”
* “Find products similar to what the user just viewed”
* “Find previous tickets related to this support request”
Rather than relying on exact values or keywords, Vector Search returns results
based on distance or similarity between vectors. This capability is
increasingly used in modern workloads such as AI-powered search, recommendation
systems, and retrieval-augmented generation (RAG).
Why Vector Search Matters
------------------------------------
Many applications already rely on ScyllaDB for high throughput, low and
predictable latency, and large-scale data storage.
Vector Search complements these strengths by enabling new classes of workloads,
including:
* Semantic search over text or documents
* Recommendations based on user or item similarity
* AI and ML applications, including RAG pipelines
* Anomaly and pattern detection
With Vector Search, ScyllaDB can serve as the similarity search backend for
AI-driven applications.
Availability
--------------
Vector Search is currently available only in ScyllaDB Cloud, the fully managed
ScyllaDB service.
👉 For details on using Vector Search, refer to the
`ScyllaDB Cloud documentation <https://cloud.docs.scylladb.com/stable/vector-search/index.html>`_.

View File

@@ -20,7 +20,10 @@ You can run your ScyllaDB workloads on AWS, GCE, and Azure using a ScyllaDB imag
Amazon Web Services (AWS)
-----------------------------
The recommended instance types are :ref:`i3en <system-requirements-i3en-instances>`, :ref:`i4i <system-requirements-i4i-instances>`, :ref:`i7i <system-requirements-i7i-instances>`, and :ref:`i7ie <system-requirements-i7ie-instances>`.
The recommended instance types are :ref:`i3en <system-requirements-i3en-instances>`,
:ref:`i4i <system-requirements-i4i-instances>`, :ref:`i7i <system-requirements-i7i-instances>`,
:ref:`i7ie <system-requirements-i7ie-instances>`, :ref:`i8g<system-requirements-i8g-instances>`,
and :ref:`i8ge <system-requirements-i8ge-instances>`.
.. note::
@@ -195,6 +198,118 @@ All i7i instances have the following specs:
See `Amazon EC2 I7i Instances <https://aws.amazon.com/ec2/instance-types/i7i/>`_ for details.
.. _system-requirements-i8g-instances:
i8g instances
^^^^^^^^^^^^^^
The following i8g instances are supported.
.. list-table::
:widths: 30 20 20 30
:header-rows: 1
* - Model
- vCPU
- Mem (GiB)
- Storage (GB)
* - i8g.large
- 2
- 16
- 1 x 468 GB
* - i8g.xlarge
- 4
- 32
- 1 x 937 GB
* - i8g.2xlarge
- 8
- 64
- 1 x 1,875 GB
* - i8g.4xlarge
- 16
- 128
- 1 x 3,750 GB
* - i8g.8xlarge
- 32
- 256
- 2 x 3,750 GB
* - i8g.12xlarge
- 48
- 384
- 3 x 3,750 GB
* - i8g.16xlarge
- 64
- 512
- 4 x 3,750 GB
All i8g instances have the following specs:
* Powered by AWS Graviton4 processors
* 3rd generation AWS Nitro SSD storage
* DDR5-5600 memory for improved throughput
* Up to 100 Gbps of networking bandwidth and up to 60 Gbps of bandwidth to
Amazon Elastic Block Store (EBS)
* Instance sizes offer up to 45 TB of total local NVMe instance storage
See `Amazon EC2 I8g Instances <https://aws.amazon.com/ec2/instance-types/i8g/>`_ for details.
.. _system-requirements-i8ge-instances:
i8ge instances
^^^^^^^^^^^^^^
The following i8ge instances are supported.
.. list-table::
:widths: 30 20 20 30
:header-rows: 1
* - Model
- vCPU
- Mem (GiB)
- Storage (GB)
* - i8ge.large
- 2
- 16
- 1 x 1,250 GB
* - i8ge.xlarge
- 4
- 32
- 1 x 2,500 GB
* - i8ge.2xlarge
- 8
- 64
- 2 x 2,500 GB
* - i8ge.3xlarge
- 12
- 96
- 1 x 7,500 GB
* - i8ge.6xlarge
- 24
- 192
- 2 x 7,500 GB
* - i8ge.12xlarge
- 48
- 384
- 4 x 7,500 GB
* - i8ge.18xlarge
- 72
- 576
- 6 x 7,500 GB
All i8ge instances have the following specs:
* Powered by AWS Graviton4 processors
* 3rd generation AWS Nitro SSD storage
* DDR5-5600 memory for improved throughput
* Up to 300 Gbps of networking bandwidth and up to 60 Gbps of bandwidth to
Amazon Elastic Block Store (EBS)
* Instance sizes offer up to 120 TB of total local NVMe instance storage
See `Amazon EC2 I8g Instances <https://aws.amazon.com/ec2/instance-types/i8g/>`_ for details.
Im4gn and Is4gen instances
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ScyllaDB supports Arm-based Im4gn and Is4gen instances. See `Amazon EC2 Im4gn and Is4gen instances <https://aws.amazon.com/ec2/instance-types/i4g/>`_ for specification details.

View File

@@ -25,8 +25,7 @@ Getting Started
:id: "getting-started"
:class: my-panel
* `Install ScyllaDB (Binary Packages, Docker, or EC2) <https://www.scylladb.com/download/#core>`_ - Links to the ScyllaDB Download Center
* :doc:`Install ScyllaDB </getting-started/install-scylla/index/>`
* :doc:`Configure ScyllaDB </getting-started/system-configuration/>`
* :doc:`Run ScyllaDB in a Shared Environment </getting-started/scylla-in-a-shared-environment>`
* :doc:`Create a ScyllaDB Cluster - Single Data Center (DC) </operating-scylla/procedures/cluster-management/create-cluster/>`

Some files were not shown because too many files have changed in this diff Show More