Compare commits

...

330 Commits

Author SHA1 Message Date
Pavel Emelyanov
0e6381f14d lister: Fix race between readdir and stat
Sometimes file::list_directory() returns entries without type set. In
thase case lister calls file_type() on the entry name to get it. In case
the call returns disengated type, the code assumes that some error
occurred and resolves into exception.

That's not correct. The file_type() method returns disengated type only
if the file being inspected is missing (i.e. on ENOENT errno). But this
can validly happen if a file is removed bettween readdir and stat. In
that case it's not "some error happened", but a enry should be just
skipped. In "some error happened", then file_type() would resolve into
exceptional future on its own.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#26595

(cherry picked from commit d9bfbeda9a)

Closes scylladb/scylladb#26764
2025-10-29 11:34:47 +02:00
Anna Stuchlik
e154b18786 doc: add --list-active-releases to Web Installer
Fixes https://github.com/scylladb/scylladb/issues/26688

V2 of https://github.com/scylladb/scylladb/pull/26687

Closes scylladb/scylladb#26689

(cherry picked from commit bd5b966208)

Closes scylladb/scylladb#26760
2025-10-29 11:33:55 +02:00
Patryk Jędrzejczak
fd1e1d506d test: test_raft_recovery_stuck: reconnect driver after rolling restarts
It turns out that #21477 wasn't sufficient to fix the issue. The driver
may still decide to reconnect the connection after `rolling_restart`
returns. One possible explanation is that the driver sometimes handles
the DOWN notification after all nodes consider each other UP.

Reconnecting the driver after restarting nodes seems to be a reliable
workaround that many tests use. We also use it here.

Fixes #19959

Closes scylladb/scylladb#26638

(cherry picked from commit 5321720853)

Closes scylladb/scylladb#26758
2025-10-29 11:33:06 +02:00
Anna Stuchlik
0905de5668 doc: add support for Debian 12
Fixes https://github.com/scylladb/scylladb/issues/26640

Closes scylladb/scylladb#26668

(cherry picked from commit 9c0ff7c46b)

Closes scylladb/scylladb#26679
2025-10-29 11:32:21 +02:00
Pavel Emelyanov
dbf0ec460d Update seastar submodule
* seastar 26badcb14...4431d974f (1):
  > Merge '[Backport 2025.3] all commits required for enabling i7i support' from Robert Bindar
    split random io buffer size in 2 options
    Fix hang in io_queue for big write ioproperties numbers
    Fix incorrect defaults for io queue iops/bandwidth
    iotune: fix very long warm up duration on systems with high cpu count
    Add iotune --get-best-iops-with-buffer-sizes option
    Add sequential buffer size options to IOTune
    iotune: Ignore measurements during warmup period
    iotune: Fix warmup calculation bug and botched rebase

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#26725
2025-10-28 13:10:12 +03:00
Robert Bindar
5e2dd4ecb3 Make scylla_io_setup detect request size for best write IOPS
We noticed during work on scylladb/seastar#2802 that on i7i family
(later proved that it's valid for i4i family as well),
the disks are reporting the physical sector sizes incorrectly
as 512bytes, whilst we proved we can render much better write IOPS with
4096bytes.

This is not the case on AWS i3en family where the reported 512bytes
physical sector size is also the size we can achieve the best write IOPS.

This patch works around this issue by changing `scylla_io_setup` to parse
the instance type out of `/sys/devices/virtual/dmi/id/product_name`
and run iotune with the correct request size based on the instance type.

Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>

Closes scylladb/scylladb#25315

(cherry picked from commit 2c74a6981b)

Closes scylladb/scylladb#26714
2025-10-27 16:53:44 +03:00
Patryk Jędrzejczak
bb99ae205c Merge '[Backport 2025.3] raft topology: fix group0 tombstone GC in the Raft-based recovery procedure' from Scylladb[bot]
Group0 tombstone GC considers only the current group 0 members
while computing the group 0 tombstone GC time. It's not enough
because in the Raft-based recovery procedure, there can be nodes
that haven't joined the current group 0 yet, but they have belonged
to a different group 0 and thus have a non-empty group 0 state ID.
The current code can cause a data resurrection in group 0 tables.

We fix this issue in this PR and add a regression test.

This issue was uncovered by `test_raft_recovery_entry_loss`, which
became flaky recently. We skipped this test for now. We will unskip
it in a following PR because it's skipped only on master, while we
want to backport this PR.

Fixes #26534

This PR contains an important bugfix, so we should backport it
to all branches with the Raft-based recovery procedure (2025.2
and newer).

- (cherry picked from commit 1d09b9c8d0)

- (cherry picked from commit 6b2e003994)

- (cherry picked from commit c57f097630)

Parent PR: #26612

Closes scylladb/scylladb#26680

* https://github.com/scylladb/scylladb:
  test: test group0 tombstone GC in the Raft-based recovery procedure
  group0_state_id_handler: remove unused group0_server_accessor
  group0_state_id_handler: consider state IDs of all non-ignored topology members
2025-10-27 10:20:16 +01:00
Patryk Jędrzejczak
ab843eb034 test: test group0 tombstone GC in the Raft-based recovery procedure
We add a regression test for the bug fixed in the previous commits.

(cherry picked from commit c57f097630)
2025-10-24 11:54:37 +02:00
Pavel Emelyanov
1ee781230b Merge '[Backport 2025.3] s3_client: handle failures which require http::request updating' from Scylladb[bot]
Apply two main changes to the s3_client error handling
1. Add a loop to s3_client's `make_request` for the case whe the retry strategy will not help since the request itself have to be updated. For example, authentication token expiration or timestamp on the request header
2. Refine the way we handle exceptions in the `chunked_download_source` background fiber, now we carry the original `exception_ptr` and also we wrap EVERY exception in `filler_exception` to prevent retry strategy trying to retry the request altogether

Fixes: https://github.com/scylladb/scylladb/issues/26483

Should be ported back to 2025.3 and 2025.4 to prevent deadlocks and failures in these versions

- (cherry picked from commit 55fb2223b6)

- (cherry picked from commit db1ca8d011)

- (cherry picked from commit 185d5cd0c6)

- (cherry picked from commit 116823a6bc)

- (cherry picked from commit 43acc0d9b9)

- (cherry picked from commit 58a1cff3db)

- (cherry picked from commit 1d34657b14)

- (cherry picked from commit 4497325cd6)

- (cherry picked from commit fdd0d66f6e)

Parent PR: #26527

This backport diverges from the original PR patch, as the 2025.3 release lacks the
required Seastar changes. Namely, there is no overload for make_request in this
version of the Seastar which accepts const& to the request argument. Thus here it's
handled by removing constness from request arguments when calling http's make_request

Closes scylladb/scylladb#26649

* https://github.com/scylladb/scylladb:
  s3_client: tune logging level
  s3_client: add logging
  s3_client: improve exception handling for chunked downloads
  s3_client: fix indentation
  s3_client: add max for client level retries
  s3_client: remove `s3_retry_strategy`
  s3_client: support high-level request retries
  s3_client: just reformat `make_request`
  s3_client: unify `make_request` implementation
2025-10-23 10:22:22 +03:00
Andrei Chekun
14583c2921 test.py: rewrite the wait_for_first_completed
Rewrite wait_for first_completed to return only first completed task guarantee
of awaiting(disappearing) all cancelled and finished tasks
Use wait_for_first_completed to avoid false pass tests in the future and issues
like #26148
Use gather_safely to await tasks and removing warning that coroutine was
not awaited

Closes scylladb/scylladb#26435

(cherry picked from commit 24d17c3ce5)

Closes scylladb/scylladb#26662
2025-10-22 21:31:47 +02:00
Patryk Jędrzejczak
ceec9b5508 group0_state_id_handler: remove unused group0_server_accessor
It became unused in the previous commit.

(cherry picked from commit 6b2e003994)
2025-10-22 17:12:57 +00:00
Patryk Jędrzejczak
9c809db181 group0_state_id_handler: consider state IDs of all non-ignored topology members
It's not enough to consider only the current group 0 members. In the
Raft-based recovery procedure, there can be nodes that haven't joined
the current group 0 yet, but they have belonged to a different group 0
and thus have a non-empty group 0 state ID.

We fix this issue in this commit by considering topology members
instead.

We don't consider ignored nodes as an optimization. When some nodes are
dead, the group 0 state ID handler won't have to wait until all these
nodes leave the cluster. It will only have to wait until all these nodes
are ignored, which happens at the beginning of the first
removenode/replace. As a result, tombstones of group 0 tables will be
purged much sooner.

We don't rename the `group0_members` variable to keep the change
minimal. There seems to be no precise and succinct name for the used set
of nodes anyway.

We use `std::ranges::join_view` in one place because:
- `std::ranges::concat` will become available in C++26,
- `boost::range::join` is not a good option, as there is an ongoing
  effort to minimize external dependencies in Scylla.

(cherry picked from commit 1d09b9c8d0)
2025-10-22 17:12:57 +00:00
Ernest Zaslavsky
c39c560bc3 s3_client: tune logging level
Change all logging related to errors in `chunked_download_source` background download fiber to `info` to make it visible right away in logs.

(cherry picked from commit fdd0d66f6e)
2025-10-22 15:24:06 +03:00
Ernest Zaslavsky
fa3f309877 s3_client: add logging
Add logging for the case when we encounter expired credentials, shouldnt happen but just in case

(cherry picked from commit 4497325cd6)
2025-10-22 15:24:06 +03:00
Ernest Zaslavsky
aca20f5ca5 s3_client: improve exception handling for chunked downloads
Refactor the wrapping exception used in `chunked_download_source` to
prevent the retry strategy from reattempting failed requests. The new
implementation preserves the original `exception_ptr`, making the root
cause clearer and easier to diagnose.

(cherry picked from commit 1d34657b14)
2025-10-22 15:24:06 +03:00
Ernest Zaslavsky
898f0ebe5e s3_client: fix indentation
Reformat `client::make_request` to fix the indentation of `if` block

(cherry picked from commit 58a1cff3db)
2025-10-22 15:24:06 +03:00
Ernest Zaslavsky
c89bed0a85 s3_client: add max for client level retries
To prevent client retrying indefinitely time skew and authentication errors add `max_attempts` to the `client::make_request`

(cherry picked from commit 43acc0d9b9)
2025-10-22 15:24:05 +03:00
Ernest Zaslavsky
779a45e2c9 s3_client: remove s3_retry_strategy
It never worked as intended, so the credentials handling is moving to the same place where we handle time skew, since we have to reauthenticate the request

(cherry picked from commit 116823a6bc)
2025-10-22 15:24:05 +03:00
Ernest Zaslavsky
85102711ba s3_client: support high-level request retries
Add an option to retry S3 requests at the highest level, including
reinitializing headers and reauthenticating. This addresses cases
where retrying the same request fails, such as when the S3 server
rejects a timestamp older than 15 minutes.

(cherry picked from commit 185d5cd0c6)
2025-10-22 15:24:05 +03:00
Asias He
ff94e2d96b repair: Fix uuid and nodes_down order in the log
Fixes #26536

Closes scylladb/scylladb#26547

(cherry picked from commit 33bc1669c4)

Closes scylladb/scylladb#26629
2025-10-22 11:30:44 +03:00
Ernest Zaslavsky
c1d53eee92 s3_client: just reformat make_request
Just reformat previously changed methods to improve readability

(cherry picked from commit db1ca8d011)
2025-10-21 12:26:15 +00:00
Ernest Zaslavsky
268e6720a8 s3_client: unify make_request implementation
Refactor `make_request` to use a single core implementation that
handles authentication and issues the HTTP request. All overloads now
delegate to this unified method.

(cherry picked from commit 55fb2223b6)
2025-10-21 12:26:15 +00:00
Botond Dénes
6dde7a9b84 Merge '[Backport 2025.3] raft topology: disable schema pulls in the Raft-based recovery procedure' from Scylladb[bot]
Schema pulls should always be disabled when group 0 is used. However,
`migration_manager::disable_schema_pulls()` is never called during
a restart with `recovery_leader` set in the Raft-based recovery
procedure, which causes schema pulls to be re-enabled on all live nodes
(excluding the nodes replacing the dead nodes). Moreover, schema pulls
remain enabled on each node until the node is restarted, which could
be a very long time.

We fix this issue and add a regression test in this PR.

Fixes #26569

This is an important bug fix, so it should be backported to all branches
with the Raft-based recovery procedure (2025.2 and newer branches).

- (cherry picked from commit ec3a35303d)

- (cherry picked from commit da8748e2b1)

- (cherry picked from commit 71de01cd41)

Parent PR: #26572

Closes scylladb/scylladb#26597

* github.com:scylladb/scylladb:
  test: test_raft_recovery_entry_loss: fix the typo in the test case name
  test: verify that schema pulls are disabled in the Raft-based recovery procedure
  raft topology: disable schema pulls in the Raft-based recovery procedure
2025-10-20 10:41:40 +03:00
Patryk Jędrzejczak
920a633a03 test: test_raft_recovery_entry_loss: fix the typo in the test case name
(cherry picked from commit 71de01cd41)
2025-10-17 10:26:59 +00:00
Patryk Jędrzejczak
5a2916dd8c test: verify that schema pulls are disabled in the Raft-based recovery procedure
We do this at the end of `test_raft_recovery_entry_loss`. It's not worth
to add a separate regression test, as tests of the recovery procedure
are complicated and have a long running time. Also, we choose
`test_raft_recovery_entry_loss` out of all tests of the recovery
procedure because it does some schema changes.

(cherry picked from commit da8748e2b1)
2025-10-17 10:26:59 +00:00
Patryk Jędrzejczak
85a67a0f9e raft topology: disable schema pulls in the Raft-based recovery procedure
Schema pulls should always be disabled when group 0 is used. However,
`migration_manager::disable_schema_pulls()` is never called during
a restart with `recovery_leader` set in the Raft-based recovery
procedure, which causes schema pulls to be re-enabled on all live nodes
(excluding the nodes replacing the dead nodes). Moreover, schema pulls
remain enabled on each node until the node is restarted, which could
be a very long time.

The old gossip-based recovery procedure doesn't have this problem
because we disable schema pulls after completing the upgrade-to-group0
procedure, which is a part of the old recovery procedure.

Fixes #26569

(cherry picked from commit ec3a35303d)
2025-10-17 10:26:59 +00:00
Jenkins Promoter
b34f11e52b Update ScyllaDB version to: 2025.3.3 2025-10-15 19:22:53 +03:00
Ernest Zaslavsky
e9bdd13d1b s3_client: track memory starvation in background filling fiber
Introduce a counter metric to monitor instances where the background
filling fiber is blocked due to insufficient memory in the S3 client.

Closes scylladb/scylladb#26466

(cherry picked from commit 413739824f)

Closes scylladb/scylladb#26553
2025-10-15 12:05:20 +02:00
Michał Chojnowski
75f671ff18 test/boost/sstable_compressor_factory_test: fix thread-unsafe usage of Boost.Test
It turns out that Boost assertions are thread-unsafe,
(and can't be used from multiple threads concurrently).
This causes the test to fail with cryptic log corruptions sometimes.
Fix that by switching to thread-safe checks.

Fixes scylladb/scylladb#24982

Closes scylladb/scylladb#26472

(cherry picked from commit 7c6e84e2ec)

Closes scylladb/scylladb#26552
2025-10-15 12:13:54 +03:00
Jenkins Promoter
8714e119c9 Update pgo profiles - aarch64 2025-10-15 05:18:53 +03:00
Jenkins Promoter
a46eb49b4d Update pgo profiles - x86_64 2025-10-15 05:04:15 +03:00
Piotr Wieczorek
445e58bbc5 alternator: Correct RCU undercount in BatchGetItem
The `describe_multi_item` function treated the last reference-captured
argument as the number of used RCU half units. The caller
`batch_get_item`, however, expected this parameter to hold an item size.
This RCU value was then passed to
`rcu_consumed_capacity_counter::get_half_units`, treating the
already-calculated RCU integer as if it were a size in bytes.

This caused a second conversion that undercounted the true RCU. During
conversion, the number of bytes is divided by `RCU_BLOCK_SIZE_LENGTH`
(=4KB), so the double conversion divided the number of bytes by 16 MB.

The fix removes the second conversion in `describe_multi_item` and
changes the API of `describe_multi_item`.

Fixes: https://github.com/scylladb/scylladb/pull/25847

Closes scylladb/scylladb#25842

(cherry picked from commit a55c5e9ec7)

Closes scylladb/scylladb#26538
2025-10-14 11:49:30 +03:00
Avi Kivity
18796c3173 dist: scylla_raid_setup: don't override XFS block size on modern kernels
In 6977064693 ("dist: scylla_raid_setup:
reduce xfs block size to 1k"), we reduced the XFS block size to 1k when
possible. This is because commitlog wants to write the smallest amount
of padding it can, and older Linux could only write a multiple of the
block size. Modern Linux [1] can O_DIRECT overwrite a range smaller than
a filesystem block.

However, this doesn't play well with some SSDs that have 512 byte
logical sector size and 4096 byte physical sector size - it causes them
to issue read-modify-writes.

To improve the situation, if we detect that the kernel is recent enough,
format the filesystem with its default block size, which should be optimal.

Note that commitlog will still issue sub-4k writes, which can translate
to RMW. There, we believe that the amplification is reduced since
sequential sub-physical-sector writes can be merged, and that the overhead
from commitlog space amplification is worse than the RMW overhead.

Tested on AWS i4i.large. fsqual report:

```
memory DMA alignment:    512
disk DMA alignment:      512
filesystem block size:   4096
context switch per write io (size-changing, append, blocksize 4096, iodepth 1): 0.0003 (GOOD)
context switch per write io (size-changing, append, blocksize 4096, iodepth 3): 0.7961 (BAD)
context switch per write io (size-unchanging, append, blocksize 4096, iodepth 3): 0 (GOOD)
context switch per write io (size-unchanging, append, blocksize 4096, iodepth 7): 0.0001 (GOOD)
context switch per write io (size-unchanging, append, blocksize 512, iodepth 1): 0.125 (BAD)
context switch per write io (size-unchanging, overwrite, blocksize 512, iodepth 1): 0 (GOOD)
context switch per write io (size-unchanging, overwrite, blocksize 512, O_DSYNC, iodepth 1): 0 (GOOD)
context switch per write io (size-unchanging, overwrite, blocksize 512, O_DSYNC, iodepth 3): 0 (GOOD)
context switch per write io (size-unchanging, overwrite, blocksize 512, O_DSYNC, iodepth 3): 0 (GOOD)
context switch per write io (size-changing, append, blocksize 4096, iodepth 1): 0 (GOOD)
context switch per write io (size-changing, append, blocksize 4096, iodepth 3): 0.8006 (BAD)
context switch per write io (size-unchanging, append, blocksize 4096, iodepth 3): 0.0001 (GOOD)
context switch per write io (size-unchanging, append, blocksize 4096, iodepth 7): 0 (GOOD)
context switch per write io (size-unchanging, append, blocksize 512, iodepth 1): 0.125 (BAD)
context switch per write io (size-unchanging, overwrite, blocksize 512, iodepth 1): 0 (GOOD)
context switch per write io (size-unchanging, overwrite, blocksize 512, O_DSYNC, iodepth 1): 0 (GOOD)
context switch per write io (size-unchanging, overwrite, blocksize 512, O_DSYNC, iodepth 3): 0 (GOOD)
context switch per read io (size-changing, append, blocksize 512, iodepth 30): 0 (GOOD)
```

The sub-block overwrite cases are GOOD.

In comparison, the fsqual report for 1k (similar):

```
memory DMA alignment:    512
disk DMA alignment:      512
filesystem block size:   1024
context switch per write io (size-changing, append, blocksize 1024, iodepth 1): 0.0005 (GOOD)
context switch per write io (size-changing, append, blocksize 1024, iodepth 3): 0.7948 (BAD)
context switch per write io (size-unchanging, append, blocksize 1024, iodepth 3): 0.0015 (GOOD)
context switch per write io (size-unchanging, append, blocksize 1024, iodepth 7): 0.0022 (GOOD)
context switch per write io (size-unchanging, append, blocksize 512, iodepth 1): 0.4999 (BAD)
context switch per write io (size-unchanging, overwrite, blocksize 512, iodepth 1): 0 (GOOD)
context switch per write io (size-unchanging, overwrite, blocksize 512, O_DSYNC, iodepth 1): 0 (GOOD)
context switch per write io (size-unchanging, overwrite, blocksize 512, O_DSYNC, iodepth 3): 0 (GOOD)
context switch per write io (size-unchanging, overwrite, blocksize 512, O_DSYNC, iodepth 3): 0 (GOOD)
context switch per write io (size-changing, append, blocksize 1024, iodepth 1): 0 (GOOD)
context switch per write io (size-changing, append, blocksize 1024, iodepth 3): 0.798 (BAD)
context switch per write io (size-unchanging, append, blocksize 1024, iodepth 3): 0.0012 (GOOD)
context switch per write io (size-unchanging, append, blocksize 1024, iodepth 7): 0.0019 (GOOD)
context switch per write io (size-unchanging, append, blocksize 512, iodepth 1): 0.5 (BAD)
context switch per write io (size-unchanging, overwrite, blocksize 512, iodepth 1): 0 (GOOD)
context switch per write io (size-unchanging, overwrite, blocksize 512, O_DSYNC, iodepth 1): 0 (GOOD)
context switch per write io (size-unchanging, overwrite, blocksize 512, O_DSYNC, iodepth 3): 0 (GOOD)
context switch per read io (size-changing, append, blocksize 512, iodepth 30): 0 (GOOD)
```

Fixes #25441.

[1] ed1128c2d0

Closes scylladb/scylladb#25445

(cherry picked from commit 5d1846d783)

Closes scylladb/scylladb#26532
2025-10-14 11:48:58 +03:00
Michał Chojnowski
c05efcf7c6 test_sstable_compression_dictionaries_basic: reconnect robustly after node reboots
Using `driver_connect()` after a cluster restart isn't enough to ensure
full CQL availability, but the test assumes that it is.

Fix that by making the test wait for CQL availability via `get_ready_cql()`.

Also, replace some manual usages of wait_for_cql_and_get_hosts with
`get_ready_cql()` too.

Fixes scylladb/scylladb#25362

Closes scylladb/scylladb#25366

(cherry picked from commit 85fd4d23fa)

Closes scylladb/scylladb#26514
2025-10-12 21:08:23 +03:00
Nadav Har'El
295ed0e9e1 cql: document and test permissions on materialized views and CDC
We were recently surprised (in pull request #25797) to "discover" that
Scylla does not allow granting SELECT permissions on individual
materialized views. Instead, all materialized views of a base table
are readable if the base table is readable.

In this patch we document this fact, and also add a test to verify
that it is indeed true. As usual for cqlpy tests, this test can also
be run on Cassandra - and it passes showing that Cassandra also
implemented it the same way (which isn't surprising, given that we
probably copied our initial implementation from them).

The test demonstrates that neither Scylla nor Cassandra prints an error
when attempting to GRANT permissions on a specific materialized view -
but this GRANT is simply ignored. This is not ideal, but it is the
existing behavior in both and it's not important now to change it.

Additionally, because pull request #25797 made CDC-log permissions behave
the same as materialized views - i.e., you need to make the base table
readable to allow reading from the CDC log, this patch also documents
this fact and adds a test for it also.

Fixes #25800

Closes scylladb/scylladb#25827

(cherry picked from commit 3c969e2122)

Closes scylladb/scylladb#26104
2025-10-10 10:37:25 +03:00
Ernest Zaslavsky
b2dc4647dd s3_client: fix when condition to prevent infinite locking
Refine condition variable predicate in filling fiber to avoid
indefinite waiting when `close` is invoked.

Closes scylladb/scylladb#26449

(cherry picked from commit c2bab430d7)

Closes scylladb/scylladb#26496
2025-10-10 10:30:54 +03:00
Michał Chojnowski
5ccdcb9459 docs: fix a parameter name in API calls in sstable-dictionary-compression.rst
The correct argument name is `cf`, not `table`.

Fixes scylladb/scylladb#25275

Closes scylladb/scylladb#26447

(cherry picked from commit 87e3027c81)

Closes scylladb/scylladb#26494
2025-10-10 10:30:39 +03:00
Pavel Emelyanov
8078ad7ee4 Merge '[Backport 2025.3] service/qos: set long timeout for auth queries on SL cache update' from Scylladb[bot]
pass an appropriate query state for auth queries called from service
level cache reload. we use the function qos_query_state to select a
query_state based on caller context - for internal queries, we set a
very long timeout.

the service level cache reload is called from group0 reload. we want it
to have a long timeout instead of the default 5 seconds for auth
queries, because we don't have strict latency requirement on the one
hand, and on the other hand a timeout exception is undesired in the
group0 reload logic and can break group0 on the node.

Fixes https://github.com/scylladb/scylladb/issues/25290

backport possible to improve stability

- (cherry picked from commit a1161c156f)

- (cherry picked from commit 3c3dd4cf9d)

- (cherry picked from commit ad1a5b7e42)

Parent PR: #26180

Closes scylladb/scylladb#26478

* github.com:scylladb/scylladb:
  service/qos: set long timeout for auth queries on SL cache update
  auth: add query_state parameter to query functions
  auth: refactor query_all_directly_granted
2025-10-10 10:30:15 +03:00
Patryk Jędrzejczak
0a71901c4c raft topology: make the voter handler consider only group 0 members
In the Raft-based recovery procedure, we create a new group 0 and add
live nodes to it one by one. This means that for some time there are
nodes which belong to the topology, but not to the new group 0. The
voter handler running on the recovery leader incorrectly considers these
nodes while choosing voters.

The consequences:
- misleading logs, for example, "making servers {<ID of a non-member>}
  voters", where the non-member won't become a voter anyway,
- increased chance of majority loss during the recovery procedure, for
  example, all 3 nodes that first joined the new group 0 are in the same
  dc and rack, but only one of them becomes a voter because the voter
  handler tries to make non-members in other dcs/racks voters.

Fixes #26321

Closes scylladb/scylladb#26327

(cherry picked from commit 67d48a459f)

Closes scylladb/scylladb#26427
2025-10-09 18:20:36 +02:00
Michael Litvak
e05c708327 service/qos: set long timeout for auth queries on SL cache update
pass an appropriate query state for auth queries called from service
level cache reload. we use the function qos_query_state to select a
query_state based on caller context - for internal queries, we set a
very long timeout.

the service level cache reload is called from group0 reload. we want it
to have a long timeout instead of the default 5 seconds for auth
queries, because we don't have strict latency requirement on the one
hand, and on the other hand a timeout exception is undesired in the
group0 reload logic and can break group0 on the node.

Fixes scylladb/scylladb#25290

(cherry picked from commit ad1a5b7e42)
2025-10-09 12:48:08 +00:00
Michael Litvak
192ec59196 auth: add query_state parameter to query functions
add a query_state parameter to several auth functions that execute
internal queries. currently the queries use the
internal_distributed_query_state() query state, and we maintain this as
default, but we want also to be able to pass a query state from the
caller.

in particular, the auth queries currently use a timeout of 5 seconds,
and we will want to set a different timeout when executed in some
different context.

(cherry picked from commit 3c3dd4cf9d)
2025-10-09 12:48:07 +00:00
Michael Litvak
859e306e9d auth: refactor query_all_directly_granted
rewrite query_all_directly_granted to use execute_internal instead of
query_internal in a style that is more consistent with the rest of the
module.

This will also be useful for a later change because execute_internal
accepts an additional parameter of query_state.

(cherry picked from commit a1161c156f)
2025-10-09 12:48:07 +00:00
Raphael S. Carvalho
40e8f652db replica: Fix race between drop table and merge completion handling
Consider this:
1) merge finishes, wakes up fiber to merge compaction groups
2) drop table happens, which in turn invokes truncate underneath
3) merge fiber stops old groups
4) truncate disables compaction on all groups, but the ones stopped
5) truncate performs a check that compaction has been disabled on
all groups, including the ones stopped
6) the check fails because groups being stopped didn't have compaction
explicitly disabled on them

To fix it, the check on step 6 will ignore groups that have been
stopped, since those are not eligible for having compaction explicitly
disabled on them. The compaction check is there, so ongoing compaction
will not propagate data being truncated, but here it happens in the
context of drop table which doesn't leave anything behind. Also, a
group stopped is somewhat equivalent to compaction disabled on it,
since the procedure to stop a group stops all ongoing compaction
and eventually removes its state from compaction manager.

Fixes #25551.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#25563

(cherry picked from commit 149f9d8448)

Closes scylladb/scylladb#25632
2025-10-08 06:28:09 +03:00
Botond Dénes
2e81bc14df Merge '[Backport 2025.3] tools: fix documentation links after change to source-available' from Scylladb[bot]
Some tools commands have links to online documentation in their help output. These links were left behind in the source-available change, they still point to the old opensource docs. Furthermore, the links in the scylla-sstable help output always point to the latest stable release's documentation, instead of the appropriate one for the branch the tool was built from. Fix both of these.

Fixes: scylladb/scylladb#26320

Broken documentation link fix for the  tool help output, needs backport to all live source-available versions.

- (cherry picked from commit 5a69838d06)

- (cherry picked from commit 15a4a9936b)

- (cherry picked from commit fe73c90df9)

Parent PR: #26322

Closes scylladb/scylladb#26389

* github.com:scylladb/scylladb:
  tools/scylla-sstable: fix doc links
  release: adjust doc_link() for the post source-available world
  tools/scylla-nodetool: remove trailing " from doc urls
2025-10-07 14:11:01 +03:00
Botond Dénes
524da18c05 tools/scylla-sstable: fix doc links
The doc links in scylla-sstable help output are static, so they always
point to the documentation of the latest stable release, not to the
documentation of the release the tool binary is from. On top of that,
the links point to old open-source documentation, which is now EOL.
Fix both problems: point link at the new source-available documentation
pages and make them version aware.

(cherry picked from commit fe73c90df9)
2025-10-07 10:07:08 +03:00
Botond Dénes
0b0192b9ec release: adjust doc_link() for the post source-available world
There is no more separate enterprise product and the doc urls are
slightly different.

(cherry picked from commit 15a4a9936b)
2025-10-03 14:28:44 +00:00
Botond Dénes
96a3481705 tools/scylla-nodetool: remove trailing " from doc urls
They are accidental leftover from a previous way of storing command
descriptions.

(cherry picked from commit 5a69838d06)
2025-10-03 14:28:44 +00:00
Benny Halevy
7341828f8f test_tablets_merge: test_tablet_split_merge_with_many_tables: reduce number of tables in debug mode
As the test hits timeouts in debug mode on aarch64.

Fixes #26252

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#26303

(cherry picked from commit b81c6a339b)

Closes scylladb/scylladb#26325
2025-10-01 14:07:00 +03:00
Asias He
8b4baeb487 repair: Always reset node ops progress to 100% upon completion
Always set the node ops progress to 100% when the operation finishes,
regardless of success or failure. This ensures the progress never
remains below 100%, which would otherwise indicates a pending node
operation in case of an error.

Fixes #26193

Closes scylladb/scylladb#26194

(cherry picked from commit b31e651657)

Closes scylladb/scylladb#26267
2025-10-01 14:06:29 +03:00
Botond Dénes
c4c18fdc8d Merge '[Backport 2025.3] replica: Fix split compaction when tablet boundaries change' from Scylladb[bot]
Consider the following:
1) balancer emits split decision
2) split compaction starts
3) split decision is revoked
4) emits merge decision
5) completes merge, before compaction in step 2 finishes

After last step, split compaction initiated in step 2 can fail because it works with the global tablet map, rather than the map when the compaction started. With the global state changing under its feet, on merge, the mutation splitting writer will think it's going backwards since sibling tablets are merged.

This problem was also seen when running load-and-stream, where split initiated by the sstable writer failed, split completed, and the unsplit sstable is left in the table dir, causing problems in the restart.

To fix this, let's make split compaction always work with the state when it started, not a global state.

Fixes #24153.

All 2025.* versions are vulnerable, so fix must be backported to them.

- (cherry picked from commit 0c1587473c)

- (cherry picked from commit 68f23d54d8)

Parent PR: #25690

Closes scylladb/scylladb#25935

* github.com:scylladb/scylladb:
  replica: Fix split compaction when tablet boundaries change
  replica: Futurize split_compaction_options()
2025-10-01 14:03:41 +03:00
Jenkins Promoter
ecfe6fa332 Update pgo profiles - aarch64 2025-10-01 05:17:59 +03:00
Jenkins Promoter
ce76350fc2 Update pgo profiles - x86_64 2025-10-01 04:57:07 +03:00
Avi Kivity
1bfd52b9ea Merge '[Backport 2025.3] scylla-gdb: Fix fair-queue entry printing' from Scylladb[bot]
Catching a live entry in IO queue is very rare event, so we haven't seen it so far, but the `_ticket` member had been removed ~2 years ago and had been replaced with `_capacity` which is plain 64bit integer.

Fixes #26184

The issue is present in 2025.x as well and looks cheap to backport

- (cherry picked from commit 8438c59ad3)

Parent PR: #26185

Also includes backport of #24835 which also applies to 2025.3 and is now crucial.
The scylla_io_queues.ticket() method is renamed by this backport, but without 24835 it will be problematic to fix all callers of it

Closes scylladb/scylladb#26266

* github.com:scylladb/scylladb:
  scylla-gdb: Fix fair-queue entry printing
  scylla-gdb: Don't show io_queue executing and queued resources
2025-09-30 16:32:10 +03:00
Pavel Emelyanov
c543f7f282 scylla-gdb: Fix fair-queue entry printing
Catching a live entry in IO queue is very rare event, so we haven't seen
it so far, but the `_ticket` member had been removed ~2 years ago and
had been replaced with `_capacity` which is plain 64bit integer.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#26185

(cherry picked from commit 8438c59ad3)
2025-09-30 11:29:06 +03:00
Pavel Emelyanov
8cb3b964e0 scylla-gdb: Don't show io_queue executing and queued resources
These counters are no longer accounted by io-queue code and are always
zero. Even more -- accounting removal happened years ago and we don't
have Scylla versions built with seastar older than that.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#24835
2025-09-30 11:29:06 +03:00
Raphael S. Carvalho
94b0f2fd48 replica: Fix split compaction when tablet boundaries change
Consider the following:
1) balancer emits split decision
2) split compaction starts
3) split decision is revoked
4) emits merge decision
5) completes merge, before compaction in step 2 finishes

After last step, split compaction initiated in step 2 can fail
because it works with the global tablet map, rather than the
map when the compaction started. With the global state changing
under its feet, on merge, the mutation splitting writer will
think it's going backwards since sibling tablets are merged.

This problem was also seen when running load-and-stream, where
split initiated by the sstable writer failed, split completed,
and the unsplit sstable is left in the table dir, causing
problems in the restart.

To fix this, let's make split compaction always work with
the state when it started, not a global state.

Fixes #24153.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 68f23d54d8)
2025-09-29 20:29:05 -03:00
Ferenc Szili
9429099d5f docs: add description of number of tablets computed by tablet allocator
This change adds the documentation section which explains the algorithm
to compute the absolute number of tablets a table has.

Fixes: #25740

Closes scylladb/scylladb#25741

(cherry picked from commit d462dc8839)

Closes scylladb/scylladb#26264
2025-09-28 20:29:05 +03:00
Aleksandra Martyniuk
28850ac613 test: fix test_two_tablets_concurrent_repair_and_migration_repair_writer_level
test_two_tablets_concurrent_repair_and_migration_repair_writer_level waits
for the first node that logs info about repair_writer using asyncio.wait.
The done group is never awaited, so we never learn about the error.

The test itself is incorrect and the log about repair_writer is never
printed. We never learn about that and tests finishes successfully
after 10 minutes timeout.

Fix the test:
- disable hinted handoff;
- repair tablets of the whole table:
  - new table is added so that concurrent migration is possible;
- use wait_for_first_completed that awaits done group;
- do some cleanups.

Remove nightly mark.

Fixes: #26148.

Closes scylladb/scylladb#26209

(cherry picked from commit 48bbe09c8b)

Closes scylladb/scylladb#26220
2025-09-26 16:42:41 +03:00
Botond Dénes
6e94299ccf Merge '[Backport 2025.3] compaction: ensure that all compaction executors are stopped' from Scylladb[bot]
Currently, while stopping the compaction_manager, we stop task_manager
compaction module and concurrently run compaction_manager::really_do_stop.
really_do_stop stops and waits for all task_executors that are kept
in compaction_manager::_tasks, but nothing ensures that no more tasks will
be added there. Due to leftover tasks, we trigger  on_fatal_internal_error.

Modify the order of compaction_manager::stop. After the change, we stop
compaction tasks in the following order:
- abort module abort source;
- close module gate in the background;
- stop_ongoing_compactions (kept in compaction_manager::_tasks);
- wait until module gate is closed.

Check module abort source before creating compaction executor and
adding it to _tasks.

Thanks to the above, we can be sure that:
- after module::stop there will be no tasks in _tasks;
- compaction_manager::stop aborts all tasks; we don't wait for any whole
  compaction to finish.

Fixes: https://github.com/scylladb/scylladb/issues/25806.

Fixes shutdown bug; Needs backports to all version

- (cherry picked from commit 17707d0e6b)

- (cherry picked from commit 97c77d7cd5)

Parent PR: #25885

Closes scylladb/scylladb#26224

* github.com:scylladb/scylladb:
  compaction: move _tasks check
  compaction: stop compaction module in really_do_stop
2025-09-26 13:20:52 +03:00
Gleb Natapov
6383b9009c storage_service: change node_ops_info::ignore_nodes to host id
It drop useless translation from id to ip during removenode through
topology coordinator.

Closes scylladb/scylladb#25958

(cherry picked from commit d3badf7406)

Closes scylladb/scylladb#26251
2025-09-26 10:53:47 +02:00
Aleksandra Martyniuk
7c847d76f6 compaction: move _tasks check
In compaction_manager::really_do_stop we check whether _tasks list
is empty after the compactions are stopped. However, a new task may
still sneak in, causing the assertion failure. Such a task won't
be there for long - module::make_task will fail as the module is
already stopped.

Move the assertion, that checks if _tasks is empty, after the
compaction_states' gates are closed.

Fixes: #25806.
(cherry picked from commit 97c77d7cd5)
2025-09-25 15:56:00 +02:00
Aleksandra Martyniuk
9fc38c01a0 compaction: stop compaction module in really_do_stop
Currently, compaction::task_manager_module is stopped in compaction_manager::stop,
concurrently to really_do_stop. We can't predict the order of the two.

Do not set _task_manager_module to nullptr at stop, because
compaction_manager::really_do_stop() may be called before the actual
shutdown, while other components still try to use it.
compaction::task_manager_module does not keep a pointer to compaction_manager,
so we won't end up with memory leak.

Stop compaction module in really_do_stop, after ongoing compactions
are stopped.

It's a preparation for further patches.

(cherry picked from commit 17707d0e6b)
2025-09-25 15:55:55 +02:00
Ferenc Szili
99b69092ef load_balancer: fix std::out_of_bounds when decommissioning with empty nodes
Consider the following:

The tablet load balancer is working on:

- node1: an empty node (no tablets) with a large disk capacity
- node2: an empty node (no tablets) with a lower disk capacity then node1
- node3: is being decommissioned and contains tablet replicas

In load_balancer::make_internode_plan() the initial destination
node/shard is selected like this:

// Pick best target shard.
auto dst = global_shard_id {target, _load_sketch->get_least_loaded_shard(target)};

load_sketch::get_least_loaded_shard(host_id) calls ensure_node() which
adds the host to load_sketch's internal hash maps in case the node was
not yet seen by load_sketch.

Let's assume dst is a shard on node1.

Later in load_balancer::make_internode_plan() we will call
pick_candidate() to try to find a better destination node than the
initial one:

// May choose a different source shard than src.shard or different destination host/shard than dst.
auto candidate = co_await pick_candidate(nodes, src_node_info, target_info, src, dst, nodes_by_load_dst,
                                            drain_skipped);
auto source_tablets = candidate.tablets;
src = candidate.src;
dst = candidate.dst;

If pick_candidate() selects some other empty destination (due to larger
capacity: node1) node, and that node has not yet been seen by
load_sketch (because it was empty), a subsequent call to
load_sketch::pick() will search for the node using
std::unordered_map::at(), and because the node is not found it will
throw a std::out_of_bounds() exception crashing the load balancer.

This problem is fixed by changing load_sketch::populate() to initialize
its internal maps with all the nodes which populate()'s arguments
filter for.

Fixes: #26203

Closes scylladb/scylladb#26207

(cherry picked from commit c6c9c316a7)

Closes scylladb/scylladb#26240
2025-09-25 09:42:47 +03:00
Dawid Mędrek
c7091b61e4 db/batchlog: Drop batch if table has been dropped
If there are pending mutations in the batchlog for a table that
has been dropped, we'll keep attempting to replay them but with
no success -- `db::no_such_column_family` exceptions will be thrown,
and we'll keep trying again and again.

To prevent that, we drop the batch in that case just like we do
in the case of a non-existing keyspace.

A reproducer test has been included in the commit. It fails without
the changes in `db/batchlog_manager.cc`, and it succeeds with them.

Fixes scylladb/scylladb#24806

Closes scylladb/scylladb#26057

(cherry picked from commit 35f7d2aec6)

Closes scylladb/scylladb#26201
2025-09-25 09:39:55 +03:00
Andrzej Jackowski
07c21c06a4 test: audit: stop using datetime.datetime.now() in syslog converter
`line_to_row` is a test function that converts `syslog` audit log to
the format of `table` audit log so tests can use the same checks
for both types of audit. Because `syslog` audit doesn't have `date`
information, the field was filled with the current date. This behavior
broke the tests running at 23:59:59 because `line_to_row` returned
different results on different days.

Fixes: scylladb/scylladb#25509

Closes scylladb/scylladb#26101

(cherry picked from commit 15e71ee083)

Closes scylladb/scylladb#26191
2025-09-25 09:37:45 +03:00
Dawid Mędrek
f4de31a316 test/perf/tablet_load_balancing.cc: Create nodes within one DC
In 789a4a1ce7, we adjusted the test file
to work with the configuration option `rf_rack_valid_keyspaces`. Part of
the commit was making the two tables used in the test replicate in
separate data centers.

Unfortunately, that destroyed the point of the test because the tables
no longer competed for resources. We fix that by enforcing the same
replication factor for both tables.

We still accept different values of replication factor when provided
manually by the user (by `--rf1` and `--rf2` commandline options). Scylla
won't allow for creating RF-rack-invalid keyspaces, but there's no reason
to take away the flexibility the user of the test already has.

Fixes scylladb/scylladb#26026

Closes scylladb/scylladb#26115

(cherry picked from commit 0d2560c07f)

Closes scylladb/scylladb#26172
2025-09-25 09:34:29 +03:00
Ferenc Szili
4a2ba1dbde docs: add capacity based balancing explanation
Capacity based balancing was introduced in 2025.1. It computes balance
based on a node's capacity: the number of tablets located on a node
should be directly proportional to that node's storage capacity.

This change adds this explanation to the docs.

Fixes: #25686

Closes scylladb/scylladb#25687

(cherry picked from commit de5dab8429)

Closes scylladb/scylladb#26107
2025-09-25 09:31:40 +03:00
Botond Dénes
6fe8f98add Merge '[Backport 2025.3] compaction/scrub: register sstables for compaction before validation' from Scylladb[bot]
compaction/scrub: register sstables for compaction before validation

When `scrub --validate` runs, it collects all candidate sstables at the
start and validates them one by one in separate compaction tasks.
However, scrub in validate mode does not register these sstables for
compaction, which allows regular compaction to pick them up and
potentially compact them away before validation begins. This leads to
scrub failures because the sstables can no longer be found.

This patch fixes the issue by first disabling compaction, collecting the
sstables, and then registering them for compaction before starting
validation. This ensures that the enqueued sstables remain available for
the entire duration of the scrub validation task.

Fixes #23363

This reported scrub failure occurs on all versions that have the
checksum/digest validation feature for uncompressed sstables.
So, backport it to older versions.

- (cherry picked from commit 84f2e99c05)

- (cherry picked from commit 7cdda510ee)

Parent PR: #26034

Closes scylladb/scylladb#26099

* github.com:scylladb/scylladb:
  compaction/scrub: register sstables for compaction before validation
  compaction/scrub: handle exceptions when moving invalid sstables to quarantine
2025-09-25 09:27:31 +03:00
Pavel Emelyanov
702eda371b s3: Add metrics to show S3 prefetch bytes
The chunked download source sends large GET requests and then consumes data
as it arrives. Sometimes it can stop reading from socket early and drop the
in-flight data. The existing read-bytes metrics show only the number of
consumed bytes, we we also want to know the number of requested bytes

Refs #25770 (accounting of read-bytes)
Fixes #25876

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#25877

(cherry picked from commit 6fb66b796a)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#26070
2025-09-25 09:26:41 +03:00
Raphael S. Carvalho
ec225e08d1 replica: Futurize split_compaction_options()
Prepararation for the fix of #24153.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 0c1587473c)
2025-09-24 20:38:08 -03:00
Patryk Jędrzejczak
d653e710ba test: deflake driver reconnections in the recovery procedure tests
All three tests could hit
https://github.com/scylladb/python-driver/issues/295. We use the
standard workaround for this issue: reconnecting the driver after
the rolling restart, and before sending any requests to local tables
(that can fail if the driver closes a connection to the node that
restarted last).

All three tests perform two rolling restarts, but the latter ones
already have the workaround.

Fixes #26005

Closes scylladb/scylladb#26056

(cherry picked from commit a56115f77b)

Closes scylladb/scylladb#26199
2025-09-24 11:52:00 +02:00
Tomasz Grabiec
b3f4bef36b tablets: scheduler: Run plan-maker in maintenance scheduling group
Currently, it runs in the gossiper scheduling group, because it's
invoked by the topology coordinator. That scheduling group has the
same amount of shares as user workload. Plan-making can take
significant amount of time during rebalancing, and we don't want that
to impact user workload which happens to run on the same shard.

Reduce impact by running in the maintenance scheduling group.

Fixes #26037

Closes scylladb/scylladb#26046

(cherry picked from commit ddbcea3e2a)

Closes scylladb/scylladb#26168
2025-09-22 15:20:01 +02:00
Pavel Emelyanov
b4598031e6 s3: Fix chunked download source metrics calculations
In S3 client both read and write metrics have three counters -- number
of requests made, number of bytes processed and request latency. In most
of the cases all three counters are updated at once -- upon response
arrival.

However, in case of chunked download source this way of accounting
metrics is misleading. In this code the request is made once, and then
the obtained bytes are consumed eventually as the data arrive.

Currently, each time a new portion of data is read from the socket the
number of read requests is incremented. That's wrong, the request is
made once, and this counter should also be incremented once, not for
every data buffer that arrived in response.

Same for read request latency -- it's "added" for every data buffer that
arrives, but it's a lenghy process, the _request_ latency should be
accounted once per responce. Maybe later we'll want to have "data
latency" metrics as well, but for what we have now it's request latency.

The number of read bytes is accounted properly, so not touched here.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#25770

(cherry picked from commit 9deea3655f)

Closes scylladb/scylladb#26145
2025-09-22 07:35:03 +03:00
Asias He
04776ad19e streaming: Enclose potential throws in try block and ensure sink close before logging
- Move the initialization of log_done inside the try block to catch any
  exceptions it may throw.

- Relocate the failure warning log after sink.close() cleanup
  to guarantee sink.close() is always called before logging errors.

Refs #25497

Closes scylladb/scylladb#25591

(cherry picked from commit b12404ba52)

Closes scylladb/scylladb#25903
2025-09-21 18:11:43 +03:00
Nadav Har'El
d61bce8685 alternator: fix bug in combination of AttributeUpdates + ReturnValues
In test/alternator/test_returnvalues.py we had tests for the
ReturnValues feature on UpdateItem requests - but we only tested
UpdateItem requests with the "modern" UpdateExpression, and forgot to
test the combination of ReturnValues with the old AttributeUpdates API.

It turns out this combination is buggy: when both ReturnValues=ALL_OLD
and AttributeUpdates need the previous value of the item, we may wrongly
std::move() the value out, and the operation will fail with a strange
error:

    An error occurred (ValidationException) when calling the UpdateItem
    operation: JSON assert failed on condition 'IsObject()'

The fix in this patch is trivial - just move the std::move() to the
correct place, after both UpdateExpression and AttributeUpdates
handling is done.

This patch also includes a reproducing test, which fails before this
patch and passes with it - and of course passes on DynamoDB. This
test reproduces two cases where the bug happened, as well as one
case where it didn't (to make sure we don't regress in what already
worked).

Fixes #25894

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#25900

(cherry picked from commit 3c0032deb4)

Closes scylladb/scylladb#26096
2025-09-19 19:25:15 +03:00
Lakshmi Narayanan Sreethar
6e94a73fd4 compaction/scrub: register sstables for compaction before validation
When `scrub --validate` runs, it collects all candidate sstables at the
start and validates them one by one in separate compaction tasks.
However, scrub in validate mode does not register these sstables for
compaction, which allows regular compaction to pick them up and
potentially compact them away before validation begins. This leads to
scrub failures because the sstables can no longer be found.

This patch fixes the issue by first disabling compaction, collecting the
sstables, and then registering them for compaction before starting
validation. This ensures that the enqueued sstables remain available for
the entire duration of the scrub validation task.

Fixes #23363

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit 7cdda510ee)
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2025-09-19 18:38:54 +05:30
Lakshmi Narayanan Sreethar
20501b2ea3 compaction/scrub: handle exceptions when moving invalid sstables to quarantine
In validate mode, scrub moves invalid sstables into the quarantine
folder. If validation fails because the sstable files are missing from
disk, there is nothing to move, and the quarantine step will throw an
exception. Handle such exceptions so scrub can return a proper
compaction_result instead of propagating the exception to the caller.
This will help the testcase for #23363 to reliably determine if the
scrub has failed or not.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit 84f2e99c05)
2025-09-19 18:35:31 +05:30
Szymon Malewski
cca78c6568 alternator/expressions.g: Fix antlr3 missing token leak
This patch overrides the antlr3 function that allocates the missing
tokens that would eventually leak. The override stores these tokens in
a vector, ensuring memory is freed whenever the parser is destroyed.
Solution is copied from CQL implementation.

A unit test to reproduce the issue is added - leak would be reported
by ASAN, when running this test in debug mode - the test passed but
the leak is discovered when the test file exits.

Fixes #25878

Closes scylladb/scylladb#25930

(cherry picked from commit 776f90e2f8)

Closes scylladb/scylladb#26085
2025-09-18 07:50:31 +03:00
Sergey Zolotukhin
8568a8a303 raft: disable caching for raft log.
This change disables caching for raft log table due to the following reasons:
* Immediate reason is a deficiency in handling emerging range tombstones in the cache, which causes stalls.
* Long-term reason is that sequential reads from the raft log do not benefit from the cache, making it better to bypass it to free up space and avoid stalls.

Fixes scylladb/scylladb#26027

Closes scylladb/scylladb#26031

(cherry picked from commit 2640b288c2)

Closes scylladb/scylladb#26074
2025-09-18 07:50:05 +03:00
Pavel Emelyanov
1310e61040 Merge '[Backport 2025.3] gossiper: ensure gossiper operations are executed in gossiper scheduling group' from Scylladb[bot]
Sometimes gossiper operations invoked from storage_service and other components run under a non-gossiper scheduling group. If these operations acquire gossiper locks, priority inversion can occur: higher-priority gossiper tasks may wait behind lower-priority tasks (e.g. streaming), which can cause gossiper slowness or even failures.
This patch ensures that gossiper operations requiring locks on gossiper structures are explicitly executed in the gossiper scheduling group. To help detect similar issues in the future, a warning is logged whenever a gossiper lock is acquired under a non-gossiper scheduling group.

Fixes scylladb/scylladb#25907
Refs: scylladb/scylladb#25702

Backport: this patch fixes an issue with gossiper operations scheduling group, that might affect topology operations, therefore backport is needed to 2025.1, 2025.2, 2025.3

- (cherry picked from commit 340413e797)

- (cherry picked from commit 6c2a145f6c)

Parent PR: #25981

Closes scylladb/scylladb#26073

* https://github.com/scylladb/scylladb:
  gossiper: ensure gossiper operations are executed in gossiper scheduling group
  gossiper: fix wrong gossiper instance used in `force_remove_endpoint`
2025-09-18 07:49:49 +03:00
Aleksandra Martyniuk
3f345615a5 replica: lower severity of failure log
Flush failure with seastar::named_gate_closed_exception is expected
if a respective compaction group was already stopped.

Lower the severity of a log in dirty_memory_manager::flush_one
for this exception.

Fixes: https://github.com/scylladb/scylladb/issues/25037.

Closes scylladb/scylladb#25355

(cherry picked from commit a10e241228)

Closes scylladb/scylladb#25650
2025-09-18 07:49:28 +03:00
Sergey Zolotukhin
3bf986170b gossiper: ensure gossiper operations are executed in gossiper scheduling group
Sometimes gossiper operations invoked from storage_service and other components
run under a non-gossiper scheduling group. If these operations acquire gossiper
locks, priority inversion can occur: higher-priority gossiper tasks may wait
behind lower-priority tasks (e.g. streaming), which can cause gossiper slowness
or even failures.
This patch ensures that gossiper operations requiring locks on gossiper
structures are explicitly executed in the gossiper scheduling group.
To help detect similar issues in the future, a warning is logged whenever
a gossiper lock is acquired under a non-gossiper scheduling group.

Fixes scylladb/scylladb#25907

(cherry picked from commit 6c2a145f6c)
2025-09-17 11:22:31 +00:00
Sergey Zolotukhin
d585211c4a gossiper: fix wrong gossiper instance used in force_remove_endpoint
`gossiper::force_remove_endpoint` is always executed on shard 0 using
`invoke_on`. Since each shard has its own `gossiper` instance, if
`force_remove_endpoint` is called from a shard other than shard 0,
`my_host_id()` may be invoked on the wrong `gossiper` object. This
results in undefined behavior due to unsynchronized access to resources
on another shard.

(cherry picked from commit 340413e797)
2025-09-17 11:22:31 +00:00
Wojciech Mitros
246fcb8b6a mv: delete previously undetected ghost rows in PRUNE MATERIALIZED VIEW statement
The PRUNE MATERIALIZED VIEW statement is supposed to remove ghost rows from the
view. Ghost rows are rows in the view with no corresponding row in the base table.
Before this patch, only rows whose primary key columns of the base table had
different values than any of the base rows were treated as ghost rows by the PRUNE
statement. However, view rows which have a column in their primary key that's not
in the base primary can also be ghost rows if this column has a different value
than the base row with the same values of remaining primary key columns. That's
because these rows won't be deleted unless we change value of this column in the
base table to this specific value.
In this patch we add a check for this column in the PRUNE MATERIALIZED VIEW logic.
If this column isn't the same in the base table and the view, these rows are also
deleted.

Fixes https://github.com/scylladb/scylladb/issues/25655

Closes scylladb/scylladb#25720

(cherry picked from commit 1f9be235b8)

Closes scylladb/scylladb#25956
2025-09-15 12:26:02 +02:00
Jenkins Promoter
93da39020f Update ScyllaDB version to: 2025.3.2 2025-09-15 11:12:31 +03:00
Jenkins Promoter
04b0d7b629 Update pgo profiles - aarch64 2025-09-15 05:35:35 +03:00
Jenkins Promoter
92d0b05bd0 Update pgo profiles - x86_64 2025-09-15 05:04:20 +03:00
Patryk Jędrzejczak
b5cbe0d50a Merge '[Backport 2025.3] test: cluster: deflake consistency checks after decommission' from Scylladb[bot]
In the Raft-based topology, a decommissioning node is removed from group
0 after the decommission request is considered finished (and the token
ring is updated). Therefore, `check_token_ring_and_group0_consistency`
called just after decommission might fail when the decommissioned node
is still in group 0 (as a non-voter). We deflake all tests that call
`check_token_ring_and_group0_consistency` after decommission in this PR.

Fixes #25809

This PR improves CI stability and changes only tests, so it should be
backported to all supported branches.

- (cherry picked from commit e41fc841cd)

- (cherry picked from commit bb9fb7848a)

Parent PR: #25927

Closes scylladb/scylladb#25963

* https://github.com/scylladb/scylladb:
  test: cluster: deflake consistency checks after decommission
  test: cluster: util: handle group 0 changes after token ring changes in wait_for_token_ring_and_group0_consistency
2025-09-11 13:01:54 +02:00
Patryk Jędrzejczak
2ce95c429f test: cluster: deflake consistency checks after decommission
In the Raft-based topology, a decommissioning node is removed from group
0 after the decommission request is considered finished (and the token
ring is updated). Therefore, `check_token_ring_and_group0_consistency`
called just after decommission might fail when the decommissioned node
is still in group 0 (as a non-voter). We deflake all tests that call
`check_token_ring_and_group0_consistency` after decommission in this
commit.

Fixes #25809

(cherry picked from commit bb9fb7848a)
2025-09-10 17:49:12 +00:00
Patryk Jędrzejczak
b4e64e5adf test: cluster: util: handle group 0 changes after token ring changes in wait_for_token_ring_and_group0_consistency
In the Raft-based topology, a decommissioning node is removed from group
0 after the decommission request is considered finished (and the token
ring is updated). `wait_for_token_ring_and_group0_consistency` doesn't
handle such a case; it only handles cases where the token ring is
updated later. We fix this in this commit.

We rely on the new implementation of
`wait_for_token_ring_and_group0_consistency` in the following commit to
fix flakiness of some tests.

We also update the obsolete docstring in this commit.

(cherry picked from commit e41fc841cd)
2025-09-10 17:49:12 +00:00
Dawid Mędrek
3dac49c62f test/perf: Adjust tablet_load_balancing.cc to RF-rack-validity
We modify the logic to make sure that all of the keyspaces that the test
creates are RF-rack-valid. For that, we distribute the nodes across two
DCs and as many racks as the provided replication factor.

That may have an effect on the load balancing logic, but since this is
a performance test and since tablet load balancing is still taking place,
it should be acceptable.

This commit also finishes work in adjusting perf tests to pass with
the `rf_rack_valid_keyspaces` configuration option enabled. The remaining
tests either don't attempt to create keyspaces or they already create
RF-rack-valid keyspaces.

We don't need to explicitly enable the configuration option. It's already
enabled by default by `cql_test_config`. The reason why we haven't run into
any issue because of that is that performance tests are not part of our CI.

Fixes scylladb/scylladb#25127

Closes scylladb/scylladb#25728

(cherry picked from commit 789a4a1ce7)

Closes scylladb/scylladb#25922
2025-09-10 10:30:40 +03:00
Asias He
ac88ea8152 streaming: Fix use after move in the tablet_stream_files_handler
The files object is moved before the log when stream finishes. We've
logged the files when the stream starts. Skip it in the end of
streaming.

Fixes #25830

Closes scylladb/scylladb#25835

(cherry picked from commit 451e1ec659)

Closes scylladb/scylladb#25891
2025-09-10 10:30:11 +03:00
Wojciech Mitros
055a6c2cee storage_proxy: send hints to pending replicas
Consider the following scenario:
- Current replica set is [A, B, C]
- write succeeds on [A, B], and a hint is logged for node C
- before the hint is replayed, D bootstraps and the token migrates from C to D
- hint is replayed to node C while D is pending, but it's too late, since streaming for that token is already done
- C is cleaned up, replayed data is lost, and D has a stale copy until next repair.
In the scenario we effectively fail to send the hint. This scenario is also more likely to happen with tablets,
as it can happen for every tablet migration.

This issue is particularly detrimental to materialized views. View updates use hints by default and a specific
view update may be sent to just one view replica (when a single base replica has a different row state due to
reordering or missed writes). When we lose a hint for such a view update, we can generate a persistent inconsistency
between the base and view - ghost rows can appear due to a lost tombstone and rows may be missing in the view due
to a lost row update. Such inconsistencies can't be fixed neither by repairing the view or the base table.

To handle this, in this patch we add the pending replicas to the list of targets of each hint, even if the original
target is still alive.

This will cause some updates to be redundant. These updates are probably unavoidable for now, but they shouldn't
be too common either. The scenarios for them are:
1. managing to send the hint to the source of a migrating replica before streaming that its token - the write will
arrive on the pending replica anyway in streaming
2. the hint target not being the source of the migration - if we managed to apply the original write of the hint to
the actual source of the migration, the pending replica will get it during streaming
3. sending the same hint to many targets at a similar time - while sending to each target, we'll see the same pending
replica for the hint so we'll send it multiple times
4. possible retries where even though the hint was successfully sent to the main target, we failed to send it to the
pending replica, so we need to retry the entire write

This patch handles both tablet migrations and tablet rebuilds. In the future, for tablet migrations, we can avoid
sending the hint to pending replias if the hint target is not the source fo the migration, which would allow us to
avoid the redundant writes 2 and 3. For rack-aware RF, this will be as simple as checking whether the replicas are
in the same rack.

We also add a test case reproducing the issue.

Co-Authored-By: Raphael S. Carvalho <raphaelsc@scylladb.com>

Fixes https://github.com/scylladb/scylladb/issues/19835

Closes scylladb/scylladb#25590

(cherry picked from commit 10b8e1c51c)

Closes scylladb/scylladb#25882
2025-09-10 10:29:52 +03:00
Pavel Emelyanov
81e4c65f8c Merge '[Backport 2025.3] Allow users to SELECT from CDC log tables they created.' from Scylladb[bot]
Before the patch, user with CREATE access could create a table with CDC or alter the table enabling CDC, but could not query a SELECT on the CDC table they created.
It was due to the fact, the SELECT permission was checked on the CDC log, and later it's "parent" - the keyspace, but not the base table, on which the user had SELECT permission automatically granted on CREATE.

This patch matches the behavior of querying the CDC log to the one implemented for Materialized Views:
1. No new permissions are granted on CREATE.
2. When querying SELECT, the permissions on base table SELECT are checked.

Fixes: https://github.com/scylladb/scylladb/issues/19798
Fixes: VECTOR-151

- (cherry picked from commit be54346846)

- (cherry picked from commit 5e72d71188)

Parent PR: #25797

Closes scylladb/scylladb#25870

* github.com:scylladb/scylladb:
  cqlpy/test_permissions: run the reproducer tests for #19798
  select_statement: check for access to CDC base table
2025-09-10 10:29:10 +03:00
Pavel Emelyanov
6977c5eaf1 s3: Export memory usage gauge (metrics)
The memory usage is tracked with the help of a semaphore, so just export
its "consumed" units.

One tricky place here is the need to skip metrics registration for
scylla-sstable tool. The thing is that the tools starts the storage
manager and sstables manager on start and then some of tool's operations
may want to start both managers again (via cql environment) causing
double metrics registration exception.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#25769

(cherry picked from commit b26816f80d)

Closes scylladb/scylladb#25865
2025-09-10 10:28:39 +03:00
Yaron Kaikov
bdec3b2bc5 build_docker.sh: enable debug symboles installation
Adding the latest scylla.repo location to our docker container, this
will allow installation scylla-debuginfo package in case it's needed

Fixes: https://github.com/scylladb/scylladb/issues/24271

Closes scylladb/scylladb#25646

(cherry picked from commit d57741edc2)

Closes scylladb/scylladb#25893
2025-09-09 11:41:17 +03:00
Patryk Jędrzejczak
2792fd6383 Merge '[Backport 2025.3] gossiper: fix issues in processing gossip status during the startup and when messages are delayed to avoid empty host ids' from Scylladb[bot]
Populate the local state during gossiper initialization in start_gossiping, preventing an empty state from being added to _endpoint_state_map and returned in get_endpoint_states responses, that was causing an 'empty host id issue' on the other nodes during nodes restart.

Check for a race condition in do_apply_state_locally In do_apply_state_locally, a race condition can occur if a task is suspended at a preemption point while the node entry is not locked.
During this time, the host may be removed from _endpoint_state_map. When the task resumes, this can lead to inserting an entry with an empty host ID into the map, causing various errors, including a node crash.

This change adds a check after locking the map entry: if a gossip ACK update does not contain a host ID, we verify that an entry with that host ID still exists in the gossiper’s _endpoint_state_map.

Fixes https://github.com/scylladb/scylladb/issues/25831
Fixes https://github.com/scylladb/scylladb/issues/25803
Fixes https://github.com/scylladb/scylladb/issues/25702
Fixes https://github.com/scylladb/scylladb/issues/25621

Ref https://github.com/scylladb/scylla-enterprise/issues/5613

Backport: The issue affects all current releases(2025.x), therefore this PR needs to be backported to all 2025.1-2025.3.

- (cherry picked from commit 28e0f42a83)

- (cherry picked from commit f08df7c9d7)

- (cherry picked from commit 775642ea23)

- (cherry picked from commit b34d543f30)

Parent PR: #25849

Closes scylladb/scylladb#25898

* https://github.com/scylladb/scylladb:
  gossiper: fix empty initial local node state
  gossiper: add test for a race condition in start_gossiping
  gossiper: check for a race condition in `do_apply_state_locally`
  test/gossiper: add reproducible test for race condition during node decommission
2025-09-09 10:00:30 +02:00
Sergey Zolotukhin
41dd29f5a3 gossiper: fix empty initial local node state
This change removes the addition of an empty state to `_endpoint_state_map`.
Instead, a new state is created locally and then published via replicate,
avoiding the issue of an empty state existing in `_endpoint_state_map`
before the preemption point. Since this resolves the issue tested in
`test_gossiper_empty_self_id_on_shadow_round`, the `xfail` mark has been removed.

Fixes: scylladb/scylladb#25831
(cherry picked from commit b34d543f30)
2025-09-08 21:55:16 +00:00
Sergey Zolotukhin
13f43e2872 gossiper: add test for a race condition in start_gossiping
This change adds a test for a race condition in `start_gossiping` that
can lead to an empty self state sent in `gossip_get_endpoint_states_response`.

Test for scylladb/scylladb#25831

(cherry picked from commit 775642ea23)
2025-09-08 21:55:16 +00:00
Sergey Zolotukhin
ec85ebf419 gossiper: check for a race condition in do_apply_state_locally
In do_apply_state_locally, a race condition can occur if a task is
suspended at a preemption point while the node entry is not locked.
During this time, the host may be removed from _endpoint_state_map.
When the task resumes, this can lead to inserting an entry with an
empty host ID into the map, causing various errors, including a node
crash.

This change
1. adds a check after locking the map entry: if a gossip ACK update
   does not contain a host ID, we verify that an entry with that host ID
   still exists in the gossiper’s _endpoint_state_map.
2. Removes xfail from the test_gossiper_race test since the issue is now
   fixed.
3. Adds exception handling in `do_shadow_round` to skip responses from
   nodes that sent an empty host ID.

This re-applies the commit 13392a40d4 that
was reverted in 46aa59fe49, after fixing
the issues that caused the CI to fail.

Fixes: scylladb/scylladb#25702
Fixes: scylladb/scylladb#25621

Ref: scylladb/scylla-enterprise#5613
(cherry picked from commit f08df7c9d7)
2025-09-08 21:55:16 +00:00
Emil Maskovsky
b53a5f9b3d test/gossiper: add reproducible test for race condition during node decommission
This change introduces a targeted test that simulates the gossiper race
condition observed during node decommissioning. The test delays gossip
state application and host ID lookup to reliably reproduce the scenario
where `gossiper::get_host_id()` is called on a removed endpoint,
potentially triggering an abort in `apply_new_states`.

There is a specific error injection added to widen the race window, in
order to increase the likelihood of hitting the race condition. The
error injection is designed to delay the application of gossip state
updates, for the specific node that is being decommissioned. This should
then result in the server abort in the gossiper.

This re-applies the commit 5dac4b38fb that
was reverted in dc44fca67c, but modified
to relax the check from "on_internal_error" to a just warning log. The
more strict can be re-introduced later once we are sure that all
remaining problems are resolved and it will not break the CI.

Refs: scylladb/scylladb#25621
Fixes: scylladb/scylladb#25721
(cherry picked from commit 28e0f42a83)
2025-09-08 21:55:16 +00:00
Anna Stuchlik
acd4cbbbe1 doc: add support for i7i instances
This commit adds currently supported i7i and i7ie instances
to the list of instance recommendations.

Fixes https://github.com/scylladb/scylladb/issues/25808

Closes scylladb/scylladb#25817

(cherry picked from commit f66580a28f)

Closes scylladb/scylladb#25853
2025-09-08 10:40:52 +03:00
Dawid Pawlik
4303bb7d56 cqlpy/test_permissions: run the reproducer tests for #19798
Since the previous commit fixes the issue, we can remove the xfail mark.
The tests should pass now.

(cherry picked from commit 5e72d71188)
2025-09-08 07:39:52 +00:00
Dawid Pawlik
675f74b4b7 select_statement: check for access to CDC base table
Before the patch, user with CREATE access could create a table
with CDC or alter the table enabling CDC, but could not query
a SELECT on the CDC table they created.
It was due to the fact, the SELECT permission was checked on
the CDC log, and later it's "parent" - the keyspace,
but not thebase table, on which the user had SELECT permission
automatically granted on CREATE.

This patch matches the behaviour of querying the CDC log
to the one implemented for Materialized Views:
    1. No new permissions are granted on CREATE.
    2. When querying SELECT, the permissions on base table
SELECT are checked.

Fixes: #19798
(cherry picked from commit be54346846)
2025-09-08 07:39:52 +00:00
Avi Kivity
0900a88884 Merge 'auth: move passwords::check call to alien thread' from Andrzej Jackowski
Analysis of customer stalls revealed that the function `detail::hash_with_salt` (invoked by `passwords::check`) often blocks the reactor. Internally, this function uses the external `crypt_r` function to compute password hashes, which is CPU-intensive.

This PR addresses the issue in two ways:
1) `sha-512` is now the only password hashing scheme for new passwords (it was already the common-case).
2) `passwords::check` is moved to a dedicated alien thread.

Regarding point 1: before this change, the following hashing schemes were supported by     `identify_best_supported_scheme()`: bcrypt_y, bcrypt_a, SHA-512, SHA-256, and MD5. The reason for this was that the `crypt_r` function used for password hashing comes from an external library (currently `libxcrypt`), and the supported hashing algorithms vary depending on the library in use. However:
- The bcrypt schemes never worked properly because their prefixes lack the required round count (e.g. `$2y$` instead of `$2y$05$`). Moreover, bcrypt is slower than SHA-512, so it  not good idea to fix or use it.
- SHA-256 and SHA-512 both belong to the SHA-2 family. Libraries that support one almost always support the other, so it’s very unlikely to find SHA-256 without SHA-512.
- MD5 is no longer considered secure for password hashing.

Regarding point 2: the `passwords::check` call now runs on a shared alien thread created at database startup. An `std::mutex` synchronizes that thread with the shards. In theory this could introduce a frequent lock contention, but in practice each shard handles only a few hundred new connections per second—even during storms. There is already `_conns_cpu_concurrency_semaphore` in `generic_server` limits the number of concurrent connection handlers.

Fixes https://github.com/scylladb/scylladb/issues/24524

Backport not needed, as it is a new feature.

Closes scylladb/scylladb#24924

* github.com:scylladb/scylladb:
  main: utils: add thread names to alien workers
  auth: move passwords::check call to alien thread
  test: wait for 3 clients with given username in test_service_level_api
  auth: refactor password checking in password_authenticator
  auth: make SHA-512 the only password hashing scheme for new passwords
  auth: whitespace change in identify_best_supported_scheme()
  auth: require scheme as parameter for `generate_salt`
  auth: check password hashing scheme support on authenticator start

(cherry picked from commit c762425ea7)
2025-09-07 13:38:33 +03:00
Calle Wilund
2bbf3cf669 system_keyspace: Prune dropped tables from truncation on start/drop
Fixes #25683

Once a table drop is complete, there should be no reason to retain
truncation records for it, as any replay should skip mutations
anyway (no CF), and iff we somehow resurrect a dropped table,
this replay-resurrected data is the least problem anyway.

Adds a prune phase to the startup drop_truncation_rp_records run,
which ignores updating, and instead deletes records for non-existant
tables (which should patch any existing servers with lingering data
as well).

Also does an explicit delete of records on actual table DROP, to
ensure we don't grow this table more than needed even in long
uptime nodes.

Small unit test included.

Closes scylladb/scylladb#25699

(cherry picked from commit bc20861afb)

Closes scylladb/scylladb#25815
2025-09-05 19:02:39 +03:00
Botond Dénes
c30c1ec40a Merge '[Backport 2025.3] drop table: fix crash on drop table with concurrent cleanup' from Scylladb[bot]
Consider the following scenario:

- A tablet is migrated away from a shard
- The tablet cleanup stage closes the storage group's async_gate
- A drop table runs truncate which attempts to disable compaction on the tablet with its gate closed. This fails, because table::parallel_foreach_compaction_group() ultimately calls storage_group_manager::parallel_foreach_storage_group() which will not disable compaction if it can't hold the storage group's gate
- Truncate calls table::discard_sstables() which checks if the compaction has been disabled, and because it hasn't, it then runs on_internal_error() with "compaction not disabled on table ks.cf during TRUNCATE" which causes a crash

Fixes: #25706

This needs to be backported to all supported versions with tablets

- (cherry picked from commit a0934cf80d)

- (cherry picked from commit 1b8a44af75)

Parent PR: #25708

Closes scylladb/scylladb#25785

* github.com:scylladb/scylladb:
  test: reproducer and test for drop with concurrent cleanup
  truncate: check for closed storage group's gate in discard_sstables
2025-09-05 19:02:04 +03:00
Andrei Chekun
2ee1082561 test.py: modify run to use different junit output filenames
Currently, run will execute twice pytest without modifying the path of the
JUnit XML report. This leads that the second execution of the pytest
will override the report. This PR fixing this issue so both reports will
be stored.

Closes scylladb/scylladb#25726

(cherry picked from commit e55c8a9936)

Closes scylladb/scylladb#25778
2025-09-05 19:01:22 +03:00
Pavel Emelyanov
f1e3dedcd6 Revert "test/gossiper: add reproducible test for race condition during node decommission"
This reverts commit 4e17330a1b because
parent PR had been reverted as per #25803
2025-09-05 10:08:29 +03:00
Nadav Har'El
5d6aa6e8c2 utils, alternator: fix detection of invalid base-64
This patch fixes an error-path bug in the base-64 decoding code in
utils/base64.cc, which among other things is used in Alternator to decode
blobs in JSON requests.

The base-64 decoding code has a lookup table, which was wrongly sized 255
bytes, but needed to be 256 bytes. This meant that if the byte 255 (0xFF)
was included in an invalid base-64 string, instead of detecting that this
is an invalid byte (since the only valid bytes in a base-64 string are
A-Z,a-z,0-9,+,/ and =), the code would either think it's valid with a
nonsense 6-bit part, or even crash on an out-of-bounds read.

Besides the trivial fix, this patch also includes a reproducing test,
which tries to write a blob as a supposedly base-64 encoded string with
a 0xFF byte in it. The test fails before this patch (the write succeeds,
unexpectedly), and passes after this patch (the write fails as
expected). The test also passes on DynamoDB.

Fixes #25701

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#25705

(cherry picked from commit ff91027eac)

Closes scylladb/scylladb#25767
2025-09-04 11:38:55 +03:00
Pavel Emelyanov
1c8e10231a Merge '[Backport 2025.3] service/qos: Modularize service level controller to avoid invalid access to auth::service' from Scylladb[bot]
Move management over effective service levels from `service_level_controller`
to a new dedicated type -- `auth_integration`.

Before these changes, it was possible for the service level controller to try
to access `auth::service` after it was deinitialized. For instance, it could
happen when reloading the cache. That HAS happened as described in the following
issue: scylladb/scylladb#24792.

Although the problem might have been mitigated or even resolved in
scylladb/scylladb@10214e13bd, it's not clear
how the service will be used in the future. It's best to prevent similar bugs
than trying to fix them later on.

The logic responsible for preventing to access an uninitialized `auth::service`
was also either non-existent, complex, or non-sufficient.

To prevent accessing `auth::service` by the service level controller, we extract
the relevant portion of the code to a separate entity -- `auth_integration`.
It's an internal helper type whose sole purpose is to manage effective service
levels.

Thanks to that, we were able to nest the lifetime of `auth_integration` within
the lifetime of `auth::service`. It's now impossible to attempt to dereference
it while it's uninitialized.

If a bug related to an invalid access is spotted again, though, it might also
be easier to debug it now.

There should be no visible change to the users of the interface of the service
level controller. We strived to make the patch minimal, and the only affected
part of the logic should be related to how `auth::service` is accessed.

The relevant portion of the initialization and deinitialization flow:

(a) Before the changes:

1. Initialize `service_level_controller`. Pass a reference to an uninitialized
   `auth::service` to it.
2. Initialize other services.
3. Initialize and start `auth::service`.
4. (work)
5. Stop and deinitialize `auth::service`.
6. Deinitialize other services.
7. Deinitialize `service_level_controller`.

(b) After the changes:

1. Initialize `service_level_controller`. Pass a reference to an uninitialized
   `auth::service` to it. (*)
2. Initialize other services.
3. Initialize and start `auth::service`.
4. Initialize `auth_integration`. Register it in `service_level_controller`.
5. (work)
6. Unregister `auth_integration` in `service_level_controller` and deinitialize
   it.
7. Stop and deinitialize `auth::service`.
8. Deinitialize other services.
9. Deinitialize `service_level_controller`.

(*):
    The reference to `auth::service` in `service_level_controller` is still
    necessary. We need to access the service when dropping a distributed
    service level.

    Although it would be best to cut that link between the service level
    controller and `auth::service` too, effectively separating the entities,
    it would require more work, so we leave it as-is for now.

    It shouldn't prove problematic as far as accessing an uninitialized service
    goes. Trying to drop a service level at the point when we're de-initializing
    auth should be impossible.

    For more context, see the function `drop_distributed_service_level` in
    `service_level_controller`.

A trivial test has been included in the PR. Although its value is questionable
as we only try to reload the service level cache at a specific moment, it's
probably the best we can deliver to provide a reproducer of the issue this patch
is resolving.

Fixes scylladb/scylladb#24792

Backport: The impact of the bug was minimal as it only affected the shutdown.
However, since CI is failing because of it, let's backport the change to all
supported versions.

- (cherry picked from commit 7d0086b093)

- (cherry picked from commit 34afb6cdd9)

- (cherry picked from commit e929279d74)

- (cherry picked from commit dd5a35dc67)

- (cherry picked from commit fc1c41536c)

Parent PR: #25478

Closes scylladb/scylladb#25753

* github.com:scylladb/scylladb:
  service/qos: Move effective SL cache to auth_integration
  service/qos: Add auth::service to auth_integration
  service/qos: Reload effective SL cache conditionally
  service/qos: Add gate to auth_integration
  service/qos: Introduce auth_integration
2025-09-04 11:38:17 +03:00
Pavel Emelyanov
d484837a2a Merge '[Backport 2025.3] db/hints: Improve logs' from Scylladb[bot]
Before these changes, the logs in hinted handoff often didn't provide
crucial information like the identifier of the node that hints were
being sent to. Also, some of the logs were misleading and referred to
other places in the code than the one where an exception or some other
situation really occurred.

We modify those logs, extending them by more valuable information
and fixing existing issues. What's more, all of the logs in
`hint_endpoint_manager` and `hint_sender` follow a consistent format
now:

```
<class_name>[<destination host ID>]:<function_name>: <message>
```

This way, we should always have AT LEAST the basic information.

Fixes scylladb/scylladb#25466

Backport:
There is no risk in backporting these changes. They only have
impact on the logs. On the other hand, they might prove helpful
when debugging an issue in hinted handoff.

- (cherry picked from commit 2327d4dfa3)

- (cherry picked from commit d7bc9edc6c)

- (cherry picked from commit 6f1fb7cfb5)

Parent PR: #25470

Closes scylladb/scylladb#25538

* github.com:scylladb/scylladb:
  db/hints: Add new logs
  db/hints: Adjust log levels
  db/hints: Improve logs
2025-09-04 11:36:30 +03:00
Pavel Emelyanov
ad6dbcfdc5 Merge '[Backport 2025.3] generic server: 2 step shutdown' from Scylladb[bot]
This PR implements solution proposed in scylladb/scylladb#24481

Instead of terminating connections immediately, the shutdown now proceeds in two stages: first closing the receive (input) side to stop new requests, then waiting for all active requests to complete before fully closing the connections.

The updated shutdown process is as follows:

1. Initial Shutdown Phase
   * Close the accept gate to block new incoming connections.
   * Abort all accept() calls.
   * For all active connections:
      * Close only the input side of the connection to prevent new requests.
      * Keep the output side open to allow responses to be sent.

2. Drain Phase
   * Wait for all in-progress requests to either complete or fail.

3. Final Shutdown Phase
   * Fully close all connections.

Fixes scylladb/scylladb#24481

- (cherry picked from commit 122e940872)

- (cherry picked from commit 3848d10a8d)

- (cherry picked from commit 3610cf0bfd)

- (cherry picked from commit 27b3d5b415)

- (cherry picked from commit 061089389c)

- (cherry picked from commit 7334bf36a4)

- (cherry picked from commit ea311be12b)

- (cherry picked from commit 4f63e1df58)

Parent PR: #24499

Closes scylladb/scylladb#25519

* github.com:scylladb/scylladb:
  test: Set `request_timeout_on_shutdown_in_seconds` to `request_timeout_in_ms`,  decrease request timeout.
  generic_server: Two-step connection shutdown.
  transport: consmetic change, remove extra blanks.
  transport: Handle sleep aborted exception in sleep_until_timeout_passes
  generic_server: replace empty destructor with `= default`
  generic_server: refactor connection::shutdown to use `shutdown_input` and `shutdown_output`
  generic_server: add `shutdown_input` and `shutdown_output` functions to `connection` class.
  test: Add test for query execution during CQL server shutdown
2025-09-04 11:35:55 +03:00
Ran Regev
a79cbd9a9a docs: backup and restore feature
added backup and restore as a feature
to documentation

Signed-off-by: Ran Regev <ran.regev@scylladb.com>

Closes scylladb/scylladb#25608

(cherry picked from commit 515d9f3e21)

Closes scylladb/scylladb#25748
2025-09-03 12:37:45 +03:00
Emil Maskovsky
4e17330a1b test/gossiper: add reproducible test for race condition during node decommission
This change introduces a targeted test that simulates the gossiper race
condition observed during node decommissioning. The test delays gossip
state application and host ID lookup to reliably reproduce the scenario
where `gossiper::get_host_id()` is called on a removed endpoint,
potentially triggering an abort in `apply_new_states`.

There is a specific error injection added to widen the race window, in
order to increase the likelihood of hitting the race condition. The
error injection is designed to delay the application of gossip state
updates, for the specific node that is being decommissioned. This should
then result in the server abort in the gossiper.

Refs: scylladb/scylladb#25621
Fixes: scylladb/scylladb#25721

Backport: The test is primarily for an issue found in 2025.1, so it
needs to be backported to all the 2025.x branches.

Closes scylladb/scylladb#25685

(cherry picked from commit 5dac4b38fb)

Closes scylladb/scylladb#25781
2025-09-02 08:29:27 +02:00
Ferenc Szili
6a7a5f5edc test: reproducer and test for drop with concurrent cleanup
This change adds a reproducer and test for issue #25706

(cherry picked from commit 1b8a44af75)
2025-09-02 02:18:56 +00:00
Ferenc Szili
34b403747a truncate: check for closed storage group's gate in discard_sstables
Consider the following scenario:

- A tablet is migrated away from a shard
- The tablet cleanup stage closes the storage group's async_gate
- A drop table runs truncate which attempts to disable compaction on the
  tablet with its gate closed. This fails, because
  table::parallel_foreach_compaction_group() ultimately calls
  storage_group_manager::parallel_foreach_storage_group() which will not
  disable compaction if it can't hold the storage group's gate
- Truncate calls table::discard_sstables() which checks if the compaction
  has been disabled, and because it hasn't, it then runs
  on_internal_error() with "compaction not disabled on table ks.cf during
  TRUNCATE" which causes a crash

This patch makes dicard_sstables check if the storage group's gate is
closed whend checking for disabled compaction.

(cherry picked from commit a0934cf80d)
2025-09-02 02:18:56 +00:00
Piotr Dulikowski
debc637ac1 Merge '[Backport 2025.3] system_keyspace: add peers cache to get_ip_from_peers_table' from Scylladb[bot]
The gossiper can call `storage_service::on_change` frequently (see  scylladb/scylla-enterprise#5613), which may cause high CPU load and even trigger OOMs or related issues.

This PR adds a temporary cache for `system.peers` to resolve host_id -> ip without hitting storage on every call. The cache is short-lived to handle the unlikely case where `system.peers` is updated directly via CQL.

This is a temporary fix; a more thorough solution is tracked in https://github.com/scylladb/scylladb/issues/25620.

Fixes scylladb/scylladb#25660

backport: this patch needs to be backported to all supported versions (2025.1/2/3).

- (cherry picked from commit 91c633371e)

- (cherry picked from commit de5dc4c362)

- (cherry picked from commit 4b907c7711)

Parent PR: #25658

Closes scylladb/scylladb#25766

* github.com:scylladb/scylladb:
  storage_service: move get_host_id_to_ip_map to system_keyspace
  system_keyspace: use peers cache in get_ip_from_peers_table
  storage_service: move get_ip_from_peers_table to system_keyspace
2025-09-01 21:21:26 +02:00
Taras Veretilnyk
ddb7c8ea12 keys: from_nodetool_style_string don't split single partition keys
Users with single-column partition keys that contain colon characters
were unable to use certain REST APIs and 'nodetool' commands, because the
API split key by colon regardless of the partition key schema.

Affected commands:
- 'nodetool getendpoints'
- 'nodetool getsstables'
Affected endpoints:
- '/column_family/sstables/by_key'
- '/storage_service/natural_endpoints'

Refs: #16596 - This does not fully fix the issue, as users with compound
keys will face the issue if any column of the partition key contains
a colon character.

Closes scylladb/scylladb#24829

Closes scylladb/scylladb#25565
2025-09-01 15:36:56 +03:00
Petr Gusev
c4386c2aa4 storage_service: move get_host_id_to_ip_map to system_keyspace
Reimplemented the function to use the peers cache. It could be replaced
with get_ip_from_peers_table, but that would create a coroutine frame for
each call.

(cherry picked from commit 4b907c7711)
2025-09-01 11:22:55 +02:00
Petr Gusev
7ec3e166c6 system_keyspace: use peers cache in get_ip_from_peers_table
The storage_service::on_change method can be called quite often
by the gossiper, see scylladb/scylla-enterprise#5613. In this commit
we introduce a temporal cache for system.peers so that we don't have
to go to the storage each time we need to resolve host_id -> ip.
We keep the cache only for a small amount of time to handle the
(unlikely) scenario when the user wants to update system.peers table
from CQL.

Fixes scylladb/scylladb#25660

(cherry picked from commit de5dc4c362)
2025-09-01 11:22:05 +02:00
Petr Gusev
5f8664757a storage_service: move get_ip_from_peers_table to system_keyspace
We plan to add a cache to get_ip_from_peers_table in upcoming commits.
It's more convenient to do this from system_keyspace, since the only two
methods that mutate system.peers (remove_endpoint and update_peers_info)
are already there.

(cherry picked from commit 91c633371e)
2025-09-01 11:21:55 +02:00
Calle Wilund
2e08d651a8 system_keyspace: Limit parallelism in drop_truncation_records
Fixes #25682
Refs scylla-enterprise#5580

If the truncation table is large in entries, we might create a
huge parallel execution, quite possibly consuming loads of resources
doing something quite trivial.
Limit concurrency to a small-ish number

Closes scylladb/scylladb#25678

(cherry picked from commit 2eccd17e70)

Closes scylladb/scylladb#25751
2025-09-01 09:13:44 +03:00
Emil Maskovsky
05f8b0d543 storage: pass host_id as parameter to maybe_reconnect_to_preferred_ip()
Previously, `maybe_reconnect_to_preferred_ip()` retrieved the host ID
using `gossiper::get_host_id()`. Since the host ID is already available
in the calling function, we now pass it directly as a parameter.

This change simplifies the code and eliminates a potential race condition
where `gossiper::get_host_id()` could fail, as described in scylladb/scylladb#25621.

Refs: scylladb/scylladb#25621
Fixes: scylladb/scylladb#25715

Backport: Recommended for 2025.x release branches to avoid potential issues
from unnecessary calls to `gossiper::get_host_id()` in subscribers.

(cherry picked from commit cfc87746b6)

Closes scylladb/scylladb#25718
2025-09-01 09:13:21 +03:00
kendrick-ren
d7a36c6d8e Update launch-on-gcp.rst
Add the missing '=' mark in --zone option. Otherwise the command complains.

Closes scylladb/scylladb#25471

(cherry picked from commit d6e62aeb6a)

Closes scylladb/scylladb#25645
2025-09-01 09:11:21 +03:00
Benny Halevy
2a6791d246 api: storage_service: fix token_range documentation
Note that the token_range type is used only by describe_ring.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#25609

(cherry picked from commit 45c496c276)

Closes scylladb/scylladb#25640
2025-09-01 09:11:12 +03:00
Pavel Emelyanov
95d953b7b9 Merge '[Backport 2025.3] cql3: Warn when creating RF-rack-invalid keyspace' from Scylladb[bot]
Although RF-rack-valid keyspaces are not universally enforced
yet (they're governed by the configuration option
`rf_rack_valid_keyspaces`), we'd like to encourage the user to
abide by the restriction.

To that end, we're introducing a warning when creating or
altering a keyspace. If the configuration option is disabled,
but the user is trying to create an RF-rack-invalid keyspace,
they'll receive a warning.

If the option is turned off, we will also log all of the
RF-rack-invalid keyspaces at start-up.

We provide validation tests.

Fixes scylladb/scylladb#23330

Backport: we'd like to encourage the user to abide by the restriction
even when they don't enforce it to make it easier in the future to
adjust the schema when there's no way to disable it anymore. Because
of that, we'd like to backport it to all relevant versions, starting with 2025.1.

- (cherry picked from commit 60ea22d887)

- (cherry picked from commit af8a3dd17b)

- (cherry picked from commit 837d267cbf)

Parent PR: #24785

Closes scylladb/scylladb#25635

* github.com:scylladb/scylladb:
  main: Log RF-rack-invalid keyspaces at startup
  cql3/statements: Fix indentation
  cql3: Warn when creating RF-rack-invalid keyspace
2025-09-01 09:11:01 +03:00
David Garcia
3db935f30f docs: expose alternator metrics
Renders in the docs some metrics introduced in https://github.com/scylladb/scylladb/pull/24046/files that were not being displayed in https://docs.scylladb.com/manual/stable/reference/metrics.html

Closes scylladb/scylladb#25561

(cherry picked from commit c3c70ba73f)

Closes scylladb/scylladb#25629
2025-09-01 09:10:41 +03:00
Michał Chojnowski
05ea29ee8d sstables/types.hh: fix fmt::formatter<sstables::deletion_time>
Obvious typo.

Fixes scylladb/scylladb#25556

Closes scylladb/scylladb#25557

(cherry picked from commit c1b513048c)

Closes scylladb/scylladb#25588
2025-09-01 09:10:21 +03:00
Dawid Mędrek
7f58681482 db/commitlog: Extend error messages for corrupted data
We're providing additional information in error messages when throwing
an exception related to data corruption: when a segment is truncated
and when it's content is invalid. That might prove helpful when debugging.

Closes scylladb/scylladb#25190

(cherry picked from commit 408b45fa7e)

Closes scylladb/scylladb#25461
2025-09-01 09:08:29 +03:00
Andrei Chekun
8163f4edaa test.py: use unique hostname for Minio
To avoid situation that port is occupied on localhost, use unique
hostname for Minio

(cherry picked from commit c6c3e9f492)

Closes scylladb/scylladb#24775
2025-09-01 08:59:00 +03:00
Pavel Emelyanov
0c6c507704 Merge '[Backport 2025.3] test.py: add missed parameters that should be passed from test.py to pytest' from Scylladb[bot]
Several parameters that `test.py` should pass to pytest->boost were missing. This PR adds handling these parameters: `--random-seed` and `--x-log2-compaction-groups`

Since this code affected with this issue in 2025.3 and this is only framework change, backport for that version needed.

Fixes: https://github.com/scylladb/scylladb/issues/24927

- (cherry picked from commit 71b875c932)

- (cherry picked from commit f7c7877ba6)

Parent PR: #24928

Closes scylladb/scylladb#25035

* github.com:scylladb/scylladb:
  test.py: add bypassing x_log2_compaction_groups to boost tests
  test.py: add bypassing random seed to boost tests
2025-09-01 08:58:48 +03:00
Jenkins Promoter
3da82e8572 Update pgo profiles - aarch64 2025-09-01 05:24:53 +03:00
Jenkins Promoter
16c7bd4c6e Update pgo profiles - x86_64 2025-09-01 05:01:14 +03:00
Jenkins Promoter
3c922d68f0 Update ScyllaDB version to: 2025.3.1 2025-08-31 11:05:24 +03:00
Calle Wilund
fe87af4674 commitlog: Ensure segment deletion is re-entrant
Fixes #25709

If we have large allocations, spanning more than one segment, and
the internal segment references from lead to secondary are the
only thing keeping a segment alive, the implicit drop in
discard_unused_segments and orphan_all can cause a recursive call
to discard_unused_segments, which in turn can lead to vector
corruption/crash, or even double free of segment (iterator confusion).

Need to separate the modification of the vector (_segments) from
actual releasing of objects. Using temporaries is the easiest
solution.

To further reduce recursion, we can also do an early clear of
segment dependencies in callbacks from segment release (cf release).

Closes scylladb/scylladb#25719

(cherry picked from commit cc9eb321a1)

Closes scylladb/scylladb#25756
2025-08-30 18:50:47 +03:00
Dawid Mędrek
5c947a936e service/qos: Move effective SL cache to auth_integration
Since `auth_integration` manages effective service levels, let's move
the relevant cache from `service_level_controller` to it.

(cherry picked from commit fc1c41536c)
2025-08-29 22:58:21 +00:00
Dawid Mędrek
fcdd21948d service/qos: Add auth::service to auth_integration
The new service, `auth_integration`, has taken over the responsibility
over managing effective service levels from `service_level_controller`.
However, before these changes, it still accessed `auth::service` via
the service level controller. Let's change that.

Note that we also remove a check that `auth::service` has been
initialized. It's not necessary anymore because the lifetime of
`auth_integration` is strictly nested within the lifetime of `auth::service`.

In actuality, `service_level_controller` should lose its reference to
`auth::service` completely. All of the management over effective service
levels has already been moved to `auth_integration`. However, the
referernce is still needed when dropping a distributed service level
because we need to update the corresponding attribute for relevant
roles.

That should not lead to invalid accesses, though. Dropping a service level
should not be possible when `auth::service` is not initialized.

(cherry picked from commit dd5a35dc67)
2025-08-29 22:58:21 +00:00
Dawid Mędrek
aea5805c1f service/qos: Reload effective SL cache conditionally
Since `service_level_controller` outlives `auth_integration`, it may
happen that we try to access it when it has already been deinitialized.
To prevent that, we only try to reload or clear the effective service
level cache when the object is still alive.

These changes solve an existing problem with an invalid memory access.
For more context, see issue scylladb/scylladb#24792.

We provide a reproducer test that consistently fails before these
changes but passes after them.

Fixes scylladb/scylladb#24792

(cherry picked from commit e929279d74)
2025-08-29 22:58:20 +00:00
Dawid Mędrek
753305763a service/qos: Add gate to auth_integration
We add a named gate to `auth_integration` that will aid us in synchronizing
ongoing tasks with stopping the service.

(cherry picked from commit 34afb6cdd9)
2025-08-29 22:58:20 +00:00
Dawid Mędrek
4b69c74385 service/qos: Introduce auth_integration
We introduce a new type, `auth_integration`, that will be used internally
by `service_level_controller`. Its purpose is to take over the responsibility
over managing effective service levels.

The main problem of the current implementation of service level controller
is its dependency on `auth::service` whose lifetime is strictly nested
within the lifetime of service level controller. That may and already have
led to invalid memory accesses; for an example, see issue
scylladb/scylladb#24792.

Our strategy is to split service level controller into smaller parts and
ensure that we access `auth::service` only when it's valid to do so.
This commit is the first step towards that.

We don't change anything in the logic yet, just add the new type. Further
adjustments will be made in following commits.

(cherry picked from commit 7d0086b093)
2025-08-29 22:58:20 +00:00
Jenkins Promoter
d9e492a90c Update ScyllaDB version to: 2025.3.0 2025-08-27 14:38:30 +03:00
Andrei Chekun
6ee92600e2 test.py: add bypassing x_log2_compaction_groups to boost tests
Bypassing argument to pytest->boost that was missing.

(cherry picked from commit f7c7877ba6)
2025-08-25 15:15:30 +02:00
Andrei Chekun
c55919242d test.py: add bypassing random seed to boost tests
Bypassing argument to pytest->boost that was missing.

Fixes: https://github.com/scylladb/scylladb/issues/24927
(cherry picked from commit 71b875c932)
2025-08-25 15:14:52 +02:00
Dawid Mędrek
9652a1260f main: Log RF-rack-invalid keyspaces at startup
When the configuration option `rf_rack_valid_keyspaces` is enabled and there
is an RF-rack-invalid keyspace, starting a node fails. However, when the
configuration option is disabled, but there still is a keyspace that violates
the condition, we'd like Scylla to print a warning informing the user about
the fact. That's what happens in this commit.

We provide a validation test.

(cherry picked from commit 837d267cbf)
2025-08-22 14:31:49 +00:00
Dawid Mędrek
5f13044627 cql3/statements: Fix indentation
(cherry picked from commit af8a3dd17b)
2025-08-22 14:31:49 +00:00
Dawid Mędrek
cd795170b4 cql3: Warn when creating RF-rack-invalid keyspace
Although RF-rack-valid keyspaces are not universally enforced
yet (they're governed by the configuration option
`rf_rack_valid_keyspaces`), we'd like to encourage the user to
abide by the restriction.

To that end, we're introducing a warning when creating or
altering a keyspace. If the configuration option is disabled,
but the user is trying to create an RF-rack-invalid keyspace,
they'll receive a warning.

We provide a validation test.

(cherry picked from commit 60ea22d887)
2025-08-22 14:31:49 +00:00
Ferenc Szili
acb542606e test: remove test_tombstone_gc_disabled_on_pending_replica
The test test_tombstone_gc_disabled_on_pending_replica was added when
we fixed (#20788) the potential problem with data resurrection during
file based streaming. The issue was occurring only in Enterprise, but
we added the fix in OSS to limit code divergence. This test was added
together with the fix in OSS with the idea to guard this change in OSS.
The real reproducer and test for this fix was added later, after the
fix was ported into Enterprise.
It is in: test/cluster/test_resurrection.py

Since Enterprise has been merged into OSS, there is no more need to
keep the test test_tombstone_gc_disabled_on_pending_replica. Also,
it is flaky with very low probability of failure, making it difficult
to investigate the cause of failure.

Fixes: #22182

Refs: scylladb/scylladb#25448

Closes scylladb/scylladb#25134

(cherry picked from commit 7ce96345bf)

Closes scylladb/scylladb#25573
2025-08-19 16:01:22 +03:00
Piotr Dulikowski
8bd92d4dd0 Merge '[Backport 2025.3] test: test_mv_backlog: fix to consider internal writes' from Scylladb[bot]
The PR fixes a test flakiness issue in test_mv_backlog related to reading metrics.

The first commit fixes a more general issue in the ScyllaMetrics helper class where it doesn't return the value of all matching lines when a specific shard is requested, but it breaks after the first match.

The second commit fixes a test issue where it expects exactly one write to be throttled, not taking into account other internal writes that may be executed during this time.

Fixes https://github.com/scylladb/scylladb/issues/23139

backport to improve CI stability - test only change

- (cherry picked from commit 5c28cffdb4)

- (cherry picked from commit 276a09ac6e)

Parent PR: #25279

Closes scylladb/scylladb#25475

* github.com:scylladb/scylladb:
  test: test_mv_backlog: fix to consider internal writes
  test/pylib/rest_client: fix ScyllaMetrics filtering
2025-08-19 09:48:01 +02:00
Patryk Jędrzejczak
e631d2e872 test: test_maintenance_socket: use cluster_con for driver sessions
The test creates all driver sessions by itself. As a consequence, all
sessions use the default request timeout of 10s. This can be too low for
the debug mode, as observed in scylladb/scylla-enterprise#5601.

In this commit, we change the test to use `cluster_con`, so that the
sessions have the request timeout set to 200s from now on.

Fixes scylladb/scylla-enterprise#5601

This commit changes only the test and is a CI stability improvement,
so it should be backported all the way to 2024.2. 2024.1 doesn't have
this test.

Closes scylladb/scylladb#25510

(cherry picked from commit 03cc34e3a0)

Closes scylladb/scylladb#25547
2025-08-18 16:41:03 +02:00
Dawid Mędrek
d12fdcaa75 db/hints: Add new logs
We're adding new logs in just a few places that may however prove
important when debugging issues in hinted handoff in the future.

(cherry picked from commit 6f1fb7cfb5)
2025-08-18 16:02:01 +02:00
Dawid Mędrek
325831afad db/hints: Adjust log levels
Some of the logs could be clogging Scylla's logs, so we demote their
level to a lower one.

On the other hand, some of the logs would most likely not do that,
and they could be useful when debugging -- we promote them to debug
level.

(cherry picked from commit d7bc9edc6c)
2025-08-18 16:02:00 +02:00
Dawid Mędrek
7b212edd0c db/hints: Improve logs
Before these changes, the logs in hinted handoff often didn't provide
crucial information like the identifier of the node that hints were
being sent to. Also, some of the logs were misleading and referred to
other places in the code than the one where an exception or some other
situation really occurred.

We modify those logs, extending them by more valuable information
and fixing existing issues. What's more, all of the logs in
`hint_endpoint_manager` and `hint_sender` follow a consistent format
now:

```
<class_name>[<destination host ID>]:<function_name>: <message>
```

This way, we should always have AT LEAST the basic information.

(cherry picked from commit 2327d4dfa3)
2025-08-18 16:01:57 +02:00
Sergey Zolotukhin
bad157453b test: Set request_timeout_on_shutdown_in_seconds to request_timeout_in_ms,
decrease request timeout.

In debug mode, queries may sometimes take longer than the default 30 seconds.
To address this, the timeout value `request_timeout_on_shutdown_in_seconds`
during tests is aligned with other request timeouts.
Change request timeout for tests from 180s to 90s since we must keep the request
timeout during shutdown significantly lower than the graceful shutdown timeout(2m),
or else a request timeout would cause a graceful shutdown timeout and fail a test.

(cherry picked from commit 4f63e1df58)
2025-08-18 15:47:08 +02:00
Sergey Zolotukhin
9b7886ed71 generic_server: Two-step connection shutdown.
When shutting down in `generic_server`, connections are now closed in two steps.
First, only the RX (receive) side is shut down. Then, after all ongoing requests
are completed, or a timeout happened the connections are fully closed.

Fixes scylladb/scylladb#24481

(cherry picked from commit ea311be12b)
2025-08-18 15:46:46 +02:00
Sergey Zolotukhin
e2aed2e860 transport: consmetic change, remove extra blanks.
(cherry picked from commit 7334bf36a4)
2025-08-18 14:55:16 +02:00
Anna Stuchlik
977a4a110a doc: add support for RHEL 10
This commit adds RHEL 10 to the list of supported platforms.

Fixes https://github.com/scylladb/scylladb/issues/25436

Closes scylladb/scylladb#25437

(cherry picked from commit 1322f301f6)

Closes scylladb/scylladb#25447
2025-08-18 12:20:19 +03:00
Wojciech Przytuła
666985bbe0 Fix link to ScyllaDB manual
The link would point to outdated OS docs. I fixed it to point to up-to-date Enterprise docs.

Closes scylladb/scylladb#25328

(cherry picked from commit 7600ccfb20)

Closes scylladb/scylladb#25486
2025-08-15 13:31:06 +03:00
Wojciech Mitros
bb6e681b58 test: run mv tests depending on metrics on a standalone instance
The test_base_partition_deletion_with_metrics test case (and the batch
variant) uses the metric of view updates done during its runtime to check
if we didn't perform too many of them. The test runs in the cqlpy suite,
which  runs all test cases sequentially on one Scylla instance. Because
of this, if another test case starts a process which generates view
updates and doesn't wait for it to finish before it exists, we may
observe too many view updates in test_base_partition_deletion_with_metrics
and fail the test.
In all test cases we make sure that all tables that were created
during the test are dropped at the end. However, that doesn't
stop the view building process immediately, so the issue can happen
even if we drop the view. I confirmed it by adding a test just before
test_base_partition_deletion_with_metrics which builds a big
materialized view and drops it at the end - the metrics check still failed.

The issue could be caused by any of the existing test cases where we create
a view and don't wait for it to be built. Note that even if we start adding
rows after creating the view, some of them may still be included in the view
building, as the view building process is started asynchronously. In such
a scenario, the view building also doesn't cause any issues with the data in
these tests - writes performed after view creation generate view updates
synchronously when they're local (and we're running a single Scylla server),
the corresponding view udpates generated during view building are redundant.

Because we have many test cases which could be causing this issue, instead
of waiting for the view building to finish in every single one of them, we
move the susceptible test cases to be run on separate Scylla instances, in
the "cluster" suite. There, no other test cases will influence the results.

Fixes https://github.com/scylladb/scylladb/issues/20379

Closes scylladb/scylladb#25209

(cherry picked from commit 2ece08ba43)

Closes scylladb/scylladb#25504
2025-08-15 13:30:53 +03:00
Ernest Zaslavsky
8a017834a0 s3_client: add memory fallback in chunked_download_source
Introduce fallback logic in `chunked_download_source` to handle
memory exhaustion. When memory is low, feed the `deque` with only
one uncounted buffer at a time. This allows slow but steady progress
without getting stuck on the memory semaphore.

Fixes: https://github.com/scylladb/scylladb/issues/25453
Fixes: https://github.com/scylladb/scylladb/issues/25262

Closes scylladb/scylladb#25452

(cherry picked from commit dd51e50f60)

Closes scylladb/scylladb#25511
2025-08-15 13:30:38 +03:00
Anna Stuchlik
d8d5ab1032 doc: document support for new z3 instance types
This commit adds new z3 instances we now support to the list of GCP instance types.

Fixes https://github.com/scylladb/scylladb/issues/25438

Closes scylladb/scylladb#25446

(cherry picked from commit 841ba86609)

Closes scylladb/scylladb#25512
2025-08-15 13:30:11 +03:00
Andrzej Jackowski
82ee1bf9cb test: audit: add logging of get_audit_log_list and set_of_rows_before
Without those logs, analysing some test failures is difficult.

Refs: scylladb/scylladb#25442

Closes scylladb/scylladb#25485

(cherry picked from commit bf8be01086)

Closes scylladb/scylladb#25514
2025-08-15 13:29:56 +03:00
Abhinav Jha
5e018831f8 raft: replication test: change rpc_propose_conf_change test to SEASTAR_THREAD_TEST_CASE
RAFT_TEST_CASE macro creates 2 test cases, one with random 20% packet
loss named name_drops. The framework makes hard coded assumptions about
leader which doesn't hold well in case of packet losses.

This short term fix disables the packet drop variant of the specified test.
It should be safe to re-enable it once the whole framework is re-worked to
remove these hard coded assumptions.

This PR fixes a bug. Hence we need to backport it.

Fixes: scylladb/scylladb#23816

Closes scylladb/scylladb#25489

(cherry picked from commit a0ee5e4b85)

Closes scylladb/scylladb#25528
2025-08-15 13:29:42 +03:00
Jenkins Promoter
a162e0256e Update pgo profiles - aarch64 2025-08-15 05:28:08 +03:00
Jenkins Promoter
adbbbf87c3 Update pgo profiles - x86_64 2025-08-15 05:05:35 +03:00
Sergey Zolotukhin
e2dcd559b6 transport: Handle sleep aborted exception in sleep_until_timeout_passes
In PR #23156, a new function `sleep_until_timeout_passes` was introduced
to wait until a read request times out or completes. However, the function
did not handle cases where the sleep is aborted via _abort_source, which
could result in WARN messages like "Exceptional future is ignored" during
shutdown.

This change adds proper handling for that exception, eliminating the warning.

(cherry picked from commit 061089389c)
2025-08-14 13:22:36 +00:00
Sergey Zolotukhin
665530e479 generic_server: replace empty destructor with = default
This change improves code readability by explicitly marking the destructor as defaulted.

(cherry picked from commit 27b3d5b415)
2025-08-14 13:22:36 +00:00
Sergey Zolotukhin
d729529226 generic_server: refactor connection::shutdown to use shutdown_input and shutdown_output
This change improves logging and modifies the behavior to attempt closing
the output side of a connection even if an error occurs while closing the input side.

(cherry picked from commit 3610cf0bfd)
2025-08-14 13:22:36 +00:00
Sergey Zolotukhin
2fef421534 generic_server: add shutdown_input and shutdown_output functions to
`connection` class.

The functions are just wrappers for  _fd.shutdown_input() and  _fd.shutdown_output(), with added error reporting.
Needed by later changes.

(cherry picked from commit 3848d10a8d)
2025-08-14 13:22:36 +00:00
Sergey Zolotukhin
0f99fa76de test: Add test for query execution during CQL server shutdown
This test simulates a scenario where a query is being executed while
the query coordinator begins shutting down the CQL server and client
connections. The shutdown process should wait until the query execution
is either completed or timed out.

Test for scylladb/scylladb#24481

(cherry picked from commit 122e940872)
2025-08-14 13:22:36 +00:00
Ernest Zaslavsky
c70ba8384e s3_client: make memory semaphore acquisition abortable
Add `abort_source` to the `get_units` call for the memory semaphore
in the S3 client, allowing the acquisition process to be aborted.

Fixes: https://github.com/scylladb/scylladb/issues/25454

Closes scylladb/scylladb#25469

(cherry picked from commit 380c73ca03)

Closes scylladb/scylladb#25499
2025-08-14 10:34:28 +02:00
Michael Litvak
761f722b6f test: test_mv_backlog: fix to consider internal writes
The test executes a single write, fetching metrics before and after the
write, and expects the total throttled writes count to be increased
exactly by one.

However, other internal writes (compaction for example) may be executed
during this time and be throttled, causing the metrics to be increased
by more than expected.

To address this, we filter the metrics by the scheduling group label of
the user write, to filter out the compaction writes that run in the
compaction scheduling group.

Fixes scylladb/scylladb#23139

(cherry picked from commit 276a09ac6e)
2025-08-12 14:51:10 +00:00
Michael Litvak
3a3b5bb14c test/pylib/rest_client: fix ScyllaMetrics filtering
In the ScyllaMetrics `get` function, when requesting the value for a
specific shard, it is expected to return the sum of all values of
metrics for that shard that match the labels.

However, it would return the value of the first matching line it finds
instead of summing all matching lines.

For example, if we have two lines for one shard like:
some_metric{scheduling_group_name="compaction",shard="0"} 1
some_metric{scheduling_group_name="sl:default",shard="0"} 2

The result of this call would be 1 instead of 3:
get('some_metric', shard="0")

We fix this to sum all matching lines.

The filtering of lines by labels is fixed to allow specifying only some
of the labels. Previously, for the line to match the filter, either the
filter needs to be empty, or all the labels in the metric line had to be
specified in the filter parameter and match its value, which is
unexpected, and breaks when more labels are added.

We also simplify the function signature and the implementation - instead
of having the shard as a separate parameter, it can be specified as a
label, like any other label.

(cherry picked from commit 5c28cffdb4)
2025-08-12 14:51:09 +00:00
Patryk Jędrzejczak
b999aa85b9 docs: Raft recovery procedure: recommend verifying participation in Raft recovery
This instruction adds additional safety. The faster we notice that
a node didn't restart properly, the better.

The old gossip-based recovery procedure had a similar recommendation
to verify that each restarting node entered `RECOVERY` mode.

Fixes #25375

This is a documentation improvement. We should backport it to all
branches with the new recovery procedure, so 2025.2 and 2025.3.

Closes scylladb/scylladb#25376

(cherry picked from commit 7b77c6cc4a)

Closes scylladb/scylladb#25440
2025-08-11 15:49:20 +02:00
Anna Stuchlik
a655c0e193 doc: add new and removed metrics to the 2025.3 upgrade guide
This commit adds the list of new and removed metrics to the already existing upgrade guide
from 2025.2 to 2025.3.

Fixes https://github.com/scylladb/scylladb/issues/24697

Closes scylladb/scylladb#25385

(cherry picked from commit f3d9d0c1c7)

Closes scylladb/scylladb#25416
2025-08-11 06:56:31 +03:00
Botond Dénes
9775b2768b Merge '[Backport 2025.3] GCP Key Provider: Fix authentication issues' from Scylladb[bot]
* Fix discovery of application default credentials by using fully expanded pathnames (no tildes).
* Fix grant type in token request with user credentials.

Fixes #25345.

- (cherry picked from commit 77cc6a7bad)

- (cherry picked from commit b1d5a67018)

Parent PR: #25351

Closes scylladb/scylladb#25407

* github.com:scylladb/scylladb:
  encryption: gcp: Fix the grant type for user credentials
  encryption: gcp: Expand tilde in pathnames for credentials file
2025-08-11 06:52:38 +03:00
Botond Dénes
2048ac88f1 Merge '[Backport 2025.3] test.py: native pytest repeats' from Scylladb[bot]
Previous way of execution repeat was to launch pytest for each repeat.
That was resource consuming, since each time pytest was doing discovery
of the tests. Now all repeats are done inside one pytest process.

Backport for 2025.3 is needed, since this functionality is framework only, and 2025.3 affected with this slow repeats as well.

Fixes: https://github.com/scylladb/scylladb/issues/25391

- (cherry picked from commit cc75197efd)

- (cherry picked from commit 557293995b)

- (cherry picked from commit 853bdec3ec)

- (cherry picked from commit d0e4045103)

Parent PR: #25073

Closes scylladb/scylladb#25392

* github.com:scylladb/scylladb:
  test.py: add repeats in pytest
  test.py: add directories and filename to the log files
  test.py: rename log sink file for boost tests
  test.py: better error handling in boost facade
2025-08-11 06:51:55 +03:00
Szymon Malewski
4c375b257b test/alternator: enable more relevant logs in CI.
This patch sets, for alternator test suite, all 'alternator-*' loggers and 'paxos' logger to trace level. This should significantly ease debugging of failed tests, while it has no effect on test time and increases log size only by 7%.
This affects running alternator tests only with `test.py`, not with `test/alternator/run`.

Closes #24645

Closes scylladb/scylladb#25327

(cherry picked from commit eb11485969)

Closes scylladb/scylladb#25383
2025-08-11 06:51:23 +03:00
Botond Dénes
ea6d0c880a Merge '[Backport 2025.3] test: audit: ignore cassandra user audit logs in AUTH tests' from Scylladb[bot]
Audit tests are vulnerable to noise from LOGIN queries (because AUTH
audit logs can appear at any time). Most tests already use the
`filter_out_noise` mechanism to remove this noise, but tests
focused on AUTH verification did not, leading to sporadic failures.

This change adds a filter to ignore AUTH logs generated by the default
"cassandra" user, so tests only verify logs from the user created
specifically for each test.

Additionally, this PR:
 - Adds missing `nonlocal new_rows` statement that prevented some checks from being called
 - Adds a testcase for audit logs of `cassandra` user

Fixes: https://github.com/scylladb/scylladb/issues/25069

Better backport those test changes to 2025.3. 2025.2 and earlier don't have `./cluster/dtest/audit_test.py`.

- (cherry picked from commit e634a2cb4f)

- (cherry picked from commit daf1c58e21)

- (cherry picked from commit aef6474537)

- (cherry picked from commit 21aedeeafb)

Parent PR: #25111

Closes scylladb/scylladb#25140

* github.com:scylladb/scylladb:
  test: audit: add cassandra user test case
  test: audit: ignore cassandra user audit logs in AUTH tests
  test: audit: change names of `filter_out_noise` parameters
2025-08-11 06:49:36 +03:00
Andrei Chekun
a4ea7b42c8 test.py: add repeats in pytest
Previous way of executin repeat was to launch pytest for each repeat.
That was resource consuming, since each time pytest was doing discovery
of the tests. Now all repeats are done inside one pytest process.

(cherry picked from commit d0e4045103)
2025-08-08 15:27:25 +02:00
Benny Halevy
9a34622a47 scylla-sstable: print_query_results_json: continue loop if row is disengaged
Otherwise it is accessed right when exiting the if block.
Add a unit test reproducing the issue and validating the fix.

Fixes #25325

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#25326

(cherry picked from commit 5e5e63af10)

Closes scylladb/scylladb#25379
2025-08-08 11:43:34 +03:00
Nikos Dragazis
8838d8df5f encryption: gcp: Fix the grant type for user credentials
Exchanging a refresh token for an access token requires the
"refresh_token" grant type [1].

[1] https://datatracker.ietf.org/doc/html/rfc6749#section-6

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
(cherry picked from commit b1d5a67018)
2025-08-07 21:46:24 +00:00
Nikos Dragazis
a69afb0d0b encryption: gcp: Expand tilde in pathnames for credentials file
The GCP host searches for application default credentials in known
locations within the user's home directory using
`seastar::file_exists()`. However, this function does not perform tilde
expansion in pathnames.

Replace tildes with the home directory from the HOME environment
variable.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
(cherry picked from commit 77cc6a7bad)
2025-08-07 21:46:24 +00:00
Andrei Chekun
8766750228 test.py: add directories and filename to the log files
Currently, only test function name used for output and log files. For better
clarity adding the relative path from the test directory of the file name
without extension to these files.
Before:
test_aggregate_avg.1.log
test_aggregate_avg_stdout.1.log
After:
boost.aggregate_fcts_test.test_aggregate_avg.1.log
boost.aggregate_fcts_test.test_aggregate_avg_stdout.3.log

(cherry picked from commit 853bdec3ec)
2025-08-07 10:46:58 +00:00
Andrei Chekun
c4cefc5195 test.py: rename log sink file for boost tests
Log sink is outputted in XML format not just simple text file. Renaming to have better clarity

(cherry picked from commit 557293995b)
2025-08-07 10:46:58 +00:00
Andrei Chekun
5f8e69a5d9 test.py: better error handling in boost facade
If test was not executed for some reason, for example not known parameter passed to the test, but boost framework was able to finish correctly, log file will have data but it will be parsed to an empty list. This will raise an exception in pytest execution, rather than produce test output. This change will handle this situation.

(cherry picked from commit cc75197efd)
2025-08-07 10:46:58 +00:00
Avi Kivity
0d54b72f21 Merge '[Backport 2025.3] truncate: change check for write during truncate into a log warning' from Scylladb[bot]
TRUNCATE TABLE performs a memtable flush and then discards the sstables of the table being truncated. It collects the highest replay position for both of these. When the highest replay position of the discarded sstables is higher than the highest replay position of the flushed memtable, that means that we have had writes during truncate which have been flushed to disk independently of the truncate process. We check for this and trigger an on_internal_error() which throws an exception, informing the user that writing data concurrently with TRUNCATE TABLE is not advised.

The problem with this is that truncate is also called from DROP KEYSPACE and DROP TABLE. These are raft operations and exceptions thrown by them are caught by the (...) exception handler in the raft applier fiber, which then exits leaving the node without the ability to execute subsequent raft commands.

This commit changes the on_internal_error() into a warning log entry. It also outputs to keyspace/table names, and the offending replay positions which caused the check to fail.

This PR also adds a test which validates that TRUNCATE works correctly with concurrent writes. More specifically, it checks that:
- all data written before TRUNCATE starts is deleted
- none of the data after TRUNCATE completes is deleted

Fixes: #25173
Fixes: #25013

Backport is needed in versions which check for truncate with concurrent writes using `on_internal_error()`: 2025.3 2025.2 2025.1

- (cherry picked from commit 268ec72dc9)

- (cherry picked from commit 33488ba943)

Parent PR: #25174

Closes scylladb/scylladb#25350

* github.com:scylladb/scylladb:
  truncate: add test for truncate with concurrent writes
  truncate: change check for write during truncate into a log warning
2025-08-07 12:19:45 +03:00
Andrzej Jackowski
1be1306233 test: audit: add cassandra user test case
Audit tests use the `filter_out_noise` function to remove noise from
audit logs generated by user authentication. As a result, none of the
existing tests covered audit logs for the default `cassandra` user.
This change adds a test case for that user.

Refs: scylladb/scylladb#25069
(cherry picked from commit 21aedeeafb)
2025-08-07 10:03:27 +02:00
Patryk Jędrzejczak
1863386bc8 Merge '[Backport 2025.3] Raft-based recovery procedure: simplify rolling restart with recovery_leader' from Scylladb[bot]
The following steps are performed in sequence as part of the
Raft-based recovery procedure:
- set `recovery_leader` to the host ID of the recovery leader in
  `scylla.yaml` on all live nodes,
- send the `SIGHUP` signal to all Scylla processes to reload the config,
- perform a rolling restart (with the recovery leader being restarted
  first).

These steps are not intuitive and more complicated than they could be.

In this PR, we simplify these steps. From now on, we will be able to
simply set `recovery_leader` on each node just before restarting it.

Apart from making necessary changes in the code, we also update all
tests of the Raft-based recovery procedure and the user-facing
documentation.

Fixes scylladb/scylladb#25015

The Raft-based procedure was added in 2025.2. This PR makes the
procedure simpler and less error-prone, so it should be backported
to 2025.2 and 2025.3.

- (cherry picked from commit ec69028907)

- (cherry picked from commit 445a15ff45)

- (cherry picked from commit 23f59483b6)

- (cherry picked from commit ba5b5c7d2f)

- (cherry picked from commit 9e45e1159b)

- (cherry picked from commit f408d1fa4f)

Parent PR: #25032

Closes scylladb/scylladb#25335

* https://github.com/scylladb/scylladb:
  docs: document the option to set recovery_leader later
  test: delay setting recovery_leader in the recovery procedure tests
  gossip: add recovery_leader to gossip_digest_syn
  db: system_keyspace: peers_table_read_fixup: remove rows with null host_id
  db/config, gms/gossiper: change recovery_leader to UUID
  db/config, utils: allow using UUID as a config option
2025-08-07 09:58:13 +02:00
Taras Veretilnyk
606db56cf3 docs: Sort commands list in nodetool.rst
Fixes scylladb/scylladb#25330

Closes scylladb/scylladb#25331

(cherry picked from commit bcb90c42e4)

Closes scylladb/scylladb#25372
2025-08-06 20:49:21 +03:00
Nikos Dragazis
26174a9c67 test: kmip: Fix segfault from premature destruction of port_promise
`kmip_test_helper()` is a utility function to spawn a dedicated PyKMIP
server for a particular Boost test case. The function runs the server as
an external process and uses a thread to parse the port from the
server's logs. The thread communicates the port to the main thread via
a promise.

The current implementation has a bug where the thread may set a value
to the promise after its destruction, causing a segfault. This happens
when the server does not start within 20 seconds, in which case the port
future throws and the stack unwinding machinery destroys the port
promise before the thread that writes to it.

Fix the bug by declaring the promise before the cleanup action.

The bug has been encountered in CI runs on slow machines, where the
PyKMIP server takes too long to create its internal tables (due to slow
fdatasync calls from SQLite). This patch does not improve CI stability -
it only ensures that the error condition is properly reflected in the
test output.

This patch is not a backport. The same bug has been fixed in master as
part of a larger rewrite of the `kmip_test_helper()` (see 722e2bce96).

Refs #24747, #24842.
Fixes #24574.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>

Closes scylladb/scylladb#25030
2025-08-06 11:59:40 +03:00
Pavel Emelyanov
4fcf0a620c Merge '[Backport 2025.3] Simplify credential reload: remove internal expiration checks' from Scylladb[bot]
This PR introduces a refinement in how credential renewal is triggered. Previously, the system attempted to renew credentials one hour before their expiration, but the credentials provider did not recognize them as expired—resulting in a no-op renewal that returned existing credentials. This led the timer fiber to immediately retry renewal, causing a renewal storm.

To resolve this, we remove expiration (or any other checks) in `reload` method, assuming that whoever calls this method knows what he does.

Fixes: https://github.com/scylladb/scylladb/issues/25044

Should be backported to 2025.3 since we need this fix for the restore

- (cherry picked from commit 68855c90ca)

- (cherry picked from commit e4ebe6a309)

- (cherry picked from commit 837475ec6f)

Parent PR: #24961

Closes scylladb/scylladb#25347

* github.com:scylladb/scylladb:
  s3_creds: code cleanup
  s3_creds: Make `reload` unconditional
  s3_creds: Add test exposing credentials renewal issue
2025-08-06 11:33:19 +03:00
Aleksandra Martyniuk
2282a11405 tasks: change _finished_children type
Parent task keeps a vector of statuses (task_essentials) of its finished
children. When the children number is large - for example because we
have many tables and a child task is created for each table - we may hit
oversize allocation while adding a new child essentials to the vector.

Keep task_essentails of children in chunked_vector.

Fixes: #25040.

Closes scylladb/scylladb#25064

(cherry picked from commit b5026edf49)

Closes scylladb/scylladb#25319
2025-08-06 07:36:04 +03:00
Michał Jadwiszczak
c31f47026d storage_service, group0_state_machine: move SL cache update from topology_state_load() to load_snapshot()
Currently the service levels cache is unnecessarily updated in every
call of `topology_state_load()`.
But it is enough to reload it only when a snapshot is loaded.
(The cache is also already updated when there is a change to one of
`service_levels_v2`, `role_members`, `role_attributes` tables.)

Fixes scylladb/scylladb#25114
Fixes scylladb/scylladb#23065

Closes scylladb/scylladb#25116

(cherry picked from commit 10214e13bd)

Closes scylladb/scylladb#25305
2025-08-06 07:27:48 +03:00
Aleksandra Martyniuk
4f0e5bf429 api: storage_service: do not log the exception that is passed to user
The exceptions that are thrown by the tasks started with API are
propagated to users. Hence, there is no need to log it.

Remove the logs about exception in user started tasks.

Fixes: https://github.com/scylladb/scylladb/issues/16732.

Closes scylladb/scylladb#25153

(cherry picked from commit e607ef10cd)

Closes scylladb/scylladb#25298
2025-08-06 07:27:05 +03:00
Andrei Chekun
85769131a2 docs: update documentation with new way of running C++ tests
Documentation had outdated information how to run C++ test.
Additionally, some information added about gathered test metrics.

Closes scylladb/scylladb#25180

(cherry picked from commit a6a3d119e8)

Closes scylladb/scylladb#25291
2025-08-06 07:25:44 +03:00
Dawid Mędrek
c5e1e28076 test: Enable RF-rack-valid keyspaces in all Python suites
We're enabling the configuration option `rf_rack_valid_keyspaces`
in all Python test suites. All relevant tests have been adjusted
to work with it enabled.

That encompasses the following suites:

* alternator,
* broadcast_tables,
* cluster (already enabled in scylladb/scylladb@ee96f8dcfc),
* cql,
* cqlpy (already enabled in scylladb/scylladb@be0877ce69),
* nodetool,
* rest_api.

Two remaining suites that use tests written in Python, redis and scylla_gdb,
are not affected, at least not directly.

The redis suite requires creating an instance of Scylla manually, and the tests
don't do anything that could violate the restriction.

The scylla_gdb suite focuses on testing the capabilities of scylla-gdb.py, but
even then it reuses the `run` file from the cqlpy suite.

Fixes scylladb/scylladb#25126

Closes scylladb/scylladb#24617

(cherry picked from commit b41151ff1a)

Closes scylladb/scylladb#25231
2025-08-06 07:17:40 +03:00
Tomasz Grabiec
71dd30fc25 topology_coordinator: Trigger load stats refresh after replace
Otherwise, tablet rebuilt will be delayed for up to 60s, as the tablet
scheduler needs load stats for the new node (replacing) to make
decisisons.

Fixes #25163

Closes scylladb/scylladb#25181

(cherry picked from commit 55116ee660)

Closes scylladb/scylladb#25216
2025-08-06 07:16:45 +03:00
Ferenc Szili
db3777c703 truncate: add test for truncate with concurrent writes
test_validate_truncate_with_concurrent_writes checks if truncate deletes
all the data written before the truncate starts, and does not delete any
data after truncate completes.

(cherry picked from commit 33488ba943)
2025-08-06 00:52:15 +00:00
Ferenc Szili
0248f555da truncate: change check for write during truncate into a log warning
TRUNCATE TABLE performs a memtable flush and then discards the sstables
of the table being truncated. It collects the highest replay position
for both of these. When the highest replay position of the discarded
sstables is higher than the highest replay position of the flushed
memtable, that means that we have had writes during truncate which have
been flushed to disk independently of the truncate process. We check for
this and trigger an on_internal_error() which throws an exception,
informing the user that writing data concurrently with TRUNCATE TABLE is
not advised.

The problem with this is that truncate is also called from DROP KEYSPACE
and DROP TABLE. These are raft operations and exceptions thrown by them
are caught by the (...) exception handler in the raft applier fiber,
which then exits leaving the node without the ability to execute
subsequent raft commands.

This commit changes the on_internal_error() into a warning log entry. It
also outputs to keyspace/table names, the truncated_at timepoint, the
offending replay positions which caused the check to fail.

Fixes: #25173
Fixes: #25013
(cherry picked from commit 268ec72dc9)
2025-08-06 00:52:15 +00:00
Ernest Zaslavsky
88779a6884 s3_creds: code cleanup
Remove unnecessary code which is no more used

(cherry picked from commit 837475ec6f)
2025-08-06 00:50:36 +00:00
Ernest Zaslavsky
ccdc98c8f0 s3_creds: Make reload unconditional
Assume that any caller invoking `reload` intends to refresh credentials.
Remove conditional logic that checks for expiration before reloading.

(cherry picked from commit e4ebe6a309)
2025-08-06 00:50:36 +00:00
Ernest Zaslavsky
bd79ae3826 s3_creds: Add test exposing credentials renewal issue
Add a test demonstrating that renewing credentials does not update
their expiration. After requesting credentials again, the expiration
remains unchanged, indicating no actual update occurred.

(cherry picked from commit 68855c90ca)
2025-08-06 00:50:36 +00:00
Botond Dénes
f212f6af28 Merge 'repair: distribute tablet_repair_task_metas between shards' from Aleksandra Martyniuk
Currently, in repair_service::repair_tablets a shard that initiates
the repair keeps repair_tablet_metas of all tablets that have a replica
on this node (on any shard). This may lead to oversized allocations.

Modify tablet_repair_task_impl to repair only the tablets which replicas
are kept on this shard. Modify repair_service::repair_tablets to gather
repair_tablet_metas only on local shard. repair_tablets is invoked on
all shards.

Add a new legacy_tablet_repair_task_impl that covers tablet repair started with
async_repair. A user can use sequence number of this task to manage the repair
using storage_service API.

In a test that reproduced this, we have seen 11136 tablets and 5636096 bytes
allocation failure. If we had a node with 250 shards, 100 tablets each, we could
reach 12MB kept on one shard for the whole repair time.

Fixes: https://github.com/scylladb/scylladb/issues/23632

Needs backport to all live branches as they are all vulnerable to such crashes.

Closes scylladb/scylladb#24194

* github.com:scylladb/scylladb:
  repair: distribute tablet_repair_task_meta among shards
  repair: do not keep erm in tablet_repair_task_meta
2025-08-05 22:36:59 +03:00
Avi Kivity
a8193bd503 Merge '[Backport 2025.3] transport: remove throwing protocol_exception on connection start' from Dario Mirovic
`protocol_exception` is thrown in several places. This has become a performance issue, especially when starting/restarting a server. To alleviate this issue, throwing the exception has to be replaced with returning it as a result or an exceptional future.

This PR replaces throws in the `transport/server` module. This is achieved by using result_with_exception, and in some places, where suitable, just by creating and returning an exceptional future.

There are four commits in this PR. The first commit introduces tests in `test/cqlpy`. The second commit refactors transport server `handle_error` to not rethrow exceptions. The third commit refactors reusable buffer writer callbacks. The fourth commit replaces throwing `protocol_exception` to returning it.

Based on the comments on an issue linked in https://github.com/scylladb/scylladb/issues/24567, the main culprit from the side of protocol exceptions is the invalid protocol version one, so I tested that exception for performance.

In order to see if there is a measurable difference, a modified version of `test_protocol_version_mismatch` Python is used, with 100'000 runs across 10 processes (not threads, to avoid Python GIL). One test run consisted of 1 warm-up run and 5 measured runs. First test run has been executed on the current code, with throwing protocol exceptions. Second test urn has been executed on the new code, with returning protocol exceptions. The performance report is in https://github.com/scylladb/scylladb/pull/24738#issuecomment-3051611069. It shows ~10% gains in real, user, and sys time for this test.

Testing

Build: `release`

Test file: `test/cqlpy/test_protocol_exceptions.py`
Test name: `test_protocol_version_mismatch` (modified for mass connection requests)

Test arguments:
```
max_attempts=100'000
num_parallel=10
```

Throwing `protocol_exception` results:
```
real=1:26.97  user=10:00.27  sys=2:34.55  cpu=867%
real=1:26.95  user=9:57.10  sys=2:32.50  cpu=862%
real=1:26.93  user=9:56.54  sys=2:35.59  cpu=865%
real=1:26.96  user=9:54.95  sys=2:32.33  cpu=859%
real=1:26.96  user=9:53.39  sys=2:33.58  cpu=859%

real=1:26.95 user=9:56.85 sys=2:34.11 cpu=862%   # average
```

Returning `protocol_exception` as `result_with_exception` or an exceptional future:
```
real=1:18.46  user=9:12.21  sys=2:19.08  cpu=881%
real=1:18.44  user=9:04.03  sys=2:17.91  cpu=869%
real=1:18.47  user=9:12.94  sys=2:19.68  cpu=882%
real=1:18.49  user=9:13.60  sys=2:19.88  cpu=883%
real=1:18.48  user=9:11.76  sys=2:17.32  cpu=878%

real=1:18.47 user=9:10.91 sys=2:18.77 cpu=879%   # average
```

This PR replaced `transport/server` throws of `protocol_exception` with returns. There are a few other places where protocol exceptions are thrown, and there are many places where `invalid_request_exception` is thrown. That is out of scope of this single PR, so the PR just refs, and does not resolve issue #24567.

Refs: #24567
Fixes: #25271

This PR improves performance in cases when protocol exceptions happen, for example during connection storms. It will require backporting.

* (cherry picked from commit 7aaeed012e)

* (cherry picked from commit 30d424e0d3)

* (cherry picked from commit 9f4344a435)

* (cherry picked from commit 5390f92afc)

* (cherry picked from commit 4a6f71df68)

Parent PR: #24738

Closes scylladb/scylladb#25117

* github.com:scylladb/scylladb:
  test/cqlpy: add cpp exception metric test conditions
  transport/server: replace protocol_exception throws with returns
  utils/reusable_buffer: accept non-throwing writer callbacks via result_with_exception
  transport/server: avoid exception-throw overhead in handle_error
  test/cqlpy: add protocol_exception tests
2025-08-05 14:16:14 +03:00
Patryk Jędrzejczak
eb8ea703d5 docs: document the option to set recovery_leader later
In one of the previous commits, we made it possible to set
`recovery_leader` on each node just before restarting it. Here, we
update the corresponding documentation.

(cherry picked from commit f408d1fa4f)
2025-08-05 10:59:39 +00:00
Patryk Jędrzejczak
ac7945e044 test: delay setting recovery_leader in the recovery procedure tests
In the previous commit, we made it possible to set `recovery_leader`
on each node just before restarting it. Here, we change all the
tests of the Raft-based recovery procedure to use and test this option.

(cherry picked from commit 9e45e1159b)
2025-08-05 10:59:39 +00:00
Patryk Jędrzejczak
79c27454d4 gossip: add recovery_leader to gossip_digest_syn
In the new Raft-based recovery procedure, live nodes join the new
group 0 one by one during a rolling restart. There is a time window when
some of them are in the old group 0, while others are in the new group
0. This causes a group 0 mismatch in `gossiper::handle_syn_msg`. The
current solution for this problem is to ignore group 0 mismatches if
`recovery_leader` is set on the local node and to ask the administrator
to perform the rolling restart in the following way:
- set `recovery_leader` in `scylla.yaml` on all live nodes,
- send the `SIGHUP` signal to all Scylla processes to reload the config,
- proceed with the rolling restart.

This commit makes `gossiper::handle_syn_msg` ignore group 0 mismatches
when exactly one of the two gossiping nodes has `recovery_leader` set.
We achieve this by adding `recovery_leader` to `gossip_digest_syn`.
This change makes setting `recovery_leader` earlier on all nodes and
reloading the config unnecessary. From now on, the administrator can
simply restart each node with `recovery_leader` set.

However, note that nodes that join group 0 must have `recovery_leader`
set until all nodes join the new group 0. For example, assume that we
are in the middle of the rolling restart and one of the nodes in the new
group 0 crashes. It must be restarted with `recovery_leader` set, or
else it would reject `gossip_digest_syn` messages from nodes in the old
group 0. To avoid problems in such cases, we will continue to recommend
setting `recovery_leader` in `scylla.yaml` instead of passing it as
a command line argument.

(cherry picked from commit ba5b5c7d2f)
2025-08-05 10:59:39 +00:00
Patryk Jędrzejczak
4294669e72 db: system_keyspace: peers_table_read_fixup: remove rows with null host_id
Currently, `peers_table_read_fixup` removes rows with no `host_id`, but
not with null `host_id`. Null host IDs are known to appear in system
tables, for example in `system.cluster_status` after a failed bootstrap.
We better make sure we handle them properly if they ever appear in
`system.peers`.

This commit guarantees that null UUID cannot belong to
`loaded_endpoints` in `storage_service::join_cluster`, which in
particular ensures that we throw a runtime error when a user sets
`recovery_leader` to null UUID during the recovery procedure. This is
handled by the code verifying that `recovery_leader` belongs to
`loaded_endpoints`.

(cherry picked from commit 23f59483b6)
2025-08-05 10:59:39 +00:00
Patryk Jędrzejczak
74cf95a675 db/config, gms/gossiper: change recovery_leader to UUID
We change the type of the `recovery_leader` config parameter and
`gossip_config::recovery_leader` from sstring to UUID. `recovery_leader`
is supposed to store host ID, so UUID is a natural choice.

After changing the type to UUID, if the user provides an incorrect UUID,
parsing `recovery_leader` will fail early, but the start-up will
continue. Outside the recovery procedure, `recovery_leader` will then be
ignored. In the recovery procedure, the start-up will fail on:

```
throw std::runtime_error(
        "Cannot start - Raft-based topology has been enabled but persistent group 0 ID is not present. "
        "If you are trying to run the Raft-based recovery procedure, you must set recovery_leader.");
```

(cherry picked from commit 445a15ff45)
2025-08-05 10:59:39 +00:00
Patryk Jędrzejczak
d18d2fa0cf db/config, utils: allow using UUID as a config option
We change the `recovery_leader` option to UUID in the following commit.

(cherry picked from commit ec69028907)
2025-08-05 10:59:39 +00:00
Jenkins Promoter
3d4ec918ff Update ScyllaDB version to: 2025.3.0-rc3 2025-08-03 15:50:47 +03:00
Nikos Dragazis
257ebbeca9 test: Use in-memory SQLite for PyKMIP server
The PyKMIP server uses an SQLite database to store artifacts such as
encryption keys. By default, SQLite performs a full journal and data
flush to disk on every CREATE TABLE operation. Each operation triggers
three fdatasync(2) calls. If we multiply this by 16, that is the number
of tables created by the server, we get a significant number of file
syncs, which can last for several seconds on slow machines.

This behavior has led to CI stability issues from KMIP unit tests where
the server failed to complete its schema creation within the 20-second
timeout (observed on spider9 and spider11).

Fix this by configuring the server to use an in-memory SQLite.

Fixes #24842.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>

Closes scylladb/scylladb#24995

(cherry picked from commit 2656fca504)

Closes scylladb/scylladb#25300
2025-08-02 17:12:05 +03:00
Ran Regev
7aa7f50b3a scylla.yaml: add recommended value for stream_io_throughput_mb_per_sec
Fixes: #24758

Updated scylla.yaml and the help for
scylla --help

Closes scylladb/scylladb#24793

(cherry picked from commit db4f301f0c)

Closes scylladb/scylladb#25280
2025-08-01 15:02:01 +03:00
Piotr Dulikowski
0dc700de70 Merge '[Backport 2025.3] qos: don't populate effective service level cache until auth is migrated to raft' from Scylladb[bot]
Right now, service levels are migrated in one group0 command and auth is migrated in the next one. This has a bad effect on the group0 state reload logic - modifying service levels in group0 causes the effective service levels cache to be recalculated, and to do so we need to fetch information about all roles. If the reload happens after SL upgrade and before auth upgrade, the query for roles will be directed to the legacy auth tables in system_auth - and the query, being a potentially remote query, has a timeout. If the query times out, it will throw an exception which will break the group0 apply fiber and the node will need to be restarted to bring it back to work.

In order to solve this issue, make sure that the service level module does not start populating and using the service level cache until both service levels and auth are migrated to raft. This is achieved by adding the check both to the cache population logic and the effective service level getter - they now look at service level's accessor new method, `can_use_effective_service_level_cache` which takes a look at the auth version.

Fixes: scylladb/scylladb#24963

Should be backported to all versions which support upgrade to topology over raft - the issue described here may put the cluster into a state which is difficult to get out of (group0 apply fiber can break on multiple nodes, which necessitates their restart).

- (cherry picked from commit 2bb800c004)

- (cherry picked from commit 3a082d314c)

Parent PR: #25188

Closes scylladb/scylladb#25285

* github.com:scylladb/scylladb:
  test: sl: verify that legacy auth is not queried in sl to raft upgrade
  qos: don't populate effective service level cache until auth is migrated to raft
2025-08-01 08:49:13 +02:00
Jenkins Promoter
308400895f Update pgo profiles - aarch64 2025-08-01 05:19:18 +03:00
Jenkins Promoter
54b259bec9 Update pgo profiles - x86_64 2025-08-01 05:02:34 +03:00
Piotr Dulikowski
f27a3be62b test: sl: verify that legacy auth is not queried in sl to raft upgrade
Adjust `test_service_levels_upgrade`: right before upgrade to topology
on raft, enable an error injection which triggers when the standard role
manager is about to query the legacy auth tables in the
system_auth keyspace. The preceding commit which fixes
scylladb/scylladb#24963 makes sure that the legacy tables are not
queried during upgrade to topology on raft, so the error injection does
not trigger and does not cause a problem; without that commit, the test
fails.

(cherry picked from commit 3a082d314c)
2025-07-31 15:13:57 +00:00
Piotr Dulikowski
ba70b39486 qos: don't populate effective service level cache until auth is migrated to raft
Right now, service levels are migrated in one group0 command and auth
is migrated in the next one. This has a bad effect on the group0 state
reload logic - modifying service levels in group0 causes the effective
service levels cache to be recalculated, and to do so we need to fetch
information about all roles. If the reload happens after SL upgrade and
before auth upgrade, the query for roles will be directed to the legacy
auth tables in system_auth - and the query, being a potentially remote
query, has a timeout. If the query times out, it will throw
an exception which will break the group0 apply fiber and the node will
need to be restarted to bring it back to work.

In order to solve this issue, make sure that the service level module
does not start populating and using the service level cache until both
service levels and auth are migrated to raft. This is achieved by adding
the check both to the cache population logic and the effective service
level getter - they now look at service level's accessor new method,
`can_use_effective_service_level_cache` which takes a look at the auth
version.

Fixes: scylladb/scylladb#24963
(cherry picked from commit 2bb800c004)
2025-07-31 15:13:57 +00:00
Andrzej Jackowski
12866e8f2e test: audit: ignore cassandra user audit logs in AUTH tests
Audit tests are vulnerable to noise from LOGIN queries (because AUTH
audit logs can appear at any time). Most tests already use the
`filter_out_noise` mechanism to remove this noise, but tests
focused on AUTH verification did not, leading to sporadic failures.

This change adds a filter to ignore AUTH logs generated by the default
"cassandra" user, so tests only verify logs from the user created
specifically for each test.

Fixes: scylladb/scylladb#25069
(cherry picked from commit aef6474537)
2025-07-31 17:01:29 +02:00
Andrzej Jackowski
e77a190f1a test: audit: change names of filter_out_noise parameters
This is a refactoring commit that changes the names of the parameters
of the `filter_out_noise` function, as well as names of related
variables. The motiviation for the change is introduction of more
complex filtering logic in next commit of this patch series.

Refs: scylladb/scylladb#25069
(cherry picked from commit daf1c58e21)
2025-07-31 16:58:36 +02:00
Aleksandra Martyniuk
132e6495a3 repair: distribute tablet_repair_task_meta among shards
Currently, in repair_service::repair_tablets a shard that initiates
the repair keeps tablet_repair_task_meta of all tablets that have a replica
on this node (on any shard). This may lead to oversized allocations.

Add remote_metas class which takes care of distributing tablet_repair_task_meta
among different shards. An additional class remote_metas_builder was
added in order to ensure safety and separate writes and reads to meta
vectors.

Fixes: #23632
2025-07-31 15:56:53 +02:00
Aleksandra Martyniuk
603a2dbb10 repair: do not keep erm in tablet_repair_task_meta
Do not keep erm in tablet_repair_task_meta to avoid non-owner shared
pointer access when metas will be distributes among shards.

Pass std::chunked_vector of erms to tablet_repair_task_impl to
preserve safety.
2025-07-31 15:56:43 +02:00
Dario Mirovic
7d300367c0 test/cqlpy: add cpp exception metric test conditions
Tested code paths should not throw exceptions. `scylla_reactor_cpp_exceptions`
metric is used. This is a global metric. To address potential test flakiness,
each test runs multiple times:
- `run_count = 100`
- `cpp_exception_threshold = 10`

If a change in the code introduced an exception, expectation is that the number
of registered exceptions will be > `cpp_exception_threshold` in `run_count` runs.
In which case the test fails.

Fixes: #25271
(cherry picked from commit 4a6f71df68)
2025-07-31 11:53:00 +02:00
Anna Stuchlik
4bc531d48d doc: add the upgrade guide from 2025.2 to 2025.3
This PR adds the upgrade guide from version 2025.2 to 2025.3.
Also, it removes the upgrade guide existing for the previous version
that is irrelevant in 2025.2 (upgrade from 2025.1 to 2025.2).

Note that the new guide does not include the "Enable Consistent Topology Updates" page and note,
as users upgrading to 2025.3 have consistent topology updates already enabled.

Fixes https://github.com/scylladb/scylladb/issues/24696

Closes scylladb/scylladb#25219

(cherry picked from commit 8365219d40)

Closes scylladb/scylladb#25248
2025-07-31 12:19:33 +03:00
Anna Stuchlik
f3ca644a55 doc: add OS support for ScyllaDB 2025.3
This commit adds the information about support for platforms in ScyllaDB version 2025.3.

Fixes https://github.com/scylladb/scylladb/issues/24698

Closes scylladb/scylladb#25220

(cherry picked from commit b67bb641bc)

Closes scylladb/scylladb#25249
2025-07-31 12:17:36 +03:00
Anna Stuchlik
573bbace20 doc: add tablets support information to the Drivers table
This commit:

- Extends the Drivers support table with information on which driver supports tablets
  and since which version.
- Adds the driver support policy to the Drivers page.
- Reorganizes the Drivers page to accommodate the updates.

In addition:
- The CPP-over-Rust driver is added to the table.
- The information about Serverless (which we don't support) is removed
  and replaced with tablets to correctly describe the contents of the table.

Fixes https://github.com/scylladb/scylladb/issues/19471

Refs https://github.com/scylladb/scylladb-docs-homepage/issues/69

Closes scylladb/scylladb#24635

(cherry picked from commit 18b4d4a77c)

Closes scylladb/scylladb#25251
2025-07-31 12:17:21 +03:00
Aleksandra Martyniuk
4630a2f9c5 streaming: close sink when exception is thrown
If an exception is thrown in result_handling_cont in streaming,
then the sink does not get closed. This leads to a node crash.

Close sink in exception handler.

Fixes: https://github.com/scylladb/scylladb/issues/25165.

Closes scylladb/scylladb#25238

(cherry picked from commit 99ff08ae78)

Closes scylladb/scylladb#25268
2025-07-31 12:17:05 +03:00
Dario Mirovic
38a8318466 transport/server: replace protocol_exception throws with returns
Replace throwing protocol_exception with returning it as a result
or an exceptional future in the transport server module. This
improves performance, for example during connection storms and
server restarts, where protocol exceptions are more frequent.

In functions already returning a future, protocol exceptions are
propagated using an exceptional future. In functions not already
returning a future, result_with_exception is used.

Notable change is checking v.failed() before calling v.get() in
process_request function, to avoid throwing in case of an
exceptional future.

Refs: #24567
Fixes: #25271
(cherry picked from commit 5390f92afc)
2025-07-30 21:35:24 +02:00
Dario Mirovic
1078a1f03a utils/reusable_buffer: accept non-throwing writer callbacks via result_with_exception
Make make_bytes_ostream and make_fragmented_temporary_buffer accept
writer callbacks that return utils::result_with_exception instead of
forcing them to throw on error. This lets callers propagate failures
by returning an error result rather than throwing an exception.

Introduce buffer_writer_for, bytes_ostream_writer, and fragmented_buffer_writer
concepts to simplify and document the template requirements on writer callbacks.

This patch does not modify the actual callbacks passed, except for the syntax
changes needed for successful compilation, without changing the logic.

Refs: #24567
Fixes: #25271
(cherry picked from commit 9f4344a435)
2025-07-30 21:35:15 +02:00
Dario Mirovic
0679a7bb78 transport/server: avoid exception-throw overhead in handle_error
Previously, connection::handle_error always called f.get() inside a try/catch,
forcing every failed future to throw and immediately catch an exception just to
classify it. This change eliminates that extra throw/catch cycle by first checking
f.failed(), getting the stored std::exception_ptr via f.get_exception(), and
then dispatching on its type via utils::try_catch<T>(eptr).

The error-response logic is not changed - cassandra_exception, std::exception,
and unknown exceptions are caught and processed, and any exceptions thrown by
write_response while handling those exceptions continues to escape handle_error.

Refs: #24567
Fixes: #25271
(cherry picked from commit 30d424e0d3)
2025-07-30 21:34:56 +02:00
Dario Mirovic
918d4ab5fb test/cqlpy: add protocol_exception tests
Add a helper to fetch scylla_transport_cql_errors_total{type="protocol_error"} counter
from Scylla's metrics endpoint. These metrics are used to track protocol error
count before and after each test.

Add cql_with_protocol context manager utility for session creation with parameterized
protocol_version value. This is used for testing connection establishment with
different protocol versions, and proper disposal of successfully established sessions.

The tests cover two failure scenarios:
- Protocol version mismatch in test_protocol_version_mismatch which tests both supported
and unsupported protocol version
- Malformed frames via raw socket in _protocol_error_impl, used by several test functions,
and also test_no_protocol_exceptions test to assert that the error counters never decrease
during test execution, catching unintended metric resets

Refs: #24567
Fixes: #25271
(cherry picked from commit 7aaeed012e)
2025-07-30 21:34:31 +02:00
Patryk Jędrzejczak
7164f11b99 Merge '[Backport 2025.3] Revert 24418: main.cc: fix group0 shutdown order' from Petr Gusev
This PR reverts the changes of #24418 since they can cause use-after-free.

The `raft_group0::abort()` was called in `storage_service::do_drain` (introduced in #24418) to stop the group0 Raft server before destroying local storage. This was necessary because `raft::server` depends on storage (via `raft_sys_table_storage` and `group0_state_machine`).

However, this caused issues: services like `sstable_dict_autotrainer` and `auth::service`, which use `group0_client` but are not stopped by `storage_service`, could trigger use-after-free if `raft_group0` was destroyed too early. This can happen both during normal shutdown and when 'nodetool drain' is used.

This PR reverts two of the three commits from #24418. The commit [e456d2d](e456d2d507) is not reverted because it only affects logging and does not impact correctness.

Fixes scylladb/scylladb#25221

Backport: this PR is a backport

Closes scylladb/scylladb#25206

* https://github.com/scylladb/scylladb:
  Revert "main.cc: fix group0 shutdown order"
  Revert "storage_service: test_group0_apply_while_node_is_being_shutdown"
2025-07-30 16:18:13 +02:00
Pavel Emelyanov
99f328b7a7 Merge '[Backport 2025.3] s3_client: Enhance s3_client error handling' from Scylladb[bot]
Enhance and fix error handling in the `chunked_download_source` to prevent errors seeping from the request callback. Also stop retrying on seastar's side since it is going to break the integrity of data which maybe downloaded more than once for the same range.

Fixes: https://github.com/scylladb/scylladb/issues/25043

Should be backported to 2025.3 since we have an intention to release native backup/restore feature

- (cherry picked from commit d53095d72f)

- (cherry picked from commit b7ae6507cd)

- (cherry picked from commit ba910b29ce)

- (cherry picked from commit fc2c9dd290)

Parent PR: #24883

Closes scylladb/scylladb#25137

* github.com:scylladb/scylladb:
  s3_client: Disable Seastar-level retries in HTTP client creation
  s3_test: Validate handling of non-`aws_error` exceptions
  s3_client: Improve error handling in chunked_download_source
  aws_error: Add factory method for `aws_error` from exception
2025-07-29 14:42:45 +03:00
Pavel Emelyanov
07f46a4ad5 Merge '[Backport 2025.3] storage_service: cancel all write requests after stopping transports' from Scylladb[bot]
When a node shuts down, in storage service, after storage_proxy RPCs are stopped, some write handlers within storage_proxy may still be waiting for background writes to complete. These handlers hold appropriate ERMs to block schema changes before the write finishes. After the RPCs are stopped, these writes cannot receive the replies anymore.

If, at the same time, there are RPC commands executing `barrier_and_drain`, they may get stuck waiting for these ERM holders to finish, potentially blocking node shutdown until the writes time out.

This change introduces cancellation of all outstanding write handlers from storage_service after the storage proxy RPCs were stopped.

Fixes scylladb/scylladb#23665

Backport: since this fixes an issue that frequently causes issues in CI, backport to 2025.1, 2025.2, and 2025.3.

- (cherry picked from commit bc934827bc)

- (cherry picked from commit e0dc73f52a)

Parent PR: #24714

Closes scylladb/scylladb#25170

* github.com:scylladb/scylladb:
  storage_service: Cancel all write requests on storage_proxy shutdown
  test: Add test for unfinished writes during shutdown and topology change
2025-07-29 14:42:25 +03:00
Taras Veretilnyk
a9f5e7d18f docs: fix typo in command name enbleautocompaction -> enableautocompaction
Renamed the file and updated all references from 'enbleautocompaction' to the correct 'enableautocompaction'.

Fixes scylladb/scylladb#25172

Closes scylladb/scylladb#25175

(cherry picked from commit 6b6622e07a)

Closes scylladb/scylladb#25218
2025-07-29 14:41:50 +03:00
Petr Gusev
d8f6a497a5 Revert "main.cc: fix group0 shutdown order"
This reverts commit 6b85ab79d6.
2025-07-28 17:50:38 +02:00
Petr Gusev
c98dde92db Revert "storage_service: test_group0_apply_while_node_is_being_shutdown"
This reverts commit b1050944a3.
2025-07-28 17:49:03 +02:00
Aleksandra Martyniuk
8efee38d6f tasks: do not use binary progress for task manager tasks
Currently, progress of a parent task depends on expected_total_workload,
expected_children_number, and children progresses. Basically, if total
workload is known or all children have already been created, progresses
of children are summed up. Otherwise binary progress is returned.

As a result, two tasks of the same type may return progress in different
units. If they are children of the same task and this parent gathers the
progress - it becomes meaningless.

Drop expected_children_number as we can't assume that children are able
to show their progresses.

Modify get_progress method - progress is calculated based on children
progresses. If expected_total_workload isn't specified, the total
progress of a task may grow. If expected_total_workload isn't specified
and no children are created, empty progress (0/0) is returned.

Fixes: https://github.com/scylladb/scylladb/issues/24650.

Closes scylladb/scylladb#25113

(cherry picked from commit a7ee2bbbd8)

Closes scylladb/scylladb#25200
2025-07-28 13:11:45 +03:00
Michael Litvak
934260e9a9 storage service: drain view builder before group0
The view builder uses group0 operations to coordinate view building, so
we should drain the view builder before stopping group0.

Fixes scylladb/scylladb#25096

Closes scylladb/scylladb#25101

(cherry picked from commit 3ff388cd94)

Closes scylladb/scylladb#25198
2025-07-28 13:05:14 +03:00
Nadav Har'El
583c118ccd Merge '[Backport 2025.3] alternator: avoid oversized allocation in Query/Scan' from Scylladb[bot]
This series fixes one cause of oversized allocations - and therefore potentially stalls and increased tail latencies - in Alternator.

The first patch in the series is the main fix - the later patches are cleanups requested by reviewers but also involved other pre-existing code, so I did those cleanups as separate patches.

Alternator's Scan or Query operation return a page of results. When the number of items is not limited by a "Limit" parameter, the default is to return a 1 MB page. If items are short, a large number of them can fit in that 1MB. The test test_query.py::test_query_large_page_small_rows has 30,000 items returned in a single page.

In the response JSON, all these items are returned in a single array "Items". Before this patch, we build the full response as a RapidJSON object before sending it. The problem is that unfortunately, RapidJSON stores arrays as contiguous allocations. This results in large contiguous allocations in workloads that scan many small items, and large contiguous allocations can also cause stalls and high tail latencies. For example, before this patch, running

    test/alternator/run --runveryslow \
        test_query.py::test_query_large_page_small_rows

reports in the log:

    oversized allocation: 573440 bytes.

After this patch, this warning no longer appears.
The patch solves the problem by collecting the scanned items not in a RapidJSON array, but rather in a chunked_vector<rjson::value>, i.e, a chunked (non-contiguous) array of items (each a JSON value). After collecting this array separately from the response object, we need to print its content without actually inserting it into the object - we add a new function print_with_extra_array() to do that.

The new separate-chunked-vector technique is used when a large number (currently, >256) of items were scanned. When there is a smaller number of items in a page (this is typical when each item is longer), we just insert those items in the object and print it as before.

Beyond the original slow test that demonstrated the oversized allocation (which is now gone), this patch also includes a new test which exercises the new code with a scan of 700 (>256) items in a page - but this new test is fast enough to be permanently in our test suite and not a manual "veryslow" test as the other test.

Fixes #23535

The stalls caused by large allocations was seen by actual users, so it makes sense to backport this patch. On the other hand, the patch while not big is fairly intrusive (modifies the nomal Scan and Query path and also the later patches do some cleanup of additional code) so there is some small risk involved in the backport.

- (cherry picked from commit 2385fba4b6)

- (cherry picked from commit d8fab2a01a)

- (cherry picked from commit 13ec94107a)

- (cherry picked from commit a248336e66)

Parent PR: #24480

Closes scylladb/scylladb#25194

* github.com:scylladb/scylladb:
  alternator: clean up by co-routinizing
  alternator: avoid spamming the log when failing to write response
  alternator: clean up and simplify request_return_type
  alternator: avoid oversized allocation in Query/Scan
2025-07-27 14:12:49 +03:00
Nadav Har'El
f1c5350141 alternator: clean up by co-routinizing
Reviewers of the previous patch complained on some ugly pre-existing
code in alternator/executor.cc, where returning from an asynchronous
(future) function require lengthy verbose casts. So this patch cleans
up a few instances of these ugly casts by using co_return instead of
return.

For example, the long and verbose

    return make_ready_future<executor::request_return_type>(
        rjson::print(std::move(response)));

can be changed to the shorter and more readable

    co_return rjson::print(std::move(response));

This patch should not have any functional implications, and also not any
performance implications: I only coroutinized slow-path functions and
one function that was already "partially" coroutinized (and this was
expecially ugly and deserved being fixed).

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit a248336e66)
2025-07-27 07:42:01 +00:00
Nadav Har'El
f897f38003 alternator: avoid spamming the log when failing to write response
Both make_streamed() and new make_streamed_with_extra_array() functions,
used when returning a long response in Alternator, would write an error-
level log message if it failed to write the response. This log message
is probably not helpful, and may spam the log if the application causes
repeated errors intentionally or accidentally.

So drop these log messages. The exception is still thrown as usual.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit 13ec94107a)
2025-07-27 07:42:01 +00:00
Nadav Har'El
fe037663ea alternator: clean up and simplify request_return_type
The previous patch introduced a function make_streamed_with_extra_array
which was a duplicate of the existing make_streamed. Reviewers
complained how baroque the new function is (just like the old function),
having to jump through hoops to return a copyable function working
on non-copyable objects, making strange-named copies and shared pointers
of everything.

We needed to return a copyable function (std::function) just because
Alternator used Seastar's json::json_return_type in the return type
from executor function (request_return_type). This json_return_type
contained either a sstring or an std::function, but neither was ever
really appropriate:

  1. We want to return noncopyable_function, not an std::function!
  2. We want to return an std::string (which rjson::print()) returns,
     not an sstring!

So in this patch we stop using seastar::json::json_return_type
entirely in Alternator.

Alternator's request_return_type is now an std::variant of *three* types:
  1. std::string for short responses,
  2. noncopyable_function for long streamed response
  3. api_error for errors.

The ugliest parts of make_streamed() where we made copies and shared
pointers to allow for a copyable function are all gone. Even nicer, a
lot of other ugly relics of using seastar::json_return_type are gone:

1. We no longer need obscure classes and functions like make_jsonable()
   and json_string() to convert strings to response bodies - an operation
   can simply return a string directly - usually returning
   rjson::print(value) or a fixed string like "" and it just works.

2. There is no more usage of seastar::json in Alternator (except one
   minor use of seastar::json::formatter::to_json in streams.cc that
   can be removed later). Alternator uses RapidJSON for its JSON
   needs, we don't need to use random pieces from a different JSON
   library.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit d8fab2a01a)
2025-07-27 07:42:01 +00:00
Nadav Har'El
b7da50d781 alternator: avoid oversized allocation in Query/Scan
This patch fixes one cause of oversized allocations - and therefore
potentially stalls and increased tail latencies - in Alternator.

Alternator's Scan or Query operation return a page of results. When the
number of items is not limited by a "Limit" parameter, the default is
to return a 1 MB page. If items are short, a large number of them can
fit in that 1MB. The test test_query.py::test_query_large_page_small_rows
has 30,000 items returned in a single page.

In the response JSON, all these items are returned in a single array
"Items". Before this patch, we build the full response as a RapidJSON
object before sending it. The problem is that unfortunately, RapidJSON
stores arrays as contiguous allocations. This results in large
contiguous allocations in workloads that scan many small items, and
large contiguous allocations can also cause stalls and high tail
latencies. For example, before this patch, running

    test/alternator/run --runveryslow \
        test_query.py::test_query_large_page_small_rows

reports in the log:

    oversized allocation: 573440 bytes.

After this patch, this warning no longer appears.
The patch solves the problem by collecting the scanned items not in a
RapidJSON array, but rather in a chunked_vector<rjson::value>, i.e,
a chunked (non-contiguous) array of items (each a JSON value).
After collecting this array separately from the response object, we
need to print its content without actually inserting it into the object -
we add a new function print_with_extra_array() to do that.

The new separate-chunked-vector technique is used when a large number
(currently, >256) of items were scanned. When there is a smaller number
of items in a page (this is typical when each item is longer), we just
insert those items in the object and print it as before.

Beyond the original slow test that demonstrated the oversized allocation
(which is now gone), this patch also includes a new test which
exercises the new code with a scan of 700 (>256) items in a page -
but this new test is fast enough to be permanently in our test suite
and not a manual "veryslow" test as the other test.

Fixes #23535

(cherry picked from commit 2385fba4b6)
2025-07-27 07:42:01 +00:00
Pavel Emelyanov
7c04619ecf Merge '[Backport 2025.3] encryption_at_rest_test: Fix some spurious errors' from Scylladb[bot]
Fixes #24574

* Ensure we close the embedded load_cache objects on encryption shutdown, otherwise we can, in unit testing, get destruction of these while a timer is still active -> assert
* Add extra exception handling to `network_error_test_helper`, so even if test framework might exception-escape, we properly stop the network proxy to avoid use after free.

- (cherry picked from commit ee98f5d361)

- (cherry picked from commit 8d37e5e24b)

Parent PR: #24633

Closes scylladb/scylladb#24772

* github.com:scylladb/scylladb:
  encryption_at_rest_test: Add exception handler to ensure proxy stop
  encryption: Ensure stopping timers in provider cache objects
2025-07-24 16:35:53 +03:00
Pavel Emelyanov
b07f4fb26b Merge '[Backport 2025.3] streaming: Avoid deadlock by running view checks in a separate scheduling group' from Scylladb[bot]
This issue happens with removenode, when RBNO is disabled, so range
streamer is used.

The deadlock happens in a scenario like this:
1. Start 3 nodes: {A, B, C}, RF=2
2. Node A is lost
3. removenode A
4. Both B and C gain ownership of ranges.
5. Streaming sessions are started with crossed directions: B->C, C->B

Readers created by sender side exhaust streaming semaphore on B and C.
Receiver side attempts to obtain a permit indirectly by calling
check_needs_view_update_path(), which reads local tables. That read is
blocked and times-out, causing streaming to fail. The streaming writer
is already using a tracking-only permit.

Even if we didn't deadlock, and the streaming semaphore was simply exhausted
by other receiving sessions (via tracking-only permit), the query may still time-out due to starvation.

To avoid that, run the query under a different scheduling group, which
translates to the system semaphore instead of the maintenance
semaphore, to break the dependency. The gossip group was chosen
because it shouldn't be contended and this change should not interfere
with it much.

Fixes #24807
Fixes #24925

- (cherry picked from commit ee2fa58bd6)

- (cherry picked from commit dff2b01237)

Parent PR: #24929

Closes scylladb/scylladb#25058

* github.com:scylladb/scylladb:
  streaming: Avoid deadlock by running view checks in a separate scheduling group
  service: migration_manager: Run group0 barrier in gossip scheduling group
2025-07-24 16:35:24 +03:00
Ran Regev
c5f4ad3665 nodetool restore: sstable list from a file
Fixes: #25045

added the ability to supply the list of files to
restore from the a given file.
mainly required for local testing.

Signed-off-by: Ran Regev <ran.regev@scylladb.com>

Closes scylladb/scylladb#25077

(cherry picked from commit dd67d22825)

Closes scylladb/scylladb#25124
2025-07-24 16:35:04 +03:00
Ran Regev
013e0d685c docs: update nodetool restore documentation for --sstables-file-list
Fixes: #25128
A leftover from #25077

Closes scylladb/scylladb#25129

(cherry picked from commit 3d82b9485e)

Closes scylladb/scylladb#25139
2025-07-24 16:34:39 +03:00
Jakub Smolar
800f819b5b gdb: handle zero-size reads in managed_bytes
Fixes: https://github.com/scylladb/scylladb/issues/25048

Closes scylladb/scylladb#25050

(cherry picked from commit 6e0a063ce3)

Closes scylladb/scylladb#25142
2025-07-24 16:34:04 +03:00
Sergey Zolotukhin
8ac6aaadaf storage_service: Cancel all write requests on storage_proxy shutdown
During a graceful node shutdown, RPC listeners are stopped in `storage_service::drain_on_shutdown`
as one of the first steps. However, even after RPCs are shut down, some write handlers in
`storage_proxy` may still be waiting for background writes to complete. These handlers retain the ERM.
Since the RPC subsystem is no longer active, replies cannot be received, and if any RPC commands are
concurrently executing `barrier_and_drain`, they may get stuck waiting for those writes. This can block
the messaging server shutdown and delay the entire shutdown process until the write timeout occurs.

This change introduces the cancellation of all outstanding write handlers in `storage_proxy`
during shutdown to prevent unnecessary delays.

Fixes scylladb/scylladb#23665

(cherry picked from commit e0dc73f52a)
2025-07-24 13:03:32 +00:00
Sergey Zolotukhin
16a8cd9514 test: Add test for unfinished writes during shutdown and topology change
This test reproduces an issue where a topology change and an ongoing write query
during query coordinator shutdown can cause the node to get stuck.

When a node receives a write request, it creates a write handler that holds
a copy of the current table's ERM (Effective Replication Map). The ERM ensures
that no topology or schema changes occur while the request is being processed.

After the query coordinator receives the required number of replica write ACKs
to satisfy the consistency level (CL), it sends a reply to the client. However,
the write response handler remains alive until all replicas respond — the remaining
writes are handled in the background.

During shutdown, when all network connections are closed, these responses can no longer
be received. As a result, the write response handler is only destroyed once the write
timeout is reached.

This becomes problematic because the ERM held by the handler blocks topology or schema
change commands from executing. Since shutdown waits for these commands to complete,
this can lead to unnecessary delays in node shutdown and restarts, and occasional
test case failures.

Test for: scylladb/scylladb#23665

(cherry picked from commit bc934827bc)
2025-07-24 13:03:32 +00:00
Ernest Zaslavsky
e45852a595 s3_client: Disable Seastar-level retries in HTTP client creation
Prevent Seastar from retrying HTTP requests to avoid buffer double-feed
issues when an entire request is retried. This could cause data
corruption in `chunked_download_source`. The change is global for every
instance of `s3_client`, but it is still safe because:
* Seastar's `http_client` resets connections regardless of retry behavior
* `s3_client` retry logic handles all error types—exceptions, HTTP errors,
  and AWS-specific errors—via `http_retryable_client`

(cherry picked from commit fc2c9dd290)
2025-07-22 16:46:54 +00:00
Ernest Zaslavsky
fdf706a6eb s3_test: Validate handling of non-aws_error exceptions
Inject exceptions not wrapped in `aws_error` from request callback
lambda to verify they are properly caught and handled.

(cherry picked from commit ba910b29ce)
2025-07-22 16:46:53 +00:00
Ernest Zaslavsky
2bc3accf9c s3_client: Improve error handling in chunked_download_source
Create aws_error from raised exceptions when possible and respond
appropriately. Previously, non-aws_exception types leaked from the
request handler and were treated as non-retryable, causing potential
data corruption during download.

(cherry picked from commit b7ae6507cd)
2025-07-22 16:46:53 +00:00
Ernest Zaslavsky
0106d132bd aws_error: Add factory method for aws_error from exception
Move `aws_error` creation logic out of `retryable_http_client` and
into the `aws_error` class to support reuse across components.

(cherry picked from commit d53095d72f)
2025-07-22 16:46:53 +00:00
Pavel Emelyanov
53637fdf61 Merge '[Backport 2025.3] storage: add make_data_or_index_source to the storages' from Scylladb[bot]
Add `make_data_or_index_source` to the storages to utilize new S3 based data source which should improve restore performance

* Introduce the `encrypted_data_source` class that wraps an existing data source to read and decrypt data on the fly using block encryption. Also add unit tests to verify correct decryption behavior.
* Add `make_data_or_index_source` to the `storage` interface, implement it  for `filesystem_storage` storage which just creates `data_source` from a file and for the `s3_storage` create a (maybe) decrypting source from s3 make_download_source. This change should solve performance improvement for reading large objects from S3 and should not affect anything for the `filesystem_storage`

Fixes: https://github.com/scylladb/scylladb/issues/22458

- (cherry picked from commit 211daeaa40)

- (cherry picked from commit 7e5e3c5569)

- (cherry picked from commit 0de61f56a2)

- (cherry picked from commit 8ac2978239)

- (cherry picked from commit dff9a229a7)

- (cherry picked from commit 8d49bb8af2)

Parent PR: #23695

Closes scylladb/scylladb#25016

* github.com:scylladb/scylladb:
  sstables: Start using `make_data_or_index_source` in `sstable`
  sstables: refactor readers and sources to use coroutines
  sstables: coroutinize futurized readers
  sstables: add `make_data_or_index_source` to the `storage`
  encryption: refactor key retrieval
  encryption: add `encrypted_data_source` class
2025-07-21 18:05:53 +03:00
Piotr Dulikowski
fdfcd67a6e Merge '[Backport 2025.3] cdc: Forbid altering columns of CDC log tables directly' from Scylladb[bot]
The set of columns of a CDC log table should be managed automatically
by Scylla, and the user should not have the ability to manipulate them
directly. That could lead to disastrous consequences such as a
segmentation fault.

In this commit, we're restricting those operations. We also provide two
validation tests.

One of the existing tests had to be adjusted as it modified the type
of a column in a CDC log table. Since the test simply verifies that
the user has sufficient permissions to perform `ALTER TABLE` on the log
table, the test is still valid.

Fixes scylladb/scylladb#24643

Backport: we should backport the change to all affected
branches to prevent the consequences that may affect the user.

- (cherry picked from commit 20d0050f4e)

- (cherry picked from commit 59800b1d66)

Parent PR: #25008

Closes scylladb/scylladb#25108

* github.com:scylladb/scylladb:
  cdc: Forbid altering columns of inactive CDC log table
  cdc: Forbid altering columns of CDC log tables directly
2025-07-21 16:22:31 +02:00
Dawid Mędrek
dc6cb5cfad cdc: Forbid altering columns of inactive CDC log table
When CDC becomes disabled on the base table, the CDC log table
still exsits (cf. scylladb/scylladb@adda43edc7).
If it continues to exist up to the point when CDC is re-enabled
on the base table, no new log table will be created -- instead,
the old olg table will be *re-attached*.

Since we want to avoid situations when the definition of the log
table has become misaligned with the definition of the base table
due to actions of the user, we forbid modifying the set of columns
or renaming them in CDC log tables, even when they're inactive.

Validation tests are provided.

(cherry picked from commit 59800b1d66)
2025-07-21 11:43:49 +00:00
Dawid Mędrek
10a9ced4d1 cdc: Forbid altering columns of CDC log tables directly
The set of columns of a CDC log table should be managed automatically
by Scylla, and the user should not have the ability to manipulate them
directly. That could lead to disastrous consequences such as a
segmentation fault.

In this commit, we're restricting those operations. We also provide two
validation tests.

One of the existing tests had to be adjusted as it modified the type
of a column in a CDC log table. Since the test simply verifies that
the user has sufficient permissions to perform `ALTER TABLE` on the log
table, the test is still valid.

Fixes scylladb/scylladb#24643

(cherry picked from commit 20d0050f4e)
2025-07-21 11:43:49 +00:00
Ernest Zaslavsky
934359ea28 s3_client: parse multipart response XML defensively
Ensure robust handling of XML responses when initiating multipart
uploads. Check for the existence of required nodes before access,
and throw an exception if the XML is empty or malformed.

Refs: https://github.com/scylladb/scylladb/issues/24676

Closes scylladb/scylladb#24990

(cherry picked from commit 342e94261f)

Closes scylladb/scylladb#25057
2025-07-21 12:03:00 +02:00
Piotr Dulikowski
74d97711fd Merge '[Backport 2025.3] cdc: throw error if column doesn't exist' from Scylladb[bot]
in the CDC log transformer, when creating a CDC mutation based on some
base table mutation, for each value of a base column we set the value in
the CDC column with the same name.

When looking up the column in the CDC schema by name, we may get a null
pointer if a column by that name is not found. This shouldn't happen
normally because the base schema and CDC schema should be compatible,
and for each base column there should be a CDC column with the same
name.

However, there are scenarios where the base schema and CDC schema are
incompatible for a short period of time when they are being altered.
When a base column is being added or dropped, we could get a base
mutation with this column set, and then the CDC transformer picks up the
latest CDC schema which doesn't have this column.

If such thing happens, we fix the code to throw an exception instead of
crashing on null pointer dereference. Currently we don't have a safer
approach to handle this, but this might be changed in the future. The
other alternative is dropping that data silently which we prefer not to
do.

Throwing an error is acceptable because this scenario most likely
indicates this behavior by the user:
* The user adds a new column, and start writing values to the column
  before the ALTER is complete. or,
* The user drops a column, and continues writing values to the column
  while it's being dropped.

Both cases might as well fail with an error because the column is not
found in the base table.

Fixes scylladb/scylladb#/24952

backport needed - simple fix for a node crash

- (cherry picked from commit b336f282ae)

- (cherry picked from commit 86dfa6324f)

Parent PR: #24986

Closes scylladb/scylladb#25067

* github.com:scylladb/scylladb:
  test: cdc: add test_cdc_with_alter
  cdc: throw error if column doesn't exist
2025-07-21 11:18:06 +02:00
Jenkins Promoter
fc7a6b66e2 Update ScyllaDB version to: 2025.3.0-rc2 2025-07-20 15:44:21 +03:00
Michael Litvak
594ec7d66d test: cdc: add test_cdc_with_alter
Add a test that tests adding and dropping a column to a table with CDC
enabled while writing to it.

(cherry picked from commit 86dfa6324f)
2025-07-20 09:04:00 +02:00
Michael Litvak
338ff18dfe cdc: throw error if column doesn't exist
in the CDC log transformer, when creating a CDC mutation based on some
base table mutation, for each value of a base column we set the value in
the CDC column with the same name.

When looking up the column in the CDC schema by name, we may get a null
pointer if a column by that name is not found. This shouldn't happen
normally because the base schema and CDC schema should be compatible,
and for each base column there should be a CDC column with the same
name.

However, there are scenarios where the base schema and CDC schema are
incompatible for a short period of time when they are being altered.
When a base column is being added or dropped, we could get a base
mutation with this column set, and then the CDC transformer picks up the
latest CDC schema which doesn't have this column.

If such thing happens, we fix the code to throw an exception instead of
crashing on null pointer dereference. Currently we don't have a safer
approach to handle this, but this might be changed in the future. The
other alternative is dropping that data silently which we prefer not to
do.

Throwing an error is acceptable because this scenario most likely
indicates this behavior by the user:
* The user adds a new column, and start writing values to the column
  before the ALTER is complete. or,
* The user drops a column, and continues writing values to the column
  while it's being dropped.

Both cases might as well fail with an error because the column is not
found in the base table.

Fixes scylladb/scylladb#24952

(cherry picked from commit b336f282ae)
2025-07-18 10:36:44 +00:00
Tomasz Grabiec
888e92c969 streaming: Avoid deadlock by running view checks in a separate scheduling group
This issue happens with removenode, when RBNO is disabled, so range
streamer is used.

The deadlock happens in a scenario like this:
1. Start 3 nodes: {A, B, C}, RF=2
2. Node A is lost
3. removenode A
4. Both B and C gain ownership of ranges.
5. Streaming sessions are started with crossed directions: B->C, C->B

Readers created by sender side exhaust streaming semaphore on B and C.
Receiver side attempts to obtain a permit indirectly by calling
check_needs_view_update_path(), which reads local tables. That read is
blocked and times-out, causing streaming to fail. The streaming writer
is already using a tracking-only permit.

To avoid that, run the query under a different scheduling group, which
translates to the system semaphore instead of the maintenance
semaphore, to break the dependency. The gossip group was chosen
because it shouldn't be contended and this change should not interfere
with it much.

Fixes: #24807
(cherry picked from commit dff2b01237)
2025-07-17 17:25:44 +00:00
Tomasz Grabiec
f424c773a4 service: migration_manager: Run group0 barrier in gossip scheduling group
Fixes two issues.

One is potential priority inversion. The barrier will be executed
using scheduling group of the first fiber which triggers it, the rest
will block waiting on it. For example, CQL statements which need to
sync the schema on replica side can block on the barrier triggered by
streaming. That's undesirable. This is theoretical, not proved in the
field.

The second problem is blocking the error path. This barrier is called
from the streaming error handling path. If the streaming concurrency
semaphore is exhausted, and streaming fails due to timeout on
obtaining the permit in check_needs_view_update_path(), the error path
will block too because it will also attempt to obtain the permit as
part of the group0 barrier. Running it in the gossip scheduling group
prevents this.

Fixes #24925

(cherry picked from commit ee2fa58bd6)
2025-07-17 17:25:44 +00:00
Piotr Dulikowski
e49b312be9 auth: fix crash when migration code runs parallel with raft upgrade
The functions password_authenticator::start and
standard_role_manager::start have a similar structure: they spawn a
fiber which invokes a callback that performs some migration until that
migration succeeds. Both handlers set a shared promise called
_superuser_created_promise (those are actually two promises, one for the
password authenticator and the other for the role manager).

The handlers are similar in both cases. They check if auth is in legacy
mode, and behave differently depending on that. If in legacy mode, the
promise is set (if it was not set before), and some legacy migration
actions follow. In auth-on-raft mode, the superuser is attempted to be
created, and if it succeeds then the promise is _unconditionally_ set.

While it makes sense at a glance to set the promise unconditionally,
there is a non-obvious corner case during upgrade to topology on raft.
During the upgrade, auth switches from the legacy mode to auth on raft
mode. Thus, if the callback didn't succeed in legacy mode and then tries
to run in auth-on-raft mode and succeds, it will unconditionally set a
promise that was already set - this is a bug and triggers an assertion
in seastar.

Fix the issue by surrounding the `shared_promise::set_value` call with
an `if` - like it is already done for the legacy case.

Fixes: scylladb/scylladb#24975

Closes scylladb/scylladb#24976

(cherry picked from commit a14b7f71fe)

Closes scylladb/scylladb#25019
2025-07-17 13:32:35 +02:00
Ernest Zaslavsky
549d139e84 sstables: Start using make_data_or_index_source in sstable
Convert all necessary methods to be awaitable. Start using `make_data_or_index_source`
when creating data_source for data and index components.

For proper working of compressed/checksummed input streams, start passing
stream creator functors to `make_(checksummed/compressed)_file_(k_l/m)_format_input_stream`.

(cherry picked from commit 8d49bb8af2)
2025-07-16 12:45:58 +00:00
Ernest Zaslavsky
4a47262167 sstables: refactor readers and sources to use coroutines
Refactor readers and sources to support coroutine usage in
preparation for integration with `make_data_or_index_source`.
Move coroutine-based member initialization out of constructors
where applicable, and defer initialization until first use.

(cherry picked from commit dff9a229a7)
2025-07-16 12:45:58 +00:00
Ernest Zaslavsky
81d356315b sstables: coroutinize futurized readers
Coroutinize futurized readers and sources to get ready for using `make_data_or_index_source` in `sstable`

(cherry picked from commit 8ac2978239)
2025-07-16 12:45:58 +00:00
Ernest Zaslavsky
4ffd72e597 sstables: add make_data_or_index_source to the storage
Add `make_data_or_index_source` to the `storage` interface, implement it
for `filesystem_storage` storage which just creates `data_source` from a
file and for the `s3_storage` create a (maybe) decrypting source from s3
make_download_source.

This change should solve performance improvement for reading large objects
from S3 and should not affect anything for the `filesystem_storage`.

(cherry picked from commit 0de61f56a2)
2025-07-16 12:45:58 +00:00
Ernest Zaslavsky
8998f221ab encryption: refactor key retrieval
Get the encryption schema extension retrieval code out of
`wrap_file` method to make it reusable elsewhere

(cherry picked from commit 7e5e3c5569)
2025-07-16 12:45:58 +00:00
Ernest Zaslavsky
243ba1fb66 encryption: add encrypted_data_source class
Introduce the `encrypted_data_source` class that wraps an existing data
source to read and decrypt data on the fly using block encryption. Also add
unit tests to verify correct decryption behavior.
NOTE: The wrapped source MUST read from offset 0, `encrypted_data_source` assumes it is

Co-authored-by: Calle Wilund <calle@scylladb.com>
(cherry picked from commit 211daeaa40)
2025-07-16 12:45:58 +00:00
Patryk Jędrzejczak
7caacf958b test: test_zero_token_nodes_multidc: properly handle reads with CL=ONE
The test could fail with RF={DC1: 2, DC2: 0} and CL=ONE when:
- both writes succeeded with the same replica responding first,
- one of the following reads succeeded with the other replica
  responding before it applied mutations from any of the writes.

We fix the test by not expecting reads with CL=ONE to return a row.

We also harden the test by inserting different rows for every pair
(CL, coordinator), where one of the two coordinators is a normal
node from DC1, and the other one is a zero-token node from DC2.
This change makes sure that, for example, every write really
inserts a row.

Fixes scylladb/scylladb#22967

The fix addresses CI flakiness and only changes the test, so it
should be backported.

Closes scylladb/scylladb#23518

(cherry picked from commit 21edec1ace)

Closes scylladb/scylladb#24985
2025-07-15 15:47:43 +02:00
Botond Dénes
489e4fdb4e Merge '[Backport 2025.3] S3 chunked download source bug fixes' from Scylladb[bot]
- Fix missing negation in the `if` in the background downloading fiber
- Add test to catch this case
- Improve the s3 proxy to inject errors if the same resource requested more than once
- Suppress client retry since retrying the same request when each produces multiple buffers may lead to the same data appear more than once in the buffer deque
- Inject exception from the test to simulate response callback failure in the middle

No need to backport anything since this class in not used yet

- (cherry picked from commit f1d0690194)

- (cherry picked from commit e73b83e039)

- (cherry picked from commit 6d9cec558a)

- (cherry picked from commit ec59fcd5e4)

- (cherry picked from commit c75acd274c)

- (cherry picked from commit d2d69cbc8c)

- (cherry picked from commit e50f247bf1)

- (cherry picked from commit 49e8c14a86)

- (cherry picked from commit a5246bbe53)

- (cherry picked from commit acf15eba8e)

Parent PR: #24657

Closes scylladb/scylladb#24943

* github.com:scylladb/scylladb:
  s3_test: Add s3_client test for non-retryable error handling
  s3_test: Add trace logging for default_retry_strategy
  s3_client: Fix edge case when the range is exhausted
  s3_client: Fix indentation in try..catch block
  s3_client: Stop retries in chunked download source
  s3_client: Enhance test coverage for retry logic
  s3_client: Add test for Content-Range fix
  s3_client: Fix missing negation
  s3_client: Refine logging
  s3_client: Improve logging placement for current_range output
2025-07-15 15:28:48 +03:00
Michael Litvak
26738588db tablets: stop storage group on deallocation
When a tablet transitions to a post-cleanup stage on the leaving replica
we deallocate its storage group. Before the storage can be deallocated
and destroyed, we must make sure it's cleaned up and stopped properly.

Normally this happens during the tablet cleanup stage, when
table::cleanup_table is called, so by the time we transition to the next
stage the storage group is already stopped.

However, it's possible that tablet cleanup did not run in some scenario:
1. The topology coordinator runs tablet cleanup on the leaving replica.
2. The leaving replica is restarted.
3. When the leaving replica starts, still in `cleanup` stage, it
   allocates a storage group for the tablet.
4. The topology coordinator moves to the next stage.
5. The leaving replica deallocates the storage group, but it was not
   stopped.

To address this scenario, we always stop the storage group when
deallocating it. Usually it will be already stopped and complete
immediately, and otherwise it will be stopped in the background.

Fixes scylladb/scylladb#24857
Fixes scylladb/scylladb#24828

Closes scylladb/scylladb#24896

(cherry picked from commit fa24fd7cc3)

Closes scylladb/scylladb#24909
2025-07-15 13:14:35 +03:00
Aleksandra Martyniuk
f69f59afbd repair: Reduce max row buf size when small table optimization is on
If small_table_optimization is on, a repair works on a whole table
simultaneously. It may be distributed across the whole cluster and
all nodes might participate in repair.

On a repair master, row buffer is copied for each repair peer.
This means that the memory scales with the number of peers.

In large clusters, repair with small_table_optimization leads to OOM.

Divide the max_row_buf_size by the number of repair peers if
small_table_optimization is on.

Use max_row_buf_size to calculate number of units taken from mem_sem.

Fixes: https://github.com/scylladb/scylladb/issues/22244.

Closes scylladb/scylladb#24868

(cherry picked from commit 17272c2f3b)

Closes scylladb/scylladb#24907
2025-07-15 13:13:49 +03:00
Łukasz Paszkowski
e1e0c721e7 test.py: Fix test_compactionhistory_rows_merged_time_window_compaction_strategy
The test has two major problems
1. Wrongly computed time windows. Data was not spread across two 1-minute
   windows causing the test to generate even three sstables instead
   of two
2. Timestamp was not propagated to the prepared CQL statements. So
   in fact, a current time was used implicitly
3. Because of the incorrect timestamp issue, the remaining tests
   testing purged tombstones were affected as well.

Fixes https://github.com/scylladb/scylladb/issues/24532

Closes scylladb/scylladb#24609

(cherry picked from commit a22d1034af)

Closes scylladb/scylladb#24791
2025-07-15 13:12:39 +03:00
Yaron Kaikov
05a6d4da23 dist/common/scripts/scylla_sysconfig_setup: fix SyntaxWarning: invalid escape sequence
There are invalid escape sequence warnings where raw strings should be used for the regex patterns

Fixes: https://github.com/scylladb/scylladb/issues/24915

Closes scylladb/scylladb#24916

(cherry picked from commit fdcaa9a7e7)

Closes scylladb/scylladb#24970
2025-07-15 11:01:28 +02:00
Yaron Kaikov
1e1aeed3cd auto-backport.py: Avoid bot push to existing backport branches
Changed the backport logic so that the bot only pushes the backport branch if it does not already exist in the remote fork.
If the branch exists, the bot skips the push, allowing only users to update (force-push) the branch after the backport PR is open.

Fixes: https://github.com/scylladb/scylladb/issues/24953

Closes scylladb/scylladb#24954

(cherry picked from commit ed7c7784e4)

Closes scylladb/scylladb#24969
2025-07-15 10:25:30 +02:00
Jenkins Promoter
af10d6f03b Update pgo profiles - aarch64 2025-07-15 05:21:25 +03:00
Jenkins Promoter
0d3742227d Update pgo profiles - x86_64 2025-07-15 04:58:36 +03:00
Yaron Kaikov
c6987e3fed packaging: add ps command to dependancies
ScyllaDB container image doesn't have ps command installed, while this command is used by perftune.py script shipped within the same image. This breaks node and container tuning in Scylla Operator.

Fixes: #24827

Closes scylladb/scylladb#24830

(cherry picked from commit 66ff6ab6f9)

Closes scylladb/scylladb#24956
2025-07-14 14:19:17 +03:00
Ernest Zaslavsky
873c8503cd s3_test: Add s3_client test for non-retryable error handling
Introduce a test that injects a non-retryable error and verifies
that the chunked download source throws an exception as expected.

(cherry picked from commit acf15eba8e)
2025-07-13 13:17:14 +00:00
Ernest Zaslavsky
dbf4bd162e s3_test: Add trace logging for default_retry_strategy
Introduce trace-level logging for `default_retry_strategy` in
`s3_test` to improve visibility into retry logic during test
execution.

(cherry picked from commit a5246bbe53)
2025-07-13 13:17:14 +00:00
Ernest Zaslavsky
7f303bfda3 s3_client: Fix edge case when the range is exhausted
Handle case where the download loop exits after consuming all data,
but before receiving an empty buffer signaling EOF. Without this, the
next request is sent with a non-zero offset and zero length, resulting
in "Range request cannot be satisfied" errors. Now, an empty buffer is
pushed to indicate completion and exit the fiber properly.

(cherry picked from commit 49e8c14a86)
2025-07-13 13:17:14 +00:00
Ernest Zaslavsky
22739df69f s3_client: Fix indentation in try..catch block
Correct indentation in the `try..catch` block to improve code
readability and maintain consistent formatting.

(cherry picked from commit e50f247bf1)
2025-07-13 13:17:14 +00:00
Ernest Zaslavsky
54db6ca088 s3_client: Stop retries in chunked download source
Disable retries for S3 requests in the chunked download source to
prevent duplicate chunks from corrupting the buffer queue. The
response handler now throws an exception to bypass the retry
strategy, allowing the next range to be attempted cleanly.

This exception is only triggered for retryable errors; unretryable
ones immediately halt further requests.

(cherry picked from commit d2d69cbc8c)
2025-07-13 13:17:14 +00:00
Ernest Zaslavsky
c841ffe398 s3_client: Enhance test coverage for retry logic
Extend the S3 proxy to support error injection when the client
makes multiple requests to the same resource—useful for testing
retry behavior and failure handling.

(cherry picked from commit c75acd274c)
2025-07-13 13:17:14 +00:00
Ernest Zaslavsky
c748a97170 s3_client: Add test for Content-Range fix
Introduce a test that accurately verifies the Content-Range
behavior, ensuring the previous fix is properly validated.

(cherry picked from commit ec59fcd5e4)
2025-07-13 13:17:14 +00:00
Ernest Zaslavsky
00f10e7f1d s3_client: Fix missing negation
Restore a missing `not` in a conditional check that caused
incorrect behavior during S3 client execution.

(cherry picked from commit 6d9cec558a)
2025-07-13 13:17:14 +00:00
Ernest Zaslavsky
4cd1792528 s3_client: Refine logging
Fix typo in log message to improve clarity and accuracy during
S3 operations.

(cherry picked from commit e73b83e039)
2025-07-13 13:17:14 +00:00
Ernest Zaslavsky
115e8c85e4 s3_client: Improve logging placement for current_range output
Relocated logging to occur after determining the `current_range`,
ensuring more relevant output during S3 client operations.

(cherry picked from commit f1d0690194)
2025-07-13 13:17:14 +00:00
Gleb Natapov
087d3bb957 api: unregister raft_topology_get_cmd_status on shutdown
In c8ce9d1c60 we introduced
raft_topology_get_cmd_status REST api but the commit forgot to
unregister the handler during shutdown.

Fixes #24910

Closes scylladb/scylladb#24911

(cherry picked from commit 89f2edf308)

Closes scylladb/scylladb#24923
2025-07-13 15:15:52 +03:00
Avi Kivity
f3297824e3 Revert "config: decrease default large allocation warning threshold to 128k"
This reverts commit 04fb2c026d. 2025.3 got
the reduced threshold, but won't get many of the fixes the warning will
generate, leaving it very noisy. Better to avoid the noise for this release.

Fixes #24384.
2025-07-10 14:12:14 +03:00
Avi Kivity
4eb220d3ab service: tablet_allocator: avoid large contiguous vector in make_repair_plan()
make_repair_plan() allocates a temporary vector which can grow larger
than our 128k basic allocation unit. Use a chunked vector to avoid
stalls due to large allocations.

Fixes #24713.

Closes scylladb/scylladb#24801

(cherry picked from commit 0138afa63b)

Closes scylladb/scylladb#24902
2025-07-10 12:41:35 +03:00
Patryk Jędrzejczak
c9de7d68f2 Merge '[Backport 2025.3] Make it easier to debug stuck raft topology operation.' from Scylladb[bot]
The series adds more logging and provides new REST api around topology command rpc execution to allow easier debugging of stuck topology operations.

Backport since we want to have in the production as quick as possible.

Fixes #24860

- (cherry picked from commit c8ce9d1c60)

- (cherry picked from commit 4e6369f35b)

Parent PR: #24799

Closes scylladb/scylladb#24881

* https://github.com/scylladb/scylladb:
  topology coordinator: log a start and an end of topology coordinator command execution at info level
  topology coordinator: add REST endpoint to query the status of ongoing topology cmd rpc
2025-07-09 12:55:48 +02:00
Piotr Dulikowski
b535f44db2 Merge '[Backport 2025.3] batchlog_manager: abort replay of a failed batch on shutdown or node down' from Scylladb[bot]
When replaying a failed batch and sending the mutation to all replicas, make the write response handler cancellable and abort it on shutdown or if some target is marked down. also set a reasonable timeout so it gets aborted if it's stuck for some other unexpected reason.

Previously, the write response handler is not cancellable and has no timeout. This can cause a scenario where some write operation by the batchlog manager is stuck indefinitely, and node shutdown gets stuck as well because it waits for the batchlog manager to complete, without aborting the operation.

backport to relevant versions since the issue can cause node shutdown to hang

Fixes scylladb/scylladb#24599

- (cherry picked from commit 8d48b27062)

- (cherry picked from commit fc5ba4a1ea)

- (cherry picked from commit 7150632cf2)

- (cherry picked from commit 74a3fa9671)

- (cherry picked from commit a9b476e057)

- (cherry picked from commit d7af26a437)

Parent PR: #24595

Closes scylladb/scylladb#24882

* github.com:scylladb/scylladb:
  test: test_batchlog_manager: batchlog replay includes cdc
  test: test_batchlog_manager: test batch replay when a node is down
  batchlog_manager: set timeout on writes
  batchlog_manager: abort writes on shutdown
  batchlog_manager: create cancellable write response handler
  storage_proxy: add write type parameter to mutate_internal
2025-07-08 12:35:55 +02:00
Michael Litvak
ec1dd1bf31 test: test_batchlog_manager: batchlog replay includes cdc
Add a new test that verifies that when replaying batch mutations from
the batchlog, the mutations include cdc augmentation if needed.

This is done in order to verify that it works currently as expected and
doesn't break in the future.

(cherry picked from commit d7af26a437)
2025-07-08 06:25:36 +00:00
Michael Litvak
7b30f487dd test: test_batchlog_manager: test batch replay when a node is down
Add a test of the batchlog manager replay loop applying failed batches
while some replica is down.

The test reproduces an issue where the batchlog manager tries to replay
a failed batch, doesn't get a response from some replica, and becomes
stuck.

It verifies that the batchlog manager can eventually recover from this
situation and continue applying failed batches.

(cherry picked from commit a9b476e057)
2025-07-08 06:25:36 +00:00
Michael Litvak
c3c489d3d4 batchlog_manager: set timeout on writes
Set a timeout on writes of replayed batches by the batchlog manager.

We want to avoid having infinite timeout for the writes in case it gets
stuck for some unexpected reason.

The timeout is set to be high enough to allow any reasonable write to
complete.

(cherry picked from commit 74a3fa9671)
2025-07-08 06:25:36 +00:00
Michael Litvak
6fb6bb8dc7 batchlog_manager: abort writes on shutdown
On shutdown of batchlog manager, abort all writes of replayed batches
by the batchlog manager.

To achieve this we set the appropriate write_type to BATCH, and on
shutdown cancel all write handlers with this type.

(cherry picked from commit 7150632cf2)
2025-07-08 06:25:36 +00:00
Michael Litvak
02c038efa8 batchlog_manager: create cancellable write response handler
When replaying a batch mutation from the batchlog manager and sending it
to all replicas, create the write response handler as cancellable.

To achieve this we define a new wrapper type for batchlog mutations -
batchlog_replay_mutation, and this allows us to overload
create_write_response_handler for this type. This is similar to how it's
done with hint_wrapper and read_repair_mutation.

(cherry picked from commit fc5ba4a1ea)
2025-07-08 06:25:36 +00:00
Michael Litvak
d3175671b7 storage_proxy: add write type parameter to mutate_internal
Currently mutate_internal has a boolean parameter `counter_write` that
indicates whether the write is of counter type or not.

We replace it with a more general parameter that allows to indicate the
write type.

It is compatible with the previous behavior - for a counter write, the
type COUNTER is passed, and otherwise a default value will be used
as before.

(cherry picked from commit 8d48b27062)
2025-07-08 06:25:36 +00:00
Gleb Natapov
4651c44747 topology coordinator: log a start and an end of topology coordinator command execution at info level
Those calls a relatively rare and the output may help to analyze issues
in production.

(cherry picked from commit 4e6369f35b)
2025-07-08 06:24:22 +00:00
Gleb Natapov
0e67f6f6c2 topology coordinator: add REST endpoint to query the status of ongoing topology cmd rpc
The topology coordinator executes several topology cmd rpc against some nodes
during a topology change. A topology operation will not proceed unless
rpc completes (successfully or not), but sometimes it appears that it
hangs and it is hard to tell on which nodes it did not complete yet.
Introduce new REST endpoint that can help with debugging such cases.
If executed on the topology coordinator it returns currently running
topology rpc (if any) and a list of nodes that did not reply yet.

(cherry picked from commit c8ce9d1c60)
2025-07-08 06:24:21 +00:00
Avi Kivity
859d9dd3b1 Merge '[Backport 2025.3] Improve background disposal of tablet_metadata' from Scylladb[bot]
As seen in #23284, when the tablet_metadata contains many tables, even empty ones,
we're seeing a long queue of seastar tasks coming from the individual destruction of
`tablet_map_ptr = foreign_ptr<lw_shared_ptr<const tablet_map>>`.

This change improves `tablet_metadata::clear_gently` to destroy the `tablet_map_ptr` objects
on their owner shard by sorting them into vectors, per- owner shard.

Also, background call to clear_gently was added to `~token_metadata`, as it is destroyed
arbitrarily when automatic token_metadata_ptr variables go out of scope, so that the
contained tablet_metadata would be cleared gently.

Finally, a unit test was added to reproduce the `Too long queue accumulated for gossip` symptom
and verify that it is gone with this change.

Fixes #24814
Refs #23284

This change is not marked as fixing the issue since we still need to verify that there is no impact on query performance, reactor stalls, or large allocations, with a large number of tablet-based tables.

* Since the issue exists in 2025.1, requesting backport to 2025.1 and upwards

- (cherry picked from commit 3acca0aa63)

- (cherry picked from commit 493a2303da)

- (cherry picked from commit e0a19b981a)

- (cherry picked from commit 2b2cfaba6e)

- (cherry picked from commit 2c0bafb934)

- (cherry picked from commit 4a3d14a031)

- (cherry picked from commit 6e4803a750)

Parent PR: #24618

Closes scylladb/scylladb#24864

* github.com:scylladb/scylladb:
  token_metadata_impl: clear_gently: release version tracker early
  test: cluster: test_tablets_merge: add test_tablet_split_merge_with_many_tables
  token_metadata: clear_and_destroy_impl when destroyed
  token_metadata: keep a reference to shared_token_metadata
  token_metadata: move make_token_metadata_ptr into shared_token_metadata class
  replica: database: get and expose a mutable locator::shared_token_metadata
  locator: tablets: tablet_metadata: clear_gently: optimize foreign ptr destruction
2025-07-07 14:02:19 +03:00
Gleb Natapov
a25bd068bf topology coordinator: do not set request_type field for truncation command if topology_global_request_queue feature is not enabled yet
Old nodes do not expect global topology request names to be in
request_type field, so set it only if a cluster is fully upgraded
already.

Closes scylladb/scylladb#24731

(cherry picked from commit ca7837550d)

Closes scylladb/scylladb#24833
2025-07-07 11:50:55 +02:00
Benny Halevy
9bc487e79e token_metadata_impl: clear_gently: release version tracker early
No need to wait for all members to be cleared gently.
We can release the version earlier since the
held version may be awaited for in barriers.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 6e4803a750)
2025-07-07 09:42:29 +03:00
Benny Halevy
41dc86ffa8 test: cluster: test_tablets_merge: add test_tablet_split_merge_with_many_tables
Reproduces #23284

Currently skipped in release mode since it requires
the `short_tablet_stats_refresh_interval` interval.
Ref #24641

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 4a3d14a031)
2025-07-07 09:42:26 +03:00
Benny Halevy
f78a352a29 token_metadata: clear_and_destroy_impl when destroyed
We have a lot of places in the code where
a token_metadata_ptr is kept in an automatic
variable and destroyed when it leaves the scope.
since it's a referenced counted lw_shared_ptr,
the token_metadata object is rarely destroyed in
those cases, but when it is, it doesn't go through
clear_gently, and in particular its tablet_metadata
is not cleared gently, leading to inefficient destruction
of potentially many foreign_ptr:s.

This patch calls clear_and_destroy_impl that gently
clears and destroys the impl object in the background
using the shared_token_metadata.

Fixes #13381

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 2c0bafb934)
2025-07-07 09:38:17 +03:00
Benny Halevy
b647dbd547 token_metadata: keep a reference to shared_token_metadata
To be used by a following patch to gently clean and destroy
the token_data_impl in the background.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 2b2cfaba6e)
2025-07-07 09:34:10 +03:00
Benny Halevy
0e7d3b4eb9 token_metadata: move make_token_metadata_ptr into shared_token_metadata class
So we can use the local shared_token_metadata instance
for safe background destroy of token_metadata_impl:s.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit e0a19b981a)
2025-07-07 09:30:01 +03:00
Benny Halevy
c8043e05c1 replica: database: get and expose a mutable locator::shared_token_metadata
Prepare for next patch, the will use this shared_token_metadata
to make mutable_token_metadata_ptr:s

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 493a2303da)
2025-07-07 09:27:06 +03:00
Benny Halevy
54fb9ed03b locator: tablets: tablet_metadata: clear_gently: optimize foreign ptr destruction
Sort all tablet_map_ptr:s by shard_id
and then destroy them on each shard to prevent
long cross-shard task queues for foreign_ptr destructions.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 3acca0aa63)
2025-07-07 09:27:01 +03:00
Avi Kivity
f60c54df77 storage_proxy: avoid large allocation when storing batch in system.batchlog
Currently, when computing the mutation to be stored in system.batchlog,
we go through data_value. In turn this goes through `bytes` type
(#24810), so it causes a large contiguous allocation if the batch is
large.

Fix by going through the more primitive, but less contiguous,
atomic_cell API.

Fixes #24809.

Closes scylladb/scylladb#24811

(cherry picked from commit 60f407bff4)

Closes scylladb/scylladb#24846
2025-07-05 00:37:09 +03:00
Patryk Jędrzejczak
f1ec51133e docs: handling-node-failures: fix typo
Replacing "from" is incorrect. The typo comes from recently
merged #24583.

Fixes #24732

Requires backport to 2025.2 since #24583 has been backported to 2025.2.

Closes scylladb/scylladb#24733

(cherry picked from commit fa982f5579)

Closes scylladb/scylladb#24832
2025-07-04 19:35:00 +02:00
Jenkins Promoter
648fe6a4e8 Update ScyllaDB version to: 2025.3.0-rc1 2025-07-03 11:35:01 +03:00
Michał Chojnowski
1bd536a228 utils/alien_worker: fix a data race in submit()
We move a `seastar::promise` on the external worker thread,
after the matching `seastar::future` was returned to the shard.

That's illegal. If the `promise` move occurs concurrently with some
operation (move, await) on the `future`, it becomes a data race
which could cause various kinds of corruption.

This patch fixes that by keeping the promise at a stable address
on the shard (inside a coroutine frame) and only passing through
the worker.

Fixes #24751

Closes scylladb/scylladb#24752

(cherry picked from commit a29724479a)

Closes scylladb/scylladb#24780
2025-07-03 10:45:51 +03:00
Avi Kivity
d5b11098e8 repair: row_level: unstall to_repair_rows_on_wire() destroying its input
to_repair_rows_on_wire() moves the contents of its input std::list
and is careful to yield after each element, but the final destruction
of the input list still deals with all of the list elements without
yielding. This is expensive as not all contents of repair_row are moved
(_dk_with_hash is of type lw_shared_ptr<const decorated_key_with_hash>).

To fix, destroy each row element as we move along. This is safe as we
own the input and don't reference row_list other than for the iteration.

Fixes #24725.

Closes scylladb/scylladb#24726

(cherry picked from commit 6aa71205d8)

Closes scylladb/scylladb#24771
2025-07-03 10:44:58 +03:00
Tomasz Grabiec
775916132e Merge '[Backport 2025.3] repair: postpone repair until topology is not busy ' from Scylladb[bot]
Currently, repair_service::repair_tablets starts repair if there
is no ongoing tablet operations. The check does not consider global
topology operations, like tablet resize finalization.

Hence, if:
- topology is in the tablet_resize_finalization state;
- repair starts (as there is no tablet transitions) and holds the erm;
- resize finalization finishes;

then the repair sees a topology state different than the actual -
it does not see that the storage groups were already split.
Repair code does not handle this case and it results with
on_internal_error.

Start repair when topology is not busy. The check isn't atomic,
as it's done on a shard 0. Thus, we compare the topology versions
to ensure that the business check is valid.

Fixes: https://github.com/scylladb/scylladb/issues/24195.

Needs backport to all branches since they are affected

- (cherry picked from commit df152d9824)

- (cherry picked from commit 83c9af9670)

Parent PR: #24202

Closes scylladb/scylladb#24783

* github.com:scylladb/scylladb:
  test: add test for repair and resize finalization
  repair: postpone repair until topology is not busy
2025-07-02 13:17:08 +02:00
Calle Wilund
46e3794bde encryption_at_rest_test: Add exception handler to ensure proxy stop
If boost test is run such that we somehow except even in a test macro
such as BOOST_REQUIRE_THROW, we could end up not stopping the net proxy
used, causing a use after free.

(cherry picked from commit 8d37e5e24b)
2025-07-02 10:13:08 +00:00
Calle Wilund
b7a82898f0 encryption: Ensure stopping timers in provider cache objects
utils::loading cache has a timer that can, if we're unlucky, be runnnig
while the encryption context/extensions referencing the various host
objects containing them are destroyed in the case of unit testing.

Add a stop phase in encryption context shutdown closing the caches.

(cherry picked from commit ee98f5d361)
2025-07-02 10:13:08 +00:00
Jenkins Promoter
76bf279e0e Update pgo profiles - aarch64 2025-07-02 13:06:18 +03:00
Jenkins Promoter
61364624e3 Update pgo profiles - x86_64 2025-07-02 12:34:58 +03:00
Botond Dénes
6e6c00dcfe docs: cql/types.rst: remove reference to frozen-only UDTs
ScyllaDB supports non-frozen UDTs since 3.2, no need to keep referencing
this limitation in the current docs. Replace the description of the
limitation with general description of frozen semantics for UDTs.

Fixes: #22929

Closes scylladb/scylladb#24763

(cherry picked from commit 37ef9efb4e)

Closes scylladb/scylladb#24784
2025-07-02 12:11:25 +03:00
Aleksandra Martyniuk
c26eb8ef14 test: add test for repair and resize finalization
Add test that checks whether repair does not start if there is an
ongoing resize finalization.

(cherry picked from commit 83c9af9670)
2025-07-01 20:26:53 +00:00
Aleksandra Martyniuk
8a1d09862e repair: postpone repair until topology is not busy
Currently, repair_service::repair_tablets starts repair if there
is no ongoing tablet operations. The check does not consider global
topology operations, like tablet resize finalization. This may cause
a data race and unexpected behavior.

Start repair when topology is not busy.

(cherry picked from commit df152d9824)
2025-07-01 20:26:53 +00:00
Yaron Kaikov
e64bb3819c Update ScyllaDB version to: 2025.3.0-rc0 2025-07-01 10:34:39 +03:00
291 changed files with 7402 additions and 2207 deletions

View File

@@ -112,10 +112,15 @@ def backport(repo, pr, version, commits, backport_base_branch, is_collaborator):
is_draft = True
repo_local.git.add(A=True)
repo_local.git.cherry_pick('--continue')
repo_local.git.push(fork_repo, new_branch_name, force=True)
create_pull_request(repo, new_branch_name, backport_base_branch, pr, backport_pr_title, commits,
is_draft, is_collaborator)
# Check if the branch already exists in the remote fork
remote_refs = repo_local.git.ls_remote('--heads', fork_repo, new_branch_name)
if not remote_refs:
# Branch does not exist, create it with a regular push
repo_local.git.push(fork_repo, new_branch_name)
create_pull_request(repo, new_branch_name, backport_base_branch, pr, backport_pr_title, commits,
is_draft, is_collaborator)
else:
logging.info(f"Remote branch {new_branch_name} already exists in fork. Skipping push.")
except GitCommandError as e:
logging.warning(f"GitCommandError: {e}")

2
.gitmodules vendored
View File

@@ -1,6 +1,6 @@
[submodule "seastar"]
path = seastar
url = ../seastar
url = ../scylla-seastar
ignore = dirty
[submodule "swagger-ui"]
path = swagger-ui

View File

@@ -78,7 +78,7 @@ fi
# Default scylla product/version tags
PRODUCT=scylla
VERSION=2025.3.0-dev
VERSION=2025.3.3
if test -f version
then

View File

@@ -38,7 +38,6 @@
#include <optional>
#include "utils/assert.hh"
#include "utils/overloaded_functor.hh"
#include <seastar/json/json_elements.hh>
#include "collection_mutation.hh"
#include "schema/schema.hh"
#include "db/tags/extension.hh"
@@ -121,47 +120,50 @@ static lw_shared_ptr<stats> get_stats_from_schema(service::storage_proxy& sp, co
}
}
make_jsonable::make_jsonable(rjson::value&& value)
: _value(std::move(value))
{}
std::string make_jsonable::to_json() const {
return rjson::print(_value);
}
json::json_return_type make_streamed(rjson::value&& value) {
// CMH. json::json_return_type uses std::function, not noncopyable_function.
// Need to make a copyable version of value. Gah.
auto rs = make_shared<rjson::value>(std::move(value));
std::function<future<>(output_stream<char>&&)> func = [rs](output_stream<char>&& os) mutable -> future<> {
// move objects to coroutine frame.
auto los = std::move(os);
auto lrs = std::move(rs);
executor::body_writer make_streamed(rjson::value&& value) {
return [value = std::move(value)](output_stream<char>&& _out) mutable -> future<> {
auto out = std::move(_out);
std::exception_ptr ex;
try {
co_await rjson::print(*lrs, los);
co_await rjson::print(value, out);
} catch (...) {
// at this point, we cannot really do anything. HTTP headers and return code are
// already written, and quite potentially a portion of the content data.
// just log + rethrow. It is probably better the HTTP server closes connection
// abruptly or something...
ex = std::current_exception();
elogger.error("Exception during streaming HTTP response: {}", ex);
}
co_await los.close();
co_await rjson::destroy_gently(std::move(*lrs));
co_await out.close();
co_await rjson::destroy_gently(std::move(value));
if (ex) {
co_await coroutine::return_exception_ptr(std::move(ex));
}
co_return;
};
return func;
}
json_string::json_string(std::string&& value)
: _value(std::move(value))
{}
std::string json_string::to_json() const {
return _value;
// make_streamed_with_extra_array() is variant of make_streamed() above, which
// builds a streaming response (a function writing to an output stream) from a
// JSON object (rjson::value) but adds to it at the end an additional array.
// The extra array is given a separate chunked_vector to avoid putting it
// inside the rjson::value - because RapidJSON does contiguous allocations for
// arrays which we want to avoid for potentially long arrays in Query/Scan
// responses (see #23535).
// If we ever fix RapidJSON to avoid contiguous allocations for arrays, or
// replace it entirely (#24458), we can remove this function and the function
// rjson::print_with_extra_array() which it calls.
executor::body_writer make_streamed_with_extra_array(rjson::value&& value,
std::string array_name, utils::chunked_vector<rjson::value>&& array) {
return [value = std::move(value), array_name = std::move(array_name), array = std::move(array)](output_stream<char>&& _out) mutable -> future<> {
auto out = std::move(_out);
std::exception_ptr ex;
try {
co_await rjson::print_with_extra_array(value, array_name, array, out);
} catch (...) {
ex = std::current_exception();
}
co_await out.close();
co_await rjson::destroy_gently(std::move(value));
// TODO: can/should we also destroy the array gently?
if (ex) {
co_await coroutine::return_exception_ptr(std::move(ex));
}
};
}
// This function throws api_error::validation if input value is not an object.
@@ -764,7 +766,7 @@ future<executor::request_return_type> executor::describe_table(client_state& cli
rjson::value response = rjson::empty_object();
rjson::add(response, "Table", std::move(table_description));
elogger.trace("returning {}", response);
co_return make_jsonable(std::move(response));
co_return rjson::print(std::move(response));
}
// Check CQL's Role-Based Access Control (RBAC) permission_to_check (MODIFY,
@@ -881,7 +883,7 @@ future<executor::request_return_type> executor::delete_table(client_state& clien
rjson::value response = rjson::empty_object();
rjson::add(response, "TableDescription", std::move(table_description));
elogger.trace("returning {}", response);
co_return make_jsonable(std::move(response));
co_return rjson::print(std::move(response));
}
static data_type parse_key_type(std::string_view type) {
@@ -1165,7 +1167,7 @@ future<executor::request_return_type> executor::tag_resource(client_state& clien
co_await db::modify_tags(_mm, schema->ks_name(), schema->cf_name(), [tags](std::map<sstring, sstring>& tags_map) {
update_tags_map(*tags, tags_map, update_tags_action::add_tags);
});
co_return json_string("");
co_return ""; // empty response
}
future<executor::request_return_type> executor::untag_resource(client_state& client_state, service_permit permit, rjson::value request) {
@@ -1186,7 +1188,7 @@ future<executor::request_return_type> executor::untag_resource(client_state& cli
co_await db::modify_tags(_mm, schema->ks_name(), schema->cf_name(), [tags](std::map<sstring, sstring>& tags_map) {
update_tags_map(*tags, tags_map, update_tags_action::delete_tags);
});
co_return json_string("");
co_return ""; // empty response
}
future<executor::request_return_type> executor::list_tags_of_resource(client_state& client_state, service_permit permit, rjson::value request) {
@@ -1212,7 +1214,7 @@ future<executor::request_return_type> executor::list_tags_of_resource(client_sta
rjson::push_back(tags, std::move(new_entry));
}
return make_ready_future<executor::request_return_type>(make_jsonable(std::move(ret)));
return make_ready_future<executor::request_return_type>(rjson::print(std::move(ret)));
}
struct billing_mode_type {
@@ -1674,7 +1676,7 @@ static future<executor::request_return_type> create_table_on_shard0(service::cli
rjson::value status = rjson::empty_object();
executor::supplement_table_info(request, *schema, sp);
rjson::add(status, "TableDescription", std::move(request));
co_return make_jsonable(std::move(status));
co_return rjson::print(std::move(status));
}
future<executor::request_return_type> executor::create_table(client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value request) {
@@ -1951,7 +1953,7 @@ future<executor::request_return_type> executor::update_table(client_state& clien
rjson::value status = rjson::empty_object();
supplement_table_info(request, *schema, p.local());
rjson::add(status, "TableDescription", std::move(request));
co_return make_jsonable(std::move(status));
co_return rjson::print(std::move(status));
});
}
@@ -2417,7 +2419,7 @@ static future<executor::request_return_type> rmw_operation_return(rjson::value&&
if (!attributes.IsNull()) {
rjson::add(ret, "Attributes", std::move(attributes));
}
return make_ready_future<executor::request_return_type>(make_jsonable(std::move(ret)));
return make_ready_future<executor::request_return_type>(rjson::print(std::move(ret)));
}
static future<std::unique_ptr<rjson::value>> get_previous_item(
@@ -3009,7 +3011,7 @@ future<executor::request_return_type> executor::batch_write_item(client_state& c
rjson::add(ret, "ConsumedCapacity", std::move(consumed_capacity));
}
_stats.api_operations.batch_write_item_latency.mark(std::chrono::steady_clock::now() - start_time);
co_return make_jsonable(std::move(ret));
co_return rjson::print(std::move(ret));
}
static const std::string_view get_item_type_string(const rjson::value& v) {
@@ -3407,16 +3409,16 @@ future<std::vector<rjson::value>> executor::describe_multi_item(schema_ptr schem
shared_ptr<cql3::selection::selection> selection,
foreign_ptr<lw_shared_ptr<query::result>> query_result,
shared_ptr<const std::optional<attrs_to_get>> attrs_to_get,
uint64_t& rcu_half_units) {
noncopyable_function<void(uint64_t)> item_callback) {
cql3::selection::result_set_builder builder(*selection, gc_clock::now());
query::result_view::consume(*query_result, slice, cql3::selection::result_set_builder::visitor(builder, *schema, *selection));
auto result_set = builder.build();
std::vector<rjson::value> ret;
for (auto& result_row : result_set->rows()) {
rjson::value item = rjson::empty_object();
rcu_consumed_capacity_counter consumed_capacity;
describe_single_item(*selection, result_row, *attrs_to_get, item, &consumed_capacity._total_bytes);
rcu_half_units += consumed_capacity.get_half_units();
uint64_t item_length_in_bytes = 0;
describe_single_item(*selection, result_row, *attrs_to_get, item, &item_length_in_bytes);
item_callback(item_length_in_bytes);
ret.push_back(std::move(item));
co_await coroutine::maybe_yield();
}
@@ -4007,9 +4009,6 @@ update_item_operation::apply(std::unique_ptr<rjson::value> previous_item, api::t
}
}
}
if (_returnvalues == returnvalues::ALL_OLD && previous_item) {
_return_attributes = std::move(*previous_item);
}
if (_attribute_updates) {
for (auto it = _attribute_updates->MemberBegin(); it != _attribute_updates->MemberEnd(); ++it) {
// Note that it.key() is the name of the column, *it is the operation
@@ -4119,6 +4118,9 @@ update_item_operation::apply(std::unique_ptr<rjson::value> previous_item, api::t
// don't report the new item in the returned Attributes.
_return_attributes = rjson::null_value();
}
if (_returnvalues == returnvalues::ALL_OLD && previous_item) {
_return_attributes = std::move(*previous_item);
}
// ReturnValues=UPDATED_OLD/NEW never return an empty Attributes field,
// even if a new item was created. Instead it should be missing entirely.
if (_returnvalues == returnvalues::UPDATED_OLD || _returnvalues == returnvalues::UPDATED_NEW) {
@@ -4249,18 +4251,17 @@ future<executor::request_return_type> executor::get_item(client_state& client_st
verify_all_are_used(expression_attribute_names, used_attribute_names, "ExpressionAttributeNames", "GetItem");
rcu_consumed_capacity_counter add_capacity(request, cl == db::consistency_level::LOCAL_QUORUM);
co_await verify_permission(_enforce_authorization, client_state, schema, auth::permission::SELECT);
co_return co_await _proxy.query(schema, std::move(command), std::move(partition_ranges), cl,
service::storage_proxy::coordinator_query_options(executor::default_timeout(), std::move(permit), client_state, trace_state)).then(
[per_table_stats, this, schema, partition_slice = std::move(partition_slice), selection = std::move(selection), attrs_to_get = std::move(attrs_to_get), start_time = std::move(start_time), add_capacity=std::move(add_capacity)] (service::storage_proxy::coordinator_query_result qr) mutable {
per_table_stats->api_operations.get_item_latency.mark(std::chrono::steady_clock::now() - start_time);
_stats.api_operations.get_item_latency.mark(std::chrono::steady_clock::now() - start_time);
uint64_t rcu_half_units = 0;
auto res = make_ready_future<executor::request_return_type>(make_jsonable(describe_item(schema, partition_slice, *selection, *qr.query_result, std::move(attrs_to_get), add_capacity, rcu_half_units)));
per_table_stats->rcu_half_units_total += rcu_half_units;
_stats.rcu_half_units_total += rcu_half_units;
return res;
});
service::storage_proxy::coordinator_query_result qr =
co_await _proxy.query(
schema, std::move(command), std::move(partition_ranges), cl,
service::storage_proxy::coordinator_query_options(executor::default_timeout(), std::move(permit), client_state, trace_state));
per_table_stats->api_operations.get_item_latency.mark(std::chrono::steady_clock::now() - start_time);
_stats.api_operations.get_item_latency.mark(std::chrono::steady_clock::now() - start_time);
uint64_t rcu_half_units = 0;
rjson::value res = describe_item(schema, partition_slice, *selection, *qr.query_result, std::move(attrs_to_get), add_capacity, rcu_half_units);
per_table_stats->rcu_half_units_total += rcu_half_units;
_stats.rcu_half_units_total += rcu_half_units;
co_return rjson::print(std::move(res));
}
static void check_big_object(const rjson::value& val, int& size_left);
@@ -4358,7 +4359,6 @@ future<executor::request_return_type> executor::batch_get_item(client_state& cli
}
};
std::vector<table_requests> requests;
std::vector<std::vector<uint64_t>> responses_sizes;
uint batch_size = 0;
for (auto it = request_items.MemberBegin(); it != request_items.MemberEnd(); ++it) {
table_requests rs(get_table_from_batch_request(_proxy, it));
@@ -4386,11 +4386,10 @@ future<executor::request_return_type> executor::batch_get_item(client_state& cli
// If we got here, all "requests" are valid, so let's start the
// requests for the different partitions all in parallel.
std::vector<future<std::vector<rjson::value>>> response_futures;
responses_sizes.resize(requests.size());
size_t responses_sizes_pos = 0;
for (const auto& rs : requests) {
responses_sizes[responses_sizes_pos].resize(rs.requests.size());
size_t pos = 0;
std::vector<uint64_t> consumed_rcu_half_units_per_table(requests.size());
for (size_t i = 0; i < requests.size(); i++) {
const table_requests& rs = requests[i];
bool is_quorum = rs.cl == db::consistency_level::LOCAL_QUORUM;
lw_shared_ptr<stats> per_table_stats = get_stats_from_schema(_proxy, *rs.schema);
per_table_stats->api_operations.batch_get_item_histogram.add(rs.requests.size());
for (const auto &r : rs.requests) {
@@ -4413,16 +4412,17 @@ future<executor::request_return_type> executor::batch_get_item(client_state& cli
auto command = ::make_lw_shared<query::read_command>(rs.schema->id(), rs.schema->version(), partition_slice, _proxy.get_max_result_size(partition_slice),
query::tombstone_limit(_proxy.get_tombstone_limit()));
command->allow_limit = db::allow_per_partition_rate_limit::yes;
const auto item_callback = [is_quorum, &rcus_per_table = consumed_rcu_half_units_per_table[i]](uint64_t size) {
rcus_per_table += rcu_consumed_capacity_counter::get_half_units(size, is_quorum);
};
future<std::vector<rjson::value>> f = _proxy.query(rs.schema, std::move(command), std::move(partition_ranges), rs.cl,
service::storage_proxy::coordinator_query_options(executor::default_timeout(), permit, client_state, trace_state)).then(
[schema = rs.schema, partition_slice = std::move(partition_slice), selection = std::move(selection), attrs_to_get = rs.attrs_to_get, &response_size = responses_sizes[responses_sizes_pos][pos]] (service::storage_proxy::coordinator_query_result qr) mutable {
[schema = rs.schema, partition_slice = std::move(partition_slice), selection = std::move(selection), attrs_to_get = rs.attrs_to_get, item_callback = std::move(item_callback)] (service::storage_proxy::coordinator_query_result qr) mutable {
utils::get_local_injector().inject("alternator_batch_get_item", [] { throw std::runtime_error("batch_get_item injection"); });
return describe_multi_item(std::move(schema), std::move(partition_slice), std::move(selection), std::move(qr.query_result), std::move(attrs_to_get), response_size);
return describe_multi_item(std::move(schema), std::move(partition_slice), std::move(selection), std::move(qr.query_result), std::move(attrs_to_get), std::move(item_callback));
});
pos++;
response_futures.push_back(std::move(f));
}
responses_sizes_pos++;
}
// Wait for all requests to complete, and then return the response.
@@ -4434,14 +4434,11 @@ future<executor::request_return_type> executor::batch_get_item(client_state& cli
rjson::value response = rjson::empty_object();
rjson::add(response, "Responses", rjson::empty_object());
rjson::add(response, "UnprocessedKeys", rjson::empty_object());
size_t rcu_half_units;
auto fut_it = response_futures.begin();
responses_sizes_pos = 0;
rjson::value consumed_capacity = rjson::empty_array();
for (const auto& rs : requests) {
for (size_t i = 0; i < requests.size(); i++) {
const table_requests& rs = requests[i];
std::string table = table_name(*rs.schema);
size_t pos = 0;
rcu_half_units = 0;
for (const auto &r : rs.requests) {
auto& pk = r.first;
auto& cks = r.second;
@@ -4456,7 +4453,6 @@ future<executor::request_return_type> executor::batch_get_item(client_state& cli
for (rjson::value& json : results) {
rjson::push_back(response["Responses"][table], std::move(json));
}
rcu_half_units += rcu_consumed_capacity_counter::get_half_units(responses_sizes[responses_sizes_pos][pos], rs.cl == db::consistency_level::LOCAL_QUORUM);
} catch(...) {
eptr = std::current_exception();
// This read of potentially several rows in one partition,
@@ -4480,8 +4476,8 @@ future<executor::request_return_type> executor::batch_get_item(client_state& cli
rjson::push_back(response["UnprocessedKeys"][table]["Keys"], std::move(*ck.second));
}
}
pos++;
}
uint64_t rcu_half_units = consumed_rcu_half_units_per_table[i];
_stats.rcu_half_units_total += rcu_half_units;
lw_shared_ptr<stats> per_table_stats = get_stats_from_schema(_proxy, *rs.schema);
per_table_stats->rcu_half_units_total += rcu_half_units;
@@ -4491,7 +4487,6 @@ future<executor::request_return_type> executor::batch_get_item(client_state& cli
rjson::add(entry, "CapacityUnits", rcu_half_units*0.5);
rjson::push_back(consumed_capacity, std::move(entry));
}
responses_sizes_pos++;
}
if (should_add_rcu) {
@@ -4505,7 +4500,7 @@ future<executor::request_return_type> executor::batch_get_item(client_state& cli
if (is_big(response)) {
co_return make_streamed(std::move(response));
} else {
co_return make_jsonable(std::move(response));
co_return rjson::print(std::move(response));
}
}
@@ -4649,7 +4644,11 @@ class describe_items_visitor {
const filter& _filter;
typename columns_t::const_iterator _column_it;
rjson::value _item;
rjson::value _items;
// _items is a chunked_vector<rjson::value> instead of a RapidJson array
// (rjson::value) because unfortunately RapidJson arrays are stored
// contiguously in memory, and cause large allocations when a Query/Scan
// returns a long list of short items (issue #23535).
utils::chunked_vector<rjson::value> _items;
size_t _scanned_count;
public:
@@ -4659,7 +4658,6 @@ public:
, _filter(filter)
, _column_it(columns.begin())
, _item(rjson::empty_object())
, _items(rjson::empty_array())
, _scanned_count(0)
{
// _filter.check() may need additional attributes not listed in
@@ -4738,13 +4736,13 @@ public:
rjson::remove_member(_item, attr);
}
rjson::push_back(_items, std::move(_item));
_items.push_back(std::move(_item));
}
_item = rjson::empty_object();
++_scanned_count;
}
rjson::value get_items() && {
utils::chunked_vector<rjson::value> get_items() && {
return std::move(_items);
}
@@ -4753,13 +4751,25 @@ public:
}
};
static future<std::tuple<rjson::value, size_t>> describe_items(const cql3::selection::selection& selection, std::unique_ptr<cql3::result_set> result_set, std::optional<attrs_to_get>&& attrs_to_get, filter&& filter) {
// describe_items() returns a JSON object that includes members "Count"
// and "ScannedCount", but *not* "Items" - that is returned separately
// as a chunked_vector to avoid large contiguous allocations which
// RapidJSON does of its array. The caller should add "Items" to the
// returned JSON object if needed, or print it separately.
// The returned chunked_vector (the items) is std::optional<>, because
// the user may have requested only to count items, and not return any
// items - which is different from returning an empty list of items.
static future<std::tuple<rjson::value, std::optional<utils::chunked_vector<rjson::value>>, size_t>> describe_items(
const cql3::selection::selection& selection,
std::unique_ptr<cql3::result_set> result_set,
std::optional<attrs_to_get>&& attrs_to_get,
filter&& filter) {
describe_items_visitor visitor(selection.get_columns(), attrs_to_get, filter);
co_await result_set->visit_gently(visitor);
auto scanned_count = visitor.get_scanned_count();
rjson::value items = std::move(visitor).get_items();
utils::chunked_vector<rjson::value> items = std::move(visitor).get_items();
rjson::value items_descr = rjson::empty_object();
auto size = items.Size();
auto size = items.size();
rjson::add(items_descr, "Count", rjson::value(size));
rjson::add(items_descr, "ScannedCount", rjson::value(scanned_count));
// If attrs_to_get && attrs_to_get->empty(), this means the user asked not
@@ -4769,10 +4779,11 @@ static future<std::tuple<rjson::value, size_t>> describe_items(const cql3::selec
// In that case, we currently build a list of empty items and here drop
// it. We could just count the items and not bother with the empty items.
// (However, remember that when we do have a filter, we need the items).
std::optional<utils::chunked_vector<rjson::value>> opt_items;
if (!attrs_to_get || !attrs_to_get->empty()) {
rjson::add(items_descr, "Items", std::move(items));
opt_items = std::move(items);
}
co_return std::tuple<rjson::value, size_t>{std::move(items_descr), size};
co_return std::tuple(std::move(items_descr), std::move(opt_items), size);
}
static rjson::value encode_paging_state(const schema& schema, const service::pager::paging_state& paging_state) {
@@ -4810,6 +4821,12 @@ static rjson::value encode_paging_state(const schema& schema, const service::pag
return last_evaluated_key;
}
// RapidJSON allocates arrays contiguously in memory, so we want to avoid
// returning a large number of items as a single rapidjson array, and use
// a chunked_vector instead. The following constant is an arbitrary cutoff
// point for when to switch from a rapidjson array to a chunked_vector.
static constexpr int max_items_for_rapidjson_array = 256;
static future<executor::request_return_type> do_query(service::storage_proxy& proxy,
schema_ptr table_schema,
const rjson::value* exclusive_start_key,
@@ -4882,19 +4899,35 @@ static future<executor::request_return_type> do_query(service::storage_proxy& pr
}
auto paging_state = rs->get_metadata().paging_state();
bool has_filter = filter;
auto [items, size] = co_await describe_items(*selection, std::move(rs), std::move(attrs_to_get), std::move(filter));
auto [items_descr, opt_items, size] = co_await describe_items(*selection, std::move(rs), std::move(attrs_to_get), std::move(filter));
if (paging_state) {
rjson::add(items, "LastEvaluatedKey", encode_paging_state(*table_schema, *paging_state));
rjson::add(items_descr, "LastEvaluatedKey", encode_paging_state(*table_schema, *paging_state));
}
if (has_filter){
cql_stats.filtered_rows_read_total += p->stats().rows_read_total;
// update our "filtered_row_matched_total" for all the rows matched, despited the filter
cql_stats.filtered_rows_matched_total += size;
}
if (is_big(items)) {
co_return executor::request_return_type(make_streamed(std::move(items)));
if (opt_items) {
if (opt_items->size() >= max_items_for_rapidjson_array) {
// There are many items, better print the JSON and the array of
// items (opt_items) separately to avoid RapidJSON's contiguous
// allocation of arrays.
co_return make_streamed_with_extra_array(std::move(items_descr), "Items", std::move(*opt_items));
}
// There aren't many items in the chunked vector opt_items,
// let's just insert them into the JSON object and print the
// full JSON normally.
rjson::value items_json = rjson::empty_array();
for (auto& item : *opt_items) {
rjson::push_back(items_json, std::move(item));
}
rjson::add(items_descr, "Items", std::move(items_json));
}
co_return executor::request_return_type(make_jsonable(std::move(items)));
if (is_big(items_descr)) {
co_return make_streamed(std::move(items_descr));
}
co_return rjson::print(std::move(items_descr));
}
static dht::token token_for_segment(int segment, int total_segments) {
@@ -5489,7 +5522,7 @@ future<executor::request_return_type> executor::list_tables(client_state& client
std::string exclusive_start = exclusive_start_json ? exclusive_start_json->GetString() : "";
int limit = limit_json ? limit_json->GetInt() : 100;
if (limit < 1 || limit > 100) {
return make_ready_future<request_return_type>(api_error::validation("Limit must be greater than 0 and no greater than 100"));
co_return api_error::validation("Limit must be greater than 0 and no greater than 100");
}
auto tables = _proxy.data_dictionary().get_tables(); // hold on to temporary, table_names isn't a container, it's a view
@@ -5531,7 +5564,7 @@ future<executor::request_return_type> executor::list_tables(client_state& client
rjson::add(response, "LastEvaluatedTableName", rjson::copy(last_table_name));
}
return make_ready_future<executor::request_return_type>(make_jsonable(std::move(response)));
co_return rjson::print(std::move(response));
}
future<executor::request_return_type> executor::describe_endpoints(client_state& client_state, service_permit permit, rjson::value request, std::string host_header) {
@@ -5542,8 +5575,8 @@ future<executor::request_return_type> executor::describe_endpoints(client_state&
if (!override.empty()) {
if (override == "disabled") {
_stats.unsupported_operations++;
return make_ready_future<request_return_type>(api_error::unknown_operation(
"DescribeEndpoints disabled by configuration (alternator_describe_endpoints=disabled)"));
co_return api_error::unknown_operation(
"DescribeEndpoints disabled by configuration (alternator_describe_endpoints=disabled)");
}
host_header = std::move(override);
}
@@ -5555,13 +5588,13 @@ future<executor::request_return_type> executor::describe_endpoints(client_state&
// A "Host:" header includes both host name and port, exactly what we need
// to return.
if (host_header.empty()) {
return make_ready_future<request_return_type>(api_error::validation("DescribeEndpoints needs a 'Host:' header in request"));
co_return api_error::validation("DescribeEndpoints needs a 'Host:' header in request");
}
rjson::add(response, "Endpoints", rjson::empty_array());
rjson::push_back(response["Endpoints"], rjson::empty_object());
rjson::add(response["Endpoints"][0], "Address", rjson::from_string(host_header));
rjson::add(response["Endpoints"][0], "CachePeriodInMinutes", rjson::value(1440));
return make_ready_future<executor::request_return_type>(make_jsonable(std::move(response)));
co_return rjson::print(std::move(response));
}
static std::map<sstring, sstring> get_network_topology_options(service::storage_proxy& sp, gms::gossiper& gossiper, int rf) {
@@ -5596,7 +5629,7 @@ future<executor::request_return_type> executor::describe_continuous_backups(clie
rjson::add(desc, "PointInTimeRecoveryDescription", std::move(pitr));
rjson::value response = rjson::empty_object();
rjson::add(response, "ContinuousBackupsDescription", std::move(desc));
co_return make_jsonable(std::move(response));
co_return rjson::print(std::move(response));
}
// Create the metadata for the keyspace in which we put the alternator

View File

@@ -10,8 +10,8 @@
#include <seastar/core/future.hh>
#include "seastarx.hh"
#include <seastar/json/json_elements.hh>
#include <seastar/core/sharded.hh>
#include <seastar/util/noncopyable_function.hh>
#include "service/migration_manager.hh"
#include "service/client_state.hh"
@@ -58,29 +58,6 @@ namespace alternator {
class rmw_operation;
struct make_jsonable : public json::jsonable {
rjson::value _value;
public:
explicit make_jsonable(rjson::value&& value);
std::string to_json() const override;
};
/**
* Make return type for serializing the object "streamed",
* i.e. direct to HTTP output stream. Note: only useful for
* (very) large objects as there are overhead issues with this
* as well, but for massive lists of return objects this can
* help avoid large allocations/many re-allocs
*/
json::json_return_type make_streamed(rjson::value&&);
struct json_string : public json::jsonable {
std::string _value;
public:
explicit json_string(std::string&& value);
std::string to_json() const override;
};
namespace parsed {
class path;
};
@@ -169,7 +146,19 @@ class executor : public peering_sharded_service<executor> {
public:
using client_state = service::client_state;
using request_return_type = std::variant<json::json_return_type, api_error>;
// request_return_type is the return type of the executor methods, which
// can be one of:
// 1. A string, which is the response body for the request.
// 2. A body_writer, an asynchronous function (returning future<>) that
// takes an output_stream and writes the response body into it.
// 3. An api_error, which is an error response that should be returned to
// the client.
// The body_writer is used for streaming responses, where the response body
// is written in chunks to the output_stream. This allows for efficient
// handling of large responses without needing to allocate a large buffer
// in memory.
using body_writer = noncopyable_function<future<>(output_stream<char>&&)>;
using request_return_type = std::variant<std::string, body_writer, api_error>;
stats _stats;
// The metric_groups object holds this stat object's metrics registered
// as long as the stats object is alive.
@@ -240,12 +229,15 @@ public:
const std::optional<attrs_to_get>&,
uint64_t* = nullptr);
// Converts a multi-row selection result to JSON compatible with DynamoDB.
// For each row, this method calls item_callback, which takes the size of
// the item as the parameter.
static future<std::vector<rjson::value>> describe_multi_item(schema_ptr schema,
const query::partition_slice&& slice,
shared_ptr<cql3::selection::selection> selection,
foreign_ptr<lw_shared_ptr<query::result>> query_result,
shared_ptr<const std::optional<attrs_to_get>> attrs_to_get,
uint64_t& rcu_half_units);
noncopyable_function<void(uint64_t)> item_callback = {});
static void describe_single_item(const cql3::selection::selection&,
const std::vector<managed_bytes_opt>&,
@@ -275,4 +267,13 @@ bool is_big(const rjson::value& val, int big_size = 100'000);
// appropriate user-readable api_error::access_denied is thrown.
future<> verify_permission(bool enforce_authorization, const service::client_state&, const schema_ptr&, auth::permission);
/**
* Make return type for serializing the object "streamed",
* i.e. direct to HTTP output stream. Note: only useful for
* (very) large objects as there are overhead issues with this
* as well, but for massive lists of return objects this can
* help avoid large allocations/many re-allocs
*/
executor::body_writer make_streamed(rjson::value&&);
}

View File

@@ -91,6 +91,18 @@ options {
throw expressions_syntax_error(format("{} at char {}", err,
ex->get_charPositionInLine()));
}
// ANTLR3 tries to recover missing tokens - it tries to finish parsing
// and create valid objects, as if the missing token was there.
// But it has a bug and leaks these tokens.
// We override offending method and handle abandoned pointers.
std::vector<std::unique_ptr<TokenType>> _missing_tokens;
TokenType* getMissingSymbol(IntStreamType* istream, ExceptionBaseType* e,
ANTLR_UINT32 expectedTokenType, BitsetListType* follow) {
auto token = BaseType::getMissingSymbol(istream, e, expectedTokenType, follow);
_missing_tokens.emplace_back(token);
return token;
}
}
@lexer::context {
void displayRecognitionError(ANTLR_UINT8** token_names, ExceptionBaseType* ex) {

View File

@@ -13,7 +13,6 @@
#include <seastar/http/function_handlers.hh>
#include <seastar/http/short_streams.hh>
#include <seastar/core/coroutine.hh>
#include <seastar/json/json_elements.hh>
#include <seastar/util/defer.hh>
#include <seastar/util/short_streams.hh>
#include "seastarx.hh"
@@ -124,22 +123,22 @@ public:
}
auto res = resf.get();
std::visit(overloaded_functor {
[&] (const json::json_return_type& json_return_value) {
slogger.trace("api_handler success case");
if (json_return_value._body_writer) {
// Unfortunately, write_body() forces us to choose
// from a fixed and irrelevant list of "mime-types"
// at this point. But we'll override it with the
// one (application/x-amz-json-1.0) below.
rep->write_body("json", std::move(json_return_value._body_writer));
} else {
rep->_content += json_return_value._res;
}
},
[&] (const api_error& err) {
generate_error_reply(*rep, err);
}
}, res);
[&] (std::string&& str) {
// Note that despite the move, there is a copy here -
// as str is std::string and rep->_content is sstring.
rep->_content = std::move(str);
},
[&] (executor::body_writer&& body_writer) {
// Unfortunately, write_body() forces us to choose
// from a fixed and irrelevant list of "mime-types"
// at this point. But we'll override it with the
// correct one (application/x-amz-json-1.0) below.
rep->write_body("json", std::move(body_writer));
},
[&] (const api_error& err) {
generate_error_reply(*rep, err);
}
}, std::move(res));
return make_ready_future<std::unique_ptr<reply>>(std::move(rep));
});

View File

@@ -217,7 +217,7 @@ future<alternator::executor::request_return_type> alternator::executor::list_str
rjson::add(ret, "LastEvaluatedStreamArn", *last);
}
return make_ready_future<executor::request_return_type>(make_jsonable(std::move(ret)));
return make_ready_future<executor::request_return_type>(rjson::print(std::move(ret)));
}
struct shard_id {
@@ -491,7 +491,7 @@ future<executor::request_return_type> executor::describe_stream(client_state& cl
if (!opts.enabled()) {
rjson::add(ret, "StreamDescription", std::move(stream_desc));
return make_ready_future<executor::request_return_type>(make_jsonable(std::move(ret)));
return make_ready_future<executor::request_return_type>(rjson::print(std::move(ret)));
}
// TODO: label
@@ -617,7 +617,7 @@ future<executor::request_return_type> executor::describe_stream(client_state& cl
rjson::add(stream_desc, "Shards", std::move(shards));
rjson::add(ret, "StreamDescription", std::move(stream_desc));
return make_ready_future<executor::request_return_type>(make_jsonable(std::move(ret)));
return make_ready_future<executor::request_return_type>(rjson::print(std::move(ret)));
});
}
@@ -770,7 +770,7 @@ future<executor::request_return_type> executor::get_shard_iterator(client_state&
auto ret = rjson::empty_object();
rjson::add(ret, "ShardIterator", iter);
return make_ready_future<executor::request_return_type>(make_jsonable(std::move(ret)));
return make_ready_future<executor::request_return_type>(rjson::print(std::move(ret)));
}
struct event_id {
@@ -1021,7 +1021,7 @@ future<executor::request_return_type> executor::get_records(client_state& client
// will notice end end of shard and not return NextShardIterator.
rjson::add(ret, "NextShardIterator", next_iter);
_stats.api_operations.get_records_latency.mark(std::chrono::steady_clock::now() - start_time);
return make_ready_future<executor::request_return_type>(make_jsonable(std::move(ret)));
return make_ready_future<executor::request_return_type>(rjson::print(std::move(ret)));
}
// ugh. figure out if we are and end-of-shard
@@ -1047,7 +1047,7 @@ future<executor::request_return_type> executor::get_records(client_state& client
if (is_big(ret)) {
return make_ready_future<executor::request_return_type>(make_streamed(std::move(ret)));
}
return make_ready_future<executor::request_return_type>(make_jsonable(std::move(ret)));
return make_ready_future<executor::request_return_type>(rjson::print(std::move(ret)));
});
});
}

View File

@@ -118,7 +118,7 @@ future<executor::request_return_type> executor::update_time_to_live(client_state
// basically identical to the request's
rjson::value response = rjson::empty_object();
rjson::add(response, "TimeToLiveSpecification", std::move(*spec));
co_return make_jsonable(std::move(response));
co_return rjson::print(std::move(response));
}
future<executor::request_return_type> executor::describe_time_to_live(client_state& client_state, service_permit permit, rjson::value request) {
@@ -135,7 +135,7 @@ future<executor::request_return_type> executor::describe_time_to_live(client_sta
}
rjson::value response = rjson::empty_object();
rjson::add(response, "TimeToLiveDescription", std::move(desc));
co_return make_jsonable(std::move(response));
co_return rjson::print(std::move(response));
}
// expiration_service is a sharded service responsible for cleaning up expired

View File

@@ -3161,6 +3161,22 @@
]
}
]
},
{
"path":"/storage_service/raft_topology/cmd_rpc_status",
"operations":[
{
"method":"GET",
"summary":"Get information about currently running topology cmd rpc",
"type":"string",
"nickname":"raft_topology_get_cmd_status",
"produces":[
"application/json"
],
"parameters":[
]
}
]
}
],
"models":{
@@ -3297,11 +3313,11 @@
"properties":{
"start_token":{
"type":"string",
"description":"The range start token"
"description":"The range start token (exclusive)"
},
"end_token":{
"type":"string",
"description":"The range start token"
"description":"The range end token (inclusive)"
},
"endpoints":{
"type":"array",

View File

@@ -749,13 +749,7 @@ rest_force_compaction(http_context& ctx, std::unique_ptr<http::request> req) {
fmopt = flush_mode::skip;
}
auto task = co_await compaction_module.make_and_start_task<global_major_compaction_task_impl>({}, db, fmopt, consider_only_existing_data);
try {
co_await task->done();
} catch (...) {
apilog.error("force_compaction failed: {}", std::current_exception());
throw;
}
co_await task->done();
co_return json_void();
}
@@ -774,13 +768,7 @@ rest_force_keyspace_compaction(http_context& ctx, std::unique_ptr<http::request>
fmopt = flush_mode::skip;
}
auto task = co_await compaction_module.make_and_start_task<major_keyspace_compaction_task_impl>({}, std::move(keyspace), tasks::task_id::create_null_id(), db, table_infos, fmopt, consider_only_existing_data);
try {
co_await task->done();
} catch (...) {
apilog.error("force_keyspace_compaction: keyspace={} tables={} failed: {}", task->get_status().keyspace, table_infos, std::current_exception());
throw;
}
co_await task->done();
co_return json_void();
}
@@ -805,13 +793,7 @@ rest_force_keyspace_cleanup(http_context& ctx, sharded<service::storage_service>
auto& compaction_module = db.local().get_compaction_manager().get_task_manager_module();
auto task = co_await compaction_module.make_and_start_task<cleanup_keyspace_compaction_task_impl>(
{}, std::move(keyspace), db, table_infos, flush_mode::all_tables, tasks::is_user_task::yes);
try {
co_await task->done();
} catch (...) {
apilog.error("force_keyspace_cleanup: keyspace={} tables={} failed: {}", task->get_status().keyspace, table_infos, std::current_exception());
throw;
}
co_await task->done();
co_return json::json_return_type(0);
}
@@ -833,12 +815,7 @@ rest_cleanup_all(http_context& ctx, sharded<service::storage_service>& ss, std::
auto& db = ctx.db;
auto& compaction_module = db.local().get_compaction_manager().get_task_manager_module();
auto task = co_await compaction_module.make_and_start_task<global_cleanup_compaction_task_impl>({}, db);
try {
co_await task->done();
} catch (...) {
apilog.error("cleanup_all failed: {}", std::current_exception());
throw;
}
co_await task->done();
co_return json::json_return_type(0);
}
@@ -850,13 +827,7 @@ rest_perform_keyspace_offstrategy_compaction(http_context& ctx, std::unique_ptr<
bool res = false;
auto& compaction_module = ctx.db.local().get_compaction_manager().get_task_manager_module();
auto task = co_await compaction_module.make_and_start_task<offstrategy_keyspace_compaction_task_impl>({}, std::move(keyspace), ctx.db, table_infos, &res);
try {
co_await task->done();
} catch (...) {
apilog.error("perform_keyspace_offstrategy_compaction: keyspace={} tables={} failed: {}", task->get_status().keyspace, table_infos, std::current_exception());
throw;
}
co_await task->done();
co_return json::json_return_type(res);
}
@@ -871,13 +842,7 @@ rest_upgrade_sstables(http_context& ctx, std::unique_ptr<http::request> req) {
auto& compaction_module = db.local().get_compaction_manager().get_task_manager_module();
auto task = co_await compaction_module.make_and_start_task<upgrade_sstables_compaction_task_impl>({}, std::move(keyspace), db, table_infos, exclude_current_version);
try {
co_await task->done();
} catch (...) {
apilog.error("upgrade_sstables: keyspace={} tables={} failed: {}", keyspace, table_infos, std::current_exception());
throw;
}
co_await task->done();
co_return json::json_return_type(0);
}
@@ -1670,6 +1635,18 @@ rest_raft_topology_upgrade_status(sharded<service::storage_service>& ss, std::un
co_return sstring(format("{}", ustate));
}
static
future<json::json_return_type>
rest_raft_topology_get_cmd_status(sharded<service::storage_service>& ss, std::unique_ptr<http::request> req) {
const auto status = co_await ss.invoke_on(0, [] (auto& ss) {
return ss.get_topology_cmd_status();
});
if (status.active_dst.empty()) {
co_return sstring("none");
}
co_return sstring(fmt::format("{}[{}]: {}", status.current, status.index, fmt::join(status.active_dst, ",")));
}
static
future<json::json_return_type>
rest_move_tablet(http_context& ctx, sharded<service::storage_service>& ss, std::unique_ptr<http::request> req) {
@@ -1902,6 +1879,7 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
ss::reload_raft_topology_state.set(r, rest_bind(rest_reload_raft_topology_state, ss, group0_client));
ss::upgrade_to_raft_topology.set(r, rest_bind(rest_upgrade_to_raft_topology, ss));
ss::raft_topology_upgrade_status.set(r, rest_bind(rest_raft_topology_upgrade_status, ss));
ss::raft_topology_get_cmd_status.set(r, rest_bind(rest_raft_topology_get_cmd_status, ss));
ss::move_tablet.set(r, rest_bind(rest_move_tablet, ctx, ss));
ss::add_tablet_replica.set(r, rest_bind(rest_add_tablet_replica, ctx, ss));
ss::del_tablet_replica.set(r, rest_bind(rest_del_tablet_replica, ctx, ss));
@@ -1983,6 +1961,7 @@ void unset_storage_service(http_context& ctx, routes& r) {
ss::reload_raft_topology_state.unset(r);
ss::upgrade_to_raft_topology.unset(r);
ss::raft_topology_upgrade_status.unset(r);
ss::raft_topology_get_cmd_status.unset(r);
ss::move_tablet.unset(r);
ss::add_tablet_replica.unset(r);
ss::del_tablet_replica.unset(r);

View File

@@ -9,6 +9,7 @@
#include "auth/allow_all_authenticator.hh"
#include "service/migration_manager.hh"
#include "utils/alien_worker.hh"
#include "utils/class_registrator.hh"
namespace auth {
@@ -21,6 +22,7 @@ static const class_registrator<
allow_all_authenticator,
cql3::query_processor&,
::service::raft_group0_client&,
::service::migration_manager&> registration("org.apache.cassandra.auth.AllowAllAuthenticator");
::service::migration_manager&,
utils::alien_worker&> registration("org.apache.cassandra.auth.AllowAllAuthenticator");
}

View File

@@ -13,6 +13,7 @@
#include "auth/authenticated_user.hh"
#include "auth/authenticator.hh"
#include "auth/common.hh"
#include "utils/alien_worker.hh"
namespace cql3 {
class query_processor;
@@ -28,7 +29,7 @@ extern const std::string_view allow_all_authenticator_name;
class allow_all_authenticator final : public authenticator {
public:
allow_all_authenticator(cql3::query_processor&, ::service::raft_group0_client&, ::service::migration_manager&) {
allow_all_authenticator(cql3::query_processor&, ::service::raft_group0_client&, ::service::migration_manager&, utils::alien_worker&) {
}
virtual future<> start() override {

View File

@@ -33,13 +33,14 @@ static const class_registrator<auth::authenticator
, auth::certificate_authenticator
, cql3::query_processor&
, ::service::raft_group0_client&
, ::service::migration_manager&> cert_auth_reg(CERT_AUTH_NAME);
, ::service::migration_manager&
, utils::alien_worker&> cert_auth_reg(CERT_AUTH_NAME);
enum class auth::certificate_authenticator::query_source {
subject, altname
};
auth::certificate_authenticator::certificate_authenticator(cql3::query_processor& qp, ::service::raft_group0_client&, ::service::migration_manager&)
auth::certificate_authenticator::certificate_authenticator(cql3::query_processor& qp, ::service::raft_group0_client&, ::service::migration_manager&, utils::alien_worker&)
: _queries([&] {
auto& conf = qp.db().get_config();
auto queries = conf.auth_certificate_role_queries();

View File

@@ -10,6 +10,7 @@
#pragma once
#include "auth/authenticator.hh"
#include "utils/alien_worker.hh"
#include <boost/regex_fwd.hpp> // IWYU pragma: keep
namespace cql3 {
@@ -31,7 +32,7 @@ class certificate_authenticator : public authenticator {
enum class query_source;
std::vector<std::pair<query_source, boost::regex>> _queries;
public:
certificate_authenticator(cql3::query_processor&, ::service::raft_group0_client&, ::service::migration_manager&);
certificate_authenticator(cql3::query_processor&, ::service::raft_group0_client&, ::service::migration_manager&, utils::alien_worker&);
~certificate_authenticator();
future<> start() override;

View File

@@ -233,9 +233,9 @@ future<role_set> ldap_role_manager::query_granted(std::string_view grantee_name,
}
future<role_to_directly_granted_map>
ldap_role_manager::query_all_directly_granted() {
ldap_role_manager::query_all_directly_granted(::service::query_state& qs) {
role_to_directly_granted_map result;
auto roles = co_await query_all();
auto roles = co_await query_all(qs);
for (auto& role: roles) {
auto granted_set = co_await query_granted(role, recursive_role_query::no);
for (auto& granted: granted_set) {
@@ -247,8 +247,8 @@ ldap_role_manager::query_all_directly_granted() {
co_return result;
}
future<role_set> ldap_role_manager::query_all() {
return _std_mgr.query_all();
future<role_set> ldap_role_manager::query_all(::service::query_state& qs) {
return _std_mgr.query_all(qs);
}
future<> ldap_role_manager::create_role(std::string_view role_name) {
@@ -311,12 +311,12 @@ future<bool> ldap_role_manager::can_login(std::string_view role_name) {
}
future<std::optional<sstring>> ldap_role_manager::get_attribute(
std::string_view role_name, std::string_view attribute_name) {
return _std_mgr.get_attribute(role_name, attribute_name);
std::string_view role_name, std::string_view attribute_name, ::service::query_state& qs) {
return _std_mgr.get_attribute(role_name, attribute_name, qs);
}
future<role_manager::attribute_vals> ldap_role_manager::query_attribute_for_all(std::string_view attribute_name) {
return _std_mgr.query_attribute_for_all(attribute_name);
future<role_manager::attribute_vals> ldap_role_manager::query_attribute_for_all(std::string_view attribute_name, ::service::query_state& qs) {
return _std_mgr.query_attribute_for_all(attribute_name, qs);
}
future<> ldap_role_manager::set_attribute(

View File

@@ -75,9 +75,9 @@ class ldap_role_manager : public role_manager {
future<role_set> query_granted(std::string_view, recursive_role_query) override;
future<role_to_directly_granted_map> query_all_directly_granted() override;
future<role_to_directly_granted_map> query_all_directly_granted(::service::query_state&) override;
future<role_set> query_all() override;
future<role_set> query_all(::service::query_state&) override;
future<bool> exists(std::string_view) override;
@@ -85,9 +85,9 @@ class ldap_role_manager : public role_manager {
future<bool> can_login(std::string_view) override;
future<std::optional<sstring>> get_attribute(std::string_view, std::string_view) override;
future<std::optional<sstring>> get_attribute(std::string_view, std::string_view, ::service::query_state&) override;
future<role_manager::attribute_vals> query_attribute_for_all(std::string_view) override;
future<role_manager::attribute_vals> query_attribute_for_all(std::string_view, ::service::query_state&) override;
future<> set_attribute(std::string_view, std::string_view, std::string_view, ::service::group0_batch& mc) override;

View File

@@ -78,11 +78,11 @@ future<role_set> maintenance_socket_role_manager::query_granted(std::string_view
return operation_not_supported_exception<role_set>("QUERY GRANTED");
}
future<role_to_directly_granted_map> maintenance_socket_role_manager::query_all_directly_granted() {
future<role_to_directly_granted_map> maintenance_socket_role_manager::query_all_directly_granted(::service::query_state&) {
return operation_not_supported_exception<role_to_directly_granted_map>("QUERY ALL DIRECTLY GRANTED");
}
future<role_set> maintenance_socket_role_manager::query_all() {
future<role_set> maintenance_socket_role_manager::query_all(::service::query_state&) {
return operation_not_supported_exception<role_set>("QUERY ALL");
}
@@ -98,11 +98,11 @@ future<bool> maintenance_socket_role_manager::can_login(std::string_view role_na
return make_ready_future<bool>(true);
}
future<std::optional<sstring>> maintenance_socket_role_manager::get_attribute(std::string_view role_name, std::string_view attribute_name) {
future<std::optional<sstring>> maintenance_socket_role_manager::get_attribute(std::string_view role_name, std::string_view attribute_name, ::service::query_state&) {
return operation_not_supported_exception<std::optional<sstring>>("GET ATTRIBUTE");
}
future<role_manager::attribute_vals> maintenance_socket_role_manager::query_attribute_for_all(std::string_view attribute_name) {
future<role_manager::attribute_vals> maintenance_socket_role_manager::query_attribute_for_all(std::string_view attribute_name, ::service::query_state&) {
return operation_not_supported_exception<role_manager::attribute_vals>("QUERY ATTRIBUTE");
}

View File

@@ -53,9 +53,9 @@ public:
virtual future<role_set> query_granted(std::string_view grantee_name, recursive_role_query) override;
virtual future<role_to_directly_granted_map> query_all_directly_granted() override;
virtual future<role_to_directly_granted_map> query_all_directly_granted(::service::query_state&) override;
virtual future<role_set> query_all() override;
virtual future<role_set> query_all(::service::query_state&) override;
virtual future<bool> exists(std::string_view role_name) override;
@@ -63,9 +63,9 @@ public:
virtual future<bool> can_login(std::string_view role_name) override;
virtual future<std::optional<sstring>> get_attribute(std::string_view role_name, std::string_view attribute_name) override;
virtual future<std::optional<sstring>> get_attribute(std::string_view role_name, std::string_view attribute_name, ::service::query_state&) override;
virtual future<role_manager::attribute_vals> query_attribute_for_all(std::string_view attribute_name) override;
virtual future<role_manager::attribute_vals> query_attribute_for_all(std::string_view attribute_name, ::service::query_state&) override;
virtual future<> set_attribute(std::string_view role_name, std::string_view attribute_name, std::string_view attribute_value, ::service::group0_batch& mc) override;

View File

@@ -48,14 +48,14 @@ static const class_registrator<
password_authenticator,
cql3::query_processor&,
::service::raft_group0_client&,
::service::migration_manager&> password_auth_reg("org.apache.cassandra.auth.PasswordAuthenticator");
::service::migration_manager&,
utils::alien_worker&> password_auth_reg("org.apache.cassandra.auth.PasswordAuthenticator");
static thread_local auto rng_for_salt = std::default_random_engine(std::random_device{}());
static std::string_view get_config_value(std::string_view value, std::string_view def) {
return value.empty() ? def : value;
}
std::string password_authenticator::default_superuser(const db::config& cfg) {
return std::string(get_config_value(cfg.auth_superuser_name(), DEFAULT_USER_NAME));
}
@@ -63,12 +63,13 @@ std::string password_authenticator::default_superuser(const db::config& cfg) {
password_authenticator::~password_authenticator() {
}
password_authenticator::password_authenticator(cql3::query_processor& qp, ::service::raft_group0_client& g0, ::service::migration_manager& mm)
password_authenticator::password_authenticator(cql3::query_processor& qp, ::service::raft_group0_client& g0, ::service::migration_manager& mm, utils::alien_worker& hashing_worker)
: _qp(qp)
, _group0_client(g0)
, _migration_manager(mm)
, _stopped(make_ready_future<>())
, _superuser(default_superuser(qp.db().get_config()))
, _hashing_worker(hashing_worker)
{}
static bool has_salted_hash(const cql3::untyped_result_set_row& row) {
@@ -125,7 +126,7 @@ future<> password_authenticator::legacy_create_default_if_missing() {
}
std::string salted_pwd(get_config_value(_qp.db().get_config().auth_superuser_salted_password(), ""));
if (salted_pwd.empty()) {
salted_pwd = passwords::hash(DEFAULT_USER_PASSWORD, rng_for_salt);
salted_pwd = passwords::hash(DEFAULT_USER_PASSWORD, rng_for_salt, _scheme);
}
const auto query = update_row_query();
co_await _qp.execute_internal(
@@ -172,7 +173,7 @@ future<> password_authenticator::maybe_create_default_password() {
// Set default superuser's password.
std::string salted_pwd(get_config_value(_qp.db().get_config().auth_superuser_salted_password(), ""));
if (salted_pwd.empty()) {
salted_pwd = passwords::hash(DEFAULT_USER_PASSWORD, rng_for_salt);
salted_pwd = passwords::hash(DEFAULT_USER_PASSWORD, rng_for_salt, _scheme);
}
const auto update_query = update_row_query();
co_await collect_mutations(_qp, batch, update_query, {salted_pwd, _superuser});
@@ -202,6 +203,10 @@ future<> password_authenticator::maybe_create_default_password_with_retries() {
future<> password_authenticator::start() {
return once_among_shards([this] {
// Verify that at least one hashing scheme is supported.
passwords::detail::verify_scheme(_scheme);
plogger.info("Using password hashing scheme: {}", passwords::detail::prefix_for_scheme(_scheme));
_stopped = do_after_system_ready(_as, [this] {
return async([this] {
if (legacy_mode(_qp)) {
@@ -227,7 +232,9 @@ future<> password_authenticator::start() {
utils::get_local_injector().inject("password_authenticator_start_pause", utils::wait_for_message(5min)).get();
if (!legacy_mode(_qp)) {
maybe_create_default_password_with_retries().get();
_superuser_created_promise.set_value();
if (!_superuser_created_promise.available()) {
_superuser_created_promise.set_value();
}
}
});
});
@@ -287,7 +294,13 @@ future<authenticated_user> password_authenticator::authenticate(
try {
const std::optional<sstring> salted_hash = co_await get_password_hash(username);
if (!salted_hash || !passwords::check(password, *salted_hash)) {
if (!salted_hash) {
throw exceptions::authentication_exception("Username and/or password are incorrect");
}
const bool password_match = co_await _hashing_worker.submit<bool>([password = std::move(password), salted_hash = std::move(salted_hash)]{
return passwords::check(password, *salted_hash);
});
if (!password_match) {
throw exceptions::authentication_exception("Username and/or password are incorrect");
}
co_return username;
@@ -311,7 +324,7 @@ future<> password_authenticator::create(std::string_view role_name, const authen
auto maybe_hash = options.credentials.transform([&] (const auto& creds) -> sstring {
return std::visit(make_visitor(
[&] (const password_option& opt) {
return passwords::hash(opt.password, rng_for_salt);
return passwords::hash(opt.password, rng_for_salt, _scheme);
},
[] (const hashed_password_option& opt) {
return opt.hashed_password;
@@ -354,11 +367,11 @@ future<> password_authenticator::alter(std::string_view role_name, const authent
query,
consistency_for_user(role_name),
internal_distributed_query_state(),
{passwords::hash(password, rng_for_salt), sstring(role_name)},
{passwords::hash(password, rng_for_salt, _scheme), sstring(role_name)},
cql3::query_processor::cache_internal::no).discard_result();
} else {
co_await collect_mutations(_qp, mc, query,
{passwords::hash(password, rng_for_salt), sstring(role_name)});
{passwords::hash(password, rng_for_salt, _scheme), sstring(role_name)});
}
}

View File

@@ -15,7 +15,9 @@
#include "db/consistency_level_type.hh"
#include "auth/authenticator.hh"
#include "auth/passwords.hh"
#include "service/raft/raft_group0_client.hh"
#include "utils/alien_worker.hh"
namespace db {
class config;
@@ -43,12 +45,15 @@ class password_authenticator : public authenticator {
abort_source _as;
std::string _superuser; // default superuser name from the config (may or may not be present in roles table)
shared_promise<> _superuser_created_promise;
// We used to also support bcrypt, SHA-256, and MD5 (ref. scylladb#24524).
constexpr static auth::passwords::scheme _scheme = passwords::scheme::sha_512;
utils::alien_worker& _hashing_worker;
public:
static db::consistency_level consistency_for_user(std::string_view role_name);
static std::string default_superuser(const db::config&);
password_authenticator(cql3::query_processor&, ::service::raft_group0_client&, ::service::migration_manager&);
password_authenticator(cql3::query_processor&, ::service::raft_group0_client&, ::service::migration_manager&, utils::alien_worker&);
~password_authenticator();

View File

@@ -21,18 +21,14 @@ static thread_local crypt_data tlcrypt = {};
namespace detail {
scheme identify_best_supported_scheme() {
const auto all_schemes = { scheme::bcrypt_y, scheme::bcrypt_a, scheme::sha_512, scheme::sha_256, scheme::md5 };
// "Random", for testing schemes.
void verify_scheme(scheme scheme) {
const sstring random_part_of_salt = "aaaabbbbccccdddd";
for (scheme c : all_schemes) {
const sstring salt = sstring(prefix_for_scheme(c)) + random_part_of_salt;
const char* e = crypt_r("fisk", salt.c_str(), &tlcrypt);
const sstring salt = sstring(prefix_for_scheme(scheme)) + random_part_of_salt;
const char* e = crypt_r("fisk", salt.c_str(), &tlcrypt);
if (e && (e[0] != '*')) {
return c;
}
if (e && (e[0] != '*')) {
return;
}
throw no_supported_schemes();

View File

@@ -21,10 +21,11 @@ class no_supported_schemes : public std::runtime_error {
public:
no_supported_schemes();
};
///
/// Apache Cassandra uses a library to provide the bcrypt scheme. Many Linux implementations do not support bcrypt, so
/// we support alternatives. The cost is loss of direct compatibility with Apache Cassandra system tables.
/// Apache Cassandra uses a library to provide the bcrypt scheme. In ScyllaDB, we use SHA-512
/// instead of bcrypt for performance and for historical reasons (see scylladb#24524).
/// Currently, SHA-512 is always chosen as the hashing scheme for new passwords, but the other
/// algorithms remain supported for CREATE ROLE WITH HASHED PASSWORD and backward compatibility.
///
enum class scheme {
bcrypt_y,
@@ -51,11 +52,11 @@ sstring generate_random_salt_bytes(RandomNumberEngine& g) {
}
///
/// Test each allowed hashing scheme and report the best supported one on the current system.
/// Test given hashing scheme on the current system.
///
/// \throws \ref no_supported_schemes when none of the known schemes is supported.
/// \throws \ref no_supported_schemes when scheme is unsupported.
///
scheme identify_best_supported_scheme();
void verify_scheme(scheme scheme);
std::string_view prefix_for_scheme(scheme) noexcept;
@@ -67,8 +68,7 @@ std::string_view prefix_for_scheme(scheme) noexcept;
/// \throws \ref no_supported_schemes when no known hashing schemes are supported on the system.
///
template <typename RandomNumberEngine>
sstring generate_salt(RandomNumberEngine& g) {
static const scheme scheme = identify_best_supported_scheme();
sstring generate_salt(RandomNumberEngine& g, scheme scheme) {
static const sstring prefix = sstring(prefix_for_scheme(scheme));
return prefix + generate_random_salt_bytes(g);
}
@@ -93,8 +93,8 @@ sstring hash_with_salt(const sstring& pass, const sstring& salt);
/// \throws \ref std::system_error when the implementation-specific implementation fails to hash the cleartext.
///
template <typename RandomNumberEngine>
sstring hash(const sstring& pass, RandomNumberEngine& g) {
return detail::hash_with_salt(pass, detail::generate_salt(g));
sstring hash(const sstring& pass, RandomNumberEngine& g, scheme scheme) {
return detail::hash_with_salt(pass, detail::generate_salt(g, scheme));
}
///

View File

@@ -17,12 +17,17 @@
#include <seastar/core/format.hh>
#include <seastar/core/sstring.hh>
#include "auth/common.hh"
#include "auth/resource.hh"
#include "cql3/description.hh"
#include "seastarx.hh"
#include "exceptions/exceptions.hh"
#include "service/raft/raft_group0_client.hh"
namespace service {
class query_state;
};
namespace auth {
struct role_config final {
@@ -167,9 +172,9 @@ public:
/// (role2, role3)
/// }
///
virtual future<role_to_directly_granted_map> query_all_directly_granted() = 0;
virtual future<role_to_directly_granted_map> query_all_directly_granted(::service::query_state& = internal_distributed_query_state()) = 0;
virtual future<role_set> query_all() = 0;
virtual future<role_set> query_all(::service::query_state& = internal_distributed_query_state()) = 0;
virtual future<bool> exists(std::string_view role_name) = 0;
@@ -186,12 +191,12 @@ public:
///
/// \returns the value of the named attribute, if one is set.
///
virtual future<std::optional<sstring>> get_attribute(std::string_view role_name, std::string_view attribute_name) = 0;
virtual future<std::optional<sstring>> get_attribute(std::string_view role_name, std::string_view attribute_name, ::service::query_state& = internal_distributed_query_state()) = 0;
///
/// \returns a mapping of each role's value for the named attribute, if one is set for the role.
///
virtual future<attribute_vals> query_attribute_for_all(std::string_view attribute_name) = 0;
virtual future<attribute_vals> query_attribute_for_all(std::string_view attribute_name, ::service::query_state& = internal_distributed_query_state()) = 0;
/// Sets `attribute_name` with `attribute_value` for `role_name`.
/// \returns an exceptional future with nonexistant_role if the role does not exist.

View File

@@ -34,9 +34,10 @@ static const class_registrator<
saslauthd_authenticator,
cql3::query_processor&,
::service::raft_group0_client&,
::service::migration_manager&> saslauthd_auth_reg("com.scylladb.auth.SaslauthdAuthenticator");
::service::migration_manager&,
utils::alien_worker&> saslauthd_auth_reg("com.scylladb.auth.SaslauthdAuthenticator");
saslauthd_authenticator::saslauthd_authenticator(cql3::query_processor& qp, ::service::raft_group0_client&, ::service::migration_manager&)
saslauthd_authenticator::saslauthd_authenticator(cql3::query_processor& qp, ::service::raft_group0_client&, ::service::migration_manager&, utils::alien_worker&)
: _socket_path(qp.db().get_config().saslauthd_socket_path())
{}

View File

@@ -11,6 +11,7 @@
#pragma once
#include "auth/authenticator.hh"
#include "utils/alien_worker.hh"
namespace cql3 {
class query_processor;
@@ -28,7 +29,7 @@ namespace auth {
class saslauthd_authenticator : public authenticator {
sstring _socket_path; ///< Path to the domain socket on which saslauthd is listening.
public:
saslauthd_authenticator(cql3::query_processor&, ::service::raft_group0_client&, ::service::migration_manager&);
saslauthd_authenticator(cql3::query_processor&, ::service::raft_group0_client&, ::service::migration_manager&, utils::alien_worker&);
future<> start() override;

View File

@@ -187,14 +187,15 @@ service::service(
::service::migration_notifier& mn,
::service::migration_manager& mm,
const service_config& sc,
maintenance_socket_enabled used_by_maintenance_socket)
maintenance_socket_enabled used_by_maintenance_socket,
utils::alien_worker& hashing_worker)
: service(
std::move(c),
qp,
g0,
mn,
create_object<authorizer>(sc.authorizer_java_name, qp, g0, mm),
create_object<authenticator>(sc.authenticator_java_name, qp, g0, mm),
create_object<authenticator>(sc.authenticator_java_name, qp, g0, mm, hashing_worker),
create_object<role_manager>(sc.role_manager_java_name, qp, g0, mm),
used_by_maintenance_socket) {
}

View File

@@ -26,6 +26,7 @@
#include "cql3/description.hh"
#include "seastarx.hh"
#include "service/raft/raft_group0_client.hh"
#include "utils/alien_worker.hh"
#include "utils/observable.hh"
#include "utils/serialized_action.hh"
#include "service/maintenance_mode.hh"
@@ -126,7 +127,8 @@ public:
::service::migration_notifier&,
::service::migration_manager&,
const service_config&,
maintenance_socket_enabled);
maintenance_socket_enabled,
utils::alien_worker&);
future<> start(::service::migration_manager&, db::system_keyspace&);

View File

@@ -9,6 +9,7 @@
#include "auth/standard_role_manager.hh"
#include <optional>
#include <stdexcept>
#include <unordered_set>
#include <vector>
@@ -28,6 +29,7 @@
#include "cql3/util.hh"
#include "db/consistency_level_type.hh"
#include "exceptions/exceptions.hh"
#include "utils/error_injection.hh"
#include "utils/log.hh"
#include <seastar/core/loop.hh>
#include <seastar/coroutine/maybe_yield.hh>
@@ -321,7 +323,9 @@ future<> standard_role_manager::start() {
}
if (!legacy) {
co_await maybe_create_default_role_with_retries();
_superuser_created_promise.set_value();
if (!_superuser_created_promise.available()) {
_superuser_created_promise.set_value();
}
}
};
@@ -648,21 +652,30 @@ future<role_set> standard_role_manager::query_granted(std::string_view grantee_n
});
}
future<role_to_directly_granted_map> standard_role_manager::query_all_directly_granted() {
future<role_to_directly_granted_map> standard_role_manager::query_all_directly_granted(::service::query_state& qs) {
const sstring query = seastar::format("SELECT * FROM {}.{}",
get_auth_ks_name(_qp),
meta::role_members_table::name);
const auto results = co_await _qp.execute_internal(
query,
db::consistency_level::ONE,
qs,
cql3::query_processor::cache_internal::yes);
role_to_directly_granted_map roles_map;
co_await _qp.query_internal(query, [&roles_map] (const cql3::untyped_result_set_row& row) -> future<stop_iteration> {
roles_map.insert({row.get_as<sstring>("member"), row.get_as<sstring>("role")});
co_return stop_iteration::no;
});
std::transform(
results->begin(),
results->end(),
std::inserter(roles_map, roles_map.begin()),
[] (const cql3::untyped_result_set_row& row) {
return std::make_pair(row.get_as<sstring>("member"), row.get_as<sstring>("role")); }
);
co_return roles_map;
}
future<role_set> standard_role_manager::query_all() {
future<role_set> standard_role_manager::query_all(::service::query_state& qs) {
const sstring query = seastar::format("SELECT {} FROM {}.{}",
meta::roles_table::role_col_name,
get_auth_ks_name(_qp),
@@ -671,10 +684,16 @@ future<role_set> standard_role_manager::query_all() {
// To avoid many copies of a view.
static const auto role_col_name_string = sstring(meta::roles_table::role_col_name);
if (utils::get_local_injector().enter("standard_role_manager_fail_legacy_query")) {
if (legacy_mode(_qp)) {
throw std::runtime_error("standard_role_manager::query_all: failed due to error injection");
}
}
const auto results = co_await _qp.execute_internal(
query,
db::consistency_level::QUORUM,
internal_distributed_query_state(),
qs,
cql3::query_processor::cache_internal::yes);
role_set roles;
@@ -706,11 +725,11 @@ future<bool> standard_role_manager::can_login(std::string_view role_name) {
});
}
future<std::optional<sstring>> standard_role_manager::get_attribute(std::string_view role_name, std::string_view attribute_name) {
future<std::optional<sstring>> standard_role_manager::get_attribute(std::string_view role_name, std::string_view attribute_name, ::service::query_state& qs) {
const sstring query = seastar::format("SELECT name, value FROM {}.{} WHERE role = ? AND name = ?",
get_auth_ks_name(_qp),
meta::role_attributes_table::name);
const auto result_set = co_await _qp.execute_internal(query, {sstring(role_name), sstring(attribute_name)}, cql3::query_processor::cache_internal::yes);
const auto result_set = co_await _qp.execute_internal(query, db::consistency_level::ONE, qs, {sstring(role_name), sstring(attribute_name)}, cql3::query_processor::cache_internal::yes);
if (!result_set->empty()) {
const cql3::untyped_result_set_row &row = result_set->one();
co_return std::optional<sstring>(row.get_as<sstring>("value"));
@@ -718,11 +737,11 @@ future<std::optional<sstring>> standard_role_manager::get_attribute(std::string_
co_return std::optional<sstring>{};
}
future<role_manager::attribute_vals> standard_role_manager::query_attribute_for_all (std::string_view attribute_name) {
return query_all().then([this, attribute_name] (role_set roles) {
return do_with(attribute_vals{}, [this, attribute_name, roles = std::move(roles)] (attribute_vals &role_to_att_val) {
return parallel_for_each(roles.begin(), roles.end(), [this, &role_to_att_val, attribute_name] (sstring role) {
return get_attribute(role, attribute_name).then([&role_to_att_val, role] (std::optional<sstring> att_val) {
future<role_manager::attribute_vals> standard_role_manager::query_attribute_for_all (std::string_view attribute_name, ::service::query_state& qs) {
return query_all(qs).then([this, attribute_name, &qs] (role_set roles) {
return do_with(attribute_vals{}, [this, attribute_name, roles = std::move(roles), &qs] (attribute_vals &role_to_att_val) {
return parallel_for_each(roles.begin(), roles.end(), [this, &role_to_att_val, attribute_name, &qs] (sstring role) {
return get_attribute(role, attribute_name, qs).then([&role_to_att_val, role] (std::optional<sstring> att_val) {
if (att_val) {
role_to_att_val.emplace(std::move(role), std::move(*att_val));
}
@@ -767,7 +786,7 @@ future<> standard_role_manager::remove_attribute(std::string_view role_name, std
future<std::vector<cql3::description>> standard_role_manager::describe_role_grants() {
std::vector<cql3::description> result{};
const auto grants = co_await query_all_directly_granted();
const auto grants = co_await query_all_directly_granted(internal_distributed_query_state());
result.reserve(grants.size());
for (const auto& [grantee_role, granted_role] : grants) {

View File

@@ -66,9 +66,9 @@ public:
virtual future<role_set> query_granted(std::string_view grantee_name, recursive_role_query) override;
virtual future<role_to_directly_granted_map> query_all_directly_granted() override;
virtual future<role_to_directly_granted_map> query_all_directly_granted(::service::query_state&) override;
virtual future<role_set> query_all() override;
virtual future<role_set> query_all(::service::query_state&) override;
virtual future<bool> exists(std::string_view role_name) override;
@@ -76,9 +76,9 @@ public:
virtual future<bool> can_login(std::string_view role_name) override;
virtual future<std::optional<sstring>> get_attribute(std::string_view role_name, std::string_view attribute_name) override;
virtual future<std::optional<sstring>> get_attribute(std::string_view role_name, std::string_view attribute_name, ::service::query_state&) override;
virtual future<role_manager::attribute_vals> query_attribute_for_all(std::string_view attribute_name) override;
virtual future<role_manager::attribute_vals> query_attribute_for_all(std::string_view attribute_name, ::service::query_state&) override;
virtual future<> set_attribute(std::string_view role_name, std::string_view attribute_name, std::string_view attribute_value, ::service::group0_batch& mc) override;

View File

@@ -37,8 +37,8 @@ class transitional_authenticator : public authenticator {
public:
static const sstring PASSWORD_AUTHENTICATOR_NAME;
transitional_authenticator(cql3::query_processor& qp, ::service::raft_group0_client& g0, ::service::migration_manager& mm)
: transitional_authenticator(std::make_unique<password_authenticator>(qp, g0, mm)) {
transitional_authenticator(cql3::query_processor& qp, ::service::raft_group0_client& g0, ::service::migration_manager& mm, utils::alien_worker& hashing_worker)
: transitional_authenticator(std::make_unique<password_authenticator>(qp, g0, mm, hashing_worker)) {
}
transitional_authenticator(std::unique_ptr<authenticator> a)
: _authenticator(std::move(a)) {
@@ -239,7 +239,8 @@ static const class_registrator<
auth::transitional_authenticator,
cql3::query_processor&,
::service::raft_group0_client&,
::service::migration_manager&> transitional_authenticator_reg(auth::PACKAGE_NAME + "TransitionalAuthenticator");
::service::migration_manager&,
utils::alien_worker&> transitional_authenticator_reg(auth::PACKAGE_NAME + "TransitionalAuthenticator");
static const class_registrator<
auth::authorizer,

View File

@@ -960,8 +960,12 @@ public:
// Given a reference to such a column from the base schema, this function sets the corresponding column
// in the log to the given value for the given row.
void set_value(const clustering_key& log_ck, const column_definition& base_cdef, const managed_bytes_view& value) {
auto& log_cdef = *_log_schema.get_column_definition(log_data_column_name_bytes(base_cdef.name()));
_log_mut.set_cell(log_ck, log_cdef, atomic_cell::make_live(*base_cdef.type, _ts, value, _ttl));
auto log_cdef_ptr = _log_schema.get_column_definition(log_data_column_name_bytes(base_cdef.name()));
if (!log_cdef_ptr) {
throw exceptions::invalid_request_exception(format("CDC log schema for {}.{} does not have base column {}",
_log_schema.ks_name(), _log_schema.cf_name(), base_cdef.name_as_text()));
}
_log_mut.set_cell(log_ck, *log_cdef_ptr, atomic_cell::make_live(*base_cdef.type, _ts, value, _ttl));
}
// Each regular and static column in the base schema has a corresponding column in the log schema
@@ -969,7 +973,13 @@ public:
// Given a reference to such a column from the base schema, this function sets the corresponding column
// in the log to `true` for the given row. If not called, the column will be `null`.
void set_deleted(const clustering_key& log_ck, const column_definition& base_cdef) {
_log_mut.set_cell(log_ck, log_data_column_deleted_name_bytes(base_cdef.name()), data_value(true), _ts, _ttl);
auto log_cdef_ptr = _log_schema.get_column_definition(log_data_column_deleted_name_bytes(base_cdef.name()));
if (!log_cdef_ptr) {
throw exceptions::invalid_request_exception(format("CDC log schema for {}.{} does not have base column {}",
_log_schema.ks_name(), _log_schema.cf_name(), base_cdef.name_as_text()));
}
auto& log_cdef = *log_cdef_ptr;
_log_mut.set_cell(log_ck, *log_cdef_ptr, atomic_cell::make_live(*log_cdef.type, _ts, log_cdef.type->decompose(true), _ttl));
}
// Each regular and static non-atomic column in the base schema has a corresponding column in the log schema
@@ -978,7 +988,12 @@ public:
// Given a reference to such a column from the base schema, this function sets the corresponding column
// in the log to the given set of keys for the given row.
void set_deleted_elements(const clustering_key& log_ck, const column_definition& base_cdef, const managed_bytes& deleted_elements) {
auto& log_cdef = *_log_schema.get_column_definition(log_data_column_deleted_elements_name_bytes(base_cdef.name()));
auto log_cdef_ptr = _log_schema.get_column_definition(log_data_column_deleted_elements_name_bytes(base_cdef.name()));
if (!log_cdef_ptr) {
throw exceptions::invalid_request_exception(format("CDC log schema for {}.{} does not have base column {}",
_log_schema.ks_name(), _log_schema.cf_name(), base_cdef.name_as_text()));
}
auto& log_cdef = *log_cdef_ptr;
_log_mut.set_cell(log_ck, log_cdef, atomic_cell::make_live(*log_cdef.type, _ts, deleted_elements, _ttl));
}
@@ -1865,5 +1880,10 @@ bool cdc::cdc_service::needs_cdc_augmentation(const std::vector<mutation>& mutat
future<std::tuple<std::vector<mutation>, lw_shared_ptr<cdc::operation_result_tracker>>>
cdc::cdc_service::augment_mutation_call(lowres_clock::time_point timeout, std::vector<mutation>&& mutations, tracing::trace_state_ptr tr_state, db::consistency_level write_cl) {
if (utils::get_local_injector().enter("sleep_before_cdc_augmentation")) {
return seastar::sleep(std::chrono::milliseconds(100)).then([this, timeout, mutations = std::move(mutations), tr_state = std::move(tr_state), write_cl] () mutable {
return _impl->augment_mutation_call(timeout, std::move(mutations), std::move(tr_state), write_cl);
});
}
return _impl->augment_mutation_call(timeout, std::move(mutations), std::move(tr_state), write_cl);
}

View File

@@ -1933,7 +1933,11 @@ static future<compaction_result> scrub_sstables_validate_mode(sstables::compacti
using scrub = sstables::compaction_type_options::scrub;
if (validation_errors != 0 && descriptor.options.as<scrub>().quarantine_sstables == scrub::quarantine_invalid_sstables::yes) {
for (auto& sst : descriptor.sstables) {
co_await sst->change_state(sstables::sstable_state::quarantine);
try {
co_await sst->change_state(sstables::sstable_state::quarantine);
} catch (...) {
clogger.error("Moving {} to quarantine failed due to {}, continuing.", sst->get_filename(), std::current_exception());
}
}
}

View File

@@ -1155,9 +1155,6 @@ future<> compaction_manager::drain() {
future<> compaction_manager::stop() {
do_stop();
if (auto cm = std::exchange(_task_manager_module, nullptr)) {
co_await cm->stop();
}
if (_stop_future) {
co_await std::exchange(*_stop_future, make_ready_future());
}
@@ -1168,14 +1165,15 @@ future<> compaction_manager::really_do_stop() noexcept {
// Reset the metrics registry
_metrics.clear();
co_await stop_ongoing_compactions("shutdown");
if (!_tasks.empty()) {
on_fatal_internal_error(cmlog, format("{} tasks still exist after being stopped", _tasks.size()));
}
co_await _task_manager_module->stop();
co_await coroutine::parallel_for_each(_compaction_state | std::views::values, [] (compaction_state& cs) -> future<> {
if (!cs.gate.is_closed()) {
co_await cs.gate.close();
}
});
if (!_tasks.empty()) {
on_fatal_internal_error(cmlog, format("{} tasks still exist after being stopped", _tasks.size()));
}
reevaluate_postponed_compactions();
co_await std::move(_waiting_reevalution);
co_await _sys_ks.close();
@@ -1839,8 +1837,21 @@ future<compaction_manager::compaction_stats_opt> compaction_manager::perform_sst
if (!gh) {
co_return compaction_stats_opt{};
}
// All sstables must be included, even the ones being compacted, such that everything in table is validated.
auto all_sstables = get_all_sstables(t);
// Collect and register all sstables as compacting while compaction is disabled, to avoid a race condition where
// regular compaction runs in between and picks the same files.
std::vector<sstables::shared_sstable> all_sstables;
compacting_sstable_registration compacting(*this, get_compaction_state(&t));
co_await run_with_compaction_disabled(t, [&all_sstables, &compacting, &t] () -> future<> {
// All sstables must be included.
all_sstables = get_all_sstables(t);
compacting.register_compacting(all_sstables);
return make_ready_future<>();
});
if (all_sstables.empty()) {
co_return compaction_stats_opt{};
}
co_return co_await perform_compaction<validate_sstables_compaction_task_executor>(throw_if_stopping::no, info, &t, info.id, std::move(all_sstables), quarantine_sstables);
}

View File

@@ -301,6 +301,11 @@ public:
// unless it is moved back to enabled state.
future<> drain();
// Check if compaction manager is running, i.e. it was enabled or drained
bool is_running() const noexcept {
return _state == state::enabled || _state == state::disabled;
}
using compaction_history_consumer = noncopyable_function<future<>(const db::compaction_history_entry&)>;
future<> get_compaction_history(compaction_history_consumer&& f);

View File

@@ -855,3 +855,18 @@ rf_rack_valid_keyspaces: false
# Maximum number of items in single BatchWriteItem command. Default is 100.
# Note: DynamoDB has a hard-coded limit of 25.
# alternator_max_items_in_batch_write: 100
#
# io-streaming rate limiting
# When setting this value to be non-zero scylla throttles disk throughput for
# stream (network) activities such as backup, repair, tablet migration and more.
# This limit is useful for user queries so the network interface does
# not get saturated by streaming activities.
# The recommended value is 75% of network bandwidth
# E.g for i4i.8xlarge (https://github.com/scylladb/scylla-machine-image/tree/next/common/aws_net_params.json):
# network: 18.75 GiB/s --> 18750 Mib/s --> 1875 MB/s (from network bits to network bytes: divide by 10, not 8)
# Converted to disk bytes: 1875 * 1000 / 1024 = 1831 MB/s (disk wise)
# 75% of disk bytes is: 0.75 * 1831 = 1373 megabytes/s
# stream_io_throughput_mb_per_sec: 1373
#

View File

@@ -1043,7 +1043,6 @@ scylla_core = (['message/messaging_service.cc',
'utils/s3/client.cc',
'utils/s3/retryable_http_client.cc',
'utils/s3/retry_strategy.cc',
'utils/s3/s3_retry_strategy.cc',
'utils/s3/credentials_providers/aws_credentials_provider.cc',
'utils/s3/credentials_providers/environment_aws_credentials_provider.cc',
'utils/s3/credentials_providers/instance_profile_credentials_provider.cc',

View File

@@ -245,12 +245,18 @@ cql3::statements::alter_keyspace_statement::prepare_schema_mutations(query_proce
qp.db().real_database().validate_keyspace_update(*ks_md_update);
service::topology_mutation_builder builder(ts);
service::topology_request_tracking_mutation_builder rtbuilder{global_request_id, qp.proxy().features().topology_requests_type_column};
rtbuilder.set("done", false)
.set("start_time", db_clock::now());
if (!qp.proxy().features().topology_global_request_queue) {
builder.set_global_topology_request(service::global_topology_request::keyspace_rf_change);
builder.set_global_topology_request_id(global_request_id);
builder.set_new_keyspace_rf_change_data(_name, ks_options);
} else {
builder.queue_global_topology_request_id(global_request_id);
rtbuilder.set("request_type", service::global_topology_request::keyspace_rf_change)
.set_new_keyspace_rf_change_data(_name, ks_options);
};
service::topology_change change{{builder.build()}};
@@ -259,13 +265,6 @@ cql3::statements::alter_keyspace_statement::prepare_schema_mutations(query_proce
return cm.to_mutation(topo_schema);
});
service::topology_request_tracking_mutation_builder rtbuilder{global_request_id, qp.proxy().features().topology_requests_type_column};
rtbuilder.set("done", false)
.set("start_time", db_clock::now())
.set("request_type", service::global_topology_request::keyspace_rf_change);
if (qp.proxy().features().topology_global_request_queue) {
rtbuilder.set_new_keyspace_rf_change_data(_name, ks_options);
}
service::topology_change req_change{{rtbuilder.build()}};
auto topo_req_schema = qp.db().find_schema(db::system_keyspace::NAME, db::system_keyspace::TOPOLOGY_REQUESTS);
@@ -277,33 +276,44 @@ cql3::statements::alter_keyspace_statement::prepare_schema_mutations(query_proce
muts.insert(muts.begin(), schema_mutations.begin(), schema_mutations.end());
}
auto rs = locator::abstract_replication_strategy::create_replication_strategy(
ks_md_update->strategy_name(),
locator::replication_strategy_params(ks_md_update->strategy_options(), ks_md_update->initial_tablets()));
// If `rf_rack_valid_keyspaces` is enabled, it's forbidden to perform a schema change that
// would lead to an RF-rack-valid keyspace. Verify that this change does not.
// For more context, see: scylladb/scylladb#23071.
if (qp.db().get_config().rf_rack_valid_keyspaces()) {
auto rs = locator::abstract_replication_strategy::create_replication_strategy(
ks_md_update->strategy_name(),
locator::replication_strategy_params(ks_md_update->strategy_options(), ks_md_update->initial_tablets()));
try {
// There are two things to note here:
// 1. We hold a group0_guard, so it's correct to check this here.
// The topology or schema cannot change while we're performing this query.
// 2. The replication strategy we use here does NOT represent the actual state
// we will arrive at after applying the schema change. For instance, if the user
// did not specify the RF for some of the DCs, it's equal to 0 in the replication
// strategy we pass to this function, while in reality that means that the RF
// will NOT change. That is not a problem:
// - RF=0 is valid for all DCs, so it won't trigger an exception on its own,
// - the keyspace must've been RF-rack-valid before this change. We check that
// condition for all keyspaces at startup.
// The second hyphen is not really true because currently topological changes can
// disturb it (see scylladb/scylladb#23345), but we ignore that.
locator::assert_rf_rack_valid_keyspace(_name, tmptr, *rs);
} catch (const std::exception& e) {
try {
// There are two things to note here:
// 1. We hold a group0_guard, so it's correct to check this here.
// The topology or schema cannot change while we're performing this query.
// 2. The replication strategy we use here does NOT represent the actual state
// we will arrive at after applying the schema change. For instance, if the user
// did not specify the RF for some of the DCs, it's equal to 0 in the replication
// strategy we pass to this function, while in reality that means that the RF
// will NOT change. That is not a problem:
// - RF=0 is valid for all DCs, so it won't trigger an exception on its own,
// - the keyspace must've been RF-rack-valid before this change. We check that
// condition for all keyspaces at startup.
// The second hyphen is not really true because currently topological changes can
// disturb it (see scylladb/scylladb#23345), but we ignore that.
locator::assert_rf_rack_valid_keyspace(_name, tmptr, *rs);
} catch (const std::exception& e) {
if (qp.db().get_config().rf_rack_valid_keyspaces()) {
// There's no guarantee what the type of the exception will be, so we need to
// wrap it manually here in a type that can be passed to the user.
throw exceptions::invalid_request_exception(e.what());
} else {
// Even when the configuration option `rf_rack_valid_keyspaces` is set to false,
// we'd like to inform the user that the keyspace they're altering will not
// satisfy the restriction after the change--but just as a warning.
// For more context, see issue: scylladb/scylladb#23330.
warnings.push_back(seastar::format(
"Keyspace '{}' is not RF-rack-valid: the replication factor doesn't match "
"the rack count in at least one datacenter. A rack failure may reduce availability. "
"For more context, see: "
"https://docs.scylladb.com/manual/stable/reference/glossary.html#term-RF-rack-valid-keyspace.",
_name));
}
}

View File

@@ -8,6 +8,7 @@
* SPDX-License-Identifier: (LicenseRef-ScyllaDB-Source-Available-1.0 and Apache-2.0)
*/
#include "cdc/log.hh"
#include "utils/assert.hh"
#include <seastar/core/coroutine.hh>
#include "cql3/query_options.hh"
@@ -27,6 +28,7 @@
#include "db/view/view.hh"
#include "cql3/query_processor.hh"
#include "cdc/cdc_extension.hh"
#include "cdc/cdc_partitioner.hh"
namespace cql3 {
@@ -290,6 +292,53 @@ std::pair<schema_ptr, std::vector<view_ptr>> alter_table_statement::prepare_sche
throw exceptions::invalid_request_exception("Cannot use ALTER TABLE on Materialized View");
}
const bool is_cdc_log_table = cdc::is_log_for_some_table(db.real_database(), s->ks_name(), s->cf_name());
// Only a CDC log table will have this partitioner name. User tables should
// not be able to set this. Note that we perform a similar check when trying to
// re-enable CDC for a table, when the log table has been replaced by a user table.
// For better visualization of the above, consider this
//
// cqlsh> CREATE TABLE ks.t (p int PRIMARY KEY, v int) WITH cdc = {'enabled': true};
// cqlsh> INSERT INTO ks.t (p, v) VALUES (1, 2);
// cqlsh> ALTER TABLE ks.t WITH cdc = {'enabled': false};
// cqlsh> DESC TABLE ks.t_scylla_cdc_log WITH INTERNALS; # Save this output!
// cqlsh> DROP TABLE ks.t_scylla_cdc_log;
// cqlsh> [Recreate the log table using the received statement]
// cqlsh> ALTER TABLE ks.t WITH cdc = {'enabled': true};
//
// InvalidRequest: Error from server: code=2200 [Invalid query] message="Cannot create CDC log
// table for table ks.t because a table of name ks.t_scylla_cdc_log already exists"
//
// See commit adda43edc75b901b2329bca8f3eb74596698d05f for more information on THAT case.
// We reuse the same technique here.
const bool was_cdc_log_table = s->get_partitioner().name() == cdc::cdc_partitioner::classname;
if (_column_changes.size() != 0 && is_cdc_log_table) {
throw exceptions::invalid_request_exception(
"You cannot modify the set of columns of a CDC log table directly. "
"Modify the base table instead.");
}
if (_column_changes.size() != 0 && was_cdc_log_table) {
throw exceptions::invalid_request_exception(
"You cannot modify the set of columns of a CDC log table directly. "
"Although the base table has deactivated CDC, this table will continue being "
"a CDC log table until it is dropped. If you want to modify the columns in it, "
"you can only do that by reenabling CDC on the base table, which will reattach "
"this log table. Then you will be able to modify the columns in the base table, "
"and that will have effect on the log table too. Modifying the columns of a CDC "
"log table directly is never allowed.");
}
if (_renames.size() != 0 && is_cdc_log_table) {
throw exceptions::invalid_request_exception("Cannot rename a column of a CDC log table.");
}
if (_renames.size() != 0 && was_cdc_log_table) {
throw exceptions::invalid_request_exception(
"You cannot rename a column of a CDC log table. Although the base table "
"has deactivated CDC, this table will continue being a CDC log table until it "
"is dropped.");
}
auto cfm = schema_builder(s);
if (_properties->get_id()) {

View File

@@ -124,15 +124,26 @@ future<std::tuple<::shared_ptr<cql_transport::event::schema_change>, std::vector
// If `rf_rack_valid_keyspaces` is enabled, it's forbidden to create an RF-rack-invalid keyspace.
// Verify that it's RF-rack-valid.
// For more context, see: scylladb/scylladb#23071.
if (cfg.rf_rack_valid_keyspaces()) {
try {
// We hold a group0_guard, so it's correct to check this here.
// The topology or schema cannot change while we're performing this query.
locator::assert_rf_rack_valid_keyspace(_name, tmptr, *rs);
} catch (const std::exception& e) {
try {
// We hold a group0_guard, so it's correct to check this here.
// The topology or schema cannot change while we're performing this query.
locator::assert_rf_rack_valid_keyspace(_name, tmptr, *rs);
} catch (const std::exception& e) {
if (cfg.rf_rack_valid_keyspaces()) {
// There's no guarantee what the type of the exception will be, so we need to
// wrap it manually here in a type that can be passed to the user.
throw exceptions::invalid_request_exception(e.what());
} else {
// Even when the configuration option `rf_rack_valid_keyspaces` is set to false,
// we'd like to inform the user that the keyspace they're creating does not
// satisfy the restriction--but just as a warning.
// For more context, see issue: scylladb/scylladb#23330.
warnings.push_back(seastar::format(
"Keyspace '{}' is not RF-rack-valid: the replication factor doesn't match "
"the rack count in at least one datacenter. A rack failure may reduce availability. "
"For more context, see: "
"https://docs.scylladb.com/manual/stable/reference/glossary.html#term-RF-rack-valid-keyspace.",
_name));
}
}
} catch (const exceptions::already_exists_exception& e) {

View File

@@ -233,7 +233,10 @@ future<> select_statement::check_access(query_processor& qp, const service::clie
try {
const data_dictionary::database db = qp.db();
auto&& s = db.find_schema(keyspace(), column_family());
auto& cf_name = s->is_view() ? s->view_info()->base_name() : column_family();
auto cdc = db.get_cdc_base_table(*s);
auto& cf_name = s->is_view()
? s->view_info()->base_name()
: (cdc ? cdc->cf_name() : column_family());
co_await state.has_column_family_access(keyspace(), cf_name, auth::permission::SELECT);
} catch (const data_dictionary::no_such_column_family& e) {
// Will be validated afterwards.

View File

@@ -18,6 +18,7 @@
#include <seastar/core/sleep.hh>
#include "batchlog_manager.hh"
#include "data_dictionary/data_dictionary.hh"
#include "mutation/canonical_mutation.hh"
#include "service/storage_proxy.hh"
#include "system_keyspace.hh"
@@ -36,7 +37,7 @@
static logging::logger blogger("batchlog_manager");
const uint32_t db::batchlog_manager::replay_interval;
const std::chrono::seconds db::batchlog_manager::replay_interval;
const uint32_t db::batchlog_manager::page_size;
db::batchlog_manager::batchlog_manager(cql3::query_processor& qp, db::system_keyspace& sys_ks, batchlog_manager_config config)
@@ -116,7 +117,8 @@ future<> db::batchlog_manager::batchlog_replay_loop() {
} catch (...) {
blogger.error("Exception in batch replay: {}", std::current_exception());
}
delay = std::chrono::milliseconds(replay_interval);
delay = utils::get_local_injector().is_enabled("short_batchlog_manager_replay_interval") ?
std::chrono::seconds(1) : replay_interval;
}
}
@@ -132,6 +134,8 @@ future<> db::batchlog_manager::drain() {
_sem.broken();
}
co_await _qp.proxy().abort_batch_writes();
co_await std::move(_loop_done);
blogger.info("Drained");
}
@@ -173,6 +177,11 @@ future<> db::batchlog_manager::replay_all_failed_batches(post_replay_cleanup cle
return make_ready_future<stop_iteration>(stop_iteration::no);
}
if (utils::get_local_injector().is_enabled("skip_batch_replay")) {
blogger.debug("Skipping batch replay due to skip_batch_replay injection");
return make_ready_future<stop_iteration>(stop_iteration::no);
}
// check version of serialization format
if (!row.has("version")) {
blogger.warn("Skipping logged batch because of unknown version");
@@ -242,13 +251,16 @@ future<> db::batchlog_manager::replay_all_failed_batches(post_replay_cleanup cle
// send to partially or wholly fail in actually sending stuff. Since we don't
// have hints (yet), send with CL=ALL, and hope we can re-do this soon.
// See below, we use retry on write failure.
return _qp.proxy().mutate(mutations, db::consistency_level::ALL, db::no_timeout, nullptr, empty_service_permit(), db::allow_per_partition_rate_limit::no);
auto timeout = db::timeout_clock::now() + write_timeout;
return _qp.proxy().send_batchlog_replay_to_all_replicas(std::move(mutations), timeout);
});
}).then_wrapped([this, id](future<> batch_result) {
try {
batch_result.get();
} catch (data_dictionary::no_such_keyspace& ex) {
// should probably ignore and drop the batch
} catch (const data_dictionary::no_such_column_family&) {
// As above -- we should drop the batch if the table doesn't exist anymore.
} catch (...) {
blogger.warn("Replay failed (will retry): {}", std::current_exception());
// timeout, overload etc.

View File

@@ -43,8 +43,9 @@ public:
using post_replay_cleanup = bool_class<class post_replay_cleanup_tag>;
private:
static constexpr uint32_t replay_interval = 60 * 1000; // milliseconds
static constexpr std::chrono::seconds replay_interval = std::chrono::seconds(60);
static constexpr uint32_t page_size = 128; // same as HHOM, for now, w/out using any heuristics. TODO: set based on avg batch size.
static constexpr std::chrono::seconds write_timeout = std::chrono::seconds(300);
using clock_type = lowres_clock;

View File

@@ -800,6 +800,8 @@ class db::commitlog::segment : public enable_shared_from_this<segment>, public c
void end_flush() {
_segment_manager->end_flush();
if (can_delete()) {
// #25709 - do this early if possible
_extended_segments.clear();
_segment_manager->discard_unused_segments();
}
}
@@ -875,6 +877,8 @@ public:
void release_cf_count(const cf_id_type& cf) {
mark_clean(cf, 1);
if (can_delete()) {
// #25709 - do this early if possible
_extended_segments.clear();
_segment_manager->discard_unused_segments();
}
}
@@ -2576,20 +2580,24 @@ struct fmt::formatter<db::commitlog::segment::cf_mark> {
void db::commitlog::segment_manager::discard_unused_segments() noexcept {
clogger.trace("Checking for unused segments ({} active)", _segments.size());
std::erase_if(_segments, [=](sseg_ptr s) {
if (s->can_delete()) {
clogger.debug("Segment {} is unused", *s);
return true;
}
if (s->is_still_allocating()) {
clogger.debug("Not safe to delete segment {}; still allocating.", *s);
} else if (!s->is_clean()) {
clogger.debug("Not safe to delete segment {}; dirty is {}", *s, segment::cf_mark {*s});
} else {
clogger.debug("Not safe to delete segment {}; disk ops pending", *s);
}
return false;
});
// #25709 ensure we don't free any segment until after prune.
{
auto tmp = _segments;
std::erase_if(_segments, [=](sseg_ptr s) {
if (s->can_delete()) {
clogger.debug("Segment {} is unused", *s);
return true;
}
if (s->is_still_allocating()) {
clogger.debug("Not safe to delete segment {}; still allocating.", *s);
} else if (!s->is_clean()) {
clogger.debug("Not safe to delete segment {}; dirty is {}", *s, segment::cf_mark {*s});
} else {
clogger.debug("Not safe to delete segment {}; disk ops pending", *s);
}
return false;
});
}
// launch in background, but guard with gate so this deletion is
// sure to finish in shutdown, because at least through this path,
@@ -2878,7 +2886,10 @@ future<> db::commitlog::segment_manager::do_pending_deletes() {
}
future<> db::commitlog::segment_manager::orphan_all() {
_segments.clear();
// #25709. the actual process of destroying the elements here
// might cause a call into discard_unused_segments.
// ensure the target vector is empty when we get to destructors
auto tmp = std::exchange(_segments, {});
return clear_reserve_segments();
}
@@ -3255,9 +3266,13 @@ const db::commitlog::config& db::commitlog::active_config() const {
return _segment_manager->cfg;
}
db::commitlog::segment_data_corruption_error::segment_data_corruption_error(std::string_view msg, uint64_t s)
: _msg(fmt::format("Segment data corruption: {}", msg))
, _bytes(s)
{}
db::commitlog::segment_truncation::segment_truncation(uint64_t pos)
: _msg(fmt::format("Segment truncation at {}", pos))
db::commitlog::segment_truncation::segment_truncation(std::string_view reason, uint64_t pos)
: _msg(fmt::format("Segment truncation at {}. Reason: {}", pos, reason))
, _pos(pos)
{}
@@ -3447,7 +3462,8 @@ db::commitlog::read_log_file(const replay_state& state, sstring filename, sstrin
while (rem < size) {
if (eof) {
throw segment_truncation(block_boundry);
auto reason = fmt::format("unexpected EOF, rem={}, size={}", rem, size);
throw segment_truncation(std::move(reason), block_boundry);
}
auto block_size = alignment - initial.size_bytes();
@@ -3458,7 +3474,8 @@ db::commitlog::read_log_file(const replay_state& state, sstring filename, sstrin
if (tmp.size_bytes() == 0) {
eof = true;
throw segment_truncation(block_boundry);
auto reason = fmt::format("read 0 bytes, while tried to read {}", block_size);
throw segment_truncation(std::move(reason), block_boundry);
}
crc32_nbo crc;
@@ -3493,10 +3510,12 @@ db::commitlog::read_log_file(const replay_state& state, sstring filename, sstrin
auto checksum = crc.checksum();
if (check != checksum) {
throw segment_data_corruption_error("Data corruption", alignment);
auto reason = fmt::format("checksums do not match: {:x} vs. {:x}", check, checksum);
throw segment_data_corruption_error(std::move(reason), alignment);
}
if (id != this->id) {
throw segment_truncation(pos + rem);
auto reason = fmt::format("IDs do not match: {} vs. {}", id, this->id);
throw segment_truncation(std::move(reason), pos + rem);
}
}
tmp.remove_suffix(detail::sector_overhead_size);
@@ -3771,7 +3790,8 @@ db::commitlog::read_log_file(const replay_state& state, sstring filename, sstrin
co_await read_chunk();
}
if (corrupt_size > 0) {
throw segment_data_corruption_error("Data corruption", corrupt_size);
auto reason = fmt::format("corrupted size while reading file: {}", corrupt_size);
throw segment_data_corruption_error(std::move(reason), corrupt_size);
}
} catch (...) {
p = std::current_exception();

View File

@@ -392,9 +392,7 @@ public:
class segment_data_corruption_error: public segment_error {
std::string _msg;
public:
segment_data_corruption_error(std::string msg, uint64_t s)
: _msg(std::move(msg)), _bytes(s) {
}
segment_data_corruption_error(std::string_view msg, uint64_t s);
uint64_t bytes() const {
return _bytes;
}
@@ -425,7 +423,7 @@ public:
std::string _msg;
uint64_t _pos;
public:
segment_truncation(uint64_t);
segment_truncation(std::string_view reason, uint64_t position);
uint64_t position() const;
const char* what() const noexcept override;

View File

@@ -86,6 +86,12 @@ object_storage_endpoints_to_json(const std::vector<db::object_storage_endpoint_p
return value_to_json(m);
}
static
json::json_return_type
uuid_to_json(const db::config::UUID& uuid) {
return value_to_json(format("{}", uuid));
}
// Convert a value that can be printed with fmt::format, or a vector of
// such values, to JSON. An example is enum_option<T>, because enum_option<T>
// has a specialization for fmt::formatter.
@@ -294,6 +300,12 @@ const config_type& config_type_for<std::vector<db::object_storage_endpoint_param
return ct;
}
template <>
const config_type& config_type_for<db::config::UUID>() {
static config_type ct("UUID", uuid_to_json);
return ct;
}
}
namespace YAML {
@@ -491,6 +503,22 @@ struct convert<db::object_storage_endpoint_param> {
}
};
template<>
struct convert<utils::UUID> {
static bool decode(const Node& node, utils::UUID& uuid) {
std::string uuid_string;
if (!convert<std::string>::decode(node, uuid_string)) {
return false;
}
try {
std::istringstream(uuid_string) >> uuid;
} catch (boost::program_options::invalid_option_value&) {
return false;
}
return true;
}
};
}
#if defined(DEBUG)
@@ -819,7 +847,7 @@ db::config::config(std::shared_ptr<db::extensions> exts)
, inter_dc_stream_throughput_outbound_megabits_per_sec(this, "inter_dc_stream_throughput_outbound_megabits_per_sec", value_status::Unused, 0,
"Throttles all streaming file transfer between the data centers. This setting allows throttles streaming throughput betweens data centers in addition to throttling all network stream traffic as configured with stream_throughput_outbound_megabits_per_sec.")
, stream_io_throughput_mb_per_sec(this, "stream_io_throughput_mb_per_sec", liveness::LiveUpdate, value_status::Used, 0,
"Throttles streaming I/O to the specified total throughput (in MiBs/s) across the entire system. Streaming I/O includes the one performed by repair and both RBNO and legacy topology operations such as adding or removing a node. Setting the value to 0 disables stream throttling.")
"Throttles streaming I/O to the specified total throughput (in MiBs/s) across the entire system. Streaming I/O includes the one performed by repair and both RBNO and legacy topology operations such as adding or removing a node. Setting the value to 0 disables stream throttling. It is recommended to set the value for this parameter to be 75% of network bandwidth")
, stream_plan_ranges_fraction(this, "stream_plan_ranges_fraction", liveness::LiveUpdate, value_status::Used, 0.1,
"Specify the fraction of ranges to stream in a single stream plan. Value is between 0 and 1.")
, enable_file_stream(this, "enable_file_stream", liveness::LiveUpdate, value_status::Used, true, "Set true to use file based stream for tablet instead of mutation based stream")
@@ -942,6 +970,8 @@ db::config::config(std::shared_ptr<db::extensions> exts)
"The default timeout for other, miscellaneous operations.\n"
"\n"
"Related information: About hinted handoff writes")
, request_timeout_on_shutdown_in_seconds(this, "request_timeout_on_shutdown_in_seconds", value_status::Used, 30,
"Timeout for CQL server requests on shutdown. After this timeout the server will shutdown all connections.")
, group0_raft_op_timeout_in_ms(this, "group0_raft_op_timeout_in_ms", liveness::LiveUpdate, value_status::Used, 60000,
"The time in milliseconds that group0 allows a Raft operation to complete.")
/**
@@ -1230,7 +1260,7 @@ db::config::config(std::shared_ptr<db::extensions> exts)
, sstable_summary_ratio(this, "sstable_summary_ratio", value_status::Used, 0.0005, "Enforces that 1 byte of summary is written for every N (2000 by default)"
"bytes written to data file. Value must be between 0 and 1.")
, components_memory_reclaim_threshold(this, "components_memory_reclaim_threshold", liveness::LiveUpdate, value_status::Used, .2, "Ratio of available memory for all in-memory components of SSTables in a shard beyond which the memory will be reclaimed from components until it falls back under the threshold. Currently, this limit is only enforced for bloom filters.")
, large_memory_allocation_warning_threshold(this, "large_memory_allocation_warning_threshold", value_status::Used, (size_t(128) << 10) + 1, "Warn about memory allocations above this size; set to zero to disable.")
, large_memory_allocation_warning_threshold(this, "large_memory_allocation_warning_threshold", value_status::Used, size_t(1) << 20, "Warn about memory allocations above this size; set to zero to disable.")
, enable_deprecated_partitioners(this, "enable_deprecated_partitioners", value_status::Used, false, "Enable the byteordered and random partitioners. These partitioners are deprecated and will be removed in a future version.")
, enable_keyspace_column_family_metrics(this, "enable_keyspace_column_family_metrics", value_status::Used, false, "Enable per keyspace and per column family metrics reporting.")
, enable_node_aggregated_table_metrics(this, "enable_node_aggregated_table_metrics", value_status::Used, true, "Enable aggregated per node, per keyspace and per table metrics reporting, applicable if enable_keyspace_column_family_metrics is false.")
@@ -1399,7 +1429,7 @@ db::config::config(std::shared_ptr<db::extensions> exts)
"The maximum fraction of cache memory permitted for use by index cache. Clamped to the [0.0; 1.0] range. Must be small enough to not deprive the row cache of memory, but should be big enough to fit a large fraction of the index. The default value 0.2 means that at least 80\% of cache memory is reserved for the row cache, while at most 20\% is usable by the index cache.")
, consistent_cluster_management(this, "consistent_cluster_management", value_status::Deprecated, true, "Use RAFT for cluster management and DDL.")
, force_gossip_topology_changes(this, "force_gossip_topology_changes", value_status::Used, false, "Force gossip-based topology operations in a fresh cluster. Only the first node in the cluster must use it. The rest will fall back to gossip-based operations anyway. This option should be used only for testing. Note: gossip topology changes are incompatible with tablets.")
, recovery_leader(this, "recovery_leader", liveness::LiveUpdate, value_status::Used, "", "Host ID of the node restarted first while performing the Manual Raft-based Recovery Procedure. Warning: this option disables some guardrails for the needs of the Manual Raft-based Recovery Procedure. Make sure you unset it at the end of the procedure.")
, recovery_leader(this, "recovery_leader", liveness::LiveUpdate, value_status::Used, utils::null_uuid(), "Host ID of the node restarted first while performing the Manual Raft-based Recovery Procedure. Warning: this option disables some guardrails for the needs of the Manual Raft-based Recovery Procedure. Make sure you unset it at the end of the procedure.")
, wasm_cache_memory_fraction(this, "wasm_cache_memory_fraction", value_status::Used, 0.01, "Maximum total size of all WASM instances stored in the cache as fraction of total shard memory.")
, wasm_cache_timeout_in_ms(this, "wasm_cache_timeout_in_ms", value_status::Used, 5000, "Time after which an instance is evicted from the cache.")
, wasm_cache_instance_size_limit(this, "wasm_cache_instance_size_limit", value_status::Used, 1024*1024, "Instances with size above this limit will not be stored in the cache.")

View File

@@ -207,6 +207,7 @@ public:
using seed_provider_type = db::seed_provider_type;
using hinted_handoff_enabled_type = db::hints::host_filter;
using error_injection_at_startup = db::error_injection_at_startup;
using UUID = utils::UUID;
/*
* All values and documentation taken from
@@ -322,6 +323,7 @@ public:
named_value<uint32_t> truncate_request_timeout_in_ms;
named_value<uint32_t> write_request_timeout_in_ms;
named_value<uint32_t> request_timeout_in_ms;
named_value<uint32_t> request_timeout_on_shutdown_in_seconds;
named_value<uint32_t> group0_raft_op_timeout_in_ms;
named_value<bool> cross_node_timeout;
named_value<uint32_t> internode_send_buff_size_in_bytes;
@@ -521,7 +523,7 @@ public:
named_value<bool> consistent_cluster_management;
named_value<bool> force_gossip_topology_changes;
named_value<sstring> recovery_leader;
named_value<UUID> recovery_leader;
named_value<double> wasm_cache_memory_fraction;
named_value<uint32_t> wasm_cache_timeout_in_ms;

View File

@@ -65,18 +65,18 @@ future<> hint_endpoint_manager::do_store_hint(schema_ptr s, lw_shared_ptr<const
const replay_position rp = rh.release();
if (_last_written_rp < rp) {
_last_written_rp = rp;
manager_logger.debug("[{}] Updated last written replay position to {}", end_point_key(), rp);
manager_logger.trace("hint_endpoint_manager[{}]:do_store_hint: Updated last written replay position to {}", end_point_key(), rp);
}
++shard_stats().written;
manager_logger.trace("Hint to {} was stored", end_point_key());
manager_logger.trace("hint_endpoint_manager[{}]:do_store_hint: Hint has been stored", end_point_key());
tracing::trace(tr_state, "Hint to {} was stored", end_point_key());
} catch (...) {
++shard_stats().errors;
const auto eptr = std::current_exception();
manager_logger.debug("store_hint(): got the exception when storing a hint to {}: {}", end_point_key(), eptr);
manager_logger.debug("hint_endpoint_manager[{}]:do_store_hint: Exception when storing a hint: {}", end_point_key(), eptr);
tracing::trace(tr_state, "Failed to store a hint to {}: {}", end_point_key(), eptr);
}
@@ -92,7 +92,7 @@ bool hint_endpoint_manager::store_hint(schema_ptr s, lw_shared_ptr<const frozen_
return do_store_hint(std::move(s), std::move(fm), tr_state);
});
} catch (...) {
manager_logger.trace("Failed to store a hint to {}: {}", end_point_key(), std::current_exception());
manager_logger.trace("hint_endpoint_manager[{}]:store_hint: Failed to store a hint: {}", end_point_key(), std::current_exception());
tracing::trace(tr_state, "Failed to store a hint to {}: {}", end_point_key(), std::current_exception());
++shard_stats().dropped;
@@ -109,16 +109,23 @@ future<> hint_endpoint_manager::populate_segments_to_replay() {
}
void hint_endpoint_manager::start() {
manager_logger.debug("hint_endpoint_manager[{}]:start: Starting", end_point_key());
clear_stopped();
allow_hints();
_sender.start();
manager_logger.debug("hint_endpoint_manager[{}]:start: Finished", end_point_key());
}
future<> hint_endpoint_manager::stop(drain should_drain) noexcept {
if(stopped()) {
if (stopped()) {
manager_logger.warn("hint_endpoint_manager[{}]:stop: Stop had already been called", end_point_key());
return make_exception_future<>(std::logic_error(format("ep_manager[{}]: stop() is called twice", _key).c_str()));
}
manager_logger.debug("hint_endpoint_manager[{}]:stop: Starting", end_point_key());
return seastar::async([this, should_drain] {
std::exception_ptr eptr;
@@ -139,10 +146,11 @@ future<> hint_endpoint_manager::stop(drain should_drain) noexcept {
}).handle_exception([&eptr] (auto e) { eptr = std::move(e); }).get();
if (eptr) {
manager_logger.error("ep_manager[{}]: exception: {}", _key, eptr);
manager_logger.error("hint_endpoint_manager[{}]:stop: Exception occurred: {}", _key, eptr);
}
set_stopped();
manager_logger.debug("hint_endpoint_manager[{}]:stop: Finished", end_point_key());
});
}
@@ -194,7 +202,7 @@ future<hints_store_ptr> hint_endpoint_manager::get_or_load() {
}
future<db::commitlog> hint_endpoint_manager::add_store() noexcept {
manager_logger.trace("Going to add a store to {}", _hints_dir.c_str());
manager_logger.debug("hint_endpoint_manager[{}]:add_store: Going to add a store: {}", end_point_key(), _hints_dir.native());
return futurize_invoke([this] {
return io_check([name = _hints_dir.c_str()] { return recursive_touch_directory(name); }).then([this] () {
@@ -289,6 +297,8 @@ future<db::commitlog> hint_endpoint_manager::add_store() noexcept {
_sender.add_segment(std::move(seg));
}
manager_logger.debug("hint_endpoint_manager[{}]:add_store: Finished", end_point_key());
co_return l;
});
});

View File

@@ -56,8 +56,8 @@ future<> hint_sender::flush_maybe() noexcept {
if (current_time >= _next_flush_tp) {
return _ep_manager.flush_current_hints().then([this, current_time] {
_next_flush_tp = current_time + manager::hints_flush_period;
}).handle_exception([] (auto eptr) {
manager_logger.trace("flush_maybe() failed: {}", eptr);
}).handle_exception([this] (auto eptr) {
manager_logger.debug("hint_sender[{}]:flush_maybe: Failed with {}", _ep_key, eptr);
return make_ready_future<>();
});
}
@@ -115,7 +115,7 @@ const column_mapping& hint_sender::get_column_mapping(lw_shared_ptr<send_one_fil
throw no_column_mapping(fm.schema_version());
}
manager_logger.debug("new schema version {}", fm.schema_version());
manager_logger.trace("hint_sender[{}]:get_column_mapping: new schema version {}", _ep_key, fm.schema_version());
cm_it = ctx_ptr->schema_ver_to_column_mapping.emplace(fm.schema_version(), *hr.get_column_mapping()).first;
}
@@ -175,23 +175,22 @@ future<> hint_sender::stop(drain should_drain) noexcept {
//
// The next call for send_hints_maybe() will send the last hints to the current end point and when it is
// done there is going to be no more pending hints and the corresponding hints directory may be removed.
manager_logger.trace("Draining for {}: start", end_point_key());
manager_logger.trace("hint_sender[{}]:stop: Draining starts", end_point_key());
set_draining();
send_hints_maybe();
_ep_manager.flush_current_hints().handle_exception([] (auto e) {
manager_logger.error("Failed to flush pending hints: {}. Ignoring...", e);
_ep_manager.flush_current_hints().handle_exception([this] (auto e) {
manager_logger.error("hint_sender[{}]:stop: Failed to flush pending hints: {}. Ignoring", _ep_key, e);
}).get();
send_hints_maybe();
manager_logger.trace("Draining for {}: end", end_point_key());
manager_logger.trace("hint_sender[{}]:stop: Draining finished", end_point_key());
}
// TODO: Change this log to match the class name, but first make sure no test
// relies on the old one.
manager_logger.trace("ep_manager({})::sender: exiting", end_point_key());
manager_logger.debug("hint_sender[{}]:stop: Finished", end_point_key());
});
}
void hint_sender::cancel_draining() {
manager_logger.info("Draining of {} has been marked as canceled", _ep_key);
manager_logger.info("hint_sender[{}]:cancel_draining: Marking as canceled", _ep_key);
if (_state.contains(state::draining)) {
_state.remove(state::draining);
}
@@ -222,9 +221,8 @@ void hint_sender::start() {
attr.sched_group = _hints_cpu_sched_group;
_stopped = seastar::async(std::move(attr), [this] {
// TODO: Change this log to match the class name, but first make sure no test
// relies on the old one.
manager_logger.trace("ep_manager({})::sender: started", end_point_key());
manager_logger.debug("hint_sender[{}]:start: Starting", end_point_key());
while (!stopping()) {
try {
flush_maybe().get();
@@ -237,34 +235,36 @@ void hint_sender::start() {
break;
} catch (...) {
// log and keep on spinning
// TODO: Change this log to match the class name, but first make sure no test
// relies on the old one.
manager_logger.trace("sender: got the exception: {}", std::current_exception());
manager_logger.debug("hint_sender[{}]:start: Exception in the loop: {}", _ep_key, std::current_exception());
}
}
manager_logger.debug("hint_sender[{}]:start: Exited the loop", _ep_key);
});
}
future<> hint_sender::send_one_mutation(frozen_mutation_and_schema m) {
auto ermp = _db.find_column_family(m.s).get_effective_replication_map();
auto token = dht::get_token(*m.s, m.fm.key());
host_id_vector_replica_set natural_endpoints = ermp->get_natural_replicas(std::move(token));
host_id_vector_replica_set natural_endpoints = ermp->get_natural_replicas(token);
host_id_vector_topology_change pending_endpoints = ermp->get_pending_replicas(token);
return futurize_invoke([this, m = std::move(m), ermp = std::move(ermp), &natural_endpoints] () mutable -> future<> {
return futurize_invoke([this, m = std::move(m), ermp = std::move(ermp), &natural_endpoints, &pending_endpoints] () mutable -> future<> {
// The fact that we send with CL::ALL in both cases below ensures that new hints are not going
// to be generated as a result of hints sending.
const auto& tm = ermp->get_token_metadata();
const auto dst = end_point_key();
if (std::ranges::contains(natural_endpoints, dst) && !tm.is_leaving(dst)) {
manager_logger.trace("Sending directly to {}", dst);
return _proxy.send_hint_to_endpoint(std::move(m), std::move(ermp), dst);
manager_logger.trace("hint_sender[{}]:send_one_mutation: Sending directly", dst);
// dst is not duplicated in pending_endpoints because it's in natural_endpoints
return _proxy.send_hint_to_endpoint(std::move(m), std::move(ermp), dst, std::move(pending_endpoints));
} else {
if (manager_logger.is_enabled(log_level::trace)) {
if (tm.is_leaving(end_point_key())) {
manager_logger.trace("The original target endpoint {} is leaving. Mutating from scratch...", dst);
manager_logger.trace("hint_sender[{}]:send_one_mutation: Original target is leaving. Mutating from scratch", dst);
} else {
manager_logger.trace("Endpoints set has changed and {} is no longer a replica. Mutating from scratch...", dst);
manager_logger.trace("hint_sender[{}]:send_one_mutation: Endpoint set has changed and original target is no longer a replica. Mutating from scratch", dst);
}
}
return _proxy.send_hint_to_all_replicas(std::move(m));
@@ -288,9 +288,9 @@ future<> hint_sender::send_one_hint(lw_shared_ptr<send_one_file_ctx> ctx_ptr, fr
// Files are aggregated for at most manager::hints_timer_period therefore the oldest hint there is
// (last_modification - manager::hints_timer_period) old.
if (const auto now = gc_clock::now().time_since_epoch(); now - secs_since_file_mod > gc_grace_sec - manager::hints_flush_period) {
manager_logger.debug("send_hints(): the hint is too old, skipping it, "
manager_logger.trace("hint_sender[{}]:send_hints: Hint is too old, skipping it, "
"secs since file last modification {}, gc_grace_sec {}, hints_flush_period {}",
now - secs_since_file_mod, gc_grace_sec, manager::hints_flush_period);
_ep_key, now - secs_since_file_mod, gc_grace_sec, manager::hints_flush_period);
return make_ready_future<>();
}
@@ -299,24 +299,24 @@ future<> hint_sender::send_one_hint(lw_shared_ptr<send_one_file_ctx> ctx_ptr, fr
++this->shard_stats().sent_total;
this->shard_stats().sent_hints_bytes_total += mutation_size;
}).handle_exception([this, ctx_ptr] (auto eptr) {
manager_logger.trace("send_one_hint(): failed to send to {}: {}", end_point_key(), eptr);
manager_logger.trace("hint_sender[{}]:send_one_hint: Failed to send: {}", end_point_key(), eptr);
++this->shard_stats().send_errors;
return make_exception_future<>(std::move(eptr));
});
// ignore these errors and move on - probably this hint is too old and the KS/CF has been deleted...
} catch (replica::no_such_column_family& e) {
manager_logger.debug("send_hints(): no_such_column_family: {}", e.what());
manager_logger.debug("hint_sender[{}]:send_one_hint: no_such_column_family: {}", _ep_key, e.what());
++this->shard_stats().discarded;
} catch (replica::no_such_keyspace& e) {
manager_logger.debug("send_hints(): no_such_keyspace: {}", e.what());
manager_logger.debug("hint_sender[{}]:send_one_hint: no_such_keyspace: {}", _ep_key, e.what());
++this->shard_stats().discarded;
} catch (no_column_mapping& e) {
manager_logger.debug("send_hints(): {} at {}: {}", fname, rp, e.what());
manager_logger.debug("hint_sender[{}]:send_one_hint: no_column_mapping: {} at {}: {}", _ep_key, fname, rp, e.what());
++this->shard_stats().discarded;
} catch (...) {
auto eptr = std::current_exception();
manager_logger.debug("send_hints(): unexpected error in file {} at {}: {}", fname, rp, eptr);
manager_logger.debug("hint_sender[{}]:send_one_hint: Unexpected error in file {} at {}: {}", _ep_key, fname, rp, eptr);
++this->shard_stats().send_errors;
return make_exception_future<>(std::move(eptr));
}
@@ -338,21 +338,24 @@ future<> hint_sender::send_one_hint(lw_shared_ptr<send_one_file_ctx> ctx_ptr, fr
}
f.ignore_ready_future();
});
}).handle_exception([ctx_ptr, rp] (auto eptr) {
manager_logger.trace("send_one_file(): Hmmm. Something bad had happened: {}", eptr);
}).handle_exception([this, ctx_ptr, rp] (auto eptr) {
manager_logger.trace("hint_sender[{}]:send_one_hint: Exception occurred: {}", _ep_key, eptr);
ctx_ptr->on_hint_send_failure(rp);
});
}
void hint_sender::notify_replay_waiters() noexcept {
if (!_foreign_segments_to_replay.empty()) {
manager_logger.trace("[{}] notify_replay_waiters(): not notifying because there are still {} foreign segments to replay", end_point_key(), _foreign_segments_to_replay.size());
manager_logger.trace("hint_sender[{}]:notify_replay_waiters: Not notifying because there are still {} foreign segments to replay",
end_point_key(), _foreign_segments_to_replay.size());
return;
}
manager_logger.trace("[{}] notify_replay_waiters(): replay position upper bound was updated to {}", end_point_key(), _sent_upper_bound_rp);
manager_logger.trace("hint_sender[{}]:notify_replay_waiters: Replay position upper bound was updated to {}", end_point_key(), _sent_upper_bound_rp);
while (!_replay_waiters.empty() && _replay_waiters.begin()->first < _sent_upper_bound_rp) {
manager_logger.trace("[{}] notify_replay_waiters(): notifying one ({} < {})", end_point_key(), _replay_waiters.begin()->first, _sent_upper_bound_rp);
manager_logger.trace("hint_sender[{}]:notify_replay_waiters: Notifying one ({} < {})",
end_point_key(), _replay_waiters.begin()->first, _sent_upper_bound_rp);
auto ptr = _replay_waiters.begin()->second;
(**ptr).set_value();
(*ptr) = std::nullopt; // Prevent it from being resolved by abort source subscription
@@ -362,7 +365,7 @@ void hint_sender::notify_replay_waiters() noexcept {
void hint_sender::dismiss_replay_waiters() noexcept {
for (auto& p : _replay_waiters) {
manager_logger.debug("[{}] dismiss_replay_waiters(): dismissing one", end_point_key());
manager_logger.debug("hint_sender[{}]:dismiss_replay_waiters: Dismissing one", end_point_key());
auto ptr = p.second;
(**ptr).set_exception(std::runtime_error(format("Hints manager for {} is stopping", end_point_key())));
(*ptr) = std::nullopt; // Prevent it from being resolved by abort source subscription
@@ -371,14 +374,15 @@ void hint_sender::dismiss_replay_waiters() noexcept {
}
future<> hint_sender::wait_until_hints_are_replayed_up_to(abort_source& as, db::replay_position up_to_rp) {
manager_logger.debug("[{}] wait_until_hints_are_replayed_up_to(): entering with target {}", end_point_key(), up_to_rp);
manager_logger.debug("hint_sender[{}]:wait_until_hints_are_replayed_up_to: Entering with target {}", end_point_key(), up_to_rp);
if (_foreign_segments_to_replay.empty() && up_to_rp < _sent_upper_bound_rp) {
manager_logger.debug("[{}] wait_until_hints_are_replayed_up_to(): hints were already replayed above the point ({} < {})", end_point_key(), up_to_rp, _sent_upper_bound_rp);
manager_logger.debug("hint_sender[{}]:wait_until_hints_are_replayed_up_to: Hints were already replayed above the point ({} < {})",
end_point_key(), up_to_rp, _sent_upper_bound_rp);
return make_ready_future<>();
}
if (as.abort_requested()) {
manager_logger.debug("[{}] wait_until_hints_are_replayed_up_to(): already aborted - stopping", end_point_key());
manager_logger.debug("hint_sender[{}]:wait_until_hints_are_replayed_up_to: Already aborted - stopping", end_point_key());
return make_exception_future<>(abort_requested_exception());
}
@@ -389,7 +393,7 @@ future<> hint_sender::wait_until_hints_are_replayed_up_to(abort_source& as, db::
// The promise already was resolved by `notify_replay_waiters` and removed from the map
return;
}
manager_logger.debug("[{}] wait_until_hints_are_replayed_up_to(): abort requested - stopping", end_point_key());
manager_logger.debug("hint_sender[{}]:wait_until_hints_are_replayed_up_to: Abort requested - stopping", end_point_key());
_replay_waiters.erase(it);
(**ptr).set_exception(abort_requested_exception());
});
@@ -398,7 +402,7 @@ future<> hint_sender::wait_until_hints_are_replayed_up_to(abort_source& as, db::
// therefore we cannot capture `this`
auto ep = end_point_key();
return (**ptr).get_future().finally([sub = std::move(sub), ep] {
manager_logger.debug("[{}] wait_until_hints_are_replayed_up_to(): returning after the future was satisfied", ep);
manager_logger.debug("hint_sender[{}]:wait_until_hints_are_replayed_up_to: Returning after the future was satisfied", ep);
});
}
@@ -470,7 +474,7 @@ bool hint_sender::send_one_file(const sstring& fname) {
}
if (canceled_draining()) {
manager_logger.debug("[{}] Exiting reading from commitlog because of canceled draining", _ep_key);
manager_logger.debug("hint_sender[{}]:send_one_file: Exiting reading from commitlog because of canceled draining", _ep_key);
// We need to throw an exception here to cancel reading the segment.
throw canceled_draining_exception{};
}
@@ -502,13 +506,15 @@ bool hint_sender::send_one_file(const sstring& fname) {
};
}, _last_not_complete_rp.pos, &_db.extensions()).get();
} catch (db::commitlog::segment_error& ex) {
manager_logger.error("{}: {}. Dropping...", fname, ex.what());
manager_logger.error("hint_sender[{}]:send_one_file: Segment error in {}: {}. Last not complete position={}",
_ep_key, fname, ex.what(), _last_not_complete_rp);
ctx_ptr->segment_replay_failed = false;
++this->shard_stats().corrupted_files;
} catch (const canceled_draining_exception&) {
manager_logger.debug("[{}] Loop in send_one_file finishes due to canceled draining", _ep_key);
manager_logger.debug("hint_sender[{}]:send_one_file: Loop in send_one_file finishes due to canceled draining", _ep_key);
} catch (...) {
manager_logger.trace("sending of {} failed: {}", fname, std::current_exception());
manager_logger.debug("hint_sender[{}]:send_one_file: Sending of {} failed: {}. Last not complete position={}",
_ep_key, fname, std::current_exception(), _last_not_complete_rp);
ctx_ptr->segment_replay_failed = true;
}
@@ -523,7 +529,7 @@ bool hint_sender::send_one_file(const sstring& fname) {
// If we are draining ignore failures and drop the segment even if we failed to send it.
if (draining() && ctx_ptr->segment_replay_failed) {
manager_logger.trace("send_one_file(): we are draining so we are going to delete the segment anyway");
manager_logger.debug("hint_sender[{}]:send_one_file: We are draining, so we are going to delete the segment anyway", _ep_key);
ctx_ptr->segment_replay_failed = false;
}
@@ -533,7 +539,7 @@ bool hint_sender::send_one_file(const sstring& fname) {
// If there was an error thrown by read_log_file function itself, we will retry sending from
// the last hint that was successfully sent (last_succeeded_rp).
_last_not_complete_rp = ctx_ptr->first_failed_rp.value_or(ctx_ptr->last_succeeded_rp.value_or(_last_not_complete_rp));
manager_logger.trace("send_one_file(): error while sending hints from {}, last RP is {}", fname, _last_not_complete_rp);
manager_logger.debug("hint_sender[{}]:send_one_file: Error while sending hints from {}, last RP is {}", _ep_key, fname, _last_not_complete_rp);
return false;
}
@@ -546,7 +552,7 @@ bool hint_sender::send_one_file(const sstring& fname) {
// clear the replay position - we are going to send the next segment...
_last_not_complete_rp = replay_position();
_last_schema_ver_to_column_mapping.clear();
manager_logger.trace("send_one_file(): segment {} was sent in full and deleted", fname);
manager_logger.debug("hint_sender[{}]:send_one_file: Segment {} has been sent in full and deleted", _ep_key, fname);
return true;
}
@@ -572,14 +578,15 @@ void hint_sender::pop_current_segment() {
// Runs in the seastar::async context
void hint_sender::send_hints_maybe() noexcept {
using namespace std::literals::chrono_literals;
manager_logger.trace("send_hints(): going to send hints to {}, we have {} segment to replay", end_point_key(), _segments_to_replay.size() + _foreign_segments_to_replay.size());
manager_logger.trace("hint_sender[{}]:send_hints_maybe: Going to send hints. We have {} segment to replay",
end_point_key(), _segments_to_replay.size() + _foreign_segments_to_replay.size());
int replayed_segments_count = 0;
try {
while (true) {
if (canceled_draining()) {
manager_logger.debug("[{}] Exiting loop in send_hints_maybe because of canceled draining", _ep_key);
manager_logger.debug("hint_sender[{}]:send_hints_maybe: Exiting loop in send_hints_maybe because of canceled draining", _ep_key);
break;
}
const sstring* seg_name = name_of_current_segment();
@@ -598,7 +605,7 @@ void hint_sender::send_hints_maybe() noexcept {
// Ignore exceptions, we will retry sending this file from where we left off the next time.
// Exceptions are not expected here during the regular operation, so just log them.
} catch (...) {
manager_logger.trace("send_hints(): got the exception: {}", std::current_exception());
manager_logger.debug("hint_sender[{}]:send_hints_maybe: Exception occurred while sending: {}", _ep_key, std::current_exception());
}
if (have_segments()) {
@@ -609,7 +616,7 @@ void hint_sender::send_hints_maybe() noexcept {
_next_send_retry_tp = _next_flush_tp;
}
manager_logger.trace("send_hints(): we handled {} segments", replayed_segments_count);
manager_logger.debug("hint_sender[{}]:send_hints_maybe: We handled {} segments", _ep_key, replayed_segments_count);
}
hint_stats& hint_sender::shard_stats() {

View File

@@ -505,20 +505,20 @@ bool manager::can_hint_for(endpoint_id ep) const noexcept {
// hints where N is the total number nodes in the cluster.
const auto hipf = hints_in_progress_for(ep);
if (_stats.size_of_hints_in_progress > max_size_of_hints_in_progress() && hipf > 0) {
manager_logger.trace("size_of_hints_in_progress {} hints_in_progress_for({}) {}",
manager_logger.trace("can_hint_for: size_of_hints_in_progress {} hints_in_progress_for({}) {}",
_stats.size_of_hints_in_progress, ep, hipf);
return false;
}
// Check that the destination DC is "hintable".
if (!check_dc_for(ep)) {
manager_logger.trace("{}'s DC is not hintable", ep);
manager_logger.trace("can_hint_for: {}'s DC is not hintable", ep);
return false;
}
const bool node_is_alive = local_gossiper().get_endpoint_downtime(ep) <= _max_hint_window_us;
if (!node_is_alive) {
manager_logger.trace("{} has been down for too long, not hinting", ep);
manager_logger.trace("can_hint_for: {} has been down for too long, not hinting", ep);
return false;
}

View File

@@ -11,9 +11,11 @@
#include <boost/functional/hash.hpp>
#include <boost/icl/interval_map.hpp>
#include <fmt/ranges.h>
#include <ranges>
#include <seastar/core/coroutine.hh>
#include <seastar/coroutine/parallel_for_each.hh>
#include <seastar/core/loop.hh>
#include <seastar/core/on_internal_error.hh>
#include "system_keyspace.hh"
#include "cql3/untyped_result_set.hh"
@@ -359,6 +361,7 @@ schema_ptr system_keyspace::raft() {
.set_comment("Persisted RAFT log, votes and snapshot info")
.with_hash_version()
.set_caching_options(caching_options::get_disabled_caching_options())
.build();
}();
return schema;
@@ -1694,6 +1697,12 @@ future<> system_keyspace::peers_table_read_fixup() {
continue;
}
const auto host_id = row.get_as<utils::UUID>("host_id");
if (!host_id) {
slogger.error("Peer {} has null host_id in system.{}, the record is broken, removing it",
peer, system_keyspace::PEERS);
co_await remove_endpoint(gms::inet_address{peer});
continue;
}
const auto ts = row.get_as<int64_t>("ts");
const auto it = map.find(host_id);
if (it == map.end()) {
@@ -1757,8 +1766,15 @@ future<> system_keyspace::drop_truncation_rp_records() {
auto rs = co_await execute_cql(req);
bool any = false;
co_await coroutine::parallel_for_each(*rs, [&] (const cql3::untyped_result_set_row& row) -> future<> {
std::unordered_set<table_id> to_delete;
auto db = _qp.db();
auto max_concurrency = std::min(1024u, smp::count * 8);
co_await seastar::max_concurrent_for_each(*rs, max_concurrency, [&] (const cql3::untyped_result_set_row& row) -> future<> {
auto table_uuid = table_id(row.get_as<utils::UUID>("table_uuid"));
if (!db.try_find_table(table_uuid)) {
to_delete.emplace(table_uuid);
co_return;
}
auto shard = row.get_as<int32_t>("shard");
auto segment_id = row.get_as<int64_t>("segment_id");
@@ -1768,11 +1784,26 @@ future<> system_keyspace::drop_truncation_rp_records() {
co_await execute_cql(req);
}
});
if (!to_delete.empty()) {
// IN has a limit to how many values we can put into it.
for (auto&& chunk : to_delete | std::views::transform(&table_id::to_sstring) | std::views::chunk(100)) {
auto str = std::ranges::to<std::string>(chunk | std::views::join_with(','));
auto req = fmt::format("DELETE FROM system.{} WHERE table_uuid IN ({})", TRUNCATED, str);
co_await execute_cql(req);
}
any = true;
}
if (any) {
co_await force_blocking_flush(TRUNCATED);
}
}
future<> system_keyspace::remove_truncation_records(table_id id) {
auto req = format("DELETE FROM system.{} WHERE table_uuid = {}", TRUNCATED, id);
co_await execute_cql(req);
co_await force_blocking_flush(TRUNCATED);
}
future<> system_keyspace::save_truncation_record(const replica::column_family& cf, db_clock::time_point truncated_at, db::replay_position rp) {
sstring req = format("INSERT INTO system.{} (table_uuid, shard, position, segment_id, truncated_at) VALUES(?,?,?,?,?)", TRUNCATED);
co_await _qp.execute_internal(req, {cf.schema()->id().uuid(), int32_t(rp.shard_id()), int32_t(rp.pos), int64_t(rp.base_id()), truncated_at}, cql3::query_processor::cache_internal::yes);
@@ -2155,7 +2186,59 @@ future<> system_keyspace::update_peer_info(gms::inet_address ep, locator::host_i
slogger.debug("{}: values={}", query, values);
co_await _qp.execute_internal(query, db::consistency_level::ONE, values, cql3::query_processor::cache_internal::yes);
const auto guard = co_await get_units(_peers_cache_lock, 1);
try {
co_await _qp.execute_internal(query, db::consistency_level::ONE, values, cql3::query_processor::cache_internal::yes);
if (auto* cache = get_peers_cache()) {
cache->host_id_to_inet_ip[hid] = ep;
cache->inet_ip_to_host_id[ep] = hid;
}
} catch (...) {
_peers_cache = nullptr;
throw;
}
}
system_keyspace::peers_cache* system_keyspace::get_peers_cache() {
auto* cache = _peers_cache.get();
if (cache && (lowres_clock::now() > cache->expiration_time)) {
_peers_cache = nullptr;
return nullptr;
}
return cache;
}
future<lw_shared_ptr<const system_keyspace::peers_cache>> system_keyspace::get_or_load_peers_cache() {
const auto guard = co_await get_units(_peers_cache_lock, 1);
if (auto* cache = get_peers_cache()) {
co_return cache->shared_from_this();
}
auto cache = make_lw_shared<peers_cache>();
cache->inet_ip_to_host_id = co_await load_host_ids();
cache->host_id_to_inet_ip.reserve(cache->inet_ip_to_host_id.size());
for (const auto [ip, id]: cache->inet_ip_to_host_id) {
const auto [it, inserted] = cache->host_id_to_inet_ip.insert({id, ip});
if (!inserted) {
on_internal_error(slogger, ::format("duplicate IP for host_id {}, first IP {}, second IP {}",
id, it->second, ip));
}
}
cache->expiration_time = lowres_clock::now() + std::chrono::milliseconds(200);
_peers_cache = cache;
co_return std::move(cache);
}
future<std::optional<gms::inet_address>> system_keyspace::get_ip_from_peers_table(locator::host_id id) {
const auto cache = co_await get_or_load_peers_cache();
if (const auto it = cache->host_id_to_inet_ip.find(id); it != cache->host_id_to_inet_ip.end()) {
co_return it->second;
}
co_return std::nullopt;
}
future<system_keyspace::host_id_to_ip_map_t> system_keyspace::get_host_id_to_ip_map() {
const auto cache = co_await get_or_load_peers_cache();
co_return cache->host_id_to_inet_ip;
}
template <typename T>
@@ -2205,7 +2288,22 @@ future<> system_keyspace::update_schema_version(table_schema_version version) {
future<> system_keyspace::remove_endpoint(gms::inet_address ep) {
const sstring req = format("DELETE FROM system.{} WHERE peer = ?", PEERS);
slogger.debug("DELETE FROM system.{} WHERE peer = {}", PEERS, ep);
co_await execute_cql(req, ep.addr()).discard_result();
const auto guard = co_await get_units(_peers_cache_lock, 1);
try {
co_await execute_cql(req, ep.addr()).discard_result();
if (auto* cache = get_peers_cache()) {
const auto it = cache->inet_ip_to_host_id.find(ep);
if (it != cache->inet_ip_to_host_id.end()) {
const auto id = it->second;
cache->inet_ip_to_host_id.erase(it);
cache->host_id_to_inet_ip.erase(id);
}
}
} catch (...) {
_peers_cache = nullptr;
throw;
}
}
future<> system_keyspace::update_tokens(const std::unordered_set<dht::token>& tokens) {

View File

@@ -143,6 +143,17 @@ class system_keyspace : public seastar::peering_sharded_service<system_keyspace>
// and this node crashes after adding a new IP but before removing the old one. The
// record with older timestamp is removed, the warning is written to the log.
future<> peers_table_read_fixup();
struct peers_cache: public enable_lw_shared_from_this<peers_cache> {
std::unordered_map<gms::inet_address, locator::host_id> inet_ip_to_host_id;
std::unordered_map<locator::host_id, gms::inet_address> host_id_to_inet_ip;
lowres_clock::time_point expiration_time;
};
lw_shared_ptr<peers_cache> _peers_cache;
semaphore _peers_cache_lock{1};
peers_cache* get_peers_cache();
future<lw_shared_ptr<const peers_cache>> get_or_load_peers_cache();
public:
static schema_ptr size_estimates();
public:
@@ -308,6 +319,12 @@ public:
future<> update_peer_info(gms::inet_address ep, locator::host_id hid, const peer_info& info);
// Return ip of the peers table entry with given host id
future<std::optional<gms::inet_address>> get_ip_from_peers_table(locator::host_id id);
using host_id_to_ip_map_t = std::unordered_map<locator::host_id, gms::inet_address>;
future<host_id_to_ip_map_t> get_host_id_to_ip_map();
future<> remove_endpoint(gms::inet_address ep);
// Saves the key-value pair into system.scylla_local table.
@@ -418,6 +435,7 @@ public:
future<> save_truncation_record(const replica::column_family&, db_clock::time_point truncated_at, db::replay_position);
future<replay_positions> get_truncated_positions(table_id);
future<> drop_truncation_rp_records();
future<> remove_truncation_records(table_id);
// Converts a `dht::token_range` object to the left-open integer range (x,y] form.
//

View File

@@ -3449,6 +3449,7 @@ void delete_ghost_rows_visitor::accept_new_row(const clustering_key& ck, const q
auto view_exploded_ck = ck.explode();
std::vector<bytes> base_exploded_pk(_base_schema->partition_key_size());
std::vector<bytes> base_exploded_ck(_base_schema->clustering_key_size());
std::map<const column_definition*, bytes> view_key_cols_not_in_base_key;
for (const column_definition& view_cdef : _view->all_columns()) {
const column_definition* base_cdef = _base_schema->get_column_definition(view_cdef.name());
if (base_cdef) {
@@ -3457,6 +3458,8 @@ void delete_ghost_rows_visitor::accept_new_row(const clustering_key& ck, const q
base_exploded_pk[base_cdef->id] = view_exploded_key[view_cdef.id];
} else if (base_cdef->is_clustering_key()) {
base_exploded_ck[base_cdef->id] = view_exploded_key[view_cdef.id];
} else if (!base_cdef->is_computed() && view_cdef.is_primary_key()) {
view_key_cols_not_in_base_key[base_cdef] = view_exploded_key[view_cdef.id];
}
}
}
@@ -3464,22 +3467,44 @@ void delete_ghost_rows_visitor::accept_new_row(const clustering_key& ck, const q
clustering_key base_ck = clustering_key::from_exploded(base_exploded_ck);
dht::partition_range_vector partition_ranges({dht::partition_range::make_singular(dht::decorate_key(*_base_schema, base_pk))});
auto selection = cql3::selection::selection::for_columns(_base_schema, std::vector<const column_definition*>({&_base_schema->partition_key_columns().front()}));
auto view_key_cols_not_in_base_key_cdefs = view_key_cols_not_in_base_key | std::views::keys | std::ranges::to<std::vector<const column_definition*>>();
auto selection = cql3::selection::selection::for_columns(_base_schema,
view_key_cols_not_in_base_key.empty() ? std::vector<const column_definition*>({&_base_schema->partition_key_columns().front()}) : view_key_cols_not_in_base_key_cdefs);
std::vector<query::clustering_range> bounds{query::clustering_range::make_singular(base_ck)};
query::partition_slice partition_slice(std::move(bounds), {}, {}, selection->get_query_options());
utils::small_vector<column_id, 8> view_key_col_ids;
for (const auto& [col_def, _] : view_key_cols_not_in_base_key) {
view_key_col_ids.push_back(col_def->id);
}
query::partition_slice partition_slice(std::move(bounds), {}, std::move(view_key_col_ids), selection->get_query_options());
auto command = ::make_lw_shared<query::read_command>(_base_schema->id(), _base_schema->version(), partition_slice,
_proxy.get_max_result_size(partition_slice), query::tombstone_limit(_proxy.get_tombstone_limit()));
auto timeout = db::timeout_clock::now() + _timeout_duration;
service::storage_proxy::coordinator_query_options opts{timeout, _state.get_permit(), _state.get_client_state(), _state.get_trace_state()};
auto base_qr = _proxy.query(_base_schema, command, std::move(partition_ranges), db::consistency_level::ALL, opts).get();
query::result& result = *base_qr.query_result;
if (result.row_count().value_or(0) == 0) {
auto delete_ghost_row = [&]() {
mutation m(_view, *_view_pk);
auto& row = m.partition().clustered_row(*_view, ck);
row.apply(tombstone(api::new_timestamp(), gc_clock::now()));
timeout = db::timeout_clock::now() + _timeout_duration;
_proxy.mutate({m}, db::consistency_level::ALL, timeout, _state.get_trace_state(), empty_service_permit(), db::allow_per_partition_rate_limit::no).get();
};
if (result.row_count().value_or(0) == 0) {
delete_ghost_row();
} else if (!view_key_cols_not_in_base_key.empty()) {
if (result.row_count().value_or(0) != 1) {
on_internal_error(vlogger, format("Got multiple base rows corresponding to a single view row when pruning {}.{}", _view->ks_name(), _view->cf_name()));
}
auto results = query::result_set::from_raw_result(_base_schema, partition_slice, result);
auto& base_row = results.row(0);
for (const auto& [col_def, col_val] : view_key_cols_not_in_base_key) {
const data_value* base_val = base_row.get_data_value(col_def->name_as_text());
if (!base_val || base_val->is_null() || col_val != base_val->serialize_nonnull()) {
delete_ghost_row();
break;
}
}
}
}

View File

@@ -131,6 +131,28 @@ def configure_iotune_open_fd_limit(shards_count):
logging.error(f"Required FDs count: {precalculated_fds_count}, default limit: {fd_limits}!")
sys.exit(1)
def force_random_request_size_of_4k():
"""
It is a known bug that on i4i, i7i, i8g, i8ge instances, the disk controller reports the wrong
physical sector size as 512bytes, but the actual physical sector size is 4096bytes. This function
helps us work around that issue until AWS manages to get a fix for it. It returns 4096 if it
detect it's running on one of the affected instance types, otherwise it returns None and IOTune
will use the physical sector size reported by the disk.
"""
path="/sys/devices/virtual/dmi/id/product_name"
try:
with open(path, "r") as f:
instance_type = f.read().strip()
except FileNotFoundError:
logging.warning(f"Couldn't find {path}. Falling back to IOTune using the physical sector size reported by disk.")
return
prefixes = ["i7i", "i4i", "i8g", "i8ge"]
if any(instance_type.startswith(p) for p in prefixes):
return 4096
def run_iotune():
if "SCYLLA_CONF" in os.environ:
conf_dir = os.environ["SCYLLA_CONF"]
@@ -173,6 +195,8 @@ def run_iotune():
configure_iotune_open_fd_limit(cpudata.nr_shards())
if (reqsize := force_random_request_size_of_4k()):
iotune_args += ["--random-write-io-buffer-size", f"{reqsize}"]
try:
subprocess.check_call([bindir() + "/iotune",
"--format", "envfile",

View File

@@ -17,6 +17,7 @@ import stat
import logging
import pyudev
import psutil
import platform
from pathlib import Path
from scylla_util import *
from subprocess import run, SubprocessError
@@ -102,6 +103,21 @@ def is_selinux_enabled():
return True
return False
def is_kernel_version_at_least(major, minor):
"""Check if the Linux kernel version is at least major.minor"""
try:
kernel_version = platform.release()
# Extract major.minor from version string like "5.15.0-56-generic"
version_parts = kernel_version.split('.')
if len(version_parts) >= 2:
kernel_major = int(version_parts[0])
kernel_minor = int(version_parts[1])
return (kernel_major, kernel_minor) >= (major, minor)
except (ValueError, IndexError):
# If we can't parse the version, assume older kernel for safety
pass
return False
if __name__ == '__main__':
if os.getuid() > 0:
print('Requires root permission.')
@@ -231,8 +247,17 @@ if __name__ == '__main__':
# see https://git.kernel.org/pub/scm/fs/xfs/xfsprogs-dev.git/tree/mkfs/xfs_mkfs.c .
# and it also cannot be smaller than the sector size.
block_size = max(1024, sector_size)
run('udevadm settle', shell=True, check=True)
run(f'mkfs.xfs -b size={block_size} {fsdev} -K -m rmapbt=0 -m reflink=0', shell=True, check=True)
# On Linux 5.12+, sub-block overwrites are supported well, so keep the default block
# size, which will play better with the SSD.
if is_kernel_version_at_least(5, 12):
block_size_opt = ""
else:
block_size_opt = f"-b size={block_size}"
run(f'mkfs.xfs {block_size_opt} {fsdev} -K -m rmapbt=0 -m reflink=0', shell=True, check=True)
run('udevadm settle', shell=True, check=True)
if is_debian_variant():

View File

@@ -86,9 +86,9 @@ if __name__ == '__main__':
ethpciid = ''
if network_mode == 'dpdk':
dpdk_status = out('/opt/scylladb/scripts/dpdk-devbind.py --status')
match = re.search('if={} drv=(\S+)'.format(ifname), dpdk_status, flags=re.MULTILINE)
match = re.search(r'if={} drv=(\S+)'.format(ifname), dpdk_status, flags=re.MULTILINE)
ethdrv = match.group(1)
match = re.search('^(\\S+:\\S+:\\S+\.\\S+) [^\n]+ if={} '.format(ifname), dpdk_status, flags=re.MULTILINE)
match = re.search(r'^(\S+:\S+:\S+\.\S+) [^\n]+ if={} '.format(ifname), dpdk_status, flags=re.MULTILINE)
ethpciid = match.group(1)
if args.mode:

View File

@@ -18,7 +18,7 @@ Breaks: scylla-enterprise-conf (<< 2025.1.0~)
Package: %{product}-server
Architecture: any
Depends: ${misc:Depends}, %{product}-conf (= ${binary:Version}), %{product}-python3 (= ${binary:Version})
Depends: ${misc:Depends}, %{product}-conf (= ${binary:Version}), %{product}-python3 (= ${binary:Version}), procps
Replaces: %{product}-tools (<<5.5), scylla-enterprise-tools (<< 2024.2.0~), scylla-enterprise-server (<< 2025.1.0~)
Breaks: %{product}-tools (<<5.5), scylla-enterprise-tools (<< 2024.2.0~), scylla-enterprise-server (<< 2025.1.0~)
Description: Scylla database server binaries

View File

@@ -14,6 +14,15 @@ product="$(<build/SCYLLA-PRODUCT-FILE)"
version="$(sed 's/-/~/' <build/SCYLLA-VERSION-FILE)"
release="$(<build/SCYLLA-RELEASE-FILE)"
original_version="$(<build/SCYLLA-VERSION-FILE)"
if [[ "$original_version" == *"-dev"* ]]; then
repo_file_url="https://downloads.scylladb.com/unstable/scylla/master/rpm/centos/latest/scylla.repo"
else
# Remove the last dot-separated component
repo_version="${original_version%.*}"
repo_file_url="https://downloads.scylladb.com/rpm/centos/scylla-$repo_version.repo"
fi
mode="release"
arch="$(uname -m)"
@@ -88,8 +97,8 @@ bcp LICENSE-ScyllaDB-Source-Available.md /licenses/
run microdnf clean all
run microdnf --setopt=tsflags=nodocs -y update
run microdnf --setopt=tsflags=nodocs -y install hostname python3 python3-pip kmod
run microdnf clean all
run microdnf --setopt=tsflags=nodocs -y install hostname kmod procps-ng python3 python3-pip
run curl -L --output /etc/yum.repos.d/scylla.repo ${repo_file_url}
run pip3 install --no-cache-dir --prefix /usr supervisor
run bash -ec "echo LANG=C.UTF-8 > /etc/locale.conf"
run bash -ec "rpm -ivh packages/*.rpm"

View File

@@ -76,6 +76,7 @@ Group: Applications/Databases
Summary: The Scylla database server
Requires: %{product}-conf = %{version}-%{release}
Requires: %{product}-python3 = %{version}-%{release}
Requires: procps-ng
AutoReqProv: no
Provides: %{product}-tools:%{_bindir}/nodetool
Provides: %{product}-tools:%{_sysconfigdir}/bash_completion.d/nodetool-completion

View File

@@ -1,16 +1,25 @@
{
"Linux Distributions": {
"Ubuntu": ["22.04", "24.04"],
"Debian": ["11"],
"Rocky / CentOS / RHEL": ["8", "9"],
"Debian": ["11", "12"],
"Rocky / CentOS / RHEL": ["8", "9", "10"],
"Amazon Linux": ["2023"]
},
"ScyllaDB Versions": [
{
"version": "ScyllaDB 2025.3",
"supported_OS": {
"Ubuntu": ["22.04", "24.04"],
"Debian": ["11", "12"],
"Rocky / CentOS / RHEL": ["8", "9", "10"],
"Amazon Linux": ["2023"]
}
},
{
"version": "ScyllaDB 2025.2",
"supported_OS": {
"Ubuntu": ["22.04", "24.04"],
"Debian": ["11"],
"Debian": ["11", "12"],
"Rocky / CentOS / RHEL": ["8", "9"],
"Amazon Linux": ["2023"]
}
@@ -19,7 +28,7 @@
"version": "ScyllaDB 2025.1",
"supported_OS": {
"Ubuntu": ["22.04", "24.04"],
"Debian": ["11"],
"Debian": ["11", "12"],
"Rocky / CentOS / RHEL": ["8", "9"],
"Amazon Linux": ["2023"]
}

View File

@@ -172,4 +172,7 @@
/stable/upgrade/upgrade-opensource/upgrade-guide-from-4.5-to-4.6/metric-update-4.5-to-4.6.html: /stable/upgrade/index.html
# Divide API reference to smaller files
# /stable/reference/api-reference.html: /stable/reference/api/index.html
# /stable/reference/api-reference.html: /stable/reference/api/index.html
# Fixed typo in the file name
/stable/operating-scylla/nodetool-commands/enbleautocompaction.html: /stable/operating-scylla/nodetool-commands/enableautocompaction.html

View File

@@ -42,6 +42,10 @@ the administrator. The tablet load balancer decides where to migrate
the tablets, either within the same node to balance the shards or across
the nodes to balance the global load in the cluster.
The number of tablets the load balancer maintains on a node is directly
proportional to the node's storage capacity. A node with twice
the storage will have twice the number of tablets located on it.
As a table grows, each tablet can split into two, creating a new tablet.
The load balancer can migrate the split halves independently to different nodes
or shards.
@@ -83,6 +87,53 @@ especially for data models that contain small cells.
File-based streaming is used for tablet migration in all
:ref:`keyspaces created with tablets enabled <tablets>`.
.. _absolute-number-of-tablets:
Absolute number of tablets
==========================
ScyllaDB has a background process that periodically re-evaluates the number of tablets of each table.
The computed number of tablets a table will have is based on several parameters and factors. These are:
* Keyspace tablets option ``'initial'``. This option sets the initial number of tablets on the keyspace level.
See :ref:`The tablets property <tablets>` for details.
* Table-level option ``'expected_data_size_in_gb'``. This option sets the minimal number of tablets for a table
based on the expected table size and the target tablet size. See
:ref:`Per-table tablet options <cql-per-table-tablet-options>` for details.
* Table-level option ``'min_per_shard_tablet_count'``. Using this option results in the number of tablets being
computed based on the number of shards in a DC so that each shard has at least ``'min_per_shard_tablet_count'``
tablets on average. See :ref:`Per-table tablet options <cql-per-table-tablet-options>` for details.
* Table-level option ``'min_tablet_count'``. This option sets the minimal number of tablets for the given table.
See :ref:`Per-table tablet options <cql-per-table-tablet-options>` for details.
* Config option ``'tablets_initial_scale_factor'``. This option sets the minimal number of tablets per shard
per table globally. This option can be overridden by the table-level option: ``'min_per_shard_tablet_count'``.
``'tablets_initial_scale_factor'`` is ignored if either the keyspace option ``'initial'`` or table-level
option ``'min_tablet_count'`` is set.
Another factor that determines the absolute tablet count is the amount of data the table contains. If the
amount of data in the table is such that the average tablet size is larger than double the target tablet size,
the table will be split (the number of tablets will be doubled), and if the average tablet size is smaller than
half the target tablet size, it will be merged (the number of tablets will be halved).
Each of these factors is taken into consideration, and the one producing the largest number of tablets wins, and
will be used as the number of tablets for the given table.
As the last step, in order to avoid having too many tablets per shard, which could potentially lead to overload
and performance degradation, ScyllaDB will run the following algorithm to respect the ``tablets_per_shard_goal``
config option:
* Compute average tablet count per-shard in each DC.
* Determine if per-shard goal is exceeded in that DC.
* Compute scale factor by which tablet count should be multiplied so that the goal is not exceeded in that DC.
* Take the smallest scale factor among all DCs, which ensures that no DC is overloaded.
* Each table's tablet count is aligned to the nearest power of 2 post-scaling.
Please note that because of this alignment, the scaling may not be effective and in the worst case may be
overshot by a factor of 2, and that the ``tablets_per_shard_goal`` is a soft limit and not a hard constraint.
Finally, the computed tablet count is compared with the current tablet count for each table, and if there is
a difference, a table resize (split or merge) is executed.
.. _tablets-enable-tablets:
Enabling Tablets

View File

@@ -481,7 +481,8 @@ Creating a new user-defined type is done using a ``CREATE TYPE`` statement defin
field_definition: `identifier` `cql_type`
A UDT has a name (``udt_name``), which is used to declare columns of that type and is a set of named and typed fields. The ``udt_name`` can be any
type, including collections or other UDTs. UDTs and collections inside collections must always be frozen (no matter which version of ScyllaDB you are using).
type, including collections or other UDTs.
Similar to collections, a UDT can be frozen or non-frozen. A frozen UDT is immutable and can only be updated as a whole. Nested UDTs or UDTs used in keys must always be frozen.
For example::
@@ -506,26 +507,15 @@ For example::
CREATE TABLE superheroes (
name frozen<full_name> PRIMARY KEY,
home frozen<address>
home address
);
.. note::
- Attempting to create an already existing type will result in an error unless the ``IF NOT EXISTS`` option is used. If it is used, the statement will be a no-op if the type already exists.
- A type is intrinsically bound to the keyspace in which it is created and can only be used in that keyspace. At creation, if the type name is prefixed by a keyspace name, it is created in that keyspace. Otherwise, it is created in the current keyspace.
- As of ScyllaDB Open Source 3.2, UDTs not inside collections do not have to be frozen, but in all versions prior to ScyllaDB Open Source 3.2, and in all ScyllaDB Enterprise versions, UDTs **must** be frozen.
A non-frozen UDT example with ScyllaDB Open Source 3.2 and higher::
CREATE TYPE ut (a int, b int);
CREATE TABLE cf (a int primary key, b ut);
Same UDT in versions prior::
CREATE TYPE ut (a int, b int);
CREATE TABLE cf (a int primary key, b frozen<ut>);
UDT literals
~~~~~~~~~~~~

View File

@@ -10,7 +10,7 @@ This is a manual for `test.py`.
## Installation
To run `test.py`, Python 3.7 or higher is required.
To run `test.py`, Python 3.11 or higher is required.
`./install-dependencies.sh` should install all the required Python
modules. If `install-dependencies.sh` does not support your distribution,
please manually install all Python modules it lists with `pip`.
@@ -106,6 +106,19 @@ shed more light on this.
Build artefacts, such as test output and harness output is stored
in `./testlog`. Scylla data files are stored in `/tmp`.
There are several test directories that are excluded from orchestration by `test.py`:
- test/boost
- test/raft
- test/ldap
- test/unit
This means that `test.py` will not run tests directly, but will delegate all work to `pytest`.
That's why all these directories do not have `suite.yaml` files.
Additionally, these directories do not follow abstract naming suite/testname
convention, and instead use the `pytest` naming convention, i.e. to run a test you need to provide the path to the file
and optionally the test name, e.g. `test/boost/aggregate_fcts_test.cc::test_aggregate_avg`.
## How it works
On start, `test.py` invokes `ninja` to find out configured build modes. Then
@@ -176,11 +189,11 @@ Scylla (possibly started in debugger) using `cqlsh`.
The same unit test can be run in different seastar configurations, i.e. with
different command line arguments. The custom arguments can be set in
`custom_args` key of the `suite.yaml` file.
`custom_args` key of the `test_config.yaml` file.
Tests from boost suite are divided into test-cases. These are top-level
functions wrapped by `BOOST_AUTO_TEST_CASE`, `SEASTAR_TEST_CASE` or alike.
Boost tests support `suitename/testname::casename` selection described above.
Boost tests support `path/to/file_name.cc::casename` selection described above.
### Debugging unit tests
@@ -328,6 +341,19 @@ as it was at the beginning of the test, is considered "dirty".
Such clusters are not returned to the pool, but destroyed, and
the pool is replenished with a new cluster instead.
## Test metrics
The parameter `--gather-metrics` is used to gather CPU/RAM usage during tests from the cgroup and system overall CPU/RAM
usage.
For that, SQLite database is used to store the metrics in `testlog/sqlite.db`.
The database is created in the `testlog` directory and contains the following tables:
- `tests` - contains the list of tests that were executed with information about the test name, directory, architecture,
and mode
- `test_metrics` - contains the metrics for each test, such as memory peak usage, CPU usage, and duration
- `system_resource_metrics` - contains system CPU and memory utilization in percents during the whole run
- `cgroup_memory_metrics` - contains cgroup memory usage during the test run
## Automation, CI, and Jenkins
If any of the tests fails, `test.py` returns a non-zero exit status.

View File

@@ -0,0 +1,53 @@
===========================
Backup And Restore Overview
===========================
Backup and restore are critical components of data management, ensuring that your data is safe and can be recovered in case of loss or corruption. This document provides an overview of the backup and restore process, including best practices, tools, procedures, metrics and more
Process Overview
----------------
**The Backup process** is managed by ScyllaDB Manager as a whole.
The overview given here is for a single node, ScyllaDB Manager is responsible to orchestrate backups across the cluster.
For backup, a snapshot is created, and then the data is copied to a remote location - normally an S3 bucket, Google Cloud Storage, or a similar service.
**The Restore process** is also managed by ScyllaDB manager and it involves copying the data back from the remote location to an empty ScyllaDB node.
Restoring to a live cluster is not yet supported.
Backup Process
--------------
#. **Snapshot Creation**: A snapshot of the data is created on the ScyllaDB node.
This is a point-in-time copy of the data.
#. **Upload Data**: The snapshot data is transferred to a remote storage location,
such as an S3 bucket or Google Cloud Storage. You can upload data in two ways:
* **rclone** - the tool responsible for the upload is the scylla manager agent
that runs on the node.
- It runs side by side with scylla and therefore may interfere
with ScyllaDB performance.
- It supports many cloud storage providers.
* **Native upload** - ScyllaDB itself is responsible for the upload.
- It takes into consideration ScyllaDB performance and does
not interfere with it.
- It supports only S3 compatible storage providers.
See the `ScyllaDB Manager backup documentation <https://manager.docs.scylladb.com/stable/backup/index.html>`_
for more details on how to configure the upload method.
#. **Native Configuration**:
* For `native` backup to work without interference to users' workload, it is
best to limit io-scheduling. See :ref:`stream_io_throughput_mb_per_sec <confprop_stream_io_throughput_mb_per_sec>` for details.
* For `native` backup to work, ScyllaDB node must have access to the S3 bucket.
See :ref:`Configuring Object Storage <object-storage-configuration>` for details.
Restore Process
---------------
The restore process is managed completely by ScyllaDB Manager.
No special configuration is needed.
Restore may be executed by rclone or natively, not depending on the backup method used.
See `ScyllaDB Manager restore documentation <https://manager.docs.scylladb.com/stable/restore/index.html>`_ for more details on how to restore data.

View File

@@ -15,6 +15,7 @@ This document highlights ScyllaDB's key data modeling features.
Change Data Capture </features/cdc/index>
Workload Attributes </features/workload-attributes>
Workload Prioritization </features/workload-prioritization>
Backup and Restore </features/backup-and-restore>
.. panel-box::
:title: ScyllaDB Features
@@ -36,3 +37,5 @@ This document highlights ScyllaDB's key data modeling features.
state and the history of all changes made to tables in the database.
* :doc:`Workload Attributes </features/workload-attributes>` assigned to your workloads
specify how ScyllaDB will handle requests depending on the workload.
* :doc:`Backup and Restore </features/backup-and-restore>` allows you to create
backups of your data and restore it when needed.

View File

@@ -20,8 +20,7 @@ You can run your ScyllaDB workloads on AWS, GCE, and Azure using a ScyllaDB imag
Amazon Web Services (AWS)
-----------------------------
* The recommended instance types are :ref:`i3en <system-requirements-i3en-instances>`, and :ref:`i4i <system-requirements-i4i-instances>`.
* We recommend using enhanced networking that exposes the physical network cards to the VM.
The recommended instance types are :ref:`i3en <system-requirements-i3en-instances>`, :ref:`i4i <system-requirements-i4i-instances>`, :ref:`i7i <system-requirements-i7i-instances>`, and :ref:`i7ie <system-requirements-i7ie-instances>`.
.. note::
@@ -99,6 +98,102 @@ See `Amazon EC2 I4i Instances <https://aws.amazon.com/ec2/instance-types/i4i/>`
See `ScyllaDB on the New AWS EC2 I4i Instances: Twice the Throughput & Lower Latency <https://www.scylladb.com/2022/05/09/scylladb-on-the-new-aws-ec2-i4i-instances-twice-the-throughput-lower-latency/>`_ to
learn more about using ScyllaDB with i4i instances.
.. _system-requirements-i7i-instances:
i7i instances
^^^^^^^^^^^^^^
The following i7i instances are supported. Larger i7i instances will be
supported in upcoming releases.
.. list-table::
:widths: 30 20 20 30
:header-rows: 1
* - Model
- vCPU
- Mem (GiB)
- Storage (GB)
* - i7i.large
- 2
- 16
- 1 x 468 GB
* - i7i.xlarge
- 4
- 32
- 1 x 937.5 GB
* - i7i.2xlarge
- 8
- 64
- 1 x 1875 GB
All i7i instances have the following specs:
* 3.2 GHz all-core turbo Intel® Xeon® Scalable (Emerald Rapids) processors.
* Up to 100 Gbps of networking bandwidth and 60 Gbps bandwidth to Amazon Elastic Block Store (EBS).
* Advanced Matrix Extensions (AMX) for accelerating CPU-based machine learning.
See `Amazon EC2 I7i Instances <https://aws.amazon.com/ec2/instance-types/i7i/>`_ for details.
.. _system-requirements-i7ie-instances:
i7ie instances
^^^^^^^^^^^^^^
The following i7ie instances are supported.
.. list-table::
:widths: 30 20 20 30
:header-rows: 1
* - Model
- vCPU
- Mem (GiB)
- Storage (GB)
* - i7ie.large
- 2
- 16
- 1 x 1,250 GB
* - i7ie.xlarge
- 4
- 32
- 1 x 2,500 GB
* - i7ie.2xlarge
- 8
- 64
- 2 x 2,500 GB
* - i7ie.3xlarge
- 12
- 96
- 1 x 7,500 GB
* - i7ie.6xlarge
- 24
- 192
- 2 x 7,500 GB
* - i7ie.12xlarge
- 48
- 384
- 4 x 7,500 GB
* - i7ie.18xlarge
- 72
- 576
- 6 x 7,500 GB
* - i7ie.24xlarge
- 96
- 768
- 8 x 7,500 GB
* - i7ie.48xlarge
- 192
- 1,536
- 16 x 7,500 GB
All i7i instances have the following specs:
* 3.2 GHz all-core turbo Intel® Xeon® Scalable (Emerald Rapids) processors.
* Up to 100 Gbps of networking bandwidth and 60 Gbps bandwidth to Amazon Elastic Block Store (EBS).
* Advanced Matrix Extensions (AMX) for accelerating CPU-based machine learning.
See `Amazon EC2 I7i Instances <https://aws.amazon.com/ec2/instance-types/i7i/>`_ for details.
Im4gn and Is4gen instances
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -113,8 +208,8 @@ Pick a zone where Haswell CPUs are found. Local SSD performance offers, accordin
Image with NVMe disk interface is recommended.
(`More info <https://cloud.google.com/compute/docs/disks/local-ssd>`_)
Recommended instances types are `z3-highmem-highlssd <https://cloud.google.com/compute/docs/storage-optimized-machines#z3_machine_types>`_,
`n1-highmem <https://cloud.google.com/compute/docs/general-purpose-machines#n1_machines>`_, and `n2-highmem <https://cloud.google.com/compute/docs/general-purpose-machines#n2_machines>`_
Recommended instances types are `z3-highmem-highlssd and z3-highmem-standardlssd <https://cloud.google.com/compute/docs/storage-optimized-machines#z3_machine_types>`_,
`n1-highmem <https://cloud.google.com/compute/docs/general-purpose-machines#n1_machines>`_, and `n2-highmem <https://cloud.google.com/compute/docs/general-purpose-machines#n2_machines>`_.
.. list-table::
@@ -145,6 +240,39 @@ Recommended instances types are `z3-highmem-highlssd <https://cloud.google.com/c
- 44
- 352
- 18,000
* - z3-highmem-88-highlssd
- 88
- 704
- 36,000
.. list-table::
:widths: 30 20 20 30
:header-rows: 1
* - Model
- vCPU
- Mem (GB)
- Storage (GB)
* - z3-highmem-14-standardlssd
- 14
- 112
- 3,000
* - z3-highmem-22-standardlssd
- 22
- 176
- 6,000
* - z3-highmem-44-standardlssd
- 44
- 352
- 9,000
* - z3-highmem-88-standardlssd
- 88
- 704
- 18,000
* - z3-highmem-176-standardlssd
- 176
- 1,406
- 36,000
.. list-table::
:widths: 30 20 20 30

View File

@@ -30,7 +30,7 @@ Launching ScyllaDB on GCP
.. code-block:: console
gcloud compute instances create <name of new instance> --image <ScyllaDB image name> --image-project < ScyllaDB project name> --local-ssd interface=nvme --zone <GCP zone - optional> --machine-type=<machine type>
gcloud compute instances create <name of new instance> --image <ScyllaDB image name> --image-project < ScyllaDB project name> --local-ssd interface=nvme --zone=<GCP zone - optional> --machine-type=<machine type>
For example:

View File

@@ -14,6 +14,7 @@ See :doc:`OS Support by Platform and Version </getting-started/os-support/>`.
Install ScyllaDB with Web Installer
---------------------------------------
To install ScyllaDB with Web Installer, run:
.. code:: console
@@ -27,7 +28,13 @@ You can run the command with the ``-h`` or ``--help`` flag to print information
Installing a Non-default Version
---------------------------------------
You can install a version other than the default.
You can install a version other than the default. To get the list of supported
release versions, run:
.. code:: console
curl -sSf get.scylladb.com/server | sudo bash -s -- --list-active-releases
Versions 2025.1 and Later
==============================

View File

@@ -83,7 +83,7 @@ Additional References
* `Jepsen and ScyllaDB: Putting Consistency to the Test blog post <https://www.scylladb.com/2020/12/23/jepsen-and-scylla-putting-consistency-to-the-test/>`_
* `Nauto: Achieving Consistency in an Eventually Consistent Environment blog post <https://www.scylladb.com/2020/02/20/nauto-achieving-consistency-in-an-eventually-consistent-environment/>`_
* `Consistency Levels documentation <https://docs.scylladb.com/stable/cql/consistency.html>`_
* `Consistency Levels documentation <https://docs.scylladb.com/manual/stable/cql/consistency.html>`_
* `High Availability lesson on ScyllaDB University <https://university.scylladb.com/courses/scylla-essentials-overview/lessons/high-availability/>`_
* `Lightweight Transactions lesson on ScyllaDB University <https://university.scylladb.com/courses/data-modeling/lessons/lightweight-transactions/>`_
* `Getting the Most out of Lightweight Transactions in ScyllaDB blog post <https://www.scylladb.com/2020/07/15/getting-the-most-out-of-lightweight-transactions-in-scylla/>`_

View File

@@ -26,6 +26,7 @@ Syntax
--table <table>
[--nowait]
[--scope <scope>]
[--sstables-file-list <file>]
<sstables>...
Example
@@ -51,6 +52,7 @@ Options
* ``--table`` - Name of the table to load SSTables into
* ``--nowait`` - Don't wait on the restore process
* ``--scope <scope>`` - Use specified load-and-stream scope
* ``--sstables-file-list <file>`` - restore the sstables listed in the given <file>. the list should be new-line seperated.
* ``<sstables>`` - Remainder of keys of the TOC (Table of Contents) components of SSTables to restore, relative to the specified prefix
The `scope` parameter describes the subset of cluster nodes where you want to load data:
@@ -60,6 +62,8 @@ The `scope` parameter describes the subset of cluster nodes where you want to lo
* `dc` - In the datacenter (DC) where the local node lives.
* `all` (default) - Everywhere across the cluster.
`--sstables-file-list <file>` and `<sstable>` can be combined together, `nodetool restore` will attempt to restore the combined list. duplicates are _not_ removed
To fully restore a cluster, you should combine the ``scope`` parameter with the correct list of
SStables to restore to each node.
On one extreme, one node is given all SStables with the scope ``all``; on the other extreme, all

View File

@@ -14,9 +14,9 @@ Nodetool
nodetool-commands/cleanup
nodetool-commands/clearsnapshot
nodetool-commands/cluster/index
nodetool-commands/compact
nodetool-commands/compactionhistory
nodetool-commands/compactionstats
nodetool-commands/compact
nodetool-commands/decommission
nodetool-commands/describecluster
nodetool-commands/describering
@@ -25,13 +25,15 @@ Nodetool
nodetool-commands/disablebinary
nodetool-commands/disablegossip
nodetool-commands/drain
nodetool-commands/enbleautocompaction
nodetool-commands/enableautocompaction
nodetool-commands/enablebackup
nodetool-commands/enablebinary
nodetool-commands/enablegossip
nodetool-commands/flush
nodetool-commands/getcompactionthroughput
nodetool-commands/getendpoints
nodetool-commands/getsstables
nodetool-commands/getstreamthroughput
nodetool-commands/gettraceprobability
nodetool-commands/gossipinfo
nodetool-commands/help
@@ -46,25 +48,23 @@ Nodetool
nodetool-commands/restore
nodetool-commands/ring
nodetool-commands/scrub
nodetool-commands/settraceprobability
nodetool-commands/setcompactionthroughput
nodetool-commands/setlogginglevel
nodetool-commands/setstreamthroughput
nodetool-commands/settraceprobability
nodetool-commands/snapshot
nodetool-commands/sstableinfo
nodetool-commands/status
nodetool-commands/statusbackup
nodetool-commands/statusbinary
nodetool-commands/statusgossip
nodetool-commands/status
Nodetool stop compaction <nodetool-commands/stop>
nodetool-commands/tablestats
nodetool-commands/tasks/index
nodetool-commands/toppartitions
nodetool-commands/upgradesstables
nodetool-commands/viewbuildstatus
nodetool-commands/version
nodetool-commands/getcompactionthroughput
nodetool-commands/setcompactionthroughput
nodetool-commands/getstreamthroughput
nodetool-commands/setstreamthroughput
nodetool-commands/viewbuildstatus
The ``nodetool`` utility provides a simple command-line interface to the following exposed operations and attributes.
@@ -87,9 +87,9 @@ Operations that are not listed below are currently not available.
* :doc:`cleanup </operating-scylla/nodetool-commands/cleanup/>` - Triggers the immediate cleanup of keys no longer belonging to a node.
* :doc:`clearsnapshot </operating-scylla/nodetool-commands/clearsnapshot/>` - This command removes snapshots.
* :doc:`cluster <nodetool-commands/cluster/index>` - Run a cluster operation.
* :doc:`compact </operating-scylla/nodetool-commands/compact/>`- Force a (major) compaction on one or more column families.
* :doc:`compactionhistory </operating-scylla/nodetool-commands/compactionhistory/>` - Provides the history of compactions.
* :doc:`compactionstats </operating-scylla/nodetool-commands/compactionstats/>`- Print statistics on compactions.
* :doc:`compact </operating-scylla/nodetool-commands/compact/>`- Force a (major) compaction on one or more column families.
* :doc:`decommission </operating-scylla/nodetool-commands/decommission/>` - Decommission the node.
* :doc:`describecluster </operating-scylla/nodetool-commands/describecluster/>` - Print the name, snitch, partitioner and schema version of a cluster.
* :doc:`describering </operating-scylla/nodetool-commands/describering/>` - :code:`<keyspace>`- Shows the partition ranges of a given keyspace.
@@ -98,14 +98,16 @@ Operations that are not listed below are currently not available.
* :doc:`disablebinary </operating-scylla/nodetool-commands/disablebinary/>` - Disable native transport (binary protocol).
* :doc:`disablegossip </operating-scylla/nodetool-commands/disablegossip/>` - Disable gossip (effectively marking the node down).
* :doc:`drain </operating-scylla/nodetool-commands/drain/>` - Drain the node (stop accepting writes and flush all column families).
* :doc:`enableautocompaction </operating-scylla/nodetool-commands/enbleautocompaction/>` - Enable automatic compaction of a keyspace or table.
* :doc:`enableautocompaction </operating-scylla/nodetool-commands/enableautocompaction/>` - Enable automatic compaction of a keyspace or table.
* :doc:`enablebackup </operating-scylla/nodetool-commands/enablebackup/>` - Enable incremental backup.
* :doc:`enablebinary </operating-scylla/nodetool-commands/enablebinary/>` - Re-enable native transport (binary protocol).
* :doc:`enablegossip </operating-scylla/nodetool-commands/enablegossip/>` - Re-enable gossip.
* :doc:`flush </operating-scylla/nodetool-commands/flush/>` - Flush one or more column families.
* :doc:`getcompactionthroughput </operating-scylla/nodetool-commands/getcompactionthroughput>` - Print the throughput cap for compaction in the system
* :doc:`getendpoints <nodetool-commands/getendpoints/>` :code:`<keyspace>` :code:`<table>` :code:`<key>`- Print the end points that owns the key.
* **getlogginglevels** - Get the runtime logging levels.
* :doc:`getsstables </operating-scylla/nodetool-commands/getsstables>` - Print the sstable filenames that own the key.
* :doc:`getstreamthroughput </operating-scylla/nodetool-commands/getstreamthroughput>` - Print the throughput cap for SSTables streaming in the system
* :doc:`gettraceprobability </operating-scylla/nodetool-commands/gettraceprobability>` - Displays the current trace probability value. 0 is disabled 1 is enabled.
* :doc:`gossipinfo </operating-scylla/nodetool-commands/gossipinfo/>` - Shows the gossip information for the cluster.
* :doc:`help </operating-scylla/nodetool-commands/help/>` - Display list of available nodetool commands.
@@ -118,28 +120,26 @@ Operations that are not listed below are currently not available.
* :doc:`refresh </operating-scylla/nodetool-commands/refresh/>`- Load newly placed SSTables to the system without restart
* :doc:`removenode </operating-scylla/nodetool-commands/removenode/>`- Remove node with the provided ID
* :doc:`repair <nodetool-commands/repair/>` :code:`<keyspace>` :code:`<table>` - Repair one or more vnode tables.
* :doc:`restore </operating-scylla/nodetool-commands/restore/>` - Load SSTables from a designated bucket in object store into a specified keyspace or table
* :doc:`resetlocalschema </operating-scylla/nodetool-commands/resetlocalschema/>` - Reset the node's local schema.
* :doc:`restore </operating-scylla/nodetool-commands/restore/>` - Load SSTables from a designated bucket in object store into a specified keyspace or table
* :doc:`ring <nodetool-commands/ring/>` - The nodetool ring command display the token ring information.
* :doc:`scrub </operating-scylla/nodetool-commands/scrub>` :code:`[-m mode] [--no-snapshot] <keyspace> [<table>...]` - Scrub the SSTable files in the specified keyspace or table(s)
* :doc:`setcompactionthroughput </operating-scylla/nodetool-commands/setcompactionthroughput>` - Set the throughput cap for compaction in the system
* :doc:`setlogginglevel</operating-scylla/nodetool-commands/setlogginglevel>` - sets the logging level threshold for ScyllaDB classes
* :doc:`setstreamthroughput </operating-scylla/nodetool-commands/setstreamthroughput>` - Set the throughput cap for SSTables streaming in the system
* :doc:`settraceprobability </operating-scylla/nodetool-commands/settraceprobability/>` ``<value>`` - Sets the probability for tracing a request. race probability value
* :doc:`snapshot </operating-scylla/nodetool-commands/snapshot>` :code:`[-t tag] [-cf column_family] <keyspace>` - Take a snapshot of specified keyspaces or a snapshot of the specified table.
* :doc:`sstableinfo </operating-scylla/nodetool-commands/sstableinfo>` - Get information about sstables per keyspace/table.
* :doc:`status </operating-scylla/nodetool-commands/status/>` - Print cluster information.
* :doc:`statusbackup </operating-scylla/nodetool-commands/statusbackup/>` - Status of incremental backup.
* :doc:`statusbinary </operating-scylla/nodetool-commands/statusbinary/>` - Status of native transport (binary protocol).
* :doc:`statusgossip </operating-scylla/nodetool-commands/statusgossip/>` - Status of gossip.
* :doc:`status </operating-scylla/nodetool-commands/status/>` - Print cluster information.
* :doc:`stop </operating-scylla/nodetool-commands/stop/>` - Stop compaction operation.
* **tablehistograms** see :doc:`cfhistograms <nodetool-commands/cfhistograms/>`
* :doc:`tablestats </operating-scylla/nodetool-commands/tablestats/>` - Provides in-depth diagnostics regard table.
* :doc:`tasks </operating-scylla/nodetool-commands/tasks/index>` - Manage tasks manager tasks.
* :doc:`toppartitions </operating-scylla/nodetool-commands/toppartitions/>` - Samples cluster writes and reads and reports the most active partitions in a specified table and time frame.
* :doc:`upgradesstables </operating-scylla/nodetool-commands/upgradesstables>` - Upgrades each table that is not running the latest ScyllaDB version, by rewriting SSTables.
* :doc:`viewbuildstatus </operating-scylla/nodetool-commands/viewbuildstatus/>` - Shows the progress of a materialized view build.
* :doc:`version </operating-scylla/nodetool-commands/version>` - Print the DB version.
* :doc:`getcompactionthroughput </operating-scylla/nodetool-commands/getcompactionthroughput>` - Print the throughput cap for compaction in the system
* :doc:`setcompactionthroughput </operating-scylla/nodetool-commands/setcompactionthroughput>` - Set the throughput cap for compaction in the system
* :doc:`getstreamthroughput </operating-scylla/nodetool-commands/getstreamthroughput>` - Print the throughput cap for SSTables streaming in the system
* :doc:`setstreamthroughput </operating-scylla/nodetool-commands/setstreamthroughput>` - Set the throughput cap for SSTables streaming in the system
* :doc:`viewbuildstatus </operating-scylla/nodetool-commands/viewbuildstatus/>` - Shows the progress of a materialized view build.

View File

@@ -38,14 +38,14 @@ Manual Dictionary Training
You can manually trigger dictionary training using the REST API::
curl -X POST "http://node-address:10000/storage_service/retrain_dict?keyspace=mykeyspace&table=mytable"
curl -X POST "http://node-address:10000/storage_service/retrain_dict?keyspace=mykeyspace&cf=mytable"
Estimating Compression Ratios
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
To choose the best compression configuration, you can estimate compression ratios using the REST API::
curl -X GET "http://node-address:10000/storage_service/estimate_compression_ratios?keyspace=mykeyspace&table=mytable"
curl -X GET "http://node-address:10000/storage_service/estimate_compression_ratios?keyspace=mykeyspace&cf=mytable"
This will return a report with estimated compression ratios for various combinations of compression
parameters (algorithm, chunk size, zstd level, dictionary).

View File

@@ -369,6 +369,12 @@ granting ``SELECT`` on a ``KEYSPACE`` automatically grants it on all ``TABLES``
.. Likewise, granting a permission on ``ALL FUNCTIONS`` grants it on every defined function, regardless of which keyspace it is scoped in. It is also possible to grant permissions on all functions scoped to a particular keyspace.
Materialized views and CDC logs cannot be granted separate permissions -
they inherit the permissions from the base table. Specifically,
granting ``SELECT`` on a table allows to read also from all its
materialized views and CDC log. It is not possible to make only
one materialized view of several materialized views readable.
Modifications to permissions are visible to existing client sessions; that is, connections need not be re-established
following permissions changes.

View File

@@ -41,6 +41,11 @@ When creating a role, you grant it permissions and resources. The permission is
* "ALL ROLES"
* Note that An unqualified table name assumes the current keyspace
Materialized views and CDC logs cannot be granted separate permissions -
they inherit their permissions from the base table. It is not possible
to give different permissions to different materialized views of the
same base table.
.. _rbac-usecase-use-case:
Use case

View File

@@ -148,16 +148,25 @@ will leave the recovery mode and remove the obsolete internal Raft data.
cqlsh> TRUNCATE TABLE system.discovery;
cqlsh> DELETE value FROM system.scylla_local WHERE key = 'raft_group0_id';
#. Add the ``recovery_leader`` property to the ``scylla.yaml`` file and set it to the host ID of the recovery leader on
**every live node**. Make sure the change is applied on all nodes by sending the ``SIGHUP`` signal to all ScyllaDB
processes.
#. Perform a :doc:`rolling restart </operating-scylla/procedures/config-change/rolling-restart/>` of all live nodes,
however, this time **the recovery leader must be restarted first**.
but:
* **restart the recovery leader first**,
* before restarting each node, add the ``recovery_leader`` property to its ``scylla.yaml`` file and set it to the
host ID of the recovery leader,
* after restarting each node, make sure it participated in Raft recovery; look for one of the following messages
in its logs:
.. code-block:: console
storage_service - Performing Raft-based recovery procedure with recovery leader <host ID of the recovery leader>/<IP address of the recovery leader>
storage_service - Raft-based recovery procedure - found group 0 with ID <ID of the new group 0; different from the one used in other steps>
After completing this step, Raft should be fully functional.
#. Replace all dead nodes from the cluster using the
#. Replace all dead nodes in the cluster using the
:doc:`node replacement procedure </operating-scylla/procedures/cluster-management/replace-dead-node/>`.
.. note::

View File

@@ -4,7 +4,7 @@ Upgrade ScyllaDB
.. toctree::
ScyllaDB 2025.1 to ScyllaDB 2025.2 <upgrade-guide-from-2025.1-to-2025.2/index>
ScyllaDB 2025.2 to ScyllaDB 2025.3 <upgrade-guide-from-2025.2-to-2025.3/index>
ScyllaDB Image <ami-upgrade>

View File

@@ -1,13 +0,0 @@
==========================================================
Upgrade - ScyllaDB 2025.1 to ScyllaDB 2025.2
==========================================================
.. toctree::
:maxdepth: 2
:hidden:
Upgrade ScyllaDB <upgrade-guide-from-2025.1-to-2025.2>
Metrics Update <metric-update-2025.1-to-2025.2>
* :doc:`Upgrade from ScyllaDB 2025.1.x to ScyllaDB 2025.2.y <upgrade-guide-from-2025.1-to-2025.2>`
* :doc:`Metrics Update Between 2025.1 and 2025.2 <metric-update-2025.1-to-2025.2>`

View File

@@ -1,61 +0,0 @@
.. |SRC_VERSION| replace:: 2025.1
.. |NEW_VERSION| replace:: 2025.2
Metrics Update Between |SRC_VERSION| and |NEW_VERSION|
================================================================
.. toctree::
:maxdepth: 2
:hidden:
ScyllaDB |NEW_VERSION| Dashboards are available as part of the latest |mon_root|.
New Metrics
------------
The following metrics are new in ScyllaDB |NEW_VERSION| compared to |SRC_VERSION|:
.. list-table::
:widths: 25 150
:header-rows: 1
* - Metric
- Description
* - scylla_alternator_batch_item_count_histogram
- A histogram of the number of items in a batch request.
* - scylla_database_total_view_updates_failed_pairing
- Total number of view updates for which we failed base/view pairing.
* - scylla_group_name_cross_rack_collocations
- The number of co-locating migrations that move replica across racks.
* - scylla_network_bytes_received
- The number of bytes received from network sockets.
* - scylla_network_bytes_sent
- The number of bytes written to network sockets.
* - scylla_reactor_awake_time_ms_total
- Total reactor awake time (wall_clock).
* - scylla_reactor_cpu_used_time_ms
- Total reactor thread CPU time (from CLOCK_THREAD_CPUTIME).
* - scylla_reactor_sleep_time_ms_total
- Total reactor sleep time (wall clock).
* - scylla_sstable_compression_dicts_total_live_memory_bytes
- Total amount of memory consumed by SSTable compression dictionaries in RAM.
* - scylla_transport_connections_blocked
- Holds an incrementing counter with the CQL connections that were blocked
before being processed due to threshold configured via
uninitialized_connections_semaphore_cpu_concurrency.Blocks are normal
when we have multiple connections initialized at once. If connectionsare
timing out and this value is high it indicates either connections storm
or unusually slow processing.
* - scylla_transport_connections_shed
- Holds an incrementing counter with the CQL connections that were shed
due to concurrency semaphore timeout (threshold configured via
uninitialized_connections_semaphore_cpu_concurrency). This typically can
happen during connection.

View File

@@ -0,0 +1,13 @@
==========================================================
Upgrade - ScyllaDB 2025.2 to ScyllaDB 2025.3
==========================================================
.. toctree::
:maxdepth: 2
:hidden:
Upgrade ScyllaDB <upgrade-guide-from-2025.2-to-2025.3>
Metrics Update <metric-update-2025.2-to-2025.3>
* :doc:`Upgrade from ScyllaDB 2025.2.x to ScyllaDB 2025.3.y <upgrade-guide-from-2025.2-to-2025.3>`
* :doc:`Metrics Update Between 2025.2 and 2025.3 <metric-update-2025.2-to-2025.3>`

View File

@@ -0,0 +1,95 @@
.. |SRC_VERSION| replace:: 2025.2
.. |NEW_VERSION| replace:: 2025.3
================================================================
Metrics Update Between |SRC_VERSION| and |NEW_VERSION|
================================================================
.. toctree::
:maxdepth: 2
:hidden:
ScyllaDB |NEW_VERSION| Dashboards are available as part of the latest |mon_root|.
New Metrics
------------
The following metrics are new in ScyllaDB |NEW_VERSION| compared to |SRC_VERSION|.
Alternator Per-table Metrics
===================================
.. list-table::
:widths: 25 150
:header-rows: 1
* - Metric
- Description
* - scylla_alternator_table_batch_item_count
- The total number of items processed across all batches.
* - scylla_alternator_table_batch_item_count_histogram
- A histogram of the number of items in a batch request.
* - scylla_alternator_table_filtered_rows_dropped_total
- The number of rows read and dropped during filtering operations.
* - scylla_alternator_table_filtered_rows_matched_total
- The number of rows read and matched during filtering operations.
* - scylla_alternator_table_filtered_rows_read_total
- The number of rows read during filtering operations.
* - scylla_alternator_table_op_latency
- A latency histogram of an operation via Alternator API.
* - scylla_alternator_table_op_latency_summary
- A latency summary of an operation via Alternator API.
* - scylla_alternator_table_operation
- The number of operations via Alternator API.
* - scylla_alternator_table_rcu_total
- The total number of consumed read units.
* - scylla_alternator_table_reads_before_write
- The number of performed read-before-write operations.
* - scylla_alternator_table_requests_blocked_memory
- Counts the number of requests blocked due to memory pressure.
* - scylla_alternator_table_requests_shed
- Counts the number of requests shed due to overload.
* - scylla_alternator_table_shard_bounce_for_lwt
- The number of writes that had to be bounced from this shard because of LWT requirements.
* - scylla_alternator_table_total_operations
- The number of total operations via Alternator API.
* - scylla_alternator_table_unsupported_operations
- The number of unsupported operations via Alternator API.
* - scylla_alternator_table_wcu_total
- The total number of consumed write units.
* - scylla_alternator_table_write_using_lwt
- The number of writes that used LWT.
Other Metrics
===============
.. list-table::
:widths: 25 150
:header-rows: 1
* - Metric
- Description
* - scylla_batchlog_manager_total_write_replay_attempts
- Counts write operations issued in a batchlog replay flow.
A high value of this metric indicates that there is a long batch replay list.
* - scylla_corrupt_data_entries_reported
- Counts the number of corrupt data instances reported to the corrupt data handler.
A non-zero value indicates that the database suffered data corruption.
* - scylla_memory_oversized_allocs
- The total count of oversized memory allocations.
* - scylla_reactor_internal_errors
- The total number of internal errors (subset of cpp_exceptions) that usually
indicate a malfunction in the code
* - scylla_stall_detector_io_threaded_fallbacks
- The total number of io-threaded-fallbacks operations.
Removed Metrics
---------------------
The following metrics have been removed in 2025.3:
* scylla_cql_authorized_prepared_statements_cache_evictions
* scylla_lsa_large_objects_total_space_bytes
* scylla_lsa_small_objects_total_space_bytes
* scylla_lsa_small_objects_used_space_bytes

View File

@@ -1,13 +1,13 @@
.. |SCYLLA_NAME| replace:: ScyllaDB
.. |SRC_VERSION| replace:: 2025.1
.. |NEW_VERSION| replace:: 2025.2
.. |SRC_VERSION| replace:: 2025.2
.. |NEW_VERSION| replace:: 2025.3
.. |ROLLBACK| replace:: rollback
.. _ROLLBACK: ./#rollback-procedure
.. |SCYLLA_METRICS| replace:: ScyllaDB Metrics Update - ScyllaDB 2025.1 to 2025.2
.. _SCYLLA_METRICS: ../metric-update-2025.1-to-2025.2
.. |SCYLLA_METRICS| replace:: ScyllaDB Metrics Update - ScyllaDB 2025.2 to 2025.3
.. _SCYLLA_METRICS: ../metric-update-2025.2-to-2025.3
=======================================================================================
Upgrade from |SCYLLA_NAME| |SRC_VERSION| to |SCYLLA_NAME| |NEW_VERSION|
@@ -44,14 +44,6 @@ We recommend upgrading the Monitoring Stack to the latest version.
See the ScyllaDB Release Notes for the latest updates. The Release Notes are published
at the `ScyllaDB Community Forum <https://forum.scylladb.com/>`_.
.. note::
If you previously upgraded from 2024.x to 2025.1 without enabling consistent
topology updates, ensure you enable the feature before you upgrade to 2025.2.
For instructions, see
`Enable Consistent Topology Updates <https://docs.scylladb.com/manual/branch-2025.1/upgrade/upgrade-guides/upgrade-guide-from-2024.x-to-2025.1/enable-consistent-topology.html>`_
in the upgrade guide for version 2025.1.
Upgrade Procedure
=================
@@ -158,7 +150,7 @@ You should take note of the current version in case you want to |ROLLBACK|_ the
.. code-block:: console
sudo wget -O /etc/apt/sources.list.d/scylla.list https://downloads.scylladb.com/deb/debian/scylla-2025.2.list
sudo wget -O /etc/apt/sources.list.d/scylla.list https://downloads.scylladb.com/deb/debian/scylla-2025.3.list
#. Install the new ScyllaDB version:
@@ -176,7 +168,7 @@ You should take note of the current version in case you want to |ROLLBACK|_ the
.. code-block:: console
sudo curl -o /etc/yum.repos.d/scylla.repo -L https://downloads.scylladb.com/rpm/centos/scylla-2025.2.repo
sudo curl -o /etc/yum.repos.d/scylla.repo -L https://downloads.scylladb.com/rpm/centos/scylla-2025.3.repo
#. Install the new ScyllaDB version:

View File

@@ -16,42 +16,99 @@ ScyllaDB CQL Drivers
ScyllaDB Drivers
-----------------
The following ScyllaDB drivers are available:
* :doc:`Python Driver</using-scylla/drivers/cql-drivers/scylla-python-driver>`
* :doc:`Java Driver </using-scylla/drivers/cql-drivers/scylla-java-driver>`
* :doc:`Go Driver </using-scylla/drivers/cql-drivers/scylla-go-driver>`
* :doc:`Go Extension </using-scylla/drivers/cql-drivers/scylla-gocqlx-driver>`
* :doc:`C++ Driver </using-scylla/drivers/cql-drivers/scylla-cpp-driver>`
* `CPP-over-Rust Driver <https://github.com/scylladb/cpp-rust-driver>`_
* :doc:`Rust Driver </using-scylla/drivers/cql-drivers/scylla-rust-driver>`
We recommend using ScyllaDB drivers. All ScyllaDB drivers are shard-aware and provide additional
benefits over third-party drivers.
ScyllaDB supports the CQL binary protocol version 3, so any Apache Cassandra/CQL driver that implements
the same version works with ScyllaDB.
The following table lists the available ScyllaDB drivers, specifying which support
`ScyllaDB Cloud Serversless <https://cloud.docs.scylladb.com/stable/serverless/index.html>`_
or include a library for :doc:`CDC </features/cdc/cdc-intro>`.
CDC Integration with ScyllaDB Drivers
-------------------------------------------
The following table specifies which ScyllaDB drivers include a library for
:doc:`CDC </features/cdc/cdc-intro>`.
.. list-table::
:widths: 30 35 35
:widths: 40 60
:header-rows: 1
* -
- ScyllaDB Driver
* - ScyllaDB Driver
- CDC Connector
* - :doc:`Python</using-scylla/drivers/cql-drivers/scylla-python-driver>`
- |v|
* - :doc:`Python </using-scylla/drivers/cql-drivers/scylla-python-driver>`
- |x|
* - :doc:`Java </using-scylla/drivers/cql-drivers/scylla-java-driver>`
- |v|
- |v|
* - :doc:`Go </using-scylla/drivers/cql-drivers/scylla-go-driver>`
- |v|
- |v|
* - :doc:`Go Extension </using-scylla/drivers/cql-drivers/scylla-gocqlx-driver>`
- |v|
- |x|
* - :doc:`C++ </using-scylla/drivers/cql-drivers/scylla-cpp-driver>`
- |v|
- |x|
* - `CPP-over-Rust Driver <https://github.com/scylladb/cpp-rust-driver>`_
- |x|
* - :doc:`Rust </using-scylla/drivers/cql-drivers/scylla-rust-driver>`
- |v|
- |v|
Support for Tablets
-------------------------
The following table specifies which ScyllaDB drivers support
:doc:`tablets </architecture/tablets>` and since which version.
.. list-table::
:widths: 30 35 35
:header-rows: 1
* - ScyllaDB Driver
- Support for Tablets
- Since Version
* - :doc:`Python</using-scylla/drivers/cql-drivers/scylla-python-driver>`
- |v|
- 3.26.5
* - :doc:`Java </using-scylla/drivers/cql-drivers/scylla-java-driver>`
- |v|
- 4.18.0 (Java Driver 4.x)
3.11.5.2 (Java Driver 3.x)
* - :doc:`Go </using-scylla/drivers/cql-drivers/scylla-go-driver>`
- |v|
- 1.13.0
* - :doc:`Go Extension </using-scylla/drivers/cql-drivers/scylla-gocqlx-driver>`
- |x|
- N/A
* - :doc:`C++ </using-scylla/drivers/cql-drivers/scylla-cpp-driver>`
- |x|
- N/A
* - `CPP-over-Rust Driver <https://github.com/scylladb/cpp-rust-driver>`_
- |v|
- All versions
* - :doc:`Rust </using-scylla/drivers/cql-drivers/scylla-rust-driver>`
- |v|
- 0.13.0
Driver Support Policy
-------------------------------
We support the **two most recent minor releases** of our drivers.
* We test and validate the latest two minor versions.
* We typically patch only the latest minor release.
We recommend staying up to date with the latest supported versions to receive
updates and fixes.
At a minimum, upgrade your driver when upgrading to a new ScyllaDB version
to ensure compatibility between the driver and the database.
Third-party Drivers
----------------------

View File

@@ -701,5 +701,97 @@ std::unique_ptr<data_sink_impl> make_encrypted_sink(data_sink sink, shared_ptr<s
return std::make_unique<encrypted_data_sink>(std::move(sink), std::move(k));
}
class encrypted_data_source : public data_source_impl, public block_encryption_base {
input_stream<char> _input;
temporary_buffer<char> _next;
size_t _current_position = 0;
size_t _skip = 0;
public:
encrypted_data_source(data_source source, shared_ptr<symmetric_key> k)
: block_encryption_base(std::move(k))
, _input(std::move(source))
{}
future<temporary_buffer<char>> get() override {
// First, get as much as we can get now (or the remainder of previous call)
auto buf1 = _next.empty()
? co_await _input.read()
: std::exchange(_next, {})
;
// eof?
if (buf1.empty()) {
co_return buf1;
}
// now we need one page more to be able to save one for next lap
auto fill_size = align_up(buf1.size(), block_size) + block_size - buf1.size();
auto buf2 = co_await _input.read_exactly(fill_size);
temporary_buffer<char> output(buf1.size() + buf2.size());
// we copy data even for the part we will cache. this to
// fix block alignment of the resulting shared buffer.
std::copy(buf1.begin(), buf1.end(), output.get_write());
std::copy(buf2.begin(), buf2.end(), output.get_write() + buf1.size());
// we need to keep one page buffered (beyond the input stream buffer - would be neat
// to share it), to be able to detect actual eof stream size. We always need
// at least _two_ pages of data to process, to be able to handle the case where
// actual size is <aligned block size> - <less than key block size>.
// I.e. stream size = 8180, encrypted data size will be 8192, and data stream
// will be 8196. So we need to make sure buf1 == [4096-8192], and buf2 == [8192-8196].
if (is_aligned(output.size(), block_size) && output.size() >= 2*block_size) {
_next = output.share(output.size() - block_size, block_size);
output.trim(output.size() - block_size);
}
const size_t key_block_size = _key->block_size();
// decrypt all blocks we have to return. might include the last, partial block
for (size_t offset = 0; offset < output.size(); offset += block_size, _current_position += block_size) {
auto iv = iv_for(_current_position);
auto rem = std::min(block_size, output.size() - offset);
_key->transform_unpadded(mode::decrypt, output.get() + offset, align_down(rem, key_block_size), output.get_write() + offset, iv.data());
}
// now, if the output buffer is not aligned, we are at eof, and
// also need to trim result.
if (!is_aligned(output.size(), key_block_size)) {
output.trim(output.size() - std::min(output.size(), key_block_size));
}
assert(is_aligned(_current_position, block_size));
// finally trim front to handle any skip remainders
output.trim_front(std::min(std::exchange(_skip, 0), output.size()));
co_return output;
}
future<temporary_buffer<char>> skip(uint64_t n) override {
if (n >= block_size) {
// since we only give back data aligned to block_size chunks,
// a client would only ever skip from a block boundary.
auto to_skip = align_down(n, block_size);
assert(is_aligned(_next.size(), block_size));
co_await _input.skip(to_skip - _next.size());
n -= to_skip;
_current_position += to_skip;
_next = {};
}
_skip = n;
co_return temporary_buffer<char>{};
}
future<> close() override {
return _input.close();
}
};
std::unique_ptr<data_source_impl> make_encrypted_source(data_source source, shared_ptr<symmetric_key> k) {
return std::make_unique<encrypted_data_source>(std::move(source), std::move(k));
}
}

View File

@@ -25,4 +25,6 @@ shared_ptr<file_impl> make_delayed_encrypted_file(file, size_t, get_key_func);
std::unique_ptr<data_sink_impl> make_encrypted_sink(data_sink, ::shared_ptr<symmetric_key>);
std::unique_ptr<data_source_impl> make_encrypted_source(data_source source, shared_ptr<symmetric_key> k);
}

View File

@@ -472,6 +472,14 @@ public:
for (auto&& [id, h] : _per_thread_kmip_host_cache[this_shard_id()]) {
co_await h->disconnect();
}
static auto stop_all = [](auto&& cache) -> future<> {
for (auto& [k, host] : cache) {
co_await host->stop();
}
};
co_await stop_all(_per_thread_kms_host_cache[this_shard_id()]);
co_await stop_all(_per_thread_gcp_host_cache[this_shard_id()]);
_per_thread_provider_cache[this_shard_id()].clear();
_per_thread_system_key_cache[this_shard_id()].clear();
_per_thread_kmip_host_cache[this_shard_id()].clear();
@@ -676,6 +684,33 @@ public:
return res;
}
std::tuple<opt_bytes, shared_ptr<encryption_schema_extension>> get_encryption_schema_extension(const sstables::sstable& sst,
sstables::component_type type) const {
const auto& sc = sst.get_shared_components();
if (!sc.scylla_metadata) {
return {};
}
const auto* ext_attr = sc.scylla_metadata->get_extension_attributes();
if (!ext_attr) {
return {};
}
bool ok = ext_attr->map.contains(encryption_attribute_ds);
if (ok && type != sstables::component_type::Data) {
ok = (ser::deserialize_from_buffer(ext_attr->map.at(encrypted_components_attribute_ds).value, std::type_identity<uint32_t>{}, 0) & (1 << static_cast<int>(type))) > 0;
}
if (!ok) {
return {};
}
auto esx = encryption_schema_extension::create(*_ctxt, ext_attr->map.at(encryption_attribute_ds).value);
opt_bytes id;
if (ext_attr->map.contains(key_id_attribute_ds)) {
id = ext_attr->map.at(key_id_attribute_ds).value;
}
return {std::move(id), std::move(esx)};
}
future<file> wrap_file(const sstables::sstable& sst, sstables::component_type type, file f, open_flags flags) override {
switch (type) {
case sstables::component_type::Scylla:
@@ -688,44 +723,21 @@ public:
if (flags == open_flags::ro) {
// open existing. check read opts.
auto& sc = sst.get_shared_components();
if (sc.scylla_metadata) {
auto* exta = sc.scylla_metadata->get_extension_attributes();
if (exta) {
auto i = exta->map.find(encryption_attribute_ds);
// note: earlier builds of encryption extension would only encrypt data component,
// so iff we are opening old sstables we need to check if this component is actually
// encrypted. We use a bitmask attribute for this.
auto [id, esx] = get_encryption_schema_extension(sst, type);
if (esx) {
if (esx->should_delay_read(id)) {
logg.debug("Encrypted sstable component {} using delayed opening {} (id: {})", sst.component_basename(type), *esx, id);
bool ok = i != exta->map.end();
if (ok && type != sstables::component_type::Data) {
ok = exta->map.count(encrypted_components_attribute_ds) &&
(ser::deserialize_from_buffer(exta->map.at(encrypted_components_attribute_ds).value, std::type_identity<uint32_t>{}, 0) & (1 << int(type)));
}
if (ok) {
auto esx = encryption_schema_extension::create(*_ctxt, i->second.value);
opt_bytes id;
if (exta->map.count(key_id_attribute_ds)) {
id = exta->map.at(key_id_attribute_ds).value;
}
if (esx->should_delay_read(id)) {
logg.debug("Encrypted sstable component {} using delayed opening {} (id: {})", sst.component_basename(type), *esx, id);
co_return make_delayed_encrypted_file(f, esx->key_block_size(), [esx, comp = sst.component_basename(type), id = std::move(id)] {
logg.trace("Delayed component {} using {} (id: {}) resolve", comp, *esx, id);
return esx->key_for_read(id);
});
}
logg.debug("Open encrypted sstable component {} using {} (id: {})", sst.component_basename(type), *esx, id);
auto k = co_await esx->key_for_read(std::move(id));
co_return make_encrypted_file(f, std::move(k));
}
co_return make_delayed_encrypted_file(f, esx->key_block_size(), [esx, comp = sst.component_basename(type), id = std::move(id)] {
logg.trace("Delayed component {} using {} (id: {}) resolve", comp, *esx, id);
return esx->key_for_read(id);
});
}
logg.debug("Open encrypted sstable component {} using {} (id: {})", sst.component_basename(type), *esx, id);
auto k = co_await esx->key_for_read(std::move(id));
co_return make_encrypted_file(f, std::move(k));
}
} else {
if (co_await wrap_writeonly(sst, type, [&f](shared_ptr<symmetric_key> k) { f = make_encrypted_file(std::move(f), std::move(k)); })) {
@@ -823,6 +835,36 @@ public:
});
co_return sink;
}
future<data_source> wrap_source(const sstables::sstable& sst,
sstables::component_type type,
sstables::data_source_creator_fn data_source_creator,
uint64_t offset,
uint64_t len) override {
switch (type) {
case sstables::component_type::Scylla:
case sstables::component_type::TemporaryTOC:
case sstables::component_type::TOC:
co_return data_source_creator(offset, len);
case sstables::component_type::CompressionInfo:
case sstables::component_type::CRC:
case sstables::component_type::Data:
case sstables::component_type::Digest:
case sstables::component_type::Filter:
case sstables::component_type::Index:
case sstables::component_type::Statistics:
case sstables::component_type::Summary:
case sstables::component_type::TemporaryStatistics:
case sstables::component_type::Unknown:
auto [id, esx] = get_encryption_schema_extension(sst, type);
if (esx) {
auto key = co_await esx->key_for_read(std::move(id));
auto block_size = key->block_size();
co_return data_source(make_encrypted_source(data_source_creator(align_down(offset, block_size), align_up(len, block_size)), std::move(key)));
}
co_return data_source_creator(offset, len);
}
}
};
std::string encryption_provider(const sstables::sstable& sst) {

View File

@@ -97,6 +97,7 @@ public:
~impl() = default;
future<> init();
future<> stop();
const host_options& options() const {
return _options;
}
@@ -477,13 +478,14 @@ encryption::gcp_host::impl::get_default_credentials() {
}
}
{
auto home = std::getenv("HOME");
if (home) {
std::string well_known_file;
auto env_path = std::getenv("CLOUDSDK_CONFIG");
if (env_path) {
well_known_file = fmt::format("~/{}/{}", env_path, WELL_KNOWN_CREDENTIALS_FILE);
well_known_file = fmt::format("{}/{}/{}", home, env_path, WELL_KNOWN_CREDENTIALS_FILE);
} else {
well_known_file = fmt::format("~/.config/{}/{}", CLOUDSDK_CONFIG_DIRECTORY, WELL_KNOWN_CREDENTIALS_FILE);
well_known_file = fmt::format("{}/.config/{}/{}", home, CLOUDSDK_CONFIG_DIRECTORY, WELL_KNOWN_CREDENTIALS_FILE);
}
if (co_await seastar::file_exists(well_known_file)) {
@@ -671,7 +673,7 @@ encryption::gcp_host::impl::get_access_token(const google_credentials& creds, co
{ "client_id", c.client_id },
{ "client_secret", c.client_secret },
{ "refresh_token", c.refresh_token },
{ "grant_type", "grant_type" },
{ "grant_type", "refresh_token" },
}), "", httpd::operation_type::POST);
co_return access_token{ json };
@@ -827,6 +829,11 @@ future<> encryption::gcp_host::impl::init() {
_initialized = true;
}
future<> encryption::gcp_host::impl::stop() {
co_await _attr_cache.stop();
co_await _id_cache.stop();
}
std::tuple<std::string, std::string> encryption::gcp_host::impl::parse_key(std::string_view spec) {
auto i = spec.find_last_of('/');
if (i == std::string_view::npos) {
@@ -989,6 +996,10 @@ future<> encryption::gcp_host::init() {
return _impl->init();
}
future<> encryption::gcp_host::stop() {
return _impl->stop();
}
const encryption::gcp_host::host_options& encryption::gcp_host::options() const {
return _impl->options();
}

View File

@@ -65,6 +65,8 @@ public:
~gcp_host();
future<> init();
future<> stop();
const host_options& options() const;
struct option_override : public t_credentials_source<std::optional<std::string>> {

View File

@@ -724,9 +724,11 @@ future<> kmip_host::impl::connect() {
}
future<> kmip_host::impl::disconnect() {
return do_for_each(_options.hosts, [this](const sstring& host) {
co_await do_for_each(_options.hosts, [this](const sstring& host) {
return clear_connections(host);
});
co_await _attr_cache.stop();
co_await _id_cache.stop();
}
static unsigned from_str(unsigned (*f)(char*, int, int*), const sstring& s, const sstring& what) {

View File

@@ -154,6 +154,8 @@ public:
~impl() = default;
future<> init();
future<> stop();
const host_options& options() const {
return _options;
}
@@ -826,6 +828,11 @@ future<> encryption::kms_host::impl::init() {
_initialized = true;
}
future<> encryption::kms_host::impl::stop() {
co_await _attr_cache.stop();
co_await _id_cache.stop();
}
future<encryption::kms_host::impl::key_and_id_type> encryption::kms_host::impl::create_key(const attr_cache_key& k) {
auto& master_key = k.master_key;
auto& aws_assume_role_arn = k.aws_assume_role_arn;
@@ -988,6 +995,10 @@ future<> encryption::kms_host::init() {
return _impl->init();
}
future<> encryption::kms_host::stop() {
return _impl->stop();
}
const encryption::kms_host::host_options& encryption::kms_host::options() const {
return _impl->options();
}

View File

@@ -63,6 +63,8 @@ public:
~kms_host();
future<> init();
future<> stop();
const host_options& options() const;
struct option_override {

View File

@@ -8,7 +8,6 @@
#include "generic_server.hh"
#include <exception>
#include <fmt/ranges.h>
#include <seastar/core/when_all.hh>
@@ -229,11 +228,28 @@ void connection::on_connection_ready()
void connection::shutdown()
{
shutdown_input();
shutdown_output();
}
bool connection::shutdown_input() {
try {
_fd.shutdown_input();
} catch (...) {
_server._logger.warn("Error shutting down input side of connection {}->{}, exception: {}", _fd.remote_address(), _fd.local_address(), std::current_exception());
return false;
}
return true;
}
bool connection::shutdown_output() {
try {
_fd.shutdown_output();
} catch (...) {
_server._logger.warn("Error shutting down output side of connection {}->{}, exception: {}", _fd.remote_address(), _fd.local_address(), std::current_exception());
return false;
}
return true;
}
server::server(const sstring& server_name, logging::logger& logger, config cfg)
@@ -256,10 +272,7 @@ server::server(const sstring& server_name, logging::logger& logger, config cfg)
}))
, _prev_conns_cpu_concurrency(_conns_cpu_concurrency)
, _conns_cpu_concurrency_semaphore(_conns_cpu_concurrency, named_semaphore_exception_factory{"connections cpu concurrency semaphore"})
{
}
server::~server()
, _shutdown_timeout(std::chrono::seconds{cfg.shutdown_timeout_in_seconds})
{
}
@@ -272,7 +285,11 @@ future<> server::shutdown() {
if (_gate.is_closed()) {
co_return;
}
_all_connections_stopped = _gate.close();
shared_future<> connections_stopped{_gate.close()};
_all_connections_stopped = connections_stopped.get_future();
// Stop all listeners.
size_t nr = 0;
size_t nr_total = _listeners.size();
_logger.debug("abort accept nr_total={}", nr_total);
@@ -280,14 +297,35 @@ future<> server::shutdown() {
l.abort_accept();
_logger.debug("abort accept {} out of {} done", ++nr, nr_total);
}
co_await std::move(_listeners_stopped);
// Shutdown RX side of the connections, so no new requests could be received.
// Leave the TX side so the responses to ongoing requests could be sent.
_logger.debug("Shutting down RX side of {} connections", _connections_list.size());
co_await for_each_gently([](auto& connection) {
if (!connection.shutdown_input()) {
// If failed to shutdown the input side, then attempt to shutdown the output side which should do a complete shutdown of the connection.
connection.shutdown_output();
}
});
// Wait for the remaining requests to finish.
_logger.debug("Waiting for connections to stop");
try {
co_await connections_stopped.get_future(seastar::lowres_clock::now() + _shutdown_timeout);
} catch (const timed_out_error& _) {
_logger.info("Timed out waiting for connections shutdown.");
}
// Either all requests stopped or a timeout occurred, do the full shutdown of the connections.
size_t nr_conn = 0;
auto nr_conn_total = _connections_list.size();
_logger.debug("shutdown connection nr_total={}", nr_conn_total);
for (auto& c : _connections_list) {
co_await for_each_gently([&nr_conn, nr_conn_total, this](connection& c) {
c.shutdown();
_logger.debug("shutdown connection {} out of {} done", ++nr_conn, nr_conn_total);
}
co_await std::move(_listeners_stopped);
});
_abort_source.request_abort();
}

View File

@@ -39,6 +39,7 @@ class server;
// member function to perform request processing. This base class provides a
// `_read_buf` and a `_write_buf` for reading requests and writing responses.
class connection : public boost::intrusive::list_base_hook<> {
friend class server;
public:
using connection_process_loop = noncopyable_function<future<> ()>;
using execute_under_tenant_type = noncopyable_function<future<> (connection_process_loop)>;
@@ -61,6 +62,8 @@ protected:
private:
future<> process_until_tenant_switch();
bool shutdown_input();
bool shutdown_output();
public:
connection(server& server, connected_socket&& fd, named_semaphore& sem, semaphore_units<named_semaphore_exception_factory> initial_sem_units);
virtual ~connection();
@@ -82,6 +85,7 @@ public:
struct config {
utils::updateable_value<uint32_t> uninitialized_connections_semaphore_cpu_concurrency;
utils::updateable_value<uint32_t> shutdown_timeout_in_seconds;
};
// A generic TCP socket server.
@@ -126,10 +130,11 @@ private:
utils::observer<uint32_t> _conns_cpu_concurrency_observer;
uint32_t _prev_conns_cpu_concurrency;
named_semaphore _conns_cpu_concurrency_semaphore;
std::chrono::seconds _shutdown_timeout;
public:
server(const sstring& server_name, logging::logger& logger, config cfg);
virtual ~server();
virtual ~server() = default;
// Makes sure listening sockets no longer generate new connections and aborts the
// connected sockets, so that new requests are not served and existing requests don't
@@ -140,9 +145,9 @@ public:
future<> shutdown();
future<> stop();
future<> listen(socket_address addr,
std::shared_ptr<seastar::tls::credentials_builder> creds,
bool is_shard_aware, bool keepalive,
future<> listen(socket_address addr,
std::shared_ptr<seastar::tls::credentials_builder> creds,
bool is_shard_aware, bool keepalive,
std::optional<file_permissions> unix_domain_socket_permissions,
std::function<server&()> get_shard_instance = {}
);

View File

@@ -13,8 +13,8 @@
auto fmt::formatter<gms::gossip_digest_syn>::format(const gms::gossip_digest_syn& syn, fmt::format_context& ctx) const
-> decltype(ctx.out()) {
auto out = ctx.out();
out = fmt::format_to(out, "cluster_id:{},partioner:{},group0_id{},",
syn._cluster_id, syn._partioner, syn._group0_id);
out = fmt::format_to(out, "cluster_id:{},partioner:{},group0_id:{},recovery_leader:{}",
syn._cluster_id, syn._partioner, syn._group0_id, syn._recovery_leader);
out = fmt::format_to(out, "digests:{{");
for (auto& d : syn._digests) {
out = fmt::format_to(out, "{} ", d);

View File

@@ -28,15 +28,19 @@ private:
sstring _partioner;
utils::chunked_vector<gossip_digest> _digests;
utils::UUID _group0_id;
utils::UUID _recovery_leader;
public:
gossip_digest_syn() {
}
gossip_digest_syn(sstring id, sstring p, utils::chunked_vector<gossip_digest> digests, utils::UUID group0_id)
gossip_digest_syn(
sstring id, sstring p, utils::chunked_vector<gossip_digest> digests,
utils::UUID group0_id, utils::UUID recovery_leader)
: _cluster_id(std::move(id))
, _partioner(std::move(p))
, _digests(std::move(digests))
, _group0_id(std::move(group0_id)) {
, _group0_id(std::move(group0_id))
, _recovery_leader(std::move(recovery_leader)) {
}
sstring cluster_id() const {
@@ -47,6 +51,10 @@ public:
return _group0_id;
}
utils::UUID recovery_leader() const {
return _recovery_leader;
}
sstring partioner() const {
return _partioner;
}
@@ -59,6 +67,10 @@ public:
return group0_id();
}
utils::UUID get_recovery_leader() const {
return _recovery_leader;
}
sstring get_partioner() const {
return partioner();
}

View File

@@ -34,6 +34,7 @@
#include <seastar/core/coroutine.hh>
#include <seastar/coroutine/parallel_for_each.hh>
#include <seastar/coroutine/exception.hh>
#include <seastar/coroutine/switch_to.hh>
#include <chrono>
#include "locator/host_id.hh"
#include <boost/range/algorithm/set_algorithm.hpp>
@@ -76,6 +77,10 @@ const utils::UUID& gossiper::get_group0_id() const noexcept {
return _gcfg.group0_id;
}
const utils::UUID& gossiper::get_recovery_leader() const noexcept {
return _gcfg.recovery_leader;
}
const std::set<inet_address>& gossiper::get_seeds() const noexcept {
return _gcfg.seeds;
}
@@ -172,8 +177,11 @@ void gossiper::do_sort(utils::chunked_vector<gossip_digest>& g_digest_list) cons
// Depends on
// - no external dependency
future<> gossiper::handle_syn_msg(locator::host_id from, gossip_digest_syn syn_msg) {
logger.trace("handle_syn_msg():from={},cluster_name:peer={},local={},group0_id:peer={},local={},partitioner_name:peer={},local={}",
from, syn_msg.cluster_id(), get_cluster_name(), syn_msg.group0_id(), get_group0_id(), syn_msg.partioner(), get_partitioner_name());
logger.trace(
"handle_syn_msg():from={},cluster_name:peer={},local={},group0_id:peer={},local={},"
"recovery_leader:peer={},local={},partitioner_name:peer={},local={}",
from, syn_msg.cluster_id(), get_cluster_name(), syn_msg.group0_id(), get_group0_id(),
syn_msg.recovery_leader(), get_recovery_leader(), syn_msg.partioner(), get_partitioner_name());
if (!is_enabled()) {
co_return;
}
@@ -184,10 +192,20 @@ future<> gossiper::handle_syn_msg(locator::host_id from, gossip_digest_syn syn_m
co_return;
}
// Recovery leader mismatch implies an administrator's mistake during the Raft-based recovery procedure.
// Throw away the message and signal that something is wrong.
bool both_nodes_in_recovery = syn_msg.recovery_leader() && get_recovery_leader();
if (both_nodes_in_recovery && syn_msg.recovery_leader() != get_recovery_leader()) {
logger.warn("Recovery leader mismatch from {} {} != {},",
from, syn_msg.recovery_leader(), get_recovery_leader());
co_return;
}
// If the message is from a node with a different group0 id throw it away.
// A group0 id mismatch is expected during a rolling restart in the Raft-based recovery procedure.
if (_gcfg.recovery_leader().empty()
&& syn_msg.group0_id() && get_group0_id() && syn_msg.group0_id() != get_group0_id()) {
bool no_recovery = !syn_msg.recovery_leader() && !get_recovery_leader();
bool group0_ids_mismatch = syn_msg.group0_id() && get_group0_id() && syn_msg.group0_id() != get_group0_id();
if (no_recovery && group0_ids_mismatch) {
logger.warn("Group0Id mismatch from {} {} != {}", from, syn_msg.group0_id(), get_group0_id());
co_return;
}
@@ -590,10 +608,27 @@ future<> gossiper::send_gossip(gossip_digest_syn message, std::set<T> epset) {
future<> gossiper::do_apply_state_locally(locator::host_id node, endpoint_state remote_state, bool shadow_round) {
co_await utils::get_local_injector().inject("delay_gossiper_apply", [&node, &remote_state](auto& handler) -> future<> {
const auto gossip_delay_node = handler.template get<std::string_view>("delay_node");
if (gossip_delay_node && !remote_state.get_host_id() && inet_address(sstring(gossip_delay_node.value())) == remote_state.get_ip()) {
logger.debug("delay_gossiper_apply: suspend for node {}", node);
co_await handler.wait_for_message(std::chrono::steady_clock::now() + std::chrono::minutes{5});
logger.debug("delay_gossiper_apply: resume for node {}", node);
}
});
// If state does not exist just add it. If it does then add it if the remote generation is greater.
// If there is a generation tie, attempt to break it by heartbeat version.
auto permit = co_await lock_endpoint(node, null_permit_id);
auto es = get_endpoint_state_ptr(node);
// If remote state update does not contain a host id, check whether the endpoint still
// exists in the `_endpoint_state_map` since after a preemption point it could have been deleted.
if (!remote_state.get_host_id() && !es) {
throw std::runtime_error(format("Entry for host id {} does not exist in the endpoint state map.", node));
}
if (es) {
endpoint_state local_state = *es;
auto local_generation = local_state.get_heart_beat_state().get_generation();
@@ -683,11 +718,12 @@ future<> gossiper::apply_state_locally(std::map<inet_address, endpoint_state> ma
}
future<> gossiper::force_remove_endpoint(locator::host_id id, permit_id pid) {
return container().invoke_on(0, [this, pid, id] (auto& gossiper) mutable -> future<> {
co_await coroutine::switch_to(_gcfg.gossip_scheduling_group);
co_await container().invoke_on(0, [pid, id] (auto& gossiper) mutable -> future<> {
auto permit = co_await gossiper.lock_endpoint(id, pid);
pid = permit.id();
try {
if (id == my_host_id()) {
if (id == gossiper.my_host_id()) {
throw std::runtime_error(format("Can not force remove node {} itself", id));
}
if (!gossiper._endpoint_state_map.contains(id)) {
@@ -857,6 +893,10 @@ gossiper::endpoint_lock_entry::endpoint_lock_entry() noexcept
{}
future<gossiper::endpoint_permit> gossiper::lock_endpoint(locator::host_id ep, permit_id pid, seastar::compat::source_location l) {
if (current_scheduling_group() != _gcfg.gossip_scheduling_group) {
logger.warn("Incorrect scheduling group used for gossiper::lock_endpoint: {}, should be {}, backtrace {}", current_scheduling_group().name(), _gcfg.gossip_scheduling_group.name(), current_backtrace());
}
if (this_shard_id() != 0) {
on_internal_error(logger, "lock_endpoint must be called on shard 0");
}
@@ -1102,7 +1142,8 @@ void gossiper::run() {
utils::chunked_vector<gossip_digest> g_digests = make_random_gossip_digest();
if (g_digests.size() > 0) {
gossip_digest_syn message(get_cluster_name(), get_partitioner_name(), g_digests, get_group0_id());
gossip_digest_syn message(
get_cluster_name(), get_partitioner_name(), g_digests, get_group0_id(), get_recovery_leader());
if (_endpoints_to_talk_with.empty() && !_live_endpoints.empty()) {
auto live_endpoints = _live_endpoints | std::ranges::to<std::vector>();
@@ -1319,6 +1360,12 @@ utils::chunked_vector<gossip_digest> gossiper::make_random_gossip_digest() const
}
future<> gossiper::replicate(endpoint_state es, permit_id pid) {
if (!es.get_host_id()) {
// TODO (#25818): re-introduce the on_internal_error() call once all the code paths leading to this are fixed
logger.warn("attempting to add a state with empty host id for ip: {}", es.get_ip());
co_return;
}
verify_permit(es.get_host_id(), pid);
// First pass: replicate the new endpoint_state on all shards.
@@ -2045,6 +2092,7 @@ void gossiper::examine_gossiper(utils::chunked_vector<gossip_digest>& g_digest_l
}
future<> gossiper::start_gossiping(gms::generation_type generation_nbr, application_state_map preload_local_states) {
co_await coroutine::switch_to(_gcfg.gossip_scheduling_group);
auto permit = co_await lock_endpoint(my_host_id(), null_permit_id);
build_seeds_list();
@@ -2052,15 +2100,19 @@ future<> gossiper::start_gossiping(gms::generation_type generation_nbr, applicat
generation_nbr = gms::generation_type(_gcfg.force_gossip_generation());
logger.warn("Use the generation number provided by user: generation = {}", generation_nbr);
}
endpoint_state local_state = my_endpoint_state();
// Create a new local state.
endpoint_state local_state{get_broadcast_address()};
local_state.set_heart_beat_state_and_update_timestamp(heart_beat_state(generation_nbr));
for (auto& entry : preload_local_states) {
local_state.add_application_state(entry.first, entry.second);
}
co_await utils::get_local_injector().inject("gossiper_publish_local_state_pause", utils::wait_for_message(5min));
co_await replicate(local_state, permit.id());
logger.info("Gossip started with local state: {}", local_state);
logger.info("Gossip started with local state: {}", my_endpoint_state());
_enabled = true;
_nr_run = 0;
_scheduled_gossip_task.arm(INTERVAL);
@@ -2100,6 +2152,7 @@ future<> gossiper::advertise_to_nodes(generation_for_nodes advertise_to_nodes) {
}
future<> gossiper::do_shadow_round(std::unordered_set<gms::inet_address> nodes, mandatory is_mandatory) {
co_await coroutine::switch_to(_gcfg.gossip_scheduling_group);
nodes.erase(get_broadcast_address());
gossip_get_endpoint_states_request request{{
gms::application_state::STATUS,
@@ -2140,7 +2193,11 @@ future<> gossiper::do_shadow_round(std::unordered_set<gms::inet_address> nodes,
});
for (auto& response : responses) {
co_await apply_state_locally_in_shadow_round(std::move(response.endpoint_state_map));
try {
co_await apply_state_locally_in_shadow_round(std::move(response.endpoint_state_map));
} catch (const std::exception& exception) {
logger.warn("Error while applying node state {}", exception.what());
}
}
if (!nodes_talked.empty()) {
break;
@@ -2182,6 +2239,7 @@ void gossiper::build_seeds_list() {
}
future<> gossiper::add_saved_endpoint(locator::host_id host_id, gms::loaded_endpoint_state st, permit_id pid) {
co_await coroutine::switch_to(_gcfg.gossip_scheduling_group);
if (host_id == my_host_id()) {
logger.debug("Attempt to add self as saved endpoint");
co_return;
@@ -2254,7 +2312,8 @@ future<> gossiper::add_local_application_state(application_state_map states) {
co_return;
}
try {
co_await container().invoke_on(0, [&] (gossiper& gossiper) mutable -> future<> {
co_await coroutine::switch_to(_gcfg.gossip_scheduling_group);
co_await container().invoke_on(0, [&](gossiper& gossiper) mutable -> future<> {
inet_address ep_addr = gossiper.get_broadcast_address();
auto ep_id = gossiper.my_host_id();
// for symmetry with other apply, use endpoint lock for our own address.

View File

@@ -69,7 +69,7 @@ struct gossip_config {
uint32_t skip_wait_for_gossip_to_settle = -1;
utils::updateable_value<uint32_t> failure_detector_timeout_ms;
utils::updateable_value<int32_t> force_gossip_generation;
utils::updateable_value<sstring> recovery_leader;
utils::updateable_value<utils::UUID> recovery_leader;
};
struct loaded_endpoint_state {
@@ -136,6 +136,8 @@ public:
void set_group0_id(utils::UUID group0_id);
const utils::UUID& get_group0_id() const noexcept;
const utils::UUID& get_recovery_leader() const noexcept;
const sstring& get_partitioner_name() const noexcept {
return _gcfg.partitioner;
}

View File

@@ -57,6 +57,7 @@ class gossip_digest_syn {
sstring get_partioner();
utils::chunked_vector<gms::gossip_digest> get_gossip_digests();
utils::UUID get_group0_id()[[version 5.4]];
utils::UUID get_recovery_leader()[[version 2025.2.2]];
};
class gossip_digest_ack {

14
keys.cc
View File

@@ -38,12 +38,18 @@ partition_key_view::ring_order_tri_compare(const schema& s, partition_key_view k
partition_key partition_key::from_nodetool_style_string(const schema_ptr s, const sstring& key) {
std::vector<sstring> vec;
boost::split(vec, key, boost::is_any_of(":"));
if (s->partition_key_type()->types().size() == 1) {
// For a single column partition key. Don't try to split the key
// See #16596
vec.push_back(key);
} else {
boost::split(vec, key, boost::is_any_of(":"));
if (vec.size() != s->partition_key_type()->types().size()) {
throw std::invalid_argument(fmt::format("partition key '{}' has mismatch number of components: expected {}, got {}", key, s->partition_key_type()->types().size(), vec.size()));
}
}
auto it = std::begin(vec);
if (vec.size() != s->partition_key_type()->types().size()) {
throw std::invalid_argument("partition key '" + key + "' has mismatch number of components");
}
std::vector<bytes> r;
r.reserve(vec.size());
for (auto t : s->partition_key_type()->types()) {

Some files were not shown because too many files have changed in this diff Show More