Compare commits

..

1258 Commits

Author SHA1 Message Date
Jenkins Promoter
c5ad7b86b7 Update pgo profiles - x86_64 2026-02-01 04:10:42 +02:00
Jenkins Promoter
5a861abd48 Update pgo profiles - aarch64 2026-01-01 04:44:25 +02:00
Jenkins Promoter
019eee28f8 Update pgo profiles - x86_64 2026-01-01 04:06:12 +02:00
Gleb Natapov
0c5780590a raft topology: Notify that a node was removed only once
Raft topology goes over all nodes in a 'left' state and triggers 'remove
node' notification in case id/ip mapping is available (meaning the node
left recently), but the problem is that, since the mapping is not removed
immediately, when multiple nodes are removed in succession a notification
for the same node can be sent several times. Fix that by sending
notification only if the node still exists in the peers table. It will
be removed by the first notification and following notification will not
be sent.

Closes scylladb/scylladb#27743

(cherry picked from commit 4a5292e815)

Closes scylladb/scylladb#27911
2025-12-30 11:24:02 +01:00
Gleb Natapov
207e302bbf topology coordinator: set session id for streaming at the correct time
Commit d3efb3ab6f added streaming session for rebuild, but it set
the session and request submission time. The session should be set when
request starts the execution, so this patch moved it to the correct
place.

Closes scylladb/scylladb#27757

(cherry picked from commit 04976875cc)

Closes scylladb/scylladb#27865
2025-12-28 13:34:01 +02:00
Ferenc Szili
2537b9b818 test: fix flakyness caused by TRUNCATE retries
The test test_truncate_during_topology_change tests TRUNCATE TABLE while
bootstrapping a new node. With tablets enabled TRUNCATE is a global
topology operation which needs to serialize with boostrap.

When TRUNCATE TABLE is issued, it first checks if there is an already
queued truncate for the same table. This can happen if a previous
TRUNCATE operation has timed out, and the client retried. The newly
issued truncate will only join the queued one if it is waiting to be
processed, and will fail immediatelly if the TRUNCATE is already being
processed.

In this test, TRUNCATE will be retried after a timeout (1 minute) due to
the default retry policy, and will be retried up to 3 times, while the
bootstrap is delayed by 2 minutes. This means that the test can validate
the result of a truncate which was started after bootstrap was
completed.

Because of the way truncate joins existing truncate operations, we can
also have the following scenario:
- TRUNCATE times out after one minute because the new node is being
  bootstrapped
- the client retries the TRUNCATE command which also times out after 1m
- the third attempt is received during TRUNCATE being processed which
  fails the test

This patch changes the retry policy of the TRUNCATE operation to
FallthroughRetryPolicy which guarantees that TRUNCATE will not be
retried on timeout. It also increases the timeout of the TRUNCATE from 1
to 4 minutes. This way the test will actually validate the performance
of the TRUNCATE operation which was issued during bootstrap, instead of
the subsequent, retried TRUNCATEs which could have been issued after the
bootstrap was complete.

Fixes: #26347

Closes scylladb/scylladb#27245

(cherry picked from commit d883ff2317)

Closes scylladb/scylladb#27505
2025-12-23 17:08:54 +02:00
Yaron Kaikov
5ff70064e5 auto-backport.py: modify instruction for making PR ready for review
Update the comment sent when PR has conflicts with clear instrauctions how to make the PR Ready for review

Fixes: https://scylladb.atlassian.net/browse/RELENG-152

Closes scylladb/scylladb#27547

(cherry picked from commit d3e199984e)

Closes scylladb/scylladb#27563
2025-12-22 15:17:23 +02:00
Anna Stuchlik
ac4d5a0bea doc: remove the links to the Download Center
This commit removes the remaining links to the Download Center on the website.
We no longer use it for installation, and we don't want users to infer that
something like that still exists.

Fixes https://github.com/scylladb/scylladb/issues/27753

Closes scylladb/scylladb#27756

(cherry picked from commit f65db4e8eb)

Closes scylladb/scylladb#27781
2025-12-21 19:23:26 +02:00
Emil Maskovsky
bfcac7547b test/raft: fix race condition in failure_detector_test
The test had a sporadic failure due to a broken promise exception.
The issue was in `test_pinger::ping()` which captured the promise by
move into the subscription lambda, causing the promise to be destroyed
when the lambda was destroyed during coroutine unwinding.

Simplify `test_pinger::ping()` by replacing manual abort_source/promise
logic with `seastar::sleep_abortable()`.
This removes the risk of promise lifetime/race issues and makes the code
simpler and more robust.

Fixes: scylladb/scylladb#27136

Backport to active branches: This fixes a CI test issue, so it is
beneficial to backport the fix. As this is a test-only fix, it is a low
risk change.

Closes scylladb/scylladb#27737

(cherry picked from commit 2a75b1374e)

Closes scylladb/scylladb#27780
2025-12-21 14:15:09 +02:00
Patryk Jędrzejczak
0a7d71663a Merge '[Backport 2025.2] topology_coordinator: handle seastar::abort_requested_exception alongside raft::request_aborted' from Scylladb[bot]
In several exception handlers, only `raft::request_aborted` was being caught and rethrown, while `seastar::abort_requested_exception` was falling through to the generic catch(...) block. This caused the exception to be incorrectly treated as a failure that triggers rollback, instead of being recognized as an abort signal.

For example, during tablet draining, the error log showed: "tablets draining failed with seastar::abort_requested_exception (abort requested). Aborting the topology operation"

This change adds `seastar::abort_requested_exception` handling alongside `raft::request_aborted` in all places where it was missing. When rethrown, these exceptions propagate up to the main `run()` loop where `handle_topology_coordinator_error()` recognizes them as normal abort signals and allows the coordinator to exit gracefully without triggering unnecessary rollback operations.

Fixes: scylladb/scylladb#27255

No backport: The problem was only seen in tests and not reported in customer tickets, so it's enough to fix it in the main branch.

- (cherry picked from commit 37e3dacf33)

Parent PR: #27314

Closes scylladb/scylladb#27661

* https://github.com/scylladb/scylladb:
  topology_coordinator: handle seastar::abort_requested_exception alongside raft::request_aborted
  topology_coordinator: consistently rethrow `raft::request_aborted` for direct/global commands
2025-12-20 19:33:31 +01:00
Patryk Jędrzejczak
eb4523fd03 Merge '[Backport 2025.2] Make direct failure detector verb handler more efficient' from Scylladb[bot]
We saw that in large clusters direct failure detector may cause large task queues to be accumulated. The series address this issue and also moves the code into the correct scheduling group.

Fixes https://github.com/scylladb/scylladb/issues/27142

Backport to all version where 60f1053087 was backported to since it should improve performance in large clusters.

- (cherry picked from commit 82f80478b8)

- (cherry picked from commit 6a6bbbf1a6)

- (cherry picked from commit 86dde50c0d)

Parent PR: #27387

Closes scylladb/scylladb#27480

* https://github.com/scylladb/scylladb:
  direct_failure_detector: run direct failure detector in the gossiper scheduling group
  raft: drop invoke_on from the pinger verb handler
  direct_failure_detector: pass timeout to direct_fd_ping verb
  idl, message: make with_timeout and cancellable verb attributes composable
2025-12-19 16:48:26 +01:00
Emil Maskovsky
b333141924 topology_coordinator: handle seastar::abort_requested_exception alongside raft::request_aborted
In several exception handlers, only raft::request_aborted was being
caught and rethrown, while seastar::abort_requested_exception was
falling through to the generic catch(...) block. This caused the
exception to be incorrectly treated as a failure that triggers
rollback, instead of being recognized as an abort signal.

For example, during tablet draining, the error log showed:
"tablets draining failed with seastar::abort_requested_exception
(abort requested). Aborting the topology operation"

This change adds seastar::abort_requested_exception handling
alongside raft::request_aborted in all places where it was missing.
When rethrown, these exceptions propagate up to the main run() loop
where handle_topology_coordinator_error() recognizes them as normal
abort signals and allows the coordinator to exit gracefully without
triggering unnecessary rollback operations.

Fixes: scylladb/scylladb#27255

(cherry picked from commit 37e3dacf33)
2025-12-19 16:26:01 +01:00
Michael Litvak
9c0528fc9f view_builder: reduce log level for expected aborts during view creation
When draining the view builder, we abort ongoing operations using the
view builder's abort source, which may cause them to fail with
abort_requested_exception or raft::request_aborted exceptions.

Since these failures are expected during shutdown, reduce the log level
in add_new_view from 'error' to 'debug' for these specific exceptions
while keeping 'error' level for unexpected failures.

Closes scylladb/scylladb#26297

(cherry picked from commit 6bc41926e2)

Closes scylladb/scylladb#27538
2025-12-18 16:24:35 +01:00
Emil Maskovsky
789eb79a6a topology_coordinator: consistently rethrow raft::request_aborted for direct/global commands
Ensure all direct and global topology commands rethrow the
`raft::request_aborted` exception when aborted, typically due to
leadership changes. This makes abortion explicit to callers, enabling
proper handling such as retries or workflow termination.

This change completes the work started in PR scylladb/scylladb#23962,
covering all remaining cases where the exception was not rethrown.

Fixes: scylladb/scylladb#23589

(cherry picked from commit 943af1ef1c)
2025-12-17 16:20:41 +01:00
Jenkins Promoter
cc39e23be4 Update pgo profiles - aarch64 2025-12-15 04:40:18 +02:00
Jenkins Promoter
11b899680a Update pgo profiles - x86_64 2025-12-15 04:03:01 +02:00
Jenkins Promoter
f11c584264 Update ScyllaDB version to: 2025.2.6 2025-12-14 14:30:35 +02:00
Benny Halevy
f06269dfca utils: error_injection: wait_for_message: print injection_name and caller source_location on timeout
When waiting for the condition variable times out
we call on_internal_error, but unfortunately, the backtrace
it generates is obfuscated by
`coroutine_handle<seastar::internal::coroutine_traits_base<void>::promise_type>::resume`.

To make the log more useful, print the error injection name
and the caller's source_location in the timeout error message.

Fixes #27531

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#27532

(cherry picked from commit 5f13880a91)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#27581
2025-12-12 14:20:30 +01:00
Anna Stuchlik
3553ae91c9 replace the Driver pages with a link to the new Drivers pages
This commit removes the now redundant driver pages from
the Scylla DB documentation. Instead, the link to the pages
where we moved the diver information is added.
Also, the links are updated across the ScyllaDB manual.

Redirections are added for all the removed pages.

Fixes https://github.com/scylladb/scylladb/issues/26871

Closes scylladb/scylladb#27277

(cherry picked from commit c5580399a8)

Closes scylladb/scylladb#27438
2025-12-12 10:30:34 +01:00
Yaron Kaikov
4c20e64f4f Add JIRA issue validation to backport PR fixes check
Extend the Fixes validation pattern to also accept JIRA issue references
(format: [A-Z]+-\d+) in addition to GitHub issue references. This allows
backport PRs to reference JIRA issues in the format 'Fixes: PROJECT-123'.

Fixes: https://github.com/scylladb/scylladb/issues/27571

Closes scylladb/scylladb#27572

(cherry picked from commit 3dfa5ebd7f)

Closes scylladb/scylladb#27597
2025-12-12 09:36:32 +02:00
Gleb Natapov
9ca1439b78 direct_failure_detector: run direct failure detector in the gossiper scheduling group
When direct failure detector was introduces the idea was that it will
run on the same connection raft group0 verbs are running, but in
60f1053087 raft verbs were moved to run on the gossiper connection
while DIRECT_FD_PING was left where it was. This patch move it to
gossiper connection as well and fix the pinger code to run in gossiper
scheduling group.

(cherry picked from commit 86dde50c0d)
2025-12-09 16:55:00 +02:00
Gleb Natapov
1ee9d1cca5 raft: drop invoke_on from the pinger verb handler
Currently raft direct pinger verb jumps to shard 0 to check if group0 is
alive before replying. The verb runs relatively often, so it is not very
efficient. The patch distributes group0 liveness information (as it
changes) to all shard instead, so that the handler itself does not need
to jump to shard 0.

(cherry picked from commit 6a6bbbf1a6)
2025-12-09 16:55:00 +02:00
Gleb Natapov
d6ec1ab5b9 direct_failure_detector: pass timeout to direct_fd_ping verb
Currently direct_fd_ping runs without timeout, but the verb is not
waited forever, the wait is canceled after a timeout, this timeout
simply is not passed to the rpc. It may create a situation where the
rpc callback can runs on a destination but it is no longer waited on.
Change the code to pass timeout to rpc as well and return earlier from
the rpc handler if the timeout is reached by the time the callback is
called. This is backwards compatible since timeout is passed as
optional.

(cherry picked from commit 82f80478b8)
2025-12-09 14:53:32 +02:00
Benny Halevy
61dd81539b idl, message: make with_timeout and cancellable verb attributes composable
And define `send_message_timeout_cancellable` in rpc_protocol_impl.hh
using the newly introduced rpc_handler entry point
in seastar that accepts both timeout and cancellable params.

Note that the interface to the user still uses abort_source
while internally the funtion allocates a seastar::rpc::cancellable
object.  It is possible to provide an interface that will accept
a rpc::cancellable& from the caller, but the existing messaging api
uses abort_source.  Changing it may be considered in the future.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 0b97806771)
2025-12-09 14:53:32 +02:00
Tomasz Grabiec
104cad1424 Merge '[Backport 2025.2] address_map: Use more efficient and reliable replication method' from Scylladb[bot]
Primary issue with the old method is that each update is a separate
cross-shard call, and all later updates queue behind it. If one of the
shards has high latency for such calls, the queue may accumulate and
system will appear unresponsive for mapping changes on non-zero shards.

This happened in the field when one of the shards was overloaded with
sstables and compaction work, which caused frequent stalls which
delayed polling for ~100ms. A queue of 3k address updates
accumulated, because we update mapping on each change of gossip
states. This made bootstrap impossible because nodes couldn't
learn about the IP mapping for the bootstrapping node and streaming
failed.

To protect against that, use a more efficient method of replication
which requires a single cross-shard call to replicate all prior
updates.

It is also more reliable, if replication fails transiently for some
reason, we don't give up and fail all later updates.

Fixes #26865

- (cherry picked from commit ed8d127457)

- (cherry picked from commit 4a85ea8eb2)

- (cherry picked from commit f83c4ffc68)

Parent PR: #26941

Closes scylladb/scylladb#27187

* github.com:scylladb/scylladb:
  address_map: Use barrier() to wait for replication
  address_map: Use more efficient and reliable replication method
  utils: Introduce helper for replicated data structures
  utils: add "fatal" version of utils::on_internal_error()
2025-12-05 20:37:14 +01:00
Tomasz Grabiec
6c942a87c7 address_map: Use barrier() to wait for replication
More efficient than 100 pings.

There was one ping in test which was done "so this shard notices the
clock advance". It's not necessary, since obsering completed SMP
call implies that local shard sees the clock advancement done within in.

(cherry picked from commit f83c4ffc68)
2025-12-05 13:25:14 +01:00
Tomasz Grabiec
db61dfb837 address_map: Use more efficient and reliable replication method
Primary issue with the old method is that each update is a separate
cross-shard call, and all later updated queue behind it. If one of the
shards has high latency for such calls, the queue may accumulate and
system will appear unresponsive for mapping changes on non-zero shards.

This happened in the field when one of the shards was overloaded with
sstables and compaction work, which caused frequent stalls which
delayed polling for ~100ms. A queue of 3k address updates
accumulated. This made bootstrap impossible, since nodes couldn't
learn about the IP mapping for the bootstrapping node and streaming
failed.

To protect against that, use a more efficient method of replication
which requires a single cross-shard call to replicate all prior
updates.

It is also more reliable, if replication fails transiently for some
reason, we don't give up and fail all later updates.

Fixes #26865
Fixes #26835

(cherry picked from commit 4a85ea8eb2)
2025-12-05 13:25:14 +01:00
Tomasz Grabiec
3cc75afbbe utils: Introduce helper for replicated data structures
Key goals:
  - efficient (batching updates)
  - reliable (no lost updates)

Will be used in data structures maintained on one designed owning
shard and replicated to other shards.

(cherry picked from commit ed8d127457)
2025-12-05 13:25:14 +01:00
Nadav Har'El
9d27db5e98 utils: add "fatal" version of utils::on_internal_error()
utils::on_internal_error() is a wrapper for Seastar's on_internal_error()
which does not require a logger parameter - because it always uses one
logger ("on_internal_error"). Not needing a unique logger is especially
important when using on_internal_error() in a header file, where we
can't define a logger.

Seastar also has a another similar function, on_fatal_internal_error(),
for which we forgot to implement a "utils" version (without a logger
parameter). This patch fixes that oversight.

In the next patch, we need to use on_fatal_internal_error() in a header
file, so the "utils" version will be useful. We will need the fatal
version because we will encounter an unexpected situation during server
destruction, and if we let the regular on_internal_error() just throw
an exception, we'll be left in an undefined state.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit 33476c7b06)
2025-12-05 13:25:14 +01:00
Avi Kivity
79aa95bf76 database: fix overflow when computing data distribution over shards
We store the per-shard chunk count in a uint64_t vector
global_offset, and then convert the counts to offsets with
a prefix sum:

```c++
        // [1, 2, 3, 0] --> [0, 1, 3, 6]
        std::exclusive_scan(global_offset.begin(), global_offset.end(), global_offset.begin(), 0, std::plus());
```

However, std::exclusive_scan takes the accumulator type from the
initial value, 0, which is an int, instead of from the range being
iterated, which is of uint64_t.

As a result, the prefix sum is computed as a 32-bit integer value. If
it exceeds 0x8000'0000, it becomes negative. It is then extended to
64 bits and stored. The result is a huge 64-bit number. Later on
we try to find an sstable with this chunk and fail, crashing on
an assertion.

An example of the failure can be seen here: https://godbolt.org/z/6M8aEbo57

The fix is simple: the initial value is passed as uint64_t instead of int.

Fixes https://github.com/scylladb/scylladb/issues/27417

Closes scylladb/scylladb#27418

(cherry picked from commit 9696ee64d0)
2025-12-04 20:19:39 +02:00
Jenkins Promoter
284bd9fc01 Update ScyllaDB version to: 2025.2.5 2025-12-04 15:49:32 +02:00
Pavel Emelyanov
60a70389a3 Update seastar submodule (SIGABRT on assertion)
* seastar 450e36d5d...7c86e59c7 (1):
  > util: make SEASTAR_ASSERT() failure generate SIGABRT

Fixes #27127

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#27404
2025-12-04 13:01:07 +03:00
Calle Wilund
7338a42cc1 commitlog::read_log_file: Check for eof position on all data reads
Fixes #24346

When reading, we check for each entry and each chunk, if advancing there
will hit EOF of the segment. However, IFF the last chunk being read has
the last entry _exactly_ matching the chunk size, and the chunk ending
at _exactly_ segment size (preset size, typically 32Mb), we did not check
the position, and instead complained about not being able to read.

This has literally _never_ happened in actual commitlog (that was replayed
at least), but has apparently happened more and more in hints replay.

Fix is simple, just check the file position against size when advancing
said position, i.e. when reading (skipping already does).

v2:

* Added unit test

Closes scylladb/scylladb#27236

(cherry picked from commit 59c87025d1)

Closes scylladb/scylladb#27340
2025-12-03 12:23:06 +03:00
Aleksandra Martyniuk
43963c47e6 replica: database: change type of tables_metadata::_ks_cf_to_uuid
If there is a lot of tables, a node reports oversized allocation
in _ks_cf_to_uuid of type flat_hash_map.

Change the type to std::unordered_map to prevent oversized allocations.

Fixes: https://github.com/scylladb/scylladb/issues/26787.

Closes scylladb/scylladb#27165

(cherry picked from commit 19a7d8e248)

Closes scylladb/scylladb#27195
2025-12-03 12:22:46 +03:00
Ernest Zaslavsky
7671134209 streaming:: add more logging
Start logging all missed streaming options like `scope` and `skip_reshape` flags

Fixes: https://github.com/scylladb/scylladb/issues/27299

Closes scylladb/scylladb#27311

(cherry picked from commit 1d5f60baac)

Closes scylladb/scylladb#27338
2025-12-02 12:17:48 +01:00
Jenkins Promoter
af570f75f1 Update pgo profiles - aarch64 2025-12-01 04:40:02 +02:00
Jenkins Promoter
3e6ef72872 Update pgo profiles - x86_64 2025-11-30 21:06:59 -05:00
Patryk Jędrzejczak
4a28929d40 Merge '[Backport 2025.2] locator/node: include _excluded in missing places' from Scylladb[bot]
We currently ignore the `_excluded` field in `node::clone()` and the verbose
formatter of `locator::node`. The first one is a bug that can have
unpredictable consequences on the system. The second one can be a minor
inconvenience during debugging.

We fix both places in this PR.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-72

This PR is a bugfix that should be backported to all supported branches.

- (cherry picked from commit 4160ae94c1)

- (cherry picked from commit 287c9eea65)

Parent PR: #27265

Closes scylladb/scylladb#27289

* https://github.com/scylladb/scylladb:
  locator/node: include _excluded in verbose formatter
  locator/node: preserve _excluded in clone()
2025-11-27 12:31:06 +01:00
Avi Kivity
af451b7997 Merge '[Backport 2025.2] fix notification about expiring erm held for to long' from Scylladb[bot]
Commit 6e4803a750 broke notification about expired erms held for too long since it resets the tracker without calling its destructor (where notification is triggered). Fix the assign operator to call the destructor like it should.

Fixes https://github.com/scylladb/scylladb/issues/27141

- (cherry picked from commit 9f97c376f1)

- (cherry picked from commit 5dcdaa6f66)

Parent PR: #27140

Closes scylladb/scylladb#27274

* github.com:scylladb/scylladb:
  test: test that expired erm that held for too long triggers notification
  token_metadata: fix notification about expiring erm held for to long
2025-11-27 12:24:56 +02:00
Patryk Jędrzejczak
0b2d303c00 locator/node: include _excluded in verbose formatter
It can be helpful during debugging.

(cherry picked from commit 287c9eea65)
2025-11-26 23:04:12 +00:00
Patryk Jędrzejczak
d8b476e39f locator/node: preserve _excluded in clone()
We currently ignore the `_excluded` field in `clone()`. Losing
information about exclusion can have unpredictable consequences. One
observed effect (that led to finding this issue) is that the
`/storage_service/nodes/excluded` API endpoint sometimes misses excluded
nodes.

(cherry picked from commit 4160ae94c1)
2025-11-26 23:04:12 +00:00
Gleb Natapov
18dcbc1783 test: test that expired erm that held for too long triggers notification
(cherry picked from commit 5dcdaa6f66)
2025-11-26 15:08:07 +00:00
Gleb Natapov
9ff81cabed token_metadata: fix notification about expiring erm held for to long
Commit 6e4803a750 broke notification about expired erms held for too
long since it resets the tracker without calling its destructor (where
notification is triggered). Fix assign operator to call destructor.

(cherry picked from commit 9f97c376f1)
2025-11-26 15:08:07 +00:00
Ernest Zaslavsky
c1d83f47b3 streaming: fix loop break condition in tablet_sstable_streamer::stream
Correct the loop termination logic that previously caused
certain SSTables to be prematurely excluded, resulting in
lost mutations. This change ensures all relevant SSTables
are properly streamed and their mutations preserved.

(cherry picked from commit dedc8bdf71)

Closes scylladb/scylladb#27150
Fixes: #26979

Parent PR: #26980
Unfortunatelly the pytest based test cannot be ported back because of changes made to the testing harness and scylla-tools
2025-11-25 11:58:12 +03:00
Avi Kivity
b619fe2882 tools: toolchain: prepare: replace 'reg' with 'skopeo'
The prepare scripts uses 'reg' to verify we're not going to
overwrite an existing image. The 'reg' command is not
available in Fedora 43. Use 'skopeo' instead. Skopeo
is part of the podman ecosystem so hopefully will live longer.

Fixes #27178.

Closes scylladb/scylladb#27179

(cherry picked from commit d6ef5967ef)

Closes scylladb/scylladb#27196
2025-11-24 19:57:38 +02:00
Raphael S. Carvalho
6ecee4deba replica: Fail timed-out single-key read on cleaned up tablet replica
Consider the following:
1) single-key read starts, blocks on replica e.g. waiting for memory.
2) the same replica is migrated away
3) single-key read expires, coordinator abandons it, releases erm.
4) migration advances to cleanup stage, barrier doesn't wait on
   timed-out read
5) compaction group of the replica is deallocated on cleanup
6) that single-key resumes, but doesn't find sstable set (post cleanup)
7) with abort-on-internal-error turned on, node crashes

It's fine for abandoned (= timed out) reads to fail, since the
coordinator is gone.
For active reads (non timed out), the barrier will wait for them
since their coordinator holds erm.
This solution consists of failing reads which underlying tablet
replica has been cleaned up, by just converting internal error
to plain exception.

Fixes #26229.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#27078

(cherry picked from commit 74ecedfb5c)

Closes scylladb/scylladb#27152
2025-11-21 17:46:21 +03:00
Patryk Jędrzejczak
5e237c0d0e test: test_raft_recovery_stuck: ensure mutual visibility before using driver
Not waiting for nodes to see each other as alive can cause the driver to
fail the request sent in `wait_for_upgrade_state()`.

scylladb/scylladb#19771 has already replaced concurrent restarts with
`ManagerClient.rolling_restart()`, but it has missed this single place,
probably because we do concurrent starts here.

Fixes #27055

Closes scylladb/scylladb#27075

(cherry picked from commit e35ba974ce)

Closes scylladb/scylladb#27108
2025-11-20 10:45:08 +02:00
Botond Dénes
ad3d182e90 Merge '[Backport 2025.2] Automatic cleanup improvements' from Scylladb[bot]
This series allows an operator to reset 'cleanup needed' flag if he already cleaned up the node, so that automatic cleanup will not do it again. We also change 'nodetool cleanup' back to run cleanup on one node only (and reset 'cleanup needed' flag in the end), but the new '--global' option allows to run cleanup on all nodes that needed it simultaneously.

Fixes https://github.com/scylladb/scylladb/issues/26866

Backport to all supported version since automatic cleanup behaviour  as it is now may create unexpected by the operator load during cluster resizing.

- (cherry picked from commit e872f9cb4e)

- (cherry picked from commit 0f0ab11311)

Parent PR: #26868

Closes scylladb/scylladb#27091

* github.com:scylladb/scylladb:
  cleanup: introduce "nodetool cluster cleanup" command  to run cleanup on all dirty nodes in the cluster
  cleanup: Add RESTful API to allow reset cleanup needed flag
2025-11-20 10:44:35 +02:00
Botond Dénes
a7856c3d52 Merge '[Backport 2025.2] service/qos: Fall back to default scheduling group when using maintenance socket' from Scylladb[bot]
The service level controller relies on `auth::service` to collect
information about roles and the relation between them and the service
levels (those attached to them). Unfortunately, the service level
controller is initialized way earlier than `auth::service` and so we
had to prevent potential invalid queries of user service levels
(cf. 46193f5e79).

Unfortunately, that came at a price: it made the maintenance socket
incompatible with the current implementation of the service level
controller. The maintenance socket starts early, before the
`auth::service` is fully initialized and registered, and is exposed
almost immediately. If the user attempts to connect to Scylla within
this time window, via the maintenance socket, one of the things that
will happen is choosing the right service level for the connection.
Since the `auth::service` is not registered, Scylla with fail an
assertion and crash.

A similar scenario occurs when using maintenance mode. The maintenance
socket is how the user communicates with the database, and we're not
prepared for that either.

To avoid unnecessary crashes, we add new branches if the passed user is
absent or if it corresponds to the anonymous role. Since the role
corresponding to a connection via the maintenance socket is the anonymous
role, that solves the problem.

Some accesses to `auth::service` are not affected and we do not modify
those.

Fixes scylladb/scylladb#26816

Backport: yes. This is a fix of a regression.

- (cherry picked from commit c0f7622d12)

- (cherry picked from commit 222eab45f8)

- (cherry picked from commit 394207fd69)

- (cherry picked from commit b357c8278f)

Parent PR: #26856

Closes scylladb/scylladb#27034

* github.com:scylladb/scylladb:
  test/cluster/test_maintenance_mode.py: Wait for initialization
  test: Disable maintenance mode correctly in test_maintenance_mode.py
  test: Fix keyspace in test_maintenance_mode.py
  service/qos: Do not crash Scylla if auth_integration absent
2025-11-20 10:43:18 +02:00
Gleb Natapov
86d6e759bc cleanup: introduce "nodetool cluster cleanup" command to run cleanup on all dirty nodes in the cluster
97ab3f6622 changed "nodetool cleanup" (without arguments) to run
cleanup on all dirty nodes in the cluster. This was somewhat unexpected,
so this patch changes it back to run cleanup on the target node only (and
reset "cleanup needed" flag afterwards) and it adds "nodetool cluster
cleanup" command that runs the cleanup on all dirty nodes in the
cluster.

(cherry picked from commit 0f0ab11311)
2025-11-19 10:35:39 +02:00
Gleb Natapov
80d92d68e0 cleanup: Add RESTful API to allow reset cleanup needed flag
Cleaning up a node using per keyspace/table interface does not reset cleanup
needed flag in the topology. The assumption was that running cleanup on
already clean node does nothing and completes quickly. But due to
https://github.com/scylladb/scylladb/issues/12215 (which is closed as
WONTFIX) this is not the case. This patch provides the ability to reset
the flag in the topology if operator cleaned up the node manually
already.

(cherry picked from commit e872f9cb4e)
2025-11-19 10:14:38 +02:00
Avi Kivity
6206b57008 Merge '[Backport 2025.2] Synchronize tablet split and load-and-stream' from Scylladb[bot]
Load-and-stream is broken when running concurrently to the finalization step of tablet split.

Consider this:
1) split starts
2) split finalization executes barrier and succeed
3) load-and-stream runs now, starts writing sstable (pre-split)
4) split finalization publishes changes to tablet metadata
5) load-and-stream finishes writing sstable
6) sstable cannot be loaded since it spans two tablets

two possible fixes (maybe both):

1) load-and-stream awaits for topology to quiesce
2) perform split compaction on sstable that spans both sibling tablets

This patch implements # 1. By awaiting for topology to quiesce,
we guarantee that load-and-stream only starts when there's no
chance coordinator is handling some topology operation like
split finalization.

Fixes https://github.com/scylladb/scylladb/issues/26455.

- (cherry picked from commit 3abc66da5a)

- (cherry picked from commit 4654cdc6fd)

Parent PR: #26456

Closes scylladb/scylladb#26647

* github.com:scylladb/scylladb:
  sstables_loader: Don't bypass synchronization with busy topology
  test: Add reproducer for l-a-s and split synchronization issue
  sstables_loader: Synchronize tablet split and load-and-stream
2025-11-17 17:15:37 +02:00
Łukasz Paszkowski
cb5bfeeb16 tools/scylla-nodetool: fix crash when rows_merged cells contain null
Any empty object of the json::json_list type has its internal
_set variable assigned to false which results in such objects
being skipped by the json::json_builder.

Hence, the json returned by the api GET//compaction_manager/compaction_history
does not contain the field `rows_merged` if a cell in the
system.compaction_history table is null or an empty list.

In such cases, executing the command `nodetool compactionhistory`
will result in a crash with the following error message:
`error running operation: rjson::error (JSON assert failed on condition 'false'`

The patch fixes it by checking if the json object contains the
`rows_merged` element before processing. If the element does
not exist, the nodetool will now produce an empty list.

Fixes https://github.com/scylladb/scylladb/issues/23540

Closes scylladb/scylladb#23514

(cherry picked from commit 113647550f)

Closes scylladb/scylladb#26947
2025-11-17 14:27:43 +02:00
Yaron Kaikov
bd974a4995 install-dependencies.sh: update node_exporter to 1.10.2
Update node exporter to solve CVE-2025-22871

[regenerate frozen toolchain with optimized clang from
	https://devpkg.scylladb.com/clang/clang-19.1.7-Fedora-41-aarch64.tar.gz
	https://devpkg.scylladb.com/clang/clang-19.1.7-Fedora-41-x86_64.tar.gz
]
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-5

Closes scylladb/scylladb#26916

(cherry picked from commit c601371b57)

Closes scylladb/scylladb#26951
2025-11-16 16:12:37 +02:00
Benny Halevy
cec15d4175 scylla-sstable: correctly dump sharding_metadata
This patch fixes 2 issues at one go:

First, Currently sstables::load clears the sharding metadata
(via open_data()), and so scylla-sstable always prints
an empty array for it.

Second, printing token values would generate invalid json
as they are currently printed as binary bytes, and they
should be printed simply as numbers, as we do elsewhere,
for example, for the first and last keys.

Fixes #26982

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#26991

(cherry picked from commit f9ce98384a)

Closes scylladb/scylladb#27033
2025-11-16 16:03:49 +02:00
Dawid Mędrek
a817da2cac test/cluster/test_maintenance_mode.py: Wait for initialization
If we try to perform queries too early, before the call to
`storage_service::start_maintenance_mode` has finished, we will
fail with the following error:

```
ERROR 2025-11-12 20:32:27,064 [shard 0:sl:d] token_metadata - sorted_tokens is empty in first_token_index!
```

To avoid that, we should wait until initialization is complete.

(cherry picked from commit b357c8278f)
2025-11-15 22:09:51 +00:00
Dawid Mędrek
fb6625000b test: Disable maintenance mode correctly in test_maintenance_mode.py
Although setting the value of `maintenance_mode` to the string `"false"`
disables maintenance mode, the testing framework misinterprets the value
and thinks that it's actually enabled. As a result, it might try to
connect to Scylla via the maintenance socket, which we don't want.

(cherry picked from commit 394207fd69)
2025-11-15 22:09:51 +00:00
Dawid Mędrek
aa19daaafe test: Fix keyspace in test_maintenance_mode.py
The keyspace used in the test is not necessarily called `ks`.

(cherry picked from commit 222eab45f8)
2025-11-15 22:09:51 +00:00
Dawid Mędrek
19ffad8fbd service/qos: Do not crash Scylla if auth_integration absent
If the user connects to Scylla via the maintenance socket, it may happen
that `auth_integration` has not been registered in the service level
controller yet. One example is maintenance mode when that will never
happen; another when the connection occurs before Scylla is fully
initialized.

To avoid unnecessary crashes, we add new branches if the passed user is
absent or if it corresponds to the anonymous role. Since the role
corresponding to a connection via the maintenance socket is the anonymous
role, that solves the problem.

In those cases, we completely circumvent any calls to `auth_integration`
and handle them separately. The modified methods are:

* `get_user_scheduling_group`,
* `with_user_service_level`,
* `describe_service_levels`.

For the first two, the new behavior is in line with the previous
implementation of those functions. The last behaves differently now,
but since it's a soft error, crashing the node is not necessary anyway.
We throw an exception instead, whose error message should give the user
a hint of what might be wrong.

The other uses of `auth_integration` within the service level controller
are not problematic:

* `find_effective_service_level`,
* `find_cached_effective_service_level`.

They take the name of a role as their argument. Since the anonymous role
doesn't have a name, it's not possible to call them with it.

Fixes scylladb/scylladb#26816

(cherry picked from commit c0f7622d12)
2025-11-15 22:09:51 +00:00
Jenkins Promoter
2264e2773d Update pgo profiles - aarch64 2025-11-15 04:56:18 +02:00
Jenkins Promoter
7da9a0470c Update pgo profiles - x86_64 2025-11-14 21:08:28 -05:00
Raphael S. Carvalho
316e702d45 sstables_loader: Don't bypass synchronization with busy topology
The patch c543059f86 fixed the synchronization issue between tablet
split and load-and-stream. The synchronization worked only with
raft topology, and therefore was disabled with gossip.
To do the check, storage_service::raft_topology_change_enabled()
but the topology kind is only available/set on shard 0, so it caused
the synchronization to be bypassed when load-and-stream runs on
any shard other than 0.

The reason the reproducer didn't catch it is that it was restricted
to single cpu. It will now run with multi cpu and catch the
problem observed.

Fixes #22707

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#26730

(cherry picked from commit 7f34366b9d)
(cherry picked from commit 4c466ace4f)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-11-14 10:51:32 -03:00
Botond Dénes
64ffd56f69 Merge '[Backport 2025.2] [schema] Speculative retry rounding fix' from Scylladb[bot]
This patch series re-enables support for speculative retry values `0` and `100`. These values have been supported some time ago, before [schema: fix issue 21825: add validation for PERCENTILE values in speculative_retry configuration. #21879
](https://github.com/scylladb/scylladb/pull/21879). When that PR prevented using invalid `101PERCENTILE` values, valid `100PERCENTILE` and `0PERCENTILE` value were prevented too.

Reproduction steps from [[Bug]: drop schema and all tables after apply speculative_retry = '99.99PERCENTILE' #26369](https://github.com/scylladb/scylladb/issues/26369) are unable to reproduce the issue after the fix. A test is added to make sure the inclusive border values `0` and `100` are supported.

Documentation is updated to give more information to the users. It now states that these border values are inclusive, and also that the precision, with automatic rounding, is 1 decimal digit.

Fixes #26369

This is a bug fix. If at any time a client tries to use value >= 99.5 and < 100, the raft error will happen. Backport is needed. The code which introduced inconsistency is introduced in 2025.2, so no backporting to 2025.1.

- (cherry picked from commit da2ac90bb6)

- (cherry picked from commit 5d1913a502)

- (cherry picked from commit aba4c006ba)

- (cherry picked from commit 85f059c148)

- (cherry picked from commit 7ec9e23ee3)

Parent PR: #26909

Closes scylladb/scylladb#27013

* github.com:scylladb/scylladb:
  test: cqlpy: add test case for non-numeric PERCENTILE value
  schema: speculative_retry: update exception type for sstring ops
  docs: cql: ddl.rst: update speculative-retry-options
  test: cqlpy: add test for valid speculative_retry values
  schema: speculative_retry: allow 0 and 100 PERCENTILE values
2025-11-14 10:38:18 +02:00
Ernest Zaslavsky
8ae4c66750 minio: update CLI usage, remove deprecated mc options
Replace phased-out `mc` command options with supported alternatives.
Ensures compatibility with the latest MinIO version.

Closes scylladb/scylladb#24363

(cherry picked from commit 1446f57635)

Closes scylladb/scylladb#27005
2025-11-14 10:35:57 +02:00
Dario Mirovic
30d615e76c test: cqlpy: add test case for non-numeric PERCENTILE value
Add test case for non-numeric PERCENTILE value, which raises an error
different to the out-of-range invalid values. Regex in the test
test_invalid_percentile_speculative_retry_values is expanded.

Refs #26369

(cherry picked from commit 7ec9e23ee3)
2025-11-13 19:44:10 +00:00
Dario Mirovic
d9ce2f5e43 schema: speculative_retry: update exception type for sstring ops
Change speculative_retry::to_sstring and speculative_retry::from_sstring
to throw exceptions::configuration_exception instead of std::invalid_argument.
These errors can be triggered by CQL, so appropriate CQL exception should be
used.
Reference: https://github.com/scylladb/scylladb/issues/24748#issuecomment-3025213304

Refs #26369

(cherry picked from commit 85f059c148)
2025-11-13 19:44:10 +00:00
Dario Mirovic
fb9f2112d6 docs: cql: ddl.rst: update speculative-retry-options
Clarify how the value of `XPERCENTILE` is handled:
- Values 0 and 100 are supported
- The percentile value is rounded to the nearest 0.1 (1 decimal place)

Refs #26369

(cherry picked from commit aba4c006ba)
2025-11-13 19:44:10 +00:00
Dario Mirovic
5e2ba892ee test: cqlpy: add test for valid speculative_retry values
test_valid_percentile_speculative_retry_values is introduced to test that
valid values for speculative_retry are properly accepted.

Some of the values are moved from the
test_invalid_percentile_speculative_retry_values test, because
the previous commit added support for them.

Refs #26369

(cherry picked from commit 5d1913a502)
2025-11-13 19:44:10 +00:00
Dario Mirovic
d5b45dc011 schema: speculative_retry: allow 0 and 100 PERCENTILE values
This patch allows specifying 0 and 100 PERCENTILE values in speculative_retry.
It was possible to specify these values before #21825. #21825 prevented specifying
invalid values, like -1 and 101, but also prevented using 0 and 100.

On top of that, speculative_retry::to_sstring function did rounding when
formatting the string, which introduced inconsistency.

Fixes #26369

(cherry picked from commit da2ac90bb6)
2025-11-13 19:44:10 +00:00
Raphael S. Carvalho
b280b1da66 test: Add reproducer for l-a-s and split synchronization issue
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 4654cdc6fd)
2025-11-12 21:18:32 -03:00
Raphael S. Carvalho
b2bb919528 sstables_loader: Synchronize tablet split and load-and-stream
Load-and-stream is broken when running concurrently to the
finalization step of tablet split.

Consider this:
1) split starts
2) split finalization executes barrier and succeed
3) load-and-stream runs now, starts writing sstable (pre-split)
4) split finalization publishes changes to tablet metadata
5) load-and-stream finishes writing sstable
6) sstable cannot be loaded since it spans two tablets

two possible fixes (maybe both):

1) load-and-stream awaits for topology to quiesce
2) perform split compaction on sstable that spans both sibling tablets

This patch implements #1. By awaiting for topology to quiesce,
we guarantee that load-and-stream only starts when there's no
chance coordinator is handling some topology operation like
split finalization.

Fixes #26455.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 3abc66da5a)
2025-11-12 20:52:57 -03:00
Yaron Kaikov
80ba79f00d auto-backport: Add support for JIRA issue references
- Added support for JIRA issue references in PR body and commit messages
- Supports both short format (PKG-92) and full URL format
- Maintains existing GitHub issue reference support
- JIRA pattern matches https://scylladb.atlassian.net/browse/{PROJECT-ID}
- Allows backporting for PRs that reference JIRA issues with 'fixes' keyword

Fixes: https://github.com/scylladb/scylladb/issues/26955

Closes scylladb/scylladb#26954

(cherry picked from commit 3ade3d8f5b)

Closes scylladb/scylladb#26964
2025-11-12 22:37:52 +02:00
Botond Dénes
1d8d6839e7 service/storage_proxy: send batches with CL=EACH_QUORUM
Batches that fail on the initial send are retired later, until they
succeed. These retires happen with CL=ALL, regardless of what the
original CL of the batch was. This is unnecessarily strict. We tried to
follow Cassandra here, but Cassandra has a big caveat in their use of
CL=ALL for batches. They accept saving just a hint for any/all of the
endpoints, so a batch which was just logged in hints is good enough for
them.
We do not plan on replicating this usage of hints at this time, so as a
middle ground, the CL is changed to EACH_QUORUM.

Fixes: scylladb/scylladb#25432

Closes scylladb/scylladb#26304

(cherry picked from commit d9c3772e20)

Closes scylladb/scylladb#26928
2025-11-11 10:35:10 +03:00
Ran Regev
5bce731c76 nodetool refresh primary-replica-only
Fixes: #26440

1. Added description to primary-replica-only option
2. Fixed code text to better reflect the constrained cheked in the code
   itself. namely: that both primary replica only and scope must be
applied only if load and steam is applied too, and that they are mutual
exclusive to each other.
Note: when https://github.com/scylladb/scylladb/issues/26584 is
implemented (with #26609) there will be a need to align the docs as
well - namely, primary-replica-only and scope will no longer be
mutual exclusive

Signed-off-by: Ran Regev <ran.regev@scylladb.com>

Closes scylladb/scylladb#26480

(cherry picked from commit aaf53e9c42)

Closes scylladb/scylladb#26904
2025-11-11 10:34:50 +03:00
Piotr Dulikowski
ff4ca88768 Merge '[Backport 2025.2] transport: call update_scheduling_group for non-auth connections' from Andrzej Jackowski
This is backport of fix for https://github.com/scylladb/scylladb/issues/26040 and related test (https://github.com/scylladb/scylladb/pull/26589) to 2025.2.

Before this change, unauthorized connections stayed in main
scheduling group. It is not ideal, in such case, rather sl:default
should be used, to have a consistent behavior with a scenario
where users is authenticated but there is no service level assigned
to the user.

This commit adds a call to update_scheduling_group at the end of
connection creation for an unauthenticated user, to make sure the
service level is switched to sl:default.

Fixes: https://github.com/scylladb/scylladb/issues/26040
Fixes: https://github.com/scylladb/scylladb/issues/26581

(cherry picked from commit 278019c328)
(cherry picked from commit 8642629e8e)

No backport, as it's already a backport (but similar PRs will be created for 2025.3, and 2025.4)

Closes scylladb/scylladb#26813

* github.com:scylladb/scylladb:
  test: add test_anonymous_user to test_raft_service_levels
  transport: call update_scheduling_group for non-auth connections
2025-11-09 03:06:22 +01:00
Jenkins Promoter
4d3e896eae Update pgo profiles - aarch64 2025-11-01 04:55:19 +02:00
Jenkins Promoter
abdfe5c805 Update pgo profiles - x86_64 2025-11-01 04:28:41 +02:00
Andrzej Jackowski
43dbaeebbc test: add test_anonymous_user to test_raft_service_levels
The primary goal of this test is to reproduce scylladb/scylladb#26040
so the fix (278019c328) can be backported
to older branches.

Scenario: connect via CQL as an anonymous user and verify that the
`sl:default` scheduling group is used. Before the fix for #26040
`main` scheduling group was incorrectly used instead of `sl:default`.

Control connections may legitimately use `sl:driver`, so the test
accepts those occurrences while still asserting that regular anonymous
queries use `sl:default`.

This adds explicit coverage on master. After scylladb#24411 was
implemented, some other tests started to fail when scylladb#26040
was unfixed. However, none of the tests asserted this exact behavior.

Refs: scylladb/scylladb#26040
Refs: scylladb/scylladb#26581

Closes scylladb/scylladb#26589

(cherry picked from commit 8642629)
2025-10-30 18:37:12 +01:00
Andrzej Jackowski
badf209a5e transport: call update_scheduling_group for non-auth connections
Before this change, unauthorized connections stayed in `main`
scheduling group. It is not ideal, in such case, rather `sl:default`
should be used, to have a consistent behavior with a scenario
where users is authenticated but there is no service level assigned
to the user.

This commit adds a call to `update_scheduling_group` at the end of
connection creation for an unauthenticated user, to make sure the
service level is switched to `sl:default`.

Fixes: scylladb/scylladb#26040
(cherry picked from commit 278019c)
2025-10-30 18:36:03 +01:00
Pavel Emelyanov
05c0f8ed03 lister: Fix race between readdir and stat
Sometimes file::list_directory() returns entries without type set. In
thase case lister calls file_type() on the entry name to get it. In case
the call returns disengated type, the code assumes that some error
occurred and resolves into exception.

That's not correct. The file_type() method returns disengated type only
if the file being inspected is missing (i.e. on ENOENT errno). But this
can validly happen if a file is removed bettween readdir and stat. In
that case it's not "some error happened", but a enry should be just
skipped. In "some error happened", then file_type() would resolve into
exceptional future on its own.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#26595

(cherry picked from commit d9bfbeda9a)

Closes scylladb/scylladb#26759
2025-10-29 11:38:35 +02:00
Anna Stuchlik
3d56ce604e doc: add --list-active-releases to Web Installer
Fixes https://github.com/scylladb/scylladb/issues/26688

V2 of https://github.com/scylladb/scylladb/pull/26687

Closes scylladb/scylladb#26689

(cherry picked from commit bd5b966208)

Closes scylladb/scylladb#26756
2025-10-29 11:38:06 +02:00
Patryk Jędrzejczak
397a214cbd test: test_raft_recovery_stuck: reconnect driver after rolling restarts
It turns out that #21477 wasn't sufficient to fix the issue. The driver
may still decide to reconnect the connection after `rolling_restart`
returns. One possible explanation is that the driver sometimes handles
the DOWN notification after all nodes consider each other UP.

Reconnecting the driver after restarting nodes seems to be a reliable
workaround that many tests use. We also use it here.

Fixes #19959

Closes scylladb/scylladb#26638

(cherry picked from commit 5321720853)

Closes scylladb/scylladb#26755
2025-10-29 11:37:15 +02:00
Anna Stuchlik
b5a5828fff doc: add support for Debian 12
Fixes https://github.com/scylladb/scylladb/issues/26640

Closes scylladb/scylladb#26668

(cherry picked from commit 9c0ff7c46b)

Closes scylladb/scylladb#26677
2025-10-29 11:36:03 +02:00
Patryk Jędrzejczak
ffd70630e5 Merge '[Backport 2025.2] raft topology: fix group0 tombstone GC in the Raft-based recovery procedure' from Scylladb[bot]
Group0 tombstone GC considers only the current group 0 members
while computing the group 0 tombstone GC time. It's not enough
because in the Raft-based recovery procedure, there can be nodes
that haven't joined the current group 0 yet, but they have belonged
to a different group 0 and thus have a non-empty group 0 state ID.
The current code can cause a data resurrection in group 0 tables.

We fix this issue in this PR and add a regression test.

This issue was uncovered by `test_raft_recovery_entry_loss`, which
became flaky recently. We skipped this test for now. We will unskip
it in a following PR because it's skipped only on master, while we
want to backport this PR.

Fixes #26534

This PR contains an important bugfix, so we should backport it
to all branches with the Raft-based recovery procedure (2025.2
and newer).

- (cherry picked from commit 1d09b9c8d0)

- (cherry picked from commit 6b2e003994)

- (cherry picked from commit c57f097630)

Parent PR: #26612

Closes scylladb/scylladb#26678

* https://github.com/scylladb/scylladb:
  test: test group0 tombstone GC in the Raft-based recovery procedure
  group0_state_id_handler: remove unused group0_server_accessor
  group0_state_id_handler: consider state IDs of all non-ignored topology members
2025-10-27 10:22:42 +01:00
Patryk Jędrzejczak
f2535f2c5e test: test group0 tombstone GC in the Raft-based recovery procedure
We add a regression test for the bug fixed in the previous commits.

(cherry picked from commit c57f097630)
2025-10-24 13:02:27 +02:00
Patryk Jędrzejczak
42271b6c41 group0_state_id_handler: remove unused group0_server_accessor
It became unused in the previous commit.

(cherry picked from commit 6b2e003994)
2025-10-22 17:12:20 +00:00
Patryk Jędrzejczak
3a662cf68d group0_state_id_handler: consider state IDs of all non-ignored topology members
It's not enough to consider only the current group 0 members. In the
Raft-based recovery procedure, there can be nodes that haven't joined
the current group 0 yet, but they have belonged to a different group 0
and thus have a non-empty group 0 state ID.

We fix this issue in this commit by considering topology members
instead.

We don't consider ignored nodes as an optimization. When some nodes are
dead, the group 0 state ID handler won't have to wait until all these
nodes leave the cluster. It will only have to wait until all these nodes
are ignored, which happens at the beginning of the first
removenode/replace. As a result, tombstones of group 0 tables will be
purged much sooner.

We don't rename the `group0_members` variable to keep the change
minimal. There seems to be no precise and succinct name for the used set
of nodes anyway.

We use `std::ranges::join_view` in one place because:
- `std::ranges::concat` will become available in C++26,
- `boost::range::join` is not a good option, as there is an ongoing
  effort to minimize external dependencies in Scylla.

(cherry picked from commit 1d09b9c8d0)
2025-10-22 17:12:20 +00:00
Asias He
57bea9bc47 repair: Fix uuid and nodes_down order in the log
Fixes #26536

Closes scylladb/scylladb#26547

(cherry picked from commit 33bc1669c4)

Closes scylladb/scylladb#26628
2025-10-22 11:29:24 +03:00
Botond Dénes
0299d7bc46 Merge '[Backport 2025.2] db/config: Add SSTable compression options for user tables' from Scylladb[bot]
ScyllaDB offers the `compression` DDL property for configuring compression per user table (compression algorithm and chunk size). If not specified, the default compression algorithm is the LZ4Compressor with a 4KiB chunk size. The same default applies to system tables as well.

This series introduces a new configuration option to allow customizing the default for user tables. It also adds some tests for the new functionality.

Fixes #25195.

- (cherry picked from commit 1106157756)

- (cherry picked from commit ea41f652c4)

- (cherry picked from commit a7e46974d4)

- (cherry picked from commit e1d9c83406)

- (cherry picked from commit 8d5bd212ca)

- (cherry picked from commit 6ba0fa20ee)

- (cherry picked from commit 8410532fa0)

Parent PR: #26003

Closes scylladb/scylladb#26300

* github.com:scylladb/scylladb:
  test/cluster: Add tests for invalid SSTable compression options
  test/boost: Add tests for SSTable compression config options
  main: Validate SSTable compression options from config
  db/config: Add SSTable compression options for user tables
  db/config: Prepare compression_parameters for config system
  compressor: Validate presence of sstable_compression in parameters
  compressor: Add missing space in exception message
2025-10-20 10:43:08 +03:00
Botond Dénes
5880893227 Merge '[Backport 2025.2] raft topology: disable schema pulls in the Raft-based recovery procedure' from Scylladb[bot]
Schema pulls should always be disabled when group 0 is used. However,
`migration_manager::disable_schema_pulls()` is never called during
a restart with `recovery_leader` set in the Raft-based recovery
procedure, which causes schema pulls to be re-enabled on all live nodes
(excluding the nodes replacing the dead nodes). Moreover, schema pulls
remain enabled on each node until the node is restarted, which could
be a very long time.

We fix this issue and add a regression test in this PR.

Fixes #26569

This is an important bug fix, so it should be backported to all branches
with the Raft-based recovery procedure (2025.2 and newer branches).

- (cherry picked from commit ec3a35303d)

- (cherry picked from commit da8748e2b1)

- (cherry picked from commit 71de01cd41)

Parent PR: #26572

Closes scylladb/scylladb#26596

* github.com:scylladb/scylladb:
  test: test_raft_recovery_entry_loss: fix the typo in the test case name
  test: verify that schema pulls are disabled in the Raft-based recovery procedure
  raft topology: disable schema pulls in the Raft-based recovery procedure
2025-10-20 10:42:36 +03:00
Nikos Dragazis
d995abfe0b test/cluster: Add tests for invalid SSTable compression options
Complementary to the previous patch. It triggers semantic validation
checks in `compression_parameters::validate()` and expects the server to
exit. The tests examine both command line and YAML options.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
(cherry picked from commit 8410532fa0)
2025-10-20 00:00:03 +03:00
Nikos Dragazis
979925e822 test/boost: Add tests for SSTable compression config options
Since patch 03461d6a54, all boost unit tests depending on `cql_test_env`
are compiled into a single executable (`combined_tests`). Add the new
test in there.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
(cherry picked from commit 6ba0fa20ee)
2025-10-20 00:00:02 +03:00
Nikos Dragazis
2acbc62d9d main: Validate SSTable compression options from config
`compression_parameters` provides two levels of validation:

* syntactic checks - implemented in the constructor
* semantic checks - implemented by `compression_parameters::validate()`

The former are applied implicitly when parsing the options from the
command line or from scylla.yaml. The latter are currently not applied,
but they should.

In lack of a better place, apply them in main, right after joining the
cluster, to make sure that the cluster features have been negotiated.
The feature needed here is the `SSTABLE_COMPRESSION_DICTS`. Validation
will fail if the feature is disabled and a dictionary compression
algorithm has been selected.

Also, mark `validate()` as const so that it can be called from a config
object.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
(cherry picked from commit 8d5bd212ca)
2025-10-20 00:00:02 +03:00
Nikos Dragazis
5321da2c0b db/config: Add SSTable compression options for user tables
ScyllaDB offers the `compression` DDL property for configuring
compression per user table (compression algorithm and chunk size). If
not specified, the default compression algorithm is the LZ4Compressor
with a 4KiB chunk size (refer to the default constructor for
`compression_parameters`). The same default applies to system tables as
well.

Add a new configuration option to allow customizing the default for user
tables. Use the previously hardcoded default as the new option's default
value.

Note that the option has no effect on ALTER TABLE statements. An altered
table either inherits explicit compression options from the CQL
statement, or maintains its existing options.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
(cherry picked from commit e1d9c83406)
2025-10-20 00:00:02 +03:00
Nikos Dragazis
f23d4b1f49 db/config: Prepare compression_parameters for config system
SSTable compression is currently configurable only per table, via the
`compression` property in CREATE/ALTER TABLE statements. This is
represented internally via the `compression_parameters` class. We plan
to offer the same options via the configuration as well, to make the
default compression method for user tables configurable.

This patch prepares the ground by making the `compression_parameters`
usable as a `config_file::named_value`, namely:

* Define an extraction operator (required by `boost::program_options`
  for parsing the options from command line).
* Define a formatter (required by `named_value::operator()`).
* Define a template specialization for `config_type_for` (required by
  `named_value` constructor).
* Define a yaml converter (required for parsing the options from
  scylla.yaml).

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
(cherry picked from commit a7e46974d4)
2025-10-19 23:59:26 +03:00
Patryk Jędrzejczak
b6a0a1290d test: test_raft_recovery_entry_loss: fix the typo in the test case name
(cherry picked from commit 71de01cd41)
2025-10-17 10:26:22 +00:00
Patryk Jędrzejczak
26ed158cb0 test: verify that schema pulls are disabled in the Raft-based recovery procedure
We do this at the end of `test_raft_recovery_entry_loss`. It's not worth
to add a separate regression test, as tests of the recovery procedure
are complicated and have a long running time. Also, we choose
`test_raft_recovery_entry_loss` out of all tests of the recovery
procedure because it does some schema changes.

(cherry picked from commit da8748e2b1)
2025-10-17 10:26:22 +00:00
Patryk Jędrzejczak
4460b9e6fb raft topology: disable schema pulls in the Raft-based recovery procedure
Schema pulls should always be disabled when group 0 is used. However,
`migration_manager::disable_schema_pulls()` is never called during
a restart with `recovery_leader` set in the Raft-based recovery
procedure, which causes schema pulls to be re-enabled on all live nodes
(excluding the nodes replacing the dead nodes). Moreover, schema pulls
remain enabled on each node until the node is restarted, which could
be a very long time.

The old gossip-based recovery procedure doesn't have this problem
because we disable schema pulls after completing the upgrade-to-group0
procedure, which is a part of the old recovery procedure.

Fixes #26569

(cherry picked from commit ec3a35303d)
2025-10-17 10:26:22 +00:00
Michał Chojnowski
fc8cc87fc2 test/boost/sstable_compressor_factory_test: fix thread-unsafe usage of Boost.Test
It turns out that Boost assertions are thread-unsafe,
(and can't be used from multiple threads concurrently).
This causes the test to fail with cryptic log corruptions sometimes.
Fix that by switching to thread-safe checks.

Fixes scylladb/scylladb#24982

Closes scylladb/scylladb#26472

(cherry picked from commit 7c6e84e2ec)

Closes scylladb/scylladb#26550
2025-10-15 12:15:48 +03:00
Jenkins Promoter
de3c316e7d Update pgo profiles - aarch64 2025-10-15 04:50:20 +03:00
Jenkins Promoter
71e2d7ae24 Update pgo profiles - x86_64 2025-10-15 04:28:19 +03:00
Michał Chojnowski
e585d6cb3b test_sstable_compression_dictionaries_basic: reconnect robustly after node reboots
Using `driver_connect()` after a cluster restart isn't enough to ensure
full CQL availability, but the test assumes that it is.

Fix that by making the test wait for CQL availability via `get_ready_cql()`.

Also, replace some manual usages of wait_for_cql_and_get_hosts with
`get_ready_cql()` too.

Fixes scylladb/scylladb#25362

Closes scylladb/scylladb#25366

(cherry picked from commit 85fd4d23fa)

Closes scylladb/scylladb#26513
2025-10-12 21:09:18 +03:00
Michał Chojnowski
19874119e5 docs: fix a parameter name in API calls in sstable-dictionary-compression.rst
The correct argument name is `cf`, not `table`.

Fixes scylladb/scylladb#25275

Closes scylladb/scylladb#26447

(cherry picked from commit 87e3027c81)

Closes scylladb/scylladb#26493
2025-10-10 10:12:57 +03:00
Pavel Emelyanov
ce740b9e45 Merge '[Backport 2025.2] service/qos: set long timeout for auth queries on SL cache update' from Scylladb[bot]
pass an appropriate query state for auth queries called from service
level cache reload. we use the function qos_query_state to select a
query_state based on caller context - for internal queries, we set a
very long timeout.

the service level cache reload is called from group0 reload. we want it
to have a long timeout instead of the default 5 seconds for auth
queries, because we don't have strict latency requirement on the one
hand, and on the other hand a timeout exception is undesired in the
group0 reload logic and can break group0 on the node.

Fixes https://github.com/scylladb/scylladb/issues/25290

backport possible to improve stability

- (cherry picked from commit a1161c156f)

- (cherry picked from commit 3c3dd4cf9d)

- (cherry picked from commit ad1a5b7e42)

Parent PR: #26180

Closes scylladb/scylladb#26477

* github.com:scylladb/scylladb:
  service/qos: set long timeout for auth queries on SL cache update
  auth: add query_state parameter to query functions
  auth: refactor query_all_directly_granted
2025-10-10 10:12:35 +03:00
Patryk Jędrzejczak
5e2ef4f3d8 raft topology: make the voter handler consider only group 0 members
In the Raft-based recovery procedure, we create a new group 0 and add
live nodes to it one by one. This means that for some time there are
nodes which belong to the topology, but not to the new group 0. The
voter handler running on the recovery leader incorrectly considers these
nodes while choosing voters.

The consequences:
- misleading logs, for example, "making servers {<ID of a non-member>}
  voters", where the non-member won't become a voter anyway,
- increased chance of majority loss during the recovery procedure, for
  example, all 3 nodes that first joined the new group 0 are in the same
  dc and rack, but only one of them becomes a voter because the voter
  handler tries to make non-members in other dcs/racks voters.

Fixes #26321

Closes scylladb/scylladb#26327

(cherry picked from commit 67d48a459f)

Closes scylladb/scylladb#26426
2025-10-09 18:22:13 +02:00
Michael Litvak
9d9c94bf47 service/qos: set long timeout for auth queries on SL cache update
pass an appropriate query state for auth queries called from service
level cache reload. we use the function qos_query_state to select a
query_state based on caller context - for internal queries, we set a
very long timeout.

the service level cache reload is called from group0 reload. we want it
to have a long timeout instead of the default 5 seconds for auth
queries, because we don't have strict latency requirement on the one
hand, and on the other hand a timeout exception is undesired in the
group0 reload logic and can break group0 on the node.

Fixes scylladb/scylladb#25290

(cherry picked from commit ad1a5b7e42)
2025-10-09 12:47:31 +00:00
Michael Litvak
4d54e98304 auth: add query_state parameter to query functions
add a query_state parameter to several auth functions that execute
internal queries. currently the queries use the
internal_distributed_query_state() query state, and we maintain this as
default, but we want also to be able to pass a query state from the
caller.

in particular, the auth queries currently use a timeout of 5 seconds,
and we will want to set a different timeout when executed in some
different context.

(cherry picked from commit 3c3dd4cf9d)
2025-10-09 12:47:31 +00:00
Michael Litvak
28349a442f auth: refactor query_all_directly_granted
rewrite query_all_directly_granted to use execute_internal instead of
query_internal in a style that is more consistent with the rest of the
module.

This will also be useful for a later change because execute_internal
accepts an additional parameter of query_state.

(cherry picked from commit a1161c156f)
2025-10-09 12:47:30 +00:00
Raphael S. Carvalho
40bad7524f replica: Fix race between drop table and merge completion handling
Consider this:
1) merge finishes, wakes up fiber to merge compaction groups
2) drop table happens, which in turn invokes truncate underneath
3) merge fiber stops old groups
4) truncate disables compaction on all groups, but the ones stopped
5) truncate performs a check that compaction has been disabled on
all groups, including the ones stopped
6) the check fails because groups being stopped didn't have compaction
explicitly disabled on them

To fix it, the check on step 6 will ignore groups that have been
stopped, since those are not eligible for having compaction explicitly
disabled on them. The compaction check is there, so ongoing compaction
will not propagate data being truncated, but here it happens in the
context of drop table which doesn't leave anything behind. Also, a
group stopped is somewhat equivalent to compaction disabled on it,
since the procedure to stop a group stops all ongoing compaction
and eventually removes its state from compaction manager.

Fixes #25551.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#25563

(cherry picked from commit 149f9d8448)

Closes scylladb/scylladb#25631
2025-10-08 09:40:44 +03:00
Botond Dénes
e18023f814 Merge '[Backport 2025.2] tools: fix documentation links after change to source-available' from Scylladb[bot]
Some tools commands have links to online documentation in their help output. These links were left behind in the source-available change, they still point to the old opensource docs. Furthermore, the links in the scylla-sstable help output always point to the latest stable release's documentation, instead of the appropriate one for the branch the tool was built from. Fix both of these.

Fixes: scylladb/scylladb#26320

Broken documentation link fix for the  tool help output, needs backport to all live source-available versions.

- (cherry picked from commit 5a69838d06)

- (cherry picked from commit 15a4a9936b)

- (cherry picked from commit fe73c90df9)

Parent PR: #26322

Closes scylladb/scylladb#26387

* github.com:scylladb/scylladb:
  tools/scylla-sstable: fix doc links
  release: adjust doc_link() for the post source-available world
  tools/scylla-nodetool: remove trailing " from doc urls
2025-10-08 06:38:01 +03:00
Botond Dénes
e90ae84b00 tools/scylla-sstable: fix doc links
The doc links in scylla-sstable help output are static, so they always
point to the documentation of the latest stable release, not to the
documentation of the release the tool binary is from. On top of that,
the links point to old open-source documentation, which is now EOL.
Fix both problems: point link at the new source-available documentation
pages and make them version aware.

(cherry picked from commit fe73c90df9)
2025-10-07 10:16:46 +03:00
Botond Dénes
fa3f9c08ea release: adjust doc_link() for the post source-available world
There is no more separate enterprise product and the doc urls are
slightly different.

(cherry picked from commit 15a4a9936b)
2025-10-07 10:16:27 +03:00
Botond Dénes
926042f29f tools/scylla-nodetool: remove trailing " from doc urls
They are accidental leftover from a previous way of storing command
descriptions.

(cherry picked from commit 5a69838d06)
2025-10-07 10:16:27 +03:00
Jenkins Promoter
00b673ac24 Update ScyllaDB version to: 2025.2.4 2025-10-05 16:34:26 +03:00
Benny Halevy
2c6cfad7e4 test_tablets_merge: test_tablet_split_merge_with_many_tables: reduce number of tables in debug mode
As the test hits timeouts in debug mode on aarch64.

Fixes #26252

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#26303

(cherry picked from commit b81c6a339b)

Closes scylladb/scylladb#26324
2025-10-01 14:12:58 +03:00
Asias He
c575fb85fe repair: Always reset node ops progress to 100% upon completion
Always set the node ops progress to 100% when the operation finishes,
regardless of success or failure. This ensures the progress never
remains below 100%, which would otherwise indicates a pending node
operation in case of an error.

Fixes #26193

Closes scylladb/scylladb#26194

(cherry picked from commit b31e651657)

Closes scylladb/scylladb#26265
2025-10-01 14:09:47 +03:00
Botond Dénes
d6e9844241 Merge '[Backport 2025.2] scylla-gdb: Fix fair-queue entry printing' from Scylladb[bot]
Catching a live entry in IO queue is very rare event, so we haven't seen it so far, but the `_ticket` member had been removed ~2 years ago and had been replaced with `_capacity` which is plain 64bit integer.

Fixes #26184

The issue is present in 2025.x as well and looks cheap to backport

- (cherry picked from commit 8438c59ad3)

Parent PR: #26185

Also includes backport of #24835 which also applies to 2025.2 and is now crucial.
The scylla_io_queues.ticket() method is renamed by this backport, but without 24835 it will be problematic to fix all callers of it

Closes scylladb/scylladb#26263

* github.com:scylladb/scylladb:
  scylla-gdb: Fix fair-queue entry printing
  scylla-gdb: Don't show io_queue executing and queued resources
2025-10-01 14:09:21 +03:00
Botond Dénes
bbf9ac6252 Merge '[Backport 2025.2] compaction: ensure that all compaction executors are stopped' from Scylladb[bot]
Currently, while stopping the compaction_manager, we stop task_manager
compaction module and concurrently run compaction_manager::really_do_stop.
really_do_stop stops and waits for all task_executors that are kept
in compaction_manager::_tasks, but nothing ensures that no more tasks will
be added there. Due to leftover tasks, we trigger  on_fatal_internal_error.

Modify the order of compaction_manager::stop. After the change, we stop
compaction tasks in the following order:
- abort module abort source;
- close module gate in the background;
- stop_ongoing_compactions (kept in compaction_manager::_tasks);
- wait until module gate is closed.

Check module abort source before creating compaction executor and
adding it to _tasks.

Thanks to the above, we can be sure that:
- after module::stop there will be no tasks in _tasks;
- compaction_manager::stop aborts all tasks; we don't wait for any whole
  compaction to finish.

Fixes: https://github.com/scylladb/scylladb/issues/25806.

Fixes shutdown bug; Needs backports to all version

- (cherry picked from commit 17707d0e6b)

- (cherry picked from commit 97c77d7cd5)

Parent PR: #25885

Closes scylladb/scylladb#26223

* github.com:scylladb/scylladb:
  compaction: move _tasks check
  compaction: stop compaction module in really_do_stop
2025-10-01 14:08:45 +03:00
Jenkins Promoter
7d41ff11c9 Update pgo profiles - aarch64 2025-10-01 04:46:21 +03:00
Jenkins Promoter
7daded2360 Update pgo profiles - x86_64 2025-10-01 04:27:01 +03:00
Tomasz Grabiec
d3d07331b3 Merge '[Backport 2025.2] replica: Fix split compaction when tablet boundaries change' from Scylladb[bot]
Consider the following:
1) balancer emits split decision
2) split compaction starts
3) split decision is revoked
4) emits merge decision
5) completes merge, before compaction in step 2 finishes

After last step, split compaction initiated in step 2 can fail because it works with the global tablet map, rather than the map when the compaction started. With the global state changing under its feet, on merge, the mutation splitting writer will think it's going backwards since sibling tablets are merged.

This problem was also seen when running load-and-stream, where split initiated by the sstable writer failed, split completed, and the unsplit sstable is left in the table dir, causing problems in the restart.

To fix this, let's make split compaction always work with the state when it started, not a global state.

Fixes #24153.

All 2025.* versions are vulnerable, so fix must be backported to them.

- (cherry picked from commit 0c1587473c)

- (cherry picked from commit 68f23d54d8)

Parent PR: #25690

Closes scylladb/scylladb#25934

* github.com:scylladb/scylladb:
  replica: Fix split compaction when tablet boundaries change
  replica: Futurize split_compaction_options()
2025-09-30 19:50:50 +02:00
Pavel Emelyanov
a597a93d5e scylla-gdb: Fix fair-queue entry printing
Catching a live entry in IO queue is very rare event, so we haven't seen
it so far, but the `_ticket` member had been removed ~2 years ago and
had been replaced with `_capacity` which is plain 64bit integer.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#26185

(cherry picked from commit 8438c59ad3)
2025-09-30 11:27:07 +03:00
Pavel Emelyanov
d2d22170b4 scylla-gdb: Don't show io_queue executing and queued resources
These counters are no longer accounted by io-queue code and are always
zero. Even more -- accounting removal happened years ago and we don't
have Scylla versions built with seastar older than that.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#24835
2025-09-30 11:27:07 +03:00
Raphael S. Carvalho
65e78d6336 replica: Fix split compaction when tablet boundaries change
Consider the following:
1) balancer emits split decision
2) split compaction starts
3) split decision is revoked
4) emits merge decision
5) completes merge, before compaction in step 2 finishes

After last step, split compaction initiated in step 2 can fail
because it works with the global tablet map, rather than the
map when the compaction started. With the global state changing
under its feet, on merge, the mutation splitting writer will
think it's going backwards since sibling tablets are merged.

This problem was also seen when running load-and-stream, where
split initiated by the sstable writer failed, split completed,
and the unsplit sstable is left in the table dir, causing
problems in the restart.

To fix this, let's make split compaction always work with
the state when it started, not a global state.

Fixes #24153.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 68f23d54d8)
2025-09-29 20:26:36 -03:00
Nikos Dragazis
dc37af8205 compressor: Validate presence of sstable_compression in parameters
SSTable compression parameters should always define an algorithm via the
`sstable_compression` sub-option. Add a check in the constructor to
ensure this is always provided (unless no options are given, which is
interpreted as "no compression").

This change has no user-visible effect, since the same check is already
performed at a higher-level, while validating the CQL properties of
CREATE TABLE and ALTER TABLE statements (see `cf_prop_defs::validate()`).
However, it will become useful in later patches, when compression config
options will be introduced.

Although now redundant, keep the sanity check in
`cf_prop_defs::validate()` to maintain consistency of error messages
with Cassandra.

Note also that Cassandra uses 'class' instead of 'sstable_compression'
since version 3.11.10, but Scylla still doesn't support this, see:
https://github.com/scylladb/scylladb/issues/4200

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
(cherry picked from commit ea41f652c4)
2025-09-28 20:00:56 +00:00
Nikos Dragazis
b957b4aac4 compressor: Add missing space in exception message
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
(cherry picked from commit 1106157756)
2025-09-28 20:00:56 +00:00
Ferenc Szili
31ac99493f docs: add description of number of tablets computed by tablet allocator
This change adds the documentation section which explains the algorithm
to compute the absolute number of tablets a table has.

Fixes: #25740

Closes scylladb/scylladb#25741

(cherry picked from commit d462dc8839)

Closes scylladb/scylladb#26261
2025-09-28 20:29:39 +03:00
Aleksandra Martyniuk
bb548aae54 test: fix test_two_tablets_concurrent_repair_and_migration_repair_writer_level
test_two_tablets_concurrent_repair_and_migration_repair_writer_level waits
for the first node that logs info about repair_writer using asyncio.wait.
The done group is never awaited, so we never learn about the error.

The test itself is incorrect and the log about repair_writer is never
printed. We never learn about that and tests finishes successfully
after 10 minutes timeout.

Fix the test:
- disable hinted handoff;
- repair tablets of the whole table:
  - new table is added so that concurrent migration is possible;
- use wait_for_first_completed that awaits done group;
- do some cleanups.

Remove nightly mark.

Fixes: #26148.

Closes scylladb/scylladb#26209

(cherry picked from commit 48bbe09c8b)

Closes scylladb/scylladb#26219
2025-09-27 17:26:16 +03:00
Gleb Natapov
fab7024fba storage_service: change node_ops_info::ignore_nodes to host id
It drop useless translation from id to ip during removenode through
topology coordinator.

Closes scylladb/scylladb#25958

(cherry picked from commit d3badf7406)

Closes scylladb/scylladb#26250
2025-09-26 10:55:34 +02:00
Aleksandra Martyniuk
163e5be3f7 compaction: move _tasks check
In compaction_manager::really_do_stop we check whether _tasks list
is empty after the compactions are stopped. However, a new task may
still sneak in, causing the assertion failure. Such a task won't
be there for long - module::make_task will fail as the module is
already stopped.

Move the assertion, that checks if _tasks is empty, after the
compaction_states' gates are closed.

Fixes: #25806.
(cherry picked from commit 97c77d7cd5)
2025-09-25 16:01:43 +02:00
Aleksandra Martyniuk
5cfd052e1b compaction: stop compaction module in really_do_stop
Currently, compaction::task_manager_module is stopped in compaction_manager::stop,
concurrently to really_do_stop. We can't predict the order of the two.

Do not set _task_manager_module to nullptr at stop, because
compaction_manager::really_do_stop() may be called before the actual
shutdown, while other components still try to use it.
compaction::task_manager_module does not keep a pointer to compaction_manager,
so we won't end up with memory leak.

Stop compaction module in really_do_stop, after ongoing compactions
are stopped.

It's a preparation for further patches.

(cherry picked from commit 17707d0e6b)
2025-09-25 16:01:39 +02:00
Ferenc Szili
9208bdfe93 load_balancer: fix std::out_of_bounds when decommissioning with empty nodes
Consider the following:

The tablet load balancer is working on:

- node1: an empty node (no tablets) with a large disk capacity
- node2: an empty node (no tablets) with a lower disk capacity then node1
- node3: is being decommissioned and contains tablet replicas

In load_balancer::make_internode_plan() the initial destination
node/shard is selected like this:

// Pick best target shard.
auto dst = global_shard_id {target, _load_sketch->get_least_loaded_shard(target)};

load_sketch::get_least_loaded_shard(host_id) calls ensure_node() which
adds the host to load_sketch's internal hash maps in case the node was
not yet seen by load_sketch.

Let's assume dst is a shard on node1.

Later in load_balancer::make_internode_plan() we will call
pick_candidate() to try to find a better destination node than the
initial one:

// May choose a different source shard than src.shard or different destination host/shard than dst.
auto candidate = co_await pick_candidate(nodes, src_node_info, target_info, src, dst, nodes_by_load_dst,
                                            drain_skipped);
auto source_tablets = candidate.tablets;
src = candidate.src;
dst = candidate.dst;

If pick_candidate() selects some other empty destination (due to larger
capacity: node1) node, and that node has not yet been seen by
load_sketch (because it was empty), a subsequent call to
load_sketch::pick() will search for the node using
std::unordered_map::at(), and because the node is not found it will
throw a std::out_of_bounds() exception crashing the load balancer.

This problem is fixed by changing load_sketch::populate() to initialize
its internal maps with all the nodes which populate()'s arguments
filter for.

Fixes: #26203

Closes scylladb/scylladb#26207

(cherry picked from commit c6c9c316a7)

Closes scylladb/scylladb#26239
2025-09-25 09:47:17 +03:00
Ferenc Szili
32c069c17d docs: add capacity based balancing explanation
Capacity based balancing was introduced in 2025.1. It computes balance
based on a node's capacity: the number of tablets located on a node
should be directly proportional to that node's storage capacity.

This change adds this explanation to the docs.

Fixes: #25686

Closes scylladb/scylladb#25687

(cherry picked from commit de5dab8429)

Closes scylladb/scylladb#26106
2025-09-25 09:46:25 +03:00
Raphael S. Carvalho
b421867fa8 replica: Futurize split_compaction_options()
Prepararation for the fix of #24153.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 0c1587473c)
2025-09-24 20:30:34 -03:00
Dawid Mędrek
96b865ab9b db/batchlog: Drop batch if table has been dropped
If there are pending mutations in the batchlog for a table that
has been dropped, we'll keep attempting to replay them but with
no success -- `db::no_such_column_family` exceptions will be thrown,
and we'll keep trying again and again.

To prevent that, we drop the batch in that case just like we do
in the case of a non-existing keyspace.

A reproducer test has been included in the commit. It fails without
the changes in `db/batchlog_manager.cc`, and it succeeds with them.

Fixes scylladb/scylladb#24806

Closes scylladb/scylladb#26057

(cherry picked from commit 35f7d2aec6)

Closes scylladb/scylladb#26200
2025-09-24 09:54:03 +03:00
Patryk Jędrzejczak
b747da8b64 test: deflake driver reconnections in the recovery procedure tests
All three tests could hit
https://github.com/scylladb/python-driver/issues/295. We use the
standard workaround for this issue: reconnecting the driver after
the rolling restart, and before sending any requests to local tables
(that can fail if the driver closes a connection to the node that
restarted last).

All three tests perform two rolling restarts, but the latter ones
already have the workaround.

Fixes #26005

Closes scylladb/scylladb#26056

(cherry picked from commit a56115f77b)

Closes scylladb/scylladb#26197
2025-09-24 09:53:45 +03:00
Dawid Mędrek
4111afa4f7 test/perf/tablet_load_balancing.cc: Create nodes within one DC
In 789a4a1ce7, we adjusted the test file
to work with the configuration option `rf_rack_valid_keyspaces`. Part of
the commit was making the two tables used in the test replicate in
separate data centers.

Unfortunately, that destroyed the point of the test because the tables
no longer competed for resources. We fix that by enforcing the same
replication factor for both tables.

We still accept different values of replication factor when provided
manually by the user (by `--rf1` and `--rf2` commandline options). Scylla
won't allow for creating RF-rack-invalid keyspaces, but there's no reason
to take away the flexibility the user of the test already has.

Fixes scylladb/scylladb#26026

Closes scylladb/scylladb#26115

(cherry picked from commit 0d2560c07f)

Closes scylladb/scylladb#26171
2025-09-24 09:53:20 +03:00
Pavel Emelyanov
43dd627118 Merge '[Backport 2025.2] compaction/scrub: register sstables for compaction before validation' from Scylladb[bot]
compaction/scrub: register sstables for compaction before validation

When `scrub --validate` runs, it collects all candidate sstables at the
start and validates them one by one in separate compaction tasks.
However, scrub in validate mode does not register these sstables for
compaction, which allows regular compaction to pick them up and
potentially compact them away before validation begins. This leads to
scrub failures because the sstables can no longer be found.

This patch fixes the issue by first disabling compaction, collecting the
sstables, and then registering them for compaction before starting
validation. This ensures that the enqueued sstables remain available for
the entire duration of the scrub validation task.

Fixes #23363

This reported scrub failure occurs on all versions that have the
checksum/digest validation feature for uncompressed sstables.
So, backport it to older versions.

- (cherry picked from commit 84f2e99c05)

- (cherry picked from commit 7cdda510ee)

Parent PR: #26034

Closes scylladb/scylladb#26098

* github.com:scylladb/scylladb:
  compaction/scrub: register sstables for compaction before validation
  compaction/scrub: handle exceptions when moving invalid sstables to quarantine
2025-09-24 09:52:57 +03:00
Tomasz Grabiec
cfab5b69d2 tablets: scheduler: Run plan-maker in maintenance scheduling group
Currently, it runs in the gossiper scheduling group, because it's
invoked by the topology coordinator. That scheduling group has the
same amount of shares as user workload. Plan-making can take
significant amount of time during rebalancing, and we don't want that
to impact user workload which happens to run on the same shard.

Reduce impact by running in the maintenance scheduling group.

Fixes #26037

Closes scylladb/scylladb#26046

(cherry picked from commit ddbcea3e2a)

Closes scylladb/scylladb#26167
2025-09-22 15:20:53 +02:00
Lakshmi Narayanan Sreethar
7eecfeb351 compaction/scrub: register sstables for compaction before validation
When `scrub --validate` runs, it collects all candidate sstables at the
start and validates them one by one in separate compaction tasks.
However, scrub in validate mode does not register these sstables for
compaction, which allows regular compaction to pick them up and
potentially compact them away before validation begins. This leads to
scrub failures because the sstables can no longer be found.

This patch fixes the issue by first disabling compaction, collecting the
sstables, and then registering them for compaction before starting
validation. This ensures that the enqueued sstables remain available for
the entire duration of the scrub validation task.

Fixes #23363

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit 7cdda510ee)
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2025-09-19 23:43:59 +05:30
Nadav Har'El
ca037dd072 alternator: fix bug in combination of AttributeUpdates + ReturnValues
In test/alternator/test_returnvalues.py we had tests for the
ReturnValues feature on UpdateItem requests - but we only tested
UpdateItem requests with the "modern" UpdateExpression, and forgot to
test the combination of ReturnValues with the old AttributeUpdates API.

It turns out this combination is buggy: when both ReturnValues=ALL_OLD
and AttributeUpdates need the previous value of the item, we may wrongly
std::move() the value out, and the operation will fail with a strange
error:

    An error occurred (ValidationException) when calling the UpdateItem
    operation: JSON assert failed on condition 'IsObject()'

The fix in this patch is trivial - just move the std::move() to the
correct place, after both UpdateExpression and AttributeUpdates
handling is done.

This patch also includes a reproducing test, which fails before this
patch and passes with it - and of course passes on DynamoDB. This
test reproduces two cases where the bug happened, as well as one
case where it didn't (to make sure we don't regress in what already
worked).

Fixes #25894

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#25900

(cherry picked from commit 3c0032deb4)

Closes scylladb/scylladb#26095
2025-09-19 19:15:17 +03:00
Lakshmi Narayanan Sreethar
95d2b043f2 compaction/scrub: handle exceptions when moving invalid sstables to quarantine
In validate mode, scrub moves invalid sstables into the quarantine
folder. If validation fails because the sstable files are missing from
disk, there is nothing to move, and the quarantine step will throw an
exception. Handle such exceptions so scrub can return a proper
compaction_result instead of propagating the exception to the caller.
This will help the testcase for #23363 to reliably determine if the
scrub has failed or not.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit 84f2e99c05)
2025-09-18 18:51:47 +05:30
Pavel Emelyanov
ef47dc47ca Merge '[Backport 2025.2] gossiper: ensure gossiper operations are executed in gossiper scheduling group' from Scylladb[bot]
Sometimes gossiper operations invoked from storage_service and other components run under a non-gossiper scheduling group. If these operations acquire gossiper locks, priority inversion can occur: higher-priority gossiper tasks may wait behind lower-priority tasks (e.g. streaming), which can cause gossiper slowness or even failures.
This patch ensures that gossiper operations requiring locks on gossiper structures are explicitly executed in the gossiper scheduling group. To help detect similar issues in the future, a warning is logged whenever a gossiper lock is acquired under a non-gossiper scheduling group.

Fixes scylladb/scylladb#25907
Refs: scylladb/scylladb#25702

Backport: this patch fixes an issue with gossiper operations scheduling group, that might affect topology operations, therefore backport is needed to 2025.1, 2025.2, 2025.3

- (cherry picked from commit 340413e797)

- (cherry picked from commit 6c2a145f6c)

Parent PR: #25981

Closes scylladb/scylladb#26071

* https://github.com/scylladb/scylladb:
  gossiper: ensure gossiper operations are executed in gossiper scheduling group
  gossiper: fix wrong gossiper instance used in `force_remove_endpoint`
2025-09-18 07:47:11 +03:00
Sergey Zolotukhin
205cd2c681 raft: disable caching for raft log.
This change disables caching for raft log table due to the following reasons:
* Immediate reason is a deficiency in handling emerging range tombstones in the cache, which causes stalls.
* Long-term reason is that sequential reads from the raft log do not benefit from the cache, making it better to bypass it to free up space and avoid stalls.

Fixes scylladb/scylladb#26027

Closes scylladb/scylladb#26031

(cherry picked from commit 2640b288c2)

Closes scylladb/scylladb#26072
2025-09-18 07:46:56 +03:00
Szymon Malewski
71aa0c0daa alternator/expressions.g: Fix antlr3 missing token leak
This patch overrides the antlr3 function that allocates the missing
tokens that would eventually leak. The override stores these tokens in
a vector, ensuring memory is freed whenever the parser is destroyed.
Solution is copied from CQL implementation.

A unit test to reproduce the issue is added - leak would be reported
by ASAN, when running this test in debug mode - the test passed but
the leak is discovered when the test file exits.

Fixes #25878

Closes scylladb/scylladb#25930

(cherry picked from commit 776f90e2f8)

Closes scylladb/scylladb#26084
2025-09-18 07:46:37 +03:00
Sergey Zolotukhin
d8f965942f gossiper: ensure gossiper operations are executed in gossiper scheduling group
Sometimes gossiper operations invoked from storage_service and other components
run under a non-gossiper scheduling group. If these operations acquire gossiper
locks, priority inversion can occur: higher-priority gossiper tasks may wait
behind lower-priority tasks (e.g. streaming), which can cause gossiper slowness
or even failures.
This patch ensures that gossiper operations requiring locks on gossiper
structures are explicitly executed in the gossiper scheduling group.
To help detect similar issues in the future, a warning is logged whenever
a gossiper lock is acquired under a non-gossiper scheduling group.

Fixes scylladb/scylladb#25907

(cherry picked from commit 6c2a145f6c)
2025-09-17 11:21:56 +00:00
Sergey Zolotukhin
838fcdfcbb gossiper: fix wrong gossiper instance used in force_remove_endpoint
`gossiper::force_remove_endpoint` is always executed on shard 0 using
`invoke_on`. Since each shard has its own `gossiper` instance, if
`force_remove_endpoint` is called from a shard other than shard 0,
`my_host_id()` may be invoked on the wrong `gossiper` object. This
results in undefined behavior due to unsynchronized access to resources
on another shard.

(cherry picked from commit 340413e797)
2025-09-17 11:21:55 +00:00
Wojciech Mitros
26bb601dc0 storage_proxy: send hints to pending replicas
Consider the following scenario:
- Current replica set is [A, B, C]
- write succeeds on [A, B], and a hint is logged for node C
- before the hint is replayed, D bootstraps and the token migrates from C to D
- hint is replayed to node C while D is pending, but it's too late, since streaming for that token is already done
- C is cleaned up, replayed data is lost, and D has a stale copy until next repair.
In the scenario we effectively fail to send the hint. This scenario is also more likely to happen with tablets,
as it can happen for every tablet migration.

This issue is particularly detrimental to materialized views. View updates use hints by default and a specific
view update may be sent to just one view replica (when a single base replica has a different row state due to
reordering or missed writes). When we lose a hint for such a view update, we can generate a persistent inconsistency
between the base and view - ghost rows can appear due to a lost tombstone and rows may be missing in the view due
to a lost row update. Such inconsistencies can't be fixed neither by repairing the view or the base table.

To handle this, in this patch we add the pending replicas to the list of targets of each hint, even if the original
target is still alive.

This will cause some updates to be redundant. These updates are probably unavoidable for now, but they shouldn't
be too common either. The scenarios for them are:
1. managing to send the hint to the source of a migrating replica before streaming that its token - the write will
arrive on the pending replica anyway in streaming
2. the hint target not being the source of the migration - if we managed to apply the original write of the hint to
the actual source of the migration, the pending replica will get it during streaming
3. sending the same hint to many targets at a similar time - while sending to each target, we'll see the same pending
replica for the hint so we'll send it multiple times
4. possible retries where even though the hint was successfully sent to the main target, we failed to send it to the
pending replica, so we need to retry the entire write

This patch handles both tablet migrations and tablet rebuilds. In the future, for tablet migrations, we can avoid
sending the hint to pending replias if the hint target is not the source fo the migration, which would allow us to
avoid the redundant writes 2 and 3. For rack-aware RF, this will be as simple as checking whether the replicas are
in the same rack.

We also add a test case reproducing the issue.

Co-Authored-By: Raphael S. Carvalho <raphaelsc@scylladb.com>

Fixes https://github.com/scylladb/scylladb/issues/19835

Closes scylladb/scylladb#25590

(cherry picked from commit 10b8e1c51c)

Closes scylladb/scylladb#25881
2025-09-16 12:24:45 +02:00
Wojciech Mitros
d3a8de60aa mv: delete previously undetected ghost rows in PRUNE MATERIALIZED VIEW statement
The PRUNE MATERIALIZED VIEW statement is supposed to remove ghost rows from the
view. Ghost rows are rows in the view with no corresponding row in the base table.
Before this patch, only rows whose primary key columns of the base table had
different values than any of the base rows were treated as ghost rows by the PRUNE
statement. However, view rows which have a column in their primary key that's not
in the base primary can also be ghost rows if this column has a different value
than the base row with the same values of remaining primary key columns. That's
because these rows won't be deleted unless we change value of this column in the
base table to this specific value.
In this patch we add a check for this column in the PRUNE MATERIALIZED VIEW logic.
If this column isn't the same in the base table and the view, these rows are also
deleted.

Fixes https://github.com/scylladb/scylladb/issues/25655

Closes scylladb/scylladb#25720

(cherry picked from commit 1f9be235b8)

Closes scylladb/scylladb#25955
2025-09-16 12:19:15 +02:00
Jenkins Promoter
2cf363a3df Update pgo profiles - aarch64 2025-09-15 04:39:07 +03:00
Jenkins Promoter
666a1782b4 Update pgo profiles - x86_64 2025-09-15 04:07:08 +03:00
Patryk Jędrzejczak
b74f507570 Merge '[Backport 2025.2] test: cluster: deflake consistency checks after decommission' from Scylladb[bot]
In the Raft-based topology, a decommissioning node is removed from group
0 after the decommission request is considered finished (and the token
ring is updated). Therefore, `check_token_ring_and_group0_consistency`
called just after decommission might fail when the decommissioned node
is still in group 0 (as a non-voter). We deflake all tests that call
`check_token_ring_and_group0_consistency` after decommission in this PR.

Fixes #25809

This PR improves CI stability and changes only tests, so it should be
backported to all supported branches.

- (cherry picked from commit e41fc841cd)

- (cherry picked from commit bb9fb7848a)

Parent PR: #25927

Closes scylladb/scylladb#25962

* https://github.com/scylladb/scylladb:
  test: cluster: deflake consistency checks after decommission
  test: cluster: util: handle group 0 changes after token ring changes in wait_for_token_ring_and_group0_consistency
2025-09-11 13:04:21 +02:00
Patryk Jędrzejczak
3dfbd813ad test: cluster: deflake consistency checks after decommission
In the Raft-based topology, a decommissioning node is removed from group
0 after the decommission request is considered finished (and the token
ring is updated). Therefore, `check_token_ring_and_group0_consistency`
called just after decommission might fail when the decommissioned node
is still in group 0 (as a non-voter). We deflake all tests that call
`check_token_ring_and_group0_consistency` after decommission in this
commit.

Fixes #25809

(cherry picked from commit bb9fb7848a)
2025-09-10 17:48:35 +00:00
Patryk Jędrzejczak
e44d154470 test: cluster: util: handle group 0 changes after token ring changes in wait_for_token_ring_and_group0_consistency
In the Raft-based topology, a decommissioning node is removed from group
0 after the decommission request is considered finished (and the token
ring is updated). `wait_for_token_ring_and_group0_consistency` doesn't
handle such a case; it only handles cases where the token ring is
updated later. We fix this in this commit.

We rely on the new implementation of
`wait_for_token_ring_and_group0_consistency` in the following commit to
fix flakiness of some tests.

We also update the obsolete docstring in this commit.

(cherry picked from commit e41fc841cd)
2025-09-10 17:48:35 +00:00
Piotr Dulikowski
2a82604998 Merge '[Backport 2025.2] service/qos: Modularize service level controller to avoid invalid access to auth::service' from Scylladb[bot]
Move management over effective service levels from `service_level_controller`
to a new dedicated type -- `auth_integration`.

Before these changes, it was possible for the service level controller to try
to access `auth::service` after it was deinitialized. For instance, it could
happen when reloading the cache. That HAS happened as described in the following
issue: scylladb/scylladb#24792.

Although the problem might have been mitigated or even resolved in
scylladb/scylladb@10214e13bd, it's not clear
how the service will be used in the future. It's best to prevent similar bugs
than trying to fix them later on.

The logic responsible for preventing to access an uninitialized `auth::service`
was also either non-existent, complex, or non-sufficient.

To prevent accessing `auth::service` by the service level controller, we extract
the relevant portion of the code to a separate entity -- `auth_integration`.
It's an internal helper type whose sole purpose is to manage effective service
levels.

Thanks to that, we were able to nest the lifetime of `auth_integration` within
the lifetime of `auth::service`. It's now impossible to attempt to dereference
it while it's uninitialized.

If a bug related to an invalid access is spotted again, though, it might also
be easier to debug it now.

There should be no visible change to the users of the interface of the service
level controller. We strived to make the patch minimal, and the only affected
part of the logic should be related to how `auth::service` is accessed.

The relevant portion of the initialization and deinitialization flow:

(a) Before the changes:

1. Initialize `service_level_controller`. Pass a reference to an uninitialized
   `auth::service` to it.
2. Initialize other services.
3. Initialize and start `auth::service`.
4. (work)
5. Stop and deinitialize `auth::service`.
6. Deinitialize other services.
7. Deinitialize `service_level_controller`.

(b) After the changes:

1. Initialize `service_level_controller`. Pass a reference to an uninitialized
   `auth::service` to it. (*)
2. Initialize other services.
3. Initialize and start `auth::service`.
4. Initialize `auth_integration`. Register it in `service_level_controller`.
5. (work)
6. Unregister `auth_integration` in `service_level_controller` and deinitialize
   it.
7. Stop and deinitialize `auth::service`.
8. Deinitialize other services.
9. Deinitialize `service_level_controller`.

(*):
    The reference to `auth::service` in `service_level_controller` is still
    necessary. We need to access the service when dropping a distributed
    service level.

    Although it would be best to cut that link between the service level
    controller and `auth::service` too, effectively separating the entities,
    it would require more work, so we leave it as-is for now.

    It shouldn't prove problematic as far as accessing an uninitialized service
    goes. Trying to drop a service level at the point when we're de-initializing
    auth should be impossible.

    For more context, see the function `drop_distributed_service_level` in
    `service_level_controller`.

A trivial test has been included in the PR. Although its value is questionable
as we only try to reload the service level cache at a specific moment, it's
probably the best we can deliver to provide a reproducer of the issue this patch
is resolving.

Fixes scylladb/scylladb#24792

Backport: The impact of the bug was minimal as it only affected the shutdown.
However, since CI is failing because of it, let's backport the change to all
supported versions.

- (cherry picked from commit 7d0086b093)

- (cherry picked from commit 34afb6cdd9)

- (cherry picked from commit e929279d74)

- (cherry picked from commit dd5a35dc67)

- (cherry picked from commit fc1c41536c)

Parent PR: #25478

Closes scylladb/scylladb#25752

* github.com:scylladb/scylladb:
  service/qos: Move effective SL cache to auth_integration
  service/qos: Add auth::service to auth_integration
  service/qos: Reload effective SL cache conditionally
  service/qos: Add gate to auth_integration
  service/qos: Introduce auth_integration
2025-09-10 09:47:53 +02:00
Dawid Mędrek
8847f3996e test/perf: Adjust tablet_load_balancing.cc to RF-rack-validity
We modify the logic to make sure that all of the keyspaces that the test
creates are RF-rack-valid. For that, we distribute the nodes across two
DCs and as many racks as the provided replication factor.

That may have an effect on the load balancing logic, but since this is
a performance test and since tablet load balancing is still taking place,
it should be acceptable.

This commit also finishes work in adjusting perf tests to pass with
the `rf_rack_valid_keyspaces` configuration option enabled. The remaining
tests either don't attempt to create keyspaces or they already create
RF-rack-valid keyspaces.

We don't need to explicitly enable the configuration option. It's already
enabled by default by `cql_test_config`. The reason why we haven't run into
any issue because of that is that performance tests are not part of our CI.

Fixes scylladb/scylladb#25127

Closes scylladb/scylladb#25728

(cherry picked from commit 789a4a1ce7)

Closes scylladb/scylladb#25921
2025-09-10 10:17:08 +03:00
Asias He
2ead17830f streaming: Enclose potential throws in try block and ensure sink close before logging
- Move the initialization of log_done inside the try block to catch any
  exceptions it may throw.

- Relocate the failure warning log after sink.close() cleanup
  to guarantee sink.close() is always called before logging errors.

Refs #25497

Closes scylladb/scylladb#25591

(cherry picked from commit b12404ba52)

Closes scylladb/scylladb#25902
2025-09-10 10:16:50 +03:00
Asias He
a5c5e062af streaming: Fix use after move in the tablet_stream_files_handler
The files object is moved before the log when stream finishes. We've
logged the files when the stream starts. Skip it in the end of
streaming.

Fixes #25830

Closes scylladb/scylladb#25835

(cherry picked from commit 451e1ec659)

Closes scylladb/scylladb#25890
2025-09-10 10:16:30 +03:00
Pavel Emelyanov
df604dabb7 s3: Export memory usage gauge (metrics)
The memory usage is tracked with the help of a semaphore, so just export
its "consumed" units.

One tricky place here is the need to skip metrics registration for
scylla-sstable tool. The thing is that the tools starts the storage
manager and sstables manager on start and then some of tool's operations
may want to start both managers again (via cql environment) causing
double metrics registration exception.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#25769

(cherry picked from commit b26816f80d)

Closes scylladb/scylladb#25864
2025-09-10 10:15:56 +03:00
Patryk Jędrzejczak
e1845ef5f8 Merge '[Backport 2025.2] gossiper: fix issues in processing gossip status during the startup and when messages are delayed to avoid empty host ids' from Scylladb[bot]
Populate the local state during gossiper initialization in start_gossiping, preventing an empty state from being added to _endpoint_state_map and returned in get_endpoint_states responses, that was causing an 'empty host id issue' on the other nodes during nodes restart.

Check for a race condition in do_apply_state_locally In do_apply_state_locally, a race condition can occur if a task is suspended at a preemption point while the node entry is not locked.
During this time, the host may be removed from _endpoint_state_map. When the task resumes, this can lead to inserting an entry with an empty host ID into the map, causing various errors, including a node crash.

This change adds a check after locking the map entry: if a gossip ACK update does not contain a host ID, we verify that an entry with that host ID still exists in the gossiper’s _endpoint_state_map.

Fixes https://github.com/scylladb/scylladb/issues/25831
Fixes https://github.com/scylladb/scylladb/issues/25803
Fixes https://github.com/scylladb/scylladb/issues/25702
Fixes https://github.com/scylladb/scylladb/issues/25621

Ref https://github.com/scylladb/scylla-enterprise/issues/5613

Backport: The issue affects all current releases(2025.x), therefore this PR needs to be backported to all 2025.1-2025.3.

- (cherry picked from commit 28e0f42a83)

- (cherry picked from commit f08df7c9d7)

- (cherry picked from commit 775642ea23)

- (cherry picked from commit b34d543f30)

Parent PR: #25849

Closes scylladb/scylladb#25897

* https://github.com/scylladb/scylladb:
  gossiper: fix empty initial local node state
  gossiper: add test for a race condition in start_gossiping
  gossiper: check for a race condition in `do_apply_state_locally`
  test/gossiper: add reproducible test for race condition during node decommission
2025-09-09 12:23:56 +02:00
Yaron Kaikov
98e32ca86b build_docker.sh: enable debug symboles installation
Adding the latest scylla.repo location to our docker container, this
will allow installation scylla-debuginfo package in case it's needed

Fixes: https://github.com/scylladb/scylladb/issues/24271

Closes scylladb/scylladb#25646

(cherry picked from commit d57741edc2)

Closes scylladb/scylladb#25892
2025-09-09 11:41:55 +03:00
Sergey Zolotukhin
a7a7de9a69 gossiper: fix empty initial local node state
This change removes the addition of an empty state to `_endpoint_state_map`.
Instead, a new state is created locally and then published via replicate,
avoiding the issue of an empty state existing in `_endpoint_state_map`
before the preemption point. Since this resolves the issue tested in
`test_gossiper_empty_self_id_on_shadow_round`, the `xfail` mark has been removed.

Fixes: scylladb/scylladb#25831
(cherry picked from commit b34d543f30)
2025-09-08 21:54:43 +00:00
Sergey Zolotukhin
f44b578075 gossiper: add test for a race condition in start_gossiping
This change adds a test for a race condition in `start_gossiping` that
can lead to an empty self state sent in `gossip_get_endpoint_states_response`.

Test for scylladb/scylladb#25831

(cherry picked from commit 775642ea23)
2025-09-08 21:54:43 +00:00
Sergey Zolotukhin
e157e8577e gossiper: check for a race condition in do_apply_state_locally
In do_apply_state_locally, a race condition can occur if a task is
suspended at a preemption point while the node entry is not locked.
During this time, the host may be removed from _endpoint_state_map.
When the task resumes, this can lead to inserting an entry with an
empty host ID into the map, causing various errors, including a node
crash.

This change
1. adds a check after locking the map entry: if a gossip ACK update
   does not contain a host ID, we verify that an entry with that host ID
   still exists in the gossiper’s _endpoint_state_map.
2. Removes xfail from the test_gossiper_race test since the issue is now
   fixed.
3. Adds exception handling in `do_shadow_round` to skip responses from
   nodes that sent an empty host ID.

This re-applies the commit 13392a40d4 that
was reverted in 46aa59fe49, after fixing
the issues that caused the CI to fail.

Fixes: scylladb/scylladb#25702
Fixes: scylladb/scylladb#25621

Ref: scylladb/scylla-enterprise#5613
(cherry picked from commit f08df7c9d7)
2025-09-08 21:54:43 +00:00
Emil Maskovsky
e8b903979e test/gossiper: add reproducible test for race condition during node decommission
This change introduces a targeted test that simulates the gossiper race
condition observed during node decommissioning. The test delays gossip
state application and host ID lookup to reliably reproduce the scenario
where `gossiper::get_host_id()` is called on a removed endpoint,
potentially triggering an abort in `apply_new_states`.

There is a specific error injection added to widen the race window, in
order to increase the likelihood of hitting the race condition. The
error injection is designed to delay the application of gossip state
updates, for the specific node that is being decommissioned. This should
then result in the server abort in the gossiper.

This re-applies the commit 5dac4b38fb that
was reverted in dc44fca67c, but modified
to relax the check from "on_internal_error" to a just warning log. The
more strict can be re-introduced later once we are sure that all
remaining problems are resolved and it will not break the CI.

Refs: scylladb/scylladb#25621
Fixes: scylladb/scylladb#25721
(cherry picked from commit 28e0f42a83)
2025-09-08 21:54:43 +00:00
Dawid Mędrek
3f3ce66981 service/qos: Move effective SL cache to auth_integration
Since `auth_integration` manages effective service levels, let's move
the relevant cache from `service_level_controller` to it.

(cherry picked from commit fc1c41536c)
2025-09-08 19:00:14 +02:00
Dawid Mędrek
687956f9f4 service/qos: Add auth::service to auth_integration
The new service, `auth_integration`, has taken over the responsibility
over managing effective service levels from `service_level_controller`.
However, before these changes, it still accessed `auth::service` via
the service level controller. Let's change that.

Note that we also remove a check that `auth::service` has been
initialized. It's not necessary anymore because the lifetime of
`auth_integration` is strictly nested within the lifetime of `auth::service`.

In actuality, `service_level_controller` should lose its reference to
`auth::service` completely. All of the management over effective service
levels has already been moved to `auth_integration`. However, the
referernce is still needed when dropping a distributed service level
because we need to update the corresponding attribute for relevant
roles.

That should not lead to invalid accesses, though. Dropping a service level
should not be possible when `auth::service` is not initialized.

(cherry picked from commit dd5a35dc67)
2025-09-08 19:00:14 +02:00
Dawid Mędrek
566db1df0a service/qos: Reload effective SL cache conditionally
Since `service_level_controller` outlives `auth_integration`, it may
happen that we try to access it when it has already been deinitialized.
To prevent that, we only try to reload or clear the effective service
level cache when the object is still alive.

These changes solve an existing problem with an invalid memory access.
For more context, see issue scylladb/scylladb#24792.

We provide a reproducer test that consistently fails before these
changes but passes after them.

Fixes scylladb/scylladb#24792

(cherry picked from commit e929279d74)
2025-09-08 19:00:14 +02:00
Dawid Mędrek
6528bd47bf service/qos: Add gate to auth_integration
We add a named gate to `auth_integration` that will aid us in synchronizing
ongoing tasks with stopping the service.

(cherry picked from commit 34afb6cdd9)
2025-09-08 19:00:14 +02:00
Dawid Mędrek
65086ea76d service/qos: Introduce auth_integration
We introduce a new type, `auth_integration`, that will be used internally
by `service_level_controller`. Its purpose is to take over the responsibility
over managing effective service levels.

The main problem of the current implementation of service level controller
is its dependency on `auth::service` whose lifetime is strictly nested
within the lifetime of service level controller. That may and already have
led to invalid memory accesses; for an example, see issue
scylladb/scylladb#24792.

Our strategy is to split service level controller into smaller parts and
ensure that we access `auth::service` only when it's valid to do so.
This commit is the first step towards that.

We don't change anything in the logic yet, just add the new type. Further
adjustments will be made in following commits.

(cherry picked from commit 7d0086b093)
2025-09-08 19:00:14 +02:00
Calle Wilund
71ef2caef9 system_keyspace: Prune dropped tables from truncation on start/drop
Fixes #25683

Once a table drop is complete, there should be no reason to retain
truncation records for it, as any replay should skip mutations
anyway (no CF), and iff we somehow resurrect a dropped table,
this replay-resurrected data is the least problem anyway.

Adds a prune phase to the startup drop_truncation_rp_records run,
which ignores updating, and instead deletes records for non-existant
tables (which should patch any existing servers with lingering data
as well).

Also does an explicit delete of records on actual table DROP, to
ensure we don't grow this table more than needed even in long
uptime nodes.

Small unit test included.

Closes scylladb/scylladb#25699

(cherry picked from commit bc20861afb)

Closes scylladb/scylladb#25813
2025-09-08 16:26:04 +03:00
Avi Kivity
5de570c9ae Merge 'auth: move passwords::check call to alien thread' from Andrzej Jackowski
Analysis of customer stalls revealed that the function `detail::hash_with_salt` (invoked by `passwords::check`) often blocks the reactor. Internally, this function uses the external `crypt_r` function to compute password hashes, which is CPU-intensive.

This PR addresses the issue in two ways:
1) `sha-512` is now the only password hashing scheme for new passwords (it was already the common-case).
2) `passwords::check` is moved to a dedicated alien thread.

Regarding point 1: before this change, the following hashing schemes were supported by     `identify_best_supported_scheme()`: bcrypt_y, bcrypt_a, SHA-512, SHA-256, and MD5. The reason for this was that the `crypt_r` function used for password hashing comes from an external library (currently `libxcrypt`), and the supported hashing algorithms vary depending on the library in use. However:
- The bcrypt schemes never worked properly because their prefixes lack the required round count (e.g. `$2y$` instead of `$2y$05$`). Moreover, bcrypt is slower than SHA-512, so it  not good idea to fix or use it.
- SHA-256 and SHA-512 both belong to the SHA-2 family. Libraries that support one almost always support the other, so it’s very unlikely to find SHA-256 without SHA-512.
- MD5 is no longer considered secure for password hashing.

Regarding point 2: the `passwords::check` call now runs on a shared alien thread created at database startup. An `std::mutex` synchronizes that thread with the shards. In theory this could introduce a frequent lock contention, but in practice each shard handles only a few hundred new connections per second—even during storms. There is already `_conns_cpu_concurrency_semaphore` in `generic_server` limits the number of concurrent connection handlers.

Fixes https://github.com/scylladb/scylladb/issues/24524

Backport not needed, as it is a new feature.

Closes scylladb/scylladb#24924

* github.com:scylladb/scylladb:
  main: utils: add thread names to alien workers
  auth: move passwords::check call to alien thread
  test: wait for 3 clients with given username in test_service_level_api
  auth: refactor password checking in password_authenticator
  auth: make SHA-512 the only password hashing scheme for new passwords
  auth: whitespace change in identify_best_supported_scheme()
  auth: require scheme as parameter for `generate_salt`
  auth: check password hashing scheme support on authenticator start

(cherry picked from commit c762425ea7)
2025-09-07 14:30:26 +03:00
Pavel Emelyanov
112827b170 Revert "test/gossiper: add reproducible test for race condition during node decommission"
This reverts commit 46f8404100 because
parent PR had been reverted as per #25803
2025-09-05 10:07:10 +03:00
Pavel Emelyanov
ed3e671564 Merge '[Backport 2025.2] drop table: fix crash on drop table with concurrent cleanup' from Scylladb[bot]
Consider the following scenario:

- A tablet is migrated away from a shard
- The tablet cleanup stage closes the storage group's async_gate
- A drop table runs truncate which attempts to disable compaction on the tablet with its gate closed. This fails, because table::parallel_foreach_compaction_group() ultimately calls storage_group_manager::parallel_foreach_storage_group() which will not disable compaction if it can't hold the storage group's gate
- Truncate calls table::discard_sstables() which checks if the compaction has been disabled, and because it hasn't, it then runs on_internal_error() with "compaction not disabled on table ks.cf during TRUNCATE" which causes a crash

Fixes: #25706

This needs to be backported to all supported versions with tablets

- (cherry picked from commit a0934cf80d)

- (cherry picked from commit 1b8a44af75)

Parent PR: #25708

Closes scylladb/scylladb#25784

* github.com:scylladb/scylladb:
  test: reproducer and test for drop with concurrent cleanup
  truncate: check for closed storage group's gate in discard_sstables
2025-09-04 08:44:29 +03:00
Emil Maskovsky
46f8404100 test/gossiper: add reproducible test for race condition during node decommission
This change introduces a targeted test that simulates the gossiper race
condition observed during node decommissioning. The test delays gossip
state application and host ID lookup to reliably reproduce the scenario
where `gossiper::get_host_id()` is called on a removed endpoint,
potentially triggering an abort in `apply_new_states`.

There is a specific error injection added to widen the race window, in
order to increase the likelihood of hitting the race condition. The
error injection is designed to delay the application of gossip state
updates, for the specific node that is being decommissioned. This should
then result in the server abort in the gossiper.

Refs: scylladb/scylladb#25621
Fixes: scylladb/scylladb#25721

Backport: The test is primarily for an issue found in 2025.1, so it
needs to be backported to all the 2025.x branches.

Closes scylladb/scylladb#25685

(cherry picked from commit 5dac4b38fb)

Closes scylladb/scylladb#25780
2025-09-02 20:54:09 +02:00
Piotr Dulikowski
d672c7c45e Merge '[Backport 2025.2] system_keyspace: add peers cache to get_ip_from_peers_table' from Scylladb[bot]
The gossiper can call `storage_service::on_change` frequently (see  scylladb/scylla-enterprise#5613), which may cause high CPU load and even trigger OOMs or related issues.

This PR adds a temporary cache for `system.peers` to resolve host_id -> ip without hitting storage on every call. The cache is short-lived to handle the unlikely case where `system.peers` is updated directly via CQL.

This is a temporary fix; a more thorough solution is tracked in https://github.com/scylladb/scylladb/issues/25620.

Fixes scylladb/scylladb#25660

backport: this patch needs to be backported to all supported versions (2025.1/2/3).

- (cherry picked from commit 91c633371e)

- (cherry picked from commit de5dc4c362)

- (cherry picked from commit 4b907c7711)

Parent PR: #25658

Closes scylladb/scylladb#25764

* github.com:scylladb/scylladb:
  storage_service: move get_host_id_to_ip_map to system_keyspace
  system_keyspace: use peers cache in get_ip_from_peers_table
  storage_service: move get_ip_from_peers_table to system_keyspace
2025-09-02 08:34:27 +02:00
Ferenc Szili
1bf298c722 test: reproducer and test for drop with concurrent cleanup
This change adds a reproducer and test for issue #25706

(cherry picked from commit 1b8a44af75)
2025-09-02 02:18:21 +00:00
Ferenc Szili
0b4b85c820 truncate: check for closed storage group's gate in discard_sstables
Consider the following scenario:

- A tablet is migrated away from a shard
- The tablet cleanup stage closes the storage group's async_gate
- A drop table runs truncate which attempts to disable compaction on the
  tablet with its gate closed. This fails, because
  table::parallel_foreach_compaction_group() ultimately calls
  storage_group_manager::parallel_foreach_storage_group() which will not
  disable compaction if it can't hold the storage group's gate
- Truncate calls table::discard_sstables() which checks if the compaction
  has been disabled, and because it hasn't, it then runs
  on_internal_error() with "compaction not disabled on table ks.cf during
  TRUNCATE" which causes a crash

This patch makes dicard_sstables check if the storage group's gate is
closed whend checking for disabled compaction.

(cherry picked from commit a0934cf80d)
2025-09-02 02:18:21 +00:00
Nadav Har'El
c04b086929 alternator: avoid oversized allocation in Query/Scan
This patch fixes one cause of oversized allocations - and therefore
potentially stalls and increased tail latencies - in Alternator.

Alternator's Scan or Query operation return a page of results. When the
number of items is not limited by a "Limit" parameter, the default is
to return a 1 MB page. If items are short, a large number of them can
fit in that 1MB. The test test_query.py::test_query_large_page_small_rows
has 30,000 items returned in a single page.

In the response JSON, all these items are returned in a single array
"Items". Before this patch, we build the full response as a RapidJSON
object before sending it. The problem is that unfortunately, RapidJSON
stores arrays as contiguous allocations. This results in large
contiguous allocations in workloads that scan many small items, and
large contiguous allocations can also cause stalls and high tail
latencies. For example, before this patch, running

    test/alternator/run --runveryslow \
        test_query.py::test_query_large_page_small_rows

reports in the log:

    oversized allocation: 573440 bytes.

After this patch, this warning no longer appears.
The patch solves the problem by collecting the scanned items not in a
RapidJSON array, but rather in a chunked_vector<rjson::value>, i.e,
a chunked (non-contiguous) array of items (each a JSON value).
After collecting this array separately from the response object, we
need to print its content without actually inserting it into the object -
we add a new function print_with_extra_array() to do that.

The new separate-chunked-vector technique is used when a large number
(currently, >256) of items were scanned. When there is a smaller number
of items in a page (this is typical when each item is longer), we just
insert those items in the object and print it as before.

Beyond the original slow test that demonstrated the oversized allocation
(which is now gone), this patch also includes a new test which
exercises the new code with a scan of 700 (>256) items in a page -
but this new test is fast enough to be permanently in our test suite
and not a manual "veryslow" test as the other test.

Fixes #23535

(cherry picked from commit 2385fba4b6)

Closes scylladb/scylladb#25654
2025-09-01 16:40:02 +03:00
Petr Gusev
8b83a3d380 storage_service: move get_host_id_to_ip_map to system_keyspace
Reimplemented the function to use the peers cache. It could be replaced
with get_ip_from_peers_table, but that would create a coroutine frame for
each call.

(cherry picked from commit 4b907c7711)
2025-09-01 11:32:44 +02:00
Petr Gusev
8f5befd726 system_keyspace: use peers cache in get_ip_from_peers_table
The storage_service::on_change method can be called quite often
by the gossiper, see scylladb/scylla-enterprise#5613. In this commit
we introduce a temporal cache for system.peers so that we don't have
to go to the storage each time we need to resolve host_id -> ip.
We keep the cache only for a small amount of time to handle the
(unlikely) scenario when the user wants to update system.peers table
from CQL.

Fixes scylladb/scylladb#25660

(cherry picked from commit de5dc4c362)
2025-09-01 11:32:13 +02:00
Petr Gusev
ece69b212d storage_service: move get_ip_from_peers_table to system_keyspace
We plan to add a cache to get_ip_from_peers_table in upcoming commits.
It's more convenient to do this from system_keyspace, since the only two
methods that mutate system.peers (remove_endpoint and update_peers_info)
are already there.

(cherry picked from commit 91c633371e)
2025-09-01 11:32:04 +02:00
Nadav Har'El
46bd9f2f27 utils, alternator: fix detection of invalid base-64
This patch fixes an error-path bug in the base-64 decoding code in
utils/base64.cc, which among other things is used in Alternator to decode
blobs in JSON requests.

The base-64 decoding code has a lookup table, which was wrongly sized 255
bytes, but needed to be 256 bytes. This meant that if the byte 255 (0xFF)
was included in an invalid base-64 string, instead of detecting that this
is an invalid byte (since the only valid bytes in a base-64 string are
A-Z,a-z,0-9,+,/ and =), the code would either think it's valid with a
nonsense 6-bit part, or even crash on an out-of-bounds read.

Besides the trivial fix, this patch also includes a reproducing test,
which tries to write a blob as a supposedly base-64 encoded string with
a 0xFF byte in it. The test fails before this patch (the write succeeds,
unexpectedly), and passes after this patch (the write fails as
expected). The test also passes on DynamoDB.

Fixes #25701

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#25705

(cherry picked from commit ff91027eac)

Closes scylladb/scylladb#25765
2025-09-01 09:07:00 +03:00
Calle Wilund
378ee9fb59 system_keyspace: Limit parallelism in drop_truncation_records
Fixes #25682
Refs scylla-enterprise#5580

If the truncation table is large in entries, we might create a
huge parallel execution, quite possibly consuming loads of resources
doing something quite trivial.
Limit concurrency to a small-ish number

Closes scylladb/scylladb#25678

(cherry picked from commit 2eccd17e70)

Closes scylladb/scylladb#25749
2025-09-01 09:06:18 +03:00
Emil Maskovsky
750549d0ca storage: pass host_id as parameter to maybe_reconnect_to_preferred_ip()
Previously, `maybe_reconnect_to_preferred_ip()` retrieved the host ID
using `gossiper::get_host_id()`. Since the host ID is already available
in the calling function, we now pass it directly as a parameter.

This change simplifies the code and eliminates a potential race condition
where `gossiper::get_host_id()` could fail, as described in scylladb/scylladb#25621.

Refs: scylladb/scylladb#25621
Fixes: scylladb/scylladb#25715

Backport: Recommended for 2025.x release branches to avoid potential issues
from unnecessary calls to `gossiper::get_host_id()` in subscribers.

(cherry picked from commit cfc87746b6)

Closes scylladb/scylladb#25717
2025-09-01 09:06:07 +03:00
Jenkins Promoter
a334608836 Update pgo profiles - aarch64 2025-09-01 04:53:00 +03:00
Jenkins Promoter
d69eb514da Update pgo profiles - x86_64 2025-09-01 04:30:46 +03:00
Calle Wilund
a2bc1a7c6b commitlog: Ensure segment deletion is re-entrant
Fixes #25709

If we have large allocations, spanning more than one segment, and
the internal segment references from lead to secondary are the
only thing keeping a segment alive, the implicit drop in
discard_unused_segments and orphan_all can cause a recursive call
to discard_unused_segments, which in turn can lead to vector
corruption/crash, or even double free of segment (iterator confusion).

Need to separate the modification of the vector (_segments) from
actual releasing of objects. Using temporaries is the easiest
solution.

To further reduce recursion, we can also do an early clear of
segment dependencies in callbacks from segment release (cf release).

Closes scylladb/scylladb#25719

(cherry picked from commit cc9eb321a1)

Closes scylladb/scylladb#25755
2025-08-30 18:51:35 +03:00
Pavel Emelyanov
1ee00069e7 Merge '[Backport 2025.2] repair: distribute tablet_repair_task_metas between shards' from Aleksandra Martyniuk
Currently, in repair_service::repair_tablets a shard that initiates
the repair keeps repair_tablet_metas of all tablets that have a replica
on this node (on any shard). This may lead to oversized allocations.

Modify tablet_repair_task_impl to repair only the tablets which replicas
are kept on this shard. Modify repair_service::repair_tablets to gather
repair_tablet_metas only on local shard. repair_tablets is invoked on
all shards.

Add a new legacy_tablet_repair_task_impl that covers tablet repair started with
async_repair. A user can use sequence number of this task to manage the repair
using storage_service API.

In a test that reproduced this, we have seen 11136 tablets and 5636096 bytes
allocation failure. If we had a node with 250 shards, 100 tablets each, we could
reach 12MB kept on one shard for the whole repair time.

Fixes: https://github.com/scylladb/scylladb/issues/23632

Needs backport to all live branches as they are all vulnerable to such crashes.

Closes scylladb/scylladb#25352

* github.com:scylladb/scylladb:
  repair: distribute tablet_repair_task_meta among shards
  repair: do not keep erm in tablet_repair_task_meta
2025-08-27 10:28:03 +03:00
Avi Kivity
1ae593da2e Merge 'token_range_vector: fragment' from Avi Kivity
token_range_vector is a sequence of intervals of tokens. It is used
to describe vnodes or token ranges owned by shards.

Since tokens are bloated (16 bytes instead of 8), and intervals are bloated
(40 byte of overhead instead of 8), and since we have plenty of token ranges,
such vectors can exceed our allocation unit of 128 kB and cause allocation stalls.

This series fixes that by first generalizing some helpers and then changing
token_range_vector to use chunked_vector.

Although this touches IDL, there is no compatibility problem since the encoding
for vector and chunked_vector are identical.

There is no performance concern since token_range_vector is never used on
any hot path (hot paths always contain a partition key).

Fixes #3335.
Fixes #24115.

Fixes #24156

Closes scylladb/scylladb#25659

* github.com:scylladb/scylladb:
  dht: fragment token_range_vector
  partition_range_compat: generalize wrap/unwrap helpers
  utils: chunked_vector: add swap() method
  utils: chunked_vector: add range insert() overloads
2025-08-26 22:43:08 +03:00
Aleksandra Martyniuk
3a37d88060 replica: lower severity of failure log
Flush failure with seastar::named_gate_closed_exception is expected
if a respective compaction group was already stopped.

Lower the severity of a log in dirty_memory_manager::flush_one
for this exception.

Fixes: https://github.com/scylladb/scylladb/issues/25037.

Closes scylladb/scylladb#25355

(cherry picked from commit a10e241228)

Closes scylladb/scylladb#25649
2025-08-26 10:32:53 +03:00
Taras Veretilnyk
c272dc7746 keys: from_nodetool_style_string don't split single partition keys
Users with single-column partition keys that contain colon characters
were unable to use certain REST APIs and 'nodetool' commands, because the
API split key by colon regardless of the partition key schema.

Affected commands:
- 'nodetool getendpoints'
- 'nodetool getsstables'
Affected endpoints:
- '/column_family/sstables/by_key'
- '/storage_service/natural_endpoints'

Refs: #16596 - This does not fully fix the issue, as users with compound
keys will face the issue if any column of the partition key contains
a colon character.

Closes scylladb/scylladb#24829

Closes scylladb/scylladb#25564
2025-08-26 10:31:54 +03:00
Avi Kivity
d2b608d41a dht: fragment token_range_vector
token_range_vector is a linear vector containing intervals
of tokens. It can grow quite large in certain places
and so cause stalls.

Convert it to utils::chunked_vector, which prevents allocation
stalls.

It is not used in any hot path, as it usually describes
vnodes or similar things.

Fixes #3335.

(cherry picked from commit 844a49ed6e)
2025-08-25 12:59:20 +03:00
Avi Kivity
8e5c8008a0 partition_range_compat: generalize wrap/unwrap helpers
These helpers convert vectors of wrapped intervals to
vectors of unwrapped intervals and vice versa.

Generalize them to work on any sequence type. This is in
preparation of moving from vectors to chunked_vectors.

(cherry picked from commit 83c2a2e169)
2025-08-25 12:47:01 +03:00
Avi Kivity
6171da6fbc utils: chunked_vector: add swap() method
Following std::vector(), we implement swap(). It's a simple matter
of swapping all the contents.

A unit test is added.

(cherry picked from commit 13a75ff835)
2025-08-25 12:44:13 +03:00
Avi Kivity
faaec66be7 utils: chunked_vector: add range insert() overloads
Inserts an iterator range at some position.

Again we insert the range at the end and use std::rotate() to
move the newly inserted elements into place, forgoing possible
optimizations.

Unit tests are added.

(cherry picked from commit 24e0d17def)
2025-08-25 12:44:13 +03:00
Aleksandra Martyniuk
43a35f299a repair: distribute tablet_repair_task_meta among shards
Currently, in repair_service::repair_tablets a shard that initiates
the repair keeps tablet_repair_task_meta of all tablets that have a replica
on this node (on any shard). This may lead to oversized allocations.

Add remote_metas class which takes care of distributing tablet_repair_task_meta
among different shards. An additional class remote_metas_builder was
added in order to ensure safety and separate writes and reads to meta
vectors.

Fixes: #23632
(cherry picked from commit 132e6495a3)
2025-08-25 10:57:32 +02:00
Aleksandra Martyniuk
ca91421ed9 repair: do not keep erm in tablet_repair_task_meta
Do not keep erm in tablet_repair_task_meta to avoid non-owner shared
pointer access when metas will be distributes among shards.

Pass std::chunked_vector of erms to tablet_repair_task_impl to
preserve safety.

(cherry picked from commit 603a2dbb10)
2025-08-25 10:45:59 +02:00
kendrick-ren
9a4d92e1b8 Update launch-on-gcp.rst
Add the missing '=' mark in --zone option. Otherwise the command complains.

Closes scylladb/scylladb#25471

(cherry picked from commit d6e62aeb6a)

Closes scylladb/scylladb#25644
2025-08-25 11:05:54 +03:00
Benny Halevy
2f50cce913 api: storage_service: fix token_range documentation
Note that the token_range type is used only by describe_ring.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#25609

(cherry picked from commit 45c496c276)

Closes scylladb/scylladb#25639
2025-08-25 11:05:27 +03:00
Pavel Emelyanov
9fa2dd7788 Merge '[Backport 2025.2] cql3: Warn when creating RF-rack-invalid keyspace' from Scylladb[bot]
Although RF-rack-valid keyspaces are not universally enforced
yet (they're governed by the configuration option
`rf_rack_valid_keyspaces`), we'd like to encourage the user to
abide by the restriction.

To that end, we're introducing a warning when creating or
altering a keyspace. If the configuration option is disabled,
but the user is trying to create an RF-rack-invalid keyspace,
they'll receive a warning.

If the option is turned off, we will also log all of the
RF-rack-invalid keyspaces at start-up.

We provide validation tests.

Fixes scylladb/scylladb#23330

Backport: we'd like to encourage the user to abide by the restriction
even when they don't enforce it to make it easier in the future to
adjust the schema when there's no way to disable it anymore. Because
of that, we'd like to backport it to all relevant versions, starting with 2025.1.

- (cherry picked from commit 60ea22d887)

- (cherry picked from commit af8a3dd17b)

- (cherry picked from commit 837d267cbf)

Parent PR: #24785

Closes scylladb/scylladb#25634

* github.com:scylladb/scylladb:
  main: Log RF-rack-invalid keyspaces at startup
  cql3/statements: Fix indentation
  cql3: Warn when creating RF-rack-invalid keyspace
2025-08-25 11:04:55 +03:00
Dawid Mędrek
99b65be52e main: Log RF-rack-invalid keyspaces at startup
When the configuration option `rf_rack_valid_keyspaces` is enabled and there
is an RF-rack-invalid keyspace, starting a node fails. However, when the
configuration option is disabled, but there still is a keyspace that violates
the condition, we'd like Scylla to print a warning informing the user about
the fact. That's what happens in this commit.

We provide a validation test.

(cherry picked from commit 837d267cbf)
2025-08-22 14:31:13 +00:00
Dawid Mędrek
8ca0a9b56a cql3/statements: Fix indentation
(cherry picked from commit af8a3dd17b)
2025-08-22 14:31:13 +00:00
Dawid Mędrek
4df1d35375 cql3: Warn when creating RF-rack-invalid keyspace
Although RF-rack-valid keyspaces are not universally enforced
yet (they're governed by the configuration option
`rf_rack_valid_keyspaces`), we'd like to encourage the user to
abide by the restriction.

To that end, we're introducing a warning when creating or
altering a keyspace. If the configuration option is disabled,
but the user is trying to create an RF-rack-invalid keyspace,
they'll receive a warning.

We provide a validation test.

(cherry picked from commit 60ea22d887)
2025-08-22 14:31:13 +00:00
Michał Chojnowski
a17bc98728 sstables/types.hh: fix fmt::formatter<sstables::deletion_time>
Obvious typo.

Fixes scylladb/scylladb#25556

Closes scylladb/scylladb#25557

(cherry picked from commit c1b513048c)

Closes scylladb/scylladb#25587
2025-08-22 10:21:32 +03:00
Jenkins Promoter
a653819865 Update ScyllaDB version to: 2025.2.3 2025-08-19 22:21:55 +03:00
Pavel Emelyanov
8631054115 Merge '[Backport 2025.2] db/hints: Improve logs' from Scylladb[bot]
Before these changes, the logs in hinted handoff often didn't provide
crucial information like the identifier of the node that hints were
being sent to. Also, some of the logs were misleading and referred to
other places in the code than the one where an exception or some other
situation really occurred.

We modify those logs, extending them by more valuable information
and fixing existing issues. What's more, all of the logs in
`hint_endpoint_manager` and `hint_sender` follow a consistent format
now:

```
<class_name>[<destination host ID>]:<function_name>: <message>
```

This way, we should always have AT LEAST the basic information.

Fixes scylladb/scylladb#25466

Backport:
There is no risk in backporting these changes. They only have
impact on the logs. On the other hand, they might prove helpful
when debugging an issue in hinted handoff.

- (cherry picked from commit 2327d4dfa3)

- (cherry picked from commit d7bc9edc6c)

- (cherry picked from commit 6f1fb7cfb5)

Parent PR: #25470

Closes scylladb/scylladb#25537

* github.com:scylladb/scylladb:
  db/hints: Add new logs
  db/hints: Adjust log levels
  db/hints: Improve logs
2025-08-19 17:11:41 +03:00
Pavel Emelyanov
dee53a0107 Merge '[Backport 2025.2] generic server: 2 step shutdown' from Scylladb[bot]
This PR implements solution proposed in scylladb/scylladb#24481

Instead of terminating connections immediately, the shutdown now proceeds in two stages: first closing the receive (input) side to stop new requests, then waiting for all active requests to complete before fully closing the connections.

The updated shutdown process is as follows:

1. Initial Shutdown Phase
   * Close the accept gate to block new incoming connections.
   * Abort all accept() calls.
   * For all active connections:
      * Close only the input side of the connection to prevent new requests.
      * Keep the output side open to allow responses to be sent.

2. Drain Phase
   * Wait for all in-progress requests to either complete or fail.

3. Final Shutdown Phase
   * Fully close all connections.

Fixes scylladb/scylladb#24481

- (cherry picked from commit 122e940872)

- (cherry picked from commit 3848d10a8d)

- (cherry picked from commit 3610cf0bfd)

- (cherry picked from commit 27b3d5b415)

- (cherry picked from commit 061089389c)

- (cherry picked from commit 7334bf36a4)

- (cherry picked from commit ea311be12b)

- (cherry picked from commit 4f63e1df58)

Parent PR: #24499

Closes scylladb/scylladb#25518

* github.com:scylladb/scylladb:
  test: Set `request_timeout_on_shutdown_in_seconds` to `request_timeout_in_ms`,  decrease request timeout.
  generic_server: Two-step connection shutdown.
  transport: consmetic change, remove extra blanks.
  generic_server: replace empty destructor with `= default`
  generic_server: refactor connection::shutdown to use `shutdown_input` and `shutdown_output`
  generic_server: add `shutdown_input` and `shutdown_output` functions to `connection` class.
  test: Add test for query execution during CQL server shutdown
2025-08-19 17:11:22 +03:00
Wojciech Mitros
6f7f639f54 test: run mv tests depending on metrics on a standalone instance
The test_base_partition_deletion_with_metrics test case (and the batch
variant) uses the metric of view updates done during its runtime to check
if we didn't perform too many of them. The test runs in the cqlpy suite,
which  runs all test cases sequentially on one Scylla instance. Because
of this, if another test case starts a process which generates view
updates and doesn't wait for it to finish before it exists, we may
observe too many view updates in test_base_partition_deletion_with_metrics
and fail the test.
In all test cases we make sure that all tables that were created
during the test are dropped at the end. However, that doesn't
stop the view building process immediately, so the issue can happen
even if we drop the view. I confirmed it by adding a test just before
test_base_partition_deletion_with_metrics which builds a big
materialized view and drops it at the end - the metrics check still failed.

The issue could be caused by any of the existing test cases where we create
a view and don't wait for it to be built. Note that even if we start adding
rows after creating the view, some of them may still be included in the view
building, as the view building process is started asynchronously. In such
a scenario, the view building also doesn't cause any issues with the data in
these tests - writes performed after view creation generate view updates
synchronously when they're local (and we're running a single Scylla server),
the corresponding view udpates generated during view building are redundant.

Because we have many test cases which could be causing this issue, instead
of waiting for the view building to finish in every single one of them, we
move the susceptible test cases to be run on separate Scylla instances, in
the "cluster" suite. There, no other test cases will influence the results.

Fixes https://github.com/scylladb/scylladb/issues/20379

Closes scylladb/scylladb#25209

(cherry picked from commit 2ece08ba43)

Closes scylladb/scylladb#25503
2025-08-19 17:10:58 +03:00
Pavel Emelyanov
39a85df231 Merge '[Backport 2025.2] test: test_mv_backlog: fix to consider internal writes' from Scylladb[bot]
The PR fixes a test flakiness issue in test_mv_backlog related to reading metrics.

The first commit fixes a more general issue in the ScyllaMetrics helper class where it doesn't return the value of all matching lines when a specific shard is requested, but it breaks after the first match.

The second commit fixes a test issue where it expects exactly one write to be throttled, not taking into account other internal writes that may be executed during this time.

Fixes https://github.com/scylladb/scylladb/issues/23139

backport to improve CI stability - test only change

- (cherry picked from commit 5c28cffdb4)

- (cherry picked from commit 276a09ac6e)

Parent PR: #25279

Closes scylladb/scylladb#25474

* github.com:scylladb/scylladb:
  test: test_mv_backlog: fix to consider internal writes
  test/pylib/rest_client: fix ScyllaMetrics filtering
2025-08-19 17:10:39 +03:00
Dawid Mędrek
0f4965b8ae db/commitlog: Extend error messages for corrupted data
We're providing additional information in error messages when throwing
an exception related to data corruption: when a segment is truncated
and when it's content is invalid. That might prove helpful when debugging.

Closes scylladb/scylladb#25190

(cherry picked from commit 408b45fa7e)

Closes scylladb/scylladb#25460
2025-08-19 17:10:17 +03:00
Ferenc Szili
a0a346496e test: remove test_tombstone_gc_disabled_on_pending_replica
The test test_tombstone_gc_disabled_on_pending_replica was added when
we fixed (#20788) the potential problem with data resurrection during
file based streaming. The issue was occurring only in Enterprise, but
we added the fix in OSS to limit code divergence. This test was added
together with the fix in OSS with the idea to guard this change in OSS.
The real reproducer and test for this fix was added later, after the
fix was ported into Enterprise.
It is in: test/cluster/test_resurrection.py

Since Enterprise has been merged into OSS, there is no more need to
keep the test test_tombstone_gc_disabled_on_pending_replica. Also,
it is flaky with very low probability of failure, making it difficult
to investigate the cause of failure.

Fixes: #22182

Refs: scylladb/scylladb#25448

Closes scylladb/scylladb#25134

(cherry picked from commit 7ce96345bf)

Closes scylladb/scylladb#25572
2025-08-19 16:02:42 +03:00
Patryk Jędrzejczak
3113968380 test: test_maintenance_socket: use cluster_con for driver sessions
The test creates all driver sessions by itself. As a consequence, all
sessions use the default request timeout of 10s. This can be too low for
the debug mode, as observed in scylladb/scylla-enterprise#5601.

In this commit, we change the test to use `cluster_con`, so that the
sessions have the request timeout set to 200s from now on.

Fixes scylladb/scylla-enterprise#5601

This commit changes only the test and is a CI stability improvement,
so it should be backported all the way to 2024.2. 2024.1 doesn't have
this test.

Closes scylladb/scylladb#25510

(cherry picked from commit 03cc34e3a0)

Closes scylladb/scylladb#25546
2025-08-18 16:43:31 +02:00
Dawid Mędrek
bf7776bc3b db/hints: Add new logs
We're adding new logs in just a few places that may however prove
important when debugging issues in hinted handoff in the future.

(cherry picked from commit 6f1fb7cfb5)
2025-08-18 16:00:45 +02:00
Dawid Mędrek
d447f8eac0 db/hints: Adjust log levels
Some of the logs could be clogging Scylla's logs, so we demote their
level to a lower one.

On the other hand, some of the logs would most likely not do that,
and they could be useful when debugging -- we promote them to debug
level.

(cherry picked from commit d7bc9edc6c)
2025-08-18 16:00:45 +02:00
Dawid Mędrek
1d09be7641 db/hints: Improve logs
Before these changes, the logs in hinted handoff often didn't provide
crucial information like the identifier of the node that hints were
being sent to. Also, some of the logs were misleading and referred to
other places in the code than the one where an exception or some other
situation really occurred.

We modify those logs, extending them by more valuable information
and fixing existing issues. What's more, all of the logs in
`hint_endpoint_manager` and `hint_sender` follow a consistent format
now:

```
<class_name>[<destination host ID>]:<function_name>: <message>
```

This way, we should always have AT LEAST the basic information.

(cherry picked from commit 2327d4dfa3)
2025-08-18 16:00:42 +02:00
Sergey Zolotukhin
687439615f test: Set request_timeout_on_shutdown_in_seconds to request_timeout_in_ms,
decrease request timeout.

In debug mode, queries may sometimes take longer than the default 30 seconds.
To address this, the timeout value `request_timeout_on_shutdown_in_seconds`
during tests is aligned with other request timeouts.
Change request timeout for tests from 180s to 90s since we must keep the request
timeout during shutdown significantly lower than the graceful shutdown timeout(2m),
or else a request timeout would cause a graceful shutdown timeout and fail a test.

(cherry picked from commit 4f63e1df58)
2025-08-15 18:20:45 +02:00
Sergey Zolotukhin
44853b3bec generic_server: Two-step connection shutdown.
When shutting down in `generic_server`, connections are now closed in two steps.
First, only the RX (receive) side is shut down. Then, after all ongoing requests
are completed, or a timeout happened the connections are fully closed.

Fixes scylladb/scylladb#24481

(cherry picked from commit ea311be12b)
2025-08-15 18:20:42 +02:00
Sergey Zolotukhin
31c44e1b33 transport: consmetic change, remove extra blanks.
(cherry picked from commit 7334bf36a4)
2025-08-15 17:33:06 +02:00
Sergey Zolotukhin
4761da3aff generic_server: replace empty destructor with = default
This change improves code readability by explicitly marking the destructor as defaulted.

(cherry picked from commit 27b3d5b415)
2025-08-15 17:25:21 +02:00
Sergey Zolotukhin
7ca00444e3 generic_server: refactor connection::shutdown to use shutdown_input and shutdown_output
This change improves logging and modifies the behavior to attempt closing
the output side of a connection even if an error occurs while closing the input side.

(cherry picked from commit 3610cf0bfd)
2025-08-15 17:25:17 +02:00
Abhinav Jha
a0161ef67a raft: replication test: change rpc_propose_conf_change test to SEASTAR_THREAD_TEST_CASE
RAFT_TEST_CASE macro creates 2 test cases, one with random 20% packet
loss named name_drops. The framework makes hard coded assumptions about
leader which doesn't hold well in case of packet losses.

This short term fix disables the packet drop variant of the specified test.
It should be safe to re-enable it once the whole framework is re-worked to
remove these hard coded assumptions.

This PR fixes a bug. Hence we need to backport it.

Fixes: scylladb/scylladb#23816

Closes scylladb/scylladb#25489

(cherry picked from commit a0ee5e4b85)

Closes scylladb/scylladb#25527
2025-08-15 13:27:52 +03:00
Wojciech Przytuła
a95ad052df Fix link to ScyllaDB manual
The link would point to outdated OS docs. I fixed it to point to up-to-date Enterprise docs.

Closes scylladb/scylladb#25328

(cherry picked from commit 7600ccfb20)

Closes scylladb/scylladb#25484
2025-08-15 13:27:08 +03:00
Jenkins Promoter
5b5fbff2e6 Update pgo profiles - aarch64 2025-08-15 04:40:12 +03:00
Jenkins Promoter
981eeb275d Update pgo profiles - x86_64 2025-08-15 04:06:14 +03:00
Sergey Zolotukhin
43addab2e5 generic_server: add shutdown_input and shutdown_output functions to
`connection` class.

The functions are just wrappers for  _fd.shutdown_input() and  _fd.shutdown_output(), with added error reporting.
Needed by later changes.

(cherry picked from commit 3848d10a8d)
2025-08-14 13:22:04 +00:00
Sergey Zolotukhin
aa8913f317 test: Add test for query execution during CQL server shutdown
This test simulates a scenario where a query is being executed while
the query coordinator begins shutting down the CQL server and client
connections. The shutdown process should wait until the query execution
is either completed or timed out.

Test for scylladb/scylladb#24481

(cherry picked from commit 122e940872)
2025-08-14 13:22:04 +00:00
Anna Stuchlik
1abb9106cf doc: add the patch release upgrade procedure for version 2025.2
Fixes https://github.com/scylladb/scylladb/issues/25322

Closes scylladb/scylladb#25343
2025-08-14 10:43:16 +02:00
Michael Litvak
a114e95798 test: test_mv_backlog: fix to consider internal writes
The test executes a single write, fetching metrics before and after the
write, and expects the total throttled writes count to be increased
exactly by one.

However, other internal writes (compaction for example) may be executed
during this time and be throttled, causing the metrics to be increased
by more than expected.

To address this, we filter the metrics by the scheduling group label of
the user write, to filter out the compaction writes that run in the
compaction scheduling group.

Fixes scylladb/scylladb#23139

(cherry picked from commit 276a09ac6e)
2025-08-12 20:53:16 +03:00
Michael Litvak
5161342910 test/pylib/rest_client: fix ScyllaMetrics filtering
In the ScyllaMetrics `get` function, when requesting the value for a
specific shard, it is expected to return the sum of all values of
metrics for that shard that match the labels.

However, it would return the value of the first matching line it finds
instead of summing all matching lines.

For example, if we have two lines for one shard like:
some_metric{scheduling_group_name="compaction",shard="0"} 1
some_metric{scheduling_group_name="sl:default",shard="0"} 2

The result of this call would be 1 instead of 3:
get('some_metric', shard="0")

We fix this to sum all matching lines.

The filtering of lines by labels is fixed to allow specifying only some
of the labels. Previously, for the line to match the filter, either the
filter needs to be empty, or all the labels in the metric line had to be
specified in the filter parameter and match its value, which is
unexpected, and breaks when more labels are added.

We also simplify the function signature and the implementation - instead
of having the shard as a separate parameter, it can be specified as a
label, like any other label.

(cherry picked from commit 5c28cffdb4)
2025-08-12 20:53:16 +03:00
Patryk Jędrzejczak
0a3b93a4e4 docs: Raft recovery procedure: recommend verifying participation in Raft recovery
This instruction adds additional safety. The faster we notice that
a node didn't restart properly, the better.

The old gossip-based recovery procedure had a similar recommendation
to verify that each restarting node entered `RECOVERY` mode.

Fixes #25375

This is a documentation improvement. We should backport it to all
branches with the new recovery procedure, so 2025.2 and 2025.3.

Closes scylladb/scylladb#25376

(cherry picked from commit 7b77c6cc4a)

Closes scylladb/scylladb#25439
2025-08-11 15:52:41 +02:00
Botond Dénes
96ed160bd9 Merge '[Backport 2025.2] GCP Key Provider: Fix authentication issues' from Scylladb[bot]
* Fix discovery of application default credentials by using fully expanded pathnames (no tildes).
* Fix grant type in token request with user credentials.

Fixes #25345.

- (cherry picked from commit 77cc6a7bad)

- (cherry picked from commit b1d5a67018)

Parent PR: #25351

Closes scylladb/scylladb#25406

* github.com:scylladb/scylladb:
  encryption: gcp: Fix the grant type for user credentials
  encryption: gcp: Expand tilde in pathnames for credentials file
2025-08-11 07:00:03 +03:00
Szymon Malewski
3791a6a4c5 test/alternator: enable more relevant logs in CI.
This patch sets, for alternator test suite, all 'alternator-*' loggers and 'paxos' logger to trace level. This should significantly ease debugging of failed tests, while it has no effect on test time and increases log size only by 7%.
This affects running alternator tests only with `test.py`, not with `test/alternator/run`.

Closes #24645

Closes scylladb/scylladb#25327

(cherry picked from commit eb11485969)

Closes scylladb/scylladb#25382
2025-08-11 06:59:34 +03:00
Taras Veretilnyk
3d31b2118f docs: fix typo in command name enbleautocompaction -> enableautocompaction
Renamed the file and updated all references from 'enbleautocompaction' to the correct 'enableautocompaction'.

Fixes scylladb/scylladb#25172

Closes scylladb/scylladb#25175

(cherry picked from commit 6b6622e07a)

Closes scylladb/scylladb#25217
2025-08-11 06:58:08 +03:00
Benny Halevy
da0f9608d8 scylla-sstable: print_query_results_json: continue loop if row is disengaged
Otherwise it is accessed right when exiting the if block.
Add a unit test reproducing the issue and validating the fix.

Fixes #25325

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#25326

(cherry picked from commit 5e5e63af10)

Closes scylladb/scylladb#25378
2025-08-10 18:54:29 +03:00
Botond Dénes
d845de84aa Merge '[Backport 2025.2] truncate: change check for write during truncate into a log warning' from Scylladb[bot]
TRUNCATE TABLE performs a memtable flush and then discards the sstables of the table being truncated. It collects the highest replay position for both of these. When the highest replay position of the discarded sstables is higher than the highest replay position of the flushed memtable, that means that we have had writes during truncate which have been flushed to disk independently of the truncate process. We check for this and trigger an on_internal_error() which throws an exception, informing the user that writing data concurrently with TRUNCATE TABLE is not advised.

The problem with this is that truncate is also called from DROP KEYSPACE and DROP TABLE. These are raft operations and exceptions thrown by them are caught by the (...) exception handler in the raft applier fiber, which then exits leaving the node without the ability to execute subsequent raft commands.

This commit changes the on_internal_error() into a warning log entry. It also outputs to keyspace/table names, and the offending replay positions which caused the check to fail.

This PR also adds a test which validates that TRUNCATE works correctly with concurrent writes. More specifically, it checks that:
- all data written before TRUNCATE starts is deleted
- none of the data after TRUNCATE completes is deleted

Fixes: #25173
Fixes: #25013

Backport is needed in versions which check for truncate with concurrent writes using `on_internal_error()`: 2025.3 2025.2 2025.1

- (cherry picked from commit 268ec72dc9)

- (cherry picked from commit 33488ba943)

Parent PR: #25174

Closes scylladb/scylladb#25349

* github.com:scylladb/scylladb:
  truncate: add test for truncate with concurrent writes
  truncate: change check for write during truncate into a log warning
2025-08-08 11:44:41 +03:00
Nikos Dragazis
0e4b1196ee encryption: gcp: Fix the grant type for user credentials
Exchanging a refresh token for an access token requires the
"refresh_token" grant type [1].

[1] https://datatracker.ietf.org/doc/html/rfc6749#section-6

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
(cherry picked from commit b1d5a67018)
2025-08-07 21:45:48 +00:00
Nikos Dragazis
ebac51202f encryption: gcp: Expand tilde in pathnames for credentials file
The GCP host searches for application default credentials in known
locations within the user's home directory using
`seastar::file_exists()`. However, this function does not perform tilde
expansion in pathnames.

Replace tildes with the home directory from the HOME environment
variable.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
(cherry picked from commit 77cc6a7bad)
2025-08-07 21:45:48 +00:00
Taras Veretilnyk
edb10b8f4d docs: Sort commands list in nodetool.rst
Fixes scylladb/scylladb#25330

Closes scylladb/scylladb#25331

(cherry picked from commit bcb90c42e4)

Closes scylladb/scylladb#25371
2025-08-07 13:13:59 +03:00
Botond Dénes
cccf726b54 Merge '[Backport 2025.2] test: introduce upgrade tests to test.py, add a SSTable dict compression upgrade test' from Michał Chojnowski
This PR adds an upgrade test for SSTable compression with shared dictionaries, and adds some bits to pylib and test.py to support that.

In the series, we:

1. Mount $XDG_CACHE_DIR into dbuild.
2. Add a pylib function which downloads and installs a released ScyllaDB package into a subdirectory of $XDG_CACHE_DIR/scylladb/test.py, and returns the path to bin/scylla.
3. Add new methods and params to the cluster manager, which let the test start nodes with historical Scylla executables, and switch executables during the test.
4. Add a test which uses the above to run an upgrade test between the released package and the current build.
5. Add --run-internet-dependent-tests to test.py which lets the user of test.py skip this test (and potentially other internet-dependent tests in the future).

(The patch modifying wait_for_cql_and_get_hosts is a part of the new test — the new test needs it to test how particular nodes in a mixed-version cluster react to some CQL queries.)

This is a follow-up to https://github.com/scylladb/scylladb/pull/23025, split into a separate PR because the potential addition of upgrade tests to test.py deserved a separate thread.

Needs backport to 2025.2, because that's where the tested feature is introduced.

Fixes https://github.com/scylladb/scylladb/issues/24110

- (cherry picked from commit 63218bb094)
- (cherry picked from commit cc7432888e)
- (cherry picked from commit 34098fbd1f)
- (cherry picked from commit 2ef0db0a6b)
- (cherry picked from commit 1ff7e09edc)
- (cherry picked from commit 5da19ff6a6)
- (cherry picked from commit d3cb873532)
- (cherry picked from commit dd878505ca)

Parent PR: https://github.com/scylladb/scylladb/pull/23538

Closes scylladb/scylladb#25158

* github.com:scylladb/scylladb:
  test: add test_sstable_compression_dictionaries_upgrade.py
  test.py: add --run-internet-dependent-tests
  pylib/manager_client: add server_switch_executable
  test/pylib: in add_server, give a way to specify the executable and version-specific config
  pylib: pass scylla_env environment variables to the topology suite
  test/pylib: add get_scylla_2025_1_executable()
  pylib/scylla_cluster: give a way to pass executable-specific options to nodes
  dbuild: mount "$XDG_CACHE_HOME/scylladb"
2025-08-07 06:26:25 +03:00
Nikos Dragazis
5ae00a3dab test: kmip: Fix segfault from premature destruction of port_promise
`kmip_test_helper()` is a utility function to spawn a dedicated PyKMIP
server for a particular Boost test case. The function runs the server as
an external process and uses a thread to parse the port from the
server's logs. The thread communicates the port to the main thread via
a promise.

The current implementation has a bug where the thread may set a value
to the promise after its destruction, causing a segfault. This happens
when the server does not start within 20 seconds, in which case the port
future throws and the stack unwinding machinery destroys the port
promise before the thread that writes to it.

Fix the bug by declaring the promise before the cleanup action.

The bug has been encountered in CI runs on slow machines, where the
PyKMIP server takes too long to create its internal tables (due to slow
fdatasync calls from SQLite). This patch does not improve CI stability -
it only ensures that the error condition is properly reflected in the
test output.

This patch is not a backport. The same bug has been fixed in master as
part of a larger rewrite of the `kmip_test_helper()` (see 722e2bce96).

Refs #24747, #24842.
Fixes #24574.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>

Closes scylladb/scylladb#25029
2025-08-06 11:58:56 +03:00
Raphael S. Carvalho
be94db3ace replica: Fix take_storage_snapshot() running concurrently to merge completion
Some background:
When merge happens, a background fiber wakes up to merge compaction
groups of sibling tablets into main one. It cannot happen when
rebuilding the storage group list, since token metadata update is
not preemptable. So a storage group, post merge, has the main
compaction group and two other groups to be merged into the main.
When the merge happens, those two groups are empty and will be
freed.

Consider this scenario:
1) merge happens, from 2 to 1 tablet
2) produces a single storage group, containing main and two
other compaction groups to be merged into main.
3) take_storage_snapshot(), triggered by migration post merge,
gets a list of pointer to all compaction groups.
4) t__s__s() iterates first on main group, yields.
5) background fiber wakes up, moves the data into main
and frees the two groups
6) t__s__s() advances to other groups that are now freed,
since step 5.
7) segmentation fault

In addition to memory corruption, there's also a potential for
data to escape the iteration in take_storage_snapshot(), since
data can be moved across compaction groups in background, all
belonging to the same storage group. That could result in
data loss.

Readers should all operate on storage group level since it can
provide a view on all the data owned by a tablet replica.
The movement of sstable from group A to B is atomic, but
iteration first on A, then later on B, might miss data that
was moved from B to A, before the iteration reached B.
By switching to storage group in the interface that retrieves
groups by token range, we guarantee that all data of a given
replica can be found regardless of which compaction group they
sit on.

Fixes #23162.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#24058

(cherry picked from commit 28056344ba)

Closes scylladb/scylladb#25338
2025-08-06 09:41:56 +03:00
Botond Dénes
22942c0a85 Merge '[Backport 2025.2] Raft-based recovery procedure: simplify rolling restart with recovery_leader' from Scylladb[bot]
The following steps are performed in sequence as part of the
Raft-based recovery procedure:
- set `recovery_leader` to the host ID of the recovery leader in
  `scylla.yaml` on all live nodes,
- send the `SIGHUP` signal to all Scylla processes to reload the config,
- perform a rolling restart (with the recovery leader being restarted
  first).

These steps are not intuitive and more complicated than they could be.

In this PR, we simplify these steps. From now on, we will be able to
simply set `recovery_leader` on each node just before restarting it.

Apart from making necessary changes in the code, we also update all
tests of the Raft-based recovery procedure and the user-facing
documentation.

Fixes scylladb/scylladb#25015

The Raft-based procedure was added in 2025.2. This PR makes the
procedure simpler and less error-prone, so it should be backported
to 2025.2 and 2025.3.

- (cherry picked from commit ec69028907)

- (cherry picked from commit 445a15ff45)

- (cherry picked from commit 23f59483b6)

- (cherry picked from commit ba5b5c7d2f)

- (cherry picked from commit 9e45e1159b)

- (cherry picked from commit f408d1fa4f)

Parent PR: #25032

Closes scylladb/scylladb#25334

* github.com:scylladb/scylladb:
  docs: document the option to set recovery_leader later
  test: delay setting recovery_leader in the recovery procedure tests
  gossip: add recovery_leader to gossip_digest_syn
  db: system_keyspace: peers_table_read_fixup: remove rows with null host_id
  db/config, gms/gossiper: change recovery_leader to UUID
  db/config, utils: allow using UUID as a config option
2025-08-06 09:41:17 +03:00
Michał Jadwiszczak
b58543dab7 storage_service, group0_state_machine: move SL cache update from topology_state_load() to load_snapshot()
Currently the service levels cache is unnecessarily updated in every
call of `topology_state_load()`.
But it is enough to reload it only when a snapshot is loaded.
(The cache is also already updated when there is a change to one of
`service_levels_v2`, `role_members`, `role_attributes` tables.)

Fixes scylladb/scylladb#25114
Fixes scylladb/scylladb#23065

Closes scylladb/scylladb#25116

(cherry picked from commit 10214e13bd)

Closes scylladb/scylladb#25304
2025-08-06 09:39:55 +03:00
Aleksandra Martyniuk
a468138716 api: storage_service: do not log the exception that is passed to user
The exceptions that are thrown by the tasks started with API are
propagated to users. Hence, there is no need to log it.

Remove the logs about exception in user started tasks.

Fixes: https://github.com/scylladb/scylladb/issues/16732.

Closes scylladb/scylladb#25153

(cherry picked from commit e607ef10cd)

Closes scylladb/scylladb#25296
2025-08-06 09:36:07 +03:00
Dawid Mędrek
8e96968fb7 test: Enable RF-rack-valid keyspaces in all Python suites
We're enabling the configuration option `rf_rack_valid_keyspaces`
in all Python test suites. All relevant tests have been adjusted
to work with it enabled.

That encompasses the following suites:

* alternator,
* broadcast_tables,
* cluster (already enabled in scylladb/scylladb@ee96f8dcfc),
* cql,
* cqlpy (already enabled in scylladb/scylladb@be0877ce69),
* nodetool,
* rest_api.

Two remaining suites that use tests written in Python, redis and scylla_gdb,
are not affected, at least not directly.

The redis suite requires creating an instance of Scylla manually, and the tests
don't do anything that could violate the restriction.

The scylla_gdb suite focuses on testing the capabilities of scylla-gdb.py, but
even then it reuses the `run` file from the cqlpy suite.

Fixes scylladb/scylladb#25126

Closes scylladb/scylladb#24617

(cherry picked from commit b41151ff1a)

Closes scylladb/scylladb#25230
2025-08-06 09:35:34 +03:00
Ferenc Szili
932223414b truncate: add test for truncate with concurrent writes
test_validate_truncate_with_concurrent_writes checks if truncate deletes
all the data written before the truncate starts, and does not delete any
data after truncate completes.

(cherry picked from commit 33488ba943)
2025-08-06 00:51:43 +00:00
Ferenc Szili
b50272d663 truncate: change check for write during truncate into a log warning
TRUNCATE TABLE performs a memtable flush and then discards the sstables
of the table being truncated. It collects the highest replay position
for both of these. When the highest replay position of the discarded
sstables is higher than the highest replay position of the flushed
memtable, that means that we have had writes during truncate which have
been flushed to disk independently of the truncate process. We check for
this and trigger an on_internal_error() which throws an exception,
informing the user that writing data concurrently with TRUNCATE TABLE is
not advised.

The problem with this is that truncate is also called from DROP KEYSPACE
and DROP TABLE. These are raft operations and exceptions thrown by them
are caught by the (...) exception handler in the raft applier fiber,
which then exits leaving the node without the ability to execute
subsequent raft commands.

This commit changes the on_internal_error() into a warning log entry. It
also outputs to keyspace/table names, the truncated_at timepoint, the
offending replay positions which caused the check to fail.

Fixes: #25173
Fixes: #25013
(cherry picked from commit 268ec72dc9)
2025-08-06 00:51:43 +00:00
Patryk Jędrzejczak
762b5e3ae8 docs: document the option to set recovery_leader later
In one of the previous commits, we made it possible to set
`recovery_leader` on each node just before restarting it. Here, we
update the corresponding documentation.

(cherry picked from commit f408d1fa4f)
2025-08-05 10:59:07 +00:00
Patryk Jędrzejczak
beb28a6155 test: delay setting recovery_leader in the recovery procedure tests
In the previous commit, we made it possible to set `recovery_leader`
on each node just before restarting it. Here, we change all the
tests of the Raft-based recovery procedure to use and test this option.

(cherry picked from commit 9e45e1159b)
2025-08-05 10:59:06 +00:00
Patryk Jędrzejczak
b79902baf8 gossip: add recovery_leader to gossip_digest_syn
In the new Raft-based recovery procedure, live nodes join the new
group 0 one by one during a rolling restart. There is a time window when
some of them are in the old group 0, while others are in the new group
0. This causes a group 0 mismatch in `gossiper::handle_syn_msg`. The
current solution for this problem is to ignore group 0 mismatches if
`recovery_leader` is set on the local node and to ask the administrator
to perform the rolling restart in the following way:
- set `recovery_leader` in `scylla.yaml` on all live nodes,
- send the `SIGHUP` signal to all Scylla processes to reload the config,
- proceed with the rolling restart.

This commit makes `gossiper::handle_syn_msg` ignore group 0 mismatches
when exactly one of the two gossiping nodes has `recovery_leader` set.
We achieve this by adding `recovery_leader` to `gossip_digest_syn`.
This change makes setting `recovery_leader` earlier on all nodes and
reloading the config unnecessary. From now on, the administrator can
simply restart each node with `recovery_leader` set.

However, note that nodes that join group 0 must have `recovery_leader`
set until all nodes join the new group 0. For example, assume that we
are in the middle of the rolling restart and one of the nodes in the new
group 0 crashes. It must be restarted with `recovery_leader` set, or
else it would reject `gossip_digest_syn` messages from nodes in the old
group 0. To avoid problems in such cases, we will continue to recommend
setting `recovery_leader` in `scylla.yaml` instead of passing it as
a command line argument.

(cherry picked from commit ba5b5c7d2f)
2025-08-05 10:59:06 +00:00
Patryk Jędrzejczak
1738f244d2 db: system_keyspace: peers_table_read_fixup: remove rows with null host_id
Currently, `peers_table_read_fixup` removes rows with no `host_id`, but
not with null `host_id`. Null host IDs are known to appear in system
tables, for example in `system.cluster_status` after a failed bootstrap.
We better make sure we handle them properly if they ever appear in
`system.peers`.

This commit guarantees that null UUID cannot belong to
`loaded_endpoints` in `storage_service::join_cluster`, which in
particular ensures that we throw a runtime error when a user sets
`recovery_leader` to null UUID during the recovery procedure. This is
handled by the code verifying that `recovery_leader` belongs to
`loaded_endpoints`.

(cherry picked from commit 23f59483b6)
2025-08-05 10:59:06 +00:00
Patryk Jędrzejczak
98e3b5e9b5 db/config, gms/gossiper: change recovery_leader to UUID
We change the type of the `recovery_leader` config parameter and
`gossip_config::recovery_leader` from sstring to UUID. `recovery_leader`
is supposed to store host ID, so UUID is a natural choice.

After changing the type to UUID, if the user provides an incorrect UUID,
parsing `recovery_leader` will fail early, but the start-up will
continue. Outside the recovery procedure, `recovery_leader` will then be
ignored. In the recovery procedure, the start-up will fail on:

```
throw std::runtime_error(
        "Cannot start - Raft-based topology has been enabled but persistent group 0 ID is not present. "
        "If you are trying to run the Raft-based recovery procedure, you must set recovery_leader.");
```

(cherry picked from commit 445a15ff45)
2025-08-05 10:59:06 +00:00
Patryk Jędrzejczak
1434afa588 db/config, utils: allow using UUID as a config option
We change the `recovery_leader` option to UUID in the following commit.

(cherry picked from commit ec69028907)
2025-08-05 10:59:06 +00:00
Nikos Dragazis
f53588813b test: Use in-memory SQLite for PyKMIP server
The PyKMIP server uses an SQLite database to store artifacts such as
encryption keys. By default, SQLite performs a full journal and data
flush to disk on every CREATE TABLE operation. Each operation triggers
three fdatasync(2) calls. If we multiply this by 16, that is the number
of tables created by the server, we get a significant number of file
syncs, which can last for several seconds on slow machines.

This behavior has led to CI stability issues from KMIP unit tests where
the server failed to complete its schema creation within the 20-second
timeout (observed on spider9 and spider11).

Fix this by configuring the server to use an in-memory SQLite.

Fixes #24842.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>

Closes scylladb/scylladb#24995

(cherry picked from commit 2656fca504)

Closes scylladb/scylladb#25299
2025-08-02 17:13:13 +03:00
Tomasz Grabiec
1dfb9d23ea topology_coordinator: Trigger load stats refresh after replace
Otherwise, tablet rebuilt will be delayed for up to 60s, as the tablet
scheduler needs load stats for the new node (replacing) to make
decisisons.

Fixes #25163

Closes scylladb/scylladb#25181

(cherry picked from commit 55116ee660)

Closes scylladb/scylladb#25214
2025-08-02 01:26:59 +02:00
Piotr Dulikowski
c4b4db62e3 Merge '[Backport 2025.2] qos: don't populate effective service level cache until auth is migrated to raft' from Scylladb[bot]
Right now, service levels are migrated in one group0 command and auth is migrated in the next one. This has a bad effect on the group0 state reload logic - modifying service levels in group0 causes the effective service levels cache to be recalculated, and to do so we need to fetch information about all roles. If the reload happens after SL upgrade and before auth upgrade, the query for roles will be directed to the legacy auth tables in system_auth - and the query, being a potentially remote query, has a timeout. If the query times out, it will throw an exception which will break the group0 apply fiber and the node will need to be restarted to bring it back to work.

In order to solve this issue, make sure that the service level module does not start populating and using the service level cache until both service levels and auth are migrated to raft. This is achieved by adding the check both to the cache population logic and the effective service level getter - they now look at service level's accessor new method, `can_use_effective_service_level_cache` which takes a look at the auth version.

Fixes: scylladb/scylladb#24963

Should be backported to all versions which support upgrade to topology over raft - the issue described here may put the cluster into a state which is difficult to get out of (group0 apply fiber can break on multiple nodes, which necessitates their restart).

- (cherry picked from commit 2bb800c004)

- (cherry picked from commit 3a082d314c)

Parent PR: #25188

Closes scylladb/scylladb#25284

* github.com:scylladb/scylladb:
  test: sl: verify that legacy auth is not queried in sl to raft upgrade
  qos: don't populate effective service level cache until auth is migrated to raft
2025-08-01 17:15:25 +02:00
Jenkins Promoter
4ff03275b5 Update ScyllaDB version to: 2025.2.2 2025-08-01 16:35:40 +03:00
Jenkins Promoter
4e6c0aaab5 Update pgo profiles - aarch64 2025-08-01 04:41:23 +03:00
Jenkins Promoter
dd95ea454a Update pgo profiles - x86_64 2025-08-01 04:29:06 +03:00
Piotr Dulikowski
d9075e6160 test: sl: verify that legacy auth is not queried in sl to raft upgrade
Adjust `test_service_levels_upgrade`: right before upgrade to topology
on raft, enable an error injection which triggers when the standard role
manager is about to query the legacy auth tables in the
system_auth keyspace. The preceding commit which fixes
scylladb/scylladb#24963 makes sure that the legacy tables are not
queried during upgrade to topology on raft, so the error injection does
not trigger and does not cause a problem; without that commit, the test
fails.

(cherry picked from commit 3a082d314c)
2025-07-31 15:13:24 +00:00
Piotr Dulikowski
618459125d qos: don't populate effective service level cache until auth is migrated to raft
Right now, service levels are migrated in one group0 command and auth
is migrated in the next one. This has a bad effect on the group0 state
reload logic - modifying service levels in group0 causes the effective
service levels cache to be recalculated, and to do so we need to fetch
information about all roles. If the reload happens after SL upgrade and
before auth upgrade, the query for roles will be directed to the legacy
auth tables in system_auth - and the query, being a potentially remote
query, has a timeout. If the query times out, it will throw
an exception which will break the group0 apply fiber and the node will
need to be restarted to bring it back to work.

In order to solve this issue, make sure that the service level module
does not start populating and using the service level cache until both
service levels and auth are migrated to raft. This is achieved by adding
the check both to the cache population logic and the effective service
level getter - they now look at service level's accessor new method,
`can_use_effective_service_level_cache` which takes a look at the auth
version.

Fixes: scylladb/scylladb#24963
(cherry picked from commit 2bb800c004)
2025-07-31 15:13:23 +00:00
Jakub Smolar
4a1de6725a gdb: handle zero-size reads in managed_bytes
Fixes: https://github.com/scylladb/scylladb/issues/25048

Closes scylladb/scylladb#25050

(cherry picked from commit 6e0a063ce3)

Closes scylladb/scylladb#25141
2025-07-31 13:06:45 +03:00
Pavel Emelyanov
45101e072e Merge '[Backport 2025.2] transport: remove throwing protocol_exception on connection start' from Dario Mirovic
Note: The simplest approach to resolving `process_request_one` merge issues, since it has been refactored, was to include the three commits from before, and then the commits that are actually being backported.

`protocol_exception` is thrown in several places. This has become a performance issue, especially when starting/restarting a server. To alleviate this issue, throwing the exception has to be replaced with returning it as a result or an exceptional future.

This PR replaces throws in the `transport/server` module. This is achieved by using result_with_exception, and in some places, where suitable, just by creating and returning an exceptional future.

There are four commits in this PR. The first commit introduces tests in `test/cqlpy`. The second commit refactors transport server `handle_error` to not rethrow exceptions. The third commit refactors reusable buffer writer callbacks. The fourth commit replaces throwing `protocol_exception` to returning it.

Based on the comments on an issue linked in https://github.com/scylladb/scylladb/issues/24567, the main culprit from the side of protocol exceptions is the invalid protocol version one, so I tested that exception for performance.

In order to see if there is a measurable difference, a modified version of `test_protocol_version_mismatch` Python is used, with 100'000 runs across 10 processes (not threads, to avoid Python GIL). One test run consisted of 1 warm-up run and 5 measured runs. First test run has been executed on the current code, with throwing protocol exceptions. Second test urn has been executed on the new code, with returning protocol exceptions. The performance report is in https://github.com/scylladb/scylladb/pull/24738#issuecomment-3051611069. It shows ~10% gains in real, user, and sys time for this test.

Testing

Build: `release`

Test file: `test/cqlpy/test_protocol_exceptions.py`
Test name: `test_protocol_version_mismatch` (modified for mass connection requests)

Test arguments:
```
max_attempts=100'000
num_parallel=10
```

Throwing `protocol_exception` results:
```
real=1:26.97  user=10:00.27  sys=2:34.55  cpu=867%
real=1:26.95  user=9:57.10  sys=2:32.50  cpu=862%
real=1:26.93  user=9:56.54  sys=2:35.59  cpu=865%
real=1:26.96  user=9:54.95  sys=2:32.33  cpu=859%
real=1:26.96  user=9:53.39  sys=2:33.58  cpu=859%

real=1:26.95 user=9:56.85 sys=2:34.11 cpu=862%   # average
```

Returning `protocol_exception` as `result_with_exception` or an exceptional future:
```
real=1:18.46  user=9:12.21  sys=2:19.08  cpu=881%
real=1:18.44  user=9:04.03  sys=2:17.91  cpu=869%
real=1:18.47  user=9:12.94  sys=2:19.68  cpu=882%
real=1:18.49  user=9:13.60  sys=2:19.88  cpu=883%
real=1:18.48  user=9:11.76  sys=2:17.32  cpu=878%

real=1:18.47 user=9:10.91 sys=2:18.77 cpu=879%   # average
```

This PR replaced `transport/server` throws of `protocol_exception` with returns. There are a few other places where protocol exceptions are thrown, and there are many places where `invalid_request_exception` is thrown. That is out of scope of this single PR, so the PR just refs, and does not resolve issue #24567.

Refs: #24567

This PR improves performance in cases when protocol exceptions happen, for example during connection storms. It will require backporting.

* (cherry picked from commit 7aaeed012e)

* (cherry picked from commit 30d424e0d3)

* (cherry picked from commit 9f4344a435)

* (cherry picked from commit 5390f92afc)

* (cherry picked from commit 4a6f71df68)

Parent PR: #24738

Closes scylladb/scylladb#25239

* github.com:scylladb/scylladb:
  test/cqlpy: add cpp exception metric test conditions
  transport/server: replace protocol_exception throws with returns
  utils/reusable_buffer: accept non-throwing writer callbacks via result_with_exception
  transport/server: avoid exception-throw overhead in handle_error
  test/cqlpy: add protocol_exception tests
  transport: remove redundant references in process_request_one
  transport: fix the indentation in process_request_one
  transport: add futures in CQL server exception handling
2025-07-31 12:18:50 +03:00
Anna Stuchlik
96a01082bb doc: add tablets support information to the Drivers table
This commit:

- Extends the Drivers support table with information on which driver supports tablets
  and since which version.
- Adds the driver support policy to the Drivers page.
- Reorganizes the Drivers page to accommodate the updates.

In addition:
- The CPP-over-Rust driver is added to the table.
- The information about Serverless (which we don't support) is removed
  and replaced with tablets to correctly describe the contents of the table.

Fixes https://github.com/scylladb/scylladb/issues/19471

Refs https://github.com/scylladb/scylladb-docs-homepage/issues/69

Closes scylladb/scylladb#24635

(cherry picked from commit 18b4d4a77c)

Closes scylladb/scylladb#25250
2025-07-31 12:18:33 +03:00
Aleksandra Martyniuk
782fb029d6 streaming: close sink when exception is thrown
If an exception is thrown in result_handling_cont in streaming,
then the sink does not get closed. This leads to a node crash.

Close sink in exception handler.

Fixes: https://github.com/scylladb/scylladb/issues/25165.

Closes scylladb/scylladb#25238

(cherry picked from commit 99ff08ae78)

Closes scylladb/scylladb#25267
2025-07-31 12:18:17 +03:00
Dario Mirovic
9708d9c4d7 test/cqlpy: add cpp exception metric test conditions
Tested code paths should not throw exceptions. `scylla_reactor_cpp_exceptions`
metric is used. This is a global metric. To address potential test flakiness,
each test runs multiple times:
- `run_count = 100`
- `cpp_exception_threshold = 10`

If a change in the code introduced an exception, expectation is that the number
of registered exceptions will be > `cpp_exception_threshold` in `run_count` runs.
In which case the test fails.

Fixes: #25272
(cherry picked from commit 4a6f71df68)
2025-07-30 21:54:47 +02:00
Dario Mirovic
dc819ebda1 transport/server: replace protocol_exception throws with returns
Replace throwing protocol_exception with returning it as a result
or an exceptional future in the transport server module. This
improves performance, for example during connection storms and
server restarts, where protocol exceptions are more frequent.

In functions already returning a future, protocol exceptions are
propagated using an exceptional future. In functions not already
returning a future, result_with_exception is used.

Notable change is checking v.failed() before calling v.get() in
process_request function, to avoid throwing in case of an
exceptional future.

Refs: #24567
Fixes: #25272
(cherry picked from commit 5390f92afc)
2025-07-30 21:54:45 +02:00
Dario Mirovic
a8d38882ff utils/reusable_buffer: accept non-throwing writer callbacks via result_with_exception
Make make_bytes_ostream and make_fragmented_temporary_buffer accept
writer callbacks that return utils::result_with_exception instead of
forcing them to throw on error. This lets callers propagate failures
by returning an error result rather than throwing an exception.

Introduce buffer_writer_for, bytes_ostream_writer, and fragmented_buffer_writer
concepts to simplify and document the template requirements on writer callbacks.

This patch does not modify the actual callbacks passed, except for the syntax
changes needed for successful compilation, without changing the logic.

Refs: #24567
Fixes: #25272
(cherry picked from commit 9f4344a435)
2025-07-30 21:54:41 +02:00
Dario Mirovic
efc1269665 transport/server: avoid exception-throw overhead in handle_error
Previously, connection::handle_error always called f.get() inside a try/catch,
forcing every failed future to throw and immediately catch an exception just to
classify it. This change eliminates that extra throw/catch cycle by first checking
f.failed(), getting the stored std::exception_ptr via f.get_exception(), and
then dispatching on its type via utils::try_catch<T>(eptr).

The error-response logic is not changed - cassandra_exception, std::exception,
and unknown exceptions are caught and processed, and any exceptions thrown by
write_response while handling those exceptions continues to escape handle_error.

Refs: #24567
Fixes: #25272
(cherry picked from commit 30d424e0d3)
2025-07-30 21:54:31 +02:00
Dario Mirovic
57a32e50d3 test/cqlpy: add protocol_exception tests
Add a helper to fetch scylla_transport_cql_errors_total{type="protocol_error"} counter
from Scylla's metrics endpoint. These metrics are used to track protocol error
count before and after each test.

Add cql_with_protocol context manager utility for session creation with parameterized
protocol_version value. This is used for testing connection establishment with
different protocol versions, and proper disposal of successfully established sessions.

The tests cover two failure scenarios:
- Protocol version mismatch in test_protocol_version_mismatch which tests both supported
and unsupported protocol version
- Malformed frames via raw socket in _protocol_error_impl, used by several test functions,
and also test_no_protocol_exceptions test to assert that the error counters never decrease
during test execution, catching unintended metric resets

Refs: #24567
Fixes: #25272
(cherry picked from commit 7aaeed012e)
2025-07-30 21:54:27 +02:00
Andrzej Jackowski
5dcda14e2e transport: remove redundant references in process_request_one
The references were added and used in previous commits to
limit the number of line changes for a reviewer convenience.

This commit removes the redundant references to make the code
more clear and concise.

(cherry picked from commit 9b1f062827)
2025-07-28 17:42:27 +02:00
Andrzej Jackowski
c56b8a3c2d transport: fix the indentation in process_request_one
Fix the indentation after the previous commit that intentionally had
a wrong indent to limit the number of changed lines

(cherry picked from commit 9c0f369cf8)
2025-07-28 17:42:11 +02:00
Andrzej Jackowski
356d73cdb0 transport: add futures in CQL server exception handling
Prepare for the next commit that will introduce a
seastar::sleep in handling of selected exception.

This commit:
 - Rewrite cql_server::connection::process_request_one to use
   seastar::futurize_invoke and try_catch<> instead of
   utils::result_try.
 - The intentation is intentionally incorrect to reduce the
   number of changed lines. Next commits fix it.

(cherry picked from commit 8a7454cf3e)
2025-07-28 17:41:31 +02:00
Aleksandra Martyniuk
c97da64e45 tasks: do not use binary progress for task manager tasks
Currently, progress of a parent task depends on expected_total_workload,
expected_children_number, and children progresses. Basically, if total
workload is known or all children have already been created, progresses
of children are summed up. Otherwise binary progress is returned.

As a result, two tasks of the same type may return progress in different
units. If they are children of the same task and this parent gathers the
progress - it becomes meaningless.

Drop expected_children_number as we can't assume that children are able
to show their progresses.

Modify get_progress method - progress is calculated based on children
progresses. If expected_total_workload isn't specified, the total
progress of a task may grow. If expected_total_workload isn't specified
and no children are created, empty progress (0/0) is returned.

Fixes: https://github.com/scylladb/scylladb/issues/24650.

Closes scylladb/scylladb#25113

(cherry picked from commit a7ee2bbbd8)

Closes scylladb/scylladb#25199
2025-07-28 09:25:39 +03:00
Ran Regev
054c658988 scylla.yaml: add recommended value for stream_io_throughput_mb_per_sec
Fixes: #24758

Updated scylla.yaml and the help for
scylla --help

Closes scylladb/scylladb#24793

(cherry picked from commit db4f301f0c)

Closes scylladb/scylladb#25196
2025-07-28 09:25:29 +03:00
Pavel Emelyanov
95b906bea9 Merge '[Backport 2025.2] storage_service: cancel all write requests after stopping transports' from Scylladb[bot]
When a node shuts down, in storage service, after storage_proxy RPCs are stopped, some write handlers within storage_proxy may still be waiting for background writes to complete. These handlers hold appropriate ERMs to block schema changes before the write finishes. After the RPCs are stopped, these writes cannot receive the replies anymore.

If, at the same time, there are RPC commands executing `barrier_and_drain`, they may get stuck waiting for these ERM holders to finish, potentially blocking node shutdown until the writes time out.

This change introduces cancellation of all outstanding write handlers from storage_service after the storage proxy RPCs were stopped.

Fixes scylladb/scylladb#23665

Backport: since this fixes an issue that frequently causes issues in CI, backport to 2025.1, 2025.2, and 2025.3.

- (cherry picked from commit bc934827bc)

- (cherry picked from commit e0dc73f52a)

Parent PR: #24714

Closes scylladb/scylladb#25169

* github.com:scylladb/scylladb:
  storage_service: Cancel all write requests on storage_proxy shutdown
  test: Add test for unfinished writes during shutdown and topology change
2025-07-28 09:25:15 +03:00
Pavel Emelyanov
8622a07bdd Merge '[Backport 2025.2] streaming: Avoid deadlock by running view checks in a separate scheduling group' from Scylladb[bot]
This issue happens with removenode, when RBNO is disabled, so range
streamer is used.

The deadlock happens in a scenario like this:
1. Start 3 nodes: {A, B, C}, RF=2
2. Node A is lost
3. removenode A
4. Both B and C gain ownership of ranges.
5. Streaming sessions are started with crossed directions: B->C, C->B

Readers created by sender side exhaust streaming semaphore on B and C.
Receiver side attempts to obtain a permit indirectly by calling
check_needs_view_update_path(), which reads local tables. That read is
blocked and times-out, causing streaming to fail. The streaming writer
is already using a tracking-only permit.

Even if we didn't deadlock, and the streaming semaphore was simply exhausted
by other receiving sessions (via tracking-only permit), the query may still time-out due to starvation.

To avoid that, run the query under a different scheduling group, which
translates to the system semaphore instead of the maintenance
semaphore, to break the dependency. The gossip group was chosen
because it shouldn't be contended and this change should not interfere
with it much.

Fixes #24807
Fixes #24925

- (cherry picked from commit ee2fa58bd6)

- (cherry picked from commit dff2b01237)

Parent PR: #24929

Closes scylladb/scylladb#25055

* github.com:scylladb/scylladb:
  streaming: Avoid deadlock by running view checks in a separate scheduling group
  service: migration_manager: Run group0 barrier in gossip scheduling group
2025-07-28 09:24:53 +03:00
Tomasz Grabiec
3991e4de28 streaming: Avoid deadlock by running view checks in a separate scheduling group
This issue happens with removenode, when RBNO is disabled, so range
streamer is used.

The deadlock happens in a scenario like this:
1. Start 3 nodes: {A, B, C}, RF=2
2. Node A is lost
3. removenode A
4. Both B and C gain ownership of ranges.
5. Streaming sessions are started with crossed directions: B->C, C->B

Readers created by sender side exhaust streaming semaphore on B and C.
Receiver side attempts to obtain a permit indirectly by calling
check_needs_view_update_path(), which reads local tables. That read is
blocked and times-out, causing streaming to fail. The streaming writer
is already using a tracking-only permit.

To avoid that, run the query under a different scheduling group, which
translates to the system semaphore instead of the maintenance
semaphore, to break the dependency. The gossip group was chosen
because it shouldn't be contended and this change should not interfere
with it much.

Fixes: #24807
(cherry picked from commit dff2b01237)
2025-07-27 22:52:56 +02:00
Sergey Zolotukhin
f15df0bcce storage_service: Cancel all write requests on storage_proxy shutdown
During a graceful node shutdown, RPC listeners are stopped in `storage_service::drain_on_shutdown`
as one of the first steps. However, even after RPCs are shut down, some write handlers in
`storage_proxy` may still be waiting for background writes to complete. These handlers retain the ERM.
Since the RPC subsystem is no longer active, replies cannot be received, and if any RPC commands are
concurrently executing `barrier_and_drain`, they may get stuck waiting for those writes. This can block
the messaging server shutdown and delay the entire shutdown process until the write timeout occurs.

This change introduces the cancellation of all outstanding write handlers in `storage_proxy`
during shutdown to prevent unnecessary delays.

Fixes scylladb/scylladb#23665

(cherry picked from commit e0dc73f52a)
2025-07-24 13:02:56 +00:00
Sergey Zolotukhin
487012e972 test: Add test for unfinished writes during shutdown and topology change
This test reproduces an issue where a topology change and an ongoing write query
during query coordinator shutdown can cause the node to get stuck.

When a node receives a write request, it creates a write handler that holds
a copy of the current table's ERM (Effective Replication Map). The ERM ensures
that no topology or schema changes occur while the request is being processed.

After the query coordinator receives the required number of replica write ACKs
to satisfy the consistency level (CL), it sends a reply to the client. However,
the write response handler remains alive until all replicas respond — the remaining
writes are handled in the background.

During shutdown, when all network connections are closed, these responses can no longer
be received. As a result, the write response handler is only destroyed once the write
timeout is reached.

This becomes problematic because the ERM held by the handler blocks topology or schema
change commands from executing. Since shutdown waits for these commands to complete,
this can lead to unnecessary delays in node shutdown and restarts, and occasional
test case failures.

Test for: scylladb/scylladb#23665

(cherry picked from commit bc934827bc)
2025-07-24 13:02:56 +00:00
Michał Chojnowski
7b5a4cadd7 test: add test_sstable_compression_dictionaries_upgrade.py
(cherry picked from commit dd878505ca)
2025-07-23 19:28:35 +02:00
Michał Chojnowski
1446b4e0ef test.py: add --run-internet-dependent-tests
Later, we will add upgrade tests, which need to download the previous
release of Scylla from the internet.

Internet access is a major dependency, so we want to make those tests
opt-in for now.

(cherry picked from commit d3cb873532)
2025-07-23 19:28:35 +02:00
Tomasz Grabiec
fa1b97f0c5 Merge '[Backport 2025.2] Improve background disposal of tablet_metadata' from Scylladb[bot]
As seen in #23284, when the tablet_metadata contains many tables, even empty ones,
we're seeing a long queue of seastar tasks coming from the individual destruction of
`tablet_map_ptr = foreign_ptr<lw_shared_ptr<const tablet_map>>`.

This change improves `tablet_metadata::clear_gently` to destroy the `tablet_map_ptr` objects
on their owner shard by sorting them into vectors, per- owner shard.

Also, background call to clear_gently was added to `~token_metadata`, as it is destroyed
arbitrarily when automatic token_metadata_ptr variables go out of scope, so that the
contained tablet_metadata would be cleared gently.

Finally, a unit test was added to reproduce the `Too long queue accumulated for gossip` symptom
and verify that it is gone with this change.

Fixes #24814
Refs #23284

This change is not marked as fixing the issue since we still need to verify that there is no impact on query performance, reactor stalls, or large allocations, with a large number of tablet-based tables.

* Since the issue exists in 2025.1, requesting backport to 2025.1 and upwards

- (cherry picked from commit 3acca0aa63)

- (cherry picked from commit 493a2303da)

- (cherry picked from commit e0a19b981a)

- (cherry picked from commit 2b2cfaba6e)

- (cherry picked from commit 2c0bafb934)

- (cherry picked from commit 4a3d14a031)

- (cherry picked from commit 6e4803a750)

Parent PR: #24618

Closes scylladb/scylladb#24863

* github.com:scylladb/scylladb:
  token_metadata_impl: clear_gently: release version tracker early
  test: cluster: test_tablets_merge: add test_tablet_split_merge_with_many_tables
  token_metadata: clear_and_destroy_impl when destroyed
  token_metadata: keep a reference to shared_token_metadata
  token_metadata: move make_token_metadata_ptr into shared_token_metadata class
  replica: database: get and expose a mutable locator::shared_token_metadata
  locator: tablets: tablet_metadata: clear_gently: optimize foreign ptr destruction
2025-07-23 17:00:44 +02:00
Piotr Dulikowski
81d1790655 Merge '[Backport 2025.2] cdc: Forbid altering columns of CDC log tables directly' from Scylladb[bot]
The set of columns of a CDC log table should be managed automatically
by Scylla, and the user should not have the ability to manipulate them
directly. That could lead to disastrous consequences such as a
segmentation fault.

In this commit, we're restricting those operations. We also provide two
validation tests.

One of the existing tests had to be adjusted as it modified the type
of a column in a CDC log table. Since the test simply verifies that
the user has sufficient permissions to perform `ALTER TABLE` on the log
table, the test is still valid.

Fixes scylladb/scylladb#24643

Backport: we should backport the change to all affected
branches to prevent the consequences that may affect the user.

- (cherry picked from commit 20d0050f4e)

- (cherry picked from commit 59800b1d66)

Parent PR: #25008

Closes scylladb/scylladb#25107

* github.com:scylladb/scylladb:
  cdc: Forbid altering columns of inactive CDC log table
  cdc: Forbid altering columns of CDC log tables directly
2025-07-22 12:35:20 +02:00
Piotr Dulikowski
ea50c02a02 Merge '[Backport 2025.2] cdc: throw error if column doesn't exist' from Scylladb[bot]
in the CDC log transformer, when creating a CDC mutation based on some
base table mutation, for each value of a base column we set the value in
the CDC column with the same name.

When looking up the column in the CDC schema by name, we may get a null
pointer if a column by that name is not found. This shouldn't happen
normally because the base schema and CDC schema should be compatible,
and for each base column there should be a CDC column with the same
name.

However, there are scenarios where the base schema and CDC schema are
incompatible for a short period of time when they are being altered.
When a base column is being added or dropped, we could get a base
mutation with this column set, and then the CDC transformer picks up the
latest CDC schema which doesn't have this column.

If such thing happens, we fix the code to throw an exception instead of
crashing on null pointer dereference. Currently we don't have a safer
approach to handle this, but this might be changed in the future. The
other alternative is dropping that data silently which we prefer not to
do.

Throwing an error is acceptable because this scenario most likely
indicates this behavior by the user:
* The user adds a new column, and start writing values to the column
  before the ALTER is complete. or,
* The user drops a column, and continues writing values to the column
  while it's being dropped.

Both cases might as well fail with an error because the column is not
found in the base table.

Fixes scylladb/scylladb#24952

backport needed - simple fix for a node crash

- (cherry picked from commit b336f282ae)

- (cherry picked from commit 86dfa6324f)

Parent PR: #24986

Closes scylladb/scylladb#25066

* github.com:scylladb/scylladb:
  test: cdc: add test_cdc_with_alter
  cdc: throw error if column doesn't exist
2025-07-21 16:01:52 +02:00
Dawid Mędrek
c9735a6015 cdc: Forbid altering columns of inactive CDC log table
When CDC becomes disabled on the base table, the CDC log table
still exsits (cf. scylladb/scylladb@adda43edc7).
If it continues to exist up to the point when CDC is re-enabled
on the base table, no new log table will be created -- instead,
the old olg table will be *re-attached*.

Since we want to avoid situations when the definition of the log
table has become misaligned with the definition of the base table
due to actions of the user, we forbid modifying the set of columns
or renaming them in CDC log tables, even when they're inactive.

Validation tests are provided.

(cherry picked from commit 59800b1d66)
2025-07-21 11:43:14 +00:00
Dawid Mędrek
038ba48917 cdc: Forbid altering columns of CDC log tables directly
The set of columns of a CDC log table should be managed automatically
by Scylla, and the user should not have the ability to manipulate them
directly. That could lead to disastrous consequences such as a
segmentation fault.

In this commit, we're restricting those operations. We also provide two
validation tests.

One of the existing tests had to be adjusted as it modified the type
of a column in a CDC log table. Since the test simply verifies that
the user has sufficient permissions to perform `ALTER TABLE` on the log
table, the test is still valid.

Fixes scylladb/scylladb#24643

(cherry picked from commit 20d0050f4e)
2025-07-21 11:43:13 +00:00
Ernest Zaslavsky
6d8350b20d s3_client: parse multipart response XML defensively
Ensure robust handling of XML responses when initiating multipart
uploads. Check for the existence of required nodes before access,
and throw an exception if the XML is empty or malformed.

Refs: https://github.com/scylladb/scylladb/issues/24676

Closes scylladb/scylladb#24990

(cherry picked from commit 342e94261f)

Closes scylladb/scylladb#25054
2025-07-21 12:08:25 +02:00
Benny Halevy
edce417036 token_metadata_impl: clear_gently: release version tracker early
No need to wait for all members to be cleared gently.
We can release the version earlier since the
held version may be awaited for in barriers.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 6e4803a750)
2025-07-21 09:49:05 +03:00
Benny Halevy
179e2b3bf1 test: cluster: test_tablets_merge: add test_tablet_split_merge_with_many_tables
Reproduces #23284

Currently skipped in release mode since it requires
the `short_tablet_stats_refresh_interval` interval.
Ref #24641

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 4a3d14a031)
2025-07-21 09:49:02 +03:00
Benny Halevy
29c33cb065 token_metadata: clear_and_destroy_impl when destroyed
We have a lot of places in the code where
a token_metadata_ptr is kept in an automatic
variable and destroyed when it leaves the scope.
since it's a referenced counted lw_shared_ptr,
the token_metadata object is rarely destroyed in
those cases, but when it is, it doesn't go through
clear_gently, and in particular its tablet_metadata
is not cleared gently, leading to inefficient destruction
of potentially many foreign_ptr:s.

This patch calls clear_and_destroy_impl that gently
clears and destroys the impl object in the background
using the shared_token_metadata.

Fixes #13381

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 2c0bafb934)
2025-07-21 09:36:40 +03:00
Benny Halevy
4da9539831 token_metadata: keep a reference to shared_token_metadata
To be used by a following patch to gently clean and destroy
the token_data_impl in the background.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 2b2cfaba6e)
2025-07-21 09:36:40 +03:00
Benny Halevy
390ca79ae4 token_metadata: move make_token_metadata_ptr into shared_token_metadata class
So we can use the local shared_token_metadata instance
for safe background destroy of token_metadata_impl:s.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit e0a19b981a)
2025-07-21 09:36:40 +03:00
Benny Halevy
1113bb2580 replica: database: get and expose a mutable locator::shared_token_metadata
Prepare for next patch, the will use this shared_token_metadata
to make mutable_token_metadata_ptr:s

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 493a2303da)
2025-07-21 09:36:40 +03:00
Benny Halevy
a59a1b422f locator: tablets: tablet_metadata: clear_gently: optimize foreign ptr destruction
Sort all tablet_map_ptr:s by shard_id
and then destroy them on each shard to prevent
long cross-shard task queues for foreign_ptr destructions.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 3acca0aa63)
2025-07-21 09:36:40 +03:00
Michael Litvak
23dbe8952b test: cdc: add test_cdc_with_alter
Add a test that tests adding and dropping a column to a table with CDC
enabled while writing to it.

(cherry picked from commit 86dfa6324f)
2025-07-20 09:07:29 +02:00
Michael Litvak
f2af0c5f18 cdc: throw error if column doesn't exist
in the CDC log transformer, when creating a CDC mutation based on some
base table mutation, for each value of a base column we set the value in
the CDC column with the same name.

When looking up the column in the CDC schema by name, we may get a null
pointer if a column by that name is not found. This shouldn't happen
normally because the base schema and CDC schema should be compatible,
and for each base column there should be a CDC column with the same
name.

However, there are scenarios where the base schema and CDC schema are
incompatible for a short period of time when they are being altered.
When a base column is being added or dropped, we could get a base
mutation with this column set, and then the CDC transformer picks up the
latest CDC schema which doesn't have this column.

If such thing happens, we fix the code to throw an exception instead of
crashing on null pointer dereference. Currently we don't have a safer
approach to handle this, but this might be changed in the future. The
other alternative is dropping that data silently which we prefer not to
do.

Throwing an error is acceptable because this scenario most likely
indicates this behavior by the user:
* The user adds a new column, and start writing values to the column
  before the ALTER is complete. or,
* The user drops a column, and continues writing values to the column
  while it's being dropped.

Both cases might as well fail with an error because the column is not
found in the base table.

Fixes scylladb/scylladb#24952

(cherry picked from commit b336f282ae)
2025-07-18 10:36:07 +00:00
Calle Wilund
0d61d63e7e utils::http::dns_connection_factory: Use a shared certificate_credentials
Fixes #24447

This factory type, which is really more a data holder/connection producer
per connection instance, creates, if using https, a new certificate_credentials
on every instance. Which when used by S3 client is per client and
scheduling groups.

Which eventually means that we will do a set_system_trust + "cold" handshake
for every tls connection created this way.

This will cause both IO and cold/expensive certificate checking -> possible
stalls/wasted CPU. Since the credentials object in question is literally a
"just trust system", it could very well be shared across the shard.

This PR adds a thread local static cached credentials object and uses this
instead. Could consider moving this to seastar, but maybe this is too much.

Closes scylladb/scylladb#24448

(cherry picked from commit 80feb8b676)

Closes scylladb/scylladb#24461
2025-07-18 09:34:45 +03:00
Tomasz Grabiec
23e365fc7b service: migration_manager: Run group0 barrier in gossip scheduling group
Fixes two issues.

One is potential priority inversion. The barrier will be executed
using scheduling group of the first fiber which triggers it, the rest
will block waiting on it. For example, CQL statements which need to
sync the schema on replica side can block on the barrier triggered by
streaming. That's undesirable. This is theoretical, not proved in the
field.

The second problem is blocking the error path. This barrier is called
from the streaming error handling path. If the streaming concurrency
semaphore is exhausted, and streaming fails due to timeout on
obtaining the permit in check_needs_view_update_path(), the error path
will block too because it will also attempt to obtain the permit as
part of the group0 barrier. Running it in the gossip scheduling group
prevents this.

Fixes #24925

(cherry picked from commit ee2fa58bd6)
2025-07-17 17:25:10 +00:00
Piotr Dulikowski
97659e19b8 auth: fix crash when migration code runs parallel with raft upgrade
The functions password_authenticator::start and
standard_role_manager::start have a similar structure: they spawn a
fiber which invokes a callback that performs some migration until that
migration succeeds. Both handlers set a shared promise called
_superuser_created_promise (those are actually two promises, one for the
password authenticator and the other for the role manager).

The handlers are similar in both cases. They check if auth is in legacy
mode, and behave differently depending on that. If in legacy mode, the
promise is set (if it was not set before), and some legacy migration
actions follow. In auth-on-raft mode, the superuser is attempted to be
created, and if it succeeds then the promise is _unconditionally_ set.

While it makes sense at a glance to set the promise unconditionally,
there is a non-obvious corner case during upgrade to topology on raft.
During the upgrade, auth switches from the legacy mode to auth on raft
mode. Thus, if the callback didn't succeed in legacy mode and then tries
to run in auth-on-raft mode and succeds, it will unconditionally set a
promise that was already set - this is a bug and triggers an assertion
in seastar.

Fix the issue by surrounding the `shared_promise::set_value` call with
an `if` - like it is already done for the legacy case.

Fixes: scylladb/scylladb#24975

Closes scylladb/scylladb#24976

(cherry picked from commit a14b7f71fe)

Closes scylladb/scylladb#25018
2025-07-17 17:55:25 +02:00
Botond Dénes
4eb070b816 Merge '[Backport 2025.2] storage_service: Use utils::chunked_vector to avoid big allocation' from Scylladb[bot]
The following was seen:

```
!WARNING | scylla[6057]:  [shard 12:strm] seastar_memory - oversized allocation: 212992 bytes. This is non-fatal, but could lead to latency and/or fragmentation issues. Please report: at
[Backtrace #0]
void seastar::backtrace<seastar::current_backtrace_tasklocal()::$_0>(seastar::current_backtrace_tasklocal()::$_0&&, bool) at ./build/release/seastar/./seastar/include/seastar/util/backtrace.hh:89
 (inlined by) seastar::current_backtrace_tasklocal() at ./build/release/seastar/./build/release/seastar/./seastar/src/util/backtrace.cc:99
seastar::current_tasktrace() at ./build/release/seastar/./build/release/seastar/./seastar/src/util/backtrace.cc:136
seastar::current_backtrace() at ./build/release/seastar/./build/release/seastar/./seastar/src/util/backtrace.cc:169
seastar::memory::cpu_pages::warn_large_allocation(unsigned long) at ./build/release/seastar/./build/release/seastar/./seastar/src/core/memory.cc:848
seastar::memory::allocate_slowpath(unsigned long) at ./build/release/seastar/./build/release/seastar/./seastar/src/core/memory.cc:911
operator new(unsigned long) at ./build/release/seastar/./build/release/seastar/./seastar/src/core/memory.cc:1706
std::allocator<dht::token_range_endpoints>::allocate(unsigned long) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/allocator.h:196
 (inlined by) std::allocator_traits<std::allocator<dht::token_range_endpoints> >::allocate(std::allocator<dht::token_range_endpoints>&, unsigned long) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/alloc_traits.h:515
 (inlined by) std::_Vector_base<dht::token_range_endpoints, std::allocator<dht::token_range_endpoints> >::_M_allocate(unsigned long) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/stl_vector.h:380
 (inlined by) void std::vector<dht::token_range_endpoints, std::allocator<dht::token_range_endpoints> >::_M_realloc_append<dht::token_range_endpoints const&>(dht::token_range_endpoints const&) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/vector.tcc:596
locator::describe_ring(replica::database const&, gms::gossiper const&, seastar::basic_sstring<char, unsigned int, 15u, true> const&, bool) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/stl_vector.h:1294
std::__n4861::coroutine_handle<seastar::internal::coroutine_traits_base<std::vector<dht::token_range_endpoints, std::allocator<dht::token_range_endpoints> > >::promise_type>::resume() const at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/coroutine:242
 (inlined by) seastar::internal::coroutine_traits_base<std::vector<dht::token_range_endpoints, std::allocator<dht::token_range_endpoints> > >::promise_type::run_and_dispose() at ././seastar/include/seastar/core/coroutine.hh:80
seastar::reactor::do_run() at ./build/release/seastar/./build/release/seastar/./seastar/src/core/reactor.cc:2635
std::_Function_handler<void (), seastar::smp::configure(seastar::smp_options const&, seastar::reactor_options const&)::$_0>::_M_invoke(std::_Any_data const&) at ./build/release/seastar/./build/release/seastar/./seastar/src/core/reactor.cc:4684
```

Fix by using chunked_vector.

Fixes #24158

- (cherry picked from commit c5a136c3b5)

Parent PR: #24561

Closes scylladb/scylladb#24891

* github.com:scylladb/scylladb:
  storage_service: Use utils::chunked_vector to avoid big allocation
  utils: chunked_vector: implement erase() for single elements and ranges
  utils: chunked_vector: implement insert() for single-element inserts
2025-07-16 15:58:25 +03:00
Asias He
67375ecf14 storage_service: Use utils::chunked_vector to avoid big allocation
The following was seen:

```
!WARNING | scylla[6057]:  [shard 12:strm] seastar_memory - oversized allocation: 212992 bytes. This is non-fatal, but could lead to latency and/or fragmentation issues. Please report: at
[Backtrace #0]
void seastar::backtrace<seastar::current_backtrace_tasklocal()::$_0>(seastar::current_backtrace_tasklocal()::$_0&&, bool) at ./build/release/seastar/./seastar/include/seastar/util/backtrace.hh:89
 (inlined by) seastar::current_backtrace_tasklocal() at ./build/release/seastar/./build/release/seastar/./seastar/src/util/backtrace.cc:99
seastar::current_tasktrace() at ./build/release/seastar/./build/release/seastar/./seastar/src/util/backtrace.cc:136
seastar::current_backtrace() at ./build/release/seastar/./build/release/seastar/./seastar/src/util/backtrace.cc:169
seastar::memory::cpu_pages::warn_large_allocation(unsigned long) at ./build/release/seastar/./build/release/seastar/./seastar/src/core/memory.cc:848
seastar::memory::allocate_slowpath(unsigned long) at ./build/release/seastar/./build/release/seastar/./seastar/src/core/memory.cc:911
operator new(unsigned long) at ./build/release/seastar/./build/release/seastar/./seastar/src/core/memory.cc:1706
std::allocator<dht::token_range_endpoints>::allocate(unsigned long) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/allocator.h:196
 (inlined by) std::allocator_traits<std::allocator<dht::token_range_endpoints> >::allocate(std::allocator<dht::token_range_endpoints>&, unsigned long) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/alloc_traits.h:515
 (inlined by) std::_Vector_base<dht::token_range_endpoints, std::allocator<dht::token_range_endpoints> >::_M_allocate(unsigned long) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/stl_vector.h:380
 (inlined by) void std::vector<dht::token_range_endpoints, std::allocator<dht::token_range_endpoints> >::_M_realloc_append<dht::token_range_endpoints const&>(dht::token_range_endpoints const&) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/vector.tcc:596
locator::describe_ring(replica::database const&, gms::gossiper const&, seastar::basic_sstring<char, unsigned int, 15u, true> const&, bool) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/stl_vector.h:1294
std::__n4861::coroutine_handle<seastar::internal::coroutine_traits_base<std::vector<dht::token_range_endpoints, std::allocator<dht::token_range_endpoints> > >::promise_type>::resume() const at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/coroutine:242
 (inlined by) seastar::internal::coroutine_traits_base<std::vector<dht::token_range_endpoints, std::allocator<dht::token_range_endpoints> > >::promise_type::run_and_dispose() at ././seastar/include/seastar/core/coroutine.hh:80
seastar::reactor::do_run() at ./build/release/seastar/./build/release/seastar/./seastar/src/core/reactor.cc:2635
std::_Function_handler<void (), seastar::smp::configure(seastar::smp_options const&, seastar::reactor_options const&)::$_0>::_M_invoke(std::_Any_data const&) at ./build/release/seastar/./build/release/seastar/./seastar/src/core/reactor.cc:4684
```

Fix by using chunked_vector.

Fixes #24158

Closes scylladb/scylladb#24561

(cherry picked from commit c5a136c3b5)
2025-07-16 07:43:39 +08:00
Avi Kivity
8f65d7e63b utils: chunked_vector: implement erase() for single elements and ranges
Implement using std::rotate() and resize(). The elements to be erased
are rotated to the end, then resized out of existence.

Again we defer optimization for trivially copyable types.

Unit tests are added.

Needed for range_streamer with token_ranges using chunked_vector.

(cherry picked from commit d6eefce145)
2025-07-16 07:43:29 +08:00
Avi Kivity
c6b0bacfb1 utils: chunked_vector: implement insert() for single-element inserts
partition_range_compat's unwrap() needs insert if we are to
use it for chunked_vector (which we do).

Implement using push_back() and std::rotate().

emplace(iterator, args) is also implemented, though the benefit
is diluted (it will be moved after construction).

The implementation isn't optimal - if T is trivially copyable
then using std::memmove() will be much faster that std::rotate(),
but this complex optimization is left for later.

Unit tests are added.

(cherry picked from commit 5301f3d0b5)
2025-07-16 07:43:21 +08:00
Patryk Jędrzejczak
7bb43d812e test: test_zero_token_nodes_multidc: properly handle reads with CL=ONE
The test could fail with RF={DC1: 2, DC2: 0} and CL=ONE when:
- both writes succeeded with the same replica responding first,
- one of the following reads succeeded with the other replica
  responding before it applied mutations from any of the writes.

We fix the test by not expecting reads with CL=ONE to return a row.

We also harden the test by inserting different rows for every pair
(CL, coordinator), where one of the two coordinators is a normal
node from DC1, and the other one is a zero-token node from DC2.
This change makes sure that, for example, every write really
inserts a row.

Fixes scylladb/scylladb#22967

The fix addresses CI flakiness and only changes the test, so it
should be backported.

Closes scylladb/scylladb#23518

(cherry picked from commit 21edec1ace)

Closes scylladb/scylladb#24984
2025-07-15 15:50:21 +02:00
Botond Dénes
9482f45d13 test/cluster/test_read_repair: write 100 rows in trace test
This test asserts that a read repair really happened. To ensure this
happens it writes a single partition after enabling the database_apply
error injection point. For some reason, the write is sometimes reordered
with the error injection and the write will get replicated to both nodes
and no read repair will happen, failing the test.
To make the test less sensitive to such rare reordering, add a
clustering column to the table and write a 100 rows. The chance of *all*
100 of them being reordered with the error injection should be low
enough that it doesn't happen again (famous last words).

Fixes: #24330

Closes scylladb/scylladb#24403

(cherry picked from commit 495f607e73)

Closes scylladb/scylladb#24973
2025-07-15 13:27:31 +03:00
Aleksandra Martyniuk
5debdce91d replica: hold compaction group gate during flush
Destructor of database_sstable_write_monitor, which is created
in table::try_flush_memtable_to_sstable, tries to get the compaction
state of the processed compaction group. If at this point
the compaction group is already stopped (and the compaction state
is removed), e.g. due to concurrent tablet merge, an exception is
thrown and a node coredumps.

Add flush gate to compaction group to wait for flushes in
compaction_group::stop. Hold the gate in seal function in
table::make_memtable_list. seal function is turned into
a coroutine to ensure it won't throw.

Wait until async_gate is closed before flushing, to ensure that
all data is written into sstables. Stop ongoing compactions
beforehand.

Remove unnecessary flush in tablet_storage_group_manager::merge_completion_fiber.
Stop method already flushes the compaction group.

Fixes: #23911.

Closes scylladb/scylladb#24582

(cherry picked from commit 2ec54d4f1a)

Closes scylladb/scylladb#24951
2025-07-15 13:26:39 +03:00
Michael Litvak
15517ba529 tablets: stop storage group on deallocation
When a tablet transitions to a post-cleanup stage on the leaving replica
we deallocate its storage group. Before the storage can be deallocated
and destroyed, we must make sure it's cleaned up and stopped properly.

Normally this happens during the tablet cleanup stage, when
table::cleanup_table is called, so by the time we transition to the next
stage the storage group is already stopped.

However, it's possible that tablet cleanup did not run in some scenario:
1. The topology coordinator runs tablet cleanup on the leaving replica.
2. The leaving replica is restarted.
3. When the leaving replica starts, still in `cleanup` stage, it
   allocates a storage group for the tablet.
4. The topology coordinator moves to the next stage.
5. The leaving replica deallocates the storage group, but it was not
   stopped.

To address this scenario, we always stop the storage group when
deallocating it. Usually it will be already stopped and complete
immediately, and otherwise it will be stopped in the background.

Fixes scylladb/scylladb#24857
Fixes scylladb/scylladb#24828

Closes scylladb/scylladb#24896

(cherry picked from commit fa24fd7cc3)

Closes scylladb/scylladb#24908
2025-07-15 13:25:38 +03:00
Aleksandra Martyniuk
ccfc053dd5 repair: Reduce max row buf size when small table optimization is on
If small_table_optimization is on, a repair works on a whole table
simultaneously. It may be distributed across the whole cluster and
all nodes might participate in repair.

On a repair master, row buffer is copied for each repair peer.
This means that the memory scales with the number of peers.

In large clusters, repair with small_table_optimization leads to OOM.

Divide the max_row_buf_size by the number of repair peers if
small_table_optimization is on.

Use max_row_buf_size to calculate number of units taken from mem_sem.

Fixes: https://github.com/scylladb/scylladb/issues/22244.

Closes scylladb/scylladb#24868

(cherry picked from commit 17272c2f3b)

Closes scylladb/scylladb#24905
2025-07-15 13:24:49 +03:00
Botond Dénes
6749954b2a Merge '[Backport 2025.2] test.py: Fix start 3rd party services' from Scylladb[bot]
Move 3rd party services starting under `try` clause to avoid situation that main process is collapses without going stopping services.
Without this, if something wrong during start it will not trigger execution exit artifacts, so the process will stay forever.

This functionality in 2025.2 and can potentially affect jobs, so backport needed.

Fixes: #24773

- (cherry picked from commit 0ca539e162)

- (cherry picked from commit c6c3e9f492)

Parent PR: #24734

Closes scylladb/scylladb#24774

* github.com:scylladb/scylladb:
  test.py: use unique hostname for Minio
  test.py: Catch possible exceptions during 3rd party services start
2025-07-15 13:23:12 +03:00
Pavel Emelyanov
71e9f5e662 sstables_loader: Fix load-and-stream vs skip-cleanup check
The intention was to fail the REST API call in case --skip-cleanup is
requested for --load-and-stream loading. The corresponding if expression
is checking something else :( despite log message is correct.

Fixes: https://github.com/scylladb/scylladb/issues/24913

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Signed-off-by: Ran Regev <ran.regev@scylladb.com>
(cherry picked from commit bd3bd089e1)

Closes scylladb/scylladb#24947
2025-07-15 13:08:32 +03:00
Yaron Kaikov
f1d4266b7a dist/common/scripts/scylla_sysconfig_setup: fix SyntaxWarning: invalid escape sequence
There are invalid escape sequence warnings where raw strings should be used for the regex patterns

Fixes: https://github.com/scylladb/scylladb/issues/24915

Closes scylladb/scylladb#24916

(cherry picked from commit fdcaa9a7e7)

Closes scylladb/scylladb#24968
2025-07-15 11:06:41 +02:00
Yaron Kaikov
9c0181e813 auto-backport.py: Avoid bot push to existing backport branches
Changed the backport logic so that the bot only pushes the backport branch if it does not already exist in the remote fork.
If the branch exists, the bot skips the push, allowing only users to update (force-push) the branch after the backport PR is open.

Fixes: https://github.com/scylladb/scylladb/issues/24953

Closes scylladb/scylladb#24954

(cherry picked from commit ed7c7784e4)

Closes scylladb/scylladb#24967
2025-07-15 10:27:28 +02:00
Aleksandra Martyniuk
6f8b378e80 nodetool: repair: skip tablet keyspaces
Currently, nodetool repair command repairs both vnode and tablet keyspaces
if no keyspace is specified. We should use this command to repair
only vnode keyspaces, but this isn't easily accessible - we have to
explicitly run repair only on vnode keyspaces.

nodetool repair skips tablet keyspaces unless a tablet keyspace
is explicitely passed as an argument.

Fixes: #24040.

Closes scylladb/scylladb#24042
2025-07-15 06:36:08 +03:00
Jenkins Promoter
afadcc648d Update pgo profiles - aarch64 2025-07-15 05:39:11 +03:00
Jenkins Promoter
2397a93410 Update pgo profiles - x86_64 2025-07-15 05:23:22 +03:00
Yaron Kaikov
6e06c57fc7 packaging: add ps command to dependancies
ScyllaDB container image doesn't have ps command installed, while this command is used by perftune.py script shipped within the same image. This breaks node and container tuning in Scylla Operator.

Fixes: #24827

Closes scylladb/scylladb#24830

(cherry picked from commit 66ff6ab6f9)

Closes scylladb/scylladb#24955
2025-07-14 14:26:38 +03:00
Gleb Natapov
ece8a8b3bc api: unregister raft_topology_get_cmd_status on shutdown
In c8ce9d1c60 we introduced
raft_topology_get_cmd_status REST api but the commit forgot to
unregister the handler during shutdown.

Fixes #24910

Closes scylladb/scylladb#24911

(cherry picked from commit 89f2edf308)

Closes scylladb/scylladb#24922
2025-07-14 11:39:42 +02:00
Avi Kivity
5bed6c7a7f storage_proxy: avoid large allocation when storing batch in system.batchlog
Currently, when computing the mutation to be stored in system.batchlog,
we go through data_value. In turn this goes through `bytes` type
(#24810), so it causes a large contiguous allocation if the batch is
large.

Fix by going through the more primitive, but less contiguous,
atomic_cell API.

Fixes #24809.

Closes scylladb/scylladb#24811

(cherry picked from commit 60f407bff4)

Closes scylladb/scylladb#24845
2025-07-13 14:11:01 +03:00
Patryk Jędrzejczak
605106a9c6 Merge '[Backport 2025.2] Make it easier to debug stuck raft topology operation.' from Scylladb[bot]
The series adds more logging and provides new REST api around topology command rpc execution to allow easier debugging of stuck topology operations.

Backport since we want to have in the production as quick as possible.

Fixes #24860

- (cherry picked from commit c8ce9d1c60)

- (cherry picked from commit 4e6369f35b)

Parent PR: #24799

Closes scylladb/scylladb#24879

* https://github.com/scylladb/scylladb:
  topology coordinator: log a start and an end of topology coordinator command execution at info level
  topology coordinator: add REST endpoint to query the status of ongoing topology cmd rpc
2025-07-09 12:58:14 +02:00
Piotr Dulikowski
e4dde34f52 Merge '[Backport 2025.2] main: don't start maintenance auth service if not enabled' from Scylladb[bot]
In f96d30c2b5
we introduced the maintenance service, which is an additional
instance of auth::service. But this service has a somewhat
confusing 2-level startup mechanism: it's initialized with
sharded<Service>::start and then auth::service::start
(different method with the same name to confuse even more).

When maintenance_socket was disabled (default setting), the code
did only the first part of the startup. This registered a config
observer but didn't create a permission_cache instance.
As a result, a crash on SIGHUP when config is reloaded can occur.

Fixes: https://github.com/scylladb/scylladb/issues/24528
Backport: all not eol versions since 6.0 and 2025.1

- (cherry picked from commit 97c60b8153)

- (cherry picked from commit dd01852341)

Parent PR: #24527

Closes scylladb/scylladb#24570

* github.com:scylladb/scylladb:
  test: add test for live updates of permissions cache config
  main: don't start maintenance auth service if not enabled
2025-07-09 09:47:57 +02:00
Piotr Dulikowski
8ebd67e1c3 Merge '[Backport 2025.2] batchlog_manager: abort replay of a failed batch on shutdown or node down' from Scylladb[bot]
When replaying a failed batch and sending the mutation to all replicas, make the write response handler cancellable and abort it on shutdown or if some target is marked down. also set a reasonable timeout so it gets aborted if it's stuck for some other unexpected reason.

Previously, the write response handler is not cancellable and has no timeout. This can cause a scenario where some write operation by the batchlog manager is stuck indefinitely, and node shutdown gets stuck as well because it waits for the batchlog manager to complete, without aborting the operation.

backport to relevant versions since the issue can cause node shutdown to hang

Fixes scylladb/scylladb#24599

- (cherry picked from commit 8d48b27062)

- (cherry picked from commit fc5ba4a1ea)

- (cherry picked from commit 7150632cf2)

- (cherry picked from commit 74a3fa9671)

- (cherry picked from commit a9b476e057)

- (cherry picked from commit d7af26a437)

Parent PR: #24595

Closes scylladb/scylladb#24880

* github.com:scylladb/scylladb:
  test: test_batchlog_manager: batchlog replay includes cdc
  test: test_batchlog_manager: test batch replay when a node is down
  batchlog_manager: set timeout on writes
  batchlog_manager: abort writes on shutdown
  batchlog_manager: create cancellable write response handler
  storage_proxy: add write type parameter to mutate_internal
2025-07-08 15:39:58 +02:00
Michael Litvak
a26d8f72b6 test: test_batchlog_manager: batchlog replay includes cdc
Add a new test that verifies that when replaying batch mutations from
the batchlog, the mutations include cdc augmentation if needed.

This is done in order to verify that it works currently as expected and
doesn't break in the future.

(cherry picked from commit d7af26a437)
2025-07-08 06:25:03 +00:00
Michael Litvak
9012357a4b test: test_batchlog_manager: test batch replay when a node is down
Add a test of the batchlog manager replay loop applying failed batches
while some replica is down.

The test reproduces an issue where the batchlog manager tries to replay
a failed batch, doesn't get a response from some replica, and becomes
stuck.

It verifies that the batchlog manager can eventually recover from this
situation and continue applying failed batches.

(cherry picked from commit a9b476e057)
2025-07-08 06:25:03 +00:00
Michael Litvak
8fa2520e15 batchlog_manager: set timeout on writes
Set a timeout on writes of replayed batches by the batchlog manager.

We want to avoid having infinite timeout for the writes in case it gets
stuck for some unexpected reason.

The timeout is set to be high enough to allow any reasonable write to
complete.

(cherry picked from commit 74a3fa9671)
2025-07-08 06:25:03 +00:00
Michael Litvak
ba11e8ebdd batchlog_manager: abort writes on shutdown
On shutdown of batchlog manager, abort all writes of replayed batches
by the batchlog manager.

To achieve this we set the appropriate write_type to BATCH, and on
shutdown cancel all write handlers with this type.

(cherry picked from commit 7150632cf2)
2025-07-08 06:25:02 +00:00
Michael Litvak
4e2b587b4d batchlog_manager: create cancellable write response handler
When replaying a batch mutation from the batchlog manager and sending it
to all replicas, create the write response handler as cancellable.

To achieve this we define a new wrapper type for batchlog mutations -
batchlog_replay_mutation, and this allows us to overload
create_write_response_handler for this type. This is similar to how it's
done with hint_wrapper and read_repair_mutation.

(cherry picked from commit fc5ba4a1ea)
2025-07-08 06:25:02 +00:00
Michael Litvak
f8b4d1c1cd storage_proxy: add write type parameter to mutate_internal
Currently mutate_internal has a boolean parameter `counter_write` that
indicates whether the write is of counter type or not.

We replace it with a more general parameter that allows to indicate the
write type.

It is compatible with the previous behavior - for a counter write, the
type COUNTER is passed, and otherwise a default value will be used
as before.

(cherry picked from commit 8d48b27062)
2025-07-08 06:25:02 +00:00
Gleb Natapov
71f59e046b topology coordinator: log a start and an end of topology coordinator command execution at info level
Those calls a relatively rare and the output may help to analyze issues
in production.

(cherry picked from commit 4e6369f35b)
2025-07-08 06:23:48 +00:00
Gleb Natapov
ad91198417 topology coordinator: add REST endpoint to query the status of ongoing topology cmd rpc
The topology coordinator executes several topology cmd rpc against some nodes
during a topology change. A topology operation will not proceed unless
rpc completes (successfully or not), but sometimes it appears that it
hangs and it is hard to tell on which nodes it did not complete yet.
Introduce new REST endpoint that can help with debugging such cases.
If executed on the topology coordinator it returns currently running
topology rpc (if any) and a list of nodes that did not reply yet.

(cherry picked from commit c8ce9d1c60)
2025-07-08 06:23:48 +00:00
Marcin Maliszkiewicz
8e783cc23a test: add test for live updates of permissions cache config
(cherry picked from commit dd01852341)
2025-07-07 10:07:02 +02:00
Marcin Maliszkiewicz
b1e75eba65 main: don't start maintenance auth service if not enabled
In f96d30c2b5
we introduced the maintenance service, which is an additional
instance of auth::service. But this service has a somewhat
confusing 2-level startup mechanism: it's initialized with
sharded<Service>::start and then auth::service::start
(different method with the same name to confuse even more).

When maintenance_socket was disabled (default setting), the code
did only the first part of the startup. This registered a config
observer but didn't create a permission_cache instance.
As a result, a crash on SIGHUP when config is reloaded can occur.

(cherry picked from commit 97c60b8153)
2025-07-07 10:06:28 +02:00
Patryk Jędrzejczak
370165cca5 docs: handling-node-failures: fix typo
Replacing "from" is incorrect. The typo comes from recently
merged #24583.

Fixes #24732

Requires backport to 2025.2 since #24583 has been backported to 2025.2.

Closes scylladb/scylladb#24733

(cherry picked from commit fa982f5579)

Closes scylladb/scylladb#24831
2025-07-04 19:32:09 +02:00
Gleb Natapov
24a317460d gossiper: do not assume that id->ip mapping is available in failure_detector_loop_for_node
failure_detector_loop_for_node may be started on a shard before id->ip
mapping is available there. Currently the code treats missing mapping
as an internal error, but it uses its result for debug output only, so
lets relax the code to not assume the mapping is available.

Fixes #23407

Closes scylladb/scylladb#24614

(cherry picked from commit a221b2bfde)

Closes scylladb/scylladb#24768
2025-07-04 16:09:50 +02:00
Michał Chojnowski
08b117425e utils/alien_worker: fix a data race in submit()
We move a `seastar::promise` on the external worker thread,
after the matching `seastar::future` was returned to the shard.

That's illegal. If the `promise` move occurs concurrently with some
operation (move, await) on the `future`, it becomes a data race
which could cause various kinds of corruption.

This patch fixes that by keeping the promise at a stable address
on the shard (inside a coroutine frame) and only passing through
the worker.

Fixes #24751

Closes scylladb/scylladb#24752

(cherry picked from commit a29724479a)

Closes scylladb/scylladb#24777
2025-07-03 11:02:30 +03:00
Łukasz Paszkowski
cf36de2c9a compaction_manager: cancel submission timer on drain
The `drain` method, cancels all running compactions and moves the
compaction manager into the disabled state. To move it back to
the enabled state, the `enable` method shall be called.

This, however, throws an assertion error as the submission time is
not cancelled and re-enabling the manager tries to arm the armed timer.

Thus, cancel the timer, when calling the drain method to disable
the compaction manager.

Fixes https://github.com/scylladb/scylladb/issues/24504

All versions are affected. So it's a good candidate for a backport.

Closes scylladb/scylladb#24505

(cherry picked from commit a9a53d9178)

Closes scylladb/scylladb#24590
2025-07-03 10:19:16 +03:00
Tomasz Grabiec
440985387e Merge '[Backport 2025.2] repair: postpone repair until topology is not busy ' from Scylladb[bot]
Currently, repair_service::repair_tablets starts repair if there
is no ongoing tablet operations. The check does not consider global
topology operations, like tablet resize finalization.

Hence, if:
- topology is in the tablet_resize_finalization state;
- repair starts (as there is no tablet transitions) and holds the erm;
- resize finalization finishes;

then the repair sees a topology state different than the actual -
it does not see that the storage groups were already split.
Repair code does not handle this case and it results with
on_internal_error.

Start repair when topology is not busy. The check isn't atomic,
as it's done on a shard 0. Thus, we compare the topology versions
to ensure that the business check is valid.

Fixes: https://github.com/scylladb/scylladb/issues/24195.

Needs backport to all branches since they are affected

- (cherry picked from commit df152d9824)

- (cherry picked from commit 83c9af9670)

Parent PR: #24202

Closes scylladb/scylladb#24781

* github.com:scylladb/scylladb:
  test: add test for repair and resize finalization
  repair: postpone repair until topology is not busy
2025-07-02 13:18:22 +02:00
Andrei Chekun
4bc33c027d test.py: use unique hostname for Minio
To avoid situation that port is occupied on localhost, use unique
hostname for Minio

(cherry picked from commit c6c3e9f492)
2025-07-02 11:12:52 +02:00
Andrei Chekun
8c0798fe00 test.py: Catch possible exceptions during 3rd party services start
With this change if something will go wrong during starting services,
they are still will be shuted down on the finally clause. Without it can
hang forever

(cherry picked from commit 0ca539e162)
2025-07-02 11:12:16 +02:00
Botond Dénes
e8c9a412bc docs: cql/types.rst: remove reference to frozen-only UDTs
ScyllaDB supports non-frozen UDTs since 3.2, no need to keep referencing
this limitation in the current docs. Replace the description of the
limitation with general description of frozen semantics for UDTs.

Fixes: #22929

Closes scylladb/scylladb#24763

(cherry picked from commit 37ef9efb4e)

Closes scylladb/scylladb#24782
2025-07-02 12:10:35 +03:00
Ferenc Szili
f24d71ab8c logging: Add row count to large partition warning message
When writing large partitions, that is: partitions with size or row count
above a configurable threshold, ScyllaDB outputs a warning to the log:

WARN ... large_data - Writing large partition test/test:  (1200031 bytes) to me-3glr_0xkd_54jip2i8oqnl7hk8mu-big-Data.db

This warning contains the information about the size of the partition,
but it does not contain the number of rows written. This can lead to
confusion because in cases where the warning was written because of the
row count being larger than the threshold, but the partition size is below
the threshold, the warning will only contain the partition size in bytes,
leading the user to believe the warning was output because of the
partition size, when in reality it was the row count that triggered the
warning. See #20125

This change adds a size_desc argument to cql_table_large_data_handler::try_record(),
which will contain the description of the size of the object written.
This method is used to output warnings for large partitions, row counts,
row sizes and cell sizes. This change does not modify the warning message
for row and cell sizes, only for partition size and row count.

The warning for large partitions and row counts will now look like this:

WARN ... large_data - Writing large partition test/test:  (1200031 bytes/100001 rows) to me-3glr_0xkd_54jip2i8oqnl7hk8mu-big-Data.db

Closes scylladb/scylladb#22010

(cherry picked from commit 96267960f8)

Closes scylladb/scylladb#24685
2025-07-02 11:22:40 +03:00
Botond Dénes
3560d9ad82 Merge '[Backport 2025.2] sstables: purge SCYLLA_ASSERT from the sstable read/parse paths' from Scylladb[bot]
Introduce `sstables::parse_assert()`, to replace `SCYLLA_ASSERT()` on the read/parse path. SSTables can get corrupt for various reasons, some outside of the database's control. A bad SSTable should not bring down the database, the parsing should simply be aborted, with as much information printed as possible for the investigation of the nature of the corruption. The newly introduced `parse_assert()` uses `on_internal_error()` under the hood, which prints a backtrace and optionally allows for aborting when on the error, to generate a coredump.

Fixes https://github.com/scylladb/scylladb/issues/20845

We just hit another case of `SCYLLA_ASSERT()` triggering due to corrupt sstables bringing down nodes in the field, should be backported to all releases, so we don't hit this in the future

- (cherry picked from commit 27e26ed93f)

- (cherry picked from commit bce89c0f5e)

Parent PR: #24534

Closes scylladb/scylladb#24686

* github.com:scylladb/scylladb:
  sstables: replace SCYLLA_ASSERT() with parse_assert() on the read path
  sstables/exceptions: introduce parse_assert()
2025-07-02 11:20:48 +03:00
Botond Dénes
9dd8dc5357 Merge '[Backport 2025.2] Fix for cassandra role gets recreated after DROP ROLE' from Scylladb[bot]
This patchset fixes regression introduced by 7e749cd848 when we started re-creating default superuser role and password from the config, even if new custom superuser was created by the user.

Now we'll check, first with CL LOCAL_ONE if there is a need to create default superuser role or password, confirm
it with CL QUORUM and only then atomically create role or password.

If server is started without cluster quorum we'll skip creating role or password.

Fixes https://github.com/scylladb/scylladb/issues/24469
Backport: all versions since 2024.2

- (cherry picked from commit 68fc4c6d61)

- (cherry picked from commit c96c5bfef5)

- (cherry picked from commit 2e2ba84e94)

- (cherry picked from commit f85d73d405)

- (cherry picked from commit d9ec746c6d)

- (cherry picked from commit a3bb679f49)

- (cherry picked from commit 67a4bfc152)

- (cherry picked from commit 0ffddce636)

- (cherry picked from commit 5e7ac34822)

Parent PR: #24451

Closes scylladb/scylladb#24694

* github.com:scylladb/scylladb:
  test: auth_cluster: add test for password reset procedure
  auth: cache roles table scan during startup
  test: auth_cluster: add test for replacing default superuser
  test: pylib: add ability to specify default authenticator during server_start
  auth: split auth-v2 logic for adding default superuser password
  auth: split auth-v2 logic for adding default superuser role
  auth: ldap: fix waiting for underlying role manager
  auth: wait for default role creation before starting authorizer and authenticator
2025-07-02 10:40:56 +03:00
Botond Dénes
d4d464e21e tools/scylla-nodetool: backup: add --move-files parameter
Allow opting in for backup to move the files instead of copying them.

Fixes: https://github.com/scylladb/scylladb/issues/24372

Closes scylladb/scylladb#24503

(cherry picked from commit e715a150b9)

Closes scylladb/scylladb#24721
2025-07-02 10:40:04 +03:00
Botond Dénes
9816cdb901 Merge '[Backport 2025.2] mutation: check key of inserted rows' from Scylladb[bot]
Make sure the keys are full prefixes as it is expected to be the case for rows. At severeal occasions we have seen empty row keys make their ways into the sstables, despite the fact that they are not allowed by the CQL frontend. This means that such empty keys are possibly results of memory corruption or use-after-{free,copy} errors. The source of the corruption is impossible to pinpoint when the empty key is discovered in the sstable. So this patch adds checks for such keys to places where mutations are built: when building or unserializing mutations.

Fixes: https://github.com/scylladb/scylladb/issues/24506

Not a typical backport candidate (not a bugfix or regression fix), but we should still backport so we have the additional checks deployed to existing production clusters.

- (cherry picked from commit 8b756ea837)

- (cherry picked from commit ab96c703ff)

Parent PR: #24497

Closes scylladb/scylladb#24742

* github.com:scylladb/scylladb:
  mutation: check key of inserted rows
  compound: optimize is_full() for single-component types
2025-07-02 10:27:13 +03:00
Avi Kivity
3d3976c862 repair: row_level: unstall to_repair_rows_on_wire() destroying its input
to_repair_rows_on_wire() moves the contents of its input std::list
and is careful to yield after each element, but the final destruction
of the input list still deals with all of the list elements without
yielding. This is expensive as not all contents of repair_row are moved
(_dk_with_hash is of type lw_shared_ptr<const decorated_key_with_hash>).

To fix, destroy each row element as we move along. This is safe as we
own the input and don't reference row_list other than for the iteration.

Fixes #24725.

Closes scylladb/scylladb#24726

(cherry picked from commit 6aa71205d8)

Closes scylladb/scylladb#24769
2025-07-02 10:06:30 +03:00
Botond Dénes
7786167998 Merge '[Backport 2025.2] encryption_at_rest_test: Fix some spurious errors' from Scylladb[bot]
Fixes #24574

* Ensure we close the embedded load_cache objects on encryption shutdown, otherwise we can, in unit testing, get destruction of these while a timer is still active -> assert
* Add extra exception handling to `network_error_test_helper`, so even if test framework might exception-escape, we properly stop the network proxy to avoid use after free.

- (cherry picked from commit ee98f5d361)

- (cherry picked from commit 8d37e5e24b)

Parent PR: #24633

Closes scylladb/scylladb#24770

* github.com:scylladb/scylladb:
  encryption_at_rest_test: Add exception handler to ensure proxy stop
  encryption: Ensure stopping timers in provider cache objects
2025-07-02 10:03:52 +03:00
Aleksandra Martyniuk
cbce0ed911 test: add test for repair and resize finalization
Add test that checks whether repair does not start if there is an
ongoing resize finalization.

(cherry picked from commit 83c9af9670)
2025-07-01 20:26:21 +00:00
Aleksandra Martyniuk
eb96ef8ce7 repair: postpone repair until topology is not busy
Currently, repair_service::repair_tablets starts repair if there
is no ongoing tablet operations. The check does not consider global
topology operations, like tablet resize finalization. This may cause
a data race and unexpected behavior.

Start repair when topology is not busy.

(cherry picked from commit df152d9824)
2025-07-01 20:26:21 +00:00
Patryk Jędrzejczak
e3952fbd35 test: test_raft_recovery_user_data: disable hinted handoff
The test is currently flaky, writes can fail with "Too many in flight
hints: 10485936". See scylladb/scylladb#23565 for more details.

We suspect that scylladb/scylladb#23565 is caused by an infrastructure
issue - slow disks on some machines we run CI jobs on.

Since the test fails often and investigation doesn't seem to be easy,
we first deflake the test in this patch by disabling hinted handoff.

For replacing nodes, we provide `cfg` because there should have been
`cfg` in the first place. The test was correct anyway because:
- `tablets_mode_for_new_keyspaces` is set to `true` by default in
  test/cluster/suite.yaml,
- `endpoint_snitch` is set to `GossipingPropertyFileSnitch` by default
  if the property file is provided in `ScyllaServer.__init__`.

Ref scylladb/scylladb#23565

We should backport this patch to 2025.2 because this test is also flaky
on CI jobs using 2025.2. Older branches don't have this test.

Closes scylladb/scylladb#24364

(cherry picked from commit 8756c233e0)

Fixes #24756

Closes scylladb/scylladb#24757
2025-07-01 19:14:22 +02:00
Calle Wilund
9a60a2adce encryption_at_rest_test: Add exception handler to ensure proxy stop
If boost test is run such that we somehow except even in a test macro
such as BOOST_REQUIRE_THROW, we could end up not stopping the net proxy
used, causing a use after free.

(cherry picked from commit 8d37e5e24b)
2025-07-01 15:12:50 +00:00
Calle Wilund
3380479455 encryption: Ensure stopping timers in provider cache objects
utils::loading cache has a timer that can, if we're unlucky, be runnnig
while the encryption context/extensions referencing the various host
objects containing them are destroyed in the case of unit testing.

Add a stop phase in encryption context shutdown closing the caches.

(cherry picked from commit ee98f5d361)
2025-07-01 15:12:50 +00:00
Anna Stuchlik
0d7d983133 doc: extend 2025.2 upgrade with a note about consistent topology updates
This commit adds a note that the user should enable consistent topology updates before upgrading
to 2025.2 if they didn't do it (for some reason) when previously upgrading to version 2025.1.

Fixes https://github.com/scylladb/scylladb/issues/24467

Closes scylladb/scylladb#24468

(cherry picked from commit e2b7302183)

Closes scylladb/scylladb#24523
2025-07-01 12:33:31 +03:00
Avi Kivity
dd509b9513 Merge '[Backport 2025.2] memtable: ensure _flushed_memory doesn't grow above total_memory' from Scylladb[bot]
`dirty_memory_manager` tracks two quantities about memtable memory usage:
"real" and "unspooled" memory usage.

"real" is the total memory usage (sum of `occupancy().total_space()`)
by all memtable LSA regions, plus a upper-bound estimate of the size of
memtable data which has already moved to the cache region but isn't
evictable (merged into the cache) yet.

"unspooled" is the difference between total memory usage by all memtable
LSA regions, and the total flushed memory (sum of `_flushed_memory`)
of memtables.

`dirty_memory_manager` controls the shares of compaction and/or blocks
writes when these quantities cross various thresholds.

"Total flushed memory" isn't a well defined notion,
since the actual consumption of memory by the same data can vary over
time due to LSA compactions, and even the data present in memtable can
change over the course of the flush due to removals of outdated MVCC versions.
So `_flushed_memory` is merely an approximation computed by `flush_reader`
based on the data passing through it.

This approximation is supposed to be a conservative lower bound.
In particular, `_flushed_memory` should be not greater than
`occupancy().total_space()`. Otherwise, for example, "unspooled" memory
could become negative (and/or wrap around) and weird things could happen.
There is an assertion in `~flush_memory_accounter` which checks that
`_flushed_memory < occupancy().total_space()` at the end of flush.

But it can fail. Without additional treatment, the memtable reader sometimes emits
data which is already deleted. (In particular, it emites rows covered by
a partition tombstone in a newer MVCC version.)
This data is seen by `flush_reader` and accounted in `_flushed_memory`.
But this data can be garbage-collected by the `mutation_cleaner` later during the
flush and decrease `total_memory` below `_flushed_memory`.

There is a piece of code in `mutation_cleaner` intended to prevent that.
If `total_memory` decreases during a `mutation_cleaner` run,
`_flushed_memory` is lowered by the same amount, just to preserve the
asserted property. (This could also make `_flushed_memory` quite inaccurate,
but that's considered acceptable).

But that only works if `total_memory` is decreased during that run. It doesn't
work if the `total_memory` decrease (enabled by the new allocator holes made
by `mutation_cleaner`'s garbage collection work) happens asynchronously
(due to memory reclaim for whatever reason) after the run.

This patch fixes that by tracking the decreases of `total_memory` closer to the
source. Instead of relying on `mutation_cleaner` to notify the memtable if it
lowers `total_memory`, the memtable itself listens for notifications about
LSA segment deallocations. It keeps `_flushed_memory` equal to the reader's
estimate of flushed memory decreased by the change in `total_memory` since the
beginning of flush (if it was positive), and it keeps the amount of "spooled"
memory reported to the `dirty_memory_manager` at `max(0, _flushed_memory)`.

Fixes scylladb/scylladb#21413

Backport candidate because it fixes a crash that can happen in existing stable branches.

- (cherry picked from commit 7d551f99be)

- (cherry picked from commit 975e7e405a)

Parent PR: #21638

Closes scylladb/scylladb#24604

* github.com:scylladb/scylladb:
  memtable: ensure _flushed_memory doesn't grow above total memory usage
  replica/memtable: move region_listener handlers from dirty_memory_manager to memtable
2025-07-01 12:31:25 +03:00
Avi Kivity
9ccc96bf05 tools: optimized_clang: make it work in the presence of a scylladb profile
optimized_clang.sh trains the compiler using profile-guided optimization
(pgo). However, while doing that, it builds scylladb using its own profile
stored in pgo/profiles and decompressed into build/profile.profdata. Due
to the funky directory structure used for training the compiler, that
path is invalid during the training and the build fails.

The workaround was to build on a cloud machine instead of a workstation -
this worked because the cloud machine didn't have git-lfs installed, and
therefore did not see the stored profile, and the whole mess was averted.

To make this work on a machine that does have access to stored profiles,
disable use of the stored profile even if it exists.

Fixes #22713

Closes scylladb/scylladb#24571

(cherry picked from commit 52f11e140f)

Closes scylladb/scylladb#24621
2025-07-01 12:30:58 +03:00
Michał Chojnowski
50736e9740 test_sstable_compression_dictionaries_basic.py: fix a flaky check
test_dict_memory_limit trains new dictionaries and checks (via metrics)
that the old dictionaries are appropriately cleaned up.
The problem is that the cleanup is asynchronous (because the lifetimes
are handled by foreign_ptr, which sends the destructor call
to the owner shard asynchronously), so the metrics might be
checked a few milliseconds before the old dictionary is cleaned up.

The dict lifetimes are lazy on purpose, the right thing to do is
to just let the test retry the check.

Fixes scylladb/scylladb#24516

Closes scylladb/scylladb#24526

(cherry picked from commit cace55aaaf)

Closes scylladb/scylladb#24653
2025-07-01 12:30:25 +03:00
Avi Kivity
ee733c4d38 Merge '[Backport 2025.2] generic_server: fix connections semaphore config observer' from Scylladb[bot]
In ed3e4f33fd we introduced new connection throttling feature which is controlled by uninitialized_connections_semaphore_cpu_concurrency config. But live updating of it was broken, this patch fixes it.

When the temporary value from observer() is destroyed, it disconnects from updateable_value, so observation stops right away. We need to retain the observer.

Backport: to 2025.2 where this feature was added
Fixes: https://github.com/scylladb/scylladb/issues/24557

- (cherry picked from commit c6a25b9140)

- (cherry picked from commit 45392ac29e)

- (cherry picked from commit 68ead01397)

Parent PR: #24484

Closes scylladb/scylladb#24679

* github.com:scylladb/scylladb:
  test: add test for live updates of generic server config
  utils: don't allow do discard updateable_value observer
  generic_server: fix connections semaphore config observer
2025-07-01 12:29:53 +03:00
Szymon Malewski
3bac46a18d utils/exceptions.cc: Added check for exceptions::request_timeout_exception in is_timeout_exception function.
It solves the issue, where in some cases a timeout exceptions in CAS operations are logged incorrectly as a general failure.

Fixes #24591

Closes scylladb/scylladb#24619

(cherry picked from commit f28bab741d)

Closes scylladb/scylladb#24687
2025-07-01 12:29:27 +03:00
Lakshmi Narayanan Sreethar
adab525151 utils/big_decimal: fix scale overflow when parsing values with large exponents
The exponent of a big decimal string is parsed as an int32, adjusted for
the removed fractional part, and stored as an int32. When parsing values
like `1.23E-2147483647`, the unscaled value becomes `123`, and the scale
is adjusted to `2147483647 + 2 = 2147483649`. This exceeds the int32
limit, and since the scale is stored as an int32, it overflows and wraps
around, losing the value.

This patch fixes that the by parsing the exponent as an int64 value and
then adjusting it for the fractional part. The adjusted scale is then
checked to see if it is still within int32 limits before storing. An
exception is thrown if it is not within the int32 limits.

Note that strings with exponents that exceed the int32 range, like
`0.01E2147483650`, were previously not parseable as a big decimal. They
are now accepted if the final adjusted scale fits within int32 limits.
For the above value, unscaled_value = 1 and scale = -2147483648, so it
is now accepted. This is in line with how Java's `BigDecimal` parses
strings.

Fixes: #24581

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>

Closes scylladb/scylladb#24640

(cherry picked from commit 279253ffd0)

Closes scylladb/scylladb#24692
2025-07-01 12:28:55 +03:00
Botond Dénes
5f45cf1683 test/boost/memtable_test: only inject error for test table
Currently the test indiscriminately injects failures into the flushes of
any table, via the IO extension mechanism. The tests want to check that
the node correctly handles the IO error by self isolating, however the
indiscriminate IO errors can have unintended consequences when they hit
raft, leading to disorderly shutdown and failure of the tests. Testing
raft's resiliency to IO errors if of course worth doing, but it is not
the goal of this particular test, so to avoid the fallout, the IO errors
are limited to the test tables only.

Fixes: https://github.com/scylladb/scylladb/issues/24637

Closes scylladb/scylladb#24638

(cherry picked from commit ee6d7c6ad9)

Closes scylladb/scylladb#24743
2025-07-01 12:28:05 +03:00
Avi Kivity
5e4941a74b Merge '[Backport 2025.2] sstables/mx/writer: handle non-full prefix row keys' from Scylladb[bot]
Although valid for compact tables, non-full (or empty) clustering key prefixes are not handled for row keys when writing sstables. Only the present components are written, consequently if the key is empty, it is omitted entirely.
When parsing sstables, the parsing code unconditionally parses a full prefix.
This mis-match results in parsing failures, as the parser parses part of the row content as a key resulting in a garbage key and subsequent mis-parsing of the row content and maybe even subsequent partitions.

Introduce a new system table: `system.corrupt_data` and infrastructure similar to `large_data_handler`: `corrupt_data_handler` which abstracts how corrupt data is handled. The sstable writer now passes rows such corrupt keys to the corrupt data handler. This way, we avoid corrupting the sstables beyond parsing and the rows are also kept around in system.corrupt_data for later inspection and possible recovery.

Add a full-stack test which checks that rows with bad keys are correctly handled.

Fixes: https://github.com/scylladb/scylladb/issues/24489

The bug is present in all versions, has to be backported to all supported versions.

- (cherry picked from commit 92b5fe8983)

- (cherry picked from commit 0753643606)

- (cherry picked from commit b0d5462440)

- (cherry picked from commit 093d4f8d69)

- (cherry picked from commit 678deece88)

- (cherry picked from commit 64f8500367)

- (cherry picked from commit b931145a26)

- (cherry picked from commit 3e1c50e9a7)

- (cherry picked from commit 46ff7f9c12)

- (cherry picked from commit ebd9420687)

- (cherry picked from commit aae212a87c)

- (cherry picked from commit 592ca789e2)

- (cherry picked from commit edc2906892)

Parent PR: #24492

Closes scylladb/scylladb#24744

* github.com:scylladb/scylladb:
  test/boost/sstable_datafile_test: add test for corrupt data
  sstables/mx/writer: handler rows with empty keys
  test/lib/cql_assertions: introduce columns_assertions
  sstables: add corrupt_data_handler to sstables::sstables
  tools/scylla-sstable: make large_data_handler a local
  db: introduce corrupt_data_handler
  mutation: introduce frozen_mutation_fragment_v2
  mutation/mutation_partition_view: read_{clustering,static}_row(): return row type
  mutation/mutation_partition_view: extract de-ser of {clustering,static} row
  idl-compiler.py: generate skip() definition for enums serializers
  idl: extract full_position.idl from position_in_partition.idl
  db/system_keyspace: add apply_mutation()
  db/system_keyspace: introduce the corrupt_data table
2025-07-01 12:27:01 +03:00
Gleb Natapov
31ed717afb storage_proxy: retry paxos repair even if repair write succeeded
After paxos state is repaired in begin_and_repair_paxos we need to
re-check the state regardless if write back succeeded or not. This
is how the code worked originally but it was unintentionally changed
when co-routinized in 61b2e41a23.

Fixes #24630

Closes scylladb/scylladb#24651

(cherry picked from commit 5f953eb092)

Closes scylladb/scylladb#24703
2025-07-01 10:15:12 +02:00
Abhinav Jha
160c937efe group0: modify start_operation logic to account for synchronize phase race condition
In the present scenario, the bootstrapping node undergoes synchronize phase after
initialization of group0, then enters post_raft phase and becomes fully ready for
group0 operations. The topology coordinator is agnostic of this and issues stream
ranges command as soon as the node successfully completes `join_group0`. Although for
a node booting into an already upgraded cluster, the time duration for which, node
remains in synchronize phase is negligible but this race condition causes trouble in a
small percentage of cases, since the stream ranges operation fails and node fails to bootstrap.

This commit addresses this issue and updates the error throw logic to account for this
edge case and lets the node wait (with timeouts) for synchronize phase to get over instead of throwing
error.

A regression test is also added to confirm the working of this code change. The test adds a
wait in synchronize phase for newly joining node and releases only after the program counter
reaches the synchronize case in the `start_operation` function. Hence it indicates that in the
updated code, the start_operation will wait for the node to get done with the
synchronize phase instead of throwing error.

This PR fixes a bug. Hence we need to backport it.

Fixes: scylladb/scylladb#23536

Closes scylladb/scylladb#23829

(cherry picked from commit 5ff693eff6)

Closes scylladb/scylladb#24628
2025-07-01 10:10:55 +02:00
Jenkins Promoter
0bf8fe4778 Update pgo profiles - aarch64 2025-07-01 04:30:55 +03:00
Jenkins Promoter
a08ff869f3 Update pgo profiles - x86_64 2025-07-01 04:07:23 +03:00
Jenkins Promoter
fe22df0af2 Update ScyllaDB version to: 2025.2.1 2025-06-30 23:59:09 +03:00
Marcin Maliszkiewicz
19b1922362 test: auth_cluster: add test for password reset procedure
(cherry picked from commit aef531077b)
2025-06-30 20:50:15 +02:00
Marcin Maliszkiewicz
ad759eb141 auth: cache roles table scan during startup
It may be particularly beneficial during connection
storms on startup. In such cases, it can happen that
none of the user's read requests succeed, preventing
the cache from being populated. This, in turn, makes
it more difficult for subsequent reads to
succeed, reducing resiliency against such storms.

(cherry picked from commit 887c57098e)
2025-06-30 20:50:15 +02:00
Marcin Maliszkiewicz
5346d959ff test: auth_cluster: add test for replacing default superuser
This test demonstrates creating custom superuser guide:
https://opensource.docs.scylladb.com/stable/operating-scylla/security/create-superuser.html

(cherry picked from commit d9223b61a2)
2025-06-30 20:50:15 +02:00
Marcin Maliszkiewicz
6568065141 test: pylib: add ability to specify default authenticator during server_start
Sometimes we may not want to use default cassandra role for
control connection, especially when we test dropping default role.

(cherry picked from commit 08bf7237f066cead133bf0cac9bba215f238070a)
2025-06-30 20:50:15 +02:00
Marcin Maliszkiewicz
acb0ddaf3d auth: split auth-v2 logic for adding default superuser password
In raft mode (auth-v2) we need to do atomic write after read as
we give stricter consistency guarantees. Instead of patching
legacy logic this commit adds different path as:
- old code may be less tested now so it's best to not change it
- new code path avoids quorum selects in a typical flow (passwords set)

There may be a case when user deletes a superuser or password
right before restarting a node, in such case we may ommit
updating a password but:
- this is a trade-off between quorum reads on startup
- it's far more important to not update password when it shouldn't be
- if needed password will be updated on next node restart

If there is no quorum on startup we'll skip creating password
because we can't perform any raft operation.

Additionally this fixes a problem when password is created despite
having non default superuser in auth-v2.

(cherry picked from commit f85d73d405)
2025-06-30 20:50:15 +02:00
Marcin Maliszkiewicz
44533a0dbe auth: split auth-v2 logic for adding default superuser role
In raft mode (auth-v2) we need to do atomic write after read as
we give stricter consistency guarantees. Instead of patching
legacy logic this commit adds different path as:
  - old code may be less tested now so it's best to not change it
  - new code path avoids quorum selects in a typical flow (roles set)

This fixes a problem when superuser role is created despite
having non default superuser in auth-v2.

If there is no quorum on startup we'll skip creating role
because we can't perform any raft operation.

(cherry picked from commit 2e2ba84e94)
2025-06-30 20:50:15 +02:00
Marcin Maliszkiewicz
19748d9714 auth: ldap: fix waiting for underlying role manager
ldap_role_manager depends on standard_role_manager,
therefore it needs to wait for superuser initialization.
If this is missing, the password authenticator will start
checking the default password too early and may fail to
create the default password if there is no default
role yet.

Currently password authenticator will create password
together with the role in such case but in following
commits we want to separate those responsibilities correctly.

(cherry picked from commit c96c5bfef5)
2025-06-30 20:50:15 +02:00
Marcin Maliszkiewicz
ef1f4907bd auth: wait for default role creation before starting authorizer and authenticator
There is a hidden dependency: the creation of the default superuser role
is split between the password authenticator and the role manager.
To work correctly, they must start in the right order: role manager first,
then password authenticator.

(cherry picked from commit 68fc4c6d61)
2025-06-30 20:50:15 +02:00
Anna Stuchlik
3d8368cacb doc: remove OSS mention from the SI notes
This commit removes a confusing reference to an Open Source version
form the Local Secondary Indexes page.

Fixes https://github.com/scylladb/scylladb/issues/24668

Closes scylladb/scylladb#24673

(cherry picked from commit 2367330513)

Closes scylladb/scylladb#24723
2025-06-30 18:53:48 +03:00
Botond Dénes
236cab0f66 test/boost/sstable_datafile_test: add test for corrupt data
* create a table with random schema
* generate data: random mutations + one row with bad key
* write data to sstable
* check that only good data is written to sstable
* check that the bad data was saved to system.corrupt_data

(cherry picked from commit edc2906892)
2025-06-30 12:44:29 +00:00
Botond Dénes
cd97f4c4c3 sstables/mx/writer: handler rows with empty keys
Although valid for compact tables, non-full (or empty) clustering key
prefixes are not handled for row keys when writing sstables. Only the
present components are written, consequently if the key is empty, it is
omitted entirely.
When parsing sstables, the parsing code unconditionally parses a full
prefix. This mis-match results in parsing failures, as the parser parses
part of the row content as a key resulting in a garbage key and
subsequent mis-parsing of the row content and maybe even subsequent
partitions.

Use the recently introduced corrupt_data_handler to handle rows with
such corrupt keys. This way, we avoid corrupting the sstables beyond
parsing and the rows are also kept around in system.corrupt_data for
later inspection and possible recovery.

(cherry picked from commit 592ca789e2)
2025-06-30 12:44:29 +00:00
Botond Dénes
7654ccbef5 test/lib/cql_assertions: introduce columns_assertions
To enable targeted and optionally typed assertions against individual
columns in a row.

(cherry picked from commit aae212a87c)
2025-06-30 12:44:29 +00:00
Botond Dénes
9eb9ffe4bc sstables: add corrupt_data_handler to sstables::sstables
Similar to how large_data_handler is handled, propagate through
sstables::sstables_manager and store its owner: replica::database.
Tests and tools are also patched. Mostly mechanical changes, updating
constructors and patching callers.

(cherry picked from commit ebd9420687)
2025-06-30 12:44:29 +00:00
Botond Dénes
b0a233b2c9 tools/scylla-sstable: make large_data_handler a local
No reason for it to be a global, not even convenience.

(cherry picked from commit 46ff7f9c12)
2025-06-30 12:44:29 +00:00
Botond Dénes
53373ea9b7 db: introduce corrupt_data_handler
Similar to large_data_handler, this interface allows sstable writers to
delegate the handling of corrupt data.
Two implementations are provided:
* system_table_corrupt_data_handler - saved corrupt data in
  system.corrupt_data, with a TTL=10days (non-configurable for now)
* nop_corrupt_data_handler - drops corrupt data

(cherry picked from commit 3e1c50e9a7)
2025-06-30 12:44:29 +00:00
Botond Dénes
b952d8a88c mutation: introduce frozen_mutation_fragment_v2
Mirrors frozen_mutation_fragment and shares most of the underlying
serialization code, the only exception is replacing range_tombstone with
range_tombstone_change in the mutation fragment variant.

(cherry picked from commit b931145a26)
2025-06-30 12:44:28 +00:00
Botond Dénes
a561600e7e mutation/mutation_partition_view: read_{clustering,static}_row(): return row type
Instead of mutation_fragment, let caller convert into mutation_fragment.
Allows reuse in future callers which will want to convert to
mutation_fragment_v2.

(cherry picked from commit 64f8500367)
2025-06-30 12:44:28 +00:00
Botond Dénes
45b6cc069f mutation/mutation_partition_view: extract de-ser of {clustering,static} row
From the visitor in frozen_mutation_fragment::unfreeze(). We will want
to re-use it in the future frozen_mutation_fragment_v2::unfreeze().

Code-movement only, the code is not changed.

(cherry picked from commit 678deece88)
2025-06-30 12:44:28 +00:00
Botond Dénes
355a1b4af4 idl-compiler.py: generate skip() definition for enums serializers
Currently they only have the declaration and so far they got away with
it, looks like no users exists, but this is about to change so generate
the definition too.

(cherry picked from commit 093d4f8d69)
2025-06-30 12:44:28 +00:00
Botond Dénes
2ead6a43a5 idl: extract full_position.idl from position_in_partition.idl
A future user of position_in_partition.idl doesn't need full_position
and so doesn't want to include full_position.hh to fix compile errors
when including position_in_partition.idl.hh.
Extract it to a separate idl file: it has a single user in a
storage_proxy VERB.

(cherry picked from commit b0d5462440)
2025-06-30 12:44:28 +00:00
Botond Dénes
14595c49ae db/system_keyspace: add apply_mutation()
Allow applying writes in the form of mutations directly to the keyspace.
Allows lower-level mutation API to build writes. Advantageous if writes
can contain large cells that would otherwise possibly cause large
allocation warnings if used via the internal CQL API.

(cherry picked from commit 0753643606)
2025-06-30 12:44:28 +00:00
Botond Dénes
43eb3bcf91 db/system_keyspace: introduce the corrupt_data table
To serve as a place to store corrupt mutation fragments. These fragments
cannot be written to sstables, as they would be spread around by
compaction and/or repair. They even might make parsing the sstable
impossible. So they are stored in this special table instead, kept
around to be inspected later and possibly restored if possible.

(cherry picked from commit 92b5fe8983)
2025-06-30 12:44:28 +00:00
Botond Dénes
f4f0ffd713 mutation: check key of inserted rows
Make sure the keys are full prefixes as it is expected to be the case
for rows. At severeal occasions we have seen empty row keys make their
ways into the sstables, despite the fact that they are not allowed by
the CQL frontend. This means that such empty keys are possibly results
of memory corruption or use-after-{free,copy} errors. The source of the
corruption is impossible to pinpoint when the empty key is discovered in
the sstable. So this patch adds checks for such keys to places where
mutations are built: when building or unserializing mutations.

The test row_cache_test/test_reading_of_nonfull_keys needs adjustment to
work with the changes: it has to make the schema use compact storage,
otherwise the non-full changes used by this tests are rejected by the
new checks.

Fixes: https://github.com/scylladb/scylladb/issues/24506
(cherry picked from commit ab96c703ff)
2025-06-30 12:43:36 +00:00
Botond Dénes
b40edca418 compound: optimize is_full() for single-component types
For such compounds, unserializing the key is not necessary to determine
whether the key is full or not.

(cherry picked from commit 8b756ea837)
2025-06-30 12:43:36 +00:00
Aleksandra Martyniuk
7fd4d77fdd test: rest_api: fix test_repair_task_progress
test_repair_task_progress checks the progress of children of root
repair task. However, nothing ensures that the children are
already created.

Wait until at least one child of a root repair task is created.

Fixes: #24556.

Closes scylladb/scylladb#24560

(cherry picked from commit 0deb9209a0)

Closes scylladb/scylladb#24655
2025-06-28 09:39:06 +03:00
Marcin Maliszkiewicz
a54cc8291c test: add test for live updates of generic server config
Affected config: uninitialized_connections_semaphore_cpu_concurrency

(cherry picked from commit 68ead01397)
2025-06-27 16:01:43 +02:00
Patryk Jędrzejczak
2c89800e76 Merge '[Backport 2025.2] docs: document the new recovery procedure' from Scylladb[bot]
We replace the documentation of the old recovery procedure with the
documentation of the new recovery procedure.

The new recovery procedure requires the Raft-based topology to be
enabled, so to remove the old procedure from the documentation,
we must assume users have the Raft-based topology enabled.
We can do it in 2025.2 because the upgrade guides to 2025.1 state that
enabling the Raft-based topology is a mandatory step of the upgrade.
Another reminder is the upgrade guides to 2025.2.

Since we rely on the Raft-based topology being enabled, we remove the
obsolete parts of the documentation.

We will make the Raft-based topology mandatory in the code in the
future, hopefully in 2025.3. For this reason, we also don't touch the
dev docs in this PR.

Fixes scylladb/scylladb#24530

Requires backport to 2025.2 because 2025.2 contains the new recovery
procedure.

- (cherry picked from commit 4e256182a0)

- (cherry picked from commit 203ea5d8f9)

Parent PR: #24583

Closes scylladb/scylladb#24702

* https://github.com/scylladb/scylladb:
  docs: rely on the Raft-based topology being enabled
  docs: handling-node-failures: document the new recovery procedure
2025-06-27 11:58:36 +02:00
Patryk Jędrzejczak
b1bfa4b115 docs: rely on the Raft-based topology being enabled
In 2025.2, we don't force enabling the Raft-based topology in the code,
but we stated in the upgrade guides that it's a mandatory step of the
upgrade to 2025.1. We also remind users to enable the Raft-based
topology in the upgrade guides to 2025.2. Hence, we can rely in the
the documentation on the Raft-based topology being enabled. If it is
still disabled, we can just send the user to the upgrade guides. Hence:
- we remove all documentation related to enabling the Raft-based
  topology, enabling the Raft-based schema (enabled Raft-based topology
  implies enabled Raft-based schema), and the gossip-based topology,
- we can replace the documentation of the old manual recovery procedure
  with the documentation of the new manual recovery procedure (done in
  the previous commit).

(cherry picked from commit 203ea5d8f9)
2025-06-26 22:18:56 +00:00
Patryk Jędrzejczak
f052af6c45 docs: handling-node-failures: document the new recovery procedure
We replace the documentation of the old recovery procedure with the
documentation of the new recovery procedure.

We can get rid of the old procedure from the documentation because
we requested users to enable the Raft-based topology during upgrades to
2025.1 and 2025.2.

We leave the note that enabling the Raft-based topology is required to
use the new recovery procedure just in case, since we didn't force
enabling the Raft-based topology in the code.

(cherry picked from commit 4e256182a0)
2025-06-26 22:18:56 +00:00
Botond Dénes
62e134f423 sstables: replace SCYLLA_ASSERT() with parse_assert() on the read path
So parse errors on corrupt SSTables don't result in crashes, instead
just aborting the read in process.
There are a lot of SCYLLA_ASSERT() usages remaining in sstables/. This
patch tried to focus on those usages which are in the read path. Some
places not only used on the read path may have been converted too, where
the usage of said method is not clear.

(cherry picked from commit bce89c0f5e)
2025-06-26 14:53:08 +00:00
Botond Dénes
e7c59ce475 sstables/exceptions: introduce parse_assert()
To replace SCYLLA_ASSERT on the read/parse path. SSTables can get
corrupt for various reasons, some outside of the database's control. A
bad SSTable should not bring down the database, the parsing should
simply be aborted, with as much information printed as possible for the
investigation of the nature of the corruption.
The newly introduced parse_assert() uses on_internal_error() under the
hood, which prints a backtrace and optionally allows for aborting when
on the error, to generate a coredump.

(cherry picked from commit 27e26ed93f)
2025-06-26 14:53:08 +00:00
Marcin Maliszkiewicz
011765ced8 utils: don't allow do discard updateable_value observer
If the object returned from observe() is destructured,
it stops observing, potentially causing subtle bugs.
Typically, the observer object is retained as a class member.

(cherry picked from commit 45392ac29e)
2025-06-26 14:50:25 +00:00
Marcin Maliszkiewicz
641cfc9a09 generic_server: fix connections semaphore config observer
When temporary value returned by observer() is destructed it
disconnects from updateable_value so the code immediately stops
observing.

To fix it we need to retain the observer in the class object.

(cherry picked from commit c6a25b9140)
2025-06-26 14:50:25 +00:00
Jenkins Promoter
33e947e753 Update ScyllaDB version to: 2025.2.0 2025-06-25 15:29:15 +03:00
Michał Chojnowski
5fba228a6b memtable: ensure _flushed_memory doesn't grow above total memory usage
dirty_memory_manager tracks two quantities about memtable memory usage:
"real" and "unspooled" memory usage.

"real" is the total memory usage (sum of `occupancy().total_space()`)
by all memtable LSA regions, plus a upper-bound estimate of the size of
memtable data which has already moved to the cache region but isn't
evictable (merged into the cache) yet.

"unspooled" is the difference between total memory usage by all memtable
LSA regions, and the total flushed memory (sum of `_flushed_memory`)
of memtables.

dirty_memory_manager controls the shares of compaction and/or blocks
writes when these quantities cross various thresholds.

"Total flushed memory" isn't a well defined notion,
since the actual consumption of memory by the same data can vary over
time due to LSA compactions, and even the data present in memtable can
change over the course of the flush due to removals of outdated MVCC versions.
So `_flushed_memory` is merely an approximation computed by `flush_reader`
based on the data passing through it.

This approximation is supposed to be a conservative lower bound.
In particular, `_flushed_memory` should be not greater than
`occupancy().total_space()`. Otherwise, for example, "unspooled" memory
could become negative (and/or wrap around) and weird things could happen.
There is an assertion in ~flush_memory_accounter which checks that
`_flushed_memory < occupancy().total_space()` at the end of flush.

But it can fail. Without additional treatment, the memtable reader sometimes emits
data which is already deleted. (In particular, it emites rows covered by
a partition tombstone in a newer MVCC version.)
This data is seen `flush_reader` and accounted in `_flushed_memory`.
But this data can be garbage-collected by the mutation_cleaner later during the
flush and decrease `total_memory` below `_flushed_memory`.

There is a piece of code in mutation_cleaner intended to prevent that.
If `total_memory` decreases during a `mutation_cleaner` run,
`_flushed_memory` is lowered by the same amount, just to preserve the
asserted property. (This could also make `_flushed_memory` quite inaccurate,
but that's considered acceptable).

But that only works if `total_memory` is decreased during that run. It doesn't
work if the `total_memory` decrease (enabled by the new allocator holes made
by `mutation_cleaner`'s garbage collection work) happens asynchronously
(due to memory reclaim for whatever reason) after the run.

This patch fixes that by tracking the decreases of `total_memory` closer to the
source. Instead of relying on `mutation_cleaner` to notify the memtable if it
lowers `total_memory`, the memtable itself listens for notifications about
LSA segment deallocations. It keeps `_flushed_memory` equal to the reader's
estimate of flushed memory decreased by the change in `total_memory` since the
beginning of flush (if it was positive), and it keeps the amount of "spooled"
memory reported to the `dirty_memory_manager` at `max(0, _flushed_memory)`.

(cherry picked from commit 975e7e405a)
2025-06-24 13:06:06 +00:00
Michał Chojnowski
9b98bacaa1 replica/memtable: move region_listener handlers from dirty_memory_manager to memtable
The memtable wants to listen for changes in its `total_memory` in order
to decrease its `_flushed_memory` in case some of the freed memory has already
been accounted as flushed. (This can happen because the flush reader sees
and accounts even outdated MVCC versions, which can be deleted and freed
during the flush).

Today, the memtable doesn't listen to those changes directly. Instead,
some calls which can affect `total_memory` (in particular, the mutation cleaner)
manually check the value of `total_memory` before and after they run, and they
pass the difference to the memtable.

But that's not good enough, because `total_memory` can also change outside
of those manually-checked calls -- for example, during LSA compaction, which
can occur anytime. This makes memtable's accounting inaccurate and can lead
to unexpected states.

But we already have an interface for listening to `total_memory` changes
actively, and `dirty_memory_manager`, which also needs to know it,
does just that. So what happens e.g. when `mutation_cleaner` runs
is that `mutation_cleaner` checks the value of `total_memory` before it runs,
then it runs, causing several changes to `total_memory` which are picked up
by `dirty_memory_manager`, then `mutation_cleaner` checks the end value of
`total_memory` and passes the difference to `memtable`, which corrects
whatever was observed by `dirty_memory_manager`.

To allow memtable to modify its `_flushed_memory` correctly, we need
to make `memtable` itself a `region_listener`. Also, instead of
the situation where `dirty_memory_manager` receives `total_memory`
change notifications from `logalloc` directly, and `memtable` fixes
the manager's state later, we want to only the memtable listen
for the notifications, and pass them already modified accordingl
to the manager, so there is no intermediate wrong states.

This patch moves the `region_listener` callbacks from the
`dirty_memory_manager` to the `memtable`. It's not intended to be
a functional change, just a source code refactoring.
The next patch will be a functional change enabled by this.

(cherry picked from commit 7d551f99be)
2025-06-24 13:06:06 +00:00
Anna Stuchlik
b469158418 doc: improve the tablets limitations section
This PR improves the Limitations and Unsupported Features section
for tablets, as it has been confusing to the customers.

Refs https://github.com/scylladb/scylla-enterprise/issues/5465

Fixes https://github.com/scylladb/scylladb/issues/24562

Closes scylladb/scylladb#24563

(cherry picked from commit 17eabbe712)

Closes scylladb/scylladb#24588
2025-06-24 10:06:21 +03:00
Benny Halevy
afa2b40ac9 disk_space_monitor: add space_source_registration
Register the current space_source_fn in an RAII
object that resets monitor._space_source to the
previous function when the RAII object is destroyed.

Use space_source_registration in database_test::
mutation_dump_generated_schema_deterministic_id_version
to prevent use-after-stack-return in the test.

Fixes #24314

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#24342

(cherry picked from commit 8b387109fc)

Closes scylladb/scylladb#24392
2025-06-24 10:02:23 +03:00
Raphael S. Carvalho
fa420f8644 replica: Fix truncate assert failure
Truncate doesn't really go well with concurrent writes. The fix (#23560) exposed
a preexisting fragility which I missed.

1) truncate gets RP mark X, truncated_at = second T
2) new sstable written during snapshot or later, also at second T (difference of MS)
3) discard_sstables() get RP Y > saved RP X, since creation time of sstable
with RP Y is equal to truncated_at = second T.

So the problem is that truncate is using a clock of second granularity for
filtering out sstables written later, and after we got low mark and truncate time,
it can happen that a sstable is flushed later within the same second, but at a
different millisecond.
By switching to a millisecond clock (db_clock), we allow sstables written later
within the same second from being filtered out. It's not perfect but
extremely unlikely a new write lands and get flushed in the same
millisecond we recorded truncated_at timepoint. In practice, truncate
will not be used concurrently to writes, so this should be enough for
our tests performing such concurrent actions.
We're moving away from gc_clock which is our cheap lowres_clock, but
time is only retrieved when creating sstable objects, which frequency of
creation is low enough for not having significant consequences, and also
db_clock should be cheap enough since it's usually syscall-less.

Fixes #23771.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#24426

(cherry picked from commit 2d716f3ffe)

Closes scylladb/scylladb#24435
2025-06-24 10:02:06 +03:00
Andrzej Jackowski
60bc1c339c test: wait for normal state propagation in test_auth_v2_migration
By default, cluster tests have skip_wait_for_gossip_to_settle=0 and
ring_delay_ms=0. In tests with gossip topology, it may lead to a race,
where nodes see different state of each other.

In case of test_auth_v2_migration, there are three nodes. If the first
node already knows that the third node is NORMAL, and the second node
does not, the system_auth tables can return incomplete results.

To avoid such a race, this commit adds a check that all nodes see other
nodes as NORMAL before any writes are done.

Refs: #24163

Closes scylladb/scylladb#24185

(cherry picked from commit 555d897a15)

Closes scylladb/scylladb#24520
2025-06-24 10:01:42 +03:00
Michał Chojnowski
3eba371e09 test/boost/mutation_reader_test: fix a use-after-free in test_fast_forwarding_combined_reader_is_consistent_with_slicing
The contract in mutation_reader.hh says:

```
// pr needs to be valid until the reader is destroyed or fast_forward_to()
// is called again.
    future<> fast_forward_to(const dht::partition_range& pr) {
```

`test_fast_forwarding_combined_reader_is_consistent_with_slicing` violates
this by passing a temporary to `fast_forward_to`.

Fix that.

Fixes scylladb/scylladb#24542

Closes scylladb/scylladb#24543

(cherry picked from commit 27f66fb110)

Closes scylladb/scylladb#24548
2025-06-24 10:01:19 +03:00
Gleb Natapov
c644526bf9 api: return error from get_host_id_map if gossiper is not enabled yet.
Token metadata api is initialized before gossiper is started.
get_host_id_map REST endpoint cannot function without the fully
initialized gossiper though. The gossiper is started deep in
the join_cluster call chain, but if we move token_metadata api
initialization after the call it means that no api will be available
during bootstrap. This is not what we want.

Make a simple fix by returning an error from the api if the gossiper is
not initialized yet.

Fixes: #24479

Closes scylladb/scylladb#24575

(cherry picked from commit e364995e28)

Closes scylladb/scylladb#24587
2025-06-24 10:00:48 +03:00
Nadav Har'El
34bdbad128 Merge '[Backport 2025.2] cql, schema: Extend keyspace, table, views, indexes name length limit from 48 to 192 bytes' from Scylladb[bot]
cql, schema: Extend name length limit from 48 to 192 bytes

    This commit increases the maximum length of names for keyspaces, tables, materialized views, and indexes from 48 to 192 bytes.
    The previous 48-bytes limit was inherited from Cassandra 3 for compatibility. However, this validation was removed in Cassandra 4 and 5 (see CASSANDRA-20389)
    and some usage scenarios (such as some feature store workflows generating long table names) now depend on this relaxed constraint.
    This change brings ScyllaDB's behavior in line with modern Cassandra versions and better supports these use cases.

    The new limit of 192 bytes is derived from underlying filesystem limitations to prevent runtime errors when creating directories for table data.
    When a new table is created, ScyllaDB generates a directory for its SSTables. The directory name is constructed from the table name, a dash, and a 32-character UUID.
    For a CDC-enabled table, an associated log table is also created, which has the suffix `_scylla_cdc_log` appended to its name.
    The directory name for this log table becomes the longest possible representation.
    Additionally we reserve 15 bytes for future use, allowing for potential future extensions without breaking existing schemas.
    To guarantee that directory creation never fails due to exceeding filesystem name limits, the maximum name length is calculated as follows:
      255 bytes (common filesystem limit for a path component)
    -  32 bytes (for the 32-character UUID string)
    -   1 byte  (for the '-' separator)
    -  15 bytes (for the '_scylla_cdc_log' suffix)
    -  15 bytes (reserved for future use)
    ----------
    = 192 bytes (Maximum allowed name length)
    This calculation is similar in principle to the one proposed for Cassandra to fix related directory creation failures (see apache/cassandra/pull/4038).

    This patch also updates/adds all associated tests to validate the new 192-byte limit.
    The documentation has been updated accordingly.

Fixes #4480

Backport 2025.2: The significantly shorter maximum table name length in Scylla compared to Cassandra is becoming a more common issue for users in the latest release.

- (cherry picked from commit a41c12cd85)

- (cherry picked from commit 4577c66a04)

Parent PR: #24500

Closes scylladb/scylladb#24603

* github.com:scylladb/scylladb:
  cql, schema: Extend name length limit from 48 to 192 bytes
  replica: Remove unused keyspace::init_storage()
2025-06-23 15:48:23 +03:00
Karol Nowacki
76bd23cddd cql, schema: Extend name length limit from 48 to 192 bytes
This commit increases the maximum length of names for keyspaces, tables, materialized views, and indexes from 48 to 192 bytes.
The previous 48-bytes limit was inherited from Cassandra 3 for compatibility. However, this validation was removed in Cassandra 4 and 5 (see CASSANDRA-20389)
and some usage scenarios (such as some feature store workflows generating long table names) now depend on this relaxed constraint.
This change brings ScyllaDB's behavior in line with modern Cassandra versions and better supports these use cases.

The new limit of 192 bytes is derived from underlying filesystem limitations to prevent runtime errors when creating directories for table data.
When a new table is created, ScyllaDB generates a directory for its SSTables. The directory name is constructed from the table name, a dash, and a 32-character UUID.
For a CDC-enabled table, an associated log table is also created, which has the suffix `_scylla_cdc_log` appended to its name.
The directory name for this log table becomes the longest possible representation.
Additionally we reserve 15 bytes for future use, allowing for potential future extensions without breaking existing schemas.
To guarantee that directory creation never fails due to exceeding filesystem name limits, the maximum name length is calculated as follows:
  255 bytes (common filesystem limit for a path component)
-  32 bytes (for the 32-character UUID string)
-   1 byte  (for the '-' separator)
-  15 bytes (for the '_scylla_cdc_log' suffix)
-  15 bytes (reserved for future use)
----------
= 192 bytes (Maximum allowed name length)
This calculation is similar in principle to the one proposed for Cassandra to fix related directory creation failures (see apache/cassandra/pull/4038).

This patch also updates/adds all associated tests to validate the new 192-byte limit.
The documentation has been updated accordingly.

(cherry picked from commit 4577c66a04)
2025-06-22 17:38:30 +00:00
Karol Nowacki
87f31f79a3 replica: Remove unused keyspace::init_storage()
This function was declared but had no implementation or callers. It is being removed as minor code cleanup.

(cherry picked from commit a41c12cd85)
2025-06-22 17:38:29 +00:00
Jenkins Promoter
942b16ffe5 Update ScyllaDB version to: 2025.2.0-rc6 2025-06-22 15:01:54 +03:00
Pavel Emelyanov
66fe11a126 Update seastar submodule (no nested stall backtraces)
* seastar 9f0034a0...450e36d5 (1):
  > stall_detector: no backtrace if exception

Fixes #24464

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#24541
2025-06-19 10:08:40 +03:00
Michał Chojnowski
06d6718f3b pylib/manager_client: add server_switch_executable
Add an util for switching the Scylla executable during the test.
Will be used for upgrade tests.

(cherry picked from commit 5da19ff6a6)
2025-06-18 13:50:38 +00:00
Michał Chojnowski
b5591422c6 test/pylib: in add_server, give a way to specify the executable and version-specific config
This will be used for upgrade tests.
The cluster will be started with an older executable and without configs
specific to newer versions.

(cherry picked from commit 1ff7e09edc)
2025-06-18 13:50:38 +00:00
Michał Chojnowski
043eaf099a pylib: pass scylla_env environment variables to the topology suite
I want to add an upgrade test under the topology suite.
To work, it will have to know the path to the tested Scylla
executable, so that it can switch the nodes to it.

The path could be passed by various means and I'm not sure
which what method is appropriate.

In some other places (e.g. the cql suite) we pass the path via
the `SCYLLA` environment variable and this patch follows that example.

`PythonTestSuite` (parent class of `TopologySuite`) already has that
variable set in `self.scylla_env`, and passes it around.
However, `TopologySuite` uses its own `run()`, and so it implicitly
overrides the decision to pass `self.scylla_env` down. This patch
changes that, and after the patch we apply the `self.scylla_env` to the
environment for topology tests.

This might has some unforeseen side effects for coverage measurement,
because AFAICS the (only) other variable in `self.scylla_env` is
`LLVM_PROFILE_FILE`.
But topology tests don't run Scylla executables themselves
(they only send command to the cluster manager started externally),
so I figure there should be no change.

(cherry picked from commit 2ef0db0a6b)
2025-06-18 13:50:38 +00:00
Michał Chojnowski
5e2b3be754 test/pylib: add get_scylla_2025_1_executable()
Adds a function which downloads and installs (in `~/.cache`)
the Scylla 2025.1, for upgrade tests.

Note: this introduces an internet dependency into pylib,
AFAIK the first one.

We already have some other code for downloading existing Scylla
releases, written for different purposes, in `cqlpy/fetch_scylla.py`.
I made zero effort to reuse that in any way.

Note: hardcoding the package version might be uncool,
but if we want "better" version selection (e.g. the newest patch version
in the given branch), we should have a separate library (or web service)
for that, and share it with CCM/SCT.
If we add a separate automatic version selection mechanism here,
we are going to end up with yet another half-broken Scylla version
selector, with yet different syntax and semantics than the other ones.

We never clear the downloaded and unpacked files.
This could become a problem in the future.
(At which point we can add some mechanism that deletes cached archives
downloaded more than a week ago.)

(cherry picked from commit 34098fbd1f)
2025-06-18 13:50:38 +00:00
Michał Chojnowski
d141b730fc pylib/scylla_cluster: give a way to pass executable-specific options to nodes
I'm trying to adapt pylib to multi-version tests.
(Where the Scylla cluster is upgraded to a newer Scylla version
during the test).

Before this patch, the initial config (where "config" == yaml file + CLI args)
of the nodes is hardcoded in scylla_cluster.py.
The problem is that this config might not apply to past versions,
so we need some way to give them a different config.
(For example, with the config as it is before the patch,
a Scylla 2025.1 executable would not boot up because it does not
know the `group0_voter_handler` logger).

In this patch, we create a way to attach version-specific
config to the executable passed to ScyllaServer.

(cherry picked from commit cc7432888e)
2025-06-18 13:50:37 +00:00
Michał Chojnowski
76d989cbfe dbuild: mount "$XDG_CACHE_HOME/scylladb"
We will use it to keep a cache of artifact downloads for upgrade tests,
across dbuild invocations.

(cherry picked from commit 63218bb094)
2025-06-18 13:50:37 +00:00
Piotr Dulikowski
9536949911 Merge '[Backport 2025.2] tablets: deallocate storage state on end_migration' from Scylladb[bot]
When a tablet is migrated and cleaned up, deallocate the tablet storage
group state on `end_migration` stage, instead of `cleanup` stage:

* When the stage is updated from `cleanup` to `end_migration`, the
  storage group is removed on the leaving replica.
* When the table is initialized, if the tablet stage is `end_migration`
  then we don't allocate a storage group for it. This happens for
  example if the leaving replica is restarted during tablet migration.
  If it's initialized in `cleanup` stage then we allocate a storage
  group, and it will be deallocated when transitioning to
  `end_migration`.

This guarantees that the storage group is always deallocated on the
leaving replica by `end_migration`, and that it is always allocated if
the tablet wasn't cleaned up fully yet.

It is a similar case also for the pending replica when the migration is
aborted. We deallocate the state on `revert_migration` which is the
stage following `cleanup_target`.

Previously the storage group would be allocated when the tablet is
initialized on any of the tablet replicas - also on the leaving replica,
and when the tablet stage is `cleanup` or `end_migration`, and
deallocated during `cleanup`.

This fixes the following issue:

1. A migrating tablet enters cleanup stage
2. the tablet is cleaned up successfuly
3. The leaving replica is restarted, and allocates storage group
4. tablet cleanup is not called because it's already cleaned up
5. the storage group remains allocated on the leaving replica after the
   migration is completed - it's not cleaned up properly.

Fixes https://github.com/scylladb/scylladb/issues/23481

backport to all relevant releases since it's a bug that results in a crash

- (cherry picked from commit 34f15ca871)

- (cherry picked from commit fb18fc0505)

- (cherry picked from commit bd88ca92c8)

Parent PR: #24393

Closes scylladb/scylladb#24488

* github.com:scylladb/scylladb:
  test/cluster/test_tablets: test restart during tablet cleanup
  test: tablets: add get_tablet_info helper
  tablets: deallocate storage state on end_migration
2025-06-18 10:25:32 +02:00
Anna Stuchlik
01d3b504d1 doc: add support for z3 GCP
This commit adds support for z3-highmem-highlssd instance types to
Cloud Instance Recommendations for GCP.

Fixes https://github.com/scylladb/scylladb/issues/24511

Closes scylladb/scylladb#24533

(cherry picked from commit 648d8caf27)

Closes scylladb/scylladb#24545
2025-06-17 23:40:47 +03:00
Michael Litvak
305f827888 test/cluster/test_tablets: test restart during tablet cleanup
Add a test that reproduces issue scylladb/scylladb#23481.

The test migrates a tablet from one node to another, and while the
tablet is in some stage of cleanup - either before or right after,
depending on the parameter - the leaving replica, on which the tablet is
cleaned, is restarted.

This is interesting because when the leaving replica starts and loads
its state, the tablet could be in different stages of cleanup - the
SSTables may still exist or they may have been cleaned up already, and
we want to make sure the state is loaded correctly.

(cherry picked from commit bd88ca92c8)
2025-06-17 13:59:10 +00:00
Michael Litvak
d094bc6fc9 test: tablets: add get_tablet_info helper
Add a helper for tests to get the tablet info from system.tablets for a
tablet owning a given token.

(cherry picked from commit fb18fc0505)
2025-06-17 13:59:10 +00:00
Michael Litvak
c11a2e2aaf tablets: deallocate storage state on end_migration
When a tablet is migrated and cleaned up, deallocate the tablet storage
group state on `end_migration` stage, instead of `cleanup` stage:

* When the stage is updated from `cleanup` to `end_migration`, the
  storage group is removed on the leaving replica.
* When the table is initialized, if the tablet stage is `end_migration`
  then we don't allocate a storage group for it. This happens for
  example if the leaving replica is restarted during tablet migration.
  If it's initialized in `cleanup` stage then we allocate a storage
  group, and it will be deallocated when transitioning to
  `end_migration`.

This guarantees that the storage group is always deallocated on the
leaving replica by `end_migration`, and that it is always allocated if
the tablet wasn't cleaned up fully yet.

It is a similar case also for the pending replica when the migration is
aborted. We deallocate the state on `revert_migration` which is the
stage following `cleanup_target`.

Previously the storage group would be allocated when the tablet is
initialized on any of the tablet replicas - also on the leaving replica,
and when the tablet stage is `cleanup` or `end_migration`, and
deallocated during `cleanup`.

This fixes the following issue:

1. A migrating tablet enters cleanup stage
2. the tablet is cleaned up successfuly
3. The leaving replica is restarted, and allocates storage group
4. tablet cleanup is not called because it was already cleaned up
4. the storage group remains allocated on the leaving replica after the
   migration is completed - it's not cleaned up properly.

Fixes scylladb/scylladb#23481

(cherry picked from commit 34f15ca871)
2025-06-17 13:59:10 +00:00
Botond Dénes
a63b22eec6 Merge '[Backport 2025.2] tablets: fix missing data after tablet merge ' from Scylladb[bot]
Consider the following scenario:

1) let's assume tablet 0 has range [1, 5] (pre merge)
2) tablet merge happens, tablet 0 has now range [1, 10]
3) tablet_sstable_set isn't refreshed, so holds a stale state, thinks tablet 0 still has range [1, 5]
4) during a full scan, forward service will intersect the full range with tablet ranges and consume one tablet at a time
5) replica service is asked to consume range [1, 10] of tablet 0 (post merge)

We have two possible outcomes:

With cache bypass:

1) cache reader is bypassed
2) sstable reader is created on range [1, 10]
3) unrefreshed tablet_sstable_set holds stale state, but select correctly all sstables intersecting with range [1, 10]

With cache:

1) cache reader is created
2) finds partition with token 5 is cached
3) sstable reader is created on range [1, 4] (later would fast forward to range [6, 10]; also belongs to tablet 0)
4) incremental selector consumes the pre-merge sstable spanning range [1, 5]
4.1) since the partitioned_sstable_set pre-merge contains only that sstable, EOS is reached
4.2) since EOS is reached, the fast forward to range [6, 10] is not allowed.
So with the set refreshed, sstable set is aligned with tablet ranges, and no premature EOS is signalled, otherwise preventing fast forward to from happening and all data from being properly captured in the read.

This change fixes the bug and triggers a mutation source refresh whenever the number of tablets for the table has changed, not only when we have incoming tablets.

Additionally, includes a fix for range reads that span more than one tablet, which can happen during split execution.

Fixes: https://github.com/scylladb/scylladb/issues/23313

This change needs to be backported to all supported versions which implement tablet merge.

- (cherry picked from commit d0329ca370)

- (cherry picked from commit 1f9f724441)

- (cherry picked from commit 53df911145)

Parent PR: #24287

Closes scylladb/scylladb#24339

* github.com:scylladb/scylladb:
  replica: Fix range reads spanning sibling tablets
  test: add reproducer and test for mutation source refresh after merge
  tablets: trigger mutation source refresh on tablet count change
2025-06-17 08:35:14 +03:00
Jenkins Promoter
0adf905112 Update ScyllaDB version to: 2025.2.0-rc5 2025-06-16 16:21:22 +03:00
Pavel Emelyanov
c2a9f2d9c6 Update seastar submodule
* seastar d7ff58f2...9f0034a0 (1):
  > http_client: Add ECONNRESET to retryable errors

And switch to 2025.2 branch from scylla-seastar for backports

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#24446
2025-06-15 17:33:16 +03:00
Raphael S. Carvalho
79958472bc replica: Fix range reads spanning sibling tablets
We don't guarantee that coordinators will only emit range reads that
span only one tablet.

Consider this scenario:

1) split is about to be finalized, barrier is executed, completes.
2) coordinator starts a read, uses pre-split erm (split not committed to group0 yet)
3) split is committed to group0, all replicas switch storage.
4) replica-side read is executed, uses a range which spans tablets.

We could fix it with two-phase split execution. Rather than pushing the
complexity to higher levels, let's fix incremental selector which should
be able to serve all the tokens owned by a given shard. During split
execution, either of sibling tablets aren't going anywhere since it
runs with state machine locked, so a single read spanning both
sibling tablets works as long as the selector works across tablet
boundaries.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 53df911145)
2025-06-15 09:14:38 -03:00
Ferenc Szili
ba192c1a29 test: add reproducer and test for mutation source refresh after merge
This change adds a reproducer and test for the fix where the local mutation
source is not always refreshed after a tablet merge.

(cherry picked from commit 1f9f724441)
2025-06-15 09:14:37 -03:00
Jenkins Promoter
89f5374435 Update pgo profiles - aarch64 2025-06-15 04:46:00 +03:00
Jenkins Promoter
184e0716b3 Update pgo profiles - x86_64 2025-06-15 04:08:36 +03:00
Anna Stuchlik
baa2592299 doc: remove the limitation for disabling CDC
This commit removes the instruction to stop all writes before disabling CDC with ALTER.

Fixes https://github.com/scylladb/scylla-docs/issues/4020

Closes scylladb/scylladb#24406

(cherry picked from commit b0ced64c88)

Closes scylladb/scylladb#24476
2025-06-13 14:07:38 +03:00
Robert Bindar
a926cba476 Add support for nodetool refresh --skip-reshape
This patch adds the new option in nodetool, patches the
load_new_ss_tables REST request with a new parameter and
skips the reshape step in refresh if this flag is passed.

Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>

Closes scylladb/scylladb#24409
Fixes: #24365

(cherry picked from commit ca1a9c8d01)

Closes scylladb/scylladb#24472
2025-06-13 14:06:19 +03:00
Michał Chojnowski
9c28b812ca db/config: add an option that disables dict-aware sstable compressors in DDL statements
For reasons, we want to be able to disallow dictionary-aware compressors
in chosen deployments.

This patch adds a knob for that. When the knob is disabled,
dictionary-aware compressors will be rejected in the validation
stage of CREATE and ALTER statements.

Closes scylladb/scylladb#24355

(cherry picked from commit 7d26d3c7cb)

Closes scylladb/scylladb#24454
2025-06-13 14:03:32 +03:00
Michael Litvak
d792916e8e test_cdc_generation_clearing: wait for generations to propagate
In test_cdc_generation_clearing we trigger events that update CDC
generations, verify the generations are updated as expected, and verify
the system topology and CDC generations are consistent on all nodes.

Before checking that all nodes are consistent and have the same CDC
generations, we need to consider that the changes are propagated through
raft and take some time to propagate to all nodes.

Currently, we wait for the change to be applied only on the first server
which runs the CDC generation publisher fiber and read the CDC
generations from this single node. The consistency check that follows
could fail if the change was not propagated to some other node yet.

To fix that, before checking consistency with all nodes, we execute a
read barrier on all nodes so they all see the same state as the leader.

Fixes scylladb/scylladb#24407

Closes scylladb/scylladb#24433

(cherry picked from commit 8aeb404893)

Closes scylladb/scylladb#24450
2025-06-10 15:50:40 +03:00
Michał Chojnowski
a539ff6419 utils/lsa/chunked_managed_vector: fix the calculation of max_chunk_capacity()
`chunked_managed_vector` is a vector-like container which splits
its contents into multiple contiguous allocations if necessary,
in order to fit within LSA's max preferred contiguous allocation
limits.

Each limited-size chunk is stored in a `managed_vector`.
`managed_vector` is unaware of LSA's size limits.
It's up to the user of `managed_vector` to pick a size which
is small enough.

This happens in `chunked_managed_vector::max_chunk_capacity()`.
But the calculation is wrong, because it doesn't account for
the fact that `managed_vector` has to place some metadata
(the backreference pointer) inside the allocation.
In effect, the chunks allocated by `chunked_managed_vector`
are just a tiny bit larger than the limit, and the limit is violated.

Fix this by accounting for the metadata.

Also, before the patch `chunked_managed_vector::max_contiguous_allocation`,
repeats the definition of logalloc::max_managed_object_size.
This is begging for a bug if `logalloc::max_managed_object_size`
changes one day. Adjust it so that `chunked_managed_vector` looks
directly at `logalloc::max_managed_object_size`, as it means to.

Fixes scylladb/scylladb#23854

(cherry picked from commit 7f9152babc)

Closes scylladb/scylladb#24371
2025-06-10 11:25:52 +03:00
Jenkins Promoter
b295ce38ae Update ScyllaDB version to: 2025.2.0-rc4 2025-06-06 17:03:11 +03:00
Nikos Dragazis
2e50d1a357 sstables: Fix race when loading checksum component
`read_checksum()` loads the checksum component from disk and stores a
non-owning reference in the shareable components. To avoid loading the
same component twice, the function has an early return statement.
However, this does not guarantee atomicity - two fibers or threads may
load the component and update the shareable components concurrently.
This can lead to use-after-free situations when accessing the component
through the shareable components, since the reference stored there is
non-owning. This can happen when multiple compaction tasks run on the
same SSTable (e.g., regular compaction and scrub-validate).

Fix this by not updating the reference in shareable components, if a
reference is already in place. Instead, create an owning reference to
the existing component for the current fiber. This is less efficient
than using a mutex, since the component may be loaded multiple times
from disk before noticing the race, but no locks are used for any other
SSTable component either. Also, this affects uncompressed SSTables,
which are not that common.

Fixes #23728.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>

Closes scylladb/scylladb#23872

(cherry picked from commit eaa2ce1bb5)

Closes scylladb/scylladb#24358
2025-06-06 08:49:56 +03:00
Szymon Malewski
d65b390780 mapreduce_service: Prevent race condition
In parallelized aggregation functions super-coordinator (node performing final merging step) receives and merges each partial result in parallel coroutines (`parallel_for_each`).
Usually responses are spread over time and actual merging is atomic.
However sometimes partial results are received at the similar time and if an aggregate function (e.g. lua script) yields, two coroutines can try to overwrite the same accumulator one after another,
which leads to losing some of the results.
To prevent this, in this patch each coroutine stores merging results in its own context and overwrites accumulator atomically, only after it was fully merged.
Comparing to the previous implementation order of operands in merging function is swapped, but the order of aggregation is not guaranteed anyway.

Fixes #20662

Closes scylladb/scylladb#24106

(cherry picked from commit 5969809607)

Closes scylladb/scylladb#24389
2025-06-06 08:49:15 +03:00
Anna Stuchlik
4ebae7ae62 doc: add the upgrade guide from 2025.1 to 2025.2
This commit adds the upgrade guide from version 2025.1 to 2025.2.
Also, it removes the upgrade guides existing for the previous version
that are irrelevant in 2025.2 (upgrade from OSS 6.2 and Enterprise 2024.x).

Note that the new guide does not include the "Enable Consistent Topology Updates" page,
as users upgrading to 2025.2 have consistent topology updates already enabled.

Fixes https://github.com/scylladb/scylladb/issues/24133

Fixes https://github.com/scylladb/scylladb/issues/24265

Closes scylladb/scylladb#24266

(cherry picked from commit 8b989d7fb1)

Closes scylladb/scylladb#24391
2025-06-06 08:48:31 +03:00
Ernest Zaslavsky
4fed3a5a5a encryption_test: Catch exact exception
Apparently `test_kms_network_error` will succeed at any circumstances since most of our exceptions derive from `std::exception`, so whatever happens to the test, for whatever reason it will throw, the test will be marked as passed.

Start catching the exact exception that we expect to be thrown.

Maybe somewhat related to https://github.com/scylladb/scylladb/issues/22628

Fixes: https://github.com/scylladb/scylladb/issues/24145

reapplies reverted: https://github.com/scylladb/scylladb/pull/24065

Should be backported to 2025.2.

Closes scylladb/scylladb#24242

(cherry picked from commit a39b773d36)

Closes scylladb/scylladb#24402
2025-06-06 08:48:02 +03:00
Pavel Emelyanov
5b86b6393a Merge '[Backport 2025.2] Add ability to skip SSTables cleanup when loading them' from Scylladb[bot]
The non-streaming loading of sstables performs cleanup since recently [1]. For vnodes, unfortunately, cleanup is almost unavoidable, because of the nature of vnodes sharding, even if sstable is already clean. This leads to waste of IO and CPU for nothing. Skipping the cleanup in a smart way is possible, but requires too many changes in the code and in the on-disk data. However, the effort will not help existing SSTables and it's going to be obsoleted by tablets some time soon.

Said that, the easiest way to skip cleanup is the explicit --skip-cleanup option for nodetool and respective skip_cleanup parameter for API handler.

New feature, no backport

fixes #24136
refs #12422 [1]

- (cherry picked from commit 4ab049ac8d)

- (cherry picked from commit ed3ce0f6af)

- (cherry picked from commit 1b1f653699)

- (cherry picked from commit c0796244bb)

Parent PR: #24139

Closes scylladb/scylladb#24398

* github.com:scylladb/scylladb:
  nodetool: Add refresh --skip-cleanup option
  api: Introduce skip_cleanup query parameter
  distributed_loader: Don't create owned ranges if skip-cleanup is true
  code: Push bool skip_cleanup flag around
2025-06-06 08:47:22 +03:00
Pavel Emelyanov
024af57bd5 nodetool: Add refresh --skip-cleanup option
The option "conflicts" with load-and-stream. Tests and doc included.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
(cherry picked from commit c0796244bb)
2025-06-05 17:52:13 +03:00
Pavel Emelyanov
c59327950b api: Introduce skip_cleanup query parameter
Just copy the load_and_stream and primary_replica_only logic, this new
option is the same in this sense.

Throw if it's specified with the load_and_stream one.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
(cherry picked from commit 1b1f653699)
2025-06-05 17:48:35 +03:00
Pavel Emelyanov
a2b2e46482 distributed_loader: Don't create owned ranges if skip-cleanup is true
In order to make reshard compaction task run cleanup, the owner-ranges
pointer is passed to it. If it's nullptr, the cleanup is not performed.
So to do the skip-cleanup, the easiest (but not the most apparent) way
is not to initialize the pointer and keep it nullptr.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
(cherry picked from commit ed3ce0f6af)
2025-06-05 17:44:45 +03:00
Pavel Emelyanov
4a7ddbfe07 code: Push bool skip_cleanup flag around
Just put the boolean into the callstack between API and distributed
loader to reduce the churn in the next patches. No functional changes,
flag is false and unused.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
(cherry picked from commit 4ab049ac8d)
2025-06-05 17:44:40 +03:00
Michał Chojnowski
484fc374c1 compress: fix a use-after-free in dictionary_holder::get_recommended_dict()
The function calls copy() on a foreign_ptr
(stored in a map) which can be destroyed
(erased from the map) before the copy() completes.
This is illegal.

One way to fix this would be to apply an rwlock
to the map. Another way is to wrap the `foreign_ptr`
in a `lw_shared_ptr` and extend its lifetime over
the `copy()` call. This patch does the latter.

Fixes scylladb/scylladb#24165
Fixes scylladb/scylladb#24174

Closes scylladb/scylladb#24175

(cherry picked from commit ea4d251ad2)

Closes scylladb/scylladb#24374
2025-06-05 12:11:22 +03:00
Botond Dénes
a5251b4d44 Merge '[Backport 2025.2] Add --scope arg to notedool refresh' from Scylladb[bot]
This PR adds the `--scope` option to `nodetool refresh`.
Like in the case of `nodetool restore`, you can pass either of:
* `node` - On the local node.
* `rack` - On the local rack.
* `dc` - In the datacenter (DC) where the local node lives.
* `all` (default) - Everywhere across the cluster.

as scope.

The feature is based on the existing load_and_stream paths, so it requires passing `--load-and-stream` to the `refresh` command, although this might change in the near future.

Fixes https://github.com/scylladb/scylladb/issues/23564

- (cherry picked from commit c570941692)

Parent PR: #23861

Closes scylladb/scylladb#24379

* github.com:scylladb/scylladb:
  Add nodetool refresh --scope option
  Refactor out code from test_restore_with_streaming_scopes
  Refactor out code from test_restore_with_streaming_scopes
  Refactor out code from test_restore_with_streaming_scopes
  Refactor out code from test_restore_with_streaming_scopes
  Refactor out code from test_restore_with_streaming_scopes
2025-06-05 11:54:17 +03:00
Avi Kivity
2afe0695cf Revert "config: decrease default large allocation warning threshold to 128k"
This reverts commit 04fb2c026d. 2025.2 got
the reduced threshold, but won't get most of the fixes the warning will
generate, leaving it very noisy. Better to avoid the noise for this release.

Fixes #24384.
2025-06-04 14:18:35 +03:00
Robert Bindar
b62264e1d9 Add nodetool refresh --scope option
This change adds the --scope option to nodetool refresh.
Like in the case of nodetool restore, you can pass either of:
* node - On the local node.
* rack - On the local rack.
* dc - In the datacenter (DC) where the local node lives.
* all (default) - Everywhere across the cluster.
as scope.

The feature is based on the existing load_and_stream paths, so it
requires passing --load-and-stream to the refresh command.
Also, it is not compatible with the --primary-replica-only option.

Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>

Closes scylladb/scylladb#23861

(cherry picked from commit c570941692)
2025-06-04 11:59:17 +03:00
Robert Bindar
36cc0f8e7e Refactor out code from test_restore_with_streaming_scopes
part 5: check_data_is_back

Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
(cherry picked from commit 548a1ec20a)
2025-06-04 11:54:07 +03:00
Robert Bindar
a885c87547 Refactor out code from test_restore_with_streaming_scopes
part 4: compute_scope

Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
(cherry picked from commit 29309ae533)
2025-06-04 11:54:01 +03:00
Robert Bindar
371fc05943 Refactor out code from test_restore_with_streaming_scopes
part 3: create_dataset

Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
(cherry picked from commit a0f0580a9c)
2025-06-04 11:53:51 +03:00
Robert Bindar
4366cd5a81 Refactor out code from test_restore_with_streaming_scopes
part 2: take_snapshot

Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
(cherry picked from commit 5171ca385a)
2025-06-04 11:53:43 +03:00
Robert Bindar
38ee119112 Refactor out code from test_restore_with_streaming_scopes
part 1: create_cluster

Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
(cherry picked from commit f09bb20ac4)
2025-06-04 11:53:32 +03:00
Piotr Dulikowski
6edf92a9e3 Merge '[Backport 2025.2] test/boost: Adjust tests to RF-rack-valid keyspaces' from Scylladb[bot]
This PR adjusts existing Boost tests so they respect the invariant
introduced by enabling `rf_rack_valid_keyspaces` configuration option.
We disable it explicitly in more problematic tests. After that, we
enable the option by default in the whole test suite.

Fixes scylladb/scylladb#23958

Backport: backporting to 2025.1 to be able to test the implementation there too.

- (cherry picked from commit 6e2fb79152)

- (cherry picked from commit e4e3b9c3a1)

- (cherry picked from commit 1199c68bac)

- (cherry picked from commit cd615c3ef7)

- (cherry picked from commit fa62f68a57)

- (cherry picked from commit 22d6c7e702)

- (cherry picked from commit 237638f4d3)

- (cherry picked from commit c60035cbf6)

Parent PR: scylladb/scylladb#23802

Closes scylladb/scylladb#24368

* github.com:scylladb/scylladb:
  test/lib/cql_test_env.cc: Enable rf_rack_valid_keyspaces by default
  test/boost/tablets_test.cc: Explicitly disable rf_rack_valid_keyspaces in problematic tests
  test/boost/tablets_test.cc: Fix indentation in test_load_balancing_with_random_load
  test/boost/tablets_test.cc: Adjust test_load_balancing_with_random_load to RF-rack-validity
  test/boost/tablets_test.cc: Adjust test_load_balancing_works_with_in_progress_transitions to RF-rack-validity
  test/boost/tablets_test.cc: Adjust test_load_balancing_resize_requests to RF-rack-validity
  test/boost/tablets_test.cc: Adjust test_load_balancing_with_two_empty_nodes to RF-rack-validity
  test/boost/tablets_test.cc: Adjust test_load_balancer_shuffle_mode to RF-rack-validity
2025-06-04 10:24:35 +02:00
Nadav Har'El
609ad01bbc alternator: hide internal tags from users
The "tags" mechanism in Alternator is a convenient way to attach metadata
to Alternator tables. Recently we have started using it more and more for
internal metadata storage:

  * UpdateTimeToLive stores the attribute in a tag system:ttl_attribute
  * CreateTable stores provisioned throughput in tags
    system:provisioned_rcu and system:provisioned_wcu
  * CreateTable stores the table's creation time in a tag called
    system:table_creation_time.

We do not want any of these internal tags to be visible to a
ListTagsOfResource request, because if they are visible (as before this
patch), systems such as Terraform can get confused when they suddenly
see a tag which they didn't set - and may even attempt to delete it
(as reported in issue #24098).

Moreover, we don't want any of these internal tags to be writable
with TagResource or UntagResource: If a user wants to change the TTL
setting they should do it via UpdateTimeToLive - not by writing
directly to tags.

So in this patch we forbid read or write to *any* tag that begins
with the "system:" prefix, except one: "system:write_isolation".
That tag is deliberately intended to be writable by the user, as
a configuration mechanism, and is never created internally by
Scylla. We should have perhaps chosen a different prefix for
configurable vs. internal tags, or chosen more unique prefixes -
but let's not change these historic names now.

This patch also adds regression tests for the internal tags features,
failing before this patch and passing after:
1. internal tags, specifically system:ttl_attribute, are not visible
   in ListTagsOfResource, and cannot be modified by TagResource or
   UntagResource.
2. system:write_isolation is not internal, and be written by either
   TagResource or UntagResource, and read with ListTagsOfResource.

This patch also fixes a bug in the test where we added more checks
for system:write_isolation - test_tag_resource_write_isolation_values.
This test forgot to remove the system:write_isolation tags from
test_table when it ended, which would lead to other tests that run
later to run with a non-default write isolation - something which we
never intended.

Fixes #24098.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#24299

(cherry picked from commit 6cbcabd100)

Closes scylladb/scylladb#24377
2025-06-04 09:56:33 +03:00
Avi Kivity
10b7f2d924 pgo: drop Java configuration
Since 5e1cf90a51
("build: replace tools/java submodule with packaged cassandra-stress")
we run pre-packaged cassandra-stress. As such, we don't need to look for
a Java runtime (which is missing on the frozen toolchain) and can
rely on the cassandra-stress package finding its own Java runtime.

Fix by just dropping all the Java-finding stuff.

Note: Java 11 is in fact present on the frozen toolchain, just
not in a way that pgo.py can find it.

Fixes #24176.

Closes scylladb/scylladb#24178

(cherry picked from commit 29932a5af1)

Closes scylladb/scylladb#24254
2025-06-03 17:54:28 +03:00
Dawid Mędrek
5130ec84de test/lib/cql_test_env.cc: Enable rf_rack_valid_keyspaces by default
We've adjusted all of the Boost tests so they respect the invariant
enforced by the `rf_rack_valid_keyspaces` configuration option, or
explicitly disabled the option in those that turned out to be more
problematic and will require more attention. Thanks to that, we can
now enable it by default in the test suite.

(cherry picked from commit c60035cbf6)
2025-06-03 11:10:16 +00:00
Dawid Mędrek
9938183ace test/boost/tablets_test.cc: Explicitly disable rf_rack_valid_keyspaces in problematic tests
Some of the tests in the file verify more subtle parts of the behavior
of tablets and rely on topology layouts or using keyspaces that violate
the invariant the `rf_rack_valid_keyspaces` configuration option is
trying to enforce. Because of that, we explicitly disable the option
to be able to enable it by default in the rest of the test suite in
the following commit.

(cherry picked from commit 237638f4d3)
2025-06-03 11:10:16 +00:00
Dawid Mędrek
1271b42848 test/boost/tablets_test.cc: Fix indentation in test_load_balancing_with_random_load
(cherry picked from commit 22d6c7e702)
2025-06-03 11:10:16 +00:00
Dawid Mędrek
012e248792 test/boost/tablets_test.cc: Adjust test_load_balancing_with_random_load to RF-rack-validity
We make sure that the keyspaces created in the test are always RF-rack-valid.
To achieve that, we change how the test is performed.

Before this commit, we first created a cluster and then ran the actual test
logic multiple times. Each of those test cases created a keyspace with a random
replication factor.

That cannot work with `rf_rack_valid_keyspaces` set to true. We cannot modify
the property file of a node (see commit: eb5b52f598),
so once we set up the cluster, we cannot adjust its layout to work with another
replication factor.

To solve that issue, we also recreate the cluster in each test case. Now we choose
the replication factor at random, create a cluster distributing nodes across as many
racks as RF, and perform the rest of the logic. We perform it multiple times in
a loop so that the test behaves as before these changes.

(cherry picked from commit fa62f68a57)
2025-06-03 11:10:16 +00:00
Dawid Mędrek
1364eec694 test/boost/tablets_test.cc: Adjust test_load_balancing_works_with_in_progress_transitions to RF-rack-validity
We distribute the nodes used in the test across two racks so we can
run the test with `rf_rack_valid_keyspaces` set to true.

We want to avoid cross-rack migrations and keep the test as realistic
as possible. Since host3 is supposed to function as a new node in the
cluster, we change the layout of it: now, host1 has 2 shards and resides
in a separate rack. Most of the remaining test logic is preserved and behaves
as before this commit.

There is a slight difference in the tablet migrations. Before the commit,
we were migrating a tablet between nodes of different shard counts. Now
it's impossible because it would force us to migrate tablets between racks.
However, since the test wants to simply verify that an ongoing migration
doesn't interfere with load balancing and still leads to a perfect balance,
that still happens: we explicitly migrate ONLY 1 tablet from host2 to host3,
so to achieve the goal, one more tablet needs to be migrated, and we test
that.

(cherry picked from commit cd615c3ef7)
2025-06-03 11:10:16 +00:00
Dawid Mędrek
85fe37a8e4 test/boost/tablets_test.cc: Adjust test_load_balancing_resize_requests to RF-rack-validity
We assign the nodes created by the test to separate racks. It has no impact
on the test since the keyspace used in the test uses RF=2, so the tablet
replicas will still be the same.

(cherry picked from commit 1199c68bac)
2025-06-03 11:10:16 +00:00
Dawid Mędrek
e21bdbb9ef test/boost/tablets_test.cc: Adjust test_load_balancing_with_two_empty_nodes to RF-rack-validity
We distribute the nodes used in the test between two racks. Although
that may affect how tablets behave in general, this change will not
have any real impact on the test. The test verifies that load balancing
eventually balances tablets in the cluster, which will still happen.
Because of that, the changes in this commit are safe to apply.

(cherry picked from commit e4e3b9c3a1)
2025-06-03 11:10:16 +00:00
Dawid Mędrek
ca8762885b test/boost/tablets_test.cc: Adjust test_load_balancer_shuffle_mode to RF-rack-validity
We distribute the nodes used in the test between two racks. Although that
may have an impact on how tablets behave, it's orthogonal to what the test
verifies -- whether the topology coordinator is continuously in the tablet
migration track. Because of that, it's safe to make this change without
influencing the test.

(cherry picked from commit 6e2fb79152)
2025-06-03 11:10:15 +00:00
Michał Chojnowski
3a7a1dc4a9 test/boost/sstable_compressor_factory_test: define a test suite name
It seems that tests in test/boost/combined_tests have to define a test
suite name, otherwise they aren't picked up by test.py.

Fixes #24199

Closes scylladb/scylladb#24200

(cherry picked from commit ff8a119f26)

Closes scylladb/scylladb#24255
2025-06-03 12:01:35 +03:00
Anna Stuchlik
12596a8eca doc: add OS support for ScyllaDB 2025.2
This commit adds the information about support for platforms
in ScyllaDB version 20252.

Fixes https://github.com/scylladb/scylladb/issues/24180

Closes scylladb/scylladb#24263

(cherry picked from commit 28cb5a1e02)

Closes scylladb/scylladb#24335
2025-06-03 10:07:28 +03:00
Anna Stuchlik
be3f50b658 doc: update migration tools overview
This commit updates the migration overview page:

- It removes the info about migration from SSTable to CQL.
- It updates the link to the migrator docs.

Fixes https://github.com/scylladb/scylladb/issues/24247

Refs https://github.com/scylladb/scylladb/pull/21775

Closes scylladb/scylladb#24258

(cherry picked from commit b197d1a617)

Closes scylladb/scylladb#24282
2025-06-03 10:06:42 +03:00
Michał Chojnowski
6cd954de8d utils/stream_compressor: allocate memory for zstd compressors externally
The default and recommended way to use zstd compressors is to let
zstd allocate and free memory for compressors on its own.

That's what we did for zstd compressors used in RPC compression.
But it turns out that it generates allocation patterns we dislike.

We expected zstd not to generate allocations after the context object
is initialized, but it turns out that it tries to downsize the context
sometimes (by reallocation). We don't want that because the allocations
generated by zstd are large (1 MiB with the parameters we use),
so repeating them periodically stresses the reclaimer.

We can avoid this by using the "static context" API of zstd,
in which the memory for context is allocated manually by the user
of the library. In this mode, zstd doesn't allocate anything
on its own.

The implementation details of this patch adds a consideration for
forward compatibility: later versions of Scylla can't use a
window size greater than the one we hardcoded in this patch
when talking to the old version of the decompressor.

(This is not a problem, since those compressors are only used
for RPC compression at the moment, where cross-version communication
can be prevented by bumping COMPRESSOR_NAME. But it's something
that the developer who changes the window size must _remember_ to do).

Fixes #24160
Fixes #24183

Closes scylladb/scylladb#24161

(cherry picked from commit 185a032044)

Closes scylladb/scylladb#24281
2025-06-03 10:02:34 +03:00
Botond Dénes
9a7ea917eb mutation/mutation_compactor: cache regular/shadowable max-purgable in separate members
Max purgeable has two possible values for each partition: one for
regular tombstones and one for shadowable ones. Yet currently a single
member is used to cache the max-purgeable value for the partition, so
whichever kind of tombstone is checked first, its max-purgeable will
become sticky and apply to the other kind of tombstones too. E.g. if the
first can_gc() check is for a regular tombstone, its max-purgeable will
apply to shadowable tombstones in the partition too, meaning they might
not be purged, even though they are purgeable, as the shadowable
max-purgeable is expected to be more lenient. The other way around is
worse, as it will result in regular tombstone being incorrectly purged,
permitted by the more lenient shadowable tombstone max-purgeable.
Fix this by caching the two possible values in two separate members.
A reproducer unit test is also added.

Fixes: scylladb/scylladb#23272

Closes scylladb/scylladb#24171

(cherry picked from commit 7db956965e)

Closes scylladb/scylladb#24329
2025-06-03 09:51:52 +03:00
Ran Regev
c5cff9e14f changed the string literals into the correct ones
Fixes: #23970

use correct string literals:
KMIP_TAG_CRYPTOGRAPHIC_LENGTH_STR --> KMIP_TAGSTR_CRYPTOGRAPHIC_LENGTH
KMIP_TAG_CRYPTOGRAPHIC_USAGE_MASK_STR --> KMIP_TAGSTR_CRYPTOGRAPHIC_USAGE_MASK

From https://github.com/scylladb/scylladb/issues/23970 description of the
problem (emphasizes are mine):

When transparent data encryption at rest is enabled with KMIP as a key
provider, the observation is that before creating a new key, Scylla tries
to locate an existing key with provided specifications (key algorithm &
length), with the intention to re-use existing key, **but the attributes
sent in the request have minor spelling mistakes** which are rejected by
the KMIP server key provider, and hence scylla assumes that a key with
these specifications doesn't exist, and creates a new key in the KMIP
server. The issue here is that for every new table, ScyllaDB will create
a key in the KMIP server, which could clutter the KMS, and make key
lifecycle management difficult for DBAs.

Closes scylladb/scylladb#24057

(cherry picked from commit 37854acc92)

Closes scylladb/scylladb#24303
2025-06-02 15:11:53 +03:00
Michael Litvak
5aca2c134d test_cdc_generation_publishing: fix to read monotonically
The test test_multiple_unpublished_cdc_generations reads the CDC
generation timestamps to verify they are published in the correct order.
To do so it issues reads in a loop with a short sleep period and checks
the differences between consecutive reads, assuming they are monotonic.

However the assumption that the reads are monotonic is not valid,
because the reads are issued with consistency_level=ONE, thus we may read
timestamps {A,B} from some node, then read timestamps {A} from another
node that didn't apply the write of the new timestamp B yet. This will
trigger the assert in the test and fail.

To ensure the reads are monotonic we change the test to use consistency
level ALL for the reads.

Fixes scylladb/scylladb#24262

Closes scylladb/scylladb#24272

(cherry picked from commit 3a1be33143)

Closes scylladb/scylladb#24336
2025-06-02 14:42:57 +03:00
Anna Stuchlik
cc299e335d doc: remove copyright from Cassandra Stress
This commit removes the Apache copyright note from the Cassandra Stress page.

It's a follow up to https://github.com/scylladb/scylladb/pull/21723, which missed
that update (see https://github.com/scylladb/scylladb/pull/21723#discussion_r1944357143).

Cassandra Stress is a separate tool with separate repo with the docs, so the copyright
information on the page is incorrect.

Fixes https://github.com/scylladb/scylladb/issues/23240

Closes scylladb/scylladb#24219

(cherry picked from commit d303edbc39)

Closes scylladb/scylladb#24256
2025-06-02 14:41:34 +03:00
David Garcia
a7b34a54bc docs: fix \t (tab) is not rendered correctly
Closes scylladb/scylladb#24096

(cherry picked from commit bf9534e2b5)

Closes scylladb/scylladb#24257
2025-06-02 14:40:54 +03:00
Pavel Emelyanov
eb78d3aefb test/result_utils: Do not assume map_reduce reducing order
When map_reduce is called on a collection, one shouldn't expect that it
processes the elements of the collection in any specific order.

Current test of map-reduce over boost outcome assumes that if reduce
function is the string concatenation, then it would concatenate the
given vector of strings in the order they are listed. That requirement
should be relaxed, and the result may have reversed concatentation.

Fixes scylladb/scylladb#24321

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#24325

(cherry picked from commit a65ffdd0df)

Closes scylladb/scylladb#24337
2025-06-02 14:00:07 +03:00
Jenkins Promoter
304f47f6ec Update ScyllaDB version to: 2025.2.0-rc3 2025-06-01 15:29:44 +03:00
Ferenc Szili
8ba5a1be70 tablets: trigger mutation source refresh on tablet count change
Consider the following scenario:

- let's assume tablet 0 has range [1, 5] (pre merge)
- tablet merge happens, tablet 0 has now range [1, 10]
- tablet_sstable_set isn't refreshed, so holds a stale state, thinks tablet
  0 still has range [1, 5]
- during a full scan, forward service will intersect the full range with
  tablet ranges and consume one tablet at a time
- replica service is asked to consume range [1, 10] of tablet 0 (post merge)

We have two possible outcomes:

With cache bypass:
1) cache reader is bypassed
2) sstable reader is created on range [1, 10]
3) unrefreshed tablet_sstable_set holds stale state, but select correctly
   all sstables intersecting with range [1, 10]

With cache:
1) cache reader is created
2) finds partition with token 5 is cached
3) sstable reader is created on range [1, 4] (later would fast forward to
   range [6, 10]; also belongs to tablet 0)
4) incremental selector consumes the pre-merge sstable spanning range [1, 5]
4.1) since the partitioned_sstable_set pre-merge contains only that sstable,
     EOS is reached
4.2) since EOS is reached, the fast forward to range [6, 10] is not allowed.

So with the set refreshed, sstable set is aligned with tablet ranges, and no
premature EOS is signalled, otherwise preventing fast forward to from
happening and all data from being properly captured in the read.

This change fixes the bug and triggeres a mutation source refresh whenever
the number of tablets for the table has changed, not only when we have
incoming tablets.

Fixes: #23313
(cherry picked from commit d0329ca370)
2025-05-30 17:08:45 +00:00
Anna Stuchlik
20602b6a8b doc: clarify RF increase issues for tablets vs. vnodes
This commit updates the guidelines for increasing the Replication Factor
depending on whether tablets are enabled or disabled.

To present it in a clear way, I've reorganized the page.

Fixes https://github.com/scylladb/scylladb/issues/23667

Closes scylladb/scylladb#24221

(cherry picked from commit efce03ef43)

Closes scylladb/scylladb#24284
2025-05-30 15:16:17 +03:00
Botond Dénes
19513fa47e Merge '[Backport 2025.2] raft_sys_table_storage: avoid temp buffer when deserializing log_entry' from Scylladb[bot]
The get_blob method linearizes data by copying it into a single buffer, which can cause 'oversized allocation' warnings.

In this commit we avoid copying by creating input stream on top of the original fragmened managed bytes, returned by untyped_result_set_row::get_view.

fixes scylladb/scylladb#23903

backport: no need, not a critical issue.

- (cherry picked from commit 6496ae6573)

- (cherry picked from commit f245b05022)

Parent PR: #24123

Closes scylladb/scylladb#24317

* github.com:scylladb/scylladb:
  raft_sys_table_storage: avoid temporary buffer when deserializing log_entry
  serializer_impl.hh:  add as_input_stream(managed_bytes_view) overload
2025-05-30 09:14:43 +03:00
Wojciech Mitros
dec10d348e test: actually wait for tablets to distribute across nodes
In test_tablet_mv_replica_pairing_during_replace, after we create
the tables, we want to wait for their tablets to distribute evenly
across nodes and we have a wait_for for that.
But we don't await this wait_for, so it's a no-op. This patch fixes
it by adding the missing await.

Refs scylladb/scylladb#23982
Refs scylladb/scylladb#23997

Closes scylladb/scylladb#24250

(cherry picked from commit 5074daf1b7)

Closes scylladb/scylladb#24311
2025-05-29 16:44:51 +02:00
Petr Gusev
ffea5e67c1 raft_sys_table_storage: avoid temporary buffer when deserializing log_entry
The get_blob() method linearizes data by copying it into a
single buffer, which can trigger "oversized allocation" warnings.
This commit avoids that extra copy by creating an input stream
directly over the original fragmented managed bytes returned by
untyped_result_set_row::get_view().

Fixes scylladb/scylladb#23903

(cherry picked from commit f245b05022)
2025-05-29 08:42:09 +00:00
Petr Gusev
bcbbc40026 serializer_impl.hh: add as_input_stream(managed_bytes_view) overload
It's useful to have it here so that people can find it easily.

(cherry picked from commit 6496ae6573)
2025-05-29 08:42:09 +00:00
Anna Stuchlik
70d9352cec doc: remove the redundant pages
This commit removes two redundant pages and adds the related redirections.

- The Tutorials page is a duplicate and is not maintained anymore.
  Having it in the docs hurts the SEO of the up-to-date Tutorias page.
- The Contributing page is not helpful. Contributions-related information
  should be maintained in the project README file.

Fixes https://github.com/scylladb/scylladb/issues/17279
Fixes https://github.com/scylladb/scylladb/issues/24060

Closes scylladb/scylladb#24090

(cherry picked from commit eed8373b77)

Closes scylladb/scylladb#24220
2025-05-26 10:30:03 +03:00
Pavel Emelyanov
e215350c61 Revert "encryption_test: Catch exact exception"
This reverts commit 59bf300e83.

KMS tests became flaky after it: #24218
Need to revisit.
2025-05-20 13:51:07 +03:00
Ernest Zaslavsky
59bf300e83 encryption_test: Catch exact exception
Apparently `test_kms_network_error` will succeed at any circumstances since most of our exceptions derive from `std::exception`, so whatever happens to the test, for whatever reason it will throw, the test will be marked as passed.

Start catching the exact exception that we expect to be thrown.

Closes scylladb/scylladb#24065

(cherry picked from commit 2d5c0f0cfd)

Closes scylladb/scylladb#24147
2025-05-20 08:27:56 +03:00
Aleksandra Martyniuk
6d733051de cql_test_env: main: move stream_manager initialization
Currently, stream_manager is initialized after storage_service and
so it is stopped before the storage_service is. In its stop method
storage_service accesses stream_manager which is uninitialized
at a time.

Move stream_manager initialization over the storage_service initialization.

Fixes: #23207.

Closes scylladb/scylladb#24008

(cherry picked from commit 9c03255fd2)

Closes scylladb/scylladb#24190
2025-05-20 08:27:26 +03:00
Ernest Zaslavsky
24c134992b database_test: Wait for the index to be created
Just call `wait_until_built` for the index in question

fix: https://github.com/scylladb/scylladb/issues/24059

Closes scylladb/scylladb#24117

(cherry picked from commit 4a7c847cba)

Closes scylladb/scylladb#24132
2025-05-19 12:08:41 +03:00
Wojciech Mitros
9247c9472a mv: remove queue length limit from the view update read concurrency semaphore
Each view update is correlated to a write that generates it (aside from view
building which is throttled separately). These writes are limited by a throttling
mechanism, which effectively works by performing the writes with CL=ALL if
ongoing writes exceed some memory usage limit

When writes generate view updates, they usually also need to perform a read. This read
goes through a read concurrency semaphore where it can get delayed or killed. The
semaphore allows up to 100 concurrent reads and puts all remaining reads in a queue.
If the number of queued reads exceeds a specific limit, the view update will fail on
the replica, causing inconsistencies.

This limit is not necessary. When a read gets queued on the semaphore, the write that's
causing the view update is paused, so the write takes part in the regular write throttling.
If too many writes get stuck on view update reads, they will get throttled, so their
number is limited and the number of queued reads is also limited to the same amount.

In this patch we remove the specified queue length limit for the view update read concurrency
semaphore. Instead of this limit, the queue will be now limited indirectly, by the base write
throttling mechanism. This may allow the queue grow longer than with the previous limit, but
it shouldn't ever cause issues - we only perform up to 100 actual reads at once, and the
remaining ones that get queued use a tiny amount of memory, less than the writes that generated
them and which are getting limited directly.

Fixes https://github.com/scylladb/scylladb/issues/23319

Closes scylladb/scylladb#24112

(cherry picked from commit 5920647617)

Closes scylladb/scylladb#24170
2025-05-19 12:05:48 +03:00
Anna Stuchlik
ab8d50b5e7 doc: fix the product name for version 2025.1
Starting with 2025.1, ScyllaDB versions are no longer called "Enterprise",
but the OS support page still uses that label.
This commit fixes that by replacing "Enterprise" with "ScyllaDB".

This update is required since we've removed "Enterprise" from everywhere else,
including the commands, so having it here is confusing.

Fixes https://github.com/scylladb/scylladb/issues/24179

Closes scylladb/scylladb#24181

(cherry picked from commit 2d7db0867c)

Closes scylladb/scylladb#24204
2025-05-19 12:03:35 +03:00
Dawid Mędrek
7986ef73da locator/production_snitch_base: Reduce log level when property file incomplete
We're reducing the log level in case the provided property file is incomplete.
The rationale behind this change is related to how CCM interacts with Scylla:

* The `GossipingPropertyFileSnitch` reloads the `cassandra-rackdc.properties`
  configuration every 60 seconds.
* When a new node is added to the cluster, CCM recreates the
  `cassandra-rackdc.properties` file for EVERY node.

If those two processes start happening at about the same time, it may lead
to Scylla trying to read a not-completely-recreated file, and an error will
be produced.

Although we would normally fix this issue and try to avoid the race, that
behavior will be no longer relevant as we're making the rack and DC values
immutable (cf. scylladb/scylladb#23278). What's more, trying to fix the problem
in the older versions of Scylla could bring a more serious regression. Having
that in mind, this commit is a compromise between making CI less flaky and
having minimal impact when backported.

We do the same for when the format of the file is invalid: the rationale
is the same.

We also do that for when there is a double declaration. Although it seems
impossible that this can stem from the same scenario the other two errors
can (since if the format of the file is valid, the error is justified;
if the format is invalid, it should be detected sooner than a doubled
declaration), let's stay consistent with the logging level.

Fixes scylladb/scylladb#20092

Closes scylladb/scylladb#23956

(cherry picked from commit 9ebd6df43a)

Closes scylladb/scylladb#24143
2025-05-16 11:51:23 +03:00
Wojciech Mitros
847504ad25 test_mv_tablets_replace: wait for tablet replicas to balance before working on them
In the test test_tablet_mv_replica_pairing_during_replace we stop 2 out of 4 servers while using RF=2.
Even though in the test we use exactly 4 tablets (1 for each replica of a base table and view), intially,
the tablets may not be split evenly between all nodes. Because of this, even when we chose a server that
hosts the view and a different server that hosts the base table, we sometimes stoped all replicas of the
base or the view table because the node with the base table replica may also be a view replica.

After some time, the tablets should be distributed across all nodes. When that happens, there will be
no common nodes with a base and view replica, so the test scenario will continue as planned.

In this patch, we add this waiting period after creating the base and view, and continue the test only
when all 4 tablets are on distinct nodes.

Fixes https://github.com/scylladb/scylladb/issues/23982
Fixes https://github.com/scylladb/scylladb/issues/23997

Closes scylladb/scylladb#24111

(cherry picked from commit bceb64fb5a)

Closes scylladb/scylladb#24126
2025-05-16 11:51:07 +03:00
Pavel Emelyanov
854587c10c Merge '[Backport 2025.2] test/cluster: Adjust tests to RF-rack-valid keyspaces' from Scylladb[bot]
In this PR, we're adjusting most of the cluster tests so that they pass
with the `rf_rack_valid_keyspaces` configuration option enabled. In most
cases, the changes are straightforward and require little to no additional
insight into what the tests are doing or verifying. In some, however, doing
that does require a deeper understanding of the tests we're modifying.
The justification for those changes and their correctness is included in
the commit messages corresponding to them.

Note that this PR does not cover all of the cluster tests. There are few
remaining ones, but they require a bit more effort, so we delegate that
work to a separate PR.

I tested all of the modified tests locally with `rf_rack_valid_keyspaces`
set to true, and they all passed.

Fixes scylladb/scylladb#23959

Backport: we want to backport these changes to 2025.1 since that's the version where we introduced RF-rack-valid keyspaces in. Although the tests are not, by default, run with `rf_rack_valid_keyspaces` enabled yet, that will most likely change in the near future and we'll also want to backport those changes too. The reason for this is that we want to verify that Scylla works correctly even with that constraint.

- (cherry picked from commit dbb8835fdf)

- (cherry picked from commit 9281bff0e3)

- (cherry picked from commit 5b83304b38)

- (cherry picked from commit 73b22d4f6b)

- (cherry picked from commit 2882b7e48a)

- (cherry picked from commit 4c46551c6b)

- (cherry picked from commit 92f7d5bf10)

- (cherry picked from commit 5d1bb8ebc5)

- (cherry picked from commit d3c0cd6d9d)

- (cherry picked from commit 04567c28a3)

- (cherry picked from commit c8c28dae92)

- (cherry picked from commit c4b32c38a3)

- (cherry picked from commit ee96f8dcfc)

Parent PR: #23661

Closes scylladb/scylladb#24121

* github.com:scylladb/scylladb:
  test/cluster/suite.yaml: Enable rf_rack_valid_keyspaces in suite
  test/cluster: Disable rf_rack_valid_keyspaces in problematic tests
  test/cluster/test_tablets: Divide rack into two to adjust tests to RF-rack-validity
  test/cluster/test_tablets: Adjust test_tablet_rf_change to RF-rack-validity
  test/cluster/test_tablet_repair_scheduler.py: Adjust to RF-rack-validity
  test/pylib/repair.py: Assign nodes to multiple racks in create_table_insert_data_for_repair
  test/cluster/test_zero_token_nodes_topology_ops: Adjust to RF-rack-validity
  test/cluster/test_zero_token_nodes_no_replication.py: Adjust to RF-rack-validity
  test/cluster/test_zero_token_nodes_multidc.py: Adjust to RF-rack-validity
  test/cluster/test_not_enough_token_owners.py: Adjust to RF-rack-validity
  test/cluster/test_multidc.py: Adjust to RF-rack-validity
  test/cluster/object_store/test_backup.py: Adjust to RF-rack-validity
  test/cluster: Adjust simple tests to RF-rack-validity
2025-05-16 11:50:45 +03:00
Pavel Emelyanov
9058d5658b Merge '[Backport 2025.2] logalloc_test: don't test performance in test background_reclaim' from Scylladb[bot]
The test is failing in CI sometimes due to performance reasons.

There are at least two problems:
1. The initial 500ms (wall time) sleep might be too short. If the reclaimer
   doesn't manage to evict enough memory during this time, the test will fail.
2. During the 100ms (thread CPU time) window given by the test to background
   reclaim, the `background_reclaim` scheduling group isn't actually
   guaranteed to get any CPU, regardless of shares. If the process is
   switched out inside the `background_reclaim` group, it might
   accumulate so much vruntime that it won't get any more CPU again
   for a long time.

We have seen both.

This kind of timing test can't be run reliably on overcommitted machines
without modifying the Seastar scheduler to support that (by e.g. using
thread clock instead of wall time clock in the scheduler), and that would
require an amount of effort disproportionate to the value of the test.

So for now, to unflake the test, this patch removes the performance test
part. (And the tradeoff is a weakening of the test). After the patch,
we only check that the background reclaim happens *eventually*.

Fixes https://github.com/scylladb/scylladb/issues/15677

Backporting this is optional. The test is flaky even in stable branches, but the failure is rare.

- (cherry picked from commit c47f438db3)

- (cherry picked from commit 1c1741cfbc)

Parent PR: #24030

Closes scylladb/scylladb#24094

* github.com:scylladb/scylladb:
  logalloc_test: don't test performance in test `background_reclaim`
  logalloc: make background_reclaimer::free_memory_threshold publicly visible
2025-05-16 11:50:17 +03:00
Gleb Natapov
dd9ec03323 topology coordinator: make decommissioning node non voter before completing the operation
A decommissioned node is removed from a raft config after operation is
marked as completed. This is required since otherwise the decommissioned
node will not see that decommission has completed (the status is
propagated through raft). But right after the decommission is marked as
completed a decommissioned node may terminate, so in case of a two node
cluster, the configuration change that removes it from the raft will fail,
because there will no be quorum.

The solution is to mark the decommissioning node as non voter before
reporting the operation as completed.

Fixes: #24026

Backport to 2025.2 because it fixes a potential hang. Don't backport to
branches older than 2025.2 because they don't have
8b186ab0ff, which caused this issue.

Closes scylladb/scylladb#24027

(cherry picked from commit c6e1758457)

Closes scylladb/scylladb#24093
2025-05-16 11:49:46 +03:00
Pavel Emelyanov
eef6a95e26 Merge '[Backport 2025.2] replica: Fix use-after-free with concurrent schema change and sstable set update' from Scylladb[bot]
When schema is changed, sstable set is updated according to the compaction strategy of the new schema (no changes to set are actually made, just the underlying set type is updated), but the problem is that it happens without a lock, causing a use-after-free when running concurrently to another set update.

Example:

1) A: sstable set is being updated on compaction completion
2) B: schema change updates the set (it's non deferring, so it happens in one go) and frees the set used by A.
3) when A resumes, system will likely crash since the set is freed already.

ASAN screams about it:
SUMMARY: AddressSanitizer: heap-use-after-free sstables/sstable_set.cc ...

Fix is about deferring update of the set on schema change to compaction, which is triggered after new schema is set. Only strategy state and backlog tracker are updated immediately, which is fine since strategy doesn't depend on any particular implementation of sstable set.

Fixes #22040.

- (cherry picked from commit 628bec4dbd)

- (cherry picked from commit 434c2c4649)

Parent PR: #23680

Closes scylladb/scylladb#24085

* github.com:scylladb/scylladb:
  replica: Fix use-after-free with concurrent schema change and sstable set update
  sstables: Implement sstable_set_impl::all_sstable_runs()
2025-05-16 11:49:21 +03:00
Piotr Smaron
9f2a13c8c2 cql: fix CREATE tablets KS warning msg
Materialized Views and Secondary Indexes are yet another features that
keyspaces with tablets do not support, but these were not listed in a
warning message returned to the user on CREATE KEYSPACE statement. This
commit adds the 2 missing features.

Fixes: #24006

Closes scylladb/scylladb#23902

(cherry picked from commit f740f9f0e1)

Closes scylladb/scylladb#24084
2025-05-16 11:49:00 +03:00
Piotr Dulikowski
4792a27396 topology_coordinator: silence ERROR messages on abort
When the topology coordinator is shut down while doing a long-running
operation, the current operation might throw a raft::request_aborted
exception. This is not a critical issue and should not be logged with
ERROR verbosity level.

Make sure that all the try..catch blocks in the topology coordinator
which:

- May try to acquire a new group0 guard in the `try` part
- Have a `catch (...)` block that print an ERROR-level message

...have a pass-through `catch (raft::request_aborted&)` block which does
not log the exception.

Fixes: scylladb/scylladb#22649

Closes scylladb/scylladb#23962

(cherry picked from commit 156ff8798b)

Closes scylladb/scylladb#24082
2025-05-16 11:48:43 +03:00
Aleksandra Martyniuk
f26c2b22dc test_tablet_repair_hosts_filter: change injected error
test_tablet_repair_hosts_filter checks whether the host filter
specfied for tablet repair is correctly persisted. To check this,
we need to ensure that the repair is still ongoing and its data
is kept. The test achieves that by failing the repair on replica
side - as the failed repair is going to be retried.

However, if the filter does not contain any host (included_host_count = 0),
the repair is started on no replica, so the request succeeds
and its data is deleted. The test fails if it checks the filter
after repair request data is removed.

Fail repair on topology coordinator side, so the request is ongoing
regardless of the specified hosts.

Fixes: #23986.

Closes scylladb/scylladb#24003

(cherry picked from commit 2549f5e16b)

Closes scylladb/scylladb#24080
2025-05-16 11:48:27 +03:00
Botond Dénes
163b65cec4 tools/scylla-nodetool: status: handle negative load sizes
Negative load sizes don't make sense, but we've seen a case in
production, where a negative number was returned by ScyllaDB REST API,
so be prepared to handle these too.

Fixes: scylladb/scylladb#24134

Closes scylladb/scylladb#24135

(cherry picked from commit 700a5f86ed)

Closes scylladb/scylladb#24169
2025-05-15 17:36:48 +03:00
Aleksandra Martyniuk
fcde30d2b0 streaming: use host_id in file streaming
Use host ids instead of ips in file-streaming.

Fixes: #22421.

Closes scylladb/scylladb#24055

(cherry picked from commit 2dcea5a27d)

Closes scylladb/scylladb#24119
2025-05-14 22:13:48 +02:00
Jenkins Promoter
26bd28dac9 Update ScyllaDB version to: 2025.2.0-rc2 2025-05-14 20:59:54 +03:00
Jenkins Promoter
6f1efcff31 Update ScyllaDB version to: 2025.2.0-rc1 2025-05-13 22:48:32 +03:00
Dawid Mędrek
204f9e2cc8 test/cluster/suite.yaml: Enable rf_rack_valid_keyspaces in suite
Almost all of the tests have been adjusted to be able to be run with
the `rf_rack_valid_keyspaces` configuration option enabled, while
the rest, a minority, create nodes with it disabled. Thanks to that,
we can enable it by default, so let's do that.

(cherry picked from commit ee96f8dcfc)
2025-05-12 23:11:34 +02:00
Dawid Mędrek
0c6a449a30 test/cluster: Disable rf_rack_valid_keyspaces in problematic tests
Some of the tests in the test suite have proven to be more problematic
in adjusting to RF-rack-validity. Since we'd like to run as many tests
as possible with the `rf_rack_valid_keyspaces` configuration option
enabled, let's disable it in those. In the following commit, we'll enable
it by default.

(cherry picked from commit c4b32c38a3)
2025-05-12 23:11:30 +02:00
Botond Dénes
7673a17365 Merge 'compress: fix an internal error when a specific debug log is enabled' from Michał Chojnowski
compress: fix an internal error when a specific debug log is enabled
While iterating over the recent 69684e16d8,
series I shot myself in the foot by defining `algorithm_to_name(algorithm::none)`
to be an internal error, and later calling that anyway in a debug log.

(Tests didn't catch it because there's no test which simultaneously
enables the debug log and configures some table to have no compression).

This proves that `algorithm_to_name` is too much of a footgun.
Fix it so that calling `algorithm_to_name(algorithm::none)` is legal.
In hindsight, I should have done that immediately.

Fixes #23624

Fix for recently-added code, no backporting needed.

Closes scylladb/scylladb#23625

* github.com:scylladb/scylladb:
  test_sstable_compression_dictionaries: reproduce an internal error in debug logging
  compress: fix an internal error when a specific debug log is enabled

(cherry picked from commit 746382257c)
2025-05-12 23:13:59 +03:00
Avi Kivity
ae05d62b97 Merge '[Backport 2025.2] compress: make sstable compression dictionaries NUMA-aware ' from Scylladb[bot]
compress: distribute compression dictionaries over shards
We don't want each shard to have its own copy of each dictionary.
It would unnecessary pressure on cache and memory.
Instead, we want to share dictionaries between shards.

Before this commit, all dictionaries live on shard 0.
All other shards borrow foreign shared pointers from shard 0.

There's a problem with this setup: dictionary blobs receive many random
accesses. If shard 0 is on a remote NUMA node, this could pose
a performance problem.

Therefore, for each dictionary, we would like to have one copy per NUMA node,
not one copy per the entire machine. And each shard should use the copy
belonging to its own NUMA node. This is the main goal of this patch.

There is another issue with putting all dicts on shard 0: it eats
an assymetric amount of memory from shard 0.
This commit spreads the ownership of dicts over all shards within
the NUMA group, to make the situation more symmetric.
(Dict owner is decided based on the hash of dict contents).

It should be noted that the last part isn't necessarily a good thing,
though.
While it makes the situation more symmetric within each node,
it makes it less symmetric across the cluster, if different node
sizes are present.

If dicts occupy 1% of memory on each shard of a 100-shard node,
then the same dicts would occupy 100% of memory on a 1-shard node.

So for the sake of cluster-wide symmetry, we might later want to consider
e.g. making the memory limit for dictionaries inversely proportional
to the number of shards.

New functionality, added to a feature which isn't in any stable branch yet. No backporting.

Edit: no backporting to <=2025.1, but need backporting to 2025.2, where the feature is introduced.

Fixes #24108

- (cherry picked from commit 0e4d0ded8d)

- (cherry picked from commit 8649adafa8)

- (cherry picked from commit 1bcf77951c)

- (cherry picked from commit 6b831aaf1b)

- (cherry picked from commit e952992560)

- (cherry picked from commit 66a454f61d)

- (cherry picked from commit 518f04f1c4)

- (cherry picked from commit f075674ebe)

Parent PR: #23590

Closes scylladb/scylladb#24109

* github.com:scylladb/scylladb:
  test: add test/boost/sstable_compressor_factory_test
  compress: add some test-only APIs
  compress: rename sstable_compressor_factory_impl to dictionary_holder
  compress: fix indentation
  compress: remove sstable_compressor_factory_impl::_owner_shard
  compress: distribute compression dictionaries over shards
  test: switch uses of make_sstable_compressor_factory() to a seastar::thread-dependent version
  test: remove sstables::test_env::do_with()
2025-05-12 23:11:12 +03:00
Dawid Mędrek
5c5911d874 test/cluster/test_tablets: Divide rack into two to adjust tests to RF-rack-validity
Three tests in the file use a multi-DC cluster. Unfortunately, they put
all of the nodes in a DC in the same rack and because of that, they fail
when run with the `rf_rack_valid_keyspaces` configuration option enabled.
Since the tests revolve mostly around zero-token nodes and how they
affect replication in a keyspace, this change should have zero impact on
them.

(cherry picked from commit c8c28dae92)
2025-05-12 13:10:12 +00:00
Dawid Mędrek
6a2e52d250 test/cluster/test_tablets: Adjust test_tablet_rf_change to RF-rack-validity
We reduce the number of nodes and the RF values used in the test
to make sure that the test can be run with the `rf_rack_valid_keyspaces`
configuration option. The test doesn't seem to be reliant on the
exact number of nodes, so the reduction should not make any difference.

(cherry picked from commit 04567c28a3)
2025-05-12 13:10:12 +00:00
Dawid Mędrek
f98c83b92f test/cluster/test_tablet_repair_scheduler.py: Adjust to RF-rack-validity
The change boils down to matching the number of created racks to the number
of created nodes in each DC in the auxiliary function `prepare_multi_dc_repair`.
This way, we ensure that the created keyspace will be RF-rack-valid and so
we can run the test file even with the `rf_rack_valid_keyspaces` configuration
option enabled.

The change has no impact on the tests that use the function; the distribution
of nodes across racks does not affect how repair is performed or what the
tests do and verify. Because of that, the change is correct.

(cherry picked from commit d3c0cd6d9d)
2025-05-12 13:10:12 +00:00
Dawid Mędrek
f5cf4a3893 test/pylib/repair.py: Assign nodes to multiple racks in create_table_insert_data_for_repair
We assign the newly created nodes to multiple racks. If RF <= 3,
we create as many racks as the provided RF. We disallow the case
of  RF > 3 to avoid trying to create an RF-rack-invalid keyspace;
note that no existing test calls `create_table_insert_data_for_repair`
providing a higher RF. The rationale for doing this is we want to ensure
that the tests calling the function can be run with the
`rf_rack_valid_keyspaces` configuration option enabled.

(cherry picked from commit 5d1bb8ebc5)
2025-05-12 13:10:12 +00:00
Dawid Mędrek
12f0136b26 test/cluster/test_zero_token_nodes_topology_ops: Adjust to RF-rack-validity
We assign the nodes to the same DC, but multiple racks to ensure that
the created keyspace is RF-rack-valid and we can run the test with
the `rf_rack_valid_keyspaces` configuration option enabled. The changes
do not affect what the test does and verifies.

(cherry picked from commit 92f7d5bf10)
2025-05-12 13:10:12 +00:00
Dawid Mędrek
4e45ceda21 test/cluster/test_zero_token_nodes_no_replication.py: Adjust to RF-rack-validity
We simply assign the nodes used in the test to seprate racks to
ensure that the created keyspace is RF-rack-valid to be able
to run the test with the `rf_rack_valid_keyspaces` configuration
option set to true. The change does not affect what the test
does and verifies -- it only depends on the type of nodes,
whether they are normal token owners or not -- and so the changes
are correct in that sense.

(cherry picked from commit 4c46551c6b)
2025-05-12 13:10:12 +00:00
Dawid Mędrek
2c8b5143ba test/cluster/test_zero_token_nodes_multidc.py: Adjust to RF-rack-validity
We parameterize the test so it's run with and without enforced
RF-rack-valid keyspaces. In the test itself, we introduce a branch
to make sure that we won't run into a situation where we're
attempting to create an RF-rack-invalid keyspace.

Since the `rf_rack_valid_keyspaces` option is not commonly used yet
and because its semantics will most likely change in the future, we
decide to parameterize the test rather than try to get rid of some
of the test cases that are problematic with the option enabled.

(cherry picked from commit 2882b7e48a)
2025-05-12 13:10:12 +00:00
Dawid Mędrek
474de0f048 test/cluster/test_not_enough_token_owners.py: Adjust to RF-rack-validity
We simply assign DC/rack properties to every node used in the test.
We put all of them in the same DC to make sure that the cluster behaves
as closely to how it would before these changes. However, we distribute
them over multiple racks to ensure that the keyspace used in the test
is RF-rack-valid, so we can also run it with the `rf_rack_valid_keyspaces`
configuration option set to true. The distribution of nodes between racks
has no effect on what the test does and verifies, so the changes are
correct in that sense.

(cherry picked from commit 73b22d4f6b)
2025-05-12 13:10:12 +00:00
Dawid Mędrek
5ac07a6c72 test/cluster/test_multidc.py: Adjust to RF-rack-validity
Instead of putting all of the nodes in a DC in the same rack
in `test_putget_2dc_with_rf`, we assign them to different racks.
The distribution of nodes in racks is orthogonal to what the test
is doing and verifying, so the change is correct in that sense.
At the same time, it ensures that the test never violates the
invariant of RF-rack-valid keyspaces, so we can also run it
with `rf_rack_valid_keyspaces` set to true.

(cherry picked from commit 5b83304b38)
2025-05-12 13:10:11 +00:00
Dawid Mędrek
f88d8edcaf test/cluster/object_store/test_backup.py: Adjust to RF-rack-validity
We modify the parameters of `test_restore_with_streaming_scopes`
so that it now represents a pair of values: topology layout and
the value `rf_rack_valid_keyspaces` should be set to.

Two of the already existing parameters violate RF-rack-validity
and so the test would fail when run with `rf_rack_valid_keyspaces: true`.
However, since the option isn't commonly used yet and since the
semantics of RF-rack-valid keyspaces will most likely change in
the future, let's keep those cases and just run them with the
option disabled. This way, we still test everything we can
without running into undesired failures that don't indicate anything.

(cherry picked from commit 9281bff0e3)
2025-05-12 13:10:11 +00:00
Dawid Mędrek
05c70b0820 test/cluster: Adjust simple tests to RF-rack-validity
We adjust all of the simple cases of cluster tests so they work
with `rf_rack_valid_keyspaces: true`. It boils down to assigning
nodes to multiple racks. For most of the changes, we do that by:

* Using `pytest.mark.prepare_3_racks_cluster` instead of
  `pytest.mark.prepare_3_nodes_cluster`.
* Using an additional argument -- `auto_rack_dc` -- when calling
  `ManagerClient::servers_add()`.

In some cases, we need to assign the racks manually, which may be
less obvious, but in every such situation, the tests didn't rely
on that assignment, so that doesn't affect them or what they verify.

(cherry picked from commit dbb8835fdf)
2025-05-12 13:10:11 +00:00
Michał Chojnowski
732321e3b8 test: add test/boost/sstable_compressor_factory_test
Add a basic test for NUMA awareness of `default_sstable_compressor_factory`.

(cherry picked from commit f075674ebe)
2025-05-12 09:12:05 +00:00
Michał Chojnowski
a2622e1919 compress: add some test-only APIs
Will be needed by the test added in the next patch.

(cherry picked from commit 518f04f1c4)
2025-05-12 09:12:05 +00:00
Michał Chojnowski
270bf34846 compress: rename sstable_compressor_factory_impl to dictionary_holder
Since sstable_compressor_factory_impl no longer
implements sstable_compressor_factory, the name can be
misleading. Rename it to something closer to its new role.

(cherry picked from commit 66a454f61d)
2025-05-12 09:12:05 +00:00
Michał Chojnowski
168f694c5d compress: fix indentation
Purely cosmetic.

(cherry picked from commit e952992560)
2025-05-12 09:12:05 +00:00
Michał Chojnowski
b5579be915 compress: remove sstable_compressor_factory_impl::_owner_shard
Before the series, sstable_compressor_factory_impl was directly
accessed by multiple shards. Now, it's a part of a `sharded`
data structure and is never directly from other shards,
so there's no need to check for that. Remove the leftover logic.

(cherry picked from commit 6b831aaf1b)
2025-05-12 09:12:05 +00:00
Michał Chojnowski
ad60d765f9 compress: distribute compression dictionaries over shards
We don't want each shard to have its own copy of each dictionary.
It would unnecessary pressure on cache and memory.
Instead, we want to share dictionaries between shards.

Before this commit, all dictionaries live on shard 0.
All other shards borrow foreign shared pointers from shard 0.

There's a problem with this setup: dictionary blobs receive many random
accesses. If shard 0 is on a remote NUMA node, this could pose
a performance problem.

Therefore, for each dictionary, we would like to have one copy per NUMA node,
not one copy per the entire machine. And each shard should use the copy
belonging to its own NUMA node. This is the main goal of this patch.

There is another issue with putting all dicts on shard 0: it eats
an assymetric amount of memory from shard 0.
This commit spreads the ownership of dicts over all shards within
the NUMA group, to make the situation more symmetric.
(Dict owner is decided based on the hash of dict contents).

It should be noted that the last part isn't necessarily a good thing,
though.
While it makes the situation more symmetric within each node,
it makes it less symmetric across the cluster, if different node
sizes are present.

If dicts occupy 1% of memory on each shard of a 100-shard node,
then the same dicts would occupy 100% of memory on a 1-shard node.

So for the sake of cluster-wide symmetry, we might later want to consider
e.g. making the memory limit for dictionaries inversely proportional
to the number of shards.

(cherry picked from commit 1bcf77951c)
2025-05-12 09:12:05 +00:00
Michał Chojnowski
68d2086fa5 test: switch uses of make_sstable_compressor_factory() to a seastar::thread-dependent version
In next patches, make_sstable_compressor_factory() will have to
disappear.
In preparation for that, we switch to a seastar::thread-dependent
replacement.

(cherry picked from commit 8649adafa8)
2025-05-12 09:12:05 +00:00
Michał Chojnowski
403d43093f test: remove sstables::test_env::do_with()
`sstable_manager` depends on `sstable_compressor_factory&`.
Currently, `test_env` obtains an implementation of this
interface with the synchronous `make_sstable_compressor_factory()`.

But after this patch, the only implementation of that interface
`sstable_compressor_factory&` will use `sharded<...>`,
so its construction will become asynchronous,
and the synchronous `make_sstable_compressor_factory()` must disappear.

There are several possible ways to deal with this, but I think the
easiest one is to write an asynchronous replacement for
`make_sstable_compressor_factory()`
that will keep the same signature but will be only usable
in a `seastar::thread`.

All other uses of `make_sstable_compressor_factory()` outside of
`test_env::do_with()` already are in seastar threads,
so if we just get rid of `test_env::do_with()`, then we will
be able to use that thread-dependent replacement. This is the
purpose of this commit.

We shouldn't be losing much.

(cherry picked from commit 0e4d0ded8d)
2025-05-12 09:12:04 +00:00
Patryk Jędrzejczak
2b1b4d1dfc Merge '[Backport 2025.2] Correctly skip updating node's own ip address due to oudated gossiper data ' from Scylladb[bot]
Used host id to check if the update is for the node itself. Using IP is unreliable since if a node is restarted with different IP a gossiper message with previous IP can be misinterpreted as belonging to a different node.

Fixes: #22777

Backport to 2025.1 since this fixes a crash. Older version do not have the code.

- (cherry picked from commit a2178b7c31)

- (cherry picked from commit ecd14753c0)

- (cherry picked from commit 7403de241c)

Parent PR: #24000

Closes scylladb/scylladb#24089

* https://github.com/scylladb/scylladb:
  test: add reproducer for #22777
  storage_service: Do not remove gossiper entry on address change
  storage_service: use id to check for local node
2025-05-12 09:31:20 +02:00
Michał Chojnowski
a5b513dde7 logalloc_test: don't test performance in test background_reclaim
The test is failing in CI sometimes due to performance reasons.

There are at least two problems:
1. The initial 500ms (wall time) sleep might be too short. If the reclaimer
   doesn't manage to evict enough memory during this time, the test will fail.
2. During the 100ms (thread CPU time) window given by the test to background
   reclaim, the `background_reclaim` scheduling group isn't actually
   guaranteed to get any CPU, regardless of shares. If the process is
   switched out inside the `background_reclaim` group, it might
   accumulate so much vruntime that it won't get any more CPU again
   for a long time.

We have seen both.

This kind of timing test can't be run reliably on overcommitted machines
without modifying the Seastar scheduler to support that (by e.g. using
thread clock instead of wall time clock in the scheduler), and that would
require an amount of effort disproportionate to the value of the test.

So for now, to unflake the test, this patch removes the performance test
part. (And the tradeoff is a weakening of the test).

(cherry picked from commit 1c1741cfbc)
2025-05-09 16:12:22 +00:00
Michał Chojnowski
2c431c1ea2 logalloc: make background_reclaimer::free_memory_threshold publicly visible
Wanted by the change to the background_reclaim test in the next patch.

(cherry picked from commit c47f438db3)
2025-05-09 16:12:22 +00:00
Gleb Natapov
827563902c test: add reproducer for #22777
Add sleep before starting gossiper to increase a chance of getting old
gossiper entry about yourself before updating local gossiper info with
new IP address.

(cherry picked from commit 7403de241c)
2025-05-09 12:56:15 +00:00
Gleb Natapov
ccf194bd89 storage_service: Do not remove gossiper entry on address change
When gossiper indexed entries by ip an old entry had to be removed on an
address change, but the index is id based, so even if ip was change the
entry should stay. Gossiper simply updates an ip address there.

(cherry picked from commit ecd14753c0)
2025-05-09 12:56:15 +00:00
Gleb Natapov
9b735bb4dc storage_service: use id to check for local node
IP may change and an old gossiper message with previous IP may be
processed when it shouldn't.

Fixes: #22777
(cherry picked from commit a2178b7c31)
2025-05-09 12:56:15 +00:00
Michał Chojnowski
f29b87970a test/boost/mvcc_test: fix an overly-strong assertion in test_snapshot_cursor_is_consistent_with_merging
The test checks that merging the partition versions on-the-fly using the
cursor gives the same results as merging them destructively with apply_monotonically.

In particular, it tests that the continuity of both results is equal.
However, there's a subtlety which makes this not true.
The cursor puts empty dummy rows (i.e. dummies shadowed by the partition
tombstone) in the output.
But the destructive merge is allowed (as an expection to the general
rule, for optimization reasons), to remove those dummies and thus reduce
the continuity.

So after this patch we instead check that the output of the cursor
has continuity equal to the merged continuities of version.
(Rather than to the continuity of merged versions, which can be
smaller as described above).

Refs https://github.com/scylladb/scylladb/pull/21459, a patch which did
the same in a different test.
Fixes https://github.com/scylladb/scylladb/issues/13642

Closes scylladb/scylladb#24044

(cherry picked from commit 746ec1d4e4)

Closes scylladb/scylladb#24083
2025-05-09 13:00:34 +02:00
Raphael S. Carvalho
82ca17e70d replica: Fix use-after-free with concurrent schema change and sstable set update
When schema is changed, sstable set is updated according to the compaction
strategy of the new schema (no changes to set are actually made, just
the underlying set type is updated), but the problem is that it happens
without a lock, causing a use-after-free when running concurrently to
another set update.

Example:

1) A: sstable set is being updated on compaction completion
2) B: schema change updates the set (it's non deferring, so it
happens in one go) and frees the set used by A.
3) when A resumes, system will likely crash since the set is freed
already.

ASAN screams about it:
SUMMARY: AddressSanitizer: heap-use-after-free sstables/sstable_set.cc ...

Fix is about deferring update of the set on schema change to compaction,
which is triggered after new schema is set. Only strategy state and
backlog tracker are updated immediately, which is fine since strategy
doesn't depend on any particular implementation of sstable set, since
patch "sstables: Implement sstable_set_impl::all_sstable_runs()".

Fixes #22040.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 434c2c4649)
2025-05-09 05:57:40 +00:00
Raphael S. Carvalho
ddf9d047db sstables: Implement sstable_set_impl::all_sstable_runs()
With upcoming change where table::set_compaction_strategy() might delay
update of sstable set, ICS might temporarily work with sstable set
implementations other than partitioned_sstable_set. ICS relies on
all_sstable_runs() during regular compaction, and today it triggers
bad_function_call exception if not overriden by set implementation.
To remove this strong dependency between compaction strategy and
a particular set implementation, let's provide a default implementation
of all_sstable_runs(), such that ICS will still work until the set
is updated eventually through a process that adds or remove a
sstable.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 628bec4dbd)
2025-05-09 05:57:40 +00:00
Botond Dénes
17a76b6264 Merge '[Backport 2025.2] test/cluster/test_read_repair.py: improve trace logging test (again)' from Scylladb[bot]
The test test_read_repair_with_trace_logging wants to test read repair with trace logging. Turns out that node restart + trace-level logging + debug mode is too much and even with 1 minute timeout, the read repair     times out sometimes. Refactor the test to use injection point instead of restart. To make sure the test still tests what it supposed to test, use tracing to assert that read repair did indeed happen.

Fixes: scylladb/scylladb#23968

Needs backport to 2025.1 and 6.2, both have the flaky test

- (cherry picked from commit 51025de755)

- (cherry picked from commit 29eedaa0e5)

Parent PR: #23989

Closes scylladb/scylladb#24051

* github.com:scylladb/scylladb:
  test/cluster/test_read_repair.py: improve trace logging test (again)
  test/cluster: extract execute_with_tracing() into pylib/util.py
2025-05-08 11:01:18 +03:00
Aleksandra Martyniuk
ab45df1aa1 streaming: skip dropped tables
Currently, stream_session::prepare throws when a table in requests
or summaries is dropped. However, we do not want to fail streaming
if the table is dropped.

Delete table checks from stream_session::prepare. Further streaming
steps can handle the dropped table and finish the streaming successfully.

Fixes: #15257.

Closes scylladb/scylladb#23915

(cherry picked from commit 20c2d6210e)

Closes scylladb/scylladb#24053
2025-05-08 11:00:27 +03:00
Botond Dénes
97f0f312e0 test/cluster/test_read_repair.py: improve trace logging test (again)
The test test_read_repair_with_trace_logging wants to test read repair
with trace logging. Turns out that node restart + trace-level logging
+ debug mode is too much and even with 1 minute timeout, the read repair
times out sometimes.
Refactor the test to use injection point instead of restart. To make
sure the test still tests what it supposed to test, use tracing to
assert that read repair did indeed happen.

(cherry picked from commit 29eedaa0e5)
2025-05-07 13:26:08 +00:00
Botond Dénes
4df6a17d30 test/cluster: extract execute_with_tracing() into pylib/util.py
To allow reuse in other tests.

(cherry picked from commit 51025de755)
2025-05-07 13:26:08 +00:00
Anna Mikhlin
b3dbfaf27a Update ScyllaDB version to: 2025.2.0-rc0 2025-05-07 11:41:33 +03:00
Botond Dénes
0a9ca52cfd replica/database: memtable_list: save ref to memtable_table_shared_data
This is passed by reference to the constructor, but a copy is saved into
the _table_shared_data member. A reference to this member is passed down
to all memtable readers. Because of the copy, the memtable readers save
a reference to the memtable_list's member, which goes away together with
the memtable_list when the storage_group is destroyed.
This causes use-after-free when a storage group is destroyed while a
memtable read is still ongoing. The memtable reader keeps the memtable
alive, but its reference to the memtable_table_shared_data becomes
stale.
Fix by saving a reference in the memtable_list too, so memtable readers
receive a reference pointing to the original replica::table member,
which is stable accross tablet migrations and merges.
The copy was introduced by 2a76065e3d.
There was a copy even before this commit, but in the previous vnode-only
world this was fine -- there was one memtable_list per table and it was
around until the table itself was. In the tablet world, this is no
longer given, but the above commit didn't account for this.

A test is included, which reproduces the use-after-free on memtable
migration. The test is somewhat artificial in that the use-after-free
would be prevented by holding on to an ERM, but this is done
intentionaly to keep the test simple. Migration -- unlike merge where
this use-after-free was originally observed -- is easy to trigger from
unit tests.

Fixes: #23762

Closes scylladb/scylladb#23984
2025-05-06 22:13:17 +03:00
David Garcia
b1ee0e2a6a docs: fix AttributeError with 'myst_enable_extensions' in publication workflow
Rolled back some dependencies in `poetry.lock` to previous versions while we investigate how to make the extension `sphinx_scylladb_markdown` compatible with the latest versions.

This should fix the error in https://github.com/scylladb/scylladb/actions/runs/14708656912/job/41275115239, which currently prevents publishing new versions of https://opensource.docs.scylladb.com/

Closes scylladb/scylladb#23969
2025-05-06 16:33:00 +03:00
Pavel Emelyanov
1b5bbc2433 Merge 'test.py: split boost pytest integration' from Andrei Chekun
This PR contains changes that do not add new functionality, and have small refactoring of the existing code.
The most significant change is the refactoring of resource gathering, so it will not create another cgroup to put itself in. So there will be no nested redundant 'initial' groups, e.x. `/sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/initial/initial/initial.../initial`
This is part two of splitting the original PR.

This PR is an extraction of several commits from https://github.com/scylladb/scylladb/pull/22894 as reviewer https://github.com/scylladb/scylladb/pull/22894?notification_referrer_id=NT_kwDOACiLR7MxNDg0ODk2MDU1MjoyNjU3MDk1&notifications_query=reason%3Aparticipating#pullrequestreview-2778582278.

Closes scylladb/scylladb#23882

* github.com:scylladb/scylladb:
  test.py: add awareness of extra_scylla_cmdline_options
  test.py: increase timeout for C++ tests in pytest
  test.py: switch method of finding the root repo directory
  test.py: move get_combined_tests to the correct facade
  test.py: add common directory for reports
  test.py: add the possibility to provide additional env vars
  test.py: move setup cgroups to the generic method
  test.py: refactor resource_gather.py
2025-05-06 16:22:49 +03:00
Botond Dénes
3c3f6ca233 tools/scylla-sstable: scrub: use UUID sstable identifiers
Much easier to avoid sstable collisions. Makes it possible to scrub
multiple sstables, with multiple calls to scylla-sstable, reusing the
same output directory. Previously, each new call to scylla-sstable
scrub, would start from generation 0, guaranteeing collision.

Remove the unit test for generation clash -- with UUID generations, this
is no longer possible to reproduce in practice.

Refs: #21387

Closes scylladb/scylladb#23990
2025-05-06 15:09:53 +03:00
Patryk Jędrzejczak
7f843e0a5c Merge 'raft: make sure to retain the existing voters including the current leader (topology coordinator)' from Emil Maskovsky
Fix an issue in the voter calculator where existing voters were not retained across data centers and racks in certain scenarios. This occurred when voters were distributed across more data centers and racks than the maximum allowed number of voters.
Previously, the prioritization logic for data centers and racks did not consider the number of existing assigned voters. It only prioritized nodes within a single data center or rack, which could result in unnecessary reassignment of voters.
Improved the prioritization logic to account for the number of existing assigned voters in each data center and rack.

Additionally, the limited voters feature did not account for the existing topology coordinator (Raft leader) when selecting voters to be removed. As a result, the limited voters calculator could inadvertently remove the votership of the topology coordinator, triggering unnecessary Raft leader re-election.
To address this, the topology coordinator's votership status is now preserved unless absolutely necessary. When choosing between otherwise equivalent voters, the node other than the existing topology coordinator is prioritized for removal.

This change ensures a more stable voter distribution and reduces unnecessary voter reassignments.

The limited voters calculator is refactored to use a priority queue for sorting nodes by their priorities. This change simplifies the voter selection logic and makes it more extensible for future enhancements, such as supporting more complex priority calculations.

Fixes: scylladb/scylladb#23950
Fixes: scylladb/scylladb#23588
Fixes: scylladb/scylladb#23786

No backport: The limited voters feature is currently only present in master.

Closes scylladb/scylladb#23888

* https://github.com/scylladb/scylladb:
  raft: ensure topology coordinator retains votership
  raft: retain existing voters across data centers and racks
  raft: refactor limited voters calculator to prioritize nodes
  raft: replace pointer with reference for non-null output parameter
  raft: reduce code duplication in group0 voter handler
  raft: unify and optimize datacenter and rack info creation
2025-05-06 13:49:55 +02:00
Nadav Har'El
252c5b5c9d Merge 'Alternator batch_write_item wcu' from Amnon Heiman
This series adds support for WCU tracking in batch_write_item and tests it.

The patches include:

Switch the metrics (RCU and WCU) to count units vs half-units as they were, to make the metrics clearer for users.

Adding a public static get_half_units function to wcu_consumed_capacity_counter for use by batch write item, which cannot directly use the counter object.

Adding WCU calculation support to batch_write_item, based on item size for puts and a fixed 1 WCU for deletes. WCU metrics are updated, and consumed capacity is returned per table when requested.

The return handling was refactored to be coroutine-like for easier management of the consumed capacity array.

Adding tests that validate WCU calculation for batch put requests on a single table and across multiple tables, ensuring delete operations are counted correctly.

Adding a test that validates that WCU metrics are updated correctly during batch write item operations, ensuring the WCU of each item is calculated independently.

**Need backport, WCU is partially supported, and is missing from batch_write_item**

Fixes #23940

Closes scylladb/scylladb#23941

* github.com:scylladb/scylladb:
  alternator/test_metrics.py: batch_write validate WCU
  alternator/test_returnconsumedcapacity.py: Add tests for batch write WCU
  alternator/executor: add WCU for batch_write_items
  alternator/consumed_capacity: make wcu get_units public
  Alternator: Change the WCU/RCU to use units
2025-05-06 13:31:53 +03:00
Avi Kivity
fc2204cea0 Merge ' test/boost/multishard_mutation_query_test: fix test_read_with_partition_row_limits' from Botond Dénes
This test has multiple problems:
* has 3 embedded loops to run different scenarios, ignores variable from 2 of these, running with hardcoded settings instead
* initializes misses and lookups to 0 at the start of each scenario, this throws off per-page increment checks, when the previous scenario moved these metrics and they don't start from 0; this causes the test to sometimes fail
* duplicate check of drops == 0 (just cosmetic)

Fix all three problems, the second is especially important because it made the test flaky.
Additionally, ensure the test will keep using vnodes in the future, by explicitly creating a vnodes keyspace for them.

Fixes: #16794

Test fix, not a backport candidate normally, we can backport to 2025.1 if the test becomes too unstable there

Closes scylladb/scylladb#23783

* github.com:scylladb/scylladb:
  test/boost/multishard_mutation_query_test: ensure test runs with vnodes
  test/boost/multishard_mutation_query_test: fix test_read_with_partition_row_limits
2025-05-05 20:49:03 +03:00
Emil Maskovsky
24dfd2034b raft: ensure topology coordinator retains votership
The limited voters feature did not account for the existing topology
coordinator (Raft leader) when selecting voters to be removed.
As a result, the limited voters calculator could inadvertently remove
the votership of the current topology coordinator, triggering
an unnecessary Raft leader re-election.

This change ensures that the existing topology coordinator's votership
status is preserved unless absolutely necessary. When choosing between
otherwise equivalent voters, the node other than the topology coordinator
is prioritized for removal. This helps maintain stability in the cluster
by avoiding unnecessary leader re-elections.

Additionally, only the alive leader node is considered relevant for this
logic. A dead existing leader (topology coordinator) is excluded from
consideration, as it is already in the process of losing leadership.

Fixes: scylladb/scylladb#23588
Fixes: scylladb/scylladb#23786
2025-05-05 16:58:34 +02:00
Emil Maskovsky
2ae59e8a87 raft: retain existing voters across data centers and racks
Fix an issue in the voter calculator where existing voters were not
retained across data centers and racks in certain scenarios. This
occurred when voters were distributed across more data centers and racks
than the maximum allowed number of voters.

Previously, the prioritization logic for data centers and racks did not
consider the number of existing assigned voters. It only prioritized
nodes within a single data center or rack, which could result in
unnecessary reassignment of voters.

Improved the prioritization logic to account for the number of existing
voters in each data center and rack.

This change ensures a more stable voter distribution and reduces
unnecessary voter reassignments.

Fixes: scylladb/scylladb#23950
2025-05-05 16:51:48 +02:00
Emil Maskovsky
018fb63305 raft: refactor limited voters calculator to prioritize nodes
Refactor the limited voters calculator to use a priority queue for
sorting nodes by their priorities. This change simplifies the voter
selection logic and makes it more extensible for future enhancements,
such as supporting more complex priority calculations.

The priority value is determined based on the node's existing status,
including whether it is alive, a voter, or any further criteria.
2025-05-05 16:36:17 +02:00
Emil Maskovsky
26fdc7b8f8 raft: replace pointer with reference for non-null output parameter
The output parameter cannot be `null`. Previously, a pointer was used to
make it explicit that the parameter is an output parameter being
modified. However, this is unnecessary, as references are more
appropriate for parameters that cannot be `null`.

Switching to a reference improves code readability and ensures the
parameter's non-null constraint is enforced at the type level.
2025-05-05 16:12:00 +02:00
Emil Maskovsky
f0468860a3 raft: reduce code duplication in group0 voter handler
Refactor the group0 voter handler by introducing a helper lambda to
handle the common logic for adding a node. This eliminates unnecessary
code duplication.

This refactor does not introduce any functional changes but prepares
the codebase for easier future modifications.
2025-05-05 16:09:53 +02:00
Botond Dénes
855411caad test/boost/multishard_mutation_query_test: ensure test runs with vnodes
All tests in this suite use the default "ks" keyspace from cql_test_env.
This keyspace has tablet support and at any time we might decide to make
it use tablets by default. This would make all these tests use the
tablet path in multishard_mutation_query.cc. These tests were created to
test the vastly more complex vnodes code path in said file. The tablet
path is much simpler and it is only used by SELECT * FROM
MUTATION_FRAGMENTS() and which has its own correctness tests.
So explicitely create a vnodes keyspace and use it in all the tests to
restore the test functionality.
2025-05-05 09:22:54 -04:00
Botond Dénes
1175e1ed49 test/boost/multishard_mutation_query_test: fix test_read_with_partition_row_limits
This test has multiple problems:
* has 3 embedded loops to run different scenarios, ignores variable from
  2 of these, running with hardcoded settings instead
* initializes misses and lookups to 0 at the start of each scenario,
  this throws off per-page increment checks, when the previous scenario
  moved these metrics and they don't start from 0; this causes the test
  to sometimes fail
* duplicate check of drops == 0 (just cosmetic)

Fix all three problems, the second is especially important because it
made the test flaky.
2025-05-05 09:22:53 -04:00
Emil Maskovsky
2ef654149f raft: unify and optimize datacenter and rack info creation
Refactor the code to use a consistent pattern for creating the
datacenter info list and the rack info list.

Both now use a map of vectors, which improves efficiency by reducing
temporary conversions to maps/sets during node list processing.

Also ensure the node descriptor is passed by reference instead of by
copy, leveraging the guaranteed lifetime of the descriptors.
2025-05-05 15:15:17 +02:00
Pavel Emelyanov
cf1ffd6086 Merge 'sstables_loader: fix the racing between get_progress() and release_resources()' from Kefu Chai
This change addresses a critical race condition in the sstables_loader where `get_progress()` could access invalid `progress_holder` instances after `release_resources()` destroyed them.

Problem:
- Progress tracking uses two components: `_progress_state` (tracks state) and `_progress_per_shard` (sharded service with actual progress data)
- `get_progress()` first checks if `_progress_state` is initialized, then accumulates progress from `_progress_per_shard`
- As both functions are coroutines, `get_progress()` could be preempted after state check but before accessing `_progress_per_shard`
- If `release_resources()` runs during this preemption, it destroys the `progress_holder` instances in `_progress_per_shard`, causing `get_progress()` to access invalid memory.

Solution:
- Implemented shared/exclusive locking to protect access to both state and sharded progress data
- Multiple `get_progress()` calls can execute in parallel (shared access)
- `release_resources()` acquires exclusive access before modifying resources
- This prevents potential memory corruption and ensures consistent progress reporting

Fixes #23801

---

this change addresses a racing related to tracking the restore progress from S3 using scylla's native API, which is not used in production yet, hence no need to backport.

Closes scylladb/scylladb#23808

* github.com:scylladb/scylladb:
  sstables_loader: fix the indent
  sstables_loader: fix the racing between get_progress() and release_resources()
2025-05-05 15:45:15 +03:00
Avi Kivity
e688e89430 tools: toolchain: clear .cache and .cargo directories
The .cache and .cargo directories are used during pip and rust builds
when preparing the toolchain, but aren't useful afterwards. Remove them
to save a bit of space.

Closes scylladb/scylladb#23955
2025-05-05 14:43:14 +03:00
Avi Kivity
4c1f4c419c tools: toolchain: dbuild: run as root in container under podman
Running as root enables nested containers under podman without
trouble from uid remapping. Unlike docker, under podman uid 0 in
the container is remapped to the host uid for bind mounts, so writes
to the build directory do not end up owned by root on the host.

Nested containers will allow us to consume opensearch, cassandra-stress,
and minio as containers rather than embedding them into the frozen
toolchain.

Closes scylladb/scylladb#23954
2025-05-05 14:40:43 +03:00
Amnon Heiman
2ab99d7a07 alternator/test_metrics.py: batch_write validate WCU
This patch adds a test that verifies the WCU metrics are updated
correctly during a batch_write_item operation.
It ensures that the WCU of each item is calculated independently.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2025-05-05 13:20:24 +03:00
Amnon Heiman
14570f1bb5 alternator/test_returnconsumedcapacity.py: Add tests for batch write WCU
This patch adds two tests:
A test that validates WCU calculation for batch put requests on a single table.

A test that validates WCU calculation for batch requests across multiple
tables, including ensuring that delete operations are counted as 1 WCU.

Both tests verify that the consumed capacity is reported correctly
according to the WCU rules.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2025-05-05 13:20:23 +03:00
Amnon Heiman
68db77643f alternator/executor: add WCU for batch_write_items
This patch adds consumed capacity unit support to batch_write_item.

It calculates the WCU based on an item's length (for put) or a static 1
WCU (for delete), for each item on each table.

The WCU metrics are always updated. if the user requests consumed
capacity, a vector of consumed capacity is returned with an entry for
each of the tables.

For code simplicity, the return part of batch_write_item was updated to
be coroutine-like; this makes it easier to manage the life cycle of the
returned consumed_capacity array.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2025-05-05 13:20:14 +03:00
Amnon Heiman
f2ade71f4f alternator/consumed_capacity: make wcu get_units public
This patch adds a public static get_units function to
wcu_consumed_capacity_counter.  It will be used by the batch write item
implementation, which cannot use the wcu_consumed_capacity_counter
directly.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>

consume_capacity need merge
2025-05-05 13:19:04 +03:00
Amnon Heiman
5ae11746fa Alternator: Change the WCU/RCU to use units
This patch changes the RCU/WCU Alternator metrics to use whole units
instead of half units. The change includes the following:

Change the metrics documentation. Keep the RCU counter internally in
half units, but return the actual (whole unit) value.
Change the RCU name to be rcu_half_units_total to indicates that it
counts half units.
Change the WCU to count in whole units instead of half units.

Update the tests accordingly.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2025-05-05 13:18:09 +03:00
Anna Stuchlik
851a433663 doc: add a link to the previous Enterprise documentation
This commit adds a link to the docs for previous Enterprise versions
at https://enterprise.docs.scylladb.com/ to the left menu.

As we still support versions 2024.1 and 2024.2, we need to ensure
easier access to those docs sets.

Fixes https://github.com/scylladb/scylladb/issues/23870

Closes scylladb/scylladb#23945
2025-05-05 12:16:47 +03:00
Avi Kivity
04fb2c026d config: decrease default large allocation warning threshold to 128k
Back in 2017 (5a2439e702), we introduced a check for large
allocations as they can stall the memory allocator. The warning
threshold was set at 1 MB. Since then many fixes for large allocations
went in and it is now time to reduce the threshold further.

We reduce it here to 128 kB, the natural allocation size for the
system. A quick run showed no warnings.

Closes scylladb/scylladb#23975
2025-05-05 12:13:48 +03:00
Pavel Emelyanov
b56d6fbb84 Merge 'sstables: Fix quadratic space complexity in partitioned_sstable_set' from Raphael Raph Carvalho
Interval map is very susceptible to quadratic space behavior when it's flooded with many entries overlapping all (or most of) intervals, since each such entry will have presence on all intervals it overlaps with.

A trigger we observed was memtable flush storm, which creates many small "L0" sstables that spans roughly the entire token range.

Since we cannot rely on insertion order, solution will be about storing sstables with such wide ranges in a vector (unleveled).

There should be no consequence for single-key reads, since upper layer applies an additional filtering based on token of key being queried.
And for range scans, there can be an increase in memory usage, but not significant because the sstables span an wide range and would have been selected in the combined reader if the range of scan overlaps with them.

Anyway, this is a protection against storm of memtable flushes and shouldn't be the common scenario.

It works both with tablets and vnodes, by adjusting the token range spanned by compaction group accordingly.

Fixes #23634.

We can backport this into 2024.2, 2025.1, but we should let this cook in master for 1 month or so.

Closes scylladb/scylladb#23806

* github.com:scylladb/scylladb:
  test: Verify partitioned set store split and unsplit correctly
  sstables: Fix quadratic space complexity in partitioned_sstable_set
  compaction: Wire table_state into make_sstable_set()
  compaction: Introduce token_range() to table_state
  dht: Add overlap_ratio() for token range
2025-05-05 11:28:38 +03:00
David Garcia
4ba7182515 docs: fix md redirections for multiversion support
This change resolves an issue where selecting a version from the multiversion dropdown on Markdown pages (e.g. https://docs.scylladb.com/manual/stable/alternator/getting-started.html) incorrectly redirected users to the main page instead of the corresponding versioned page.

The underlying cause was that the `multiversion` extension relies on `source_suffix` to identify available pages for URL mapping. Without this configuration, proper redirection fails for `.md` files.

This fix should be backported to `2025.1` to ensure correct behavior. Otherwise, the fix will only take effect in future releases.

Testing locally is non-trivial: clone the repository, apply the changes to each relevant branch, set `smv_remote_whitelist` to "", then run `make multiversionpreview`. Afterward, switch between versions in the dropdown to verify behavior. I've tested it locally, so the best next step is to merge and confirm that it works as expected in the live environment.

Closes scylladb/scylladb#23957
2025-05-05 10:39:39 +03:00
Pavel Emelyanov
7b786d9398 topology_coordinator: Use this->_feature_service directly
This dependency is already there, topology coordinator doesn't need
to use database reference to get to the features.

Previous patch of the same kind: b79137eaa4

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#23777
2025-05-05 09:37:29 +02:00
Piotr Dulikowski
05c797795f Merge 'Simplify test/sstable_assertions class API' from Pavel Emelyanov
It had recently been patched to re-use the sstables::test class functionality (scylladb/scylladb#23697), now it can be put on some more strict diet.

Closes scylladb/scylladb#23815

* github.com:scylladb/scylladb:
  test: Remove sstable_assertions::get_stats_metadata()
  test: Add sstable_assertions::operator->()
2025-05-05 09:33:45 +02:00
Nadav Har'El
834107ae97 test/cqlpy,alternator: fix reporting of Scylla crash during test
The cqlpy and alternator test frameworks use a single Scylla node started
once for all tests to run on. In the distant past, we had a problem where
if one test caused Scylla to crash, the result was a confusing report of
hundreds of failed tests - all tests after the crash "failed" and it wasn't
easy to find which test really caused the crash.

Our old solution to this problem was to have an autouse fixture (called
cql_test_connection or dynamodb_test_connection) which tested the
connection at the end of each test, and if it detected Scylla has
crashed - it used pytest.exit() to report the error and have pytest
exit and therefore stop running any further tests (which would have
led to all of them testing).

This approach had two problems:

1. The pytest.exit() caused the entire cqlpy suite to report a failure,
   but but not the individual test - the individual test might have
   failed as well, but that isn't guaranteed and in any case this test's
   output is missing the informative message that Scylla crashed during
   the test. This was fine when for each cqlpy failure we had two separate
   error logs in Jenkins - the specific failed function, and the failed
   file - but when we recently got rid of the suplication by removing the
   second one, we no longer see the "Scylla crashed" messages any more.

2. Exiting pytest will be the wrong thing to do if the same pytest
   run could run tests from different test suites. We don't do this
   today, but we plan to support this approach soon.

This patch fixes both problems by replacing the pytest.exit() call by
setting a "scylla_crashed" flag and using pytest.fail(). The pytest.fail()
causes the current test - the one which caused Scylla to crash - to be
reported as an "ERROR" and the "Scylla crashed" message will correctly
appear in this test's log. The flag will cause all other tests in the
same test suite to be skip()ed. But other tests in other directories,
depending on different fixtures, might continue to run normally.

Fixes #23287

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#23307
2025-05-05 10:15:56 +03:00
Nadav Har'El
3ce7e250cc alternator: fix schema "concurrent modification" errors
In ScyllaDB, schema modification operations use "optimistic locking":
A schema operation reads the current schema, decides what it wants to do
and prepares changes to the schema, and then attempts to commit those
changes - but only if the schema hasn't changed since the first read.
If the schema has already been changed by some other node - we need to
try again. In a loop.

In Alternator, there are six operations that perform schema modification:
CreateTable, DeleteTable, UpdateTable, TagResource, UntagResource and
UpdateTimeToLive. All of them were missing this loop. We knew about
this - and even had FIXME in all places. So all these operations,
when facing contention of concurrent schema modifications on different
nodes may fail one of these operations with an error like:

   Internal server error: service::group0_concurrent_modification
   (Failed to apply group 0 change due to concurrent modification).

This problem had very minor effect, if any, on real users because the
DynamoDB SDK automatically retries operations that fail with retryable
errors - like this "Internal server error" - and most likely the schema
operation will succeed upon retry. However, as shown in issue #13152
these failures were annoying in our CI, where tests - which disable
request retries - failed on these errors.

This patch fixes all six operations (the last three operations all
use one common function, db::modify_tags(), so are fixed by one
change) to add the missing loop.

The patch also includes reproducing tests for all these operations -
the new tests all fail before this patch, and pass with it.

These new tests are much more reliable reproducers than the dtests
we had that only sometimes - very rarely - reproduced the problem.
Moreover, the new tests reproduces the bug seperately for each of the
six operations, so if we forget to fix one of the six operations, one
of the tests would have continued to fail. Of course I checked this
during development.

The new tests are in the test/cluster framework, not test/alternator,
because this problem can only be reproduced in a multi-node cluster:
On a single node, it serializes its schema modifications on its own;
The collisions only happen when more than one node attempts schema
modifications at the same time.

Fixes #13152

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#23827
2025-05-05 09:59:08 +03:00
Pavel Emelyanov
d40d6801b0 sstable_directory: Print ks.cf when moving unshared remove sstables
When an sstable is identified by sstable_directory as remote-unshared,
it will at some point be moved to the target shard. When it happens a
log-message appears:

    sstable_directory - Moving 1 unshared SSTables to shard 1

Processing of tables by sstable_directory often happens in parallel, and
messages from sstable_directory are intermixed. Having a message like
above is not very informative, as it tells nothing about sstables that
are being moved.

Equip the message with ks:cf pair to make it more informative.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#23912
2025-05-05 09:45:44 +03:00
Pavel Emelyanov
e0f30a30a7 sstable_directory: Print unshared remote sstable when sorting
When collecting sstables, the sstable_directory may sort the collected
descriptors into one of three buckets -- unshared local and remote, and
shared ones. Unshared local and shared sstables' paths are loggerd (with
trace level) while unshared remote is silently collected for further
processing. Add log message for that case too, there's enough data to
print the sstable path as well.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#23913
2025-05-05 09:33:06 +03:00
Piotr Dulikowski
8ffe4b0308 utils::loading_cache: gracefully skip timer if gate closed
The loading_cache has a periodic timer which acquires the
_timer_reads_gate. The stop() method first closes the gate and then
cancels the timer - this order is necessary because the timer is
re-armed under the gate. However, the timer callback does not check
whether the gate was closed but tries to acquire it, which might result
in unhandled exception which is logged with ERROR severity.

Fix the timer callback by acquiring access to the gate at the beginning
and gracefully returning if the gate is closed. Even though the gate
used to be entered in the middle of the callback, it does not make sense
to execute the timer's logic at all if the cache is being stopped.

Fixes: scylladb/scylladb#23951

Closes scylladb/scylladb#23952
2025-04-30 16:43:22 +03:00
Aleksandra Martyniuk
1f4edd8683 test_tablet_tasks: use injection to revoke resize
Currently, test_tablet_resize_revoked tries to trigger split revoke
by deleting some rows. This method isn't deterministic and so a test
is flaky.

Use error injection to trigger resize revoke.

Fixes: #22570.

Closes scylladb/scylladb#23966
2025-04-30 07:04:57 +03:00
Michał Chojnowski
9e2343ecb0 test_sstable_compression_dictionaries_autotrain: raise the timeout
There were CI runs in which the training happened as planned,
but it was too slow to fit within the timeout.

Raise the timeout to pacify the CI.

Fixes scylladb/scylladb#23964

Closes scylladb/scylladb#23965
2025-04-29 22:09:14 +03:00
Raphael S. Carvalho
d5bee4c814 test: Verify partitioned set store split and unsplit correctly
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-04-29 15:47:33 -03:00
Raphael S. Carvalho
c77f710a0c sstables: Fix quadratic space complexity in partitioned_sstable_set
Interval map is very susceptible to quadratic space behavior when
it's flooded with many entries overlapping all (or most of)
intervals, since each such entry will have presence on all
intervals it overlaps with.

A trigger we observed was memtable flush storm, which creates many
small "L0" sstables that spans roughly the entire token range.

Since we cannot rely on insertion order, solution will be about
storing sstables with such wide ranges in a vector (unleveled).

There should be no consequence for single-key reads, since upper
layer applies an additional filtering based on token of key being
queried.
And for range scans, there can be an increase in memory usage,
but not significant because the sstables span an wide range and
would have been selected in the combined reader if the range of
scan overlaps with them.

Anyway, this is a protection against storm of memtable flushes
and shouldn't be the common scenario.

It works both with tablets and vnodes, by adjusting the token
range spanned by compaction group accordingly.

Fixes #23634.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-04-29 15:47:33 -03:00
Raphael S. Carvalho
21d1e78457 compaction: Wire table_state into make_sstable_set()
This will be useful for feeding token range owned by compaction group
into sstable set.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-04-29 15:47:33 -03:00
Raphael S. Carvalho
59dad2121f compaction: Introduce token_range() to table_state
This provides a way for compaction layer to know compaction group's
token range. It will be important for sstable set impl to know
the token range of underlying group.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-04-29 15:47:33 -03:00
Raphael S. Carvalho
494ed6b887 dht: Add overlap_ratio() for token range
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-04-29 15:47:33 -03:00
Patryk Jędrzejczak
0cdcf82cd0 Merge 'topology coordinator: do not proceed further on invalid boostrap tokens' from Piotr Dulikowski
In case when dht::boot_strapper::get_boostrap_tokens fail to parse the
tokens, the topology coordinator handles the exception and schedules a
rollback. However, the current code tries to continue with the topology
coordinator logic even if an exception occurs, leaving boostrap_tokens
empty. This does not make sense and can actually cause issues,
specifically in prepare_and_broadcast_cdc_generation_data which
implicitly expect that the bootstrap_tokens of the first node in the
cluster will not be empty.

Fix this by adding the missing break.

Fixes: scylladb/scylladb#23897

From the code inspection alone it looks like 2025.1 and 6.2 have this problem, so marking for backport to both of them.

Closes scylladb/scylladb#23914

* https://github.com/scylladb/scylladb:
  test: cluster: add test_bad_initial_token
  topology coordinator: do not proceed further on invalid boostrap tokens
  cdc: add sanity check for generating an empty generation
2025-04-28 12:45:33 +02:00
Botond Dénes
d582c436e5 Merge 'tasks: check whether a node is alive before rpc' from Aleksandra Martyniuk
Check whether a node is alive before making an rpc that gathers children
infos from the whole cluster in virtual_task::impl::get_children.

Fixes: https://github.com/scylladb/scylladb/issues/22514.

Needs backport to 2025.1 and 6.2 as they contain the bug.

Closes scylladb/scylladb#23787

* github.com:scylladb/scylladb:
  test: add test for getting tasks children
  tasks: check whether a node is alive before rpc
2025-04-28 09:32:45 +03:00
Nadav Har'El
262530f27c Merge 'mv: make base_info in view schemas immutable' from Wojciech Mitros
Currently, the base_info may or may not be set in view schemas.
Even when it's set, it may be modified. This necessitates extra
checks when handling view schemas, as we'll as potentially causing
errors when we forget to set it at some point.

Instead, we want to make the base info an immutable member of view
schemas (inside view_info). To achieve this, in this series we remove
all base_info members that can change due to a base schema update,
and we calculate the remaining values during view update generation,
using the most up-to-date base schema version.

To calculate the values that depend on the base schema version, we
need to iterate over the view primary key and find the corresponding
columns, which adds extra overhead for each batch of view updates.
However, this overhead should be relatively small, as when creating
a view update, we need to prepare each of its columns anyway. And
if we need to read the old value of the base row, the relative
overhead is even lower.

After this change, the base info in view schemas stays the same
for all base schema updates, so we'll no longer get issues with
base_info being incompatible with a base schema version. Additionally,
it's a step towards making the schema objects immutable, which
we sometimes incorrectly assumed in the past (they're still not
completely immutable yet, as some other fields in view_info other
than base_info are initialized lazily and may depend on the base
schema version).

Fixes https://github.com/scylladb/scylladb/issues/9059
Fixes https://github.com/scylladb/scylladb/issues/21292
Fixes https://github.com/scylladb/scylladb/issues/22194
Fixes https://github.com/scylladb/scylladb/issues/22410

Closes scylladb/scylladb#23337

* github.com:scylladb/scylladb:
  test: remove flakiness from test_schema_is_recovered_after_dying
  mv: add a test for dropping an index while it's building
  base_info: remove the lw_shared_ptr variant
  view_info: don't re-set base_info after construction
  base_info: remove base_info snapshot semantics
  base_info: remove base schema from the base_info
  schema_registry: store base info instead of base schema for view entries
  base_info: make members non-const
  view_info: move the base info to a separate header
  view_info: move computation of view pk columns not in base pk to view_updates
  view_info: move base-dependent variables into base_info
  view_info: set base info on construction
2025-04-27 19:12:12 +03:00
David Garcia
cf7d846b9e docs: update dependencies
This is a mandatory dependency update to resolve a critical Dependabot alert. For more details, see the [Dependabot alerts](https://docs.github.com/en/code-security/dependabot/dependabot-alerts/viewing-and-updating-dependabot-alerts).

Closes scylladb/scylladb#23918

Fixes #23935
2025-04-27 18:45:11 +03:00
Piotr Szymaniak
e588c8667f alternator: Limit attribute name lengths
Attribute names are now checked against DynamoDB-compatible length
limits. When exceeded, Alternator emits exception identical or similar
to the DDB one. It might be worth noting that DDB emits more than a
single kind of an exception string for some exceptions. The tests'
catch clauses handle all the observed kinds of messages from DynamoDB.
The validation differentiates between key and non-key attributes and
applies the limit accordingly.

AWS DDB raises exceptions with somewhat different contents when the
get request contains ProjectionExpression, so this case needed separate
treatment to emit the corresponding exception string. The
length-validating function was declared and defined in
expressions.hh/.cc respectively, because that's where the relevant
parsing happens.

** Tests

The following tests were validated when handling this issue:
test_limit_attribute_length_nonkey_good,
test_limit_attribute_length_nonkey_bad,
test_limit_attribute_length_key_good,
test_limit_attribute_length_key_bad,
test_limit_attribute_length_gsi_lsi_good,
test_limit_attribute_length_gsi_lsi_bad,
test_limit_attribute_length_gsi_lsi_projection_bad.

Some of the tests were expanded into being more granular. Namely, there
is a new test function
`test_limit_attribute_length_key_bad_incoherent_names`
which groups tests with too long attribute names in the case of
incorrect (incoherent) user requests.
Similarily, there is a new test function
`test_limit_attribute_length_gsi_lsi_bad_incoherent_names`
All the tests cover now each combination of the key/keys being too long.
Both the new fuctions contain tests that verify that ScyllaDB throws
length-related exceptions (instead of the coherency-related), similar
to what DynamoDB does.

The new test test_limit_gsiu_key_len_bad covers the case of too long
attribute name inside GlobalSecondaryIndexUpdates.
The new test test_limit_gsiu_key_len_bad_incoherent_names covers the
case of incorrect (incoherent) user requests containing too long
attribute names and GlobalSecondaryIndexUpdates.

test_limit_attribute_length_key_bad was found to have contaned an
illegal KeySchema structure.

Some of the tests were corrected their match clause.

All the tests are stripped of the xfail flag except
test_limit_attribute_length_key_bad, which has it changed since it
still fails due to Projection in GSI and LIS not implemented in Alternator.
The xfail now points to #5036.

Fixes scylladb/scylladb#9169

Closes scylladb/scylladb#23097
2025-04-27 18:39:20 +03:00
Piotr Dulikowski
82e1678fbe test: mv: skip test_mv_tablets_empty_ip in debug mode
This test shuts down a node and then replaces it with another one while
continuously writing to the cluster. The test has been observed to take
a lot of time in debug mode and time out on the replace operation.
Replace takes very long because rebuilding tablets on the new node is
very slow, and the slowest part is memtable flush which happens at the
beginning of streaming. The slowness seems to be specific to the debug
mode.

Turn off the test in debug mode to deflake the CI. As a follow-up, the
test is planned to be reworked into an quicker error injection test so
that the code path tested by this test will be again exercised in debug
unit tests (scylladb/scylladb#23898)

Fixes: scylladb/scylladb#20316

Closes scylladb/scylladb#23900
2025-04-27 18:06:08 +03:00
Piotr Dulikowski
670a69007e test: cluster: add test_bad_initial_token
Adds a test which checks that rollback works properly in case when a bad
value of the initial_token function is provided.
2025-04-25 12:25:15 +02:00
Piotr Dulikowski
845cedea7f topology coordinator: do not proceed further on invalid boostrap tokens
In case when dht::boot_strapper::get_boostrap_tokens fail to parse the
tokens, the topology coordinator handles the exception and schedules a
rollback. However, the current code tries to continue with the topology
coordinator logic even if an exception occurs, leaving boostrap_tokens
empty. This does not make sense and can actually cause issues,
specifically in prepare_and_broadcast_cdc_generation_data which
implicitly expect that the bootstrap_tokens of the first node in the
cluster will not be empty.

Fix this by adding the missing break.

Fixes: scylladb/scylladb#23897
2025-04-25 11:30:01 +02:00
Piotr Dulikowski
66acaa1bf8 cdc: add sanity check for generating an empty generation
It doesn't make sense to create an empty CDC generation because it does
not make sense to have a cluster with no tokens. Add a sanity check to
cdc::make_new_generation_description which fails if somebody attempts to
do that (i.e. when the set of current tokens + optionally bootstrapping
node's tokens is empty).

The function does not work correctly if it is misused, as we saw in
scylladb/scylladb#23897. While the function should not be misused in the
first place, it's better to throw an exception rather than crash -
especially that this crash could happen on the topology coordinator.
2025-04-25 11:25:07 +02:00
Aleksandra Martyniuk
76cd707b18 test: test_tablets: wait for cql
Wait for cql after rolling restart in test_two_tablets_concurrent_repair_and_migration_repair_writer_level
to prevent failing queries.

Fixes: #23620.

Closes scylladb/scylladb#23796
2025-04-24 21:25:29 +03:00
Patryk Jędrzejczak
2a8bb47cfb test: test_zero_token_nodes_topology_ops: use host IDs for ignored nodes
Providing IP of an ignored node during removenode made the test flaky.
It could happen that the address map contained mappings of two
nodes with the same IP:
1. the node being ignored,
2. the node that expectedly failed replacing earlier in the test.

So, `address_map::find_by_addr()` called in `find_raft_nodes_from_hoeps`
could return the host ID of the second node instead of the first node
and cause removenode to fail.

We fix flakiness in this patch by providing the host ID of the ignored
node instead of its IP. We would have to do it anyway sooner or later
because providing IP is deprecated.

The bug in `find_raft_nodes_from_hoeps` is tracked by
scylladb/scylladb#23846.

The test became flaky because of f0af3f261e.
That patch is not present in 2025.1, so the test isn't flaky outside
master, and hence there is no reason to backport this patch.

Fixes scylladb/scylladb#23499

Closes scylladb/scylladb#23863
2025-04-24 20:17:19 +03:00
Pavel Emelyanov
68a178eba9 Merge 'replica: skip flush of dropped table' from Aleksandra Martyniuk
Currently, flush throws no_such_column_family if a table is dropped. Skip the flush of dropped table instead.

Fixes: #16095.

Needs backport to 2025.1 and 6.2 as they contain the bug

Closes scylladb/scylladb#23876

* github.com:scylladb/scylladb:
  test: test table drop during flush
  replica: skip flush of dropped table
2025-04-24 20:02:59 +03:00
Andrei Chekun
22ef09489d test.py: add awareness of extra_scylla_cmdline_options
test_config.yaml can have field extra_scylla_cmdline_options that
previously was not added to the commandline to start Scylla. Now any
extra options will be added to commandline to start tests
2025-04-24 14:05:50 +02:00
Andrei Chekun
2758c4a08e test.py: increase timeout for C++ tests in pytest
Current timeouts it not enough. Tests failed randomly with hitting
timeout. This will allow to test finish normally. As a downside if the
process will hang we will be waiting more. This adjustments will be
changed after we will have metrics how long it takes to test to pass in
each mode.
2025-04-24 14:05:50 +02:00
Andrei Chekun
f5c88e1107 test.py: switch method of finding the root repo directory
Switching to use constant defined in __init__ filet instead of getting
the root directory from pytest's config. This is will allow to have only
one source of truth in defining the  root directory of the project to
avoid cases when root directory defined incorrectly. This change also
simplifies potential changes in future.
2025-04-24 14:05:50 +02:00
Andrei Chekun
06eca04370 test.py: move get_combined_tests to the correct facade
Since get_combined_tests method is used only for boost tests and not all C++ tests, moving it into the correct place
2025-04-24 14:05:49 +02:00
Andrei Chekun
8cc9c0a53a test.py: add common directory for reports
When test.py executing python test it executes it by mode and by file,
so it can say where the report should with mode. With new approach
pytest will execute the tests for all modes inside himself, and we can
only have one report per pytest invocation. That's why we need common
directory for reports and not under the mode directory. It can later be
used for simplification, so any report should be there.
2025-04-24 14:05:49 +02:00
Andrei Chekun
b791af1f16 test.py: add the possibility to provide additional env vars
This will allow inject any environment variable to the test, because
previosly it was taking only the environment variables from the process.
Adding injecting ASAN and UBSAN variablet to the tests
2025-04-24 14:05:49 +02:00
Andrei Chekun
3cb5838619 test.py: move setup cgroups to the generic method
This changes needed for later integration for pytest executing the C++
tests to be able to gather resource metric.
2025-04-24 14:05:49 +02:00
Andrei Chekun
ca615af407 test.py: refactor resource_gather.py
Refactor resource_gather.py to not create the initial cgroup when the process it's already in it. This will allow not going deeper, creating again and again the same cgroup with each test.py execution when the terminal isn't closed.
Add creation of own event loop in case it's not exists. This needed to be able to work with
test.py that creates loop and with pytest that not create loop.
2025-04-24 14:05:49 +02:00
Wojciech Mitros
ee5883770a test: remove flakiness from test_schema_is_recovered_after_dying
Due to the changes in creating schemas with base info the
test_schema_is_recovered_after_dying seems to be flaky when checking
that the schema is actually lost after 'grace_period'. We don't
actually guarantee that the the schema will be lost at that exact
moment so there's no reason to test this. To remove the flakiness,
we remove the check and the related sleep, which should also slightly
improve the speed of this test.
2025-04-24 01:09:35 +02:00
Wojciech Mitros
bf7bba9634 mv: add a test for dropping an index while it's building
Dropping an index is a schema change of its base table and
a schema drop of the index's materialized view. This combination
of schema changes used to cause issues during view building, because
when a view schema was dropped, it wasn't getting updated with the
new version of the base schema, and while the view building was
in progress, we would update the base schema for the base table
mutation reader and try generating updates with a view schema that
wasn't compatible with the base schema, failing on an `on_internal_error`.

In this patch we add a test for this scenario. We create an index,
halt its view building process using an injection, and drop it.
If no errors are thrown, the test succeeds.

The test was failing before https://github.com/scylladb/scylladb/pull/23337
and is passing afterwards.
2025-04-24 01:09:32 +02:00
Wojciech Mitros
d77f11d436 base_info: remove the lw_shared_ptr variant
The base_dependent_view_info is no longer needed to be shared or
modified in the view_info, so we no longer need to keep it as
a shared pointer.
2025-04-24 01:08:40 +02:00
Wojciech Mitros
d7bd86591e view_info: don't re-set base_info after construction
In the previous commits we made sure that the base info is not dependent
on the base schema version, and the info dependent on the base schema
version is calculated when it's needed. In this patch we remove the
unnecessary re-setting of the base_info.

The set_base_info method isn't removed completely, because it also has
a secondary function - zeroing the view_info fields other than base_info.
Because of this, in this patch we rename it accordingly and limit its
use to the updates caused by a base schema change.
2025-04-24 01:08:40 +02:00
Wojciech Mitros
ea462efa3d base_info: remove base_info snapshot semantics
The base info in view schemas no longer changes on base schema
updates, so saving the base info with a view schema from a specific
point in time doesn't provide any additional benefits.
In this patch we remove the code using the base_and_view snapshots
as it's no longer useful.
2025-04-24 01:08:40 +02:00
Wojciech Mitros
ad55935411 base_info: remove base schema from the base_info
The base info now only contains values which are not reliant on the
base schema version. We remove the the base schema from the base info
to make it immutable regardless of base schema version, at the point
of this patch it's also not needed anywhere - the new base info can
replace the base schema in most places, and in the few (view_updates)
where we need it, we pull the most recent base schema version from
the database.

After this change, the base info no longer changes in a view schema
after creation, so we'll no longer get errors when we try generating
view updates with a base_info that's incompatible with a specific
base schema version.

Fixes #9059
Fixes #21292
Fixes #22410
2025-04-24 01:08:39 +02:00
Wojciech Mitros
05fce91945 schema_registry: store base info instead of base schema for view entries
In the following patch we plan to remove the base schema from the base_info
to make the base_info immutable. To do that, we first prepare the schema
registry for the change; we need to be able to create view schemas from
frozen schemas there and frozen schemas have no information about the base
table. Unless we do this change, after base schemas are removed from the
base info, we'll no longer be able to load a view schema to the schema registry
without looking up the base schema in the database.

This change also required some updates to schema building:
* we add a method for unfreezing a view schema with base info instead of
a base schema
* we make it possible to use schema_builder with a base info instead of
a base schema
* we add a method for creating a view schema from mutations with a base info
instead of a base schema
* we add a view_info constructor withat base info instead of a base schema
* we update the naming in schema_registry to reflect the usage of base info
instead of base schema
2025-04-24 01:08:39 +02:00
Wojciech Mitros
6e539c2b4d base_info: make members non-const
In the following patches we'll add the base info instead of the
base schema to various places (schema building, schema registry).
There, we'll sometimes need to update the base_info fields, which
we can't do with const members. There's also a place (global_schema_ptr)
where we won't be able to use the base_info_ptr (a shared pointer to the
base_info), so we can't just use the base_info_ptr everywhere instead.

In this patch we unmark these members as const.
In the following patches we'll remove the methods for changing the
base_info in the view schema, so it will remain effectively const.
2025-04-24 01:08:39 +02:00
Wojciech Mitros
32258d8f9a view_info: move the base info to a separate header
In the following commits the base_depenedent_view_info will be needed
in many more places. To avoid including the whole db/view/view.hh
or forward declaring (where possible) the base info, we move it to
a separate header which can be included anywhere at almost no cost.
2025-04-24 01:08:39 +02:00
Wojciech Mitros
a3d2cd6b5e view_info: move computation of view pk columns not in base pk to view_updates
In preparation of making the base_info immutable, we want to get rid of
any base_dependent_view_info fields that can change when base schema
is updated.
The _base_regular_columns_in_view_pk and _base_static_columns_in_view_pk
base column_ids of corresponding base columns and they can change
(decrease) when an earlier column is dropped in the base table.
view_updates is the only location where these values are used and calculating
them is not expensive when comparing to the overall work done while performing
a view update - we iterate over all view primary key columns and look them up
in the base table.
With this in mind, we can just calculate them when creating a view_updates
object, instead of keeping them in the base_info. We do that in this patch.
2025-04-24 01:08:39 +02:00
Wojciech Mitros
a33963daef view_info: move base-dependent variables into base_info
The has_computed_column_depending_on_base_non_primary_key
and is_partition_key_permutation_of_base_partition_key variables
in the view_info depend on the base table so they should be in the
base_dependent_view_info instead of view_info.
2025-04-24 01:08:39 +02:00
Wojciech Mitros
900687c818 view_info: set base info on construction
Currently, the base_info may or may not be set in view schemas.
Even when it's set, it may be modified. This necessitates extra
checks when handling view schemas, as well as potentially causing
errors when we forget to set it at some point.

Instead, we want to make the base info an immutable member of view
schemas (inside view_info). The first step towards that is making
sure that all newly created schemas have the base info set.
We achieve that by requiring a base schema when constructing a view
schema. Unfortunately, this adds complexity each time we're making
a view schema - we need to get the base schema as well.
In most cases, the base schema is already available. The most
problematic scenario is when we create a schema from mutations:
- when parsing system tables we can get the schema from the
database, as regular tables are parsed before views
- when loading a view schema using the schema loader tool, we need
to load the base additionally to the view schema, effectively
doubling the work
- when pulling the schema from another node - in this case we can
only get the current version of the base schema from the local
database

Additionally, we need to consider the base schema version - when
we generate view updates the version of the base schema used for
reads should match the version of the base schema in view's base
info.
This is achieved by selecting the correct (old or new) schema in
`db::schema_tables::merge_tables_and_views` and using the stored
base schema in the schema_registry.
2025-04-24 01:08:39 +02:00
Benny Halevy
f279625f59 test_tablets_cql: test_alter_dropped_tablets_keyspace: extend expected error
The query may fail also on a no_such_keyspace
exception, which generates the following cql error:
```
Error from server: code=2200 [Invalid query] message="Can\'t find a keyspace test_1745198244144_qoohq"
```
Extend the pytest.raises match expression to include
this error as well.

Fixes #23812

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#23875
2025-04-23 18:54:22 +03:00
Benny Halevy
2bbdaeba1c Update seastar submodule
* seastar e44af9b0...d7ff58f2 (2):
  > rpc: client: support timeout and cancellation
  > doc/io-properties-file.md: correct a typo

Closes scylladb/scylladb#23865
2025-04-23 16:10:51 +03:00
Aleksandra Martyniuk
c1618c7de5 test: test table drop during flush 2025-04-23 14:29:28 +02:00
Aleksandra Martyniuk
91b57e79f3 replica: skip flush of dropped table 2025-04-23 14:29:28 +02:00
Kefu Chai
0d7752b010 build: cmake: generalize update_cxx_flags()
Refactor our CMake flag handling to make it more flexible and reduce
repetition:

- Rename update_cxx_flags() to update_build_flags() to better reflect
  its expanded purpose
- Generate CMake variable names internally based on configuration type
  instead of requiring callers to specify full variable names
- Follow CMake's standard naming conventions for configuration-specific
  flags, see
  https://cmake.org/cmake/help/latest/variable/CMAKE_LANG_FLAGS.html#variable:CMAKE_%3CLANG%3E_FLAGS
- Prepare groundwork for handling linker flags in addition to compiler
  flags in future changes

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#23842
2025-04-23 12:06:04 +03:00
Nadav Har'El
64a5eee6b9 test/cqlpy: insert test names into Scylla logs
Both test.py and test/cqlpy/run run many test functions against the same
Scylla process. In the resulting log file, it is hard to understand which
log messages are related to which test. In this patch, we log a message
(using the "/system/log" REST API) every time a test is started or ends.

The messages look like this:

    INFO  2025-04-22 15:10:44,625 [shard 1:strm] api - /system/log:
    test/cqlpy: Starting test_lwt.py::test_lwt_missing_row_with_static
    ...
    INFO  2025-04-22 15:10:44,631 [shard 0:strm] api - /system/log:
    test/cqlpy: Ended test_lwt.py::test_lwt_missing_row_with_static

We already had a similar feature in test/alternator, added three years
ago in commit b0371b6bf8. The implementation
is similar but not identical due to different available utility functions,
and in any case it's very simple.

While at it, this patch also fixes the has_rest_api() to timeout after
one second. Without this, if the REST API is blocked in a way that
a connection attempt just hangs, the tests can hang. With the new
timeout, the test will hang for a second, realize the REST API is
not available, and remember this decision (the next tests will not
wait one second again). We had the same bug in Alternator, and fixed
it in 758f8f01d7. This one second "pause"
will only happen if the REST API port is blocked - in the more typical
case the REST API port is just not listening but not blocked, and the
failure will be noticed immediately and won't wait a whole second.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#23857
2025-04-23 12:04:14 +03:00
Piotr Dulikowski
3d73c79a72 test: mv: skip test_view_building_scheduling_group in debug
The test populates a table with 50k rows, creates a view on that table
and then compares the time spent in streaming vs. gossip scheduling
groups. It only takes 10s in dev mode on my machine, but is much slower
in debug mode in CI - building the view doesn't finish within 2 minutes.

The bigger the view to build, the more accurrate the measurement;
moreover, the test scenario isn't interesting enough to be worth running
it in debug mode as this should be covered by other tests. Therefore,
just skip this test in debug mode.

Fixes: scylladb/scylladb#23862

Closes scylladb/scylladb#23866
2025-04-23 11:29:35 +03:00
Pavel Emelyanov
a6ba535c3c Merge 'test.py: refactoring before boost pytest integration' from Andrei Chekun
This PR contains changes that do not add new functionality, and have small refactoring of the existing code.
The most significant change though is switching the SQLite writer from a singleton to a thread locking mechanism that will be needed later on.

This PR is an extraction of several commits from https://github.com/scylladb/scylladb/pull/22894 as reviewer [request](https://github.com/scylladb/scylladb/pull/22894?notification_referrer_id=NT_kwDOACiLR7MxNDg0ODk2MDU1MjoyNjU3MDk1&notifications_query=reason%3Aparticipating#pullrequestreview-2778582278).

Closes scylladb/scylladb#23867

* github.com:scylladb/scylladb:
  test.py: move the readme file for LDAP tests to the correct location
  test.py: eliminate deprecation warning for xml.etree.ElementTree.Element
  test.py: align the behavior of max-failures parameter with pytest maxfail
  test.py: fix typo in toxiproxy name parameter
  test.py: add locking to the sqlite writer for resource gather
  test.py: add sqlite datetime adapter for resource gather
  test.py: change the parameter for get_modes_to_run()
2025-04-23 11:10:56 +03:00
Andrei Chekun
57b66e6b2e test.py: move the readme file for LDAP tests to the correct location
README file was created in incorrect location, now it moved to the
directory with source files where it intended to be.
2025-04-22 19:03:28 +02:00
Andrei Chekun
cf4747c151 test.py: eliminate deprecation warning for xml.etree.ElementTree.Element
Testing the truth value of an Element emits DeprecationWarning. This check is done correctly
2025-04-22 19:03:21 +02:00
Andrei Chekun
bc49cd5214 test.py: align the behavior of max-failures parameter with pytest maxfail
This will allow to just transfer the existing max-failures values to the
pytest without any modification. As a downside test.py logic of handling
these changes slightly.
2025-04-22 19:03:08 +02:00
Andrei Chekun
5c3501e4bf test.py: fix typo in toxiproxy name parameter
Fix typo in toxiproxy name parameter. No any functional changes just
cosmetic fix.
2025-04-22 19:02:12 +02:00
Andrei Chekun
2c37a793d1 test.py: add locking to the sqlite writer for resource gather
SQLite blocking the DB during writes, so it's not possible to make writes from
several thread. To be able to gather metrics in several threads, we need a
locking mechanism for threads during writes. So thread will not try to
write metrics while another thread is performing writes.
2025-04-22 19:01:30 +02:00
Andrei Chekun
800710dc2c test.py: add sqlite datetime adapter for resource gather
Add sqlite datetime adapter for resource gather since default adapters are deprecated from 3.12
2025-04-22 18:59:49 +02:00
Andrei Chekun
bf2a9e267e test.py: change the parameter for get_modes_to_run()
Change the parameter for get_modes_to_run() from session to config to
narrow the scope, and prepare it to later use in method that do not have
access to the session, but have access to the config object
2025-04-22 18:58:33 +02:00
Kefu Chai
7254c0c515 db/config.cc: correct a typo in option's description
s/incomming/incoming/

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#23826
2025-04-22 16:55:04 +03:00
Pavel Emelyanov
65efd2b2f6 Merge 'Refactor and enhance s3_tests' from Ernest Zaslavsky
This PR introduces a cleanup mechanism in s3_tests to remove uploaded objects after the test completes, ensuring a clean testing environment. Additionally, the recently added test has been refactored and split into smaller, more maintainable parts, improving readability and extending its coverage to include the "proxied" case.

As these changes primarily improve code aesthetics and maintainability, backporting is not necessary.

Refs: https://github.com/scylladb/scylladb/issues/23830

Closes scylladb/scylladb#23828

* github.com:scylladb/scylladb:
  s3_tests: Improve and extend copy object test coverage
  s3_tests: Implement post-test cleanup for uploaded objects
2025-04-22 16:40:37 +03:00
Nadav Har'El
5fd2eabd48 Merge 'Generalize the diversity of parse_table_infos() callers in API' from Pavel Emelyanov
The helper in question is used in several different ways -- by handlers directly (most of the callers), as a part of wrap_ks_cf() helper and by one of its overloads that unpack the "cf" query parameter from request. This PR generalizes most of the described callers thus reducing the number differently-looking of ways API handlers parse "keyspace" and "cf" request parameters.

Continuation of #22742

Closes scylladb/scylladb#23368

* github.com:scylladb/scylladb:
  api: Squash two parse_table_infos into one
  api: Generalize keyspaces:tables parsing a little bit more
  api: Provide general pair<keyspace, vector<table>> parsing
  api: Remove ks_cf_func and related code
2025-04-22 15:40:06 +03:00
Nadav Har'El
8d1a413357 test/scylla_gdb: better error message when running on dev build mode
The test/scylla_gdb suite needs Scylla to have been built with debug
symbols - which is NOT the case for the dev build. So the script
test/scylla_gdb/run attempts to recognize when a developer runs it
on an executable with the debug symbols missing - and prints a clear error.

Unfortunately, as we noticed in #10863, and again in #23832, because
wasmtime is compiled with debug symbols and linked with Scylla,
build/dev/scylla "pretends" to have debug symbols, foiling the check
in test/scylla_gdb/run. Reviewers rejected two solutions to this problem
(pull requests #10865 and #10923), so in pull request #10937 I added
a cosmetic solution just for test/scylla_gdb: in test/scylla_gdb/conftest.py
we check that there are **really** debug symbols that interest us,
and if not, exit immediately instead of failing each test separately.

For some reason, the sys.exit() we used is no longer effective - it
no longer exits pytest, so in this patch we use pytest.exit() instead.

Fixes #23832 (sort of, we leave build/dev/scylla with the fake claim
that it has debug symbols, but test/scylla_gdb will handle this
situation more gracefully).

Closes scylladb/scylladb#23834
2025-04-22 15:02:06 +03:00
Michael Litvak
5c1d24f983 test: test_mv_topology_change: increase timeout for remove_node
The test `test_mv_write_to_dead_node` currently uses a timeout of 60
seconds for remove_node, after it was increased from 30 seconds to fix
scylladb/scylladb#22953. Apparently it is still too low, and it was
observed to fail in debug mode.

Normally remove_node uses a default timeout of TOPOLOGY_TIMEOUT = 1000
seconds, but the test requires a timeout which is shorter than 5
minutes, because it is a regression test for an issue where MV updates
hold topology changes for more than 5 minutes, and we want to verify in
the test that the topology change completes in less than 5 minutes.

To resolve the issue, we set the test to skip in debug mode, because the
remove node operation is unpredictably slow, and we increase the timeout
to 180 seconds which is hopefully enough time for remove_node in
non-debug modes, and still sufficient to satisfy the test requirements.

Fixes scylladb/scylladb#22530

Closes scylladb/scylladb#23833
2025-04-22 10:51:19 +02:00
Kefu Chai
a2b46cbf45 sstables_loader: fix the indent
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2025-04-22 12:05:55 +08:00
Kefu Chai
6b3ecad467 sstables_loader: fix the racing between get_progress() and release_resources()
This change addresses a critical race condition in the sstables_loader
where `get_progress()` could access invalid `progress_holder` instances
after `release_resources()` destroyed them.

Problem:
- Progress tracking uses two components: `_progress_state` (tracks state)
  and `_progress_per_shard` (sharded service with actual progress data)
- `get_progress()` first checks if `_progress_state` is initialized, then
  accumulates progress from `_progress_per_shard`
- As both functions are coroutines, `get_progress()` could be preempted
  after state check but before accessing `_progress_per_shard`
- If `release_resources()` runs during this preemption, it destroys the
  `progress_holder` instances in `_progress_per_shard`, causing
  `get_progress()` to access invalid memory.

Solution:
- Implemented shared/exclusive locking to protect access to both state
  and sharded progress data
- Multiple `get_progress()` calls can execute in parallel (shared access)
- `release_resources()` acquires exclusive access before modifying resources
- This prevents potential memory corruption and ensures consistent
  progress reporting

Fixes #23801

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2025-04-22 12:05:54 +08:00
Ernest Zaslavsky
edaa3f4bdd s3_tests: Improve and extend copy object test coverage
Refactored the copy object test to enhance readability and maintainability.
The test was simplified and split into smaller, more focused parts.
Additionally, a "proxied" variant of the test was introduced to expand
coverage.
2025-04-21 20:54:14 +03:00
Ernest Zaslavsky
252a0a14af s3_tests: Implement post-test cleanup for uploaded objects
Ensure cleanup after tests by deleting objects uploaded to MinIO.
This improves resource management and maintains a clean test environment.
2025-04-21 20:54:14 +03:00
Avi Kivity
2dcd2b21ae Merge 'tablets: Equalize per-table balance when allocating tablets for a new table' from Tomasz Grabiec
Fixes the following scenario:

1. Scale out adds new nodes to each rack
2. Table is created - all tablets are allocated to new nodes because they have low load
3. Rebalancing moves tablets from old nodes to new nodes - table balance for the new table is not fixed

We're wrong to try to equalize global load when allocating tablets,
and we should equalize per-table load instead, and let background load
balancing fix it in a fair way. It will add to the allocated storage
imbalance, but:

1. The table is initially empty, so doesn't impact actual storage imbalance.
2. It's more important to avoid overloading CPU on the nodes - imbalance hurts this aspect immediately.
3. If the table was created before imbalance was formed, we would end up in the same situation as in the problematic scenario after the patch.
4. It's the job of the load balancing to keep up with storage growing, and if it's not, scale out should kick in.

Before we have CPU-aware tablet allocation, and thus can prove we have
CPU capacity on the small nodes, we should respect per-table balance
as this is the way in which we achieve full CPU utilization.

Fixes #23631

Backport to 2025.1 because load imbalance is a serious problem in production.

Closes scylladb/scylladb#23708

* github.com:scylladb/scylladb:
  tablets: Equalize per-table balance when allocating tablets for a new table
  load_sketch: Tolerate missing tablet_map when selecting for a given table
  tests: tablets: Simplify tests by moving common code to topology_builder
2025-04-21 17:06:30 +03:00
Pavel Emelyanov
eb5b52f598 Merge 'main: make DC and rack immutable after bootstrap' from Piotr Dulikowski
Changing DC or rack on a node which was already bootstrapped is, in
case of vnodes, very unsafe (almost guaranteed to cause data loss or
unavailability), and is outright not supported if the cluster has
a tablet-backed keyspaces. Moreover, the possibility of doing that
makes it impossible to uphold some of the invariants promised by
the RF-rack-valid flag, which is eventually going to become
unconditionally enabled.

Get rid of the above problems by removing the possibility of changing
the DC / rack of a node. A node will now fail to start if its snitch
reports a different DC or rack than the one that was reported during the
first boot.

Fixes: scylladb/scylladb#23278
Fixes: scylladb/scylladb#22869

Marking for backport to 2025.1, as this is a necessary part of the RF-rack-valid saga

Closes scylladb/scylladb#23800

* github.com:scylladb/scylladb:
  doc: changing topology when changing snitches is no longer supported
  test: cluster: introduce test_no_dc_rack_change
  storage_service: don't update DC/rack in update_topology_with_local_metadata
  main: make dc and rack immutable after bootstrap
  test: cluster: remove test_snitch_change
2025-04-21 15:52:55 +03:00
Yaniv Michael Kaul
b374f94b15 pip installation: use --no-cache-dir
There are two reasons we may want NOT to use caching of pip deps:
1. When building a container, unless we specifically clean it up, it'll remain, even when we squash the image layers later.
2. When building a container, that cache is not useful, as we squash our containers later (so that layer is not cached really). And our CI cleans up the layers repo anyway.
3. Caching sometimes isn't great, and doesn't ensure we pick up the exact version (or latest) that we wish to...

This PR changes two locations in Scylla, both of which (also) build containers, so certainly relevant for 1, 2 above and possibly 3.
No real need to backport.

Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>

Closes scylladb/scylladb#23822
2025-04-21 13:46:57 +03:00
Avi Kivity
0ba3ce1741 test: gdb: avoid using file(1) to determine if debug information is present
The scylla_gdb tests verify, as a sanity check, that the executable
was built with debug information. They do so via file(1).

In Fedora 42, file(1) crashes on ELF files that have interpreter pathnames
larger than 128 characters[1]. This was later fixed[2], but the fix is not
in any release.

Work around the problem by using objdump instead of file.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=2354970
[2] b3384a1fbf

Closes scylladb/scylladb#23823
2025-04-21 13:29:27 +03:00
Andrei Chekun
441cee8d9c test.py: fix gathering logs in case of fail
Currently log files have information about run_id twice:
cluster.object_store_test_backup.10.test_abort_restore_with_rpc_error.dev.10_cluster.log
However, sometimes the first run_id can be incorrect:
cluster.object_store_test_backup.1.test_abort_restore_with_rpc_error.dev.10_cluster.log
Removing first run_id in the name to not face this issue and because
it's actually redundant.
Removing creation empty file for scylla manager log, since it redundant
and was done as incorrect assumption on the root cause of the fail.
Add extension to the stacktrace file, so it will be opened in the
browser in Jenkins in the new tab instead of downloading it.

Fixes: https://github.com/scylladb/scylladb/issues/23731

Closes scylladb/scylladb#23797
2025-04-21 13:12:35 +03:00
Pavel Emelyanov
09caad6147 test: Remove sstable_assertions::get_stats_metadata()
It mirrors the sstable method of the same name, which is public. With ->
operator, it's just as convenient to call it directly.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-04-18 18:53:41 +03:00
Pavel Emelyanov
294e56207d test: Add sstable_assertions::operator->()
... and replace get_sstable() with it. It's more natural (despite having
the only user) to consider the class to be yet another "pointer" to an
sstable.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-04-18 18:52:39 +03:00
Sergey Zolotukhin
2314feeae2 test: Ignore DEBUG,TRACE,INFO level messages when checking for failed mutations.
Update the regular expression in `check_node_log_for_failed_mutations` to avoid
false test failures when DEBUG-level logging is enabled.

Fixes scylladb/scylladb#23688

Closes scylladb/scylladb#23658
2025-04-18 16:17:41 +03:00
Calle Wilund
4a44651fce encryption_at_rest_test: Make fake_proxy read/write loop noexcept
Fixes #23774

Test code falls into same when_all issue as http client did.
Avoid passing exceptions through this, and instead catch and
report in worker lambda.

Closes scylladb/scylladb#23778
2025-04-18 16:17:41 +03:00
Pavel Emelyanov
324daac156 Merge 'Add CopyObject API implementation to S3 client' from Ernest Zaslavsky
Implement the CopyObject API to directly copy S3 object from one location to another. This implementation consumes zero networking overhead on the client side since the object is copied internally by S3 machinery

Usage example: Backup of tiered SSTables - you already have SSTables on S3, CopyObject is the ideal way to go

No need to backport since we are adding new functionality for a future use

Closes scylladb/scylladb#23779

* github.com:scylladb/scylladb:
  s3_client: implement S3 copy object
  s3_client: improve exception message
  s3_client: reposition local function for future use
2025-04-18 16:17:41 +03:00
Pavel Emelyanov
cc919b08c2 Merge 'backup: Optimize S3 throughput with shard-based upload' from Ernest Zaslavsky
This PR enhances S3 throughput by leveraging every available shard to upload backup files concurrently. By distributing the load across multiple shards, we significantly improve the upload performance. Each shard retrieves an SSTable and processes its files sequentially, ensuring efficient, file-by-file uploads.

To prevent uncontrolled fiber creation and potential resource exhaustion, the backup task employs a directory semaphore from the sstables_manager. This mechanism helps regulate concurrency at the directory level, ensuring stable and predictable performance during large-scale backup operations.

Refs #22460
fixes: #22520

```
===========================================
 Release build, master, smp-16, mem-32GiB
 Bytes: 2342880184, backup time: 9.51 s
===========================================
 Release build, this PR, smp-16, mem-32GiB
 Bytes: 2342891015, backup time: 1.23 s
===========================================
```
Looks like it is faster at least x7.7

No backport needed since it (native backup) is still unused functionality

Closes scylladb/scylladb#23727

* github.com:scylladb/scylladb:
  backup: Add test for invalid endpoint
  backup_task: upload on all shards
  backup_task: integrate sharded storage manager for upload
2025-04-18 16:17:41 +03:00
Avi Kivity
6b415cfd4b Merge 'managed_bytes: in the copy constructor, respect the target preferred allocation size' from Michał Chojnowski
Commit 14bf09f447 added a single-chunk layout to `managed_bytes`, which makes the overhead of `managed_bytes` smaller in the common case of a small buffer.

But there was a bug in it. In the copy constructor of `managed_bytes`, a copy of a single-chunk `managed_bytes` is made single-chunk too.

But this is wrong, because the source of the copy and the target of the copy might have different preferred max contiguous allocation sizes.

In particular, if a `managed_bytes` of size between 13 kiB and 128 kiB is copied from the standard allocator into LSA, the resulting `managed_bytes` is a single chunk which violates LSA's preferred allocation size. (And therefore is placed by LSA in the standard allocator).

In other words, since Scylla 6.0, cache and memtable cells between 13 kiB and 128 kiB are getting allocated in the standard allocator rather than inside LSA segments.

Consequences of the bug:

1. Effective memory consumption of an affected cell is rounded up to the nearest power of 2.

2. With a pathological-enough allocation pattern (for example, one which somehow ends up placing a single 16 kiB memtable-owned allocation in every aligned 128 kiB span), memtable flushing could theoretically deadlock, because the allocator might be too fragmented to let the memtable grow by another 128 kiB segment, while keeping the sum of all allocations small enough to avoid triggering a flush. (Such an allocation pattern probably wouldn't happen in practice though).

3. It triggers a bug in reclaim which results in spurious allocation failures despite ample evictable memory.

   There is a path in the reclaimer procedure where we check whether reclamation succeeded by checking that the number of free LSA segments grew.

   But in the presence of evictable non-LSA allocations, this is wrong because the reclaim might have met its target by evicting the non-LSA allocations, in which case memory is returned directly to the standard allocator, rather than to the pool of free segments.

   If that happens, the reclaimer wrongly returns `reclaimed_nothing` to Seastar, which fails the allocation.

Refs (possibly fixes) https://github.com/scylladb/scylladb/issues/21072
Fixes https://github.com/scylladb/scylladb/issues/22941
Fixes https://github.com/scylladb/scylladb/issues/22389
Fixes https://github.com/scylladb/scylladb/issues/23781

This is a regression fix, should be backported to all affected releases.

Closes scylladb/scylladb#23782

* github.com:scylladb/scylladb:
  managed_bytes_test: add a reproducer for #23781
  managed_bytes: in the copy constructor, respect the target preferred allocation size
2025-04-17 21:14:10 +03:00
Pavel Emelyanov
ca2cc5e826 Merge 'test/cluster/test_read_repair: make incremental test work with tablets' from Botond Dénes
There are two tests which test incremental read repair: one with row the other with partition tombstones. The tests currently force vnodes, by creating the test keyspace with {'enabled': false}. Even so, the tests were found to be flaky so one of them are marked for skip. This commit does the following changes:
* Make the tests use tablets by creating the test keyspace with tablets.
* Change the way the tests write data so it works with tablets: currently the tests use scylla-sstable write + upload but this won't work with tablets since upload with tablets implies --load-and-stream which means data is streamed to all replicas (no difference created between nodes). Switch to the classic stop-node + write to other replica with CL=ONE.
* Remove the skip added to the partition-tombstone test variant.

Fixes: #21179

Test improvement, no backport required.

Closes scylladb/scylladb#23167

* github.com:scylladb/scylladb:
  wip
  test/cluster/test_read_repair: make incremental test work with tablets
2025-04-17 18:54:00 +03:00
Piotr Dulikowski
325a89638c doc: changing topology when changing snitches is no longer supported
Update the "How to Switch Snitches" document to indicate that changing
topology (i.e. changing node's DC or rack) while changing the snitch is
no longer supported.

Remove a note which said that switching snitches is not supported with
tablets. It was introduced because of the concern that switching a
snitch might change DC or rack of the node, for which our current tablet
load balancer is completely unprepated. Now that changing DC/rack is
forbidden, there doesn't seem to be anything related to snitches which
could cause trouble for tablets.
2025-04-17 16:22:58 +02:00
Piotr Dulikowski
796c8d1601 test: cluster: introduce test_no_dc_rack_change
The test makes sure that changing the DC or rack in the snitch's
configuration fails with an expected error.
2025-04-17 16:22:58 +02:00
Piotr Dulikowski
1791ae3581 storage_service: don't update DC/rack in update_topology_with_local_metadata
The DC/rack are now immutable and cannot be changed after restart, so
there is no need to update the node's system.topology entry with this
information on restart.
2025-04-17 16:22:58 +02:00
Piotr Dulikowski
ce2fab7cce main: make dc and rack immutable after bootstrap
Changing DC or rack on a node which was already bootstrapped is, in
case of vnodes, very unsafe (almost guaranteed to cause data loss or
unavailability), and is outright not supported if the cluster has
a tablet-backed keyspaces. Moreover, the possibility of doing that
makes it impossible to uphold some of the invariants promised by
the RF-rack-valid flag, which is eventually going to become
unconditionally enabled.

Get rid of the above problems by removing the possibility of changing
the DC / rack of a node. A node will now fail to start if its snitch
reports a different DC or rack than the one that was reported during the
first boot.

Fixes: scylladb/scylladb#23278
2025-04-17 16:22:26 +02:00
Tomasz Grabiec
1e407ab4d2 tablets: Equalize per-table balance when allocating tablets for a new table
Fixes the following scenario:

1. Scale out adds new nodes to each rack
2. Table is created - all tablets are allocated to new nodes because they have low load
3. Rebalancing moves tablets from old nodes to new nodes - table balance for the new table is not fixed

We're wrong to try to equalize global load when allocating tablets,
and we should equalize per-table load instead, and let background load
balancing fix it in a fair way. It will add to the allocated storage
imbalance, but:

1. The table is initially empty, so doesn't impact actual storage imbalance.
2. It's more important to avoid overloading CPU on the nodes - imbalance hurts this aspect immediately.
3. If the table was created before imbalance was formed, we would end up in the same situation in the problematic scenario after the patch.
4. It's the job of the load balancing to keep up with storage growing, and if it's not, scale out should kick in.

Before we have CPU-aware tablet allocation, and thus can prove we have
CPU capacity on the small nodes, we should respect per-table balance
as this is the way in which we achieve full CPU utilization.

Fixes #23631
2025-04-17 16:01:23 +02:00
Tomasz Grabiec
2597a7e980 load_sketch: Tolerate missing tablet_map when selecting for a given table
To simplify future usage in
network_topology_strategy::add_tablets_in_dc() which invokes
populate() for a given table, which may be both new and preexisitng.
2025-04-17 16:01:16 +02:00
Ernest Zaslavsky
b79ca5a1aa backup: Add test for invalid endpoint
* During the development phase, the backup functionality broke because we lacked a test that runs backup with an invalid endpoint. This commit adds a test to cover that scenario.
* Add checking for the expected error to be propagated from failing/aborted backup
2025-04-17 16:31:43 +03:00
Benny Halevy
b7212620f9 backup_task: upload on all shards
Use all shards to upload snapshot files to S3.
By using the sharded sstables_manager_for_table
infrastructure.

Refs #22460

Quick perf comparison
===========================================
 Release build, master, smp-16, mem-32GiB
 Bytes: 2342880184, backup time: 9.51 s
===========================================
 Release build, this PR, smp-16, mem-32GiB
 Bytes: 2342891015, backup time: 1.23 s
===========================================

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Co-authored-by: Ernest Zaslavsky <ernest.zaslavsky@scylladb.com>
2025-04-17 16:31:42 +03:00
Piotr Dulikowski
dd2e507ece test: cluster: remove test_snitch_change
This test checked that it is possible to change DC/rack of a node during
restart. This will become explicitly forbidden, so remove the test.
2025-04-17 13:51:22 +02:00
Aleksandra Martyniuk
e178bd7847 test: add test for getting tasks children
Add test that checks whether the children of a virtual task will be
properly gathered if a node is down.
2025-04-17 13:48:44 +02:00
Aleksandra Martyniuk
53e0f79947 tasks: check whether a node is alive before rpc
Check whether a node is alive before making an rpc that gathers children
infos from the whole cluster in virtual_task::impl::get_children.
2025-04-17 12:51:22 +02:00
Michał Chojnowski
6c1889f65c managed_bytes_test: add a reproducer for #23781 2025-04-17 12:51:01 +02:00
Botond Dénes
8ac7c54d8b Merge 'topology_coordinator: stop: await all background_action_holder:s' from Benny Halevy
Add missing awaits for the rebuild_repair and repair background actions.
Although the background actions hold the _async_gate
which is closed in topology_coordinator::run(),
stop() still needs to await all background action futures
and handle any errors they may have left behind.

Fixes #23755

* The issue exists since 6.2

Closes scylladb/scylladb#17712

* github.com:scylladb/scylladb:
  topology_coordinator: stop: await all background_action_holder:s
  topology_coordinator: stop: improve error messages
  topology_coordinator: stop: define stop_background_action helper
2025-04-17 12:10:29 +03:00
Kefu Chai
b0cbe86780 s3/client: define a constant for security credential resource
instead of repeating it, let's define a consstant and reuse it.
less repeatings this way.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#23713
2025-04-17 11:51:15 +03:00
Kefu Chai
a33651b03e db, service: do not include unused header
these unused headers were flagged by clang-include-cleaner.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#23735
2025-04-17 11:49:59 +03:00
Botond Dénes
33e383c557 scripts/pull_github_pr.sh: add argument parsing
Instead of hardcoding PR_NUM=$1 and FORCE=$2. This current setup is
not very flexible and one gets no feedback if the arguments are
incorrect or not recognized.

Add proper position-independent argument parsing using a classic while
case loop.

Closes scylladb/scylladb#23623
2025-04-17 11:49:15 +03:00
Nadav Har'El
84d4af1f0e Merge 'Alternator batch rcu' from Amnon Heiman
This series adds support for reporting consumed capacity in BatchGetItem operations in Alternator.
It includes changes to the RCU accounting logic, exposing internal functionality to support batch-specific behavior, and adds corresponding tests for both simple and complex use cases involving multiple tables and consistency modes.

Need backporting to 2025.1, as RCU and WCU are not fully supported

Fixes #23690

Closes scylladb/scylladb#23691

* github.com:scylladb/scylladb:
  test_returnconsumedcapacity.py: test RCU for batch get item
  alternator/executor: Add RCU support for batch get items
  alternator/consumed_capacity: make functionality public
2025-04-17 10:08:16 +03:00
Botond Dénes
22a28ca1db wip 2025-04-17 03:01:17 -04:00
Ernest Zaslavsky
a369dda049 s3_client: implement S3 copy object
Add support for the CopyObject API to enable direct copying of S3
objects between locations. This approach eliminates networking
overhead on the client side, as the operation is handled internally
by S3.
2025-04-17 09:47:47 +03:00
Botond Dénes
19b4f10598 test/cluster/test_read_repair: make incremental test work with tablets
There are two tests which test incremental read repair: one with row the
other with partition tombstones. The tests currently force vnodes, by
creating the test keyspace with {'enabled': false}. Even so, the tests
were found to be flaky so one of them are marked for skip.
This commit does the following changes:
* Make the tests use tablets by creating the test keyspace with tablets.
* Change the way the tests write data so it works with tablets:
  currently the tests use scylla-sstable write + upload but this won't
  work with tablets since upload with tablets implies --load-and-stream
  which means data is streamed to all replicas (no difference created
  between nodes). Switch to the classic stop-node + write to other
  replica with CL=ONE.
* Remove the skip added to the partition-tombstone test variant.

Also add tracing to the read-repair query, to make debugging the test
easier if it fails.

Fixes: #21179
2025-04-17 02:01:17 -04:00
Michał Chojnowski
4e2f62143b managed_bytes: in the copy constructor, respect the target preferred allocation size
Commit 14bf09f447 added a single-chunk
layout to `managed_bytes`, which makes the overhead of `managed_bytes`
smaller in the common case of a small buffer.

But there was a bug in it. In the copy constructor of `managed_bytes`,
a copy of a single-chunk `managed_bytes` is made single-chunk too.

But this is wrong, because the source of the copy and the target
of the copy might have different preferred max contiguous allocation
sizes.

In particular, if a `managed_bytes` of size between 13 kiB and 128 kiB
is copied from the standard allocator into LSA, the resulting
`managed_bytes` is a single chunk which violates LSA's preferred
allocation size. (And therefore is placed by LSA in the standard
allocator).

In other words, since Scylla 6.0, cache and memtable cells
between 13 kiB and 128 kiB are getting allocated in the standard allocator
rather than inside LSA segments.

Consequences of the bug:

1. Effective memory consumption of an affected cell is rounded up to the nearest
   power of 2.

2. With a pathological-enough allocation pattern
   (for example, one which somehow ends up placing a single 16 kiB
   memtable-owned allocation in every aligned 128 kiB span),
   memtable flushing could theoretically deadlock,
   because the allocator might be too fragmented to let the memtable
   grow by another 128 kiB segment, while keeping the sum of all
   allocations small enough to avoid triggering a flush.
   (Such an allocation pattern probably wouldn't happen in practice though).

3. It triggers a bug in reclaim which results in spurious
   allocation failures despite ample evictable memory.

   There is a path in the reclaimer procedure where we check whether
   reclamation succeeded by checking that the number of free LSA
   segments grew.

   But in the presence of evictable non-LSA allocations, this is wrong
   because the reclaim might have met its target by evicting the non-LSA
   allocations, in which case memory is returned directly to the
   standard allocator, rather than to the pool of free segments.

   If that happens, the reclaimer wrongly returns `reclaimed_nothing`
   to Seastar, which fails the allocation.

Refs (possibly fixes) https://github.com/scylladb/scylladb/issues/21072
Fixes https://github.com/scylladb/scylladb/issues/22941
Fixes https://github.com/scylladb/scylladb/issues/22389
Fixes https://github.com/scylladb/scylladb/issues/23781
2025-04-16 22:06:06 +02:00
Nadav Har'El
6db666a1c1 replica: fix 10-second pause during shutdown
As noticed in issue #23687, if we shut down Scylla while a paged read is
in progress - or even a paged read that the client had no intention of
ever resume it - the shutdown pauses for 10 seconds.

The problem was the stop() order - we must stop the "querier cache"
before we can close sstables - the "querier cache" is what holds paged
readers alive waiting for clients to resume those reads, and while a
reader is alive it holds on to sstables so they can't be closed. The
querier cache's querier_cache::default_entry_ttl is set to 10 seconds,
which is why the shutdown was un-paused after 10 seconds.

This fix in this patch is obvious: We need to stop the querier cache
(and have it release all the readers it was holding) before we close
the sstables.

Fixes #23687

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#23770
2025-04-16 20:35:44 +03:00
Avi Kivity
0206da5232 Merge 'readers: strip "flat" and "v2" from names' from Botond Dénes
Continue the effort of normalizing reader names, stripping legacy qualifying terms like "flat" and "v2".
Flat and v2 readers are the default now, we only need to add qualifying terms to readers which are different than the normal.
One such reader remains: `make_generating_reader_v1()`.

This PR contains mostly mechanical changes, done with a sed script. Commits which only contain such mechanical renames are marked as such in the commitlog.

Code cleanup, no backport needed.

Closes scylladb/scylladb#23767

* github.com:scylladb/scylladb:
  readers: mv reversing_v2.hh reversing.hh
  readers: mv generating_v2.hh generating.hh
  tree: s/make_generating_reader_v2/make_generating_reader/
  readers: mv from_mutations_v2.hh from_mutations.hh
  tree: s/make_mutation_reader_from_mutations_v2/make_mutation_reader_from_mutations/s
  readers: mv from_fragments_v2.hh from_fragments.hh
  readers: mv forwardable_v2.hh forwardable.hh
  readers: mv empty_v2.hh empty.hh
  tree: s/make_empty_flat_reader_v2/make_empty_mutation_reader/
  readers/empty_v2.hh: replace forward declarations with include of fwd header
  readers/mutation_reader_fwd.hh: forward declare reader_permit
  readers: mv delegating_v2.hh delegating.hh
  readers/delegating_v2.hh: move reader definition to _impl.hh file
2025-04-16 20:21:51 +03:00
Ernest Zaslavsky
8929cb324e s3_client: improve exception message
Clarify that the multipart upload was aborted due to a failure in
parsing ETags.
2025-04-16 18:58:22 +03:00
Ernest Zaslavsky
993953016f s3_client: reposition local function for future use
The local function has been relocated higher in the code
to prepare for its usage in upcoming implementations.
2025-04-16 18:46:31 +03:00
Ernest Zaslavsky
428f673ca2 backup_task: integrate sharded storage manager for upload
Introduce the sharded storage manager and use it to instantiate upload
clients. Full functionality will be implemented in subsequent changes.
2025-04-16 18:18:58 +03:00
Amnon Heiman
3acde5f904 test_returnconsumedcapacity.py: test RCU for batch get item
This patch adds tests for consumed capacity in batch get item.  It tests
both the simple case and the multi-item, multi-table case that combines
consistent and non-consistent reads.
2025-04-16 17:05:32 +03:00
Pavel Emelyanov
8b2cababb6 generic_server: Don't mess with db::config
The db::config is top-level configuration of scylla, we generally try to
avoid using it even in scylla components: each uses its own config
initialized by the service creator out of the db::config itself. The
generic_server is not an exception, all the more so, it already has its
own config.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#23705
2025-04-16 17:02:30 +03:00
Amnon Heiman
88095919d0 alternator/executor: Add RCU support for batch get items
This patch adds RCU support for batch get items.  With batch requests,
multiple objects are read from multiple tables. While the criterion for
adding the units is per the batch request, the units are calculated per
table—and so is the read consistency.
2025-04-16 16:53:22 +03:00
Amnon Heiman
0eabf8b388 alternator/consumed_capacity: make functionality public
The consumed_capacity_counter is not completely applicable for batch
operations.  This patch makes some of its functionality public so that
batch get item can use the components to decide if it needs to send
consumed capacity in the reply, to get the half units used by the
metrics and returned result, and to allow an empty constructor for the
RCU counter.
2025-04-16 16:49:40 +03:00
Benny Halevy
7a0f5e0a54 topology_coordinator: stop: await all background_action_holder:s
Add missing awaits for the rebuild_repair and repair background actions.
Although the background actions hold the _async_gate
which is closed in topology_coordinator::run(),
stop() still needs to await all background action futures
and handle any errors they may have left behind.

Fixes #23755

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-16 15:23:02 +03:00
Benny Halevy
6de79d0dd3 topology_coordinator: stop: improve error messages
"when cleanup" is ill-formed. Use "when XYZ"
to "during XYZ" instead.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-16 15:20:58 +03:00
Benny Halevy
d624795fda topology_coordinator: stop: define stop_background_action helper
Refactor the code to use a helper to await background_action_holder
and handle any errors by printing a warning.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-16 15:20:39 +03:00
Botond Dénes
6172ff501f readers: mv reversing_v2.hh reversing.hh
Completely mechanical change.
2025-04-16 04:46:08 -04:00
Botond Dénes
c8563b9604 readers: mv generating_v2.hh generating.hh
Completely mechanical change.
2025-04-16 04:46:08 -04:00
Botond Dénes
dfd7f03463 tree: s/make_generating_reader_v2/make_generating_reader/
Completely mechanical change.
2025-04-16 04:46:08 -04:00
Botond Dénes
c29c696780 readers: mv from_mutations_v2.hh from_mutations.hh
Completely mechanical change.
2025-04-16 04:46:08 -04:00
Botond Dénes
b104862702 tree: s/make_mutation_reader_from_mutations_v2/make_mutation_reader_from_mutations/s
Completely mechanical change.
2025-04-16 04:46:07 -04:00
Anna Stuchlik
0b4740f3d7 doc: add info about Scylla Doctor Automation to the docs
Fixes https://github.com/scylladb/scylladb/issues/23642

Closes scylladb/scylladb#23745
2025-04-16 11:44:35 +03:00
Botond Dénes
7547d0c6a9 readers: mv from_fragments_v2.hh from_fragments.hh
Completely mechanical change.
2025-04-16 04:35:00 -04:00
Botond Dénes
f1bd2553ed readers: mv forwardable_v2.hh forwardable.hh
Completely mechanical change.
2025-04-16 04:33:50 -04:00
Botond Dénes
a9d75c4f9d readers: mv empty_v2.hh empty.hh
Completely mechanical change.
2025-04-16 04:32:56 -04:00
Botond Dénes
05829f98f3 tree: s/make_empty_flat_reader_v2/make_empty_mutation_reader/
Completely mechanical change.
2025-04-16 04:32:56 -04:00
Botond Dénes
0e33f0d09e readers/empty_v2.hh: replace forward declarations with include of fwd header 2025-04-16 04:12:08 -04:00
Botond Dénes
d75936d989 readers/mutation_reader_fwd.hh: forward declare reader_permit
It is commonly used as parameter to reader factory methods.
2025-04-16 04:12:08 -04:00
Botond Dénes
7d9b91a00e readers: mv delegating_v2.hh delegating.hh
Completely mechanical change.
2025-04-16 04:11:55 -04:00
Botond Dénes
c7f68a2649 readers/delegating_v2.hh: move reader definition to _impl.hh file
The idea behind readers/ is that each reader has its minimal header with
just a factory method declaration. The delegating reader is defined in
the factory header because it has a derived class in row_cache_test.cc.
Move the definition to delegating_impl.hh so users not interested in
deriving from it don't pay the price in header include cost.
2025-04-16 03:47:57 -04:00
Pavel Emelyanov
70ac5828a8 Update seastar submodule
* seastar 099cf616...e44af9b0 (19):
  > Add assertion to `get_local_service`
  > http_client: Improve handling of server response parsing errors
  > util: include used header
  > core: Fix module linkage by using `inline constexpr` for shared constants
  > build: fix P2582R1 detection for GCC compiler compatibility
  > app-template: remove production warning
  > ioinfo: Extend printed data a bit more
  > reactor: Fix indentation after previous patch
  > reactor: Configure multiple mountpoints per disk
  > io_queue, resource, reactor: Rename dev_t -> unsigned
  > resource: Rename mountpoint to disk in resources
  > reactor: Keep queues as shared_ptr-s
  > io_queue: Drop device ID
  > io_intent: Use unsigned queue id as a key
  > io_queue: Keep unsigned queue id on an io_queue
  > file: Keep device_id on posix file impl
  > io_queue: Print mountpoint in latency goal bump message
  > io_intent: Rename qid to cid
  > reactor: Move engine()._num_io_groups assignment and check

Changes in io-queue call for scylla-gdb update as well -- now the
reactor map of device to io-queue uses seastar::shared_ptr, not
std::unique_ptr.

Closes scylladb/scylladb#23733
2025-04-16 09:44:37 +03:00
Botond Dénes
f5125ffa18 Merge 'Ensure raft group0 RPCs use the gossip scheduling group.' from Sergey Zolotukhin
Scylla operations use concurrency semaphores to limit the number of concurrent operations and prevent resource exhaustion. The semaphore is selected based on the current scheduling group.

For RAFT group operations, it is essential to use a system semaphore to avoid queuing behind user operations. This patch ensures that RAFT operations use the `gossip` scheduling group to leverage the system semaphore.

Fixes scylladb/scylladb#21637

Backport: 6.2 and 6.1

Closes scylladb/scylladb#22779

* github.com:scylladb/scylladb:
  Ensure raft group0 RPCs use the gossip scheduling group
  Move RAFT operations verbs to GOSSIP group.
2025-04-16 09:11:29 +03:00
Lakshmipathi
42ed6a87bf test: Test truncate during topology change
Add a new node, during topology change issue truncate call and
verify all nodes empty data after tablet migration.

Fixes: https://github.com/scylladb/scylla-dtest/issues/5317

Signed-off-by: Lakshmipathi Ganapathi <lakshmipathi.ganapathi@scylladb.com>

Closes scylladb/scylladb#22595
2025-04-16 09:10:22 +03:00
Tomasz Grabiec
001d3b2415 Merge 'storage_service: preserve state of busy topology when transiting tablet' from Łukasz Paszkowski
Commit 876478b84f ("storage_service: allow concurrent tablet migration in tablets/move API", 2024-02-08) introduced a code path on which the topology state machine would be busy -- in "tablet_draining" or "tablet_migration" state -- at the time of starting tablet migration. The pre-commit code would unconditionally transition the topology to "tablet_migration" state, assuming the topology had been idle previously. On the new code path, this state change would be idempotent if the topology state machine had been busy in "tablet_migration", but the state change would incorrectly overwrite the "tablet_draining" state otherwise.

Restrict the state change to when the topology state machine is idle.

In addition, add the topology update to the "updates" vector with plain push_back(). emplace_back() is not helpful here, as topology_mutation_builder::build() cannot construct in-place, and so we invoke the "canonical_mutation" move constructor once, either way.

Unit test:

Start a two node cluster. Create a single tablet on one of the nodes. Start decommissioning that node, but block decommissioning at once. In that state (i.e., in "tablet_draining"), move the tablet manually to the other node. Check that transit_tablet() leaves the topology transition state alone.

Fixes https://github.com/scylladb/scylladb/issues/20073.

Commit 876478b84f was first released in scylla-6.0.0, so we might want to backport this patch accordingly.

Closes scylladb/scylladb#23751

* github.com:scylladb/scylladb:
  storage_service: add unit test for mid-decommission transit_tablet()
  storage_service: preserve state of busy topology when transiting tablet
2025-04-16 00:19:24 +02:00
Pavel Emelyanov
b79137eaa4 storage_service: Use this->_features directly
This dependency is already there, storage service doesn't need to go
rounds via database reference to get to the features.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#23739
2025-04-15 21:11:12 +03:00
Tomasz Grabiec
d493a8d736 tests: tablets: Simplify tests by moving common code to topology_builder
Reduces code duplication.
2025-04-15 16:05:41 +02:00
Laszlo Ersek
841ca652a0 storage_service: add unit test for mid-decommission transit_tablet()
Start a two node cluster. Create a single tablet on one of the nodes.
Start decommissioning that node, but block decommissioning at once. In
that state (i.e., in "tablet_draining"), move the tablet manually to the
other node. Check that transit_tablet() leaves the topology transition
state alone.

Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
2025-04-15 15:15:25 +02:00
Michał Chojnowski
b3d951517d test/scylla_gdb: generate a coredump when coro_task fails
This test fails sometimes, but rarely and unreliably.
We want to get a coredump from it the next time it fails.
Sending a SIGSEGV should induce that.

Refs https://github.com/scylladb/scylladb/issues/22501

Closes scylladb/scylladb#23256
2025-04-15 15:16:38 +03:00
Calle Wilund
abd2d8a58b test_tools: Manual merge of local key gen tool test from enterprise
Fixes scylladb/scylla-enterprise#5358

Transposed tool test for local file generator, originally java test.
Then enterprise test. Now here.

Closes scylladb/scylladb#23726
2025-04-15 15:14:08 +03:00
Laszlo Ersek
e1186f0ae6 storage_service: preserve state of busy topology when transiting tablet
Commit 876478b84f ("storage_service: allow concurrent tablet migration
in tablets/move API", 2024-02-08) introduced a code path on which the
topology state machine would be busy -- in "tablet_draining" or
"tablet_migration" state -- at the time of starting tablet migration. The
pre-commit code would unconditionally transition the topology to
"tablet_migration" state, assuming the topology had been idle previously.
On the new code path, this state change would be idempotent if the
topology state machine had been busy in "tablet_migration", but the state
change would incorrectly overwrite the "tablet_draining" state otherwise.

Restrict the state change to when the topology state machine is idle.

In addition, add the topology update to the "updates" vector with plain
push_back(). emplace_back() is not helpful here, as
topology_mutation_builder::build() cannot construct in-place, and so we
invoke the "canonical_mutation" move constructor once, either way.

Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
2025-04-15 13:44:45 +02:00
Piotr Dulikowski
22e3b8eccd Merge 'test/cqlpy: Adjust tests to RF-rack-valid keyspaces' from Dawid Mędrek
In this PR, we adjust tests in the cqlpy test suite so they
only use RF-rack-valid keyspaces. After that, we enable
the configuration option `rf_rack_valid_keyspaces` in the
suite by default.

Refs scylladb/scylladb#23428

Backport: backporting to 2025.1 so we can test the option there too.

Closes scylladb/scylladb#23489

* github.com:scylladb/scylladb:
  test/cqlpy: Enable rf_rack_valid_keyspaces by default
  test: Move test_alter_tablet_keyspace_rf to cluster suite
  test/cqlpy: Adjust tests to RF-rack-valid keyspaces
  test/cqlpy/cassandra_tests: Adjust to RF-rack-valid keyspaces
2025-04-15 12:43:11 +02:00
Avi Kivity
b4d4e48381 scylla-gdb: small-objects: fix for very small objects
Because of rounding and alignment, there are multiple pools for small
sizes (e.g. 4 for size 32). Because the pool selection algorithm
ignores alignment, different pools can be chosen for different object
sizes. For example, an object size of 29 will choose the first pool
of size 32, while an object size of 32 will choose the fourth pool of
size 32.

The small-objects command doesn't know about this and always considers
just the first pool for a given size. This causes it to miss out on
sister pools.

While it's possible to adjust pool selection to always choose one of the
pools, it may eat a precious cycle. So instead let's compensate in the
small-objects command. Instead of finding one pool for a given size,
find all of them, and iterate over all those pools.

Fixes #23603

Closes scylladb/scylladb#23604
2025-04-15 11:16:52 +03:00
Emil Maskovsky
3930ee8e3c raft: fix data center remaining nodes initialization
The `_remaining_nodes` attribute of the data center information was not
initialized correctly. The parameter was passed by value to the
initialization function instead of by reference or pointer.

As a result, `_remaining_nodes` was left initialized to zero, causing an
underflow when decrementing its value.

This bug did not significantly impact behavior because other safeguards,
such as capping the maximum voters per data center by the total number
of nodes, masked the issue. However, it could lead to inefficiencies, as
the remaining nodes check would not trigger correctly.

Fixes: scylladb/scylladb#23702

No backport: The bug is only present in the master branch, so no backport
is required.

Closes scylladb/scylladb#23704
2025-04-15 09:58:32 +02:00
Nadav Har'El
fbcf77d134 raft: make group0 Raft operation timeout configurable
A recent commit 370707b111 (re)introduced
a timeout for every group0 Raft operation. This timeout was set to 60
seconds, which, paraphrasing Bill Gates, "ought to be enough for anybody".

However, one of the things we do as a group0 operation is schema
changes, and we already noticed a few years ago, see commit
0b2cf21932, that in some extremely
overloaded test machines where tests run hundreds of times (!) slower
than usual, a single big schema operation - such as Alternator's
DeleteTable deleting a table and multiple of its CDC or view tables -
sometimes takes more than 60 seconds. The above fix changed the
client's timeout to wait for 300 seconds instead of 60 seconds,
but now we also need to increase our Raft timeout, or the server can
time out. We've seen this happening recently making some tests flaky
in CI (issue #23543).

So let's make this timeout configurable, as a new configuration option
group0_raft_op_timeout_in_ms. This option defaults to 60000 (i.e,
60 seconds), the same as the existing default. The test framework
overrides this default with a a higher 300 second timeout, matching
the client-side timeout.

Before this patch, this timeout was already configurable in a strange
way, using injections. But this was a misstep: We already have more
than a dozen timeouts configurable through the normal configration,
and this one should have been configured in the same way. There is
nothing "holy" about the default of 60 seconds we chose, and who
knows maybe in the future we might need to tweek it in the field,
just like we made the other timeouts tweakable. Injections cannot
be used in release mode, but configuration options can.

Fixes #23543

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#23717
2025-04-15 10:57:39 +03:00
Kefu Chai
3e3f583b84 docs/dev/tombstone.md: fix a typo
s/alwas/always/

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#23734
2025-04-15 10:54:42 +03:00
Avi Kivity
5e1cf90a51 build: replace tools/java submodule with packaged cassandra-stress
We no longer use tools/java (scylladb/scylla-tools-java.git) for
nodetool or cqlsh; only cassandra-stress. Since that is available
in package form install that and excise the tools/java submodule
from the source tree.

pgo/ is adjusted to use the packaged cassandra-stress (and the cqlsh
submodule).

A few jmx references are dropped as well.

Frozen toolchain regenerated.

Optimized clang from

  https://devpkg.scylladb.com/clang/clang-19.1.7-Fedora-41-aarch64.tar.gz
  https://devpkg.scylladb.com/clang/clang-19.1.7-Fedora-41-x86_64.tar.gz

Closes scylladb/scylladb#23698
2025-04-15 10:11:28 +03:00
Jenkins Promoter
9699c3ded4 Update pgo profiles - aarch64 2025-04-15 04:45:34 +03:00
Jenkins Promoter
8472aa9e53 Update pgo profiles - x86_64 2025-04-15 04:29:24 +03:00
Pavel Emelyanov
b25cb5af0c Merge 'Use named gates' from Benny Halevy
Name the gates and phased barriers we use
to make it easy to debug gate_closed_exception

Refs https://github.com/scylladb/seastar/pull/2688

* Enhancement only, no backport needed

Closes scylladb/scylladb#23329

* github.com:scylladb/scylladb:
  utils: loading_cache: use named_gate
  utils: flush_queue: use named_gate
  sstables_manager: use named gate
  sstables_loader: use named gate
  utils: phased_barrier, pluggable: use named gate
  utils: s3::client::multipart_upload: use named gate
  utils: s3::client: use named_gate
  transport: controller: use named gate
  tracing: trace_keyspace_helper: use named gate
  task_manager: module: use named gate
  topology_coordinator: use named gate
  storage_service: use named gate
  storage_proxy: wait_for_hint_sync_point: use named gate
  storage_proxy: remote: use named gate
  service: session: use named gate
  service: raft: raft_rpc: use named gate
  service: raft: raft_group0: use named gate
  service: raft: persistent_discovery: use named gate
  service: raft: group0_state_machine: use named gate
  service: migration_manager: use named gate
  replica: table: use named gate
  replica: compaction_group, storage_group: use named gate
  redis: query_processor: use named gate
  repair: repair_meta: use named gate
  reader_concurrency_semaphore: use named gate
  raft: server_impl: use named gate
  querier_cache: use named gate
  gms: gossiper: use named gate
  generic_server: use named gate
  db: sstables_format_listener: use named gate
  db: snapshot: backup_task: use named gate
  db: snapshot_ctl: use named gate
  hints: hints_sender: use named gate
  hints: manager: use named gate
  hints: hint_endpoint_manager: use named gate
  commitlog: segment_manager: use named gate
  db: batchlog_manager: use named gate
  query_processor: remote: use named gate
  compaction: compaction_state: use named gate
  alternator/server: use named_gate
2025-04-14 20:56:32 +03:00
Sergey Zolotukhin
e05c082002 Ensure raft group0 RPCs use the gossip scheduling group
Scylla operations use concurrency semaphores to limit the number
of concurrent operations and prevent resource exhaustion. The
semaphore is selected based on the current scheduling group.
For Raft group operations, it is essential to use a system semaphore to
avoid queuing behind user operations.
This commit adds a check to ensure that the raft group0 RPCs are
executed with the `gossiper` scheduling group.
2025-04-14 17:10:46 +02:00
Sergey Zolotukhin
60f1053087 Move RAFT operations verbs to GOSSIP group.
In order for RAFT operations to use the gossip system semaphore, moving RAFT
verbs to the gossip group in `do_get_rpc_client_idx`,  messaging_service.

Fixes scylladb/scylladb21637
2025-04-14 17:09:49 +02:00
Pavel Emelyanov
1bd991a111 test: Inherit sstable_assertions from sstables::test
The latter class is invented to let tests access private fields of an
sstable (mostly methods). The former is in fact an extended version of
that also does some checks. Howerver, they don't inherit from each
other, and the sstable_assertions partially duplicates some funtionality
of the test one.

Add the inheritance, remove the duplicated methods from the child class,
update the callers (the test class returns future<>s, the assertions one
"knows" it runs in seastar thread) and marm sstable::read_toc() private.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#23697
2025-04-14 13:45:14 +03:00
Kefu Chai
b3f709bed7 s3: remove an extraneous space
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#23714
2025-04-14 13:02:58 +03:00
Michał Chojnowski
6e2795a843 Update seastar submodule
* seastar ed8952fb...099cf616 (10):
  > reactor: Disable hot polling if wakeup granularity is too high
  > smp: add shard_to_numa_node_mapping()
  > tests/unit/httpd_test: fix the handling of NUL bytes in the parser
  > fstream: skip allocation in no write_behinds case
  > `http`: add `xml` support to `http::mime_types::mappings`
  > Print incrementally in sigsegv handler
  > reactor: use 0x for hex addresses
  > tls: Make session resume key shared across credentials builders creds
  > build: fix CMAKE_REQUIRED_FLAGS format for sanitizer detection
  > reactor: Remove sched_debug() related code

Closes scylladb/scylladb#23703
2025-04-14 12:54:19 +03:00
Andrei Chekun
8e33d7ab81 test.py: Make the testpy log files in pytest follow the same format
Fix the incorrect log file names between conftest and scylla_manager.
This regression issue, was introduced in #22960.

Currently, scylla manager will output it's logs to the file with the
next pattern:
suite_name.path_to_the_test_file_with_subfolders.run_id.function_name.mode.run_id_cluster.log
On the same time pytest will try to find this log with next name:
suite_name.file_name_without_subfolders_path.py.run_id.function_name.mode.run_id_cluster.log

This inconsistency leads to the situation when the test failed, scylla
manager log file will not be copied to the failed_test directory and
test will have exception on teardown.

Closes scylladb/scylladb#23596
2025-04-14 12:52:48 +03:00
Evgeniy Naydanov
d6b64642c5 test.py: print out path to Scylla log for Python test suites
Test suites with `type: Python` are using single Scylla node
created by test.py, but it's handy to print a path to a log
file in pytest log too to make it easier to find the file
on failures.

Closes scylladb/scylladb#23683
2025-04-14 11:15:37 +03:00
Kefu Chai
69de816b1b scylla-gdb.py: fix a typo in gdb command description
replace "runnign" with "running".

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#23716
2025-04-14 10:59:21 +03:00
Benny Halevy
8d7e4d6c36 utils: loading_cache: use named_gate
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-12 11:47:09 +03:00
Benny Halevy
46f2a24772 utils: flush_queue: use named_gate
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-12 11:47:02 +03:00
Benny Halevy
d665bb4f8b sstables_manager: use named gate
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-12 11:47:00 +03:00
Benny Halevy
7969293dcf sstables_loader: use named gate
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-12 11:47:00 +03:00
Benny Halevy
e1fe82ed33 utils: phased_barrier, pluggable: use named gate
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-12 11:47:00 +03:00
Benny Halevy
d3f498ae59 utils: s3::client::multipart_upload: use named gate
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-12 11:47:00 +03:00
Benny Halevy
eea83464c7 utils: s3::client: use named_gate
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-12 11:46:51 +03:00
Benny Halevy
79e967e2f5 transport: controller: use named gate
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-12 11:29:48 +03:00
Benny Halevy
3d87b67d0e tracing: trace_keyspace_helper: use named gate
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-12 11:29:48 +03:00
Benny Halevy
bfdd8a98ca task_manager: module: use named gate
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-12 11:29:48 +03:00
Benny Halevy
5e864b6277 topology_coordinator: use named gate
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-12 11:29:46 +03:00
Benny Halevy
a67ed59399 storage_service: use named gate
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-12 11:28:49 +03:00
Benny Halevy
39f1175451 storage_proxy: wait_for_hint_sync_point: use named gate
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-12 11:28:49 +03:00
Benny Halevy
e228a112fe storage_proxy: remote: use named gate
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-12 11:28:49 +03:00
Benny Halevy
0a1e7de6ea service: session: use named gate
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-12 11:28:49 +03:00
Benny Halevy
747446cb25 service: raft: raft_rpc: use named gate
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-12 11:28:49 +03:00
Benny Halevy
01bb3980fc service: raft: raft_group0: use named gate
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-12 11:28:49 +03:00
Benny Halevy
6118150d44 service: raft: persistent_discovery: use named gate
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-12 11:28:49 +03:00
Benny Halevy
e430df6332 service: raft: group0_state_machine: use named gate
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-12 11:28:49 +03:00
Benny Halevy
5f8b5724e6 service: migration_manager: use named gate
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-12 11:28:49 +03:00
Benny Halevy
7342a57cbb replica: table: use named gate
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-12 11:28:49 +03:00
Benny Halevy
52e1ce7f0d replica: compaction_group, storage_group: use named gate
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-12 11:28:49 +03:00
Benny Halevy
aff6017e83 redis: query_processor: use named gate
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-12 11:28:49 +03:00
Benny Halevy
80b5089d0c repair: repair_meta: use named gate
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-12 11:28:49 +03:00
Benny Halevy
679e73053f reader_concurrency_semaphore: use named gate
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-12 11:28:48 +03:00
Benny Halevy
9724d87e86 raft: server_impl: use named gate
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-12 11:28:48 +03:00
Benny Halevy
5780599eec querier_cache: use named gate
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-12 11:28:48 +03:00
Benny Halevy
cecfb6dfd7 gms: gossiper: use named gate
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-12 11:28:48 +03:00
Benny Halevy
bc69bc3de7 generic_server: use named gate
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-12 11:28:48 +03:00
Benny Halevy
5a71763d75 db: sstables_format_listener: use named gate
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-12 11:28:48 +03:00
Benny Halevy
da492231df db: snapshot: backup_task: use named gate
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-12 11:28:48 +03:00
Benny Halevy
edf497c170 db: snapshot_ctl: use named gate
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-12 11:28:48 +03:00
Benny Halevy
c5d7272393 hints: hints_sender: use named gate
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-12 11:28:48 +03:00
Benny Halevy
1c1adb3d60 hints: manager: use named gate
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-12 11:28:48 +03:00
Benny Halevy
4c475a1905 hints: hint_endpoint_manager: use named gate
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-12 11:28:48 +03:00
Benny Halevy
bdd5a61139 commitlog: segment_manager: use named gate
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-12 11:28:48 +03:00
Benny Halevy
0672c9da5c db: batchlog_manager: use named gate
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-12 11:28:48 +03:00
Benny Halevy
f8d5835cab query_processor: remote: use named gate
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-12 11:28:48 +03:00
Benny Halevy
747ae5e1c4 compaction: compaction_state: use named gate
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-12 11:28:48 +03:00
Benny Halevy
879811e0d2 alternator/server: use named_gate
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-12 11:28:48 +03:00
Dawid Mędrek
be0877ce69 test/cqlpy: Enable rf_rack_valid_keyspaces by default
All of the tests in the suite have been adjusted so they only
use RF-rack-valid keyspaces, so let's start enabling the option
by default.
2025-04-11 14:55:13 +02:00
Dawid Mędrek
a59842257a test: Move test_alter_tablet_keyspace_rf to cluster suite
We move the test `test_alter_tablet_keyspace_rf` from the cqlpy to the
cluster test suite. The reason behind the change is that the test cannot
be run with `rf_rack_valid_keyspaces` turned on in the configuration.
During the test, we make the keyspace RF-rack-invalid multiple times.
Since RF-rack-validity is a very strong constraint, adjust the test
otherwise is impossible.

By moving it to the cluster test suite, we're able to change the
configuration of the node used in the test, and so the test can work
again.
2025-04-11 14:55:11 +02:00
Dawid Mędrek
958eaec056 test/cqlpy: Adjust tests to RF-rack-valid keyspaces 2025-04-11 14:55:04 +02:00
Dawid Mędrek
6bde01bb59 test/cqlpy/cassandra_tests: Adjust to RF-rack-valid keyspaces
We adjust three existing Cassandra tests so that they don't create
RF-rack-invalid keyspaces. We modify the replication factor used
in the problematic tests. The changes don't affect the tests as
the value of the RF is unrelated to what they verify. Thanks to
that, we can run them now even with enforced RF-rack-valid keyspaces.

The drawback is that the modified ALTER statements do not modify
the RF at all. However, since the tests seem to verify that the code
responsible for VALIDATING a request works as intended, that should
have little to no impact on them.
2025-04-11 14:20:14 +02:00
Dawid Mędrek
10589e966f test/cluster/mv: Adjust test to RF-rack-valid keyspaces
We adjust the test in the directory so that all of the used
keyspaces are RF-rack-valid throughout the their execution.

Refs scylladb/scylladb#23428

Closes scylladb/scylladb#23490
2025-04-11 14:03:21 +02:00
Karol Baryła
df64985a4e Docs: Describe driver issue with tablet RF increase
Current protocol extension that sends tablet info to drivers only does
that if the driver selects a non-replica coordinator for a routable
request. It works well if some node on the replica list is replaced by
other node, or if some replicas are removed from the list. Driver will
at some point send a request to stale replica, and receive new list in
response.

The issue is with extending the list with new replicas. In that case old
replicas are all still correct, so driver will not select any wrong
replica, and will not receive the new list. As far as I know that only
scenario where this could happen is RF increase.

It could be to some degree worked around in the drivers, but it would
add significant complexity (definitely more than any other invalidations
we introduced) while still not being ideal solution. This scenario
should be rare enough, and the consequences of not handling it minor
enough (new replicas not being used as coordinators) that it does not
warrant driver-side solution. Instead this commit adds info about this
to documentation, advising users to restart applications after replica
lists are extended.

It is worth noting that if new tablet feedback protocol extension is
implemented then this problem goes away. See issue #21664.

Closes scylladb/scylladb#23447
2025-04-11 13:48:40 +02:00
David Garcia
cf11d5eb69 fix: openapi not rendering in docs.scylladb.com/manual
Closes scylladb/scylladb#23686
2025-04-10 17:47:58 +03:00
Patryk Jędrzejczak
07a7a75b98 Merge 'raft: implement the limited voters feature' from Emil Maskovsky
Currently if raft is enabled all nodes are voters in group0. However it is not necessary to have all nodes to be voters - it only slows down the raft group operation (since the quorum is large) and makes deployments with asymmetrical DCs problematic (2 DCs with 5 nodes along 1 DC with 10 nodes will lose the majority if large DC is isolated).

The topology coordinator will now maintain a state where there are only limited number of voters, evenly distributed across the DCs and racks.

After each node addition or removal the voters are recalculated and rebalanced if necessary. That means:
* When a new node is added, it might become a voter depending on the current distribution of voters - either if there are still some voter "slots" available, or if the new node is a better candidate than some existing voter (in which case the existing node voter status might be revoked).
* When a voter node is removed or stopped (shut down), its voter status is revoked and another node might become a voter instead (this can also depend on other circumstances, like e.g. changing the number of DCs).
* If a node addition or removal causes a change in number of data centers (DCs) or racks, the rebalance action might become wider (as there are some special rules applying to 1 vs 2 vs more DCs, also changing the number of racks might cause similar effects in the voters distribution)

Special conditions for various number of DCs:
* 1 DC: Can have up to the maximum allowed number of voters (5 - see below)
* 2 DCs: The distribution of the voters will be asymmetric (if possible), meaning that we can tolerate a loss of the DC with the smaller number of voters (if both would have the same number of voters we'd lose majority if any of the DCs is lost). For example, if we have 2 DCs with 2 nodes each, one of them will only have 1 voter (despite the limit of 5). Also, if one of the 2 DCs has more racks than the other and the node count allows it, the DC with the more racks will have more voters.
* 3 and more DCs: The distribution of the voters will be so that every DC has strictly less than half of the total voters (so a loss of any of the DCs cannot lead to the majority loss). Again, DCs with more racks are being preferred in the voter distribution.

At the moment we will be handling the zero-token nodes in the same way as the regular nodes (i.e. the zero-token nodes will not take any priority in the voter distribution). Technically it doesn't make much sense to have a zero-token node that is not a voter (when there are regular nodes in the same DC being voters), but currently the intended purpose of zero-token nodes is to form an "arbiter DC" (in case of 2 DCs, creating a third DC with zero-token nodes only), so for that intended purpose no special handling is needed and will work out of the box. If a preference of zero token nodes will eventually be needed/requested, it will be added separately from this PR.

The maximum number of voters of 5 has been chosen as the smallest "safe" value. We can lose majority when multiple nodes (possibly in different dcs and racks) die independently in a short time span. With less than 5 voters, we would lose majority if 2 voters died, which is very unlikely to happen but not entirely impossible. With 5 voters, at least 3 voters must die to lose majority, which can be safely considered impossible in the case of independent failures.

Currently the limit will not be configurable (we might introduce configurable limits later if that would be needed/requested).

Tests added:
* boost/group0_voter_registry_test.cc: run time on CI: ~3.5s
* topology_custom/test_raft_voters.py: parametrized with 1 or 3 nodes per DC, the run time on CI: 1: ~20s. 3: ~40s, approx 1 min total

Fixes: scylladb/scylladb#18793

No backport: This is a new feature that will not be backported.

Closes scylladb/scylladb#21969

* https://github.com/scylladb/scylladb:
  raft: distribute voters by rack inside DC
  raft/test: fix lint warnings in `test_raft_no_quorum`
  raft/test: add the upgrade test for limited voters feature
  raft topology: handle on_up/on_down to add/remove node from voters
  raft: fix the indentation after the limited voters changes
  raft: implement the limited voters feature
  raft: drop the voter removal from the decommission
  raft/test: disable the `stop_before_becoming_raft_voter` test
  raft/test: stop the server less gracefully in the voters test
2025-04-10 15:29:15 +02:00
Avi Kivity
9559e53f55 Merge 'Adjust tablet-mon.py for capacity-aware load balancing' from Tomasz Grabiec
After load-balancer was made capacity-aware it no longer equalizes tablet count per shard, but rather utilization of shard's storage. This makes the old presentation mode not useful in assessing whether balance was reached, since nodes with less capacity will get fewer tablets when in balanced state. This PR adds a new default presentation mode which scales tablet size by its storage utilization so that tablets which have equal shard utilization take equal space on the graph.

To facilitate that, a new virtual table was added: system.load_per_node, which allows the tool to learn about load balancer's view on per-node capacity. It can also serve as a debugging interface to get a view of current balance according to the load-balancer.

Closes scylladb/scylladb#23584

* github.com:scylladb/scylladb:
  tablet-mon.py: Add presentation mode which scales tablet size by its storage utilization
  tablet-mon.py: Center tablet id text properly in the vertical axis
  tablet-mon.py: Show migration stage tag in table mode only when migrating
  virtual-tables: Introduce system.load_per_node
  virtual_tables: memtable_filling_virtual_table: Propagate permit to execute()
  docs: virtual-tables: Fix instructions
  service: tablets: Keep load_stats inside tablet_allocator
2025-04-10 14:59:08 +03:00
Avi Kivity
885838fc46 Merge 'scylla-gdb.py: improve scylla repairs command' from Botond Dénes
Make output more readable by:
* group follower/master repair instances separately
* split repair details into one line for repair summary, then one line for each host info
* add indentation to make the output easier to follow

Also add `-m|--memory` option to calculate memory usage of repair buffers.

Example output:

    (gdb) scylla repairs -m
    Repairs for which this node is leader:
      (repair_meta*) 0x60503ab7f7b0: {id: 19197, table: large_collection_test.table_with_large_collection, reason: decommission, row_buf: {len: 0, memory: 0}, working_row_buf: {len: 30, memory: 48208512}, same_shard: True, tablet: False}
        host: 496e8b0c-50bf-4ada-b8f9-3d167138e908, shard: 5, state: repair_state::get_combined_row_hash_finished
        host: ce4413ab-33d9-40f8-b13e-d14af8511dda, shard: 4294967295, state: repair_state::put_row_diff_with_rpc_stream_started
      (repair_meta*) 0x60503717f7b0: {id: 19211, table: large_collection_test.table_with_large_collection, reason: decommission, row_buf: {len: 0, memory: 0}, working_row_buf: {len: 28, memory: 63863265}, same_shard: True, tablet: False}
        host: 496e8b0c-50bf-4ada-b8f9-3d167138e908, shard: 5, state: repair_state::get_combined_row_hash_finished
        host: c4936a19-41da-4260-971e-651445d740fd, shard: 4294967295, state: repair_state::get_row_diff_with_rpc_stream_finished
      (repair_meta*) 0x60502ddff7b0: {id: 19231, table: large_collection_test.table_with_large_collection, reason: decommission, row_buf: {len: 0, memory: 0}, working_row_buf: {len: 0, memory: 0}, same_shard: True, tablet: False}
        host: 496e8b0c-50bf-4ada-b8f9-3d167138e908, shard: 5, state: repair_state::row_level_stop_started
        host: 039494b6-9d35-4f34-82c4-3c79c1d97175, shard: 4294967295, state: repair_state::row_level_stop_finished
      (repair_meta*) 0x60501db3f7b0: {id: 19234, table: large_collection_test.table_with_large_collection, reason: decommission, row_buf: {len: 0, memory: 0}, working_row_buf: {len: 0, memory: 0}, same_shard: True, tablet: False}
        host: 496e8b0c-50bf-4ada-b8f9-3d167138e908, shard: 5, state: repair_state::get_sync_boundary_started
        host: 039494b6-9d35-4f34-82c4-3c79c1d97175, shard: 4294967295, state: repair_state::get_sync_boundary_finished
      (repair_meta*) 0x60501c81f7b0: {id: 19236, table: large_collection_test.table_with_large_collection, reason: decommission, row_buf: {len: 0, memory: 0}, working_row_buf: {len: 28, memory: 42696821}, same_shard: True, tablet: False}
        host: 496e8b0c-50bf-4ada-b8f9-3d167138e908, shard: 5, state: repair_state::get_combined_row_hash_finished
        host: ce4413ab-33d9-40f8-b13e-d14af8511dda, shard: 4294967295, state: repair_state::put_row_diff_with_rpc_stream_started
      (repair_meta*) 0x60503f65f7b0: {id: 19238, table: large_collection_test.table_with_large_collection, reason: decommission, row_buf: {len: 0, memory: 0}, working_row_buf: {len: 28, memory: 47785163}, same_shard: True, tablet: False}
        host: 496e8b0c-50bf-4ada-b8f9-3d167138e908, shard: 5, state: repair_state::get_combined_row_hash_finished
        host: ce4413ab-33d9-40f8-b13e-d14af8511dda, shard: 4294967295, state: repair_state::get_row_diff_with_rpc_stream_finished
    Repairs for which this node is follower:

Closes scylladb/scylladb#23075

* github.com:scylladb/scylladb:
  scylla-gdb.py: improve scylla repairs commadn
  scylla-gdb.py: seastar_lw_shared_ptr: add __nonzero__ and __bool__
  scylla-gdb.py: introduce managed_bytes
2025-04-10 14:52:43 +03:00
Dani Tweig
e92740cc2b .github: update bug_report.yml
Perform a yaml "face lift" on the old bug report md template, making bug reporting more efficient.

- Add dedicated textarea fields for problem description and expected behavior
- Include pre-filled placeholders to guide issue reporting
- Add formatted log output section with shell syntax highlighting

Closes: #21532
2025-04-10 14:26:00 +03:00
Pavel Emelyanov
88318d3b50 topology_coordinator: Use shorter fault-injection overloads
There are few places that want to pause until a message is received from
the test. There's a convenience one-line suger to do it.

One test needs update its expectations about log message that appears
when scylle steps on it and actually starts waiting.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#23390
2025-04-10 14:05:46 +03:00
Botond Dénes
d67202972a mutation/frozen_mutation: frozen_mutation_consumer_adaptor: fix end-of-partition handling
This adaptor adapts a mutation reader pausable consumer to the frozen
mutation visitor interface. The pausable consumer protocol allows the
consumer to skip the remaining parts of the partition and resume the
consumption with the next one. To do this, the consumer just has to
return stop_iteration::yes from one of the consume() overloads for
clustering elements, then return stop_iteration::no from
consume_end_of_partition(). Due to a bug in the adaptor, this sequence
leads to terminating the consumption completely -- so any remaining
partitions are also skipped.

This protocol implementation bug has user-visible effects, when the
only user of the adaptor -- read repair -- happens during a query which
has limitations on the amount of content in each partition.
There are two such queries: select distinct ... and select ... with
partition limit. When converting the repaired mutation to to query
result, these queries will trigger the skip sequence in the consumer and
due to the above described bug, will skip the remaining partitions in
the results, omitting these from the final query result.

This patch fixes the protocol bug, the return value of the underlying
consumer's consume_end_of_partition() is now respected.

A unit test is also added which reproduces the problem both with select
distinct ... and select ... per partition limit.

Follow-up work:
* frozen_mutation_consumer_adaptor::on_end_of_partition() calls the
  underlying consumer's on_end_of_stream(), so when consuming multiple
  frozen mutations, the underlying's on_end_of_stream() is called for
  each partition. This is incorrect but benign.
* Improve documentation of mutation_reader::consume_pausable().

Fixes: #20084

Closes scylladb/scylladb#23657
2025-04-10 13:19:57 +03:00
Pavel Emelyanov
4de48a9d24 encryption: Mark parts of encrypted_data_sink private
Nowadays the whole class is public, but it's not in fact such.
Remove the SUDDENLY unused private _flush_pos member to please the
compiler.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#23677
2025-04-10 12:42:57 +03:00
Dawid Mędrek
0ed21d9cc1 test/cluster/test_tablets.py: Fix test errorneous indentation
Some of the statements in the test are not indented properly
and, as a result, are never run. It's most likely a small mistake,
so let's fix it.

Closes scylladb/scylladb#23659
2025-04-10 11:06:01 +03:00
Nadav Har'El
258213f73b Merge 'Alternator batch count histograms' from Amnon Heiman
This series adds a histogram for get and write batch sizes.
It uses the estimated_histogram implementation which starts from 1 with 1.2 exponential factor, which works
extremely tight to 20 but still covers all the way to 100.

Histograms will be reported per node.

**Backport to 2025.1 so we'll have information about user batch size limitation**

Closes scylladb/scylladb#23379

* github.com:scylladb/scylladb:
  alternator: Add tests for the batch items histograms
  alternator: Add histogram for batch item count
2025-04-09 22:41:14 +03:00
Tomasz Grabiec
b5211cca85 Merge 'tablets: rebuild: use repair for tablet rebuild' from Aleksandra Martyniuk
Currently, when we rebuild a tablet, we stream data from all
replicas. This creates a lot of redundancy, wastes bandwidth
and CPU resources.

In this series, we split the streaming stage of tablet rebuild into
two phases: first we stream tablet's data from only one replica
and then repair the tablet.

Fixes: https://github.com/scylladb/scylladb/issues/17174.

Needs backport to 2025.1 to prevent out of space during streaming

Closes scylladb/scylladb#23187

* github.com:scylladb/scylladb:
  test: add test for rebuild with repair
  locator: service: move to rebuild_v2 transition if cluster is upgraded
  locator: service: add transition to rebuild_repair stage for rebuild_v2
  locator: service: add rebuild_repair tablet transition stage
  locator: add maybe_get_primary_replica
  locator: service: add rebuild_v2 tablet transition kind
  gms: add REPAIR_BASED_TABLET_REBUILD cluster feature
2025-04-09 21:35:37 +02:00
Avi Kivity
ed3e4f33fd Merge 'generic_server: throttle and shed incoming connections according to semaphore limit' from Marcin Maliszkiewicz
Adds new live updatable config: uninitialized_connections_semaphore_cpu_concurrency.

It should help to reduce cpu usage by limiting cpu concurrency for new connections.  As a last resort when those connections are waiting for initial processing too long (over 1m) they are shed.

New connections_shed and connections_blocked metrics are added for tracking.

Testing:
 - manually via simple program creating high number of connection and constantly re-connecting
 - added benchmark

Following are benchmark results:

Before:
```
> build/release/test/perf/perf_generic_server --smp=1
170101.41 tps ( 13.1 allocs/op,   0.0 logallocs/op,   7.0 tasks/op,    4695 insns/op,    3178 cycles/op,        0 errors)
[...]
throughput: mean=173850.06 standard-deviation=1844.48 median=174509.66 median-absolute-deviation=874.23 maximum=175087.49 minimum=170588.54
instructions_per_op: mean=4725.59 standard-deviation=13.35 median=4729.38 median-absolute-deviation=12.49 maximum=4738.61 minimum=4709.96
  cpu_cycles_per_op: mean=3135.08 standard-deviation=32.13 median=3122.68 median-absolute-deviation=22.29 maximum=3179.38 minimum=3103.15
```

After:
```
> build/release/test/perf/perf_generic_server --smp=1
167373.19 tps ( 13.1 allocs/op,   0.0 logallocs/op,   7.0 tasks/op,    4821 insns/op,    3371 cycles/op,        0 errors)
[...]
throughput:
  mean=   171199.55 standard-deviation=2484.58
  median= 171667.06 median-absolute-deviation=2087.63
  maximum=173689.11 minimum=167904.76
instructions_per_op:
  mean=   4801.90 standard-deviation=16.54
  median= 4796.78 median-absolute-deviation=9.32
  maximum=4830.71 minimum=4789.81
cpu_cycles_per_op:
  mean=   3245.26 standard-deviation=32.28
  median= 3230.44 median-absolute-deviation=16.52
  maximum=3297.39 minimum=3215.62
```

The patch adds around 67 insns/op so it's effect on performance should be negligible.

Fixes: https://github.com/scylladb/scylladb/issues/22844

Closes scylladb/scylladb#22828

* github.com:scylladb/scylladb:
  transport: move on_connection_close into connection destructor
  test: perf: make aggregated_perf_results formatting more human readable
  transport: add blocked and shed connection metrics
  generic_server: throttle and shed incoming connections according to semaphore limit
  generic_server: add data source and sink wrappers bookkeeping network IO
  generic_server: coroutinize part of server::do_accepts
  test: add benchmark for generic_server
  test: perf: add option to count multiple ops per time_parallel iteration
  generic_server: add semaphore for limiting new connections concurrency
  generic_server: add config to the constructor
  generic_server: add on_connection_ready handler
2025-04-09 21:41:38 +03:00
Tomasz Grabiec
5b5ada1743 tablet-mon.py: Add presentation mode which scales tablet size by its storage utilization
Per-node capacity is queried from system.load_per_node

Tablet height in each node is scaled so that equal height = equal node
utilization.

The nominal height is assigned to the node which has the smallest
capacity, so nodes with higher capacity will have smaller tablets than
normal.
2025-04-09 20:21:51 +02:00
Tomasz Grabiec
217184f16b tablet-mon.py: Center tablet id text properly in the vertical axis
Was too low due to not subtracting frame size from height
2025-04-09 20:21:51 +02:00
Tomasz Grabiec
20cac72056 tablet-mon.py: Show migration stage tag in table mode only when migrating
It's the gray bar at the top of the tablet. It's not showing useful
information when tablet is not migrating.
2025-04-09 20:21:51 +02:00
Tomasz Grabiec
0b9a75d7b6 virtual-tables: Introduce system.load_per_node
Can be used to query per-node stats about load as seen by the load
balancer.

In particular, node's capacity will be used by tablet-mon.py to
scale tablet columns so that equal height is equal node utilization.
2025-04-09 20:21:51 +02:00
Tomasz Grabiec
668094dc58 virtual_tables: memtable_filling_virtual_table: Propagate permit to execute()
So that population can access read's timeout and mark the permit as awaiting.
2025-04-09 20:21:51 +02:00
Tomasz Grabiec
34beaa30b5 docs: virtual-tables: Fix instructions 2025-04-09 20:21:51 +02:00
Tomasz Grabiec
76bc11c78c service: tablets: Keep load_stats inside tablet_allocator
So that virtual tables can pick them up.

It's a better place to keep them than in topology_coordinator.
2025-04-09 20:21:51 +02:00
Pavel Emelyanov
d9853efa7c Merge '[Out-of-space prevention] db: backup: prioritize sstables that were deleted from the table' from Benny Halevy
The motivation behind this change to free up disk space as early as possible.
The reason is that snapshot locks the space of all SSTables in the snapshot,
and deleting form the table, for example, by compaction, or tablet migration,
won't free-up their capacity until they are uploaded to object storage and deleted from the snapshot.

This series adds prioritization of deleted sstables in two cases:
First, after the snapshot dir is processed, the list of SSTable generation is cross-referenced with the
list of SSTables presently in the table and any generation that is not in the table is prioritized to
be uploaded earlier.
In addition, a subscription mechanism was added to sstables_manager
and it is used in backup to prioritize SSTables that get deleted from the table directory
during backup.

This is particularly important when backup happens during high disk utilization (e.g. 90%).
Without it, even if the cluster is scaled up and tablets are migrated away from the full nodes
to new nodes, tablet cleanup might not free any space if all the tablet sstables are hardlinked to the
snapshot taken for backup.

* Enhancement, no backport needed

Closes scylladb/scylladb#23241

* github.com:scylladb/scylladb:
  db: snapshot: backup_task: prioritize sstables deleted during upload
  sstables_manager: add subscriptions
  db: snapshot: backup_task: limit concurrency
  sstables: directory_semaphore: expose get_units
  db: snapshot: backup_task: add sharded sstables_manager
  database: expose get_sstables_manager(schema)
  db: snapshot: backup_task: do_backup: prioritize sstables that are already deleted from the table
  db: snapshot-ctl: pass table_id to backup_task
  db: snapshot-ctl: expose sharded db() getter
  db: snapshot: backup_task: do_backup: organize components by sstable generation
  db: snapshot: coroutinize backup_task
  db: snapshot: backup_task: refactor backup_file out of uploads_worker
  db: snapshot: backup_task: refactor uploads_worker out of do_backup
  db: snapshot: backup_task: process_snapshot_dir: initialize total progress
  utils/s3: upload_progress: init members to 0
  db: snapshot: backup_task: do_backup: refactor process_snapshot_dir
  db: snapshot: backup_task: keep expection as member
2025-04-09 15:32:11 +03:00
Marcin Maliszkiewicz
ce18909688 transport: move on_connection_close into connection destructor
To make the code more robust by ensuring closing code is always executed.
2025-04-09 13:50:19 +02:00
Pavel Emelyanov
35dfc8c782 Merge 'audit: add semaphore to audit_syslog_storage_helper' from Andrzej Jackowski
audit_syslog_storage_helper::syslog_send_helper uses Seastar's
net::datagram_channel to write to syslog device (usually /dev/log).
However, datagram_channel.send() is not fiber-safe (ref seastar#2690),
so unserialized use of send() results in packets overwriting its state.
This, in turn, causes a corruption of audit logs, as well as assertion
failures.

To workaround the problem, a new semaphore is introduced in
audit_syslog_storage_helper. As storage_helper is a member of sharded
audit service, the semaphore allows for one datagram_channel.send() on
each shard. Each audit_syslog_storage_helper stores its own
datagram_channel, therefore concurrent sends to datagram_channel are
eliminated.

This change:
 - Moved syslog_send_helper to audit_syslog_storage_helper
 - Corutinize audit_syslog_storage_helper
 - Introduce semaphore with count=1 in audit_syslog_storage_helper.

See https://github.com/scylladb/scylla-dtest/pull/5749 for releated dtest
Fixes: scylladb#22973

Backport to 2025.1 should be considered, as https://github.com/scylladb/scylladb/issues/22973 is known to cause crashes of 2025.1.

Closes scylladb/scylladb#23464

* github.com:scylladb/scylladb:
  audit: add semaphore to audit_syslog_storage_helper
  audit: corutinize audit_syslog_storage_helper
  audit: moved syslog_send_helper to audit_syslog_storage_helper
2025-04-09 12:39:06 +03:00
Marcin Maliszkiewicz
619944555f test: perf: make aggregated_perf_results formatting more human readable
Before:
throughput: mean=170728.58 standard-deviation=1921.76 median=171084.16 median-absolute-deviation=1501.58 maximum=172913.36 minimum=167288.97
instructions_per_op: mean=4685.89 standard-deviation=12.46 median=4683.92 median-absolute-deviation=9.68 maximum=4706.53 minimum=4666.70
cpu_cycles_per_op: mean=3090.94 standard-deviation=52.69 median=3103.43 median-absolute-deviation=24.55 maximum=3192.99 minimum=3003.00

After:
throughput:
	mean=   168224.81 standard-deviation=854.48
	median= 168829.02 median-absolute-deviation=604.21
	maximum=168829.02 minimum=167620.60
instructions_per_op:
	mean=   4837.02 standard-deviation=20.89
	median= 4851.79 median-absolute-deviation=14.77
	maximum=4851.79 minimum=4822.24
cpu_cycles_per_op:
	mean=   3271.42 standard-deviation=46.29
	median= 3304.16 median-absolute-deviation=32.73
	maximum=3304.16 minimum=3238.69
2025-04-09 10:49:20 +02:00
Marcin Maliszkiewicz
599f4d312b transport: add blocked and shed connection metrics
This adds some visibility into connection storm mitigations
added in following commits.
2025-04-09 10:49:18 +02:00
Marcin Maliszkiewicz
26518704ab generic_server: throttle and shed incoming connections according to semaphore limit
If we have uninitialized_connections_semaphore_cpu_concurrency (default
2) connections being processed we start delay accepting new connections.

Connections which are in network IO state are not counted towards this
limit and they can go to cpu phase without blocking. So it can happen
that we process more concurrent new connections but that's a necessary
tradeof to make progress during storm without implementing more advanced
machinery (i.e. priority queue).
2025-04-09 10:48:51 +02:00
Marcin Maliszkiewicz
9f5de2c256 generic_server: add data source and sink wrappers bookkeeping network IO
They release semaphore units when we start network IO and acquire it
when we enter cpu intensive phase. We use consume() so it doesn't block
because we don't want connections we started processing to compete with
new incomming connections. Otherwise during connection storm we wouldn't
make much progress.

There will be a simplification here as we'll treat disc IO (if there is any)
as cpu work.
2025-04-09 10:48:42 +02:00
Marcin Maliszkiewicz
c56116372e generic_server: coroutinize part of server::do_accepts 2025-04-09 10:48:42 +02:00
Marcin Maliszkiewicz
719d04d501 test: add benchmark for generic_server
Changes in configure.py are needed becuase we don't want to embed
this benchmark in scylla binary as perf_simple_query or perf_alternator,
it doesn't directly translate to Scylla performance but we want to use
aggregated_perf_results for precise cpu measurements so we need
different dependecies.
2025-04-09 10:48:42 +02:00
Marcin Maliszkiewicz
b957cedace test: perf: add option to count multiple ops per time_parallel iteration 2025-04-09 10:30:58 +02:00
Marcin Maliszkiewicz
ed82bede39 generic_server: add semaphore for limiting new connections concurrency
It will be used in following commits.
2025-04-09 10:30:58 +02:00
Marcin Maliszkiewicz
33122d3f93 generic_server: add config to the constructor 2025-04-09 10:30:58 +02:00
Marcin Maliszkiewicz
474e84199c generic_server: add on_connection_ready handler
This patch cleans the code a bit so that ready state is set in a single place.
And adds handler which will allow adding logic when connection is made
ready, this will be added in the following commits.
2025-04-09 10:30:58 +02:00
Benny Halevy
1ab3ec061b db: snapshot: backup_task: prioritize sstables deleted during upload
subscribe on each shard's sstables_manager to get
callback notifications and keep the generation numbers
of deleted sstables in a vector so they can be prioritized
first to free up their disk space as soon as possible.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-09 08:54:07 +03:00
Benny Halevy
d8b0c661e4 sstables_manager: add subscriptions
Allow other submodules to subscribe for added/deleted
notifications.  This will be used in a later to
patch to prioritize unlinked sstables for backup.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-09 08:54:07 +03:00
Benny Halevy
d3b4874ec3 db: snapshot: backup_task: limit concurrency
Otherwise, once all the background tasks are created
we have no way to reorder the queue.

Fixes #23239

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-09 08:54:07 +03:00
Benny Halevy
e60fcc58b7 sstables: directory_semaphore: expose get_units
To be used by a following patch for
backup concurrency control.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-09 08:54:07 +03:00
Benny Halevy
b7807ec165 db: snapshot: backup_task: add sharded sstables_manager
Get a reference to the table's sstables_manager
on each shard.  This will be used be later patches
to limit concurrency and to subscribe for notifications.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-09 08:54:07 +03:00
Benny Halevy
b270d552fb database: expose get_sstables_manager(schema)
Return either the system or use sstables manager.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-09 08:54:07 +03:00
Benny Halevy
9a4b4afade db: snapshot: backup_task: do_backup: prioritize sstables that are already deleted from the table
Detect SSTables that are already deleted from the table
in process_snapshot_dir when their number_of_links is equal to 1.

Note that the SSTable may be hard-linked by more than one snapshot,
so even after it is deleted from the table, its number of links
would be greater than one.  In that case, however, uploading it
earlier won't help to free-up its capacity since it is still held
by other snapshots.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-09 08:54:07 +03:00
Benny Halevy
4b8699e278 db: snapshot-ctl: pass table_id to backup_task
To be used by the following patches to get
to the table's sstables_manager for concurrency
control and for notifications (TBD).

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-09 08:54:07 +03:00
Benny Halevy
d646603bfd db: snapshot-ctl: expose sharded db() getter
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-09 08:54:07 +03:00
Benny Halevy
63bc1d4626 db: snapshot: backup_task: do_backup: organize components by sstable generation
Do not rely on the snapshot directory listing order.
This will become useful for prioritizing unlinked
sstables in a following patch.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-09 08:54:06 +03:00
Benny Halevy
a731c1b33d db: snapshot: coroutinize backup_task
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-09 08:49:53 +03:00
Benny Halevy
189075b885 db: snapshot: backup_task: refactor backup_file out of uploads_worker
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-09 08:49:53 +03:00
Benny Halevy
e3ba425c2b db: snapshot: backup_task: refactor uploads_worker out of do_backup
Let do_backup deal only with the high level coordination.
A future patch will follow this structure to run
uploads_worker on each shard.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-09 08:49:53 +03:00
Benny Halevy
ff25b4c97f db: snapshot: backup_task: process_snapshot_dir: initialize total progress
Now we can calculate advance how much data we intend to upload
before we start uploading it.

This will be used also later when uploading in parallel
on all shards, so we can collect the progress from all
shards in get_progress().

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-09 08:49:51 +03:00
Benny Halevy
6da215e8af utils/s3: upload_progress: init members to 0
For default construction.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-09 08:44:52 +03:00
Benny Halevy
70307e8120 db: snapshot: backup_task: do_backup: refactor process_snapshot_dir
Do preliminary listing of the snapshot dir.

While at it, simplify the loop as follows:
The optional directory_entry returned by snapshot_dir_lister.get()
can be checked as part of the loop condition expression,
and with that, error handling can be simplified and moved
out of the loop body.

A followup patch will organize the component files
by their sstable generation.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

db: snapshot: backup_task: process_snapshot_dir: simplify loop

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-09 08:44:52 +03:00
Benny Halevy
8a4b6b9614 db: snapshot: backup_task: keep expection as member
As part of refactoring do_backup().

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-09 08:44:52 +03:00
Botond Dénes
b65a76ab6f Merge 'nodetool: cluster repair: add a command to repair tablet keyspaces' from Aleksandra Martyniuk
Add a new nodetool cluster super-command. Add nodetool
cluster repair command to repair tablet keyspaces.
It uses the new /storage_service/tablets/repair API.

The nodetool cluster repair command allows you to specify
the keyspace and tables to be repaired. A cluster repair of many
tables will request /storage_service/tablets/repair and wait for
the result synchronously for each table.

The nodetool repair command, which was previously used to repair
keyspaces of any type, now repairs only vnode keyspaces.

Fixes: https://github.com/scylladb/scylladb/issues/22409.

Needs backport to 2025.1 that introduces the new tablet repair API

Closes scylladb/scylladb#22905

* github.com:scylladb/scylladb:
  docs: nodetool: update repair and add tablet-repair docs
  test: nodetool: add tests for cluster repair command
  nodetool: add cluster repair command
  nodetool: repair: extract getting hosts and dcs to functions
  nodetool: repair: warn about repairing tablet keyspaces
  nodetool: repair: move keyspace_uses_tablets function
2025-04-09 08:20:34 +03:00
Botond Dénes
5f697d373f test/cqlpy/test_tools.py: use AIO backend in scylla-sstable query tests
These tests seem to be hitting the io-uring bug in the kernel from
time-to-time, making CI flaky. Force the use of the AIO backend in these
tests, as a workaround until fixed kernels (>=6.8.13) are available.

Fixes: #23517
Fixes: #23546

Closes scylladb/scylladb#23648
2025-04-08 20:29:58 +03:00
Benny Halevy
dfdca2d84e locator: topology: drop unused calculate_datacenters
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#23647
2025-04-08 19:04:56 +03:00
Tomasz Grabiec
06b49bdf69 Merge 'row_cache: don't garbage-collect tombstones which cover data in memtables' from Botond Dénes
The row cache can garbage-collect tombstones in two places:
1) When populating the cache - the underlying reader pipeline has a `compacting_reader` in it;
2) During reads - reads now compact data including garbage collection;

In both cases, garbage collection has to do overlap checks against memtables, to avoid collecting tombstones which cover data in the memtables.
This PR includes fixes for (2), which were not handled at all currently.
(1) was already supposed to be fixed, see https://github.com/scylladb/scylladb/issues/20916. But the test added in this PR showed that the test is incomplete: https://github.com/scylladb/scylladb/issues/23291. A fix for this issue is also included.

Fixes: https://github.com/scylladb/scylladb/issues/23291
Fixes: https://github.com/scylladb/scylladb/issues/23252

The fix will need backport to all live release.

Closes scylladb/scylladb#23255

* github.com:scylladb/scylladb:
  test/boost/row_cache_test: add memtable overlap check tests
  replica/table: add error injection to memtable post-flush phase
  utils/error_injection: add a way to set parameters from error injection points
  test/cluster: add test_data_resurrection_in_memtable.py
  test/pylib/utils: wait_for_cql_and_get_hosts(): sort hosts
  replica/mutation_dump: don't assume cells are live
  replica/database: do_apply() add error injection point
  replica: improve memtable overlap checks for the cache
  replica/memtable: add is_merging_to_cache()
  db/row_cache: add overlap-check for cache tombstone garbage collection
  mutation/mutation_compactor: copy key passed-in to consume_new_partition()
2025-04-08 17:26:58 +02:00
Andrzej Jackowski
c12f976389 audit: add semaphore to audit_syslog_storage_helper
audit_syslog_storage_helper::syslog_send_helper uses Seastar's
net::datagram_channel to write to syslog device (usually /dev/log).
However, datagram_channel.send() is not fiber-safe (ref seastar#2690),
so unserialized use of send() results in packets overwriting its state.
This, in turn, causes a corruption of audit logs, as well as assertion
failures.

To workaround the problem, a new semaphore is introduced in
audit_syslog_storage_helper. As storage_helper is a member of sharded
audit service, the semaphore allows for one datagram_channel.send() on
each shard. Each audit_syslog_storage_helper stores its own
datagram_channel, therefore concurrent sends to datagram_channel are
eliminated.

This change:
 - Introduce semaphore with count=1 in audit_syslog_storage_helper.
 - Added 1 hour timeout to the semaphore, so semaphore stalls are
   failed just as all other syslog auditing failures.

Fixes: scylladb#22973
2025-04-08 16:24:42 +02:00
Andrzej Jackowski
889fd5bc9f audit: corutinize audit_syslog_storage_helper
This change:
 - Corutinize audit_syslog_storage_helper::syslog_send_helper
 - Corutinize audit_syslog_storage_helper::start
 - Corutinize audit_syslog_storage_helper::write
2025-04-08 16:24:42 +02:00
Andrzej Jackowski
dbd2acd2be audit: moved syslog_send_helper to audit_syslog_storage_helper
This change:
 - Make syslog_send_helper() a method of audit_syslog_storage_helper, so
   syslog_send_helper() can access private members of
   audit_syslog_storage_helper in the next commits.
 - Remove unneeded syslog_send_helper() arguments that now are class
   members.
2025-04-08 16:24:42 +02:00
Benny Halevy
f702adf6a5 main: fix typo in tablet allocator checkpoint message
Inroduced in b6705ad48b

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#23211
2025-04-08 17:19:41 +03:00
Botond Dénes
583a813d17 docs/dev/tombstone.md: fix link to ddl.html
Closes scylladb/scylladb#23622
2025-04-08 16:18:50 +03:00
Anna Stuchlik
93a7b3ac1d doc: add enabling consistent topology updates to the 2025.1 upgrade guide-from-2024
This commit adds the procedure to enable consistent topology updates for upgrades
from 2024.1 to 2025.1 (or from 2024.2 to 2025.1 if the feature wasn't enabled
after upgrading from 2024.1 to 2024.2).

Fixes https://github.com/scylladb/scylladb/issues/23650

Closes scylladb/scylladb#23651
2025-04-08 15:38:00 +03:00
Robert Bindar
4e3eb2fdac Move direct_failure_detector from root to service/
direct_failure_detector used to be used by gms/ as well,
but that's not the case anymore, so raft/ is the only user.

Fixes #23133

Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>

Closes scylladb/scylladb#23248
2025-04-08 13:03:24 +03:00
Aleksandra Martyniuk
372b562f5e test: add test for rebuild with repair 2025-04-08 10:42:02 +02:00
Aleksandra Martyniuk
acd32b24d3 locator: service: move to rebuild_v2 transition if cluster is upgraded
If cluster is upgraded to version containing rebuild_v2 transition
kind, move to this transition kind instead of rebuild.
2025-04-08 10:42:02 +02:00
Aleksandra Martyniuk
eb17af6143 locator: service: add transition to rebuild_repair stage for rebuild_v2
Modify write_both_read_old and streaming stages in rebuild_v2 transition
kind: write_both_read_old moves to rebuild_repair stage and streaming stage
streams data only from one replica.
2025-04-08 10:42:02 +02:00
Aleksandra Martyniuk
4a847df55c locator: service: add rebuild_repair tablet transition stage
Currently, in the streaming stage of rebuild tablet transition,
we stream tablet data from all replicas.
This patch series splits the streaming stage into two phases:
- repair phase, where we repair the tablet;
- streaming phase, where we stream tablet data from one replica.

rebuild_repair is a stage that will be used to perform the repair
phase. It executes the tablet repair on tablet_info::replicas.
A primary replica out of migration_streraming_info::read_from is
the repair master. If the repair succeeds, we move to streaming
tablet transition stage, and to cleanup_target - if it fails.

The repair bypasses the tablet repair scheduler and it does not update
the repair_time.

A transition to the rebuild_repair stage will be added in the following
patches.
2025-04-08 10:42:02 +02:00
Aleksandra Martyniuk
5d6041617b locator: add maybe_get_primary_replica
Add maybe_get_primary_replica to choose a primary replica out of
custom replica set.
2025-04-08 10:42:01 +02:00
Aleksandra Martyniuk
ed7b8bb787 locator: service: add rebuild_v2 tablet transition kind
Currently, in the streaming stage of rebuild tablet transition,
we stream tablet data from all replicas.
This patch series splits the streaming stage into two phases:
- repair phase, where we repair the tablet;
- streaming phase, where we stream tablet data from one replica.

To differentiate the two streaming methods, a new tablet transition
kind - rebuild_v2 - is added.

The transtions and stages for rebuild_v2 transition kind will be
added in the following patches.
2025-04-08 10:42:01 +02:00
Aleksandra Martyniuk
b80e957a40 gms: add REPAIR_BASED_TABLET_REBUILD cluster feature 2025-04-08 10:42:01 +02:00
Aleksandra Martyniuk
9769d7a564 docs: nodetool: update repair and add tablet-repair docs 2025-04-08 09:13:14 +02:00
Aleksandra Martyniuk
02fb71da42 test: nodetool: add tests for cluster repair command 2025-04-08 09:13:14 +02:00
Aleksandra Martyniuk
8bbc5e8923 nodetool: add cluster repair command
Add a new nodetool cluster repair command that repairs tablet keyspaces.

Users may specify keyspace and tables that they want to repair.
If the keyspace and tables are not specified, all tablet keyspaces
are repaired.

The command calls the new tablet repair API /storage_service/tablets/repair.
2025-04-08 09:13:14 +02:00
Aleksandra Martyniuk
aa3973c850 nodetool: repair: extract getting hosts and dcs to functions 2025-04-08 09:13:14 +02:00
Aleksandra Martyniuk
b81c81c7f4 nodetool: repair: warn about repairing tablet keyspaces
Warn about an attempt to repair tablet keysapce with nodetool repair.

A nodetool cluster repair command to repair tablet keyspaces will
be added in the following patches.
2025-04-08 09:13:14 +02:00
Aleksandra Martyniuk
cbde835792 nodetool: repair: move keyspace_uses_tablets function 2025-04-08 09:13:14 +02:00
Yaron Kaikov
2dc7ea366b .github: Make "make-pr-ready-for-review" workflow run in base repo
in 57683c1a50 we fixed the `token` error,
but removed the checkout part which causing now the following error
```
failed to run git: fatal: not a git repository (or any of the parent directories): .git
```
Adding the repo checkout stage to avoid such error

Fixes: https://github.com/scylladb/scylladb/issues/22765

Closes scylladb/scylladb#23641
2025-04-08 09:30:18 +03:00
Raphael S. Carvalho
0f59deffaa replica: Fix truncate and drop table after tablet migration happens
When running those operations after a tablet replica is migrated away from
a shard, an assert can fail resulting in a crash.

Status quo (around the assert in truncate procedure):

1) Highest RP seen by table is saved in low_mark, and the current time in
low_mark_at.
2) Then compaction is disabled in order to not mix data written before truncate,
and data written later.
3) Then memtable is flushed in order for the data written before truncate to be
available in sstables and then removed.
4) Now, current time is saved in truncated_at, which is supposedly the time of
truncate to decide which sstables to remove.

Note: truncated_at is likely above low_mark_at due to steps 2 and 3.

The interesting part of the assert is:
    (truncated_at <= low_mark_at ? rp <= low_mark : low_mark <= rp)

Note: RP in the assert above is the highest RP among all sstables generated
before truncated_at. RP is retrieved by table::discard_sstables().

If truncated_at > low_mark_at, maybe newer data was written during steps 2 and
3, and memtable's RP becomes greater than low_mark, resulting in a SSTable with
RP > low_mark.
So assert's 2nd condition is there to defend against the scenario above.

truncated_at and low_mark_at uses millisecond granularity, so even if
truncated_at == low_mark_at, data could have been written in steps 2 and 3
(during same MS window), failing the assert. This is fragile.

Reproducer:

To reproduce the problem, truncated_at must be > low_mark_at, which can easily
happen with both drop table and truncate due to steps 2 and 3.

If a shard has 2 or more tablets, the table's highest RP refer to just one
tablet in that shard.
If the tablet with the highest RP is migrated away, then the sstables in that
shard will have lower RP than the recorded highest RP (it's a table wide state,
which makes sense since CL is shared among tablets).

So when either drop table or truncate runs, low_mark will be potentially bigger
than highest RP retrieved from sstables.

Proposed solution:

The current assert is hacked to not fail if writes sneak in, during steps 2 and
3, but it's still fragile and seems not to serve its real purpose, since it's
allowing for RP > low_mark.

We should be able to say that low_mark >= RP, as a way of asserting we're not
leaving data targeted by truncate behind (or that we're not removing the wrong
data).

But the problem is that we're saving low_mark in step 1, before preparation
steps (2 and 3). When truncated_at is recorded in step 4, it's a way of saying
all data written so far is targeted for removal. But as of today, low_mark
refers to all data written up to step 1. So low_mark is now only one set
before issuing flush, and also accounts for all potentially flushed data.

Fixes #18059.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#23560
2025-04-08 07:32:58 +03:00
Botond Dénes
0d39091df2 test/boost/row_cache_test: add memtable overlap check tests
Similar to test/cluster/test_data_resurrection_in_memtable.py but works
on a single node and uses more low-level mechanism. These tests can also
reproduce more advanced scenarios, like concurrent reads, with some
reading from flushed memtables.
2025-04-08 00:11:36 -04:00
Botond Dénes
6c1f6427b3 replica/table: add error injection to memtable post-flush phase
After the memtable was flushed to disk, but before it is merged to
cache. The injection point will only active for the table specified in
the "table_name" injection parameter.
2025-04-08 00:11:36 -04:00
Botond Dénes
f7938e3f8b utils/error_injection: add a way to set parameters from error injection points
With this, now it is possible to have two-way communication between
the error injection point and its enabler. The test can enable the error
injection point, then wait until it is hit, before proceedin.
2025-04-08 00:11:36 -04:00
Botond Dénes
34b18d7ef4 test/cluster: add test_data_resurrection_in_memtable.py
Reproducers for #23252 and #23291 -- cache garbage
collecting tombstones resurrecting data in the memtable.
2025-04-08 00:11:36 -04:00
Botond Dénes
e5afd9b5fb test/pylib/utils: wait_for_cql_and_get_hosts(): sort hosts
Such that a given index in the return hosts refers to the same
underlying Scylla instance, as the same index in the passed-in nodes
list. This is what users of this method intuitively expect, but
currently the returned hosts list is unordered (has random order).
2025-04-08 00:11:36 -04:00
Botond Dénes
df09b3f970 replica/mutation_dump: don't assume cells are live
Currently the dumper unconditionally extracts the value of atomic cells,
assuming they are live. This doesn't always hold of course and
attempting to get the value of a dead cell will lead to marshalling
errors. Fix by checking is_live() before attempting to get the cell
value. Fix for both regular and collection cells.
2025-04-08 00:11:36 -04:00
Botond Dénes
cb76cafb60 replica/database: do_apply() add error injection point
So writes (to user tables) can be failed on a replica, via error
injection. Should simplify tests which want to create differences in
what writes different replicas receive.
2025-04-08 00:11:35 -04:00
Botond Dénes
d126ea09ba replica: improve memtable overlap checks for the cache
The current memtable overlap check that is used by the cache
-- table::get_max_purgeable_fn_for_cache_underlying_reader() -- only
checks the active memtable, so memtables which are either being flushed
or are already flushed and also have active reads against them do not
participate in the overlap check.
This can result in temporary data resurrection, where a cache read can
garbage-collect a tombstone which still covers data in a flushing or
flushed memtable, which still have active read against it.

To prevent this, extend the overlap check to also consider all of the
memtable list. Furthermore, memtable_list::erase() now places the removed
(flushed) memtable in an intrusive list. These entries are alive only as
long as there are readers still keeping an `lw_shared_ptr<memtable>`
alive. This list is now also consulted on overlap checks.
2025-04-08 00:11:35 -04:00
Botond Dénes
7e600a0747 replica/memtable: add is_merging_to_cache()
And set it when the memtable is merged to cache.
2025-04-08 00:11:35 -04:00
Botond Dénes
6b5b563ef7 db/row_cache: add overlap-check for cache tombstone garbage collection
The cache should not garbage-collect tombstone which cover data in the
memtable. Add overlap checks (get_max_purgeable) to garbage collection
to detect tombstones which cover data in the memtable and to prevent
their garbage collection.
2025-04-08 00:11:35 -04:00
Botond Dénes
c2518cdf1a mutation/mutation_compactor: copy key passed-in to consume_new_partition()
This doesn't introduce additional work for single-partition queries: the
key is copied anyway on consume_end_of_stream().
Multi-partition reads and compaction are not that sensitive to
additional copy added.

This change fixes a bug in the compacting_reader: currently the reader
passes _last_uncompacted_partition_start.key() to the compactor's
consume_new_partition(). When the compactor emits enough content for this
partition, _last_uncompacted_partition_start is moved from to emit the
partition start, this makes the key reference passed to the compaction
corrupt (refer to moved-from value). This in turn means that subsequent
GC checks done by the compactor will be done with a corrupt key and
therefore can result in tombstone being garbage-collected while they
still cover data elsewhere (data resurrection).

The compacting reader is violating the API contract and normally the bug
should be fixed there. We make an exception here because doing the fix
in the mutation compactor better aligns with our future plans:
* The fix simplifies the compactor (gets rid of _last_dk).
* Prepares the way to get rid of the consume API used by the compactor.
2025-04-08 00:11:35 -04:00
Avi Kivity
8d2a41db82 Merge "Fixes for gossiper conversion to host id" from Gleb
"
The series contains fixes to gossiper conversion to host id. There are
two fixes where we could erroneously send outdated entry in a gossiper
message and a fix for force_remove_endpoint which was not converted to
work on host id and this caused it to not delete the entry in some cases
(in replace with the same ip case).
"

* 'gleb/host-id-fixes' of github.com:scylladb/scylla-dev:
  gossiper: send newest entry in a digest message
  gossiper: change make_random_gossip_digest to return value instead of modifying passed parameter
  gossiper: move force_remove_endpoint to work on host id
  gossiper: do not send outdated endpoint in gossiper round
2025-04-07 17:04:28 +03:00
dependabot[bot]
a899cae158 build(deps): bump sphinx-scylladb-theme from 1.8.5 to 1.8.6 in /docs
Bumps [sphinx-scylladb-theme](https://github.com/scylladb/sphinx-scylladb-theme) from 1.8.5 to 1.8.6.
- [Release notes](https://github.com/scylladb/sphinx-scylladb-theme/releases)
- [Commits](https://github.com/scylladb/sphinx-scylladb-theme/compare/1.8.5...1.8.6)

---
updated-dependencies:
- dependency-name: sphinx-scylladb-theme
  dependency-version: 1.8.6
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

Closes scylladb/scylladb#23537
2025-04-07 13:42:19 +03:00
Emil Maskovsky
76ceaf129b raft: distribute voters by rack inside DC
Distribute the voters evenly across racks in the datacenters.

When distributing the voters across datacenters, the datacenters with
more racks will be preferred in case of a tie. Also, in case of
asymmetric voter distribution (2 DCs), the DC with more racks will have
more voters (if the node counts allow it).

In case of a single datacenter, the voters will be distributed across
racks evenly (in the similar manner as done for the whole datacenters).

The intention is that similar to losing a datacenter, we want to avoid
losing the majority if a rack goes down - so if there are multiple racks,
we want to distribute the voters across them in such a way that losing
the whole rack will not cause the majority loss (if possible).
2025-04-07 12:31:37 +02:00
Emil Maskovsky
831fae4bff raft/test: fix lint warnings in test_raft_no_quorum
Code cleanup - fixed lint warnings in `test_raft_no_quorum` test.
2025-04-07 12:31:37 +02:00
Emil Maskovsky
92f6662cd1 raft/test: add the upgrade test for limited voters feature
We test the upgrade scenario of the limited voters feature - first we
start the cluster with the limited voters feature disabled ("old code"),
then we upgrade the cluster to the version with the limited voters
feature enabled ("new code").

The nodes are being upgraded one by one and we test that the cluster
still works (doesn't e.g. lose the majority).
2025-04-07 12:31:37 +02:00
Emil Maskovsky
a740623fa1 raft topology: handle on_up/on_down to add/remove node from voters
Adding and removing the voters based on the node up/down events.

This improves the availability of the system by automatically
adjusting the number of voters in the system to use the alive nodes in
precedence.

We can then also drop the voter removal from the `write_both_read_old`
to further simplify the code - the node will be removed from the voters
when it goes down. However we only can do that in case the feature is
enabled.
2025-04-07 12:31:37 +02:00
Emil Maskovsky
dc6afd47b7 raft: fix the indentation after the limited voters changes
Fix the indentation that needs to be changed because of the added condition.

This is done separately to make it easier to review the main commit with
the functional changes.
2025-04-07 12:31:37 +02:00
Emil Maskovsky
1d06ea3a5a raft: implement the limited voters feature
Currently if raft is enabled all nodes are voters in group0. However it
is not necessary to have all nodes to be voters - it only slows down
the raft group operation (since the quorum is large) and makes
deployments with asymmetrical DCs problematic (2 DCs with 5 nodes along
1 DC with 10 nodes will lose the majority if large DC is isolated).

The topology coordinator will now maintain a state where there are only
limited number of voters, evenly distributed across the DCs and racks.

After each node addition or removal the voters are recalculated and
rebalanced if necessary. That means:
* When a new node is added, it might become a voter depending on the
  current distribution of voters - either if there are still some voter
  "slots" available, or if the new node is a better candidate than some
  existing voter (in which case the existing node voter status might be
  revoked).
* When a voter node is removed or stopped (shut down), its voter status
  is revoked and another node might become a voter instead (this can also
  depend on other circumstances, like e.g. changing the number of DCs).
* If a node addition or removal causes a change in number of datacenters
  (DCs) or racks, the rebalance action might become wider (as there are
  some special rules applying to 1 vs 2 vs more DCs, also changing the
  number of racks might cause similar effects in the voters distribution)

Special conditions for various number of DCs:
* 1 DC: Can have up to the maximum allowed number of voters (5 - see below)
* 2 DCs: The distribution of the voters will be asymmetric (if possible),
  meaning that we can tolerate a loss of the DC with the smaller number
  of voters (if both would have the same number of voters we'd lose the
  majority if any of the DCs is lost).
  For example, if we have 2 DCs with 2 nodes each, one of them will only
  have 1 voter (despite the limit of 5). Also, if one of the 2 DCs has
  more racks than the other and the node count allows it, the DC with
  the more racks will have more voters.
* 3 and more DCs: The distribution of the voters will be so that every
  DC has strictly less than half of the total voters (so a loss of any
  of the DCs cannot lead to the majority loss). Again, DCs with more
  racks are being preferred in the voter distribution.

At the moment we will be handling the zero-token nodes in the same way
as the regular nodes (i.e. the zero-token nodes will not take any
priority in the voter distribution). Technically it doesn't make much
sense to have a zero-token node that is not a voter (when there are
regular nodes in the same DC being voters), but currently the intended
purpose of zero-token nodes is to form an "arbiter DC" (in case of 2 DCs,
creating a third DC with zero-token nodes only), so for that intended
purpose no special handling is needed and will work out of the box.
If a preference of zero token nodes will eventually be needed/requested,
it will be added separately from this PR.

Currently the voter limits will not be configurable (we might introduce
configurable limits later if that would be needed/requested).

The feature is enabled by the `group0_limited_voters` feature flag
to avoid issues with cluster upgrade (the feature will be only enabled
once all nodes in the cluster are upgraded to the version supporting
the feature).

Fixes: scylladb/scylladb#18793
2025-04-07 12:31:18 +02:00
Lakshmi Narayanan Sreethar
750f4baf44 replica/table::do_apply : do not check for async gate's closure
The `table::do_apply()` method verifies if the compaction group's async
gate is open to determine if the compaction group is active. Closing
this async gate prevents any new operations but waits for existing
holders to exit, allowing their operations to complete. When holding a
gate, holders will observe the gate as closed when it is being closed,
but this is irrelevant as they are already inside the gate and are
allowed to complete. All the callers of `table::do_apply()` already
enter the gate before calling the method. So, the async gate check
inside `table::do_apply()` will erroneously throw an exception when the
compaction group is closing despite holding the gate. This commit
removes the check to prevent this from happening.

Fixes #23348

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>

Closes scylladb/scylladb#23579
2025-04-07 13:27:22 +03:00
Emil Maskovsky
8b186ab0ff raft: drop the voter removal from the decommission
In the particular case of node decommission, this code doesn't really
matter in production and only confuses us. Losing majority is
an extremely rare event, and for this code to help one would have
to lose majority in a very specific way (exactly half of the nodes die
in a short time window during decommission), which is unrealistic.

In addition, this code will be completely irrelevant (and would never be
executed) once we implement #23266.

Refs: scylladb/scylladb#23266
2025-04-07 12:23:25 +02:00
Emil Maskovsky
00794af94d raft/test: disable the stop_before_becoming_raft_voter test
The workflow of becoming a voter changes with the "limited voters"
feature, as the node will no longer become a voter on its own, but the
votership is being managed by the topology coordinator. This therefore
breaks the `stop_before_becoming_raft_voter` test, as that injection
relies on the old behavior.

We will disable the test for this particular case for now and address
either fixing of complete removal of the test in a follow-up task.

Refs: scylladb/scylladb#23418
2025-04-07 12:23:25 +02:00
Emil Maskovsky
57df5d013e raft/test: stop the server less gracefully in the voters test
Stopping the test gracefully might hide some issues, therefore we want
to stop it forcefully to make sure that the code can handle it.

Added a parameter to stop gracefully or less gracefully (so that we test
both cases).
2025-04-07 12:22:19 +02:00
Pavel Emelyanov
10376b5b85 db: Re-use database::snapshot_table_on_all_shards()
There are two snapshot-on-all-shards methods on the database -- the one
that snapshots a keyspace and the one that snapshots a vector of tables.
The latter snapshots a single table with a neat helper, while the former
has the helper open-coded.

Re-using the helper in keyspace snapshot is worth it, but needs to patch
the helper to work on uuid, rather than ks:cf pair of strings.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#23532
2025-04-07 11:55:43 +02:00
Nadav Har'El
84fd52315f alternator: in GetRecords, enforce Limit to be <= 1000
Alternator Streams' "GetRecords" operation has a "Limit" parameter on
how many records to return. The DynamoDB documentations says that the
upper limit on this Limit parameter is 1000 - but Alternator didn't
enforce this. In this patch we begin enforcing this highest Limit, and
also add a test for verifying this enforcement. As usual, the new test
passes on DynamoDB, and after this patch - also on Alternator.

The reason why it's useful to have *some* upper limit on Limit is that
the existing executor::get_records() implementation does not really have
preemption points in all the necessary places. In particular, we have a
loop on all returned records without preemption points. We also store
the returned records in a RapidJson vector, which requires a contiguous
allocation.

Even before this patch, GetRecords had a hard limit of 1 MB of results.
But still, in some cases 1 MB of results may be a lot of results, and we
can see stalls in the aforementioned places being O(number of results).

Fixes #23534

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#23547
2025-04-07 12:52:03 +03:00
Kefu Chai
55777812d4 s3/client: Optimize file streaming with zero-copy multipart uploads
When streaming files using multipart upload, switch from using
`output_stream::write(const char*, size_t)` to passing buffer objects
directly to `output_stream::write()`. This eliminates unnecessary memory
copying that occurred when the original implementation had to
defensively copy data before sending.

The buffer objects can now be safely reused by the output stream instead
of creating deep copies, which should improve performance by reducing
memory operations during S3 file uploads.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#23567
2025-04-07 12:50:06 +03:00
Avi Kivity
ac3d25eb44 sstable_set: incremental_reader_selector: be more careful when filtering out already engaged sstables
The incremental reader selector maintains an unordered_set of
sstables that are already engaged, and uses std::views::filter
to filter those out. It adds the sstable under consideration to the
set, and if addition failed (because it's already in) then it
filters it out.

This breaks if the filter view is executed twice - the first pass
will add every sstable to the set, and the second will consider
every sstable already filtered. This is what happens with
libstdc++ 15 (due to the addition of vector(from_range_t) constructor),
which uses the first pass to calculate the vector size
and the second pass to insert the elements into a correctly-sized
vector.

Fix by open-coding the loop.

Closes scylladb/scylladb#23597
2025-04-07 12:49:04 +03:00
Gleb Natapov
a982db326e gossiper: send newest entry in a digest message
In cases where two entries have the same ip address send information
only for the newest one. Now we send both which make the receiver use
one of them at random and it may be outdated one (though it should only
cause more data than needed to be requested).
2025-04-06 18:39:24 +03:00
Gleb Natapov
8d534ee68e gossiper: change make_random_gossip_digest to return value instead of modifying passed parameter 2025-04-06 18:39:24 +03:00
Gleb Natapov
6f53611337 gossiper: move force_remove_endpoint to work on host id
Since the gossiper works on host ids now it is incorrect to leave this
function to work on ip. It makes it impossible to delete outdated entry
since the "gossiper.get_host_id(endpoint) != id" check will always be
false for such entries (get_host_id() always returns most up -to-date
mapping.
2025-04-06 18:39:24 +03:00
Amnon Heiman
b55f24c14d alternator: Add tests for the batch items histograms
This patch adds a test for the batch‑items histogram for both get and
write operations.

It update the check_increases_metric_exact helper function so that it
would get a list of expected value and labels (labels can be None).
This makes it easy to test multiple buckets in a histogram.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2025-04-06 18:22:23 +03:00
Amnon Heiman
c060c0b867 alternator: Add histogram for batch item count
This patch adds an estimated_histogram for alternator batch item count.
estimated_histogram can be used with values starting from 1 with an
exponential factor of 1.2, which nicely covers values up to 20, but with
only 22 buckets it can reach all the way to 100 (plus infinity).

Aside from the new histograms for get and write batches, a helper
function was added to return the histogram in the metric format without
changing its resolution (which is the metric’s default behaviour).

The histogram will be reported once per node rather than once per shard.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2025-04-06 18:22:13 +03:00
Marcin Maliszkiewicz
b94acfb37b test: remove alternator code from perf-simple-query
This kind of benchmark was superseded by perf-alternator
which has more options, workflows and most importantly
measures overhead of http server layer (including json parsing).

There is no need to maintain additional code in perf-simple-query.

Closes scylladb/scylladb#23474
2025-04-06 18:15:16 +03:00
Pavel Emelyanov
d4f3a3ee4f cql: Remove unused "initial_tablets" mention from guardrails
All tablets configuration was moved into its own "with tablets" section,
this option name cannot be met among replication factors.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#23555
2025-04-06 16:52:07 +03:00
Gleb Natapov
df6cd87bcc gossiper: do not send outdated endpoint in gossiper round
Now that the gossiper map is id based there can be a situation where two
entries have the same ip, Shadow round should send the newest one in
this cased. The patch makes it so.

Fixes: #23553
2025-04-06 15:08:03 +03:00
Nadav Har'El
431de48df9 test/alternator: test for item with many attributes
A user complained that he couldn't read or write an item with more than
16 attributes (!) in Alternator. This isn't true, but I realized that we
don't have a simple test for this case - all test use just a few attributes.
So let's add such a test, doing PutItem, UpdateItem and GetItem with 400
attributes. Unsurprisingly, the test passes.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#23568
2025-04-03 22:35:49 +03:00
Nadav Har'El
a9a6f9eecc test/alternator: increase timeout in Alternator RBAC test
On our testing infrastructure, tests often run a hundred times (!)
slower than usual, for various reasons that we can't always avoid.
This is why all our test frameworks drastically increase the default
timeouts.

We forgot to increase the timeout in one place - where Alternator tests
use CQL. This is needed for the Alternator role-based access control
(RBAC) tests, which is configured via CQL and therefore the Alternator
test unusually uses CQL.

So in this patch we increase the timeout of CQL driver used by
Alternator tests to the same high timeouts (60-120 seconds) used by
the regular CQL tests. As the famous saying goes, these timeouts should
be enough for anyone.

Fixes #23569.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#23578
2025-04-03 22:31:08 +03:00
Benny Halevy
cdf9fe9e50 Update seastar submodule
* seastar 2f13c461...ed8952fb (24):
  > file: explain dsync check in flush method
  > gate: add named_gate
  > tests: unit: add gate_test
  > reactor: Remove global task_quota extern declaration
  > future: Move report_failed_future to internal namespace
  > update boost cooking URL
  > smp: prefault: clear memory map after threads join
  > change format to sesatar::format
  > Prevent move / copy constructor / assignment on backtrace_buffer
  > Remove unnecesary flush calls from backtrace_buffer usage points
  > Make backtrace_buffer flush on destruction
  > Add `backtrace_buffer&` param to maybe_report_kernel_trace function
  > Prevent empty kernel callstack messages
  > Make cpu_stall_detector_linux_perf_event::maybe_report_kernel_trace function protected.
  > iotune: Add cli flag to force io depth
  > smp: prefault: decouple _stop_request from join_threads
  > reactor: more info, robustness on segfault
  > net/udp: fix ipv4_udp::next_port calculation
  > map_reduce: prevent mapper or reducer exception from poisoning state
  > build: Re-enable ASan's verify_asan_link_order check
  > tests: enable/disable internet-dependent tests at runtime
  > test: tls_test: rename test_simple_x509_client variants to avoid naming conflicts
  > tests: extend test.py to accept arbitrary ctest parameters from positional args
  > tests: add a handle for building tests in "offline" mode

Closes scylladb/scylladb#23566
2025-04-03 19:45:37 +03:00
Botond Dénes
1198213000 Merge 'tablets: Make tablet allocation equalize per-shard load ' from Tomasz Grabiec
Before, it was equalizing per-node load (tablet count), which is wrong
in heterogeneous clusters. Nodes with fewer shards will end up with
overloaded shards.

Refs #23378

Closes scylladb/scylladb#23478

* github.com:scylladb/scylladb:
  tablets: Make tablet allocation equalize per-shard load
  tablets: load_balancer: Fix reporting of total load per node
2025-04-03 16:32:53 +03:00
Botond Dénes
fcdae20fd1 Merge 'Add tablet enforcing option' from Benny Halevy
This series add a new config option: `tablets_mode_for_new_keyspaces` that replaces the existing
`enable_tablets` option. It can be set to the following values:
    disabled: New keyspaces use vnodes by default, unless enabled by the tablets={'enabled':true} option
    enabled:  New keyspaces use tablets by default, unless disabled by the tablets={'disabled':true} option
    enforced: New keyspaces must use tablets. Tablets cannot be disabled using the CREATE KEYSPACE option

`tablets_mode_for_new_keyspaces=disabled` or `tablets_mode_for_new_keyspaces=enabled` control whether
tablets are disabled or enabled by default for new keyspaces, respectively.
In either cases, tablets can be opted-in or out using the `tablets={'enabled':...}`
keyspace option, when the keyspace is created.

`tablets_mode_for_new_keyspaces=enforced` enables tablets by default for new keyspaces,
like `tablets_mode_for_new_keyspaces=enabled`.
However, it does not allow to opt-out when creating
new keyspaces by setting `tablets = {'enabled': false}`

Refs scylladb/scylla-enterprise#4355

* Requires backport to 2025.1

Closes scylladb/scylladb#22273

* github.com:scylladb/scylladb:
  boost/tablets_test: verify failure to create keyspace with tablets and non network replication strategy
  tablets: enforce tablets using tablets_mode_for_new_keyspaces=enforced config option
  db/config: add tablets_mode_for_new_keyspaces option
2025-04-03 16:32:19 +03:00
Kefu Chai
3760a1c85e cql3: Remove unnecessary 'virtual' specifiers from final class methods
Remove 'virtual' specifiers from member functions in final classes where
they can never be overridden. This addresses Clang errors like:

```
/home/kefu/dev/scylladb/cql3/column_identifier.hh:85:21: error: virtual method 'to_string' is inside a 'final' class and can never be overridden [-Werror,-Wunnecessary-virtual-specifier]
   85 |     virtual sstring to_string() const;
      |                     ^
1 error generated.
```

This change improves code clarity and maintainability by eliminating
redundant modifiers that could cause confusion.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#23570
2025-04-03 13:51:42 +03:00
Tomasz Grabiec
fe8187e594 Merge 'repair: release erm in repair_writer_impl::create_writer when possible' from Aleksandra Martyniuk
Currently, repair_writer_impl::create_writer keeps erm to ensure that a sharder is valid. If we repair a tablet, erm blocks the state machine and no operation on any tablet of this table might be performed.

Use auto_refreshing_sharder and topology_guard to ensure that the operation is safe and that tablet operations on the whole table aren't blocked.

Fixes: #23453.

Needs backport to 2025.1 that introduces the tablet repair scheduler.

Closes scylladb/scylladb#23455

* github.com:scylladb/scylladb:
  \test: add test to check concurrent migration and repair of two different tablets
  repair: release erm in repair_writer_impl::create_writer when possible
2025-04-03 11:15:08 +02:00
Botond Dénes
7bbfa5293f test/cluster/test_read_repair.py: increase read request timeout
This test enables trace-level logging for the mutation_data logger,
which seems to be too much in debug mode and the test read times out.
Increase timeout to 1minute to avoid this.

Fixes: #23513

Closes scylladb/scylladb#23558
2025-04-03 10:42:11 +03:00
Botond Dénes
07510c07a0 readers/mutation_readers: queue_reader_handle_v2::push_end_of_stream() raise _ex if set
Instead of raising std::runtime_error("Dangling queue_reader_handle_v2")
unconditionally. push() already raises _ex if set, best to be
consistent.
Unconditionally raising std::runtime_error can cause an error to be
logged, when aborting an operation involving a queue reader.
Although the original exception passed to
queue_reader_handle_v2::abort() is most likely handled by higher level
code (not logged), the generic std::runtime_error raised is not and
therefore is logged.

Fixes: #23550

Closes scylladb/scylladb#23554
2025-04-03 10:39:56 +03:00
Pavel Emelyanov
3bf4768205 Merge 'Unify http transport in EAR to use seastar http client' from Calle Wilund
Fixes #22925
Refs #22885

Some providers in EAR were written before seastar got its own native http connector (as it is). Thus hand-made connectivity is used there.

This PR unifies the code paths, and also extract some abstraction between providers where possible.
One big reason for this is the handling of abrupt disconnects and retries; Seastar has some handling of things like EPIPE and ECONNRESET situations, that can be safely ignored in a REST call iff data was in fact transferred etc.

This PR mainly takes the usage of seastar httpclient from gcp connector, makes a wrapper matching most of the usage of local client in kms connector, ensures common functionality and the replaces the code in the individual connectors.

Closes scylladb/scylladb#22926

* github.com:scylladb/scylladb:
  encryption::gcp: Use seastar http client wrapper
  encryption::kms: Drop local http client and use seastar wrapper
  encryption: Break out a "httpclient" wrapper for seastar httpclient
2025-04-03 10:35:14 +03:00
Kefu Chai
0cd6cf1dc5 main: Remove unused member variable _sys_ks
Fixes a Clang error by removing the unused private field
`sstable_dict_deleter::_sys_ks` that was flagged with:
[-Werror,-Wunused-private-field]
```
/home/kefu/.local/bin/clang++ -DBOOST_PROGRAM_OPTIONS_DYN_LINK -DBOOST_PROGRAM_OPTIONS_NO_LIB -DSCYLLA_BUILD_MODE=release -DXXH_PRIVATE_API -DCMAKE_INTDIR=\"RelWithDebInfo\" -I/home/kefu/dev/scylladb -I/home/kefu/dev/scylladb/build/gen -I/home/kefu/dev/scylladb/build -isystem /home/kefu/dev/scylladb/seastar/include -isystem /home/kefu/dev/scylladb/build/RelWithDebInfo/seastar/gen/include -isystem /home/kefu/dev/scylladb/abseil -isystem /home/kefu/dev/scylladb/build/rust -I/usr/include/p11-kit-1 -ffunction-sections -fdata-sections -O3 -g -gz -std=gnu++23 -flto=thin -fvisibility=hidden -Wall -Werror -Wextra -Wno-error=deprecated-declarations -Wimplicit-fallthrough -Wno-c++11-narrowing -Wno-deprecated-copy -Wno-mismatched-tags -Wno-missing-field-initializers -Wno-overloaded-virtual -Wno-unsupported-friend -Wno-unused-parameter -ffile-prefix-map=/home/kefu/dev/scylladb/= -ffile-prefix-map=/home/kefu/dev/scylladb/build=. -ffile-prefix-map=/home/kefu/dev/scylladb/build/=build -march=westmere -Xclang -fexperimental-assignment-tracking=disabled -mllvm -inline-threshold=2500 -fno-slp-vectorize -ffat-lto-objects -std=gnu++23 -Werror=unused-result -DSEASTAR_API_LEVEL=7 -DSEASTAR_SSTRING -DSEASTAR_LOGGER_COMPILE_TIME_FMT -DSEASTAR_SCHEDULING_GROUPS_COUNT=19 -DSEASTAR_LOGGER_TYPE_STDOUT -DBOOST_PROGRAM_OPTIONS_NO_LIB -DBOOST_PROGRAM_OPTIONS_DYN_LINK -DBOOST_THREAD_NO_LIB -DBOOST_THREAD_DYN_LINK -DFMT_SHARED -MD -MT CMakeFiles/scylla.dir/RelWithDebInfo/main.cc.o -MF CMakeFiles/scylla.dir/RelWithDebInfo/main.cc.o.d -o CMakeFiles/scylla.dir/RelWithDebInfo/main.cc.o -c /home/kefu/dev/scylladb/main.cc
/home/kefu/dev/scylladb/main.cc:1660:38: error: private field '_sys_ks' is not used [-Werror,-Wunused-private-field]
 1660 |                 db::system_keyspace& _sys_ks;
      |                                      ^
```

The member variable is not referenced anywhere in the code,
so removing it improves maintainability without affecting
functionality.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#23545
2025-04-02 20:07:39 +03:00
Evgeniy Naydanov
84a5037056 test.py: cluster/suite.yaml: update test filters
After switching to subfolders the filter `run_in_debug` for
random failures test was just copied as is, but need to include
the subfolder, actually.

Also, `test_old_ip_notification_repro` was deleted, so, we
don't need it in the `skip_in_debug` list.

Closes scylladb/scylladb#23492
2025-04-02 19:29:27 +03:00
Kefu Chai
a09ec9d60d .github: add delay before checking for required PR labels
Improve the GitHub workflow to prevent premature email notifications
about missing labels. Previously, contributors without write permissions
to the scylladb repo would receive immediate notification emails about
missing required backport labels, even if they were in the process of
adding them.

This change introduces a 1-minute grace period before checking for
required labels, giving contributors sufficient time to add necessary
labels (like backport labels) to their pull requests before any warning
notifications are sent.

The delay makes the experience more user-friendly for non-maintainer
contributors while maintaining the labeling requirements.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#23539
2025-04-02 19:28:15 +03:00
Aleksandra Martyniuk
bae6711809 \test: add test to check concurrent migration and repair of two different tablets 2025-04-02 15:30:17 +02:00
Radosław Cybulski
c36614e16d alternator: add size check to BatchItemWrite
Add a size check for BatchItemWrite command - if the item count is
bigger than configuration value `alternator_maximum_batch_write_size`,
an error will be raised and no modification will happen.

This is done to synchronize with DynamoDB, where maximum size of
BatchItemWrite is 25. To avoid complaints from clients, who use
our feature of BatchWriteItem being limitless we set default value
to 100.

Fixes #5057

Closes scylladb/scylladb#23232
2025-04-02 14:48:00 +03:00
Avi Kivity
882f405eed Merge "Convert gossiper's endpoint state map to be host id based" from Gleb
"
The series makes endpoint state map in the gossiper addressable by host
id instead of ips. The transition has implication outside of the
gossiper as well. Gossiper based topology operations are affected by
this change since they assume that the mapping is ip based.

On wire protocol is not affected by the change as maps that are sent by
the gossiper protocol remain ip based. If old node sends two different
entries for the same host id the one with newer generation is applied.
If new node has two ids that are mapped to the same ip the newer one is
added to the outgoing map.

Interoperability was verified manually by running mixed cluster.

The series concludes the conversion of the system to be host id based.
"

* 'gleb/gossipper-endpoint-map-to-host-id-v2' of github.com:scylladb/scylla-dev:
  gossiper: make examine_gossiper private
  gossiper: rename get_nodes_with_host_id to get_node_ip
  treewide: drop id parameter from gossiper::for_each_endpoint_state
  treewide: move gossiper to index nodes by host id
  gossiper: drop ip from replicate function parameters
  gossiper: drop ip from apply_new_states parameters
  gossiper: drop address from handle_major_state_change parameter list
  gossiper: pass rpc::client_info to gossiper_shutdown verb handler
  gossiper: add try_get_host_id function
  gossiper: add ip to endpoint_state
  serialization: fix std::map de-serializer to not invoke value's default constructor
  gossiper: drop template from  wait_alive_helper function
  gossiper: move get_supported_features and its users to host id
  storage_service: make candidates_for_removal host id based
  gossiper: use peers table to detect address change
  storage_service: use std::views::keys instead of std::views::transform that returns a key
  gossiper: move _pending_mark_alive_endpoints to host id
  gossiper: do not allow to assassinate endpoint in raft topology mode
  gossiper: fix indentation after previous patch
  gossiper: do not allow to assassinate non existing endpoint
2025-04-02 12:30:00 +03:00
Pavel Emelyanov
832d83ae4b sstables_loader: Do not stop sharded<progress_monitor> unconditionally
The member in question is unconditionally .stop()-ed in task's
release_resources() method, however, it may happen that the thing wasn't
.start()-ed in the first place. Start happens in the middle of the
task's .run() method and there can be several reasons why it can be
skipped -- e.g. the task is aborted early, or collecting sstables from
S3 throws.

fixes: #23231

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#23483
2025-04-02 12:09:02 +03:00
Kefu Chai
6da758d74c config: mark uuid_sstable_identifiers_enabled unused
the option of `uuid_sstable_identifier_enabled` was introduced in
f014ccf3 . the first version which has this change was 5.4, and
6.1 has been branched. during the discussion of backup and restore,
we realized that we've been taking efforts to address problems which
could have been addressed with the sstable with UUID-based identifier.
see also #10459 which is the issue which proposed to implement UUID-v1
based sstable identifier.

now that two major releases passed, we should have the luxury to mark
this option "unused". this option which was previously introduced to
keep the backward compatibility, and to allow user to opt-out of the
feature for some reasons.

so in this change,  mark the option unused, so that if any user still
sets this option with command line, they will get a clear error. but
we still parse and handle this setting in `scylla.yaml`, so that this
option is still respected for existing settings, and for existing tests,
which are not yet prepared for the uuid-based sstable identifiers.

Refs #10459
Fixes #20337

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#20341
2025-04-01 20:21:47 +03:00
Botond Dénes
3bad46a6e2 docs/dev: add tombstone.md
An exhaustive document on the tombstone related internal logic as well
as the user-facing aspects.

Closes scylladb/scylladb#23454
2025-04-01 20:17:57 +03:00
Botond Dénes
a0d8102a1f replica/memtable: s/make_flat_reader/make_mutation_reader/
Following the recent refactoring of removing "flat" and "v2" from reader
names, replacing all the fully qualified names with simply "mutation_reader".

Closes scylladb/scylladb#23346
2025-04-01 17:58:13 +03:00
Artsiom Mishuta
032b28d793 test.py: remove pylib_test from test.py/CI run
pylib_test contains one pure Python test. This test does not test Scylla.
This test is not deleted because it can be useful to run during pre-commit,
for example, but it definitely should not be run in CI in modes with 3 repeats each.
It does not make sense. It is a Unit test for test.py framework.

Note: test still can be easily run by pytest via the command:
./tools/toolchain/dbuild pytest test/pylib_test

Closes scylladb/scylladb#23181
2025-04-01 16:43:45 +03:00
Pavel Emelyanov
2ee9cec1d3 Merge 'Remove object_storage.yaml and move the endpoints to scylla.yaml' from Robert Bindar
Move `object_storage.yaml` endpoints to `scylla.yaml`

This change also removes the `object_storage.yaml` file
altogether and adds tests for fetching the endpoints
via the `v2/config/object_storage_endpoints` REST api.

Also, `object_storage_config_file` options is moved to a deprecated state as it's no longer needed.

This PR depends on #22951, the reviewers should review patch 393e1ac0ec066475ca94094265a5f88dbbdb1a1f

Refs https://github.com/scylladb/scylladb/issues/22428

Closes scylladb/scylladb#22952

* github.com:scylladb/scylladb:
  Remove db::config::object_storage_config
  Move `object_storage.yaml` endpoints to `scylla.yaml`
2025-04-01 16:01:44 +03:00
Avi Kivity
69684e16d8 Merge 'sstables: add SSTable compression with shared dictionaries ' from Michał Chojnowski
This PR extends Scylla's SSTable compression with the ability to use compression dictionaries shared across compression chunks. This involves several changes:

- We refactor `compression_parameters` and friends (`compressor`, `sstables::local_compression`, `sstables::compression`) to prepare for making the construction of `compressor`s asynchronous, to enable sharing pieces of compressors (the dictionaries) across shards.
- We introduce the notion of "hidden compression options" which are written to `CompressionInfo.db` and used to construct decompressors, like regular options, but don't appear in the schema. (We later stuff the SSTable's dictionary into `CompressionInfo.db` using a sequence of such options).
- We add a cluster feature which guards the creation of dictionary-compressed SSTables.
- We introduce a central "compressor factory" (one instance shared by all shards), which from this point onward is used to construct all `compressor` objects (one per SSTable) used to process the SSTables. When constructing a compressor for writing, it uses the "current"/"recommended" dictionary (which is passed to the factory from the actively-observed contents of the group0-managed `system.dicts`). When constructing a compressor for reading, it uses the dictionary written in the hidden compression options in CompressionInfo.db. And it keeps dictionaries deduplicated, so that each unique live dictionary blob has only one instance in memory, shared across shards.
- We teach the relevant `lz4` and `zstd` compressor wrappers about the dictionaries.
- We add a HTTP API call which samples pieces of the given table (i.e. the Data.db files) from across the cluster, trains a dictionary on it, and publishes it via `system.dicts` as the new current dictionary for that table. (And we add some RPC verbs to support that).
- We add a HTTP API call which estimates the impact of various available compression configurations on the compression ratio.
- We add an autotrainer fiber which periodically retrains dicts for dict-aware tables and publishes them if they seem to be a significant improvement.

Known imperfections:
- The factory currently keeps one dictionary instance on the entire node, but we probably want one copy per NUMA node. I didn't do that because exposing NUMA knowledge to Scylla seems to require some changes in Seastar first.

New feature, no backporting involved.

Closes scylladb/scylladb#23025

* github.com:scylladb/scylladb:
  docs: add user-facing documentation for SSTable compression with shared dicts
  docs/dev: add sstable-compression-dicts.md
  test: add test_sstable_compression_dictionaries_autotrain.py
  test: add test_sstable_compression_dictionaries_basic.py
  test/pylib/rest_client: add `keyspace_upgrade_sstables` helper
  main: run a sstable_dict_autotrainer
  api: add the estimate_compression_ratios API call
  dict_autotrainer: introduce sstable_dict_autotrainer
  db/system_keyspace: add query_dict_timestamp
  compress: add ZstdWithDictsCompressor and LZ4WithDictsCompressor
  main: clean up sstable compression dicts after table drops
  sstables/compress: discard hidden compression options after the decompressor is created
  compress: change compressor_ptr from shared_ptr to unique_ptr
  api: add the retrain_dict API call
  storage_service: add some dict-related routines
  main: in compression_dict_updated_callback, recognize and use SSTable compression dicts
  storage_service: add do_sample_sstables()
  messaging_service: add SAMPLE_SSTABLES and ESTIMATE_SSTABLE_VOLUME verbs
  db/system_keyspace: let `system.dicts` helpers be used for dicts other than the RPC compression dict
  raft/group0_state_machine: on `system.dicts` mutations, pass the affected partitition keys to the callback
  database: add sample_data_files()
  database: add take_sstable_set_snapshot()
  compress: teach `lz4_processor` about dictionaries
  compress: teach `zstd_processor` about dictionaries
  sstables: delegate compressor creation to the compressor factory
  sstables: plug an `sstable_compressor_factory` into `sstables_manager`
  sstables: introduce sstable_compressor_factory
  utils/hashers: add get_sha256()
  gms/feature_service: add the SSTABLE_COMPRESSION_DICTS cluster feature
  compress: add hidden dictionary options
  compress: remove `compression_parameters::get_compressor()`
  sstables/compress: remove get_sstable_compressor()
  sstables/compress: move ownership of `compressor` to `sstable::compression`
  compress: remove compressor::option_names()
  compress: clean up the constructor of zstd_processor
  compress: squash zstd.cc into compress.cc
  sstables/compress: break the dependency of `compression_parameters` on `compressor`
  compress.hh: switch compressor::name() from an instance member to a virtual call
  bytes: adapt fmt_hex to std::span<const std::byte>
2025-04-01 12:47:34 +03:00
Aleksandra Martyniuk
1dc29ddc86 repair: release erm in repair_writer_impl::create_writer when possible
Currently, repair_writer_impl::create_writer keeps erm to ensure
that a sharder is valid. If we repair a tablet, erm blocks the state
machine and no operation on any tablet of this table might be performed.

Use auto_refreshing_sharder and topology_guard to ensure that the
operation is safe and that tablet operations on the whole table
aren't blocked.

Fixes: #23453.
2025-04-01 11:34:21 +02:00
Calle Wilund
c6674619b7 encryption::gcp: Use seastar http client wrapper
Refs #22925

Remove direct usage of seastar http client, and instead share this
with other connectors via the http client wrapper type.
2025-04-01 08:18:05 +00:00
Calle Wilund
491748cde3 encryption::kms: Drop local http client and use seastar wrapper
Fixes #22925

Removes the boost based http client in favour of our seastar
wrapper.
2025-04-01 08:18:05 +00:00
Calle Wilund
878f76df1f encryption: Break out a "httpclient" wrapper for seastar httpclient
Refs #22925

Adds some wrapping and helpers for the kind of REST operations we
expect to perform.

Some things like stream formatting is redundant visavi seastar,
but on that level we only have \r\n encoded writing to
output_stream and similar, which is less useful for things like
logging.
2025-04-01 08:18:05 +00:00
Piotr Smaron
370707b111 service: restore default timeout in announce_with_raft
This restored timeout seems to have been accidentally removed in
7081215552 (r2005352424).
Without it, `raft_server_with_timeouts::run_with_timeout` will get
`std::nullopt` as a value of the `timeout` parameter and perform an
operation without any timeout, whereas previously it would have waited
for the default timeout specified in
`raft_server_for_group::default_op_timeout`.

Closes scylladb/scylladb#23380
2025-04-01 10:20:16 +03:00
David Garcia
6e61fc323b docs: redirect to docs.scylladb.com/manual/
Define a custom alert to redirect users to the latest version of the docs in https://docs.scylladb.com/manual/

Closes scylladb/scylladb#22636
2025-04-01 09:22:56 +03:00
Botond Dénes
bd9f51a29c Merge 'transport/server.cc: set default timestamp info in EXECUTE and BATCH tracing' from Vladislav Zolotarov
A default timestamp (not to confuse with the timestamp passed via 'USING TIMESTAMP' query clause) can be set using 0x20 flag and the <timestamp> field in the binary CQL frame payload of QUERY, EXECUTE and BATCH ops. It also happens to be a default of a Java CQL Driver.

However, we were only setting the corresponding info in the CQL Tracing context of a QUERY operation. For an unknown reason we were not setting this for an EXECUTE and for a BATCH traces (I guess I simply forgot to set it back then).

This patch fixes this.

Fixes #23173

The issue fixed by this PR is not critical but the fix is simple and safe enough so we should backport it to all live releases.

Closes scylladb/scylladb#23174

* github.com:scylladb/scylladb:
  CQL Tracing: set common query parameters in a single function
  transport/server.cc: set default timestamp info in EXECUTE and BATCH tracing
2025-04-01 09:16:02 +03:00
Pavel Emelyanov
b5a124f60c sstable_directory: Move highest_generation_seen() to distributed_loader.cc
This method is only used by the loader code (and tests). Also, There's the
highest_version_seen() peer that sits in the loader code either.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#23324
2025-04-01 09:15:14 +03:00
Pavel Emelyanov
eafc767cc6 sstable/filesystem: Add convenience helper to generate filename
In its operations the fs storage carefully generates full filename from
all sstable parameters -- version, format, generation, keyspace and
table names and component type or name. However, in all of the cases
format, version and keyspace:table names are inherited from the sstable
being operated on. This calls for a filename generation helper that
wraps most of the arguments thus making the lines shorter.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#23384
2025-04-01 09:14:44 +03:00
Botond Dénes
0fdf2a2090 Merge 'test/pylib: servers_add: support list of property_files' from Benny Halevy
So that a multi-dc/multi-rack cluster can be populated
in a single call.

* Enhancement, no backport required

Closes scylladb/scylladb#23341

* github.com:scylladb/scylladb:
  test/pylib: servers_add: add auto_rack_dc parameter
  test/pylib: servers_add: support list of property_files
2025-04-01 09:14:20 +03:00
Botond Dénes
94e8971308 scylla-gdb.py: improve scylla repairs commadn
Make output more readable by:
* group follower/master repair instances separately
* split repair details into one line for repair summary, then one line
  for each host info
* add indentation to make the output easier to follow

Also add -m|--memory option to calculate memory usage of repair buffers.

Example output:

    (gdb) scylla repairs -m
    Repairs for which this node is leader:
      (repair_meta*) 0x60503ab7f7b0: {id: 19197, table: large_collection_test.table_with_large_collection, reason: decommission, row_buf: {len: 0, memory: 0}, working_row_buf: {len: 30, memory: 48208512}, same_shard: True, tablet: False}
        host: 496e8b0c-50bf-4ada-b8f9-3d167138e908, shard: 5, state: repair_state::get_combined_row_hash_finished
        host: ce4413ab-33d9-40f8-b13e-d14af8511dda, shard: 4294967295, state: repair_state::put_row_diff_with_rpc_stream_started
      (repair_meta*) 0x60503717f7b0: {id: 19211, table: large_collection_test.table_with_large_collection, reason: decommission, row_buf: {len: 0, memory: 0}, working_row_buf: {len: 28, memory: 63863265}, same_shard: True, tablet: False}
        host: 496e8b0c-50bf-4ada-b8f9-3d167138e908, shard: 5, state: repair_state::get_combined_row_hash_finished
        host: c4936a19-41da-4260-971e-651445d740fd, shard: 4294967295, state: repair_state::get_row_diff_with_rpc_stream_finished
      (repair_meta*) 0x60502ddff7b0: {id: 19231, table: large_collection_test.table_with_large_collection, reason: decommission, row_buf: {len: 0, memory: 0}, working_row_buf: {len: 0, memory: 0}, same_shard: True, tablet: False}
        host: 496e8b0c-50bf-4ada-b8f9-3d167138e908, shard: 5, state: repair_state::row_level_stop_started
        host: 039494b6-9d35-4f34-82c4-3c79c1d97175, shard: 4294967295, state: repair_state::row_level_stop_finished
      (repair_meta*) 0x60501db3f7b0: {id: 19234, table: large_collection_test.table_with_large_collection, reason: decommission, row_buf: {len: 0, memory: 0}, working_row_buf: {len: 0, memory: 0}, same_shard: True, tablet: False}
        host: 496e8b0c-50bf-4ada-b8f9-3d167138e908, shard: 5, state: repair_state::get_sync_boundary_started
        host: 039494b6-9d35-4f34-82c4-3c79c1d97175, shard: 4294967295, state: repair_state::get_sync_boundary_finished
      (repair_meta*) 0x60501c81f7b0: {id: 19236, table: large_collection_test.table_with_large_collection, reason: decommission, row_buf: {len: 0, memory: 0}, working_row_buf: {len: 28, memory: 42696821}, same_shard: True, tablet: False}
        host: 496e8b0c-50bf-4ada-b8f9-3d167138e908, shard: 5, state: repair_state::get_combined_row_hash_finished
        host: ce4413ab-33d9-40f8-b13e-d14af8511dda, shard: 4294967295, state: repair_state::put_row_diff_with_rpc_stream_started
      (repair_meta*) 0x60503f65f7b0: {id: 19238, table: large_collection_test.table_with_large_collection, reason: decommission, row_buf: {len: 0, memory: 0}, working_row_buf: {len: 28, memory: 47785163}, same_shard: True, tablet: False}
        host: 496e8b0c-50bf-4ada-b8f9-3d167138e908, shard: 5, state: repair_state::get_combined_row_hash_finished
        host: ce4413ab-33d9-40f8-b13e-d14af8511dda, shard: 4294967295, state: repair_state::get_row_diff_with_rpc_stream_finished
    Repairs for which this node is follower:
2025-04-01 01:53:35 -04:00
Botond Dénes
47c62a4cf2 scylla-gdb.py: seastar_lw_shared_ptr: add __nonzero__ and __bool__
There is currently no easy way to null-check seastar_lw_shared_ptr.
Comparing get() against 0 doesn't work, if _p is null, get() will return
an illegal pointer. So add methods to allow for easy null-checks by
comparing _p with 0 instead.
2025-04-01 01:53:34 -04:00
Botond Dénes
f84bf43c96 scylla-gdb.py: introduce managed_bytes
Extracted from managed_bytes_printer. Make working with managed_bytes
easier. Abstracts how size and content is obtained.
2025-04-01 01:53:34 -04:00
Jenkins Promoter
6c528f5027 Update pgo profiles - aarch64 2025-04-01 04:45:44 +03:00
Jenkins Promoter
3c12029584 Update pgo profiles - x86_64 2025-04-01 04:27:11 +03:00
Michał Chojnowski
36be9d1c9b docs: add user-facing documentation for SSTable compression with shared dicts 2025-04-01 00:07:31 +02:00
Michał Chojnowski
d33ffb221b docs/dev: add sstable-compression-dicts.md 2025-04-01 00:07:31 +02:00
Michał Chojnowski
f851efd4fa test: add test_sstable_compression_dictionaries_autotrain.py
Adds a test which checks that sstable compression dict autotraining
does its job.
2025-04-01 00:07:31 +02:00
Michał Chojnowski
62da3d8363 test: add test_sstable_compression_dictionaries_basic.py
Add a basic integration test for SSTable compression with shared dictionaries.
2025-04-01 00:07:30 +02:00
Michał Chojnowski
7b0eeefd79 test/pylib/rest_client: add keyspace_upgrade_sstables helper 2025-04-01 00:07:30 +02:00
Michał Chojnowski
3f7969313f main: run a sstable_dict_autotrainer
Create an instance of `sstable_dict_autotrainer` in `scylla_main`
and run it.
2025-04-01 00:07:30 +02:00
Michał Chojnowski
a19d6d95f7 api: add the estimate_compression_ratios API call
Add an API call which estimates the effectiveness of possible
compression config changes.

This can be used to make an informed decision about whether to
change the compression method, without actually recompressing
any SSTables.
2025-04-01 00:07:30 +02:00
Michał Chojnowski
4f0d453acf dict_autotrainer: introduce sstable_dict_autotrainer
Add a fiber responsible for periodic re-training of compression dictionaries
(for tables which opted into dict-aware compression).

As of this patch, it works like this:
every `$tick_period` (15 minutes), if we are the current Raft leader,
we check for dict-aware tables which have no dict, or a dict older
than `$retrain_period`.

For those tables, if they have enough data (>1GiB) for a training,
we train a new dict and check if it's significantly better
than the current one (provides ratio smaller than 95% of current ratio),
and if so, we update the dict.
2025-04-01 00:07:30 +02:00
Michał Chojnowski
9d02e2c005 db/system_keyspace: add query_dict_timestamp
Adds a helper method which queries the creation timestamp
of a given dict in `system.dicts`.

We will later use the age of the current SSTable compression dict
to decide if another training should be done already.
2025-04-01 00:07:30 +02:00
Michał Chojnowski
cb1b291051 compress: add ZstdWithDictsCompressor and LZ4WithDictsCompressor
Add new compressor names to `sstable_compression`.
When those names are configured in the schema,
new SSTables will be compressed with dict-aware Zstd or LZ4
respectively.
2025-04-01 00:07:30 +02:00
Michał Chojnowski
bea866a46f main: clean up sstable compression dicts after table drops
When a table is dropped, its corresponding dictionary in `system.dicts`
-- if any -- should be deleted, otherwise it will remain forever as
garbage.

This commit implements such cleanup.
2025-04-01 00:07:30 +02:00
Michał Chojnowski
cee504f66f sstables/compress: discard hidden compression options after the decompressor is created
Dictionary contents are kept in the list of "compression options" in the
header of `CompressionInfo.db`, and they are loaded from disk into
memory when the `sstable::compression` object is populated.

After the decompressor for the SSTable is created based on those
dict contents, they are not needed in RAM anymore. And since
they take up a sizeable amount of memory, we would like to free them.

In this patch, we discard all "hidden compression options"
(currently: only the dictionary contents) from the
`sstable::compression` object right after the decompressor is created.
(Those options are not supposed to be used for anything else anyway).
2025-04-01 00:07:30 +02:00
Michał Chojnowski
10fa4abde7 compress: change compressor_ptr from shared_ptr to unique_ptr
Cleanup patch. After we moved the ownership of compressors
to sstables, compressor objects never have shared lifetime.
`unique_ptr` is more appropriate for them than `shared_ptr` now.
(And besides expressing the intent better, using `unique_ptr`
prevents an accidental cross-shard `shared_ptr` copy).
2025-04-01 00:07:29 +02:00
Michał Chojnowski
58ae278d10 api: add the retrain_dict API call
Add an API call which will retrain the SSTable compression dictionary
for a given table.

Currently, it needs all nodes to be alive to succeed. We can relax this later.
2025-04-01 00:07:29 +02:00
Michał Chojnowski
4115a6fece storage_service: add some dict-related routines
storage_service will be the interface between the API layer
(or the automatic training loop) and the dict machinery.
This commit implements the relevant interface for that.

It adds methods that:
1. Take SSTable samples from the cluster, using the new RPC verbs.
2. Train a dict on the sample. (The trainer will be plugged in from `main`).
3. Publishes the trained dictionary. (By adding mutations to Raft group 0).

Perhaps this should be moved to a separate "service".
But it's not like `storage_service` has a clear purpose anyway.
2025-04-01 00:07:29 +02:00
Michał Chojnowski
94d244ab49 main: in compression_dict_updated_callback, recognize and use SSTable compression dicts
Currently, there is at most one dictionary in `system.dicts`:
named "general", used by RPC compression. So the callback called
on `system.dicts` just always refreshes the RPC compression dict.

In a follow-up commit, we will publish SSTable compression dicts to
`system.dicts` rows with a name in the "sstables/{table_uuid}" format.
We want modification to such rows to be passed as new dictionary
recommendations to the SSTable compressor factory. This commit teaches
the `system.dicts` modification callback to recognize such modifications
and forward them to the compressor factory.
2025-04-01 00:07:29 +02:00
Michał Chojnowski
380f409c46 storage_service: add do_sample_sstables()
Adds a helper which uses ESTIMATE_SSTABLE_VOLUME and SAMPLE_SSTABLES
RPC calls to gather a combined sample of SSTable Data files for the given table
from the entire cluster.
2025-04-01 00:07:29 +02:00
Michał Chojnowski
94c33b6760 messaging_service: add SAMPLE_SSTABLES and ESTIMATE_SSTABLE_VOLUME verbs
Add two verbs needed to implement dictionary training for SSTable
compression.

SAMPLE_SSTABLES returns a list of randomly-selected chunks of Data files
with a given cardinality and using a given chunk size,
for the given table.

ESTIMATE_SSTABLE_VOLUME returns the total uncompressed size of all Data
files the given table.
2025-04-01 00:07:29 +02:00
Michał Chojnowski
4856f4acca db/system_keyspace: let system.dicts helpers be used for dicts other than the RPC compression dict
Extend the `system.dicts` helper for querying and modifying
`system.dicts` with an ability to use names other than "general".
We will use that in later commits to publish dictionaries for SSTable compression.
2025-04-01 00:07:29 +02:00
Michał Chojnowski
b77c611c00 raft/group0_state_machine: on system.dicts mutations, pass the affected partitition keys to the callback
Before this patch, `system.dicts` contains only one dictionary, for RPC
compression, with the fixed name "general".

In later parts of this series, we will add more dictionaries to
system.dicts, one per table, for SSTable compression.

To enable that, this patch adjusts the callback mechanism for group0's `write_mutations`
command, so that the mutation callbacks for group0-managed tables can see which
partition keys were affected. This way, the callbacks can query only the
modified partitions instead of doing a full scan. (This is necessary to
prevent quadratic behaviours.)

For now, only the `system.dicts` callback uses the partition keys.
2025-04-01 00:07:29 +02:00
Michał Chojnowski
d920ab5366 database: add sample_data_files()
Add a helper for sampling the Data files for a given table.
We will use it to take samples for dictionary training.
2025-04-01 00:07:29 +02:00
Michał Chojnowski
48c06c7e4b database: add take_sstable_set_snapshot()
We want a method that will allow us to take a stable snapshot of
SSTables, to asynchronously compute some stats on them.
But `take_storage_snapshot` is overly invasive for that, because
it flushes memtables on each call.
(If `take_storage_snapshot` was, for example, called repetitively,
it could create a ton of small memtables and lead to trouble).

This commit adds a weaker version which only takes a snapshot of
*existing SSTables*, and doesn't flush memtables by itself.

This will be useful for dictionary training, which doesn't
care about the semantics of SSTables, only their rough statistical
properties.
2025-04-01 00:07:28 +02:00
Michał Chojnowski
64f3d7e364 compress: teach lz4_processor about dictionaries
Extend `lz4_processor` with the ability to use dictionaries.
We won't use this ability yet. It will be used when new
compressor names are added.
2025-04-01 00:07:28 +02:00
Michał Chojnowski
b65101b371 compress: teach zstd_processor about dictionaries
Extend `zstd_processor` with the ability to use dictionaries.
We won't use this ability yet. It will be used when new
compressor names are added.
2025-04-01 00:07:28 +02:00
Michał Chojnowski
b18ddcb92e sstables: delegate compressor creation to the compressor factory
Remove `compressor::create()`. This enforces that compressors
are only created through the `sstable_compressor_factory`.

Unlike the synchronous `compressor::create()`, the factory will be able
to create dict-aware compressors.
2025-04-01 00:07:28 +02:00
Michał Chojnowski
30a9d471fa sstables: plug an sstable_compressor_factory into sstables_manager
Create a `sstable_compressor_factory_impl` in `scylla_main`,
and pipe it through constructors into `sstables_manager`.

In next commits, the factory available through the `sstables_manager`
will be used to create compressors for SSTable readers and writers.
2025-04-01 00:07:28 +02:00
Michał Chojnowski
ebf02913a2 sstables: introduce sstable_compressor_factory
Before this commit, `compressor` objects are synchronously
created, during the creation or opening of SSTables,
from `compression_parameters` objects.

But we want to add compression dictionaries to SSTables and we want
to share dictionary contents across shards.
To do that, we need to make the creation of `compressor` objects asynchronous,
and give it access to a global dictionary registry.

We encapsulate that in a `sstable_compression_factory`. Instead of
calling `compressor::create()` on SSTable opening or creation, we will
ask the factory, asynchronously, for a new compressor, and it will return
a compressor with a deduplicated, up-to-date dictionary.

This commit introduces such a factory. It's not used anywhere yet,
and the compressors it produces don't use the provided dictionaries yet.
2025-04-01 00:07:28 +02:00
Michał Chojnowski
2bd393849c utils/hashers: add get_sha256()
Add a helper function which computes the SHA256 for a blob.
We will use it to compute identifiers for SSTable compression
dictionaries later.
2025-04-01 00:07:28 +02:00
Michał Chojnowski
61316e29df gms/feature_service: add the SSTABLE_COMPRESSION_DICTS cluster feature
This feature will guard against writing SSTables containing compression
dictionaries before the entire cluster is able to understand them.
2025-04-01 00:07:28 +02:00
Michał Chojnowski
dd932ebb2f compress: add hidden dictionary options
Before this commit, "compression options" written into
CompressionInfo.db (and used to construct a decompressor)
have a 1:1 correspondence to "compression options" specified
in the schema.

But we want to add a new "compression option" -- the compression
dictionary -- which will be written into CompressionInfo.db
and used to construct decompressors, but won't be specified in the
schema.

To reconcile that, in this commit we introduce the notion of a "hidden
option". If an option name in `CompressionInfo.db` begins with a dot,
then this option will be used to construct decompressors, but won't
be visible for other uses. (I.e. for the `sstable_info` API call
and for recovering a fake `schema` from `CompressionInfo.db` in the
`scylla sstable` tool).

Then, we introduce the hidden `.dictionary.{0,1,2,..}` options,
which hold the contents of the dictionary blob for this SSTable.

(The dictionary is split into several parts because the SSTable
format limits the length of a single option value to 16 bits,
and dictionaries usually have a length greater than that).

This commit only introduces helpers which translate dictionary blobs
into "options" for CompressionInfo.db, and vice-versa, but it doesn't
use those helpers yet. They will be used in later commits.
2025-04-01 00:07:28 +02:00
Michał Chojnowski
11be7c0704 compress: remove compression_parameters::get_compressor()
Following up on the previous commits, we avoid constructing
compressors where not necessary,
by checking things directly on `compression_parameters` instead.
2025-04-01 00:07:28 +02:00
Michał Chojnowski
006c631642 sstables/compress: remove get_sstable_compressor()
Following up on the previous commit, we avoid constructing
a compressor in the `sstable_info` API call, and we instead
read the compression options from the `sstable::compression`.
2025-04-01 00:07:28 +02:00
Michał Chojnowski
8e611536b0 sstables/compress: move ownership of compressor to sstable::compression
SSTable readers and writers use `compressor` objects to compress and
decompress chunks of SSTable data files.

`compressor` objects are read-only, so only one of them is needed
for each SSTable. Before this commit, each reader and writer has
its own `compressor` object. This isn't necessary, but it's okay.

But later in this series it will stop being okay, because the creation
of a `compressor` will become an expensive cross-shard
operation (because it might require sharing a compression dictionary
from another shard). So we have to adjust the code so that there is
only once `compressor` per sstable, not one per reader/writer.

We stuff the ownership of this compressor into `sstable::compression`.

To make the ownership clear, we remove `compression_ptr` shared
pointers from readers and writers, and make them access the
compressor via the `sstable::compression` instead.
2025-04-01 00:07:27 +02:00
Michał Chojnowski
7bdcd5e8c1 compress: remove compressor::option_names()
It used to be used by `compression_parameters` validation logic
to ask the created `compressor` for compressor-specific option names.

Since we no longer delegate this to `compressor`, but we just
put the knowledge of those options directly into
`compressor_parameters`, it's dead code now.
2025-04-01 00:07:27 +02:00
Michał Chojnowski
3b0ab8e1ee compress: clean up the constructor of zstd_processor
Since we now parse and validate the compression level during the
construction of `compression_parameters`, we can just pass the
structured params to `zstd_processor` instead of passing
a raw string map.
2025-04-01 00:07:27 +02:00
Michał Chojnowski
6470035a74 compress: squash zstd.cc into compress.cc
Unlike all other implementations of `compressor`, `zstd_processor`
has its own special object file and its own special
late binding mechanism (via the `class_registry`).
It doesn't need either.

Let's squash it into `compress.cc`. Keeping `zstd_processor` a separate "module"
would require adding even more headers and source files later in the
series (when adding dictionaries), and there's no benefit in being
so granular. All `compressor` logic can be in `compress.cc` and it will
still be small enough.

This commit also gets rid of the pointless `class_registry` late binding
mechanism and just constructs the `zstd_processor` in
`compressor::create()` with a regular constructor call.
2025-04-01 00:07:27 +02:00
Michał Chojnowski
cfe69e057f sstables/compress: break the dependency of compression_parameters on compressor
Note: this commit is meant to be a code refactoring only and is not intended
to change the observable behaviour.

Today `schema` contains a `compression_parameters`.
`compression_parameters` contains an instance of
`compressor`, and SSTable writers just share that instance.

This is fine because `compressor` is a stateless object,
functionally dependent on the schema.

But in later parts of the series, we will break this functional
dependency by adding dictionaries to compressors. Two writers
for the same schema might have different dictionaries, so they won't
be able to just share a single instance contained in the schema.

And when that happens, having a `compressor` instance
in the `schema`/`compression_parameters` will become awkward,
since it won't be actually used. It will be only a container for options.

In addition, for performance reasons, we will want to share some pieces
of compressors across shards, which will require -- in the general case --
a construction of a compressor to be asynchronous, and therefore not
possible inside the constructor of `compression_parameters`.

This commit modifies `compression_parameters` so that it doesn't hold or
construct instances of `compressor`.

Before this patch, the `compressor` instance constructed in
`compression_parameters` has an additional role of validating and
holding compressor-specific options.
(Today the only such option is the zstd compression level).

This means that the pieces of logic responsible for compressor-specific
options have to be rewritten. That ends up being the bulk of this commit.
2025-04-01 00:07:27 +02:00
Michał Chojnowski
f4ca94d13b compress.hh: switch compressor::name() from an instance member to a virtual call
Before this patch, `compressor` is designed to be a proper abstract
class, where the creator of a compressor doesn't even know
what he's creating -- he passes a name, and it gets turned into a
`compressor` behind a scenes.

But later, when creation of compressors will involve looking up
dictionaries, this abstraction will only get in the way.
So we give up on keeping `compressor` abstract, and instead of
using "opaque" names we turn to an explicit enum of possible compressor types.

The main point of this patch is to add the `algorithm` enum and the `algorithm_to_name()`
function. The rest of the patch switches the `compressor::name()` function
to use `algorithm_to_name()` instead of the passed-by-constructor
`compressor::_name`, to keep a single source of truth for the names.
2025-04-01 00:07:27 +02:00
Michał Chojnowski
4f634de2e9 bytes: adapt fmt_hex to std::span<const std::byte>
This allows us to hexdump things other than `bytes_view`.
(That is, without reinterpret_casting them to `bytes_view`,
which -- aside from the inconvenience -- isn't quite legal.
In contrast, any span can be legally casted to `std::span<const std::byte>`).
2025-04-01 00:07:27 +02:00
Robert Bindar
b647196121 Remove db::config::object_storage_config
That map became redundant once we added
object_storage_endpoints in the config, this patch removes
it and switches all the user code to use the new option.

Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
2025-03-31 17:15:12 +03:00
Gleb Natapov
3abe5de8bf gossiper: make examine_gossiper private 2025-03-31 16:50:50 +03:00
Gleb Natapov
afdfde8300 gossiper: rename get_nodes_with_host_id to get_node_ip
Also change it to return std::optional instead of std::set since now
there can be only on ip mapped to an id.
2025-03-31 16:50:50 +03:00
Gleb Natapov
28fb84117d treewide: drop id parameter from gossiper::for_each_endpoint_state
We have it in endpoint_state anyway, so no need to pass both.
2025-03-31 16:50:50 +03:00
Gleb Natapov
4609bbbbb2 treewide: move gossiper to index nodes by host id
This patch changes gossiper to index nodes by host ids instead of ips.
The main data structure that changes is _endpoint_state_map, but this
results in a lot of changes since everything that uses the map directly
or indirectly has to be changed. The big victim of this outside of the
gossiper itself is topology over gossiper code. It works on IPs and
assumes the gossiper does the same and both need to be changed together.
Changes to other subsystems are much smaller since they already mostly
work on host ids anyway.
2025-03-31 16:50:50 +03:00
Gleb Natapov
19ac05b0ba gossiper: drop ip from replicate function parameters
We have it in endpoint_state now, so no need to pass both.
2025-03-31 16:50:50 +03:00
Gleb Natapov
c5b8429bec gossiper: drop ip from apply_new_states parameters
We have it in endpoint_state now, so no need to pass both.
2025-03-31 16:50:50 +03:00
Gleb Natapov
6da5f541a2 gossiper: drop address from handle_major_state_change parameter list
We have it in endpoint_state now, so no need to pass both.
2025-03-31 16:50:50 +03:00
Gleb Natapov
5e06bf76e0 gossiper: pass rpc::client_info to gossiper_shutdown verb handler
It will be needed later to obtain host id of the peer.
2025-03-31 16:50:50 +03:00
Gleb Natapov
704580b197 gossiper: add try_get_host_id function
The function returns unengaged std::optional if id is not found instead
of throwing like get_host_id does.
2025-03-31 16:50:45 +03:00
Tomasz Grabiec
29d1c2adc6 Merge 'Finalize tablet splits earlier' from Lakshmi Narayanan Sreethar
Resize finalization is executed in a separate topology transition state,
`tablet_resize_finalization`, to ensure it does not overlap with tablet
transitions. The topology transitions into the
`tablet_resize_finalization` state only when no tablet migrations are
scheduled or being executed. If there is a large load-balancing backlog,
split finalization might be delayed indefinitely, leaving the tables
with large tablets.

This PR fixes the issue by updating the load balancer to no schedule any
migrations and to not make any repair plans when there a resize
finalization is pending in any table.

Also added a testcase to verify the fix.

Fixes #21762

Improvement : No need to backport.

Closes scylladb/scylladb#22148

* github.com:scylladb/scylladb:
  topology_coordinator: fix indentation in generate_migration_updates
  topology_coordinator: do not schedule migrations when there are pending resize finalizations
  load_balancer: make repair plans only when there is no pending resize finalization
2025-03-31 14:42:34 +02:00
Gleb Natapov
6999b474a1 gossiper: add ip to endpoint_state
Store endpoint's IP in the endpoint state. Currently it is stored as a key
in gossiper's endpoint map, but we are going to change that. The new filed
is not serialized when endpoint state is sent over rpc, so it is set by
the rpc handler from the value in the map that is in the rpc message. This
map will not be changed to be host id based to not break interoperability.
2025-03-31 15:42:08 +03:00
Gleb Natapov
9bb2edcae6 serialization: fix std::map de-serializer to not invoke value's default constructor 2025-03-31 15:42:07 +03:00
Gleb Natapov
e5cc3b75f8 gossiper: drop template from wait_alive_helper function
Move ip to id translation to the caller.
2025-03-31 15:42:07 +03:00
Gleb Natapov
0dd86b4f1d gossiper: move get_supported_features and its users to host id 2025-03-31 15:42:07 +03:00
Gleb Natapov
f97bb6922d storage_service: make candidates_for_removal host id based 2025-03-31 15:42:07 +03:00
Gleb Natapov
82491cec19 gossiper: use peers table to detect address change
This requires serializing entire handle_state_normal with a lock since
it both reads and updates peers table now (it only updated it before the
change). This is not a big deal since most of it is already serialized
with token metadata lock. We cannot use it to serialize peers writes
as well since the code that removes an endpoint from peers table also
removes it from gossiper which causes on_remove notification to be called
and it may take the metadata lock as well causing deadlock.
2025-03-31 15:41:44 +03:00
Tomasz Grabiec
6bff596fce tablets: Make tablet allocation equalize per-shard load
Before, it was equalizing per-node load (tablet count), which is wrong
in heterogenous clusters. Nodes with fewer shards will end up with
overloaded shards.

Refs #23378
2025-03-31 14:34:30 +02:00
Gleb Natapov
1c2a9257e9 storage_service: use std::views::keys instead of std::views::transform that returns a key 2025-03-31 15:25:39 +03:00
Gleb Natapov
a581a99dbf gossiper: move _pending_mark_alive_endpoints to host id
Index _pending_mark_alive_endpoints map by host id instead of ip
2025-03-31 15:25:39 +03:00
Gleb Natapov
555149c153 gossiper: do not allow to assassinate endpoint in raft topology mode
It does nothing but harm in raft topology mode.
2025-03-31 15:25:39 +03:00
Gleb Natapov
4cc1c10035 gossiper: fix indentation after previous patch 2025-03-31 15:25:39 +03:00
Gleb Natapov
e8b7aaa0d4 gossiper: do not allow to assassinate non existing endpoint
We assume that all endpoint states have HOST_ID set or the host id is
available locally, but the assassinate code injects a state without
HOST_ID for not existing endpoint violating this assumption.
2025-03-31 15:25:39 +03:00
Botond Dénes
90c20858ed Merge 'test/database: Remove most of take_snapshot() helper overloads and re-use them more' from Pavel Emelyanov
This helper facilitate snapshot creation by various test cases in database_test.cc. This PR generalizes all overloads into one that suits all callers and patches one more test case to use it as well.

Closes scylladb/scylladb#23482

* github.com:scylladb/scylladb:
  test/database: Re-use take_snapshot() helper once more
  test/database: Remove most of take_snapshot() helper overloads
2025-03-31 15:20:51 +03:00
Benny Halevy
5f2ce0b022 loading_cache_test: test_loading_cache_reload_during_eviction: use manual_clock
Rather than lowres_clock, as since
32b7cab917,
loading_cache_for_test uses manual_clock for timing
and relying on lowres_clock to time the test might
run out of memory on fast test machines.

Fixes #23497

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#23498
2025-03-31 14:53:06 +03:00
Robert Bindar
e3a3508960 Move object_storage.yaml endpoints to scylla.yaml
This change also removes the `object_storage.yaml` file
altogether and adds tests for fetching the endpoints
via the `v2/config/object_storage_endpoints` REST api.

Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
2025-03-31 13:39:39 +03:00
Pavel Emelyanov
ac582efb44 test/database: Re-use take_snapshot() helper once more
There's a test case that can call the recently patched take_snapshot()
helper as well. This changes nothing, but makes further patching a bit
simpler (not in this branch).

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-03-31 13:18:06 +03:00
Pavel Emelyanov
7e6380b6bd test/database: Remove most of take_snapshot() helper overloads
There are 3 of those that help tests (re)shuffle cql_test_env/database,
skip_flush == true/false options and keyspace/table/snapshot names.
There's little sense in having that many of those, just one overload
with default arguments suits most of the callers.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-03-31 13:18:06 +03:00
Botond Dénes
ea55eed037 Merge 'Snapshot several tables at once in scrub API handler' from Pavel Emelyanov
The scrub API handler may want to snapshot several tables. For that, it calls snapshot-ctl method to snapshot a single table for each table in the list. That's excessive, snapshot-ctl has a method to snapshot a bunch of tables at once, just what the scrub handler needs.

It's an improvement, so no need to backport

Closes scylladb/scylladb#23472

* github.com:scylladb/scylladb:
  snapshot-ctl: Remove unused snapshot-single-table method
  api: Snapshot all tables at once in scrub handler
2025-03-31 13:00:32 +03:00
Piotr Smaron
aff8cbc6f3 CODEOWNERS: remove expired owners
Removing krzaq, who's no longer with the company.
Removing core-frontend team members from Alternator areas, as it's no
longer the domain of this team.

Closes scylladb/scylladb#23500
2025-03-31 11:37:51 +03:00
Pavel Emelyanov
0077acd1bb api: Properly validate table in tablet add|del replica handlers
The handlers in question just go and call database.find_column_family,
in case the table in question doesn't exist, the no_such_column_family
exception would be thrown, which is not nice. Proper behavior is to
throw bad_param one and there's a helper that does it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#23389
2025-03-31 10:03:17 +02:00
Andrzej Jackowski
c89d8c6566 cql3: prevent from empty option use in cf_statement::column_family()
Implementation of cf_statement::column_family() dereferences _cf_name
option without checking if the option is non-empty. On enterprise
branch, there is a safeguard that prevents from such an empty option
dereferencing. Although the current code on master seems to not call
columny_family() when _cf_name is empty, it is safer to introduce the
same workaround on master, to avoid any regression.

This change:
 - Prevent from empty option use in cf_statement::column_family()

Fixes: scylla-enterprise#5273

Closes scylladb/scylladb#23366
2025-03-31 09:43:22 +03:00
Michał Chojnowski
e23fdc0799 table: fix a race in table::take_storage_snapshot()
`safe_foreach_sstable` doesn't do its job correctly.

It iterates over an sstable set under the sstable deletion
lock in an attempt to ensure that SSTables aren't deleted during the iteration.

The thing is, it takes the deletion lock after the SSTable set is
already obtained, so SSTables might get unlinked *before* we take the lock.

Remove this function and fix its usages to obtain the set and iterate
over it under the lock.

Closes scylladb/scylladb#23397
2025-03-31 09:40:32 +03:00
Avi Kivity
2b9e1e61d0 docs: reader_concurrency_semaphore: document CPU concurrency limit
Document the CPU concurrency implemented in 3d816b7c16
and adjusted in 3d12451d1f.

Closes scylladb/scylladb#23404
2025-03-31 09:39:55 +03:00
Dawid Mędrek
b0b0c5905e test/cluster/test_multidc: Clean up RF-rack-valid keyspaces tests
There are some minor things we should fix that are a remnant
of the original changes (scylladb/scylladb@7646e14).

Closes scylladb/scylladb#23429
2025-03-31 09:38:42 +03:00
David Garcia
1a7be07b8c docs: renders os-support from json file
docs: renders os-support from json file

Closes scylladb/scylladb#23436
2025-03-31 09:36:49 +03:00
Marcin Maliszkiewicz
e3f2ebd4fb cql3: remove not needed cmd copy in indexed_table_select_statement
It's not used variable. There should be a tiny perf increase as
it saves allocation.

Closes scylladb/scylladb#23473
2025-03-31 09:34:32 +03:00
Avi Kivity
73e4a3c581 sstables: store features early in write path
sstable features indicate that an sstable has some extension, or that
some bug was fixed. They allow us to know if we can rely on certain
properties in a read sstables.

Currently, sstable features are set early in the read path (when we
read the scylla metadata file) and very late in the write path
(when we write the scylla metadata file just before sealing the sstable).

However, we happen to read features before we set them in the write path -
when we resize the bloom filter for a newly written sstable we instantiate
an index reader, and that depends on some features. As a result,
we read a disengaged optional (for the scylla metadata component) as if
it was engaged. This somehow worked so far, but fails with libstdc++
hash table implementation.

Fix it by moving storage of the features to the sstable itself, and
setting it early in the write path.

Fixes #23484

Closes scylladb/scylladb#23485
2025-03-31 09:33:56 +03:00
Pavel Emelyanov
693387bda6 Merge 'test.py: topology: allow to run tests with bare pytest command' from Evgeniy Naydanov
Add possibility to run topology tests using bare pytest command.

To achieve this goal the following changes were made:

- Add fixtures `testpy_testsuite` and `testpy_test` to `test/conftest.py`.
- To build `TestSuite` object we need to discover a corresponding `suite.xml` file.  Do this by walking up thru the fs tree starting from the current test file.
- Run ScyllaClusterManager using pytest fixture if `--manager-api` option is not provided.

And made some refactoring:

- Add path constants to `test` module and use them in different test suites instead of own dups of the same code:
  - TOP_SRC_DIR : ScyllaDB's source code root directory
  - TEST_DIR : the directory with test.py tests and libs
  - BUILD_DIR : directory with ScyllaDB's build artifacts
- Add TestSuite.log_dir attribute as a ScyllaDB's build mode subdir of a path provided using `--tmpdir` CLI argument. Don't use `tmpdir` name because it mixed up with pytest's built-in fixture and `--tmpdir` option itself.
- Change default value for `--tmdir` from `./testlog` to `TOP_SRC_DIR/testlog`
- Refactor `ResourceGather*` classes to use path from a `test` object instead of providing it separately.
- Move modes constants (`all_modes`/`ALL_MODES` and `debug_modes`/`DEBUG_MODES`) to `test` module and remove duplication.
- Move `prepare_dirs()` and `start_3rd_party_services()` from `pylib.util` to`pylib.suite.base` to avoid circular imports.
- In some places refactor to use f-strings for formatting.

Also minor changes related to running with pytest-xdist:

- When run tests in parallel we need to ensure that filenames are unique by adding xdist worker ID to them.
- Pass random seed across xdist workers using env variable.

Closes scylladb/scylladb#22960

* github.com:scylladb/scylladb:
  test.py: async_cql: remove unused event_loop fixture
  test.py: random_failures: make it play well with xdist
  test.py: add xdist worker ID to log filenames
  test.py: topology: run tests using bare pytest command
  test.py: add fixtures for current test suite and test
  test.py: refactor paths constants and options
2025-03-31 09:30:06 +03:00
Benny Halevy
a4aa4d74c1 test/pylib: servers_add: add auto_rack_dc parameter
To quickly populate nodes in a single dc,
each node in its own rack.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-03-30 19:23:40 +03:00
Benny Halevy
c4dbb11c87 test/pylib: servers_add: support list of property_files
So that a multi-dc/multi-rack cluster can be populated
in a single call.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-03-30 19:12:39 +03:00
Piotr Smaron
a2bbbc6904 auth: forbid modifying system ks by non-superusers
Before this patch, granting a user MODIFY permissions on ALL KEYSPACES allowed the user to write to system tables, where the user could also set himself to "superuser" granting him all other permissions. After this patch, MODIFY permissions on ALL KEYSPACES is limited only to non-system keyspaces.

Fixes: scylladb/scylladb#23218

Closes scylladb/scylladb#23219
2025-03-30 16:55:04 +03:00
Ferenc Szili
2c9b312b58 test: port of test and reproducer for resurrection during file based streaming
This change ports test/cluster/test_resurrection.py from enterprise to
master. Because the underlying issue deals with file based streaming,
this test was a part of the enterprise repo. It contains the test and
reproducer for the issue described below:

When tablets are migrated with file-based streaming, we can have a situation
where a tombstone is garbage collected before the data it shadows lands. For
instance, if we have a tablet replica with 3 sstables:

1 sstable containing an expired tombstone
2 sstable with additional data
3 sstable containing data which is shadowed by the expired tombstone in sstable 1

If this tablet is migrated, and the sstables are streamed in the order listed
above, the first two sstables can be compacted before the third sstable arrives.
In that case, the expired tombstone will be garbage collected, and data in the
third sstable will be resurrected after it arrives to the pending replica.

The fix for the issue was merged in b66479ea98

This patch only ports the missing test.

Closes scylladb/scylladb#23466
2025-03-30 13:39:40 +03:00
Andrzej Jackowski
b8adbcbc84 audit: fix empty query string in BATCH query
Function modification_statement::add_raw() is never called, which
makes query string in audit_info of batch queries empty. In enterprise
branch, add_raw is called in Cql.g and those changes were never merged
to master.

This changes:
 - Add missing call of add_raw() to Cql.g
 - Include other related changes (from PR#3228 in scylla-enterprise)

Fixes scylladb#23311

Closes scylladb/scylladb#23315
2025-03-30 13:37:11 +03:00
Michał Chojnowski
79a477ecb6 cmake: add the -dynamic-linker=... form to the -dynamic-linker regex
On my system (Nix), the compiler produces a `-dynamic-linker=/nix/store/...` in
the linker call scanned by get_padded_dynamic_linker_option.
But the regex can't deal with the `=` there, it requires a ` `. Fix that.

We also do the same in configure.py, and remove the Nix-specific hack
which used to disable the entire mechanism.

Closes scylladb/scylladb#22308
2025-03-30 11:58:47 +03:00
Kefu Chai
7814f6d374 github: improve seastar bad include check
for better developer experience:

- add inline annotations using problem matchers, see
  https://github.com/actions/toolkit/blob/main/docs/problem-matchers.md
- use a single step for uploading both output files, because the `path`
  setting is actually passed to
  [@actions/glob](https://github.com/actions/toolkit/tree/main/packages/glob),
  i removed the double quotes and the leading "./"
  from the paths.
- use "::error" workflow command to signify the failure, see
  https://docs.github.com/en/actions/writing-workflows/choosing-what-your-workflow-does/workflow-commands-for-github-actions#example-creating-an-annotation-for-an-error

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#23310
2025-03-30 11:56:18 +03:00
Evgeniy Naydanov
1a0c14aa50 test.py: async_cql: remove unused event_loop fixture
Newer version of pytest-asyncio (0.24.0) allows to control the scope
of async loop per fixture.  Don't need this workaround anymore.
2025-03-30 03:19:30 +00:00
Evgeniy Naydanov
cac0257914 test.py: random_failures: make it play well with xdist
Pass random seed across xdist workers using env variable.
2025-03-30 03:19:30 +00:00
Evgeniy Naydanov
9bba59631f test.py: add xdist worker ID to log filenames
When run tests in parallel we need to ensure that filenames
are unique by adding xdist worker ID to them.
2025-03-30 03:19:30 +00:00
Evgeniy Naydanov
9cb0ec2b42 test.py: topology: run tests using bare pytest command
Run ScyllaClusterManager using pytest fixture if `--manager-api`
option is not provided.

On this stage we're trying to be as close to test.py as possible.
test.py runs tests file-by-file, so, effectively, scopes `session`,
`package`, and `module` are pretty same.  Also, test.py starts
ScyllaClusterManager for every test module and this is the reason
why fixture `manager_api_sock_path` has scope=`module`.  And, in
result, we need to change scope for fixture `manager_internal` too.
2025-03-30 03:19:29 +00:00
Evgeniy Naydanov
42075170d1 test.py: add fixtures for current test suite and test
Add fixtures `testpy_testsuite` and `testpy_test` to `test/conftest.py`
To build TestSuite object we need to discover a corresponding `suite.xml`
file.  Do this by walking up thru the fs tree starting from the current
test file.
2025-03-30 03:19:29 +00:00
Evgeniy Naydanov
c4ae4e247a test.py: refactor paths constants and options
Add path constants to `test` module and use them in different test suites
instead of own dups of the same code:

 - TOP_SRC_DIR : ScyllaDB's source code root directory
 - TEST_DIR : the directory with test.py tests and libs
 - BUILD_DIR : directory with ScyllaDB's build artefacts

Add TestSuite.log_dir attribute as a ScyllaDB's build mode subdir of a path
provided using `--tmpdir` CLI argument.  Don't use `tmpdir` name because it
mixed up with pytest's built-in fixture and `--tmpdir` option itself.

Change default value for `--tmdir` from `./testlog` to `TOP_SRC_DIR/testlog`

Refactor `ResourceGather*` classes to use path from a `test` object instead of
providing it separately.

Move modes constants to `test` module and remove duplications.

Move `prepare_dirs()` and `start_3rd_party_services()` from `pylib.util` to
`pylib.suite.base` to avoid circular imports (with little refactoring to
use `pathlib.Path` instead of `str` as paths.)

Also, in some places refactor to use f-strings for formatting.
2025-03-30 03:19:29 +00:00
Michał Jadwiszczak
0ee0696959 test/cqlpy/test_service_level_api: update to service levels on raft and remove flakiness
Tests in `test_service_level_api` were written before
scylladb/scylladb#16585 and they were doing 10s sleeps to wait for
service level controller to update its configuration. Now performing
a read barrier is sufficient to ensure SL configuration is up-to-date,
which significantly reduces tests time (from ~60s to ~2-3s).

Moreover, there was flakiness in the `test_switch_tenants` test.
Until now, the test waited up to 60s for the connections to update
their scheduling groups. However, it is difficult to determine
how long the process might take because a connection may be blocked
while waiting for the next request to be processed,
and the scheduling group will be updated only after a request is processed
(see `generic_server::connection::process_until_tenant_switch()`).
To address this issue, 100 simple queries are executed so that
connections on all shards process at least one request
and update their scheduling groups.

Fixes scylladb/scylladb#22768

Closes scylladb/scylladb#23381
2025-03-28 17:14:21 +03:00
Pavel Emelyanov
9aa986a49a snapshot-ctl: Remove unused snapshot-single-table method
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-03-28 10:45:31 +03:00
Pavel Emelyanov
5162f75d0b api: Snapshot all tables at once in scrub handler
The handler walks the list of tables and snapshots each one individually
(if needed). That's not very optimal, each such call starts a "snapshot
modification operation", which is switching to shard-0 for a lock, then
calls the snapshot of multiple tables giving it vector of a single name.
There's a method of snapshot-ctl that snapshots several tables at once,
no need to open-code it here.

One thing to care about -- the take_column_family_snapshot() throws when
the vector of table names is empty, so need an explicit skipping check.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-03-28 10:44:47 +03:00
Avi Kivity
6d7cb68aab test: ldap: avoid io_uring Seastar reactor backend
It tends to fail sometimes with ENOMEM:

```
ERROR 2025-03-24 01:05:22,983 [shard 0:sl:d] ldap_role_manager - error in reconnect: std::system_error (error C-Ares:4, server.that.will.never.exist.scylladb.com: Not found)
ERROR 2025-03-24 01:05:30,984 [shard 0:sl:d] ldap_role_manager - error in reconnect: std::system_error (error C-Ares:4, server.that.will.never.exist.scylladb.com: Not found)
ERROR 2025-03-24 01:05:47,123 [shard 0:main] storage_service - Shutting down communications due to I/O errors until operator intervention: Disk error: std::system_error (error system:12, Cannot allocate memory)
ERROR 2025-03-24 01:05:47,139 [shard 0:main] table - failed to write sstable /scylladir/testlog/x86_64/debug/scylla-33787f64/system_schema/view_virtual_columns-08843b6345dc3be29798a0418295cfaa/me-3got_1s5n_0lfls1y4z7vkkts07a-big-Data.db: storage_io_error (Storage I/O error: 12: Cannot allocate memory)
ERROR 2025-03-24 01:05:47,140 [shard 0:main] table - Memtable flush failed due to: storage_io_error (Storage I/O error: 12: Cannot allocate memory). Aborting, at 0x30f5605 /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar.so+0x4514f14 /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar.so+0x4514b96 /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar.so+0x45165b1 /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar.so+0x4518dcf 0x3fde842 0x35dc5c6 /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar.so+0x36c26ed /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar.so+0x36cdd0c /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar.so+0x36d2cd2 /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar.so+0x36d0e56 /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar.so+0x327f47a /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar.so+0x327c8f0 /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar_testing.so+0x1cdd4 /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar_testing.so+0x1c79c /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar_testing.so+0x1c69c /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar_testing.so+0x1c184 /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar.so+0x34b2674 0x314b8b6 /lib64/libc.so.6+0x70ba7 /lib64/libc.so.6+0xf4b8b
   --------
   seastar::internal::coroutine_traits_base<void>::promise_type
   --------
   seastar::internal::coroutine_traits_base<void>::promise_type
   --------
   seastar::continuation<seastar::internal::promise_base_with_type<void>, seastar::noncopyable_function<seastar::future<void> (seastar::future<void>&&)>, seastar::future<void>::then_wrapped_nrvo<seastar::future<void>, seastar::noncopyable_function<seastar::future<void> (seastar::future<void>&&)> >(seastar::noncopyable_function<seastar::future<void> (seastar::future<void>&&)>&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, seastar::noncopyable_function<seastar::future<void> (seastar::future<void>&&)>&, seastar::future_state<seastar::internal::monostate>&&)#1}, void>
   --------
   seastar::continuation<seastar::internal::promise_base_with_type<void>, seastar::noncopyable_function<seastar::future<void> (seastar::future<void>&&)>, seastar::future<void>::then_wrapped_nrvo<seastar::future<void>, seastar::noncopyable_function<seastar::future<void> (seastar::future<void>&&)> >(seastar::noncopyable_function<seastar::future<void> (seastar::future<void>&&)>&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, seastar::noncopyable_function<seastar::future<void> (seastar::future<void>&&)>&, seastar::future_state<seastar::internal::monostate>&&)#1}, void>
   --------
   seastar::shared_future<>::shared_state
Aborting on shard 0, in scheduling group main.
Backtrace:
  0x30f5605
  /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar.so+0x384a0e4
  /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar.so+0x3849db2
  /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar.so+0x369bd84
  /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar.so+0x36d42a2
  /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar.so+0x37a5ed9
  /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar.so+0x37a61d5
  /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar.so+0x37a601f
  /lib64/libc.so.6+0x1a04f
  /lib64/libc.so.6+0x72b53
  /lib64/libc.so.6+0x19f9d
  /lib64/libc.so.6+0x1941
  0x3fde8b1
  0x35dc5c6
  /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar.so+0x36c26ed
  /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar.so+0x36cdd0c
  /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar.so+0x36d2cd2
  /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar.so+0x36d0e56
  /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar.so+0x327f47a
  /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar.so+0x327c8f0
  /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar_testing.so+0x1cdd4
  /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar_testing.so+0x1c79c
  /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar_testing.so+0x1c69c
  /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar_testing.so+0x1c184
  /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar.so+0x34b2674
  0x314b8b6
  /lib64/libc.so.6+0x70ba7
  /lib64/libc.so.6+0xf4b8b
=== TEST.PY SUMMARY START ===
Test exited with code -6

=== TEST.PY SUMMARY END ===

=== decoded ===
Backtrace:
[Backtrace #0]
__interceptor_backtrace at /mnt/clang_build/llvm-project-x86_64/compiler-rt/lib/asan/../sanitizer_common/sanitizer_common_interceptors.inc:4369
void seastar::backtrace<seastar::backtrace_buffer::append_backtrace()::{lambda(seastar::frame)#1}>(seastar::backtrace_buffer::append_backtrace()::{lambda(seastar::frame)#1}&&) at ./build/debug/seastar/./seastar/include/seastar/util/backtrace.hh:70
seastar::backtrace_buffer::append_backtrace() at ./build/debug/seastar/./build/debug/seastar/./seastar/src/core/reactor.cc:805
seastar::print_with_backtrace(seastar::backtrace_buffer&, bool) at ./build/debug/seastar/./build/debug/seastar/./seastar/src/core/reactor.cc:838
seastar::print_with_backtrace(char const*, bool) at ./build/debug/seastar/./build/debug/seastar/./seastar/src/core/reactor.cc:850
seastar::sigabrt_action() at ./build/debug/seastar/./build/debug/seastar/./seastar/src/core/reactor.cc:4004
seastar::install_oneshot_signal_handler<6, (void (*)())(&seastar::sigabrt_action)>()::{lambda(int, siginfo_t*, void*)#1}::operator()(int, siginfo_t*, void*) const at ./build/debug/seastar/./build/debug/seastar/./seastar/src/core/reactor.cc:3981
seastar::install_oneshot_signal_handler<6, (void (*)())(&seastar::sigabrt_action)>()::{lambda(int, siginfo_t*, void*)#1}::__invoke(int, siginfo_t*, void*) at ./build/debug/seastar/./build/debug/seastar/./seastar/src/core/reactor.cc:3976
/lib64/libc.so.6: ELF 64-bit LSB shared object, x86-64, version 1 (GNU/Linux), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, BuildID[sha1]=c8c3fa52aaee3f5d73b6fd862e39e9d4c010b6ba, for GNU/Linux 3.2.0, not stripped

?? ??:0
printf_positional at ??:?
?? ??:0
?? ??:0
replica::table::seal_active_memtable(replica::compaction_group&, replica::flush_permit&&)::$_0::operator()(std::function<seastar::future<void> ()>) const at ././replica/table.cc:1512
std::__n4861::coroutine_handle<seastar::internal::coroutine_traits_base<void>::promise_type>::resume() const at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/coroutine:242
 (inlined by) seastar::internal::coroutine_traits_base<void>::promise_type::run_and_dispose() at ././seastar/include/seastar/core/coroutine.hh:122
seastar::reactor::run_tasks(seastar::reactor::task_queue&) at ./build/debug/seastar/./build/debug/seastar/./seastar/src/core/reactor.cc:2616
seastar::reactor::run_some_tasks() at ./build/debug/seastar/./build/debug/seastar/./seastar/src/core/reactor.cc:3088
seastar::reactor::do_run() at ./build/debug/seastar/./build/debug/seastar/./seastar/src/core/reactor.cc:3256
seastar::reactor::run() at ./build/debug/seastar/./build/debug/seastar/./seastar/src/core/reactor.cc:3146
seastar::app_template::run_deprecated(int, char**, std::function<void ()>&&) at ./build/debug/seastar/./build/debug/seastar/./seastar/src/core/app-template.cc:276
seastar::app_template::run(int, char**, std::function<seastar::future<int> ()>&&) at ./build/debug/seastar/./build/debug/seastar/./seastar/src/core/app-template.cc:167
seastar::testing::test_runner::start_thread(int, char**)::$_0::operator()() at ./build/debug/seastar/./build/debug/seastar/./seastar/src/testing/test_runner.cc:77
void std::__invoke_impl<void, seastar::testing::test_runner::start_thread(int, char**)::$_0&>(std::__invoke_other, seastar::testing::test_runner::start_thread(int, char**)::$_0&) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/invoke.h:61
std::enable_if<is_invocable_r_v<void, seastar::testing::test_runner::start_thread(int, char**)::$_0&>, void>::type std::__invoke_r<void, seastar::testing::test_runner::start_thread(int, char**)::$_0&>(seastar::testing::test_runner::start_thread(int, char**)::$_0&) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/invoke.h:111
std::_Function_handler<void (), seastar::testing::test_runner::start_thread(int, char**)::$_0>::_M_invoke(std::_Any_data const&) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/std_function.h:290
seastar::posix_thread::start_routine(void*) at ./build/debug/seastar/./build/debug/seastar/./seastar/src/core/posix.cc:90
asan_thread_start(void*) at /mnt/clang_build/llvm-project-x86_64/compiler-rt/lib/asan/asan_interceptors.cpp:239
__vfscanf_internal at :?
peek_token at ??:?
```

In ce65164315, we banned io_uring from tests, but missed the ldap
tests. This extends coverage to ldap tests.

I verified that the new options indeed reach the test.

Refs #23411.

Credit to Botond for recognizing the failure reason.

Closes scylladb/scylladb#23422
2025-03-28 07:45:53 +02:00
Tomasz Grabiec
d6232a4f5f tablets: load_balancer: Fix reporting of total load per node
Load is now utilization, not count, so we should report average
per-shard load, which is equivalent to node's utilization.
2025-03-27 23:28:20 +01:00
Botond Dénes
bd8973a025 tools/scylla-nodetool: s/GetInt()/GetInt64()/
GetInt() was observed to fail when the integer JSON value overflows the
int32_t type, which `GetInt()` uses for storage. When this happens,
rapidjson will assign a distinct 64 bit integer type to the value, and
attempting to access it as 32 bit integer triggers the wrong-type error,
resulting in assert failure. This was hit on the field where invoking
nodetool netstats resulted in nodetool crashing when the streamed bytes
amounts were higher than maxint.

To avoid such bugs in the future, replace all usage of GetInt() in
nodetool of GetInt64(), just to be sure.

A reproducer is added to the nodetool netstats crash.

Fixes: scylladb/scylladb#23394

Closes scylladb/scylladb#23395
2025-03-27 14:05:39 +02:00
Botond Dénes
d57e71837f Merge 'Improve scoped restore test' from Pavel Emelyanov
This PR includes several fixes to the nowadays flaky test_restore_with_streaming_scopes test.

1. Check that backup and restore APIs don't fail. Currently, if either of them does the test cases fails anyway checking that the data is not restored back, but it's better to know what exactly failed

2. For restore API the test collects the list of sstables to restore from. Currently collecting this list races with background compaction and sometimes leads to restore API to fail which, in turn, makes the whole test to fail

3. Add a test case that validates that restore-from-missing-sstable fails nicely

refs: #23189
No backport, as it's a relatively new test

Closes scylladb/scylladb#23445

* github.com:scylladb/scylladb:
  test/backup: Validate that restoring from non-existing sstables fails
  test/backup: Collect sstables names after snapshot
  test/backup: Check that backup and restore succeed
2025-03-27 13:23:41 +02:00
Piotr Dulikowski
288216a89e Merge 'Ignore wrapped exceptions gate_closed_exception and rpc::closed_error when node shuts down.' from Sergey Zolotukhin
Normally, when a node is shutting down, `gate_closed_exception` and `rpc::closed_error`
in `send_to_live_endpoints` should be ignored. However, if these exceptions are wrapped
in a `nested_exception`, an error message is printed, causing tests to fail.

This commit adds handling for nested exceptions in this case to prevent unnecessary
error messages.

Fixes scylladb/scylladb#23325
Fixes scylladb/scylladb#23305
Fixes scylladb/scylladb#21815

Backport: looks like this is quite a frequent issue, therefore backport to 2025.1.

Closes scylladb/scylladb#23336

* github.com:scylladb/scylladb:
  database: Pass schema_ptr as const ref in `wrap_commitlog_add_error`
  database: Unify exception handling in `do_apply` and `apply_with_commitlog`
  storage_proxy: Ignore wrapped `gate_closed_exception` and `rpc::closed_error` when node shuts down.
  exceptions: Add `try_catch_nested` to universally handle nested exceptions of the same type.
2025-03-27 11:39:42 +01:00
Pavel Emelyanov
9f036d957a Merge 'test/clqpy/test_tool.py: get_sstables_for_table(): exclude non-sealed sstables' from Botond Dénes
Filter out sstables which don't have a TOC or have a temporary TOC. Such sstables are incomplete and can dissapear if the compaction which writes them is interrupted.

Fixes: #23203

This PR fixes a flaky test which is only on master, no backports required.

Closes scylladb/scylladb#23450

* github.com:scylladb/scylladb:
  test/cqlpy/test_tools.py: test_scylla_sstable_query: reduce scope of no-compaction context
  test/clqpy/test_tool.py: get_sstables_for_table(): exclude non-sealed sstables
2025-03-27 09:45:07 +03:00
Tomasz Grabiec
8e506c5a8f test: tablets: Fix flakiness due to ungraceful shutdown
The test fails sporadically with:

cassandra.ReadFailure: Error from server: code=1300 [Replica(s) failed to execute read] message="Operation failed for test3.test2 - received 1 responses and 1 failures from 2 CL=QUORUM." info={'consistency': 'QUORUM', 'required_responses': 2, 'received_responses': 1, 'failures': 1}

That's becase a server is stopped in the middle of the workload.

The server is stopped ungracefully which will cause some requests to
time out. We should stop it gracefully to allow in-flight requests to
finish.

Fixes #20492

Closes scylladb/scylladb#23451
2025-03-27 09:44:07 +03:00
Lakshmi Narayanan Sreethar
dccce670c1 topology_coordinator: fix indentation in generate_migration_updates
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2025-03-27 10:16:34 +05:30
Lakshmi Narayanan Sreethar
5b47d84399 topology_coordinator: do not schedule migrations when there are pending resize finalizations
Resize finalization is executed in a separate topology transition state,
`tablet_resize_finalization`, to ensure it does not overlap with tablet
transitions. The topology transitions into the
`tablet_resize_finalization` state only when no tablet migrations are
scheduled or being executed. If there is a large load-balancing backlog,
split finalization might be delayed indefinitely, leaving the tables
with large tablets.

To fix this, do not schedule tablet migrations on any tables when there
are pending resize finalizations. This ensures that migrations from the
same table and other unrelated tables do not block resize finalization.

Also added a testcase to verify the fix.

Fixes #21762

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2025-03-27 10:16:34 +05:30
Lakshmi Narayanan Sreethar
8cabc66f07 load_balancer: make repair plans only when there is no pending resize finalization
Do not make repair plans if any table has pending resize finalization.
This is to ensure that the finalization doesn't get delayed by reapir
tasks.

Refs #21762

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2025-03-27 10:16:34 +05:30
Avi Kivity
b292b5800b Merge 'test.py: move starting LDAP service to dedicate method' from Andrei Chekun
Move starting LDAP to the method where the rest of the services are started. This will unify the way of starting the 3rd party services.
Fix LDAP tests flakiness due not possible to connect to LDAP server.
Add catching stdout and stderr of toxiproxy-cli in case of errors

Related: https://github.com/scylladb/scylladb/pull/23333

This PR is based on https://github.com/scylladb/scylladb/pull/23221, so #23221 should be merged first.

Closes scylladb/scylladb#23235

* github.com:scylladb/scylladb:
  test.py: Refactor nodetool/conftest
  test.py: Refactor test/pylib/cpp/ldap
  test.py: move starting LDAP service to dedicate method
2025-03-26 15:31:00 +02:00
Botond Dénes
801339bad9 test/cqlpy/test_tools.py: test_scylla_sstable_query: reduce scope of no-compaction context
To just system.local, the table these tests operate on. No need to
disable autocompaction for all of the system keyspace.
2025-03-26 09:19:38 -04:00
Botond Dénes
3ec863c4ce test/clqpy/test_tool.py: get_sstables_for_table(): exclude non-sealed sstables
Filter out sstables which don't have a TOC or have a temporary TOC. Such
sstables are incomplete and can dissapear if the compaction which writes
them is interrupted.
2025-03-26 09:18:34 -04:00
Pavel Emelyanov
1da889f239 Merge 'Allow abort during join_cluster' from Benny Halevy
Bootstrap or replace can take a long time, but
since feef7d3fa1,
the stop_signal is checked only in checkpoints,
and in particular, abort isn't requested during
join_cluster.

Fixes #23222

* requires backport on top of https://github.com/scylladb/scylladb/pull/23184

Closes scylladb/scylladb#23306

* github.com:scylladb/scylladb:
  main: allow abort during join_cluster
  main: add checkpoint before joining cluster
  storage_service: add start_sys_dist_ks
2025-03-26 15:48:58 +03:00
Sergey Zolotukhin
d448f3de77 database: Pass schema_ptr as const ref in wrap_commitlog_add_error 2025-03-26 11:15:26 +01:00
Sergey Zolotukhin
0d9d0fe60e database: Unify exception handling in do_apply and apply_with_commitlog
Move exception wrapping logic from `do_apply` and `apply_with_commitlog`
to `wrap_commitlog_add_error` to ensure consistent error handling.
2025-03-26 11:15:18 +01:00
Sergey Zolotukhin
b1e89246d4 storage_proxy: Ignore wrapped gate_closed_exception and rpc::closed_error when node shuts down.
Normally, when a node is shutting down, `gate_closed_exception` and `rpc::closed_error`
in `send_to_live_endpoints` should be ignored. However, if these exceptions are wrapped
in a `nested_exception`, an error message is printed, causing tests to fail.

This commit adds handling for nested exceptions in this case to prevent unnecessary
error messages.

Fixes scylladb/scylladb#23325
2025-03-26 11:15:16 +01:00
Sergey Zolotukhin
6abfed9817 exceptions: Add try_catch_nested to universally handle nested exceptions of the same type. 2025-03-26 11:15:13 +01:00
Evgeniy Naydanov
574c81eac6 test.py: random_failures: deselect topology ops for some injections
After recent changes #18640 and #19151 started to reproduce for
stop_after_sending_join_node_request and
stop_after_bootstrapping_initial_raft_configuration error injections too.

The solution is the same: deselect the tests.

Fixes #23302

Closes scylladb/scylladb#23405
2025-03-26 12:07:12 +03:00
Pavel Emelyanov
38f37763d6 test/backup: Validate that restoring from non-existing sstables fails
When restore API is called and is given a non-existing sstable (object
name) the task should complete with failed status and some meaningful
message in the error text.

refs: #23189

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-03-26 10:55:42 +03:00
Pavel Emelyanov
02610a9072 test/backup: Collect sstables names after snapshot
The scoped restoer test works like this

- populate table
- flush it
- collect list of sstables
- take snapshot
- backup
- restore (with the list of sstables as argument)
- check the data is back

Steps 2 and 3 are racy -- in case compaction comes in the middle, the
list of collected sstables would differ from those snapshotted (and
backuped) which will later lead to restore failure due to missing
sstable.

Fix by collecting the list of sstables after taking snapshot, and
collect those not from the datadir, but from the snapshot dir.

fixes: #23189

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-03-26 10:40:54 +03:00
Pavel Emelyanov
08004fe470 test/backup: Check that backup and restore succeed
The scoped-restore test calls backup and restore APIs on several nodes,
but doesn't check if any of the operations actually succeeds. Sometimes
they indeed don't and test captures this, but in a weird manner -- the
post-test checks for data presense fails, because the expected data is
not in fact in its place.

It's more debugging-friendly if we know in advance if backup or restore
fails, rather than see that some data is missing after (failed) restore.

refs: #23189

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-03-25 19:45:56 +03:00
Gleb Natapov
0aa4a82c83 messaging_service: do not call uninitialized _address_to_host_id_mapper std::function
During messaging_service object creation remove_rpc_client function may
be called if prefer_local snitch setting is true. The caller does not
provide host id, so _address_to_host_id_mapper is called to obtain it,
but at this point the function is not initialized yet.

The patch fixes the code to not call the function if not initialized.
This is not the problem since during messaging_service creation there
is no connection to drop.

Fixes: #23353

Message-ID: <Z-J2KbBK8NoFNYZZ@scylladb.com>
2025-03-25 18:41:16 +02:00
Wojciech Mitros
88d3fc68b5 alter_table_statement: fix renaming multiple columns in tables with views
When we rename columns in a table which has materialized views depending
on it, we need to also rename them in the materialized views' WHERE
clauses.
Currently, we do that by creating a new WHERE clause after each rename,
with the updated column. This is later converted to a mutation that
overwrites the WHERE clause. After multiple renames, we have multiple
mutations, each overwriting the WHERE clause with one column renamed.
As a result, the final WHERE clause is one of the modified clauses with
one column renamed.
Instead, we should prepare one new WHERE clause which includes all the
renamed columns. This patch accomplishes this by processing all the
column renames first, and only preparing the new view schema with the
new WHERE clause afterwards.

This patch also includes a test reproducer for this scenario.

Fixes scylladb/scylladb#22194

Closes scylladb/scylladb#23152
2025-03-25 09:58:58 +01:00
Benny Halevy
9fac0045d1 boost/tablets_test: verify failure to create keyspace with tablets and non network replication strategy
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-03-24 15:39:53 +02:00
Benny Halevy
62aeba759b tablets: enforce tablets using tablets_mode_for_new_keyspaces=enforced config option
`tablets_mode_for_new_keyspaces=enforced` enables tablets by default for
new keyspaces, like `tablets_mode_for_new_keyspaces=enabled`.
However, it does not allow to opt-out when creating
new keyspaces by setting `tablets = {'enabled': false}`.

Refs scylladb/scylla-enterprise#4355

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-03-24 15:32:16 +02:00
Benny Halevy
c62865df90 db/config: add tablets_mode_for_new_keyspaces option
The new option deprecates the existing `enable_tablets` option.
It will be extended in the next patch with a 3rd value: "enforced"
while will enable tablets by default for new keyspace but
without the posibility to opt out using the `tablets = {'enabled':
false}` keyspace schema option.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-03-24 14:54:45 +02:00
Michael Litvak
49b8cf2d1d storage_service: fix tablet split of materialized views
This fixes an issue where materialized view tablets are not split
because they are not registered as split candidates by the storage
service.

The code in storage_service::replicate_to_all_cores was changed in
4bfa3060d0 to handle normal tables and view tables separately, but with
that change register_tablet_split_candidate is applied only to normal
tables and not every table like before. We fix it by registering view
tables as well.

We add a test to verify that split of MV tables works.

Closes scylladb/scylladb#23335
2025-03-24 08:23:58 +01:00
Pavel Emelyanov
79b9626d16 Merge 'service: do not include unused headers ' from Kefu Chai
these unused includes were identified by clang-include-cleaner. after auditing these source files, all of the reports have been confirmed. also, updated the "iwyu.yaml" (short for include what you use) workflow to include "service" and "raft" subdirectories to prevent future regressions of including unused headers in them.

---

it's a cleanup, hence no need to backport.

Closes scylladb/scylladb#23373

* github.com:scylladb/scylladb:
  .github: add "raft" and "service" subdirectories to CLEANER_DIR
  service: do not include unused headers
2025-03-24 10:20:15 +03:00
Avi Kivity
cc5fe542ed test: ignore unused fmt::to_string() result
fmt 11.1 apparently marks to_string() as [[nodiscard]]. Here we aren't
interested in the result, so explicitly ignore it to avoid an error.

Closes scylladb/scylladb#23403
2025-03-24 10:19:09 +03:00
Avi Kivity
9d49c3254f install-dependencies.sh: disabiguate python magic package
There are in fact two python magic packages, file-magic (that binds
to libmagic and comes from the file package), magic, an independent
one. The name we use in install-depedencies.sh, python3-magic,
resolves to file-magic.

In Fedora 42, the resolution from the name python3-magic to
file-magic was removed [1], and so install-dependencies.sh now tries
to install the wrong magic package, which turns out not to coexist
with the one we want anyway.

Fix by naming python3-file-magic directly instead. Since this is what's
installed in the current frozen toolchain, there's no need to
regenerate it; we're just making the package list work in Fedora 42.

[1] 81910b7d88

Closes scylladb/scylladb#23402
2025-03-24 10:18:27 +03:00
Avi Kivity
cd04ab1a4e test: avoid spaces when defining user-defined literal operator
Clang 20 complains when it sees a user-defined literal operator
defined with a space before the underscore. Assume it's adhering
to the standard and comply.

Closes scylladb/scylladb#23401
2025-03-24 10:17:12 +03:00
Pavel Emelyanov
d436fb8045 Merge 'Fix EAR not applied on write to S3 (but on read).' from Calle Wilund
Fixes #23225
Fixes #23185

Adds a "wrap_sink" (with default implementation) to sstables::file_io_extension, and moves
extension wrapping of file and sink objects to storage level.
(Wrapping/handling on sstable level would be problematic, because for file storage we typically re-use the sstable file objects for sinks, whereas for S3 we do not).

This ensures we apply encryption on both read and write, whereas we previously only did so on read -> fail.
Adds io wrapper objects for adapting file/sink for default implementation, as well as a proper encrypted sink implementation for EAR.

Unit tests for io objects and a macro test for S3 encrypted storage included.

Closes scylladb/scylladb#23261

* github.com:scylladb/scylladb:
  encryption: Add "wrap_sink" to encryption sstable extension
  encrypted_file_impl: Add encrypted_data_sink
  sstables::storage: Move wrapping sstable components to storage provider
  sstables::file_io_extension: Add a "wrap_sink" method.
  sstables::file_io_extension: Make sstable argument to "wrap" const
  utils: Add "io-wrappers", useful IO helper types
2025-03-24 10:12:46 +03:00
Artsiom Mishuta
8bb6414037 test.py: reuse clusters in Python suite
PR https://github.com/scylladb/scylladb/pull/22274 was introduced due to
CI instability and want to mark the cluster dirty after each test for topology
But in fact, affects only Python suites that are quite stable, and CI was
Stabilized by PR https://github.com/scylladb/scylladb/pull/22252

This PR get back cluster reusage in Python test suites

Closes scylladb/scylladb#23179
2025-03-23 20:08:36 +02:00
Kefu Chai
fdc5255eb8 build: disable DPDK for all release builds
Previously, DPDK was enabled by default in standard release builds but disabled
in "release-pgo" and "release-cs-pgo" builds. This inconsistency caused linking
warnings during PGO phase 2, when trained profiles from non-DPDK builds were
used with DPDK-enabled builds:

```
[1980/1983] LINK build/release/scylla
ld.lld: warning: /home/avi/scylla-maint/build/release/seastar/libseastar.a(reactor.cc.o at 57829248): function control flow change detected (hash mismatch) _ZN7seastar7reactor14run_some_tasksEv Hash = 2095857468992035112 up to 0 count discarded
ld.lld: warning: /home/avi/scylla-maint/build/release/seastar/libseastar.a(reactor.cc.o at 57829248): function control flow change detected (hash mismatch) _ZN7seastar7reactor6do_runEv Hash = 2184396189398169723 up to 50134372 count discarded
ld.lld: warning: /home/avi/scylla-maint/build/release/seastar/libseastar.a(reactor.cc.o at 57829248): function control flow change detected (hash mismatch) _ZN7seastar18syscall_work_queue11submit_itemESt10unique_ptrINS0_9work_itemESt14default_deleteIS2_EE Hash = 1533150042646546219 up to 1979931 count discarded
```

Since DPDK is not used in production and increases build time, this
change disables DPDK across all release build types. This both silences
the warnings and improves build performance.

Fixes #23323
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#23391
2025-03-23 15:26:10 +02:00
Avi Kivity
9adfb91f46 Merge 'Introduce s3 data_source_impl for optimized object streaming' from Pavel Emelyanov
Currently, to stream data from sstable component the sstables code uses file_data_source_impl. In case the component is on S3, the s3::readable_file is put into that data source. The data source is configured with 128k buffers and at most 4 read-ahead-s. With that configuration, downloading full object from S3 becomes too slow -- GET-ing file with 128k requests is not nice even with 4 parallel read-ahead-s.

Better solution for S3 downloading is to request way larger chunk with one GET and then produce smaller, 128k or alike, buffers upon data arrival. This is what the newly introduced data source impl does -- it spawns a background GET and lets the upper input stream read buffers directly from the arriving body.

This PR doesn't yet make sstable layer use the new sink, just introduces it and adds unit and perf tests.

Testing

|Test|Download speed, MB/s|
|-|-|
|file_input_stream (*), 1 socket | 4.996|
|file_input_stream (*), 2 sockets | 9.403|
|s3_data_source (**) | 93.164|

(*) The file_input_stream test renders 128k GETs and is configured to issue at most 4 read-ahead-s
(**) The s3_data_source uses at most 1 socket regardless of what perf-test configures it to

refs: #22458

Closes scylladb/scylladb#22907

* github.com:scylladb/scylladb:
  test: Extend s3-perf test with stream download one
  test/perf: Tune-up s3 test options parsing
  test: Add unit test for newly introduced download source
  s3/client: Introduce data_source_impl for object downloading
  s3/client: Detach format_range_header() helper
2025-03-23 14:22:04 +02:00
Pavel Emelyanov
ca3b604afa test: Extend s3-perf test with stream download one
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-03-21 12:01:07 +03:00
Pavel Emelyanov
283e8e0706 test/perf: Tune-up s3 test options parsing
Rename the `--upload bool` into `--operation string` one, so that new
tests can be added in the future. Also rename run_download() to
run_contiguous_get() because this is what the internals of this method
do -- just GET contiguous ranges sequentially.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-03-21 12:01:07 +03:00
Pavel Emelyanov
bd313c581f test: Add unit test for newly introduced download source
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-03-21 12:01:06 +03:00
Pavel Emelyanov
1f301b1c5d s3/client: Introduce data_source_impl for object downloading
The new data source implementation runs a single GET for the whole range
specified and lends the body input_stream for the upper input_stream's
get()-s. Eventually, getting the data from the body stream EOFs or
fails. In either case, the existing body is closed and a new GET is
spawn with the updater Range header so that not to include the bytes
read so far.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-03-21 12:01:06 +03:00
Pavel Emelyanov
d47719f70e s3/client: Detach format_range_header() helper
The get_object_contiguous() formats the 'bytes=X-Y' one for its GET
request. The very same code will be needed by next patch.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-03-21 12:01:06 +03:00
Avi Kivity
7646e1448a Merge 'cql3: Introduce RF-rack-valid keyspaces' from Dawid Mędrek
This PR is an introductory step towards enforcing
RF-rack-valid keyspaces in Scylla.

The scope of changes:
* defining RF-rack-valid keyspaces,
* introducing a configuration option enforcing RF-rack-valid
  keyspaces,
* restricting the CREATE and ALTER KEYSPACE statements
  so that they never lead to RF-rack invalid keyspaces,
* during the initialization of a node, it verifies that all existing
  keyspaces are RF-rack-valid. If not, the initialization fails.

We provide tests verifying that the changes behave as intended.

---

Note that there are a number of things that still need to be implemented.
That includes, for instance, restricting topology operations too.

---

Implementation strategy (going beyond the scope of this PR):

1. Introduce the new configuration option `rf_rack_valid_keyspaces`.
2. Start enforcing RF-rack-validity in keyspaces if the option is enabled.
3. Adjust the tests: in the tree and out of it. Explicitly enable the option in all tests.
4. Once the tests have been adjusted, change the default value of the option to enabled.
5. Stop explicitly enabling the option in tests.
6. Get rid of the option.

---

Fixes scylladb/scylladb#20356
Fixes scylladb/scylladb#23276
Fixes scylladb/scylladb#23300

---

Backport: this is part of the requirements for releasing 2025.1.

Closes scylladb/scylladb#23138

* github.com:scylladb/scylladb:
  main: Refuse to start node when RF-rack-invalid keyspace exists
  cql3: Ensure that CREATE and ALTER never lead to RF-rack-invalid keyspaces
  db/config: Introduce RF-rack-valid keyspaces
2025-03-20 19:10:36 +02:00
Paweł Zakrzewski
0d14177409 audit/syslog: escape quotes and add explicit section names
Before this change we outputted CSV-like structure, that looked like the
following:
Feb 27 12:31:30 scylla-audit: "10.200.200.41:0", "AUTH", "", "", "", "", "10.200.200.41:0", "cassandra", "false"

While this is passably readable for humans, the ordering of fields is
not clear and can be confusing. Furthermore, the `"` character (double
quote) was not escaped. This is not an issue for CQL, but will be a
problem for auditing Alternator, which will require logging JSON
payloads.

The new format will consist of key=value pairs and will escape the quote
character, making it easy to parse programmatically.
Feb 28 02:21:56 scylla-audit: node="10.200.200.41:0", category="AUTH", cl="", error="false", keyspace="", query="", client_ip="10.200.200.41:0", table="", username="cassandra"

This is required for the auditing alternator feature.

Closes scylladb/scylladb#23099
2025-03-20 19:55:51 +03:00
Calle Wilund
5c6337b887 encryption: Add "wrap_sink" to encryption sstable extension
Creates a more efficient data_sink wrapper for encrypted output
stream (S3).
2025-03-20 14:54:24 +00:00
Calle Wilund
9ac9813c62 encrypted_file_impl: Add encrypted_data_sink
Adds a sibling type to encrypted file, a data_sink, that
will write a data stream in the same block format as a file
object would. Including end padding.

For making encrypted data sink writing less cumbersome.
2025-03-20 14:54:24 +00:00
Calle Wilund
e02be77af7 sstables::storage: Move wrapping sstable components to storage provider
Fixes #23225
Fixes #23185

Moved wrapping component files/sinks to storage provider. Also ensures
to wrap data_sinks as well as actual files. This ensures that we actually
write encryption if active.
2025-03-20 14:54:24 +00:00
Calle Wilund
d46dcbb769 sstables::file_io_extension: Add a "wrap_sink" method.
Similar to wrap file, should wrap a data_sink (used for
sstable writers), in obvious write-only, simple stream
mode.

Default impl will detect if we wrap files for this component,
and if so, generate a file wrapper for the input sink, wrap
this, and the wrap it in a file_data_sink_impl.

This is obviously not efficient, so extensions used in actual
non-test code should implement the method.
2025-03-20 14:54:22 +00:00
Calle Wilund
e100af5280 sstables::file_io_extension: Make sstable argument to "wrap" const
This matches the signature of call sites. Since the only "real"
extension to actually make a marker in the sstable will do so in
the scylla component, which is writable even in a const sstable,
this is ok.
2025-03-20 14:54:09 +00:00
Calle Wilund
98a6d0f79c utils: Add "io-wrappers", useful IO helper types
Mainly to add a somewhat functional file-impl wrapping
a data_sink. This can implement a rudimentary, write-only,
file based on any output sink.

For testing, and because they fit there, place memory
sink and source types there as well.
2025-03-20 14:54:09 +00:00
David Garcia
209ea2ea27 docs: update issues label
Closes scylladb/scylladb#23304
2025-03-20 17:46:58 +03:00
Kefu Chai
c37149d106 test: stop using seastar::at_exit()
seastar::at_exit() was marked deprecated recently. so let's use
the recommended approach to perform cleanups.

following tests were updated in this changes

- scylla perf-tablets: tested with
  scylla perf-tablets
- scylla perf-row-cache-update: tested with
  scylla perf-row-cache-update
- scylla perf-fast-forward: tested with
  scylla perf-fast-forward --populate --run-tests small-partition-skips \
    --smp 1
  scylla perf-fast-forward --run-tests small-partition-skips \
    --smp 1
- scylla perf-load-balancing: tested with
  scylla perf-load-balancing --nodes 3 --tablets1 16 --tablets2 16 --rf1 3 --rf2 3 --shards 16
- unit/row_cache_stress_test: tested with
  row_cache_stress_test --seconds 10
- perf/perf_cache_eviction: tested with
  ./perf_cache_eviction --seconds 1 --smp 1
- perf/perf_row_cache_reads: tested with
  ./perf_row_cache_reads

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#23356
2025-03-20 17:44:57 +03:00
Ernest Zaslavsky
2fb5c7402e s3_client: Rearrange credentials providers chain
As the IAM role is not configured to assume a role at this moment, it
makes sense to move the instance metadata credentials provider up in
the chain. This avoids unnecessary network calls and prevents log
clutter caused by failure messages.

Closes scylladb/scylladb#23360
2025-03-20 17:43:04 +03:00
Pavel Emelyanov
23089e1387 Merge 'Enhance S3 client robustness' from Ernest Zaslavsky
This PR introduces several key improvements to bolster the reliability of our S3 client, particularly in handling intermittent authentication and TLS-related issues. The changes include:

1. **Automatic Credential Renewal and Request Retry**: When credentials expire, the new retry strategy now resets the credentials and set the client to the retryable state, so the client will re-authenticate, and automatically retry the request. This change prevents transient authentication failures from propagating as fatal errors.
2. **Enhanced Exception Unwrapping**: The client now extracts the embedded std::system_error from std::nested_exception instances that may be raised by the Seastar HTTP client when using TLS. This allows for more precise error reporting and handling.
3. **Expanded TLS Error Handling**: We've added support for retryable TLS error codes within the std::system_error handler. This modification enables the client to detect and recover from transient TLS issues by retrying the affected operations.

Together, these enhancements improve overall client robustness by ensuring smoother recovery from both credential and TLS-related errors.

No backport needed since it is an enhancement

Closes scylladb/scylladb#22150

* github.com:scylladb/scylladb:
  aws_error: Add GNU TLS codes
  s3_client: Handle nested std::system_error exceptions
  s3_client: Start using new retry strategy
  retry_strategy: Add custom retry strategy for S3 client
  retry_strategy: Make `should_retry` awaitable
2025-03-20 16:52:20 +03:00
Andrei Chekun
502b31d9c2 test.py: Refactor nodetool/conftest
Remove using method for finding root dir of the project and start using
the constant defined in package.
2025-03-20 11:41:30 +01:00
Andrei Chekun
1ea7b99385 test.py: Refactor test/pylib/cpp/ldap
Rename and move prepare_instance from ldap tests directory to
pylib/ldap_server.
2025-03-20 11:41:30 +01:00
Andrei Chekun
33e53565c4 test.py: move starting LDAP service to dedicate method
Move starting LDAP to the method where the rest of the services are
started. This will unify the way of starting the 3rd party services.
Fix LDAP tests flakiness due not possible to connect to LDAP server
Add catching stdout and stderr of toxiproxy-cli in case of errors
2025-03-20 11:37:04 +01:00
Pavel Emelyanov
339a849f13 transport: Remove connection::make_client_key()
It's effectively unused, there's one place where connection initializes
the client_data object using this helper, but that initialization looks
better without it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#23321
2025-03-20 10:22:05 +01:00
Calle Wilund
5cc3fc4f14 cluster/test_encryption: bring test from enterprise (and enable)
Fixes scylladb/scylla-enterprise#5262

Part of the source-available code migration from scylla-enterprise.git
to scylla.git.

Original comment: topology_custom: add test_file_streaming_respects_encryption

Reproducer for issue scylladb/scylla-enterprise#4246.

Closes scylladb/scylladb#23320
2025-03-20 10:07:16 +02:00
Kefu Chai
ebf9125728 storage_proxy: Prevent integer overflow in abstract_read_executor::execute
Fix UBSan abort caused by integer overflow when calculating time difference
between read and write operations. The issue occurs when:
1. The queried partition on replicas is not purgeable (has no recorded
   modified time)
2. Digests don't match across replicas
3. The system attempts to calculate timespan using missing/negative
   last_modified timestamps

This change skips cross-DC repair optimization when write timestamp is
negative or missing, as this optimization is only relevant for reads
occurring within write_timeout of a write.

Error details:
```
service/storage_proxy.cc:5532:80: runtime error: signed integer overflow: -9223372036854775808 - 1741940132787203 cannot be represented in type 'int64_t' (aka 'long')
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior service/storage_proxy.cc:5532:80
Aborting on shard 1, in scheduling group sl:default
```

Related to previous fix 39325cf which handled negative read_timestamp cases.

Fixes #23314
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#23359
2025-03-20 10:05:42 +02:00
Botond Dénes
d06bc27979 Merge 'Don't export string filenames from sstable' from Pavel Emelyanov
There are several sstring-returning methods on class sstable that return paths to files. Mostly these are used to print them into logs, sometimes are used to be put into exception messages. And there are places that use these strings as file names. Since now sstables can also be stored on S3, generic code shouldn't consider those strings as on disk file names.

Other than that, even when the methods are used to put component names into logs, in many cases these log messages come with debug or trace level, so generated strings are immediately dropped on the floor, but generating it is not extremely cheap. Code would benefit from using lazily-printed names.

This change introduces the component_name struct that wraps sstable reference and component ID (which is a numerical enum of several items). When printed, the component_name formatter calls the aforementioned filename generation, thus implementing lazy printing. And since there's no automatic conversion of component_name-s into strings, all the code that treats them as file paths, becomes explicit.

refs: #14122 (previous ugly attempt to achieve the same goal)

Closes scylladb/scylladb#23194

* github.com:scylladb/scylladb:
  sstable: Remove unused malformed_sstable_exctpion(string filename)
  sstables: Make filename() return component_name
  sstables: Make file_writer keep component_name on board
  sstables: Make get_filename() return component_name
  sstables: Make toc_filename() return component_name
  sstables: Make sstable::index_filename() return component_name
  sstables: Introduce struct component_name
  sstables: Remove unused sstable::component_filenames() method
  sstables: Do not print component filenames on load-and-stream wrap-up
  sstables: Explicitly format prefix in S3 object name making
  sstables: Don't include directory name in exception
  sstables: Use fmt::format instead of string concatenation
  sstables: Rename filename($component) calls to ${component}_filename()
  sstables: Rename local filename variable to component_name
2025-03-20 09:51:03 +02:00
Kefu Chai
fd14a23aab .github: add "raft" and "service" subdirectories to CLEANER_DIR
in order to prevent future inclusion of unused headers, let's include
"raft" and "service" subdirectories to CLEANER_DIR, so that this
workflow can identify the regressions in future.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2025-03-20 11:18:16 +08:00
Kefu Chai
b3e2561ed8 service: do not include unused headers
these unused includes were identified by clang-include-cleaner. after
auditing these source files, all of the reports have been confirmed.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2025-03-20 11:18:16 +08:00
Avi Kivity
a62ab824e6 schema: deprecate schema_extension
schema_extension allows making invisible changes to system_schema
that evade upgrade rollback tests. They appear in system_schema
as an encoded blob which reduces serviceability, as they cannot
be read.

Deprecate it and point users to adding explicit columns in scylla_tables.

We could probably make use of the data structure, after we teach it
to encode its payload into proper named and typed columns instead of
using IDL.

Closes scylladb/scylladb#23151
2025-03-19 20:36:16 +02:00
Kefu Chai
8fdaaf6491 service/storage_proxy: Improve digest comparison
Previously, the code used a find_if to compare each digest to the first
one to check for any mismatches. This was less readable. This change
replaces that with `std::ranges::all_of`, which checks if all elements
in the range are equal to the first digest, improving readability.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#23332
2025-03-19 18:21:14 +03:00
Nadav Har'El
317de64281 test/alternator: enable debugging output during Python crashes
For a long time now, we've been seeing (see #17564), once in a while,
Alternator tests crashing with the Python process getting killed on
SIGSEGV after the tests have already finished successfully and all
pytest had to do is exit. We have not been able to figure out where the
bug is. Unfortunately, we've never been able to reproduce this bug
locally - and only rarely we see it in CI runs, and when it happens
we don't any information on why it happend.

So the goal of this patch is to print more information that might
hopefully help us next time we see this problem in CI (this patch
does NOT fix the bug). This patch adds to test/alternator's conftest.py
a call to faulthandler.enable(). This traps SIGSEGV and prints a stack
trace (for each thread, if there are several) showing what Python was
trying to do while it is crashing. Hopefully we'll see in this output
some specific cleanup function belonging to boto3 or urllib or whatever,
and be able to figure out where the bug is and how to avoid it.

We could have added this faulthandler.enable() call to the top-level
conftest.py or to test.py, but since we only ever had this Python
crash in Alternator tests, I think it is more suitable that we limit
this desperate debugging attempt only to Alternator tests.

Refs #17564

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#23340
2025-03-19 18:18:51 +03:00
Dawid Mędrek
0e04a6f3eb main: Refuse to start node when RF-rack-invalid keyspace exists
When a node is started with the option `rf_rack_valid_keyspaces`
enabled, the initialization will fail if there is an RF-rack-invalid
keyspace. We want to force the user to adjust their existing
keyspaces when upgrading to 2025.* so that the invariant that
every keyspace is RF-rack-valid is always satisfied.

Fixes scylladb/scylladb#23300
2025-03-19 15:13:44 +01:00
Dawid Mędrek
41f862d7ba cql3: Ensure that CREATE and ALTER never lead to RF-rack-invalid keyspaces
In this commit, we refuse to create or alter a keyspace when that operation
would make it RF-rack-invalid if the option `rf_rack_valid_keyspaces` is
enabled.

We provide two tests verifying that the changes work as intended.

Fixes scylladb/scylladb#23276
2025-03-19 14:51:47 +01:00
Dawid Mędrek
32879ec0d5 db/config: Introduce RF-rack-valid keyspaces
We introduce a new term in the glossary: RF-rack-valid keyspace.

We also highlight in our user documentation that all keyspaces
must remain RF-rack-valid throughout their lifetime, and failing
to guarantee that may result in data inconsistencies or other
issues. We base that information on our experience with materialized
views in keyspaces using tablets, even though they remain
an experimental feature.

Along with the new term, we introduce a new configuration option
called `rf_rack_valid_keyspaces`, which, when enabled, will enforce
preserving all keyspaces RF-rack-valid. That functionality will be
implemented in upcoming commits. For now, we materialize the
restriction in form of a named requirement: a function verifying
that the passed keyspace is RF-rack-valid.

The option is disabled by default. That will change once we adjust
the existing tests to the new semantics. Once that is done, the option
will first be enabled by default, and then it will be removed.

Fixes scylladb/scylladb#20356
2025-03-19 14:46:35 +01:00
Pavel Emelyanov
6e7d6b06f0 api: Squash two parse_table_infos into one
There are currently three of them:
- one that works on query parameter value
- one that works on query parameters map
- one that works on the request itself

The second one is not used any longer by anyone by the third one, so
squash them together.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-03-19 15:53:38 +03:00
Pavel Emelyanov
851bd38953 api: Generalize keyspaces:tables parsing a little bit more
Continuation of the previous patch -- there's one caller that uses "non
standard" name for the tables query parameter.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-03-19 15:52:54 +03:00
Pavel Emelyanov
dc3455bc55 api: Provide general pair<keyspace, vector<table>> parsing
Lots of API handlers get "keyspace" path parameter and parse the "cf"
query one into a vector of table_infos. Generalize those places.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-03-19 15:51:57 +03:00
Pavel Emelyanov
722f282748 api: Remove ks_cf_func and related code
The type in question is used by two endpoint handlers that are called
with validated keyspace name and parsed vector of table_info-s. Both
handlers can parse what they need on their own, all the more so next
patches will make this parsing even more simpler.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-03-19 15:49:55 +03:00
Pavel Emelyanov
73187a2e19 Merge 'mutation/mutation_consumer_concepts: simplify consumer hierarchy' from Botond Dénes
The reader consumer concept hierarchy is a sprawling confusing jungle of deeply nested concepts. Looking at `FlattenedConsumer[V2]` -- the subject of this PR: this consumer is defined in terms of the `StreamedMutationConsumer[V2]` which in terms is defined in terms of the `FragmentConsumer[V2]`.
This amount of nesting makes it really hard to see what a concept actually comes down to: made even more difficult by the fact that the concepts are scattered across two header files.
In theory, this nesting allows for greater flexibility: some code can use a lower lever concept directly while it can also serve as the basis for the higher lever concepts. But the fact of the matter is that none of the lower level concepts are used directly, so we pay the price in hard-to-follow code for no benefit.

This PR cuts down the complexity by folding up the entire hierarchy into the top-level `FlattenedConsumer[V2]` and `FlatteneConsumerReturning[V2]` concepts.
Doing this immediately reveals just how similar the two major consumer concepts (`FlattenedConsumer[V2]` and `MutationFragmentConsumer[V2]`) supported by `mutation_reader` are. In a follow-up PR, we will attempt to unify the two.

Refactoring, no backport needed.

Closes scylladb/scylladb#23344

* github.com:scylladb/scylladb:
  mutation: fold FragmentConsumer[V2] into FlattenedConsumer[V2]
  mutation: fold StreamedMutationConsumer[V2] into FlattenedConsumer[V2]
  test/lib/fragment_scatterer: s/StreamedMutationConsumer/FlattenedConsumer/
2025-03-19 15:43:00 +03:00
Pavel Emelyanov
a408a7abe1 sstable: Remove unused malformed_sstable_exctpion(string filename)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-03-19 13:03:29 +03:00
Pavel Emelyanov
f06cc32812 sstables: Make filename() return component_name
Similarly to toc_, index_ and data filenames, make the generic component
name getter return back not string, but a wrapper object. Most of
callers are log messages and exception generations. Other than that
there are tests, filesystem storage driver and few more places in
generic code who "know" that they work with real files, so make them use
explicit fmt::to_string().

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-03-19 13:03:29 +03:00
Pavel Emelyanov
68c41f0459 sstables: Make file_writer keep component_name on board
The class in question is a wrapper around output_stream that writes,
flushes and closes the stream in async context. For logging it also
keeps the component filename on board, and now it's good time to patch
it and keep the component_filename instead.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-03-19 13:03:29 +03:00
Pavel Emelyanov
1ba91e28cb sstables: Make get_filename() return component_name
Similarly to previous patches -- mostly the result is used as log
argument. The remaining users include

- scylla sstable tool that dumps component names to json output
- API endpoint that returns component names to user
- tests

these are all good to explicitly convert component_names to strings.

There are few more places that expect strings instead of component name
objects. For now they also use fmt::to_string() explicitly, partially it
will be fixed later, mostly -- as future follow-ups.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-03-19 13:03:29 +03:00
Pavel Emelyanov
0cdeed858c sstables: Make toc_filename() return component_name
Most of the callers use the returned value as log message parameter,
some construct malformed_sstable_exception that was prepared by previous
patch.

The remaining callers explicitly use fmt::to_string(), these are

- pending deletion log creation
- filesystem storage code
- tests
- stream-blob code that re-loads sstable

All but the last one are OK to use string toc name, the last one is not
very correct in its usage of toc_filename string, but it needs more care
to be fixed properly.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-03-19 13:03:29 +03:00
Pavel Emelyanov
80e0030613 sstables: Make sstable::index_filename() return component_name
Most of the method callers use it as log parameter. There are few more
places that push it to malformed_sstable_exception, which immediately
converts it to string, so this patch makes the exception be constructed
with the component_name either.

And there's one more place that passes this string to file_writer
constructor. For now, convert it to string explicitly, but next patches
will fix that place to use pure component_name too.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-03-19 13:01:23 +03:00
Pavel Emelyanov
dbb9ee15c1 sstables: Introduce struct component_name
The structure wraps const reference to sstable and component_name value
(it's an enum of several elements). It also has a formatter so that it
can be directly printed in logs (main usage) as well as converted to
strings (auxiliary and discourage usage).

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-03-19 12:45:21 +03:00
Pavel Emelyanov
aba400f5d9 sstables: Remove unused sstable::component_filenames() method
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-03-19 12:45:21 +03:00
Pavel Emelyanov
24e5c30cc8 sstables: Do not print component filenames on load-and-stream wrap-up
When load-and-stream finishes it may call sstable::unlink() method to
drop the loaded (and streamed) sstable. Before calling it it prints a
log message about its intention that includes component_filenames()
vector. This log message is ugly in several ways.

First, it prints only recognized components, while unlink() method
unlinks all of them, so it's sort of misleading (it doesn't seem that
anyone ever read this message IRL though)

Next, that's the only place that is _that_ verbose about sstable
unlinking. "Common" unlinking paths don't print that much info.

Finally, the log message happen in debug level, so it's hardly ever
appears in any logs, but collecting several filenames takes time.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-03-19 12:45:21 +03:00
Pavel Emelyanov
fb2bd91009 sstables: Explicitly format prefix in S3 object name making
Sometimes a component object name looks like
s3://bucket/prefix/component. For that the path formatting code formats
bucket name with the result of sstable->filename() invocation. This
patch changes it to format bucket name, prefix itself and
sstable->component_filename().

The change is idempotent, as sstable::filename() just concatenates prefix
with sstable::component_filename(). This change will help to remove the
former method from sstable soon.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-03-19 12:45:21 +03:00
Pavel Emelyanov
f212b5efa9 sstables: Don't include directory name in exception
When filesystem storage throws an exception about failure to create
components hardlinks, it includes three paths into it -- source file
name, destination file name and the directory name. The directory name
is excessive, source file name already has it. Also, this change will
make it possible to remove one of malformed_sstable_exception
constructors soon.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-03-19 12:45:21 +03:00
Pavel Emelyanov
a8bc81eb3c sstables: Use fmt::format instead of string concatenation
There are some places that concatentate filenames with something else to
get different filename (tool does it) or message for exception
(read_toc() helper). This patch uses fmt::format() instead to facilitate
future patching.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-03-19 12:45:21 +03:00
Pavel Emelyanov
dcc9167734 sstables: Rename filename($component) calls to ${component}_filename()
There's a generic sstable::filename(component_type) method that returns
a file name for the given component. For "popular" components, namely
TOC, Data and Index there are dedicated sstable methods to get their
names. Fix existing callers of the generic method to use the former.
It's shorter, nicer and makes further patching simpler.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-03-19 12:45:21 +03:00
Pavel Emelyanov
e6898a8854 sstables: Rename local filename variable to component_name
This is to be consistent with future changes and not to bloat them with
extra renames

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-03-19 12:45:20 +03:00
Kefu Chai
1ab2b7e7a0 tree: fix misspellings
these two misspellings were flagged by codespell.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#23357
2025-03-19 09:13:20 +02:00
Botond Dénes
8f0d0daf53 Merge 'repair: allow concurrent repair and migration of two different tablets' from Aleksandra Martyniuk
Do not hold erm during repair of a tablet that is started with tablet
repair scheduler. This way two different tablets can be repaired
and migrated concurrently. The same tablet won't be migrated while
being repaired as it is provided by topology coordinator.

Use topology_guard to maintain safety.

Fixes: https://github.com/scylladb/scylladb/issues/22408.

Needs backport to 2025.1 that introduces the tablet repair scheduler.

Closes scylladb/scylladb#22842

* github.com:scylladb/scylladb:
  test: add test to check concurrent tablets migration and repair
  repair: do not hold erm for repair scheduled by scheduler
  repair: get total rf based on current erm
  repair: make shard_repair_task_impl::erm private
  repair: do not pass erm to put_row_diff_with_rpc_stream when unnecessary
  repair: do not pass erm to flush_rows_in_working_row_buf when unnecessary
  repair: pass session_id to repair_writer_impl::create_writer
  repair: keep materialized topology guard in shard_repair_task_impl
  repair: pass session_id to repair_meta
2025-03-19 08:55:24 +02:00
Kefu Chai
aca00118fb service: fix misspellings
these misspellings were flagged by codespell.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#23334
2025-03-18 22:21:45 +02:00
Piotr Dulikowski
2ca1c0b6f9 Merge 'introduce the new Raft-based recovery procedure for group 0 majority loss' from Patryk Jędrzejczak
This PR introduces the new Raft-based recovery procedure for group 0
majority loss.

The Raft-based recovery procedure works with tablets. The old
gossip-based recovery procedure does not because we have no code
for tablet migrations after the gossip-based topology changes.

The Raft-based procedure requires the Raft-based topology to be
enabled in the cluster. If the Raft-based topology is not enabled, the
gossip-based procedure must be used.

We will be able to get rid of the gossip-based procedure when we make
the Raft-based topology mandatory (we can do both in the same version,
2025.2 is the plan). Before we do it, we will have to keep both procedures
and explain when each of them should be used.

The idea behind the new procedure is to recreate group 0 without
touching the topology structures. Once we create a new group 0, we
can remove all dead nodes using the standard `removenode` and
`replace` operations.

For the procedure to be safe, we must ensure that each member of the
new group 0 moves to the same initial group 0 state. Also, the only safe
choice for the state is the latest persistent state available among the
live nodes.

The solution to the problem above is to ensure that the leader of the new
group 0 (called the recovery leader) is one of the nodes with the latest
state available. Other members will receive the snapshot from the
recovery leader when they join the new group 0 and move to its state.

Below is the shortened description of the new recovery procedure from
the perspective of the administrator. For the full description, refer to the
design document.
1. Find the set of live nodes.
2. Kill any live node that shouldn't be a member of the new group 0.
3. Ensure the full network connectivity between live nodes.
4. Rolling restart live nodes to ensure they are healthy and ready for
recovery.
5. Check if some data could have been lost. If yes, restore it from
backup after the recovery procedure.
6. Find the recovery leader (the node with the largest `group0_state_id`).
7. Remove `raft_group_id` from `system.scylla_local` and truncate
`system.discovery` on each live node.
8. Set the new scylla.yaml parameter, `recovery_leader`, to Host ID of the
recovery leader on each live node.
9. Rolling restart all live nodes, but the recovery leader must be
restarted first.
10. Remove all dead nodes using `removenode` or `replace`.
11. Unset `recovery_leader` on all nodes.
12. Delete data of the old group 0 from  `system.raft`,
`system.raft_snaphots`, and `system.raft_snapshot_config`.

In the future, we could automate some of these steps or even introduce
a tool that will do all (or most) of them by itself. For now, we are fine with
a procedure that is reliable and simple enough.

This PR makes using 2025.1 with tablets much safer. We want to
backport it to 2025.1. We will also want to backport a few follow-ups.

Fixes scylladb/scylladb#20657

Closes scylladb/scylladb#22286

* github.com:scylladb/scylladb:
  test: mark tests with the gossip-based recovery procedure
  test: add tests for the Raft-based recovery procedure
  test: topology: util: fix the tokens consistency check for left nodes
  test: topology: util: extend start_writes
  gossip: allow group 0 ID mismatch in the Raft-based recovery procedure
  raft_group0: modify_raft_voter_status: do not add new members
  treewide: allow recreating group 0 in the Raft-based recovery procedure
2025-03-18 19:10:56 +01:00
Yaron Kaikov
b375222408 ./github/scripts/auto-backport.py: don't remove backport label when backport process has an error
Today, when the `Fixes` prefix is missing or the developer is not a collaborator with `scylladbbot` we remove the backport labels to prevent the process from starting and notifying the developers.

Developers are worried that removing these backport labels will cause us to forget we need to do these backports. @nyh suggested to add a `scylladbbot/backport_error` label instead

Applied those changes, so when a `Fixes` prefix is missing we will add a `scylladbbot/backport_error` label and stop the process

When a user doesn't accept the invite we will still open the PR but he will not be assigned and will not be able to edit the branch when we have conflicts

Fixes: https://github.com/scylladb/scylla-pkg/issues/4898
Fixes: https://github.com/scylladb/scylla-pkg/issues/4897

Closes scylladb/scylladb#23259
2025-03-18 16:19:09 +02:00
Pavel Emelyanov
420b5bee20 test/s3: Increase boost/s3_test log levels
When something goes wrong, it's impossible to find anyting out without
s3 and http logs, so increase them for boost tests.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#23245
2025-03-18 15:59:05 +02:00
Botond Dénes
a2d0d7b9a0 mutation: fold FragmentConsumer[V2] into FlattenedConsumer[V2]
FragmentConsumer[V2] also has no direct users, so fold it into
FlattenedConsumer[V2] as well. With this, FlattenedConsumer[V2] has a
nice and simple definition, with a single nesting level required due to
the return-type flexibility.
2025-03-18 09:24:49 -04:00
Botond Dénes
8768e2e08e mutation: fold StreamedMutationConsumer[V2] into FlattenedConsumer[V2]
No code uses StreamedMutationConsumer[V2] directly, so let's take this
opportunity to reduce the jungle of consumer concepts.
2025-03-18 09:24:44 -04:00
Botond Dénes
969b07fdfd test/lib/fragment_scatterer: s/StreamedMutationConsumer/FlattenedConsumer/
The class actually implements the FlattenedConsumer, so fix the comment.
This eliminates the only reference to the StreamedMutationConsumer
concept.
2025-03-18 07:57:04 -04:00
Avi Kivity
9867129c7b Update seastar submodule
* seastar 412d058cf9...2f13c461bb (2):
  > smp: prefaulter: don't leave zombie worker threads
Fixes #23316
  > demos/tcp_sctp_server_demo:  Modernize with seastar::async and proper teardown

Closes scylladb/scylladb#23317
2025-03-18 13:36:05 +02:00
Botond Dénes
2795d83b32 Merge 'commitlog: Serialize file deletion and distribute replayed segments' from Calle Wilund
Fixes #23017

When deleting segments while our footprint is over the limit, mainly when recycling/deleting segments after replay (recover boot) we can cause two deletion passes to be running at the same time. This is because delete is triggered by either

a.) replay release
b.) timer check (explicit)
c.) timer initiated flush callback

where the last one is in fact not even waited for. If we are considering many files for delete/recycle, we can, due to task switch, end up considering segments ok to keep, in parallel, even though one of them should be deleted. The end result will be us keeping one more segment than should be allowed.

Now, eventually, this should be released, once we do deletion again, but this can take a while.

Solution is to simply ensure we serialize deletion. This might cause some delay in processing cycles for recycle, but in practice, this should never happen when we are in fact under pressure.

As noted in the issue above, when replaying a large commitlog from an unclean node, we can cause shard 0
db commitlog to reach footprint limit, and then remain there (because we never release segments lower than limit). This is wasteful with diskspace. But deleting segments early here is also wasteful; A better solution is
to simply give the segments to all CL shards, thus distributing the available space.

Closes scylladb/scylladb#23150

* github.com:scylladb/scylladb:
  main/commitlog: wait for file deletion and distribute recycled segments to shards
  commitlog: Serialize file deletion
2025-03-18 11:47:17 +02:00
Avi Kivity
176bb464a2 github: error if we see #include "seastar/..."
Seastar is a system library from ScyllaDB's persepective and
so should use angle brackets for #include statements.

Closes scylladb/scylladb#23308
2025-03-17 21:56:48 +02:00
Ernest Zaslavsky
08b9e4d87b aws_error: Add GNU TLS codes
Add GNU TLS error codes to std::system_error handler since we can start getting these once they seep from seastar's http client
2025-03-17 16:38:14 +02:00
Ernest Zaslavsky
012f0e6d8c s3_client: Handle nested std::system_error exceptions
Enhance error handling by detecting and processing std::system_error exceptions
nested within std::nested_exception. This improvement ensures that system-level
errors wrapped in the exception chain are properly caught and managed, leading
to more robust error reporting and recovery.
2025-03-17 16:38:14 +02:00
Ernest Zaslavsky
367140a9c5 s3_client: Start using new retry strategy
* Previously, token expiration was considered a fatal error. With this change,
the `s3_client` uses new retry strategy that is trying to renew expired
creds
* Added related test to the `s3_proxy`
2025-03-17 16:38:14 +02:00
Ernest Zaslavsky
ed09614c27 retry_strategy: Add custom retry strategy for S3 client
Introduced a new retry strategy that extends the default implementation.
The should_retry method is overridden to handle a specific case for expired credential tokens.
When an expired token error is detected, the credentials are reset so it is expected that the client will re-authenticates, and the
original request is retried.
2025-03-17 16:38:14 +02:00
Ernest Zaslavsky
26062c65e4 retry_strategy: Make should_retry awaitable 2025-03-17 16:36:26 +02:00
Avi Kivity
0e4b303339 tools: toolchain: regenerate for python3-pytest-asyncio 0.24
Fixes a bug related to load_scope="module".

python-driver fixed to version 3.28.2, as it looks like
3.29.0 regressed TLS handling [1]. In any case tools/cqlsh
fixes it to 3.28.2.

Optimized clang from

 https://devpkg.scylladb.com/clang/clang-19.1.7-Fedora-41-aarch64.tar.gz
 https://devpkg.scylladb.com/clang/clang-19.1.7-Fedora-41-x86_64.tar.gz

Ref #22960.

Fixes #23213

[1] https://github.com/scylladb/python-driver/issues/456

Closes scylladb/scylladb#23236
2025-03-17 15:41:55 +02:00
Botond Dénes
fda3486770 Merge 'Remove some excessive ks:cf -> table_id conversions in API and schema_tables' from Pavel Emelyanov
Actually, the main goal of this PR was to remove parse_tables() helpers from api/ in favor of more flexible (yet same complex) parse_table_infos(), but it turned out that it also saves some lookups in database maps.

There are several places in API and schema_tables that have table_id at hand, but at some point drop it and carry keyspace and table names over to a place that maps ks:cf back to table_id and then uses it to find the table object. This PR keeps the table_id with the help of table_info struct in those places. This change allows removing the aforementioned parse_table() helpers from api/ and also saves few lookups in database maps.

Removing the parse_tables() from api/ is the continuation of previous effort that reduces the set of helpers in api/ code that help handlers "parse" keyspaces and tables names see #22742 #21533

Closes scylladb/scylladb#23216

* github.com:scylladb/scylladb:
  api: Remove the remaining parse_tables() overload
  database: Sanitize flush_tables_on_all_shards()
  schema_tables: Remove all_table_names()
  database: Make tables flushing helper use table_info-s, not names
  api: Make keyspace flush endpoint use parse_table_infos() (and a bit more)
  schema_tables,client_state: Switch to using all_table_infos()
  schema_tables: Tune up some methods to benefit from table_infos
  schema_tables: Introduce all_table_infos()
2025-03-17 15:40:41 +02:00
Pavel Emelyanov
6217124d1d s3/client: Make "expected" reply status truly optional
Currently when a client::make_request() is called it can pass
std::optional<status> argument indicating which status it expects from
server. In case status doesn't match, the request body handler won't be
called, the request will fail with unexpected status exception.

However, disengaged expected implicitly means, that the requestor
expects the OK (200) status. This makes it impossible to make a query
which return status is not known in advance and it's up to the handler
to check it.

Lower level http client allows disengaged expected with the described
semantics -- handler will check status its own. This behavios for s3
client is needed for GET request. Server can respond with OK or partial
content status depending on the Range header. If the header is absent or
is large enough for the requested object to fit into it, the status
would be OK, if the object is "trimmed" the status is partial content.
In the end of the day, requestor cannot "guess" the returning status in
advance and should check it upon response arrival.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#23243
2025-03-17 15:34:58 +02:00
Botond Dénes
afa305ffb4 Merge 'perf/perf_sstable: stop using at_exit() ' from Kefu Chai
`seastar::at_exit()` was marked deprecated recently. so let's use the recommended approach to perform cleanups.

---

it's a cleanup, hence no need to backport.

Closes scylladb/scylladb#23253

* github.com:scylladb/scylladb:
  perf/perf_sstable: fix the indent
  perf/perf_sstable: stop using at_exit()
2025-03-17 15:30:10 +02:00
Andrei Chekun
d68e54c26d test.py: Remove reuse cluster in cluster tests
Pool is not aware of the cluster configuration, so it can return cluster
to the test that is not suitable for it. Removing reuse will remove such
possibility, so there will be less flaky tests.

Closes scylladb/scylladb#23277
2025-03-17 15:27:59 +02:00
Calle Wilund
1525cb2dba main/commitlog: wait for file deletion and distribute recycled segments to shards
Refs #23017

When replaying a large commitlog from an unclean node, we can cause shard 0
db commitlog to reach footprint limit, and then remain there (because we
never release segments lower than limit). This is wasteful with diskspace.
But deleting segments early here is also wasteful; A better solution is
to simply give the segments to all CL shards, thus distributing the available
space.

v2:
* Do segement distribution using ranges. go c++23
2025-03-17 12:09:00 +00:00
Calle Wilund
4ed81e05bf commitlog: Serialize file deletion
Fixes #23017

When deleting segments while our footprint is over the limit,
mainly when recycling/deleting segments after replay (recover
boot) we can cause two deletion passes to be running at the same
time. This is because delete is triggered by either

a.) replay release
b.) timer check (explicit)
c.) timer initiated flush callback

where the last one is in fact not even waited for. If we are
considering many files for delete/recycle, we can, due to task
switch, end up considering segments ok to keep, in parallel,
even though one of them should be deleted. The end result
will be us keeping one more segment than should be allowed.
Now, eventually, this should be released, once we do deletion
again, but this can take a while.

Solution is to simply ensure we serialize deletion. This might
cause some delay in processing cycles for recycle, but in
practice, this should never happen when we are in fact under
pressure.

Small unit test included.
2025-03-17 12:09:00 +00:00
Anna Stuchlik
cd61f60549 doc: fix product names in the 2025.1 upgrage guides
This commit fixes the product names in the upgrade 2025.1 guides so that:

- 6.2 is preceded with "ScyllaDB Open Source"
- 2024.x is preceded with "ScyllaDB Enterprise"
- 2025.1 is preceded with "ScyllaDB"

Fixes https://github.com/scylladb/scylladb/issues/23154

Closes scylladb/scylladb#23223
2025-03-17 13:54:11 +03:00
Anna Stuchlik
dbbf9e19e4 doc: remove the outdated info on seeds-info
This commit removes the outdated information about seed nodes.
We no longer need it in the docs, as a) the documentation is versioned,
and b) the ScyllaDB Open Source 4.3 and ScyllaDB Enterprise 2021.1 versions
mentioned in the docs are no longer supported.

In addition, some clarification has been added to the existing sections.

Fixes https://github.com/scylladb/scylladb/issues/22400

Closes scylladb/scylladb#23282
2025-03-17 13:53:48 +03:00
Andrei Chekun
7423edb1f7 test.py: Increase verbosity of pytest
Currently, pytest truncates long objects in assertions.
This makes understanding the failure message difficult.
This will increase verbosity and pytest will stop truncating messages.

Closes scylladb/scylladb#23263
2025-03-17 12:51:41 +02:00
Aleksandra Martyniuk
20f9d7b6eb test: add test to check concurrent tablets migration and repair
Add a test to check whether a tablet can be migrated while another
tablet is repaired.
2025-03-17 10:37:03 +01:00
Aleksandra Martyniuk
5b792bdc98 repair: do not hold erm for repair scheduled by scheduler
Do not hold erm	for tablet repair scheduled by scheduler. Thanks to
that one tablet repair won't exclude migration of other tablets.

Concurrent repair and migration of the same tablet isn't possible,
since a tablet can be in one type of transition only at the time.
Hence the change is safe.

Refs: https://github.com/scylladb/scylladb/issues/22408.
2025-03-17 10:37:02 +01:00
Aleksandra Martyniuk
a1375896df repair: get total rf based on current erm
Get total rf based on erm. Currently, it does not change anything
because erm stays the same during the whole repair.
2025-03-17 10:36:18 +01:00
Aleksandra Martyniuk
34cd485553 repair: make shard_repair_task_impl::erm private
Make shard_repair_task_impl::erm private. Access it with getter.
2025-03-17 10:36:14 +01:00
Andrei Chekun
a20d848c01 test.py: Refactor test/conftest.py
Move functions responsible for preparation of the environment to the util file.
This is extracted from https://github.com/scylladb/scylladb/pull/22894 to make it easier to work together.

Closes scylladb/scylladb#23221
2025-03-17 11:31:00 +02:00
Avi Kivity
4416b0c732 treewide: use angle brackets for including seastar headers
Seastar is an external library, so we use angle brackets to
include its interfaces.

Closes scylladb/scylladb#23301
2025-03-17 10:03:06 +02:00
Andrei Chekun
1e1d213592 test.py: Remove additional report generation for python tests
Pytest is responsible for generation the report of the failed tests and
there is no need to generate it one more time

Closes scylladb/scylladb#23237
2025-03-17 09:36:08 +02:00
Kefu Chai
f8800b3f19 ent/encryption: rename "padd" to "padding"/"pad" and use structured bindings
Replace the abbreviated term "padd" with either "padding" or "pad" throughout
the encryption module. While "padd" was originally chosen to align with other
variable names ("type" and "mode"), using standard terminology improves code
readability and resolves codespell warnings.

Additionally, refactor relevant code to use C++ structured bindings for cleaner
implementation.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#23251
2025-03-17 09:23:42 +02:00
Raphael S. Carvalho
e9944f0b7c service: Introduce rack-aware co-location migrations for tablet merge
Merge co-location can emit migrations across racks even when RF=#racks,
reducing availability and affecting consistency of base-view pairing.

Given replica set of sibling tablets T0 and T1 below:
[T0: (rack1,rack3,rack2)]
[T1: (rack2,rack1,rack3)]

Merge will co-locate T1:rack2 into T0:rack1, T1 will be temporarily only at
only a subset of racks, reducing availability.

This is the main problem fixed by this patch.

It also lays the ground for consistent base-view replica pairing,
which is rack-based. For tables on which views can be created we plan
to enforce the constraint that replicas don't move across racks and
that all tablets use the same set of racks (RF=#racks). This patch
avoids moving replicas across racks unless it's necessary, so if the
constraint is satisfied before merge, there will be no co-locating
migrations across racks. This constraint of RF=#racks is not enforced
yet, it requires more extensive changes.

Fixes #22994.
Refs #17265.

This patch is based on Raphael's work done in PR #23081. The main differences are:

1) Instead of sorting replicas by rack, we try to find
    replicas in sibling tablets which belong to the same rack.
    This is similar to how we match replicas within the same host.
    It reduces number of across-rack migrations even if RF!=#racks,
    which the original patch didn't handle.
    Unlike the original patch, it also avoids rack-overloaded in case
    RF!=#racks

2) We emit across-rack co-locating migrations if we have no other choice
   in order to finalize the merge

   This is ok, since views are not supported with tablets yet. Later,
   we will disallow this for tables which have views, and we will
   allow creating views in the first place only when no such migrations
   can happen (RF=#racks).

3) Added boost unit test which checks that rack overload is avoided during merge
   in case RF<#racks

4) Moved logging of across-rack migration to debug level

5) Exposed metric for across-rack co-locating migrations

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Signed-off-by: Tomasz Grabiec <tgrabiec@scylladb.com>

Closes scylladb/scylladb#23247
2025-03-16 22:45:00 +02:00
Pavel Emelyanov
95809a3ed1 Update seastar submodule
* seastar 5b95d1d7...412d058c (62):
  > fstream: Export functions for making file_data_source
  > build: Include DPDK dependency libraries in Seastar linkage
  > demos/tls_echo_server_demo: Modernize with seastar::async
  > http/client: Pass abort source by pointer
  > rpc: remove deprecated logging function support
  > github: Add Alpine Linux workflow to test builds with musl libc
  > exception_hacks: Make dl_iterate_phdr resolution manual
  > tests: relax test_file_system_space check for empty filesystems
  > demos/udp_server_demo:  Modernize with seastar::async and proper teardown
  > future: remove deprecated functions/concepts
  > util: logger: remove deprecated set_stdout_enabled and logger_ostream_type::{stdout,stderr}
  > memory: guard __GLIBC_PREREQ usage with __GLIBC__ check
  > scheduling_specific: Add noexcept wrapper for free()
  > file: Replace __gid_t with standard POSIX gid_t
  > aio_storage_context: Use reactor::do_at_exit()
  > json2code: support chunked_fifo
  > json: remove unused headers
  > httpd: test cases for streaming
  > build: use find_dependency() instead find_package() in config file
  > build: stop using a loop for finding dependencies
  > dns: Fix event processing to work safely with recent c-ares
  > tutorial: add a section about initialization and cleanup
  > reactor: deprecate at_exit()
  > httpclient: Add exception handling to connection::close
  > file: document max_length-limits for dma_read/write funcs taking vector<iovec>
  > build: fix P2582R1 detection in GCC compatibility check
  > json2code: optimize string handling using std::string_view
  > tests/unit: fix typo in test output
  > doc: Update documentation after removing build.sh
  > test: Add direct exception passing for awaits for perf test
  > github:  add Docker build verification workflow
  > docker: update LLVM debian repo for Ubuntu Orcular migration
  > tests/unit: Use http.HTTPStatus constants instead of raw status codes
  > tests/unit: Fix exception verification in json2code_test.py
  > httpd: handle streaming results in more handlers
  > json: stream_object now moves value
  > json: support for rvalue ranges
  > chunked_fifo: make copyable
  > reactor: deprecate at_destroy()
  > testing: prevent test scheduling after reactor exit
  > net: Add bytes sent/received metrics
  > net: switch rss_key_type to std::span instead of std::string_view
  > log: fixes for libc++ 19
  > sstring: fixes for lib++ 19
  > build: finalize numactl dependency removal
  > build: link DPDK against libnuma when detected during build
  > memory: remove libnuma dependency
  > treewide: replace assert with SEASTAR_ASSERT
  > future: fix typo in comment
  > http: Unwrap nested exceptions to handle retryable transport errors
  > net/ip, net: sed -i 's/to_ulong/to_uint/'
  > core: function_traits noexcept specializations
  > util/variant: seastar::visit forward value arg
  > net/tls: fix missing include
  > tls: Add a way to inspect peer certificate chain
  > websocket: Extract encode_base64() function
  > websocket: Rename wlogger to websocket_logger
  > websocket: Extract parts of server_connection usable for client
  > websocket: Rename connection to server_connection
  > websocket: Extract websocket parser to separate file
  > json2code_test: factor out query method
  > seastar-json2code: fix error handling

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#23281
2025-03-16 21:57:43 +02:00
Benny Halevy
41f02c521d main: allow abort during join_cluster
Bootstrap or replace can take a long time, but
since feef7d3fa1,
the stop_signal is checked only in checkpoints,
and in particular, abort isn't requested during
join_cluster.

Fixes #23222

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-03-16 12:21:15 +02:00
Benny Halevy
f269480f53 main: add checkpoint before joining cluster
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-03-16 12:08:04 +02:00
Benny Halevy
0fc196991a storage_service: add start_sys_dist_ks
Currently, there's a call to
`supervisor::notify("starting system distributed keyspace")`
which is misleading as it is identical to a similar
message in main() when starting the sharded service.

Change that to a storage_service log messages
and be more specific that the sys_dist_ks shards are started.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-03-16 12:05:23 +02:00
Jenkins Promoter
d84da3dc11 Update pgo profiles - x86_64 2025-03-15 04:57:28 +02:00
Jenkins Promoter
6e8e2ae333 Update pgo profiles - aarch64 2025-03-15 04:48:49 +02:00
Pavel Emelyanov
604fdd86e9 test: Count mutation fragments verbosily in scoped restore test
Sometimes after scoped restore a key is not found in nodes' mutation
fragments. This patch makes the counting more verbose to get better
understanding of what's going on in case of test failure

refs: #23189

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#23296
2025-03-14 21:31:36 +02:00
Pavel Emelyanov
bfbe802632 streaming: Relax load_sstable_for_tablet()
The method does several excessive things, that can be relaxed

1. In order to transfer a table-id to another shard, finds the table on
   source shard, gets schema and captures schema id on invoke_on()'s
   lambda. It can just capture the original table-id

2. In order to get sstable parameters (format, version, etc.) generates
   toc_filename(), then calls parse_path() to convert it into the
   entry_descriptor. The descriptor can be read from sstable directly.

3. Logging "success" includes target shard into the message, but happens
   on the source shard. The message can be just logged on target shard.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#23197
2025-03-14 15:26:48 +02:00
Botond Dénes
39bcf99f8e Merge 'Apply hard limit to partition range vectors in secondary index queries' from Nikos Dragazis
Secondary index queries fetch partition keys from the index view and store them in an `std::vector`. The vector size is currently limited by the user's page size and the page memory limit (1MiB). These are not enough to prevent large contiguous allocations (which can lead to stalls).

This series introduces a hard limit to the vector size to ensure it does not exceed the allocator's preferred max contiguous allocation size (128KiB). With the size of each element being 120 bytes, this allows for 1092 partition keys. The limit was set to 1000. Any partitions above this limit are discarded.

Discarding partitions breaks the querier cache on the replicas, causing a performance regression, as can be seen from the following measurements:
```
* Cluster: 3 nodes (local Docker containers), 1 vCPU, 4GB memory, dev mode
* Schema:
  CREATE KEYSPACE ks WITH replication = {'class': 'org.apache.cassandra.locator.NetworkTopologyStrategy', 'datacenter1': '3'} AND durable_writes = true AND tablets = {'enabled': false};
  CREATE TABLE ks.t1 (pk1 int, pk2 int, ck int, value int, PRIMARY KEY ((pk1, pk2), ck));
  CREATE INDEX t1_pk2_idx ON ks.t1(pk2);
* Query: CONSISTENCY LOCAL_QUORUM; SELECT * FROM ks.t1 where pk2 = 1;

+------------+-------------------+-------------------+
|  Page Size |      Master       |   Vector Limit    |
+============+===================+===================+
|            |   Latency (sec)   |   Latency (sec)   |
+------------+-------------------+-------------------+
|     100    |  5.80 ± 0.13      |  5.64 ± 0.10      |
+------------+-------------------+-------------------+
|    1000    |  4.77 ± 0.07      |  4.62 ± 0.06      |
+------------+-------------------+-------------------+
|    2000    |  4.67 ± 0.07      |  5.13 ± 0.03      |
+------------+-------------------+-------------------+
|    5000    |  4.82 ± 0.09      |  6.25 ± 0.06      |
+------------+-------------------+-------------------+
|   10000    |  4.89 ± 0.36      |  7.52 ± 0.13      |
+------------+-------------------+-------------------+
|     -1     |  4.90 ± 0.67      |  4.79 ± 0.33      |
+------------+-------------------+-------------------+
```
We expect this to be fixed with adaptive paging in a future PR. Until then, users can avoid regressions by adjusting their page size.

Additionally, this series changes the `untyped_result_set` to store rows in a `chunked_vector` instead of an `std::vector`, similarly to the `result_set`. Secondary index queries use an `untyped_result_set` to store the raw result from the index view before processing. With 1MiB results, the `std::vector` would cause a large allocation of this magnitude.

Finally, a unit test is added to reproduce the bug.

Fixes #18536.

The PR fixes stalls of up to 100ms, but there is an easy workaround: adjust the page size. No need to backport.

Closes scylladb/scylladb#22682

* github.com:scylladb/scylladb:
  cql3: secondary index: Limit page size for single-row partitions
  cql3: secondary index: Limit the size of partition range vectors
  cql3: untyped_result_set: Store rows in chunked_vector
  test: Reproduce bug with large allocations from secondary index
2025-03-14 15:06:07 +02:00
Botond Dénes
83ea1877ab Merge 'scylla-sstable: add native S3 support' from Ernest Zaslavsky
scylla-sstable: Enable support for S3-stored sstables

Minimal implementation of what was mentioned in this [issue](https://github.com/scylladb/scylladb/issues/20532)

This update allows Scylla to work with sstables stored on AWS S3. Users can specify the fully qualified location of the sstable using the format: `s3://bucket/prefix/sstable_name`. One should have `object_storage_config_file` referenced in the `scylla.yaml` as described in docs/operating-scylla/admin.rst

ref: https://github.com/scylladb/scylladb/issues/20532
fixes: https://github.com/scylladb/scylladb/issues/20535

No backport needed since the S3 functionality was never released

Closes scylladb/scylladb#22321

* github.com:scylladb/scylladb:
  tests: Add Tests for Scylla-SSTable S3 Functionality
  docs: Update Scylla Tools Documentation for S3 SSTable Support
  scylla-sstable: Enable Support for S3 SSTables
  s3: Implement S3 Fully Qualified Name Manipulation Functions
  object_storage: Refactor `object_storage.yaml` parsing logic
2025-03-14 15:05:52 +02:00
Patryk Jędrzejczak
ca5c223505 test: mark tests with the gossip-based recovery procedure
This patch makes it clear which Raft recovery procedure is used in
each test.

Tests with "This test uses the gossip-based recovery procedure." are
the tests that use the gossip-based topology. This tests should be
deleted once we make the Raft-based topology mandatory.

Tests with the new FIXME are the tests that use the Raft-based
topology. They should be changed to use the Raft-based recovery
procedure or removed if they don't test anything important with
the new procedure.
2025-03-14 13:53:05 +01:00
Patryk Jędrzejczak
4fd0e93154 test: add tests for the Raft-based recovery procedure 2025-03-14 13:53:05 +01:00
Patryk Jędrzejczak
4e055882c1 test: topology: util: fix the tokens consistency check for left nodes
When we remove a node in the Raft-based topology
(by remove/replace/decommission), we remove its
tokens from `system.topology`, but we do not
change `num_tokens`. Hence, the old check could
fail for left nodes.
2025-03-14 13:53:05 +01:00
Patryk Jędrzejczak
d0efc77d20 test: topology: util: extend start_writes
We extend `start_writes` to allow:
- providing `ks_name` from the test,
- restarting it (by starting it again with the same `ks_name`),
- running it in the presence of shutdowns.

We use these features in a new test in one of the following patches.
2025-03-14 13:53:05 +01:00
Patryk Jędrzejczak
9970c1fcc3 gossip: allow group 0 ID mismatch in the Raft-based recovery procedure
This patch ensures that members of the new group 0 can gossip with
members of the old group 0 during rolling restart in the Raft-based
recovery procedure. Without this change, restarted nodes (members of
the new group 0) wouldn't be marked as UP by other nodes (members of
the old group 0), which would decrease availability.
2025-03-14 13:53:05 +01:00
Patryk Jędrzejczak
3b9765dac8 raft_group0: modify_raft_voter_status: do not add new members
In the new Raft-based recovery procedure, we create a new group 0.
Dead nodes are not members of this group 0. Also, the removenode
handler makes a node being removed a non-voter. So, with the previous
implementation of `modify_raft_voter_status`, the node being removed
would become a non-voting member of the new group 0, which is very
weird. It should not cause problems, but we better avoid it and
keep the procedure clean.

This change also makes `modify_raft_voter_status` more intuitive in
general.
2025-03-14 13:53:05 +01:00
Patryk Jędrzejczak
fd51d7e448 treewide: allow recreating group 0 in the Raft-based recovery procedure
This patch adds support for recreating group 0 after losing
majority. This is the only part of the new Raft-based recovery
procedure that touches Scylla core.

The following steps are necessary to recreate group 0:
1. Determine the new group 0 members. These are alive nodes that
are normal or rebuilding.
2. Choose the recovery leader - the node which will become the
new group 0 leader. This must be one of the nodes with the
latest persistent group 0 state.
3. Remove `raft_group_id` from `system.scylla_local` and truncate
`system.discovery` on each live node.
4. Set the new scylla.yaml parameter - `recovery_leader` - to Host
ID of the recovery leader on each live node.
5. Rolling restart all live nodes, but the recovery leader must be
restarted first.

In the implementation, restarts in step 5 are very similar to normal
restarts with the Raft-based topology enabled. The only differences
are:
1. Steps 3-4 make the restarting node discover the new group 0
in `join_cluster`.
2. The group 0 server is started in `join_group0`, not
`setup_group0_if_exists`.
3. The restarting node joins the new group 0 in `join_topology` using
`legacy_handshaker`. There is no reason to contact the topology
coordinator since the node has already joined the topology.

Unfortunately, this patch creates another execution path for the
starting logic. `join_cluster` becomes even messier. However, there
is nothing we can do about it. Joining group 0 without joining
topology is something completely new. Having a few small changes
without touching other execution paths is the best we can do.
We will start removing the old stuff soon, after making the
Raft-based topology mandatory, and the situation will improve.
2025-03-14 13:52:57 +01:00
Nadav Har'El
de7c1d526a test/cqlpy: test DESC doesn't list an index as a view
Issue #6058 complained that "DESCRIBE TABLE" or "DESCRIBE KEYSPACE" list
a secondary index as materialized view (the view used to back the index
in Scylla's implementation of secondary indexes). This patch adds a test
to verify that this issue no longer exists in server-side describe - so we
can mark the issue as fixed.

While preparing this test, I noticed that Scylla and Cassandra behave
differently on whether DESC TABLE should list materialized views or not,
so this patch also includes a test for that as well - and I opened
issue #23014 on Scylla and CASSANDRA-20365 on Cassandra to further
discuss that new issue.

Fixes #6058
Refs #23014.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#23015
2025-03-14 14:40:19 +03:00
Nadav Har'El
c0821842de alternator: document the state of tablet support in Alternator
In commit c24bc3b we decided that creating a new table in Alternator
will by default use vnodes - not tablets - because of all the missing
features in our tablets implementation that are important for
Alternator, namely - LWT, CDC and Alternator TTL.

We never documented this, or the fact that we support a tag
`experimental:initial_tablets` which allows to override this decision
and create an Alternator table using tablets. We also never documented
what exactly doesn't work when Alternator uses tablet.

This patch adds the missing documentation in docs/alternator/new-apis.md
(which is a good place for describing the `experimental:initial_tablets`
tag). The patch also adds a new test file, test_tablets.py, which
includes tests for all the statements made in the document regarding
how `experimental:initial_tablets` works and what works or doesn't
work when tablets are enabled.

Two existing tests - for TTL and Streams non-support with tablets -
are moved to the new test file.

When the tablets feature will finally be completed, both the document
and the tests will need to be modified (some of the tests should be
outright deleted). But it seems this will not happen for at least
several months, and that is too long to wait without accurate
documentation.

Fixes #21629

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#22462
2025-03-14 14:03:15 +03:00
Pavel Emelyanov
2bb455ec75 Merge 'Main: stop system_keyspace' from Benny Halevy
This series adds an async guard to system_keyspace operations
and adds a deferred action to stop the system_keyspace in main() before destroying the service.

This helps to make sure that sys_ks is unplugged from its users and that all async operations using it are drained once it's stopped.

* Enhancement, no backport needed

Closes scylladb/scylladb#23113

* github.com:scylladb/scylladb:
  main: stop system keyspace
  system_keyspace: call shutdown from stop
  system_keyspace: shutdown: allow calling more than once
  database, compaction_manager, large_data_handler: use pluggable<system_keysapce>
  utils: add class pluggable
2025-03-14 13:23:28 +03:00
Aleksandra Martyniuk
444c7eab90 repair: do not pass erm to put_row_diff_with_rpc_stream when unnecessary
When small_table_optimization isn't enabled, put_row_diff_with_rpc_stream
does not access erm. Pass small_table_optimization_params containing erm
only when small_table_optimization is enabled.

This is safe as erm is kept by shard_repair_task_impl.
2025-03-14 10:45:52 +01:00
Aleksandra Martyniuk
e56bb5b6e2 repair: do not pass erm to flush_rows_in_working_row_buf when unnecessary
When small_table_optimization isn't enabled, flush_rows_in_working_row_buf
does not access erm. Add small_table_optimization_params containing erm and
pass it only when small_table_optimization is enabled.

This is safe as erm is kept by shard_repair_task_impl.
2025-03-14 10:45:52 +01:00
Aleksandra Martyniuk
09c74aa294 repair: pass session_id to repair_writer_impl::create_writer 2025-03-14 10:45:52 +01:00
Aleksandra Martyniuk
47bb9dcf78 repair: keep materialized topology guard in shard_repair_task_impl
Keep materialized topology guard in shard_repair_task_impl and check
it in check_in_abort_or_shutdown and before each range repair.
2025-03-14 10:41:10 +01:00
Aleksandra Martyniuk
928f92c780 repair: pass session_id to repair_meta
Pass session_id of tablet repair down the stack from the repair request
to repair_meta.

The session_id will be utiziled in the following patches.
2025-03-14 10:20:12 +01:00
Nadav Har'El
a72dde2ee6 test/cqlpy: add test for long table names
Scylla inherited a 48-character limit on the length of table (and
keyspace) names from Cassandra 3. It turns out that Cassandra 4 and
5 unintentionally dropped this limit (see history lesson in
CASSANDRA-20425), and now Cassandra accepts longer table names.
Some Cassandra users are using such longer names and disappointed
that Scylla doesn't allow them.

This patch includes tests for this feature. One test tries a
48-character table name - it passes on Scylla and all versions
of Cassandra. A second test tries a 100-character table name - this
one passes on Cassandra version 4 and above (but not on 3), and
fails on Scylla so marked "xfail". A third test tries a 500-character
table name. This one fails badly on Cassandra (see CASSANDRA-20389),
but passes on Scylla today. This test is important because we need to
be sure that it continues to pass on Scylla even after the Scylla is
fixed to allow the 100-character test.

Refs #4480 - an issue we already have about supporting longer names

Note on the test implementation:
Ideally, the test for a particular table-name length shouldn't just
create the table - it should also make sure we can write table to it
and flush it, i.e., that sstables can get written correctly. But in
practice, these complications are not needed, because in modern Scylla
it is the directory name which contains the table's name, and the
individual sstable files do not contain the table's name. Just creating
the table already creates the long directory name, so that is the part
that needs to be tested. If we created this directory successfully,
later creating the short-named sstables inside it can't fail.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#23229
2025-03-14 11:15:07 +03:00
Kefu Chai
a82cfbecad test: perf_sstable: close frag_stream before destoying it
the underlying reader should be closed before being destroyed. otherwise
we'd have following failure when testing the "full_scan_streaming":

```
$ scylla perf-sstable --parallelism 1 --iterations 20 --partitions 20 --testdir /tmp/sstable --mode full_scan_streaming

ERROR 2025-03-13 15:04:26,321 [shard  0:main] mutation_reader - N8sstables2mx27mx_sstable_full_scan_readerE [0x60015a36b650]: permit *.*:test: was not closed before destruction, at: 0x235931e 0x2359470 0x239deb3 0x62a1ed3 0x89fd156 0x89c3fba 0x22a6ed3 0x22a8fea 0x22aae17 0x22a9928 0x26bb7d0 0x26bbe3e 0x89bca67 0x246bd8d /lib64/libc.so.6+0x3247 /lib64/libc.so.6+0x330a 0x1657774
------
   seastar::internal::coroutine_traits_base<double>::promise_type
```

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#23270
2025-03-14 11:12:44 +03:00
Piotr Smaron
d365d9b2ad test/ldap: assign non-busy ports to ldap
It may happen that the ports we randomly choose for LDAP are busy, and
that'd fail the test suite, so once we randomly select ports, now we'll
see if they're busy or not, and if they're busy, we'll select next ones,
until we finally have some free ports for LDAP.
Tested with: `./test.py ldap/ldap_connection_test --repeat 1000 -j 10`:
before the fix, this command fails after ~112 runs, and of course it
passes with the fix.

Fixes: scylladb/scylla-enterprise#5120
Fixes: scylladb/scylladb#23149
Fixes: scylladb/scylladb#23242

Closes scylladb/scylladb#23275
2025-03-14 11:09:19 +03:00
Botond Dénes
68b2ac541c Merge 'streaming: fix the way a reason of streaming failure is determined' from Aleksandra Martyniuk
During streaming receiving node gets and processes mutation fragments.
If this operation fails, receiver responds with -1 status code, unless
it failed due to no_such_column_family in which case streaming of this
table should be skipped.

However, when the table was dropped, an exception handler on receiver
side may get not only data_dictionary::no_such_column_family, but also
seastar::nested_exception of two no_such_column_family.

Encountered example:
```
ERROR 2025-02-12 15:20:51,508 [shard 0:strm] stream_session - [Stream #f1cd6830-e954-11ef-afd9-b022e40bf72d] Failed to handle STREAM_MUTATION_FRAGMENTS (receive and distribute phase) for ks=ks, cf=cf, peer=756dd3fe-2bf0-4dcd-afbc-cfd5202669a0: seastar::nested_exception: data_dictionary::no_such_column_family (Can't find a column family with UUID ef9b1ee0-e954-11ef-ba4a-faf17acf4e14) (while cleaning up after data_dictionary::no_such_column_family (Can't find a column family with UUID ef9b1ee0-e954-11ef-ba4a-faf17acf4e14))
```

In this case, the exception does not match the try_catch<data_dictionary::no_such_column_family>
clause and gets handled the same as any other exception type.

Replace try_catch clause with table_sync_and_check that synchronizes
the schema and check if the table exists.

Fixes: https://github.com/scylladb/scylladb/issues/22834.

Needs backport to all live version, as they all contain the bug

Closes scylladb/scylladb#22868

* github.com:scylladb/scylladb:
  streaming: fix the way a reason of streaming failure is determined
  streaming: save a continuation lambda
  streaming: use streaming namespace in table_check.{cc,hh}
  repair: streaming: move table_check.{cc,hh} to streaming
2025-03-14 07:25:00 +02:00
Kefu Chai
31320399e8 test: sstable_test: use auto instead of statistics to avoid name collision
Replace explicit `statistics` type with `auto` in sstable_test to
resolve name collision. This addresses ambiguity introduced by commit
87c221cb which added `struct statistics` in
`seastar/include/seastar/net/api.hh`, conflicting with the existing
definition in `scylladb/sstables/types.hh` when the `seastar` namespace
is opened.

The `auto` keyword avoids the need to explicitly reference either type,
cleanly resolving the collision while maintaining functionality.
This change prepares for the upcoming change to bump up seastar
submodule.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#23249
2025-03-13 22:51:21 +02:00
Avi Kivity
696ce4c982 Merge "convert some parts of the gossiper to host ids" from Gleb
"
This is series starts conversion of the gossiper to use host ids to
index nodes. It does not touch the main map yet, but converts a lot of
internal code to host id. There are also some unrelated cleanups that
were done while working on the series. On of which is dropping code
related to old shadow round. We replaced shadow round with explicit
GOSSIP_GET_ENDPOINT_STATES verb in cd7d64f588
which is in scylla-4.3.0, so there should be no compatibility problem.
We already dropped a lot of old shadow round code in previous patches
anyway.

I tested manually that old and new node can co-exist in the same
cluster,
"

* 'gleb/gossiper-host-id-v2' of github.com:scylladb/scylla-dev: (33 commits)
  gossiper: drop unneeded code
  gossiper: move _expire_time_endpoint_map to host_id
  gossiper: move _just_removed_endpoints to host id
  gossiper: drop unused get_msg_addr function
  messaging_service: change connection dropping notification to pass host id only
  messaging_service: pass host id to remove_rpc_client in down notification
  treewide: pass host id to endpoint_lifecycle_subscriber
  treewide: drop endpoint life cycle subscribers that do nothing
  load_meter: move to host id
  treewide: use host id directly in endpoint state change subscribers
  treewide: pass host id to endpoint state change subscribers
  gossiper: drop deprecated unsafe_assassinate_endpoint operation
  storage_service: drop unused code in handle_state_removed
  treewide: drop endpoint state change subscribers that do nothing
  gossiper: drop ip address from handle_echo_msg and simplify code since host_id is now mandatory
  gossiper: start using host ids to send messages earlier
  messaging_service: add temporary address map entry on incoming connection
  topology_coordinator: notify about IP change from sync_raft_topology_nodes as well
  treewide: move everyone to use host id based gossiper::is_alive and drop ip based one
  storage_proxy: drop unused template
  ...
2025-03-13 13:36:31 +02:00
Kefu Chai
5eba29e376 ent/encryption: correct misspellings
these misspellings were flagged by codespell.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#23254
2025-03-13 13:07:34 +02:00
Kefu Chai
9f411f9962 tools/scylla-nodetool: refactor to use std::tie() for cleaner code
Replace explicit pair member access with std::tie() throughout
scylla-nodetool. This simplifies the code by eliminating repetitive
pair.first/pair.second references and makes the codebase more
maintainable and readable.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#23250
2025-03-13 11:56:07 +02:00
Dawid Mędrek
0a6137218a db/hints: Cancel draining when stopping node
Draining hints may occur in one of the two scenarios:

* a node leaves the cluster and the local node drains all of the hints
  saved for that node,
* the local node is being decommissioned.

Draining may take some time and the hint manager won't stop until it
finishes. It's not a problem when decommissioning a node, especially
because we want the cluster to retain the data stored in the hints.
However, it may become a problem when the local node started draining
hints saved for another node and now it's being shut down.

There are two reasons for that:

* Generally, in situations like that, we'd like to be able to shut down
  nodes as fast as possible. The data stored in the hints won't
  disappear from the cluster yet since we can restart the local node.
* Draining hints may introduce flakiness in tests. Replaying hints doesn't
  have the highest priority and it's reflected in the scheduling groups we
  use as well as the explicitly enforced throughput. If there are a large
  number of hints to be replayed, it might affect our tests.
  It's already happened, see: scylladb/scylladb#21949.

To solve those problems, we change the semantics of draining. It will behave
as before when the local node is being decommissioned. However, when the
local node is only being stopped, we will immediately cancel all ongoing
draining processes and stop the hint manager. To amend for that, when we
start a node and it initializes a hint endpoint manager corresponding to
a node that's already left the cluster, we will begin the draining process
of that endpoint manager right away.

That should ensure all data is retained, while possibly speeding up
the shutdown process.

There's a small trade-off to it, though. If we stop a node, we can then
remove it. It won't have a chance to replay hints it might've before
these changes, but that's an edge case. We expect this commit to bring
more benefit than harm.

We also provide tests verifying that the implementation works as intended.

Fixes scylladb/scylladb#21949

Closes scylladb/scylladb#22811
2025-03-13 11:55:15 +02:00
Paweł Zakrzewski
d483051e44 cql3/select_statement: reject aggregate functions when PER PARTITION LIMIT is present
Before this patch we silently allowed and ignored PER PARTITION LIMIT.
While using aggregate functions in conjunction with PER PARTITION LIMIT
can make sense, we want to disable it until we can offer proper
implementation, see #9879 for discussion.

We want to match Cassandra, and for queries with aggregate functions it
behaves as follows:
- it silently ignores PER PARTITION LIMIT if GROUP BY is present, which
  matches our previous implementation.
- rejects PER PARTITION LIMIT when GROUP BY is *not* present.

This patch adds rejection of the second group.

Fixes #9879

Closes scylladb/scylladb#23086
2025-03-13 10:29:53 +02:00
Pavel Emelyanov
f50bcbf4d0 test/perf/s3: Don't forget to stop sharded<tester> on error
In case invoke_on_all(tester::start) throws, the sharded<tester>
instance remains non-stopped and calltrace is reported on test stop. Not
nice, fix it so that sharded<> thing is stopped in any case.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#23244
2025-03-13 09:54:09 +02:00
Anna Stuchlik
562b5db5b8 doc: Remove "experimental" from ALTER KEYSPACE with Tablets
Altering a keyspace with tablets is no longer experimental.
This commit removes the "Experimental" label from the feature.

Fixes https://github.com/scylladb/scylladb/issues/23166

Closes scylladb/scylladb#23183
2025-03-12 17:41:36 +02:00
Kefu Chai
68fc067106 perf/perf_sstable: fix the indent
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2025-03-12 19:00:50 +08:00
Kefu Chai
4f62f79622 perf/perf_sstable: stop using at_exit()
seastar::at_exit() was marked deprecated recently. so let's use
the recommended approach to perform cleanups.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2025-03-12 19:00:50 +08:00
Nadav Har'El
3ca2e6ddda Merge 's3_client: Add retries to Security Token Service/EC2 instance metadata credentials providers' from Ernest Zaslavsky
Several updates and improvements to the retryable HTTP client functionality, as well as enhancements to error handling and integration with AWS services, as part of this PR. Below is a summary of the changes:

- Moved the retryable HTTP client functionality out of the S3 client to improve modularity and reusability across other services like AWS STS.

- Isolated the retryable_http_client into its own file, improving clarity and maintainability.

- Added a make_request method that introduces a response-skipping handler.

- Introduced a custom error handler constructor, providing greater flexibility in handling errors.

- Updated the STS and Instance Metadata Service credentials providers to utilize the new retryable HTTP client, enhancing their robustness and reliability.

- Extended the AWS error list to handle errors specific to the STS service, ensuring more granular and accurate error management for STS operations.

- Enhanced error handling for system errors returned by Seastar’s HTTP client, ensuring smoother operations.

- Properly closed the HTTP client in instance_profile_credentials_provider and sts_assume_role_credentials_provider to prevent resource leaks.

- Reduced the log severity in the retry strategy to avoid SCT test failures that occur when any log message is tagged as an ERROR.

No backport needed since we dont have any s3 related activity on the scylla side been released

Closes scylladb/scylladb#21933

* github.com:scylladb/scylladb:
  s3_client: Adjust Log Severity in Retry Strategy
  aws_error: Enhance error handling for AWS HTTP client
  aws_error: Add STS specific error handling
  credentials_providers: Close retryable clients in Credentials Providers
  credentials_providers: Integrate retryable_http_client with Credentials Providers
  s3_client: enhance `retryable_http_client` functionality
  s3_client: isolate `retryable_http_client`
  s3_client: Prepare for `retryable_http_client` relocation
  s3_client: Remove `is_redirect_status` function
  s3_client: Move retryable functionality out of s3 client
2025-03-12 10:19:15 +02:00
Avi Kivity
b1d9f80d85 Merge 'tablets: Make load balancing capacity-aware' from Tomasz Grabiec
Before this patch, the load balancer was equalizing tablet count per
shard, so it achieved balance assuming that:
 1) tablets have the same size
 2) shards have the same capacity

That can cause imbalance of utilization if shards have different
capacity, which can happen in heterogeneous clusters with different
instance types. One of the causes for capacity difference is that
larger instances run with fewer shards due to vCPUs being dedicated to
IRQ handling. This makes those shards have more disk capacity, and
more CPU power.

After this patch, the load balancer equalizes shard's storage
utilization, so it no longer assumes that shards have the same
capacity. It still assumes that each tablet has equal size. So it's a
middle step towards full size-aware balancing.

One consequence is that to be able to balance, the load balancer need
to know about every node's capacity, which is collected with the same
RPC which collects load_stats for average tablet size. This is not a
significant set back because migrations cannot proceed anyway if nodes
are down due to barriers. We could make intra-node migration
scheduling work without capacity information, but it's pointless due
to above, so not implemented.

Also, per-shard goal for tablet count is still the same for all nodes in the cluster,
so nodes with less capacity will be below limit and nodes with more capacity will
be slightly above limit. This shouldn't be a significant problem in practice, we could
compensate for this by increasing the limit.

Refs #23042

Closes scylladb/scylladb#23079

* github.com:scylladb/scylladb:
  tablets: Make load balancing capacity-aware
  topology_coordinator: Fix confusing log message
  topology_coordinator: Refresh load stats after adding a new node
  topology_coordinator: Allow capacity stats to be refreshed with some nodes down
  topology_coordinator: Refactor load status refreshing so that it can be triggered from multiple places
  test: boost: tablets_test: Always provide capacity in load_stats
  test: perf_load_balancing: Set node capacity
  test: perf_load_balancing: Convert to topology_builder
  config, disk_space_monitor: Allow overriding capacity via config
  storage_service, tablets: Collect per-node capacity in load_stats
2025-03-11 14:34:27 +02:00
Gleb Natapov
57f2b6d825 gossiper: drop unneeded code
host_id is already available at this point.
2025-03-11 12:09:22 +02:00
Gleb Natapov
cca228265e gossiper: move _expire_time_endpoint_map to host_id
Index _expire_time_endpoint_map map by host id instead of ip
2025-03-11 12:09:22 +02:00
Gleb Natapov
c45b50bbe6 gossiper: move _just_removed_endpoints to host id
Index _just_removed_endpoints map by host id instead of ip
2025-03-11 12:09:22 +02:00
Gleb Natapov
22739bb39a gossiper: drop unused get_msg_addr function 2025-03-11 12:09:22 +02:00
Gleb Natapov
b3720b80b6 messaging_service: change connection dropping notification to pass host id only
Only host id is needed in the callback anyway.
2025-03-11 12:09:22 +02:00
Gleb Natapov
24d30073f9 messaging_service: pass host id to remove_rpc_client in down notification
Do not iterate over all client indexed by hos id to search for those
with given IP.  Look up by host id directly since now we know it in down
notification. In cases host id is not known look it up by ip.
2025-03-11 12:09:22 +02:00
Gleb Natapov
4ca627b533 treewide: pass host id to endpoint_lifecycle_subscriber 2025-03-11 12:09:22 +02:00
Gleb Natapov
8a747fbc2a treewide: drop endpoint life cycle subscribers that do nothing
Provide default implementation for them instead. Will be easier to rework them later.
2025-03-11 12:09:22 +02:00
Gleb Natapov
525b88f877 load_meter: move to host id
Use host id indexing in load_meter and only convert to ips on api level.
2025-03-11 12:09:22 +02:00
Gleb Natapov
48a1030c91 treewide: use host id directly in endpoint state change subscribers
Now that we have host ids in endpoint state change subscribers some of
them can be simplified by using the id directly instead of locking it up
by ip.
2025-03-11 12:09:22 +02:00
Gleb Natapov
499eb4d17f treewide: pass host id to endpoint state change subscribers 2025-03-11 12:09:22 +02:00
Gleb Natapov
eb59205caf gossiper: drop deprecated unsafe_assassinate_endpoint operation
It was always deprecated.
2025-03-11 12:09:21 +02:00
Gleb Natapov
c17a8b4a76 storage_service: drop unused code in handle_state_removed 2025-03-11 12:09:21 +02:00
Gleb Natapov
696aee3adc treewide: drop endpoint state change subscribers that do nothing
Provide default implementation for them instead. Will be easier to rework them later.
2025-03-11 12:09:21 +02:00
Gleb Natapov
7dcffda6bd gossiper: drop ip address from handle_echo_msg and simplify code since host_id is now mandatory 2025-03-11 12:09:21 +02:00
Gleb Natapov
8425c26462 gossiper: start using host ids to send messages earlier
Send digest ack and ack2 by host ids as well now since the id->ip
mapping is available after receiving digest syn. It allows to convert
more code to host id here.
2025-03-11 12:09:21 +02:00
Gleb Natapov
f0af3f261e messaging_service: add temporary address map entry on incoming connection
We want to move to use host ids as soon as possible. Currently it is
possible only after the full gossiper exchange (because only at this
point gossiper state is added and with it address map entry). To make it
possible to move to host ids earlier this patch adds address map entries
on incoming communication during CLIENT_ID verb processing. The patch
also adds generation to CLIENT_ID to use it when address map is updated.
It is done so that older gossiper entries can be overwritten with newer
mapping in case of IP change.
2025-03-11 12:09:21 +02:00
Gleb Natapov
c3035caeb5 topology_coordinator: notify about IP change from sync_raft_topology_nodes as well
Currently sync_raft_topology_nodes() only send join notification if a
node is new in the topology, but sometimes a node changes IP and the
join notification should be send for the new IP as well. Usually it is
done from ip_address_updater, but topology reload can run first and then
the notification will be missed. The solution is to send notification
during topology reload as well.
2025-03-11 12:09:21 +02:00
Gleb Natapov
0e3dcb7954 treewide: move everyone to use host id based gossiper::is_alive and drop ip based one 2025-03-11 12:09:21 +02:00
Gleb Natapov
56c6e04079 storage_proxy: drop unused template
The storage_proxy::is_alive is called with host_id only.
2025-03-11 12:09:21 +02:00
Gleb Natapov
e47f251178 gossiper: move _live_endpoints and _unreachable_endpoints endpoint to host_id
Index live and dead endpoints by host id. It also allows to simplify
some code that does a translation.
2025-03-11 12:09:21 +02:00
Gleb Natapov
6f05608b5e gossiper: chunk vector using std::views::chunk instead of explicitly code it 2025-03-11 12:09:21 +02:00
Gleb Natapov
0437f558cd idl: generate ip based version of a verb only for verbs that need it
The patch adds new marker for a verb - [[ip]] that means that for this
verb ip version of the verbs needs to be generated. Most of the verbs
do not need it.
2025-03-11 12:09:21 +02:00
Gleb Natapov
3734afe8a5 gossiper: send shutdown notification by host id 2025-03-11 12:09:21 +02:00
Gleb Natapov
ee59baf6fc gossiper: drop old shadow round code
It is no longer used. It was replaced with explicit GOSSIP_GET_ENDPOINT_STATES verb in
cd7d64f588 which is in scylla-4.3.0
2025-03-11 12:09:20 +02:00
Gleb Natapov
f1a82c1d01 gossiper: drop unused get_endpoint_states function 2025-03-11 12:09:20 +02:00
Gleb Natapov
c4a0fbae16 gossiper: check id match inside force_remove_endpoint
Before calling force_remove_endpoint (which works on ip) the code checks
that the ip maps to the correct id (not not remove a new node that
inherited this ip by  mistake). Move the check to the function itself.
2025-03-11 12:09:20 +02:00
Gleb Natapov
52c9217f1b migration_manager: drop unneeded id to ip translation 2025-03-11 12:09:20 +02:00
Gleb Natapov
4420ddaf86 gossiper: move is_gossip_only_member and its users to work on host id 2025-03-11 12:09:20 +02:00
Gleb Natapov
cb2b874942 table: use host id based get_endpoint_state_ptr and skip id->ip translation 2025-03-11 12:09:20 +02:00
Gleb Natapov
2746d391af gossiper: do not ping outdated address
A node may change its IP but some other node in the cluster may still
try to ping it using an old IP because it may receive an outdated gossiper
entry with the old IP. Do not send echo message to the old IP. It will
cause a misusing UP message with old address to be printed.
2025-03-11 12:09:20 +02:00
Gleb Natapov
aaba55073d storage_service: drop outdated code that checks whether raft topology should be used
After raft_topology_change_enabled() was introduced the code does
nothing useful. The function is responsible for the decision if raft topology
is enabled or not.
2025-03-11 12:09:20 +02:00
Gleb Natapov
6952f62869 gossiper: drop unused field from loaded_endpoint_state 2025-03-11 12:09:20 +02:00
Nikos Dragazis
7a6a4f54a5 cql3: secondary index: Limit page size for single-row partitions
The size of the partition range vector was constrained in the previous
patch. Any rows beyond the vector's capacity are discarded.

In the special case of single-row partitions, we know the size of each
partition, so we can enforce this limit on the query itself via the page
size.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2025-03-10 12:18:49 +02:00
Nikos Dragazis
76b31a3acc cql3: secondary index: Limit the size of partition range vectors
The partition range vector is an std::vector, which means it performs
contiguous allocations. Large allocations are known to cause problems
(e.g., reactor stalls).

For paged queries, limit the vector size to 1000. If more partition keys
are available in the query result, discard them. Ideally, we should not
be fetching them at all, but this is not possible without knowing the
size of each partition.

Currently, each vector element is 120 bytes and the standard allocator's
max preferred contiguous allocation is 128KiB. Therefore, the chosen
value of 1000 satisfies the constraint (128 KiB / 120 = 1092 > 1000).
This should be good enough for most cases. Since secondary index queries
involve one base table query per partition key, these queries are slow.
A higher limit would only make them slower and increase the probability
of a timeout. For the same reason, saving a follow-up paged request from
the client would not increase the efficiency much.

For unpaged queries, do not apply any limit. This means they remain
susceptible to stalls, but unpaged queries are considered unoptimized
anyway.

Finally, update the unit test reproducer since the bug is now fixed.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2025-03-10 12:18:42 +02:00
Pavel Emelyanov
db70c7bbf7 api: Remove the remaining parse_tables() overload
There's only one caller of it left -- the scrub handler. It can use the
parse_table_infos() one and get table names from it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-03-10 13:14:10 +03:00
Pavel Emelyanov
89f3c1a91e database: Sanitize flush_tables_on_all_shards()
Previous patch left this method with few uglinesses

- the vector<table_id> argument is named table_names
- the sstring keyspace argument is unused
- the keyspace argument is captured for no use

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-03-10 13:13:10 +03:00
Pavel Emelyanov
0f9cc956f4 schema_tables: Remove all_table_names()
Now it's unused.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-03-10 13:12:56 +03:00
Pavel Emelyanov
c2d23d7948 database: Make tables flushing helper use table_info-s, not names
The database::flush_tables_on_all_shards() method accepts a keyspace
name and a vector of table names. Then it converts ks:cf pair for each
of the table name into a table-id and flushes the table with the ID.

All the callers of that method already have or can easily get the vector
of table_id-s, not just names, so make use of this.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-03-10 13:11:32 +03:00
Pavel Emelyanov
e94dce1725 api: Make keyspace flush endpoint use parse_table_infos() (and a bit more)
Currently the handler in question calls parse_tables() which returns
empty list of tables in the "cf" parameter is missing, or the table
names if it's present. In the former case the handler will call
flush_keyspace_on_all_shards() that just gets all table names from the
keyspace and flushes them all.

This change makes the handler use parse_table_infos() which is different
-- when the "cf" parameter is missing, it gets all tables from the
keyspace. So the handler no longer need to call the keyspace flush, it
can always call the "flush the list of tables" helper.

With that change one of the parse_tables() helpers becomes unused, so
remove it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-03-10 13:06:55 +03:00
Pavel Emelyanov
5a897d7368 schema_tables,client_state: Switch to using all_table_infos()
There are few more places left that can use all_table_infos() as a
replacement for all_table_names(), patch them.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-03-10 13:05:59 +03:00
Pavel Emelyanov
da05765746 schema_tables: Tune up some methods to benefit from table_infos
There are convert_schema_to_mutations() and calculate_schema_digest()
that collect table names and then use them to find schema and query
mutations from the table.

Both can use the newly introduced all_table_infos() and use the returned
table_id-s to do the same, thus avoiding re-lookups (which are fast
anyway, but still).

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-03-10 13:01:50 +03:00
Pavel Emelyanov
d7bfa5a545 schema_tables: Introduce all_table_infos()
This method is like all_table_names(), but returns a vector of
table_info-s which is effectively a pair of string name and uuid id.
To be used later, and the string-returning all_table_name() will be
removed very soon too.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-03-10 12:59:03 +03:00
Ernest Zaslavsky
c8de7619e5 s3_client: Adjust Log Severity in Retry Strategy
* Reduced log severity in retry_strategy.
* Rationale: SCT fails tests when any message is logged as ERROR.
2025-03-10 09:01:47 +02:00
Ernest Zaslavsky
8e46929474 aws_error: Enhance error handling for AWS HTTP client
- Seastar's HTTP client is known to throw exceptions for various reasons, including network errors, TLS errors and other transient issues.
- Update error handling to correctly capture and process all exceptions from Seastar's HTTP client.
- Previously, only aws_exception was handled, causing retryable errors to be missed and `should_retry` not invoked.
- Now, all exceptions trigger the appropriate retry logic per the intended strategy.
- Add tests for the S3 proxy to ensure robustness and reliability of these enhancements.
2025-03-10 09:01:47 +02:00
Ernest Zaslavsky
92a12c96a2 aws_error: Add STS specific error handling
Updated the AWS error list to include handling for errors specific to the STS service. This enhancement ensures more comprehensive error management for STS-related operations.
2025-03-10 09:01:47 +02:00
Ernest Zaslavsky
a371d6cf62 credentials_providers: Close retryable clients in Credentials Providers
Updated `instance_profile_credentials_provider` and `sts_assume_role_credentials_provider` to close the HTTP client appropriately.
2025-03-10 09:01:47 +02:00
Ernest Zaslavsky
45a6e88954 credentials_providers: Integrate retryable_http_client with Credentials Providers
* Updated STS and Instance Metadata Service credentials providers to utilize retryable_http_client.
2025-03-10 09:01:47 +02:00
Ernest Zaslavsky
7c49ee4520 s3_client: enhance retryable_http_client functionality
Enhanced `retryable_http_client` by allowing the injection of a custom error handler through its constructor.
2025-03-10 09:01:47 +02:00
Ernest Zaslavsky
b589a882bb s3_client: isolate retryable_http_client
Relocated `retryable_http_client` into its own dedicated file for improved clarity and maintainability.
2025-03-10 09:01:47 +02:00
Ernest Zaslavsky
5eff83af95 s3_client: Prepare for retryable_http_client relocation
Expose `map_s3_client_exception` outside the S3 client class to facilitate moving `retryable_http_client` to a separate file.
2025-03-10 09:01:47 +02:00
Ernest Zaslavsky
2b3abba10a s3_client: Remove is_redirect_status function
Eliminate the `is_redirect_status` function in favor of the equivalent functionality provided by Seastar's HTTP client.
2025-03-10 09:01:47 +02:00
Ernest Zaslavsky
5b7d4a4136 s3_client: Move retryable functionality out of s3 client
This commit moves the retryable HTTP client functionality out of the S3 client implementation. Since this functionality is also required for other services, such as AWS STS, it has been separated to ensure broader applicability.
2025-03-10 09:01:47 +02:00
Ernest Zaslavsky
050c3cdbc2 tests: Add Tests for Scylla-SSTable S3 Functionality
Extended existing Scylla Tools tests to cover the new functionality of
reading SSTables from S3. This ensures that the new S3 integration is
thoroughly tested and performs as expected.
2025-03-09 10:17:48 +02:00
Ernest Zaslavsky
112b4c8764 docs: Update Scylla Tools Documentation for S3 SSTable Support
Updated the Scylla Tools documentation to include changes related to
the enhanced support for S3-stored SSTables. This update ensures that
the documentation accurately reflects the latest functionality and
improvements.
2025-03-09 09:50:37 +02:00
Ernest Zaslavsky
17e3c01f4e scylla-sstable: Enable Support for S3 SSTables
Configure the sstable manager to correctly handle storage options based
on the input type (local or S3-stored sstables). This tweak allows for
mixing both storage types within a single call, improving flexibility
and functionality.
2025-03-09 09:50:36 +02:00
Ernest Zaslavsky
88c4fa6569 s3: Implement S3 Fully Qualified Name Manipulation Functions
Added utility functions to handle S3 Fully Qualified Names (FQN). These
functions enable parsing, splitting, and identification of S3 paths,
enhancing our ability to work with S3 object storage more effectively.
2025-03-09 09:50:36 +02:00
Ernest Zaslavsky
38165fd285 object_storage: Refactor object_storage.yaml parsing logic
Refactored the parsing of `object_storage.yaml` out of Scylla's `main`
function. This change is made to facilitate reusability of the parsing
logic in other parts of the codebase.
2025-03-09 09:50:36 +02:00
Vlad Zolotarov
f7e1695068 CQL Tracing: set common query parameters in a single function
Each query-type (QUERY, EXECUTE, BATCH) CQL opcode has a number of parameters
in their payload which we always want to record in the Tracing object.
Today it's a Consistency Level, Serial Consistency Level and a Default Timestamp.

Setting each of them individually can lead to a human error when one (or more) of
them would not be set. Let's eliminate such a possibility by defining
a single function that sets them all.

This also allows an easy addition of such parameters to this function in
the future.
2025-03-06 09:30:51 -05:00
Aleksandra Martyniuk
35bc1fe276 streaming: fix the way a reason of streaming failure is determined
During streaming receiving node gets and processes mutation fragments.
If this operation fails, receiver responds with -1 status code, unless
it failed due to no_such_column_family in which case streaming of this
table should be skipped.

However, when the table was dropped, an exception handler on receiver
side may get not only data_dictionary::no_such_column_family, but also
seastar::nested_exception of two no_such_column_family.

Encountered example:
```
ERROR 2025-02-12 15:20:51,508 [shard 0:strm] stream_session - [Stream #f1cd6830-e954-11ef-afd9-b022e40bf72d] Failed to handle STREAM_MUTATION_FRAGMENTS (receive and distribute phase) for ks=ks, cf=cf, peer=756dd3fe-2bf0-4dcd-afbc-cfd5202669a0: seastar::nested_exception: data_dictionary::no_such_column_family (Can't find a column family with UUID ef9b1ee0-e954-11ef-ba4a-faf17acf4e14) (while cleaning up after data_dictionary::no_such_column_family (Can't find a column family with UUID ef9b1ee0-e954-11ef-ba4a-faf17acf4e14))
```

In this case, the exception does not match the try_catch<data_dictionary::no_such_column_family>
clause and gets handled the same as any other exception type.

Replace try_catch clause with table_sync_and_check that synchronizes
the schema and check if the table exists.

Fixes: https://github.com/scylladb/scylladb/issues/22834.
2025-03-06 15:07:14 +01:00
Aleksandra Martyniuk
44748d624d streaming: save a continuation lambda
In the following patches, an additional preemption point will be
added to the coroutine lambda in register_stream_mutation_fragments.

Assign a lambda to a variable to prolong the captures lifetime.
2025-03-06 15:07:09 +01:00
Tomasz Grabiec
c4714180cc tablets: Make load balancing capacity-aware
Before this patch the load balancer was equalizing tablet count per
shard, so it achieved balance assuming that:
 1) tablets have the same size
 2) shards have the same capacity

That can cause imbalance of utilization if shards have different
capacity, which can happen in heterogenous clusters with different
instance types. One of the causes for capacity difference is that
larger instances run with fewer shards due to vCPUs being dedicated to
IRQ handling. This makes those shards have more disk capacity, and
more CPU power.

After this patch, the load balancer equalizes shard's storage
utilization, so it no longer assumes that shards have the same
capacity. It still assummes that each tablet has equal size. So it's a
middle step towards full size-aware balancing.

One consequence is that to be able to balance, the load balancer need
to know about every node's capacity, which is collected with the same
RPC which collects load_stats for average tablet size. This is not a
significant set back because migrations cannot proceed anyway if nodes
are down due to barriers. We could make intra-node migration
scheduling work without capacity information, but it's pointless due
to above, so not implemented.
2025-03-06 13:35:38 +01:00
Tomasz Grabiec
3c0b733943 topology_coordinator: Fix confusing log message
There can be other reasons the plan is empty, tablets may not actually
be balanced. For example, capacity for all the nodes may not be known,
or nodes may be down.
2025-03-06 13:35:37 +01:00
Tomasz Grabiec
40414c4985 topology_coordinator: Refresh load stats after adding a new node
Stats are refreshed every minute by default. Load balancing cannot
happen without capacity information for all normal nodes. To avoid the
delay, trigger refresh after adding a new node.
2025-03-06 13:35:37 +01:00
Tomasz Grabiec
d6f8810e66 topology_coordinator: Allow capacity stats to be refreshed with some nodes down
With capacity-aware balancing, if we're missing capacity for a normal
node, we won't be able to proceed with tablet drain. Consider the
following scenario:

1. Nodes: A, B
2. refresh stats with A and B
3. Add node C
4. Node B goes down
5. removenode B starts
6. stats refreshing fails because B is down

If we don't have capacity stats for node C, load balancer cannot make
decisions and removenode is blocked indefinitely. A reproducer is
added in this patch.

To alleviate that, we allow capacity stats to be collected for nodes
which are reachable, we just don't update the table size part.

To keep table stats monotonic, we cache previous results per node, so
even if it's unreachable now, we use its last reported sizes. It's
still more accurate than not refreshing stats at all. A node can be
down for a long period, and other replicas can grow in size. It's not
perfect, because the stale node can skew the stats in its direction,
but ignoring it completely has its pitfalls too. Better solution is
left for later.
2025-03-06 13:35:37 +01:00
Tomasz Grabiec
af3dce4c8a topology_coordinator: Refactor load status refreshing so that it can be triggered from multiple places
Use serialized_action for serialization and batching.
2025-03-06 13:35:37 +01:00
Tomasz Grabiec
69c49fb1a7 test: boost: tablets_test: Always provide capacity in load_stats
Move shared_load_stats to topology_builder.hh so that topology_builder
can maintain it. It will set capacity for all created nodes. Needed
after load balancer requires capacity to make decisions.
2025-03-06 13:35:37 +01:00
Tomasz Grabiec
dfc9101dfd test: perf_load_balancing: Set node capacity
Otherwise, load balancer will not make any plan once it becomes
capacity-aware.
2025-03-06 13:35:37 +01:00
Tomasz Grabiec
6169401dbc test: perf_load_balancing: Convert to topology_builder
The test no longer worked becuase load balancer requires proper schema
in the database now. Convert to topology_builder which builds topology
in the database and create schema in the database (which needs proper
topology).
2025-03-06 13:35:37 +01:00
Tomasz Grabiec
d01cc16d1e config, disk_space_monitor: Allow overriding capacity via config
Intended for testing, or hot-fixing out-of-space issues in production.

Tablet load balancer uses this information for determining per-shard load
so reducing capacity will cause tablets to be migrated away from the node.
2025-03-06 13:35:37 +01:00
Tomasz Grabiec
7e7f1e6f91 storage_service, tablets: Collect per-node capacity in load_stats
New RPC is introduced becuase load_stats was marked "final" in the IDL.

Will be needed by capacity-aware load balancing.
2025-03-06 12:17:32 +01:00
Vlad Zolotarov
ca6bddef35 transport/server.cc: set default timestamp info in EXECUTE and BATCH tracing
A default timestamp (not to confuse with the timestamp passed via 'USING TIMESTAMP' query clause)
can be set using 0x20 flag and the <timestamp> field in the binary CQL frame payload of
QUERY, EXECUTE and BATCH ops. It also happens to be a default of a Java CQL Driver.

However, we were only setting the corresponding info in the CQL Tracing context of a QUERY operation.
For an unknown reason we were not setting this for an EXECUTE and for a BATCH traces (I guess I simply forgot to
set it back then).

This patch fixes this.

Fixes #23173
2025-03-05 20:37:37 -05:00
Aleksandra Martyniuk
faf3aa13db streaming: use streaming namespace in table_check.{cc,hh} 2025-03-05 11:00:03 +01:00
Aleksandra Martyniuk
876cf32e9d repair: streaming: move table_check.{cc,hh} to streaming 2025-03-05 11:00:03 +01:00
Benny Halevy
8ae8275f17 main: stop system keyspace
To prevent internal queries coming
from system_keyspace (like updating compaction history,
for example)

Refs scylladb/scylla-dtest#5581
Refs #22886
Refs #8995

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-03-05 08:30:23 +02:00
Benny Halevy
7a624e3df8 system_keyspace: call shutdown from stop
and use that to replace the explicit shutdown when stopped
in cql_test_env.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-03-05 08:30:23 +02:00
Benny Halevy
102aec64d5 system_keyspace: shutdown: allow calling more than once
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-03-05 08:30:22 +02:00
Benny Halevy
fba88bdd62 database, compaction_manager, large_data_handler: use pluggable<system_keysapce>
To allow safe plug and unplug of the system_keyspace.

This patch follows-up on 917fdb9e53
(more specifically - f9b57df471)
Since just keeping a shared_ptr<system_keyspace> doesn't prevent
stopping the system_keyspace shards, while using the `pluggable`
interface allows safe draining of outstanding async calls
on shutdown, before stopping the system_keyspace.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-03-05 08:27:23 +02:00
Benny Halevy
13a22cb6fd utils: add class pluggable
A wrapper around a shared service allowing
safe plug and unplug of the service from its user
using a phased-barrier operation permit guarding
the service while in use.

Also add a unit test for this class.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-03-05 08:25:50 +02:00
Nikos Dragazis
03902e5f17 cql3: untyped_result_set: Store rows in chunked_vector
The `untyped_result_set` stores rows in std::vector.

Switch to `chunked_vector` to prevent large allocations and data copies.
One such case is in secondary index queries, where we convert the result
of the internal index view query into an `untyped_result_set` for
processing. The result is bound by the page size memory limit (1MiB by
default), so it can cause large allocations of this magnitude.

This patch aligns `untyped_result_set` with `result_set`, which also
uses a `chunked_vector`.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2025-03-04 18:39:32 +02:00
Nikos Dragazis
892690b953 test: Reproduce bug with large allocations from secondary index
Secondary index queries which fetch partitions from the base table can
cause large allocations that can lead to reactor stalls.

Reproduce this with a unit test that runs an indexed query on a table
with thousands of single-row partitions, and checks the memory stats for
any large contiguous allocations.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2025-03-04 18:39:28 +02:00
933 changed files with 34710 additions and 10998 deletions

14
.github/CODEOWNERS vendored
View File

@@ -1,5 +1,5 @@
# AUTH
auth/* @nuivall @ptrsmrn @KrzaQ
auth/* @nuivall @ptrsmrn
# CACHE
row_cache* @tgrabiec
@@ -25,15 +25,15 @@ compaction/* @raphaelsc
transport/*
# CQL QUERY LANGUAGE
cql3/* @tgrabiec @nuivall @ptrsmrn @KrzaQ
cql3/* @tgrabiec @nuivall @ptrsmrn
# COUNTERS
counters* @nuivall @ptrsmrn @KrzaQ
tests/counter_test* @nuivall @ptrsmrn @KrzaQ
counters* @nuivall @ptrsmrn
tests/counter_test* @nuivall @ptrsmrn
# DOCS
docs/* @annastuchlik @tzach
docs/alternator @annastuchlik @tzach @nyh @nuivall @ptrsmrn @KrzaQ
docs/alternator @annastuchlik @tzach @nyh
# GOSSIP
gms/* @tgrabiec @asias @kbr-scylla
@@ -74,8 +74,8 @@ streaming/* @tgrabiec @asias
service/storage_service.* @tgrabiec @asias
# ALTERNATOR
alternator/* @nyh @nuivall @ptrsmrn @KrzaQ
test/alternator/* @nyh @nuivall @ptrsmrn @KrzaQ
alternator/* @nyh
test/alternator/* @nyh
# HINTED HANDOFF
db/hints/* @piodul @vladzcloudius @eliransin

View File

@@ -1,15 +1,86 @@
This is Scylla's bug tracker, to be used for reporting bugs only.
name: "Report a bug"
description: "File a bug report."
title: "[Bug]: "
type: "bug"
labels: bug
body:
- type: checkboxes
id: terms
attributes:
label: Code of Conduct
description: "This is Scylla's bug tracker, to be used for reporting bugs only.
If you have a question about Scylla, and not a bug, please ask it in
our mailing-list at scylladb-dev@googlegroups.com or in our slack channel.
our forum at https://forum.scylladb.com/ or in our slack channel https://slack.scylladb.com/ "
options:
- label: I have read the disclaimer above and am reporting a suspected malfunction in Scylla.
required: true
- [] I have read the disclaimer above, and I am reporting a suspected malfunction in Scylla.
*Installation details*
Scylla version (or git commit hash):
Cluster size:
OS (RHEL/CentOS/Ubuntu/AWS AMI):
*Hardware details (for performance issues)* Delete if unneeded
Platform (physical/VM/cloud instance type/docker):
Hardware: sockets= cores= hyperthreading= memory=
Disks: (SSD/HDD, count)
- type: input
id: product-version
attributes:
label: product version
description: Scylla version (or git commit hash)
placeholder: ex. scylla-6.1.1
validations:
required: true
- type: input
id: cluster-size
attributes:
label: Cluster Size
validations:
required: true
- type: input
id: os
attributes:
label: OS
placeholder: RHEL/CentOS/Ubuntu/AWS AMI
validations:
required: true
- type: textarea
id: additional-data
attributes:
label: Additional Environmental Data
#description:
placeholder: Add additional data
value: "Platform (physical/VM/cloud instance type/docker):\n
Hardware: sockets= cores= hyperthreading= memory=\n
Disks: (SSD/HDD, count)"
validations:
required: false
- type: textarea
id: reproducer-steps
attributes:
label: Reproduction Steps
placeholder: Describe how to reproduce the problem
value: "The steps to reproduce the problem are:"
validations:
required: true
- type: textarea
id: the-problem
attributes:
label: What is the problem?
placeholder: Describe the problem you found
value: "The problem is that"
validations:
required: true
- type: textarea
id: what-happened
attributes:
label: Expected behavior?
placeholder: Describe what should have happened
value: "I expected that "
validations:
required: true
- type: textarea
id: logs
attributes:
label: Relevant log output
description: Please copy and paste any relevant log output. This will be automatically formatted into code, so no need for backticks.
render: shell

View File

@@ -52,7 +52,7 @@ def create_pull_request(repo, new_branch_name, base_branch_name, pr, backport_pr
if is_draft:
backport_pr.add_to_labels("conflicts")
pr_comment = f"@{pr.user.login} - This PR was marked as draft because it has conflicts\n"
pr_comment += "Please resolve them and mark this PR as ready for review"
pr_comment += "Please resolve them and remove the 'conflicts' label. The PR will be made ready for review automatically."
backport_pr.create_issue_comment(pr_comment)
logging.info(f"Assigned PR to original author: {pr.user}")
return backport_pr
@@ -112,29 +112,45 @@ def backport(repo, pr, version, commits, backport_base_branch, is_collaborator):
is_draft = True
repo_local.git.add(A=True)
repo_local.git.cherry_pick('--continue')
repo_local.git.push(fork_repo, new_branch_name, force=True)
create_pull_request(repo, new_branch_name, backport_base_branch, pr, backport_pr_title, commits,
is_draft, is_collaborator)
# Check if the branch already exists in the remote fork
remote_refs = repo_local.git.ls_remote('--heads', fork_repo, new_branch_name)
if not remote_refs:
# Branch does not exist, create it with a regular push
repo_local.git.push(fork_repo, new_branch_name)
create_pull_request(repo, new_branch_name, backport_base_branch, pr, backport_pr_title, commits,
is_draft, is_collaborator)
else:
logging.info(f"Remote branch {new_branch_name} already exists in fork. Skipping push.")
except GitCommandError as e:
logging.warning(f"GitCommandError: {e}")
def with_github_keyword_prefix(repo, pr):
pattern = rf"(?:fix(?:|es|ed))\s*:?\s*(?:(?:(?:{repo.full_name})?#)|https://github\.com/{repo.full_name}/issues/)(\d+)"
match = re.findall(pattern, pr.body, re.IGNORECASE)
if not match:
for commit in pr.get_commits():
match = re.findall(pattern, commit.commit.message, re.IGNORECASE)
if match:
print(f'{pr.number} has a valid close reference in commit message {commit.sha}')
break
if not match:
print(f'No valid close reference for {pr.number}')
return False
else:
# GitHub issue pattern: #123, scylladb/scylladb#123, or full GitHub URLs
github_pattern = rf"(?:fix(?:|es|ed))\s*:?\s*(?:(?:(?:{repo.full_name})?#)|https://github\.com/{repo.full_name}/issues/)(\d+)"
# JIRA issue pattern: PKG-92 or https://scylladb.atlassian.net/browse/PKG-92
jira_pattern = r"(?:fix(?:|es|ed))\s*:?\s*(?:(?:https://scylladb\.atlassian\.net/browse/)?([A-Z]+-\d+))"
# Check PR body for GitHub issues
github_match = re.findall(github_pattern, pr.body, re.IGNORECASE)
# Check PR body for JIRA issues
jira_match = re.findall(jira_pattern, pr.body, re.IGNORECASE)
match = github_match or jira_match
if match:
return True
for commit in pr.get_commits():
github_match = re.findall(github_pattern, commit.commit.message, re.IGNORECASE)
jira_match = re.findall(jira_pattern, commit.commit.message, re.IGNORECASE)
if github_match or jira_match:
print(f'{pr.number} has a valid close reference in commit message {commit.sha}')
return True
print(f'No valid close reference for {pr.number}')
return False
def main():
args = parse_args()

16
.github/seastar-bad-include.json vendored Normal file
View File

@@ -0,0 +1,16 @@
{
"problemMatcher": [
{
"owner": "seastar-bad-include",
"severity": "error",
"pattern": [
{
"regexp": "^(.+):(\\d+):(.+)$",
"file": 1,
"line": 2,
"message": 3
}
]
}
]
}

View File

@@ -18,7 +18,7 @@ jobs:
// Regular expression pattern to check for "Fixes" prefix
// Adjusted to dynamically insert the repository full name
const pattern = `Fixes:? (?:#|${repo.replace('/', '\\/')}#|https://github\\.com/${repo.replace('/', '\\/')}/issues/)(\\d+)`;
const pattern = `Fixes:? ((?:#|${repo.replace('/', '\\/')}#|https://github\\.com/${repo.replace('/', '\\/')}/issues/)(\\d+)|([A-Z]+-\\d+))`;
const regex = new RegExp(pattern);
if (!regex.test(body)) {

View File

@@ -11,7 +11,8 @@ env:
CLEANER_OUTPUT_PATH: build/clang-include-cleaner.log
# the "idl" subdirectory does not contain C++ source code. the .hh files in it are
# supposed to be processed by idl-compiler.py, so we don't check them using the cleaner
CLEANER_DIRS: test/unit exceptions alternator api auth cdc compaction db dht gms index lang message mutation mutation_writer node_ops redis replica
CLEANER_DIRS: test/unit exceptions alternator api auth cdc compaction db dht gms index lang message mutation mutation_writer node_ops raft redis replica service
SEASTAR_BAD_INCLUDE_OUTPUT_PATH: build/seastar-bad-include.log
permissions: {}
@@ -80,7 +81,24 @@ jobs:
done
- run: |
echo "::remove-matcher owner=clang-include-cleaner::"
- run: |
echo "::add-matcher::.github/seastar-bad-include.json"
- name: check for seastar includes
run: |
git -c safe.directory="$PWD" \
grep -nE '#include +"seastar/' \
| tee "$SEASTAR_BAD_INCLUDE_OUTPUT_PATH"
- run: |
echo "::remove-matcher owner=seastar-bad-include::"
- uses: actions/upload-artifact@v4
with:
name: Logs (clang-include-cleaner)
path: "./${{ env.CLEANER_OUTPUT_PATH }}"
name: Logs
path: |
${{ env.CLEANER_OUTPUT_PATH }}
${{ env.SEASTAR_BAD_INCLUDE_OUTPUT_PATH }}
- name: fail if seastar headers are included as an internal library
run: |
if [ -s "$SEASTAR_BAD_INCLUDE_OUTPUT_PATH" ]; then
echo "::error::Found #include \"seastar/ in the source code. Use angle brackets instead."
exit 1
fi

View File

@@ -16,6 +16,13 @@ jobs:
pull-requests: write
steps:
- name: Checkout repository
uses: actions/checkout@v4
with:
repository: ${{ github.repository }}
ref: ${{ env.DEFAULT_BRANCH }}
token: ${{ secrets.AUTO_BACKPORT_TOKEN }}
fetch-depth: 1
- name: Mark pull request as ready for review
run: gh pr ready "${{ github.event.pull_request.number }}"
env:

View File

@@ -13,6 +13,8 @@ jobs:
issues: write
pull-requests: write
steps:
- name: Wait for label to be added
run: sleep 1m
- uses: mheap/github-action-required-labels@v5
with:
mode: minimum

5
.gitmodules vendored
View File

@@ -1,6 +1,6 @@
[submodule "seastar"]
path = seastar
url = ../seastar
url = ../scylla-seastar
ignore = dirty
[submodule "swagger-ui"]
path = swagger-ui
@@ -9,9 +9,6 @@
[submodule "abseil"]
path = abseil
url = ../abseil-cpp
[submodule "scylla-tools"]
path = tools/java
url = ../scylla-tools-java
[submodule "scylla-python3"]
path = tools/python3
url = ../scylla-python3

View File

@@ -163,14 +163,6 @@ file(MAKE_DIRECTORY "${scylla_gen_build_dir}")
include(add_version_library)
generate_scylla_version()
add_library(scylla-zstd STATIC
zstd.cc)
target_link_libraries(scylla-zstd
PRIVATE
db
Seastar::seastar
zstd::libzstd)
add_library(scylla-main STATIC)
target_sources(scylla-main
PRIVATE
@@ -182,7 +174,7 @@ target_sources(scylla-main
compress.cc
converting_mutation_partition_applier.cc
counters.cc
direct_failure_detector/failure_detector.cc
sstable_dict_autotrainer.cc
duration.cc
exceptions/exceptions.cc
frozen_schema.cc
@@ -204,6 +196,7 @@ target_sources(scylla-main
reader_concurrency_semaphore_group.cc
schema_mutations.cc
serializer.cc
service/direct_failure_detector/failure_detector.cc
sstables_loader.cc
table_helper.cc
tasks/task_handler.cc
@@ -214,7 +207,6 @@ target_sources(scylla-main
vint-serialization.cc)
target_link_libraries(scylla-main
PRIVATE
"$<LINK_LIBRARY:WHOLE_ARCHIVE,scylla-zstd>"
db
absl::headers
absl::btree

View File

@@ -220,28 +220,9 @@ On a development machine, one might run Scylla as
$ SCYLLA_HOME=$HOME/scylla build/release/scylla --overprovisioned --developer-mode=yes
```
To interact with scylla it is recommended to build our versions of
cqlsh and nodetool. They are available at
https://github.com/scylladb/scylla-tools-java and can be built with
```bash
$ sudo ./install-dependencies.sh
$ ant jar
```
cqlsh should work out of the box, but nodetool depends on a running
scylla-jmx (https://github.com/scylladb/scylla-jmx). It can be build
with
```bash
$ mvn package
```
and must be started with
```bash
$ ./scripts/scylla-jmx
```
To interact with scylla it is recommended to build our version of
cqlsh. It is available at
https://github.com/scylladb/scylla-cqlsh and is available as a submodule.
### Branches and tags

View File

@@ -78,7 +78,7 @@ fi
# Default scylla product/version tags
PRODUCT=scylla
VERSION=2025.2.0-dev
VERSION=2025.2.6
if test -f version
then

View File

@@ -24,7 +24,7 @@ static constexpr uint64_t KB = 1024ULL;
static constexpr uint64_t RCU_BLOCK_SIZE_LENGTH = 4*KB;
static constexpr uint64_t WCU_BLOCK_SIZE_LENGTH = 1*KB;
static bool should_add_capacity(const rjson::value& request) {
bool consumed_capacity_counter::should_add_capacity(const rjson::value& request) {
const rjson::value* return_consumed = rjson::find(request, "ReturnConsumedCapacity");
if (!return_consumed) {
return false;
@@ -62,15 +62,22 @@ static uint64_t calculate_half_units(uint64_t unit_block_size, uint64_t total_by
rcu_consumed_capacity_counter::rcu_consumed_capacity_counter(const rjson::value& request, bool is_quorum) :
consumed_capacity_counter(should_add_capacity(request)),_is_quorum(is_quorum) {
}
uint64_t rcu_consumed_capacity_counter::get_half_units(uint64_t total_bytes, bool is_quorum) noexcept {
return calculate_half_units(RCU_BLOCK_SIZE_LENGTH, total_bytes, is_quorum);
}
uint64_t rcu_consumed_capacity_counter::get_half_units() const noexcept {
return calculate_half_units(RCU_BLOCK_SIZE_LENGTH, _total_bytes, _is_quorum);
return get_half_units(_total_bytes, _is_quorum);
}
uint64_t wcu_consumed_capacity_counter::get_half_units() const noexcept {
return calculate_half_units(WCU_BLOCK_SIZE_LENGTH, _total_bytes, true);
}
uint64_t wcu_consumed_capacity_counter::get_units(uint64_t total_bytes) noexcept {
return calculate_half_units(WCU_BLOCK_SIZE_LENGTH, total_bytes, true) * HALF_UNIT_MULTIPLIER;
}
wcu_consumed_capacity_counter::wcu_consumed_capacity_counter(const rjson::value& request) :
consumed_capacity_counter(should_add_capacity(request)) {
}

View File

@@ -42,21 +42,25 @@ public:
*/
virtual uint64_t get_half_units() const noexcept = 0;
uint64_t _total_bytes = 0;
static bool should_add_capacity(const rjson::value& request);
protected:
bool _should_add_to_reponse = false;
};
class rcu_consumed_capacity_counter : public consumed_capacity_counter {
virtual uint64_t get_half_units() const noexcept;
bool _is_quorum = false;
public:
rcu_consumed_capacity_counter(const rjson::value& request, bool is_quorum);
rcu_consumed_capacity_counter(): consumed_capacity_counter(false), _is_quorum(false){}
virtual uint64_t get_half_units() const noexcept;
static uint64_t get_half_units(uint64_t total_bytes, bool is_quorum) noexcept;
};
class wcu_consumed_capacity_counter : public consumed_capacity_counter {
virtual uint64_t get_half_units() const noexcept;
public:
wcu_consumed_capacity_counter(const rjson::value& request);
static uint64_t get_units(uint64_t total_bytes) noexcept;
};
}

File diff suppressed because it is too large Load Diff

View File

@@ -241,7 +241,8 @@ public:
const query::partition_slice&& slice,
shared_ptr<cql3::selection::selection> selection,
foreign_ptr<lw_shared_ptr<query::result>> query_result,
shared_ptr<const std::optional<attrs_to_get>> attrs_to_get);
shared_ptr<const std::optional<attrs_to_get>> attrs_to_get,
uint64_t& rcu_half_units);
static void describe_single_item(const cql3::selection::selection&,
const std::vector<managed_bytes_opt>&,

View File

@@ -165,7 +165,9 @@ static std::optional<std::string> resolve_path_component(const std::string& colu
fmt::format("ExpressionAttributeNames missing entry '{}' required by expression", column_name));
}
used_attribute_names.emplace(column_name);
return std::string(rjson::to_string_view(*value));
auto result = std::string(rjson::to_string_view(*value));
validate_attr_name_length("", result.size(), false, "ExpressionAttributeNames contains invalid value: ");
return result;
}
return std::nullopt;
}
@@ -737,6 +739,26 @@ rjson::value calculate_value(const parsed::set_rhs& rhs,
return rjson::null_value();
}
void validate_attr_name_length(std::string_view supplementary_context, size_t attr_name_length, bool is_key, std::string_view error_msg_prefix) {
constexpr const size_t DYNAMODB_KEY_ATTR_NAME_SIZE_MAX = 255;
constexpr const size_t DYNAMODB_NONKEY_ATTR_NAME_SIZE_MAX = 65535;
const size_t max_length = is_key ? DYNAMODB_KEY_ATTR_NAME_SIZE_MAX : DYNAMODB_NONKEY_ATTR_NAME_SIZE_MAX;
if (attr_name_length > max_length) {
std::string error_msg;
if (!error_msg_prefix.empty()) {
error_msg += error_msg_prefix;
}
if (!supplementary_context.empty()) {
error_msg += "in ";
error_msg += supplementary_context;
error_msg += " - ";
}
error_msg += fmt::format("Attribute name is too large, must be less than {} bytes", std::to_string(max_length + 1));
throw api_error::validation(error_msg);
}
}
} // namespace alternator
auto fmt::formatter<alternator::parsed::path>::format(const alternator::parsed::path& p, fmt::format_context& ctx) const

View File

@@ -91,6 +91,18 @@ options {
throw expressions_syntax_error(format("{} at char {}", err,
ex->get_charPositionInLine()));
}
// ANTLR3 tries to recover missing tokens - it tries to finish parsing
// and create valid objects, as if the missing token was there.
// But it has a bug and leaks these tokens.
// We override offending method and handle abandoned pointers.
std::vector<std::unique_ptr<TokenType>> _missing_tokens;
TokenType* getMissingSymbol(IntStreamType* istream, ExceptionBaseType* e,
ANTLR_UINT32 expectedTokenType, BitsetListType* follow) {
auto token = BaseType::getMissingSymbol(istream, e, expectedTokenType, follow);
_missing_tokens.emplace_back(token);
return token;
}
}
@lexer::context {
void displayRecognitionError(ANTLR_UINT8** token_names, ExceptionBaseType* ex) {

View File

@@ -91,5 +91,7 @@ rjson::value calculate_value(const parsed::value& v,
rjson::value calculate_value(const parsed::set_rhs& rhs,
const rjson::value* previous_item);
void validate_attr_name_length(std::string_view supplementary_context, size_t attr_name_length, bool is_key, std::string_view error_msg_prefix = {});
} /* namespace alternator */

View File

@@ -228,9 +228,8 @@ protected:
// If the rack does not exist, we return an empty list - not an error.
sstring query_rack = req->get_query_param("rack");
for (auto& id : local_dc_nodes) {
auto ip = _gossiper.get_address_map().get(id);
if (!query_rack.empty()) {
auto rack = _gossiper.get_application_state_value(ip, gms::application_state::RACK);
auto rack = _gossiper.get_application_state_value(id, gms::application_state::RACK);
if (rack != query_rack) {
continue;
}
@@ -238,10 +237,10 @@ protected:
// Note that it's not enough for the node to be is_alive() - a
// node joining the cluster is also "alive" but not responsive to
// requests. We alive *and* normal. See #19694, #21538.
if (_gossiper.is_alive(ip) && _gossiper.is_normal(ip)) {
if (_gossiper.is_alive(id) && _gossiper.is_normal(id)) {
// Use the gossiped broadcast_rpc_address if available instead
// of the internal IP address "ip". See discussion in #18711.
rjson::push_back(results, rjson::from_string(_gossiper.get_rpc_address(ip)));
rjson::push_back(results, rjson::from_string(_gossiper.get_rpc_address(id)));
}
}
rep->set_status(reply::status_type::ok);
@@ -505,7 +504,7 @@ server::server(executor& exec, service::storage_proxy& proxy, gms::gossiper& gos
, _key_cache(1024, 1min, slogger)
, _enforce_authorization(false)
, _enabled_servers{}
, _pending_requests{}
, _pending_requests("alternator::server::pending_requests")
, _timeout_config(_proxy.data_dictionary().get_config())
, _callbacks{
{"CreateTable", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req) {

View File

@@ -41,7 +41,7 @@ class server : public peering_sharded_service<server> {
key_cache _key_cache;
utils::updateable_value<bool> _enforce_authorization;
utils::small_vector<std::reference_wrapper<seastar::httpd::http_server>, 2> _enabled_servers;
gate _pending_requests;
named_gate _pending_requests;
// In some places we will need a CQL updateable_timeout_config object even
// though it isn't really relevant for Alternator which defines its own
// timeouts separately. We can create this object only once.

View File

@@ -14,7 +14,20 @@
namespace alternator {
const char* ALTERNATOR_METRICS = "alternator";
static seastar::metrics::histogram estimated_histogram_to_metrics(const utils::estimated_histogram& histogram) {
seastar::metrics::histogram res;
res.buckets.resize(histogram.bucket_offsets.size());
uint64_t cumulative_count = 0;
res.sample_count = histogram._count;
res.sample_sum = histogram._sample_sum;
for (size_t i = 0; i < res.buckets.size(); i++) {
auto& v = res.buckets[i];
v.upper_bound = histogram.bucket_offsets[i];
cumulative_count += histogram.buckets[i];
v.count = cumulative_count;
}
return res;
}
stats::stats() : api_operations{} {
// Register the
seastar::metrics::label op("op");
@@ -95,22 +108,26 @@ stats::stats() : api_operations{} {
seastar::metrics::description("number of rows read during filtering operations"))(alternator_label).set_skip_when_empty(),
seastar::metrics::make_total_operations("filtered_rows_matched_total", cql_stats.filtered_rows_matched_total,
seastar::metrics::description("number of rows read and matched during filtering operations")),
seastar::metrics::make_counter("rcu_total", rcu_total,
seastar::metrics::description("total number of consumed read units, counted as half units"))(alternator_label).set_skip_when_empty(),
seastar::metrics::make_counter("rcu_total", [this]{return 0.5 * rcu_half_units_total;},
seastar::metrics::description("total number of consumed read units"))(alternator_label).set_skip_when_empty(),
seastar::metrics::make_counter("wcu_total", wcu_total[wcu_types::PUT_ITEM],
seastar::metrics::description("total number of consumed write units, counted as half units"),{op("PutItem")})(alternator_label).set_skip_when_empty(),
seastar::metrics::description("total number of consumed write units"),{op("PutItem")})(alternator_label).set_skip_when_empty(),
seastar::metrics::make_counter("wcu_total", wcu_total[wcu_types::DELETE_ITEM],
seastar::metrics::description("total number of consumed write units, counted as half units"),{op("DeleteItem")})(alternator_label).set_skip_when_empty(),
seastar::metrics::description("total number of consumed write units"),{op("DeleteItem")})(alternator_label).set_skip_when_empty(),
seastar::metrics::make_counter("wcu_total", wcu_total[wcu_types::UPDATE_ITEM],
seastar::metrics::description("total number of consumed write units, counted as half units"),{op("UpdateItem")})(alternator_label).set_skip_when_empty(),
seastar::metrics::description("total number of consumed write units"),{op("UpdateItem")})(alternator_label).set_skip_when_empty(),
seastar::metrics::make_counter("wcu_total", wcu_total[wcu_types::INDEX],
seastar::metrics::description("total number of consumed write units, counted as half units"),{op("Index")})(alternator_label).set_skip_when_empty(),
seastar::metrics::description("total number of consumed write units"),{op("Index")})(alternator_label).set_skip_when_empty(),
seastar::metrics::make_total_operations("filtered_rows_dropped_total", [this] { return cql_stats.filtered_rows_read_total - cql_stats.filtered_rows_matched_total; },
seastar::metrics::description("number of rows read and dropped during filtering operations"))(alternator_label).set_skip_when_empty(),
seastar::metrics::make_counter("batch_item_count", seastar::metrics::description("The total number of items processed across all batches"),{op("BatchWriteItem")},
api_operations.batch_write_item_batch_total)(alternator_label).set_skip_when_empty(),
seastar::metrics::make_counter("batch_item_count", seastar::metrics::description("The total number of items processed across all batches"),{op("BatchGetItem")},
api_operations.batch_get_item_batch_total)(alternator_label).set_skip_when_empty(),
seastar::metrics::make_histogram("batch_item_count_histogram", seastar::metrics::description("Histogram of the number of items in a batch request"),{op("BatchGetItem")},
[this]{ return estimated_histogram_to_metrics(api_operations.batch_get_item_histogram);})(alternator_label).aggregate({seastar::metrics::shard_label}).set_skip_when_empty(),
seastar::metrics::make_histogram("batch_item_count_histogram", seastar::metrics::description("Histogram of the number of items in a batch request"),{op("BatchWriteItem")},
[this]{ return estimated_histogram_to_metrics(api_operations.batch_write_item_histogram);})(alternator_label).aggregate({seastar::metrics::shard_label}).set_skip_when_empty(),
});
}

View File

@@ -12,6 +12,7 @@
#include <seastar/core/metrics_registration.hh>
#include "utils/histogram.hh"
#include "utils/estimated_histogram.hh"
#include "cql3/stats.hh"
namespace alternator {
@@ -75,6 +76,9 @@ public:
utils::timed_rate_moving_average_summary_and_histogram batch_write_item_latency;
utils::timed_rate_moving_average_summary_and_histogram batch_get_item_latency;
utils::timed_rate_moving_average_summary_and_histogram get_records_latency;
utils::estimated_histogram batch_get_item_histogram{22}; // a histogram that covers the range 1 - 100
utils::estimated_histogram batch_write_item_histogram{22}; // a histogram that covers the range 1 - 100
} api_operations;
// Miscellaneous event counters
uint64_t total_operations = 0;
@@ -84,7 +88,7 @@ public:
uint64_t shard_bounce_for_lwt = 0;
uint64_t requests_blocked_memory = 0;
uint64_t requests_shed = 0;
uint64_t rcu_total = 0;
uint64_t rcu_half_units_total = 0;
// wcu can results from put, update, delete and index
// Index related will be done on top of the operation it comes with
enum wcu_types {

View File

@@ -808,6 +808,9 @@ future<executor::request_return_type> executor::get_records(client_state& client
if (limit < 1) {
throw api_error::validation("Limit must be 1 or more");
}
if (limit > 1000) {
throw api_error::validation("Limit must be less than or equal to 1000");
}
auto db = _proxy.data_dictionary();
schema_ptr schema, base;

View File

@@ -136,14 +136,6 @@
"allowMultiple":false,
"type":"string",
"paramType":"path"
},
{
"name":"unsafe",
"description":"Set to True to perform an unsafe assassination",
"required":false,
"allowMultiple":false,
"type":"boolean",
"paramType":"query"
}
]
}

View File

@@ -984,7 +984,7 @@
]
},
{
"path":"/storage_service/cleanup_all",
"path":"/storage_service/cleanup_all/",
"operations":[
{
"method":"POST",
@@ -994,6 +994,30 @@
"produces":[
"application/json"
],
"parameters":[
{
"name":"global",
"description":"true if cleanup of entire cluster is requested",
"required":false,
"allowMultiple":false,
"type":"boolean",
"paramType":"query"
}
]
}
]
},
{
"path":"/storage_service/mark_node_as_clean",
"operations":[
{
"method":"POST",
"summary":"Mark the node as clean. After that the node will not be considered as needing cleanup during automatic cleanup which is triggered by some topology operations",
"type":"void",
"nickname":"reset_cleanup_needed",
"produces":[
"application/json"
],
"parameters":[]
}
]
@@ -2144,6 +2168,31 @@
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"skip_cleanup",
"description":"Don't cleanup keys from loaded sstables. Invalid if load_and_stream is true",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"skip_reshape",
"description":"Don't reshape the loaded sstables. Invalid if load_and_stream is true",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"scope",
"description":"Defines the set of nodes to which mutations can be streamed",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query",
"enum": ["all", "dc", "rack", "node"]
}
]
}
@@ -3027,6 +3076,73 @@
}
]
},
{
"path":"/storage_service/retrain_dict",
"operations":[
{
"method":"POST",
"summary":"Retrain the SSTable compression dictionary for the target table.",
"type":"void",
"nickname":"retrain_dict",
"produces":[
"application/json"
],
"parameters":[
{
"name":"keyspace",
"description":"Name of the keyspace containing the target table.",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"cf",
"description":"Name of the target table.",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"query"
}
]
}
]
},
{
"path":"/storage_service/estimate_compression_ratios",
"operations":[
{
"method":"GET",
"summary":"Compute an estimated compression ratio for SSTables of the given table, for various compression configurations.",
"type":"array",
"items":{
"type":"compression_config_result"
},
"nickname":"estimate_compression_ratios",
"produces":[
"application/json"
],
"parameters":[
{
"name":"keyspace",
"description":"Name of the keyspace containing the target table.",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"cf",
"description":"Name of the target table.",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"query"
}
]
}
]
},
{
"path":"/storage_service/raft_topology/reload",
"operations":[
@@ -3069,6 +3185,22 @@
]
}
]
},
{
"path":"/storage_service/raft_topology/cmd_rpc_status",
"operations":[
{
"method":"GET",
"summary":"Get information about currently running topology cmd rpc",
"type":"string",
"nickname":"raft_topology_get_cmd_status",
"produces":[
"application/json"
],
"parameters":[
]
}
]
}
],
"models":{
@@ -3205,11 +3337,11 @@
"properties":{
"start_token":{
"type":"string",
"description":"The range start token"
"description":"The range start token (exclusive)"
},
"end_token":{
"type":"string",
"description":"The range start token"
"description":"The range end token (inclusive)"
},
"endpoints":{
"type":"array",
@@ -3328,6 +3460,32 @@
"type":"string"
}
}
},
"compression_config_result":{
"id":"compression_config_result",
"description":"Compression ratio estimation result for one config",
"properties":{
"level":{
"type":"long",
"description":"The used value of `compression_level`"
},
"chunk_length_in_kb":{
"type":"long",
"description":"The used value of `chunk_length_in_kb`"
},
"dict":{
"type":"string",
"description":"The used dictionary: `none`, `past` (== current), or `future`"
},
"sstable_compression":{
"type":"string",
"description":"The used compressor name (aka `sstable_compression`)"
},
"ratio":{
"type":"float",
"description":"The resulting compression ratio (estimated on a random sample of files)"
}
}
}
}
}

View File

@@ -902,17 +902,13 @@ void set_column_family(http_context& ctx, routes& r, sharded<db::system_keyspace
});
ss::enable_auto_compaction.set(r, [&ctx](std::unique_ptr<http::request> req) {
auto keyspace = validate_keyspace(ctx, req);
auto tables = parse_table_infos(keyspace, ctx, req->query_parameters, "cf");
auto [keyspace, tables] = parse_table_infos(ctx, *req);
apilog.info("enable_auto_compaction: keyspace={} tables={}", keyspace, tables);
return set_tables_autocompaction(ctx, std::move(tables), true);
});
ss::disable_auto_compaction.set(r, [&ctx](std::unique_ptr<http::request> req) {
auto keyspace = validate_keyspace(ctx, req);
auto tables = parse_table_infos(keyspace, ctx, req->query_parameters, "cf");
auto [keyspace, tables] = parse_table_infos(ctx, *req);
apilog.info("disable_auto_compaction: keyspace={} tables={}", keyspace, tables);
return set_tables_autocompaction(ctx, std::move(tables), false);
});
@@ -936,17 +932,13 @@ void set_column_family(http_context& ctx, routes& r, sharded<db::system_keyspace
});
ss::enable_tombstone_gc.set(r, [&ctx](std::unique_ptr<http::request> req) {
auto keyspace = validate_keyspace(ctx, req);
auto tables = parse_table_infos(keyspace, ctx, req->query_parameters, "cf");
auto [keyspace, tables] = parse_table_infos(ctx, *req);
apilog.info("enable_tombstone_gc: keyspace={} tables={}", keyspace, tables);
return set_tables_tombstone_gc(ctx, std::move(tables), true);
});
ss::disable_tombstone_gc.set(r, [&ctx](std::unique_ptr<http::request> req) {
auto keyspace = validate_keyspace(ctx, req);
auto tables = parse_table_infos(keyspace, ctx, req->query_parameters, "cf");
auto [keyspace, tables] = parse_table_infos(ctx, *req);
apilog.info("disable_tombstone_gc: keyspace={} tables={}", keyspace, tables);
return set_tables_tombstone_gc(ctx, std::move(tables), false);
});
@@ -1054,7 +1046,7 @@ void set_column_family(http_context& ctx, routes& r, sharded<db::system_keyspace
return ctx.db.map_reduce0([key, uuid] (replica::database& db) -> future<std::unordered_set<sstring>> {
auto sstables = co_await db.find_column_family(uuid).get_sstables_by_partition_key(key);
co_return sstables | std::views::transform([] (auto s) { return s->get_filename(); }) | std::ranges::to<std::unordered_set>();
co_return sstables | std::views::transform([] (auto s) -> sstring { return fmt::to_string(s->get_filename()); }) | std::ranges::to<std::unordered_set>();
}, std::unordered_set<sstring>(),
[](std::unordered_set<sstring> a, std::unordered_set<sstring>&& b) mutable {
a.merge(b);

View File

@@ -111,8 +111,7 @@ void set_compaction_manager(http_context& ctx, routes& r, sharded<compaction_man
});
cm::stop_keyspace_compaction.set(r, [&ctx] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {
auto ks_name = validate_keyspace(ctx, req);
auto tables = parse_table_infos(ks_name, ctx, req->query_parameters, "tables");
auto [ks_name, tables] = parse_table_infos(ctx, *req, "tables");
auto type = req->get_query_param("type");
co_await ctx.db.invoke_on_all([&] (replica::database& db) {
auto& cm = db.get_compaction_manager();

View File

@@ -22,10 +22,10 @@ void set_failure_detector(http_context& ctx, routes& r, gms::gossiper& g) {
return g.container().invoke_on(0, [] (gms::gossiper& g) {
std::vector<fd::endpoint_state> res;
res.reserve(g.num_endpoints());
g.for_each_endpoint_state([&] (const gms::inet_address& addr, const gms::endpoint_state& eps) {
g.for_each_endpoint_state([&] (const gms::endpoint_state& eps) {
fd::endpoint_state val;
val.addrs = fmt::to_string(addr);
val.is_alive = g.is_alive(addr);
val.addrs = fmt::to_string(eps.get_ip());
val.is_alive = g.is_alive(eps.get_host_id());
val.generation = eps.get_heart_beat_state().get_generation().value();
val.version = eps.get_heart_beat_state().get_heart_beat_version().value();
val.update_time = eps.get_update_timestamp().time_since_epoch().count();
@@ -65,8 +65,8 @@ void set_failure_detector(http_context& ctx, routes& r, gms::gossiper& g) {
fd::get_simple_states.set(r, [&g] (std::unique_ptr<request> req) {
return g.container().invoke_on(0, [] (gms::gossiper& g) {
std::map<sstring, sstring> nodes_status;
g.for_each_endpoint_state([&] (const gms::inet_address& node, const gms::endpoint_state&) {
nodes_status.emplace(fmt::to_string(node), g.is_alive(node) ? "UP" : "DOWN");
g.for_each_endpoint_state([&] (const gms::endpoint_state& es) {
nodes_status.emplace(fmt::to_string(es.get_ip()), g.is_alive(es.get_host_id()) ? "UP" : "DOWN");
});
return make_ready_future<json::json_return_type>(map_to_key_value<fd::mapper>(nodes_status));
});
@@ -81,7 +81,7 @@ void set_failure_detector(http_context& ctx, routes& r, gms::gossiper& g) {
fd::get_endpoint_state.set(r, [&g] (std::unique_ptr<request> req) {
return g.container().invoke_on(0, [req = std::move(req)] (gms::gossiper& g) {
auto state = g.get_endpoint_state_ptr(gms::inet_address(req->get_path_param("addr")));
auto state = g.get_endpoint_state_ptr(g.get_host_id(gms::inet_address(req->get_path_param("addr"))));
if (!state) {
return make_ready_future<json::json_return_type>(format("unknown endpoint {}", req->get_path_param("addr")));
}

View File

@@ -35,37 +35,32 @@ void set_gossiper(http_context& ctx, routes& r, gms::gossiper& g) {
gms::inet_address ep(req->get_path_param("addr"));
// synchronize unreachable_members on all shards
co_await g.get_unreachable_members_synchronized();
co_return g.get_endpoint_downtime(ep);
co_return g.get_endpoint_downtime(g.get_host_id(ep));
});
httpd::gossiper_json::get_current_generation_number.set(r, [&g] (std::unique_ptr<http::request> req) {
gms::inet_address ep(req->get_path_param("addr"));
return g.get_current_generation_number(ep).then([] (gms::generation_type res) {
return g.get_current_generation_number(g.get_host_id(ep)).then([] (gms::generation_type res) {
return make_ready_future<json::json_return_type>(res.value());
});
});
httpd::gossiper_json::get_current_heart_beat_version.set(r, [&g] (std::unique_ptr<http::request> req) {
gms::inet_address ep(req->get_path_param("addr"));
return g.get_current_heart_beat_version(ep).then([] (gms::version_type res) {
return g.get_current_heart_beat_version(g.get_host_id(ep)).then([] (gms::version_type res) {
return make_ready_future<json::json_return_type>(res.value());
});
});
httpd::gossiper_json::assassinate_endpoint.set(r, [&g](std::unique_ptr<http::request> req) {
if (req->get_query_param("unsafe") != "True") {
return g.assassinate_endpoint(req->get_path_param("addr")).then([] {
return make_ready_future<json::json_return_type>(json_void());
});
}
return g.unsafe_assassinate_endpoint(req->get_path_param("addr")).then([] {
return g.assassinate_endpoint(req->get_path_param("addr")).then([] {
return make_ready_future<json::json_return_type>(json_void());
});
});
httpd::gossiper_json::force_remove_endpoint.set(r, [&g](std::unique_ptr<http::request> req) {
gms::inet_address ep(req->get_path_param("addr"));
return g.force_remove_endpoint(ep, gms::null_permit_id).then([] {
return g.force_remove_endpoint(g.get_host_id(ep), gms::null_permit_id).then([] () {
return make_ready_future<json::json_return_type>(json_void());
});
});

View File

@@ -148,7 +148,7 @@ void set_messaging_service(http_context& ctx, routes& r, sharded<netw::messaging
hf::inject_disconnect.set(r, [&ms] (std::unique_ptr<request> req) -> future<json::json_return_type> {
auto ip = msg_addr(req->get_path_param("ip"));
co_await ms.invoke_on_all([ip] (netw::messaging_service& ms) {
ms.remove_rpc_client(ip);
ms.remove_rpc_client(ip, std::nullopt);
});
co_return json::json_void();
});

View File

@@ -11,7 +11,7 @@
#include "cql3/query_processor.hh"
#include "cql3/untyped_result_set.hh"
#include "db/consistency_level_type.hh"
#include "seastar/json/json_elements.hh"
#include <seastar/json/json_elements.hh>
#include "transport/controller.hh"
#include <unordered_map>

View File

@@ -14,9 +14,13 @@
#include "api/scrub_status.hh"
#include "db/config.hh"
#include "db/schema_tables.hh"
#include "gms/feature_service.hh"
#include "schema/schema_builder.hh"
#include "sstables/sstables_manager.hh"
#include "utils/hash.hh"
#include <optional>
#include <sstream>
#include <stdexcept>
#include <time.h>
#include <algorithm>
#include <functional>
@@ -29,6 +33,7 @@
#include "service/raft/raft_group0_client.hh"
#include "service/storage_service.hh"
#include "service/load_meter.hh"
#include "gms/feature_service.hh"
#include "gms/gossiper.hh"
#include "db/system_keyspace.hh"
#include <seastar/http/exception.hh>
@@ -55,6 +60,7 @@
#include "db/view/view_builder.hh"
#include "utils/rjson.hh"
#include "utils/user_provided_param.hh"
#include "sstable_dict_autotrainer.hh"
using namespace seastar::httpd;
using namespace std::chrono_literals;
@@ -127,32 +133,6 @@ int64_t validate_int(const sstring& param) {
return std::atoll(param.c_str());
}
// splits a request parameter assumed to hold a comma-separated list of table names
// verify that the tables are found, otherwise a bad_param_exception exception is thrown
// containing the description of the respective no_such_column_family error.
static std::vector<sstring> parse_tables(const sstring& ks_name, const http_context& ctx, sstring value) {
if (value.empty()) {
return map_keys(ctx.db.local().find_keyspace(ks_name).metadata().get()->cf_meta_data());
}
std::vector<sstring> names = split(value, ",");
try {
for (const auto& table_name : names) {
ctx.db.local().find_column_family(ks_name, table_name);
}
} catch (const replica::no_such_column_family& e) {
throw bad_param_exception(e.what());
}
return names;
}
static std::vector<sstring> parse_tables(const sstring& ks_name, const http_context& ctx, const std::unordered_map<sstring, sstring>& query_params, sstring param_name) {
auto it = query_params.find(param_name);
if (it == query_params.end()) {
return {};
}
return parse_tables(ks_name, ctx, it->second);
}
std::vector<table_info> parse_table_infos(const sstring& ks_name, const http_context& ctx, sstring value) {
std::vector<table_info> res;
try {
@@ -178,9 +158,12 @@ std::vector<table_info> parse_table_infos(const sstring& ks_name, const http_con
return res;
}
std::vector<table_info> parse_table_infos(const sstring& ks_name, const http_context& ctx, const std::unordered_map<sstring, sstring>& query_params, sstring param_name) {
auto it = query_params.find(param_name);
return parse_table_infos(ks_name, ctx, it != query_params.end() ? it->second : "");
std::pair<sstring, std::vector<table_info>> parse_table_infos(const http_context& ctx, const http::request& req, sstring cf_param_name) {
auto keyspace = validate_keyspace(ctx, req);
const auto& query_params = req.query_parameters;
auto it = query_params.find(cf_param_name);
auto tis = parse_table_infos(keyspace, ctx, it != query_params.end() ? it->second : "");
return std::make_pair(std::move(keyspace), std::move(tis));
}
static ss::token_range token_range_endpoints_to_json(const dht::token_range_endpoints& d) {
@@ -201,16 +184,6 @@ static ss::token_range token_range_endpoints_to_json(const dht::token_range_endp
return r;
}
using ks_cf_func = std::function<future<json::json_return_type>(http_context&, std::unique_ptr<http::request>, sstring, std::vector<table_info>)>;
static auto wrap_ks_cf(http_context &ctx, ks_cf_func f) {
return [&ctx, f = std::move(f)](std::unique_ptr<http::request> req) {
auto keyspace = validate_keyspace(ctx, req);
auto table_infos = parse_table_infos(keyspace, ctx, req->query_parameters, "cf");
return f(ctx, std::move(req), std::move(keyspace), std::move(table_infos));
};
}
seastar::future<json::json_return_type> run_toppartitions_query(db::toppartitions_query& q, http_context &ctx, bool legacy_request) {
return q.scatter().then([&q, legacy_request] {
return sleep(q.duration()).then([&q, legacy_request] {
@@ -253,7 +226,7 @@ future<scrub_info> parse_scrub_options(const http_context& ctx, sharded<db::snap
});
rp.process(*req);
info.keyspace = validate_keyspace(ctx, *rp.get("keyspace"));
info.column_families = parse_tables(info.keyspace, ctx, *rp.get("cf"));
info.column_families = parse_table_infos(info.keyspace, ctx, *rp.get("cf")) | std::views::transform([] (auto ti) { return ti.name; }) | std::ranges::to<std::vector>();
auto scrub_mode_opt = rp.get("scrub_mode");
auto scrub_mode = sstables::compaction_type_options::scrub::mode::abort;
@@ -278,11 +251,9 @@ future<scrub_info> parse_scrub_options(const http_context& ctx, sharded<db::snap
}
}
if (!req_param<bool>(*req, "disable_snapshot", false)) {
if (!req_param<bool>(*req, "disable_snapshot", false) && !info.column_families.empty()) {
auto tag = format("pre-scrub-{:d}", db_clock::now().time_since_epoch().count());
co_await coroutine::parallel_for_each(info.column_families, [&snap_ctl, keyspace = info.keyspace, tag](sstring cf) {
return snap_ctl.local().take_column_family_snapshot(keyspace, cf, tag, db::snapshot_ctl::skip_flush::no);
});
co_await snap_ctl.local().take_column_family_snapshot(info.keyspace, info.column_families, tag, db::snapshot_ctl::skip_flush::no);
}
info.opts = {
@@ -483,17 +454,26 @@ void set_sstables_loader(http_context& ctx, routes& r, sharded<sstables_loader>&
auto cf = req->get_query_param("cf");
auto stream = req->get_query_param("load_and_stream");
auto primary_replica = req->get_query_param("primary_replica_only");
auto skip_cleanup_p = req->get_query_param("skip_cleanup");
boost::algorithm::to_lower(stream);
boost::algorithm::to_lower(primary_replica);
bool load_and_stream = stream == "true" || stream == "1";
bool primary_replica_only = primary_replica == "true" || primary_replica == "1";
bool skip_cleanup = skip_cleanup_p == "true" || skip_cleanup_p == "1";
auto scope = parse_stream_scope(req->get_query_param("scope"));
auto skip_reshape_p = req->get_query_param("skip_reshape");
auto skip_reshape = skip_reshape_p == "true" || skip_reshape_p == "1";
if (scope != sstables_loader::stream_scope::all && !load_and_stream) {
throw httpd::bad_param_exception("scope takes no effect without load-and-stream");
}
// No need to add the keyspace, since all we want is to avoid always sending this to the same
// CPU. Even then I am being overzealous here. This is not something that happens all the time.
auto coordinator = std::hash<sstring>()(cf) % smp::count;
return sst_loader.invoke_on(coordinator,
[ks = std::move(ks), cf = std::move(cf),
load_and_stream, primary_replica_only] (sstables_loader& loader) {
return loader.load_new_sstables(ks, cf, load_and_stream, primary_replica_only, sstables_loader::stream_scope::all);
load_and_stream, primary_replica_only, skip_cleanup, skip_reshape, scope] (sstables_loader& loader) {
return loader.load_new_sstables(ks, cf, load_and_stream, primary_replica_only, skip_cleanup, skip_reshape, scope);
}).then_wrapped([] (auto&& f) {
if (f.failed()) {
auto msg = fmt::format("Failed to load new sstables: {}", f.get_exception());
@@ -725,7 +705,7 @@ rest_get_load(http_context& ctx, std::unique_ptr<http::request> req) {
static
future<json::json_return_type>
rest_get_current_generation_number(sharded<service::storage_service>& ss, std::unique_ptr<http::request> req) {
auto ep = ss.local().get_token_metadata().get_topology().my_address();
auto ep = ss.local().get_token_metadata().get_topology().my_host_id();
return ss.local().gossiper().get_current_generation_number(ep).then([](gms::generation_type res) {
return make_ready_future<json::json_return_type>(res.value());
});
@@ -768,13 +748,7 @@ rest_force_compaction(http_context& ctx, std::unique_ptr<http::request> req) {
fmopt = flush_mode::skip;
}
auto task = co_await compaction_module.make_and_start_task<global_major_compaction_task_impl>({}, db, fmopt, consider_only_existing_data);
try {
co_await task->done();
} catch (...) {
apilog.error("force_compaction failed: {}", std::current_exception());
throw;
}
co_await task->done();
co_return json_void();
}
@@ -801,13 +775,7 @@ rest_force_keyspace_compaction(http_context& ctx, std::unique_ptr<http::request>
fmopt = flush_mode::skip;
}
auto task = co_await compaction_module.make_and_start_task<major_keyspace_compaction_task_impl>({}, std::move(keyspace), tasks::task_id::create_null_id(), db, table_infos, fmopt, consider_only_existing_data);
try {
co_await task->done();
} catch (...) {
apilog.error("force_keyspace_compaction: keyspace={} tables={} failed: {}", task->get_status().keyspace, table_infos, std::current_exception());
throw;
}
co_await task->done();
co_return json_void();
}
@@ -815,8 +783,7 @@ static
future<json::json_return_type>
rest_force_keyspace_cleanup(http_context& ctx, sharded<service::storage_service>& ss, std::unique_ptr<http::request> req) {
auto& db = ctx.db;
auto keyspace = validate_keyspace(ctx, req);
auto table_infos = parse_table_infos(keyspace, ctx, req->query_parameters, "cf");
auto [keyspace, table_infos] = parse_table_infos(ctx, *req);
const auto& rs = db.local().find_keyspace(keyspace).get_replication_strategy();
if (rs.get_type() == locator::replication_strategy_type::local || !rs.is_vnode_based()) {
auto reason = rs.get_type() == locator::replication_strategy_type::local ? "require" : "support";
@@ -833,21 +800,21 @@ rest_force_keyspace_cleanup(http_context& ctx, sharded<service::storage_service>
auto& compaction_module = db.local().get_compaction_manager().get_task_manager_module();
auto task = co_await compaction_module.make_and_start_task<cleanup_keyspace_compaction_task_impl>(
{}, std::move(keyspace), db, table_infos, flush_mode::all_tables, tasks::is_user_task::yes);
try {
co_await task->done();
} catch (...) {
apilog.error("force_keyspace_cleanup: keyspace={} tables={} failed: {}", task->get_status().keyspace, table_infos, std::current_exception());
throw;
}
co_await task->done();
co_return json::json_return_type(0);
}
static
future<json::json_return_type>
rest_cleanup_all(http_context& ctx, sharded<service::storage_service>& ss, std::unique_ptr<http::request> req) {
apilog.info("cleanup_all");
auto done = co_await ss.invoke_on(0, [] (service::storage_service& ss) -> future<bool> {
bool global = true;
if (auto global_param = req->get_query_param("global"); !global_param.empty()) {
global = validate_bool(global_param);
}
apilog.info("cleanup_all global={}", global);
auto done = !global ? false : co_await ss.invoke_on(0, [] (service::storage_service& ss) -> future<bool> {
if (!ss.is_topology_coordinator_enabled()) {
co_return false;
}
@@ -857,53 +824,59 @@ rest_cleanup_all(http_context& ctx, sharded<service::storage_service>& ss, std::
if (done) {
co_return json::json_return_type(0);
}
// fall back to the local global cleanup if topology coordinator is not enabled
// fall back to the local cleanup if topology coordinator is not enabled or local cleanup is requested
auto& db = ctx.db;
auto& compaction_module = db.local().get_compaction_manager().get_task_manager_module();
auto task = co_await compaction_module.make_and_start_task<global_cleanup_compaction_task_impl>({}, db);
try {
co_await task->done();
} catch (...) {
apilog.error("cleanup_all failed: {}", std::current_exception());
throw;
}
co_await task->done();
// Mark this node as clean
co_await ss.invoke_on(0, [] (service::storage_service& ss) -> future<> {
if (ss.is_topology_coordinator_enabled()) {
co_await ss.reset_cleanup_needed();
}
});
co_return json::json_return_type(0);
}
static
future<json::json_return_type>
rest_perform_keyspace_offstrategy_compaction(http_context& ctx, std::unique_ptr<http::request> req, sstring keyspace, std::vector<table_info> table_infos) {
rest_reset_cleanup_needed(http_context& ctx, sharded<service::storage_service>& ss, std::unique_ptr<http::request> req) {
apilog.info("reset_cleanup_needed");
co_await ss.invoke_on(0, [] (service::storage_service& ss) {
if (!ss.is_topology_coordinator_enabled()) {
throw std::runtime_error("mark_node_as_clean is only supported when topology over raft is enabled");
}
return ss.reset_cleanup_needed();
});
co_return json_void();
}
static
future<json::json_return_type>
rest_perform_keyspace_offstrategy_compaction(http_context& ctx, std::unique_ptr<http::request> req) {
auto [keyspace, table_infos] = parse_table_infos(ctx, *req);
apilog.info("perform_keyspace_offstrategy_compaction: keyspace={} tables={}", keyspace, table_infos);
bool res = false;
auto& compaction_module = ctx.db.local().get_compaction_manager().get_task_manager_module();
auto task = co_await compaction_module.make_and_start_task<offstrategy_keyspace_compaction_task_impl>({}, std::move(keyspace), ctx.db, table_infos, &res);
try {
co_await task->done();
} catch (...) {
apilog.error("perform_keyspace_offstrategy_compaction: keyspace={} tables={} failed: {}", task->get_status().keyspace, table_infos, std::current_exception());
throw;
}
co_await task->done();
co_return json::json_return_type(res);
}
static
future<json::json_return_type>
rest_upgrade_sstables(http_context& ctx, std::unique_ptr<http::request> req, sstring keyspace, std::vector<table_info> table_infos) {
rest_upgrade_sstables(http_context& ctx, std::unique_ptr<http::request> req) {
auto& db = ctx.db;
auto [keyspace, table_infos] = parse_table_infos(ctx, *req);
bool exclude_current_version = req_param<bool>(*req, "exclude_current_version", false);
apilog.info("upgrade_sstables: keyspace={} tables={} exclude_current_version={}", keyspace, table_infos, exclude_current_version);
auto& compaction_module = db.local().get_compaction_manager().get_task_manager_module();
auto task = co_await compaction_module.make_and_start_task<upgrade_sstables_compaction_task_impl>({}, std::move(keyspace), db, table_infos, exclude_current_version);
try {
co_await task->done();
} catch (...) {
apilog.error("upgrade_sstables: keyspace={} tables={} failed: {}", keyspace, table_infos, std::current_exception());
throw;
}
co_await task->done();
co_return json::json_return_type(0);
}
@@ -920,15 +893,10 @@ rest_force_flush(http_context& ctx, std::unique_ptr<http::request> req) {
static
future<json::json_return_type>
rest_force_keyspace_flush(http_context& ctx, std::unique_ptr<http::request> req) {
auto keyspace = validate_keyspace(ctx, req);
auto column_families = parse_tables(keyspace, ctx, req->query_parameters, "cf");
apilog.info("perform_keyspace_flush: keyspace={} tables={}", keyspace, column_families);
auto [keyspace, table_infos] = parse_table_infos(ctx, *req);
apilog.info("perform_keyspace_flush: keyspace={} tables={}", keyspace, table_infos);
auto& db = ctx.db;
if (column_families.empty()) {
co_await replica::database::flush_keyspace_on_all_shards(db, keyspace);
} else {
co_await replica::database::flush_tables_on_all_shards(db, keyspace, std::move(column_families));
}
co_await replica::database::flush_tables_on_all_shards(db, std::move(table_infos));
co_return json_void();
}
@@ -1448,6 +1416,95 @@ rest_get_effective_ownership(http_context& ctx, sharded<service::storage_service
});
}
static
future<json::json_return_type>
rest_estimate_compression_ratios(http_context& ctx, sharded<service::storage_service>& ss, std::unique_ptr<http::request> req) {
if (!ss.local().get_feature_service().sstable_compression_dicts) {
apilog.warn("estimate_compression_ratios: called before the cluster feature was enabled");
throw std::runtime_error("estimate_compression_ratios requires all nodes to support the SSTABLE_COMPRESSION_DICTS cluster feature");
}
auto ticket = get_units(ss.local().get_do_sample_sstables_concurrency_limiter(), 1);
auto ks = api::req_param<sstring>(*req, "keyspace", {}).value;
auto cf = api::req_param<sstring>(*req, "cf", {}).value;
apilog.debug("estimate_compression_ratios: called with ks={} cf={}", ks, cf);
auto s = ctx.db.local().find_column_family(ks, cf).schema();
auto training_sample = co_await ss.local().do_sample_sstables(s->id(), 4096, 4096);
auto validation_sample = co_await ss.local().do_sample_sstables(s->id(), 16*1024, 1024);
apilog.debug("estimate_compression_ratios: got training sample with {} blocks and validation sample with {}", training_sample.size(), validation_sample.size());
auto dict = co_await ss.local().train_dict(std::move(training_sample));
apilog.debug("estimate_compression_ratios: got dict of size {}", dict.size());
std::vector<ss::compression_config_result> res;
auto make_result = [](std::string_view name, int chunk_length_kb, std::string_view dict, int level, float ratio) -> ss::compression_config_result {
ss::compression_config_result x;
x.sstable_compression = sstring(name);
x.chunk_length_in_kb = chunk_length_kb;
x.dict = sstring(dict);
x.level = level;
x.ratio = ratio;
return x;
};
using algorithm = compression_parameters::algorithm;
for (const auto& algo : {algorithm::lz4_with_dicts, algorithm::zstd_with_dicts}) {
for (const auto& chunk_size_kb : {1, 4, 16}) {
std::vector<int> levels;
if (algo == compressor::algorithm::zstd_with_dicts) {
for (int i = 1; i <= 5; ++i) {
levels.push_back(i);
}
} else {
levels.push_back(1);
}
for (auto level : levels) {
auto algo_name = compression_parameters::algorithm_to_name(algo);
auto m = std::map<sstring, sstring>{
{compression_parameters::CHUNK_LENGTH_KB, std::to_string(chunk_size_kb)},
{compression_parameters::SSTABLE_COMPRESSION, sstring(algo_name)},
};
if (algo == compressor::algorithm::zstd_with_dicts) {
m.insert(decltype(m)::value_type{sstring("compression_level"), sstring(std::to_string(level))});
}
auto params = compression_parameters(std::move(m));
auto ratio_with_no_dict = co_await try_one_compression_config({}, s, params, validation_sample);
auto ratio_with_past_dict = co_await try_one_compression_config(ctx.db.local().get_user_sstables_manager().get_compressor_factory(), s, params, validation_sample);
auto ratio_with_future_dict = co_await try_one_compression_config(dict, s, params, validation_sample);
res.push_back(make_result(algo_name, chunk_size_kb, "none", level, ratio_with_no_dict));
res.push_back(make_result(algo_name, chunk_size_kb, "past", level, ratio_with_past_dict));
res.push_back(make_result(algo_name, chunk_size_kb, "future", level, ratio_with_future_dict));
}
}
}
co_return res;
}
static
future<json::json_return_type>
rest_retrain_dict(http_context& ctx, sharded<service::storage_service>& ss, service::raft_group0_client& group0_client, std::unique_ptr<http::request> req) {
if (!ss.local().get_feature_service().sstable_compression_dicts) {
apilog.warn("retrain_dict: called before the cluster feature was enabled");
throw std::runtime_error("retrain_dict requires all nodes to support the SSTABLE_COMPRESSION_DICTS cluster feature");
}
auto ticket = get_units(ss.local().get_do_sample_sstables_concurrency_limiter(), 1);
auto ks = api::req_param<sstring>(*req, "keyspace", {}).value;
auto cf = api::req_param<sstring>(*req, "cf", {}).value;
apilog.debug("retrain_dict: called with ks={} cf={}", ks, cf);
const auto t_id = ctx.db.local().find_column_family(ks, cf).schema()->id();
constexpr uint64_t chunk_size = 4096;
constexpr uint64_t n_chunks = 4096;
auto sample = co_await ss.local().do_sample_sstables(t_id, chunk_size, n_chunks);
apilog.debug("retrain_dict: got sample with {} blocks", sample.size());
auto dict = co_await ss.local().train_dict(std::move(sample));
apilog.debug("retrain_dict: got dict of size {}", dict.size());
co_await ss.local().publish_new_sstable_dict(t_id, dict, group0_client);
apilog.debug("retrain_dict: published new dict");
co_return json_void();
}
static
future<json::json_return_type>
rest_sstable_info(http_context& ctx, std::unique_ptr<http::request> req) {
@@ -1509,21 +1566,23 @@ rest_sstable_info(http_context& ctx, std::unique_ptr<http::request> req) {
info.version = sstable->get_version();
if (sstable->has_component(sstables::component_type::CompressionInfo)) {
auto& c = sstable->get_compression();
auto cp = sstables::get_sstable_compressor(c);
const auto& cp = sstable->get_compression().get_compressor();
ss::named_maps nm;
nm.group = "compression_parameters";
for (auto& p : cp->options()) {
for (auto& p : cp.options()) {
if (compressor::is_hidden_option_name(p.first)) {
continue;
}
ss::mapper e;
e.key = p.first;
e.value = p.second;
nm.attributes.push(std::move(e));
}
if (!cp->options().contains(compression_parameters::SSTABLE_COMPRESSION)) {
if (!cp.options().contains(compression_parameters::SSTABLE_COMPRESSION)) {
ss::mapper e;
e.key = compression_parameters::SSTABLE_COMPRESSION;
e.value = cp->name();
e.value = sstring(cp.name());
nm.attributes.push(std::move(e));
}
info.extended_properties.push(std::move(nm));
@@ -1610,6 +1669,18 @@ rest_raft_topology_upgrade_status(sharded<service::storage_service>& ss, std::un
co_return sstring(format("{}", ustate));
}
static
future<json::json_return_type>
rest_raft_topology_get_cmd_status(sharded<service::storage_service>& ss, std::unique_ptr<http::request> req) {
const auto status = co_await ss.invoke_on(0, [] (auto& ss) {
return ss.get_topology_cmd_status();
});
if (status.active_dst.empty()) {
co_return sstring("none");
}
co_return sstring(fmt::format("{}[{}]: {}", status.current, status.index, fmt::join(status.active_dst, ",")));
}
static
future<json::json_return_type>
rest_move_tablet(http_context& ctx, sharded<service::storage_service>& ss, std::unique_ptr<http::request> req) {
@@ -1640,7 +1711,7 @@ rest_add_tablet_replica(http_context& ctx, sharded<service::storage_service>& ss
auto token = dht::token::from_int64(validate_int(req->get_query_param("token")));
auto ks = req->get_query_param("ks");
auto table = req->get_query_param("table");
auto table_id = ctx.db.local().find_column_family(ks, table).schema()->id();
auto table_id = validate_table(ctx.db.local(), ks, table);
auto force_str = req->get_query_param("force");
auto force = service::loosen_constraints(force_str == "" ? false : validate_bool(force_str));
@@ -1659,7 +1730,7 @@ rest_del_tablet_replica(http_context& ctx, sharded<service::storage_service>& ss
auto token = dht::token::from_int64(validate_int(req->get_query_param("token")));
auto ks = req->get_query_param("ks");
auto table = req->get_query_param("table");
auto table_id = ctx.db.local().find_column_family(ks, table).schema()->id();
auto table_id = validate_table(ctx.db.local(), ks, table);
auto force_str = req->get_query_param("force");
auto force = service::loosen_constraints(force_str == "" ? false : validate_bool(force_str));
@@ -1767,12 +1838,6 @@ rest_bind(FuncType func, BindArgs&... args) {
return std::bind_front(func, std::ref(args)...);
}
static
seastar::httpd::future_json_function
rest_bind(ks_cf_func func, http_context& ctx) {
return wrap_ks_cf(ctx, func);
}
void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_service>& ss, service::raft_group0_client& group0_client) {
ss::get_token_endpoint.set(r, rest_bind(rest_get_token_endpoint, ctx, ss));
ss::toppartitions_generic.set(r, rest_bind(rest_toppartitions_generic, ctx));
@@ -1790,6 +1855,7 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
ss::force_keyspace_compaction.set(r, rest_bind(rest_force_keyspace_compaction, ctx));
ss::force_keyspace_cleanup.set(r, rest_bind(rest_force_keyspace_cleanup, ctx, ss));
ss::cleanup_all.set(r, rest_bind(rest_cleanup_all, ctx, ss));
ss::reset_cleanup_needed.set(r, rest_bind(rest_reset_cleanup_needed, ctx, ss));
ss::perform_keyspace_offstrategy_compaction.set(r, rest_bind(rest_perform_keyspace_offstrategy_compaction, ctx));
ss::upgrade_sstables.set(r, rest_bind(rest_upgrade_sstables, ctx));
ss::force_flush.set(r, rest_bind(rest_force_flush, ctx));
@@ -1841,10 +1907,13 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
ss::get_total_hints.set(r, rest_bind(rest_get_total_hints));
ss::get_ownership.set(r, rest_bind(rest_get_ownership, ctx, ss));
ss::get_effective_ownership.set(r, rest_bind(rest_get_effective_ownership, ctx, ss));
ss::retrain_dict.set(r, rest_bind(rest_retrain_dict, ctx, ss, group0_client));
ss::estimate_compression_ratios.set(r, rest_bind(rest_estimate_compression_ratios, ctx, ss));
ss::sstable_info.set(r, rest_bind(rest_sstable_info, ctx));
ss::reload_raft_topology_state.set(r, rest_bind(rest_reload_raft_topology_state, ss, group0_client));
ss::upgrade_to_raft_topology.set(r, rest_bind(rest_upgrade_to_raft_topology, ss));
ss::raft_topology_upgrade_status.set(r, rest_bind(rest_raft_topology_upgrade_status, ss));
ss::raft_topology_get_cmd_status.set(r, rest_bind(rest_raft_topology_get_cmd_status, ss));
ss::move_tablet.set(r, rest_bind(rest_move_tablet, ctx, ss));
ss::add_tablet_replica.set(r, rest_bind(rest_add_tablet_replica, ctx, ss));
ss::del_tablet_replica.set(r, rest_bind(rest_del_tablet_replica, ctx, ss));
@@ -1871,6 +1940,7 @@ void unset_storage_service(http_context& ctx, routes& r) {
ss::force_keyspace_compaction.unset(r);
ss::force_keyspace_cleanup.unset(r);
ss::cleanup_all.unset(r);
ss::reset_cleanup_needed.unset(r);
ss::perform_keyspace_offstrategy_compaction.unset(r);
ss::upgrade_sstables.unset(r);
ss::force_flush.unset(r);
@@ -1926,6 +1996,7 @@ void unset_storage_service(http_context& ctx, routes& r) {
ss::reload_raft_topology_state.unset(r);
ss::upgrade_to_raft_topology.unset(r);
ss::raft_topology_upgrade_status.unset(r);
ss::raft_topology_get_cmd_status.unset(r);
ss::move_tablet.unset(r);
ss::add_tablet_replica.unset(r);
ss::del_tablet_replica.unset(r);

View File

@@ -52,10 +52,11 @@ table_id validate_table(const replica::database& db, sstring ks_name, sstring ta
// containing the description of the respective no_such_column_family error.
// Returns a vector of all table infos given by the parameter, or
// if the parameter is not found or is empty, returns a list of all table infos in the keyspace.
std::vector<table_info> parse_table_infos(const sstring& ks_name, const http_context& ctx, const std::unordered_map<sstring, sstring>& query_params, sstring param_name);
std::vector<table_info> parse_table_infos(const sstring& ks_name, const http_context& ctx, sstring value);
std::pair<sstring, std::vector<table_info>> parse_table_infos(const http_context& ctx, const http::request& req, sstring cf_param_name = "cf");
struct scrub_info {
sstables::compaction_type_options::scrub opts;
sstring keyspace;

View File

@@ -31,8 +31,7 @@ using ks_cf_func = std::function<future<json::json_return_type>(http_context&, s
static auto wrap_ks_cf(http_context &ctx, ks_cf_func f) {
return [&ctx, f = std::move(f)](std::unique_ptr<http::request> req) {
auto keyspace = validate_keyspace(ctx, req);
auto table_infos = parse_table_infos(keyspace, ctx, req->query_parameters, "cf");
auto [keyspace, table_infos] = parse_table_infos(ctx, *req);
return f(ctx, std::move(req), std::move(keyspace), std::move(table_infos));
};
}
@@ -63,8 +62,7 @@ void set_tasks_compaction_module(http_context& ctx, routes& r, sharded<service::
t::force_keyspace_cleanup_async.set(r, [&ctx, &ss](std::unique_ptr<http::request> req) -> future<json::json_return_type> {
auto& db = ctx.db;
auto keyspace = validate_keyspace(ctx, req);
auto table_infos = parse_table_infos(keyspace, ctx, req->query_parameters, "cf");
auto [keyspace, table_infos] = parse_table_infos(ctx, *req);
apilog.info("force_keyspace_cleanup_async: keyspace={} tables={}", keyspace, table_infos);
if (!co_await ss.local().is_cleanup_allowed(keyspace)) {
auto msg = "Can not perform cleanup operation when topology changes";

View File

@@ -74,6 +74,9 @@ void set_token_metadata(http_context& ctx, routes& r, sharded<locator::shared_to
});
ss::get_host_id_map.set(r, [&tm, &g](const_req req) {
if (!g.local().is_enabled()) {
throw std::runtime_error("The gossiper is not ready yet");
}
std::vector<ss::mapper> res;
auto map = tm.local().get()->get_host_ids() |
std::views::transform([&g] (locator::host_id id) { return std::make_pair(g.local().get_address_map().get(id), id); }) |

View File

@@ -33,20 +33,6 @@ namespace audit {
namespace {
future<> syslog_send_helper(net::datagram_channel& sender,
const socket_address& address,
const sstring& msg) {
return sender.send(address, net::packet{msg.data(), msg.size()}).handle_exception([address](auto&& exception_ptr) {
auto error_msg = seastar::format(
"Syslog audit backend failed (sending a message to {} resulted in {}).",
address,
exception_ptr
);
logger.error("{}", error_msg);
throw audit_exception(std::move(error_msg));
});
}
static auto syslog_address_helper(const db::config& cfg)
{
return cfg.audit_unix_socket_path.is_set()
@@ -54,11 +40,40 @@ static auto syslog_address_helper(const db::config& cfg)
: unix_domain_addr(_PATH_LOG);
}
static std::string json_escape(std::string_view str) {
std::string result;
result.reserve(str.size() * 1.2);
for (auto c : str) {
if (c == '"' || c == '\\') {
result.push_back('\\');
}
result.push_back(c);
}
return result;
}
}
future<> audit_syslog_storage_helper::syslog_send_helper(const sstring& msg) {
try {
auto lock = co_await get_units(_semaphore, 1, std::chrono::hours(1));
co_await _sender.send(_syslog_address, net::packet{msg.data(), msg.size()});
}
catch (const std::exception& e) {
auto error_msg = seastar::format(
"Syslog audit backend failed (sending a message to {} resulted in {}).",
_syslog_address,
e
);
logger.error("{}", error_msg);
throw audit_exception(std::move(error_msg));
}
}
audit_syslog_storage_helper::audit_syslog_storage_helper(cql3::query_processor& qp, service::migration_manager&) :
_syslog_address(syslog_address_helper(qp.db().get_config())),
_sender(make_unbound_datagram_channel(AF_UNIX)) {
_sender(make_unbound_datagram_channel(AF_UNIX)),
_semaphore(1) {
}
audit_syslog_storage_helper::~audit_syslog_storage_helper() {
@@ -73,10 +88,10 @@ audit_syslog_storage_helper::~audit_syslog_storage_helper() {
*/
future<> audit_syslog_storage_helper::start(const db::config& cfg) {
if (this_shard_id() != 0) {
return make_ready_future();
co_return;
}
return syslog_send_helper(_sender, _syslog_address, "Initializing syslog audit backend.");
co_await syslog_send_helper("Initializing syslog audit backend.");
}
future<> audit_syslog_storage_helper::stop() {
@@ -93,7 +108,7 @@ future<> audit_syslog_storage_helper::write(const audit_info* audit_info,
auto now = std::chrono::system_clock::to_time_t(std::chrono::system_clock::now());
tm time;
localtime_r(&now, &time);
sstring msg = seastar::format("<{}>{:%h %e %T} scylla-audit: \"{}\", \"{}\", \"{}\", \"{}\", \"{}\", \"{}\", \"{}\", \"{}\", \"{}\"",
sstring msg = seastar::format(R"(<{}>{:%h %e %T} scylla-audit: node="{}" category="{}" cl="{}" error="{}" keyspace="{}" query="{}" client_ip="{}" table="{}" username="{}")",
LOG_NOTICE | LOG_USER,
time,
node_ip,
@@ -101,12 +116,12 @@ future<> audit_syslog_storage_helper::write(const audit_info* audit_info,
cl,
(error ? "true" : "false"),
audit_info->keyspace(),
audit_info->query(),
json_escape(audit_info->query()),
client_ip,
audit_info->table(),
username);
return syslog_send_helper(_sender, _syslog_address, msg);
co_await syslog_send_helper(msg);
}
future<> audit_syslog_storage_helper::write_login(const sstring& username,
@@ -117,15 +132,15 @@ future<> audit_syslog_storage_helper::write_login(const sstring& username,
auto now = std::chrono::system_clock::to_time_t(std::chrono::system_clock::now());
tm time;
localtime_r(&now, &time);
sstring msg = seastar::format("<{}>{:%h %e %T} scylla-audit: \"{}\", \"AUTH\", \"\", \"\", \"\", \"\", \"{}\", \"{}\", \"{}\"",
sstring msg = seastar::format(R"(<{}>{:%h %e %T} scylla-audit: node="{}", category="AUTH", cl="", error="{}", keyspace="", query="", client_ip="{}", table="", username="{}")",
LOG_NOTICE | LOG_USER,
time,
node_ip,
(error ? "true" : "false"),
client_ip,
username,
(error ? "true" : "false"));
username);
co_await syslog_send_helper(_sender, _syslog_address, msg.c_str());
co_await syslog_send_helper(msg.c_str());
}
using registry = class_registrator<storage_helper, audit_syslog_storage_helper, cql3::query_processor&, service::migration_manager&>;

View File

@@ -24,6 +24,9 @@ namespace audit {
class audit_syslog_storage_helper : public storage_helper {
socket_address _syslog_address;
net::datagram_channel _sender;
seastar::semaphore _semaphore;
future<> syslog_send_helper(const sstring& msg);
public:
explicit audit_syslog_storage_helper(cql3::query_processor&, service::migration_manager&);
virtual ~audit_syslog_storage_helper();

View File

@@ -9,6 +9,7 @@
#include "auth/allow_all_authenticator.hh"
#include "service/migration_manager.hh"
#include "utils/alien_worker.hh"
#include "utils/class_registrator.hh"
namespace auth {
@@ -21,6 +22,7 @@ static const class_registrator<
allow_all_authenticator,
cql3::query_processor&,
::service::raft_group0_client&,
::service::migration_manager&> registration("org.apache.cassandra.auth.AllowAllAuthenticator");
::service::migration_manager&,
utils::alien_worker&> registration("org.apache.cassandra.auth.AllowAllAuthenticator");
}

View File

@@ -13,6 +13,7 @@
#include "auth/authenticated_user.hh"
#include "auth/authenticator.hh"
#include "auth/common.hh"
#include "utils/alien_worker.hh"
namespace cql3 {
class query_processor;
@@ -28,7 +29,7 @@ extern const std::string_view allow_all_authenticator_name;
class allow_all_authenticator final : public authenticator {
public:
allow_all_authenticator(cql3::query_processor&, ::service::raft_group0_client&, ::service::migration_manager&) {
allow_all_authenticator(cql3::query_processor&, ::service::raft_group0_client&, ::service::migration_manager&, utils::alien_worker&) {
}
virtual future<> start() override {

View File

@@ -33,13 +33,14 @@ static const class_registrator<auth::authenticator
, auth::certificate_authenticator
, cql3::query_processor&
, ::service::raft_group0_client&
, ::service::migration_manager&> cert_auth_reg(CERT_AUTH_NAME);
, ::service::migration_manager&
, utils::alien_worker&> cert_auth_reg(CERT_AUTH_NAME);
enum class auth::certificate_authenticator::query_source {
subject, altname
};
auth::certificate_authenticator::certificate_authenticator(cql3::query_processor& qp, ::service::raft_group0_client&, ::service::migration_manager&)
auth::certificate_authenticator::certificate_authenticator(cql3::query_processor& qp, ::service::raft_group0_client&, ::service::migration_manager&, utils::alien_worker&)
: _queries([&] {
auto& conf = qp.db().get_config();
auto queries = conf.auth_certificate_role_queries();

View File

@@ -10,6 +10,7 @@
#pragma once
#include "auth/authenticator.hh"
#include "utils/alien_worker.hh"
#include <boost/regex_fwd.hpp> // IWYU pragma: keep
namespace cql3 {
@@ -31,7 +32,7 @@ class certificate_authenticator : public authenticator {
enum class query_source;
std::vector<std::pair<query_source, boost::regex>> _queries;
public:
certificate_authenticator(cql3::query_processor&, ::service::raft_group0_client&, ::service::migration_manager&);
certificate_authenticator(cql3::query_processor&, ::service::raft_group0_client&, ::service::migration_manager&, utils::alien_worker&);
~certificate_authenticator();
future<> start() override;

View File

@@ -119,6 +119,11 @@ future<> create_legacy_metadata_table_if_missing(
return qs;
}
::service::raft_timeout get_raft_timeout() noexcept {
auto dur = internal_distributed_query_state().get_client_state().get_timeout_config().other_timeout;
return ::service::raft_timeout{.value = lowres_clock::now() + dur};
}
static future<> announce_mutations_with_guard(
::service::raft_group0_client& group0_client,
std::vector<canonical_mutation> muts,

View File

@@ -17,6 +17,7 @@
#include "types/types.hh"
#include "service/raft/raft_group0_client.hh"
#include "timeout_config.hh"
using namespace std::chrono_literals;
@@ -77,6 +78,8 @@ future<> create_legacy_metadata_table_if_missing(
///
::service::query_state& internal_distributed_query_state() noexcept;
::service::raft_timeout get_raft_timeout() noexcept;
// Execute update query via group0 mechanism, mutations will be applied on all nodes.
// Use this function when need to perform read before write on a single guard or if
// you have more than one mutation and potentially exceed single command size limit.

View File

@@ -233,9 +233,9 @@ future<role_set> ldap_role_manager::query_granted(std::string_view grantee_name,
}
future<role_to_directly_granted_map>
ldap_role_manager::query_all_directly_granted() {
ldap_role_manager::query_all_directly_granted(::service::query_state& qs) {
role_to_directly_granted_map result;
auto roles = co_await query_all();
auto roles = co_await query_all(qs);
for (auto& role: roles) {
auto granted_set = co_await query_granted(role, recursive_role_query::no);
for (auto& granted: granted_set) {
@@ -247,8 +247,8 @@ ldap_role_manager::query_all_directly_granted() {
co_return result;
}
future<role_set> ldap_role_manager::query_all() {
return _std_mgr.query_all();
future<role_set> ldap_role_manager::query_all(::service::query_state& qs) {
return _std_mgr.query_all(qs);
}
future<> ldap_role_manager::create_role(std::string_view role_name) {
@@ -311,12 +311,12 @@ future<bool> ldap_role_manager::can_login(std::string_view role_name) {
}
future<std::optional<sstring>> ldap_role_manager::get_attribute(
std::string_view role_name, std::string_view attribute_name) {
return _std_mgr.get_attribute(role_name, attribute_name);
std::string_view role_name, std::string_view attribute_name, ::service::query_state& qs) {
return _std_mgr.get_attribute(role_name, attribute_name, qs);
}
future<role_manager::attribute_vals> ldap_role_manager::query_attribute_for_all(std::string_view attribute_name) {
return _std_mgr.query_attribute_for_all(attribute_name);
future<role_manager::attribute_vals> ldap_role_manager::query_attribute_for_all(std::string_view attribute_name, ::service::query_state& qs) {
return _std_mgr.query_attribute_for_all(attribute_name, qs);
}
future<> ldap_role_manager::set_attribute(
@@ -338,8 +338,7 @@ future<std::vector<cql3::description>> ldap_role_manager::describe_role_grants()
}
future<> ldap_role_manager::ensure_superuser_is_created() {
// ldap is responsible for users
co_return;
return _std_mgr.ensure_superuser_is_created();
}
} // namespace auth

View File

@@ -75,9 +75,9 @@ class ldap_role_manager : public role_manager {
future<role_set> query_granted(std::string_view, recursive_role_query) override;
future<role_to_directly_granted_map> query_all_directly_granted() override;
future<role_to_directly_granted_map> query_all_directly_granted(::service::query_state&) override;
future<role_set> query_all() override;
future<role_set> query_all(::service::query_state&) override;
future<bool> exists(std::string_view) override;
@@ -85,9 +85,9 @@ class ldap_role_manager : public role_manager {
future<bool> can_login(std::string_view) override;
future<std::optional<sstring>> get_attribute(std::string_view, std::string_view) override;
future<std::optional<sstring>> get_attribute(std::string_view, std::string_view, ::service::query_state&) override;
future<role_manager::attribute_vals> query_attribute_for_all(std::string_view) override;
future<role_manager::attribute_vals> query_attribute_for_all(std::string_view, ::service::query_state&) override;
future<> set_attribute(std::string_view, std::string_view, std::string_view, ::service::group0_batch& mc) override;

View File

@@ -78,11 +78,11 @@ future<role_set> maintenance_socket_role_manager::query_granted(std::string_view
return operation_not_supported_exception<role_set>("QUERY GRANTED");
}
future<role_to_directly_granted_map> maintenance_socket_role_manager::query_all_directly_granted() {
future<role_to_directly_granted_map> maintenance_socket_role_manager::query_all_directly_granted(::service::query_state&) {
return operation_not_supported_exception<role_to_directly_granted_map>("QUERY ALL DIRECTLY GRANTED");
}
future<role_set> maintenance_socket_role_manager::query_all() {
future<role_set> maintenance_socket_role_manager::query_all(::service::query_state&) {
return operation_not_supported_exception<role_set>("QUERY ALL");
}
@@ -98,11 +98,11 @@ future<bool> maintenance_socket_role_manager::can_login(std::string_view role_na
return make_ready_future<bool>(true);
}
future<std::optional<sstring>> maintenance_socket_role_manager::get_attribute(std::string_view role_name, std::string_view attribute_name) {
future<std::optional<sstring>> maintenance_socket_role_manager::get_attribute(std::string_view role_name, std::string_view attribute_name, ::service::query_state&) {
return operation_not_supported_exception<std::optional<sstring>>("GET ATTRIBUTE");
}
future<role_manager::attribute_vals> maintenance_socket_role_manager::query_attribute_for_all(std::string_view attribute_name) {
future<role_manager::attribute_vals> maintenance_socket_role_manager::query_attribute_for_all(std::string_view attribute_name, ::service::query_state&) {
return operation_not_supported_exception<role_manager::attribute_vals>("QUERY ATTRIBUTE");
}

View File

@@ -53,9 +53,9 @@ public:
virtual future<role_set> query_granted(std::string_view grantee_name, recursive_role_query) override;
virtual future<role_to_directly_granted_map> query_all_directly_granted() override;
virtual future<role_to_directly_granted_map> query_all_directly_granted(::service::query_state&) override;
virtual future<role_set> query_all() override;
virtual future<role_set> query_all(::service::query_state&) override;
virtual future<bool> exists(std::string_view role_name) override;
@@ -63,9 +63,9 @@ public:
virtual future<bool> can_login(std::string_view role_name) override;
virtual future<std::optional<sstring>> get_attribute(std::string_view role_name, std::string_view attribute_name) override;
virtual future<std::optional<sstring>> get_attribute(std::string_view role_name, std::string_view attribute_name, ::service::query_state&) override;
virtual future<role_manager::attribute_vals> query_attribute_for_all(std::string_view attribute_name) override;
virtual future<role_manager::attribute_vals> query_attribute_for_all(std::string_view attribute_name, ::service::query_state&) override;
virtual future<> set_attribute(std::string_view role_name, std::string_view attribute_name, std::string_view attribute_value, ::service::group0_batch& mc) override;

View File

@@ -48,14 +48,14 @@ static const class_registrator<
password_authenticator,
cql3::query_processor&,
::service::raft_group0_client&,
::service::migration_manager&> password_auth_reg("org.apache.cassandra.auth.PasswordAuthenticator");
::service::migration_manager&,
utils::alien_worker&> password_auth_reg("org.apache.cassandra.auth.PasswordAuthenticator");
static thread_local auto rng_for_salt = std::default_random_engine(std::random_device{}());
static std::string_view get_config_value(std::string_view value, std::string_view def) {
return value.empty() ? def : value;
}
std::string password_authenticator::default_superuser(const db::config& cfg) {
return std::string(get_config_value(cfg.auth_superuser_name(), DEFAULT_USER_NAME));
}
@@ -63,12 +63,13 @@ std::string password_authenticator::default_superuser(const db::config& cfg) {
password_authenticator::~password_authenticator() {
}
password_authenticator::password_authenticator(cql3::query_processor& qp, ::service::raft_group0_client& g0, ::service::migration_manager& mm)
password_authenticator::password_authenticator(cql3::query_processor& qp, ::service::raft_group0_client& g0, ::service::migration_manager& mm, utils::alien_worker& hashing_worker)
: _qp(qp)
, _group0_client(g0)
, _migration_manager(mm)
, _stopped(make_ready_future<>())
, _superuser(default_superuser(qp.db().get_config()))
, _hashing_worker(hashing_worker)
{}
static bool has_salted_hash(const cql3::untyped_result_set_row& row) {
@@ -117,33 +118,95 @@ future<> password_authenticator::migrate_legacy_metadata() const {
});
}
future<> password_authenticator::create_default_if_missing() {
future<> password_authenticator::legacy_create_default_if_missing() {
SCYLLA_ASSERT(legacy_mode(_qp));
const auto exists = co_await default_role_row_satisfies(_qp, &has_salted_hash, _superuser);
if (exists) {
co_return;
}
std::string salted_pwd(get_config_value(_qp.db().get_config().auth_superuser_salted_password(), ""));
if (salted_pwd.empty()) {
salted_pwd = passwords::hash(DEFAULT_USER_PASSWORD, rng_for_salt);
salted_pwd = passwords::hash(DEFAULT_USER_PASSWORD, rng_for_salt, _scheme);
}
const auto query = update_row_query();
if (legacy_mode(_qp)) {
co_await _qp.execute_internal(
co_await _qp.execute_internal(
query,
db::consistency_level::QUORUM,
internal_distributed_query_state(),
{salted_pwd, _superuser},
cql3::query_processor::cache_internal::no);
plogger.info("Created default superuser authentication record.");
} else {
co_await announce_mutations(_qp, _group0_client, query,
{salted_pwd, _superuser}, _as, ::service::raft_timeout{});
plogger.info("Created default superuser authentication record.");
plogger.info("Created default superuser authentication record.");
}
future<> password_authenticator::maybe_create_default_password() {
auto needs_password = [this] () -> future<bool> {
const sstring query = seastar::format("SELECT * FROM {}.{} WHERE is_superuser = true ALLOW FILTERING", get_auth_ks_name(_qp), meta::roles_table::name);
auto results = co_await _qp.execute_internal(query,
db::consistency_level::LOCAL_ONE,
internal_distributed_query_state(), cql3::query_processor::cache_internal::yes);
// Don't add default password if
// - there is no default superuser
// - there is a superuser with a password.
bool has_default = false;
bool has_superuser_with_password = false;
for (auto& result : *results) {
if (result.get_as<sstring>(meta::roles_table::role_col_name) == _superuser) {
has_default = true;
}
if (has_salted_hash(result)) {
has_superuser_with_password = true;
}
}
co_return has_default && !has_superuser_with_password;
};
if (!co_await needs_password()) {
co_return;
}
// We don't want to start operation earlier to avoid quorum requirement in
// a common case.
::service::group0_batch batch(
co_await _group0_client.start_operation(_as, get_raft_timeout()));
// Check again as the state may have changed before we took the guard (batch).
if (!co_await needs_password()) {
co_return;
}
// Set default superuser's password.
std::string salted_pwd(get_config_value(_qp.db().get_config().auth_superuser_salted_password(), ""));
if (salted_pwd.empty()) {
salted_pwd = passwords::hash(DEFAULT_USER_PASSWORD, rng_for_salt, _scheme);
}
const auto update_query = update_row_query();
co_await collect_mutations(_qp, batch, update_query, {salted_pwd, _superuser});
co_await std::move(batch).commit(_group0_client, _as, get_raft_timeout());
plogger.info("Created default superuser authentication record.");
}
future<> password_authenticator::maybe_create_default_password_with_retries() {
size_t retries = _migration_manager.get_concurrent_ddl_retries();
while (true) {
try {
co_return co_await maybe_create_default_password();
} catch (const ::service::group0_concurrent_modification& ex) {
plogger.warn("Failed to execute maybe_create_default_password due to guard conflict.{}.", retries ? " Retrying" : " Number of retries exceeded, giving up");
if (retries--) {
continue;
}
// Log error but don't crash the whole node startup sequence.
plogger.error("Failed to create default superuser password due to guard conflict.");
co_return;
} catch (const ::service::raft_operation_timeout_error& ex) {
plogger.error("Failed to create default superuser password due to exception: {}", ex.what());
co_return;
}
}
}
future<> password_authenticator::start() {
return once_among_shards([this] {
// Verify that at least one hashing scheme is supported.
passwords::detail::verify_scheme(_scheme);
plogger.info("Using password hashing scheme: {}", passwords::detail::prefix_for_scheme(_scheme));
_stopped = do_after_system_ready(_as, [this] {
return async([this] {
if (legacy_mode(_qp)) {
@@ -164,11 +227,14 @@ future<> password_authenticator::start() {
migrate_legacy_metadata().get();
return;
}
legacy_create_default_if_missing().get();
}
utils::get_local_injector().inject("password_authenticator_start_pause", utils::wait_for_message(5min)).get();
create_default_if_missing().get();
if (!legacy_mode(_qp)) {
_superuser_created_promise.set_value();
maybe_create_default_password_with_retries().get();
if (!_superuser_created_promise.available()) {
_superuser_created_promise.set_value();
}
}
});
});
@@ -228,7 +294,13 @@ future<authenticated_user> password_authenticator::authenticate(
try {
const std::optional<sstring> salted_hash = co_await get_password_hash(username);
if (!salted_hash || !passwords::check(password, *salted_hash)) {
if (!salted_hash) {
throw exceptions::authentication_exception("Username and/or password are incorrect");
}
const bool password_match = co_await _hashing_worker.submit<bool>([password = std::move(password), salted_hash = std::move(salted_hash)]{
return passwords::check(password, *salted_hash);
});
if (!password_match) {
throw exceptions::authentication_exception("Username and/or password are incorrect");
}
co_return username;
@@ -252,7 +324,7 @@ future<> password_authenticator::create(std::string_view role_name, const authen
auto maybe_hash = options.credentials.transform([&] (const auto& creds) -> sstring {
return std::visit(make_visitor(
[&] (const password_option& opt) {
return passwords::hash(opt.password, rng_for_salt);
return passwords::hash(opt.password, rng_for_salt, _scheme);
},
[] (const hashed_password_option& opt) {
return opt.hashed_password;
@@ -295,11 +367,11 @@ future<> password_authenticator::alter(std::string_view role_name, const authent
query,
consistency_for_user(role_name),
internal_distributed_query_state(),
{passwords::hash(password, rng_for_salt), sstring(role_name)},
{passwords::hash(password, rng_for_salt, _scheme), sstring(role_name)},
cql3::query_processor::cache_internal::no).discard_result();
} else {
co_await collect_mutations(_qp, mc, query,
{passwords::hash(password, rng_for_salt), sstring(role_name)});
{passwords::hash(password, rng_for_salt, _scheme), sstring(role_name)});
}
}

View File

@@ -15,7 +15,9 @@
#include "db/consistency_level_type.hh"
#include "auth/authenticator.hh"
#include "auth/passwords.hh"
#include "service/raft/raft_group0_client.hh"
#include "utils/alien_worker.hh"
namespace db {
class config;
@@ -41,14 +43,17 @@ class password_authenticator : public authenticator {
::service::migration_manager& _migration_manager;
future<> _stopped;
abort_source _as;
std::string _superuser;
std::string _superuser; // default superuser name from the config (may or may not be present in roles table)
shared_promise<> _superuser_created_promise;
// We used to also support bcrypt, SHA-256, and MD5 (ref. scylladb#24524).
constexpr static auth::passwords::scheme _scheme = passwords::scheme::sha_512;
utils::alien_worker& _hashing_worker;
public:
static db::consistency_level consistency_for_user(std::string_view role_name);
static std::string default_superuser(const db::config&);
password_authenticator(cql3::query_processor&, ::service::raft_group0_client&, ::service::migration_manager&);
password_authenticator(cql3::query_processor&, ::service::raft_group0_client&, ::service::migration_manager&, utils::alien_worker&);
~password_authenticator();
@@ -89,7 +94,10 @@ private:
future<> migrate_legacy_metadata() const;
future<> create_default_if_missing();
future<> legacy_create_default_if_missing();
future<> maybe_create_default_password();
future<> maybe_create_default_password_with_retries();
sstring update_row_query() const;
};

View File

@@ -21,18 +21,14 @@ static thread_local crypt_data tlcrypt = {};
namespace detail {
scheme identify_best_supported_scheme() {
const auto all_schemes = { scheme::bcrypt_y, scheme::bcrypt_a, scheme::sha_512, scheme::sha_256, scheme::md5 };
// "Random", for testing schemes.
void verify_scheme(scheme scheme) {
const sstring random_part_of_salt = "aaaabbbbccccdddd";
for (scheme c : all_schemes) {
const sstring salt = sstring(prefix_for_scheme(c)) + random_part_of_salt;
const char* e = crypt_r("fisk", salt.c_str(), &tlcrypt);
const sstring salt = sstring(prefix_for_scheme(scheme)) + random_part_of_salt;
const char* e = crypt_r("fisk", salt.c_str(), &tlcrypt);
if (e && (e[0] != '*')) {
return c;
}
if (e && (e[0] != '*')) {
return;
}
throw no_supported_schemes();

View File

@@ -21,10 +21,11 @@ class no_supported_schemes : public std::runtime_error {
public:
no_supported_schemes();
};
///
/// Apache Cassandra uses a library to provide the bcrypt scheme. Many Linux implementations do not support bcrypt, so
/// we support alternatives. The cost is loss of direct compatibility with Apache Cassandra system tables.
/// Apache Cassandra uses a library to provide the bcrypt scheme. In ScyllaDB, we use SHA-512
/// instead of bcrypt for performance and for historical reasons (see scylladb#24524).
/// Currently, SHA-512 is always chosen as the hashing scheme for new passwords, but the other
/// algorithms remain supported for CREATE ROLE WITH HASHED PASSWORD and backward compatibility.
///
enum class scheme {
bcrypt_y,
@@ -51,11 +52,11 @@ sstring generate_random_salt_bytes(RandomNumberEngine& g) {
}
///
/// Test each allowed hashing scheme and report the best supported one on the current system.
/// Test given hashing scheme on the current system.
///
/// \throws \ref no_supported_schemes when none of the known schemes is supported.
/// \throws \ref no_supported_schemes when scheme is unsupported.
///
scheme identify_best_supported_scheme();
void verify_scheme(scheme scheme);
std::string_view prefix_for_scheme(scheme) noexcept;
@@ -67,8 +68,7 @@ std::string_view prefix_for_scheme(scheme) noexcept;
/// \throws \ref no_supported_schemes when no known hashing schemes are supported on the system.
///
template <typename RandomNumberEngine>
sstring generate_salt(RandomNumberEngine& g) {
static const scheme scheme = identify_best_supported_scheme();
sstring generate_salt(RandomNumberEngine& g, scheme scheme) {
static const sstring prefix = sstring(prefix_for_scheme(scheme));
return prefix + generate_random_salt_bytes(g);
}
@@ -93,8 +93,8 @@ sstring hash_with_salt(const sstring& pass, const sstring& salt);
/// \throws \ref std::system_error when the implementation-specific implementation fails to hash the cleartext.
///
template <typename RandomNumberEngine>
sstring hash(const sstring& pass, RandomNumberEngine& g) {
return detail::hash_with_salt(pass, detail::generate_salt(g));
sstring hash(const sstring& pass, RandomNumberEngine& g, scheme scheme) {
return detail::hash_with_salt(pass, detail::generate_salt(g, scheme));
}
///

View File

@@ -17,12 +17,17 @@
#include <seastar/core/format.hh>
#include <seastar/core/sstring.hh>
#include "auth/common.hh"
#include "auth/resource.hh"
#include "cql3/description.hh"
#include "seastarx.hh"
#include "exceptions/exceptions.hh"
#include "service/raft/raft_group0_client.hh"
namespace service {
class query_state;
};
namespace auth {
struct role_config final {
@@ -167,9 +172,9 @@ public:
/// (role2, role3)
/// }
///
virtual future<role_to_directly_granted_map> query_all_directly_granted() = 0;
virtual future<role_to_directly_granted_map> query_all_directly_granted(::service::query_state& = internal_distributed_query_state()) = 0;
virtual future<role_set> query_all() = 0;
virtual future<role_set> query_all(::service::query_state& = internal_distributed_query_state()) = 0;
virtual future<bool> exists(std::string_view role_name) = 0;
@@ -186,12 +191,12 @@ public:
///
/// \returns the value of the named attribute, if one is set.
///
virtual future<std::optional<sstring>> get_attribute(std::string_view role_name, std::string_view attribute_name) = 0;
virtual future<std::optional<sstring>> get_attribute(std::string_view role_name, std::string_view attribute_name, ::service::query_state& = internal_distributed_query_state()) = 0;
///
/// \returns a mapping of each role's value for the named attribute, if one is set for the role.
///
virtual future<attribute_vals> query_attribute_for_all(std::string_view attribute_name) = 0;
virtual future<attribute_vals> query_attribute_for_all(std::string_view attribute_name, ::service::query_state& = internal_distributed_query_state()) = 0;
/// Sets `attribute_name` with `attribute_value` for `role_name`.
/// \returns an exceptional future with nonexistant_role if the role does not exist.

View File

@@ -34,9 +34,10 @@ static const class_registrator<
saslauthd_authenticator,
cql3::query_processor&,
::service::raft_group0_client&,
::service::migration_manager&> saslauthd_auth_reg("com.scylladb.auth.SaslauthdAuthenticator");
::service::migration_manager&,
utils::alien_worker&> saslauthd_auth_reg("com.scylladb.auth.SaslauthdAuthenticator");
saslauthd_authenticator::saslauthd_authenticator(cql3::query_processor& qp, ::service::raft_group0_client&, ::service::migration_manager&)
saslauthd_authenticator::saslauthd_authenticator(cql3::query_processor& qp, ::service::raft_group0_client&, ::service::migration_manager&, utils::alien_worker&)
: _socket_path(qp.db().get_config().saslauthd_socket_path())
{}

View File

@@ -11,6 +11,7 @@
#pragma once
#include "auth/authenticator.hh"
#include "utils/alien_worker.hh"
namespace cql3 {
class query_processor;
@@ -28,7 +29,7 @@ namespace auth {
class saslauthd_authenticator : public authenticator {
sstring _socket_path; ///< Path to the domain socket on which saslauthd is listening.
public:
saslauthd_authenticator(cql3::query_processor&, ::service::raft_group0_client&, ::service::migration_manager&);
saslauthd_authenticator(cql3::query_processor&, ::service::raft_group0_client&, ::service::migration_manager&, utils::alien_worker&);
future<> start() override;

View File

@@ -187,14 +187,15 @@ service::service(
::service::migration_notifier& mn,
::service::migration_manager& mm,
const service_config& sc,
maintenance_socket_enabled used_by_maintenance_socket)
maintenance_socket_enabled used_by_maintenance_socket,
utils::alien_worker& hashing_worker)
: service(
std::move(c),
qp,
g0,
mn,
create_object<authorizer>(sc.authorizer_java_name, qp, g0, mm),
create_object<authenticator>(sc.authenticator_java_name, qp, g0, mm),
create_object<authenticator>(sc.authenticator_java_name, qp, g0, mm, hashing_worker),
create_object<role_manager>(sc.role_manager_java_name, qp, g0, mm),
used_by_maintenance_socket) {
}
@@ -240,6 +241,13 @@ future<> service::start(::service::migration_manager& mm, db::system_keyspace& s
});
}
co_await _role_manager->start();
if (this_shard_id() == 0) {
// Role manager and password authenticator have this odd startup
// mechanism where they asynchronously create the superuser role
// in the background. Correct password creation depends on role
// creation therefore we need to wait here.
co_await _role_manager->ensure_superuser_is_created();
}
co_await when_all_succeed(_authorizer->start(), _authenticator->start()).discard_result();
_permissions_cache = std::make_unique<permissions_cache>(_loading_cache_config, *this, log);
co_await once_among_shards([this] {

View File

@@ -26,6 +26,7 @@
#include "cql3/description.hh"
#include "seastarx.hh"
#include "service/raft/raft_group0_client.hh"
#include "utils/alien_worker.hh"
#include "utils/observable.hh"
#include "utils/serialized_action.hh"
#include "service/maintenance_mode.hh"
@@ -126,7 +127,8 @@ public:
::service::migration_notifier&,
::service::migration_manager&,
const service_config&,
maintenance_socket_enabled);
maintenance_socket_enabled,
utils::alien_worker&);
future<> start(::service::migration_manager&, db::system_keyspace&);

View File

@@ -9,6 +9,7 @@
#include "auth/standard_role_manager.hh"
#include <optional>
#include <stdexcept>
#include <unordered_set>
#include <vector>
@@ -28,6 +29,7 @@
#include "cql3/util.hh"
#include "db/consistency_level_type.hh"
#include "exceptions/exceptions.hh"
#include "utils/error_injection.hh"
#include "utils/log.hh"
#include <seastar/core/loop.hh>
#include <seastar/coroutine/maybe_yield.hh>
@@ -178,7 +180,8 @@ future<> standard_role_manager::create_legacy_metadata_tables_if_missing() const
_migration_manager)).discard_result();
}
future<> standard_role_manager::create_default_role_if_missing() {
future<> standard_role_manager::legacy_create_default_role_if_missing() {
SCYLLA_ASSERT(legacy_mode(_qp));
try {
const auto exists = co_await default_role_row_satisfies(_qp, &has_can_login, _superuser);
if (exists) {
@@ -188,16 +191,12 @@ future<> standard_role_manager::create_default_role_if_missing() {
get_auth_ks_name(_qp),
meta::roles_table::name,
meta::roles_table::role_col_name);
if (legacy_mode(_qp)) {
co_await _qp.execute_internal(
query,
db::consistency_level::QUORUM,
internal_distributed_query_state(),
{_superuser},
cql3::query_processor::cache_internal::no).discard_result();
} else {
co_await announce_mutations(_qp, _group0_client, query, {_superuser}, _as, ::service::raft_timeout{});
}
co_await _qp.execute_internal(
query,
db::consistency_level::QUORUM,
internal_distributed_query_state(),
{_superuser},
cql3::query_processor::cache_internal::no).discard_result();
log.info("Created default superuser role '{}'.", _superuser);
} catch(const exceptions::unavailable_exception& e) {
log.warn("Skipped default role setup: some nodes were not ready; will retry");
@@ -205,6 +204,60 @@ future<> standard_role_manager::create_default_role_if_missing() {
}
}
future<> standard_role_manager::maybe_create_default_role() {
auto has_superuser = [this] () -> future<bool> {
const sstring query = seastar::format("SELECT * FROM {}.{} WHERE is_superuser = true ALLOW FILTERING", get_auth_ks_name(_qp), meta::roles_table::name);
auto results = co_await _qp.execute_internal(query, db::consistency_level::LOCAL_ONE,
internal_distributed_query_state(), cql3::query_processor::cache_internal::yes);
for (const auto& result : *results) {
if (has_can_login(result)) {
co_return true;
}
}
co_return false;
};
if (co_await has_superuser()) {
co_return;
}
// We don't want to start operation earlier to avoid quorum requirement in
// a common case.
::service::group0_batch batch(
co_await _group0_client.start_operation(_as, get_raft_timeout()));
// Check again as the state may have changed before we took the guard (batch).
if (co_await has_superuser()) {
co_return;
}
// There is no superuser which has can_login field - create default role.
// Note that we don't check if can_login is set to true.
const sstring insert_query = seastar::format("INSERT INTO {}.{} ({}, is_superuser, can_login) VALUES (?, true, true)",
get_auth_ks_name(_qp),
meta::roles_table::name,
meta::roles_table::role_col_name);
co_await collect_mutations(_qp, batch, insert_query, {_superuser});
co_await std::move(batch).commit(_group0_client, _as, get_raft_timeout());
log.info("Created default superuser role '{}'.", _superuser);
}
future<> standard_role_manager::maybe_create_default_role_with_retries() {
size_t retries = _migration_manager.get_concurrent_ddl_retries();
while (true) {
try {
co_return co_await maybe_create_default_role();
} catch (const ::service::group0_concurrent_modification& ex) {
log.warn("Failed to execute maybe_create_default_role due to guard conflict.{}.", retries ? " Retrying" : " Number of retries exceeded, giving up");
if (retries--) {
continue;
}
// Log error but don't crash the whole node startup sequence.
log.error("Failed to create default superuser role due to guard conflict.");
co_return;
} catch (const ::service::raft_operation_timeout_error& ex) {
log.error("Failed to create default superuser role due to exception: {}", ex.what());
co_return;
}
}
}
static const sstring legacy_table_name{"users"};
bool standard_role_manager::legacy_metadata_exists() {
@@ -266,10 +319,13 @@ future<> standard_role_manager::start() {
co_await migrate_legacy_metadata();
co_return;
}
co_await legacy_create_default_role_if_missing();
}
co_await create_default_role_if_missing();
if (!legacy) {
_superuser_created_promise.set_value();
co_await maybe_create_default_role_with_retries();
if (!_superuser_created_promise.available()) {
_superuser_created_promise.set_value();
}
}
};
@@ -596,21 +652,30 @@ future<role_set> standard_role_manager::query_granted(std::string_view grantee_n
});
}
future<role_to_directly_granted_map> standard_role_manager::query_all_directly_granted() {
future<role_to_directly_granted_map> standard_role_manager::query_all_directly_granted(::service::query_state& qs) {
const sstring query = seastar::format("SELECT * FROM {}.{}",
get_auth_ks_name(_qp),
meta::role_members_table::name);
const auto results = co_await _qp.execute_internal(
query,
db::consistency_level::ONE,
qs,
cql3::query_processor::cache_internal::yes);
role_to_directly_granted_map roles_map;
co_await _qp.query_internal(query, [&roles_map] (const cql3::untyped_result_set_row& row) -> future<stop_iteration> {
roles_map.insert({row.get_as<sstring>("member"), row.get_as<sstring>("role")});
co_return stop_iteration::no;
});
std::transform(
results->begin(),
results->end(),
std::inserter(roles_map, roles_map.begin()),
[] (const cql3::untyped_result_set_row& row) {
return std::make_pair(row.get_as<sstring>("member"), row.get_as<sstring>("role")); }
);
co_return roles_map;
}
future<role_set> standard_role_manager::query_all() {
future<role_set> standard_role_manager::query_all(::service::query_state& qs) {
const sstring query = seastar::format("SELECT {} FROM {}.{}",
meta::roles_table::role_col_name,
get_auth_ks_name(_qp),
@@ -619,10 +684,16 @@ future<role_set> standard_role_manager::query_all() {
// To avoid many copies of a view.
static const auto role_col_name_string = sstring(meta::roles_table::role_col_name);
if (utils::get_local_injector().enter("standard_role_manager_fail_legacy_query")) {
if (legacy_mode(_qp)) {
throw std::runtime_error("standard_role_manager::query_all: failed due to error injection");
}
}
const auto results = co_await _qp.execute_internal(
query,
db::consistency_level::QUORUM,
internal_distributed_query_state(),
qs,
cql3::query_processor::cache_internal::yes);
role_set roles;
@@ -654,11 +725,11 @@ future<bool> standard_role_manager::can_login(std::string_view role_name) {
});
}
future<std::optional<sstring>> standard_role_manager::get_attribute(std::string_view role_name, std::string_view attribute_name) {
future<std::optional<sstring>> standard_role_manager::get_attribute(std::string_view role_name, std::string_view attribute_name, ::service::query_state& qs) {
const sstring query = seastar::format("SELECT name, value FROM {}.{} WHERE role = ? AND name = ?",
get_auth_ks_name(_qp),
meta::role_attributes_table::name);
const auto result_set = co_await _qp.execute_internal(query, {sstring(role_name), sstring(attribute_name)}, cql3::query_processor::cache_internal::yes);
const auto result_set = co_await _qp.execute_internal(query, db::consistency_level::ONE, qs, {sstring(role_name), sstring(attribute_name)}, cql3::query_processor::cache_internal::yes);
if (!result_set->empty()) {
const cql3::untyped_result_set_row &row = result_set->one();
co_return std::optional<sstring>(row.get_as<sstring>("value"));
@@ -666,11 +737,11 @@ future<std::optional<sstring>> standard_role_manager::get_attribute(std::string_
co_return std::optional<sstring>{};
}
future<role_manager::attribute_vals> standard_role_manager::query_attribute_for_all (std::string_view attribute_name) {
return query_all().then([this, attribute_name] (role_set roles) {
return do_with(attribute_vals{}, [this, attribute_name, roles = std::move(roles)] (attribute_vals &role_to_att_val) {
return parallel_for_each(roles.begin(), roles.end(), [this, &role_to_att_val, attribute_name] (sstring role) {
return get_attribute(role, attribute_name).then([&role_to_att_val, role] (std::optional<sstring> att_val) {
future<role_manager::attribute_vals> standard_role_manager::query_attribute_for_all (std::string_view attribute_name, ::service::query_state& qs) {
return query_all(qs).then([this, attribute_name, &qs] (role_set roles) {
return do_with(attribute_vals{}, [this, attribute_name, roles = std::move(roles), &qs] (attribute_vals &role_to_att_val) {
return parallel_for_each(roles.begin(), roles.end(), [this, &role_to_att_val, attribute_name, &qs] (sstring role) {
return get_attribute(role, attribute_name, qs).then([&role_to_att_val, role] (std::optional<sstring> att_val) {
if (att_val) {
role_to_att_val.emplace(std::move(role), std::move(*att_val));
}
@@ -715,7 +786,7 @@ future<> standard_role_manager::remove_attribute(std::string_view role_name, std
future<std::vector<cql3::description>> standard_role_manager::describe_role_grants() {
std::vector<cql3::description> result{};
const auto grants = co_await query_all_directly_granted();
const auto grants = co_await query_all_directly_granted(internal_distributed_query_state());
result.reserve(grants.size());
for (const auto& [grantee_role, granted_role] : grants) {

View File

@@ -66,9 +66,9 @@ public:
virtual future<role_set> query_granted(std::string_view grantee_name, recursive_role_query) override;
virtual future<role_to_directly_granted_map> query_all_directly_granted() override;
virtual future<role_to_directly_granted_map> query_all_directly_granted(::service::query_state&) override;
virtual future<role_set> query_all() override;
virtual future<role_set> query_all(::service::query_state&) override;
virtual future<bool> exists(std::string_view role_name) override;
@@ -76,9 +76,9 @@ public:
virtual future<bool> can_login(std::string_view role_name) override;
virtual future<std::optional<sstring>> get_attribute(std::string_view role_name, std::string_view attribute_name) override;
virtual future<std::optional<sstring>> get_attribute(std::string_view role_name, std::string_view attribute_name, ::service::query_state&) override;
virtual future<role_manager::attribute_vals> query_attribute_for_all(std::string_view attribute_name) override;
virtual future<role_manager::attribute_vals> query_attribute_for_all(std::string_view attribute_name, ::service::query_state&) override;
virtual future<> set_attribute(std::string_view role_name, std::string_view attribute_name, std::string_view attribute_value, ::service::group0_batch& mc) override;
@@ -95,7 +95,10 @@ private:
future<> migrate_legacy_metadata();
future<> create_default_role_if_missing();
future<> legacy_create_default_role_if_missing();
future<> maybe_create_default_role();
future<> maybe_create_default_role_with_retries();
future<> create_or_replace(std::string_view role_name, const role_config&, ::service::group0_batch&);

View File

@@ -37,8 +37,8 @@ class transitional_authenticator : public authenticator {
public:
static const sstring PASSWORD_AUTHENTICATOR_NAME;
transitional_authenticator(cql3::query_processor& qp, ::service::raft_group0_client& g0, ::service::migration_manager& mm)
: transitional_authenticator(std::make_unique<password_authenticator>(qp, g0, mm)) {
transitional_authenticator(cql3::query_processor& qp, ::service::raft_group0_client& g0, ::service::migration_manager& mm, utils::alien_worker& hashing_worker)
: transitional_authenticator(std::make_unique<password_authenticator>(qp, g0, mm, hashing_worker)) {
}
transitional_authenticator(std::unique_ptr<authenticator> a)
: _authenticator(std::move(a)) {
@@ -239,7 +239,8 @@ static const class_registrator<
auth::transitional_authenticator,
cql3::query_processor&,
::service::raft_group0_client&,
::service::migration_manager&> transitional_authenticator_reg(auth::PACKAGE_NAME + "TransitionalAuthenticator");
::service::migration_manager&,
utils::alien_worker&> transitional_authenticator_reg(auth::PACKAGE_NAME + "TransitionalAuthenticator");
static const class_registrator<
auth::authorizer,

View File

@@ -35,8 +35,9 @@ inline bytes_view to_bytes_view(std::string_view view) {
}
struct fmt_hex {
const bytes_view& v;
fmt_hex(const bytes_view& v) noexcept : v(v) {}
std::span<const std::byte> v;
fmt_hex(const bytes_view& v) noexcept : v(std::as_bytes(std::span(v))) {}
fmt_hex(std::span<const std::byte> v) noexcept : v(v) {}
};
bytes from_hex(std::string_view s);

View File

@@ -23,6 +23,10 @@ class cdc_extension : public schema_extension {
public:
static constexpr auto NAME = "cdc";
// cdc_extension was written before schema_extension was deprecated, so support it
// without warnings
#pragma clang diagnostic push
#pragma clang diagnostic ignored "-Wdeprecated-declarations"
cdc_extension() = default;
cdc_extension(const options& opts) : _cdc_options(opts) {}
explicit cdc_extension(std::map<sstring, sstring> tags) : _cdc_options(std::move(tags)) {}
@@ -30,6 +34,7 @@ public:
explicit cdc_extension(const sstring& s) {
throw std::logic_error("Cannot create cdc info from string");
}
#pragma clang diagnostic pop
bytes serialize() const override {
return ser::serialize_to_buffer<bytes>(_cdc_options.to_map());
}

View File

@@ -39,12 +39,12 @@
extern logging::logger cdc_log;
static int get_shard_count(const gms::inet_address& endpoint, const gms::gossiper& g) {
static int get_shard_count(const locator::host_id& endpoint, const gms::gossiper& g) {
auto ep_state = g.get_application_state_ptr(endpoint, gms::application_state::SHARD_COUNT);
return ep_state ? std::stoi(ep_state->value()) : -1;
}
static unsigned get_sharding_ignore_msb(const gms::inet_address& endpoint, const gms::gossiper& g) {
static unsigned get_sharding_ignore_msb(const locator::host_id& endpoint, const gms::gossiper& g) {
auto ep_state = g.get_application_state_ptr(endpoint, gms::application_state::IGNORE_MSB_BITS);
return ep_state ? std::stoi(ep_state->value()) : 0;
}
@@ -198,7 +198,7 @@ static std::vector<stream_id> create_stream_ids(
}
bool should_propose_first_generation(const locator::host_id& my_host_id, const gms::gossiper& g) {
return g.for_each_endpoint_state_until([&] (const gms::inet_address&, const gms::endpoint_state& eps) {
return g.for_each_endpoint_state_until([&] (const gms::endpoint_state& eps) {
return stop_iteration(my_host_id < eps.get_host_id());
}) == stop_iteration::no;
}
@@ -365,6 +365,9 @@ cdc::topology_description make_new_generation_description(
const noncopyable_function<std::pair<size_t, uint8_t>(dht::token)>& get_sharding_info,
const locator::token_metadata_ptr tmptr) {
const auto tokens = get_tokens(bootstrap_tokens, tmptr);
if (tokens.empty()) {
on_internal_error(cdc_log, "Attempted to create a CDC generation from an empty list of tokens");
}
utils::chunked_vector<token_range_description> vnode_descriptions;
vnode_descriptions.reserve(tokens.size());
@@ -402,9 +405,8 @@ future<cdc::generation_id> generation_service::legacy_make_new_generation(const
throw std::runtime_error(
format("Can't find endpoint for token {}", end));
}
const auto ep = _gossiper.get_address_map().get(*endpoint);
auto sc = get_shard_count(ep, _gossiper);
return {sc > 0 ? sc : 1, get_sharding_ignore_msb(ep, _gossiper)};
auto sc = get_shard_count(*endpoint, _gossiper);
return {sc > 0 ? sc : 1, get_sharding_ignore_msb(*endpoint, _gossiper)};
}
};
@@ -463,7 +465,7 @@ future<cdc::generation_id> generation_service::legacy_make_new_generation(const
* but if the cluster already supports CDC, then every newly joining node will propose a new CDC generation,
* which means it will gossip the generation's timestamp.
*/
static std::optional<cdc::generation_id> get_generation_id_for(const gms::inet_address& endpoint, const gms::endpoint_state& eps) {
static std::optional<cdc::generation_id> get_generation_id_for(const locator::host_id& endpoint, const gms::endpoint_state& eps) {
const auto* gen_id_ptr = eps.get_application_state_ptr(gms::application_state::CDC_GENERATION_ID);
if (!gen_id_ptr) {
return std::nullopt;
@@ -841,18 +843,18 @@ future<> generation_service::leave_ring() {
co_await _gossiper.unregister_(shared_from_this());
}
future<> generation_service::on_join(gms::inet_address ep, gms::endpoint_state_ptr ep_state, gms::permit_id pid) {
return on_change(ep, ep_state->get_application_state_map(), pid);
future<> generation_service::on_join(gms::inet_address ep, locator::host_id id, gms::endpoint_state_ptr ep_state, gms::permit_id pid) {
return on_change(ep, id, ep_state->get_application_state_map(), pid);
}
future<> generation_service::on_change(gms::inet_address ep, const gms::application_state_map& states, gms::permit_id pid) {
future<> generation_service::on_change(gms::inet_address ep, locator::host_id id, const gms::application_state_map& states, gms::permit_id pid) {
assert_shard_zero(__PRETTY_FUNCTION__);
if (_raft_topology_change_enabled()) {
return make_ready_future<>();
}
return on_application_state_change(ep, states, gms::application_state::CDC_GENERATION_ID, pid, [this] (gms::inet_address ep, const gms::versioned_value& v, gms::permit_id) {
return on_application_state_change(ep, id, states, gms::application_state::CDC_GENERATION_ID, pid, [this] (gms::inet_address ep, locator::host_id id, const gms::versioned_value& v, gms::permit_id) {
auto gen_id = gms::versioned_value::cdc_generation_id_from_string(v.value());
cdc_log.debug("Endpoint: {}, CDC generation ID change: {}", ep, gen_id);
@@ -867,7 +869,8 @@ future<> generation_service::check_and_repair_cdc_streams() {
}
std::optional<cdc::generation_id> latest = _gen_id;
_gossiper.for_each_endpoint_state([&] (const gms::inet_address& addr, const gms::endpoint_state& state) {
_gossiper.for_each_endpoint_state([&] (const gms::endpoint_state& state) {
auto addr = state.get_host_id();
if (_gossiper.is_left(addr)) {
cdc_log.info("check_and_repair_cdc_streams ignored node {} because it is in LEFT state", addr);
return;
@@ -1066,8 +1069,8 @@ future<> generation_service::legacy_scan_cdc_generations() {
assert_shard_zero(__PRETTY_FUNCTION__);
std::optional<cdc::generation_id> latest;
_gossiper.for_each_endpoint_state([&] (const gms::inet_address& node, const gms::endpoint_state& eps) {
auto gen_id = get_generation_id_for(node, eps);
_gossiper.for_each_endpoint_state([&] (const gms::endpoint_state& eps) {
auto gen_id = get_generation_id_for(eps.get_host_id(), eps);
if (!latest || (gen_id && get_ts(*gen_id) > get_ts(*latest))) {
latest = gen_id;
}

View File

@@ -110,13 +110,8 @@ public:
return _cdc_metadata;
}
virtual future<> on_alive(gms::inet_address, gms::endpoint_state_ptr, gms::permit_id) override { return make_ready_future(); }
virtual future<> on_dead(gms::inet_address, gms::endpoint_state_ptr, gms::permit_id) override { return make_ready_future(); }
virtual future<> on_remove(gms::inet_address, gms::permit_id) override { return make_ready_future(); }
virtual future<> on_restart(gms::inet_address, gms::endpoint_state_ptr, gms::permit_id) override { return make_ready_future(); }
virtual future<> on_join(gms::inet_address, gms::endpoint_state_ptr, gms::permit_id) override;
virtual future<> on_change(gms::inet_address, const gms::application_state_map&, gms::permit_id) override;
virtual future<> on_join(gms::inet_address, locator::host_id id, gms::endpoint_state_ptr, gms::permit_id) override;
virtual future<> on_change(gms::inet_address, locator::host_id id, const gms::application_state_map&, gms::permit_id) override;
future<> check_and_repair_cdc_streams();

View File

@@ -960,8 +960,12 @@ public:
// Given a reference to such a column from the base schema, this function sets the corresponding column
// in the log to the given value for the given row.
void set_value(const clustering_key& log_ck, const column_definition& base_cdef, const managed_bytes_view& value) {
auto& log_cdef = *_log_schema.get_column_definition(log_data_column_name_bytes(base_cdef.name()));
_log_mut.set_cell(log_ck, log_cdef, atomic_cell::make_live(*base_cdef.type, _ts, value, _ttl));
auto log_cdef_ptr = _log_schema.get_column_definition(log_data_column_name_bytes(base_cdef.name()));
if (!log_cdef_ptr) {
throw exceptions::invalid_request_exception(format("CDC log schema for {}.{} does not have base column {}",
_log_schema.ks_name(), _log_schema.cf_name(), base_cdef.name_as_text()));
}
_log_mut.set_cell(log_ck, *log_cdef_ptr, atomic_cell::make_live(*base_cdef.type, _ts, value, _ttl));
}
// Each regular and static column in the base schema has a corresponding column in the log schema
@@ -969,7 +973,13 @@ public:
// Given a reference to such a column from the base schema, this function sets the corresponding column
// in the log to `true` for the given row. If not called, the column will be `null`.
void set_deleted(const clustering_key& log_ck, const column_definition& base_cdef) {
_log_mut.set_cell(log_ck, log_data_column_deleted_name_bytes(base_cdef.name()), data_value(true), _ts, _ttl);
auto log_cdef_ptr = _log_schema.get_column_definition(log_data_column_deleted_name_bytes(base_cdef.name()));
if (!log_cdef_ptr) {
throw exceptions::invalid_request_exception(format("CDC log schema for {}.{} does not have base column {}",
_log_schema.ks_name(), _log_schema.cf_name(), base_cdef.name_as_text()));
}
auto& log_cdef = *log_cdef_ptr;
_log_mut.set_cell(log_ck, *log_cdef_ptr, atomic_cell::make_live(*log_cdef.type, _ts, log_cdef.type->decompose(true), _ttl));
}
// Each regular and static non-atomic column in the base schema has a corresponding column in the log schema
@@ -978,7 +988,12 @@ public:
// Given a reference to such a column from the base schema, this function sets the corresponding column
// in the log to the given set of keys for the given row.
void set_deleted_elements(const clustering_key& log_ck, const column_definition& base_cdef, const managed_bytes& deleted_elements) {
auto& log_cdef = *_log_schema.get_column_definition(log_data_column_deleted_elements_name_bytes(base_cdef.name()));
auto log_cdef_ptr = _log_schema.get_column_definition(log_data_column_deleted_elements_name_bytes(base_cdef.name()));
if (!log_cdef_ptr) {
throw exceptions::invalid_request_exception(format("CDC log schema for {}.{} does not have base column {}",
_log_schema.ks_name(), _log_schema.cf_name(), base_cdef.name_as_text()));
}
auto& log_cdef = *log_cdef_ptr;
_log_mut.set_cell(log_ck, log_cdef, atomic_cell::make_live(*log_cdef.type, _ts, deleted_elements, _ttl));
}
@@ -1865,5 +1880,10 @@ bool cdc::cdc_service::needs_cdc_augmentation(const std::vector<mutation>& mutat
future<std::tuple<std::vector<mutation>, lw_shared_ptr<cdc::operation_result_tracker>>>
cdc::cdc_service::augment_mutation_call(lowres_clock::time_point timeout, std::vector<mutation>&& mutations, tracing::trace_state_ptr tr_state, db::consistency_level write_cl) {
if (utils::get_local_injector().enter("sleep_before_cdc_augmentation")) {
return seastar::sleep(std::chrono::milliseconds(100)).then([this, timeout, mutations = std::move(mutations), tr_state = std::move(tr_state), write_cl] () mutable {
return _impl->augment_mutation_call(timeout, std::move(mutations), std::move(tr_state), write_cl);
});
}
return _impl->augment_mutation_call(timeout, std::move(mutations), std::move(tr_state), write_cl);
}

View File

@@ -3,7 +3,7 @@ set(CMAKE_CXX_FLAGS_COVERAGE
CACHE
INTERNAL
"")
update_cxx_flags(CMAKE_CXX_FLAGS_COVERAGE
update_build_flags(Coverage
WITH_DEBUG_INFO
OPTIMIZATION_LEVEL "g")

View File

@@ -1,6 +1,6 @@
set(OptimizationLevel "g")
update_cxx_flags(CMAKE_CXX_FLAGS_DEBUG
update_build_flags(Debug
WITH_DEBUG_INFO
OPTIMIZATION_LEVEL ${OptimizationLevel})

View File

@@ -3,7 +3,7 @@ set(CMAKE_CXX_FLAGS_DEV
CACHE
INTERNAL
"")
update_cxx_flags(CMAKE_CXX_FLAGS_DEV
update_build_flags(Dev
OPTIMIZATION_LEVEL "2")
set(scylla_build_mode_Dev "dev")

View File

@@ -8,7 +8,7 @@ set(CMAKE_CXX_FLAGS_RELWITHDEBINFO
CACHE
INTERNAL
"")
update_cxx_flags(CMAKE_CXX_FLAGS_RELWITHDEBINFO
update_build_flags(RelWithDebInfo
WITH_DEBUG_INFO
OPTIMIZATION_LEVEL "3")

View File

@@ -3,7 +3,7 @@ set(CMAKE_CXX_FLAGS_SANITIZE
CACHE
INTERNAL
"")
update_cxx_flags(CMAKE_CXX_FLAGS_SANITIZE
update_build_flags(Sanitize
WITH_DEBUG_INFO
OPTIMIZATION_LEVEL "s")

View File

@@ -72,7 +72,7 @@ function(get_padded_dynamic_linker_option output length)
ERROR_VARIABLE driver_command_line
ERROR_STRIP_TRAILING_WHITESPACE)
# extract the argument for the "-dynamic-linker" option
if(driver_command_line MATCHES ".*\"?${dynamic_linker_option}\"? \"?([^ \"]*)\"? .*")
if(driver_command_line MATCHES ".*\"?${dynamic_linker_option}\"?[ =]\"?([^ \"]*)\"?[ \n].*")
set(dynamic_linker ${CMAKE_MATCH_1})
else()
message(FATAL_ERROR "Unable to find ${dynamic_linker_option} in driver-generated command: "
@@ -135,7 +135,7 @@ function(maybe_limit_stack_usage_in_KB stack_usage_threshold_in_KB config)
endif()
endfunction()
macro(update_cxx_flags flags)
macro(update_build_flags config)
cmake_parse_arguments (
parsed_args
"WITH_DEBUG_INFO"
@@ -145,11 +145,15 @@ macro(update_cxx_flags flags)
if(NOT DEFINED parsed_args_OPTIMIZATION_LEVEL)
message(FATAL_ERROR "OPTIMIZATION_LEVEL is missing")
endif()
string(APPEND ${flags}
string(TOUPPER ${config} CONFIG)
set(cxx_flags "CMAKE_CXX_FLAGS_${CONFIG}")
string(APPEND ${cxx_flags}
" -O${parsed_args_OPTIMIZATION_LEVEL}")
if(parsed_args_WITH_DEBUG_INFO)
string(APPEND ${flags} " -g -gz")
string(APPEND ${cxx_flags} " -g -gz")
endif()
unset(CONFIG)
unset(cxx_flags)
endmacro()
set(pgo_opts "")

View File

@@ -769,7 +769,7 @@ private:
}
virtual sstables::sstable_set make_sstable_set_for_input() const {
return _table_s.get_compaction_strategy().make_sstable_set(_schema);
return _table_s.get_compaction_strategy().make_sstable_set(_table_s);
}
const tombstone_gc_state& get_tombstone_gc_state() const {
@@ -1301,7 +1301,7 @@ public:
}
virtual sstables::sstable_set make_sstable_set_for_input() const override {
return sstables::make_partitioned_sstable_set(_schema, false);
return sstables::make_partitioned_sstable_set(_schema, _table_s.token_range());
}
// Unconditionally enable incremental compaction if the strategy specifies a max output size, e.g. LCS.
@@ -1910,7 +1910,11 @@ static future<compaction_result> scrub_sstables_validate_mode(sstables::compacti
using scrub = sstables::compaction_type_options::scrub;
if (validation_errors != 0 && descriptor.options.as<scrub>().quarantine_sstables == scrub::quarantine_invalid_sstables::yes) {
for (auto& sst : descriptor.sstables) {
co_await sst->change_state(sstables::sstable_state::quarantine);
try {
co_await sst->change_state(sstables::sstable_state::quarantine);
} catch (...) {
clogger.error("Moving {} to quarantine failed due to {}, continuing.", sst->get_filename(), std::current_exception());
}
}
}

View File

@@ -395,7 +395,7 @@ future<sstables::sstable_set> compaction_task_executor::sstable_set_for_tombston
auto compound_set = t.sstable_set_for_tombstone_gc();
// Compound set will be linearized into a single set, since compaction might add or remove sstables
// to it for incremental compaction to work.
auto new_set = sstables::make_partitioned_sstable_set(t.schema(), false);
auto new_set = sstables::make_partitioned_sstable_set(t.schema(), t.token_range());
co_await compound_set->for_each_sstable_gently([&] (const sstables::shared_sstable& sst) {
auto inserted = new_set.insert(sst);
if (!inserted) {
@@ -458,9 +458,7 @@ future<sstables::compaction_result> compaction_task_executor::compact_sstables(s
future<> compaction_task_executor::update_history(table_state& t, const sstables::compaction_result& res, const sstables::compaction_data& cdata) {
auto ended_at = std::chrono::duration_cast<std::chrono::milliseconds>(res.stats.ended_at.time_since_epoch());
if (_cm._sys_ks) {
auto sys_ks = _cm._sys_ks; // hold pointer on sys_ks
if (auto sys_ks = _cm._sys_ks.get_permit()) {
co_await utils::get_local_injector().inject("update_history_wait", utils::wait_for_message(120s));
std::unordered_map<int32_t, int64_t> rows_merged;
for (size_t id=0; id<res.stats.reader_statistics.rows_merged_histogram.size(); ++id) {
@@ -475,11 +473,9 @@ future<> compaction_task_executor::update_history(table_state& t, const sstables
}
future<> compaction_manager::get_compaction_history(compaction_history_consumer&& f) {
if (!_sys_ks) {
return make_ready_future<>();
if (auto sys_ks = _sys_ks.get_permit()) {
co_await sys_ks->get_compaction_history(std::move(f));
}
return _sys_ks->get_compaction_history(std::move(f)).finally([s = _sys_ks] {});
}
template<std::derived_from<compaction::compaction_task_executor> Executor>
@@ -924,6 +920,7 @@ public:
compaction_manager::compaction_manager(config cfg, abort_source& as, tasks::task_manager& tm)
: _task_manager_module(make_shared<task_manager_module>(tm))
, _sys_ks("compaction_manager::system_keyspace")
, _cfg(std::move(cfg))
, _compaction_submission_timer(compaction_sg(), compaction_submission_callback())
, _compaction_controller(make_compaction_controller(compaction_sg(), static_shares(), [this] () -> float {
@@ -960,6 +957,7 @@ compaction_manager::compaction_manager(config cfg, abort_source& as, tasks::task
compaction_manager::compaction_manager(tasks::task_manager& tm)
: _task_manager_module(make_shared<task_manager_module>(tm))
, _sys_ks("compaction_manager::system_keyspace")
, _cfg(config{ .available_memory = 1 })
, _compaction_submission_timer(compaction_sg(), compaction_submission_callback())
, _compaction_controller(make_compaction_controller(compaction_sg(), 1, [] () -> float { return 1.0; }))
@@ -1128,16 +1126,16 @@ future<> compaction_manager::drain() {
// Disable the state so that it can be enabled later if requested.
_state = state::disabled;
}
_compaction_submission_timer.cancel();
// Stop ongoing compactions, if the request has not been sent already and wait for them to stop.
co_await stop_ongoing_compactions("drain");
// Trigger a signal to properly exit from postponed_compactions_reevaluation() fiber
reevaluate_postponed_compactions();
cmlog.info("Drained");
}
future<> compaction_manager::stop() {
do_stop();
if (auto cm = std::exchange(_task_manager_module, nullptr)) {
co_await cm->stop();
}
if (_stop_future) {
co_await std::exchange(*_stop_future, make_ready_future());
}
@@ -1148,16 +1146,18 @@ future<> compaction_manager::really_do_stop() noexcept {
// Reset the metrics registry
_metrics.clear();
co_await stop_ongoing_compactions("shutdown");
if (!_tasks.empty()) {
on_fatal_internal_error(cmlog, format("{} tasks still exist after being stopped", _tasks.size()));
}
co_await _task_manager_module->stop();
co_await coroutine::parallel_for_each(_compaction_state | std::views::values, [] (compaction_state& cs) -> future<> {
if (!cs.gate.is_closed()) {
co_await cs.gate.close();
}
});
if (!_tasks.empty()) {
on_fatal_internal_error(cmlog, format("{} tasks still exist after being stopped", _tasks.size()));
}
reevaluate_postponed_compactions();
co_await std::move(_waiting_reevalution);
co_await _sys_ks.close();
_weight_tracker.clear();
_compaction_submission_timer.cancel();
co_await _compaction_controller.shutdown();
@@ -1818,8 +1818,21 @@ future<compaction_manager::compaction_stats_opt> compaction_manager::perform_sst
if (!gh) {
co_return compaction_stats_opt{};
}
// All sstables must be included, even the ones being compacted, such that everything in table is validated.
auto all_sstables = get_all_sstables(t);
// Collect and register all sstables as compacting while compaction is disabled, to avoid a race condition where
// regular compaction runs in between and picks the same files.
std::vector<sstables::shared_sstable> all_sstables;
compacting_sstable_registration compacting(*this, get_compaction_state(&t));
co_await run_with_compaction_disabled(t, [&all_sstables, &compacting, &t] () -> future<> {
// All sstables must be included.
all_sstables = get_all_sstables(t);
compacting.register_compacting(all_sstables);
return make_ready_future();
});
if (all_sstables.empty()) {
co_return compaction_stats_opt{};
}
co_return co_await perform_compaction<validate_sstables_compaction_task_executor>(throw_if_stopping::no, info, &t, info.id, std::move(all_sstables), quarantine_sstables);
}
@@ -2178,7 +2191,8 @@ future<compaction_manager::compaction_stats_opt> compaction_manager::perform_sst
}
compaction::compaction_state::compaction_state(table_state& t)
: backlog_tracker(t.get_compaction_strategy().make_backlog_tracker())
: gate(format("compaction_state for table {}.{}", t.schema()->ks_name(), t.schema()->cf_name()))
, backlog_tracker(t.get_compaction_strategy().make_backlog_tracker())
{
}
@@ -2294,11 +2308,11 @@ strategy_control& compaction_manager::get_strategy_control() const noexcept {
}
void compaction_manager::plug_system_keyspace(db::system_keyspace& sys_ks) noexcept {
_sys_ks = sys_ks.shared_from_this();
_sys_ks.plug(sys_ks.shared_from_this());
}
void compaction_manager::unplug_system_keyspace() noexcept {
_sys_ks = nullptr;
future<> compaction_manager::unplug_system_keyspace() noexcept {
co_await _sys_ks.unplug();
}
double compaction_backlog_tracker::backlog() const {

View File

@@ -32,6 +32,7 @@
#include "seastarx.hh"
#include "sstables/exceptions.hh"
#include "tombstone_gc.hh"
#include "utils/pluggable.hh"
namespace db {
class system_keyspace;
@@ -138,7 +139,7 @@ private:
// being picked more than once.
seastar::named_semaphore _off_strategy_sem = {1, named_semaphore_exception_factory{"off-strategy compaction"}};
seastar::shared_ptr<db::system_keyspace> _sys_ks;
utils::pluggable<db::system_keyspace> _sys_ks;
std::function<void()> compaction_submission_callback();
// all registered tables are reevaluated at a constant interval.
@@ -300,6 +301,11 @@ public:
// unless it is moved back to enabled state.
future<> drain();
// Check if compaction manager is running, i.e. it was enabled or drained
bool is_running() const noexcept {
return _state == state::enabled || _state == state::disabled;
}
using compaction_history_consumer = noncopyable_function<future<>(const db::compaction_history_entry&)>;
future<> get_compaction_history(compaction_history_consumer&& f);
@@ -391,7 +397,7 @@ public:
future<> run_with_compaction_disabled(compaction::table_state& t, std::function<future<> ()> func);
void plug_system_keyspace(db::system_keyspace& sys_ks) noexcept;
void unplug_system_keyspace() noexcept;
future<> unplug_system_keyspace() noexcept;
// Adds a table to the compaction manager.
// Creates a compaction_state structure that can be used for submitting

View File

@@ -22,7 +22,7 @@ namespace compaction {
struct compaction_state {
// Used both by compaction tasks that refer to the compaction_state
// and by any function running under run_with_compaction_disabled().
seastar::gate gate;
seastar::named_gate gate;
// Prevents table from running major and minor compaction at the same time.
seastar::rwlock lock;

View File

@@ -789,8 +789,8 @@ future<reshape_config> make_reshape_config(const sstables::storage& storage, res
};
}
std::unique_ptr<sstable_set_impl> incremental_compaction_strategy::make_sstable_set(schema_ptr schema) const {
return std::make_unique<partitioned_sstable_set>(std::move(schema), false);
std::unique_ptr<sstable_set_impl> incremental_compaction_strategy::make_sstable_set(const table_state& ts) const {
return std::make_unique<partitioned_sstable_set>(ts.schema(), ts.token_range());
}
}

View File

@@ -105,7 +105,7 @@ public:
return name(type());
}
sstable_set make_sstable_set(schema_ptr schema) const;
sstable_set make_sstable_set(const table_state& ts) const;
compaction_backlog_tracker make_backlog_tracker() const;

View File

@@ -56,7 +56,7 @@ public:
return true;
}
virtual int64_t estimated_pending_compactions(table_state& table_s) const = 0;
virtual std::unique_ptr<sstable_set_impl> make_sstable_set(schema_ptr schema) const;
virtual std::unique_ptr<sstable_set_impl> make_sstable_set(const table_state& ts) const;
bool use_clustering_key_filter() const {
return _use_clustering_key_filter;

View File

@@ -98,7 +98,7 @@ public:
virtual compaction_descriptor get_reshaping_job(std::vector<shared_sstable> input, schema_ptr schema, reshape_config cfg) const override;
virtual std::unique_ptr<sstable_set_impl> make_sstable_set(schema_ptr schema) const override;
virtual std::unique_ptr<sstable_set_impl> make_sstable_set(const table_state& ts) const override;
friend class ::incremental_backlog_tracker;
};

View File

@@ -70,7 +70,7 @@ public:
virtual compaction_strategy_type type() const override {
return compaction_strategy_type::leveled;
}
virtual std::unique_ptr<sstable_set_impl> make_sstable_set(schema_ptr schema) const override;
virtual std::unique_ptr<sstable_set_impl> make_sstable_set(const table_state& ts) const override;
virtual std::unique_ptr<compaction_backlog_tracker::impl> make_backlog_tracker() const override;

View File

@@ -33,6 +33,7 @@ namespace compaction {
class table_state {
public:
virtual ~table_state() {}
virtual dht::token_range token_range() const noexcept = 0;
virtual const schema_ptr& schema() const noexcept = 0;
// min threshold as defined by table.
virtual unsigned min_compaction_threshold() const noexcept = 0;

View File

@@ -150,7 +150,7 @@ public:
return compaction_strategy_type::time_window;
}
virtual std::unique_ptr<sstable_set_impl> make_sstable_set(schema_ptr schema) const override;
virtual std::unique_ptr<sstable_set_impl> make_sstable_set(const table_state& ts) const override;
virtual std::unique_ptr<compaction_backlog_tracker::impl> make_backlog_tracker() const override;

View File

@@ -255,6 +255,9 @@ public:
// Returns true iff given prefix has no missing components
bool is_full(managed_bytes_view v) const {
SCYLLA_ASSERT(AllowPrefixes == allow_prefixes::yes);
if (_types.size() == 0) {
return v.empty();
}
return std::distance(begin(v), end(v)) == (ssize_t)_types.size();
}
bool is_empty(managed_bytes_view v) const {

File diff suppressed because it is too large Load Diff

View File

@@ -10,17 +10,25 @@
#include <map>
#include <optional>
#include <set>
#include <seastar/core/future.hh>
#include <seastar/core/shared_ptr.hh>
#include <seastar/core/sstring.hh>
#include <seastar/util/bool_class.hh>
#include "seastarx.hh"
class compression_parameters;
class compressor {
sstring _name;
public:
compressor(sstring);
enum class algorithm {
lz4,
lz4_with_dicts,
zstd,
zstd_with_dicts,
snappy,
deflate,
none,
};
virtual ~compressor() {}
@@ -42,44 +50,38 @@ public:
virtual size_t compress_max_size(size_t input_len) const = 0;
/**
* Returns accepted option names for this compressor
*/
virtual std::set<sstring> option_names() const;
/**
* Returns original options used in instantiating this compressor
* Returns metadata which must be written together with the compressed
* data and used to construct a corresponding decompressor.
*/
virtual std::map<sstring, sstring> options() const;
/**
* Compressor class name.
*/
const sstring& name() const {
return _name;
}
static bool is_hidden_option_name(std::string_view sv);
// to cheaply bridge sstable compression options / maps
using opt_string = std::optional<sstring>;
using opt_getter = std::function<opt_string(const sstring&)>;
using ptr_type = shared_ptr<compressor>;
std::string name() const;
static ptr_type create(const sstring& name, const opt_getter&);
static ptr_type create(const std::map<sstring, sstring>&);
virtual algorithm get_algorithm() const = 0;
static thread_local const ptr_type lz4;
static thread_local const ptr_type snappy;
static thread_local const ptr_type deflate;
virtual std::optional<unsigned> get_dict_owner_for_test() const;
static sstring make_name(std::string_view short_name);
using ptr_type = std::unique_ptr<compressor>;
};
template<typename BaseType, typename... Args>
class class_registry;
using compressor_ptr = compressor::ptr_type;
using compressor_registry = class_registry<compressor, const typename compressor::opt_getter&>;
compressor_ptr make_lz4_sstable_compressor_for_tests();
// Per-table compression options, parsed and validated.
//
// Compression options are configured through the JSON-like `compression` entry in the schema.
// The CQL layer parses the text of that entry to a `map<string, string>`.
// A `compression_parameters` object is constructed from this map.
// and the passed keys and values are parsed and validated in the constructor.
// This object can be then used to create a `compressor` objects for sstable readers and writers.
class compression_parameters {
public:
using algorithm = compressor::algorithm;
static constexpr std::string_view name_prefix = "org.apache.cassandra.io.compress.";
static constexpr int32_t DEFAULT_CHUNK_LENGTH = 4 * 1024;
static constexpr double DEFAULT_CRC_CHECK_CHANCE = 1.0;
@@ -88,26 +90,47 @@ public:
static const sstring CHUNK_LENGTH_KB_ERR;
static const sstring CRC_CHECK_CHANCE;
private:
compressor_ptr _compressor;
algorithm _algorithm;
std::optional<int> _chunk_length;
std::optional<double> _crc_check_chance;
std::optional<int> _zstd_compression_level;
public:
compression_parameters();
compression_parameters(compressor_ptr);
compression_parameters(algorithm);
compression_parameters(const std::map<sstring, sstring>& options);
~compression_parameters();
compressor_ptr get_compressor() const { return _compressor; }
int32_t chunk_length() const { return _chunk_length.value_or(int(DEFAULT_CHUNK_LENGTH)); }
double crc_check_chance() const { return _crc_check_chance.value_or(double(DEFAULT_CRC_CHECK_CHANCE)); }
algorithm get_algorithm() const { return _algorithm; }
std::optional<int> zstd_compression_level() const { return _zstd_compression_level; }
using dicts_feature_enabled = bool_class<struct dicts_feature_enabled_tag>;
using dicts_usage_allowed = bool_class<struct dicts_usage_allowed_tag>;
void validate(dicts_feature_enabled, dicts_usage_allowed) const;
void validate();
std::map<sstring, sstring> get_options() const;
bool operator==(const compression_parameters& other) const;
static compression_parameters no_compression() {
return compression_parameters(nullptr);
bool compression_enabled() const {
return _algorithm != algorithm::none;
}
static compression_parameters no_compression() {
return compression_parameters(algorithm::none);
}
bool operator==(const compression_parameters&) const = default;
static std::string_view algorithm_to_name(algorithm);
static std::string algorithm_to_qualified_name(algorithm);
private:
void validate_options(const std::map<sstring, sstring>&);
static void validate_options(const std::map<sstring, sstring>&);
static algorithm name_to_algorithm(std::string_view name);
};
// Stream operator for boost::program_options support
std::istream& operator>>(std::istream& is, compression_parameters& cp);
template <>
struct fmt::formatter<compression_parameters> : fmt::formatter<std::string_view> {
auto format(const compression_parameters& cp, fmt::format_context& ctx) const -> decltype(ctx.out()) {
return fmt::format_to(ctx.out(), "{}", cp.get_options());
}
};

View File

@@ -825,7 +825,9 @@ maintenance_socket: ignore
# Guardrail to enable the deprecated feature of CREATE TABLE WITH COMPACT STORAGE.
# enable_create_table_with_compact_storage: false
# Enable tablets for new keyspaces.
# Control tablets for new keyspaces.
# Can be set to: disabled|enabled
#
# When enabled, newly created keyspaces will have tablets enabled by default.
# That can be explicitly disabled in the CREATE KEYSPACE query
# by using the `tablets = {'enabled': false}` replication option.
@@ -834,6 +836,37 @@ maintenance_socket: ignore
# unless tablets are explicitly enabled in the CREATE KEYSPACE query
# by using the `tablets = {'enabled': true}` replication option.
#
# When set to `enforced`, newly created keyspaces will always have tablets enabled by default.
# This prevents explicitly disabling tablets in the CREATE KEYSPACE query
# using the `tablets = {'enabled': false}` replication option.
# It also mandates a replication strategy supporting tablets, like
# NetworkTopologyStrategy
#
# Note that creating keyspaces with tablets enabled or disabled is irreversible.
# The `tablets` option cannot be changed using `ALTER KEYSPACE`.
enable_tablets: true
tablets_mode_for_new_keyspaces: enabled
# Enforce RF-rack-valid keyspaces.
rf_rack_valid_keyspaces: false
#
# Alternator options
#
# Maximum number of items in single BatchWriteItem command. Default is 100.
# Note: DynamoDB has a hard-coded limit of 25.
# alternator_max_items_in_batch_write: 100
#
# io-streaming rate limiting
# When setting this value to be non-zero scylla throttles disk throughput for
# stream (network) activities such as backup, repair, tablet migration and more.
# This limit is useful for user queries so the network interface does
# not get saturated by streaming activities.
# The recommended value is 75% of network bandwidth
# E.g for i4i.8xlarge (https://github.com/scylladb/scylla-machine-image/tree/next/common/aws_net_params.json):
# network: 18.75 GiB/s --> 18750 Mib/s --> 1875 MB/s (from network bits to network bytes: divide by 10, not 8)
# Converted to disk bytes: 1875 * 1000 / 1024 = 1831 MB/s (disk wise)
# 75% of disk bytes is: 0.75 * 1831 = 1373 megabytes/s
# stream_io_throughput_mb_per_sec: 1373
#

View File

@@ -610,6 +610,10 @@ perf_tests = set([
'test/perf/perf_sort_by_proximity',
])
perf_standalone_tests = set([
'test/perf/perf_generic_server',
])
raft_tests = set([
'test/raft/replication_test',
'test/raft/randomized_nemesis_test',
@@ -644,7 +648,7 @@ lto_binaries = set([
'scylla'
])
tests = scylla_tests | perf_tests | raft_tests
tests = scylla_tests | perf_tests | perf_standalone_tests | raft_tests
other = set([
'iotune',
@@ -680,8 +684,8 @@ arg_parser.add_argument('--compiler', action='store', dest='cxx', default='clang
help='C++ compiler path')
arg_parser.add_argument('--c-compiler', action='store', dest='cc', default='clang',
help='C compiler path')
add_tristate(arg_parser, name='dpdk', dest='dpdk',
help='Use dpdk (from seastar dpdk sources) (default=True for release builds)')
add_tristate(arg_parser, name='dpdk', dest='dpdk', default=False,
help='Use dpdk (from seastar dpdk sources)')
arg_parser.add_argument('--dpdk-target', action='store', dest='dpdk_target', default='',
help='Path to DPDK SDK target location (e.g. <DPDK SDK dir>/x86_64-native-linuxapp-gcc)')
arg_parser.add_argument('--debuginfo', action='store', dest='debuginfo', type=int, default=1,
@@ -812,6 +816,7 @@ scylla_core = (['message/messaging_service.cc',
'utils/rjson.cc',
'utils/human_readable.cc',
'utils/histogram_metrics_helper.cc',
'utils/io-wrappers.cc',
'utils/on_internal_error.cc',
'utils/pretty_printers.cc',
'utils/stream_compressor.cc',
@@ -825,7 +830,7 @@ scylla_core = (['message/messaging_service.cc',
'keys.cc',
'counters.cc',
'compress.cc',
'zstd.cc',
'sstable_dict_autotrainer.cc',
'sstables/sstables.cc',
'sstables/sstables_manager.cc',
'sstables/sstable_set.cc',
@@ -976,6 +981,7 @@ scylla_core = (['message/messaging_service.cc',
'cql3/result_set.cc',
'cql3/prepare_context.cc',
'db/batchlog_manager.cc',
'db/corrupt_data_handler.cc',
'db/commitlog/commitlog.cc',
'db/commitlog/commitlog_entry.cc',
'db/commitlog/commitlog_replayer.cc',
@@ -1029,14 +1035,18 @@ scylla_core = (['message/messaging_service.cc',
'utils/multiprecision_int.cc',
'utils/gz/crc_combine.cc',
'utils/gz/crc_combine_table.cc',
'utils/http.cc',
'utils/s3/aws_error.cc',
'utils/s3/client.cc',
'utils/s3/retryable_http_client.cc',
'utils/s3/retry_strategy.cc',
'utils/s3/s3_retry_strategy.cc',
'utils/s3/credentials_providers/aws_credentials_provider.cc',
'utils/s3/credentials_providers/environment_aws_credentials_provider.cc',
'utils/s3/credentials_providers/instance_profile_credentials_provider.cc',
'utils/s3/credentials_providers/sts_assume_role_credentials_provider.cc',
'utils/s3/credentials_providers/aws_credentials_provider_chain.cc',
'utils/s3/utils/manip_s3.cc',
'utils/advanced_rpc_compressor.cc',
'gms/version_generator.cc',
'gms/versioned_value.cc',
@@ -1106,7 +1116,7 @@ scylla_core = (['message/messaging_service.cc',
'utils/lister.cc',
'repair/repair.cc',
'repair/row_level.cc',
'repair/table_check.cc',
'streaming/table_check.cc',
'exceptions/exceptions.cc',
'auth/allow_all_authenticator.cc',
'auth/allow_all_authorizer.cc',
@@ -1160,6 +1170,7 @@ scylla_core = (['message/messaging_service.cc',
'ent/encryption/kms_key_provider.cc',
'ent/encryption/gcp_host.cc',
'ent/encryption/gcp_key_provider.cc',
'ent/encryption/utils.cc',
'ent/ldap/ldap_connection.cc',
'multishard_mutation_query.cc',
'reader_concurrency_semaphore.cc',
@@ -1182,6 +1193,7 @@ scylla_core = (['message/messaging_service.cc',
'service/raft/group0_state_id_handler.cc',
'service/raft/group0_state_machine.cc',
'service/raft/group0_state_machine_merger.cc',
'service/raft/group0_voter_handler.cc',
'service/raft/raft_sys_table_storage.cc',
'serializer.cc',
'release.cc',
@@ -1189,7 +1201,7 @@ scylla_core = (['message/messaging_service.cc',
'service/raft/raft_group_registry.cc',
'service/raft/discovery.cc',
'service/raft/raft_group0.cc',
'direct_failure_detector/failure_detector.cc',
'service/direct_failure_detector/failure_detector.cc',
'service/raft/raft_group0_client.cc',
'service/broadcast_tables/experimental/lang.cc',
'tasks/task_handler.cc',
@@ -1328,6 +1340,7 @@ idls = ['idl/gossip_digest.idl.hh',
'idl/replica_exception.idl.hh',
'idl/per_partition_rate_limit_info.idl.hh',
'idl/position_in_partition.idl.hh',
'idl/full_position.idl.hh',
'idl/experimental/broadcast_tables_lang.idl.hh',
'idl/storage_service.idl.hh',
'idl/join_node.idl.hh',
@@ -1379,7 +1392,6 @@ scylla_perfs = ['test/perf/perf_alternator.cc',
'test/perf/perf_tablets.cc',
'test/perf/tablet_load_balancing.cc',
'test/perf/perf.cc',
'test/lib/alternator_test_env.cc',
'test/lib/cql_test_env.cc',
'test/lib/log.cc',
'test/lib/test_services.cc',
@@ -1470,13 +1482,16 @@ for t in sorted(scylla_tests):
else:
deps[t] += scylla_core + alternator + idls + scylla_tests_generic_dependencies
for t in sorted(perf_tests | perf_standalone_tests):
deps[t] = [t + '.cc'] + scylla_tests_dependencies
deps[t] += ['test/perf/perf.cc', 'seastar/tests/perf/linux_perf_event.cc']
perf_tests_seastar_deps = [
'seastar/tests/perf/perf_tests.cc'
]
for t in sorted(perf_tests):
deps[t] = [t + '.cc'] + scylla_tests_dependencies + perf_tests_seastar_deps
deps[t] += ['test/perf/perf.cc', 'seastar/tests/perf/linux_perf_event.cc']
deps[t] += perf_tests_seastar_deps
deps['test/boost/combined_tests'] += [
'test/boost/aggregate_fcts_test.cc',
@@ -1501,6 +1516,7 @@ deps['test/boost/combined_tests'] += [
'test/boost/filtering_test.cc',
'test/boost/group0_cmd_merge_test.cc',
'test/boost/group0_test.cc',
'test/boost/group0_voter_calculator_test.cc',
'test/boost/index_with_paging_test.cc',
'test/boost/json_cql_query_test.cc',
'test/boost/large_paging_state_test.cc',
@@ -1512,10 +1528,12 @@ deps['test/boost/combined_tests'] += [
'test/boost/mutation_writer_test.cc',
'test/boost/network_topology_strategy_test.cc',
'test/boost/per_partition_rate_limit_test.cc',
'test/boost/pluggable_test.cc',
'test/boost/querier_cache_test.cc',
'test/boost/query_processor_test.cc',
'test/boost/reader_concurrency_semaphore_test.cc',
'test/boost/repair_test.cc',
'test/boost/replicator_test.cc',
'test/boost/restrictions_test.cc',
'test/boost/role_manager_test.cc',
'test/boost/row_cache_test.cc',
@@ -1524,6 +1542,8 @@ deps['test/boost/combined_tests'] += [
'test/boost/secondary_index_test.cc',
'test/boost/sessions_test.cc',
'test/boost/sstable_compaction_test.cc',
'test/boost/sstable_compressor_factory_test.cc',
'test/boost/sstable_compression_config_test.cc',
'test/boost/sstable_directory_test.cc',
'test/boost/sstable_set_test.cc',
'test/boost/statement_restrictions_test.cc',
@@ -1586,8 +1606,8 @@ deps['test/boost/rust_test'] += ['rust/inc/src/lib.rs']
deps['test/raft/replication_test'] = ['test/raft/replication_test.cc', 'test/raft/replication.cc', 'test/raft/helpers.cc', 'test/lib/eventually.cc'] + scylla_raft_dependencies
deps['test/raft/raft_server_test'] = ['test/raft/raft_server_test.cc', 'test/raft/replication.cc', 'test/raft/helpers.cc', 'test/lib/eventually.cc'] + scylla_raft_dependencies
deps['test/raft/randomized_nemesis_test'] = ['test/raft/randomized_nemesis_test.cc', 'direct_failure_detector/failure_detector.cc', 'test/raft/helpers.cc'] + scylla_raft_dependencies
deps['test/raft/failure_detector_test'] = ['test/raft/failure_detector_test.cc', 'direct_failure_detector/failure_detector.cc', 'test/raft/helpers.cc'] + scylla_raft_dependencies
deps['test/raft/randomized_nemesis_test'] = ['test/raft/randomized_nemesis_test.cc', 'service/direct_failure_detector/failure_detector.cc', 'test/raft/helpers.cc'] + scylla_raft_dependencies
deps['test/raft/failure_detector_test'] = ['test/raft/failure_detector_test.cc', 'service/direct_failure_detector/failure_detector.cc', 'test/raft/helpers.cc'] + scylla_raft_dependencies
deps['test/raft/many_test'] = ['test/raft/many_test.cc', 'test/raft/replication.cc', 'test/raft/helpers.cc', 'test/lib/eventually.cc'] + scylla_raft_dependencies
deps['test/raft/fsm_test'] = ['test/raft/fsm_test.cc', 'test/raft/helpers.cc', 'test/lib/log.cc'] + scylla_raft_dependencies
deps['test/raft/etcd_test'] = ['test/raft/etcd_test.cc', 'test/raft/helpers.cc', 'test/lib/log.cc'] + scylla_raft_dependencies
@@ -1727,19 +1747,11 @@ def generate_version(date_stamp):
# the program headers.
def dynamic_linker_option():
gcc_linker_output = subprocess.check_output(['gcc', '-###', '/dev/null', '-o', 't'], stderr=subprocess.STDOUT).decode('utf-8')
original_dynamic_linker = re.search('-dynamic-linker ([^ ]*)', gcc_linker_output).groups()[0]
original_dynamic_linker = re.search('"?-dynamic-linker"?[ =]"?([^ "]*)"?[ \n]', gcc_linker_output).groups()[0]
employ_ld_trickery = True
# distro-specific setup
if os.environ.get('NIX_CC'):
employ_ld_trickery = False
if employ_ld_trickery:
# gdb has a SO_NAME_MAX_PATH_SIZE of 512, so limit the path size to
# that. The 512 includes the null at the end, hence the 511 below.
dynamic_linker = '/' * (511 - len(original_dynamic_linker)) + original_dynamic_linker
else:
dynamic_linker = original_dynamic_linker
# gdb has a SO_NAME_MAX_PATH_SIZE of 512, so limit the path size to
# that. The 512 includes the null at the end, hence the 511 below.
dynamic_linker = '/' * (511 - len(original_dynamic_linker)) + original_dynamic_linker
return f'--dynamic-linker={dynamic_linker}'
forced_ldflags = '-Wl,'
@@ -1875,9 +1887,8 @@ def prepare_advanced_optimizations(*, modes, build_modes, args):
submode['profile_target'] = profile_target
submode['lib_cflags'] += f" -f{it}profile-generate={os.path.realpath(outdir)}/{submode_name} {conservative_opts}"
submode['cxx_ld_flags'] += f" -f{it}profile-generate={os.path.realpath(outdir)}/{submode_name} {conservative_opts}"
# Profile collection depends on java tools because we use cassandra-stress as the load.
submode['profile_recipe'] = textwrap.dedent(f"""\
build $builddir/{submode_name}/profiles/prof.profdata: train $builddir/{submode_name}/scylla | dist-tools-tar
build $builddir/{submode_name}/profiles/prof.profdata: train $builddir/{submode_name}/scylla
build $builddir/{submode_name}/profiles/merged.profdata: merge_profdata $builddir/{submode_name}/profiles/prof.profdata {profile_target or str()}
""")
submode['is_profile'] = True
@@ -1961,8 +1972,6 @@ def configure_seastar(build_dir, mode, mode_config):
seastar_cmake_args += ['-DSeastar_STACK_GUARDS={}'.format(stack_guards)]
dpdk = args.dpdk
if dpdk is None:
dpdk = platform.machine() == 'x86_64' and mode == 'release'
if dpdk:
seastar_cmake_args += ['-DSeastar_DPDK=ON', '-DSeastar_DPDK_MACHINE=westmere']
if args.split_dwarf:
@@ -2627,11 +2636,10 @@ def write_build_file(f,
f.write(f' mode = {mode}\n')
f.write(f'build dist-server-{mode}: phony $builddir/dist/{mode}/redhat $builddir/dist/{mode}/debian\n')
f.write(f'build dist-server-debuginfo-{mode}: phony $builddir/{mode}/dist/tar/{scylla_product}-debuginfo-{scylla_version}-{scylla_release}.{arch}.tar.gz\n')
f.write(f'build dist-tools-{mode}: phony $builddir/{mode}/dist/tar/{scylla_product}-tools-{scylla_version}-{scylla_release}.noarch.tar.gz dist-tools-rpm dist-tools-deb\n')
f.write(f'build dist-cqlsh-{mode}: phony $builddir/{mode}/dist/tar/{scylla_product}-cqlsh-{scylla_version}-{scylla_release}.{arch}.tar.gz dist-cqlsh-rpm dist-cqlsh-deb\n')
f.write(f'build dist-python3-{mode}: phony dist-python3-tar dist-python3-rpm dist-python3-deb\n')
f.write(f'build dist-unified-{mode}: phony $builddir/{mode}/dist/tar/{scylla_product}-unified-{scylla_version}-{scylla_release}.{arch}.tar.gz\n')
f.write(f'build $builddir/{mode}/dist/tar/{scylla_product}-unified-{scylla_version}-{scylla_release}.{arch}.tar.gz: unified $builddir/{mode}/dist/tar/{scylla_product}-{scylla_version}-{scylla_release}.{arch}.tar.gz $builddir/{mode}/dist/tar/{scylla_product}-python3-{scylla_version}-{scylla_release}.{arch}.tar.gz $builddir/{mode}/dist/tar/{scylla_product}-tools-{scylla_version}-{scylla_release}.noarch.tar.gz $builddir/{mode}/dist/tar/{scylla_product}-cqlsh-{scylla_version}-{scylla_release}.{arch}.tar.gz | always\n')
f.write(f'build $builddir/{mode}/dist/tar/{scylla_product}-unified-{scylla_version}-{scylla_release}.{arch}.tar.gz: unified $builddir/{mode}/dist/tar/{scylla_product}-{scylla_version}-{scylla_release}.{arch}.tar.gz $builddir/{mode}/dist/tar/{scylla_product}-python3-{scylla_version}-{scylla_release}.{arch}.tar.gz $builddir/{mode}/dist/tar/{scylla_product}-cqlsh-{scylla_version}-{scylla_release}.{arch}.tar.gz | always\n')
f.write(f' mode = {mode}\n')
f.write(f'build $builddir/{mode}/dist/tar/{scylla_product}-unified-package-{scylla_version}-{scylla_release}.tar.gz: copy $builddir/{mode}/dist/tar/{scylla_product}-unified-{scylla_version}-{scylla_release}.{arch}.tar.gz\n')
f.write(f'build $builddir/{mode}/dist/tar/{scylla_product}-unified-{arch}-package-{scylla_version}-{scylla_release}.tar.gz: copy $builddir/{mode}/dist/tar/{scylla_product}-unified-{scylla_version}-{scylla_release}.{arch}.tar.gz\n')
@@ -2672,17 +2680,6 @@ def write_build_file(f,
rule build-submodule-deb
command = cd $dir && ./reloc/build_deb.sh --reloc-pkg $artifact
build tools/java/build/{scylla_product}-tools-{scylla_version}-{scylla_release}.noarch.tar.gz: build-submodule-reloc | $builddir/SCYLLA-PRODUCT-FILE $builddir/SCYLLA-VERSION-FILE $builddir/SCYLLA-RELEASE-FILE
reloc_dir = tools/java
build dist-tools-rpm: build-submodule-rpm tools/java/build/{scylla_product}-tools-{scylla_version}-{scylla_release}.noarch.tar.gz
dir = tools/java
artifact = build/{scylla_product}-tools-{scylla_version}-{scylla_release}.noarch.tar.gz
build dist-tools-deb: build-submodule-deb tools/java/build/{scylla_product}-tools-{scylla_version}-{scylla_release}.noarch.tar.gz
dir = tools/java
artifact = build/{scylla_product}-tools-{scylla_version}-{scylla_release}.noarch.tar.gz
build dist-tools-tar: phony {' '.join(['$builddir/{mode}/dist/tar/{scylla_product}-tools-{scylla_version}-{scylla_release}.noarch.tar.gz'.format(mode=mode, scylla_product=scylla_product, scylla_version=scylla_version, scylla_release=scylla_release) for mode in default_modes])}
build dist-tools: phony dist-tools-tar dist-tools-rpm dist-tools-deb
build tools/cqlsh/build/{scylla_product}-cqlsh-{scylla_version}-{scylla_release}.{arch}.tar.gz: build-submodule-reloc | $builddir/SCYLLA-PRODUCT-FILE $builddir/SCYLLA-VERSION-FILE $builddir/SCYLLA-RELEASE-FILE
reloc_dir = tools/cqlsh
build dist-cqlsh-rpm: build-submodule-rpm tools/cqlsh/build/{scylla_product}-cqlsh-{scylla_version}-{scylla_release}.{arch}.tar.gz
@@ -2705,11 +2702,11 @@ def write_build_file(f,
artifact = build/{scylla_product}-python3-{scylla_version}-{scylla_release}.{arch}.tar.gz
build dist-python3-tar: phony {' '.join(['$builddir/{mode}/dist/tar/{scylla_product}-python3-{scylla_version}-{scylla_release}.{arch}.tar.gz'.format(mode=mode, scylla_product=scylla_product, arch=arch, scylla_version=scylla_version, scylla_release=scylla_release) for mode in default_modes])}
build dist-python3: phony dist-python3-tar dist-python3-rpm dist-python3-deb
build dist-deb: phony dist-server-deb dist-python3-deb dist-tools-deb dist-cqlsh-deb
build dist-rpm: phony dist-server-rpm dist-python3-rpm dist-tools-rpm dist-cqlsh-rpm
build dist-tar: phony dist-unified-tar dist-server-tar dist-python3-tar dist-tools-tar dist-cqlsh-tar
build dist-deb: phony dist-server-deb dist-python3-deb dist-cqlsh-deb
build dist-rpm: phony dist-server-rpm dist-python3-rpm dist-cqlsh-rpm
build dist-tar: phony dist-unified-tar dist-server-tar dist-python3-tar dist-cqlsh-tar
build dist: phony dist-unified dist-server dist-python3 dist-tools dist-cqlsh
build dist: phony dist-unified dist-server dist-python3 dist-cqlsh
'''))
f.write(textwrap.dedent(f'''\
@@ -2722,12 +2719,10 @@ def write_build_file(f,
build $builddir/{mode}/dist/tar/{scylla_product}-python3-{scylla_version}-{scylla_release}.{arch}.tar.gz: copy tools/python3/build/{scylla_product}-python3-{scylla_version}-{scylla_release}.{arch}.tar.gz
build $builddir/{mode}/dist/tar/{scylla_product}-python3-package.tar.gz: copy tools/python3/build/{scylla_product}-python3-{scylla_version}-{scylla_release}.{arch}.tar.gz
build $builddir/{mode}/dist/tar/{scylla_product}-python3-{arch}-package.tar.gz: copy tools/python3/build/{scylla_product}-python3-{scylla_version}-{scylla_release}.{arch}.tar.gz
build $builddir/{mode}/dist/tar/{scylla_product}-tools-{scylla_version}-{scylla_release}.noarch.tar.gz: copy tools/java/build/{scylla_product}-tools-{scylla_version}-{scylla_release}.noarch.tar.gz
build $builddir/{mode}/dist/tar/{scylla_product}-tools-package.tar.gz: copy tools/java/build/{scylla_product}-tools-{scylla_version}-{scylla_release}.noarch.tar.gz
build $builddir/{mode}/dist/tar/{scylla_product}-cqlsh-{scylla_version}-{scylla_release}.{arch}.tar.gz: copy tools/cqlsh/build/{scylla_product}-cqlsh-{scylla_version}-{scylla_release}.{arch}.tar.gz
build $builddir/{mode}/dist/tar/{scylla_product}-cqlsh-package.tar.gz: copy tools/cqlsh/build/{scylla_product}-cqlsh-{scylla_version}-{scylla_release}.{arch}.tar.gz
build {mode}-dist: phony dist-server-{mode} dist-server-debuginfo-{mode} dist-python3-{mode} dist-tools-{mode} dist-unified-{mode} dist-cqlsh-{mode}
build {mode}-dist: phony dist-server-{mode} dist-server-debuginfo-{mode} dist-python3-{mode} dist-unified-{mode} dist-cqlsh-{mode}
build dist-{mode}: phony {mode}-dist
build dist-check-{mode}: dist-check
mode = {mode}

View File

@@ -709,17 +709,23 @@ batchStatement returns [std::unique_ptr<cql3::statements::raw::batch_statement>
: K_BEGIN
( K_UNLOGGED { type = btype::UNLOGGED; } | K_COUNTER { type = btype::COUNTER; } )?
K_BATCH ( usingClause[attrs] )?
( s=batchStatementObjective ';'? { statements.push_back(std::move(s)); } )*
( s=batchStatementObjective ';'?
{
auto&& stmt = *$s.statement;
stmt->add_raw(sstring{$s.text});
statements.push_back(std::move(stmt));
} )*
K_APPLY K_BATCH
{
$expr = std::make_unique<cql3::statements::raw::batch_statement>(type, std::move(attrs), std::move(statements));
}
;
batchStatementObjective returns [std::unique_ptr<cql3::statements::raw::modification_statement> statement]
: i=insertStatement { $statement = std::move(i); }
| u=updateStatement { $statement = std::move(u); }
| d=deleteStatement { $statement = std::move(d); }
batchStatementObjective returns [::lw_shared_ptr<std::unique_ptr<cql3::statements::raw::modification_statement>> statement]
@init { using original_ret_type = std::unique_ptr<cql3::statements::raw::modification_statement>; }
: i=insertStatement { $statement = make_lw_shared<original_ret_type>(std::move(i)); }
| u=updateStatement { $statement = make_lw_shared<original_ret_type>(std::move(u)); }
| d=deleteStatement { $statement = make_lw_shared<original_ret_type>(std::move(d)); }
;
dropAggregateStatement returns [std::unique_ptr<cql3::statements::drop_aggregate_statement> expr]

View File

@@ -82,7 +82,7 @@ public:
const sstring& text() const;
virtual sstring to_string() const;
sstring to_string() const;
sstring to_cql_string() const;
friend std::hash<column_identifier_raw>;

View File

@@ -48,14 +48,16 @@ const std::chrono::minutes prepared_statements_cache::entry_expiry = std::chrono
struct query_processor::remote {
remote(service::migration_manager& mm, service::mapreduce_service& fwd,
service::storage_service& ss, service::raft_group0_client& group0_client)
: mm(mm), mapreducer(fwd), ss(ss), group0_client(group0_client) {}
: mm(mm), mapreducer(fwd), ss(ss), group0_client(group0_client)
, gate("query_processor::remote")
{}
service::migration_manager& mm;
service::mapreduce_service& mapreducer;
service::storage_service& ss;
service::raft_group0_client& group0_client;
seastar::gate gate;
seastar::named_gate gate;
};
bool query_processor::topology_global_queue_empty() {

View File

@@ -13,6 +13,7 @@
#include <seastar/core/on_internal_error.hh>
#include <stdexcept>
#include "alter_keyspace_statement.hh"
#include "locator/tablets.hh"
#include "prepared_statement.hh"
#include "service/migration_manager.hh"
#include "service/storage_proxy.hh"
@@ -25,6 +26,7 @@
#include "create_keyspace_statement.hh"
#include "gms/feature_service.hh"
#include "replica/database.hh"
#include "db/config.hh"
using namespace std::string_literals;
@@ -194,9 +196,9 @@ cql3::statements::alter_keyspace_statement::prepare_schema_mutations(query_proce
event::schema_change::target_type target_type = event::schema_change::target_type::KEYSPACE;
auto ks = qp.db().find_keyspace(_name);
auto ks_md = ks.metadata();
const auto& tm = *qp.proxy().get_token_metadata_ptr();
const auto tmptr = qp.proxy().get_token_metadata_ptr();
const auto& feat = qp.proxy().features();
auto ks_md_update = _attrs->as_ks_metadata_update(ks_md, tm, feat);
auto ks_md_update = _attrs->as_ks_metadata_update(ks_md, *tmptr, feat);
std::vector<mutation> muts;
std::vector<sstring> warnings;
auto old_ks_options = get_old_options_flattened(ks);
@@ -265,6 +267,47 @@ cql3::statements::alter_keyspace_statement::prepare_schema_mutations(query_proce
muts.insert(muts.begin(), schema_mutations.begin(), schema_mutations.end());
}
auto rs = locator::abstract_replication_strategy::create_replication_strategy(
ks_md_update->strategy_name(),
locator::replication_strategy_params(ks_md_update->strategy_options(), ks_md_update->initial_tablets()));
// If `rf_rack_valid_keyspaces` is enabled, it's forbidden to perform a schema change that
// would lead to an RF-rack-valid keyspace. Verify that this change does not.
// For more context, see: scylladb/scylladb#23071.
try {
// There are two things to note here:
// 1. We hold a group0_guard, so it's correct to check this here.
// The topology or schema cannot change while we're performing this query.
// 2. The replication strategy we use here does NOT represent the actual state
// we will arrive at after applying the schema change. For instance, if the user
// did not specify the RF for some of the DCs, it's equal to 0 in the replication
// strategy we pass to this function, while in reality that means that the RF
// will NOT change. That is not a problem:
// - RF=0 is valid for all DCs, so it won't trigger an exception on its own,
// - the keyspace must've been RF-rack-valid before this change. We check that
// condition for all keyspaces at startup.
// The second hyphen is not really true because currently topological changes can
// disturb it (see scylladb/scylladb#23345), but we ignore that.
locator::assert_rf_rack_valid_keyspace(_name, tmptr, *rs);
} catch (const std::exception& e) {
if (qp.db().get_config().rf_rack_valid_keyspaces()) {
// There's no guarantee what the type of the exception will be, so we need to
// wrap it manually here in a type that can be passed to the user.
throw exceptions::invalid_request_exception(e.what());
} else {
// Even when the configuration option `rf_rack_valid_keyspaces` is set to false,
// we'd like to inform the user that the keyspace they're altering will not
// satisfy the restriction after the change--but just as a warning.
// For more context, see issue: scylladb/scylladb#23330.
warnings.push_back(seastar::format(
"Keyspace '{}' is not RF-rack-valid: the replication factor doesn't match "
"the rack count in at least one datacenter. A rack failure may reduce availability. "
"For more context, see: "
"https://docs.scylladb.com/manual/stable/reference/glossary.html#term-RF-rack-valid-keyspace.",
_name));
}
}
auto ret = ::make_shared<event::schema_change>(
event::schema_change::change_type::UPDATED,
target_type,

View File

@@ -8,6 +8,7 @@
* SPDX-License-Identifier: (LicenseRef-ScyllaDB-Source-Available-1.0 and Apache-2.0)
*/
#include "cdc/log.hh"
#include "utils/assert.hh"
#include <seastar/core/coroutine.hh>
#include "cql3/query_options.hh"
@@ -27,6 +28,7 @@
#include "db/view/view.hh"
#include "cql3/query_processor.hh"
#include "cdc/cdc_extension.hh"
#include "cdc/cdc_partitioner.hh"
namespace cql3 {
@@ -284,12 +286,59 @@ void alter_table_statement::drop_column(const query_options& options, const sche
}
}
std::pair<schema_builder, std::vector<view_ptr>> alter_table_statement::prepare_schema_update(data_dictionary::database db, const query_options& options) const {
std::pair<schema_ptr, std::vector<view_ptr>> alter_table_statement::prepare_schema_update(data_dictionary::database db, const query_options& options) const {
auto s = validation::validate_column_family(db, keyspace(), column_family());
if (s->is_view()) {
throw exceptions::invalid_request_exception("Cannot use ALTER TABLE on Materialized View");
}
const bool is_cdc_log_table = cdc::is_log_for_some_table(db.real_database(), s->ks_name(), s->cf_name());
// Only a CDC log table will have this partitioner name. User tables should
// not be able to set this. Note that we perform a similar check when trying to
// re-enable CDC for a table, when the log table has been replaced by a user table.
// For better visualization of the above, consider this
//
// cqlsh> CREATE TABLE ks.t (p int PRIMARY KEY, v int) WITH cdc = {'enabled': true};
// cqlsh> INSERT INTO ks.t (p, v) VALUES (1, 2);
// cqlsh> ALTER TABLE ks.t WITH cdc = {'enabled': false};
// cqlsh> DESC TABLE ks.t_scylla_cdc_log WITH INTERNALS; # Save this output!
// cqlsh> DROP TABLE ks.t_scylla_cdc_log;
// cqlsh> [Recreate the log table using the received statement]
// cqlsh> ALTER TABLE ks.t WITH cdc = {'enabled': true};
//
// InvalidRequest: Error from server: code=2200 [Invalid query] message="Cannot create CDC log
// table for table ks.t because a table of name ks.t_scylla_cdc_log already exists"
//
// See commit adda43edc75b901b2329bca8f3eb74596698d05f for more information on THAT case.
// We reuse the same technique here.
const bool was_cdc_log_table = s->get_partitioner().name() == cdc::cdc_partitioner::classname;
if (_column_changes.size() != 0 && is_cdc_log_table) {
throw exceptions::invalid_request_exception(
"You cannot modify the set of columns of a CDC log table directly. "
"Modify the base table instead.");
}
if (_column_changes.size() != 0 && was_cdc_log_table) {
throw exceptions::invalid_request_exception(
"You cannot modify the set of columns of a CDC log table directly. "
"Although the base table has deactivated CDC, this table will continue being "
"a CDC log table until it is dropped. If you want to modify the columns in it, "
"you can only do that by reenabling CDC on the base table, which will reattach "
"this log table. Then you will be able to modify the columns in the base table, "
"and that will have effect on the log table too. Modifying the columns of a CDC "
"log table directly is never allowed.");
}
if (_renames.size() != 0 && is_cdc_log_table) {
throw exceptions::invalid_request_exception("Cannot rename a column of a CDC log table.");
}
if (_renames.size() != 0 && was_cdc_log_table) {
throw exceptions::invalid_request_exception(
"You cannot rename a column of a CDC log table. Although the base table "
"has deactivated CDC, this table will continue being a CDC log table until it "
"is dropped.");
}
auto cfm = schema_builder(s);
if (_properties->get_id()) {
@@ -377,41 +426,45 @@ std::pair<schema_builder, std::vector<view_ptr>> alter_table_statement::prepare_
validate_column_rename(db, *s, *from, *to);
cfm.rename_column(from->name(), to->name());
// If the view includes a renamed column, it must be renamed in
// the view table and the definition.
for (auto&& view : cf.views()) {
}
// New view schemas contain the new column names, so we need to base them on the
// new base schema.
schema_ptr new_base_schema = cfm.build();
// If the view includes a renamed column, it must be renamed in
// the view table and the definition.
for (auto&& view : cf.views()) {
schema_builder builder(view);
std::vector<std::pair<::shared_ptr<column_identifier>, ::shared_ptr<column_identifier>>> view_renames;
for (auto&& entry : _renames) {
auto from = entry.first->prepare_column_identifier(*s);
if (view->get_column_definition(from->name())) {
schema_builder builder(view);
auto view_from = entry.first->prepare_column_identifier(*view);
auto view_to = entry.second->prepare_column_identifier(*view);
validate_column_rename(db, *view, *view_from, *view_to);
builder.rename_column(view_from->name(), view_to->name());
auto new_where = util::rename_column_in_where_clause(
view->view_info()->where_clause(),
column_identifier::raw(view_from->text(), true),
column_identifier::raw(view_to->text(), true),
cql3::dialect{});
builder.with_view_info(view->view_info()->base_id(), view->view_info()->base_name(),
view->view_info()->include_all_columns(), std::move(new_where));
view_updates.push_back(view_ptr(builder.build()));
view_renames.emplace_back(view_from, view_to);
}
}
if (!view_renames.empty()) {
auto new_where = util::rename_columns_in_where_clause(
view->view_info()->where_clause(),
view_renames,
cql3::dialect{});
builder.with_view_info(new_base_schema, view->view_info()->include_all_columns(), std::move(new_where));
view_updates.push_back(view_ptr(builder.build()));
}
}
break;
return make_pair(std::move(new_base_schema), std::move(view_updates));
}
return make_pair(std::move(cfm), std::move(view_updates));
return make_pair(cfm.build(), std::move(view_updates));
}
future<std::tuple<::shared_ptr<cql_transport::event::schema_change>, std::vector<mutation>, cql3::cql_warnings_vec>>
alter_table_statement::prepare_schema_mutations(query_processor& qp, const query_options& options, api::timestamp_type ts) const {
data_dictionary::database db = qp.db();
auto [cfm, view_updates] = prepare_schema_update(db, options);
auto m = co_await service::prepare_column_family_update_announcement(qp.proxy(), cfm.build(), std::move(view_updates), ts);
auto [s, view_updates] = prepare_schema_update(db, options);
auto m = co_await service::prepare_column_family_update_announcement(qp.proxy(), std::move(s), std::move(view_updates), ts);
using namespace cql_transport;
auto ret = ::make_shared<event::schema_change>(

View File

@@ -69,7 +69,7 @@ private:
void add_column(const query_options& options, const schema& schema, data_dictionary::table cf, schema_builder& cfm, std::vector<view_ptr>& view_updates, const column_identifier& column_name, const cql3_type validator, const column_definition* def, bool is_static) const;
void alter_column(const query_options& options, const schema& schema, data_dictionary::table cf, schema_builder& cfm, std::vector<view_ptr>& view_updates, const column_identifier& column_name, const cql3_type validator, const column_definition* def, bool is_static) const;
void drop_column(const query_options& options, const schema& schema, data_dictionary::table cf, schema_builder& cfm, std::vector<view_ptr>& view_updates, const column_identifier& column_name, const cql3_type validator, const column_definition* def, bool is_static) const;
std::pair<schema_builder, std::vector<view_ptr>> prepare_schema_update(data_dictionary::database db, const query_options& options) const;
std::pair<schema_ptr, std::vector<view_ptr>> prepare_schema_update(data_dictionary::database db, const query_options& options) const;
};
class alter_table_statement::raw_statement : public raw::cf_statement {

View File

@@ -23,6 +23,7 @@
#include "db/per_partition_rate_limit_options.hh"
#include "db/tablet_options.hh"
#include "utils/bloom_calculations.hh"
#include "db/config.hh"
#include <boost/algorithm/string/predicate.hpp>
@@ -135,7 +136,9 @@ void cf_prop_defs::validate(const data_dictionary::database db, sstring ks_name,
throw exceptions::configuration_exception(sstring("Missing sub-option '") + compression_parameters::SSTABLE_COMPRESSION + "' for the '" + KW_COMPRESSION + "' option.");
}
compression_parameters cp(*compression_options);
cp.validate();
cp.validate(
compression_parameters::dicts_feature_enabled(bool(db.features().sstable_compression_dicts)),
compression_parameters::dicts_usage_allowed(db.get_config().sstable_compression_dictionaries_allow_in_ddl()));
}
auto per_partition_rate_limit_options = get_per_partition_rate_limit_options(schema_extensions);

View File

@@ -53,7 +53,8 @@ const sstring& cf_statement::keyspace() const
const sstring& cf_statement::column_family() const
{
return _cf_name->get_column_family();
thread_local static sstring empty = "";
return bool(_cf_name) ? _cf_name->get_column_family() : empty;
}
}

View File

@@ -11,6 +11,8 @@
#include <seastar/core/coroutine.hh>
#include "cql3/statements/create_keyspace_statement.hh"
#include "cql3/statements/ks_prop_defs.hh"
#include "exceptions/exceptions.hh"
#include "locator/tablets.hh"
#include "prepared_statement.hh"
#include "data_dictionary/data_dictionary.hh"
#include "data_dictionary/keyspace_metadata.hh"
@@ -90,14 +92,14 @@ void create_keyspace_statement::validate(query_processor& qp, const service::cli
future<std::tuple<::shared_ptr<cql_transport::event::schema_change>, std::vector<mutation>, cql3::cql_warnings_vec>> create_keyspace_statement::prepare_schema_mutations(query_processor& qp, const query_options&, api::timestamp_type ts) const {
using namespace cql_transport;
const auto& tm = *qp.proxy().get_token_metadata_ptr();
const auto tmptr = qp.proxy().get_token_metadata_ptr();
const auto& feat = qp.proxy().features();
const auto& cfg = qp.db().get_config();
std::vector<mutation> m;
std::vector<sstring> warnings;
try {
auto ksm = _attrs->as_ks_metadata(_name, tm, feat, cfg);
auto ksm = _attrs->as_ks_metadata(_name, *tmptr, feat, cfg);
m = service::prepare_new_keyspace_announcement(qp.db().real_database(), ksm, ts);
// If the new keyspace uses tablets, as long as there are features
// which aren't supported by tablets we want to warn the user that
@@ -111,14 +113,39 @@ future<std::tuple<::shared_ptr<cql_transport::event::schema_change>, std::vector
if (rs->uses_tablets()) {
warnings.push_back(
"Tables in this keyspace will be replicated using Tablets "
"and will not support CDC, LWT and counters features. "
"To use CDC, LWT or counters, drop this keyspace and re-create it "
"without tablets by adding AND TABLETS = {'enabled': false} "
"to the CREATE KEYSPACE statement.");
"and will not support Materialized Views, Secondary Indexes, CDC, LWT and counters features. "
"To use Materialized Views, Secondary Indexes, CDC, LWT or counters, drop this keyspace and re-create it "
"without tablets by adding AND TABLETS = {'enabled': false} to the CREATE KEYSPACE statement.");
if (ksm->initial_tablets().value()) {
warnings.push_back("Keyspace `initial` tablets option is deprecated. Use per-table tablet options instead.");
}
}
// If `rf_rack_valid_keyspaces` is enabled, it's forbidden to create an RF-rack-invalid keyspace.
// Verify that it's RF-rack-valid.
// For more context, see: scylladb/scylladb#23071.
try {
// We hold a group0_guard, so it's correct to check this here.
// The topology or schema cannot change while we're performing this query.
locator::assert_rf_rack_valid_keyspace(_name, tmptr, *rs);
} catch (const std::exception& e) {
if (cfg.rf_rack_valid_keyspaces()) {
// There's no guarantee what the type of the exception will be, so we need to
// wrap it manually here in a type that can be passed to the user.
throw exceptions::invalid_request_exception(e.what());
} else {
// Even when the configuration option `rf_rack_valid_keyspaces` is set to false,
// we'd like to inform the user that the keyspace they're creating does not
// satisfy the restriction--but just as a warning.
// For more context, see issue: scylladb/scylladb#23330.
warnings.push_back(seastar::format(
"Keyspace '{}' is not RF-rack-valid: the replication factor doesn't match "
"the rack count in at least one datacenter. A rack failure may reduce availability. "
"For more context, see: "
"https://docs.scylladb.com/manual/stable/reference/glossary.html#term-RF-rack-valid-keyspace.",
_name));
}
}
} catch (const exceptions::already_exists_exception& e) {
if (!_if_not_exists) {
co_return coroutine::exception(std::current_exception());
@@ -220,9 +247,6 @@ std::vector<sstring> check_against_restricted_replication_strategies(
// We ignore errors (non-number, negative number, etc.) here,
// these are checked and reported elsewhere.
for (auto opt : attrs.get_replication_options()) {
if (opt.first == sstring("initial_tablets")) {
continue;
}
try {
auto rf = std::stol(opt.second);
if (rf > 0) {

View File

@@ -31,6 +31,8 @@
#include "db/config.hh"
#include "compaction/time_window_compaction_strategy.hh"
bool is_internal_keyspace(std::string_view name);
namespace cql3 {
namespace statements {
@@ -122,6 +124,10 @@ void create_table_statement::apply_properties_to(schema_builder& builder, const
addColumnMetadataFromAliases(cfmd, Collections.singletonList(valueAlias), defaultValidator, ColumnDefinition.Kind.COMPACT_VALUE);
#endif
if (!_properties->get_compression_options() && !is_internal_keyspace(keyspace())) {
builder.set_compressor_params(db.get_config().sstable_compression_user_table_options());
}
_properties->apply_to_builder(builder, _properties->make_schema_extensions(db.extensions()), db, keyspace());
}

View File

@@ -378,7 +378,7 @@ std::pair<view_ptr, cql3::cql_warnings_vec> create_view_statement::prepare_view(
}
auto where_clause_text = util::relations_to_where_clause(_where_clause);
builder.with_view_info(schema->id(), schema->cf_name(), included.empty(), std::move(where_clause_text));
builder.with_view_info(schema, included.empty(), std::move(where_clause_text));
return std::make_pair(view_ptr(builder.build()), std::move(warnings));
}

View File

@@ -150,7 +150,7 @@ data_dictionary::storage_options ks_prop_defs::get_storage_options() const {
return opts;
}
std::optional<unsigned> ks_prop_defs::get_initial_tablets(std::optional<unsigned> default_value) const {
std::optional<unsigned> ks_prop_defs::get_initial_tablets(std::optional<unsigned> default_value, bool enforce_tablets) const {
auto tablets_options = get_map(KW_TABLETS);
if (!tablets_options) {
return default_value;
@@ -165,6 +165,9 @@ std::optional<unsigned> ks_prop_defs::get_initial_tablets(std::optional<unsigned
if (enabled == "true") {
// nothing
} else if (enabled == "false") {
if (enforce_tablets) {
throw exceptions::configuration_exception("Cannot disable tablets for keyspace since tablets are enforced using the `tablets_mode_for_new_keyspaces: enforced` config option.");
}
return std::nullopt;
} else {
throw exceptions::configuration_exception(sstring("Tablets enabled value must be true or false; found: ") + enabled);
@@ -199,8 +202,10 @@ bool ks_prop_defs::get_durable_writes() const {
lw_shared_ptr<data_dictionary::keyspace_metadata> ks_prop_defs::as_ks_metadata(sstring ks_name, const locator::token_metadata& tm, const gms::feature_service& feat, const db::config& cfg) {
auto sc = get_replication_strategy_class().value();
// if tablets options have not been specified, but tablets are globally enabled, set the value to 0 for N.T.S. only
auto enable_tablets = feat.tablets && cfg.enable_tablets();
auto initial_tablets = get_initial_tablets(enable_tablets && locator::abstract_replication_strategy::to_qualified_class_name(sc) == "org.apache.cassandra.locator.NetworkTopologyStrategy" ? std::optional<unsigned>(0) : std::nullopt);
auto enable_tablets = feat.tablets && cfg.enable_tablets_by_default();
std::optional<unsigned> default_initial_tablets = enable_tablets && locator::abstract_replication_strategy::to_qualified_class_name(sc) == "org.apache.cassandra.locator.NetworkTopologyStrategy"
? std::optional<unsigned>(0) : std::nullopt;
auto initial_tablets = get_initial_tablets(default_initial_tablets, cfg.enforce_tablets());
auto options = prepare_options(sc, tm, get_replication_options());
return data_dictionary::keyspace_metadata::new_keyspace(ks_name, sc,
std::move(options), initial_tablets, get_boolean(KW_DURABLE_WRITES, true), get_storage_options());

View File

@@ -60,7 +60,7 @@ public:
void validate();
std::map<sstring, sstring> get_replication_options() const;
std::optional<sstring> get_replication_strategy_class() const;
std::optional<unsigned> get_initial_tablets(std::optional<unsigned> default_value) const;
std::optional<unsigned> get_initial_tablets(std::optional<unsigned> default_value, bool enforce_tablets = false) const;
data_dictionary::storage_options get_storage_options() const;
bool get_durable_writes() const;
lw_shared_ptr<data_dictionary::keyspace_metadata> as_ks_metadata(sstring ks_name, const locator::token_metadata&, const gms::feature_service&, const db::config&);

Some files were not shown because too many files have changed in this diff Show More