Commit Graph

48669 Commits

Author SHA1 Message Date
Yaron Kaikov
f5a2fcab72 Add JIRA issue validation to backport PR fixes check
Extend the Fixes validation pattern to also accept JIRA issue references
(format: [A-Z]+-\d+) in addition to GitHub issue references. This allows
backport PRs to reference JIRA issues in the format 'Fixes: PROJECT-123'.

Fixes: https://github.com/scylladb/scylladb/issues/27571

Closes scylladb/scylladb#27572

(cherry picked from commit 3dfa5ebd7f)

Closes scylladb/scylladb#27599
2025-12-12 09:35:49 +02:00
Jenkins Promoter
2fe49ce031 Update ScyllaDB version to: 2025.3.6 2025-12-10 10:46:07 +02:00
Anna Stuchlik
f9d19cab8a replace the Driver pages with a link to the new Drivers pages
This commit removes the now redundant driver pages from
the Scylla DB documentation. Instead, the link to the pages
where we moved the diver information is added.
Also, the links are updated across the ScyllaDB manual.

Redirections are added for all the removed pages.

Fixes https://github.com/scylladb/scylladb/issues/26871

Closes scylladb/scylladb#27277

(cherry picked from commit c5580399a8)

Closes scylladb/scylladb#27440
2025-12-10 09:26:11 +01:00
Tomasz Grabiec
750e9da1e8 Merge '[Backport 2025.3] tablets: scheduler: Balance racks separately when rf_rack_valid_keyspaces is true' from Scylladb[bot]
Greatly improves performance of plan making, because we don't consider
candidates in other racks, most of which will fail to be selected due
to replication constraints (no rack overload). Also (but minor)
reduces the overhead of candidate evaluation, as we don't have to
evaluate rack load.

Enabled only for rf_rack_valid_keyspaces because such setups guarantee
that we will not need (because we must not) move tablets across racks,
and we don't need to execute the general algorithm for the whole DC.

Tested with perf-load-balancing, which performs a single scale-out
operation on a cluster which initially has 10 nodes 88 shards each, 2
racks, RF=2, 70 tables, 256 tablets per table. Scale out adds 6 new
nodes (same shard count). Time to reballance the cluster (plan making
only, sum of all iterations, no streaming):

Before:  16 min 25 s
After:    0 min 25 s

Before, plan making cost (single incremental iteration) alternated
between fast (0.1 [s]) and slow (14.1 [s]):

  testlog - Rebalance iteration 7 took 14.156 [s]: mig=88, bad=88, first_bad=17741, eval=93874484, skiplist=0, skip: (load=0, rack=17653, node=0)
  testlog - Rebalance iteration 8 took 0.143 [s]: mig=88, bad=88, first_bad=88, eval=865407, skiplist=0, skip: (load=0, rack=0, node=0)

The slow run chose min and max nodes in different racks, hence the
fast path failed to find any candidates and we switched to exhaustive
search of candidates in other nodes.

After, all iterations are fast (0.1 [s] per rack, 0.2 [s] per plan-making). The plan is twice as large because it combines the output of two subsequent (pre-patch) plan-making calls.

Fixes #26016

- (cherry picked from commit c9f0a9d0eb)

- (cherry picked from commit 0dcaaa061e)

- (cherry picked from commit 2b03a69065)

Parent PR: #26017

Closes scylladb/scylladb#26218

* github.com:scylladb/scylladb:
  test: perf: perf-load-balancing: Add parallel-scaleout scenario
  test: perf: perf-load-balancing: Convert to tool_app_template
  tablets: scheduler: Balance racks separately when rf_rack_valid_keyspaces is true
  load_balancer: include dead nodes when calculating rack load
2025-12-09 23:54:20 +01:00
Tomasz Grabiec
ebc07c360f test: perf: perf-load-balancing: Add parallel-scaleout scenario
Simulates reblancing on a single scale-out involving simultaneous
addition of multiple nodes per rack.

Default parameters create a cluster with 2 racks, 70 tables, 256
tablets/table, 10 nodes, 88 shards/node.
Adds 6 nodes in parallel (3 per rack).

Current result on my laptop:

  testlog - Rebalance took 21.874 [s] after 82 iteration(s)

(cherry picked from commit 2b03a69065)
2025-12-09 14:04:19 +01:00
Tomasz Grabiec
b7db86611c test: perf: perf-load-balancing: Convert to tool_app_template
To support sub-commands for testing different scenarios.

The current scenario is given the name "rolling-add-dec".

(cherry picked from commit 0dcaaa061e)
2025-12-09 14:04:19 +01:00
Tomasz Grabiec
37824ec021 tablets: scheduler: Balance racks separately when rf_rack_valid_keyspaces is true
Greatly improves performance of plan making, because we don't consider
candidates in other racks, most of which will fail to be selected due
to replication constraints (no rack overload). Also (but minor)
reduces the overhead of candidate evaluation, as we don't have to
evaluate rack load.

Enabled only for rf_rack_valid_keyspaces because such setups guarantee
that we will not need (because we must not) move tablets across racks,
and we don't need to execute the general algorithm for the whole DC.

Tested with perf-load-balancing, which performs a single scale-out
operation on a cluster which initially has 10 nodes 88 shards each, 2
racks, RF=2, 70 tables, 256 tablets per table. Scale out adds 6 new
nodes (same shard count). Time to rebalance the cluster (plan making
only, sum of all iterations, no streaming):

Before: 16 min 25 s
After: 0 min 25 s

Before, plan making cost (single incremental iteration) alternated
between fast (0.1 [s]) and slow (14.1 [s]):

  Rebalance iteration 7 took 14.156 [s]: mig=88, bad=88, first_bad=17741, eval=93874484, skiplist=0, skip: (load=0, rack=17653, node=0)
  Rebalance iteration 8 took 0.143 [s]: mig=88, bad=88, first_bad=88, eval=865407, skiplist=0, skip: (load=0, rack=0, node=0)

The slow run chose min and max nodes in different racks, hence the
fast path failed to find any candidates and we switched to exhaustive
search of candidates in other nodes.

After, all iterations are fast (0.1 [s] per rack, 0.2 [s] per plan-making).
The plan is twice as large because it combines the output of two subsequent (pre-patch)
plan-making calls.

Fixes #26016

(cherry picked from commit c9f0a9d0eb)
2025-12-09 14:04:19 +01:00
Wojciech Mitros
ef250e58dd load_balancer: include dead nodes when calculating rack load
Load balancer aims to preserve a balance in rack loads when generating
tablet migrations. However, this balance might get broken when dead nodes
are present. Currently, these nodes aren't include in rack load calculations,
even if they own tablet replicas. As a result, load balancer treats racks
with dead nodes as racks with a lower load, so I generates migrations to these
racks.
This is incorrect, because a dead node might come back alive, which would result
in having multiple tablet replicas on the same rack. It's also inefficient
even if we know that the node won't come back - when it's being replaced or removed.
In that case we know we are going to rebuild the lost tablet replicas
so migrating tablets to this rack just doubles the work. Allowing such migrations
to happen would also require adjustments in the materialized view pairing code
because we'd temporarily allow having multiple tablet replicas on the same rack.
So in this patch we include dead nodes when calculating rack loads in the load
balancer. The dead nodes still aren't treated as potential migration sources or
destinations.
We also add a test which verifies that no migrations are performed by doing a node
replace with a mv workload in parallel. Before the patch, we'd get pairing errors
and after the patch, no pairing errors are detected.

Fixes https://github.com/scylladb/scylladb/issues/24485

Closes scylladb/scylladb#26028
2025-12-09 14:04:19 +01:00
Tomasz Grabiec
4acf082686 Merge '[Backport 2025.3] address_map: Use more efficient and reliable replication method' from Scylladb[bot]
Primary issue with the old method is that each update is a separate
cross-shard call, and all later updates queue behind it. If one of the
shards has high latency for such calls, the queue may accumulate and
system will appear unresponsive for mapping changes on non-zero shards.

This happened in the field when one of the shards was overloaded with
sstables and compaction work, which caused frequent stalls which
delayed polling for ~100ms. A queue of 3k address updates
accumulated, because we update mapping on each change of gossip
states. This made bootstrap impossible because nodes couldn't
learn about the IP mapping for the bootstrapping node and streaming
failed.

To protect against that, use a more efficient method of replication
which requires a single cross-shard call to replicate all prior
updates.

It is also more reliable, if replication fails transiently for some
reason, we don't give up and fail all later updates.

Fixes #26865

- (cherry picked from commit ed8d127457)

- (cherry picked from commit 4a85ea8eb2)

- (cherry picked from commit f83c4ffc68)

Parent PR: #26941

Closes scylladb/scylladb#27188

* github.com:scylladb/scylladb:
  address_map: Use barrier() to wait for replication
  address_map: Use more efficient and reliable replication method
  utils: Introduce helper for replicated data structures
  utils: add "fatal" version of utils::on_internal_error()
2025-12-05 13:23:19 +01:00
Avi Kivity
3f343d70e4 database: fix overflow when computing data distribution over shards
We store the per-shard chunk count in a uint64_t vector
global_offset, and then convert the counts to offsets with
a prefix sum:

```c++
        // [1, 2, 3, 0] --> [0, 1, 3, 6]
        std::exclusive_scan(global_offset.begin(), global_offset.end(), global_offset.begin(), 0, std::plus());
```

However, std::exclusive_scan takes the accumulator type from the
initial value, 0, which is an int, instead of from the range being
iterated, which is of uint64_t.

As a result, the prefix sum is computed as a 32-bit integer value. If
it exceeds 0x8000'0000, it becomes negative. It is then extended to
64 bits and stored. The result is a huge 64-bit number. Later on
we try to find an sstable with this chunk and fail, crashing on
an assertion.

An example of the failure can be seen here: https://godbolt.org/z/6M8aEbo57

The fix is simple: the initial value is passed as uint64_t instead of int.

Fixes https://github.com/scylladb/scylladb/issues/27417

Closes scylladb/scylladb#27418

(cherry picked from commit 9696ee64d0)
scylla-2025.3.5-candidate-20251205050220 scylla-2025.3.5
2025-12-04 20:18:13 +02:00
Tomasz Grabiec
fed0f95626 address_map: Use barrier() to wait for replication
More efficient than 100 pings.

There was one ping in test which was done "so this shard notices the
clock advance". It's not necessary, since obsering completed SMP
call implies that local shard sees the clock advancement done within in.

(cherry picked from commit f83c4ffc68)
2025-12-04 14:50:31 +01:00
Tomasz Grabiec
eba97f80e0 address_map: Use more efficient and reliable replication method
Primary issue with the old method is that each update is a separate
cross-shard call, and all later updated queue behind it. If one of the
shards has high latency for such calls, the queue may accumulate and
system will appear unresponsive for mapping changes on non-zero shards.

This happened in the field when one of the shards was overloaded with
sstables and compaction work, which caused frequent stalls which
delayed polling for ~100ms. A queue of 3k address updates
accumulated. This made bootstrap impossible, since nodes couldn't
learn about the IP mapping for the bootstrapping node and streaming
failed.

To protect against that, use a more efficient method of replication
which requires a single cross-shard call to replicate all prior
updates.

It is also more reliable, if replication fails transiently for some
reason, we don't give up and fail all later updates.

Fixes #26865
Fixes #26835

(cherry picked from commit 4a85ea8eb2)
2025-12-04 14:50:31 +01:00
Tomasz Grabiec
777b54f072 utils: Introduce helper for replicated data structures
Key goals:
  - efficient (batching updates)
  - reliable (no lost updates)

Will be used in data structures maintained on one designed owning
shard and replicated to other shards.

(cherry picked from commit ed8d127457)
2025-12-04 14:50:31 +01:00
Nadav Har'El
6562e844f8 utils: add "fatal" version of utils::on_internal_error()
utils::on_internal_error() is a wrapper for Seastar's on_internal_error()
which does not require a logger parameter - because it always uses one
logger ("on_internal_error"). Not needing a unique logger is especially
important when using on_internal_error() in a header file, where we
can't define a logger.

Seastar also has a another similar function, on_fatal_internal_error(),
for which we forgot to implement a "utils" version (without a logger
parameter). This patch fixes that oversight.

In the next patch, we need to use on_fatal_internal_error() in a header
file, so the "utils" version will be useful. We will need the fatal
version because we will encounter an unexpected situation during server
destruction, and if we let the regular on_internal_error() just throw
an exception, we'll be left in an undefined state.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit 33476c7b06)
2025-12-04 14:50:31 +01:00
Pavel Emelyanov
1d9e0c17a6 Update seastar submodule (SIGABRT on assertion)
* seastar 4431d974f...f61814a48 (1):
  > util: make SEASTAR_ASSERT() failure generate SIGABRT

Fixes #27127

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#27403
2025-12-04 13:00:30 +03:00
Ernest Zaslavsky
aa8495d465 s3_client: handle additional transient network errors
Add handling for a broader set of transient network-related `std::errc` values in `aws_error::from_system_error`. Treat these conditions as retryable when the client re-creates the socket for each request.

Fixes: https://github.com/scylladb/scylladb/issues/27349

Closes scylladb/scylladb#27350

(cherry picked from commit 605f71d074)

Closes scylladb/scylladb#27390
scylla-2025.3.5-candidate-20251203022600
2025-12-03 12:25:15 +03:00
Calle Wilund
7eb51568e9 commitlog::read_log_file: Check for eof position on all data reads
Fixes #24346

When reading, we check for each entry and each chunk, if advancing there
will hit EOF of the segment. However, IFF the last chunk being read has
the last entry _exactly_ matching the chunk size, and the chunk ending
at _exactly_ segment size (preset size, typically 32Mb), we did not check
the position, and instead complained about not being able to read.

This has literally _never_ happened in actual commitlog (that was replayed
at least), but has apparently happened more and more in hints replay.

Fix is simple, just check the file position against size when advancing
said position, i.e. when reading (skipping already does).

v2:

* Added unit test

Closes scylladb/scylladb#27236

(cherry picked from commit 59c87025d1)

Closes scylladb/scylladb#27343
2025-12-03 12:25:02 +03:00
Pavel Emelyanov
232cbe2f69 Merge '[Backport 2025.3] tablet: scheduler: Do not emit conflicting migration in merge colocation' from Scylladb[bot]
The tablet scheduler should not emit conflicting migrations for the same tablet. This was addressed initially in scylladb/scylladb#26038 but the check is missing in the merge colocation plan, so add it there as well.

Without this check, the merge colocation plan could generate a conflicting migration for a tablet that is already scheduled for migration, as the test demonstrates.

This can cause correctness problems, because if the load balancer generates two migrations for a single tablet, both will be written as mutations, and the resulting mutation could contain mixed cells from both migrations.

Fixes scylladb/scylladb#27304

backport to existing releases - this is a bug that can affect correctness

- (cherry picked from commit 97b7c03709)

Parent PR: #27312

Closes scylladb/scylladb#27330

* github.com:scylladb/scylladb:
  tablet: scheduler: Do not emit conflicting migration in merge colocation
  tablet: scheduler: Do not emit conflicting migrations in the plan
2025-12-03 12:24:48 +03:00
Aleksandra Martyniuk
8a3932e4d9 replica: database: change type of tables_metadata::_ks_cf_to_uuid
If there is a lot of tables, a node reports oversized allocation
in _ks_cf_to_uuid of type flat_hash_map.

Change the type to std::unordered_map to prevent oversized allocations.

Fixes: https://github.com/scylladb/scylladb/issues/26787.

Closes scylladb/scylladb#27165

(cherry picked from commit 19a7d8e248)

Closes scylladb/scylladb#27198
2025-12-03 12:24:29 +03:00
Ernest Zaslavsky
99a51cf695 streaming:: add more logging
Start logging all missed streaming options like `scope`, `primary_replica` and `skip_reshape` flags

Fixes: https://github.com/scylladb/scylladb/issues/27299

Closes scylladb/scylladb#27311

(cherry picked from commit 1d5f60baac)

Closes scylladb/scylladb#27341
2025-12-02 12:13:21 +01:00
Jenkins Promoter
a99b7020dd Update pgo profiles - aarch64 2025-12-01 05:12:52 +02:00
Jenkins Promoter
8456b9520b Update pgo profiles - x86_64 2025-12-01 04:35:55 +02:00
Michael Litvak
4ac402c5c5 tablet: scheduler: Do not emit conflicting migration in merge colocation
The tablet scheduler should not emit conflicting migrations for the same
tablet. This was addressed initially in scylladb/scylladb#26038 but the
check is missing in the merge colocation plan, so add it there as well.

Without this check, the merge colocation plan could generate a
conflicting migration for a tablet that is already scheduled for
migration, as the test demonstrates.

This can cause correctness problems, because if the load balancer
generates two migrations for a single tablet, both will be written as
mutations, and the resulting mutation could contain mixed cells from
both migrations.

Fixes scylladb/scylladb#27304

Closes scylladb/scylladb#27312

(cherry picked from commit 97b7c03709)
2025-11-30 10:13:42 +01:00
Tomasz Grabiec
951d3f50ea tablet: scheduler: Do not emit conflicting migrations in the plan
Plan-making is invoked independently for different DCs (and in the
future, racks) and then plans are merged. It could be that the same
tablets are selected for migration in different DCs. Only one
migration will prevail and be committed to group0, so it's not a
correctness problem. Next cycle will recognize that the tablet is in
transition and will not be selected by plan-maker. But it makes
plan-making less efficient.

It may also surprise consumers of the plan, like we saw in #25912.

So we should make plan-maker be aware of already scheduled transitions
and not consider those tablets as candidates.

Fixes #26038

Closes scylladb/scylladb#26048

(cherry picked from commit 981592bca5)
2025-11-30 10:00:22 +01:00
Patryk Jędrzejczak
a07e0d46ae Merge '[Backport 2025.3] locator/node: include _excluded in missing places' from Scylladb[bot]
We currently ignore the `_excluded` field in `node::clone()` and the verbose
formatter of `locator::node`. The first one is a bug that can have
unpredictable consequences on the system. The second one can be a minor
inconvenience during debugging.

We fix both places in this PR.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-72

This PR is a bugfix that should be backported to all supported branches.

- (cherry picked from commit 4160ae94c1)

- (cherry picked from commit 287c9eea65)

Parent PR: #27265

Closes scylladb/scylladb#27290

* https://github.com/scylladb/scylladb:
  locator/node: include _excluded in verbose formatter
  locator/node: preserve _excluded in clone()
2025-11-27 12:29:19 +01:00
Avi Kivity
69871fe600 Merge '[Backport 2025.3] fix notification about expiring erm held for to long' from Scylladb[bot]
Commit 6e4803a750 broke notification about expired erms held for too long since it resets the tracker without calling its destructor (where notification is triggered). Fix the assign operator to call the destructor like it should.

Fixes https://github.com/scylladb/scylladb/issues/27141

- (cherry picked from commit 9f97c376f1)

- (cherry picked from commit 5dcdaa6f66)

Parent PR: #27140

Closes scylladb/scylladb#27275

* github.com:scylladb/scylladb:
  test: test that expired erm that held for too long triggers notification
  token_metadata: fix notification about expiring erm held for to long
2025-11-27 12:10:38 +02:00
Patryk Jędrzejczak
2307bf891d locator/node: include _excluded in verbose formatter
It can be helpful during debugging.

(cherry picked from commit 287c9eea65)
2025-11-26 23:04:48 +00:00
Patryk Jędrzejczak
f288273ef0 locator/node: preserve _excluded in clone()
We currently ignore the `_excluded` field in `clone()`. Losing
information about exclusion can have unpredictable consequences. One
observed effect (that led to finding this issue) is that the
`/storage_service/nodes/excluded` API endpoint sometimes misses excluded
nodes.

(cherry picked from commit 4160ae94c1)
2025-11-26 23:04:48 +00:00
Gleb Natapov
aa75444438 test: test that expired erm that held for too long triggers notification
(cherry picked from commit 5dcdaa6f66)
2025-11-26 15:08:41 +00:00
Gleb Natapov
e2d59df166 token_metadata: fix notification about expiring erm held for to long
Commit 6e4803a750 broke notification about expired erms held for too
long since it resets the tracker without calling its destructor (where
notification is triggered). Fix assign operator to call destructor.

(cherry picked from commit 9f97c376f1)
2025-11-26 15:08:41 +00:00
Ernest Zaslavsky
7e6b653e5c streaming: fix loop break condition in tablet_sstable_streamer::stream
Correct the loop termination logic that previously caused
certain SSTables to be prematurely excluded, resulting in
lost mutations. This change ensures all relevant SSTables
are properly streamed and their mutations preserved.

(cherry picked from commit dedc8bdf71)

Closes scylladb/scylladb#27153
Fixes: #26979

Parent PR: #26980
Unfortunatelly the pytest based test cannot be ported back because of changes made to the testing harness and scylla-tools
2025-11-25 11:59:01 +03:00
Avi Kivity
84b7e06268 tools: toolchain: prepare: replace 'reg' with 'skopeo'
The prepare scripts uses 'reg' to verify we're not going to
overwrite an existing image. The 'reg' command is not
available in Fedora 43. Use 'skopeo' instead. Skopeo
is part of the podman ecosystem so hopefully will live longer.

Fixes #27178.

Closes scylladb/scylladb#27179

(cherry picked from commit d6ef5967ef)

Closes scylladb/scylladb#27199
2025-11-24 16:32:04 +02:00
Jenkins Promoter
812fc721cd Update ScyllaDB version to: 2025.3.5 2025-11-24 15:50:44 +02:00
Raphael S. Carvalho
867cb1e7ac replica: Fail timed-out single-key read on cleaned up tablet replica
Consider the following:
1) single-key read starts, blocks on replica e.g. waiting for memory.
2) the same replica is migrated away
3) single-key read expires, coordinator abandons it, releases erm.
4) migration advances to cleanup stage, barrier doesn't wait on
   timed-out read
5) compaction group of the replica is deallocated on cleanup
6) that single-key resumes, but doesn't find sstable set (post cleanup)
7) with abort-on-internal-error turned on, node crashes

It's fine for abandoned (= timed out) reads to fail, since the
coordinator is gone.
For active reads (non timed out), the barrier will wait for them
since their coordinator holds erm.
This solution consists of failing reads which underlying tablet
replica has been cleaned up, by just converting internal error
to plain exception.

Fixes #26229.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#27078

(cherry picked from commit 74ecedfb5c)

Closes scylladb/scylladb#27155
2025-11-21 17:48:21 +03:00
Patryk Jędrzejczak
a9fc235aee test: test_raft_recovery_stuck: ensure mutual visibility before using driver
Not waiting for nodes to see each other as alive can cause the driver to
fail the request sent in `wait_for_upgrade_state()`.

scylladb/scylladb#19771 has already replaced concurrent restarts with
`ManagerClient.rolling_restart()`, but it has missed this single place,
probably because we do concurrent starts here.

Fixes #27055

Closes scylladb/scylladb#27075

(cherry picked from commit e35ba974ce)

Closes scylladb/scylladb#27109
2025-11-20 10:41:58 +02:00
Botond Dénes
78ecb8854a Merge '[Backport 2025.3] Automatic cleanup improvements' from Scylladb[bot]
This series allows an operator to reset 'cleanup needed' flag if he already cleaned up the node, so that automatic cleanup will not do it again. We also change 'nodetool cleanup' back to run cleanup on one node only (and reset 'cleanup needed' flag in the end), but the new '--global' option allows to run cleanup on all nodes that needed it simultaneously.

Fixes https://github.com/scylladb/scylladb/issues/26866

Backport to all supported version since automatic cleanup behaviour  as it is now may create unexpected by the operator load during cluster resizing.

- (cherry picked from commit e872f9cb4e)

- (cherry picked from commit 0f0ab11311)

Parent PR: #26868

Closes scylladb/scylladb#27093

* github.com:scylladb/scylladb:
  cleanup: introduce "nodetool cluster cleanup" command  to run cleanup on all dirty nodes in the cluster
  cleanup: Add RESTful API to allow reset cleanup needed flag
2025-11-20 10:41:04 +02:00
Botond Dénes
d2d9140029 Merge '[Backport 2025.3] encryption::kms_host: Add exponential backoff-retry for 503 errors' from Scylladb[bot]
Refs #26822
Fixes #27062

AWS says to treat 503 errors, at least in the case of ec2 metadata query, as backoff-retry (generally, we do _not_ retry on provider level, but delegate this to higher levels). This patch adds special treatment for 503:s (service unavailable) for both ec2 meta and    actual endpoint, doing exponential backoff.

Note: we do _not_ retry forever.
Not tested as such, since I don't get any errors when testing (doh!). Should try to set up a mock ec2 meta with injected errors maybe.

- (cherry picked from commit 190e3666cb)

- (cherry picked from commit d22e0acf0b)

Parent PR: #26934

Closes scylladb/scylladb#27063

* github.com:scylladb/scylladb:
  encryption::kms_host: Add exponential backoff-retry for 503 errors
  encryption::kms_host: Include http error code in kms_error
2025-11-20 10:40:20 +02:00
Botond Dénes
91e6efdde8 Merge '[Backport 2025.3] service/qos: Fall back to default scheduling group when using maintenance socket' from Scylladb[bot]
The service level controller relies on `auth::service` to collect
information about roles and the relation between them and the service
levels (those attached to them). Unfortunately, the service level
controller is initialized way earlier than `auth::service` and so we
had to prevent potential invalid queries of user service levels
(cf. 46193f5e79).

Unfortunately, that came at a price: it made the maintenance socket
incompatible with the current implementation of the service level
controller. The maintenance socket starts early, before the
`auth::service` is fully initialized and registered, and is exposed
almost immediately. If the user attempts to connect to Scylla within
this time window, via the maintenance socket, one of the things that
will happen is choosing the right service level for the connection.
Since the `auth::service` is not registered, Scylla with fail an
assertion and crash.

A similar scenario occurs when using maintenance mode. The maintenance
socket is how the user communicates with the database, and we're not
prepared for that either.

To avoid unnecessary crashes, we add new branches if the passed user is
absent or if it corresponds to the anonymous role. Since the role
corresponding to a connection via the maintenance socket is the anonymous
role, that solves the problem.

Some accesses to `auth::service` are not affected and we do not modify
those.

Fixes scylladb/scylladb#26816

Backport: yes. This is a fix of a regression.

- (cherry picked from commit c0f7622d12)

- (cherry picked from commit 222eab45f8)

- (cherry picked from commit 394207fd69)

- (cherry picked from commit b357c8278f)

Parent PR: #26856

Closes scylladb/scylladb#27039

* github.com:scylladb/scylladb:
  test/cluster/test_maintenance_mode.py: Wait for initialization
  test: Disable maintenance mode correctly in test_maintenance_mode.py
  test: Fix keyspace in test_maintenance_mode.py
  service/qos: Do not crash Scylla if auth_integration absent
2025-11-20 10:39:48 +02:00
Botond Dénes
a067723f55 Merge '[Backport 2025.3] cdc: set column drop timestamp in the future' from Scylladb[bot]
When dropping a column from a CDC log table, set the column drop
timestamp several seconds into the future.

If a value is written to a column concurrently with dropping that
column, the value's timestamp may be after the column drop timestamp. If
this value is also flushed to an SSTable, the SSTable would be
corrupted, because it considers the column missing after the drop
timestamp and doesn't allow values for it.

While this issue affects general tables, it especially impacts CDC tables
because this scenario can occur when writing to a table with CDC preimage
enabled while dropping a column from the base table. This happens even if
the base mutation doesn't write to the dropped column, because CDC log
mutations can generate values for a column even if the base mutation doesn't.
For general tables, this issue can be avoided by simply not writing to a
column while dropping it.

We fix this for the more problematic case of CDC log tables by setting
the column drop timestamp several seconds into the future, ensuring that
writes concurrent with column drops are much less likely to have
timestamps greater than the column drop timestamp.

Fixes https://github.com/scylladb/scylladb/issues/26340

the issue affects all previous releases, backport to improve stability

- (cherry picked from commit eefae4cc4e)

- (cherry picked from commit 48298e38ab)

- (cherry picked from commit 039323d889)

- (cherry picked from commit e85051068d)

Parent PR: #26533

Closes scylladb/scylladb#27036

* github.com:scylladb/scylladb:
  test: test concurrent writes with column drop with cdc preimage
  cdc: check if recreating a column too soon
  cdc: set column drop timestamp in the future
2025-11-20 10:39:18 +02:00
Gleb Natapov
b53bf43844 cleanup: introduce "nodetool cluster cleanup" command to run cleanup on all dirty nodes in the cluster
97ab3f6622 changed "nodetool cleanup" (without arguments) to run
cleanup on all dirty nodes in the cluster. This was somewhat unexpected,
so this patch changes it back to run cleanup on the target node only (and
reset "cleanup needed" flag afterwards) and it adds "nodetool cluster
cleanup" command that runs the cleanup on all dirty nodes in the
cluster.

(cherry picked from commit 0f0ab11311)
2025-11-19 10:53:42 +02:00
Gleb Natapov
3d60e5e825 cleanup: Add RESTful API to allow reset cleanup needed flag
Cleaning up a node using per keyspace/table interface does not reset cleanup
needed flag in the topology. The assumption was that running cleanup on
already clean node does nothing and completes quickly. But due to
https://github.com/scylladb/scylladb/issues/12215 (which is closed as
WONTFIX) this is not the case. This patch provides the ability to reset
the flag in the topology if operator cleaned up the node manually
already.

(cherry picked from commit e872f9cb4e)
2025-11-19 10:44:30 +02:00
Avi Kivity
e9e849c2bf Merge '[Backport 2025.3] Synchronize tablet split and load-and-stream' from Scylladb[bot]
Load-and-stream is broken when running concurrently to the finalization step of tablet split.

Consider this:
1) split starts
2) split finalization executes barrier and succeed
3) load-and-stream runs now, starts writing sstable (pre-split)
4) split finalization publishes changes to tablet metadata
5) load-and-stream finishes writing sstable
6) sstable cannot be loaded since it spans two tablets

two possible fixes (maybe both):

1) load-and-stream awaits for topology to quiesce
2) perform split compaction on sstable that spans both sibling tablets

This patch implements # 1. By awaiting for topology to quiesce,
we guarantee that load-and-stream only starts when there's no
chance coordinator is handling some topology operation like
split finalization.

Fixes https://github.com/scylladb/scylladb/issues/26455.

- (cherry picked from commit 3abc66da5a)

- (cherry picked from commit 4654cdc6fd)

Parent PR: #26456

Closes scylladb/scylladb#26648

* github.com:scylladb/scylladb:
  sstables_loader: Don't bypass synchronization with busy topology
  test: Add reproducer for l-a-s and split synchronization issue
  sstables_loader: Synchronize tablet split and load-and-stream
2025-11-17 17:14:36 +02:00
Calle Wilund
484e7aed2c encryption::kms_host: Add exponential backoff-retry for 503 errors
Refs #26822

AWS says to treat 503 errors, at least in the case of ec2 metadata
query, as backoff-retry (generally, we do _not_ retry on provider
level, but delegate this to higher levels). This patch adds special
treatment for 503:s (service unavailable) for both ec2 meta and
actual endpoint, doing exponential backoff.

Note: we do _not_ retry forever.
Not tested as such, since I don't get any errors when testing
(doh!). Should try to set up a mock ec2 meta with injected errors
maybe.

v2:
* Use utils::exponential_backoff_retry

(cherry picked from commit d22e0acf0b)
2025-11-17 11:48:42 +00:00
Calle Wilund
77407fd704 encryption::kms_host: Include http error code in kms_error
Keep track of actual HTTP failure.

(cherry picked from commit 190e3666cb)
2025-11-17 11:48:41 +00:00
Benny Halevy
898f193ef6 scylla-sstable: correctly dump sharding_metadata
This patch fixes 2 issues at one go:

First, Currently sstables::load clears the sharding metadata
(via open_data()), and so scylla-sstable always prints
an empty array for it.

Second, printing token values would generate invalid json
as they are currently printed as binary bytes, and they
should be printed simply as numbers, as we do elsewhere,
for example, for the first and last keys.

Fixes #26982

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#26991

(cherry picked from commit f9ce98384a)

Closes scylladb/scylladb#27037
scylla-2025.3.4-candidate-20251117021558 scylla-2025.3.4
2025-11-16 15:43:35 +02:00
Michael Litvak
ba40c1eba9 test: test concurrent writes with column drop with cdc preimage
add a test that writes to a table concurrently with dropping a column,
where the table has CDC enabled with preimage.

the test reproduces issue #26340 where this results in a malformed
sstable.

(cherry picked from commit e85051068d)
2025-11-16 10:03:07 +01:00
Michael Litvak
28eaa12af9 cdc: check if recreating a column too soon
When we drop a column from a CDC log table, we set the column drop
timestamp a few seconds into the future. This can cause unexpected
problems if a user tries to recreate a CDC column too soon, before
the drop timestamp has passed.

To prevent this issue, when creating a CDC column we check its
creation timestamp against the existing drop timestamp, if any, and
fail with an informative error if the recreation attempt is too soon.

(cherry picked from commit 039323d889)
2025-11-16 10:03:07 +01:00
Michael Litvak
c37d224db6 cdc: set column drop timestamp in the future
When dropping a column from a CDC log table, set the column drop
timestamp several seconds into the future.

If a value is written to a column concurrently with dropping that
column, the value's timestamp may be after the column drop timestamp. If
this value is also flushed to an SSTable, the SSTable would be
corrupted, because it considers the column missing after the drop
timestamp and doesn't allow values for it.

While this issue affects general tables, it especially impacts CDC tables
because this scenario can occur when writing to a table with CDC preimage
enabled while dropping a column from the base table. This happens even if
the base mutation doesn't write to the dropped column, because CDC log
mutations can generate values for a column even if the base mutation doesn't.
For general tables, this issue can be avoided by simply not writing to a
column while dropping it.

We fix this for the more problematic case of CDC log tables by setting
the column drop timestamp several seconds into the future, ensuring that
writes concurrent with column drops are much less likely to have
timestamps greater than the column drop timestamp.

Fixes scylladb/scylladb#26340

(cherry picked from commit 48298e38ab)
2025-11-16 09:34:51 +01:00
Dawid Mędrek
7b32c277fe test/cluster/test_maintenance_mode.py: Wait for initialization
If we try to perform queries too early, before the call to
`storage_service::start_maintenance_mode` has finished, we will
fail with the following error:

```
ERROR 2025-11-12 20:32:27,064 [shard 0:sl:d] token_metadata - sorted_tokens is empty in first_token_index!
```

To avoid that, we should wait until initialization is complete.

(cherry picked from commit b357c8278f)
2025-11-15 22:10:28 +00:00
Dawid Mędrek
6d6f870a5f test: Disable maintenance mode correctly in test_maintenance_mode.py
Although setting the value of `maintenance_mode` to the string `"false"`
disables maintenance mode, the testing framework misinterprets the value
and thinks that it's actually enabled. As a result, it might try to
connect to Scylla via the maintenance socket, which we don't want.

(cherry picked from commit 394207fd69)
2025-11-15 22:10:28 +00:00