The tablet scheduler should not emit conflicting migrations for the same tablet. This was addressed initially in scylladb/scylladb#26038 but the check is missing in the merge colocation plan, so add it there as well.
Without this check, the merge colocation plan could generate a conflicting migration for a tablet that is already scheduled for migration, as the test demonstrates.
This can cause correctness problems, because if the load balancer generates two migrations for a single tablet, both will be written as mutations, and the resulting mutation could contain mixed cells from both migrations.
Fixes scylladb/scylladb#27304
backport to existing releases - this is a bug that can affect correctness
- (cherry picked from commit 97b7c03709)
Parent PR: #27312Closesscylladb/scylladb#27330
* github.com:scylladb/scylladb:
tablet: scheduler: Do not emit conflicting migration in merge colocation
tablet: scheduler: Do not emit conflicting migrations in the plan
The tablet scheduler should not emit conflicting migrations for the same
tablet. This was addressed initially in scylladb/scylladb#26038 but the
check is missing in the merge colocation plan, so add it there as well.
Without this check, the merge colocation plan could generate a
conflicting migration for a tablet that is already scheduled for
migration, as the test demonstrates.
This can cause correctness problems, because if the load balancer
generates two migrations for a single tablet, both will be written as
mutations, and the resulting mutation could contain mixed cells from
both migrations.
Fixesscylladb/scylladb#27304Closesscylladb/scylladb#27312
(cherry picked from commit 97b7c03709)
Plan-making is invoked independently for different DCs (and in the
future, racks) and then plans are merged. It could be that the same
tablets are selected for migration in different DCs. Only one
migration will prevail and be committed to group0, so it's not a
correctness problem. Next cycle will recognize that the tablet is in
transition and will not be selected by plan-maker. But it makes
plan-making less efficient.
It may also surprise consumers of the plan, like we saw in #25912.
So we should make plan-maker be aware of already scheduled transitions
and not consider those tablets as candidates.
Fixes#26038Closesscylladb/scylladb#26048
(cherry picked from commit 981592bca5)
We currently ignore the `_excluded` field in `node::clone()` and the verbose
formatter of `locator::node`. The first one is a bug that can have
unpredictable consequences on the system. The second one can be a minor
inconvenience during debugging.
We fix both places in this PR.
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-72
This PR is a bugfix that should be backported to all supported branches.
- (cherry picked from commit 4160ae94c1)
- (cherry picked from commit 287c9eea65)
Parent PR: #27265Closesscylladb/scylladb#27290
* https://github.com/scylladb/scylladb:
locator/node: include _excluded in verbose formatter
locator/node: preserve _excluded in clone()
Commit 6e4803a750 broke notification about expired erms held for too long since it resets the tracker without calling its destructor (where notification is triggered). Fix the assign operator to call the destructor like it should.
Fixes https://github.com/scylladb/scylladb/issues/27141
- (cherry picked from commit 9f97c376f1)
- (cherry picked from commit 5dcdaa6f66)
Parent PR: #27140Closesscylladb/scylladb#27275
* github.com:scylladb/scylladb:
test: test that expired erm that held for too long triggers notification
token_metadata: fix notification about expiring erm held for to long
We currently ignore the `_excluded` field in `clone()`. Losing
information about exclusion can have unpredictable consequences. One
observed effect (that led to finding this issue) is that the
`/storage_service/nodes/excluded` API endpoint sometimes misses excluded
nodes.
(cherry picked from commit 4160ae94c1)
Commit 6e4803a750 broke notification about expired erms held for too
long since it resets the tracker without calling its destructor (where
notification is triggered). Fix assign operator to call destructor.
(cherry picked from commit 9f97c376f1)
Correct the loop termination logic that previously caused
certain SSTables to be prematurely excluded, resulting in
lost mutations. This change ensures all relevant SSTables
are properly streamed and their mutations preserved.
(cherry picked from commit dedc8bdf71)
Closesscylladb/scylladb#27153Fixes: #26979
Parent PR: #26980
Unfortunatelly the pytest based test cannot be ported back because of changes made to the testing harness and scylla-tools
The prepare scripts uses 'reg' to verify we're not going to
overwrite an existing image. The 'reg' command is not
available in Fedora 43. Use 'skopeo' instead. Skopeo
is part of the podman ecosystem so hopefully will live longer.
Fixes#27178.
Closesscylladb/scylladb#27179
(cherry picked from commit d6ef5967ef)
Closesscylladb/scylladb#27199
Consider the following:
1) single-key read starts, blocks on replica e.g. waiting for memory.
2) the same replica is migrated away
3) single-key read expires, coordinator abandons it, releases erm.
4) migration advances to cleanup stage, barrier doesn't wait on
timed-out read
5) compaction group of the replica is deallocated on cleanup
6) that single-key resumes, but doesn't find sstable set (post cleanup)
7) with abort-on-internal-error turned on, node crashes
It's fine for abandoned (= timed out) reads to fail, since the
coordinator is gone.
For active reads (non timed out), the barrier will wait for them
since their coordinator holds erm.
This solution consists of failing reads which underlying tablet
replica has been cleaned up, by just converting internal error
to plain exception.
Fixes#26229.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#27078
(cherry picked from commit 74ecedfb5c)
Closesscylladb/scylladb#27155
Not waiting for nodes to see each other as alive can cause the driver to
fail the request sent in `wait_for_upgrade_state()`.
scylladb/scylladb#19771 has already replaced concurrent restarts with
`ManagerClient.rolling_restart()`, but it has missed this single place,
probably because we do concurrent starts here.
Fixes#27055Closesscylladb/scylladb#27075
(cherry picked from commit e35ba974ce)
Closesscylladb/scylladb#27109
This series allows an operator to reset 'cleanup needed' flag if he already cleaned up the node, so that automatic cleanup will not do it again. We also change 'nodetool cleanup' back to run cleanup on one node only (and reset 'cleanup needed' flag in the end), but the new '--global' option allows to run cleanup on all nodes that needed it simultaneously.
Fixes https://github.com/scylladb/scylladb/issues/26866
Backport to all supported version since automatic cleanup behaviour as it is now may create unexpected by the operator load during cluster resizing.
- (cherry picked from commit e872f9cb4e)
- (cherry picked from commit 0f0ab11311)
Parent PR: #26868Closesscylladb/scylladb#27093
* github.com:scylladb/scylladb:
cleanup: introduce "nodetool cluster cleanup" command to run cleanup on all dirty nodes in the cluster
cleanup: Add RESTful API to allow reset cleanup needed flag
Refs #26822Fixes#27062
AWS says to treat 503 errors, at least in the case of ec2 metadata query, as backoff-retry (generally, we do _not_ retry on provider level, but delegate this to higher levels). This patch adds special treatment for 503:s (service unavailable) for both ec2 meta and actual endpoint, doing exponential backoff.
Note: we do _not_ retry forever.
Not tested as such, since I don't get any errors when testing (doh!). Should try to set up a mock ec2 meta with injected errors maybe.
- (cherry picked from commit 190e3666cb)
- (cherry picked from commit d22e0acf0b)
Parent PR: #26934Closesscylladb/scylladb#27063
* github.com:scylladb/scylladb:
encryption::kms_host: Add exponential backoff-retry for 503 errors
encryption::kms_host: Include http error code in kms_error
The service level controller relies on `auth::service` to collect
information about roles and the relation between them and the service
levels (those attached to them). Unfortunately, the service level
controller is initialized way earlier than `auth::service` and so we
had to prevent potential invalid queries of user service levels
(cf. 46193f5e79).
Unfortunately, that came at a price: it made the maintenance socket
incompatible with the current implementation of the service level
controller. The maintenance socket starts early, before the
`auth::service` is fully initialized and registered, and is exposed
almost immediately. If the user attempts to connect to Scylla within
this time window, via the maintenance socket, one of the things that
will happen is choosing the right service level for the connection.
Since the `auth::service` is not registered, Scylla with fail an
assertion and crash.
A similar scenario occurs when using maintenance mode. The maintenance
socket is how the user communicates with the database, and we're not
prepared for that either.
To avoid unnecessary crashes, we add new branches if the passed user is
absent or if it corresponds to the anonymous role. Since the role
corresponding to a connection via the maintenance socket is the anonymous
role, that solves the problem.
Some accesses to `auth::service` are not affected and we do not modify
those.
Fixes scylladb/scylladb#26816
Backport: yes. This is a fix of a regression.
- (cherry picked from commit c0f7622d12)
- (cherry picked from commit 222eab45f8)
- (cherry picked from commit 394207fd69)
- (cherry picked from commit b357c8278f)
Parent PR: #26856Closesscylladb/scylladb#27039
* github.com:scylladb/scylladb:
test/cluster/test_maintenance_mode.py: Wait for initialization
test: Disable maintenance mode correctly in test_maintenance_mode.py
test: Fix keyspace in test_maintenance_mode.py
service/qos: Do not crash Scylla if auth_integration absent
When dropping a column from a CDC log table, set the column drop
timestamp several seconds into the future.
If a value is written to a column concurrently with dropping that
column, the value's timestamp may be after the column drop timestamp. If
this value is also flushed to an SSTable, the SSTable would be
corrupted, because it considers the column missing after the drop
timestamp and doesn't allow values for it.
While this issue affects general tables, it especially impacts CDC tables
because this scenario can occur when writing to a table with CDC preimage
enabled while dropping a column from the base table. This happens even if
the base mutation doesn't write to the dropped column, because CDC log
mutations can generate values for a column even if the base mutation doesn't.
For general tables, this issue can be avoided by simply not writing to a
column while dropping it.
We fix this for the more problematic case of CDC log tables by setting
the column drop timestamp several seconds into the future, ensuring that
writes concurrent with column drops are much less likely to have
timestamps greater than the column drop timestamp.
Fixes https://github.com/scylladb/scylladb/issues/26340
the issue affects all previous releases, backport to improve stability
- (cherry picked from commit eefae4cc4e)
- (cherry picked from commit 48298e38ab)
- (cherry picked from commit 039323d889)
- (cherry picked from commit e85051068d)
Parent PR: #26533Closesscylladb/scylladb#27036
* github.com:scylladb/scylladb:
test: test concurrent writes with column drop with cdc preimage
cdc: check if recreating a column too soon
cdc: set column drop timestamp in the future
97ab3f6622 changed "nodetool cleanup" (without arguments) to run
cleanup on all dirty nodes in the cluster. This was somewhat unexpected,
so this patch changes it back to run cleanup on the target node only (and
reset "cleanup needed" flag afterwards) and it adds "nodetool cluster
cleanup" command that runs the cleanup on all dirty nodes in the
cluster.
(cherry picked from commit 0f0ab11311)
Cleaning up a node using per keyspace/table interface does not reset cleanup
needed flag in the topology. The assumption was that running cleanup on
already clean node does nothing and completes quickly. But due to
https://github.com/scylladb/scylladb/issues/12215 (which is closed as
WONTFIX) this is not the case. This patch provides the ability to reset
the flag in the topology if operator cleaned up the node manually
already.
(cherry picked from commit e872f9cb4e)
Load-and-stream is broken when running concurrently to the finalization step of tablet split.
Consider this:
1) split starts
2) split finalization executes barrier and succeed
3) load-and-stream runs now, starts writing sstable (pre-split)
4) split finalization publishes changes to tablet metadata
5) load-and-stream finishes writing sstable
6) sstable cannot be loaded since it spans two tablets
two possible fixes (maybe both):
1) load-and-stream awaits for topology to quiesce
2) perform split compaction on sstable that spans both sibling tablets
This patch implements # 1. By awaiting for topology to quiesce,
we guarantee that load-and-stream only starts when there's no
chance coordinator is handling some topology operation like
split finalization.
Fixes https://github.com/scylladb/scylladb/issues/26455.
- (cherry picked from commit 3abc66da5a)
- (cherry picked from commit 4654cdc6fd)
Parent PR: #26456Closesscylladb/scylladb#26648
* github.com:scylladb/scylladb:
sstables_loader: Don't bypass synchronization with busy topology
test: Add reproducer for l-a-s and split synchronization issue
sstables_loader: Synchronize tablet split and load-and-stream
Refs #26822
AWS says to treat 503 errors, at least in the case of ec2 metadata
query, as backoff-retry (generally, we do _not_ retry on provider
level, but delegate this to higher levels). This patch adds special
treatment for 503:s (service unavailable) for both ec2 meta and
actual endpoint, doing exponential backoff.
Note: we do _not_ retry forever.
Not tested as such, since I don't get any errors when testing
(doh!). Should try to set up a mock ec2 meta with injected errors
maybe.
v2:
* Use utils::exponential_backoff_retry
(cherry picked from commit d22e0acf0b)
This patch fixes 2 issues at one go:
First, Currently sstables::load clears the sharding metadata
(via open_data()), and so scylla-sstable always prints
an empty array for it.
Second, printing token values would generate invalid json
as they are currently printed as binary bytes, and they
should be printed simply as numbers, as we do elsewhere,
for example, for the first and last keys.
Fixes#26982
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closesscylladb/scylladb#26991
(cherry picked from commit f9ce98384a)
Closesscylladb/scylladb#27037
add a test that writes to a table concurrently with dropping a column,
where the table has CDC enabled with preimage.
the test reproduces issue #26340 where this results in a malformed
sstable.
(cherry picked from commit e85051068d)
When we drop a column from a CDC log table, we set the column drop
timestamp a few seconds into the future. This can cause unexpected
problems if a user tries to recreate a CDC column too soon, before
the drop timestamp has passed.
To prevent this issue, when creating a CDC column we check its
creation timestamp against the existing drop timestamp, if any, and
fail with an informative error if the recreation attempt is too soon.
(cherry picked from commit 039323d889)
When dropping a column from a CDC log table, set the column drop
timestamp several seconds into the future.
If a value is written to a column concurrently with dropping that
column, the value's timestamp may be after the column drop timestamp. If
this value is also flushed to an SSTable, the SSTable would be
corrupted, because it considers the column missing after the drop
timestamp and doesn't allow values for it.
While this issue affects general tables, it especially impacts CDC tables
because this scenario can occur when writing to a table with CDC preimage
enabled while dropping a column from the base table. This happens even if
the base mutation doesn't write to the dropped column, because CDC log
mutations can generate values for a column even if the base mutation doesn't.
For general tables, this issue can be avoided by simply not writing to a
column while dropping it.
We fix this for the more problematic case of CDC log tables by setting
the column drop timestamp several seconds into the future, ensuring that
writes concurrent with column drops are much less likely to have
timestamps greater than the column drop timestamp.
Fixesscylladb/scylladb#26340
(cherry picked from commit 48298e38ab)
If we try to perform queries too early, before the call to
`storage_service::start_maintenance_mode` has finished, we will
fail with the following error:
```
ERROR 2025-11-12 20:32:27,064 [shard 0:sl:d] token_metadata - sorted_tokens is empty in first_token_index!
```
To avoid that, we should wait until initialization is complete.
(cherry picked from commit b357c8278f)
Although setting the value of `maintenance_mode` to the string `"false"`
disables maintenance mode, the testing framework misinterprets the value
and thinks that it's actually enabled. As a result, it might try to
connect to Scylla via the maintenance socket, which we don't want.
(cherry picked from commit 394207fd69)
If the user connects to Scylla via the maintenance socket, it may happen
that `auth_integration` has not been registered in the service level
controller yet. One example is maintenance mode when that will never
happen; another when the connection occurs before Scylla is fully
initialized.
To avoid unnecessary crashes, we add new branches if the passed user is
absent or if it corresponds to the anonymous role. Since the role
corresponding to a connection via the maintenance socket is the anonymous
role, that solves the problem.
In those cases, we completely circumvent any calls to `auth_integration`
and handle them separately. The modified methods are:
* `get_user_scheduling_group`,
* `with_user_service_level`,
* `describe_service_levels`.
For the first two, the new behavior is in line with the previous
implementation of those functions. The last behaves differently now,
but since it's a soft error, crashing the node is not necessary anyway.
We throw an exception instead, whose error message should give the user
a hint of what might be wrong.
The other uses of `auth_integration` within the service level controller
are not problematic:
* `find_effective_service_level`,
* `find_cached_effective_service_level`.
They take the name of a role as their argument. Since the anonymous role
doesn't have a name, it's not possible to call them with it.
Fixesscylladb/scylladb#26816
(cherry picked from commit c0f7622d12)
The patch c543059f86 fixed the synchronization issue between tablet
split and load-and-stream. The synchronization worked only with
raft topology, and therefore was disabled with gossip.
To do the check, storage_service::raft_topology_change_enabled()
but the topology kind is only available/set on shard 0, so it caused
the synchronization to be bypassed when load-and-stream runs on
any shard other than 0.
The reason the reproducer didn't catch it is that it was restricted
to single cpu. It will now run with multi cpu and catch the
problem observed.
Fixes#22707
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#26730
(cherry picked from commit 7f34366b9d)
(cherry picked from commit e8a74d0fb3)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
This patch series re-enables support for speculative retry values `0` and `100`. These values have been supported some time ago, before [schema: fix issue 21825: add validation for PERCENTILE values in speculative_retry configuration. #21879
](https://github.com/scylladb/scylladb/pull/21879). When that PR prevented using invalid `101PERCENTILE` values, valid `100PERCENTILE` and `0PERCENTILE` value were prevented too.
Reproduction steps from [[Bug]: drop schema and all tables after apply speculative_retry = '99.99PERCENTILE' #26369](https://github.com/scylladb/scylladb/issues/26369) are unable to reproduce the issue after the fix. A test is added to make sure the inclusive border values `0` and `100` are supported.
Documentation is updated to give more information to the users. It now states that these border values are inclusive, and also that the precision, with automatic rounding, is 1 decimal digit.
Fixes#26369
This is a bug fix. If at any time a client tries to use value >= 99.5 and < 100, the raft error will happen. Backport is needed. The code which introduced inconsistency is introduced in 2025.2, so no backporting to 2025.1.
- (cherry picked from commit da2ac90bb6)
- (cherry picked from commit 5d1913a502)
- (cherry picked from commit aba4c006ba)
- (cherry picked from commit 85f059c148)
- (cherry picked from commit 7ec9e23ee3)
Parent PR: #26909Closesscylladb/scylladb#27014
* github.com:scylladb/scylladb:
test: cqlpy: add test case for non-numeric PERCENTILE value
schema: speculative_retry: update exception type for sstring ops
docs: cql: ddl.rst: update speculative-retry-options
test: cqlpy: add test for valid speculative_retry values
schema: speculative_retry: allow 0 and 100 PERCENTILE values
cql3: Fix std::bad_cast when deserializing vectors of collections
This PR fixes a bug where attempting to INSERT a vector containing collections (e.g., `vector<set<int>,1>`) would fail. On the client side, this manifested as a `ServerError: std::bad_cast`.
The cause was "type slicing" issue in the reserialize_value function. When retrieving the vector's element type, the result was being assigned by value (using auto) instead of by reference.
This "sliced" the polymorphic abstract_type object, stripping it of its actual derived type information. As a result, a subsequent dynamic_cast would fail, even if the underlying type was correct.
To prevent this entire class of bugs from happening again, I've made the polymorphic base class `abstract_type` explicitly uncopyable.
Fixes: #26704
This fix needs to be backported as these releases are affected: `2025.4` , `2025.3`.
- (cherry picked from commit 960fe3da60)
- (cherry picked from commit 77da4517d2)
Parent PR: #26740Closesscylladb/scylladb#26997
* github.com:scylladb/scylladb:
cql3: Make abstract_type explicitly noncopyable
cql3: Fix std::bad_cast when deserializing vectors of collections
Add test case for non-numeric PERCENTILE value, which raises an error
different to the out-of-range invalid values. Regex in the test
test_invalid_percentile_speculative_retry_values is expanded.
Refs #26369
(cherry picked from commit 7ec9e23ee3)
Change speculative_retry::to_sstring and speculative_retry::from_sstring
to throw exceptions::configuration_exception instead of std::invalid_argument.
These errors can be triggered by CQL, so appropriate CQL exception should be
used.
Reference: https://github.com/scylladb/scylladb/issues/24748#issuecomment-3025213304
Refs #26369
(cherry picked from commit 85f059c148)
Clarify how the value of `XPERCENTILE` is handled:
- Values 0 and 100 are supported
- The percentile value is rounded to the nearest 0.1 (1 decimal place)
Refs #26369
(cherry picked from commit aba4c006ba)
test_valid_percentile_speculative_retry_values is introduced to test that
valid values for speculative_retry are properly accepted.
Some of the values are moved from the
test_invalid_percentile_speculative_retry_values test, because
the previous commit added support for them.
Refs #26369
(cherry picked from commit 5d1913a502)
This patch allows specifying 0 and 100 PERCENTILE values in speculative_retry.
It was possible to specify these values before #21825. #21825 prevented specifying
invalid values, like -1 and 101, but also prevented using 0 and 100.
On top of that, speculative_retry::to_sstring function did rounding when
formatting the string, which introduced inconsistency.
Fixes#26369
(cherry picked from commit da2ac90bb6)
The polymorphic abstract_type class serves as an interface and should not be copied.
To prevent accidental and unsafe copies, make it explicitly uncopyable.
(cherry picked from commit 77da4517d2)
When deserializing a vector whose elements are collections (e.g., set, list),
the operation raises a `std::bad_cast` exception.
This was caused by type slicing due to an incorrect assignment of a
polymorphic type by value instead of by reference. This resulted in a
failed `dynamic_cast` even when the underlying type was correct.
(cherry picked from commit 960fe3da60)
Load-and-stream is broken when running concurrently to the
finalization step of tablet split.
Consider this:
1) split starts
2) split finalization executes barrier and succeed
3) load-and-stream runs now, starts writing sstable (pre-split)
4) split finalization publishes changes to tablet metadata
5) load-and-stream finishes writing sstable
6) sstable cannot be loaded since it spans two tablets
two possible fixes (maybe both):
1) load-and-stream awaits for topology to quiesce
2) perform split compaction on sstable that spans both sibling tablets
This patch implements #1. By awaiting for topology to quiesce,
we guarantee that load-and-stream only starts when there's no
chance coordinator is handling some topology operation like
split finalization.
Fixes#26455.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 3abc66da5a)