Commit Graph

5834 Commits

Author SHA1 Message Date
Patryk Jędrzejczak
adaa0560d9 Merge 'Automatic cleanup improvements' from Gleb Natapov
This series allows an operator to reset 'cleanup needed' flag if he already cleaned up the node, so that automatic cleanup will not do it again. We also change 'nodetool cleanup' back to run cleanup on one node only (and reset 'cleanup needed' flag in the end), but the new '--global' option allows to run cleanup on all nodes that needed it simultaneously.

Fixes https://github.com/scylladb/scylladb/issues/26866

Backport to all supported version since automatic cleanup behaviour  as it is now may create unexpected by the operator load during cluster resizing.

Closes scylladb/scylladb#26868

* https://github.com/scylladb/scylladb:
  cleanup: introduce "nodetool cluster cleanup" command  to run cleanup on all dirty nodes in the cluster
  cleanup: Add RESTful API to allow reset cleanup needed flag
2025-11-18 08:17:17 +02:00
Piotr Dulikowski
f0039381d2 Merge 'db/view/view_building_worker: support staging sstables intra-node migration and tablet merge' from Michał Jadwiszczak
This PR fixes staging stables handling by view building coordinator in case of intra-node tablet migration or tablet merge.

To support tablet merge, the worker stores the sstables grouped only be `table_id`, instead of `(table_id, last_token)` pair.
There shouldn't be that many staging sstables, so selecting relevant for each `process_staging` task is fine.
For the intra-node migration support, the patch adds methods to load migrated sstables on the destination shard and to cleanup them on source shard.

The patch should be backported to 2025.4

Fixes https://github.com/scylladb/scylladb/issues/26244

Closes scylladb/scylladb#26454

* github.com:scylladb/scylladb:
  service/storage_service: migrate staging sstables in view building worker during intra-node migration
  db/view/view_building_worker: support sstables intra-node migration
  db/view_building_worker: fix indent
  db/view/view_building_worker: don't organize staging sstables by last token
2025-11-17 08:53:19 +01:00
Pavel Emelyanov
1c9c4c8c8c Merge 'service: attach storage_service to migration_manager using pluggable' from Marcin Maliszkiewicz
Migration manager depends on storage service. For instance,
it has a reload_schema_in_bg background task which calls
_ss.local() so it expects that storage service is not stopped
before it stops.

To solve this we use permit approach, and during storage_service
stop:
- we ignore *new* code execution in migration_manager which'd use
  storage_service
- but wait with storage_service shutdown until all *existing*
  executions are done

Fixes scylladb/scylladb#26734

Backport: no need, problem existed since very long time, code restructure in https://github.com/scylladb/scylladb/commit/389afcd (and following commits) made
it hitting more often, as _ss was called earlier, but it's not released yet.

Closes scylladb/scylladb#26779

* github.com:scylladb/scylladb:
  service: attach storage_service to migration_manager using pluggabe
  service: migration_manager: corutinize merge_schema_from
  service: migration_manager: corutinize reload_schema
2025-11-14 15:14:28 +03:00
Piotr Dulikowski
2ccc94c496 Merge 'topology_coordinator: include joining node in barrier' from Michael Litvak
Previously, only nodes in the 'normal' state and decommissioning nodes
were included in the set of nodes participating in barrier and
barrier_and_drain commands. Joining nodes are not included because they
don't coordinate requests, given their cql port is closed.

However, joining nodes may receive mutations from other nodes, for which
they may generate and coordinate materialized view updates. If their
group0 state is not synchronized it could cause lost view updates.
For example:

1. On the topology coordinator, the join completes and the joining node
   becomes normal, but the joining node's state lags behind. Since it's
   not synchronized by the barrier, it could be in an old state such as
   `write_both_read_old`.
2. A normal node coordinates a write and sends it to the new node as the
   new replica.
3. The new node applies the base mutation but doesn't generate a view
   update for it, because it calculates the base-view pairing according
   to its own state and replication map, and determines that it doesn't
   participate in the base-view pairing.

Therefore, since the joining node participates as a coordinator for view
updates, it should be included in these barriers as well. This ensures
that before the join completes, the joining node's state is
`write_both_read_new`, where it does generate view updates.

Fixes https://github.com/scylladb/scylladb/issues/26976

backport to previous versions since it fixes a bug in MV with vnodes

Closes scylladb/scylladb#27008

* github.com:scylladb/scylladb:
  test: add mv write during node join test
  topology_coordinator: include joining node in barrier
2025-11-14 12:41:16 +01:00
Patryk Jędrzejczak
1141342c4f Merge 'topology: refactor excluded nodes' from Petr Gusev
This PR refactors excluded nodes handling for tablets and topology. For tablets a dedicated variable `topology::excluded_tablet_nodes` is introduced, for topology operations a method get_excluded_nodes() is inlined into topology_coordinator and renamed to `get_excluded_nodes_for_topology_request`.

The PR improves codes readability and efficiency, no behavior changes.

backport: this is a refactoring/optimization, no need to backport

Closes scylladb/scylladb#26907

* https://github.com/scylladb/scylladb:
  topology_coordinator: drop unused exec_global_command overload
  topology_coordinator: rename get_excluded_nodes -> get_excluded_nodes_for_topology_request
  topology_state_machine: inline get_excluded_nodes
  messaging_service: simplify and optimize ban_host
  storage_service: topology_state_load: extract topology variable
  topology_coordinator: excluded_tablet_nodes -> ignored_nodes
  topology_state_machine: add excluded_tablet_nodes field
2025-11-14 11:52:00 +01:00
Piotr Dulikowski
833b824905 Merge 'service/qos: Fall back to default scheduling group when using maintenance socket' from Dawid Mędrek
The service level controller relies on `auth::service` to collect
information about roles and the relation between them and the service
levels (those attached to them). Unfortunately, the service level
controller is initialized way earlier than `auth::service` and so we
had to prevent potential invalid queries of user service levels
(cf. 46193f5e79).

Unfortunately, that came at a price: it made the maintenance socket
incompatible with the current implementation of the service level
controller. The maintenance socket starts early, before the
`auth::service` is fully initialized and registered, and is exposed
almost immediately. If the user attempts to connect to Scylla within
this time window, via the maintenance socket, one of the things that
will happen is choosing the right service level for the connection.
Since the `auth::service` is not registered, Scylla with fail an
assertion and crash.

A similar scenario occurs when using maintenance mode. The maintenance
socket is how the user communicates with the database, and we're not
prepared for that either.

To avoid unnecessary crashes, we add new branches if the passed user is
absent or if it corresponds to the anonymous role. Since the role
corresponding to a connection via the maintenance socket is the anonymous
role, that solves the problem.

Some accesses to `auth::service` are not affected and we do not modify
those.

Fixes scylladb/scylladb#26816

Backport: yes. This is a fix of a regression.

Closes scylladb/scylladb#26856

* github.com:scylladb/scylladb:
  test/cluster/test_maintenance_mode.py: Wait for initialization
  test: Disable maintenance mode correctly in test_maintenance_mode.py
  test: Fix keyspace in test_maintenance_mode.py
  service/qos: Do not crash Scylla if auth_integration absent
2025-11-14 11:12:28 +01:00
Marcin Maliszkiewicz
958d04c349 service: attach storage_service to migration_manager using pluggabe
Migration manager depends on storage service. For instance,
it has a reload_schema_in_bg background task which calls
_ss.local() so it expects that storage service is not stopped
before it stops.

To solve this we use permit approach, and during storage_service
stop:
- we ignore *new* code execution in migration_manager which'd use
  storage_service
- but wait with storage_service shutdown until all *existing*
  executions are done

Fixes scylladb/scylladb#26734
2025-11-14 08:50:19 +01:00
Marcin Maliszkiewicz
cf9b2de18b service: migration_manager: corutinize merge_schema_from
It's needed to easily keep-alive pluggable storage_service
permit in a following commit.
2025-11-14 08:50:19 +01:00
Marcin Maliszkiewicz
5241e9476f service: migration_manager: corutinize reload_schema
It's needed to easily keep-alive pluggable storage_service
permit in a following commit.
2025-11-14 08:50:18 +01:00
Michael Litvak
eefae4cc4e migration_manager: pass timestamp to pre_create
pass the write timestamp as parameter to the
on_pre_create_column_families notification.
2025-11-13 16:59:43 +01:00
Petr Gusev
d3bd8c924d topology_coordinator: drop unused exec_global_command overload 2025-11-13 14:19:03 +01:00
Petr Gusev
45d1302066 topology_coordinator: rename get_excluded_nodes -> get_excluded_nodes_for_topology_request
This method is specific to topology requests -- node joining, replacing,
decommissioning etc, everything that goes through
topology::transition_state::write_both_read_old and
raft_topology_cmd::command::stream_ranges. It shouldn't be used in
other contexts -- to handle global topology requests
(e.g. truncate table) or for tablets. Rename the method to make this
more explicit.
2025-11-13 14:19:03 +01:00
Petr Gusev
bf8cc5358b topology_state_machine: inline get_excluded_nodes
The method is specific to topology_coordinator, which already contains
a wrapper for it, so inline the topology method into it.

Also, make the logic of the method more explicit and remove multiple
transition_nodes lookups.
2025-11-13 14:18:46 +01:00
Michael Litvak
13d94576e5 topology_coordinator: include joining node in barrier
Previously, only nodes in the 'normal' state and decommissioning nodes
were included in the set of nodes participating in barrier and
barrier_and_drain commands. Joining nodes are not included because they
don't coordinate requests, given their cql port is closed.

However, joining nodes may receive mutations from other nodes, for which
they may generate and coordinate materialized view updates. If their
group0 state is not synchronized it could cause lost view updates.
For example:

1. On the topology coordinator, the join completes and the joining node
   becomes normal, but the joining node's state lags behind. Since it's
   not synchronized by the barrier, it could be in an old state such as
   `write_both_read_old`.
2. A normal node coordinates a write and sends it to the new node as the
   new replica.
3. The new node applies the base mutation but doesn't generate a view
   update for it, because it calculates the base-view pairing according
   to its own state and replication map, and determines that it doesn't
   participate in the base-view pairing.

Therefore, since the joining node participates as a coordinator for view
updates, it should be included in these barriers as well. This ensures
that before the join completes, the joining node's state is
`write_both_read_new`, where it does generate view updates.

Fixes scylladb/scylladb#26976
2025-11-13 12:24:31 +01:00
Piotr Dulikowski
2e5eb92f21 Merge 'cdc: use CDC schema that is compatible with the base schema' from Michael Litvak
When generating CDC log mutations for some base mutation, use a CDC schema that is compatible with the base schema.

The compatible CDC schema has for every base column a corresponding CDC column with the same name. If using a non-compatible schema, we may encounter a situation, especially during ALTER, that we have a mutation with a base column set with some value, but the CDC schema doesn't have a column by that name. This would cause the user request to fail with an error.

We add to the schema object a schema_ptr that for CDC-enabled tables points to the schema object of the CDC table that is compatible with the schema. It is set by the schema merge algorithm when creating the schema for a table that is created or altered. We use the fact that a base table and its CDC table are created and altered in the same group0 operation, and this way we can find and set the cdc schema for a base table.

When transporting the base schema as a frozen schema between shards, we transport with it the frozen cdc schema as well.

The patch starts with a series of refactoring commits that make extending the frozen schema easier and cleans up some duplication in the code about the frozen schema. We combine the two types `frozen_schema_with_base_info` and `view_schema_and_base_info` to a single type `extended_frozen_schema` that holds a frozen schema with additional data that is not part of the schema mutations but needs to be transported with it to unfreeze it - base_info, and the frozen cdc schema which is added in a later commit.

Fixes https://github.com/scylladb/scylladb/issues/26405

backport not needed - enhancement

Closes scylladb/scylladb#24960

* github.com:scylladb/scylladb:
  test: cdc: test cdc compatible schema
  cdc: use compatiable cdc schema
  db: schema_applier: create schema with pointer to CDC schema
  db: schema_applier: extract cdc tables
  schema: add pointer to CDC schema
  schema_registry: remove base_info from global_schema_ptr
  schema_registry: use extended_frozen_schema in schema load
  schema_registry: replace frozen_schema+base_info with extended_frozen_schema
  frozen_schema: extract info from schema_ptr in the constructor
  frozen_schema: rename frozen_schema_with_base_info to extended_frozen_schema
2025-11-13 10:11:54 +01:00
Pavel Emelyanov
f47f2db710 Merge 'Support local primary-replica-only for native restore' from Robert Bindar
This PR extends the restore API so that it accepts primary_replica_only as parameter and it combines the concepts of primary-replica-only with scoped streaming so that with:
- `scope=all primary_replica_only=true` The restoring node will stream to the global primary replica only
- `scope=dc primary_replica_only=true` The restoring node will stream to the local primary replica only.
- `scope=rack primary_replica_only=true` The restoring node will stream only to the primary replica from within its own rack (with rf=#racks, the restoring node will stream only to itself)
- `scope=node primary_replica_only=true` is not allowed, the restoring node will always stream only to itself so the primary_replica_only parameter wouldn't make sense.

The PR also adjusts the `nodetool refresh` restriction on running restore with both primary_replica_only and scope, it adds primary_replica_only to `nodetool restore` and it adds cluster tests for primary replica within scope.

Fixes #26584

Closes scylladb/scylladb#26609

* github.com:scylladb/scylladb:
  Add cluster tests for checking scoped primary_replica_only streaming
  Improve choice distribution for primary replica
  Refactor cluster/object_store/test_backup
  nodetool restore: add primary-replica-only option
  nodetool refresh: Enable scope={all,dc,rack} with primary_replica_only
  Enable scoped primary replica only streaming
  Support primary_replica_only for native restore API
2025-11-13 12:11:18 +03:00
Tomasz Grabiec
10b893dc27 Merge 'load_stats: fix bug in migrate_tablet_size()' from Ferenc Szili
`topology_cooridinator::migrate_tablet_size()` was introduced in 10f07fb95a. It has a bug where the has_tablet_size() lambda always returns false because of bad comparison of iterators after a table and tablet search:

```
if (auto table_i = tables.find(gid.table); table_i != tables.find(gid.table)) {
    if (auto size_i = table_i->second.find(trange); size_i != table_i->second.find(trange)) {
```

This change also fixes a problem where the `migrate_tablet_size()` would crash with a `std::out_of_range` if the pending node was not present in load_stats.

This change fixes these two problems and moves the functionality into a separate method of `load_stats`. It also adds tests for the new method.

A version containing this bug has not been released yet, so no backport is needed.

Closes scylladb/scylladb#26946

* github.com:scylladb/scylladb:
  load_stats: add test for migrate_tablet_size()
  load_stats: fix problem with tablet size migration
2025-11-12 23:48:37 +01:00
Petr Gusev
9fed80c4be messaging_service: simplify and optimize ban_host
We do one cross-shard call for all left+ignored nodes.
2025-11-12 12:27:44 +01:00
Petr Gusev
52cccc999e storage_service: topology_state_load: extract topology variable
It's inconvinient to always write the long expression
_topology_state_machine._topology.
2025-11-12 12:27:44 +01:00
Petr Gusev
66063f202b topology_coordinator: excluded_tablet_nodes -> ignored_nodes
ignored_nodes is sufficient in these cases. excluded_tablet_nodes
also includes left_nodes_rs, which are not needed
here — global_token_metadata_barrier runs the barrier only
on normal and transition nodes, not on left nodes.
2025-11-12 12:27:44 +01:00
Petr Gusev
82da83d0e5 topology_state_machine: add excluded_tablet_nodes field
The topology_coordinator::is_excluded() creates a temporary hash
map for each call. This is probably not a performance problem since
left_nodes_rs contains only those left nodes that are referenced
from tablet replicas, this happens temporarily while e.g. a replaced
node is being rebuilt. On the other hand, why not just have a
dedicated field in the topology_state_machine, then this code wouldn't
look suspicious.
2025-11-12 12:27:43 +01:00
Gleb Natapov
e872f9cb4e cleanup: Add RESTful API to allow reset cleanup needed flag
Cleaning up a node using per keyspace/table interface does not reset cleanup
needed flag in the topology. The assumption was that running cleanup on
already clean node does nothing and completes quickly. But due to
https://github.com/scylladb/scylladb/issues/12215 (which is closed as
WONTFIX) this is not the case. This patch provides the ability to reset
the flag in the topology if operator cleaned up the node manually
already.
2025-11-12 10:56:57 +02:00
Ferenc Szili
b77ea1b8e1 load_stats: fix problem with tablet size migration
This patch fixes a bug with tablet size migration in load_stats.
has_tablet_size() lambda in topology_coordinator::migrate_tablet_size()
was returning false in all cases due to incorrect search iterator
comparison after a table and tablet saeach.

This change moves load_stats migrate_tablet_sizes() functionaility
into a separate method of load_stats.
2025-11-11 14:26:09 +01:00
Nikos Dragazis
56e5dfc14b migration_manager: Add missing validations for schema extensions
The migration manager offers some free functions to prepare mutations
for a new/updated table/view. Most of them include a validation check
for the schema extensions, but in the following ones it's missing:

* `prepare_new_column_family_announcement` (overload with vector as out parameter)
* `prepare_new_column_families_announcement`

Presumably, this was just an omission. It's also not a very important
one since the only extension having validation logic is the
`encryption_schema_extension`, but none of these functions is connected
to user queries where encryption options can be provided in the schema.
User queries go through the other
`prepare_new_column_family_announcement` overload, which does perform a
validation check.

Add validation in the missing places.

Fixes #26470.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>

Closes scylladb/scylladb#26487
2025-11-11 10:08:58 +02:00
Robert Bindar
817fdadd49 Improve choice distribution for primary replica
I noticed during tests that `maybe_get_primary_replica`
would not distribute uniformly the choice of primary replica
because `info.replicas` on some shards would have an order whilst
on others it'd be ordered differently, thus making the function choose
a node as primary replica multiple times when it clearly could've
chosen a different nodes.

This patch sorts the replica set before passing it through the
scope filter.

Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
2025-11-11 09:18:01 +02:00
Dawid Mędrek
c0f7622d12 service/qos: Do not crash Scylla if auth_integration absent
If the user connects to Scylla via the maintenance socket, it may happen
that `auth_integration` has not been registered in the service level
controller yet. One example is maintenance mode when that will never
happen; another when the connection occurs before Scylla is fully
initialized.

To avoid unnecessary crashes, we add new branches if the passed user is
absent or if it corresponds to the anonymous role. Since the role
corresponding to a connection via the maintenance socket is the anonymous
role, that solves the problem.

In those cases, we completely circumvent any calls to `auth_integration`
and handle them separately. The modified methods are:

* `get_user_scheduling_group`,
* `with_user_service_level`,
* `describe_service_levels`.

For the first two, the new behavior is in line with the previous
implementation of those functions. The last behaves differently now,
but since it's a soft error, crashing the node is not necessary anyway.
We throw an exception instead, whose error message should give the user
a hint of what might be wrong.

The other uses of `auth_integration` within the service level controller
are not problematic:

* `find_effective_service_level`,
* `find_cached_effective_service_level`.

They take the name of a role as their argument. Since the anonymous role
doesn't have a name, it's not possible to call them with it.

Fixes scylladb/scylladb#26816
2025-11-10 19:21:36 +01:00
Michał Jadwiszczak
9345c33d27 service/storage_service: migrate staging sstables in view building
worker during intra-node migration

Use methods introduces in previous commit and:
- load staging sstables to the view building worker on the target
  shard, at the end of `streaming` stage
- clear migrated staging sstables on source shard in `cleanup` stage

This patch also removes skip mark in `test_staging_sstables_with_tablet_merge`.

Fixes scylladb/scylladb#26244
2025-11-10 10:38:08 +01:00
Abhinav Jha
ab0e0eab90 raft topology: skip non-idempotent steps in decommission path to avoid problems during races
In the present scenario, there are issues in left_token_ring transition state
execution in the decommissioning path. In case of concurrent mutation race
conditions, we enter left_token_ring more than once, and apparently if
we enter left token ring second time, we try to barrier the decommisioned
node, which at this point is no longer possible. That's what causes the errors.

This pr resolves the issue by adding a check right in the start of
left_token_ring to check if the first topology state update, which marks
the request as done is completed. In this case, its confirmed that this
is the second time flow is entering left_token_ring and the steps preceding
the request status update should be skipped. In such cases, all the rest
steps are skipped and topology node status update( which threw error in
previous trial) is executed directly. Node removal status from group0 is
also checked and remove operation is retried if failed last time.

Although these changes are done with regard to the decommission operation
behavior in `left_token_ring` transition state, but since the pr doesn't
interfere with the core logic, it should not derail any rollback specific
logic. The changes just prevent some non-idempotent operations from
re-occuring in case of failures. Rest of the core logic remain intact.

Test is also added to confirm the proper working of the same.

Fixes: scylladb/scylladb#20865

Backport is not needed, since this is not a super critical bug fix.

Closes scylladb/scylladb#26717
2025-11-07 10:07:49 +01:00
Patryk Jędrzejczak
d6c64097ad Merge 'storage_proxy: use gates to track write handlers destruction' from Petr Gusev
In [#26408](https://github.com/scylladb/scylladb/pull/26408) a `write_handler_destroy_promise` class was introduced to wait for `abstract_write_response_handler` instances destruction. We strived to minimize the memory footprint of `abstract_write_response_handler`, with `write_handler_destroy_promise`-es we required only a single additional int. It turned our that in some cases a lot of write handlers can be scheduled for deletion at the same time, in such cases the vector can become big and cause 'oversized allocation' seastar warnings.

Another concern with `write_handler_destroy_promise`-es [was that they were more complicated than it was worth](https://github.com/scylladb/scylladb/pull/26408#pullrequestreview-3361001103).

In this commit we replace `write_handler_destroy_promise` with simple gates. One or more gates can be attached to an `abstract_write_response_handler` to wait for its destruction. We use `utils::small_vector` to store the attached gates. The limit 2 was chosen because we expect two gates at the same time in most cases. One is `storage_proxy::_write_handlers_gate`, which is used to wait for all handlers in `cancel_all_write_response_handlers`. Another one can be attached by a caller of `cancel_write_handlers`. Nothing stops several cancel_write_handlers to be called at the same time, but it should be rare.

The `sizeof(utils::small_vector) == 40`, this is `40.0 / 488 * 100 ~ 8%` increase in `sizeof(abstract_write_response_handler)`, which seems acceptable.

Fixes [scylladb/scylladb#26788](https://github.com/scylladb/scylladb/issues/26788)

backport: need to backport to 2025.4 (LWT for tablets release)

Closes scylladb/scylladb#26827

* https://github.com/scylladb/scylladb:
  storage_proxy: use coroutine::maybe_yield();
  storage_proxy: use gates to track write handlers destruction
2025-11-06 10:17:04 +01:00
Petr Gusev
5bda226ff6 storage_proxy: use coroutine::maybe_yield();
This is a small "while at it" refactoring -- better to use
coroutine::maybe_yield with co_await-s.
2025-11-05 14:38:19 +01:00
Petr Gusev
4578304b76 storage_proxy: use gates to track write handlers destruction
In #26408 a write_handler_destroy_promise class was introduced to
wait for abstract_write_response_handler instances destruction. We
strived to minimize the memory footprint of
abstract_write_response_handler, with write_handler_destroy_promise-es
we required only a single additional int. It turned our that in some
cases a lot of write handlers can be scheduled for deletion
at the same time, in such cases the
vector<write_handler_destroy_promise> can become big and cause
'oversized allocation' seastar warnings.

Another concern with write_handler_destroy_promise-es was that they
were more complicated than it was worth.

In this commit we replace write_handler_destroy_promise with simple
gates. One or more gates can be attached to an
abstract_write_response_handler to wait for its destruction. We use
utils::small_vector<gate::holder, 2> to store the attached gates.
The limit 2 was chosen because we expect two gates at the same time
in most cases. One is storage_proxy::_write_handlers_gate,
which is used to wait for all handlers in
cancel_all_write_response_handlers. Another one can be attached by
a caller of cancel_write_handlers. Nothing stops several
cancel_write_handlers to be called at the same time, but it should be
rare.

The sizeof(utils::small_vector<gate::holder, 2>) == 40, this is
40.0 / 488 * 100 ~ 8% increase in
sizeof(abstract_write_response_handler), which seems acceptable.

Fixes scylladb/scylladb#26788
2025-11-05 14:37:52 +01:00
Tomasz Grabiec
f8879d797d tablet_allocator: Avoid load balancer failure when replacing the last node in a rack
Introduced in 9ebdeb2

The problem is specific to node replacing and rack-list RF. The
culprit is in the part of the load balancer which determines rack's
shard count. If we're replacing the last node, the rack will contain
no normal nodes, and shards_per_rack will have no entry for the rack,
on which the table still has replicas. This throws std::out_of_range
and fails the tablet draining stage, and node replace is failed.

No backport because the problem exists only on master.

Fixes #26768

Closes scylladb/scylladb#26783
2025-11-05 15:49:51 +03:00
Wojciech Mitros
977fa91e3d view_building_coordinator: rollback tasks on the leaving tablet replica
When a tablet migration is started, we abort the corresponding view
building tasks (i.e. we change the state of those tasks to "ABORTED").
However, we don't change the host and shard of these tasks until the
migration successfully completes. When for some reason we have to
rollback the migration, that means the migration didn't finish and
the aborted task still has the host and shard of the migration
source. So when we recreate tasks that should no longer be aborted
due to a rolled-back migration, we should look at the aborted tasks
of the source (leaving) replica. But we don't do it and we look at
the aborted tasks of the target replica.
In this patch we adjust the rollback mechanism to recreate tasks
for the migration source instead of destination. We also fix the
test that should have detected this issue - the injection that
the test was using didn't make us rollback, but we simply retried
a stage of the tablet migration. By using one_shot=False and adding
a second injection, we can now guarantee that the migration will
eventually fail and we'll continue to the 'cleanup_target' and
'revert_migration' stages.

Fixes https://github.com/scylladb/scylladb/issues/26691

Closes scylladb/scylladb#26825
2025-11-05 10:44:06 +01:00
Avi Kivity
95700c5f7f Merge 'Support counters with tablets' from Michael Litvak
Support the counters feature in tablets keyspaces.

The main change is to fix the counter update during tablets intranode migration.

Counter cell is c = map<host_id, value>. A counter update is applied by doing read-modify-write on a leader replica to retrieve the current host's counter value and transform the mutation to contain the updated value for the host, then apply the mutation and replicate it to other hosts. the read-modify-write is protected against concurrent updates by locking the counter cell.

When the counter is migrated between two shards, it's not enough to lock the counter on the read shard, because in the stage write_both_read_new the read shard is switched, and then we can have concurrent updates reach either the old or the new shard. In order to keep the counter update exclusive we lock both shards when in the stage write_both_read_new.

Also, when applying the transformed mutation we need to respect write_both stages and apply the mutation on both shards. We change it to use `apply_on_shards` similarly to other methods in storage proxy.

The change applies to both tablets and vnodes, they use the same implementation, but for vnodes the behavior should remain equivalent up to some small reordering of the code since it doesn't have intranode migration and reduces to single read shard = write shard.

Fixes https://github.com/scylladb/scylladb/issues/18180

no backport - new feature

Closes scylladb/scylladb#26636

* github.com:scylladb/scylladb:
  docs: counters now work with tablets
  pgo: enable counters with tablets
  test: enable counters tests with tablets
  test: add counters with tablets test
  cql3: remove warning when creating keyspace with tablets
  cql3: allow counters with tablets
  storage_proxy: lock all read shards for counter update
  storage_proxy: apply counter mutation on all write shards
  storage_proxy: move counter update coordination to storage proxy
  storage_proxy: refactor mutate_counter_on_leader
  replica/db: add counter update guard
  replica/db: split counter update helper functions
2025-11-03 22:28:10 +01:00
Raphael S. Carvalho
7f34366b9d sstables_loader: Don't bypass synchronization with busy topology
The patch c543059f86 fixed the synchronization issue between tablet
split and load-and-stream. The synchronization worked only with
raft topology, and therefore was disabled with gossip.
To do the check, storage_service::raft_topology_change_enabled()
but the topology kind is only available/set on shard 0, so it caused
the synchronization to be bypassed when load-and-stream runs on
any shard other than 0.

The reason the reproducer didn't catch it is that it was restricted
to single cpu. It will now run with multi cpu and catch the
problem observed.

Fixes #22707

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#26730
2025-11-03 18:10:08 +01:00
Michael Litvak
296b116ae2 storage_proxy: lock all read shards for counter update
Previously in a counter update we lock the read shard to protect the
counter's read-modify-write against concurrent updates.

This is not sufficient when the counter is migrated between different
shards, because there is a stage where the read shard switches from the
old shard to the new shard, and during that switch there can be
concurrent counter updates on both shards. If each shard takes only its
own lock, the operations will not be exclusive anymore, and this can
cause lost counter updates.

To fix this, we acquire the counter lock on both shards in the stage
write_both_read_new, when both shards can serve reads. This guarantees
that counter updates continue to be exclusive during intranode
migration.
2025-11-03 16:04:35 +01:00
Michael Litvak
de321218bc storage_proxy: apply counter mutation on all write shards
When applying a counter mutation, use apply_on_shards to apply the
mutation on all write shards, similarly to the way other mutations are
applied in the storage proxy. Previously the mutation was applied only
on the current shard which is the read shard.

This is needed to respect the write_both stages of intranode migration
where we need to apply the mutation on both the old and the new shards.
2025-11-03 16:03:29 +01:00
Michael Litvak
c7e7a9e120 storage_proxy: move counter update coordination to storage proxy
Refactor the counter update to split the functions and have them called
by the storage proxy to prepare for a later change.

Previously in mutate_counter the storage proxy calls the replica
function apply_counter_update that does a few things:
1. checks that the operation can be done: check timeout, disk utilization
2. acquire counter locks
3. do read-modify-write and transform the counter mutation
4. apply the mutation in the replica

In this commit we change it so that these functions are split and called
from the storage proxy, so that we have better control from the storage
proxy when we change it later to work across multiple shards. For
example, we will want to acquire locks on multiple shards, transform it
on one shard, and then apply the mutation on multiple shards.

After the change it works as follows in storage proxy:
1. acquire counter locks
2. call replica prepare to check the operation and transform the mutation
3. call replica apply to apply the transformed mutation
2025-11-03 15:59:46 +01:00
Michael Litvak
579031cfc8 storage_proxy: refactor mutate_counter_on_leader
Slightly reorganize the mutate counter function to prepare it for a
later change.

Move the code that finds the read shard and invokes the rest of the
function on the read shard to the caller function. This simplifies the
function mutate_counter_on_leader_and_replicate which now runs on the
read shard and will make it easier to extend.
2025-11-03 08:43:11 +01:00
Petr Gusev
88765f627a paxos_state: get_replica_lock: remove shard check
This check is incorrect: the current shard may be looking at
the old version of tablets map:
* an accept RPC comes to replica shard 0, which is already at write_both_read_new
* the new shard is shard 1, so paxos_state::accept is called on shard 1
* shard 1 is still at "streaming" -> shards_ready_for_reads() returns old
shard 0

Fixes scylladb/scylladb#26801

Closes scylladb/scylladb#26809
2025-10-31 21:37:39 +01:00
Avi Kivity
7a72155374 Merge 'Introduce nodetool excludenode' from Tomasz Grabiec
If a node is dead and cannot be brought back, tablet migrations are
stuck, until the node is explicitly marked as "permanently dead" /
"ignored node" / "excluded" (name differs in different contexts).

Currently, this is done during removenode and replace operations but
it should be possible to only mark the node as dead, for the purpose
of unblocking migrations or other topology operations, without doing
the actual removenode, because full removal might be currently
impossible, or not desirable due to lack of capacity or priorities.

This patch introduces this kind of API:

```
  nodetool excludenode <host-id> [ ... <host-id> ]
```

Having this kind of API is an improvement in user experience in
several cases. For example, when we lose a rack, the only viable
option for recovery is to run removenode with an extra
--ignore-dead-nodes option. This removenode will fail in the tablet
draining phase, as there is no live node in the rack to rebuild
replicas in. This is confusing to the operator. But necessary before
ALTER KEYSPACE can proceed in order to change replication options to
drop the rack from RF.

Having this API allows operators to have more unified procedures,
where "nodetool excludenode" is always the first step of recovery,
which unblocks further topology operations, both those which restore
capacity, but also auto-scaling, tablet split/merge, load balancing,
etc.

Fixes #21281

The PR also changes "nodetool status" to show excluded nodes,
they have 'X' in their status instead of 'D'.

Closes scylladb/scylladb#26659

* github.com:scylladb/scylladb:
  nodetool: status: Show excluded nodes as having status 'X'
  test: py: Test scenario involving excludenode API
  nodetool: Introduce excludenode command
2025-10-31 22:14:57 +02:00
Michael Litvak
e7dbccd59e cdc: use chunked_vector instead of vector for stream ids
use utils::chunked_vector instead of std::vector to store cdc stream
sets for tablets.

a cdc stream set usually represents all streams for a specific table and
timestamp, and has a stream id per each tablet of the table. each stream
id is represented by 16 bytes. thus the vector could require quite large
contiguous allocations for a table that has many tablets. change it to
chunked_vector to avoid large contiguous allocations.

Fixes scylladb/scylladb#26791

Closes scylladb/scylladb#26792
2025-10-31 13:02:34 +01:00
Tomasz Grabiec
1c0d847281 Merge 'load_balancer: load_stats reconcile after tablet migration and table resize' from Ferenc Szili
This change adds the ability to move tablets sizes in load_stats after a tablet migration or table resize (split/merge). This is needed because the size based load balancer needs to have tablet size data which is as accurate as possible, in order to work on fresh tablet size distribution and issue correct tablet migrations.

This is the second part of the size based load balancing changes:

- First part for tablet size collection via load_stats: #26035
- Second part reconcile load_stats: #26152
- The third part for load_sketch changes: #26153
- The fourth part which performs tablet load balancing based on tablet size: #26254

This is a new feature and backport is not needed.

Closes scylladb/scylladb#26152

* github.com:scylladb/scylladb:
  load_balancer: load_stats reconcile after tablet migration and table resize
  load_stats: change data structure which contains tablet sizes
2025-10-31 09:58:25 +01:00
Tomasz Grabiec
55ecd92feb nodetool: Introduce excludenode command
If a node is dead and cannot be brought back, tablet migrations are
stuck, until the node is explicitly marked as "permanently dead" /
"ignored node" / "excluded" (name differs in different contexts).

Currently, this is done during removenode and replace operations but
it should be possible to only mark the node as dead, for the purpose
of unblocking migrations or other topology operations, without doing
the actual removenode, because full removal might be currently
impossible, or not desirable due to lack of capacity or priorities.

This patch introduces this kind of API:

  nodetool excludenode <host-id> [ ... <host-id> ]

Having this kind of API is an improvement in user experience in
several cases. For example, when we lose a rack, the only viable
option for recovery is to run removenode with an extra
--ignore-dead-nodes option. This removenode will fail in the tablet
draining phase, as there is no live node in the rack to rebuild
replicas in. This is confusing to the operator. But necessary before
ALTER KEYSPACE can proceed in order to change replication options to
drop the rack from RF.

Having this API allows operators to have more unified procedures,
where "nodetool excludenode" is always the first step of recovery,
which unblocks further topology operations, both those which restore
capacity, but also auto-scaling, tablet split/merge, load balancing,
etc.

Fixes #21281
2025-10-31 09:03:20 +01:00
Avi Kivity
04a289cae6 Merge 'Auto expand to rack list' from Tomasz Grabiec
We want to move towards rack-list based replication factor for tablets being the default mode, and in the future the only supported mode. This PR is a step towards that. We auto-expand numeric RF to rack list on keyspace creation and ALTER when rf_rack_valid_keyspaces option is enabled.

The PR is mostly about adjusting tests. The main logic change is in the last patch, which modifies option post-processing in ks_prop_defs.

Fixes #26397

Closes scylladb/scylladb#26692

* github.com:scylladb/scylladb:
  cql3: ks_prop_defs: Expand numeric RF to rack list
  locator: Move rack_list to topology.hh
  alternator: Do not set RF for zero-token DCs
  alternator: Switch keyspace creation to use ks_prop_defs
  test: alternator: Adjust for rack lists
  cql3: Move validation of invalid ALTER KEYSPACE earlier, to ks_prop_defs
  test: cqlpy: Mark tests using rack lists as scylla-only
  test: Switch to rack-list based RF
  test: Generalize tests to work with both numeric RF and rack lists
  test: cluster: test_zero_token_nodes_multidc: Adjust to rack list RF
  test: Prepare for handling errors specific to rack list path
  test: cluster: dtest: alternator: Force RF=1 in test_putitem_contention
  test: Create cluster with multiple racks in multi-dc setups
  test: boost: network_topology_strategy_test: Adjust to rack-list RF
  test: tablets: Adjust to rack list
  test: cluster: test_group0_schema_versioning: Use smaller RF to respect rf-rack-validness
  test: tablets_test: Convert test_per_shard_goal_mixed_dc_rf to be rack-valid
  test: object_store: test_backup: Adjust for rack lists
  test: cluster: tablets: Do not move tablet across racks in test_tablet_transition_sanity
  test: cluster: mv: Do not move tablets across racks
  test: cluster: util: Fix docstring for parse_replication_options()
  tablets, topology_coordinator: Skip tablet draining on replace
2025-10-30 21:54:08 +02:00
Tomasz Grabiec
28f6bdc99b cql3: ks_prop_defs: Expand numeric RF to rack list
Auto-exands numeric RF in CREATE/ALTER KEYSPACE statements for
new DCs specified in the statement.

Doesn't auto-expand existing options, as the rack choice may not be in
line with current replica placement. This requires co-locating tablet
replicas, and tracking of co-location state, which is not implemented yet.

Signed-off-by: Tomasz Grabiec <tgrabiec@scylladb.com>
2025-10-29 23:32:59 +01:00
Tomasz Grabiec
288e75fe22 tablets, topology_coordinator: Skip tablet draining on replace
Replace doesn't drain (rebuild) tablets during topology change. They
are rebuilt afterwards when the replaced node is in "left" state and
replacing node is in normal state. So there is no point in attempting
to drain, as nothing will be drained.

Not only that, doing so has a risk, because the load balancer is
invoked on a transitional topology state in which we can end up with
no normal nodes in a rack. That's the case if the replaced node was
the last one in the rack. This tripped one of the algorithms which
computes rack's shard count for the purpose of determining ideal
tablet count, it was not prepared to find an empty rack to which a
table is still repliacated. That was fixed separately, but to avoid
this, we better skip tablet draining here.
2025-10-29 23:32:57 +01:00
Patryk Jędrzejczak
7304afb75a Merge 'vnodes cleanup: renames and code comments fixes' from Petr Gusev
This is a follow-up for https://github.com/scylladb/scylladb/pull/26315. Fixes several review comments that were left unresolved in the original PR.

backport: not needed, this PR contains only renames and code comment fixes

Closes scylladb/scylladb#26745

* https://github.com/scylladb/scylladb:
  test_automatic_cleanup: fix comment
  storage_proxy: remove stale comment
  storage_proxy: improve run_fenceable_write comment
  topology_coordinator: rename start_cleanup_on_dirty_nodes -> start_vnodes_cleanup_on_dirty_nodes
  storage_service: rename is_cleanup_allowed -> is_vnodes_cleanup_allowed
  storage_service: rename do_cluster_cleanup -> do_clusterwide_vnodes_cleanup
2025-10-29 10:38:27 +01:00
Petr Gusev
d49be677d5 storage_proxy: remove stale comment 2025-10-28 17:55:20 +01:00
Petr Gusev
c60223f009 storage_proxy: improve run_fenceable_write comment 2025-10-28 17:55:20 +01:00