scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-05-30 19:46:48 +00:00

Author	SHA1	Message	Date
Patryk Jędrzejczak	adaa0560d9	Merge 'Automatic cleanup improvements' from Gleb Natapov This series allows an operator to reset 'cleanup needed' flag if he already cleaned up the node, so that automatic cleanup will not do it again. We also change 'nodetool cleanup' back to run cleanup on one node only (and reset 'cleanup needed' flag in the end), but the new '--global' option allows to run cleanup on all nodes that needed it simultaneously. Fixes https://github.com/scylladb/scylladb/issues/26866 Backport to all supported version since automatic cleanup behaviour as it is now may create unexpected by the operator load during cluster resizing. Closes scylladb/scylladb#26868 * https://github.com/scylladb/scylladb: cleanup: introduce "nodetool cluster cleanup" command to run cleanup on all dirty nodes in the cluster cleanup: Add RESTful API to allow reset cleanup needed flag	2025-11-18 08:17:17 +02:00
Piotr Dulikowski	f0039381d2	Merge 'db/view/view_building_worker: support staging sstables intra-node migration and tablet merge' from Michał Jadwiszczak This PR fixes staging stables handling by view building coordinator in case of intra-node tablet migration or tablet merge. To support tablet merge, the worker stores the sstables grouped only be `table_id`, instead of `(table_id, last_token)` pair. There shouldn't be that many staging sstables, so selecting relevant for each `process_staging` task is fine. For the intra-node migration support, the patch adds methods to load migrated sstables on the destination shard and to cleanup them on source shard. The patch should be backported to 2025.4 Fixes https://github.com/scylladb/scylladb/issues/26244 Closes scylladb/scylladb#26454 * github.com:scylladb/scylladb: service/storage_service: migrate staging sstables in view building worker during intra-node migration db/view/view_building_worker: support sstables intra-node migration db/view_building_worker: fix indent db/view/view_building_worker: don't organize staging sstables by last token	2025-11-17 08:53:19 +01:00
Pavel Emelyanov	1c9c4c8c8c	Merge 'service: attach storage_service to migration_manager using pluggable' from Marcin Maliszkiewicz Migration manager depends on storage service. For instance, it has a reload_schema_in_bg background task which calls _ss.local() so it expects that storage service is not stopped before it stops. To solve this we use permit approach, and during storage_service stop: - we ignore new code execution in migration_manager which'd use storage_service - but wait with storage_service shutdown until all existing executions are done Fixes scylladb/scylladb#26734 Backport: no need, problem existed since very long time, code restructure in https://github.com/scylladb/scylladb/commit/389afcd (and following commits) made it hitting more often, as _ss was called earlier, but it's not released yet. Closes scylladb/scylladb#26779 * github.com:scylladb/scylladb: service: attach storage_service to migration_manager using pluggabe service: migration_manager: corutinize merge_schema_from service: migration_manager: corutinize reload_schema	2025-11-14 15:14:28 +03:00
Piotr Dulikowski	2ccc94c496	Merge 'topology_coordinator: include joining node in barrier' from Michael Litvak Previously, only nodes in the 'normal' state and decommissioning nodes were included in the set of nodes participating in barrier and barrier_and_drain commands. Joining nodes are not included because they don't coordinate requests, given their cql port is closed. However, joining nodes may receive mutations from other nodes, for which they may generate and coordinate materialized view updates. If their group0 state is not synchronized it could cause lost view updates. For example: 1. On the topology coordinator, the join completes and the joining node becomes normal, but the joining node's state lags behind. Since it's not synchronized by the barrier, it could be in an old state such as `write_both_read_old`. 2. A normal node coordinates a write and sends it to the new node as the new replica. 3. The new node applies the base mutation but doesn't generate a view update for it, because it calculates the base-view pairing according to its own state and replication map, and determines that it doesn't participate in the base-view pairing. Therefore, since the joining node participates as a coordinator for view updates, it should be included in these barriers as well. This ensures that before the join completes, the joining node's state is `write_both_read_new`, where it does generate view updates. Fixes https://github.com/scylladb/scylladb/issues/26976 backport to previous versions since it fixes a bug in MV with vnodes Closes scylladb/scylladb#27008 * github.com:scylladb/scylladb: test: add mv write during node join test topology_coordinator: include joining node in barrier	2025-11-14 12:41:16 +01:00
Patryk Jędrzejczak	1141342c4f	Merge 'topology: refactor excluded nodes' from Petr Gusev This PR refactors excluded nodes handling for tablets and topology. For tablets a dedicated variable `topology::excluded_tablet_nodes` is introduced, for topology operations a method get_excluded_nodes() is inlined into topology_coordinator and renamed to `get_excluded_nodes_for_topology_request`. The PR improves codes readability and efficiency, no behavior changes. backport: this is a refactoring/optimization, no need to backport Closes scylladb/scylladb#26907 * https://github.com/scylladb/scylladb: topology_coordinator: drop unused exec_global_command overload topology_coordinator: rename get_excluded_nodes -> get_excluded_nodes_for_topology_request topology_state_machine: inline get_excluded_nodes messaging_service: simplify and optimize ban_host storage_service: topology_state_load: extract topology variable topology_coordinator: excluded_tablet_nodes -> ignored_nodes topology_state_machine: add excluded_tablet_nodes field	2025-11-14 11:52:00 +01:00
Piotr Dulikowski	833b824905	Merge 'service/qos: Fall back to default scheduling group when using maintenance socket' from Dawid Mędrek The service level controller relies on `auth::service` to collect information about roles and the relation between them and the service levels (those attached to them). Unfortunately, the service level controller is initialized way earlier than `auth::service` and so we had to prevent potential invalid queries of user service levels (cf. `46193f5e79`). Unfortunately, that came at a price: it made the maintenance socket incompatible with the current implementation of the service level controller. The maintenance socket starts early, before the `auth::service` is fully initialized and registered, and is exposed almost immediately. If the user attempts to connect to Scylla within this time window, via the maintenance socket, one of the things that will happen is choosing the right service level for the connection. Since the `auth::service` is not registered, Scylla with fail an assertion and crash. A similar scenario occurs when using maintenance mode. The maintenance socket is how the user communicates with the database, and we're not prepared for that either. To avoid unnecessary crashes, we add new branches if the passed user is absent or if it corresponds to the anonymous role. Since the role corresponding to a connection via the maintenance socket is the anonymous role, that solves the problem. Some accesses to `auth::service` are not affected and we do not modify those. Fixes scylladb/scylladb#26816 Backport: yes. This is a fix of a regression. Closes scylladb/scylladb#26856 * github.com:scylladb/scylladb: test/cluster/test_maintenance_mode.py: Wait for initialization test: Disable maintenance mode correctly in test_maintenance_mode.py test: Fix keyspace in test_maintenance_mode.py service/qos: Do not crash Scylla if auth_integration absent	2025-11-14 11:12:28 +01:00
Marcin Maliszkiewicz	958d04c349	service: attach storage_service to migration_manager using pluggabe Migration manager depends on storage service. For instance, it has a reload_schema_in_bg background task which calls _ss.local() so it expects that storage service is not stopped before it stops. To solve this we use permit approach, and during storage_service stop: - we ignore new code execution in migration_manager which'd use storage_service - but wait with storage_service shutdown until all existing executions are done Fixes scylladb/scylladb#26734	2025-11-14 08:50:19 +01:00
Marcin Maliszkiewicz	cf9b2de18b	service: migration_manager: corutinize merge_schema_from It's needed to easily keep-alive pluggable storage_service permit in a following commit.	2025-11-14 08:50:19 +01:00
Marcin Maliszkiewicz	5241e9476f	service: migration_manager: corutinize reload_schema It's needed to easily keep-alive pluggable storage_service permit in a following commit.	2025-11-14 08:50:18 +01:00
Michael Litvak	eefae4cc4e	migration_manager: pass timestamp to pre_create pass the write timestamp as parameter to the on_pre_create_column_families notification.	2025-11-13 16:59:43 +01:00
Petr Gusev	d3bd8c924d	topology_coordinator: drop unused exec_global_command overload	2025-11-13 14:19:03 +01:00
Petr Gusev	45d1302066	topology_coordinator: rename get_excluded_nodes -> get_excluded_nodes_for_topology_request This method is specific to topology requests -- node joining, replacing, decommissioning etc, everything that goes through topology::transition_state::write_both_read_old and raft_topology_cmd::command::stream_ranges. It shouldn't be used in other contexts -- to handle global topology requests (e.g. truncate table) or for tablets. Rename the method to make this more explicit.	2025-11-13 14:19:03 +01:00
Petr Gusev	bf8cc5358b	topology_state_machine: inline get_excluded_nodes The method is specific to topology_coordinator, which already contains a wrapper for it, so inline the topology method into it. Also, make the logic of the method more explicit and remove multiple transition_nodes lookups.	2025-11-13 14:18:46 +01:00
Michael Litvak	13d94576e5	topology_coordinator: include joining node in barrier Previously, only nodes in the 'normal' state and decommissioning nodes were included in the set of nodes participating in barrier and barrier_and_drain commands. Joining nodes are not included because they don't coordinate requests, given their cql port is closed. However, joining nodes may receive mutations from other nodes, for which they may generate and coordinate materialized view updates. If their group0 state is not synchronized it could cause lost view updates. For example: 1. On the topology coordinator, the join completes and the joining node becomes normal, but the joining node's state lags behind. Since it's not synchronized by the barrier, it could be in an old state such as `write_both_read_old`. 2. A normal node coordinates a write and sends it to the new node as the new replica. 3. The new node applies the base mutation but doesn't generate a view update for it, because it calculates the base-view pairing according to its own state and replication map, and determines that it doesn't participate in the base-view pairing. Therefore, since the joining node participates as a coordinator for view updates, it should be included in these barriers as well. This ensures that before the join completes, the joining node's state is `write_both_read_new`, where it does generate view updates. Fixes scylladb/scylladb#26976	2025-11-13 12:24:31 +01:00
Piotr Dulikowski	2e5eb92f21	Merge 'cdc: use CDC schema that is compatible with the base schema' from Michael Litvak When generating CDC log mutations for some base mutation, use a CDC schema that is compatible with the base schema. The compatible CDC schema has for every base column a corresponding CDC column with the same name. If using a non-compatible schema, we may encounter a situation, especially during ALTER, that we have a mutation with a base column set with some value, but the CDC schema doesn't have a column by that name. This would cause the user request to fail with an error. We add to the schema object a schema_ptr that for CDC-enabled tables points to the schema object of the CDC table that is compatible with the schema. It is set by the schema merge algorithm when creating the schema for a table that is created or altered. We use the fact that a base table and its CDC table are created and altered in the same group0 operation, and this way we can find and set the cdc schema for a base table. When transporting the base schema as a frozen schema between shards, we transport with it the frozen cdc schema as well. The patch starts with a series of refactoring commits that make extending the frozen schema easier and cleans up some duplication in the code about the frozen schema. We combine the two types `frozen_schema_with_base_info` and `view_schema_and_base_info` to a single type `extended_frozen_schema` that holds a frozen schema with additional data that is not part of the schema mutations but needs to be transported with it to unfreeze it - base_info, and the frozen cdc schema which is added in a later commit. Fixes https://github.com/scylladb/scylladb/issues/26405 backport not needed - enhancement Closes scylladb/scylladb#24960 * github.com:scylladb/scylladb: test: cdc: test cdc compatible schema cdc: use compatiable cdc schema db: schema_applier: create schema with pointer to CDC schema db: schema_applier: extract cdc tables schema: add pointer to CDC schema schema_registry: remove base_info from global_schema_ptr schema_registry: use extended_frozen_schema in schema load schema_registry: replace frozen_schema+base_info with extended_frozen_schema frozen_schema: extract info from schema_ptr in the constructor frozen_schema: rename frozen_schema_with_base_info to extended_frozen_schema	2025-11-13 10:11:54 +01:00
Pavel Emelyanov	f47f2db710	Merge 'Support local primary-replica-only for native restore' from Robert Bindar This PR extends the restore API so that it accepts primary_replica_only as parameter and it combines the concepts of primary-replica-only with scoped streaming so that with: - `scope=all primary_replica_only=true` The restoring node will stream to the global primary replica only - `scope=dc primary_replica_only=true` The restoring node will stream to the local primary replica only. - `scope=rack primary_replica_only=true` The restoring node will stream only to the primary replica from within its own rack (with rf=#racks, the restoring node will stream only to itself) - `scope=node primary_replica_only=true` is not allowed, the restoring node will always stream only to itself so the primary_replica_only parameter wouldn't make sense. The PR also adjusts the `nodetool refresh` restriction on running restore with both primary_replica_only and scope, it adds primary_replica_only to `nodetool restore` and it adds cluster tests for primary replica within scope. Fixes #26584 Closes scylladb/scylladb#26609 * github.com:scylladb/scylladb: Add cluster tests for checking scoped primary_replica_only streaming Improve choice distribution for primary replica Refactor cluster/object_store/test_backup nodetool restore: add primary-replica-only option nodetool refresh: Enable scope={all,dc,rack} with primary_replica_only Enable scoped primary replica only streaming Support primary_replica_only for native restore API	2025-11-13 12:11:18 +03:00
Tomasz Grabiec	10b893dc27	Merge 'load_stats: fix bug in migrate_tablet_size()' from Ferenc Szili `topology_cooridinator::migrate_tablet_size()` was introduced in `10f07fb95a`. It has a bug where the has_tablet_size() lambda always returns false because of bad comparison of iterators after a table and tablet search: ``` if (auto table_i = tables.find(gid.table); table_i != tables.find(gid.table)) { if (auto size_i = table_i->second.find(trange); size_i != table_i->second.find(trange)) { ``` This change also fixes a problem where the `migrate_tablet_size()` would crash with a `std::out_of_range` if the pending node was not present in load_stats. This change fixes these two problems and moves the functionality into a separate method of `load_stats`. It also adds tests for the new method. A version containing this bug has not been released yet, so no backport is needed. Closes scylladb/scylladb#26946 * github.com:scylladb/scylladb: load_stats: add test for migrate_tablet_size() load_stats: fix problem with tablet size migration	2025-11-12 23:48:37 +01:00
Petr Gusev	9fed80c4be	messaging_service: simplify and optimize ban_host We do one cross-shard call for all left+ignored nodes.	2025-11-12 12:27:44 +01:00
Petr Gusev	52cccc999e	storage_service: topology_state_load: extract topology variable It's inconvinient to always write the long expression _topology_state_machine._topology.	2025-11-12 12:27:44 +01:00
Petr Gusev	66063f202b	topology_coordinator: excluded_tablet_nodes -> ignored_nodes ignored_nodes is sufficient in these cases. excluded_tablet_nodes also includes left_nodes_rs, which are not needed here — global_token_metadata_barrier runs the barrier only on normal and transition nodes, not on left nodes.	2025-11-12 12:27:44 +01:00
Petr Gusev	82da83d0e5	topology_state_machine: add excluded_tablet_nodes field The topology_coordinator::is_excluded() creates a temporary hash map for each call. This is probably not a performance problem since left_nodes_rs contains only those left nodes that are referenced from tablet replicas, this happens temporarily while e.g. a replaced node is being rebuilt. On the other hand, why not just have a dedicated field in the topology_state_machine, then this code wouldn't look suspicious.	2025-11-12 12:27:43 +01:00
Gleb Natapov	e872f9cb4e	cleanup: Add RESTful API to allow reset cleanup needed flag Cleaning up a node using per keyspace/table interface does not reset cleanup needed flag in the topology. The assumption was that running cleanup on already clean node does nothing and completes quickly. But due to https://github.com/scylladb/scylladb/issues/12215 (which is closed as WONTFIX) this is not the case. This patch provides the ability to reset the flag in the topology if operator cleaned up the node manually already.	2025-11-12 10:56:57 +02:00
Ferenc Szili	b77ea1b8e1	load_stats: fix problem with tablet size migration This patch fixes a bug with tablet size migration in load_stats. has_tablet_size() lambda in topology_coordinator::migrate_tablet_size() was returning false in all cases due to incorrect search iterator comparison after a table and tablet saeach. This change moves load_stats migrate_tablet_sizes() functionaility into a separate method of load_stats.	2025-11-11 14:26:09 +01:00
Nikos Dragazis	56e5dfc14b	migration_manager: Add missing validations for schema extensions The migration manager offers some free functions to prepare mutations for a new/updated table/view. Most of them include a validation check for the schema extensions, but in the following ones it's missing: * `prepare_new_column_family_announcement` (overload with vector as out parameter) * `prepare_new_column_families_announcement` Presumably, this was just an omission. It's also not a very important one since the only extension having validation logic is the `encryption_schema_extension`, but none of these functions is connected to user queries where encryption options can be provided in the schema. User queries go through the other `prepare_new_column_family_announcement` overload, which does perform a validation check. Add validation in the missing places. Fixes #26470. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com> Closes scylladb/scylladb#26487	2025-11-11 10:08:58 +02:00
Robert Bindar	817fdadd49	Improve choice distribution for primary replica I noticed during tests that `maybe_get_primary_replica` would not distribute uniformly the choice of primary replica because `info.replicas` on some shards would have an order whilst on others it'd be ordered differently, thus making the function choose a node as primary replica multiple times when it clearly could've chosen a different nodes. This patch sorts the replica set before passing it through the scope filter. Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>	2025-11-11 09:18:01 +02:00
Dawid Mędrek	c0f7622d12	service/qos: Do not crash Scylla if auth_integration absent If the user connects to Scylla via the maintenance socket, it may happen that `auth_integration` has not been registered in the service level controller yet. One example is maintenance mode when that will never happen; another when the connection occurs before Scylla is fully initialized. To avoid unnecessary crashes, we add new branches if the passed user is absent or if it corresponds to the anonymous role. Since the role corresponding to a connection via the maintenance socket is the anonymous role, that solves the problem. In those cases, we completely circumvent any calls to `auth_integration` and handle them separately. The modified methods are: * `get_user_scheduling_group`, * `with_user_service_level`, * `describe_service_levels`. For the first two, the new behavior is in line with the previous implementation of those functions. The last behaves differently now, but since it's a soft error, crashing the node is not necessary anyway. We throw an exception instead, whose error message should give the user a hint of what might be wrong. The other uses of `auth_integration` within the service level controller are not problematic: * `find_effective_service_level`, * `find_cached_effective_service_level`. They take the name of a role as their argument. Since the anonymous role doesn't have a name, it's not possible to call them with it. Fixes scylladb/scylladb#26816	2025-11-10 19:21:36 +01:00
Michał Jadwiszczak	9345c33d27	service/storage_service: migrate staging sstables in view building worker during intra-node migration Use methods introduces in previous commit and: - load staging sstables to the view building worker on the target shard, at the end of `streaming` stage - clear migrated staging sstables on source shard in `cleanup` stage This patch also removes skip mark in `test_staging_sstables_with_tablet_merge`. Fixes scylladb/scylladb#26244	2025-11-10 10:38:08 +01:00
Abhinav Jha	ab0e0eab90	raft topology: skip non-idempotent steps in decommission path to avoid problems during races In the present scenario, there are issues in left_token_ring transition state execution in the decommissioning path. In case of concurrent mutation race conditions, we enter left_token_ring more than once, and apparently if we enter left token ring second time, we try to barrier the decommisioned node, which at this point is no longer possible. That's what causes the errors. This pr resolves the issue by adding a check right in the start of left_token_ring to check if the first topology state update, which marks the request as done is completed. In this case, its confirmed that this is the second time flow is entering left_token_ring and the steps preceding the request status update should be skipped. In such cases, all the rest steps are skipped and topology node status update( which threw error in previous trial) is executed directly. Node removal status from group0 is also checked and remove operation is retried if failed last time. Although these changes are done with regard to the decommission operation behavior in `left_token_ring` transition state, but since the pr doesn't interfere with the core logic, it should not derail any rollback specific logic. The changes just prevent some non-idempotent operations from re-occuring in case of failures. Rest of the core logic remain intact. Test is also added to confirm the proper working of the same. Fixes: scylladb/scylladb#20865 Backport is not needed, since this is not a super critical bug fix. Closes scylladb/scylladb#26717	2025-11-07 10:07:49 +01:00
Patryk Jędrzejczak	d6c64097ad	Merge 'storage_proxy: use gates to track write handlers destruction' from Petr Gusev In [#26408](https://github.com/scylladb/scylladb/pull/26408) a `write_handler_destroy_promise` class was introduced to wait for `abstract_write_response_handler` instances destruction. We strived to minimize the memory footprint of `abstract_write_response_handler`, with `write_handler_destroy_promise`-es we required only a single additional int. It turned our that in some cases a lot of write handlers can be scheduled for deletion at the same time, in such cases the vector can become big and cause 'oversized allocation' seastar warnings. Another concern with `write_handler_destroy_promise`-es [was that they were more complicated than it was worth](https://github.com/scylladb/scylladb/pull/26408#pullrequestreview-3361001103). In this commit we replace `write_handler_destroy_promise` with simple gates. One or more gates can be attached to an `abstract_write_response_handler` to wait for its destruction. We use `utils::small_vector` to store the attached gates. The limit 2 was chosen because we expect two gates at the same time in most cases. One is `storage_proxy::_write_handlers_gate`, which is used to wait for all handlers in `cancel_all_write_response_handlers`. Another one can be attached by a caller of `cancel_write_handlers`. Nothing stops several cancel_write_handlers to be called at the same time, but it should be rare. The `sizeof(utils::small_vector) == 40`, this is `40.0 / 488 * 100 ~ 8%` increase in `sizeof(abstract_write_response_handler)`, which seems acceptable. Fixes [scylladb/scylladb#26788](https://github.com/scylladb/scylladb/issues/26788) backport: need to backport to 2025.4 (LWT for tablets release) Closes scylladb/scylladb#26827 * https://github.com/scylladb/scylladb: storage_proxy: use coroutine::maybe_yield(); storage_proxy: use gates to track write handlers destruction	2025-11-06 10:17:04 +01:00
Petr Gusev	5bda226ff6	storage_proxy: use coroutine::maybe_yield(); This is a small "while at it" refactoring -- better to use coroutine::maybe_yield with co_await-s.	2025-11-05 14:38:19 +01:00
Petr Gusev	4578304b76	storage_proxy: use gates to track write handlers destruction In #26408 a write_handler_destroy_promise class was introduced to wait for abstract_write_response_handler instances destruction. We strived to minimize the memory footprint of abstract_write_response_handler, with write_handler_destroy_promise-es we required only a single additional int. It turned our that in some cases a lot of write handlers can be scheduled for deletion at the same time, in such cases the vector<write_handler_destroy_promise> can become big and cause 'oversized allocation' seastar warnings. Another concern with write_handler_destroy_promise-es was that they were more complicated than it was worth. In this commit we replace write_handler_destroy_promise with simple gates. One or more gates can be attached to an abstract_write_response_handler to wait for its destruction. We use utils::small_vector<gate::holder, 2> to store the attached gates. The limit 2 was chosen because we expect two gates at the same time in most cases. One is storage_proxy::_write_handlers_gate, which is used to wait for all handlers in cancel_all_write_response_handlers. Another one can be attached by a caller of cancel_write_handlers. Nothing stops several cancel_write_handlers to be called at the same time, but it should be rare. The sizeof(utils::small_vector<gate::holder, 2>) == 40, this is 40.0 / 488 * 100 ~ 8% increase in sizeof(abstract_write_response_handler), which seems acceptable. Fixes scylladb/scylladb#26788	2025-11-05 14:37:52 +01:00
Tomasz Grabiec	f8879d797d	tablet_allocator: Avoid load balancer failure when replacing the last node in a rack Introduced in `9ebdeb2` The problem is specific to node replacing and rack-list RF. The culprit is in the part of the load balancer which determines rack's shard count. If we're replacing the last node, the rack will contain no normal nodes, and shards_per_rack will have no entry for the rack, on which the table still has replicas. This throws std::out_of_range and fails the tablet draining stage, and node replace is failed. No backport because the problem exists only on master. Fixes #26768 Closes scylladb/scylladb#26783	2025-11-05 15:49:51 +03:00
Wojciech Mitros	977fa91e3d	view_building_coordinator: rollback tasks on the leaving tablet replica When a tablet migration is started, we abort the corresponding view building tasks (i.e. we change the state of those tasks to "ABORTED"). However, we don't change the host and shard of these tasks until the migration successfully completes. When for some reason we have to rollback the migration, that means the migration didn't finish and the aborted task still has the host and shard of the migration source. So when we recreate tasks that should no longer be aborted due to a rolled-back migration, we should look at the aborted tasks of the source (leaving) replica. But we don't do it and we look at the aborted tasks of the target replica. In this patch we adjust the rollback mechanism to recreate tasks for the migration source instead of destination. We also fix the test that should have detected this issue - the injection that the test was using didn't make us rollback, but we simply retried a stage of the tablet migration. By using one_shot=False and adding a second injection, we can now guarantee that the migration will eventually fail and we'll continue to the 'cleanup_target' and 'revert_migration' stages. Fixes https://github.com/scylladb/scylladb/issues/26691 Closes scylladb/scylladb#26825	2025-11-05 10:44:06 +01:00
Avi Kivity	95700c5f7f	Merge 'Support counters with tablets' from Michael Litvak Support the counters feature in tablets keyspaces. The main change is to fix the counter update during tablets intranode migration. Counter cell is c = map<host_id, value>. A counter update is applied by doing read-modify-write on a leader replica to retrieve the current host's counter value and transform the mutation to contain the updated value for the host, then apply the mutation and replicate it to other hosts. the read-modify-write is protected against concurrent updates by locking the counter cell. When the counter is migrated between two shards, it's not enough to lock the counter on the read shard, because in the stage write_both_read_new the read shard is switched, and then we can have concurrent updates reach either the old or the new shard. In order to keep the counter update exclusive we lock both shards when in the stage write_both_read_new. Also, when applying the transformed mutation we need to respect write_both stages and apply the mutation on both shards. We change it to use `apply_on_shards` similarly to other methods in storage proxy. The change applies to both tablets and vnodes, they use the same implementation, but for vnodes the behavior should remain equivalent up to some small reordering of the code since it doesn't have intranode migration and reduces to single read shard = write shard. Fixes https://github.com/scylladb/scylladb/issues/18180 no backport - new feature Closes scylladb/scylladb#26636 * github.com:scylladb/scylladb: docs: counters now work with tablets pgo: enable counters with tablets test: enable counters tests with tablets test: add counters with tablets test cql3: remove warning when creating keyspace with tablets cql3: allow counters with tablets storage_proxy: lock all read shards for counter update storage_proxy: apply counter mutation on all write shards storage_proxy: move counter update coordination to storage proxy storage_proxy: refactor mutate_counter_on_leader replica/db: add counter update guard replica/db: split counter update helper functions	2025-11-03 22:28:10 +01:00
Raphael S. Carvalho	7f34366b9d	sstables_loader: Don't bypass synchronization with busy topology The patch `c543059f86` fixed the synchronization issue between tablet split and load-and-stream. The synchronization worked only with raft topology, and therefore was disabled with gossip. To do the check, storage_service::raft_topology_change_enabled() but the topology kind is only available/set on shard 0, so it caused the synchronization to be bypassed when load-and-stream runs on any shard other than 0. The reason the reproducer didn't catch it is that it was restricted to single cpu. It will now run with multi cpu and catch the problem observed. Fixes #22707 Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#26730	2025-11-03 18:10:08 +01:00
Michael Litvak	296b116ae2	storage_proxy: lock all read shards for counter update Previously in a counter update we lock the read shard to protect the counter's read-modify-write against concurrent updates. This is not sufficient when the counter is migrated between different shards, because there is a stage where the read shard switches from the old shard to the new shard, and during that switch there can be concurrent counter updates on both shards. If each shard takes only its own lock, the operations will not be exclusive anymore, and this can cause lost counter updates. To fix this, we acquire the counter lock on both shards in the stage write_both_read_new, when both shards can serve reads. This guarantees that counter updates continue to be exclusive during intranode migration.	2025-11-03 16:04:35 +01:00
Michael Litvak	de321218bc	storage_proxy: apply counter mutation on all write shards When applying a counter mutation, use apply_on_shards to apply the mutation on all write shards, similarly to the way other mutations are applied in the storage proxy. Previously the mutation was applied only on the current shard which is the read shard. This is needed to respect the write_both stages of intranode migration where we need to apply the mutation on both the old and the new shards.	2025-11-03 16:03:29 +01:00
Michael Litvak	c7e7a9e120	storage_proxy: move counter update coordination to storage proxy Refactor the counter update to split the functions and have them called by the storage proxy to prepare for a later change. Previously in mutate_counter the storage proxy calls the replica function apply_counter_update that does a few things: 1. checks that the operation can be done: check timeout, disk utilization 2. acquire counter locks 3. do read-modify-write and transform the counter mutation 4. apply the mutation in the replica In this commit we change it so that these functions are split and called from the storage proxy, so that we have better control from the storage proxy when we change it later to work across multiple shards. For example, we will want to acquire locks on multiple shards, transform it on one shard, and then apply the mutation on multiple shards. After the change it works as follows in storage proxy: 1. acquire counter locks 2. call replica prepare to check the operation and transform the mutation 3. call replica apply to apply the transformed mutation	2025-11-03 15:59:46 +01:00
Michael Litvak	579031cfc8	storage_proxy: refactor mutate_counter_on_leader Slightly reorganize the mutate counter function to prepare it for a later change. Move the code that finds the read shard and invokes the rest of the function on the read shard to the caller function. This simplifies the function mutate_counter_on_leader_and_replicate which now runs on the read shard and will make it easier to extend.	2025-11-03 08:43:11 +01:00
Petr Gusev	88765f627a	paxos_state: get_replica_lock: remove shard check This check is incorrect: the current shard may be looking at the old version of tablets map: * an accept RPC comes to replica shard 0, which is already at write_both_read_new * the new shard is shard 1, so paxos_state::accept is called on shard 1 * shard 1 is still at "streaming" -> shards_ready_for_reads() returns old shard 0 Fixes scylladb/scylladb#26801 Closes scylladb/scylladb#26809	2025-10-31 21:37:39 +01:00
Avi Kivity	7a72155374	Merge 'Introduce nodetool excludenode' from Tomasz Grabiec If a node is dead and cannot be brought back, tablet migrations are stuck, until the node is explicitly marked as "permanently dead" / "ignored node" / "excluded" (name differs in different contexts). Currently, this is done during removenode and replace operations but it should be possible to only mark the node as dead, for the purpose of unblocking migrations or other topology operations, without doing the actual removenode, because full removal might be currently impossible, or not desirable due to lack of capacity or priorities. This patch introduces this kind of API: ``` nodetool excludenode <host-id> [ ... <host-id> ] ``` Having this kind of API is an improvement in user experience in several cases. For example, when we lose a rack, the only viable option for recovery is to run removenode with an extra --ignore-dead-nodes option. This removenode will fail in the tablet draining phase, as there is no live node in the rack to rebuild replicas in. This is confusing to the operator. But necessary before ALTER KEYSPACE can proceed in order to change replication options to drop the rack from RF. Having this API allows operators to have more unified procedures, where "nodetool excludenode" is always the first step of recovery, which unblocks further topology operations, both those which restore capacity, but also auto-scaling, tablet split/merge, load balancing, etc. Fixes #21281 The PR also changes "nodetool status" to show excluded nodes, they have 'X' in their status instead of 'D'. Closes scylladb/scylladb#26659 * github.com:scylladb/scylladb: nodetool: status: Show excluded nodes as having status 'X' test: py: Test scenario involving excludenode API nodetool: Introduce excludenode command	2025-10-31 22:14:57 +02:00
Michael Litvak	e7dbccd59e	cdc: use chunked_vector instead of vector for stream ids use utils::chunked_vector instead of std::vector to store cdc stream sets for tablets. a cdc stream set usually represents all streams for a specific table and timestamp, and has a stream id per each tablet of the table. each stream id is represented by 16 bytes. thus the vector could require quite large contiguous allocations for a table that has many tablets. change it to chunked_vector to avoid large contiguous allocations. Fixes scylladb/scylladb#26791 Closes scylladb/scylladb#26792	2025-10-31 13:02:34 +01:00
Tomasz Grabiec	1c0d847281	Merge 'load_balancer: load_stats reconcile after tablet migration and table resize' from Ferenc Szili This change adds the ability to move tablets sizes in load_stats after a tablet migration or table resize (split/merge). This is needed because the size based load balancer needs to have tablet size data which is as accurate as possible, in order to work on fresh tablet size distribution and issue correct tablet migrations. This is the second part of the size based load balancing changes: - First part for tablet size collection via load_stats: #26035 - Second part reconcile load_stats: #26152 - The third part for load_sketch changes: #26153 - The fourth part which performs tablet load balancing based on tablet size: #26254 This is a new feature and backport is not needed. Closes scylladb/scylladb#26152 * github.com:scylladb/scylladb: load_balancer: load_stats reconcile after tablet migration and table resize load_stats: change data structure which contains tablet sizes	2025-10-31 09:58:25 +01:00
Tomasz Grabiec	55ecd92feb	nodetool: Introduce excludenode command If a node is dead and cannot be brought back, tablet migrations are stuck, until the node is explicitly marked as "permanently dead" / "ignored node" / "excluded" (name differs in different contexts). Currently, this is done during removenode and replace operations but it should be possible to only mark the node as dead, for the purpose of unblocking migrations or other topology operations, without doing the actual removenode, because full removal might be currently impossible, or not desirable due to lack of capacity or priorities. This patch introduces this kind of API: nodetool excludenode <host-id> [ ... <host-id> ] Having this kind of API is an improvement in user experience in several cases. For example, when we lose a rack, the only viable option for recovery is to run removenode with an extra --ignore-dead-nodes option. This removenode will fail in the tablet draining phase, as there is no live node in the rack to rebuild replicas in. This is confusing to the operator. But necessary before ALTER KEYSPACE can proceed in order to change replication options to drop the rack from RF. Having this API allows operators to have more unified procedures, where "nodetool excludenode" is always the first step of recovery, which unblocks further topology operations, both those which restore capacity, but also auto-scaling, tablet split/merge, load balancing, etc. Fixes #21281	2025-10-31 09:03:20 +01:00
Avi Kivity	04a289cae6	Merge 'Auto expand to rack list' from Tomasz Grabiec We want to move towards rack-list based replication factor for tablets being the default mode, and in the future the only supported mode. This PR is a step towards that. We auto-expand numeric RF to rack list on keyspace creation and ALTER when rf_rack_valid_keyspaces option is enabled. The PR is mostly about adjusting tests. The main logic change is in the last patch, which modifies option post-processing in ks_prop_defs. Fixes #26397 Closes scylladb/scylladb#26692 * github.com:scylladb/scylladb: cql3: ks_prop_defs: Expand numeric RF to rack list locator: Move rack_list to topology.hh alternator: Do not set RF for zero-token DCs alternator: Switch keyspace creation to use ks_prop_defs test: alternator: Adjust for rack lists cql3: Move validation of invalid ALTER KEYSPACE earlier, to ks_prop_defs test: cqlpy: Mark tests using rack lists as scylla-only test: Switch to rack-list based RF test: Generalize tests to work with both numeric RF and rack lists test: cluster: test_zero_token_nodes_multidc: Adjust to rack list RF test: Prepare for handling errors specific to rack list path test: cluster: dtest: alternator: Force RF=1 in test_putitem_contention test: Create cluster with multiple racks in multi-dc setups test: boost: network_topology_strategy_test: Adjust to rack-list RF test: tablets: Adjust to rack list test: cluster: test_group0_schema_versioning: Use smaller RF to respect rf-rack-validness test: tablets_test: Convert test_per_shard_goal_mixed_dc_rf to be rack-valid test: object_store: test_backup: Adjust for rack lists test: cluster: tablets: Do not move tablet across racks in test_tablet_transition_sanity test: cluster: mv: Do not move tablets across racks test: cluster: util: Fix docstring for parse_replication_options() tablets, topology_coordinator: Skip tablet draining on replace	2025-10-30 21:54:08 +02:00
Tomasz Grabiec	28f6bdc99b	cql3: ks_prop_defs: Expand numeric RF to rack list Auto-exands numeric RF in CREATE/ALTER KEYSPACE statements for new DCs specified in the statement. Doesn't auto-expand existing options, as the rack choice may not be in line with current replica placement. This requires co-locating tablet replicas, and tracking of co-location state, which is not implemented yet. Signed-off-by: Tomasz Grabiec <tgrabiec@scylladb.com>	2025-10-29 23:32:59 +01:00
Tomasz Grabiec	288e75fe22	tablets, topology_coordinator: Skip tablet draining on replace Replace doesn't drain (rebuild) tablets during topology change. They are rebuilt afterwards when the replaced node is in "left" state and replacing node is in normal state. So there is no point in attempting to drain, as nothing will be drained. Not only that, doing so has a risk, because the load balancer is invoked on a transitional topology state in which we can end up with no normal nodes in a rack. That's the case if the replaced node was the last one in the rack. This tripped one of the algorithms which computes rack's shard count for the purpose of determining ideal tablet count, it was not prepared to find an empty rack to which a table is still repliacated. That was fixed separately, but to avoid this, we better skip tablet draining here.	2025-10-29 23:32:57 +01:00
Patryk Jędrzejczak	7304afb75a	Merge 'vnodes cleanup: renames and code comments fixes' from Petr Gusev This is a follow-up for https://github.com/scylladb/scylladb/pull/26315. Fixes several review comments that were left unresolved in the original PR. backport: not needed, this PR contains only renames and code comment fixes Closes scylladb/scylladb#26745 * https://github.com/scylladb/scylladb: test_automatic_cleanup: fix comment storage_proxy: remove stale comment storage_proxy: improve run_fenceable_write comment topology_coordinator: rename start_cleanup_on_dirty_nodes -> start_vnodes_cleanup_on_dirty_nodes storage_service: rename is_cleanup_allowed -> is_vnodes_cleanup_allowed storage_service: rename do_cluster_cleanup -> do_clusterwide_vnodes_cleanup	2025-10-29 10:38:27 +01:00
Petr Gusev	d49be677d5	storage_proxy: remove stale comment	2025-10-28 17:55:20 +01:00
Petr Gusev	c60223f009	storage_proxy: improve run_fenceable_write comment	2025-10-28 17:55:20 +01:00

1 2 3 4 5 ...

5834 Commits