scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-06-09 16:33:35 +00:00

Author	SHA1	Message	Date
Avi Kivity	5bed6c7a7f	storage_proxy: avoid large allocation when storing batch in system.batchlog Currently, when computing the mutation to be stored in system.batchlog, we go through data_value. In turn this goes through `bytes` type (#24810), so it causes a large contiguous allocation if the batch is large. Fix by going through the more primitive, but less contiguous, atomic_cell API. Fixes #24809. Closes scylladb/scylladb#24811 (cherry picked from commit `60f407bff4`) Closes scylladb/scylladb#24845	2025-07-13 14:11:01 +03:00
Patryk Jędrzejczak	605106a9c6	Merge '[Backport 2025.2] Make it easier to debug stuck raft topology operation.' from Scylladb[bot] The series adds more logging and provides new REST api around topology command rpc execution to allow easier debugging of stuck topology operations. Backport since we want to have in the production as quick as possible. Fixes #24860 - (cherry picked from commit `c8ce9d1c60`) - (cherry picked from commit `4e6369f35b`) Parent PR: #24799 Closes scylladb/scylladb#24879 * https://github.com/scylladb/scylladb: topology coordinator: log a start and an end of topology coordinator command execution at info level topology coordinator: add REST endpoint to query the status of ongoing topology cmd rpc	2025-07-09 12:58:14 +02:00
Michael Litvak	9012357a4b	test: test_batchlog_manager: test batch replay when a node is down Add a test of the batchlog manager replay loop applying failed batches while some replica is down. The test reproduces an issue where the batchlog manager tries to replay a failed batch, doesn't get a response from some replica, and becomes stuck. It verifies that the batchlog manager can eventually recover from this situation and continue applying failed batches. (cherry picked from commit `a9b476e057`)	2025-07-08 06:25:03 +00:00
Michael Litvak	ba11e8ebdd	batchlog_manager: abort writes on shutdown On shutdown of batchlog manager, abort all writes of replayed batches by the batchlog manager. To achieve this we set the appropriate write_type to BATCH, and on shutdown cancel all write handlers with this type. (cherry picked from commit `7150632cf2`)	2025-07-08 06:25:02 +00:00
Michael Litvak	4e2b587b4d	batchlog_manager: create cancellable write response handler When replaying a batch mutation from the batchlog manager and sending it to all replicas, create the write response handler as cancellable. To achieve this we define a new wrapper type for batchlog mutations - batchlog_replay_mutation, and this allows us to overload create_write_response_handler for this type. This is similar to how it's done with hint_wrapper and read_repair_mutation. (cherry picked from commit `fc5ba4a1ea`)	2025-07-08 06:25:02 +00:00
Michael Litvak	f8b4d1c1cd	storage_proxy: add write type parameter to mutate_internal Currently mutate_internal has a boolean parameter `counter_write` that indicates whether the write is of counter type or not. We replace it with a more general parameter that allows to indicate the write type. It is compatible with the previous behavior - for a counter write, the type COUNTER is passed, and otherwise a default value will be used as before. (cherry picked from commit `8d48b27062`)	2025-07-08 06:25:02 +00:00
Gleb Natapov	71f59e046b	topology coordinator: log a start and an end of topology coordinator command execution at info level Those calls a relatively rare and the output may help to analyze issues in production. (cherry picked from commit `4e6369f35b`)	2025-07-08 06:23:48 +00:00
Gleb Natapov	ad91198417	topology coordinator: add REST endpoint to query the status of ongoing topology cmd rpc The topology coordinator executes several topology cmd rpc against some nodes during a topology change. A topology operation will not proceed unless rpc completes (successfully or not), but sometimes it appears that it hangs and it is hard to tell on which nodes it did not complete yet. Introduce new REST endpoint that can help with debugging such cases. If executed on the topology coordinator it returns currently running topology rpc (if any) and a list of nodes that did not reply yet. (cherry picked from commit `c8ce9d1c60`)	2025-07-08 06:23:48 +00:00
Aleksandra Martyniuk	cbce0ed911	test: add test for repair and resize finalization Add test that checks whether repair does not start if there is an ongoing resize finalization. (cherry picked from commit `83c9af9670`)	2025-07-01 20:26:21 +00:00
Gleb Natapov	31ed717afb	storage_proxy: retry paxos repair even if repair write succeeded After paxos state is repaired in begin_and_repair_paxos we need to re-check the state regardless if write back succeeded or not. This is how the code worked originally but it was unintentionally changed when co-routinized in `61b2e41a23`. Fixes #24630 Closes scylladb/scylladb#24651 (cherry picked from commit `5f953eb092`) Closes scylladb/scylladb#24703	2025-07-01 10:15:12 +02:00
Abhinav Jha	160c937efe	group0: modify `start_operation` logic to account for synchronize phase race condition In the present scenario, the bootstrapping node undergoes synchronize phase after initialization of group0, then enters post_raft phase and becomes fully ready for group0 operations. The topology coordinator is agnostic of this and issues stream ranges command as soon as the node successfully completes `join_group0`. Although for a node booting into an already upgraded cluster, the time duration for which, node remains in synchronize phase is negligible but this race condition causes trouble in a small percentage of cases, since the stream ranges operation fails and node fails to bootstrap. This commit addresses this issue and updates the error throw logic to account for this edge case and lets the node wait (with timeouts) for synchronize phase to get over instead of throwing error. A regression test is also added to confirm the working of this code change. The test adds a wait in synchronize phase for newly joining node and releases only after the program counter reaches the synchronize case in the `start_operation` function. Hence it indicates that in the updated code, the start_operation will wait for the node to get done with the synchronize phase instead of throwing error. This PR fixes a bug. Hence we need to backport it. Fixes: scylladb/scylladb#23536 Closes scylladb/scylladb#23829 (cherry picked from commit `5ff693eff6`) Closes scylladb/scylladb#24628	2025-07-01 10:10:55 +02:00
Raphael S. Carvalho	fa420f8644	replica: Fix truncate assert failure Truncate doesn't really go well with concurrent writes. The fix (#23560) exposed a preexisting fragility which I missed. 1) truncate gets RP mark X, truncated_at = second T 2) new sstable written during snapshot or later, also at second T (difference of MS) 3) discard_sstables() get RP Y > saved RP X, since creation time of sstable with RP Y is equal to truncated_at = second T. So the problem is that truncate is using a clock of second granularity for filtering out sstables written later, and after we got low mark and truncate time, it can happen that a sstable is flushed later within the same second, but at a different millisecond. By switching to a millisecond clock (db_clock), we allow sstables written later within the same second from being filtered out. It's not perfect but extremely unlikely a new write lands and get flushed in the same millisecond we recorded truncated_at timepoint. In practice, truncate will not be used concurrently to writes, so this should be enough for our tests performing such concurrent actions. We're moving away from gc_clock which is our cheap lowres_clock, but time is only retrieved when creating sstable objects, which frequency of creation is low enough for not having significant consequences, and also db_clock should be cheap enough since it's usually syscall-less. Fixes #23771. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#24426 (cherry picked from commit `2d716f3ffe`) Closes scylladb/scylladb#24435	2025-06-24 10:02:06 +03:00
Michael Litvak	305f827888	test/cluster/test_tablets: test restart during tablet cleanup Add a test that reproduces issue scylladb/scylladb#23481. The test migrates a tablet from one node to another, and while the tablet is in some stage of cleanup - either before or right after, depending on the parameter - the leaving replica, on which the tablet is cleaned, is restarted. This is interesting because when the leaving replica starts and loads its state, the tablet could be in different stages of cleanup - the SSTables may still exist or they may have been cleaned up already, and we want to make sure the state is loaded correctly. (cherry picked from commit `bd88ca92c8`)	2025-06-17 13:59:10 +00:00
Szymon Malewski	d65b390780	mapreduce_service: Prevent race condition In parallelized aggregation functions super-coordinator (node performing final merging step) receives and merges each partial result in parallel coroutines (`parallel_for_each`). Usually responses are spread over time and actual merging is atomic. However sometimes partial results are received at the similar time and if an aggregate function (e.g. lua script) yields, two coroutines can try to overwrite the same accumulator one after another, which leads to losing some of the results. To prevent this, in this patch each coroutine stores merging results in its own context and overwrites accumulator atomically, only after it was fully merged. Comparing to the previous implementation order of operands in merging function is swapped, but the order of aggregation is not guaranteed anyway. Fixes #20662 Closes scylladb/scylladb#24106 (cherry picked from commit `5969809607`) Closes scylladb/scylladb#24389	2025-06-06 08:49:15 +03:00
Petr Gusev	ffea5e67c1	raft_sys_table_storage: avoid temporary buffer when deserializing log_entry The get_blob() method linearizes data by copying it into a single buffer, which can trigger "oversized allocation" warnings. This commit avoids that extra copy by creating an input stream directly over the original fragmented managed bytes returned by untyped_result_set_row::get_view(). Fixes scylladb/scylladb#23903 (cherry picked from commit `f245b05022`)	2025-05-29 08:42:09 +00:00
Gleb Natapov	dd9ec03323	topology coordinator: make decommissioning node non voter before completing the operation A decommissioned node is removed from a raft config after operation is marked as completed. This is required since otherwise the decommissioned node will not see that decommission has completed (the status is propagated through raft). But right after the decommission is marked as completed a decommissioned node may terminate, so in case of a two node cluster, the configuration change that removes it from the raft will fail, because there will no be quorum. The solution is to mark the decommissioning node as non voter before reporting the operation as completed. Fixes: #24026 Backport to 2025.2 because it fixes a potential hang. Don't backport to branches older than 2025.2 because they don't have `8b186ab0ff`, which caused this issue. Closes scylladb/scylladb#24027 (cherry picked from commit `c6e1758457`) Closes scylladb/scylladb#24093	2025-05-16 11:49:46 +03:00
Piotr Dulikowski	4792a27396	topology_coordinator: silence ERROR messages on abort When the topology coordinator is shut down while doing a long-running operation, the current operation might throw a raft::request_aborted exception. This is not a critical issue and should not be logged with ERROR verbosity level. Make sure that all the try..catch blocks in the topology coordinator which: - May try to acquire a new group0 guard in the `try` part - Have a `catch (...)` block that print an ERROR-level message ...have a pass-through `catch (raft::request_aborted&)` block which does not log the exception. Fixes: scylladb/scylladb#22649 Closes scylladb/scylladb#23962 (cherry picked from commit `156ff8798b`) Closes scylladb/scylladb#24082	2025-05-16 11:48:43 +03:00
Aleksandra Martyniuk	f26c2b22dc	test_tablet_repair_hosts_filter: change injected error test_tablet_repair_hosts_filter checks whether the host filter specfied for tablet repair is correctly persisted. To check this, we need to ensure that the repair is still ongoing and its data is kept. The test achieves that by failing the repair on replica side - as the failed repair is going to be retried. However, if the filter does not contain any host (included_host_count = 0), the repair is started on no replica, so the request succeeds and its data is deleted. The test fails if it checks the filter after repair request data is removed. Fail repair on topology coordinator side, so the request is ongoing regardless of the specified hosts. Fixes: #23986. Closes scylladb/scylladb#24003 (cherry picked from commit `2549f5e16b`) Closes scylladb/scylladb#24080	2025-05-16 11:48:27 +03:00
Aleksandra Martyniuk	fcde30d2b0	streaming: use host_id in file streaming Use host ids instead of ips in file-streaming. Fixes: #22421. Closes scylladb/scylladb#24055 (cherry picked from commit `2dcea5a27d`) Closes scylladb/scylladb#24119	2025-05-14 22:13:48 +02:00
Gleb Natapov	827563902c	test: add reproducer for #22777 Add sleep before starting gossiper to increase a chance of getting old gossiper entry about yourself before updating local gossiper info with new IP address. (cherry picked from commit `7403de241c`)	2025-05-09 12:56:15 +00:00
Gleb Natapov	ccf194bd89	storage_service: Do not remove gossiper entry on address change When gossiper indexed entries by ip an old entry had to be removed on an address change, but the index is id based, so even if ip was change the entry should stay. Gossiper simply updates an ip address there. (cherry picked from commit `ecd14753c0`)	2025-05-09 12:56:15 +00:00
Gleb Natapov	9b735bb4dc	storage_service: use id to check for local node IP may change and an old gossiper message with previous IP may be processed when it shouldn't. Fixes: #22777 (cherry picked from commit `a2178b7c31`)	2025-05-09 12:56:15 +00:00
Emil Maskovsky	24dfd2034b	raft: ensure topology coordinator retains votership The limited voters feature did not account for the existing topology coordinator (Raft leader) when selecting voters to be removed. As a result, the limited voters calculator could inadvertently remove the votership of the current topology coordinator, triggering an unnecessary Raft leader re-election. This change ensures that the existing topology coordinator's votership status is preserved unless absolutely necessary. When choosing between otherwise equivalent voters, the node other than the topology coordinator is prioritized for removal. This helps maintain stability in the cluster by avoiding unnecessary leader re-elections. Additionally, only the alive leader node is considered relevant for this logic. A dead existing leader (topology coordinator) is excluded from consideration, as it is already in the process of losing leadership. Fixes: scylladb/scylladb#23588 Fixes: scylladb/scylladb#23786	2025-05-05 16:58:34 +02:00
Emil Maskovsky	2ae59e8a87	raft: retain existing voters across data centers and racks Fix an issue in the voter calculator where existing voters were not retained across data centers and racks in certain scenarios. This occurred when voters were distributed across more data centers and racks than the maximum allowed number of voters. Previously, the prioritization logic for data centers and racks did not consider the number of existing assigned voters. It only prioritized nodes within a single data center or rack, which could result in unnecessary reassignment of voters. Improved the prioritization logic to account for the number of existing voters in each data center and rack. This change ensures a more stable voter distribution and reduces unnecessary voter reassignments. Fixes: scylladb/scylladb#23950	2025-05-05 16:51:48 +02:00
Emil Maskovsky	018fb63305	raft: refactor limited voters calculator to prioritize nodes Refactor the limited voters calculator to use a priority queue for sorting nodes by their priorities. This change simplifies the voter selection logic and makes it more extensible for future enhancements, such as supporting more complex priority calculations. The priority value is determined based on the node's existing status, including whether it is alive, a voter, or any further criteria.	2025-05-05 16:36:17 +02:00
Emil Maskovsky	26fdc7b8f8	raft: replace pointer with reference for non-null output parameter The output parameter cannot be `null`. Previously, a pointer was used to make it explicit that the parameter is an output parameter being modified. However, this is unnecessary, as references are more appropriate for parameters that cannot be `null`. Switching to a reference improves code readability and ensures the parameter's non-null constraint is enforced at the type level.	2025-05-05 16:12:00 +02:00
Emil Maskovsky	f0468860a3	raft: reduce code duplication in group0 voter handler Refactor the group0 voter handler by introducing a helper lambda to handle the common logic for adding a node. This eliminates unnecessary code duplication. This refactor does not introduce any functional changes but prepares the codebase for easier future modifications.	2025-05-05 16:09:53 +02:00
Emil Maskovsky	2ef654149f	raft: unify and optimize datacenter and rack info creation Refactor the code to use a consistent pattern for creating the datacenter info list and the rack info list. Both now use a map of vectors, which improves efficiency by reducing temporary conversions to maps/sets during node list processing. Also ensure the node descriptor is passed by reference instead of by copy, leveraging the guaranteed lifetime of the descriptors.	2025-05-05 15:15:17 +02:00
Pavel Emelyanov	7b786d9398	topology_coordinator: Use this->_feature_service directly This dependency is already there, topology coordinator doesn't need to use database reference to get to the features. Previous patch of the same kind: `b79137eaa4` Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#23777	2025-05-05 09:37:29 +02:00
Aleksandra Martyniuk	1f4edd8683	test_tablet_tasks: use injection to revoke resize Currently, test_tablet_resize_revoked tries to trigger split revoke by deleting some rows. This method isn't deterministic and so a test is flaky. Use error injection to trigger resize revoke. Fixes: #22570. Closes scylladb/scylladb#23966	2025-04-30 07:04:57 +03:00
Patryk Jędrzejczak	0cdcf82cd0	Merge 'topology coordinator: do not proceed further on invalid boostrap tokens' from Piotr Dulikowski In case when dht::boot_strapper::get_boostrap_tokens fail to parse the tokens, the topology coordinator handles the exception and schedules a rollback. However, the current code tries to continue with the topology coordinator logic even if an exception occurs, leaving boostrap_tokens empty. This does not make sense and can actually cause issues, specifically in prepare_and_broadcast_cdc_generation_data which implicitly expect that the bootstrap_tokens of the first node in the cluster will not be empty. Fix this by adding the missing break. Fixes: scylladb/scylladb#23897 From the code inspection alone it looks like 2025.1 and 6.2 have this problem, so marking for backport to both of them. Closes scylladb/scylladb#23914 * https://github.com/scylladb/scylladb: test: cluster: add test_bad_initial_token topology coordinator: do not proceed further on invalid boostrap tokens cdc: add sanity check for generating an empty generation	2025-04-28 12:45:33 +02:00
Botond Dénes	d582c436e5	Merge 'tasks: check whether a node is alive before rpc' from Aleksandra Martyniuk Check whether a node is alive before making an rpc that gathers children infos from the whole cluster in virtual_task::impl::get_children. Fixes: https://github.com/scylladb/scylladb/issues/22514. Needs backport to 2025.1 and 6.2 as they contain the bug. Closes scylladb/scylladb#23787 * github.com:scylladb/scylladb: test: add test for getting tasks children tasks: check whether a node is alive before rpc	2025-04-28 09:32:45 +03:00
Piotr Dulikowski	845cedea7f	topology coordinator: do not proceed further on invalid boostrap tokens In case when dht::boot_strapper::get_boostrap_tokens fail to parse the tokens, the topology coordinator handles the exception and schedules a rollback. However, the current code tries to continue with the topology coordinator logic even if an exception occurs, leaving boostrap_tokens empty. This does not make sense and can actually cause issues, specifically in prepare_and_broadcast_cdc_generation_data which implicitly expect that the bootstrap_tokens of the first node in the cluster will not be empty. Fix this by adding the missing break. Fixes: scylladb/scylladb#23897	2025-04-25 11:30:01 +02:00
Wojciech Mitros	d77f11d436	base_info: remove the lw_shared_ptr variant The base_dependent_view_info is no longer needed to be shared or modified in the view_info, so we no longer need to keep it as a shared pointer.	2025-04-24 01:08:40 +02:00
Wojciech Mitros	05fce91945	schema_registry: store base info instead of base schema for view entries In the following patch we plan to remove the base schema from the base_info to make the base_info immutable. To do that, we first prepare the schema registry for the change; we need to be able to create view schemas from frozen schemas there and frozen schemas have no information about the base table. Unless we do this change, after base schemas are removed from the base info, we'll no longer be able to load a view schema to the schema registry without looking up the base schema in the database. This change also required some updates to schema building: * we add a method for unfreezing a view schema with base info instead of a base schema * we make it possible to use schema_builder with a base info instead of a base schema * we add a method for creating a view schema from mutations with a base info instead of a base schema * we add a view_info constructor withat base info instead of a base schema * we update the naming in schema_registry to reflect the usage of base info instead of base schema	2025-04-24 01:08:39 +02:00
Pavel Emelyanov	eb5b52f598	Merge 'main: make DC and rack immutable after bootstrap' from Piotr Dulikowski Changing DC or rack on a node which was already bootstrapped is, in case of vnodes, very unsafe (almost guaranteed to cause data loss or unavailability), and is outright not supported if the cluster has a tablet-backed keyspaces. Moreover, the possibility of doing that makes it impossible to uphold some of the invariants promised by the RF-rack-valid flag, which is eventually going to become unconditionally enabled. Get rid of the above problems by removing the possibility of changing the DC / rack of a node. A node will now fail to start if its snitch reports a different DC or rack than the one that was reported during the first boot. Fixes: scylladb/scylladb#23278 Fixes: scylladb/scylladb#22869 Marking for backport to 2025.1, as this is a necessary part of the RF-rack-valid saga Closes scylladb/scylladb#23800 * github.com:scylladb/scylladb: doc: changing topology when changing snitches is no longer supported test: cluster: introduce test_no_dc_rack_change storage_service: don't update DC/rack in update_topology_with_local_metadata main: make dc and rack immutable after bootstrap test: cluster: remove test_snitch_change	2025-04-21 15:52:55 +03:00
Piotr Dulikowski	1791ae3581	storage_service: don't update DC/rack in update_topology_with_local_metadata The DC/rack are now immutable and cannot be changed after restart, so there is no need to update the node's system.topology entry with this information on restart.	2025-04-17 16:22:58 +02:00
Aleksandra Martyniuk	53e0f79947	tasks: check whether a node is alive before rpc Check whether a node is alive before making an rpc that gathers children infos from the whole cluster in virtual_task::impl::get_children.	2025-04-17 12:51:22 +02:00
Botond Dénes	8ac7c54d8b	Merge 'topology_coordinator: stop: await all background_action_holder:s' from Benny Halevy Add missing awaits for the rebuild_repair and repair background actions. Although the background actions hold the _async_gate which is closed in topology_coordinator::run(), stop() still needs to await all background action futures and handle any errors they may have left behind. Fixes #23755 * The issue exists since 6.2 Closes scylladb/scylladb#17712 * github.com:scylladb/scylladb: topology_coordinator: stop: await all background_action_holder:s topology_coordinator: stop: improve error messages topology_coordinator: stop: define stop_background_action helper	2025-04-17 12:10:29 +03:00
Kefu Chai	a33651b03e	db, service: do not include unused header these unused headers were flagged by clang-include-cleaner. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#23735	2025-04-17 11:49:59 +03:00
Benny Halevy	7a0f5e0a54	topology_coordinator: stop: await all background_action_holder:s Add missing awaits for the rebuild_repair and repair background actions. Although the background actions hold the _async_gate which is closed in topology_coordinator::run(), stop() still needs to await all background action futures and handle any errors they may have left behind. Fixes #23755 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-16 15:23:02 +03:00
Benny Halevy	6de79d0dd3	topology_coordinator: stop: improve error messages "when cleanup" is ill-formed. Use "when XYZ" to "during XYZ" instead. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-16 15:20:58 +03:00
Benny Halevy	d624795fda	topology_coordinator: stop: define stop_background_action helper Refactor the code to use a helper to await background_action_holder and handle any errors by printing a warning. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-16 15:20:39 +03:00
Botond Dénes	f5125ffa18	Merge 'Ensure raft group0 RPCs use the gossip scheduling group.' from Sergey Zolotukhin Scylla operations use concurrency semaphores to limit the number of concurrent operations and prevent resource exhaustion. The semaphore is selected based on the current scheduling group. For RAFT group operations, it is essential to use a system semaphore to avoid queuing behind user operations. This patch ensures that RAFT operations use the `gossip` scheduling group to leverage the system semaphore. Fixes scylladb/scylladb#21637 Backport: 6.2 and 6.1 Closes scylladb/scylladb#22779 * github.com:scylladb/scylladb: Ensure raft group0 RPCs use the gossip scheduling group Move RAFT operations verbs to GOSSIP group.	2025-04-16 09:11:29 +03:00
Tomasz Grabiec	001d3b2415	Merge 'storage_service: preserve state of busy topology when transiting tablet' from Łukasz Paszkowski Commit `876478b84f` ("storage_service: allow concurrent tablet migration in tablets/move API", 2024-02-08) introduced a code path on which the topology state machine would be busy -- in "tablet_draining" or "tablet_migration" state -- at the time of starting tablet migration. The pre-commit code would unconditionally transition the topology to "tablet_migration" state, assuming the topology had been idle previously. On the new code path, this state change would be idempotent if the topology state machine had been busy in "tablet_migration", but the state change would incorrectly overwrite the "tablet_draining" state otherwise. Restrict the state change to when the topology state machine is idle. In addition, add the topology update to the "updates" vector with plain push_back(). emplace_back() is not helpful here, as topology_mutation_builder::build() cannot construct in-place, and so we invoke the "canonical_mutation" move constructor once, either way. Unit test: Start a two node cluster. Create a single tablet on one of the nodes. Start decommissioning that node, but block decommissioning at once. In that state (i.e., in "tablet_draining"), move the tablet manually to the other node. Check that transit_tablet() leaves the topology transition state alone. Fixes https://github.com/scylladb/scylladb/issues/20073. Commit `876478b84f` was first released in scylla-6.0.0, so we might want to backport this patch accordingly. Closes scylladb/scylladb#23751 * github.com:scylladb/scylladb: storage_service: add unit test for mid-decommission transit_tablet() storage_service: preserve state of busy topology when transiting tablet	2025-04-16 00:19:24 +02:00
Pavel Emelyanov	b79137eaa4	storage_service: Use this->_features directly This dependency is already there, storage service doesn't need to go rounds via database reference to get to the features. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#23739	2025-04-15 21:11:12 +03:00
Laszlo Ersek	841ca652a0	storage_service: add unit test for mid-decommission transit_tablet() Start a two node cluster. Create a single tablet on one of the nodes. Start decommissioning that node, but block decommissioning at once. In that state (i.e., in "tablet_draining"), move the tablet manually to the other node. Check that transit_tablet() leaves the topology transition state alone. Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>	2025-04-15 15:15:25 +02:00
Laszlo Ersek	e1186f0ae6	storage_service: preserve state of busy topology when transiting tablet Commit `876478b84f` ("storage_service: allow concurrent tablet migration in tablets/move API", 2024-02-08) introduced a code path on which the topology state machine would be busy -- in "tablet_draining" or "tablet_migration" state -- at the time of starting tablet migration. The pre-commit code would unconditionally transition the topology to "tablet_migration" state, assuming the topology had been idle previously. On the new code path, this state change would be idempotent if the topology state machine had been busy in "tablet_migration", but the state change would incorrectly overwrite the "tablet_draining" state otherwise. Restrict the state change to when the topology state machine is idle. In addition, add the topology update to the "updates" vector with plain push_back(). emplace_back() is not helpful here, as topology_mutation_builder::build() cannot construct in-place, and so we invoke the "canonical_mutation" move constructor once, either way. Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>	2025-04-15 13:44:45 +02:00
Emil Maskovsky	3930ee8e3c	raft: fix data center remaining nodes initialization The `_remaining_nodes` attribute of the data center information was not initialized correctly. The parameter was passed by value to the initialization function instead of by reference or pointer. As a result, `_remaining_nodes` was left initialized to zero, causing an underflow when decrementing its value. This bug did not significantly impact behavior because other safeguards, such as capping the maximum voters per data center by the total number of nodes, masked the issue. However, it could lead to inefficiencies, as the remaining nodes check would not trigger correctly. Fixes: scylladb/scylladb#23702 No backport: The bug is only present in the master branch, so no backport is required. Closes scylladb/scylladb#23704	2025-04-15 09:58:32 +02:00
Nadav Har'El	fbcf77d134	raft: make group0 Raft operation timeout configurable A recent commit `370707b111` (re)introduced a timeout for every group0 Raft operation. This timeout was set to 60 seconds, which, paraphrasing Bill Gates, "ought to be enough for anybody". However, one of the things we do as a group0 operation is schema changes, and we already noticed a few years ago, see commit `0b2cf21932`, that in some extremely overloaded test machines where tests run hundreds of times (!) slower than usual, a single big schema operation - such as Alternator's DeleteTable deleting a table and multiple of its CDC or view tables - sometimes takes more than 60 seconds. The above fix changed the client's timeout to wait for 300 seconds instead of 60 seconds, but now we also need to increase our Raft timeout, or the server can time out. We've seen this happening recently making some tests flaky in CI (issue #23543). So let's make this timeout configurable, as a new configuration option group0_raft_op_timeout_in_ms. This option defaults to 60000 (i.e, 60 seconds), the same as the existing default. The test framework overrides this default with a a higher 300 second timeout, matching the client-side timeout. Before this patch, this timeout was already configurable in a strange way, using injections. But this was a misstep: We already have more than a dozen timeouts configurable through the normal configration, and this one should have been configured in the same way. There is nothing "holy" about the default of 60 seconds we chose, and who knows maybe in the future we might need to tweek it in the field, just like we made the other timeouts tweakable. Injections cannot be used in release mode, but configuration options can. Fixes #23543 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#23717	2025-04-15 10:57:39 +03:00

1 2 3 4 5 ...

5377 Commits