scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-26 19:35:12 +00:00

Author	SHA1	Message	Date
Botond Dénes	22942c0a85	Merge '[Backport 2025.2] Raft-based recovery procedure: simplify rolling restart with recovery_leader' from Scylladb[bot] The following steps are performed in sequence as part of the Raft-based recovery procedure: - set `recovery_leader` to the host ID of the recovery leader in `scylla.yaml` on all live nodes, - send the `SIGHUP` signal to all Scylla processes to reload the config, - perform a rolling restart (with the recovery leader being restarted first). These steps are not intuitive and more complicated than they could be. In this PR, we simplify these steps. From now on, we will be able to simply set `recovery_leader` on each node just before restarting it. Apart from making necessary changes in the code, we also update all tests of the Raft-based recovery procedure and the user-facing documentation. Fixes scylladb/scylladb#25015 The Raft-based procedure was added in 2025.2. This PR makes the procedure simpler and less error-prone, so it should be backported to 2025.2 and 2025.3. - (cherry picked from commit `ec69028907`) - (cherry picked from commit `445a15ff45`) - (cherry picked from commit `23f59483b6`) - (cherry picked from commit `ba5b5c7d2f`) - (cherry picked from commit `9e45e1159b`) - (cherry picked from commit `f408d1fa4f`) Parent PR: #25032 Closes scylladb/scylladb#25334 * github.com:scylladb/scylladb: docs: document the option to set recovery_leader later test: delay setting recovery_leader in the recovery procedure tests gossip: add recovery_leader to gossip_digest_syn db: system_keyspace: peers_table_read_fixup: remove rows with null host_id db/config, gms/gossiper: change recovery_leader to UUID db/config, utils: allow using UUID as a config option	2025-08-06 09:41:17 +03:00
Michał Jadwiszczak	b58543dab7	storage_service, group0_state_machine: move SL cache update from `topology_state_load()` to `load_snapshot()` Currently the service levels cache is unnecessarily updated in every call of `topology_state_load()`. But it is enough to reload it only when a snapshot is loaded. (The cache is also already updated when there is a change to one of `service_levels_v2`, `role_members`, `role_attributes` tables.) Fixes scylladb/scylladb#25114 Fixes scylladb/scylladb#23065 Closes scylladb/scylladb#25116 (cherry picked from commit `10214e13bd`) Closes scylladb/scylladb#25304	2025-08-06 09:39:55 +03:00
Patryk Jędrzejczak	98e3b5e9b5	db/config, gms/gossiper: change recovery_leader to UUID We change the type of the `recovery_leader` config parameter and `gossip_config::recovery_leader` from sstring to UUID. `recovery_leader` is supposed to store host ID, so UUID is a natural choice. After changing the type to UUID, if the user provides an incorrect UUID, parsing `recovery_leader` will fail early, but the start-up will continue. Outside the recovery procedure, `recovery_leader` will then be ignored. In the recovery procedure, the start-up will fail on: ``` throw std::runtime_error( "Cannot start - Raft-based topology has been enabled but persistent group 0 ID is not present. " "If you are trying to run the Raft-based recovery procedure, you must set recovery_leader."); ``` (cherry picked from commit `445a15ff45`)	2025-08-05 10:59:06 +00:00
Tomasz Grabiec	1dfb9d23ea	topology_coordinator: Trigger load stats refresh after replace Otherwise, tablet rebuilt will be delayed for up to 60s, as the tablet scheduler needs load stats for the new node (replacing) to make decisisons. Fixes #25163 Closes scylladb/scylladb#25181 (cherry picked from commit `55116ee660`) Closes scylladb/scylladb#25214	2025-08-02 01:26:59 +02:00
Piotr Dulikowski	618459125d	qos: don't populate effective service level cache until auth is migrated to raft Right now, service levels are migrated in one group0 command and auth is migrated in the next one. This has a bad effect on the group0 state reload logic - modifying service levels in group0 causes the effective service levels cache to be recalculated, and to do so we need to fetch information about all roles. If the reload happens after SL upgrade and before auth upgrade, the query for roles will be directed to the legacy auth tables in system_auth - and the query, being a potentially remote query, has a timeout. If the query times out, it will throw an exception which will break the group0 apply fiber and the node will need to be restarted to bring it back to work. In order to solve this issue, make sure that the service level module does not start populating and using the service level cache until both service levels and auth are migrated to raft. This is achieved by adding the check both to the cache population logic and the effective service level getter - they now look at service level's accessor new method, `can_use_effective_service_level_cache` which takes a look at the auth version. Fixes: scylladb/scylladb#24963 (cherry picked from commit `2bb800c004`)	2025-07-31 15:13:23 +00:00
Pavel Emelyanov	95b906bea9	Merge '[Backport 2025.2] storage_service: cancel all write requests after stopping transports' from Scylladb[bot] When a node shuts down, in storage service, after storage_proxy RPCs are stopped, some write handlers within storage_proxy may still be waiting for background writes to complete. These handlers hold appropriate ERMs to block schema changes before the write finishes. After the RPCs are stopped, these writes cannot receive the replies anymore. If, at the same time, there are RPC commands executing `barrier_and_drain`, they may get stuck waiting for these ERM holders to finish, potentially blocking node shutdown until the writes time out. This change introduces cancellation of all outstanding write handlers from storage_service after the storage proxy RPCs were stopped. Fixes scylladb/scylladb#23665 Backport: since this fixes an issue that frequently causes issues in CI, backport to 2025.1, 2025.2, and 2025.3. - (cherry picked from commit `bc934827bc`) - (cherry picked from commit `e0dc73f52a`) Parent PR: #24714 Closes scylladb/scylladb#25169 * github.com:scylladb/scylladb: storage_service: Cancel all write requests on storage_proxy shutdown test: Add test for unfinished writes during shutdown and topology change	2025-07-28 09:25:15 +03:00
Pavel Emelyanov	8622a07bdd	Merge '[Backport 2025.2] streaming: Avoid deadlock by running view checks in a separate scheduling group' from Scylladb[bot] This issue happens with removenode, when RBNO is disabled, so range streamer is used. The deadlock happens in a scenario like this: 1. Start 3 nodes: {A, B, C}, RF=2 2. Node A is lost 3. removenode A 4. Both B and C gain ownership of ranges. 5. Streaming sessions are started with crossed directions: B->C, C->B Readers created by sender side exhaust streaming semaphore on B and C. Receiver side attempts to obtain a permit indirectly by calling check_needs_view_update_path(), which reads local tables. That read is blocked and times-out, causing streaming to fail. The streaming writer is already using a tracking-only permit. Even if we didn't deadlock, and the streaming semaphore was simply exhausted by other receiving sessions (via tracking-only permit), the query may still time-out due to starvation. To avoid that, run the query under a different scheduling group, which translates to the system semaphore instead of the maintenance semaphore, to break the dependency. The gossip group was chosen because it shouldn't be contended and this change should not interfere with it much. Fixes #24807 Fixes #24925 - (cherry picked from commit `ee2fa58bd6`) - (cherry picked from commit `dff2b01237`) Parent PR: #24929 Closes scylladb/scylladb#25055 * github.com:scylladb/scylladb: streaming: Avoid deadlock by running view checks in a separate scheduling group service: migration_manager: Run group0 barrier in gossip scheduling group	2025-07-28 09:24:53 +03:00
Sergey Zolotukhin	f15df0bcce	storage_service: Cancel all write requests on storage_proxy shutdown During a graceful node shutdown, RPC listeners are stopped in `storage_service::drain_on_shutdown` as one of the first steps. However, even after RPCs are shut down, some write handlers in `storage_proxy` may still be waiting for background writes to complete. These handlers retain the ERM. Since the RPC subsystem is no longer active, replies cannot be received, and if any RPC commands are concurrently executing `barrier_and_drain`, they may get stuck waiting for those writes. This can block the messaging server shutdown and delay the entire shutdown process until the write timeout occurs. This change introduces the cancellation of all outstanding write handlers in `storage_proxy` during shutdown to prevent unnecessary delays. Fixes scylladb/scylladb#23665 (cherry picked from commit `e0dc73f52a`)	2025-07-24 13:02:56 +00:00
Sergey Zolotukhin	487012e972	test: Add test for unfinished writes during shutdown and topology change This test reproduces an issue where a topology change and an ongoing write query during query coordinator shutdown can cause the node to get stuck. When a node receives a write request, it creates a write handler that holds a copy of the current table's ERM (Effective Replication Map). The ERM ensures that no topology or schema changes occur while the request is being processed. After the query coordinator receives the required number of replica write ACKs to satisfy the consistency level (CL), it sends a reply to the client. However, the write response handler remains alive until all replicas respond — the remaining writes are handled in the background. During shutdown, when all network connections are closed, these responses can no longer be received. As a result, the write response handler is only destroyed once the write timeout is reached. This becomes problematic because the ERM held by the handler blocks topology or schema change commands from executing. Since shutdown waits for these commands to complete, this can lead to unnecessary delays in node shutdown and restarts, and occasional test case failures. Test for: scylladb/scylladb#23665 (cherry picked from commit `bc934827bc`)	2025-07-24 13:02:56 +00:00
Benny Halevy	390ca79ae4	token_metadata: move make_token_metadata_ptr into shared_token_metadata class So we can use the local shared_token_metadata instance for safe background destroy of token_metadata_impl:s. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> (cherry picked from commit `e0a19b981a`)	2025-07-21 09:36:40 +03:00
Tomasz Grabiec	23e365fc7b	service: migration_manager: Run group0 barrier in gossip scheduling group Fixes two issues. One is potential priority inversion. The barrier will be executed using scheduling group of the first fiber which triggers it, the rest will block waiting on it. For example, CQL statements which need to sync the schema on replica side can block on the barrier triggered by streaming. That's undesirable. This is theoretical, not proved in the field. The second problem is blocking the error path. This barrier is called from the streaming error handling path. If the streaming concurrency semaphore is exhausted, and streaming fails due to timeout on obtaining the permit in check_needs_view_update_path(), the error path will block too because it will also attempt to obtain the permit as part of the group0 barrier. Running it in the gossip scheduling group prevents this. Fixes #24925 (cherry picked from commit `ee2fa58bd6`)	2025-07-17 17:25:10 +00:00
Asias He	67375ecf14	storage_service: Use utils::chunked_vector to avoid big allocation The following was seen: ``` !WARNING \| scylla[6057]: [shard 12:strm] seastar_memory - oversized allocation: 212992 bytes. This is non-fatal, but could lead to latency and/or fragmentation issues. Please report: at [Backtrace #0] void seastar::backtrace<seastar::current_backtrace_tasklocal()::$_0>(seastar::current_backtrace_tasklocal()::$_0&&, bool) at ./build/release/seastar/./seastar/include/seastar/util/backtrace.hh:89 (inlined by) seastar::current_backtrace_tasklocal() at ./build/release/seastar/./build/release/seastar/./seastar/src/util/backtrace.cc:99 seastar::current_tasktrace() at ./build/release/seastar/./build/release/seastar/./seastar/src/util/backtrace.cc:136 seastar::current_backtrace() at ./build/release/seastar/./build/release/seastar/./seastar/src/util/backtrace.cc:169 seastar::memory::cpu_pages::warn_large_allocation(unsigned long) at ./build/release/seastar/./build/release/seastar/./seastar/src/core/memory.cc:848 seastar::memory::allocate_slowpath(unsigned long) at ./build/release/seastar/./build/release/seastar/./seastar/src/core/memory.cc:911 operator new(unsigned long) at ./build/release/seastar/./build/release/seastar/./seastar/src/core/memory.cc:1706 std::allocator<dht::token_range_endpoints>::allocate(unsigned long) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/allocator.h:196 (inlined by) std::allocator_traits<std::allocator<dht::token_range_endpoints> >::allocate(std::allocator<dht::token_range_endpoints>&, unsigned long) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/alloc_traits.h:515 (inlined by) std::_Vector_base<dht::token_range_endpoints, std::allocator<dht::token_range_endpoints> >::_M_allocate(unsigned long) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/stl_vector.h:380 (inlined by) void std::vector<dht::token_range_endpoints, std::allocator<dht::token_range_endpoints> >::_M_realloc_append<dht::token_range_endpoints const&>(dht::token_range_endpoints const&) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/vector.tcc:596 locator::describe_ring(replica::database const&, gms::gossiper const&, seastar::basic_sstring<char, unsigned int, 15u, true> const&, bool) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/stl_vector.h:1294 std::__n4861::coroutine_handle<seastar::internal::coroutine_traits_base<std::vector<dht::token_range_endpoints, std::allocator<dht::token_range_endpoints> > >::promise_type>::resume() const at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/coroutine:242 (inlined by) seastar::internal::coroutine_traits_base<std::vector<dht::token_range_endpoints, std::allocator<dht::token_range_endpoints> > >::promise_type::run_and_dispose() at ././seastar/include/seastar/core/coroutine.hh:80 seastar::reactor::do_run() at ./build/release/seastar/./build/release/seastar/./seastar/src/core/reactor.cc:2635 std::_Function_handler<void (), seastar::smp::configure(seastar::smp_options const&, seastar::reactor_options const&)::$_0>::_M_invoke(std::_Any_data const&) at ./build/release/seastar/./build/release/seastar/./seastar/src/core/reactor.cc:4684 ``` Fix by using chunked_vector. Fixes #24158 Closes scylladb/scylladb#24561 (cherry picked from commit `c5a136c3b5`)	2025-07-16 07:43:39 +08:00
Michael Litvak	15517ba529	tablets: stop storage group on deallocation When a tablet transitions to a post-cleanup stage on the leaving replica we deallocate its storage group. Before the storage can be deallocated and destroyed, we must make sure it's cleaned up and stopped properly. Normally this happens during the tablet cleanup stage, when table::cleanup_table is called, so by the time we transition to the next stage the storage group is already stopped. However, it's possible that tablet cleanup did not run in some scenario: 1. The topology coordinator runs tablet cleanup on the leaving replica. 2. The leaving replica is restarted. 3. When the leaving replica starts, still in `cleanup` stage, it allocates a storage group for the tablet. 4. The topology coordinator moves to the next stage. 5. The leaving replica deallocates the storage group, but it was not stopped. To address this scenario, we always stop the storage group when deallocating it. Usually it will be already stopped and complete immediately, and otherwise it will be stopped in the background. Fixes scylladb/scylladb#24857 Fixes scylladb/scylladb#24828 Closes scylladb/scylladb#24896 (cherry picked from commit `fa24fd7cc3`) Closes scylladb/scylladb#24908	2025-07-15 13:25:38 +03:00
Avi Kivity	5bed6c7a7f	storage_proxy: avoid large allocation when storing batch in system.batchlog Currently, when computing the mutation to be stored in system.batchlog, we go through data_value. In turn this goes through `bytes` type (#24810), so it causes a large contiguous allocation if the batch is large. Fix by going through the more primitive, but less contiguous, atomic_cell API. Fixes #24809. Closes scylladb/scylladb#24811 (cherry picked from commit `60f407bff4`) Closes scylladb/scylladb#24845	2025-07-13 14:11:01 +03:00
Patryk Jędrzejczak	605106a9c6	Merge '[Backport 2025.2] Make it easier to debug stuck raft topology operation.' from Scylladb[bot] The series adds more logging and provides new REST api around topology command rpc execution to allow easier debugging of stuck topology operations. Backport since we want to have in the production as quick as possible. Fixes #24860 - (cherry picked from commit `c8ce9d1c60`) - (cherry picked from commit `4e6369f35b`) Parent PR: #24799 Closes scylladb/scylladb#24879 * https://github.com/scylladb/scylladb: topology coordinator: log a start and an end of topology coordinator command execution at info level topology coordinator: add REST endpoint to query the status of ongoing topology cmd rpc	2025-07-09 12:58:14 +02:00
Michael Litvak	9012357a4b	test: test_batchlog_manager: test batch replay when a node is down Add a test of the batchlog manager replay loop applying failed batches while some replica is down. The test reproduces an issue where the batchlog manager tries to replay a failed batch, doesn't get a response from some replica, and becomes stuck. It verifies that the batchlog manager can eventually recover from this situation and continue applying failed batches. (cherry picked from commit `a9b476e057`)	2025-07-08 06:25:03 +00:00
Michael Litvak	ba11e8ebdd	batchlog_manager: abort writes on shutdown On shutdown of batchlog manager, abort all writes of replayed batches by the batchlog manager. To achieve this we set the appropriate write_type to BATCH, and on shutdown cancel all write handlers with this type. (cherry picked from commit `7150632cf2`)	2025-07-08 06:25:02 +00:00
Michael Litvak	4e2b587b4d	batchlog_manager: create cancellable write response handler When replaying a batch mutation from the batchlog manager and sending it to all replicas, create the write response handler as cancellable. To achieve this we define a new wrapper type for batchlog mutations - batchlog_replay_mutation, and this allows us to overload create_write_response_handler for this type. This is similar to how it's done with hint_wrapper and read_repair_mutation. (cherry picked from commit `fc5ba4a1ea`)	2025-07-08 06:25:02 +00:00
Michael Litvak	f8b4d1c1cd	storage_proxy: add write type parameter to mutate_internal Currently mutate_internal has a boolean parameter `counter_write` that indicates whether the write is of counter type or not. We replace it with a more general parameter that allows to indicate the write type. It is compatible with the previous behavior - for a counter write, the type COUNTER is passed, and otherwise a default value will be used as before. (cherry picked from commit `8d48b27062`)	2025-07-08 06:25:02 +00:00
Gleb Natapov	71f59e046b	topology coordinator: log a start and an end of topology coordinator command execution at info level Those calls a relatively rare and the output may help to analyze issues in production. (cherry picked from commit `4e6369f35b`)	2025-07-08 06:23:48 +00:00
Gleb Natapov	ad91198417	topology coordinator: add REST endpoint to query the status of ongoing topology cmd rpc The topology coordinator executes several topology cmd rpc against some nodes during a topology change. A topology operation will not proceed unless rpc completes (successfully or not), but sometimes it appears that it hangs and it is hard to tell on which nodes it did not complete yet. Introduce new REST endpoint that can help with debugging such cases. If executed on the topology coordinator it returns currently running topology rpc (if any) and a list of nodes that did not reply yet. (cherry picked from commit `c8ce9d1c60`)	2025-07-08 06:23:48 +00:00
Aleksandra Martyniuk	cbce0ed911	test: add test for repair and resize finalization Add test that checks whether repair does not start if there is an ongoing resize finalization. (cherry picked from commit `83c9af9670`)	2025-07-01 20:26:21 +00:00
Gleb Natapov	31ed717afb	storage_proxy: retry paxos repair even if repair write succeeded After paxos state is repaired in begin_and_repair_paxos we need to re-check the state regardless if write back succeeded or not. This is how the code worked originally but it was unintentionally changed when co-routinized in `61b2e41a23`. Fixes #24630 Closes scylladb/scylladb#24651 (cherry picked from commit `5f953eb092`) Closes scylladb/scylladb#24703	2025-07-01 10:15:12 +02:00
Abhinav Jha	160c937efe	group0: modify `start_operation` logic to account for synchronize phase race condition In the present scenario, the bootstrapping node undergoes synchronize phase after initialization of group0, then enters post_raft phase and becomes fully ready for group0 operations. The topology coordinator is agnostic of this and issues stream ranges command as soon as the node successfully completes `join_group0`. Although for a node booting into an already upgraded cluster, the time duration for which, node remains in synchronize phase is negligible but this race condition causes trouble in a small percentage of cases, since the stream ranges operation fails and node fails to bootstrap. This commit addresses this issue and updates the error throw logic to account for this edge case and lets the node wait (with timeouts) for synchronize phase to get over instead of throwing error. A regression test is also added to confirm the working of this code change. The test adds a wait in synchronize phase for newly joining node and releases only after the program counter reaches the synchronize case in the `start_operation` function. Hence it indicates that in the updated code, the start_operation will wait for the node to get done with the synchronize phase instead of throwing error. This PR fixes a bug. Hence we need to backport it. Fixes: scylladb/scylladb#23536 Closes scylladb/scylladb#23829 (cherry picked from commit `5ff693eff6`) Closes scylladb/scylladb#24628	2025-07-01 10:10:55 +02:00
Raphael S. Carvalho	fa420f8644	replica: Fix truncate assert failure Truncate doesn't really go well with concurrent writes. The fix (#23560) exposed a preexisting fragility which I missed. 1) truncate gets RP mark X, truncated_at = second T 2) new sstable written during snapshot or later, also at second T (difference of MS) 3) discard_sstables() get RP Y > saved RP X, since creation time of sstable with RP Y is equal to truncated_at = second T. So the problem is that truncate is using a clock of second granularity for filtering out sstables written later, and after we got low mark and truncate time, it can happen that a sstable is flushed later within the same second, but at a different millisecond. By switching to a millisecond clock (db_clock), we allow sstables written later within the same second from being filtered out. It's not perfect but extremely unlikely a new write lands and get flushed in the same millisecond we recorded truncated_at timepoint. In practice, truncate will not be used concurrently to writes, so this should be enough for our tests performing such concurrent actions. We're moving away from gc_clock which is our cheap lowres_clock, but time is only retrieved when creating sstable objects, which frequency of creation is low enough for not having significant consequences, and also db_clock should be cheap enough since it's usually syscall-less. Fixes #23771. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#24426 (cherry picked from commit `2d716f3ffe`) Closes scylladb/scylladb#24435	2025-06-24 10:02:06 +03:00
Michael Litvak	305f827888	test/cluster/test_tablets: test restart during tablet cleanup Add a test that reproduces issue scylladb/scylladb#23481. The test migrates a tablet from one node to another, and while the tablet is in some stage of cleanup - either before or right after, depending on the parameter - the leaving replica, on which the tablet is cleaned, is restarted. This is interesting because when the leaving replica starts and loads its state, the tablet could be in different stages of cleanup - the SSTables may still exist or they may have been cleaned up already, and we want to make sure the state is loaded correctly. (cherry picked from commit `bd88ca92c8`)	2025-06-17 13:59:10 +00:00
Szymon Malewski	d65b390780	mapreduce_service: Prevent race condition In parallelized aggregation functions super-coordinator (node performing final merging step) receives and merges each partial result in parallel coroutines (`parallel_for_each`). Usually responses are spread over time and actual merging is atomic. However sometimes partial results are received at the similar time and if an aggregate function (e.g. lua script) yields, two coroutines can try to overwrite the same accumulator one after another, which leads to losing some of the results. To prevent this, in this patch each coroutine stores merging results in its own context and overwrites accumulator atomically, only after it was fully merged. Comparing to the previous implementation order of operands in merging function is swapped, but the order of aggregation is not guaranteed anyway. Fixes #20662 Closes scylladb/scylladb#24106 (cherry picked from commit `5969809607`) Closes scylladb/scylladb#24389	2025-06-06 08:49:15 +03:00
Petr Gusev	ffea5e67c1	raft_sys_table_storage: avoid temporary buffer when deserializing log_entry The get_blob() method linearizes data by copying it into a single buffer, which can trigger "oversized allocation" warnings. This commit avoids that extra copy by creating an input stream directly over the original fragmented managed bytes returned by untyped_result_set_row::get_view(). Fixes scylladb/scylladb#23903 (cherry picked from commit `f245b05022`)	2025-05-29 08:42:09 +00:00
Gleb Natapov	dd9ec03323	topology coordinator: make decommissioning node non voter before completing the operation A decommissioned node is removed from a raft config after operation is marked as completed. This is required since otherwise the decommissioned node will not see that decommission has completed (the status is propagated through raft). But right after the decommission is marked as completed a decommissioned node may terminate, so in case of a two node cluster, the configuration change that removes it from the raft will fail, because there will no be quorum. The solution is to mark the decommissioning node as non voter before reporting the operation as completed. Fixes: #24026 Backport to 2025.2 because it fixes a potential hang. Don't backport to branches older than 2025.2 because they don't have `8b186ab0ff`, which caused this issue. Closes scylladb/scylladb#24027 (cherry picked from commit `c6e1758457`) Closes scylladb/scylladb#24093	2025-05-16 11:49:46 +03:00
Piotr Dulikowski	4792a27396	topology_coordinator: silence ERROR messages on abort When the topology coordinator is shut down while doing a long-running operation, the current operation might throw a raft::request_aborted exception. This is not a critical issue and should not be logged with ERROR verbosity level. Make sure that all the try..catch blocks in the topology coordinator which: - May try to acquire a new group0 guard in the `try` part - Have a `catch (...)` block that print an ERROR-level message ...have a pass-through `catch (raft::request_aborted&)` block which does not log the exception. Fixes: scylladb/scylladb#22649 Closes scylladb/scylladb#23962 (cherry picked from commit `156ff8798b`) Closes scylladb/scylladb#24082	2025-05-16 11:48:43 +03:00
Aleksandra Martyniuk	f26c2b22dc	test_tablet_repair_hosts_filter: change injected error test_tablet_repair_hosts_filter checks whether the host filter specfied for tablet repair is correctly persisted. To check this, we need to ensure that the repair is still ongoing and its data is kept. The test achieves that by failing the repair on replica side - as the failed repair is going to be retried. However, if the filter does not contain any host (included_host_count = 0), the repair is started on no replica, so the request succeeds and its data is deleted. The test fails if it checks the filter after repair request data is removed. Fail repair on topology coordinator side, so the request is ongoing regardless of the specified hosts. Fixes: #23986. Closes scylladb/scylladb#24003 (cherry picked from commit `2549f5e16b`) Closes scylladb/scylladb#24080	2025-05-16 11:48:27 +03:00
Aleksandra Martyniuk	fcde30d2b0	streaming: use host_id in file streaming Use host ids instead of ips in file-streaming. Fixes: #22421. Closes scylladb/scylladb#24055 (cherry picked from commit `2dcea5a27d`) Closes scylladb/scylladb#24119	2025-05-14 22:13:48 +02:00
Gleb Natapov	827563902c	test: add reproducer for #22777 Add sleep before starting gossiper to increase a chance of getting old gossiper entry about yourself before updating local gossiper info with new IP address. (cherry picked from commit `7403de241c`)	2025-05-09 12:56:15 +00:00
Gleb Natapov	ccf194bd89	storage_service: Do not remove gossiper entry on address change When gossiper indexed entries by ip an old entry had to be removed on an address change, but the index is id based, so even if ip was change the entry should stay. Gossiper simply updates an ip address there. (cherry picked from commit `ecd14753c0`)	2025-05-09 12:56:15 +00:00
Gleb Natapov	9b735bb4dc	storage_service: use id to check for local node IP may change and an old gossiper message with previous IP may be processed when it shouldn't. Fixes: #22777 (cherry picked from commit `a2178b7c31`)	2025-05-09 12:56:15 +00:00
Emil Maskovsky	24dfd2034b	raft: ensure topology coordinator retains votership The limited voters feature did not account for the existing topology coordinator (Raft leader) when selecting voters to be removed. As a result, the limited voters calculator could inadvertently remove the votership of the current topology coordinator, triggering an unnecessary Raft leader re-election. This change ensures that the existing topology coordinator's votership status is preserved unless absolutely necessary. When choosing between otherwise equivalent voters, the node other than the topology coordinator is prioritized for removal. This helps maintain stability in the cluster by avoiding unnecessary leader re-elections. Additionally, only the alive leader node is considered relevant for this logic. A dead existing leader (topology coordinator) is excluded from consideration, as it is already in the process of losing leadership. Fixes: scylladb/scylladb#23588 Fixes: scylladb/scylladb#23786	2025-05-05 16:58:34 +02:00
Emil Maskovsky	2ae59e8a87	raft: retain existing voters across data centers and racks Fix an issue in the voter calculator where existing voters were not retained across data centers and racks in certain scenarios. This occurred when voters were distributed across more data centers and racks than the maximum allowed number of voters. Previously, the prioritization logic for data centers and racks did not consider the number of existing assigned voters. It only prioritized nodes within a single data center or rack, which could result in unnecessary reassignment of voters. Improved the prioritization logic to account for the number of existing voters in each data center and rack. This change ensures a more stable voter distribution and reduces unnecessary voter reassignments. Fixes: scylladb/scylladb#23950	2025-05-05 16:51:48 +02:00
Emil Maskovsky	018fb63305	raft: refactor limited voters calculator to prioritize nodes Refactor the limited voters calculator to use a priority queue for sorting nodes by their priorities. This change simplifies the voter selection logic and makes it more extensible for future enhancements, such as supporting more complex priority calculations. The priority value is determined based on the node's existing status, including whether it is alive, a voter, or any further criteria.	2025-05-05 16:36:17 +02:00
Emil Maskovsky	26fdc7b8f8	raft: replace pointer with reference for non-null output parameter The output parameter cannot be `null`. Previously, a pointer was used to make it explicit that the parameter is an output parameter being modified. However, this is unnecessary, as references are more appropriate for parameters that cannot be `null`. Switching to a reference improves code readability and ensures the parameter's non-null constraint is enforced at the type level.	2025-05-05 16:12:00 +02:00
Emil Maskovsky	f0468860a3	raft: reduce code duplication in group0 voter handler Refactor the group0 voter handler by introducing a helper lambda to handle the common logic for adding a node. This eliminates unnecessary code duplication. This refactor does not introduce any functional changes but prepares the codebase for easier future modifications.	2025-05-05 16:09:53 +02:00
Emil Maskovsky	2ef654149f	raft: unify and optimize datacenter and rack info creation Refactor the code to use a consistent pattern for creating the datacenter info list and the rack info list. Both now use a map of vectors, which improves efficiency by reducing temporary conversions to maps/sets during node list processing. Also ensure the node descriptor is passed by reference instead of by copy, leveraging the guaranteed lifetime of the descriptors.	2025-05-05 15:15:17 +02:00
Pavel Emelyanov	7b786d9398	topology_coordinator: Use this->_feature_service directly This dependency is already there, topology coordinator doesn't need to use database reference to get to the features. Previous patch of the same kind: `b79137eaa4` Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#23777	2025-05-05 09:37:29 +02:00
Aleksandra Martyniuk	1f4edd8683	test_tablet_tasks: use injection to revoke resize Currently, test_tablet_resize_revoked tries to trigger split revoke by deleting some rows. This method isn't deterministic and so a test is flaky. Use error injection to trigger resize revoke. Fixes: #22570. Closes scylladb/scylladb#23966	2025-04-30 07:04:57 +03:00
Patryk Jędrzejczak	0cdcf82cd0	Merge 'topology coordinator: do not proceed further on invalid boostrap tokens' from Piotr Dulikowski In case when dht::boot_strapper::get_boostrap_tokens fail to parse the tokens, the topology coordinator handles the exception and schedules a rollback. However, the current code tries to continue with the topology coordinator logic even if an exception occurs, leaving boostrap_tokens empty. This does not make sense and can actually cause issues, specifically in prepare_and_broadcast_cdc_generation_data which implicitly expect that the bootstrap_tokens of the first node in the cluster will not be empty. Fix this by adding the missing break. Fixes: scylladb/scylladb#23897 From the code inspection alone it looks like 2025.1 and 6.2 have this problem, so marking for backport to both of them. Closes scylladb/scylladb#23914 * https://github.com/scylladb/scylladb: test: cluster: add test_bad_initial_token topology coordinator: do not proceed further on invalid boostrap tokens cdc: add sanity check for generating an empty generation	2025-04-28 12:45:33 +02:00
Botond Dénes	d582c436e5	Merge 'tasks: check whether a node is alive before rpc' from Aleksandra Martyniuk Check whether a node is alive before making an rpc that gathers children infos from the whole cluster in virtual_task::impl::get_children. Fixes: https://github.com/scylladb/scylladb/issues/22514. Needs backport to 2025.1 and 6.2 as they contain the bug. Closes scylladb/scylladb#23787 * github.com:scylladb/scylladb: test: add test for getting tasks children tasks: check whether a node is alive before rpc	2025-04-28 09:32:45 +03:00
Piotr Dulikowski	845cedea7f	topology coordinator: do not proceed further on invalid boostrap tokens In case when dht::boot_strapper::get_boostrap_tokens fail to parse the tokens, the topology coordinator handles the exception and schedules a rollback. However, the current code tries to continue with the topology coordinator logic even if an exception occurs, leaving boostrap_tokens empty. This does not make sense and can actually cause issues, specifically in prepare_and_broadcast_cdc_generation_data which implicitly expect that the bootstrap_tokens of the first node in the cluster will not be empty. Fix this by adding the missing break. Fixes: scylladb/scylladb#23897	2025-04-25 11:30:01 +02:00
Wojciech Mitros	d77f11d436	base_info: remove the lw_shared_ptr variant The base_dependent_view_info is no longer needed to be shared or modified in the view_info, so we no longer need to keep it as a shared pointer.	2025-04-24 01:08:40 +02:00
Wojciech Mitros	05fce91945	schema_registry: store base info instead of base schema for view entries In the following patch we plan to remove the base schema from the base_info to make the base_info immutable. To do that, we first prepare the schema registry for the change; we need to be able to create view schemas from frozen schemas there and frozen schemas have no information about the base table. Unless we do this change, after base schemas are removed from the base info, we'll no longer be able to load a view schema to the schema registry without looking up the base schema in the database. This change also required some updates to schema building: * we add a method for unfreezing a view schema with base info instead of a base schema * we make it possible to use schema_builder with a base info instead of a base schema * we add a method for creating a view schema from mutations with a base info instead of a base schema * we add a view_info constructor withat base info instead of a base schema * we update the naming in schema_registry to reflect the usage of base info instead of base schema	2025-04-24 01:08:39 +02:00
Pavel Emelyanov	eb5b52f598	Merge 'main: make DC and rack immutable after bootstrap' from Piotr Dulikowski Changing DC or rack on a node which was already bootstrapped is, in case of vnodes, very unsafe (almost guaranteed to cause data loss or unavailability), and is outright not supported if the cluster has a tablet-backed keyspaces. Moreover, the possibility of doing that makes it impossible to uphold some of the invariants promised by the RF-rack-valid flag, which is eventually going to become unconditionally enabled. Get rid of the above problems by removing the possibility of changing the DC / rack of a node. A node will now fail to start if its snitch reports a different DC or rack than the one that was reported during the first boot. Fixes: scylladb/scylladb#23278 Fixes: scylladb/scylladb#22869 Marking for backport to 2025.1, as this is a necessary part of the RF-rack-valid saga Closes scylladb/scylladb#23800 * github.com:scylladb/scylladb: doc: changing topology when changing snitches is no longer supported test: cluster: introduce test_no_dc_rack_change storage_service: don't update DC/rack in update_topology_with_local_metadata main: make dc and rack immutable after bootstrap test: cluster: remove test_snitch_change	2025-04-21 15:52:55 +03:00
Piotr Dulikowski	1791ae3581	storage_service: don't update DC/rack in update_topology_with_local_metadata The DC/rack are now immutable and cannot be changed after restart, so there is no need to update the node's system.topology entry with this information on restart.	2025-04-17 16:22:58 +02:00

1 2 3 4 5 ...

5390 Commits