scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-28 04:06:59 +00:00

Author	SHA1	Message	Date
Petr Gusev	ed6bec2cac	storage_proxy: node_local_only: always use my_host_id The previous implementation did not handle topology changes well: * In node_local_only mode with CL=1, if the current node is pending, the CL is raised to 2, causing unavailable_exception. * If the current tablet is in write_both_read_old and we read with node_local_only on the new node, the replica list is empty. This patch changes node_local_only mode to always use my_host_id as the replica list. An explicit check ensures the current node is a replica for the operation; otherwise on_internal_error is called.	2025-08-19 16:11:49 +02:00
Petr Gusev	65c7e36b7c	storage_proxy: handle node_local_only in query In this commit we support node_local_only flag in read code path in storage_proxy.	2025-07-24 19:48:08 +02:00
Petr Gusev	2d747d97b8	storage_proxy: handle node_local_only in mutate We add the remove_non_local_host_ids() helper, which will be used in the next commit to support the read path. HostIdVector concept is introduced to be able to handle both host_id_vector_replica_set and host_id_vector_topology_change uniformly. The storage_proxy_coordinator_mutate_options class is declared outside of storage_proxy to avoid C++ compiler complaints about default field initializers. In particular, some storage_proxy methods use this class for optional parameters with default values, which is not allowed when the class is defined inside storage_proxy.	2025-07-24 19:48:08 +02:00
Petr Gusev	7eb198f2cc	storage_proxy: introduce node_local_only flag Add a per-request flag that restricts query execution to the local node by filtering out all non-local replicas. Standard consistency level (CL) rules still apply: if the local node alone cannot satisfy the requested CL, an exception is thrown. This flag is required for Paxos state access, where reads and writes must target only the local node. As a side effect, this also enables the implementation of scylladb/scylladb#16478, which proposes a CQL extension to expose 'local mode' query execution to users. Support for this flag in storage_proxy's read and write code paths will be added in follow-up commits.	2025-07-24 19:48:08 +02:00
Petr Gusev	4c1aca3927	storage_proxy: add coordinator_mutate_options In upcoming commits, we want to add a node_local_only flag to both read and write paths in storage_proxy. This requires passing the flag from query_processor to the part of storage_proxy where replica selection decisions are made. For reads, it's sufficient to add the flag to the existing coordinator_query_options class. For writes, there is no such options container, so we introduce coordinator_mutate_options in this commit. In the future, we may move some of the many mutate() method arguments into this container to simplify the code.	2025-07-24 19:48:08 +02:00
Petr Gusev	b6ccaffd45	storage_proxy: rename create_write_response_handler -> make_write_response_handler Most of the create_write_response_handler overloads follow the same signature pattern to satisfy the sp::mutate_prepare call. The one which doesn't follow it is invoked by others and is responsible for creating a concrete handler instance. In this refactoring commit we rename it to make_write_response_handler to reduce confusion.	2025-07-24 19:48:08 +02:00
Petr Gusev	db946edd1d	storage_proxy: simplify mutate_prepare This is a refactoring commit. We remove extra lambda parameters from mutate_prepare since the CreateWriteHandler lambda can simply capture them. We can't std::move(permit) in another mutate_prepare overload, because each handler wants its own copy of this pemit.	2025-07-24 19:48:08 +02:00
Petr Gusev	6e87a6cdb0	paxos_state: extract state access functions into paxos_store Introduce paxos_store abstraction to isolate Paxos state access. Prepares for supporting either system.paxos or a co-located table as the storage backend.	2025-07-24 16:39:50 +02:00
Patryk Jędrzejczak	f89ffe491a	Merge 'storage_service: cancel all write requests after stopping transports' from Sergey Zolotukhin When a node shuts down, in storage service, after storage_proxy RPCs are stopped, some write handlers within storage_proxy may still be waiting for background writes to complete. These handlers hold appropriate ERMs to block schema changes before the write finishes. After the RPCs are stopped, these writes cannot receive the replies anymore. If, at the same time, there are RPC commands executing `barrier_and_drain`, they may get stuck waiting for these ERM holders to finish, potentially blocking node shutdown until the writes time out. This change introduces cancellation of all outstanding write handlers from storage_service after the storage proxy RPCs were stopped. Fixes scylladb/scylladb#23665 Backport: since this fixes an issue that frequently causes issues in CI, backport to 2025.1, 2025.2, and 2025.3. Closes scylladb/scylladb#24714 * https://github.com/scylladb/scylladb: storage_service: Cancel all write requests on storage_proxy shutdown test: Add test for unfinished writes during shutdown and topology change	2025-07-24 09:46:42 +03:00
Sergey Zolotukhin	e0dc73f52a	storage_service: Cancel all write requests on storage_proxy shutdown During a graceful node shutdown, RPC listeners are stopped in `storage_service::drain_on_shutdown` as one of the first steps. However, even after RPCs are shut down, some write handlers in `storage_proxy` may still be waiting for background writes to complete. These handlers retain the ERM. Since the RPC subsystem is no longer active, replies cannot be received, and if any RPC commands are concurrently executing `barrier_and_drain`, they may get stuck waiting for those writes. This can block the messaging server shutdown and delay the entire shutdown process until the write timeout occurs. This change introduces the cancellation of all outstanding write handlers in `storage_proxy` during shutdown to prevent unnecessary delays. Fixes scylladb/scylladb#23665	2025-07-22 15:03:30 +02:00
Benny Halevy	3feb759943	everywhere: use utils::chunked_vector for list of mutations Currently, we use std::vector<*mutation> to keep a list of mutations for processing. This can lead to large allocation, e.g. when the vector size is a function of the number of tables. Use a chunked vector instead to prevent oversized allocations. `perf-simple-query --smp 1` results obtained for fixed 400MHz frequency and PGO disabled: Before (read path): ``` enable-cache=1 Running test with config: {partitions=10000, concurrency=100, mode=read, query_single_key=no, counters=no} Disabling auto compaction Creating 10000 partitions... 89055.97 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 39417 insns/op, 18003 cycles/op, 0 errors) 103372.72 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 39380 insns/op, 17300 cycles/op, 0 errors) 98942.27 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 39413 insns/op, 17336 cycles/op, 0 errors) 103752.93 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 39407 insns/op, 17252 cycles/op, 0 errors) 102516.77 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 39403 insns/op, 17288 cycles/op, 0 errors) throughput: mean= 99528.13 standard-deviation=6155.71 median= 102516.77 median-absolute-deviation=3844.59 maximum=103752.93 minimum=89055.97 instructions_per_op: mean= 39403.99 standard-deviation=14.25 median= 39406.75 median-absolute-deviation=9.30 maximum=39416.63 minimum=39380.39 cpu_cycles_per_op: mean= 17435.81 standard-deviation=318.24 median= 17300.40 median-absolute-deviation=147.59 maximum=18002.53 minimum=17251.75 ``` After (read path) ``` enable-cache=1 Running test with config: {partitions=10000, concurrency=100, mode=read, query_single_key=no, counters=no} Disabling auto compaction Creating 10000 partitions... 59755.04 tps ( 66.2 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 39466 insns/op, 22834 cycles/op, 0 errors) 71854.16 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 39417 insns/op, 17883 cycles/op, 0 errors) 82149.45 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 39411 insns/op, 17409 cycles/op, 0 errors) 49640.04 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.3 tasks/op, 39474 insns/op, 19975 cycles/op, 0 errors) 54963.22 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.3 tasks/op, 39474 insns/op, 18235 cycles/op, 0 errors) throughput: mean= 63672.38 standard-deviation=13195.12 median= 59755.04 median-absolute-deviation=8709.16 maximum=82149.45 minimum=49640.04 instructions_per_op: mean= 39448.38 standard-deviation=31.60 median= 39466.17 median-absolute-deviation=25.75 maximum=39474.12 minimum=39411.42 cpu_cycles_per_op: mean= 19267.01 standard-deviation=2217.03 median= 18234.80 median-absolute-deviation=1384.25 maximum=22834.26 minimum=17408.67 ``` `perf-simple-query --smp 1 --write` results obtained for fixed 400MHz frequency and PGO disabled: Before (write path): ``` enable-cache=1 Running test with config: {partitions=10000, concurrency=100, mode=write, query_single_key=no, counters=no} Disabling auto compaction 63736.96 tps ( 59.4 allocs/op, 16.4 logallocs/op, 14.3 tasks/op, 49667 insns/op, 19924 cycles/op, 0 errors) 64109.41 tps ( 59.3 allocs/op, 16.0 logallocs/op, 14.3 tasks/op, 49992 insns/op, 20084 cycles/op, 0 errors) 56950.47 tps ( 59.3 allocs/op, 16.0 logallocs/op, 14.3 tasks/op, 50005 insns/op, 20501 cycles/op, 0 errors) 44858.42 tps ( 59.3 allocs/op, 16.0 logallocs/op, 14.3 tasks/op, 50014 insns/op, 21947 cycles/op, 0 errors) 28592.87 tps ( 59.3 allocs/op, 16.0 logallocs/op, 14.3 tasks/op, 50027 insns/op, 27659 cycles/op, 0 errors) throughput: mean= 51649.63 standard-deviation=15059.74 median= 56950.47 median-absolute-deviation=12087.33 maximum=64109.41 minimum=28592.87 instructions_per_op: mean= 49941.18 standard-deviation=153.76 median= 50005.24 median-absolute-deviation=73.01 maximum=50027.07 minimum=49667.05 cpu_cycles_per_op: mean= 22023.01 standard-deviation=3249.92 median= 20500.74 median-absolute-deviation=1938.76 maximum=27658.75 minimum=19924.32 ``` After (write path) ``` enable-cache=1 Running test with config: {partitions=10000, concurrency=100, mode=write, query_single_key=no, counters=no} Disabling auto compaction 53395.93 tps ( 59.4 allocs/op, 16.5 logallocs/op, 14.3 tasks/op, 50326 insns/op, 21252 cycles/op, 0 errors) 46527.83 tps ( 59.3 allocs/op, 16.0 logallocs/op, 14.3 tasks/op, 50704 insns/op, 21555 cycles/op, 0 errors) 55846.30 tps ( 59.3 allocs/op, 16.0 logallocs/op, 14.3 tasks/op, 50731 insns/op, 21060 cycles/op, 0 errors) 55669.30 tps ( 59.3 allocs/op, 16.0 logallocs/op, 14.3 tasks/op, 50735 insns/op, 21521 cycles/op, 0 errors) 52130.17 tps ( 59.3 allocs/op, 16.0 logallocs/op, 14.3 tasks/op, 50757 insns/op, 21334 cycles/op, 0 errors) throughput: mean= 52713.91 standard-deviation=3795.38 median= 53395.93 median-absolute-deviation=2955.40 maximum=55846.30 minimum=46527.83 instructions_per_op: mean= 50650.57 standard-deviation=182.46 median= 50731.38 median-absolute-deviation=84.09 maximum=50756.62 minimum=50325.87 cpu_cycles_per_op: mean= 21344.42 standard-deviation=202.86 median= 21334.00 median-absolute-deviation=176.37 maximum=21554.61 minimum=21060.24 ``` Fixes #24815 Improvement for rare corner cases. No backport required Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#24919	2025-07-13 19:13:11 +03:00
Michael Litvak	7150632cf2	batchlog_manager: abort writes on shutdown On shutdown of batchlog manager, abort all writes of replayed batches by the batchlog manager. To achieve this we set the appropriate write_type to BATCH, and on shutdown cancel all write handlers with this type.	2025-07-07 12:23:06 +03:00
Michael Litvak	fc5ba4a1ea	batchlog_manager: create cancellable write response handler When replaying a batch mutation from the batchlog manager and sending it to all replicas, create the write response handler as cancellable. To achieve this we define a new wrapper type for batchlog mutations - batchlog_replay_mutation, and this allows us to overload create_write_response_handler for this type. This is similar to how it's done with hint_wrapper and read_repair_mutation.	2025-07-07 12:23:06 +03:00
Michael Litvak	8d48b27062	storage_proxy: add write type parameter to mutate_internal Currently mutate_internal has a boolean parameter `counter_write` that indicates whether the write is of counter type or not. We replace it with a more general parameter that allows to indicate the write type. It is compatible with the previous behavior - for a counter write, the type COUNTER is passed, and otherwise a default value will be used as before.	2025-07-07 12:23:06 +03:00
Nadav Har'El	e12ff4d3ab	Merge 'LWT: use tablet_metadata_guard' from Petr Gusev This PR is a step towards enabling LWT for tablet-based tables. It pursues several goals: * Make it explicit that the tablet can't migrate after the `cas_shard` check in `selec_statement/modification_statement`. Currently, `storage_proxy::cas` expects that the client calls it on a correct shard -- the one which owns the partition key the LWT is running on. There reasons for that are explained in [this commit](`f16e3b0491 (diff-1073ea9ce4c5e00bb6eb614154f523ba7962403a4fe6c8cd877d1c8b73b3f649)`) message. The statements check the current shard and invokes `bounce_to_shard` if it's not the right one. However , the erm strong pointer is only captured in `storage_proxy::cas` and until that moment there is no explicit structure in the code which would prevent the ongoing migrations. In this PR we introduce such stucture -- `erm_handle`. We create it before the `cas_check` and pass it down to `storage_proxy::cas` and `paxos_response_handler`. * Another goal of this PR is an optimization -- we don't want to hold erm for the duration of entire LWT, unless it directly affects the current tablet. The is a `tablet_metadata_guard` class which is used for long running tablet operations. It automatically switches to a new erm if the topology change represented by the new erm doesn't affect the current tablet. We use this class in `erm_handle` if the table uses tablets. Otherwise, `erm_handle` just stores erm directly. * Fixes [shard bouncing issue in alternator](https://github.com/scylladb/scylladb/issues/17399) Backport: not needed (new feature). Closes scylladb/scylladb#24495 * github.com:scylladb/scylladb: LWT: make cas_shard non-optional in sp::cas LWT: create cas_shard in select_statement LWT: create cas_shard in modification and batch statements LWT: create cas_shard in alternator LWT: use cas_shard in storage_proxy::cas do_query_with_paxos: remove redundant cas_shard check storage_proxy: add cas_shard class sp::cas_shard: rename to get_cas_shard token_metadata_guard: a topology guard for a token tablet_metadata_guard: mark as noncopyable and nonmoveable	2025-07-01 11:33:20 +03:00
Petr Gusev	35aba76401	LWT: make cas_shard non-optional in sp::cas We also make sp::cas_shard function local since it's now not used directly by sp clients.	2025-06-30 10:37:33 +02:00
Petr Gusev	deb7afbc87	LWT: use cas_shard in storage_proxy::cas Take cas_shard parameter in sp::cas and pass token_metadata_guard down to paxos_response_handler. We make cas_shard parameter optional in storage_proxy methods to make the refactoring easier. The sp::cas method constructs a new token_metadata_guard if it's not set. All call sites pass null in this commit, we will add the proper implementation in the next commits.	2025-06-30 10:33:17 +02:00
Petr Gusev	43c4de8ad1	storage_proxy: add cas_shard class The sp::cas method must be called on the correct shard, as determined by sp::cas_shard. Additionally, there must be no asynchronous yields between the shard check and capturing the erm strong pointer in sp::cas. While this condition currently holds, it's fragile and easy to break. To address this, future commits will move the capture of token_metadata_guard to the call sites of sp::cas, before performing the shard check. As a first step, this commit introduces a cas_shard class that wraps both the target shard and a token_metadata_guard instance. This ensures the returned shard remains valid for the given tablet as long as the guard is held. In the next commits, we’ll pass a cas_shard instance to sp::cas as a separate parameter.	2025-06-30 10:33:17 +02:00
Piotr Dulikowski	62efe6616a	Merge 'mapreduce: add tablet-aware dispatching algorithm' from Andrzej Jackowski The primary motivation for this change is to reduce the time during which the Effective Replication Map (ERM) is retained by the mapreduce service. This ensures that long aggregate queries do not block topology operations. As ScyllaDB is generally transitioning towards tablets, and using tablets simplifies work dispatching, the decision was made to design the new algorithm specifically for tablets. The goal of the algorithm is to divide the work in such a way that each `tablet_replica` (that is <host, shard> pair) processes two tablets at a time. The new algorithm can be summarized as follows: 1. Prepare a tablet_replica -> partition_range mapping where the values cover the entire space. 2. For each tablet_replica, in parallel, take two partition ranges and dispatch them to the node hosting the replica. The ERM is released and re-acquired in each iteration, allowing the destination (i.e., tablet_replica) to change for each artition range (in such cases, the partition range is assigned to the appropriate tablet_replica). In step 1, the main difference compared to the old algorithm (dispatch_to_vnodes) is that partition ranges are assigned to a tablet_replica rather than just the host. In step 2, the main difference is that the work is divided into smaller batches, and the ERM is released and re-acquired for each batch. In the current implementation, each node can correctly handle every partition range, even if the mapreduce supercoordinator does not retain the ERM and the range is absent locally. This is because mapreduce_service::execute_on_this_shard creates a new pager that coordinates the partition range read, including obtaining its own ERM. However, every partition range that is absent locally is handled by shard 0. Therefore, proper routing of partition ranges is necessary to avoid shard 0 overload. This is why, in step 2, the ERM is retained during each batch processing, and the tablet_replica is refreshed for each processed range. Additionally, shard_id is added to mapreduce request. When shard_id is set, the entire partition range is handled by the specified shard. As the new tablet-aware mapreduce algorithm balances the workload across shards, shard_id ensure that the balance is preserved, even during events such as tablet splits. This patch series: - Refactors a bit mapreduce service, to facilitate having two algorithm versions (one for vnodes and one for tablets). - Implements tablet-aware dispatching algorithm. - Adds shard_id to mapreduce request and uses the information to handle requests entirely by selected shard. - Adds test_long_query_timeout_erm to verify the new functionality. Fixes: scylladb#21831 No backport, as it is rather new feature than a bugfix. Closes scylladb/scylladb#24383 * github.com:scylladb/scylladb: mapreduce: add missing comma and space in mapreduce_request operator<< mapreduce: add shard_id_hint to mapreduce request test: add test_long_query_timeout_erm mapreduce: add tablet-aware dispatching algorithm storage_proxy: make storage_proxy::is_alive public mapreduce: remove _shared_token_metadata from mapreduce_service mapreduce: move dispatching logic to dispatch_to_vnodes mapreduce: remove underscores from variable names mapreduce: move req_with_modified_pr handling to a new function mapreduce: change next_vnode lambda to get_next_partition_range function	2025-06-26 12:25:39 +02:00
Andrzej Jackowski	6d358cd7b2	storage_proxy: make storage_proxy::is_alive public The motivation is to allow other components (specifically mapreduce service) to use the method, just as storage_proxy::get_live_endpoints.	2025-06-25 08:59:04 +02:00
Asias He	c5a136c3b5	storage_service: Use utils::chunked_vector to avoid big allocation The following was seen: ``` !WARNING \| scylla[6057]: [shard 12:strm] seastar_memory - oversized allocation: 212992 bytes. This is non-fatal, but could lead to latency and/or fragmentation issues. Please report: at [Backtrace #0] void seastar::backtrace<seastar::current_backtrace_tasklocal()::$_0>(seastar::current_backtrace_tasklocal()::$_0&&, bool) at ./build/release/seastar/./seastar/include/seastar/util/backtrace.hh:89 (inlined by) seastar::current_backtrace_tasklocal() at ./build/release/seastar/./build/release/seastar/./seastar/src/util/backtrace.cc:99 seastar::current_tasktrace() at ./build/release/seastar/./build/release/seastar/./seastar/src/util/backtrace.cc:136 seastar::current_backtrace() at ./build/release/seastar/./build/release/seastar/./seastar/src/util/backtrace.cc:169 seastar::memory::cpu_pages::warn_large_allocation(unsigned long) at ./build/release/seastar/./build/release/seastar/./seastar/src/core/memory.cc:848 seastar::memory::allocate_slowpath(unsigned long) at ./build/release/seastar/./build/release/seastar/./seastar/src/core/memory.cc:911 operator new(unsigned long) at ./build/release/seastar/./build/release/seastar/./seastar/src/core/memory.cc:1706 std::allocator<dht::token_range_endpoints>::allocate(unsigned long) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/allocator.h:196 (inlined by) std::allocator_traits<std::allocator<dht::token_range_endpoints> >::allocate(std::allocator<dht::token_range_endpoints>&, unsigned long) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/alloc_traits.h:515 (inlined by) std::_Vector_base<dht::token_range_endpoints, std::allocator<dht::token_range_endpoints> >::_M_allocate(unsigned long) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/stl_vector.h:380 (inlined by) void std::vector<dht::token_range_endpoints, std::allocator<dht::token_range_endpoints> >::_M_realloc_append<dht::token_range_endpoints const&>(dht::token_range_endpoints const&) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/vector.tcc:596 locator::describe_ring(replica::database const&, gms::gossiper const&, seastar::basic_sstring<char, unsigned int, 15u, true> const&, bool) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/stl_vector.h:1294 std::__n4861::coroutine_handle<seastar::internal::coroutine_traits_base<std::vector<dht::token_range_endpoints, std::allocator<dht::token_range_endpoints> > >::promise_type>::resume() const at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/coroutine:242 (inlined by) seastar::internal::coroutine_traits_base<std::vector<dht::token_range_endpoints, std::allocator<dht::token_range_endpoints> > >::promise_type::run_and_dispose() at ././seastar/include/seastar/core/coroutine.hh:80 seastar::reactor::do_run() at ./build/release/seastar/./build/release/seastar/./seastar/src/core/reactor.cc:2635 std::_Function_handler<void (), seastar::smp::configure(seastar::smp_options const&, seastar::reactor_options const&)::$_0>::_M_invoke(std::_Any_data const&) at ./build/release/seastar/./build/release/seastar/./seastar/src/core/reactor.cc:4684 ``` Fix by using chunked_vector. Fixes #24158 Closes scylladb/scylladb#24561	2025-06-19 16:51:01 +03:00
Petr Gusev	aa970bf2e4	sp::cas_shard: rename to get_cas_shard We intend to introduce a separate cas_shard class in the next commits. We rename the existing function here to avoid conflicts.	2025-06-18 11:51:48 +02:00
Kefu Chai	b3e2561ed8	service: do not include unused headers these unused includes were identified by clang-include-cleaner. after auditing these source files, all of the reports have been confirmed. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2025-03-20 11:18:16 +08:00
Avi Kivity	696ce4c982	Merge "convert some parts of the gossiper to host ids" from Gleb " This is series starts conversion of the gossiper to use host ids to index nodes. It does not touch the main map yet, but converts a lot of internal code to host id. There are also some unrelated cleanups that were done while working on the series. On of which is dropping code related to old shadow round. We replaced shadow round with explicit GOSSIP_GET_ENDPOINT_STATES verb in `cd7d64f588` which is in scylla-4.3.0, so there should be no compatibility problem. We already dropped a lot of old shadow round code in previous patches anyway. I tested manually that old and new node can co-exist in the same cluster, " * 'gleb/gossiper-host-id-v2' of github.com:scylladb/scylla-dev: (33 commits) gossiper: drop unneeded code gossiper: move _expire_time_endpoint_map to host_id gossiper: move _just_removed_endpoints to host id gossiper: drop unused get_msg_addr function messaging_service: change connection dropping notification to pass host id only messaging_service: pass host id to remove_rpc_client in down notification treewide: pass host id to endpoint_lifecycle_subscriber treewide: drop endpoint life cycle subscribers that do nothing load_meter: move to host id treewide: use host id directly in endpoint state change subscribers treewide: pass host id to endpoint state change subscribers gossiper: drop deprecated unsafe_assassinate_endpoint operation storage_service: drop unused code in handle_state_removed treewide: drop endpoint state change subscribers that do nothing gossiper: drop ip address from handle_echo_msg and simplify code since host_id is now mandatory gossiper: start using host ids to send messages earlier messaging_service: add temporary address map entry on incoming connection topology_coordinator: notify about IP change from sync_raft_topology_nodes as well treewide: move everyone to use host id based gossiper::is_alive and drop ip based one storage_proxy: drop unused template ...	2025-03-13 13:36:31 +02:00
Dawid Mędrek	0a6137218a	db/hints: Cancel draining when stopping node Draining hints may occur in one of the two scenarios: * a node leaves the cluster and the local node drains all of the hints saved for that node, * the local node is being decommissioned. Draining may take some time and the hint manager won't stop until it finishes. It's not a problem when decommissioning a node, especially because we want the cluster to retain the data stored in the hints. However, it may become a problem when the local node started draining hints saved for another node and now it's being shut down. There are two reasons for that: * Generally, in situations like that, we'd like to be able to shut down nodes as fast as possible. The data stored in the hints won't disappear from the cluster yet since we can restart the local node. * Draining hints may introduce flakiness in tests. Replaying hints doesn't have the highest priority and it's reflected in the scheduling groups we use as well as the explicitly enforced throughput. If there are a large number of hints to be replayed, it might affect our tests. It's already happened, see: scylladb/scylladb#21949. To solve those problems, we change the semantics of draining. It will behave as before when the local node is being decommissioned. However, when the local node is only being stopped, we will immediately cancel all ongoing draining processes and stop the hint manager. To amend for that, when we start a node and it initializes a hint endpoint manager corresponding to a node that's already left the cluster, we will begin the draining process of that endpoint manager right away. That should ensure all data is retained, while possibly speeding up the shutdown process. There's a small trade-off to it, though. If we stop a node, we can then remove it. It won't have a chance to replay hints it might've before these changes, but that's an edge case. We expect this commit to bring more benefit than harm. We also provide tests verifying that the implementation works as intended. Fixes scylladb/scylladb#21949 Closes scylladb/scylladb#22811	2025-03-13 11:55:15 +02:00
Gleb Natapov	4ca627b533	treewide: pass host id to endpoint_lifecycle_subscriber	2025-03-11 12:09:22 +02:00
Gleb Natapov	8a747fbc2a	treewide: drop endpoint life cycle subscribers that do nothing Provide default implementation for them instead. Will be easier to rework them later.	2025-03-11 12:09:22 +02:00
Gleb Natapov	2ea8df2cf5	storage_proxy: drop is_alive that works on ip since it is not used any more	2025-01-16 16:37:06 +02:00
Gleb Natapov	4d7c05ad82	hints: move create_hint_sync_point function to host ids One of its caller is in the RESTful API which gets ips from the user, so we convert ips to ids inside the API handler using gossiper before calling the function. We need to deprecate ip based API and move to host id based.	2025-01-15 16:30:28 +02:00
Sergey Zolotukhin	39785c6f4e	storage_proxy/read_repair: Use `partition_key` instead of `token` key for mutation diff calculation hashmap. This update addresses an issue in the mutation diff calculation algorithm used during read repair. Previously, the algorithm used `token` as the hashmap key. Since `token` is calculated basing on the Murmur3 hash function, it could generate duplicate values for different partition keys, causing corruption in the affected rows' values. Fixes scylladb/scylladb#19101	2025-01-03 09:53:02 +01:00
Avi Kivity	f3eade2f62	treewide: relicense to ScyllaDB-Source-Available-1.0 Drop the AGPL license in favor of a source-available license. See the blog post [1] for details. [1] https://www.scylladb.com/2024/12/18/why-were-moving-to-a-source-available-license/	2024-12-18 17:45:13 +02:00
Ferenc Szili	7f29b7d8f6	storage_proxy: propagate group0 client and TSM dependency This commit makes storage_proxy::remote dependent on raft_group0_client and topology_state_machine. storage_proxy::remote gets references to these via the call to start_remote(). These references will be needed to call storage_service::truncate_table_with_tablets().	2024-12-04 11:30:06 +01:00
Gleb Natapov	20d1b80535	view: move view building to host id Use host ids in view building code as well.	2024-12-02 10:31:13 +02:00
Gleb Natapov	0ca14ef8b7	hints: use host id to send hints Drop address translation that no longer needed. Templates here are used temporarily until another user of the function (MV) is converted as well.	2024-12-02 10:31:12 +02:00
Gleb Natapov	12937aeb7f	storage_proxy: move to addressing nodes by host ids instead of ips In this rather large path we mode to address nodes in storage proxy by host ids instead of ips. Some subsystems storage proxy calls to are not yet converted to host ids, so we translate back and forth when we interact with them.	2024-12-02 10:31:11 +02:00
Gleb Natapov	020e8010e8	storage_proxy: remove unused function	2024-11-24 11:01:39 +02:00
Pavel Emelyanov	9fd8eba3ec	proxy: Don't keep truncate timeout as optional argument Because it is never such -- the only caller of truncate_blocking() always knows the timeout it want this method to use. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#20620	2024-09-24 08:25:54 +03:00
Piotr Dulikowski	da5f4faac1	Merge 'mv: reject user requests by coordinator when a replica is overloaded by MVs' from Wojciech Mitros Currently, when a view update backlog of one replica is full, the write is still sent by the coordinator to all replicas. Because of the backlog, the write fails on the replica, causing inconsistency that needs to be fixed by repair. To avoid these inconsistencies, this patch adds a check on the coordinator for overloaded replicas. As a result, a write may be rejected before being sent to any replicas and later retried by the user, when the replica is no longer overloaded. This patch does not remove the replica write failures, because we still may reach a full backlog when more view updates are generated after the coordinator check is performed and before the write reaches the replica. Fixes scylladb/scylladb#17426 Closes scylladb/scylladb#18334 * github.com:scylladb/scylladb: mv: test the view update behavior mv: add test for admission control storage_proxy: return overloaded_exception instead of throwing mv: reject user requests by coordinator when a replica is overloaded by MVs	2024-08-27 12:50:34 +02:00
Wojciech Mitros	5eaae05aaf	mv: reject user requests by coordinator when a replica is overloaded by MVs Currently, when a replica's view update backlog is full, the write is still sent by the coordinator to all replicas. Because of the backlog, the write fails on the replica, causing inconsistency that needs to be fixed by repair. To avoid these inconsistencies, this patch adds a check on the coordinator for overloaded replicas. As a result, a write may be rejected before being sent to any replicas and later retried by the user, when the replica is no longer overloaded. Fixes scylladb/scylladb#17426	2024-08-02 12:12:19 +02:00
Pavel Emelyanov	a1dbaba9e1	proxy: Use remote gossiper to start hints resource manager By the time hinst resource manager is started, proxy already has its remote part initialized. Remote returns const gossiper pointer, but after previous change hints code can live with it. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-07-26 16:29:03 +03:00
Pavel Emelyanov	6c1e5c248f	main,proxy: Drain proxy in its stop_remote Currently proxy initialization is pretty disperse, in particular it's stopped in several steps -- first drain_on_shutdown() then stop_remote(). In between there's nothing that needs proxy in any particular sate, so those two steps can be merged into one. refs: scylladb/scylladb#2737 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#19344	2024-06-27 12:26:51 +02:00
Wojciech Mitros	f70f774e40	mv: gossip the same backlog if a different backlog was sent in a response Currently, there are 2 ways of sharing a backlog with other nodes: through a gossip mechanism, and with responses to replica writes. In gossip, we check each second if the backlog changed, and if it did we update other nodes with it. However if the backlog for this node changed on another node with a write response, the gossiped backlog is currently not updated, so if after the response the backlog goes back to the value from the previous gossip round, it will not get sent and the other node will stay with an outdated backlog. This patch changes this by notifying the gossip that a the backlog changed since the last gossip round so a different backlog could have been send through the response piggyback mechanism. With that information, gossip will send an unchanged backlog to other nodes in the following gossip round. Fixes: https://github.com/scylladb/scylladb/issues/18461	2024-06-06 10:45:15 +02:00
Wojciech Mitros	272e80fe0a	node_update_backlog: divide adding and fetching backlogs Currently, we only update the backlogs in node_update_backlog at the same time when we're fetching them. This is done using storage_proxy's method get_view_update_backlog, which is confusing because it's a getter with side-effects. Additionally, we don't always want to update the backlog when we're reading it (as in gossip which is only on shard 0) and we don't always want to read it when we're updating it (when we're not handling any writes but the backlog drops due to background work finish). This patch divides the node_view_backlog::add_fetch as well the storage_proxy::get_view_update_backlog both into two methods; one for updating and one for reading the backlog. This patch only replaces the places where we're currently using the view backlog getter, more situations where we should get/update the backlog should be considered in a following patch.	2024-06-06 10:45:13 +02:00
Piotr Dulikowski	68eca3778c	Merge 'mv: throttle view update generation for large queries' from Wojciech Mitros This series is a reupload of #13792 with a few modifications, namely a test is added and the conflicts with recent tablet related changes are fixed. See https://github.com/scylladb/scylladb/issues/12379 and https://github.com/scylladb/scylladb/pull/13583 for a detailed description of the problem and discussions. This PR aims to extend the existing throttling mechanism to work with requests that internally generate a large amount of view updates, as suggested by @nyh. The existing mechanism works in the following way: * Client sends a request, we generate the view updates corresponding to the request and spawn background tasks which will send these updates to remote nodes * Each background task consumes some units from the `view_update_concurrency_semaphore`, but doesn't wait for these units, it's just for tracking * We keep track of the percent of consumed units on each node, this is called `view update backlog`. * Before sending a response to the client we sleep for a short amount of time. The amount of time to sleep for is based on the fullness of this `view update backlog`. For a well behaved client with limited concurrency this will limit the amount of incoming requests to a manageable level. This mechanism doesn't handle large DELETE queries. Deleting a partition is fast for the base table, but it requires us to generate a view update for every single deleted row. The number of deleted rows per single client request can be in the millions. Delaying response to the request doesn't help when a single request can generate millions of updates. To deal with this we could treat the view update generator just like any other client and force it to wait a bit of time before sending the next batch of updates. The amount of time to wait for is calculated just like in the existing throttling code, it's based on the fullness of `view update backlogs`. The new algorithm of view update generation looks something like this: ```c++ for(;;) { auto updates = generate_updates_batch_with_max_100_rows(); co_await seastar::sleep(calculate_sleep_time_from_backlogs()); spawn_background_tasks_for_updates(updates); } ``` Fixes: https://github.com/scylladb/scylladb/issues/12379 Closes scylladb/scylladb#16819 * github.com:scylladb/scylladb: test: add test for bad_allocs during large mv queries mv: throttle view update generation for large queries exceptions: add read_write_timeout_exception, a subclass of request_timeout_exception db/view: extract view throttling delay calculation to a global function view_update_generator: add get_storage_proxy() storage_proxy: make view backlog getters public	2024-05-16 08:22:54 +02:00
Botond Dénes	155332ebf8	Merge 'Drain view_builder in generic drain (again)' from Pavel Emelyanov Some time ago #16558 was merged that moved view builder drain into generic drain. After this merge dtests started to fail from time to time, so the PR was reverted (see #18278). In #18295 the hang was found. View builder drain was moved from "before stopping messaging service to "after" it, and view update write handlers in proxy hanged for hard-coded timeout of 5 minutes without being aborted. Tests don't wait for 5 minutes and kill scylla, then complain about it and fail. This PR brings back the original PR as well as the necessary fix that cancels view update write handlers on stop. Closes scylladb/scylladb#18408 * github.com:scylladb/scylladb: Reapply "Merge 'Drain view_builder in generic drain' from ScyllaDB" view: Abort pending view updates when draining	2024-05-09 08:26:44 +03:00
Piotr Dulikowski	64ba620dc2	Merge 'hinted handoff: Use host IDs instead of IPs in the module' from Dawid Mędrek This pull request introduces host ID in the Hinted Handoff module. Nodes are now identified by their host IDs instead of their IPs. The conversion occurs on the boundary between the module and `storage_proxy.hh`, but aside from that, IPs have been erased. The changes take into considerations that there might still be old hints, still identified by IPs, on disk – at start-up, we map them to host IDs if it's possible so that they're not lost. Refs scylladb/scylladb#6403 Fixes scylladb/scylladb#12278 Closes scylladb/scylladb#15567 * github.com:scylladb/scylladb: docs: Update Hinted Handoff documentation db/hints: Add endpoint_downtime_not_bigger_than() db/hints: Migrate hinted handoff when cluster feature is enabled db/hints: Handle arbitrary directories in resource manager db/hints: Start using hint_directory_manager db/hints: Enforce providing IP in get_ep_manager() db/hints: Introduce hint_directory_manager db/hints/resource_manager: Update function description db/hints: Coroutinize space_watchdog::scan_one_ep_dir() db/hints: Expose update lock of space watchdog db/hints: Add function for migrating hint directories to host ID db/hints: Take both IP and host ID when storing hints db/hints: Prepare initializing endpoint managers for migrating from IP to host ID db/hints: Migrate to locator::host_id db/hints: Remove noexcept in do_send_one_mutation() service: Add locator::host_id to on_leave_cluster service: Fix indentation db/hints: Fix indentation	2024-05-06 09:58:18 +02:00
Benny Halevy	890b890e36	storage_proxy: add mutate_locally(vector<frozen_mutation_and_schema>) method Generalizing the ad-hoc implementation out of group0_state_machine.write_mutations_to_database. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2024-05-02 19:42:58 +03:00
Jan Ciolek	4c5cfc7683	storage_proxy: make view backlog getters public Storage proxy maintains information about both local and remote view update backlogs. This information might also be useful outside of storage_proxy, so let's expose the functions that allow to acces backlog information. There aren't any implementation quirks that would make it unsafe to make the functions public, the worst that can happen is that someone causes a lot of atomic operations by repeatedly calling get_view_update_backlog(). Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2024-05-02 10:59:55 +02:00
Pavel Emelyanov	d47053266b	view: Abort pending view updates when draining When view builder is drained (it now happens very early, but next patch moves this into regular drain) it waits for all on-going view build steps to complete. This includes waiting for any outstanding proxy view writes to complete as well. View writes in proxy have very high timeout of 5 minutes but they are cancellable. However, canecelling of such writes happens in proxy's drain_on_shutdown() call which, in turn, happens pretty late on shutdown. Effectively, by the time it happens all view writes mush have completed already, so stop-time cancelling doesn't really work nowadays. Next patch makes view builder drain happen a bit later during shutdown, namely -- _after_ shutting down messaging service. When it happen that late, non-working view writes cancellation becomes critical, as view builder drain hangs for aforementioned 5 minutes. This patch explicitly cancels all view writes when view builder stops. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-05-02 08:16:12 +03:00
Pavel Emelyanov	5d992a4f01	proxy: Remove declaration of nonexisting view_update_write_response_handler class Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#18417	2024-05-01 10:15:41 +03:00

1 2 3 4 5 ...

468 Commits