scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-05-01 13:45:53 +00:00

Author	SHA1	Message	Date
Avi Kivity	eefb6a0642	Merge 'storage_proxy: node_local_only: always use my_host_id' from Petr Gusev The previous implementation did not handle topology changes well: * In `node_local_only` mode with CL=1, if the current node is pending, the CL is increased to 2, causing `unavailable_exception`. * If the current tablet is in `write_both_read_old` and we try to read with `node_local_only` on the new node, the replica list will be empty. This patch changes `node_local_only` mode to always use `my_host_id` as the replica list. An explicit check ensures the current node is a replica for the operation; otherwise `on_internal_error` is called. backport: not needed, since `node_local_only` is only used in LWT for tablets and it hasn't been released yet. Closes scylladb/scylladb#25508 * github.com:scylladb/scylladb: test_tablets_lwt: add test_lwt_during_migration storage_proxy: node_local_only: always use my_host_id	2025-08-20 12:11:44 +03:00
Petr Gusev	ed6bec2cac	storage_proxy: node_local_only: always use my_host_id The previous implementation did not handle topology changes well: * In node_local_only mode with CL=1, if the current node is pending, the CL is raised to 2, causing unavailable_exception. * If the current tablet is in write_both_read_old and we read with node_local_only on the new node, the replica list is empty. This patch changes node_local_only mode to always use my_host_id as the replica list. An explicit check ensures the current node is a replica for the operation; otherwise on_internal_error is called.	2025-08-19 16:11:49 +02:00
Avi Kivity	41475858aa	storage_proxy: endpoint_filter(): fix rack count confusion endpoint_filter() is used by batchlog to select nodes to replicate to. It contains an unordered_multimap data structure that maps rack names to nodes. It misuses std::unordered_map::bucket_count() to count the number of racks. While values that share a key in a multimap will definitly be in the same bucket, it's possible for values that don't share a key to share a bucket. Therefore bucket_count() undercounts the number of racks. Fix this by using a more accurate data structure: a map of a set. The patch changes validated.bucket_count() to validated.size() and validated.size() to a new variable nr_validated. The patch does cause an extra two allocations per rack (one for the unordered_map node, one for the unordered_set bucket vector), but this is only used for logged batches, so it is amortized over all the mutations in the logged batch. Closes scylladb/scylladb#25493	2025-08-19 11:58:39 +03:00
Petr Gusev	8bd936b72c	storage_proxy: preserve accept error messages	2025-08-13 13:43:12 +02:00
Petr Gusev	00c25d396f	storage_proxy: preserve prepare error message	2025-08-13 13:43:12 +02:00
Petr Gusev	0724fafe47	storage_proxy: fix log message	2025-08-13 13:40:09 +02:00
Petr Gusev	ff89c03c7f	exceptions: add constructors that accept explicit error messages To improve debuggability, we need to propagate original error messages from Paxos verbs to the user. This change adds constructors that take an error message directly, enabling better error reporting. Additionally, functions such as write_timeout_to_read, write_failure_to_read etc are updated to use these message-based constructors. These functions are used in storage_proxy::cas to convert between different error types, and without this change, they could lose the original error message during conversion.	2025-08-12 16:31:05 +02:00
Avi Kivity	8164f72f6e	Merge 'Separate local_effective_replication_map from vnode_effective_replication_map' from Benny Halevy Derive both vnode_effective_replication_map and local_effective_replication_map from static_effective_replication_map as both are static and per-keyspace. However, local_effective_replication_map does not need vnodes for the mapping of all tokens to the local node. Refs #22733 * No backport required Closes scylladb/scylladb#25222 * github.com:scylladb/scylladb: locator: abstract_replication_strategy: implement local_replication_strategy locator: vnode_effective_replication_map: convert clone_data_gently to clone_gently locator: abstract_replication_map: rename make_effective_replication_map locator: abstract_replication_map: rename calculate_effective_replication_map replica: database: keyspace: rename {create,update}_effective_replication_map locator: effective_replication_map_factory: rename create_effective_replication_map locator: abstract_replication_strategy: rename vnode_effective_replication_map_ptr et. al locator: abstract_replication_strategy: rename global_vnode_effective_replication_map keyspace: rename get_vnode_effective_replication_map dht: range_streamer: use naked e_r_m pointers storage_service: use naked e_r_m pointers alternator: ttl: use naked e_r_m pointers locator: abstract_replication_strategy: define is_local	2025-08-07 12:51:43 +03:00
Benny Halevy	ec85678de1	locator: abstract_replication_strategy: define is_local Prefer for specializing the local replication strategy, local effective replication map, et. al byt defining an is_local() predicate, similar to uses_tablets(). Note that is_vnode_based() still applies to local replication strategy. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-08-06 13:34:23 +03:00
Avi Kivity	630b3d31bb	storage_proxy: reduce allocations in send_to_live_endpoints() send_to_live_endpoints() computes sets of endpoints to which we send mutations - remote endpoints (where we send to each set as a whole, using forwarding), and local endpoints, where we send directly. To make handling regular, each local endpoint is treated as its own set. Thus, each local endpoint and each datacenter receive one RPC call (or local call if the coordinator is also a replica). These sets are maintained a std::unordered_map (for remote endpoints) and a vector with the same value_type as the map (for local endpoints). The key part of the vector payload is initialized to the empty string. We simplify this by noting that the datacenter name is never used after this computation, so the vector can hold just the replica sets, without the fake datacenter name. The downstream variable `all` is adjusted to point just to the replica set as well. As a reward for our efforts, the vector's contents becomes nothrow move constructible (no string), and we can convert it to a small_vector, which reduces allocations in the common case of RF<=3. The reduction in allocations is visible in perf-simple-query --write results: ``` before 165080.62 tps ( 60.3 allocs/op, 16.0 logallocs/op, 14.2 tasks/op, 53438 insns/op, 26705 cycles/op, 0 errors) after 164513.83 tps ( 59.3 allocs/op, 16.0 logallocs/op, 14.2 tasks/op, 53347 insns/op, 26761 cycles/op, 0 errors) ``` The instruction count reduction is a not very impressive 70/op: before ``` instructions_per_op: mean= 53412.22 standard-deviation=32.12 median= 53420.53 median-absolute-deviation=20.32 maximum=53462.23 minimum=53290.06 ``` after ``` instructions_per_op: mean= 53350.32 standard-deviation=32.38 median= 53353.71 median-absolute-deviation=13.60 maximum=53415.20 minimum=53222.24 ``` Perhaps the extra code from small_vector defeated some inlining, which negated some of the gain from the reduced allocations. Perhaps a build with full profiling will gain it back (my builds were without pgo). Closes scylladb/scylladb#25270	2025-08-06 11:28:20 +03:00
Petr Gusev	e120ee6d32	storage_proxy.cc: get_cas_shard: fallback to the primary replica shard Currently, get_cas_shard uses shard_for_reads to decide which shard to use for LWT execution—both on replicas and the coordinator. If the coordinator is not a replica, shard_for_reads returns a default shard (shard 0). There are at least two problems with this: * shard 0 can become overloaded, because all LWT coordinators-but-not-replacas are served on it. * mismatch with replicas: the default shard doesn't match what shard_for_reads returns on replicas. This hinders the "same shard for client and server" RPC level optimization. In this commit we change get_cas_shard to use a primary replica shard if the current node is not a replica. This guarantees that all LWT coordinators for the same tablet will be served on the same shard. This is important for LWT coordinator locks (paxos::paxos_state::get_cas_lock). Also, if all tablet replicas on different nodes live on the same shard, RPC optimization will make sure that no additional smp::submit_to will be needed on the server side. Fixes scylladb/scylladb#20497	2025-07-29 17:07:04 +02:00
Petr Gusev	65c7e36b7c	storage_proxy: handle node_local_only in query In this commit we support node_local_only flag in read code path in storage_proxy.	2025-07-24 19:48:08 +02:00
Petr Gusev	2d747d97b8	storage_proxy: handle node_local_only in mutate We add the remove_non_local_host_ids() helper, which will be used in the next commit to support the read path. HostIdVector concept is introduced to be able to handle both host_id_vector_replica_set and host_id_vector_topology_change uniformly. The storage_proxy_coordinator_mutate_options class is declared outside of storage_proxy to avoid C++ compiler complaints about default field initializers. In particular, some storage_proxy methods use this class for optional parameters with default values, which is not allowed when the class is defined inside storage_proxy.	2025-07-24 19:48:08 +02:00
Petr Gusev	4c1aca3927	storage_proxy: add coordinator_mutate_options In upcoming commits, we want to add a node_local_only flag to both read and write paths in storage_proxy. This requires passing the flag from query_processor to the part of storage_proxy where replica selection decisions are made. For reads, it's sufficient to add the flag to the existing coordinator_query_options class. For writes, there is no such options container, so we introduce coordinator_mutate_options in this commit. In the future, we may move some of the many mutate() method arguments into this container to simplify the code.	2025-07-24 19:48:08 +02:00
Petr Gusev	b6ccaffd45	storage_proxy: rename create_write_response_handler -> make_write_response_handler Most of the create_write_response_handler overloads follow the same signature pattern to satisfy the sp::mutate_prepare call. The one which doesn't follow it is invoked by others and is responsible for creating a concrete handler instance. In this refactoring commit we rename it to make_write_response_handler to reduce confusion.	2025-07-24 19:48:08 +02:00
Petr Gusev	db946edd1d	storage_proxy: simplify mutate_prepare This is a refactoring commit. We remove extra lambda parameters from mutate_prepare since the CreateWriteHandler lambda can simply capture them. We can't std::move(permit) in another mutate_prepare overload, because each handler wants its own copy of this pemit.	2025-07-24 19:48:08 +02:00
Petr Gusev	ac4bc3f816	paxos_state: lazily create paxos state table We call paxos_store::ensure_initialized in the beginning of storage_proxy::cas to create a paxos state table for a user table if it doesn't exist. When the LWT coordinator sends RPCs to replicas, some of them may not yet have the paxos schema. In paxos_store::get_paxos_state_schema we just wait for them to appear, or throw 'no_such_column_family' if the base table was dropped.	2025-07-24 19:48:08 +02:00
Petr Gusev	6e87a6cdb0	paxos_state: extract state access functions into paxos_store Introduce paxos_store abstraction to isolate Paxos state access. Prepares for supporting either system.paxos or a co-located table as the storage backend.	2025-07-24 16:39:50 +02:00
Gleb Natapov	ab6e328226	storage_proxy: preallocate write response handler hash table Currently it grows dynamically and triggers oversized allocation warning. Also it may be hard to find sufficient contiguous memory chunk after the system runs for a while. This patch pre-allocates enough memory for ~1M outstanding writes per shard. Fixes #24660 Fixes #24217 Closes scylladb/scylladb#25098	2025-07-24 09:46:42 +03:00
Patryk Jędrzejczak	f89ffe491a	Merge 'storage_service: cancel all write requests after stopping transports' from Sergey Zolotukhin When a node shuts down, in storage service, after storage_proxy RPCs are stopped, some write handlers within storage_proxy may still be waiting for background writes to complete. These handlers hold appropriate ERMs to block schema changes before the write finishes. After the RPCs are stopped, these writes cannot receive the replies anymore. If, at the same time, there are RPC commands executing `barrier_and_drain`, they may get stuck waiting for these ERM holders to finish, potentially blocking node shutdown until the writes time out. This change introduces cancellation of all outstanding write handlers from storage_service after the storage proxy RPCs were stopped. Fixes scylladb/scylladb#23665 Backport: since this fixes an issue that frequently causes issues in CI, backport to 2025.1, 2025.2, and 2025.3. Closes scylladb/scylladb#24714 * https://github.com/scylladb/scylladb: storage_service: Cancel all write requests on storage_proxy shutdown test: Add test for unfinished writes during shutdown and topology change	2025-07-24 09:46:42 +03:00
Sergey Zolotukhin	e0dc73f52a	storage_service: Cancel all write requests on storage_proxy shutdown During a graceful node shutdown, RPC listeners are stopped in `storage_service::drain_on_shutdown` as one of the first steps. However, even after RPCs are shut down, some write handlers in `storage_proxy` may still be waiting for background writes to complete. These handlers retain the ERM. Since the RPC subsystem is no longer active, replies cannot be received, and if any RPC commands are concurrently executing `barrier_and_drain`, they may get stuck waiting for those writes. This can block the messaging server shutdown and delay the entire shutdown process until the write timeout occurs. This change introduces the cancellation of all outstanding write handlers in `storage_proxy` during shutdown to prevent unnecessary delays. Fixes scylladb/scylladb#23665	2025-07-22 15:03:30 +02:00
Sergey Zolotukhin	bc934827bc	test: Add test for unfinished writes during shutdown and topology change This test reproduces an issue where a topology change and an ongoing write query during query coordinator shutdown can cause the node to get stuck. When a node receives a write request, it creates a write handler that holds a copy of the current table's ERM (Effective Replication Map). The ERM ensures that no topology or schema changes occur while the request is being processed. After the query coordinator receives the required number of replica write ACKs to satisfy the consistency level (CL), it sends a reply to the client. However, the write response handler remains alive until all replicas respond — the remaining writes are handled in the background. During shutdown, when all network connections are closed, these responses can no longer be received. As a result, the write response handler is only destroyed once the write timeout is reached. This becomes problematic because the ERM held by the handler blocks topology or schema change commands from executing. Since shutdown waits for these commands to complete, this can lead to unnecessary delays in node shutdown and restarts, and occasional test case failures. Test for: scylladb/scylladb#23665	2025-07-22 15:03:13 +02:00
Benny Halevy	3feb759943	everywhere: use utils::chunked_vector for list of mutations Currently, we use std::vector<*mutation> to keep a list of mutations for processing. This can lead to large allocation, e.g. when the vector size is a function of the number of tables. Use a chunked vector instead to prevent oversized allocations. `perf-simple-query --smp 1` results obtained for fixed 400MHz frequency and PGO disabled: Before (read path): ``` enable-cache=1 Running test with config: {partitions=10000, concurrency=100, mode=read, query_single_key=no, counters=no} Disabling auto compaction Creating 10000 partitions... 89055.97 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 39417 insns/op, 18003 cycles/op, 0 errors) 103372.72 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 39380 insns/op, 17300 cycles/op, 0 errors) 98942.27 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 39413 insns/op, 17336 cycles/op, 0 errors) 103752.93 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 39407 insns/op, 17252 cycles/op, 0 errors) 102516.77 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 39403 insns/op, 17288 cycles/op, 0 errors) throughput: mean= 99528.13 standard-deviation=6155.71 median= 102516.77 median-absolute-deviation=3844.59 maximum=103752.93 minimum=89055.97 instructions_per_op: mean= 39403.99 standard-deviation=14.25 median= 39406.75 median-absolute-deviation=9.30 maximum=39416.63 minimum=39380.39 cpu_cycles_per_op: mean= 17435.81 standard-deviation=318.24 median= 17300.40 median-absolute-deviation=147.59 maximum=18002.53 minimum=17251.75 ``` After (read path) ``` enable-cache=1 Running test with config: {partitions=10000, concurrency=100, mode=read, query_single_key=no, counters=no} Disabling auto compaction Creating 10000 partitions... 59755.04 tps ( 66.2 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 39466 insns/op, 22834 cycles/op, 0 errors) 71854.16 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 39417 insns/op, 17883 cycles/op, 0 errors) 82149.45 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 39411 insns/op, 17409 cycles/op, 0 errors) 49640.04 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.3 tasks/op, 39474 insns/op, 19975 cycles/op, 0 errors) 54963.22 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.3 tasks/op, 39474 insns/op, 18235 cycles/op, 0 errors) throughput: mean= 63672.38 standard-deviation=13195.12 median= 59755.04 median-absolute-deviation=8709.16 maximum=82149.45 minimum=49640.04 instructions_per_op: mean= 39448.38 standard-deviation=31.60 median= 39466.17 median-absolute-deviation=25.75 maximum=39474.12 minimum=39411.42 cpu_cycles_per_op: mean= 19267.01 standard-deviation=2217.03 median= 18234.80 median-absolute-deviation=1384.25 maximum=22834.26 minimum=17408.67 ``` `perf-simple-query --smp 1 --write` results obtained for fixed 400MHz frequency and PGO disabled: Before (write path): ``` enable-cache=1 Running test with config: {partitions=10000, concurrency=100, mode=write, query_single_key=no, counters=no} Disabling auto compaction 63736.96 tps ( 59.4 allocs/op, 16.4 logallocs/op, 14.3 tasks/op, 49667 insns/op, 19924 cycles/op, 0 errors) 64109.41 tps ( 59.3 allocs/op, 16.0 logallocs/op, 14.3 tasks/op, 49992 insns/op, 20084 cycles/op, 0 errors) 56950.47 tps ( 59.3 allocs/op, 16.0 logallocs/op, 14.3 tasks/op, 50005 insns/op, 20501 cycles/op, 0 errors) 44858.42 tps ( 59.3 allocs/op, 16.0 logallocs/op, 14.3 tasks/op, 50014 insns/op, 21947 cycles/op, 0 errors) 28592.87 tps ( 59.3 allocs/op, 16.0 logallocs/op, 14.3 tasks/op, 50027 insns/op, 27659 cycles/op, 0 errors) throughput: mean= 51649.63 standard-deviation=15059.74 median= 56950.47 median-absolute-deviation=12087.33 maximum=64109.41 minimum=28592.87 instructions_per_op: mean= 49941.18 standard-deviation=153.76 median= 50005.24 median-absolute-deviation=73.01 maximum=50027.07 minimum=49667.05 cpu_cycles_per_op: mean= 22023.01 standard-deviation=3249.92 median= 20500.74 median-absolute-deviation=1938.76 maximum=27658.75 minimum=19924.32 ``` After (write path) ``` enable-cache=1 Running test with config: {partitions=10000, concurrency=100, mode=write, query_single_key=no, counters=no} Disabling auto compaction 53395.93 tps ( 59.4 allocs/op, 16.5 logallocs/op, 14.3 tasks/op, 50326 insns/op, 21252 cycles/op, 0 errors) 46527.83 tps ( 59.3 allocs/op, 16.0 logallocs/op, 14.3 tasks/op, 50704 insns/op, 21555 cycles/op, 0 errors) 55846.30 tps ( 59.3 allocs/op, 16.0 logallocs/op, 14.3 tasks/op, 50731 insns/op, 21060 cycles/op, 0 errors) 55669.30 tps ( 59.3 allocs/op, 16.0 logallocs/op, 14.3 tasks/op, 50735 insns/op, 21521 cycles/op, 0 errors) 52130.17 tps ( 59.3 allocs/op, 16.0 logallocs/op, 14.3 tasks/op, 50757 insns/op, 21334 cycles/op, 0 errors) throughput: mean= 52713.91 standard-deviation=3795.38 median= 53395.93 median-absolute-deviation=2955.40 maximum=55846.30 minimum=46527.83 instructions_per_op: mean= 50650.57 standard-deviation=182.46 median= 50731.38 median-absolute-deviation=84.09 maximum=50756.62 minimum=50325.87 cpu_cycles_per_op: mean= 21344.42 standard-deviation=202.86 median= 21334.00 median-absolute-deviation=176.37 maximum=21554.61 minimum=21060.24 ``` Fixes #24815 Improvement for rare corner cases. No backport required Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#24919	2025-07-13 19:13:11 +03:00
Michael Litvak	a9b476e057	test: test_batchlog_manager: test batch replay when a node is down Add a test of the batchlog manager replay loop applying failed batches while some replica is down. The test reproduces an issue where the batchlog manager tries to replay a failed batch, doesn't get a response from some replica, and becomes stuck. It verifies that the batchlog manager can eventually recover from this situation and continue applying failed batches.	2025-07-07 12:23:06 +03:00
Michael Litvak	7150632cf2	batchlog_manager: abort writes on shutdown On shutdown of batchlog manager, abort all writes of replayed batches by the batchlog manager. To achieve this we set the appropriate write_type to BATCH, and on shutdown cancel all write handlers with this type.	2025-07-07 12:23:06 +03:00
Michael Litvak	fc5ba4a1ea	batchlog_manager: create cancellable write response handler When replaying a batch mutation from the batchlog manager and sending it to all replicas, create the write response handler as cancellable. To achieve this we define a new wrapper type for batchlog mutations - batchlog_replay_mutation, and this allows us to overload create_write_response_handler for this type. This is similar to how it's done with hint_wrapper and read_repair_mutation.	2025-07-07 12:23:06 +03:00
Michael Litvak	8d48b27062	storage_proxy: add write type parameter to mutate_internal Currently mutate_internal has a boolean parameter `counter_write` that indicates whether the write is of counter type or not. We replace it with a more general parameter that allows to indicate the write type. It is compatible with the previous behavior - for a counter write, the type COUNTER is passed, and otherwise a default value will be used as before.	2025-07-07 12:23:06 +03:00
Avi Kivity	60f407bff4	storage_proxy: avoid large allocation when storing batch in system.batchlog Currently, when computing the mutation to be stored in system.batchlog, we go through data_value. In turn this goes through `bytes` type (#24810), so it causes a large contiguous allocation if the batch is large. Fix by going through the more primitive, but less contiguous, atomic_cell API. Fixes #24809. Closes scylladb/scylladb#24811	2025-07-04 10:43:05 +03:00
Gleb Natapov	ca7837550d	topology coordinator: do not set request_type field for truncation command if topology_global_request_queue feature is not enabled yet Old nodes do not expect global topology request names to be in request_type field, so set it only if a cluster is fully upgraded already. Closes scylladb/scylladb#24731	2025-07-02 17:09:29 +02:00
Nadav Har'El	e12ff4d3ab	Merge 'LWT: use tablet_metadata_guard' from Petr Gusev This PR is a step towards enabling LWT for tablet-based tables. It pursues several goals: * Make it explicit that the tablet can't migrate after the `cas_shard` check in `selec_statement/modification_statement`. Currently, `storage_proxy::cas` expects that the client calls it on a correct shard -- the one which owns the partition key the LWT is running on. There reasons for that are explained in [this commit](`f16e3b0491 (diff-1073ea9ce4c5e00bb6eb614154f523ba7962403a4fe6c8cd877d1c8b73b3f649)`) message. The statements check the current shard and invokes `bounce_to_shard` if it's not the right one. However , the erm strong pointer is only captured in `storage_proxy::cas` and until that moment there is no explicit structure in the code which would prevent the ongoing migrations. In this PR we introduce such stucture -- `erm_handle`. We create it before the `cas_check` and pass it down to `storage_proxy::cas` and `paxos_response_handler`. * Another goal of this PR is an optimization -- we don't want to hold erm for the duration of entire LWT, unless it directly affects the current tablet. The is a `tablet_metadata_guard` class which is used for long running tablet operations. It automatically switches to a new erm if the topology change represented by the new erm doesn't affect the current tablet. We use this class in `erm_handle` if the table uses tablets. Otherwise, `erm_handle` just stores erm directly. * Fixes [shard bouncing issue in alternator](https://github.com/scylladb/scylladb/issues/17399) Backport: not needed (new feature). Closes scylladb/scylladb#24495 * github.com:scylladb/scylladb: LWT: make cas_shard non-optional in sp::cas LWT: create cas_shard in select_statement LWT: create cas_shard in modification and batch statements LWT: create cas_shard in alternator LWT: use cas_shard in storage_proxy::cas do_query_with_paxos: remove redundant cas_shard check storage_proxy: add cas_shard class sp::cas_shard: rename to get_cas_shard token_metadata_guard: a topology guard for a token tablet_metadata_guard: mark as noncopyable and nonmoveable	2025-07-01 11:33:20 +03:00
Petr Gusev	35aba76401	LWT: make cas_shard non-optional in sp::cas We also make sp::cas_shard function local since it's now not used directly by sp clients.	2025-06-30 10:37:33 +02:00
Petr Gusev	deb7afbc87	LWT: use cas_shard in storage_proxy::cas Take cas_shard parameter in sp::cas and pass token_metadata_guard down to paxos_response_handler. We make cas_shard parameter optional in storage_proxy methods to make the refactoring easier. The sp::cas method constructs a new token_metadata_guard if it's not set. All call sites pass null in this commit, we will add the proper implementation in the next commits.	2025-06-30 10:33:17 +02:00
Petr Gusev	94f0717a1e	do_query_with_paxos: remove redundant cas_shard check The same check is done in the sp::cas method.	2025-06-30 10:33:17 +02:00
Petr Gusev	43c4de8ad1	storage_proxy: add cas_shard class The sp::cas method must be called on the correct shard, as determined by sp::cas_shard. Additionally, there must be no asynchronous yields between the shard check and capturing the erm strong pointer in sp::cas. While this condition currently holds, it's fragile and easy to break. To address this, future commits will move the capture of token_metadata_guard to the call sites of sp::cas, before performing the shard check. As a first step, this commit introduces a cas_shard class that wraps both the target shard and a token_metadata_guard instance. This ensures the returned shard remains valid for the given tablet as long as the guard is held. In the next commits, we’ll pass a cas_shard instance to sp::cas as a separate parameter.	2025-06-30 10:33:17 +02:00
Gleb Natapov	5f953eb092	storage_proxy: retry paxos repair even if repair write succeeded After paxos state is repaired in begin_and_repair_paxos we need to re-check the state regardless if write back succeeded or not. This is how the code worked originally but it was unintentionally changed when co-routinized in `61b2e41a23`. Fixes #24630 Closes scylladb/scylladb#24651	2025-06-26 17:06:02 +02:00
Patryk Jędrzejczak	6489308ebc	Merge 'Introduce a queue of global topology requests.' from Gleb Natapov Currently only one global topology request (such as truncate, cdc repair, cleanup and alter table) can be pending. If one is already pending others will be rejected with an error. This is not very user friendly, so this series introduces a queue of global requests which allows queuing many global topology requests simultaneously. Fixes: #16822 No need to backport since this is a new feature. Closes scylladb/scylladb#24293 * https://github.com/scylladb/scylladb: topology coordinator: simplify truncate handling in case request queue feature is disable topology coordinator: fix indentation after the previous patch topology coordinator: allow running multiple global commands in parallel topology coordinator: Implement global topology request queue topology coordinator: Do not cancel global requests in cancel_all_requests topology coordinator: store request type for each global command topology request: make it possible to hold global request types in request_type field topology coordinator: move alter table global request parameters into topology_request table topology coordinator: move cleanup global command to report completion through topology_request table topology coordinator: no need to create updates vector explicitly topology coordinator: use topology_request_tracking_mutation_builder::done() instead of open code it topology coordinator: handle error during new_cdc_generation command processing topology coordinator: remove unneeded semicolon topology coordinator: fix indentation after the last commit topology coordinator: move new_cdc_generation topology request to use topology_request table for completion gms/feature_service: add TOPOLOGY_GLOBAL_REQUEST_QUEUE feature flag	2025-06-23 16:08:09 +03:00
Asias He	c5a136c3b5	storage_service: Use utils::chunked_vector to avoid big allocation The following was seen: ``` !WARNING \| scylla[6057]: [shard 12:strm] seastar_memory - oversized allocation: 212992 bytes. This is non-fatal, but could lead to latency and/or fragmentation issues. Please report: at [Backtrace #0] void seastar::backtrace<seastar::current_backtrace_tasklocal()::$_0>(seastar::current_backtrace_tasklocal()::$_0&&, bool) at ./build/release/seastar/./seastar/include/seastar/util/backtrace.hh:89 (inlined by) seastar::current_backtrace_tasklocal() at ./build/release/seastar/./build/release/seastar/./seastar/src/util/backtrace.cc:99 seastar::current_tasktrace() at ./build/release/seastar/./build/release/seastar/./seastar/src/util/backtrace.cc:136 seastar::current_backtrace() at ./build/release/seastar/./build/release/seastar/./seastar/src/util/backtrace.cc:169 seastar::memory::cpu_pages::warn_large_allocation(unsigned long) at ./build/release/seastar/./build/release/seastar/./seastar/src/core/memory.cc:848 seastar::memory::allocate_slowpath(unsigned long) at ./build/release/seastar/./build/release/seastar/./seastar/src/core/memory.cc:911 operator new(unsigned long) at ./build/release/seastar/./build/release/seastar/./seastar/src/core/memory.cc:1706 std::allocator<dht::token_range_endpoints>::allocate(unsigned long) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/allocator.h:196 (inlined by) std::allocator_traits<std::allocator<dht::token_range_endpoints> >::allocate(std::allocator<dht::token_range_endpoints>&, unsigned long) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/alloc_traits.h:515 (inlined by) std::_Vector_base<dht::token_range_endpoints, std::allocator<dht::token_range_endpoints> >::_M_allocate(unsigned long) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/stl_vector.h:380 (inlined by) void std::vector<dht::token_range_endpoints, std::allocator<dht::token_range_endpoints> >::_M_realloc_append<dht::token_range_endpoints const&>(dht::token_range_endpoints const&) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/vector.tcc:596 locator::describe_ring(replica::database const&, gms::gossiper const&, seastar::basic_sstring<char, unsigned int, 15u, true> const&, bool) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/stl_vector.h:1294 std::__n4861::coroutine_handle<seastar::internal::coroutine_traits_base<std::vector<dht::token_range_endpoints, std::allocator<dht::token_range_endpoints> > >::promise_type>::resume() const at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/coroutine:242 (inlined by) seastar::internal::coroutine_traits_base<std::vector<dht::token_range_endpoints, std::allocator<dht::token_range_endpoints> > >::promise_type::run_and_dispose() at ././seastar/include/seastar/core/coroutine.hh:80 seastar::reactor::do_run() at ./build/release/seastar/./build/release/seastar/./seastar/src/core/reactor.cc:2635 std::_Function_handler<void (), seastar::smp::configure(seastar::smp_options const&, seastar::reactor_options const&)::$_0>::_M_invoke(std::_Any_data const&) at ./build/release/seastar/./build/release/seastar/./seastar/src/core/reactor.cc:4684 ``` Fix by using chunked_vector. Fixes #24158 Closes scylladb/scylladb#24561	2025-06-19 16:51:01 +03:00
Petr Gusev	aa970bf2e4	sp::cas_shard: rename to get_cas_shard We intend to introduce a separate cas_shard class in the next commits. We rename the existing function here to avoid conflicts.	2025-06-18 11:51:48 +02:00
Tomasz Grabiec	cdb1499898	Merge 'interval: reduce memory footprint' from Avi Kivity The interval class's memory footprint isn't important for single objects, but intervals are frequently held in moderately sized collections. In #3335 this caused a stall. Therefore reducing interval's memory footprint and reduce allocation pressure. This series does this by consolidating badly-padded booleans in the object tree spanned by interval into 5 booleans that are consecutive in memory. This reduces the space required by these booleans from 40 bytes to 8 bytes. perf-simple-query report (with refresh-pgo-profiles.sh for each measurement): before: 252127.60 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 37128 insns/op, 18147 cycles/op, 0 errors) INFO 2025-06-07 21:00:34,010 [shard 0:main] group0_tombstone_gc_handler - Setting reconcile time to 1749319231 (min id=4dbed2f4-43c9-11f0-cbc6-87d1a08b4ca4) 246492.37 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 37153 insns/op, 18411 cycles/op, 0 errors) 253633.11 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 37127 insns/op, 17941 cycles/op, 0 errors) 254029.93 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 37155 insns/op, 17951 cycles/op, 0 errors) 254465.76 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 37123 insns/op, 17906 cycles/op, 0 errors) throughput: mean= 252149.75 standard-deviation=3282.75 median= 253633.11 median-absolute-deviation=1880.17 maximum=254465.76 minimum=246492.37 instructions_per_op: mean= 37137.24 standard-deviation=15.71 median= 37127.54 median-absolute-deviation=14.45 maximum=37155.24 minimum=37122.79 cpu_cycles_per_op: mean= 18071.19 standard-deviation=212.25 median= 17950.62 median-absolute-deviation=130.10 maximum=18411.50 minimum=17906.13 after: 252561.26 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 37039 insns/op, 18075 cycles/op, 0 errors) 256876.44 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 37022 insns/op, 17785 cycles/op, 0 errors) 257084.38 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 37030 insns/op, 17840 cycles/op, 0 errors) 257305.35 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 37042 insns/op, 17804 cycles/op, 0 errors) 258088.53 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 37028 insns/op, 17778 cycles/op, 0 errors) throughput: mean= 256383.19 standard-deviation=2185.22 median= 257084.38 median-absolute-deviation=922.16 maximum=258088.53 minimum=252561.26 instructions_per_op: mean= 37032.17 standard-deviation=8.06 median= 37030.46 median-absolute-deviation=6.44 maximum=37041.83 minimum=37021.93 cpu_cycles_per_op: mean= 17856.60 standard-deviation=124.70 median= 17804.16 median-absolute-deviation=71.24 maximum=18075.50 minimum=17777.95 A small improvement is observed in instructions_per_op. It could be random fluctuations in the compiler performance, or maybe the default constructor/destructor of interval are meaningful even in this simple test. Small performance improvement, so not a backport candidate. Closes scylladb/scylladb#24232 * github.com:scylladb/scylladb: interval: reduce sizeof interval: change start()/end() not to return references to data members interval: rename start_ref() back to start() (and end_ref() etc). interval: rename start() to start_ref() (and end() etc). test: wrapping_interval_test: add more tests for intervals	2025-06-16 09:23:56 +02:00
Avi Kivity	16fb68bb5e	interval: rename start_ref() back to start() (and end_ref() etc). To reduce noise, rename start_ref() back to its original name start(), after it was changed in the previous patch to force an audit of all calls.	2025-06-14 21:26:16 +03:00
Avi Kivity	3363bc41e2	interval: rename start() to start_ref() (and end() etc). We are about to change start() to return a proxy object rather than a `const interval_bound<T>&`. This is generally transparent, except in one case: `auto x = i.start()`. With the current implementation, we'll copy object referred to and assign it to x. With the planned implementation, the proxy object will be assigned to `x`, but it will keep referring to `i`. To prevent such problems, rename start() to start_ref() and end() to end_ref(). This forces us to audit all calls, and redirect calls that will break to new start_copy() and end_copy() methods.	2025-06-14 21:26:16 +03:00
Gleb Natapov	c00a0554e0	topology coordinator: simplify truncate handling in case request queue feature is disable After allowing running multiple command in parallel the code that handles multiple truncates to the same table can be simplified since now it is executed only if request queue feature is disable, so it does not need to handle the case where a request may be in the queue.	2025-06-11 11:29:33 +03:00
Gleb Natapov	01dd4b7f30	topology coordinator: fix indentation after the previous patch	2025-06-11 11:29:33 +03:00
Gleb Natapov	a9e99d1d3c	topology coordinator: allow running multiple global commands in parallel Now that we have a global request queue do not check that there is global request before adding another one. Amend truncation test that expects it explicitly and add another one that checks that two truncates can be submitted in parallel.	2025-06-11 11:29:33 +03:00
Gleb Natapov	a0a3a034e0	topology coordinator: Implement global topology request queue Requests, together with their parameters, are added to the topology_request tables and the queue of active global requests is kept in topology state. Thy are processed one by one by the topology state machine. Fixes: #16822	2025-06-11 11:29:33 +03:00
Petr Gusev	e456d2d507	storage_proxy: log gate_closed_exception gate_closed_exception likely signals that we have shutdown order issues. If we just swallow it we lose information what exact component was shutdown prematurely. For example, we stopped local storage before group0 during shutdown in main.cc. If a group0 command arrives, topology_state_load might try to write something and get mutation_write_failure_exception, which results in 'applier fiber stopped because of the error'. There is no other information in the logs in this case, other than 'mutation_write_failure_exception'. It's not clear what the original problem is and what component is triggering it. In this commit we add a warning to the logs when gate_closed_exception is thrown from lmutate or rmutate. Another option is to just remove the try_catch_nested line and allow gate_closed_exception to be logged as an error below. However, this might break some tests which check ERROR lines in the logs.	2025-06-10 10:04:04 +02:00
Gleb Natapov	be0b328b19	topology coordinator: store request type for each global command	2025-06-09 13:38:49 +03:00
Tomasz Grabiec	fadfbe8459	Merge 'transport: storage_proxy: release ERM when waiting for query timeout' from Andrzej Jackowski Before this change, if a read executor had just enough targets to achieve query's CL, and there was a connection drop (e.g. node failure), the read executor waited for the entire request timeout to give drivers time to execute a speculative read in a meantime. Such behavior don't work well when a very long query timeout (e.g. 1800s) is set, because the unfinished request blocks topology changes. This change implements a mechanism to thrown a new read_failure_exception_with_timeout in the aforementioned scenario. The exception is caught by CQL server which conducts the waiting, after ERM is released. The new exception inherits from read_failure_exception, because layers that don't catch the exception (such as mapreduce service) should handle the exception just a regular read_failure. However, when CQL server catch the exception, it returns read_timeout_exception to the client because after additional waiting such an error message is more appropriate (read_timeout_exception was also returned before this change was introduced). This change: - Rewrite cql_server::connection::process_request_one to use seastar::futurize_invoke and try_catch<> instead of utils::result_try - Add new read_failure_exception_with_timeout and throws it in storage_proxy - Add sleep in CQL server when the new exception is caught - Catch local exceptions in Mapreduce Service and convert them to std::runtime_error. - Add get_cql_exclusive to manager_client.py - Add test_long_query_timeout_erm No backport needed - minor issue fix. Closes scylladb/scylladb#23156 * github.com:scylladb/scylladb: test: add test_long_query_timeout_erm test: add get_cql_exclusive to manager_client.py mapreduce: catch local read_failure_exception_with_timeout transport: storage_proxy: release ERM when waiting for query timeout transport: remove redundant references in process_request_one transport: fix the indentation in process_request_one transport: add futures in CQL server exception handling	2025-05-08 12:45:49 +02:00
Andrzej Jackowski	1fca994c7b	transport: storage_proxy: release ERM when waiting for query timeout Before this change, if a read executor had just enough targets to achieve query's CL, and there was a connection drop (e.g. node failure), the read executor waited for the entire request timeout to give drivers time to execute a speculative read in a meantime. Such behavior don't work well when a very long query timeout (e.g. 1800s) is set, because the unfinished request blocks topology changes. This change implements a mechanism to thrown a new read_failure_exception_with_timeout in the aforementioned scenario. The exception is caught by CQL server which conducts the waiting, after ERM is released. The new exception inherits from read_failure_exception, because layers that don't catch the exception (such as mapreduce service) should handle the exception just a regular read_failure. However, when CQL server catch the exception, it returns read_timeout_exception to the client because after additional waiting such an error message is more appropriate (read_timeout_exception was also returned before this change was introduced). This change: - Add new read_failure_exception_with_timeout exception - Add throw of read_failure_exception_with_timeout in storage_proxy - Add abort_source to CQL server, as well as to_stop() method for the correct abort handling - Add sleep in CQL server when the new exception is caught Refs #21831	2025-04-23 09:29:47 +02:00
Benny Halevy	e1fe82ed33	utils: phased_barrier, pluggable: use named gate Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-12 11:47:00 +03:00

1 2 3 4 5 ...

1293 Commits