scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-05-31 20:16:43 +00:00

Author	SHA1	Message	Date
Avi Kivity	0ae22a09d4	LICENSE: Update to version 1.1 Updated terms of non-commercial use (must be a never-customer).	2026-04-12 19:46:33 +03:00
Calle Wilund	fec7df7cbb	topology::snapshot: Add expiry (ttl) to RPC/topo op Not set yet, but includes it in messages so it can be properly set in calling code. Will add entry to manifest.	2026-02-23 11:37:17 +01:00
Calle Wilund	2bc633c3bd	storage_proxy: Add handler for SNAPSHOT_WITH_TABLETS	2026-02-23 10:44:42 +01:00
Petr Gusev	6d7af84fed	storage_proxy: add fencing to Paxos verbs This commit adds fencing support to all Paxos verbs: * Pass an optional (for backward compatibility) fencing_token as a parameter to the prepare, accept, learn, and prune verbs. * Call apply_fence twice — before and after accessing local data. This ensures that if the coordinator is fenced out mid-request, the replica does not return success, which would otherwise incorrectly contribute to achieving the target CL. Without this, a user might observe successful writes that become unreadable after the topology operation completes. * For prune, call apply_fence only once because it does not return a response to the LWT coordinator. Fixes scylladb/scylladb#22332	2025-09-15 11:24:53 +02:00
Benny Halevy	3feb759943	everywhere: use utils::chunked_vector for list of mutations Currently, we use std::vector<*mutation> to keep a list of mutations for processing. This can lead to large allocation, e.g. when the vector size is a function of the number of tables. Use a chunked vector instead to prevent oversized allocations. `perf-simple-query --smp 1` results obtained for fixed 400MHz frequency and PGO disabled: Before (read path): ``` enable-cache=1 Running test with config: {partitions=10000, concurrency=100, mode=read, query_single_key=no, counters=no} Disabling auto compaction Creating 10000 partitions... 89055.97 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 39417 insns/op, 18003 cycles/op, 0 errors) 103372.72 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 39380 insns/op, 17300 cycles/op, 0 errors) 98942.27 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 39413 insns/op, 17336 cycles/op, 0 errors) 103752.93 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 39407 insns/op, 17252 cycles/op, 0 errors) 102516.77 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 39403 insns/op, 17288 cycles/op, 0 errors) throughput: mean= 99528.13 standard-deviation=6155.71 median= 102516.77 median-absolute-deviation=3844.59 maximum=103752.93 minimum=89055.97 instructions_per_op: mean= 39403.99 standard-deviation=14.25 median= 39406.75 median-absolute-deviation=9.30 maximum=39416.63 minimum=39380.39 cpu_cycles_per_op: mean= 17435.81 standard-deviation=318.24 median= 17300.40 median-absolute-deviation=147.59 maximum=18002.53 minimum=17251.75 ``` After (read path) ``` enable-cache=1 Running test with config: {partitions=10000, concurrency=100, mode=read, query_single_key=no, counters=no} Disabling auto compaction Creating 10000 partitions... 59755.04 tps ( 66.2 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 39466 insns/op, 22834 cycles/op, 0 errors) 71854.16 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 39417 insns/op, 17883 cycles/op, 0 errors) 82149.45 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 39411 insns/op, 17409 cycles/op, 0 errors) 49640.04 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.3 tasks/op, 39474 insns/op, 19975 cycles/op, 0 errors) 54963.22 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.3 tasks/op, 39474 insns/op, 18235 cycles/op, 0 errors) throughput: mean= 63672.38 standard-deviation=13195.12 median= 59755.04 median-absolute-deviation=8709.16 maximum=82149.45 minimum=49640.04 instructions_per_op: mean= 39448.38 standard-deviation=31.60 median= 39466.17 median-absolute-deviation=25.75 maximum=39474.12 minimum=39411.42 cpu_cycles_per_op: mean= 19267.01 standard-deviation=2217.03 median= 18234.80 median-absolute-deviation=1384.25 maximum=22834.26 minimum=17408.67 ``` `perf-simple-query --smp 1 --write` results obtained for fixed 400MHz frequency and PGO disabled: Before (write path): ``` enable-cache=1 Running test with config: {partitions=10000, concurrency=100, mode=write, query_single_key=no, counters=no} Disabling auto compaction 63736.96 tps ( 59.4 allocs/op, 16.4 logallocs/op, 14.3 tasks/op, 49667 insns/op, 19924 cycles/op, 0 errors) 64109.41 tps ( 59.3 allocs/op, 16.0 logallocs/op, 14.3 tasks/op, 49992 insns/op, 20084 cycles/op, 0 errors) 56950.47 tps ( 59.3 allocs/op, 16.0 logallocs/op, 14.3 tasks/op, 50005 insns/op, 20501 cycles/op, 0 errors) 44858.42 tps ( 59.3 allocs/op, 16.0 logallocs/op, 14.3 tasks/op, 50014 insns/op, 21947 cycles/op, 0 errors) 28592.87 tps ( 59.3 allocs/op, 16.0 logallocs/op, 14.3 tasks/op, 50027 insns/op, 27659 cycles/op, 0 errors) throughput: mean= 51649.63 standard-deviation=15059.74 median= 56950.47 median-absolute-deviation=12087.33 maximum=64109.41 minimum=28592.87 instructions_per_op: mean= 49941.18 standard-deviation=153.76 median= 50005.24 median-absolute-deviation=73.01 maximum=50027.07 minimum=49667.05 cpu_cycles_per_op: mean= 22023.01 standard-deviation=3249.92 median= 20500.74 median-absolute-deviation=1938.76 maximum=27658.75 minimum=19924.32 ``` After (write path) ``` enable-cache=1 Running test with config: {partitions=10000, concurrency=100, mode=write, query_single_key=no, counters=no} Disabling auto compaction 53395.93 tps ( 59.4 allocs/op, 16.5 logallocs/op, 14.3 tasks/op, 50326 insns/op, 21252 cycles/op, 0 errors) 46527.83 tps ( 59.3 allocs/op, 16.0 logallocs/op, 14.3 tasks/op, 50704 insns/op, 21555 cycles/op, 0 errors) 55846.30 tps ( 59.3 allocs/op, 16.0 logallocs/op, 14.3 tasks/op, 50731 insns/op, 21060 cycles/op, 0 errors) 55669.30 tps ( 59.3 allocs/op, 16.0 logallocs/op, 14.3 tasks/op, 50735 insns/op, 21521 cycles/op, 0 errors) 52130.17 tps ( 59.3 allocs/op, 16.0 logallocs/op, 14.3 tasks/op, 50757 insns/op, 21334 cycles/op, 0 errors) throughput: mean= 52713.91 standard-deviation=3795.38 median= 53395.93 median-absolute-deviation=2955.40 maximum=55846.30 minimum=46527.83 instructions_per_op: mean= 50650.57 standard-deviation=182.46 median= 50731.38 median-absolute-deviation=84.09 maximum=50756.62 minimum=50325.87 cpu_cycles_per_op: mean= 21344.42 standard-deviation=202.86 median= 21334.00 median-absolute-deviation=176.37 maximum=21554.61 minimum=21060.24 ``` Fixes #24815 Improvement for rare corner cases. No backport required Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#24919	2025-07-13 19:13:11 +03:00
Botond Dénes	b0d5462440	idl: extract full_position.idl from position_in_partition.idl A future user of position_in_partition.idl doesn't need full_position and so doesn't want to include full_position.hh to fix compile errors when including position_in_partition.idl.hh. Extract it to a separate idl file: it has a single user in a storage_proxy VERB.	2025-06-24 11:05:30 +03:00
Avi Kivity	f3eade2f62	treewide: relicense to ScyllaDB-Source-Available-1.0 Drop the AGPL license in favor of a source-available license. See the blog post [1] for details. [1] https://www.scylladb.com/2024/12/18/why-were-moving-to-a-source-available-license/	2024-12-18 17:45:13 +02:00
Ferenc Szili	36d35d2297	RPC: add truncate_with_tablets RPC with frozen_topology_guard This change introduces a new truncate_with_tablets RPC with a parameter of type service::frozen_topology_guard. This is materialized on replica nodes into a topology_guard which guarantees that truncate is performed under a global session, which, in turn, makes sure that we don't execute truncate as a result of stale RPCs. Also, this RPC does not have a timeout. Timeout will be handled on the coordinator side, and the truncate operation will not be allowed to time out.	2024-12-04 11:30:07 +01:00
Gleb Natapov	a1fdc8c847	storage_proxy: change mutation rpcs to send forward and reply addresses as host ids RPCs from old nodes will still use old format so translation will be used in this case. The change is backwards compatible thanks to RPC extensibility.	2024-12-02 10:31:12 +02:00
Petr Gusev	116444a01b	counter_mutation: add fencing As for regular mutations, we do the check twice in handle_counter_mutation, before and after applying the mutations. The last is important in case fence was moved while we were handling the request - some post-fence actions might have already happened at this time, so we can't treat the request as successful. For example, if topology change coordinator was switching to write_both_read_new, streaming might have already started and missed this update. In mutate_counters we can use a single fencing_token for all leaders, since all the erms are processed without yields and should underneath share the same token_metadata. We don't pass fencing token for replication explicitly in replicate_counter_from_leader since mutate_counter_on_leader_and_replicate doesn't capture erm and if the drain on the coordinator timed out the erm for replication might be different and we should use the corresponding (maybe the new one) topology version for outgoing write replication requests. This delayed replication is similar to any other background activity (e.g. writing hints) - it takes the current erm and the current token_metadata version for outgoing requests.	2023-07-25 12:10:03 +04:00
Petr Gusev	f2cbdc7f18	counter_mutation: add replica::exception_variant to signature We are going to add fencing for counter mutations, this means handle_counter_mutation will sometimes throw stale_topology_exception. RPC doesn't marshall exceptions transparently, exceptions thrown by server are delivered to the client as a general remote_verb_error, which is not very helpful. The common practice is to embed exceptions into handler result type. In this commit we use already existing exception_variant as an exception container. We mark exception_variant with [[version]] attribute in the idl file, this should handle the case when the old replica (without exception_variant in the signature) is replying to the new one.	2023-07-25 12:09:19 +04:00
Petr Gusev	5fb8da4181	hints: add fencing In this commit we just pass a fencing_token through hint_mutation RPC verb. The hints manager uses either storage_proxy::send_hint_to_all_replicas or storage_proxy::send_hint_to_endpoint to send a hint. Both methods capture the current erm and use the corresponding fencing token from it in the mutation or hint_mutation RPC verb. If these verbs are fenced out, the server stale_topology_exception is translated to a mutation_write_failure_exception on the client with an appropriate error message. The hint manager will attempt to resend the failed hint from the commitlog segment after a delay. However, if delivery is unsuccessful, the hint will be discarded after gc_grace_seconds. Closes #14580	2023-07-24 18:12:48 +02:00
Petr Gusev	94605e4839	storage_proxy.cc: add fencing to read RPCs On the call site we use the version captured in read_executor/erm/token_metadata. In the handlers we use apply_fence twice just like in mutation RPC. Fencing was also added to local query calls, such as query_result_local in make_data_request. This is for the case when query coordinator was isolated from topology change coordinator and didn't receive barrier_and_drain.	2023-06-15 15:52:50 +04:00
Petr Gusev	46f73fcaa6	storage_proxy: add fencing for mutation At the call site, we use the version, captured in erm/token_metadata. In the handler, we use double checking, apply_fence after the local write guarantees that no mutations succeed on coordinators if the fence version has been updated on the replica during the write. Fencing was also added to mutate_locally calls on request coordinator, for the case if this coordinator was isolated from the topology change coordinator and missed the barrier_and_drain command.	2023-06-15 15:52:49 +04:00
Petr Gusev	3a88c7769f	tracing::trace_info: pass by ref sizeof(std::optional<tracing::trace_info>) == 64 bytes, so it should be more efficient.	2023-05-30 14:32:10 +04:00
Petr Gusev	48600049fc	storage_proxy: pass inet_address_vector_replica_set by ref sizeof(inet_address_vector_replica_set) == 96 bytes and it has complex move constructor.	2023-05-30 14:04:53 +04:00
Petr Gusev	db4030f792	storage_proxy: paxos:: add [[ref]] attribute read_command, partition_key and paxos::proposal are marked with [[ref]]. partition_key contains dynamic allocations and can be big. proposal contains frozen_mutation, so it's also contains dynamic allocations. The call sites are fine, the already passed by reference.	2023-05-30 13:14:19 +04:00
Petr Gusev	f2cba20945	storage_proxy: read_XXX:: make read_command [[ref]] We had a redundant copies at the call sites of these methods. Class read_command does not contain dynamic allocations, but it's quite but by itself (368 bytes).	2023-05-30 13:14:19 +04:00
Petr Gusev	ffb4e39e40	storage_proxy: hint_mutation:: make frozen_mutation [[ref]] We had a redundant copy in hint_mutation::apply_remotely. This frozen_mutation is dynamically allocated and can be arbitrary large.	2023-05-30 13:14:19 +04:00
Petr Gusev	5adbb6cde2	storage_proxy: mutation:: make frozen_mutation [[ref]] We had a redundant copy in receive_mutation_handler forward_fn callback. This frozen_mutation is dynamically allocated and can be arbitrary large. Fixes: #12504	2023-05-30 13:14:19 +04:00
Botond Dénes	2656968db2	service/storage_proxy: propagate last position on digest reads We want to transmit the last position as determined by the replica on both result and digest reads. Result reads already do that via the query::result, but digest reads don't yet as they don't return the full query::result structure, just the digest field from it. Add the last position to the digest read's return value and collect these in the digest resolver, along with the returned digests.	2022-08-10 06:03:37 +03:00
Benny Halevy	2b017ce285	schema, everywhere: define and use table_schema_version as a strong type Define table_schema_version as a distinct tagged_uuid class, So it can be differentiated from other uuid-class types, in particular table_id. Added reversed(table_schema_version) for convenience and uniformity since the same logic is currently open coded in several places. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-08 08:09:45 +03:00
Benny Halevy	1fda686f96	idl: make idl headers self-sufficient Add include statements to satisfy dependencies. Delete, now unneeded, include directives from the upper level source files. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-08 08:02:27 +03:00
Piotr Dulikowski	d3d9add219	storage_proxy: add per partition rate limit info to read RPC Now, the read RPC accept the per partition rate limit info parameter. It is passed on to query_result_local(_digest) methods.	2022-06-22 20:16:49 +02:00
Piotr Dulikowski	02469e0b15	storage_proxy: add per partition rate limit info to write RPC Adds db::per_partition_rate_limit::info parameter to the write RPC. The rate limit info controls the behavior of the rate limiter on the replica.	2022-06-22 20:16:48 +02:00
Piotr Dulikowski	2162bb9f3b	storage_proxy: propagate rate_limit_exception through read RPC This commit modifies the read RPC and the storage_proxy logic so that the coordinator knows whether a read operation failed due to rate limit being exceeded, and returns `exceptions::rate_limit_exception` if that happens.	2022-06-22 20:16:48 +02:00
Piotr Dulikowski	51546b0609	storage_proxy: pass rate_limit_exception through write RPC This commit modifies the storage_proxy logic so that the coordinator knows whether a write operation failed due to rate limit being exceeded, and returns `exceptions::rate_limit_exception` when that happens.	2022-06-22 20:16:48 +02:00
Avi Kivity	fcb8d040e8	treewide: use Software Package Data Exchange (SPDX) license identifiers Instead of lengthy blurbs, switch to single-line, machine-readable standardized (https://spdx.dev) license identifiers. The Linux kernel switched long ago, so there is strong precedent. Three cases are handled: AGPL-only, Apache-only, and dual licensed. For the latter case, I chose (AGPL-3.0-or-later and Apache-2.0), reasoning that our changes are extensive enough to apply our license. The changes we applied mechanically with a script, except to licenses/README.md. Closes #9937	2022-01-18 12:15:18 +01:00
Gleb Natapov	1db151bd75	storage_proxy: move all verbs to the IDL Define all verbs in the IDL instead of manually codding them.	2022-01-10 14:58:28 +02:00

29 Commits