scylladb

Author	SHA1	Message	Date
Amnon Heiman	72414b613b	Split the timed_rate_moving_average into data and timer This patch split the timed_rate_moving_average functionality into two, a data class: rates_moving_average, and a wrapper class timed_rate_moving_average that uses a timer to update the rates periodically. To make the transition as simple as possible timed_rate_moving_average, takes the original API. A new helper class meter_timer was introduced to handle the timer update functionality. This change required minimal code adaptation in some other parts of the code. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2022-07-26 15:59:33 +03:00
Avi Kivity	29c28dcb0c	Merge 'Unstall get_range_to_address_map' from Benny Halevy Prevent stalls in this path as seen in performance testing. Also, add a respective rest_api test. Fixes #11114 Closes #11115 * github.com:scylladb/scylla: storage_service: reserve space in get_range_to_address_map and friends storage_service: coroutinize get_range_to_address_map and friends storage_service: pass replication map to get_range_to_address_map and friends storage_service: get_range_to_address_map: move selection of arbitrary ks to api layer test: rest_api: test range_to_endpoint_map and describe_ring	2022-07-25 18:06:28 +03:00
Piotr Sarna	c195ce1b82	query: allow merging non-empty forward_result with an empty one Merging empty results was already allowed, but in one way only: empty.merge(nonempty, r); // was permitted nonempty.merge(empty, r); // not permitted With this commit, both methods are permitted. In order to remove copying, the other result is now taken by rvalue reference, with all call sites being updated accordingly. Fixes #10446 Fixes #10174 Closes #11064	2022-07-25 18:06:28 +03:00
Benny Halevy	bc5f6cf45d	storage_service: reserve space in get_range_to_address_map and friends To reduce the chance of reallocation. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-25 18:06:28 +03:00
Benny Halevy	5eb31eff64	storage_service: coroutinize get_range_to_address_map and friends And add calls to maybe_yield to prevent stalls in this path as seen in performance testing. Also, add a respective rest_api test. Fixes #11114 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-25 18:06:28 +03:00
Tomasz Grabiec	76d20aeb96	Merge 'Refactor group 0 operations (joining, leaving, removing).' from Kamil Braun A series of refactors to the `raft_group0` service. Read the commits in topological order for best experience. This PR is more or less equivalent to the second-to-last commit of PR https://github.com/scylladb/scylla/pull/10835, I split it so we could have an easier time reviewing and pushing it through. Closes #11024 * github.com:scylladb/scylla: service: storage_service: additional assertions and comments service/raft: raft_group0: additional logging, assertions, comments service/raft: raft_group0: pass seed list and `as_voter` flag to `join_group0` service/raft: raft_group0: rewrite `remove_from_group0` service/raft: raft_group0: rewrite `leave_group0` service/raft: raft_group0: split `leave_group0` from `remove_from_group0` service/raft: raft_group0: introduce `setup_group0` service/raft: raft_group0: introduce `load_my_addr` service/raft: raft_group0: make some calls abortable service/raft: raft_group0: remove some temporary variables service/raft: raft_group0: refactor `do_discover_group0`. service/raft: raft_group0: rename `create_server_for_group` to `create_server_for_group0` service/raft: raft_group0: extract `start_server_for_group0` function service/raft: raft_group0: create a private section service/raft: discovery: `seeds` may contain `self`	2022-07-25 18:06:28 +03:00
Benny Halevy	3d62a1592f	storage_service: pass replication map to get_range_to_address_map and friends Before they are made asynchronous in the next patch, so they work on a coherent snapshot of the token_metadata and replication map as their caller. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-25 18:06:28 +03:00
Petr Gusev	52142bb8b3	raft_group_registry, is_alive for non-existent server_id We could yield between updating the list of servers in raft/fsm and updating the raft_address_map, e.g. in case of a set_configuration. If tick_leader happens before the raft_address_map is updated, is_alive will be called with server_id that is not in the map yet. Fix: scylladb/scylla-dtest#2753 Closes #11111	2022-07-25 18:06:28 +03:00
Benny Halevy	0b474866a3	storage_service: get_range_to_address_map: move selection of arbitrary ks to api layer It is only needed for the "storage_service/describe_ring" api and service/storage_service shouldn't bother with it. It's an api sugar coating. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-25 18:06:28 +03:00
Gleb Natapov	f1f1176963	service: raft: do not allow downgrading non expiring entry to expiring one in raft_address_map Expiring entries are added when a message is received from an unknown host. If the host is later added to the raft configuration they become non expiring. After that they can only be removed when the host is dropped from the configuration, but they should never become expiring again. Refs #10826	2022-07-21 17:40:04 +02:00
Asias He	39db15d2cb	misc_services: Fix cache hitrate update This patch avoids unncessary CACHE_HITRATES updates through gossip. After this patch: Publish CACHE_HITRATES in case: - We haven't published it at all - The diff is bigger than 1% and we haven't published in the last 5 seconds - The diff is really big 10% Note: A peer node can know the cache hitrate through read_data read_mutation_data and read_digest RPC verbs which have cache_temperature in the response. So there is no need to update CACHE_HITRATES through gossip in high frequency. We do the recalculation faster if the diff is bigger than 0.01. It is useful to do the calculation even if we do not publish the CACHE_HITRATES though gossip, since the recalculation will call the table->set_global_cache_hit_rate to set the hitrate. Fixes #5971 Closes #11079	2022-07-21 11:31:30 +03:00
Kamil Braun	4e42aeb0df	service: storage_service: additional assertions and comments	2022-07-20 19:39:29 +02:00
Kamil Braun	25bb8384af	service/raft: raft_group0: additional logging, assertions, comments Move some rare logs from TRACE to INFO level. Add some assertions. Write some more comments, including FIXMEs and TODOs. Remove unnecessary `_shutdown_gate.hold()` (this is not a background task).	2022-07-20 19:39:29 +02:00
Kamil Braun	c9f1ec1268	service/raft: raft_group0: pass seed list and `as_voter` flag to `join_group0` Group 0 discovery would internally fetch the seed list from gossiper. Gossiper would return the seed list from conf/scylla.yaml. This seed list is proper for the bootstrapping scenario - we specify the initial contact points for a node that joins a cluster. We'll have to use a different list of seeds for group 0 discovery for the upgrade scenario. Prepare for that by taking the seed list as a parameter. In the bootstrap scenario we'll pass the seed list down from `storage_service::join_cluster`. Additionally, `join_group0` now takes an `as_voter` flag, which is `false` in the bootstrap scenario (we initially join as a non-voter) but will be `true` in the upgrade scenario.	2022-07-20 19:39:29 +02:00
Kamil Braun	684d8171ca	service/raft: raft_group0: rewrite `remove_from_group0` See previous commit. `remove_from_group0` had a similar problem as `leave_group0`: it would handle the case where `raft_group0::_group0` variant was not `raft::group_id` (i.e. we haven't joined group 0), but RAFT local feature was enabled - i.e. the yet-unimplemented upgrade case - by running discovery and calling `send_group0_modify_config`. Instead, if we see that we've joined group 0 before, assume that we're still a member and simply use the Raft `modify_config` API to remove another server. If we're not a member it means we either decommissioned or were removed by someone else; then we have no business trying to remove others. There's also the unimplemented upgrade case but that will come in another pull request. Finally, add some logic for handling an edge case: suppose we joined group 0 recently and we still didn't fully update our RPC address map (it's being updated asynchronously by Raft's io_fiber). Thus we may fail to find a member of group 0 in the address map. To handle this, ensure we're up-to-date by performing a Raft read barrier. State some assumptions in a comment. Add a TODO for handling failures. Remove unnecessary `_shutdown_gate.hold()` (this is not a background task).	2022-07-20 19:39:29 +02:00
Kamil Braun	eeeef0bc50	service/raft: raft_group0: rewrite `leave_group0` One of the following cases is true: 1. RAFT local feature is disabled. Then we don't do anything related to group 0. 2. RAFT local feature is enabled and when we bootstrapped, we joined group 0. Then `raft_group0::_group0` variant holds the `raft::group_id` alternative. 3. RAFT local feature is enabled and when we bootstrapped we didn't join group 0. This means the RAFT local feature was disabled when we bootstrapped and we're in the (unimplemented yet) upgrade scenario. `raft_group0::_group0` variant holds the `std::monostate` alternative. The problem with the previous implementation was that it checked for the conditions of the third case above - that RAFT local feature is enabled but `_group0` does not hold `raft::group_id` - and if those conditions were true, it executed some logic that didn't really make sense: it ran the discovery algorithm and called `send_group0_modify_config` RPC. In this rewrite I state some assumptions that `leave_group0` makes: - we've finished the startup procedure. - we're being run during decommission - after the node entered LEFT status. In the new implementation, if `_group0` does not hold `raft::group_id` (checked by the internal `joined_group0()` helper), we simply return. This is the yet-unimplemented upgrade case left for a follow-up PR. Otherwise we fetch our Raft server ID (at this point it must be present - otherwise it's a fatal error) and simply call `modify_config` from the `raft::server` API. Remove unnecessary call to `_shutdown_gate.hold()` (this is not a background task).	2022-07-20 19:39:29 +02:00
Kamil Braun	75608bcd2f	service/raft: raft_group0: split `leave_group0` from `remove_from_group0` `leave_group0` was responsible for both removing a different node from group 0 and removing ourselves (leaving) group 0. The two scenarios are a bit different and the handling will be rewritten in following commits. Split `leave_group0` into two functions. Remove the incorrect comment about idempotency - saying that the procedure is idempotent is an oversimplification, one could argue it's incorrect since the second call simply hangs, at least in the case of leaving group 0; following commits will state what's happening more precisely. Add some additional logging and assertions where the two functions are called in `storage_service`.	2022-07-20 19:39:29 +02:00
Kamil Braun	ee0219dfe3	service/raft: raft_group0: introduce `setup_group0` Contains all logic for deciding to join (or not join) group 0. Prepare for the case where we don't want to join group 0 immediately on startup - the upgrade scenario (will be implemented in a follow-up). Move the group 0 setup step earlier in `storage_service::join_cluster`. `join_group0()` is now a private member of `raft_group0`. Some more comments were written.	2022-07-20 19:39:29 +02:00
Kamil Braun	4b0db59671	service/raft: raft_group0: introduce `load_my_addr` Compared to `load_or_create_my_addr` this function assumes that the address is already present on disk; if not, it's a fatal error. Use it in places where it would indeed be a fatal error if the address was missing.	2022-07-20 19:39:29 +02:00
Kamil Braun	f0f9aa5c7d	service/raft: raft_group0: make some calls abortable There are some calls to `modify_config` which should react to aborts (e.g. when we shutdown Scylla). There are also calls to `send_group0_modify_config` which should probably also react to aborts, but the functions don't take an abort_source parameter. This is fixable but I left TODOs for now.	2022-07-20 19:39:29 +02:00
Kamil Braun	ab8c3c6742	service/raft: raft_group0: remove some temporary variables Make the code a bit shorter.	2022-07-20 19:39:29 +02:00
Kamil Braun	b193ea8ec0	service/raft: raft_group0: refactor `do_discover_group0`. The function no longer accesses the `_group0` variant directly, instead it is made a member of `service::persistent_discovery`; the caller guarantees that `persistent_discovery` is not destroyed before the function finishes. The function is now named `run`. A short comment was written at the declaration site. Make some members of `persistent_discovery` private, as they are only used by `run`. Simplify `struct tracker`, store the discovery output separately (`struct tracker` is now responsible for a single thing). Enclose the `parallel_for_each` over requests in a common coroutine which keeps alive all the necessary things for the loop body and performs the last step which was previously inside a `then`.	2022-07-20 19:39:29 +02:00
Kamil Braun	6d9d493e2a	service/raft: raft_group0: rename `create_server_for_group` to `create_server_for_group0`	2022-07-20 19:39:28 +02:00
Kamil Braun	54d9219257	service/raft: raft_group0: extract `start_server_for_group0` function Extract part of the code from `join_group0`. Add some comments. This part will be reused.	2022-07-20 19:38:53 +02:00
Kamil Braun	dca1ce52ed	service/raft: raft_group0: create a private section Move member functions and fields used internally by the `raft_group0` class into a private section. Write some comments.	2022-07-20 19:38:53 +02:00
Kamil Braun	d28170b1a5	service/raft: discovery: `seeds` may contain `self` The set of seeds passed to the discovery algorithm may contain `self`. The implementation will filter the `self` out (it calls `step(seeds)`; `step` iterates over the given list of peers and ignores `_self`). Specify this at the `discovery` constructor declaration site. Simplify the code constructing `persistent_discovery` in `raft_group0::discover_group0` using this assumption.	2022-07-20 19:38:53 +02:00
Botond Dénes	014c5b56a3	query-result: move last_pos up to query::result query_result was the wrong place to put last position into. It is only included in data-responses, but not on digest-responses. If we want to support empty pages from replicas, both data and digest responses have to include the last position. So hoist up the last position to the parent structure: query::result. This is a breaking change inter-node ABI wise, but it is fine: the current code wasn't released yet. Closes #11072	2022-07-20 13:28:09 +03:00
Tomasz Grabiec	04f9a150be	Merge 'raft: split `can_vote` field form `server_address` to separate struct' from Kamil Braun Whether a server can vote in a Raft configuration is not part of the address. `server_address` was used in many context where `can_vote` is irrelevant. Split the struct: `server_address` now contains only `id` and `server_info` as it did before `can_vote` was introduced. Instead we have a `config_member` struct that contains a `server_address` and the `can_vote` field. Also remove an "unsafe" constructor from `server_address` where `id` was provided but `server_info` was not. The constructor was used for tests where `server_info` is irrelevant, but it's important not to forget about the info in production code. Replace the constructor with helper functions which specify in comments that they are supposed to be used in tests or in contexts where `info` doesn't matter (e.g. when checking presence in an `unordered_set`, where the equality operator and hash operate only on the `id`). Closes #11047 * github.com:scylladb/scylla: raft: fsm: fix `entry_size` calculation for config entries raft: split `can_vote` field from `server_address` to separate struct serializer_impl: generalize (de)serialization of `unordered_set` to_string: generalize `operator<<` for `unordered_set`	2022-07-20 12:20:52 +02:00
Asias He	482ee369d0	storage_service: Increase watchdog_interval for node ops The node operations using node_ops_cmd have the following procedure: 1) Send node_ops_cmd::replace_prepare to all nodes 2) Send node_ops_cmd::replace_heartbeat to all nodes In a large cluster 1) might take a long time to finish, as a result when the node starts to perform 2), the heartbeat timer on the peer nodes which is 30s might have already timed out. This fails the whole node opeartions. We have patches to make 1) more efficient and faster. https://github.com/scylladb/scylla/pull/10850 https://github.com/scylladb/scylla/pull/10822 In addition to that, this patch increases the heartbeat timeout to reduce the false positive of timeout. Refs #10337 Refs #11078 Closes #11081	2022-07-20 12:56:17 +03:00
Kamil Braun	daf9c53bb8	raft: split `can_vote` field from `server_address` to separate struct Whether a server can vote in a Raft configuration is not part of the address. `server_address` was used in many context where `can_vote` is irrelevant. Split the struct: `server_address` now contains only `id` and `server_info` as it did before `can_vote` was introduced. Instead we have a `config_member` struct that contains a `server_address` and the `can_vote` field. Also remove an "unsafe" constructor from `server_address` where `id` was provided but `server_info` was not. The constructor was used for tests where `server_info` is irrelevant, but it's important not to forget about the info in production code. The constructor was used for two purposes: - Invoking set operations such as `contains`. To solve this we use C++20 transparent hash and comparator functions, which allow invoking `contains` and similar functions by providing a different key type (in this case `raft::server_id` in set of addresses, for example). - constructing addresses without `info`s in tests. For this we provide helper functions in the test helpers module and use them.	2022-07-18 18:22:10 +02:00
Jadw1	29a0be75da	forward_service: support UDA and native aggregate parallelization Enables parallelization of UDA and native aggregates. The way the query is parallelized is the same as in #9209. Separate reduction type for `COUNT(*)` is left for compatibility reason.	2022-07-18 15:25:41 +02:00
Jadw1	d13f347621	DB: Add `scylla_aggregates` system table Saving information about UDA's reduce function to `scylla_aggregates` table and distributing it across cluster.	2022-07-18 15:25:37 +02:00
Benny Halevy	dc93564247	storage_proxy: abstract_read_resolver: swallow gate_closed exception Like other errors triggered on shutdown, this one is triggered by #8995. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #11029	2022-07-14 09:26:34 +03:00
Avi Kivity	957bf48eb2	Merge 'Don't throw exceptions on the replica side when handling single partition reads and writes' from Piotr Dulikowski This PR gets rid of exception throws/rethrows on the replica side for writes and single-partition reads. This goal is achieved without using `boost::outcome` but rather by replacing the parts of the code which throw with appropriate seastar idioms and by introducing two helper functions: 1.`try_catch` allows to inspect the type and value behind an `std::exception_ptr`. When libstdc++ is used, this function does not need to throw the exception and avoids the very costly unwind process. This based on the "How to catch an exception_ptr without even try-ing" proposal mentioned in https://github.com/scylladb/scylla/issues/10260. This function allows to replace the current `try..catch` chains which inspect the exception type and account it in the metrics. Example: ```c++ // Before try { std::rethrow_exception(eptr); } catch (std::runtime_exception& ex) { // 1 } catch (...) { // 2 } // After if (auto* ex = try_catch<std::runtime_exception>(eptr)) { // 1 } else { // 2 } ``` 2. `make_nested_exception_ptr` which is meant to be a replacement for `std::throw_with_nested`. Unlike the original function, it does not require an exception being currently thrown and does not throw itself - instead, it takes the nested exception as an `std::exception_ptr` and produces another `std::exception_ptr` itself. Apart from the above, seastar idioms such as `make_exception_future`, `co_await as_future`, `co_return coroutine::exception()` are used to propagate exceptions without throwing. This brings the number of exception throws to zero for single partition reads and writes (tested with scylla-bench, --mode=read and --mode=write). Results from `perf_simple_query`: ``` Before (`719724e4df`): Writes: Normal: 127841.40 tps ( 56.2 allocs/op, 13.2 tasks/op, 50042 insns/op, 0 errors) Timeouts: 94770.81 tps ( 53.1 allocs/op, 5.1 tasks/op, 78678 insns/op, 1000000 errors) Reads: Normal: 138902.31 tps ( 65.1 allocs/op, 12.1 tasks/op, 43106 insns/op, 0 errors) Timeouts: 62447.01 tps ( 49.7 allocs/op, 12.1 tasks/op, 135984 insns/op, 936846 errors) After (d8ac4c02bfb7786dc9ed30d2db3b99df09bf448f): Writes: Normal: 127359.12 tps ( 56.2 allocs/op, 13.2 tasks/op, 49782 insns/op, 0 errors) Timeouts: 163068.38 tps ( 52.1 allocs/op, 5.1 tasks/op, 40615 insns/op, 1000000 errors) Reads: Normal: 151221.15 tps ( 65.1 allocs/op, 12.1 tasks/op, 43028 insns/op, 0 errors) Timeouts: 192094.11 tps ( 41.2 allocs/op, 12.1 tasks/op, 33403 insns/op, 960604 errors) ``` Closes #10368 * github.com:scylladb/scylla: database: avoid rethrows when handling exceptions from commitlog database: convert throw_commitlog_add_error to use make_nested_exception_ptr utils: add make_nested_exception_ptr storage_proxy: don't rethrow when inspecting replica exceptions on write path database: don't rethrow rate_limit_exception storage_proxy: don't rethrow the exception in abstract_read_resolver::error utils/exceptions.cc: don't rethrow in is_timeout_exception utils/exceptions: add try_catch utils: add abi/eh_ia64.hh storage_proxy: don't rethrow exceptions from replicas when accounting read stats message: get rid of throws in send_message{,_timeout,_abortable} database/{query,query_mutations}: don't rethrow read semaphore exceptions	2022-07-11 14:01:41 +03:00
Nadav Har'El	cc69177dcc	config: fix printing of experimental feature list Recently we noticed a regression where with certain versions of the fmt library, SELECT value FROM system.config WHERE name = 'experimental_features' returns string numbers, like "5", instead of feature names like "raft". It turns out that the fmt library keep changing their overload resolution order when there are several ways to print something. For enum_option<T> we happen to have to conflicting ways to print it: 1. We have an explicit operator<<. 2. We have an implicit convertor to the type held by T. We were hoping that the operator<< always wins. But in fmt 8.1, there is special logic that if the type is convertable to an int, this is used before operator<<()! For experimental_features_t, the type held in it was an old-style enum, so it is indeed convertible to int. The solution I used in this patch is to replace the old-style enum in experimental_features_t by the newer and more recommended "enum class", which does not have an implicit conversion to int. I could have fixed it in other ways, but it wouldn't have been much prettier. For example, dropping the implicit convertor would require us to change a bunch of switch() statements over enum_option (and not just experimental_features_t, but other types of enum_option). Going forward, all uses of enum_option should use "enum class", not "enum". tri_mode_restriction_t was already using an enum class, and now so does experimental_features_t. I changed the examples in the comments to also use "enum class" instead of enum. This patch also adds to the existing experimental_features test a check that the feature names are words that are not numbers. Fixes #11003. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #11004	2022-07-11 09:17:30 +02:00
Benny Halevy	acae3cc223	treewide: stop use of deprecated coroutine::make_exception Convert most use sites from `co_return coroutine::make_exception` to `co_await coroutine::return_exception{,_ptr}` where possible. In cases this is done in a catch clause, convert to `co_return coroutine::exception`, generating an exception_ptr if needed. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #10972	2022-07-07 15:02:16 +03:00
Piotr Dulikowski	2008db58c4	storage_proxy: don't rethrow when inspecting replica exceptions on write path Now, storage_proxy::send_to_live_endpoints doesn't rethrow exceptions received from the replica logic when inspecting them.	2022-07-05 16:41:09 +02:00
Piotr Dulikowski	ffb95c4840	storage_proxy: don't rethrow the exception in abstract_read_resolver::error Now, the abstract_read_resolver::error uses the utils::try_catch utility to analyse the error received from replica instead of rethrowing it.	2022-07-05 16:41:09 +02:00
Avi Kivity	74b02b9719	Merge 'storage_service: track restore_replica_count' from Benny Halevy This mini-series adds an _async_gate to storage_service that is closed on stop() and it performs restore_replica_count under this gate so it can be orderly waited on in stop() Fixes #10672 Closes #10922 * github.com:scylladb/scylla: storage_service: handle_state_removing: restore_replica_count under _async_gate storage_service: add async_gate for background work	2022-07-05 13:18:59 +03:00
Piotr Dulikowski	491cc2a8df	storage_proxy: don't rethrow exceptions from replicas when accounting read stats Now, make_{data,mutation_data,digest}_requests don't rethrow the exception received from replicas when increasing the error count metric.	2022-07-04 19:27:06 +02:00
Pavel Emelyanov	ea820e13b3	database: Move flushing logging Now it happens before calling database::drain() but drain is not only flushing it does lots of other things. More elaborated logging is better Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-07-04 13:42:45 +03:00
Pavel Emelyanov	b5c4553a66	storage_service: Sanitize stop_transport() It generates ignored future that can be avoided if using forwarding to shared_future<>'s promise Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-07-01 17:17:53 +03:00
Pavel Emelyanov	85033ea6ae	Merge 'A bunch of refactors related to Raft group 0' from Kamil Braun The commits here were extracted from PR https://github.com/scylladb/scylla/pull/10835 which implements upgrade procedure for Raft group 0. They are mostly refactors which don't affect the behavior of the system, except one: the commit `4d439a16b3` causes all schema changes to be bounced to shard 0. Previously, they would only be bounced when the local Raft feature was enabled. I do that because: 1. eventually, we want this to be the default behavior 2. in the upgrade PR I remove the `is_raft_enabled()` function - the function was basically created with the mindset "Raft is either enabled or not" - which was right when we didn't support upgrade, but will be incorrect when we introduce intermediate states (when we upgrade from non-raft-based to raft-based operations); the upgrade PR introduces another mechanism to dispatch based on the upgrade state, but for the case of bouncing to shard 0, dispatching is simply not necessary. Closes #10864 * github.com:scylladb/scylla: service/raft: raft_group_registry: add assertions when fetching servers for groups service/raft: raft_group_registry: remove `_raft_support_listener` service/raft: raft_group0: log adding/removing servers to/from group 0 RPC map service/raft: raft_group0: move group 0 RPC handlers from `storage_service` service/raft: messaging: extract raft_addr/inet_addr conversion functions service: storage_service: initialize `raft_group0` in `main` and pass a reference to `join_cluster` treewide: remove unnecessary `migration_manager::is_raft_enabled()` calls test/boost: memtable_test: perform schema operations on shard 0 test/boost: cdc_test: remove test_cdc_across_shards message: rename `send_message_abortable` to `send_message_cancellable` message: change parameter order in `send_message_oneway_timeout`	2022-06-29 16:51:54 +03:00
Benny Halevy	cb0b728ed1	storage_service: handle_state_removing: restore_replica_count under _async_gate Track the background restore_replica_count fiber so it be awaited on in stop() by closing the _async_gate. Fixes #10672 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-06-29 10:51:26 +03:00
Benny Halevy	1b1c02b243	storage_service: add async_gate for background work To be used for tracking restore_replica_count and waiting for it on stop(). Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-06-29 10:47:49 +03:00
Avi Kivity	3131cbea62	Merge 'query: allow replica to provide arbitrary continue position' from Botond Dénes Currently, we use the last row in the query result set as the position where the query is continued from on the next page. Since only live rows make it into query result set, this mandates the query to be stopped on a live row on the replica, lest any dead rows or tombstones processed after the live rows, would have to be re-processed on the next page (and the saved reader would have to be thrown away due to position mismatch). This requirement of having to stop on a live row is problematic with datasets which have lots of dead rows or tombstones, especially if these form a prefix. In the extreme case, a query can time out before it can process a single live row and the data-set becomes effectively unreadable until compaction gets rid of the tombstones. This series prepares the way for the solution: it allows the replica to determine what position the query should continue from on the next page. This position can be that of a dead row, if the query stopped on a dead row. For now, the replica supplies the same position that would have been obtained with looking at the last row in the result set, this series merely introduces the infrastructure for transferring a position together with the query result, and it prepares the paging logic to make use of this position. If the coordinator is not prepared for the new field, it will simply fall-back to the old way of looking at the last row in the result set. As I said for now this is still the same as the content of the new field so there is no problem in mixed clusters. Refs: https://github.com/scylladb/scylla/issues/3672 Refs: https://github.com/scylladb/scylla/issues/7689 Refs: https://github.com/scylladb/scylla/issues/7933 Tests: manual upgrade test. I wrote a data set with: ``` ./scylla-bench -mode=write -workload=sequential -replication-factor=3 -nodes 127.0.0.1,127.0.0.2,127.0.0.3 -clustering-row-count=10000 -clustering-row-size=8096 -partition-count=1000 ``` This creates large, 80MB partitions, which should fill many pages if read in full. Then I started a read workload: ``` ./scylla-bench -mode=read -workload=uniform -replication-factor=3 -nodes 127.0.0.1,127.0.0.2,127.0.0.3 -clustering-row-count=10000 -duration=10m -rows-per-request=9000 -page-size=100 ``` I confirmed that paging is happening as expected, then upgraded the nodes one-by-one to this PR (while the read-load was ongoing). I observed no read errors or any other errors in the logs. Closes #10829 * github.com:scylladb/scylla: query: have replica provide the last position idl/query: add last_position to query_result mutlishard_mutation_query: propagate compaction state to result builder multishard_mutation_query: defer creating result builder until needed querier: use full_position instead of ad-hoc struct querier: rely on compactor for position tracking mutation_compactor: add current_full_position() convenience accessor mutation_compactor: s/_last_clustering_pos/_last_pos/ mutation_compactor: add state accessor to compact_mutation introduce full_position idl: move position_in_partition into own header service/paging: use position_in_partition instead of clustering_key for last row alternator/serialization: extract value object parsing logic service/pagers/query_pagers.cc: fix indentation position_in_partition: add to_string(partition_region) and parse_partition_region() mutation_fragment.hh: move operator<<(partition_region) to position_in_partition.hh	2022-06-27 12:23:21 +03:00
Pavel Emelyanov' via ScyllaDB development	a78af050fd	cql: Constify select_statement restrictions It is in fact immutable (both the pointer and the object it points to), so is the pointer copy returned by get_restrictions() method, so are those propagated to filtering stuff. tests: https://jenkins.scylladb.com/job/releng/job/Scylla-CI/1028 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20220624083351.24970-1-xemul@scylladb.com>	2022-06-24 12:27:36 +03:00
Avi Kivity	dab56b82fa	Merge 'Per-partition rate limiting' from Piotr Dulikowski Due to its sharded and token-based architecture, Scylla works best when the user workload is more or less uniformly balanced across all nodes and shards. However, a common case when this assumption is broken is the "hot partition" - suddenly, a single partition starts getting a lot more reads and writes in comparison to other partitions. Because the shards owning the partition have only a fraction of the total cluster capacity, this quickly causes latency problems for other partitions within the same shard and vnode. This PR introduces per-partition rate limiting feature. Now, users can choose to apply per-partition limits to their tables of choice using a schema extension: ``` ALTER TABLE ks.tbl WITH per_partition_rate_limit = { 'max_writes_per_second': 100, 'max_reads_per_second': 200 }; ``` Reads and writes which are detected to go over that quota are rejected to the client using a new RATE_LIMIT_ERROR CQL error code - existing error codes didn't really fit well with the rate limit error, so a new error code is added. This code is implemented as a part of a CQL protocol extension and returned to clients only if they requested the extension - if not, the existing CONFIG_ERROR will be used instead. Limits are tracked and enforced on the replica side. If a write fails with some replicas reporting rate limit being reached, the rate limit error is propagated to the client. Additionally, the following optimization is implemented: if the coordinator shard/node is also a replica, we account the operation into the rate limit early and return an error in case of exceeding the rate limit before sending any messages to other replicas at all. The PR covers regular, non-batch writes and single-partition reads. LWT and counters are not covered here. Results of `perf_simple_query --smp=1 --operations-per-shard=1000000`: - Write mode: ``` `8f690fdd47` (PR base): 129644.11 tps ( 56.2 allocs/op, 13.2 tasks/op, 49785 insns/op) This PR: 125564.01 tps ( 56.2 allocs/op, 13.2 tasks/op, 49825 insns/op) ``` - Read mode: ``` `8f690fdd47` (PR base): 150026.63 tps ( 63.1 allocs/op, 12.1 tasks/op, 42806 insns/op) This PR: 151043.00 tps ( 63.1 allocs/op, 12.1 tasks/op, 43075 insns/op) ``` Manual upgrade test: - Start 3 nodes, 4 shards each, Scylla version `8f690fdd47` - Create a keyspace with scylla-bench, RF=3 - Start reading and writing with scylla-bench with CL=QUORUM - Manually upgrade nodes one by one to the version from this PR - Upgrade succeeded, apart from a small number of operations which failed when each node was being put down all reads/writes succeeded - Successfully altered the scylla-bench table to have a read and write limit and those limits were enforced as expected Fixes: #4703 Closes #9810 * github.com:scylladb/scylla: storage_proxy: metrics for per-partition rate limiting of reads storage_proxy: metrics for per-partition rate limiting of writes database: add stats for per partition rate limiting tests: add per_partition_rate_limit_test config: add add_per_partition_rate_limit_extension function for testing cf_prop_defs: guard per-partition rate limit with a feature query-request: add allow_limit flag storage_proxy: add allow rate limit flag to get_read_executor storage_proxy: resultize return type of get_read_executor storage_proxy: add per partition rate limit info to read RPC storage_proxy: add per partition rate limit info to query_result_local(_digest) storage_proxy: add allow rate limit flag to mutate/mutate_result storage_proxy: add allow rate limit flag to mutate_internal storage_proxy: add allow rate limit flag to mutate_begin storage_proxy: choose the right per partition rate limit info in write handler storage_proxy: resultize return types of write handler creation path storage_proxy: add per partition rate limit to mutation_holders storage_proxy: add per partition rate limit info to write RPC storage_proxy: add per partition rate limit info to mutate_locally database: apply per-partition rate limiting for reads/writes database: move and rename: classify_query -> classify_request schema: add per_partition_rate_limit schema extension db: add rate_limiter storage_proxy: propagate rate_limit_exception through read RPC gms: add TYPED_ERRORS_IN_READ_RPC cluster feature storage_proxy: pass rate_limit_exception through write RPC replica: add rate_limit_exception and a simple serialization framework docs: design doc for per-partition rate limiting transport: add rate_limit_error	2022-06-24 01:32:13 +03:00
Kamil Braun	a3d2f54806	service/raft: raft_group_registry: add assertions when fetching servers for groups Better than dereferencing null-pointers or null-opts.	2022-06-23 16:14:41 +02:00
Kamil Braun	bb58ee0b2e	service/raft: raft_group_registry: remove `_raft_support_listener` It did nothing. It will be readded in `raft_group0` and it will do something, stay tuned. With this we can remove the `feature_service` reference from `raft_group_registry`.	2022-06-23 16:14:41 +02:00

1 2 3 4 5 ...

2862 Commits