Commit Graph

2862 Commits

Author SHA1 Message Date
Amnon Heiman
72414b613b Split the timed_rate_moving_average into data and timer
This patch split the timed_rate_moving_average functionality into two, a
data class: rates_moving_average, and a wrapper class
timed_rate_moving_average that uses a timer to update the rates
periodically.

To make the transition as simple as possible timed_rate_moving_average,
takes the original API.

A new helper class meter_timer was introduced to handle the timer update
functionality.

This change required minimal code adaptation in some other parts of the
code.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2022-07-26 15:59:33 +03:00
Avi Kivity
29c28dcb0c Merge 'Unstall get_range_to_address_map' from Benny Halevy
Prevent stalls in this path as seen in performance testing.

Also, add a respective rest_api test.

Fixes #11114

Closes #11115

* github.com:scylladb/scylla:
  storage_service: reserve space in get_range_to_address_map and friends
  storage_service: coroutinize get_range_to_address_map and friends
  storage_service: pass replication map to get_range_to_address_map and friends
  storage_service: get_range_to_address_map: move selection of arbitrary ks to api layer
  test: rest_api: test range_to_endpoint_map and describe_ring
2022-07-25 18:06:28 +03:00
Piotr Sarna
c195ce1b82 query: allow merging non-empty forward_result with an empty one
Merging empty results was already allowed, but in one way only:

empty.merge(nonempty, r); // was permitted
nonempty.merge(empty, r); // not permitted

With this commit, both methods are permitted.
In order to remove copying, the other result is now taken
by rvalue reference, with all call sites being updated
accordingly.

Fixes #10446
Fixes #10174

Closes #11064
2022-07-25 18:06:28 +03:00
Benny Halevy
bc5f6cf45d storage_service: reserve space in get_range_to_address_map and friends
To reduce the chance of reallocation.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-25 18:06:28 +03:00
Benny Halevy
5eb31eff64 storage_service: coroutinize get_range_to_address_map and friends
And add calls to maybe_yield to prevent stalls in this path
as seen in performance testing.

Also, add a respective rest_api test.

Fixes #11114

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-25 18:06:28 +03:00
Tomasz Grabiec
76d20aeb96 Merge 'Refactor group 0 operations (joining, leaving, removing).' from Kamil Braun
A series of refactors to the `raft_group0` service.
Read the commits in topological order for best experience.

This PR is more or less equivalent to the second-to-last commit of PR https://github.com/scylladb/scylla/pull/10835, I split it so we could have an easier time reviewing and pushing it through.

Closes #11024

* github.com:scylladb/scylla:
  service: storage_service: additional assertions and comments
  service/raft: raft_group0: additional logging, assertions, comments
  service/raft: raft_group0: pass seed list and `as_voter` flag to `join_group0`
  service/raft: raft_group0: rewrite `remove_from_group0`
  service/raft: raft_group0: rewrite `leave_group0`
  service/raft: raft_group0: split `leave_group0` from `remove_from_group0`
  service/raft: raft_group0: introduce `setup_group0`
  service/raft: raft_group0: introduce `load_my_addr`
  service/raft: raft_group0: make some calls abortable
  service/raft: raft_group0: remove some temporary variables
  service/raft: raft_group0: refactor `do_discover_group0`.
  service/raft: raft_group0: rename `create_server_for_group` to `create_server_for_group0`
  service/raft: raft_group0: extract `start_server_for_group0` function
  service/raft: raft_group0: create a private section
  service/raft: discovery: `seeds` may contain `self`
2022-07-25 18:06:28 +03:00
Benny Halevy
3d62a1592f storage_service: pass replication map to get_range_to_address_map and friends
Before they are made asynchronous in the next patch,
so they work on a coherent snapshot of the token_metadata and
replication map as their caller.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-25 18:06:28 +03:00
Petr Gusev
52142bb8b3 raft_group_registry, is_alive for non-existent server_id
We could yield between updating the list of servers in raft/fsm
and updating the raft_address_map, e.g. in case of a set_configuration.
If tick_leader happens before the raft_address_map is updated,
is_alive will be called with server_id that is not in the map yet.

Fix: scylladb/scylla-dtest#2753

Closes #11111
2022-07-25 18:06:28 +03:00
Benny Halevy
0b474866a3 storage_service: get_range_to_address_map: move selection of arbitrary ks to api layer
It is only needed for the "storage_service/describe_ring" api
and service/storage_service shouldn't bother with it.
It's an api sugar coating.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-25 18:06:28 +03:00
Gleb Natapov
f1f1176963 service: raft: do not allow downgrading non expiring entry to expiring one in raft_address_map
Expiring entries are added when a message is received from an unknown
host. If the host is later added to the raft configuration they become
non expiring. After that they can only be removed when the host is
dropped from the configuration, but they should never become expiring
again.

Refs #10826
2022-07-21 17:40:04 +02:00
Asias He
39db15d2cb misc_services: Fix cache hitrate update
This patch avoids unncessary CACHE_HITRATES updates through gossip.

After this patch:

Publish CACHE_HITRATES in case:

- We haven't published it at all
- The diff is bigger than 1% and we haven't published in the last 5 seconds
- The diff is really big 10%

Note: A peer node can know the cache hitrate through read_data
read_mutation_data and read_digest RPC verbs which have cache_temperature in
the response. So there is no need to update CACHE_HITRATES through gossip in
high frequency.

We do the recalculation faster if the diff is bigger than 0.01. It is useful to
do the calculation even if we do not publish the CACHE_HITRATES though gossip,
since the recalculation will call the table->set_global_cache_hit_rate to set
the hitrate.

Fixes #5971

Closes #11079
2022-07-21 11:31:30 +03:00
Kamil Braun
4e42aeb0df service: storage_service: additional assertions and comments 2022-07-20 19:39:29 +02:00
Kamil Braun
25bb8384af service/raft: raft_group0: additional logging, assertions, comments
Move some rare logs from TRACE to INFO level.
Add some assertions.
Write some more comments, including FIXMEs and TODOs.

Remove unnecessary `_shutdown_gate.hold()` (this is not a background
task).
2022-07-20 19:39:29 +02:00
Kamil Braun
c9f1ec1268 service/raft: raft_group0: pass seed list and as_voter flag to join_group0
Group 0 discovery would internally fetch the seed list from gossiper.
Gossiper would return the seed list from conf/scylla.yaml. This seed
list is proper for the bootstrapping scenario - we specify the initial
contact points for a node that joins a cluster.

We'll have to use a different list of seeds for group 0 discovery for
the upgrade scenario. Prepare for that by taking the seed list as a
parameter.

In the bootstrap scenario we'll pass the seed list down from
`storage_service::join_cluster`.

Additionally, `join_group0` now takes an `as_voter` flag, which is
`false` in the bootstrap scenario (we initially join as a non-voter) but
will be `true` in the upgrade scenario.
2022-07-20 19:39:29 +02:00
Kamil Braun
684d8171ca service/raft: raft_group0: rewrite remove_from_group0
See previous commit. `remove_from_group0` had a similar problem as
`leave_group0`: it would handle the case where `raft_group0::_group0`
variant was not `raft::group_id` (i.e. we haven't joined group 0), but
RAFT local feature was enabled - i.e. the yet-unimplemented upgrade case
- by running discovery and calling `send_group0_modify_config`.

Instead, if we see that we've joined group 0 before, assume that we're
still a member and simply use the Raft `modify_config` API to remove
another server. If we're not a member it means we either decommissioned
or were removed by someone else; then we have no business trying to
remove others. There's also the unimplemented upgrade case but that will
come in another pull request.

Finally, add some logic for handling an edge case: suppose we joined
group 0 recently and we still didn't fully update our RPC address map
(it's being updated asynchronously by Raft's io_fiber). Thus we may fail
to find a member of group 0 in the address map. To handle this, ensure
we're up-to-date by performing a Raft read barrier.

State some assumptions in a comment.

Add a TODO for handling failures.

Remove unnecessary `_shutdown_gate.hold()` (this is not a background
task).
2022-07-20 19:39:29 +02:00
Kamil Braun
eeeef0bc50 service/raft: raft_group0: rewrite leave_group0
One of the following cases is true:
1. RAFT local feature is disabled. Then we don't do anything related to
  group 0.
2. RAFT local feature is enabled and when we bootstrapped, we joined
  group 0. Then `raft_group0::_group0` variant holds the
  `raft::group_id` alternative.
3. RAFT local feature is enabled and when we bootstrapped we didn't join
  group 0. This means the RAFT local feature was disabled when we
  bootstrapped and we're in the (unimplemented yet) upgrade scenario.
  `raft_group0::_group0` variant holds the `std::monostate` alternative.

The problem with the previous implementation was that it checked for the
conditions of the third case above - that RAFT local feature is enabled
but `_group0` does not hold `raft::group_id` - and if those conditions
were true, it executed some logic that didn't really make sense: it ran
the discovery algorithm and called `send_group0_modify_config` RPC.

In this rewrite I state some assumptions that `leave_group0` makes:
- we've finished the startup procedure.
- we're being run during decommission - after the node entered LEFT
  status.

In the new implementation, if `_group0` does not hold `raft::group_id`
(checked by the internal `joined_group0()` helper), we simply return.
This is the yet-unimplemented upgrade case left for a follow-up PR.

Otherwise we fetch our Raft server ID (at this point it must be present
- otherwise it's a fatal error) and simply call `modify_config` from the
`raft::server` API.

Remove unnecessary call to `_shutdown_gate.hold()` (this is not a
background task).
2022-07-20 19:39:29 +02:00
Kamil Braun
75608bcd2f service/raft: raft_group0: split leave_group0 from remove_from_group0
`leave_group0` was responsible for both removing a different node from
group 0 and removing ourselves (leaving) group 0. The two scenarios are
a bit different and the handling will be rewritten in following commits.

Split `leave_group0` into two functions. Remove the incorrect comment
about idempotency - saying that the procedure is idempotent is an
oversimplification, one could argue it's incorrect since the second call
simply hangs, at least in the case of leaving group 0; following commits
will state what's happening more precisely.

Add some additional logging and assertions where the two functions are
called in `storage_service`.
2022-07-20 19:39:29 +02:00
Kamil Braun
ee0219dfe3 service/raft: raft_group0: introduce setup_group0
Contains all logic for deciding to join (or not join) group 0.

Prepare for the case where we don't want to join group 0 immediately on
startup - the upgrade scenario (will be implemented in a follow-up).

Move the group 0 setup step earlier in `storage_service::join_cluster`.

`join_group0()` is now a private member of `raft_group0`. Some more
comments were written.
2022-07-20 19:39:29 +02:00
Kamil Braun
4b0db59671 service/raft: raft_group0: introduce load_my_addr
Compared to `load_or_create_my_addr` this function assumes that
the address is already present on disk; if not, it's a fatal error.

Use it in places where it would indeed be a fatal error
if the address was missing.
2022-07-20 19:39:29 +02:00
Kamil Braun
f0f9aa5c7d service/raft: raft_group0: make some calls abortable
There are some calls to `modify_config` which should react to aborts
(e.g. when we shutdown Scylla).

There are also calls to `send_group0_modify_config` which should
probably also react to aborts, but the functions don't take
an abort_source parameter. This is fixable but I left TODOs for now.
2022-07-20 19:39:29 +02:00
Kamil Braun
ab8c3c6742 service/raft: raft_group0: remove some temporary variables
Make the code a bit shorter.
2022-07-20 19:39:29 +02:00
Kamil Braun
b193ea8ec0 service/raft: raft_group0: refactor do_discover_group0.
The function no longer accesses the `_group0` variant directly, instead
it is made a member of `service::persistent_discovery`; the caller
guarantees that `persistent_discovery` is not destroyed before the
function finishes.

The function is now named `run`. A short comment was written at the
declaration site.

Make some members of `persistent_discovery` private, as they are only
used by `run`.

Simplify `struct tracker`, store the discovery output separately
(`struct tracker` is now responsible for a single thing).

Enclose the `parallel_for_each` over requests in a common coroutine
which keeps alive all the necessary things for the loop body and
performs the last step which was previously inside a `then`.
2022-07-20 19:39:29 +02:00
Kamil Braun
6d9d493e2a service/raft: raft_group0: rename create_server_for_group to create_server_for_group0 2022-07-20 19:39:28 +02:00
Kamil Braun
54d9219257 service/raft: raft_group0: extract start_server_for_group0 function
Extract part of the code from `join_group0`. Add some comments.
This part will be reused.
2022-07-20 19:38:53 +02:00
Kamil Braun
dca1ce52ed service/raft: raft_group0: create a private section
Move member functions and fields used internally by the `raft_group0`
class into a private section.

Write some comments.
2022-07-20 19:38:53 +02:00
Kamil Braun
d28170b1a5 service/raft: discovery: seeds may contain self
The set of seeds passed to the discovery algorithm may contain `self`.
The implementation will filter the `self` out (it calls `step(seeds)`;
`step` iterates over the given list of peers and ignores `_self`).

Specify this at the `discovery` constructor declaration site.

Simplify the code constructing `persistent_discovery` in
`raft_group0::discover_group0` using this assumption.
2022-07-20 19:38:53 +02:00
Botond Dénes
014c5b56a3 query-result: move last_pos up to query::result
query_result was the wrong place to put last position into. It is only
included in data-responses, but not on digest-responses. If we want to
support empty pages from replicas, both data and digest responses have
to include the last position. So hoist up the last position to the
parent structure: query::result. This is a breaking change inter-node
ABI wise, but it is fine: the current code wasn't released yet.

Closes #11072
2022-07-20 13:28:09 +03:00
Tomasz Grabiec
04f9a150be Merge 'raft: split can_vote field form server_address to separate struct' from Kamil Braun
Whether a server can vote in a Raft configuration is not part of the
address. `server_address` was used in many context where `can_vote` is
irrelevant.

Split the struct: `server_address` now contains only `id` and
`server_info` as it did before `can_vote` was introduced. Instead we
have a `config_member` struct that contains a `server_address` and the
`can_vote` field.

Also remove an "unsafe" constructor from `server_address` where `id` was
provided but `server_info` was not. The constructor was used for tests
where `server_info` is irrelevant, but it's important not to forget
about the info in production code. Replace the constructor with helper
functions which specify in comments that they are supposed to be used in
tests or in contexts where `info` doesn't matter (e.g. when checking
presence in an `unordered_set`, where the equality operator and hash
operate only on the `id`).

Closes #11047

* github.com:scylladb/scylla:
  raft: fsm: fix `entry_size` calculation for config entries
  raft: split `can_vote` field from `server_address` to separate struct
  serializer_impl: generalize (de)serialization of `unordered_set`
  to_string: generalize `operator<<` for `unordered_set`
2022-07-20 12:20:52 +02:00
Asias He
482ee369d0 storage_service: Increase watchdog_interval for node ops
The node operations using node_ops_cmd have the following procedure:

1) Send node_ops_cmd::replace_prepare to all nodes
2) Send node_ops_cmd::replace_heartbeat to all nodes

In a large cluster 1) might take a long time to finish, as a result when
the node starts to perform 2), the heartbeat timer on the peer nodes which
is 30s might have already timed out. This fails the whole node
opeartions.

We have patches to make 1) more efficient and faster.

https://github.com/scylladb/scylla/pull/10850
https://github.com/scylladb/scylla/pull/10822

In addition to that, this patch increases the heartbeat timeout to reduce
the false positive of timeout.

Refs #10337
Refs #11078

Closes #11081
2022-07-20 12:56:17 +03:00
Kamil Braun
daf9c53bb8 raft: split can_vote field from server_address to separate struct
Whether a server can vote in a Raft configuration is not part of the
address. `server_address` was used in many context where `can_vote` is
irrelevant.

Split the struct: `server_address` now contains only `id` and
`server_info` as it did before `can_vote` was introduced. Instead we
have a `config_member` struct that contains a `server_address` and the
`can_vote` field.

Also remove an "unsafe" constructor from `server_address` where `id` was
provided but `server_info` was not. The constructor was used for tests
where `server_info` is irrelevant, but it's important not to forget
about the info in production code. The constructor was used for two
purposes:
- Invoking set operations such as `contains`. To solve this we use C++20
  transparent hash and comparator functions, which allow invoking
  `contains` and similar functions by providing a different key type (in
  this case `raft::server_id` in set of addresses, for example).
- constructing addresses without `info`s in tests. For this we provide
  helper functions in the test helpers module and use them.
2022-07-18 18:22:10 +02:00
Jadw1
29a0be75da forward_service: support UDA and native aggregate parallelization
Enables parallelization of UDA and native aggregates. The way the
query is parallelized is the same as in #9209. Separate reduction
type for `COUNT(*)` is left for compatibility reason.
2022-07-18 15:25:41 +02:00
Jadw1
d13f347621 DB: Add scylla_aggregates system table
Saving information about UDA's reduce function to `scylla_aggregates`
table and distributing it across cluster.
2022-07-18 15:25:37 +02:00
Benny Halevy
dc93564247 storage_proxy: abstract_read_resolver: swallow gate_closed exception
Like other errors triggered on shutdown,
this one is triggered by #8995.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #11029
2022-07-14 09:26:34 +03:00
Avi Kivity
957bf48eb2 Merge 'Don't throw exceptions on the replica side when handling single partition reads and writes' from Piotr Dulikowski
This PR gets rid of exception throws/rethrows on the replica side for writes and single-partition reads. This goal is achieved without using `boost::outcome` but rather by replacing the parts of the code which throw with appropriate seastar idioms and by introducing two helper functions:

1.`try_catch` allows to inspect the type and value behind an `std::exception_ptr`. When libstdc++ is used, this function does not need to throw the exception and avoids the very costly unwind process. This based on the "How to catch an exception_ptr without even try-ing" proposal mentioned in https://github.com/scylladb/scylla/issues/10260.

This function allows to replace the current `try..catch` chains which inspect the exception type and account it in the metrics.

Example:

```c++
// Before
try {
    std::rethrow_exception(eptr);
} catch (std::runtime_exception& ex) {
    // 1
} catch (...) {
    // 2
}

// After
if (auto* ex = try_catch<std::runtime_exception>(eptr)) {
    // 1
} else {
    // 2
}
```

2. `make_nested_exception_ptr` which is meant to be a replacement for `std::throw_with_nested`. Unlike the original function, it does not require an exception being currently thrown and does not throw itself - instead, it takes the nested exception as an `std::exception_ptr` and produces another `std::exception_ptr` itself.

Apart from the above, seastar idioms such as `make_exception_future`, `co_await as_future`, `co_return coroutine::exception()` are used to propagate exceptions without throwing. This brings the number of exception throws to zero for single partition reads and writes (tested with scylla-bench, --mode=read and --mode=write).

Results from `perf_simple_query`:

```
Before (719724e4df):
  Writes:
    Normal:
      127841.40 tps ( 56.2 allocs/op,  13.2 tasks/op,   50042 insns/op,        0 errors)
    Timeouts:
      94770.81 tps ( 53.1 allocs/op,   5.1 tasks/op,   78678 insns/op,  1000000 errors)
  Reads:
    Normal:
      138902.31 tps ( 65.1 allocs/op,  12.1 tasks/op,   43106 insns/op,        0 errors)
    Timeouts:
      62447.01 tps ( 49.7 allocs/op,  12.1 tasks/op,  135984 insns/op,   936846 errors)

After (d8ac4c02bfb7786dc9ed30d2db3b99df09bf448f):
  Writes:
    Normal:
      127359.12 tps ( 56.2 allocs/op,  13.2 tasks/op,   49782 insns/op,        0 errors)
    Timeouts:
      163068.38 tps ( 52.1 allocs/op,   5.1 tasks/op,   40615 insns/op,  1000000 errors)
  Reads:
    Normal:
      151221.15 tps ( 65.1 allocs/op,  12.1 tasks/op,   43028 insns/op,        0 errors)
    Timeouts:
      192094.11 tps ( 41.2 allocs/op,  12.1 tasks/op,   33403 insns/op,   960604 errors)
```

Closes #10368

* github.com:scylladb/scylla:
  database: avoid rethrows when handling exceptions from commitlog
  database: convert throw_commitlog_add_error to use make_nested_exception_ptr
  utils: add make_nested_exception_ptr
  storage_proxy: don't rethrow when inspecting replica exceptions on write path
  database: don't rethrow rate_limit_exception
  storage_proxy: don't rethrow the exception in abstract_read_resolver::error
  utils/exceptions.cc: don't rethrow in is_timeout_exception
  utils/exceptions: add try_catch
  utils: add abi/eh_ia64.hh
  storage_proxy: don't rethrow exceptions from replicas when accounting read stats
  message: get rid of throws in send_message{,_timeout,_abortable}
  database/{query,query_mutations}: don't rethrow read semaphore exceptions
2022-07-11 14:01:41 +03:00
Nadav Har'El
cc69177dcc config: fix printing of experimental feature list
Recently we noticed a regression where with certain versions of the fmt
library,

   SELECT value FROM system.config WHERE name = 'experimental_features'

returns string numbers, like "5", instead of feature names like "raft".

It turns out that the fmt library keep changing their overload resolution
order when there are several ways to print something. For enum_option<T> we
happen to have to conflicting ways to print it:
  1. We have an explicit operator<<.
  2. We have an *implicit* convertor to the type held by T.

We were hoping that the operator<< always wins. But in fmt 8.1, there is
special logic that if the type is convertable to an int, this is used
before operator<<()! For experimental_features_t, the type held in it was
an old-style enum, so it is indeed convertible to int.

The solution I used in this patch is to replace the old-style enum
in experimental_features_t by the newer and more recommended "enum class",
which does not have an implicit conversion to int.

I could have fixed it in other ways, but it wouldn't have been much
prettier. For example, dropping the implicit convertor would require
us to change a bunch of switch() statements over enum_option (and
not just experimental_features_t, but other types of enum_option).

Going forward, all uses of enum_option should use "enum class", not
"enum". tri_mode_restriction_t was already using an enum class, and
now so does experimental_features_t. I changed the examples in the
comments to also use "enum class" instead of enum.

This patch also adds to the existing experimental_features test a
check that the feature names are words that are not numbers.

Fixes #11003.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #11004
2022-07-11 09:17:30 +02:00
Benny Halevy
acae3cc223 treewide: stop use of deprecated coroutine::make_exception
Convert most use sites from `co_return coroutine::make_exception`
to `co_await coroutine::return_exception{,_ptr}` where possible.

In cases this is done in a catch clause, convert to
`co_return coroutine::exception`, generating an exception_ptr
if needed.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #10972
2022-07-07 15:02:16 +03:00
Piotr Dulikowski
2008db58c4 storage_proxy: don't rethrow when inspecting replica exceptions on write path
Now, storage_proxy::send_to_live_endpoints doesn't rethrow exceptions
received from the replica logic when inspecting them.
2022-07-05 16:41:09 +02:00
Piotr Dulikowski
ffb95c4840 storage_proxy: don't rethrow the exception in abstract_read_resolver::error
Now, the abstract_read_resolver::error uses the utils::try_catch utility
to analyse the error received from replica instead of rethrowing it.
2022-07-05 16:41:09 +02:00
Avi Kivity
74b02b9719 Merge 'storage_service: track restore_replica_count' from Benny Halevy
This mini-series adds an _async_gate to storage_service that is closed on stop()
and it performs restore_replica_count under this gate so it can be orderly waited on in stop()

Fixes #10672

Closes #10922

* github.com:scylladb/scylla:
  storage_service: handle_state_removing: restore_replica_count under _async_gate
  storage_service: add async_gate for background work
2022-07-05 13:18:59 +03:00
Piotr Dulikowski
491cc2a8df storage_proxy: don't rethrow exceptions from replicas when accounting read stats
Now, make_{data,mutation_data,digest}_requests don't rethrow the
exception received from replicas when increasing the error count metric.
2022-07-04 19:27:06 +02:00
Pavel Emelyanov
ea820e13b3 database: Move flushing logging
Now it happens before calling database::drain() but drain is not only
flushing it does lots of other things. More elaborated logging is better

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-07-04 13:42:45 +03:00
Pavel Emelyanov
b5c4553a66 storage_service: Sanitize stop_transport()
It generates ignored future that can be avoided if using forwarding to
shared_future<>'s promise

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-07-01 17:17:53 +03:00
Pavel Emelyanov
85033ea6ae Merge 'A bunch of refactors related to Raft group 0' from Kamil Braun
The commits here were extracted from PR https://github.com/scylladb/scylla/pull/10835 which implements upgrade procedure for Raft group 0.

They are mostly refactors which don't affect the behavior of the system, except one: the commit 4d439a16b3 causes all schema changes to be bounced to shard 0. Previously, they would only be bounced when the local Raft feature was enabled. I do that because:
1. eventually, we want this to be the default behavior
2. in the upgrade PR I remove the `is_raft_enabled()` function - the function was basically created with the mindset "Raft is either enabled or not" - which was right when we didn't support upgrade, but will be incorrect when we introduce intermediate states (when we upgrade from non-raft-based to raft-based operations); the upgrade PR introduces another mechanism to dispatch based on the upgrade state, but for the case of bouncing to shard 0, dispatching is simply not necessary.

Closes #10864

* github.com:scylladb/scylla:
  service/raft: raft_group_registry: add assertions when fetching servers for groups
  service/raft: raft_group_registry: remove `_raft_support_listener`
  service/raft: raft_group0: log adding/removing servers to/from group 0 RPC map
  service/raft: raft_group0: move group 0 RPC handlers from `storage_service`
  service/raft: messaging: extract raft_addr/inet_addr conversion functions
  service: storage_service: initialize `raft_group0` in `main` and pass a reference to `join_cluster`
  treewide: remove unnecessary `migration_manager::is_raft_enabled()` calls
  test/boost: memtable_test: perform schema operations on shard 0
  test/boost: cdc_test: remove test_cdc_across_shards
  message: rename `send_message_abortable` to `send_message_cancellable`
  message: change parameter order in `send_message_oneway_timeout`
2022-06-29 16:51:54 +03:00
Benny Halevy
cb0b728ed1 storage_service: handle_state_removing: restore_replica_count under _async_gate
Track the background restore_replica_count fiber
so it be awaited on in stop() by closing the
_async_gate.

Fixes #10672

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-06-29 10:51:26 +03:00
Benny Halevy
1b1c02b243 storage_service: add async_gate for background work
To be used for tracking restore_replica_count
and waiting for it on stop().

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-06-29 10:47:49 +03:00
Avi Kivity
3131cbea62 Merge 'query: allow replica to provide arbitrary continue position' from Botond Dénes
Currently, we use the last row in the query result set as the position where the query is continued from on the next page. Since only live rows make it into query result set, this mandates the query to be stopped on a live row on the replica, lest any dead rows or tombstones processed after the live rows, would have to be re-processed on the next page (and the saved reader would have to be thrown away due to position mismatch). This requirement of having to stop on a live row is problematic with datasets which have lots of dead rows or tombstones, especially if these form a prefix. In the extreme case, a query can time out before it can process a single live row and the data-set becomes effectively unreadable until compaction gets rid of the tombstones.
This series prepares the way for the solution: it allows the replica to determine what position the query should continue from on the next page. This position can be that of a dead row, if the query stopped on a dead row. For now, the replica supplies the same position that would have been obtained with looking at the last row in the result set, this series merely introduces the infrastructure for transferring a position together with the query result, and it prepares the paging logic to make use of this position. If the coordinator is not prepared for the new field, it will simply fall-back to the old way of looking at the last row in the result set. As I said for now this is still the same as the content of the new field so there is no problem in mixed clusters.

Refs: https://github.com/scylladb/scylla/issues/3672
Refs: https://github.com/scylladb/scylla/issues/7689
Refs: https://github.com/scylladb/scylla/issues/7933

Tests: manual upgrade test.
I wrote a data set with:
```
./scylla-bench -mode=write -workload=sequential -replication-factor=3 -nodes 127.0.0.1,127.0.0.2,127.0.0.3 -clustering-row-count=10000 -clustering-row-size=8096 -partition-count=1000
```
This creates large, 80MB partitions, which should fill many pages if read in full. Then I started a read workload:
```
./scylla-bench -mode=read -workload=uniform -replication-factor=3 -nodes 127.0.0.1,127.0.0.2,127.0.0.3 -clustering-row-count=10000 -duration=10m -rows-per-request=9000 -page-size=100
```
I confirmed that paging is happening as expected, then upgraded the nodes one-by-one to this PR (while the read-load was ongoing). I observed no read errors or any other errors in the logs.

Closes #10829

* github.com:scylladb/scylla:
  query: have replica provide the last position
  idl/query: add last_position to query_result
  mutlishard_mutation_query: propagate compaction state to result builder
  multishard_mutation_query: defer creating result builder until needed
  querier: use full_position instead of ad-hoc struct
  querier: rely on compactor for position tracking
  mutation_compactor: add current_full_position() convenience accessor
  mutation_compactor: s/_last_clustering_pos/_last_pos/
  mutation_compactor: add state accessor to compact_mutation
  introduce full_position
  idl: move position_in_partition into own header
  service/paging: use position_in_partition instead of clustering_key for last row
  alternator/serialization: extract value object parsing logic
  service/pagers/query_pagers.cc: fix indentation
  position_in_partition: add to_string(partition_region) and parse_partition_region()
  mutation_fragment.hh: move operator<<(partition_region) to position_in_partition.hh
2022-06-27 12:23:21 +03:00
Pavel Emelyanov' via ScyllaDB development
a78af050fd cql: Constify select_statement restrictions
It is in fact immutable (both the pointer and the object it points
to), so is the pointer copy returned by get_restrictions() method,
so are those propagated to filtering stuff.

tests: https://jenkins.scylladb.com/job/releng/job/Scylla-CI/1028

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20220624083351.24970-1-xemul@scylladb.com>
2022-06-24 12:27:36 +03:00
Avi Kivity
dab56b82fa Merge 'Per-partition rate limiting' from Piotr Dulikowski
Due to its sharded and token-based architecture, Scylla works best when the user workload is more or less uniformly balanced across all nodes and shards. However, a common case when this assumption is broken is the "hot partition" - suddenly, a single partition starts getting a lot more reads and writes in comparison to other partitions. Because the shards owning the partition have only a fraction of the total cluster capacity, this quickly causes latency problems for other partitions within the same shard and vnode.

This PR introduces per-partition rate limiting feature. Now, users can choose to apply per-partition limits to their tables of choice using a schema extension:

```
ALTER TABLE ks.tbl
WITH per_partition_rate_limit = {
	'max_writes_per_second': 100,
	'max_reads_per_second': 200
};
```

Reads and writes which are detected to go over that quota are rejected to the client using a new RATE_LIMIT_ERROR CQL error code - existing error codes didn't really fit well with the rate limit error, so a new error code is added. This code is implemented as a part of a CQL protocol extension and returned to clients only if they requested the extension - if not, the existing CONFIG_ERROR will be used instead.

Limits are tracked and enforced on the replica side. If a write fails with some replicas reporting rate limit being reached, the rate limit error is propagated to the client. Additionally, the following optimization is implemented: if the coordinator shard/node is also a replica, we account the operation into the rate limit early and return an error in case of exceeding the rate limit before sending any messages to other replicas at all.

The PR covers regular, non-batch writes and single-partition reads. LWT and counters are not covered here.

Results of `perf_simple_query --smp=1 --operations-per-shard=1000000`:

- Write mode:
  ```
  8f690fdd47 (PR base):
  129644.11 tps ( 56.2 allocs/op,  13.2 tasks/op,   49785 insns/op)
  This PR:
  125564.01 tps ( 56.2 allocs/op,  13.2 tasks/op,   49825 insns/op)
  ```
- Read mode:
  ```
  8f690fdd47 (PR base):
  150026.63 tps ( 63.1 allocs/op,  12.1 tasks/op,   42806 insns/op)
  This PR:
  151043.00 tps ( 63.1 allocs/op,  12.1 tasks/op,   43075 insns/op)
  ```

Manual upgrade test:
- Start 3 nodes, 4 shards each, Scylla version 8f690fdd47
- Create a keyspace with scylla-bench, RF=3
- Start reading and writing with scylla-bench with CL=QUORUM
- Manually upgrade nodes one by one to the version from this PR
- Upgrade succeeded, apart from a small number of operations which failed when each node was being put down all reads/writes succeeded
- Successfully altered the scylla-bench table to have a read and write limit and those limits were enforced as expected

Fixes: #4703

Closes #9810

* github.com:scylladb/scylla:
  storage_proxy: metrics for per-partition rate limiting of reads
  storage_proxy: metrics for per-partition rate limiting of writes
  database: add stats for per partition rate limiting
  tests: add per_partition_rate_limit_test
  config: add add_per_partition_rate_limit_extension function for testing
  cf_prop_defs: guard per-partition rate limit with a feature
  query-request: add allow_limit flag
  storage_proxy: add allow rate limit flag to get_read_executor
  storage_proxy: resultize return type of get_read_executor
  storage_proxy: add per partition rate limit info to read RPC
  storage_proxy: add per partition rate limit info to query_result_local(_digest)
  storage_proxy: add allow rate limit flag to mutate/mutate_result
  storage_proxy: add allow rate limit flag to mutate_internal
  storage_proxy: add allow rate limit flag to mutate_begin
  storage_proxy: choose the right per partition rate limit info in write handler
  storage_proxy: resultize return types of write handler creation path
  storage_proxy: add per partition rate limit to mutation_holders
  storage_proxy: add per partition rate limit info to write RPC
  storage_proxy: add per partition rate limit info to mutate_locally
  database: apply per-partition rate limiting for reads/writes
  database: move and rename: classify_query -> classify_request
  schema: add per_partition_rate_limit schema extension
  db: add rate_limiter
  storage_proxy: propagate rate_limit_exception through read RPC
  gms: add TYPED_ERRORS_IN_READ_RPC cluster feature
  storage_proxy: pass rate_limit_exception through write RPC
  replica: add rate_limit_exception and a simple serialization framework
  docs: design doc for per-partition rate limiting
  transport: add rate_limit_error
2022-06-24 01:32:13 +03:00
Kamil Braun
a3d2f54806 service/raft: raft_group_registry: add assertions when fetching servers for groups
Better than dereferencing null-pointers or null-opts.
2022-06-23 16:14:41 +02:00
Kamil Braun
bb58ee0b2e service/raft: raft_group_registry: remove _raft_support_listener
It did nothing.
It will be readded in `raft_group0` and it will do something, stay
tuned.

With this we can remove the `feature_service` reference from
`raft_group_registry`.
2022-06-23 16:14:41 +02:00