Commit Graph

774 Commits

Author SHA1 Message Date
Jadw1
2c46222e31 db,gms: Add SCYLLA_AGGREGATES schema features
This schema feature will be used to guard
system_schema.scylla_aggregates schema table.
2022-07-18 14:18:48 +02:00
Jadw1
346fb08680 gms: add UDA_NATIVE_PARALLELIZED_AGGREGATION feature
Feature that indicate whether the cluter supports optional UDA
parameter (reduction function) and parallelization of uda and
native aggregates.
2022-07-18 14:18:48 +02:00
Nadav Har'El
cc69177dcc config: fix printing of experimental feature list
Recently we noticed a regression where with certain versions of the fmt
library,

   SELECT value FROM system.config WHERE name = 'experimental_features'

returns string numbers, like "5", instead of feature names like "raft".

It turns out that the fmt library keep changing their overload resolution
order when there are several ways to print something. For enum_option<T> we
happen to have to conflicting ways to print it:
  1. We have an explicit operator<<.
  2. We have an *implicit* convertor to the type held by T.

We were hoping that the operator<< always wins. But in fmt 8.1, there is
special logic that if the type is convertable to an int, this is used
before operator<<()! For experimental_features_t, the type held in it was
an old-style enum, so it is indeed convertible to int.

The solution I used in this patch is to replace the old-style enum
in experimental_features_t by the newer and more recommended "enum class",
which does not have an implicit conversion to int.

I could have fixed it in other ways, but it wouldn't have been much
prettier. For example, dropping the implicit convertor would require
us to change a bunch of switch() statements over enum_option (and
not just experimental_features_t, but other types of enum_option).

Going forward, all uses of enum_option should use "enum class", not
"enum". tri_mode_restriction_t was already using an enum class, and
now so does experimental_features_t. I changed the examples in the
comments to also use "enum class" instead of enum.

This patch also adds to the existing experimental_features test a
check that the feature names are words that are not numbers.

Fixes #11003.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #11004
2022-07-11 09:17:30 +02:00
Tomasz Grabiec
62df9f446c Introduce SCHEMA_COMMITLOG cluster feature 2022-07-06 22:08:56 +02:00
Asias He
a33c370f9a gossip: Speed up wait for gossip settle
In a large cluster, a node would receive frequent and periodic gossip
application state updates like CACHE_HITRATES or VIEW_BACKLOG from peer
nodes. Those states are not critical. They should not be counted for the
_msg_processing counter which is used to decide if gossip is settled.

This patch fixes the long settle on every restart issue reported by
users.

Refs #10337

Closes #10892
2022-07-06 11:26:32 +03:00
Avi Kivity
dab56b82fa Merge 'Per-partition rate limiting' from Piotr Dulikowski
Due to its sharded and token-based architecture, Scylla works best when the user workload is more or less uniformly balanced across all nodes and shards. However, a common case when this assumption is broken is the "hot partition" - suddenly, a single partition starts getting a lot more reads and writes in comparison to other partitions. Because the shards owning the partition have only a fraction of the total cluster capacity, this quickly causes latency problems for other partitions within the same shard and vnode.

This PR introduces per-partition rate limiting feature. Now, users can choose to apply per-partition limits to their tables of choice using a schema extension:

```
ALTER TABLE ks.tbl
WITH per_partition_rate_limit = {
	'max_writes_per_second': 100,
	'max_reads_per_second': 200
};
```

Reads and writes which are detected to go over that quota are rejected to the client using a new RATE_LIMIT_ERROR CQL error code - existing error codes didn't really fit well with the rate limit error, so a new error code is added. This code is implemented as a part of a CQL protocol extension and returned to clients only if they requested the extension - if not, the existing CONFIG_ERROR will be used instead.

Limits are tracked and enforced on the replica side. If a write fails with some replicas reporting rate limit being reached, the rate limit error is propagated to the client. Additionally, the following optimization is implemented: if the coordinator shard/node is also a replica, we account the operation into the rate limit early and return an error in case of exceeding the rate limit before sending any messages to other replicas at all.

The PR covers regular, non-batch writes and single-partition reads. LWT and counters are not covered here.

Results of `perf_simple_query --smp=1 --operations-per-shard=1000000`:

- Write mode:
  ```
  8f690fdd47 (PR base):
  129644.11 tps ( 56.2 allocs/op,  13.2 tasks/op,   49785 insns/op)
  This PR:
  125564.01 tps ( 56.2 allocs/op,  13.2 tasks/op,   49825 insns/op)
  ```
- Read mode:
  ```
  8f690fdd47 (PR base):
  150026.63 tps ( 63.1 allocs/op,  12.1 tasks/op,   42806 insns/op)
  This PR:
  151043.00 tps ( 63.1 allocs/op,  12.1 tasks/op,   43075 insns/op)
  ```

Manual upgrade test:
- Start 3 nodes, 4 shards each, Scylla version 8f690fdd47
- Create a keyspace with scylla-bench, RF=3
- Start reading and writing with scylla-bench with CL=QUORUM
- Manually upgrade nodes one by one to the version from this PR
- Upgrade succeeded, apart from a small number of operations which failed when each node was being put down all reads/writes succeeded
- Successfully altered the scylla-bench table to have a read and write limit and those limits were enforced as expected

Fixes: #4703

Closes #9810

* github.com:scylladb/scylla:
  storage_proxy: metrics for per-partition rate limiting of reads
  storage_proxy: metrics for per-partition rate limiting of writes
  database: add stats for per partition rate limiting
  tests: add per_partition_rate_limit_test
  config: add add_per_partition_rate_limit_extension function for testing
  cf_prop_defs: guard per-partition rate limit with a feature
  query-request: add allow_limit flag
  storage_proxy: add allow rate limit flag to get_read_executor
  storage_proxy: resultize return type of get_read_executor
  storage_proxy: add per partition rate limit info to read RPC
  storage_proxy: add per partition rate limit info to query_result_local(_digest)
  storage_proxy: add allow rate limit flag to mutate/mutate_result
  storage_proxy: add allow rate limit flag to mutate_internal
  storage_proxy: add allow rate limit flag to mutate_begin
  storage_proxy: choose the right per partition rate limit info in write handler
  storage_proxy: resultize return types of write handler creation path
  storage_proxy: add per partition rate limit to mutation_holders
  storage_proxy: add per partition rate limit info to write RPC
  storage_proxy: add per partition rate limit info to mutate_locally
  database: apply per-partition rate limiting for reads/writes
  database: move and rename: classify_query -> classify_request
  schema: add per_partition_rate_limit schema extension
  db: add rate_limiter
  storage_proxy: propagate rate_limit_exception through read RPC
  gms: add TYPED_ERRORS_IN_READ_RPC cluster feature
  storage_proxy: pass rate_limit_exception through write RPC
  replica: add rate_limit_exception and a simple serialization framework
  docs: design doc for per-partition rate limiting
  transport: add rate_limit_error
2022-06-24 01:32:13 +03:00
Piotr Dulikowski
000f417d23 gms: add TYPED_ERRORS_IN_READ_RPC cluster feature
We would like to extend the read RPC to return an optional, second value
which indicates an exception - seastar type-erases exception on the RPC
handler boundary and we need to differentiate rate_limit_exception from
others. However, it may happen that a replica with an up-to-date version
of Scylla tries to return an exception in this way to a coordinator with
an old version and the coordinator will drop the error, thinking that
the request succeeded.

In order to protect from that, we introduce the
`TYPED_ERROR_IN_READ_RPC` feature. Only after it is enabled replicas
will start returning exceptions in the new way, and until then all
exceptions will be reported using seastar's type-erasure mechanism.
2022-06-22 20:16:48 +02:00
Pavel Emelyanov
820be06ac1 hints: Remove snitch dependency
After previous patch hints manager class gets unused dependency on
snitch. While removing it it turns out that several unrelated places
get needed headers indirectly via host_filter.hh -> snitsh_base.hh
inclusion.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-06-22 11:47:26 +03:00
Geoffrey Beausire
ee9841b138 Ensure gossip is enabled on all shards before starting the failure_detector_loop
Before it was possible for a race condition to happen where the failure_detector_loop is started before the gossiper._enabled is set to true on every shard.
This change ensure that _enabled is set to true before moving forward

Closes #10548
2022-06-17 14:10:45 +03:00
Avi Kivity
4b53af0bd5 treewide: replace parallel_for_each with coroutine::parallel_for_each in coroutines
coroutine::parallel_for_each avoids an allocation and is therefore preferred. The lifetime
of the function object is less ambiguous, and so it is safer. Replace all eligible
occurences (i.e. caller is a coroutine).

One case (storage_service::node_ops_cmd_heartbeat_updater()) needed a little extra
attention since there was a handle_exception() continuation attached. It is converted
to a try/catch.

Closes #10699
2022-05-31 09:06:24 +03:00
Kamil Braun
4c3678e2a0 gms: gossiper: fix direct_fd_pinger::_generation_number initialization
It's an `int64_t` that needs to be explicitly initialized, otherwise the
value is undefined.

This is probably the cause of #10639, although I'm not sure - I couldn't
reproduce it (the bug is dependent on how the binary is compiled, so
that's probably it). We'll see if it reproduces with this fix, and if
it will, close the issue.

Closes #10681
2022-05-29 13:08:09 +03:00
Gleb Natapov
083b47cecb gossiper: replace ad-hoc guard with defer()
msg_proc_guard is a guard that makes sure _msg_processing is always
decreased. We can use regular defer() to achieve the same.

Message-Id: <YoZTQPbTMWAdCObs@scylladb.com>
2022-05-24 19:20:25 +03:00
Avi Kivity
528ab5a502 treewide: change metric calls from make_derive to make_counter
make_derive was recently deprecated in favor of make_counter, so
make the change throughput the codebase.

Closes #10564
2022-05-14 12:53:55 +02:00
Avi Kivity
5937b1fa23 treewide: remove empty comments in top-of-files
After fcb8d040 ("treewide: use Software Package Data Exchange
(SPDX) license identifiers"), many dual-licensed files were
left with empty comments on top. Remove them to avoid visual
noise.

Closes #10562
2022-05-13 07:11:58 +02:00
Tomasz Grabiec
f703e8ded5 Merge 'New failure detector for Raft' from Kamil Braun
We introduce a new service that performs failure detection by periodically pinging
endpoints. The set of pinged endpoints can be dynamically extended and
shrinked. To learn about liveness of endpoints, user of the service
registers a listener and chooses a threshold - a duration of time which
has to pass since the last successful ping in order to mark an endpoint
as dead. When an endpoint responds it's immediately marked as alive.

Endpoints are identified using abstract integer identifiers.
The method of performing a ping is a dependency of the service provided
by the user through the `pinger` interface. The implementation of `pinger` is
responsible for translating the abstract endpoint IDs to 'real'
addresses. For example, production implementation may map endpoint IDs
to IP addresses and use TCP/IP to perform the ping, while a test/simulation
implementation may use a simulated network that also operates on
abstract identifiers.

Similarly, the method of measuring time is a dependency provided by the
user using the `clock` interface. The service operates on abstract time
intervals and timepoints. So, for example, in a production
implementation time can be measured using a stopwatch, while in
test/simulation we can use a logical clock.

The service distributes work across different shards. When an endpoint
is added to the set of detected endpoints, the service will choose a
shard with the smallest amount of workers and create a worker that is
responsible for periodically pinging this endpoint on that shard and
sending notifications to listeners.

We modify the randomized nemesis test to use the new service.
The service is sharded, but for simplicity of implementation in the test
we implement rpcs and sleeps by routing the requests to shard 0, where
logical timers and network live. rpcs are using the existing simulated
network and clock using the existing logical timers.

We also integrate the service with production code. There,
`pinger` is implemented using existing GOSSIP_ECHO verb. The gossip echo
message requires the node's gossip generation number. We handle this by
embedding the pinger implementation inside `gossiper`, and making
`gossiper` update the generation number (cached inside the pinger class)
periodically.

Production `clock` is a simple implementation which uses
`std::chrono::steady_clock` and `seastar::sleep_until` underneath.
Translating `steady_clock` durations to `direct_fd::clock` durations happens
by taking the number of ticks.

We connect the group 0 raft server rpc implementation to the new service,
so that when servers are added or removed from the the group 0 configuration,
corresponding endpoints are added to the direct failure detector service.
Thus the set of detected endpoints will be equal to the group 0 configuration.

On each shard, we register a listener for the service.
The listener maintains a set of live addresses; on mark_alive it adds a
server to the set and on mark_dead it removes it. This set is then used
to implement the `raft::failure_detector` interface, consisting of
`is_alive()` function, which simply checks set membership.

---

v6:
- remove `_alive_start_index`. Instead, keep a map of `bool`s to track liveness of each endpoint. See the code for details (`listeners_liveness` struct and its usage in `ping_fiber()`, `notify_fiber()`, `add/remove_worker`, `add/remove_listener`). The diff is easy to read: f617aeca62..d4b225437c

v5:
- renamed `rpc` to `pinger`
- replaced `bool` with `enum class endpoint_update` (with values `added` and `removed`) in `_endpoint_updates`
- replaced `unsigned` with `shard_id`
- fixed definition of `threshold(size_t n)` (it didn't use `n`, but `_alive_start`; fortunately all uses passed `_alive_start` as `n` so the bug wouldn't affect the behavior)
- improve `_num_workers` assertions
- signal `_alive_start_changed` only when `_alive_start` indeed changed
- renamed `{_marked}_alive_start` to `{_marked}_alive_start_index`

v4:
- rearrange ping_fiber(). Remove the loop at the end of the big `while`
  which was timing out listeners (after the sleep). Instead:
    - rely on the loop before the sleep for timing out listeners
    - before calling ping(), check if there is a timed out listener,
      if so abandon the ping, immediately proceed to the timing-out-listeners
      loop, and then immediately proceed to the next iteration of the big `while`
      (without sleeping)
- inline send_mark_dead() and send_mark_alive(); each was used in
  exactly one place after the rearrangement
- when marking alive, instead of repeatedly doing `--_alive_start` and
  signalling the condition variable, just do `_alive_start = 0` and signal
  the condition variable once
- fix the condition for stopping `endpoint_worker::notify_fiber()`: before, it was
  `_as.abort_requested()`, now it is `_as.abort_requested() && _alive_start == _fd._listeners.size()`.
  Indeed, we want to wait for the stopping code (`destroy_worker()`)
  to set `_alive_start = _fd._listeners.size()` before `notify_fiber()`
  finishes so `notify_fiber()` can send the final `mark_dead`
  notifications for this endpoint. There was a race before where
  `notify_fiber()` could finish before it sent those notifications
  (because it finished as soon as it noticed `_as.abort_requested()`)
- fix some waits in the unit test; they depended on particular ordering
  of tasks by the Scylla reactor, the test could sometimes hang in debug
  mode which randomizes task order
- fix `rpc::ping()` in randomized_nemesis_test so it doesn't give an
  exceptional discarded future in some cases

v3:
- fix a race in failure_detector::stop(): we must first wait for _destroy_subscriptions fiber to finish on all shards, only then we can set _impl to nullptr on any shard
- invoke_abortable_on was moved from randomized_nemesis_test to raft/helpers
- add a unit test (second patch)

v2:
- rename `direct_fd` namespace to `direct_failure_detector`
- move gms/direct_failure_detector.{cc,hh} to direct_failure_detector/failure_detector.{cc,hh}
- cleaned license comments
- removed _mark_queue for sending notifications from ping_fiber() to notify_fiber(). Instead:
    - _listeners is now a boost::container::flat_multimap (previously it was std::multimap)
    - _alive_start is no longer an iterator to _listeners, but an index (size_t)
    - _mark_queue was replaced with a second index to _listeners, _marked_alive_start, together with a condition variable, _alive_start_changed
    - ping_fiber() signals _alive_start_changed when it changes _alive_start
    - notify_fiber() waits on _alive_start_changed. When it wakes up, it compares _marked_alive_start to _alive_start, sends notifications to listeners appropriately, and updates _marked_alive_start
- replacing _mark_queue with index + condition variable allowed some better exception specifications: send_mark_alive and send_mark_dead are now noexcept, ping_fiber() is specified to not return exceptional futures other than sleep_aborted which can only happen when we destroy the worker (previously, ping_fiber() could silently stop due to exception happening when we insert to _mark_queue - it could probably only be bad_alloc, but still)
- _shard_workers is now unordered_map<endpoint_id, endpoint_worker> instead of unordered_map<endpoint_id, unique_ptr<endpoint_worker>> (after learning how to construct map values in place - using either `emplace`+`forward_as_tuple` or `try_emplace`)
- `failure_detector::impl::add_endpoint` now gives strong exception guarantee: if an exception is thrown, no state changes
- same for `failure_detector::impl::remove_endpoint`
- `failure_detector::impl::create_worker` now uses `on_internal_error` when it detects that there is a worker for this endpoint already - thanks to the strong exception guarantees of `add_endpoint` and `remove_endpoint` this should never happen
- comment at _num_workers definition why we maintain this statistic (to pick a shard with smallest number of workers)
- remove unnecessary `if (_as.abort_requested())` in `ping_fiber()`
- in ping_fiber(), after a ping, we send notifications to listeners which we know will time-out before the next ping starts. Before, we would sleep until the threshold is actually passed by the clock. Now we send it immediately - we know ahead of time that the listener will time-out and we can notify it immediately.
- due to above, comment at `register_listener` was adjusted, with the following note added: "Note: the `mark_dead` notification may be sent earlier if we know ahead of time that `threshold` will be crossed before the next `ping()` can start."
- `register_listener` now takes a `listener&`, not `listener*`
- at `register_listener` comment why we allow different thresholds (second to last paragraph)
- at `register_listener` mention that listeners can be registered on any shard (last paragraph)
- add protected destructors to rpc, clock, listener, and mention that these objects are not owned/destroyed by `failure_detector`.
- replaced _endpoint_queue (seastar::queue<pair<endpoint_id, bool>>) with unordered_map<endpoint_id, bool> + condition variable. When user calls add/remove_endpoint, an entry is inserted to this map, or existing entry is updated, and the condition variable is signaled. update_endpoint_fiber() waits on the condition variable, performs the add/remove operation, and removes entries from this map. Compared to the previous solution:
    - the new solution has at most one entry for a given endpoint, so the number of entries is bounded by the number of different endpoints (so in the main Scylla use case, by the number of different nodes that ever exist); the previous solution could in theory have a backlog of unprocessed events, with updates for a given endpoint appearing multiple times in the queue at once
    - when the add/remove operation fails in update_endpoint_fiber(), we don't remove the entry from the map so the operation can be retried later. Previously we would always remove the entry from the queue so it doesn't grow too big in presence of failures.
    - when the add/remove operation fails in update_endpoint_fiber(), we sleep for 10*ping_period before retrying. Note that this codepath should not be reached in practice, it can basically only happen on bad_alloc
- commented that `clock::sleep_until` should signalize aborts using `sleep_aborted`
- `clock::now()` is `noexcept`
- `add/remove_endpoint` can be called after `stop()`, they just won't do anything in that case. Reason: next item
- in randomized_nemesis_test, stop failure detector before raft server (it was the other way before), so it stops using server's RPC before server is aborted. Before, the log was spammed with errors from failure detector because failure detector was getting gate_closed_exceptions from the RPC when the server was stopped. A side effect is that the raft server may continue adding/removing endpoints when the failure detector is stopped, which is fine due to above item
- randomized_nemesis_test: direct_fd_clock::sleep_until translates abort_requested_exception to sleep_aborted (so sleep_until satisfies the interface specification)
- message/rpc_protocol_impl: send_message_abortable: if abort_source::subscribe returns null, immediately throw abort_requested_exception (before we would send the message out and not react to an abort if it happened before we were called)
- rebase

Closes #10437

* github.com:scylladb/scylla:
  service: raft: remove `raft_gossip_failure_detector`
  service: raft: raft_group_registry: use direct failure detector notifications for raft server liveness
  service: raft: add/remove direct failure detector endpoints on group 0 configuration changes
  main: start direct failure detector service
  messaging_service: abortable version of `send_gossip_echo`
  message: abortable version of `send_message`
  test: raft: randomized_nemesis_test: remove old failure_detector
  test: raft: randomized_nemesis_test: use `direct_failure_detector::failure_detector`
  test: raft: randomized_nemesis_test: ping all shards on each tick
  test: unit test for new failure detector service
  direct_failure_detector: introduce new failure detector service
2022-05-11 14:46:27 +02:00
Kamil Braun
38f65e5a2e main: start direct failure detector service
We add the new direct failure detector to the list of services started
in the Scylla process.

To start the service, we need an implementation of `pinger` and `clock`.

`pinger` is implemented using existing GOSSIP_ECHO verb. The gossip echo
message requires the node's gossip generation number. We handle this by
embedding the pinger implementation inside `gossiper`, and making
`gossiper` update the generation number (cached inside the pinger class)
periodically.

`clock` is a simple implementation which uses `std::chrono::steady_clock`
and `seastar::sleep_until` underneath. Translating `steady_clock`
durations to `direct_failure_detector::clock` durations happens by taking
the number of ticks.

The service is currently not used, just initialized; no endpoints are
added and no listeners are registered yet, but the following commits
change that.
2022-05-09 13:14:42 +02:00
Pavel Emelyanov
9d364f19dc gossiper: Add underscores to new private members
The state map and guarding locks were moved to private and now should have a _ prefix

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-05-06 11:32:03 +03:00
Pavel Emelyanov
5ac28a29d3 gossiper, code: Relax get_up/down/all_counters() helpers
These helpers count elements in the endpoint state map. It makes sense
to keep them in gossiper API, but it's worth removing the wrappers that
do invoke_on(0). This makes code shorter.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-05-06 10:34:48 +03:00
Pavel Emelyanov
0ef33b71ba gossiper, api: Remove get_arrival_samples()
It's empty too, but the API-side conversion probably has some value for
the future, so keep it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-05-06 10:34:48 +03:00
Pavel Emelyanov
37d392c772 gossiper, api: Remove get/set phi convict threshold helpers
These are empty anyway. API caller can place return stubs itself.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-05-06 10:34:48 +03:00
Pavel Emelyanov
ad786d6b4d gossiper, api: Move get_simple_states() into API code
The API method in question just tries to scan the state map. There's no
need in doing invoke_on(0) and in a separate helper method in gossiper,
the creation of the json return value can happen in the API handler.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-05-06 10:34:48 +03:00
Pavel Emelyanov
49dd6b5371 gossiper: In-line std::optional<> get_endpoint_state_for_endpoint() overload
The method helps updating enpoint state in handle_major_state_change by
returning a copy of an endpoint state that's kept while the map's entry
is being replaced with the new state. It can be replaced with a shorter
code.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-05-06 10:34:48 +03:00
Pavel Emelyanov
f278d84cfe gossiper, api: Remove get_endpoint_state() helpers
There are two of them -- one to do invoke_on(0) the other one to get the
needed data. The former one is not needed -- the scanned endpoint state
map is replicated accross shards and is the same everywhere. The latter
is not needed, because there's only one user of it -- the API -- which
can work with the existing gossiper API.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-05-06 10:34:48 +03:00
Pavel Emelyanov
0aea43a245 gossiper: Make state and locks maps private
Locks are not needed outside gossiper, state map is sometimes read from,
but there a const getter for such cases. Both methods now desrve the
underbar prefix, but it doesn't come with this short patch.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-05-06 10:34:48 +03:00
Pavel Emelyanov
690b21aa4d gossiper: Remove dead code
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-05-06 10:34:48 +03:00
Avi Kivity
19ab3edd77 gms: feature_service: remove variable/helper function duplication
Each feature has a private variable and a public accessor. Since the
accessor effectively makes the variable public, avoid the intermediary
and make the variable public directly.

To ease mechanical translation, the variable name is chosen as
the function name (without the cluster_supports_ prefix).

References throughout the codebase are adjusted.
2022-05-04 18:59:56 +03:00
Avi Kivity
435b46cd52 gms: feature: make operator bool implicit
Features are usually used as booleans, so forcing allowing them
to implicitly decay to bool is not a mistake. In fact a bunch
of helper functions exist to cast feature variables to bool.

Prepare to reduce this boilerplate by allowing automatic conversion
to bool.
2022-05-04 18:58:24 +03:00
Avi Kivity
81ad595f61 gms: feature_service: remove feature variable duplication in enable()
We have a list of all feature variables in enable(), but the list
is also available programatically in _registered_features, so use
that instead.
2022-05-04 18:44:28 +03:00
Avi Kivity
f0f4759163 gms: feature_service: remove feature variable declaration/definition duplication
Feature variables are both declared and defined. Make that happen in one
place, reducing boilerplate.
2022-05-04 18:24:56 +03:00
Avi Kivity
0f95258577 gms: features: de-quadruplicate active feature names
Active feature names are present four or five times in the code:
a delaration in feature.hh, a definition and initialization (two copies)
in feature_service.cc, a use in feature_service.cc, and a possible
reference in feature_service.cc if the feature is conditionally enabled.

Switch to just one copy or two, using the "foo"sv operator (and "foo"s)
to generate a string_view (string) as before.

Note that a few features had different external and C++ names; we
preserve the external name.

This patch does cause literal strings to be present in two places,
making them vulnerable to misspellings. But since feature names
are immutable, there is little risk that one will change without
the other.
2022-05-04 18:12:53 +03:00
Avi Kivity
980b109adb gms: features: de-quadruplicate deprecated feature names
Deprecated features are unused, but are present four times in the code:
a delaration in feature.hh, a definition and initialization (two copies)
in feature_service.cc, and a use in feature_service.cc. Switch to just
one copy, using the "foo"sv operator to generate a string_view as before.

Note that a few features had different external and C++ names; we
preserve the external name.
2022-05-04 17:54:05 +03:00
Avi Kivity
ebe5ce2870 gms: feature_service: avoid duplicating feature names when listing known features
We already have the registered features in a data structure, collect them
from there instead of repeating.
2022-05-04 16:19:42 +03:00
Pavel Emelyanov
b26a3da584 gossiper: Coroutinize wait_for_gossip_to_settle()
Looks notably shorter this way

tests: unit(dev)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20220422093000.24407-1-xemul@scylladb.com>
2022-05-03 15:58:04 +03:00
Pavel Emelyanov
e80adbade3 code: De-globalize gossiper
No code uses global gossiper instance, it can be removed. The main and
cql-test-env code now have their own real local instances.

This change also requires adding the debug:: pointer and fixing the
scylle-gdb.py to find the correct global location.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-05-03 10:57:40 +03:00
Pavel Emelyanov
7a0ca3fedc gossiper: Use container() instead of the global pointer
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-05-03 10:57:40 +03:00
Pavel Solodovnikov
b25c4fee01 gms: gossiper: coroutinize apply_state_locally
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-04-17 11:51:18 +03:00
Pavel Solodovnikov
746f1179eb gms: gossiper: coroutinize apply_state_locally_without_listener_notification
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-04-17 11:38:33 +03:00
Pavel Solodovnikov
b7322c3f5d gms: gossiper: coroutinize do_apply_state_locally
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-04-17 11:29:26 +03:00
Pavel Solodovnikov
c48dcf607a gms: gossiper: coroutinize apply_new_states
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-04-17 11:28:42 +03:00
Kamil Braun
41f5b7e69e Merge branch 'raft_group0_early_startup_v3' of https://github.com/ManManson/scylla into next
* 'raft_group0_early_startup_v3' of https://github.com/ManManson/scylla:
  main: allow joining raft group0 before waiting for gossiper to settle
  service: raft_group0: make `join_group0` re-entrant
  service: storage_service: add `join_group0` method
  raft_group_registry: update gossiper state only on shard 0
  raft: don't update gossiper state if raft is enabled early or not enabled at all
  gms: feature_service: add `cluster_uses_raft_mgmt` accessor method
  db: system_keyspace: add `bootstrap_needed()` method
  db: system_keyspace: mark getter methods for bootstrap state as "const"
2022-04-14 16:42:20 +02:00
Raphael S. Carvalho
8427ec056c gms: gossiper: don't duplicate knowledge of minimum time for gossip to settle
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20220409022435.58070-2-raphaelsc@scylladb.com>
2022-04-11 19:19:02 +03:00
Piotr Sarna
3272b4826f db: add keyspace-storage-options experimental feature
Specifying non-standard keyspace options is experimental, so it's
going to be protected by a configuration flag.
2022-04-08 09:17:01 +02:00
Piotr Sarna
120980ac8e db,gms: add SCYLLA_KEYSPACE schema feature
This schema feature will be used to guard the upcoming
system_schema.scylla_keyspaces schema table.
2022-04-08 09:17:00 +02:00
Piotr Sarna
567c0d0368 db,gms: add KEYSPACE_STORAGE_OPTIONS feature
The feature represents the ability to store storage options
in keyspace metadata: represented as a map of options,
e.g. storage type, bucket, authentication details, etc.
2022-04-08 09:17:00 +02:00
Pavel Solodovnikov
ccb59ba6c7 gms: feature_service: add cluster_uses_raft_mgmt accessor method
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-04-07 12:30:21 +03:00
Pavel Emelyanov
05a32328fc snitch: Remove gossiper_starting()
No longer used

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-04-01 13:16:09 +03:00
Pavel Emelyanov
3da5f6ac30 gossiper: Add system keyspace dependency
The gossiper reads peer features from system keyspace. Also the snitch
code needs system keyspace, and since now it gets all its dependencies
from gossiper (will be fixed some day, but not now), it will do the same
for sys.ks.. Thus it's worth having gossiper->system_keyspace explicit
dependency.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-03-25 15:08:13 +03:00
Pavel Solodovnikov
011942dcce raft: move tracking SUPPORTS_RAFT_CLUSTER_MANAGEMENT feature to raft
Move the listener from feature service to the `raft_group_registry`.

Enable support for the `USES_RAFT_CLUSTER_MANAGEMENT`
feature when the former is enabled.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-03-18 09:54:25 +03:00
Pavel Solodovnikov
7ea4d44508 gms: feature_service: update system.local#supported_features when feature support changes
Also, change the signature of `support()` method to return
`future<>` since it's now a coroutine. Adjust existing call sites.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-03-18 09:54:21 +03:00
Pavel Emelyanov
6a154305d7 gossiper: Remove db::config reference from gossiper
Also const-ify the db::config reference argument and std::move
the gossip_config argument while at it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-03-02 18:34:55 +03:00