Commit Graph

31140 Commits

Author SHA1 Message Date
Botond Dénes
7501a075bd sstables/index_reader: push down eof() check to advance_to(index_bound&, dht::ring_position_view)
Commit e8f3d7dd13 added eof() checks to public partition-level
advance_to() methods, to ensure we do not attempt to re-read the last
page of the index when at eof(). It was noted however that this check
would be safer in advance_to(index_bound&, dht::ring_position_view)
because that is the method that all these higher-level methods end up
calling. Placing the check there would guarantee safety for all such
operations. This path does exactly that: it pushes down the check to
said method. One change needed for this to work is to check eof on the
bound that is currently advanced, instead of unconditionally checking
the lower bound.

Closes #10531
2022-05-11 14:46:30 +02:00
Asias He
77b1db475c locator: Do not enforce public ip address for broadcast_rpc_address
Reported by Felipe Cardeneti:

- Create a 2-node Scylla cluster w/ Ec2MultiRegionSnitch
- Check system.peers table

Scylla (uses public address)
```
cqlsh> select peer,data_center,host_id,preferred_ip,rack,rpc_address,schema_version from system.peers;

peer          | data_center | host_id                              | preferred_ip  | rack | rpc_address   | schema_version
---------------+-------------+--------------------------------------+---------------+------+---------------+--------------------------------------
18.216.98.219 |   us-east-2 | d9443741-a12e-4bbb-91ce-9931cece589c | 172.31.43.122 |   2c | 18.216.98.219 | 95c3fca5-c463-3aba-98c6-1c0b3fac5b58

(1 rows)
```

Cassandra (uses local address):
```
cqlsh> SELECT peer,data_center,host_id,preferred_ip,rack,rpc_address,schema_version from system.peers;

peer          | data_center | host_id                              | preferred_ip  | rack       | rpc_address   | schema_version
---------------+-------------+--------------------------------------+---------------+------------+---------------+--------------------------------------
52.15.104.255 |   us-east-2 | 42c0b717-775f-4998-a420-0388fe8b4e70 | 172.31.42.126 | us-east-2c | 172.31.42.126 | 2207c2a9-f598-3971-986b-2926e09e239d

(1 rows)
```

Config diff:
```
cassandra.yaml:rpc_address: 0.0.0.0
cassandra.yaml:broadcast_rpc_address: 172.31.42.126
/etc/scylla/scylla.yaml:broadcast_rpc_address: 172.31.42.126
/etc/scylla/scylla.yaml:rpc_address: 0.0.0.0
```

After this patch, if broadcast_rpc_address is unset, Ec2MultiRegionSnitch
will use the public ip address to set broadcast_rpc_address. If
broadcast_rpc_address is set, Ec2MultiRegionSnitch will not modify it.

Fixes #10236

Closes #10519
2022-05-11 14:46:30 +02:00
Tomasz Grabiec
f703e8ded5 Merge 'New failure detector for Raft' from Kamil Braun
We introduce a new service that performs failure detection by periodically pinging
endpoints. The set of pinged endpoints can be dynamically extended and
shrinked. To learn about liveness of endpoints, user of the service
registers a listener and chooses a threshold - a duration of time which
has to pass since the last successful ping in order to mark an endpoint
as dead. When an endpoint responds it's immediately marked as alive.

Endpoints are identified using abstract integer identifiers.
The method of performing a ping is a dependency of the service provided
by the user through the `pinger` interface. The implementation of `pinger` is
responsible for translating the abstract endpoint IDs to 'real'
addresses. For example, production implementation may map endpoint IDs
to IP addresses and use TCP/IP to perform the ping, while a test/simulation
implementation may use a simulated network that also operates on
abstract identifiers.

Similarly, the method of measuring time is a dependency provided by the
user using the `clock` interface. The service operates on abstract time
intervals and timepoints. So, for example, in a production
implementation time can be measured using a stopwatch, while in
test/simulation we can use a logical clock.

The service distributes work across different shards. When an endpoint
is added to the set of detected endpoints, the service will choose a
shard with the smallest amount of workers and create a worker that is
responsible for periodically pinging this endpoint on that shard and
sending notifications to listeners.

We modify the randomized nemesis test to use the new service.
The service is sharded, but for simplicity of implementation in the test
we implement rpcs and sleeps by routing the requests to shard 0, where
logical timers and network live. rpcs are using the existing simulated
network and clock using the existing logical timers.

We also integrate the service with production code. There,
`pinger` is implemented using existing GOSSIP_ECHO verb. The gossip echo
message requires the node's gossip generation number. We handle this by
embedding the pinger implementation inside `gossiper`, and making
`gossiper` update the generation number (cached inside the pinger class)
periodically.

Production `clock` is a simple implementation which uses
`std::chrono::steady_clock` and `seastar::sleep_until` underneath.
Translating `steady_clock` durations to `direct_fd::clock` durations happens
by taking the number of ticks.

We connect the group 0 raft server rpc implementation to the new service,
so that when servers are added or removed from the the group 0 configuration,
corresponding endpoints are added to the direct failure detector service.
Thus the set of detected endpoints will be equal to the group 0 configuration.

On each shard, we register a listener for the service.
The listener maintains a set of live addresses; on mark_alive it adds a
server to the set and on mark_dead it removes it. This set is then used
to implement the `raft::failure_detector` interface, consisting of
`is_alive()` function, which simply checks set membership.

---

v6:
- remove `_alive_start_index`. Instead, keep a map of `bool`s to track liveness of each endpoint. See the code for details (`listeners_liveness` struct and its usage in `ping_fiber()`, `notify_fiber()`, `add/remove_worker`, `add/remove_listener`). The diff is easy to read: f617aeca62..d4b225437c

v5:
- renamed `rpc` to `pinger`
- replaced `bool` with `enum class endpoint_update` (with values `added` and `removed`) in `_endpoint_updates`
- replaced `unsigned` with `shard_id`
- fixed definition of `threshold(size_t n)` (it didn't use `n`, but `_alive_start`; fortunately all uses passed `_alive_start` as `n` so the bug wouldn't affect the behavior)
- improve `_num_workers` assertions
- signal `_alive_start_changed` only when `_alive_start` indeed changed
- renamed `{_marked}_alive_start` to `{_marked}_alive_start_index`

v4:
- rearrange ping_fiber(). Remove the loop at the end of the big `while`
  which was timing out listeners (after the sleep). Instead:
    - rely on the loop before the sleep for timing out listeners
    - before calling ping(), check if there is a timed out listener,
      if so abandon the ping, immediately proceed to the timing-out-listeners
      loop, and then immediately proceed to the next iteration of the big `while`
      (without sleeping)
- inline send_mark_dead() and send_mark_alive(); each was used in
  exactly one place after the rearrangement
- when marking alive, instead of repeatedly doing `--_alive_start` and
  signalling the condition variable, just do `_alive_start = 0` and signal
  the condition variable once
- fix the condition for stopping `endpoint_worker::notify_fiber()`: before, it was
  `_as.abort_requested()`, now it is `_as.abort_requested() && _alive_start == _fd._listeners.size()`.
  Indeed, we want to wait for the stopping code (`destroy_worker()`)
  to set `_alive_start = _fd._listeners.size()` before `notify_fiber()`
  finishes so `notify_fiber()` can send the final `mark_dead`
  notifications for this endpoint. There was a race before where
  `notify_fiber()` could finish before it sent those notifications
  (because it finished as soon as it noticed `_as.abort_requested()`)
- fix some waits in the unit test; they depended on particular ordering
  of tasks by the Scylla reactor, the test could sometimes hang in debug
  mode which randomizes task order
- fix `rpc::ping()` in randomized_nemesis_test so it doesn't give an
  exceptional discarded future in some cases

v3:
- fix a race in failure_detector::stop(): we must first wait for _destroy_subscriptions fiber to finish on all shards, only then we can set _impl to nullptr on any shard
- invoke_abortable_on was moved from randomized_nemesis_test to raft/helpers
- add a unit test (second patch)

v2:
- rename `direct_fd` namespace to `direct_failure_detector`
- move gms/direct_failure_detector.{cc,hh} to direct_failure_detector/failure_detector.{cc,hh}
- cleaned license comments
- removed _mark_queue for sending notifications from ping_fiber() to notify_fiber(). Instead:
    - _listeners is now a boost::container::flat_multimap (previously it was std::multimap)
    - _alive_start is no longer an iterator to _listeners, but an index (size_t)
    - _mark_queue was replaced with a second index to _listeners, _marked_alive_start, together with a condition variable, _alive_start_changed
    - ping_fiber() signals _alive_start_changed when it changes _alive_start
    - notify_fiber() waits on _alive_start_changed. When it wakes up, it compares _marked_alive_start to _alive_start, sends notifications to listeners appropriately, and updates _marked_alive_start
- replacing _mark_queue with index + condition variable allowed some better exception specifications: send_mark_alive and send_mark_dead are now noexcept, ping_fiber() is specified to not return exceptional futures other than sleep_aborted which can only happen when we destroy the worker (previously, ping_fiber() could silently stop due to exception happening when we insert to _mark_queue - it could probably only be bad_alloc, but still)
- _shard_workers is now unordered_map<endpoint_id, endpoint_worker> instead of unordered_map<endpoint_id, unique_ptr<endpoint_worker>> (after learning how to construct map values in place - using either `emplace`+`forward_as_tuple` or `try_emplace`)
- `failure_detector::impl::add_endpoint` now gives strong exception guarantee: if an exception is thrown, no state changes
- same for `failure_detector::impl::remove_endpoint`
- `failure_detector::impl::create_worker` now uses `on_internal_error` when it detects that there is a worker for this endpoint already - thanks to the strong exception guarantees of `add_endpoint` and `remove_endpoint` this should never happen
- comment at _num_workers definition why we maintain this statistic (to pick a shard with smallest number of workers)
- remove unnecessary `if (_as.abort_requested())` in `ping_fiber()`
- in ping_fiber(), after a ping, we send notifications to listeners which we know will time-out before the next ping starts. Before, we would sleep until the threshold is actually passed by the clock. Now we send it immediately - we know ahead of time that the listener will time-out and we can notify it immediately.
- due to above, comment at `register_listener` was adjusted, with the following note added: "Note: the `mark_dead` notification may be sent earlier if we know ahead of time that `threshold` will be crossed before the next `ping()` can start."
- `register_listener` now takes a `listener&`, not `listener*`
- at `register_listener` comment why we allow different thresholds (second to last paragraph)
- at `register_listener` mention that listeners can be registered on any shard (last paragraph)
- add protected destructors to rpc, clock, listener, and mention that these objects are not owned/destroyed by `failure_detector`.
- replaced _endpoint_queue (seastar::queue<pair<endpoint_id, bool>>) with unordered_map<endpoint_id, bool> + condition variable. When user calls add/remove_endpoint, an entry is inserted to this map, or existing entry is updated, and the condition variable is signaled. update_endpoint_fiber() waits on the condition variable, performs the add/remove operation, and removes entries from this map. Compared to the previous solution:
    - the new solution has at most one entry for a given endpoint, so the number of entries is bounded by the number of different endpoints (so in the main Scylla use case, by the number of different nodes that ever exist); the previous solution could in theory have a backlog of unprocessed events, with updates for a given endpoint appearing multiple times in the queue at once
    - when the add/remove operation fails in update_endpoint_fiber(), we don't remove the entry from the map so the operation can be retried later. Previously we would always remove the entry from the queue so it doesn't grow too big in presence of failures.
    - when the add/remove operation fails in update_endpoint_fiber(), we sleep for 10*ping_period before retrying. Note that this codepath should not be reached in practice, it can basically only happen on bad_alloc
- commented that `clock::sleep_until` should signalize aborts using `sleep_aborted`
- `clock::now()` is `noexcept`
- `add/remove_endpoint` can be called after `stop()`, they just won't do anything in that case. Reason: next item
- in randomized_nemesis_test, stop failure detector before raft server (it was the other way before), so it stops using server's RPC before server is aborted. Before, the log was spammed with errors from failure detector because failure detector was getting gate_closed_exceptions from the RPC when the server was stopped. A side effect is that the raft server may continue adding/removing endpoints when the failure detector is stopped, which is fine due to above item
- randomized_nemesis_test: direct_fd_clock::sleep_until translates abort_requested_exception to sleep_aborted (so sleep_until satisfies the interface specification)
- message/rpc_protocol_impl: send_message_abortable: if abort_source::subscribe returns null, immediately throw abort_requested_exception (before we would send the message out and not react to an abort if it happened before we were called)
- rebase

Closes #10437

* github.com:scylladb/scylla:
  service: raft: remove `raft_gossip_failure_detector`
  service: raft: raft_group_registry: use direct failure detector notifications for raft server liveness
  service: raft: add/remove direct failure detector endpoints on group 0 configuration changes
  main: start direct failure detector service
  messaging_service: abortable version of `send_gossip_echo`
  message: abortable version of `send_message`
  test: raft: randomized_nemesis_test: remove old failure_detector
  test: raft: randomized_nemesis_test: use `direct_failure_detector::failure_detector`
  test: raft: randomized_nemesis_test: ping all shards on each tick
  test: unit test for new failure detector service
  direct_failure_detector: introduce new failure detector service
2022-05-11 14:46:27 +02:00
Nadav Har'El
2c39c4c284 Merge 'Handle errors during snapshot' from Benny Halevy
This series refactors `table::snapshot` and moves the responsibility
to flush the table before taking the snapshot to the caller.

`flush_on_all` and `snapshot_on_all` helpers are added to replica::database
(by making it a peering_sharded_service) and upper layers,
including api and snapshot-ctl now call it instead of calling cf.snapshot directly.

With that, error are handed in table::snapshot and propagated
back to the callers.

Failure to allocate the `snapshot_manager` object is fatal,
similar to failure to allocate a continuation, since we can't
coordinate across the shards without it.

Test: unit(dev), rest_api(debug)

Fixes #10500

Closes #10513

* github.com:scylladb/scylla:
  table: snapshot: handle errors
  table: snapshot: get rid of skip_flush param
  database: truncate: skip flush when taking snapshot
  test: rest_api: storage_service: verify_snapshot_details: add truncate
  database: snapshot_on_all: flush before snapshot if needed
  table: make snapshot method private
  database: add snapshot_on_all
  snapshot-ctl: run_snapshot_modify_operation: reject views and secondary index using the schema
  snapshot-ctl: refactor and coroutinize take_snapshot / take_column_family_snapshot
  api: storage_service: increase visibility of snapshot ops in the log
  api: storage_service: coroutinize take_snapshot and del_snapshot
  api: storage_service: take_snapshot: improve api help messages
  test: rest_api: storage_service: add test_storage_service_snapshot
  database: add flush_on_all variants
  test: rest_api: add test_storage_service_flush
2022-05-10 10:52:10 +03:00
Benny Halevy
1d39d803af table: snapshot: handle errors
Turn table::snapshot into a coroutine,
catch exceptions, and return them to the caller.

Make sure that coordination across shards
would not break even if any of the shards hits
an error, by always signaling semaphores other
shards wait on.

All errors except for failing to allocate
the snapshot_manager objects are caught
and propagated back.

Failing to allocate the snapshot_manager is fatal
similar to failing to allocate a continuation
since we can't coordinate across the shards without it,
so abort that fails.

Fixes #10500

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-05-10 10:45:14 +03:00
Benny Halevy
9e69089306 table: snapshot: get rid of skip_flush param
Now that all callers flush on their own
before calling table::snapshot.

Refs #10500

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-05-10 10:45:14 +03:00
Benny Halevy
31881273a1 database: truncate: skip flush when taking snapshot
database::truncate already flushes the table
on auto_snapshot so there is never a reason
to flush it again in table::snapshot.

Note that cf.can_flush() is false only if memtables
are empty so there nothing to flush or there is
is no seal_immediate_fn and then table::snapshot
wouldn't be able to flush either.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-05-10 10:45:14 +03:00
Benny Halevy
fc79787863 test: rest_api: storage_service: verify_snapshot_details: add truncate
Truncate the test table and verify that
the 'live' snapshot size is now non-zero.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-05-10 10:45:14 +03:00
Benny Halevy
46c950fb31 database: snapshot_on_all: flush before snapshot if needed
flush_on_all shards before taking the snapshot if !skip_flush
so we can get rid of flushing in table::snapshot.

Refs #10500

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-05-10 10:45:14 +03:00
Benny Halevy
33bd52921e table: make snapshot method private
Only callable by database.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-05-10 10:45:14 +03:00
Benny Halevy
e1d58d4422 database: add snapshot_on_all
And move the logic from snapshot-ctl down to the
replica::database layer.

A following patch will move the flush phase
from the replica::table::snapshot layer
out to the caller.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-05-10 10:45:14 +03:00
Benny Halevy
aa127a2dbb snapshot-ctl: run_snapshot_modify_operation: reject views and secondary index using the schema
Detecting a secondary index by checking for a dot
in the table name is wrong as tables generated by Alternator
may contain a dot in their name.

Instead detect bot hmaterialized view and secondary indexes
using the schema()->is_view() method.

Fixes #10526

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-05-10 10:44:52 +03:00
Benny Halevy
1fbcdbd2e8 snapshot-ctl: refactor and coroutinize take_snapshot / take_column_family_snapshot
There is no functional change in this patch.
Only refactoring of the code.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-05-10 10:16:39 +03:00
Benny Halevy
01b1e54e22 api: storage_service: increase visibility of snapshot ops in the log
snapshot operations over the api are rare
but they contain significant state on disk in the
form of sstables hard-linked to the snapshot directories.

Also, we've seen snapshot operations hang in the field,
requiring a core dump to analyse the issue,
while there were no records in the log indicating
when previous snapshot operations were last executed.

This change promotes logging to info level
when take_snapshot and del_snapshot start,
and logs errors if in case they fail.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-05-10 10:15:46 +03:00
Benny Halevy
b9d972d029 api: storage_service: coroutinize take_snapshot and del_snapshot
Before making any further changes in them.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-05-10 10:02:52 +03:00
Benny Halevy
10b86ee5bd api: storage_service: take_snapshot: improve api help messages
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-05-10 10:02:47 +03:00
Benny Halevy
e95ecbbea6 test: rest_api: storage_service: add test_storage_service_snapshot
Test the snapshot operations via the rest api.

Added test/rest_api/rest_util.py with
new_test_snapshot that creates a new test snapshot
and automagically deletes it when the `with` block
if exited, similar to new_test_keyspace and new_test_table.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-05-10 09:56:44 +03:00
Benny Halevy
5b4eb44795 database: add flush_on_all variants
Use by api layer.

Will be used in a later patch to flush
on all shards before taking a snapshot.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-05-10 09:56:44 +03:00
Benny Halevy
05c7f4b832 test: rest_api: add test_storage_service_flush
Add a basic rest_api test for keyspace_flush.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-05-10 09:56:44 +03:00
Nadav Har'El
1c6163d51f Merge 'cql3: expr: allow bind markers in collection literals' from Michał Sala
Allowing bind markers in collection literals is a change which causes minor differences in behavior between Scylla and Cassandra. Despite such an undesirable effect, I think allowing them is a good idea because it makes [refactoring work made by cvybhu](https://github.com/scylladb/scylla/pull/10409) easier - 469d03f8c2.

Also, making Scylla accept a superset of valid Cassandra cql expressions does not make us less compatible (maybe apart from test suit compatibility).

Closes #10457

* github.com:scylladb/scylla:
  test/boost: cql_query_test: allow bound variables in test_list_of_tuples_with_bound_var
  test/boost: cql_query_test: test bound variables in collection literals
  cql3: expr: do not allow unset values inside collections
  cql3: expr: prepare_expr: allow bind markers in collection literals
2022-05-09 19:15:22 +03:00
Botond Dénes
fd27fbfe64 Merge "Add user types carrier helper" from Pavel Emelyanov
"
There's a cql_type_parser::parse() method that needs to get user
types for a keyspace by its name. For this it uses the global
storage proxy instance as a place to get database from. This set
introduces an abstract user_types_storage helper object that's
responsible in providing the user types for the caller.

This helper, in turn, is provided to the parse() method by the
database itself or by the schema_ctxt object that needs parse()
to unfreeze schemas and doesn't have database at those times.

This removes one more get_storage_proxy() call.
"

* 'br-user-types-storage' of https://github.com/xemul/scylla:
  cql_type_parser: Require user_types_storage& in parse()
  schame_tables: Add db/ctxt args here and there
  user_types: Carry storage on database and schema_ctxt
  data_dictionary: Introduce user types storage
2022-05-09 17:38:52 +03:00
Nadav Har'El
ca700bf417 scripts/pull_github_pr.sh: clean up after failed cherry-pick
When pull_github_pr.sh uses git cherry-pick to merge a single-patch
pull request, this cherry-pick can fail. A typical example is trying
to merge a patch that has actually already been merged in the past,
so cherry-pick reports that the patch, after conflict resolution,
is empty.

When cherry-pick fails, it leaves the working directory in an annoying
mid-cherry-pick state, and today the user needs to manually call
"git cherry-pick --abort" to return to the normal state. The script
should it automatically - so this is what we do in this patch.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2022-05-09 17:23:34 +03:00
Pavel Emelyanov
598ce8111d repair: Handle discarded stopping future
When repair_meta stops it does so in the background and reports back
a shared future into whose shared promise peer it resolves that
background activity. There's a shorter way to forward a future result
into another, even shared, promise. And this method doesn't need to
discard a future.

tests: https://jenkins.scylladb.com/job/releng/job/Scylla-CI/253

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-05-09 17:23:12 +03:00
Pavel Emelyanov
3b4af86ad9 proxy (and suddenly redis): Don't check latency_counter.is_start()
The lcs at those places are explicitly start()ed beforehand. The
is_start() check is necessary when using the latency_counter with a
histogram that may or may not start the counter (this is the case
in several class table methods).

tests: unit(dev)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-05-09 17:20:41 +03:00
Raphael S. Carvalho
48e3117ebc compaction: move propagate_replacement() into private namespace
propagate_replacement() is an internal function that shouldn't be in
the public interface. No one besides an unit test for incremental
compaction needs it. In the future, I want to revisit incremental
compaction unit test to stop using it and only rely on public
interfaces

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20220506171647.81063-1-raphaelsc@scylladb.com>
2022-05-09 16:49:50 +03:00
Kamil Braun
0f7a1179c8 service: raft: remove raft_gossip_failure_detector
It's no longer used, having been replaced
by the direct_failure_detector listener.
2022-05-09 15:31:19 +02:00
Kamil Braun
295aec2633 service: raft: raft_group_registry: use direct failure detector notifications for raft server liveness
On each shard, we register a listener for the new direct failure detector service.
The listener maintains a set of live addresses; on mark_alive it adds a
server to the set and on mark_dead it removes it. This set is then used
to implement the `raft::failure_detector` interface, consisting of
`is_alive()` function, which simply checks set membership.

There is some complexity in between, because we need to translate
direct_failure_detector endpoint_ids to inet_addresses and raft::server_ids
to inet_addreses, but all building blocks are already there.
2022-05-09 15:31:19 +02:00
Kamil Braun
7e4bb68061 service: raft: add/remove direct failure detector endpoints on group 0 configuration changes
We connect the group 0 raft server rpc implementation to the new direct
failure detector service, so that when servers are added or removed from
the the group 0 configuration, corresponding endpoints are added to the
direct failure detector service. Thus the set of detected endpoints will
be equal to the group 0 configuration.

This causes the failure detector service to start pinging endpoints,
but no listeners are registered yet. The following commit changes that.
2022-05-09 15:31:19 +02:00
Kamil Braun
38f65e5a2e main: start direct failure detector service
We add the new direct failure detector to the list of services started
in the Scylla process.

To start the service, we need an implementation of `pinger` and `clock`.

`pinger` is implemented using existing GOSSIP_ECHO verb. The gossip echo
message requires the node's gossip generation number. We handle this by
embedding the pinger implementation inside `gossiper`, and making
`gossiper` update the generation number (cached inside the pinger class)
periodically.

`clock` is a simple implementation which uses `std::chrono::steady_clock`
and `seastar::sleep_until` underneath. Translating `steady_clock`
durations to `direct_failure_detector::clock` durations happens by taking
the number of ticks.

The service is currently not used, just initialized; no endpoints are
added and no listeners are registered yet, but the following commits
change that.
2022-05-09 13:14:42 +02:00
Kamil Braun
9551256e81 messaging_service: abortable version of send_gossip_echo
Use the new `send_message_abortable` function to implement an abortable
version of `send_gossip_echo`.

These echo messages will be used for direct failure detection.
2022-05-09 13:14:41 +02:00
Kamil Braun
f2548fc3fa message: abortable version of send_message
I want to be able to timeout `send_message`, but not through the
existing `send_message_timeout` API which forces me to use a particular
clock/duration/timepoint type. Introduce a more general
`send_message_abortable` API which gets an `abort_source&`, subscribes
to it, and uses the `rpc::cancellable` interface to cancel the RPC on
abort.

The function is 90% copy-pasta from `send_message{_timeout}`, only the
abort part is new.
2022-05-09 13:14:41 +02:00
Kamil Braun
c15f3a9698 test: raft: randomized_nemesis_test: remove old failure_detector
No longer used.
Split from the previous commit for a better diff.
2022-05-09 13:14:41 +02:00
Kamil Braun
915d329f1f test: raft: randomized_nemesis_test: use direct_failure_detector::failure_detector
Until now the nemesis test used its own failure detector implementation
which used one-way heartbeats.

Switch it to use the new direct failure detection service, which will
also be used in production code. Integrating it does require some work
however as we need to implement the `pinger` and `clock` interfaces
for the failure detector.

The service is sharded, but for simplicity of implementation we
implement rpcs and sleeps by routing the requests to shard 0, where
logical timers and network live.
2022-05-09 13:14:41 +02:00
Kamil Braun
e5fc0681d9 test: raft: randomized_nemesis_test: ping all shards on each tick
Right now the test is running entirely on shard 0, but we want to
introduce a sharded service to the test. The initial naive attempt of
doing that failed because the test would time out (reach the tick limit)
before any work distributed to other shards could even start. The
solution in this commit solves that by synchronizing the shards on each
tick.

When the test is ran with smp=1, the behavior is as before.
2022-05-09 13:14:41 +02:00
Kamil Braun
e4f85cf425 test: unit test for new failure detector service 2022-05-09 13:14:41 +02:00
Kamil Braun
666e5a414d direct_failure_detector: introduce new failure detector service
The new service performs failure detection by periodically pinging
endpoints. The set of pinged endpoints can be dynamically extended and
shrinked. To learn about liveness of endpoints, user of the service
registers a listener and chooses a threshold - a duration of time which
has to pass since the last successful ping in order to mark an endpoint
as dead. When an endpoint responds it's immediately marked as alive.

Endpoints are identified using abstract integer identifiers.
The method of performing a ping is a dependency of the service provided
by the user through the `pinger` interface. The implementation of `pinger`
is responsible for translating the abstract endpoint IDs to 'real'
addresses. For example, production implementation may map endpoint IDs
to IP addresses and use TCP/IP to perform the ping, while a test/simulation
implementation may use a simulated network that also operates on
abstract identifiers.

Similarly, the method of measuring time is a dependency provided by the
user using the `clock` interface. The service operates on abstract time
intervals and timepoints. So, for example, in a production
implementation time can be measured using a stopwatch, while in
test/simulation we can use a logical clock.

The service distributes work across different shards. When an endpoint
is added to the set of detected endpoints, the service will choose a
shard with the smallest amount of workers and create a worker that is
responsible for periodically pinging this endpoint on that shard and
sending notifications to listeners.

Endpoints can be added or removed only through the shard 0 instance of
the service and shard 0 is responsible for coordinating the endpoint
workers. Listeners can be registered on any shard.
2022-05-09 13:14:40 +02:00
David Garcia
3e0f81180e docs: disable link checker
Closes #10434
2022-05-09 12:45:28 +02:00
Avi Kivity
81af9342f1 Merge "Simplify gossiper state map API" from Pavel E
"
There's a enpoint->state map member of the gossiper class. First
ugly thing about it is that the member is public.

Next, there's a whole bunch of helpers around that map that export
various bits of information from it. All of those helpers reshard
to shard-0 to read from the state mape ignoring the fact that the
map is replicated on all shards internally. Also, some of those
helpers effectively duplicate each other for no real gain. Finally,
most of them are specific to api/ code, and open-coding them often
makes api/ handlers shorter and simpler.

This set removes the unused, api-only or trivial state map accessors
and marks the state map itself private (underscore prefix included).

tests: https://jenkins.scylladb.com/job/releng/job/Scylla-CI/233/
"

* 'br-gossiper-sanitize-api-2' of https://github.com/xemul/scylla:
  gossiper: Add underscores to new private members
  code: Indentation fix after previous patch
  gossiper, code: Relax get_up/down/all_counters() helpers
  api: Fix indentation after previous patch
  gossiper, api: Remove get_arrival_samples()
  gossiper, api: Remove get/set phi convict threshold helpers
  gossiper, api: Move get_simple_states() into API code
  gossiper: In-line std::optional<> get_endpoint_state_for_endpoint() overload
  gossiper, api: Remove get_endpoint_state() helpers
  gossiper: Make state and locks maps private
  gossiper: Remove dead code
2022-05-08 22:56:23 +03:00
Avi Kivity
94f677b790 Merge 'sstables/index_reader: short-circuit fast-forward-to when at EOF' from Botond Dénes
Attempting to call advance_to() on the index, after it is positioned at EOF, can result in an assert failure, because the operation results in an attempt to move backwards in the index-file (to read the last index page, which was already read). This only happens if the index cache entry belonging to the last index page is evicted, otherwise the advance operation just looks-up said entry and returns it. To prevent this, we add an early return conditioned on eof() to all the partition-level advance-to methods.
A regression unit test reproducing the above described crash is also added.

Fixes: #10403

Closes #10491

* github.com:scylladb/scylla:
  sstables/index_reader: short-circuit fast-forward-to when at EOF
  test/lib/random_schema: add a simpler overload for fixed partition count
2022-05-08 14:17:40 +03:00
Juliusz Stasiewicz
603dd72f9e CQL: Replace assert by exception on invalid auth opcode
One user observed this assertion fail, but it's an extremely rare event.
The root cause - interlacing of processing STARTUP and OPTIONS messages -
is still there, but now it's harmless enough to leave it as is.

Fixes #10487

Closes #10503
2022-05-08 11:33:58 +03:00
Michał Chojnowski
fb1a9e97c9 cql3: restrictions: statement_restrictions: pass arguments to std::bind_front by reference
Fix an accidental copy of query_options in range_or_slice_eq_null.

Closes #10511
2022-05-08 11:32:53 +03:00
Avi Kivity
1ecb87b7a8 Merge 'Harden table truncate' from Benny Halevy
This series fixes a few issue on the table truncate path:
- "memtable_list: safely futurize clear_and_add"
  - reinstates an async version of table::clear_and_add, just safe against #10421
- a unit test reproducing #10421 was added to make sure the new version is indeed safe.
- "table: clear: serialize with ongoing flush" fixes #10423
- a unit test reproducing #10423 was added

Fixes #10281
Fixes #10423

Test: unit(dev), database_test. test_truncate_without_snapshot_during_{writes,flushes} (debug)

Closes #10424

* github.com:scylladb/scylla:
  test: database_test: add test_truncate_without_snapshot_during_writes
  memtable_list: safely futurize clear_and_add
  table: clear: serialize with ongoing flush
2022-05-08 11:30:21 +03:00
Avi Kivity
287c01ab4d Merge ' sstables: consumer: reuse the fragmented_temporary_buffer in read_bytes()' from Michał Chojnowski
primitive_consumer::read_bytes() destroys and creates a vector for every value it reads.
This happens for every cell.
We can save a bit of work by reusing the vector.

Closes #10512

* github.com:scylladb/scylla:
  sstables: consumer: reuse the fragmented_temporary_buffer in read_bytes()
  utils: fragmented_temporary_buffer: add release()
2022-05-08 11:26:31 +03:00
Raphael S. Carvalho
8e99d3912e compaction: LCS: don't write to disengaged optional on compaction completion
Dtest triggers the problem by:
1) creating table with LCS
2) disabling regular compaction
3) writing a few sstables
4) running maintenance compaction, e.g. cleanup

Once the maintenance compaction completes, disengaged optional _last_compacted_keys
triggers an exception in notify_completion().

_last_compacted_keys is used by regular for its round-robin file picking
policy. It stores the last compacted key for each level. Meaning it's
irrelevant for any other compaction type.

Regular compaction is responsible for initializing it when it runs for
the first time to pick files. But with it disabled, notify_completion()
will find it uninitialized, therefore resulting in bad_optional_access.

To fix this, the procedure is skipped if _last_compacted_keys is
disengaged. Regular compaction, once re-enabled, will be able to
fill _last_compacted_keys by looking at metadata of the files.

compaction_test.py::TestCompaction::test_disable_autocompaction_doesnt_
block_user_initiated_compactions[CLEANUP-LeveledCompactionStrategy]
now passes.

Fixes #10378.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes #10508
2022-05-08 11:23:13 +03:00
Raphael S. Carvalho
5682393693 compaction: Fix use-after-move when retrying maintenance compaction
SSTable was moved into descriptor, so on failure, it couldn't be used
without resulting in a segfault. Fix it by not moving sst, and changing
signature to make it explicit we don't want to move the content.

Fixes #10505.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes #10506
2022-05-08 11:16:55 +03:00
Michał Chojnowski
ddc535a4a2 sstables: consumer: reuse the fragmented_temporary_buffer in read_bytes()
read_bytes destroys and creates a vector for every value it reads.
This happens for every cell.
We can save a bit of work by reusing the vector.
2022-05-07 13:04:16 +02:00
Michał Chojnowski
8cfbe9c9c1 utils: fragmented_temporary_buffer: add release()
Add a release() method to fragmented_temporary_buffer.
This method releases the underlying vector to allow for its reuse.
2022-05-07 13:04:16 +02:00
Pavel Emelyanov
9d364f19dc gossiper: Add underscores to new private members
The state map and guarding locks were moved to private and now should have a _ prefix

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-05-06 11:32:03 +03:00
Pavel Emelyanov
334d3434e7 code: Indentation fix after previous patch
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-05-06 10:34:48 +03:00
Pavel Emelyanov
5ac28a29d3 gossiper, code: Relax get_up/down/all_counters() helpers
These helpers count elements in the endpoint state map. It makes sense
to keep them in gossiper API, but it's worth removing the wrappers that
do invoke_on(0). This makes code shorter.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-05-06 10:34:48 +03:00