Commit Graph

375 Commits

Author SHA1 Message Date
Pavel Solodovnikov
95c8d65949 treewide: fix compilation issues with fmtlib 8.1.0+
Due to fd62fba985
scoped enums are not automatically converted to integers anymore,
this is the intended behavior, according to the fmtlib devs.

A bit nicer solution would be to use `std::to_underlying`
instead of a direct `static_cast`, but it's not available until
C++23 and some compilers are still missing the support for it.

Tests: unit(dev)

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-03-16 12:31:50 +03:00
Mikołaj Sielużycki
1d84a254c0 flat_mutation_reader: Split readers by file and remove unnecessary includes.
The flat_mutation_reader files were conflated and contained multiple
readers, which were not strictly necessary. Splitting optimizes both
iterative compilation times, as touching rarely used readers doesn't
recompile large chunks of codebase. Total compilation times are also
improved, as the size of flat_mutation_reader.hh and
flat_mutation_reader_v2.hh have been reduced and those files are
included by many file in the codebase.

With changes

real	29m14.051s
user	168m39.071s
sys	5m13.443s

Without changes

real	30m36.203s
user	175m43.354s
sys	5m26.376s

Closes #10194
2022-03-14 13:20:25 +02:00
Michał Sala
fff454761a messaging_service: add verb for count(*) request forwarding
Except for the verb addition, this commit also defines forward_request
and forward_result structures, used as an argument and result of the new
rpc. forward_request is used to forward information about select
statement that does count(*) (or other aggregating functions such as
max, min, avg in the future). Due to the inability to serialize
cql3::statements::select_statement, I chose to include
query::read_command, dht::partition_range_vector and some configuration
options in forward_request. They can be serialized and are sufficient
enough to allow creation of service::pager::query_pagers::pager.
2022-02-01 21:14:41 +01:00
Kamil Braun
cc0c54ea15 service: migration_manager: allow using MIGRATION_REQUEST verb to fetch group 0 history table
The MIGRATION_REQUEST verb is currently used to pull the contents of
schema tables (in the form of mutations) when nodes synchronize schemas.
We will (ab)use the verb to fetch additional data, such as the contents
of the group 0 history table, for purposes of group 0 snapshot transfer.

We extend `schema_pull_options` with a flag specifying that the puller
requests the additional data associated with group 0 snapshots. This
flag is `false` by default, so existing schema pulls will do what they
did before. If the flag is `true`, the migration request handler will
include the contents of group 0 history table.

Note that if a request is set with the flag set to `true`, that means
the entire cluster must have enabled the Raft feature, which also means
that the handler knows of the flag.
2022-01-24 15:20:37 +01:00
Avi Kivity
fcb8d040e8 treewide: use Software Package Data Exchange (SPDX) license identifiers
Instead of lengthy blurbs, switch to single-line, machine-readable
standardized (https://spdx.dev) license identifiers. The Linux kernel
switched long ago, so there is strong precedent.

Three cases are handled: AGPL-only, Apache-only, and dual licensed.
For the latter case, I chose (AGPL-3.0-or-later and Apache-2.0),
reasoning that our changes are extensive enough to apply our license.

The changes we applied mechanically with a script, except to
licenses/README.md.

Closes #9937
2022-01-18 12:15:18 +01:00
Gleb Natapov
b1fea20d36 raft: move raft verbs to the IDL 2022-01-13 13:14:46 +02:00
Gleb Natapov
8a25b740df raft: split idl to rpc and storage
Storage uses only small part of the IDL, so it can include only the part
that is relevant to it.
2022-01-13 13:14:46 +02:00
Gleb Natapov
1db151bd75 storage_proxy: move all verbs to the IDL
Define all verbs in the IDL instead of manually codding them.
2022-01-10 14:58:28 +02:00
Gleb Natapov
ff6a0fffaf storage_proxy: convert more address vectors to inet_address_vector_replica_set 2022-01-10 13:48:20 +02:00
Avi Kivity
57188de09e Merge 'Make dc/rack encryption work for some cases where Nat hides ednpoint ips' from Eliran Sinvani
This is a consolidation of #9714 and #9709 PRs by @elcallio that were reviewed by @asias
The last comment on those was that they should be consolidated in order not to create a security degradation for
ec2 setups.
For some cases it is impossible to determine dc or rack association for nodes on outgoing connections.
One example is when some IPs are hidden behind Nat layer.
In some cases this creates problems where one side of the connection is aware of the rack/dc association where the
other doesn't.
The solution here is a two stage one:
1. First add a gossip reverse lookup that will help us determine the rack/dc association for a broader (hopefully all) range
 of setups and NAT situations.
2. When this fails - be more strict about downgrading a node which tries to ensure that both sides of the connection will at least
 downgrade the connection instead of just fail to start when it is not possible for one side to determine rack/dc association.

Fixes #9653
/cc @elcallio @asias

Closes #9822

* github.com:scylladb/scylla:
  messaging_service: Add reverse mapping of private ip -> public endpoint
  production_snitch_base: Do reverse lookup of endpoint for info
  messaging_service: Make dc/rack encryption check for connection more strict
2022-01-09 16:40:49 +02:00
Asias He
a8ad385ecd repair: Get rid of the gc_grace_seconds
The gc_grace_seconds is a very fragile and broken design inherited from
Cassandra. Deleted data can be resurrected if cluster wide repair is not
performed within gc_grace_seconds. This design pushes the job of making
the database consistency to the user. In practice, it is very hard to
guarantee repair is performed within gc_grace_seconds all the time. For
example, repair workload has the lowest priority in the system which can
be slowed down by the higher priority workload, so that there is no
guarantee when a repair can finish. A gc_grace_seconds value that is
used to work might not work after data volume grows in a cluster. Users
might want to avoid running repair during a specific period where
latency is the top priority for their business.

To solve this problem, an automatic mechanism to protect data
resurrection is proposed and implemented. The main idea is to remove the
tombstone only after the range that covers the tombstone is repaired.

In this patch, a new table option tombstone_gc is added. The option is
used to configure tombstone gc mode. For example:

1) GC a tombstone after gc_grace_seconds

cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'timeout'} ;

This is the default mode. If no tombstone_gc option is specified by the
user. The old gc_grace_seconds based gc will be used.

2) Never GC a tombstone

cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'disabled'};

3) GC a tombstone immediately

cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'immediate'};

4) GC a tombstone after repair

cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'repair'};

In addition to the 'mode' option, another option 'propagation_delay_in_seconds'
is added. It defines the max time a write could possibly delay before it
eventually arrives at a node.

A new gossip feature TOMBSTONE_GC_OPTIONS is added. The new tombstone_gc
option can only be used after the whole cluster supports the new
feature. A mixed cluster works with no problem.

Tests: compaction_test.py, ninja test

Fixes #3560

[avi: resolve conflicts vs data_dictionary]
2022-01-04 19:48:14 +02:00
Calle Wilund
73c4a2f42b messaging_service: Add reverse mapping of private ip -> public endpoint
For quick reverse lookup.

(cherry picked from commit c86296f2a8)
2022-01-04 15:14:58 +02:00
Calle Wilund
4778770814 messaging_service: Make dc/rack encryption check for connection more strict
Fixes #9653

When doing an outgoing connection, in a internode_encryption=dc/rack situation
we should not use endpoint/local broadcast solely to determine if we can
downgrade a connection.

If gossip/message_service determines that we will connect to a different
address than the "official" endpoint address, we should use this to determine
association of target node, and similarly, if we bind outgoing connection
to interface != bc we need to use this to decide local one.

Note: This will effectively _disable_ internode_encryption=dc/rack on ec2 etc
until such time that gossip can give accurate info on dc/rack for "internal"
ip addresses of nodes.
2021-12-20 06:20:46 +02:00
Konstantin Osipov
c22f945f11 raft: (service) manage Raft configuration during topology changes
Operations of adding or removing a node to Raft configuration
are made idempotent: they do nothing if already done, and
they are safe to resume after a failure.

However, since topology changes are not transactional, if a
bootstrap or removal procedure fails midway, Raft group 0
configuration may go out of sync with topology state as seen by
gossip.

In future we must change gossip to avoid making any persistent
changes to the cluster: all changes to persistent topology state
will be done exclusively through Raft Group 0.

Specifically, instead of persisting the tokens by advertising
them through gossip, the bootstrap will commit a change to a system
table using Raft group 0. nodetool will switch from looking at
gossip-managed tables to consulting with Raft Group 0 configuration
or Raft-managed tables.
Once this transformation is done, naturally, adding a node to Raft
configuration (perhaps as a non-voting member at first) will become the
first persistent change to ring state applied when a node joins;
removing a node from the Raft Group 0 configuration will become the last
action when removing a node.

Until this is done, do our best to avoid a cluster state when
a removed node or a node which addition failed is stuck in Raft
configuration, but the node is no longer present in gossip-managed
system tables. In other words, keep the gossip the primary source of
truth. For this purpose, carefully chose the timing when we
join and leave Raft group 0:

Join the Raft group 0 only after we've advertised our tokens, so the
cluster is aware of this node, it's visible in nodetool status,
but before node state jumps to "normal", i.e. before it accepts
queries. Since the operation is idempotent, invoke it on each
restart.

Remove the node from Group 0 *before* its tokens are removed
from gossip-managed system tables. This guarantees
that if removal from Raft group 0 fails for whatever reason,
the node stays in the ring, so nodetool removenode and
friends are re-tried.

Add tracing.
2021-11-25 12:35:42 +03:00
Konstantin Osipov
e3751068fe raft: (server) allow adding entries/modify config on a follower
Implement an RPC to forward add_entry calls from the follower
to leader. Bounce & retry in case of not_a_leader.
Do not retry in case of uncertainty - this can lead to adding
duplicate entries.

The feature is added to core Raft since it's needed by
all current clients - both topology and schema changes.

When forwarding an entry to a remote leader we may get back
a term/index pair that conflicts (has the same index, but is with
a higher term) with a local entry we're still waiting on.

This can happen, e.g. because there was a leader change and the
log was truncated, but we still haven't got the append_entries
RPC from the new leader, still haven't truncated the log locally,
still haven't aborted all the local waits for truncated entries.

Only remove the offending entry from the wait list and abort it.
There may be entries labeled with an older term to the right (with
higher commit index) of the conflicting entry. However, finding them,
would require a linear scan. If we allow it, we may end up doing this
linear scan for *every* conflicting entry during the transition
period, which brings us to N^2 complexity of this step. At the
same time, as soon as append_entries that commits a higher-term
entry with the same index reaches the follower, the waits
for the respective truncated entry will be aborted anyway (see
notify_waiters() which sets dropped_entry exception), so the scan
is unnecessary.

Similarly to being able to add entries, allow to modify
Raft group configuration on a follower. The implementation
works the same way as adding entries - forwards the command
to the leader.

Now that add_entry() or modify_config never throws not_a_leader,
it's more likely to throw timed_out_error, e.g. in case the
network is partitioned. Previously it was only possible due to a
semaphore wait timeout, and this scenario was not tested.
Handle timed_out_error on RPC level to let the existing tests
(specifically the randomized nemesis test) pass.
2021-11-25 11:50:38 +03:00
Benny Halevy
ff18c0c14c messaging_service: remove unused include of db/system_keyspace.hh
As a followup to eba20c7e5d
"messaging_service: init_local_preferred_ip_cache: get preferred ips from caller".

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20211123080457.1247970-1-bhalevy@scylladb.com>
2021-11-23 11:12:36 +03:00
Benny Halevy
ce9836e2fd messaging_service: init_local_preferred_ip_cache: fixup indentation
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20211119143523.3424773-2-bhalevy@scylladb.com>
2021-11-22 13:29:21 +03:00
Benny Halevy
eba20c7e5d messaging_service: init_local_preferred_ip_cache: get preferred ips from caller
To avoid back-calling the system_keyspace from the messaging layer
let the system_keyspace get the preferred ips vector and pass it
down to the messaging_service.

This is part of the effort to deglobalize the system keyspace
and query context.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20211119143523.3424773-1-bhalevy@scylladb.com>
2021-11-22 13:29:17 +03:00
Avi Kivity
0ea79559a6 Merge 'IDL: support generating boilerplate code for RPC verbs' from Pavel Solodovnikov
Introduce new syntax in IDL compiler to allow generating
registration/sending code for RPC verbs:

```
        verb [[attr1, attr2...] my_verb (args...) -> return_type;
```

`my_verb` RPC verb declaration corresponds to the
`netw::messaging_verb::MY_VERB` enumeration value to identify the
new RPC verb.

For a given `idl_module.idl.hh` file, a registrator class named
`idl_module_rpc_verbs` will be created if there are any RPC verbs
registered within the IDL module file.

These are the methods being created for each RPC verb:

```
        static void register_my_verb(netw::messaging_service* ms, std::function<return_type(args...)>&&);
        static future<> unregister_my_verb(netw::messaging_service* ms);
        static future<> send_my_verb(netw::messaging_service* ms, netw::msg_addr id, args...);
```

Each method accepts a pointer to an instance of `messaging_service`
object, which contains the underlying seastar RPC protocol
implementation, that is used to register verbs and pass messages.

There is also a method to unregister all verbs at once:

```
        static future<> unregister(netw::messaging_service* ms);
```

The following attributes are supported when declaring an RPC verb
in the IDL:
* `[[with_client_info]]` - the handler will contain a const reference to
  an `rpc::client_info` as the first argument.
* `[[with_timeout]]` - an additional `time_point` parameter is supplied
  to the handler function and `send*` method uses `send_message_*_timeout`
  variant of internal function to actually send the message.
* `[[one_way]]` - the handler function is annotated by
  `future<rpc::no_wait_type>` return type to designate that a client
  doesn't need to wait for an answer.

The `-> return_type` clause is optional for two-way messages. If omitted,
the return type is set to be `future<>`.
For one-way verbs, the use of return clause is prohibited and the
signature of `send*` function always returns `future<>`.

No existing code is affected.

Ref: #1456

Closes #9359

* github.com:scylladb/scylla:
  idl: support generating boilerplate code for RPC verbs
  idl: allow specifying multiple attributes in the grammar
  message: messaging_service: extract RPC protocol details and helpers into a separate header
2021-10-05 18:05:24 +03:00
Asias He
1657e7be14 gossiper: Send generation number with shutdown message
Consider:
- n1, n2 in the cluster
- n2 shutdown
- n2 sends gossip shutdown message to n1
- n1 delays processing of the handler of shutdown message
- n2 restarts
- n1 learns new gossip state of n2
- n1 resumes to handle the shutdown message
- n1 will mark n2 as shutdown status incorrectly until n2 restarts again

To prevent this, we can send the gossip generation number along with the
shutdown message. If the generation number does not match the local
generation number for the remote node, the shutdown message will be
ignored.

Since we use the rpc::optional to send the generation number, it works
with mixed cluster.

Fixes #8597

Closes #9381
2021-09-27 11:08:43 +03:00
Pavel Emelyanov
598841a5dd code: Expell gossiper.hh from other headers
This needs to add forward declarations of the gossiper class and
re-include some other headers here and there.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-09-22 13:13:06 +03:00
Pavel Emelyanov
a4118a70ee database, messaging: Delete old connection drop notification
Database no longer needs it. Since the only user of the old-style
notification is gone -- remove it as well.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-09-15 17:49:06 +03:00
Pavel Emelyanov
dd498273dc messaging, proxy: Notify connection drops with boost signal
The messaging_service keeps track of a list of connection-drop
listeners. This list is not auto-removing and is thus not safe
on stop (fortunately there's only 1 non-stopping client of it
so far).

This patch adds a safter notification based on boost/signals.
Also storage_proxy is subscribed on it in advance to demonstrate
how it looks like altogether and make next patch shorter.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-09-15 17:49:06 +03:00
Pavel Solodovnikov
7a8cadcca8 message: messaging_service: extract RPC protocol details and helpers into a separate header
Introduce a new header `message/rpc_protocol_impl.hh`, move here
the following things from `message/messaging_service.cc`:

* RPC protocol wrappers implementation
* Serialization thunks
* `register_handler` and `send_message*` functions

This code will be used later for IDL-generated RPC verbs
implementation.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-09-15 17:47:11 +03:00
Avi Kivity
705f957425 Merge "Generalize TLS creds builder configuration" from Pavel E
"
There are 4 places out there that do the same steps parsing
"client_|server_encryption_options" and configuring the
seastar::tls::creds_builder with the values (messaging, redis,
alternator and transport).

Also to make redis and transport look slimmer main() cleans
the client_encryption_options by ... parsing it too.

This set introduces a (coroutinized) helper to configure the
creds_builder with map<string, string> and removes the options
beautification from main.

tests: unit(dev), dtest.internode_ssl_test(dev)
"

* 'br-generalize-tls-creds-builder-configuration' of https://github.com/xemul/scylla:
  code: Generalize tls::credentials_builder configuration
  transport, redis: Do not assume fixed encryption options
  messaging: Move encryption options parsing to ms
  main: Open-code internode encryption misconfig warning
  main, config: Move options parsing helpers
2021-09-01 14:19:19 +03:00
Gleb Natapov
03a266d73b raft: make read_barrier work on a follower as well as on a leader
This patch implements RAFT extension that allows to perform linearisable
reads by accessing local state machine. The extension is described
in section 6.4 of the PhD. To sum it up to perform a read barrier on
a follower it needs to asks a leader the last committed index that it
knows about. The leader must make sure that it is still a leader before
answering by communicating with a quorum. When follower gets the index
back it waits for it to be applied and by that completes read_barrier
invocation.

The patch adds three new RPC: read_barrier, read_barrier_reply and
execute_read_barrier_on_leader. The last one is the one a follower uses
to ask a leader about safe index it can read. First two are used by a
leader to communicate with a quorum.
2021-08-25 08:57:13 +03:00
Pavel Emelyanov
e02b39ca3d code: Generalize tls::credentials_builder configuration
All the places in code that configure the mentioned creds builder
from client_|server_encryption_options now do it the same way.
This patch generalizes it all in the utils:: helper.

The alternator code "ignores" require_client_auth and truststore
keys, but it's easy to make the generalized helper be compatible.

Also make the new helper coroutinized from the beginning.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-08-20 18:05:41 +03:00
Pavel Emelyanov
2f5941ca6f messaging: Move encryption options parsing to ms
Main collects a bunch of local variables from config and passes
them as arguments to messaging service initialization helper.
This patch replaces all these args with const config reference.

The motivation is to facilitate next patching by providing the
server encryption options k:v set right in the m.s. init code.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-08-20 17:56:16 +03:00
Piotr Dulikowski
0d74dee683 Revert "messaging_service: add verbs for hint sync points"
This reverts commit 82c419870a.

This commit removes the HINT_SYNC_POINT_CREATE and HINT_SYNC_POINT_CHECK
rpc verbs.

The upcoming HTTP API for waiting for hint replay will be restricted
to waiting for hints on the node handling the request, so there is no
need for new verbs.
2021-08-09 09:24:36 +02:00
Calle Wilund
b8b5f69111 messaging_service: Bind to listen address, not broadcast
Refs #8418

Broadcast can (apparently) be an address not actually on machine, but
on the other side of NAT. Thus binding local side of outgoing
connection there will fail.
Bind instead to listen_address (or broadcast, if listen_to_broadcast),
this will require routing + NAT to make the connection looking
like from broadcast from node connected to, to allow the connection
(if using partial encryption).

Note: this is somewhat verified somewhat limitedly. I would suggest
verifying various multi rack/dc setups before relying on it.

Closes #8974
2021-07-15 13:18:10 +03:00
Avi Kivity
9059514335 build, treewide: enable -Wpessimizing-move warning
This warning prevents using std::move() where it can hurt
- on an unnamed temporary or a named automatic variable being
returned from a function. In both cases the value could be
constructed directly in its final destination, but std::move()
prevents it.

Fix the handful of cases (all trivial), and enable the warning.

Closes #8992
2021-07-08 17:52:34 +03:00
Benny Halevy
51bc6c8b5a messaging_service: do_start_listen: improve info log accuracy
Make sure to log the info message when we actually
start listening.

Also, print a log message when listening on the
broadcast address.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-06-30 16:25:21 +03:00
Benny Halevy
df442d4d24 messaging_service: never listen on port 0
We never want to listen on port 0, even if configured so.
When the listen port is set to 0, the OS will choose the
port randomly, which makes it useless for communicating
with other nodes in the cluster, since we don't support that.

Also, it causes the listen_ports_conf_test internode_ssl_test
to fail since it expects to disable listening on storage_port
or ssl_storage_port when set to 0, as seen in
https://github.com/scylladb/scylla-dtest/issues/2174.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-06-30 16:24:54 +03:00
Pavel Emelyanov
3552e99ce7 scylla-gdb: Bring scylla netw back to work
The netw command tries to access the netw::_the_messaging_service that
was removed long ago. The correct place for the messaging service is
in debug:: namespace.

The scylla-gdb test checks that, but the netw command sees that the ptr
in question is not initialized, thinks it's not yet sharded::start()-ed
and exits without errors.

tests: unit(gdb)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210624135107.12375-1-xemul@scylladb.com>
2021-06-24 20:59:27 +03:00
Avi Kivity
a55b434a2b treewide: extent copyright statements to present day 2021-06-06 19:18:49 +03:00
Konstantin Osipov
ac43941f17 rpc: don't include an unused header (raft_services.hh)
Message-Id: <20210525183919.1395607-7-kostja@scylladb.com>
2021-05-26 11:07:44 +03:00
Asias He
425e3b1182 gossip: Introduce direct failure detector
Currently, gossip uses the updates of the gossip heartbeat from gossip
messages to decide if a node is up or down. This means if a node is
actually down but the gossip messages are delayed in the network, the
marking of node down can be delayed.

For example, a node sends 20 gossip messages in 20 seconds before it
is dead. Each message is delayed 15 seconds by the network for some
reason. A node receives those delayed messages one after another.
Those delayed messages will prevent this node from being marked as down.
Because heartbeat update is received just before the threshold to mark a
node down is triggered which is around 20 seconds by default.

As a result, this node will not be marked as down in 20 * 15 seconds =
300 seconds, much longer than the ~20 seconds node down detection time
in normal cases.

In this patch, a new failure detector is implemented.

- Direct detection

The existing failure detector can get gossip heartbeat updates
indirectly.  For example:

Node A can talk to Node B
Node B can talk to Node C
Node A can not talk to Node C, due to network issues

Node A will not mark Node B to be down because Node A can get heart beat
of Node C from node B indirectly.

This indirect detection is not very useful because when Node A decides
if it should send requests to Node C, the requests from Node A to C will
fail while Node A thinks it can communicate with Node C.

This patch changes the failure detection to be direct. It uses the
existing gossip echo message to detect directly. Gossip echo messages
will be sent to peer nodes periodically. A peer node will be marked as
down if a timeout threshold has been meet.

Since the failure detection is peer to peer, it avoids the delayed
message issue mentioned above.

- Parallel detection

The old failure detector uses shard zero only. This new failure detector
utilizes all the shards to perform the failure detection, each shard
handling a subset of live nodes. For example, if the cluster has 32
nodes and each node has 16 shards, each shard will handle only 2 nodes.
With a 16 nodes cluster, each node has 16 shards, each shard will handle
only one peer node.

A gossip message will be sent to peer nodes every 2 seconds. The extra
echo messages traffic produced compared to the old failure detector is
negligible.

- Deterministic detection

Users can configure the failure_detector_timeout_in_ms to set the
threshold to mark a node down. It is the maximum time between two
successful echo message before gossip marks a node down. It is easier to
understand than the old phi_convict_threshold.

- Compatible

This patch only uses the existing gossip echo message. Nodes with or without
this patch can work together.

Fixes #8488

Closes #8036
2021-05-24 10:47:06 +03:00
Avi Kivity
cea5493cb7 storage_proxy, treewide: introduce names for vectors of inet_address
storage_proxy works with vectors of inet_addresses for replica sets
and for topology changes (pending endpoints, dead nodes). This patch
introduces new names for these (without changing the underlying
type - it's still std::vector<gms::inet_address>). This is so that
the following patch, that changes those types to utils::small_vector,
will be less noisy and highlight the real changes that take place.
2021-05-05 18:36:48 +03:00
Avi Kivity
6ffd813b7b Merge 'hints: delay repair until hints are replayed' from Piotr Dulikowski
Both hinted handoff and repair are meant to improve the consistency of the cluster's data. HH does this by storing records of failed replica writes and replaying them later, while repair goes through all data on all participaring replicas and makes sure the same data is stored on all nodes. The former is generally cheaper and sometimes (but not always) can bring back full consistency on its own; repair, while being more costly, is a sure way to bring back current data to full consistency.

When hinted handoff and repair are running at the same time, some of the work can be unnecessarily duplicated. For example, if a row is repaired first, then hints towards it become unnecessary. However, repair needs to do less work if data already has good consistency, so if hints finish first, then the repair will be shorter.

This PR introduces a possibility to wait for hints to be replayed before continuing with user-issued repair. The coordinator of the repair operation asks all nodes participating in the repair operation (including itself) to mark a point at the end of all hint queues pointing towards other nodes participating in repair. Then, it waits until hint replay in all those queues reaches marked point, or configured timeout is reached.

This operation is currently opt-in and can be turned on by setting the `wait_for_hint_replay_before_repair_in_ms` config option to a positive value.

Fixes #8102

Tests:
- unit(dev)
- some manual tests:
    - shutting down repair coordinator during hints replay,
    - shutting down node participating in repair during hints replay,

Closes #8452

* github.com:scylladb/scylla:
  repair: introduce abort_source for repair abort
  repair: introduce abort_source for shutdown
  storage_proxy: add abort_source to wait_for_hints_to_be_replayed
  storage_proxy: stop waiting for hints replay when node goes down
  hints: dismiss segment waiters when hint queue can't send
  repair: plug in waiting for hints to be sent before repair
  repair: add get_hosts_participating_in_repair
  storage_proxy: coordinate waiting for hints to be sent
  config: add wait_for_hint_replay_before_repair option
  storage_proxy: implement verbs for hint sync points
  messaging_service: add verbs for hint sync points
  storage_proxy: add functions for syncing with hints queue
  db/hints: make it possible to wait until current hints are sent
  db/hints: add a metric for counting processed files
  db/hints: allow to forcefully update segment list on flush
2021-05-03 18:47:27 +03:00
Pavel Solodovnikov
4c351ff260 raft: switch group_id type from uint64_t to utils::UUID
Introduce a tagged id struct for `group_id`.

Raft code would want to generate quite a lot of unique
raft groups in the future (e.g. tablets). UUID is designed
exactly for that (e.g. larger capacity than `uint64_t`, obviously,
and also has built-in procedures to generate random ids).

Also, this is a preparation to make "raft group 0" use a random
ID instead of a literal fixed `0` as a group id.

The purpose is that every scylla cluster must have a unique ID
for "raft group 0" since we don't want the nodes from some other
cluster to disrupt the current cluster. This can happen if,
for some reason, a foreign node happens to contact a node in
our cluster.

Tests: unit(dev)

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20210429170630.533596-3-pa.solodovnikov@scylladb.com>
2021-05-02 16:39:54 +03:00
Eliran Sinvani
0320110b04 messaging service: be more verbose when shutting down servers and clients
We encountered a phenomena where shutting down the messaging service
don't complete, leaving the shutdown process stuck. Since we couldn't
pinpoint where exactly the shutdown went wrong, here we add some
verbosity to the shutdown stages so we can more accurately pinpoint the
culprit.

Closes #8560
2021-04-29 12:28:17 +03:00
Piotr Dulikowski
82c419870a messaging_service: add verbs for hint sync points
Adds two verbs: HINT_SYNC_POINT_CREATE and HINT_SYNC_POINT_CHECK.
Those will make it possible to create a sync point and regularly poll
to check its existence.
2021-04-27 15:06:39 +02:00
Avi Kivity
0af7a22c21 repair: remove partition_checksum and related code
80ebedd242 made row-level repair mandatory, so there remain no
callers to partition_checksum. Remove it.

Closes #8537
2021-04-22 18:56:53 +03:00
Asias He
9ea57dff21 gossip: Relax failure detector update
We currently only update the failure detector for a node when a higher
version of application state is received. Since gossip syn messages do
not contain application state, so this means we do not update the
failure detector upon receiving gossip syn messages, even if a message
from peer node is received which implies the peer node is alive.

This patch relaxes the failure detector update rule to update the
failure detector for the sender of gossip messages directly.

Refs #8296

Closes #8476
2021-04-14 13:16:00 +02:00
Avi Kivity
82c76832df treewide: don't include "db/system_distributed_keyspace.hh" from headers
This just causes unneeded and slower recompliations. Instead replace
with forward declarations, or includes of smaller headers that were
incidentally brought in by the one removed. The .cc files that really
need it gain the include, but they are few.

Ref #1.

Closes #8403
2021-04-04 14:00:26 +03:00
Avi Kivity
40b60e8f09 Merge 'repair: Switch to use NODE_OPS_CMD for replace operation' from Asias He
In commit c82250e0cf (gossip: Allow deferring
advertise of local node to be up), the replacing node is changed to postpone
the responding of gossip echo message to avoid other nodes sending read
requests to the replacing node. It works as following:

1) replacing node does not respond echo message to avoid other nodes to
   mark replacing node as alive

2) replacing node advertises hibernate state so other nodes knows
   replacing node is replacing

3) replacing node responds echo message so other nodes can mark
   replacing node as alive

This is problematic because after step 2, the existing nodes in the
cluster will start to send writes to the replacing node, but at this
time it is possible that existing nodes haven't marked the replacing
node as alive, thus failing the write request unnecessarily.

For instance, we saw the following errors in issue #8013 (Cassandra
stress fails to achieve consistency when only one of the nodes is down)

```
scylla:
[shard 1] consistency - Live nodes 2 do not satisfy ConsistencyLevel (2
required, 1 pending, live_endpoints={127.0.0.2, 127.0.0.1},
pending_endpoints={127.0.0.3}) [shard 0] gossip - Fail to send
EchoMessage to 127.0.0.3: std::runtime_error (Not ready to respond
gossip echo message)

c-s:
java.io.IOException: Operation x10 on key(s) [4c4f4d37324c35304c30]:
Error executing: (UnavailableException): Not enough replicas available
for query at consistency QUORUM (2 required but only 1 alive
```

To solve this problem, we can do the replacing operation in multiple stages.

One solution is to introduce a new gossip status state as proposed
here: gossip: Introduce STATUS_PREPARE_REPLACE #7416

1) replacing node does not respond echo message

2) replacing node advertises prepare_replace state (Remove replacing
   node from natural endpoint, but do not put in pending list yet)

3) replacing node responds echo message

4) replacing node advertises hibernate state (Put replacing node in
   pending list)

Since we now have the node ops verb introduced in
829b4c1438 (repair: Make removenode safe
by default), we can do the multiple stage without introducing a new
gossip status state.

This patch uses the NODE_OPS_CMD infrastructure to implement replace
operation.

Improvements:

1) It solves the race between marking replacing node alive and sending
   writes to replacing node

2) The cluster reverts to a state before the replace operation
   automatically in case of error. As a result, it solves when the
   replacing node fails in the middle of the operation, the repacing
   node will be in HIBERNATE status forever issue.

3) The gossip status of the node to be replaced is not changed until the
   replace operation is successful. HIBERNATE gossip status is not used
   anymore.

4) Users can now pass a list of dead nodes to ignore explicitly.

Fixes #8013

Closes #8330

* github.com:scylladb/scylla:
  repair: Switch to use NODE_OPS_CMD for replace operation
  gossip: Add advertise_to_nodes
  gossip: Add helper to wait for a node to be up
  gossip: Add is_normal_ring_member helper
2021-04-04 12:54:09 +03:00
Asias He
bdb95233e8 gossip: Add advertise_to_nodes
gossiper::advertise_to_nodes() is added to allow respond to gossip echo
message with specified nodes and the current gossip generation number
for the nodes.

This is helpful to avoid the restarted node to be marked as alive during
a pending replace operation.

After this patch, when a node sends a echo message, the gossip
generation number is sent in the echo message. Since the generation
number changes after a restart, the receiver of the echo message can
compare the generation number to tell if the node has restarted.

Refs #8013
2021-04-01 09:38:54 +08:00
Gleb Natapov
9d6bf7f351 raft: introduce leader stepdown procedure
Section 3.10 of the PhD describes two cases for which the extension can
be helpful:

1. Sometimes the leader must step down. For example, it may need to reboot
 for maintenance, or it may be removed from the cluster. When it steps
 down, the cluster will be idle for an election timeout until another
 server times out and wins an election. This brief unavailability can be
 avoided by having the leader transfer its leadership to another server
 before it steps down.

2. In some cases, one or more servers may be more suitable to lead the
 cluster than others. For example, a server with high load would not make
 a good leader, or in a WAN deployment, servers in a primary datacenter
 may be preferred in order to minimize the latency between clients and
 the leader. Other consensus algorithms may be able to accommodate these
 preferences during leader election, but Raft needs a server with a
 sufficiently up-to-date log to become leader, which might not be the
 most preferred one. Instead, a leader in Raft can periodically check
 to see whether one of its available followers would be more suitable,
 and if so, transfer its leadership to that server. (If only human leaders
 were so graceful.)

The patch here implements the extension and employs it automatically
when a leader removes itself from a cluster.
2021-03-22 10:28:43 +02:00
Konstantin Osipov
4afa662d62 raft: respond with snapshot_reply to send_snapshot RPC
Raft send_snapshot RPC is actually two-way, the follower
responds with snapshot_reply message. This message until now
was, however, muted by RPC.

Do not mute snapshot_reply any more:
- to make it obvious the RPC is two way
- to feed the follower response directly into leader's FSM and
  thus ensure that FSM testing results produced when using a test
  transport are representative of the real world uses of
  raft::rpc.
2021-03-18 16:56:42 +03:00
Calle Wilund
a0745f9498 messaging_service: Enforce dc/rack membership iff required for non-tls connections
When internode_encryption is "rack" or "dc", we should enforce incoming
connections are from the appropriate address spaces iff answering on
non-tls socket.

This is implemented by having two protocol handlers. One for tls/full notls,
and one for mixed (needs checking) connections. The latter will ask
snitch if remote address is kosher, and refuse the connection otherwise.

Note: requires seastar patches:
"rpc: Make is possible for rpc server instance to refuse connection"
"RPC: (client) retain local address and use on stream creation"

Note that ip-level checks are not exhaustive. If a user is also using
"require_client_auth" with dc/rack tls setting we should warn him that
there is a possibility that someone could spoof himself pass the
authentication.

Closes #8051
2021-03-17 09:59:22 +02:00