Due to fd62fba985
scoped enums are not automatically converted to integers anymore,
this is the intended behavior, according to the fmtlib devs.
A bit nicer solution would be to use `std::to_underlying`
instead of a direct `static_cast`, but it's not available until
C++23 and some compilers are still missing the support for it.
Tests: unit(dev)
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
The flat_mutation_reader files were conflated and contained multiple
readers, which were not strictly necessary. Splitting optimizes both
iterative compilation times, as touching rarely used readers doesn't
recompile large chunks of codebase. Total compilation times are also
improved, as the size of flat_mutation_reader.hh and
flat_mutation_reader_v2.hh have been reduced and those files are
included by many file in the codebase.
With changes
real 29m14.051s
user 168m39.071s
sys 5m13.443s
Without changes
real 30m36.203s
user 175m43.354s
sys 5m26.376s
Closes#10194
Except for the verb addition, this commit also defines forward_request
and forward_result structures, used as an argument and result of the new
rpc. forward_request is used to forward information about select
statement that does count(*) (or other aggregating functions such as
max, min, avg in the future). Due to the inability to serialize
cql3::statements::select_statement, I chose to include
query::read_command, dht::partition_range_vector and some configuration
options in forward_request. They can be serialized and are sufficient
enough to allow creation of service::pager::query_pagers::pager.
The MIGRATION_REQUEST verb is currently used to pull the contents of
schema tables (in the form of mutations) when nodes synchronize schemas.
We will (ab)use the verb to fetch additional data, such as the contents
of the group 0 history table, for purposes of group 0 snapshot transfer.
We extend `schema_pull_options` with a flag specifying that the puller
requests the additional data associated with group 0 snapshots. This
flag is `false` by default, so existing schema pulls will do what they
did before. If the flag is `true`, the migration request handler will
include the contents of group 0 history table.
Note that if a request is set with the flag set to `true`, that means
the entire cluster must have enabled the Raft feature, which also means
that the handler knows of the flag.
Instead of lengthy blurbs, switch to single-line, machine-readable
standardized (https://spdx.dev) license identifiers. The Linux kernel
switched long ago, so there is strong precedent.
Three cases are handled: AGPL-only, Apache-only, and dual licensed.
For the latter case, I chose (AGPL-3.0-or-later and Apache-2.0),
reasoning that our changes are extensive enough to apply our license.
The changes we applied mechanically with a script, except to
licenses/README.md.
Closes#9937
This is a consolidation of #9714 and #9709 PRs by @elcallio that were reviewed by @asias
The last comment on those was that they should be consolidated in order not to create a security degradation for
ec2 setups.
For some cases it is impossible to determine dc or rack association for nodes on outgoing connections.
One example is when some IPs are hidden behind Nat layer.
In some cases this creates problems where one side of the connection is aware of the rack/dc association where the
other doesn't.
The solution here is a two stage one:
1. First add a gossip reverse lookup that will help us determine the rack/dc association for a broader (hopefully all) range
of setups and NAT situations.
2. When this fails - be more strict about downgrading a node which tries to ensure that both sides of the connection will at least
downgrade the connection instead of just fail to start when it is not possible for one side to determine rack/dc association.
Fixes#9653
/cc @elcallio @asias
Closes#9822
* github.com:scylladb/scylla:
messaging_service: Add reverse mapping of private ip -> public endpoint
production_snitch_base: Do reverse lookup of endpoint for info
messaging_service: Make dc/rack encryption check for connection more strict
The gc_grace_seconds is a very fragile and broken design inherited from
Cassandra. Deleted data can be resurrected if cluster wide repair is not
performed within gc_grace_seconds. This design pushes the job of making
the database consistency to the user. In practice, it is very hard to
guarantee repair is performed within gc_grace_seconds all the time. For
example, repair workload has the lowest priority in the system which can
be slowed down by the higher priority workload, so that there is no
guarantee when a repair can finish. A gc_grace_seconds value that is
used to work might not work after data volume grows in a cluster. Users
might want to avoid running repair during a specific period where
latency is the top priority for their business.
To solve this problem, an automatic mechanism to protect data
resurrection is proposed and implemented. The main idea is to remove the
tombstone only after the range that covers the tombstone is repaired.
In this patch, a new table option tombstone_gc is added. The option is
used to configure tombstone gc mode. For example:
1) GC a tombstone after gc_grace_seconds
cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'timeout'} ;
This is the default mode. If no tombstone_gc option is specified by the
user. The old gc_grace_seconds based gc will be used.
2) Never GC a tombstone
cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'disabled'};
3) GC a tombstone immediately
cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'immediate'};
4) GC a tombstone after repair
cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'repair'};
In addition to the 'mode' option, another option 'propagation_delay_in_seconds'
is added. It defines the max time a write could possibly delay before it
eventually arrives at a node.
A new gossip feature TOMBSTONE_GC_OPTIONS is added. The new tombstone_gc
option can only be used after the whole cluster supports the new
feature. A mixed cluster works with no problem.
Tests: compaction_test.py, ninja test
Fixes#3560
[avi: resolve conflicts vs data_dictionary]
Fixes#9653
When doing an outgoing connection, in a internode_encryption=dc/rack situation
we should not use endpoint/local broadcast solely to determine if we can
downgrade a connection.
If gossip/message_service determines that we will connect to a different
address than the "official" endpoint address, we should use this to determine
association of target node, and similarly, if we bind outgoing connection
to interface != bc we need to use this to decide local one.
Note: This will effectively _disable_ internode_encryption=dc/rack on ec2 etc
until such time that gossip can give accurate info on dc/rack for "internal"
ip addresses of nodes.
Operations of adding or removing a node to Raft configuration
are made idempotent: they do nothing if already done, and
they are safe to resume after a failure.
However, since topology changes are not transactional, if a
bootstrap or removal procedure fails midway, Raft group 0
configuration may go out of sync with topology state as seen by
gossip.
In future we must change gossip to avoid making any persistent
changes to the cluster: all changes to persistent topology state
will be done exclusively through Raft Group 0.
Specifically, instead of persisting the tokens by advertising
them through gossip, the bootstrap will commit a change to a system
table using Raft group 0. nodetool will switch from looking at
gossip-managed tables to consulting with Raft Group 0 configuration
or Raft-managed tables.
Once this transformation is done, naturally, adding a node to Raft
configuration (perhaps as a non-voting member at first) will become the
first persistent change to ring state applied when a node joins;
removing a node from the Raft Group 0 configuration will become the last
action when removing a node.
Until this is done, do our best to avoid a cluster state when
a removed node or a node which addition failed is stuck in Raft
configuration, but the node is no longer present in gossip-managed
system tables. In other words, keep the gossip the primary source of
truth. For this purpose, carefully chose the timing when we
join and leave Raft group 0:
Join the Raft group 0 only after we've advertised our tokens, so the
cluster is aware of this node, it's visible in nodetool status,
but before node state jumps to "normal", i.e. before it accepts
queries. Since the operation is idempotent, invoke it on each
restart.
Remove the node from Group 0 *before* its tokens are removed
from gossip-managed system tables. This guarantees
that if removal from Raft group 0 fails for whatever reason,
the node stays in the ring, so nodetool removenode and
friends are re-tried.
Add tracing.
Implement an RPC to forward add_entry calls from the follower
to leader. Bounce & retry in case of not_a_leader.
Do not retry in case of uncertainty - this can lead to adding
duplicate entries.
The feature is added to core Raft since it's needed by
all current clients - both topology and schema changes.
When forwarding an entry to a remote leader we may get back
a term/index pair that conflicts (has the same index, but is with
a higher term) with a local entry we're still waiting on.
This can happen, e.g. because there was a leader change and the
log was truncated, but we still haven't got the append_entries
RPC from the new leader, still haven't truncated the log locally,
still haven't aborted all the local waits for truncated entries.
Only remove the offending entry from the wait list and abort it.
There may be entries labeled with an older term to the right (with
higher commit index) of the conflicting entry. However, finding them,
would require a linear scan. If we allow it, we may end up doing this
linear scan for *every* conflicting entry during the transition
period, which brings us to N^2 complexity of this step. At the
same time, as soon as append_entries that commits a higher-term
entry with the same index reaches the follower, the waits
for the respective truncated entry will be aborted anyway (see
notify_waiters() which sets dropped_entry exception), so the scan
is unnecessary.
Similarly to being able to add entries, allow to modify
Raft group configuration on a follower. The implementation
works the same way as adding entries - forwards the command
to the leader.
Now that add_entry() or modify_config never throws not_a_leader,
it's more likely to throw timed_out_error, e.g. in case the
network is partitioned. Previously it was only possible due to a
semaphore wait timeout, and this scenario was not tested.
Handle timed_out_error on RPC level to let the existing tests
(specifically the randomized nemesis test) pass.
To avoid back-calling the system_keyspace from the messaging layer
let the system_keyspace get the preferred ips vector and pass it
down to the messaging_service.
This is part of the effort to deglobalize the system keyspace
and query context.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20211119143523.3424773-1-bhalevy@scylladb.com>
Introduce new syntax in IDL compiler to allow generating
registration/sending code for RPC verbs:
```
verb [[attr1, attr2...] my_verb (args...) -> return_type;
```
`my_verb` RPC verb declaration corresponds to the
`netw::messaging_verb::MY_VERB` enumeration value to identify the
new RPC verb.
For a given `idl_module.idl.hh` file, a registrator class named
`idl_module_rpc_verbs` will be created if there are any RPC verbs
registered within the IDL module file.
These are the methods being created for each RPC verb:
```
static void register_my_verb(netw::messaging_service* ms, std::function<return_type(args...)>&&);
static future<> unregister_my_verb(netw::messaging_service* ms);
static future<> send_my_verb(netw::messaging_service* ms, netw::msg_addr id, args...);
```
Each method accepts a pointer to an instance of `messaging_service`
object, which contains the underlying seastar RPC protocol
implementation, that is used to register verbs and pass messages.
There is also a method to unregister all verbs at once:
```
static future<> unregister(netw::messaging_service* ms);
```
The following attributes are supported when declaring an RPC verb
in the IDL:
* `[[with_client_info]]` - the handler will contain a const reference to
an `rpc::client_info` as the first argument.
* `[[with_timeout]]` - an additional `time_point` parameter is supplied
to the handler function and `send*` method uses `send_message_*_timeout`
variant of internal function to actually send the message.
* `[[one_way]]` - the handler function is annotated by
`future<rpc::no_wait_type>` return type to designate that a client
doesn't need to wait for an answer.
The `-> return_type` clause is optional for two-way messages. If omitted,
the return type is set to be `future<>`.
For one-way verbs, the use of return clause is prohibited and the
signature of `send*` function always returns `future<>`.
No existing code is affected.
Ref: #1456Closes#9359
* github.com:scylladb/scylla:
idl: support generating boilerplate code for RPC verbs
idl: allow specifying multiple attributes in the grammar
message: messaging_service: extract RPC protocol details and helpers into a separate header
Consider:
- n1, n2 in the cluster
- n2 shutdown
- n2 sends gossip shutdown message to n1
- n1 delays processing of the handler of shutdown message
- n2 restarts
- n1 learns new gossip state of n2
- n1 resumes to handle the shutdown message
- n1 will mark n2 as shutdown status incorrectly until n2 restarts again
To prevent this, we can send the gossip generation number along with the
shutdown message. If the generation number does not match the local
generation number for the remote node, the shutdown message will be
ignored.
Since we use the rpc::optional to send the generation number, it works
with mixed cluster.
Fixes#8597Closes#9381
This needs to add forward declarations of the gossiper class and
re-include some other headers here and there.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Database no longer needs it. Since the only user of the old-style
notification is gone -- remove it as well.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The messaging_service keeps track of a list of connection-drop
listeners. This list is not auto-removing and is thus not safe
on stop (fortunately there's only 1 non-stopping client of it
so far).
This patch adds a safter notification based on boost/signals.
Also storage_proxy is subscribed on it in advance to demonstrate
how it looks like altogether and make next patch shorter.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Introduce a new header `message/rpc_protocol_impl.hh`, move here
the following things from `message/messaging_service.cc`:
* RPC protocol wrappers implementation
* Serialization thunks
* `register_handler` and `send_message*` functions
This code will be used later for IDL-generated RPC verbs
implementation.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
"
There are 4 places out there that do the same steps parsing
"client_|server_encryption_options" and configuring the
seastar::tls::creds_builder with the values (messaging, redis,
alternator and transport).
Also to make redis and transport look slimmer main() cleans
the client_encryption_options by ... parsing it too.
This set introduces a (coroutinized) helper to configure the
creds_builder with map<string, string> and removes the options
beautification from main.
tests: unit(dev), dtest.internode_ssl_test(dev)
"
* 'br-generalize-tls-creds-builder-configuration' of https://github.com/xemul/scylla:
code: Generalize tls::credentials_builder configuration
transport, redis: Do not assume fixed encryption options
messaging: Move encryption options parsing to ms
main: Open-code internode encryption misconfig warning
main, config: Move options parsing helpers
This patch implements RAFT extension that allows to perform linearisable
reads by accessing local state machine. The extension is described
in section 6.4 of the PhD. To sum it up to perform a read barrier on
a follower it needs to asks a leader the last committed index that it
knows about. The leader must make sure that it is still a leader before
answering by communicating with a quorum. When follower gets the index
back it waits for it to be applied and by that completes read_barrier
invocation.
The patch adds three new RPC: read_barrier, read_barrier_reply and
execute_read_barrier_on_leader. The last one is the one a follower uses
to ask a leader about safe index it can read. First two are used by a
leader to communicate with a quorum.
All the places in code that configure the mentioned creds builder
from client_|server_encryption_options now do it the same way.
This patch generalizes it all in the utils:: helper.
The alternator code "ignores" require_client_auth and truststore
keys, but it's easy to make the generalized helper be compatible.
Also make the new helper coroutinized from the beginning.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Main collects a bunch of local variables from config and passes
them as arguments to messaging service initialization helper.
This patch replaces all these args with const config reference.
The motivation is to facilitate next patching by providing the
server encryption options k:v set right in the m.s. init code.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This reverts commit 82c419870a.
This commit removes the HINT_SYNC_POINT_CREATE and HINT_SYNC_POINT_CHECK
rpc verbs.
The upcoming HTTP API for waiting for hint replay will be restricted
to waiting for hints on the node handling the request, so there is no
need for new verbs.
Refs #8418
Broadcast can (apparently) be an address not actually on machine, but
on the other side of NAT. Thus binding local side of outgoing
connection there will fail.
Bind instead to listen_address (or broadcast, if listen_to_broadcast),
this will require routing + NAT to make the connection looking
like from broadcast from node connected to, to allow the connection
(if using partial encryption).
Note: this is somewhat verified somewhat limitedly. I would suggest
verifying various multi rack/dc setups before relying on it.
Closes#8974
This warning prevents using std::move() where it can hurt
- on an unnamed temporary or a named automatic variable being
returned from a function. In both cases the value could be
constructed directly in its final destination, but std::move()
prevents it.
Fix the handful of cases (all trivial), and enable the warning.
Closes#8992
Make sure to log the info message when we actually
start listening.
Also, print a log message when listening on the
broadcast address.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
We never want to listen on port 0, even if configured so.
When the listen port is set to 0, the OS will choose the
port randomly, which makes it useless for communicating
with other nodes in the cluster, since we don't support that.
Also, it causes the listen_ports_conf_test internode_ssl_test
to fail since it expects to disable listening on storage_port
or ssl_storage_port when set to 0, as seen in
https://github.com/scylladb/scylla-dtest/issues/2174.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The netw command tries to access the netw::_the_messaging_service that
was removed long ago. The correct place for the messaging service is
in debug:: namespace.
The scylla-gdb test checks that, but the netw command sees that the ptr
in question is not initialized, thinks it's not yet sharded::start()-ed
and exits without errors.
tests: unit(gdb)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210624135107.12375-1-xemul@scylladb.com>
Currently, gossip uses the updates of the gossip heartbeat from gossip
messages to decide if a node is up or down. This means if a node is
actually down but the gossip messages are delayed in the network, the
marking of node down can be delayed.
For example, a node sends 20 gossip messages in 20 seconds before it
is dead. Each message is delayed 15 seconds by the network for some
reason. A node receives those delayed messages one after another.
Those delayed messages will prevent this node from being marked as down.
Because heartbeat update is received just before the threshold to mark a
node down is triggered which is around 20 seconds by default.
As a result, this node will not be marked as down in 20 * 15 seconds =
300 seconds, much longer than the ~20 seconds node down detection time
in normal cases.
In this patch, a new failure detector is implemented.
- Direct detection
The existing failure detector can get gossip heartbeat updates
indirectly. For example:
Node A can talk to Node B
Node B can talk to Node C
Node A can not talk to Node C, due to network issues
Node A will not mark Node B to be down because Node A can get heart beat
of Node C from node B indirectly.
This indirect detection is not very useful because when Node A decides
if it should send requests to Node C, the requests from Node A to C will
fail while Node A thinks it can communicate with Node C.
This patch changes the failure detection to be direct. It uses the
existing gossip echo message to detect directly. Gossip echo messages
will be sent to peer nodes periodically. A peer node will be marked as
down if a timeout threshold has been meet.
Since the failure detection is peer to peer, it avoids the delayed
message issue mentioned above.
- Parallel detection
The old failure detector uses shard zero only. This new failure detector
utilizes all the shards to perform the failure detection, each shard
handling a subset of live nodes. For example, if the cluster has 32
nodes and each node has 16 shards, each shard will handle only 2 nodes.
With a 16 nodes cluster, each node has 16 shards, each shard will handle
only one peer node.
A gossip message will be sent to peer nodes every 2 seconds. The extra
echo messages traffic produced compared to the old failure detector is
negligible.
- Deterministic detection
Users can configure the failure_detector_timeout_in_ms to set the
threshold to mark a node down. It is the maximum time between two
successful echo message before gossip marks a node down. It is easier to
understand than the old phi_convict_threshold.
- Compatible
This patch only uses the existing gossip echo message. Nodes with or without
this patch can work together.
Fixes#8488Closes#8036
storage_proxy works with vectors of inet_addresses for replica sets
and for topology changes (pending endpoints, dead nodes). This patch
introduces new names for these (without changing the underlying
type - it's still std::vector<gms::inet_address>). This is so that
the following patch, that changes those types to utils::small_vector,
will be less noisy and highlight the real changes that take place.
Both hinted handoff and repair are meant to improve the consistency of the cluster's data. HH does this by storing records of failed replica writes and replaying them later, while repair goes through all data on all participaring replicas and makes sure the same data is stored on all nodes. The former is generally cheaper and sometimes (but not always) can bring back full consistency on its own; repair, while being more costly, is a sure way to bring back current data to full consistency.
When hinted handoff and repair are running at the same time, some of the work can be unnecessarily duplicated. For example, if a row is repaired first, then hints towards it become unnecessary. However, repair needs to do less work if data already has good consistency, so if hints finish first, then the repair will be shorter.
This PR introduces a possibility to wait for hints to be replayed before continuing with user-issued repair. The coordinator of the repair operation asks all nodes participating in the repair operation (including itself) to mark a point at the end of all hint queues pointing towards other nodes participating in repair. Then, it waits until hint replay in all those queues reaches marked point, or configured timeout is reached.
This operation is currently opt-in and can be turned on by setting the `wait_for_hint_replay_before_repair_in_ms` config option to a positive value.
Fixes#8102
Tests:
- unit(dev)
- some manual tests:
- shutting down repair coordinator during hints replay,
- shutting down node participating in repair during hints replay,
Closes#8452
* github.com:scylladb/scylla:
repair: introduce abort_source for repair abort
repair: introduce abort_source for shutdown
storage_proxy: add abort_source to wait_for_hints_to_be_replayed
storage_proxy: stop waiting for hints replay when node goes down
hints: dismiss segment waiters when hint queue can't send
repair: plug in waiting for hints to be sent before repair
repair: add get_hosts_participating_in_repair
storage_proxy: coordinate waiting for hints to be sent
config: add wait_for_hint_replay_before_repair option
storage_proxy: implement verbs for hint sync points
messaging_service: add verbs for hint sync points
storage_proxy: add functions for syncing with hints queue
db/hints: make it possible to wait until current hints are sent
db/hints: add a metric for counting processed files
db/hints: allow to forcefully update segment list on flush
Introduce a tagged id struct for `group_id`.
Raft code would want to generate quite a lot of unique
raft groups in the future (e.g. tablets). UUID is designed
exactly for that (e.g. larger capacity than `uint64_t`, obviously,
and also has built-in procedures to generate random ids).
Also, this is a preparation to make "raft group 0" use a random
ID instead of a literal fixed `0` as a group id.
The purpose is that every scylla cluster must have a unique ID
for "raft group 0" since we don't want the nodes from some other
cluster to disrupt the current cluster. This can happen if,
for some reason, a foreign node happens to contact a node in
our cluster.
Tests: unit(dev)
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20210429170630.533596-3-pa.solodovnikov@scylladb.com>
We encountered a phenomena where shutting down the messaging service
don't complete, leaving the shutdown process stuck. Since we couldn't
pinpoint where exactly the shutdown went wrong, here we add some
verbosity to the shutdown stages so we can more accurately pinpoint the
culprit.
Closes#8560
Adds two verbs: HINT_SYNC_POINT_CREATE and HINT_SYNC_POINT_CHECK.
Those will make it possible to create a sync point and regularly poll
to check its existence.
We currently only update the failure detector for a node when a higher
version of application state is received. Since gossip syn messages do
not contain application state, so this means we do not update the
failure detector upon receiving gossip syn messages, even if a message
from peer node is received which implies the peer node is alive.
This patch relaxes the failure detector update rule to update the
failure detector for the sender of gossip messages directly.
Refs #8296Closes#8476
This just causes unneeded and slower recompliations. Instead replace
with forward declarations, or includes of smaller headers that were
incidentally brought in by the one removed. The .cc files that really
need it gain the include, but they are few.
Ref #1.
Closes#8403
In commit c82250e0cf (gossip: Allow deferring
advertise of local node to be up), the replacing node is changed to postpone
the responding of gossip echo message to avoid other nodes sending read
requests to the replacing node. It works as following:
1) replacing node does not respond echo message to avoid other nodes to
mark replacing node as alive
2) replacing node advertises hibernate state so other nodes knows
replacing node is replacing
3) replacing node responds echo message so other nodes can mark
replacing node as alive
This is problematic because after step 2, the existing nodes in the
cluster will start to send writes to the replacing node, but at this
time it is possible that existing nodes haven't marked the replacing
node as alive, thus failing the write request unnecessarily.
For instance, we saw the following errors in issue #8013 (Cassandra
stress fails to achieve consistency when only one of the nodes is down)
```
scylla:
[shard 1] consistency - Live nodes 2 do not satisfy ConsistencyLevel (2
required, 1 pending, live_endpoints={127.0.0.2, 127.0.0.1},
pending_endpoints={127.0.0.3}) [shard 0] gossip - Fail to send
EchoMessage to 127.0.0.3: std::runtime_error (Not ready to respond
gossip echo message)
c-s:
java.io.IOException: Operation x10 on key(s) [4c4f4d37324c35304c30]:
Error executing: (UnavailableException): Not enough replicas available
for query at consistency QUORUM (2 required but only 1 alive
```
To solve this problem, we can do the replacing operation in multiple stages.
One solution is to introduce a new gossip status state as proposed
here: gossip: Introduce STATUS_PREPARE_REPLACE #7416
1) replacing node does not respond echo message
2) replacing node advertises prepare_replace state (Remove replacing
node from natural endpoint, but do not put in pending list yet)
3) replacing node responds echo message
4) replacing node advertises hibernate state (Put replacing node in
pending list)
Since we now have the node ops verb introduced in
829b4c1438 (repair: Make removenode safe
by default), we can do the multiple stage without introducing a new
gossip status state.
This patch uses the NODE_OPS_CMD infrastructure to implement replace
operation.
Improvements:
1) It solves the race between marking replacing node alive and sending
writes to replacing node
2) The cluster reverts to a state before the replace operation
automatically in case of error. As a result, it solves when the
replacing node fails in the middle of the operation, the repacing
node will be in HIBERNATE status forever issue.
3) The gossip status of the node to be replaced is not changed until the
replace operation is successful. HIBERNATE gossip status is not used
anymore.
4) Users can now pass a list of dead nodes to ignore explicitly.
Fixes#8013Closes#8330
* github.com:scylladb/scylla:
repair: Switch to use NODE_OPS_CMD for replace operation
gossip: Add advertise_to_nodes
gossip: Add helper to wait for a node to be up
gossip: Add is_normal_ring_member helper
gossiper::advertise_to_nodes() is added to allow respond to gossip echo
message with specified nodes and the current gossip generation number
for the nodes.
This is helpful to avoid the restarted node to be marked as alive during
a pending replace operation.
After this patch, when a node sends a echo message, the gossip
generation number is sent in the echo message. Since the generation
number changes after a restart, the receiver of the echo message can
compare the generation number to tell if the node has restarted.
Refs #8013
Section 3.10 of the PhD describes two cases for which the extension can
be helpful:
1. Sometimes the leader must step down. For example, it may need to reboot
for maintenance, or it may be removed from the cluster. When it steps
down, the cluster will be idle for an election timeout until another
server times out and wins an election. This brief unavailability can be
avoided by having the leader transfer its leadership to another server
before it steps down.
2. In some cases, one or more servers may be more suitable to lead the
cluster than others. For example, a server with high load would not make
a good leader, or in a WAN deployment, servers in a primary datacenter
may be preferred in order to minimize the latency between clients and
the leader. Other consensus algorithms may be able to accommodate these
preferences during leader election, but Raft needs a server with a
sufficiently up-to-date log to become leader, which might not be the
most preferred one. Instead, a leader in Raft can periodically check
to see whether one of its available followers would be more suitable,
and if so, transfer its leadership to that server. (If only human leaders
were so graceful.)
The patch here implements the extension and employs it automatically
when a leader removes itself from a cluster.
Raft send_snapshot RPC is actually two-way, the follower
responds with snapshot_reply message. This message until now
was, however, muted by RPC.
Do not mute snapshot_reply any more:
- to make it obvious the RPC is two way
- to feed the follower response directly into leader's FSM and
thus ensure that FSM testing results produced when using a test
transport are representative of the real world uses of
raft::rpc.
When internode_encryption is "rack" or "dc", we should enforce incoming
connections are from the appropriate address spaces iff answering on
non-tls socket.
This is implemented by having two protocol handlers. One for tls/full notls,
and one for mixed (needs checking) connections. The latter will ask
snitch if remote address is kosher, and refuse the connection otherwise.
Note: requires seastar patches:
"rpc: Make is possible for rpc server instance to refuse connection"
"RPC: (client) retain local address and use on stream creation"
Note that ip-level checks are not exhaustive. If a user is also using
"require_client_auth" with dc/rack tls setting we should warn him that
there is a possibility that someone could spoof himself pass the
authentication.
Closes#8051