This will be used for re-enabling previously enabled cluster
features, which will be introduces in later patches.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Save each feature enabled through the feature_service
instance in the `system.scylla_local` under the
'enabled_features' key.
The features would be persisted only if the underlying
query context used by `db::system_keyspace` is initialized.
Since `system.scylla_local` table is essentially a
string->string map, use an ad-hoc method for serializing
enabled features set: the same as used in gossiper for
translating supported features set via gossip.
The entry should be saved before we enable the feature so
that crash-after-enable is safe.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
This utility will also be used for de-serialization of
persisted enabled features, which will be introduced in a
later patch.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
There's nothing in this function that actually requries
the batchlog manager instance.
It uses a random number engine that's moved along with it
to class gossiper.
This resolves a circular dependency between the
batchlog_manager and storage_proxy.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
When shutdown gossiper may have some messages being processed in
the background. This brings two problems.
First, the gossiper itself is about to disappear soon and messages
might step on the freed instance (however, this one is not real now,
gossiper is not freed for real, just ::stop() is called).
Second, messages processing may notify other subsystems which, in
turn, do not expect this after gossiper is shutdown.
The common solution to this is to run background code through a gate
that gets closed at some point, the ::shutdown() in gossiper case.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Some messages are processed by gossiper on shard0 in the no-wait
manner. Add a generic helper for that to facilitate next patching.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Consider:
- n1, n2 in the cluster
- n2 shutdown
- n2 sends gossip shutdown message to n1
- n1 delays processing of the handler of shutdown message
- n2 restarts
- n1 learns new gossip state of n2
- n1 resumes to handle the shutdown message
- n1 will mark n2 as shutdown status incorrectly until n2 restarts again
To prevent this, we can send the gossip generation number along with the
shutdown message. If the generation number does not match the local
generation number for the remote node, the shutdown message will be
ignored.
Since we use the rpc::optional to send the generation number, it works
with mixed cluster.
Fixes#8597Closes#9381
"
The main challenge here is to move messaging_service.start_listen()
call from out of gossiper into main. Other changes are pretty minor
compared to that and include
- patch gossiper API towards a standard start-shutdown-stop form
- gossiping "sharder info" in initial state
- configure cluster name and seeds via gossip_config
tests: unit(dev)
dtest.bootstrap_test.start_stop_test_node(dev)
manual(dev): start+stop, nodetool enable-/disablegossip
refs: #2737
refs: #2795
refs: #5489
"
* 'br-gossiper-dont-start-messaging-listen-2' of https://github.com/xemul/scylla:
code: Expell gossiper.hh from other headers
storage_service: Gossip "sharder" in initial states
gossiper: Relax set_seeds()
gossiper, main: Turn init_gossiper into get_seeds_from_config
storage_service: Eliminate the do-bind argument from everywhere
gossiper: Drop ms-registered manipulations
messaging, main, gossiper: Move listening start into main
gossiper: Do handlers reg/unreg from start/stop
gossiper: Split (un)init_messaging_handler()
gossiper: Relocate stop_gossiping() into .stop()
gossiper: Introduce .shutdown() and use where appropriate
gossiper: Set cluster_name via gossip_config
gossiper, main: Straighten start/stop
tests/cql_test_env: Open-code tst_init_ms_fd_gossiper
tests/cql_test_env: De-global most of gossiper
gossiper: Merge start_gossiping() overloads into one
gossiper: Use is_... helpers
gossiper: Fix do_shadow_round comment
gossiper: Dispose dead code
It's much shorter and simpler to pass the seeds, obtained from the
config, into gossiper via gossip_config rahter than with the help
of a special call.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The same as in previous patch -- the gossiper doesn't need to know
if it should call messaging.start_listen() or not, neither should
do the storage_service.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Before preparing the cluster join process the messaging should be
put into listening state. Right now it's done "on-demand" by the
call to the do_shadow_round(), also there's a safety call in the
start_gossiping(). Tests, however, should not start listening, so
the do_bind boolean exists and is passed all the way around.
Make the main() code explicitly call the messaging.start_listen()
and leave tests without it. This change makes messaging start
listening a bit earlier, but in between these old and new places
there's nothing that needs messaging to stay deaf.
As the do_bind becomes useless, the wait_for_gossip_to_settle() is
also moved into main.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
On start handlers can be registered any time before the messaging
starts to listen. On stop handlers can remain registered any long,
since the messaging service stops early in drain_on_shutdown().
One tricky place is API start_/stop_gossiping(). The latter calls
gossiper::stop() thus unregistering the handlers. So to make the
start_gossiping() work it must call gossiper::start() in advance.
Overall the gossiper start/stop becomes this:
gossiper.start()
`- registers handlers
gossiper.start_gossiping()
`- // starts gossiping
gossiper.shutdown()
`- // stops gossiping
gossiper.stop()
`- calls shutdown() // re-entrable
`- unregisters handlers
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The helper in question is called in two places:
1. In main() as a fuse against early exception before creating the
drain_on_shutdown() defer
2. In the stop_gossiping() API call
Both can be replaced with the stop_gossiping() call from the .stop()
method, here's why:
1. In main the gossiper::stop() call is already deferred right after
the gossiper is started. So this change moves it above. It may
happen that an exception pops up before the old fuse was deferred,
but that's OK -- the stop_gossiping() is safe against early- and
re- entrances
2. The stop_gossiping() change is effectlvey a rename -- it calls the
stop_gossiping() as it did before, but with the help of the .stop()
method
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The start/stop sequence we're moving towards assumes a shutdown (or
drain) method that will be called early on stop to notify the service
that the system is going down so it could prepare.
For gossiper it already means calling stop_gossiping() on the shard-0
instance. So by and large this patch renames a few stop_gossiping()
calls into .shutdown() ones.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It's taken purely from the db::config and thus can be set up early.
Right now the empty name is converted into "Test Cluster" one, but
remains empty in the config and is later used by the system_keyspace
code. This logic remains intact.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Turn the gossiper start/stop sequence into the canonical form
gossiper.start(std::ref(dependencies)...).get();
auto stop_gossiper = defer({
gossiper.invoke_on_all(&gossiper::stop).get();
});
gossiper.invoke_on_all(&gossiper::start).get();
The deferred call should be gossiper.stop(); but for now keep
the instances memory alive.
This trick is safe at this point, because .start() and .stop()
methods are both empty (still).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There are two of them and one is only called from the API with the
do_bind always set to "yes". This fact makes it possible to remove
it by adding relevant defaults for the other.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There are several state booleans on the service and some helpers to
manipulate/check those. Make the code consistent by always using these
helpers.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The debug_show() is unused, as well as the advertise_myself().
The _features_condvar used to be listened on before f32f08c9,
now it's signal-only.
Feature frendship with gossiper is not required.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This patch adds stubs for the UpdateTimeToLive and DescribeTimeToLive
operations to Alternator. These operations can enable, disable, or inquire
about, the chosen expiration-time attribute.
Currently, the information about the chosen attribute is only saved, with
no actual expiration of any items taking place.
Some of the tests for the TTL feature start to pass, so their xfail tag
is removed.
Because this this new feature is incomplete, it is not enabled unless
the "alternator-ttl" experimental feature is enabled. Moreover, for
these operations to be allowed, the entire cluster needs to support
this experimental feature, because all nodes need to participate in the
data expiration - if some old nodes don't support Alternator TTL, some
of the data they hold won't get expired... So we don't allow enabling
TTL until all the nodes in the cluster support this feature.
The implementation is in a new source file, alternator/ttl.cc. This
source file will continue to grow as we implement the expiration feature.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Some state accessors called get_local_gossiper(); this is removed
and replaced with a parameter. Some callers (redis, alternators)
now have the gossiper passed as a parameter during initialization
so they can use the adjusted API.
Have the callers pass it instead, and they all have a reference
already except for cql_test_env (which will be fixed later).
The checks for initialization it does are likely unnecessary, but
we'll only be able to prove it when get_gossiper() is completely
removed.
Initialization happens in the gossiper itself, so we can capture
'this'. If we need to move to shard 0, use sharded::invoke_on() to
get the local instance.
These features have been around for over 2 years and every reasonable
deployment should have them enabled.
The only case when those features could be not enabled is when the user
has used enable_sstables_mc_format config flag to disable MC sstable
format. This case has been eliminated by removing
enable_sstables_mc_format config flag.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
1) Start node n1, n2, n3
2) Bootstrap n4 and kill n4 in the middle of bootstrap
3) Wipe data on n4 and start n4 again
After step 2, n1, n2 and n3 will remove n4 from gossip after
fat_client_timeout and put n4 in quarantine for quarantine_delay().
If n4 bootstraps again in step 3 before the quarantine finishes, n1, n2
and n3 will ignore gossip updates from n4, and n4 will not learn gossip
updates from the cluster.
After PR #8896, the bootstrap will be rejected.
This patch promotes the gossip quarantine over log to info level, so
that dtest can wait for the log to bootstrap the node again.
Refs #8889
Refs #8890Closes#8905
In commit 11a8912093 (gossiper:
get_gossip_status: return string_view and make noexcept)
get_gossip_status returns a pointer to an endpoint_state in
endpoint_state_map.
After commit 425e3b1182 (gossip: Introduce
direct failure detector), gossiper::mark_dead and gossiper::real_mark_alive
can yield in the middle of the function. It is possible that
endpoint_state can be removed, causing use-after-free to access it.
To fix, make a copy before we yield.
Fixes#8859Closes#8862
When a new node bootstraps to join the cluster, it will be set in
bootstrap gossip status. If the node is gone in the middle, the node
will be removed by gossip after the new node fails to update gossip
after fat_client_timeout, which reverts the new node as pending node.
However, if the new node is slow to update gossip and it finishes
bootstrapping after existing nodes have removed the new node after
fat_client_timeout. In handle_state_normal handler, the existing nodes
will fail to find the host id for the new node and throw and in turn
terminate the scylla process.
To mitigate the problem, we set fat_client_timeout which is half of
quarantine_delay to a minimum value if users set a small ring_delay
value.
Refs #8702
Refs #8859Closes#8860
When a node is removed from the _live_endpoints list directly, e.g., a
node being decommissioned, it is possible the node might not be marked
as down in gossiper::failure_detector_loop_for_node loop before the loop
exits. When the gossiper::failure_detector_loop loop starts again, the
node will not be considered because it is not present in _live_endpoints
list any more. As a result, the node will not be marked as down though
gossiper::failure_detector_loop_for_node loop.
To fix, we mark the nodes that are removed from _live_endpoints
lists as down in the gossiper::failure_detector_loop loop.
Fixes#8712Closes#8770
In commit 425e3b1182 (gossip: Introduce
direct failure detector), the call to notify_failure_detector inside ack
and ack2 msg handler was removed since there is no need to update the
old failure detector anymore. However, the timestamp for endpoit_state
is also updated inside notify_failure_detector. With the new failure
detector we still need the timestamp for endpoit_state. Otherwise, nodes
might be removed from gossip wrongly.
For example, as we saw in issue #8702:
INFO 2021-05-24 22:45:24,713 [shard 0] gossip - FatClient 127.0.60.2
has been silent for 5000ms, removing from gossip
To fix, update the timestamp as we do before in ack and ack2 msg
handler.
Fixes#8702Closes#8777
When a node joins this feature (which it does immediately when upgrading
to a version that has this commit), it says: "I understand the new
generation storage format and the new identifier format". Thus, when the
feature becomes enabled - after all nodes have joined it - it means that
it's safe to create new generations using these new storage/ID formats.
This is a new type of CDC generation identifiers. Compared to old IDs,
additionally to the timestamp it contains an UUID.
These new identifiers will allow a safer and more efficient algorithm of
introducing new generations into a cluster (introduced in a later commit).
For now, nodes keep using the old identifier format when creating new
generations and whenever they learn about a new CDC generation from gossip
they assume that it also is stored in the v1 format. But they do know how
to (de)serialize the second format and how to persist new identifiers in
local tables.
"
The patch set is an assorted collection of header cleanups, e.g:
* Reduce number of boost includes in header files
* Switch to forward declarations in some places
A quick measurement was performed to see if these changes
provide any improvement in build times (ccache cleaned and
existing build products wiped out).
The results are posted below (`/usr/bin/time -v ninja dev-build`)
for 24 cores/48 threads CPU setup (AMD Threadripper 2970WX).
Before:
Command being timed: "ninja dev-build"
User time (seconds): 28262.47
System time (seconds): 824.85
Percent of CPU this job got: 3979%
Elapsed (wall clock) time (h:mm:ss or m:ss): 12:10.97
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 2129888
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 1402838
Minor (reclaiming a frame) page faults: 124265412
Voluntary context switches: 1879279
Involuntary context switches: 1159999
Swaps: 0
File system inputs: 0
File system outputs: 11806272
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
After:
Command being timed: "ninja dev-build"
User time (seconds): 26270.81
System time (seconds): 767.01
Percent of CPU this job got: 3905%
Elapsed (wall clock) time (h:mm:ss or m:ss): 11:32.36
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 2117608
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 1400189
Minor (reclaiming a frame) page faults: 117570335
Voluntary context switches: 1870631
Involuntary context switches: 1154535
Swaps: 0
File system inputs: 0
File system outputs: 11777280
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
The observed improvement is about 5% of total wall clock time
for `dev-build` target.
Also, all commits make sure that headers stay self-sufficient,
which would help to further improve the situation in the future.
"
* 'feature/header_cleanups_v1' of https://github.com/ManManson/scylla:
transport: remove extraneous `qos/service_level_controller` includes from headers
treewide: remove evidently unneded storage_proxy includes from some places
service_level_controller: remove extraneous `service/storage_service.hh` include
sstables/writer: remove extraneous `service/storage_service.hh` include
treewide: remove extraneous database.hh includes from headers
treewide: reduce boost headers usage in scylla header files
cql3: remove extraneous includes from some headers
cql3: various forward declaration cleanups
utils: add missing <limits> header in `extremum_tracking.hh`
Currently, gossip uses the updates of the gossip heartbeat from gossip
messages to decide if a node is up or down. This means if a node is
actually down but the gossip messages are delayed in the network, the
marking of node down can be delayed.
For example, a node sends 20 gossip messages in 20 seconds before it
is dead. Each message is delayed 15 seconds by the network for some
reason. A node receives those delayed messages one after another.
Those delayed messages will prevent this node from being marked as down.
Because heartbeat update is received just before the threshold to mark a
node down is triggered which is around 20 seconds by default.
As a result, this node will not be marked as down in 20 * 15 seconds =
300 seconds, much longer than the ~20 seconds node down detection time
in normal cases.
In this patch, a new failure detector is implemented.
- Direct detection
The existing failure detector can get gossip heartbeat updates
indirectly. For example:
Node A can talk to Node B
Node B can talk to Node C
Node A can not talk to Node C, due to network issues
Node A will not mark Node B to be down because Node A can get heart beat
of Node C from node B indirectly.
This indirect detection is not very useful because when Node A decides
if it should send requests to Node C, the requests from Node A to C will
fail while Node A thinks it can communicate with Node C.
This patch changes the failure detection to be direct. It uses the
existing gossip echo message to detect directly. Gossip echo messages
will be sent to peer nodes periodically. A peer node will be marked as
down if a timeout threshold has been meet.
Since the failure detection is peer to peer, it avoids the delayed
message issue mentioned above.
- Parallel detection
The old failure detector uses shard zero only. This new failure detector
utilizes all the shards to perform the failure detection, each shard
handling a subset of live nodes. For example, if the cluster has 32
nodes and each node has 16 shards, each shard will handle only 2 nodes.
With a 16 nodes cluster, each node has 16 shards, each shard will handle
only one peer node.
A gossip message will be sent to peer nodes every 2 seconds. The extra
echo messages traffic produced compared to the old failure detector is
negligible.
- Deterministic detection
Users can configure the failure_detector_timeout_in_ms to set the
threshold to mark a node down. It is the maximum time between two
successful echo message before gossip marks a node down. It is easier to
understand than the old phi_convict_threshold.
- Compatible
This patch only uses the existing gossip echo message. Nodes with or without
this patch can work together.
Fixes#8488Closes#8036
We remove a log of severity error that is later thrown as an
exception, being catched few lines below and then printed out as
a warning.
Fixes#8616Closes#8617
We currently only update the failure detector for a node when a higher
version of application state is received. Since gossip syn messages do
not contain application state, so this means we do not update the
failure detector upon receiving gossip syn messages, even if a message
from peer node is received which implies the peer node is alive.
This patch relaxes the failure detector update rule to update the
failure detector for the sender of gossip messages directly.
Refs #8296Closes#8476
alternator/expressions.g had both AGPL and proprietary licensing. The
proprietary one is removed.
gms/inet_address_serializer.hh had only a proprietary license; it is
replaced by the AGPL.
Fixes#8465.
Closes#8466
This is a follow-up to the previous commit.
Each CDC generation has a timestamp which denotes a logical point in time
when this generation starts operating. That same timestamp is
used to identify the CDC generation. We use this identification scheme
to exchange CDC generations around the cluster.
However, the fact that a generation's timestamp is used as an ID for
this generation is an implementation detail of the currently used method
of managing CDC generations.
Places in the code that deal with the timestamp, e.g. functions which
take it as an argument (such as handle_cdc_generation) are often
interested in the ID aspect, not the "when does the generation start
operating" aspect. They don't care that the ID is a `db_clock::time_point`.
They may sometimes want to retrieve the time point given the ID (such as
do_handle_cdc_generation when it calls `cdc::metadata::insert`),
but they don't care about the fact that the time point actually IS the ID.
In the future we may actually change the specific type of the ID if we
modify the generation management algorithms.
This commit is an intermediate step that will ease the transition in the
future. It introduces a new type, `cdc::generation_id`. Inside it contains
the timestamp, so:
1. if a piece of code doesn't care about the timestamp, it just passes
the ID around
2. if it does care, it can simply access it using the `get_ts` function.
The fact that `get_ts` simply accesses the ID's only field is an
implementation detail.
Using the occasion, we change the `do_handle_cdc_generation_intercept...`
function to be a standard function, not a coroutine. It turns out that -
depending on the shape of the passed-in argument - the function would
sometimes miscompile (the compiled code would not copy the argument to the
coroutine frame).
Each CDC generation always has a timestamp, but the fact that the
timestamp identifies the generation is an implementation detail.
We abstract away from this detail by using a more generic naming scheme:
a generation "identifier" (whatever that is - a timestamp or something
else).
It's possible that a CDC generation will be identified by more than a
timestamp in the (near) future.
The actual string gossiped by nodes in their application state is left
as "CDC_STREAMS_TIMESTAMP" for backward compatibility.
Some stale comments have been updated.
In commit c82250e0cf (gossip: Allow deferring
advertise of local node to be up), the replacing node is changed to postpone
the responding of gossip echo message to avoid other nodes sending read
requests to the replacing node. It works as following:
1) replacing node does not respond echo message to avoid other nodes to
mark replacing node as alive
2) replacing node advertises hibernate state so other nodes knows
replacing node is replacing
3) replacing node responds echo message so other nodes can mark
replacing node as alive
This is problematic because after step 2, the existing nodes in the
cluster will start to send writes to the replacing node, but at this
time it is possible that existing nodes haven't marked the replacing
node as alive, thus failing the write request unnecessarily.
For instance, we saw the following errors in issue #8013 (Cassandra
stress fails to achieve consistency when only one of the nodes is down)
```
scylla:
[shard 1] consistency - Live nodes 2 do not satisfy ConsistencyLevel (2
required, 1 pending, live_endpoints={127.0.0.2, 127.0.0.1},
pending_endpoints={127.0.0.3}) [shard 0] gossip - Fail to send
EchoMessage to 127.0.0.3: std::runtime_error (Not ready to respond
gossip echo message)
c-s:
java.io.IOException: Operation x10 on key(s) [4c4f4d37324c35304c30]:
Error executing: (UnavailableException): Not enough replicas available
for query at consistency QUORUM (2 required but only 1 alive
```
To solve this problem, we can do the replacing operation in multiple stages.
One solution is to introduce a new gossip status state as proposed
here: gossip: Introduce STATUS_PREPARE_REPLACE #7416
1) replacing node does not respond echo message
2) replacing node advertises prepare_replace state (Remove replacing
node from natural endpoint, but do not put in pending list yet)
3) replacing node responds echo message
4) replacing node advertises hibernate state (Put replacing node in
pending list)
Since we now have the node ops verb introduced in
829b4c1438 (repair: Make removenode safe
by default), we can do the multiple stage without introducing a new
gossip status state.
This patch uses the NODE_OPS_CMD infrastructure to implement replace
operation.
Improvements:
1) It solves the race between marking replacing node alive and sending
writes to replacing node
2) The cluster reverts to a state before the replace operation
automatically in case of error. As a result, it solves when the
replacing node fails in the middle of the operation, the repacing
node will be in HIBERNATE status forever issue.
3) The gossip status of the node to be replaced is not changed until the
replace operation is successful. HIBERNATE gossip status is not used
anymore.
4) Users can now pass a list of dead nodes to ignore explicitly.
Fixes#8013Closes#8330
* github.com:scylladb/scylla:
repair: Switch to use NODE_OPS_CMD for replace operation
gossip: Add advertise_to_nodes
gossip: Add helper to wait for a node to be up
gossip: Add is_normal_ring_member helper
gossiper::advertise_to_nodes() is added to allow respond to gossip echo
message with specified nodes and the current gossip generation number
for the nodes.
This is helpful to avoid the restarted node to be marked as alive during
a pending replace operation.
After this patch, when a node sends a echo message, the gossip
generation number is sent in the echo message. Since the generation
number changes after a restart, the receiver of the echo message can
compare the generation number to tell if the node has restarted.
Refs #8013