Also, pass `node_ops_cmd` by value to get rid of lifetime issues
when converting to coroutine.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
* 'raft_group0_early_startup_v3' of https://github.com/ManManson/scylla:
main: allow joining raft group0 before waiting for gossiper to settle
service: raft_group0: make `join_group0` re-entrant
service: storage_service: add `join_group0` method
raft_group_registry: update gossiper state only on shard 0
raft: don't update gossiper state if raft is enabled early or not enabled at all
gms: feature_service: add `cluster_uses_raft_mgmt` accessor method
db: system_keyspace: add `bootstrap_needed()` method
db: system_keyspace: mark getter methods for bootstrap state as "const"
Just delegates work to `service::raft_group0::join_group0()`
so that it can be used in `main` to activate raft group0
early in some cases (before waiting for gossiper to settle).
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
The update_table() helper template too. And the update_peer_info as
well. It can stop using global qctx and cache after that
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The set_mode() tries to combine mode switching and extended logging,
but there are no places left that do need this flexibility. It's
simpler and nicer to make set_mode() _just_ switch the mode and
log some generic "entering ... mode" message.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This boolean protects do_stop_ms from re-entrability. However, this
method is only called from stop_transport() which handles re-entring
itself, so the _ms_stopped can be just removed.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This "state" is the sub-state of the STARTING mode that's activated
when the storage_service::bootstrap() is called. Instead of the
separate boolean the new mode can be used. To stop it from reverting
the BOOTSTRAP mode back to JOINING some calls to set_mode() should
be converted into regular logging.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This bit is hairy. First, it indicates that the storage service
entered the init_server() method. But, once the node is up and
running it also indicates whether the gossiper is enabled or not
via the APi call.
To rely on the operation mode, first, the NONE mode is introduced
at which the server starts. Then in init_server() is switches to
STARTING.
Second change is to stop using the bit in enable/disable gossiper
API call, instead -- check the gossiper.is_enabled() itself.
To keep the is_initialized API call compatible, when the operation
mode is NORMAL it would return true/false according to the status
of the gossiper. This change is simple because storage service API
handlers already have the gossiper instance hanging around.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The is_joined() status can be get with get_operation_mode(). Since
it indicates that the operation mode is JOINING, NORMAL or anything
above, the operation mode the enum class should be shuffled to get
the simple >= comparison.
Another needed change is to set mode few steps earlier than it
happens now to cover the non-bootstrap startup case.
And the third change is to partially revert the d49aa7ab that made
the .is_joined() method be future-less. Nowadays the is_joined() is
called only from the API which is happy with being future-full in
all other storage service state checks.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This is trivial change, since the only user is in API and the
get_operation_mode + mode values are at hand.
One thing to pay attention to -- the new method checks the mode to
be <= STARTING, not for equality. Now this is equivalent change,
but next patch will introduce NONE mode that should be reported
as is_starting() too.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Now it reports back formatted mode. For future convenience it's
needed to return the raw value, all the more so the mode enum class
is already public.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
And coroutinize while moving. No other changes.
While the code in question runs in a thread context and can enjoy
synchronous .get() calls, it's still better if it doesn't make any
assumptions about its environment. The ring joining code is changing
and new intermediate helpers should better be on the safe side from
the very beginning.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It's milliseconds and is converted back and forth in join_token_ring().
Having a chrono type for it makes things (mostly code reading) simpler.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Both helpers (natuarally) used to be storage-service methods, but then
were moved to databse because bootstrapper code wanted to know this info.
Now the bootstraper is equipped with necessary arguments.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Instead of lengthy blurbs, switch to single-line, machine-readable
standardized (https://spdx.dev) license identifiers. The Linux kernel
switched long ago, so there is strong precedent.
Three cases are handled: AGPL-only, Apache-only, and dual licensed.
For the latter case, I chose (AGPL-3.0-or-later and Apache-2.0),
reasoning that our changes are extensive enough to apply our license.
The changes we applied mechanically with a script, except to
licenses/README.md.
Closes#9937
Move replica-oriented classes to the replica namespace. The main
classes moved are ::database, ::keyspace, and ::table, but a few
ancillary classes are also moved. There are certainly classes that
should be moved but aren't (like distributed_loader) but we have
to start somewhere.
References are adjusted treewide. In many cases, it is obvious that
a call site should not access the replica (but the data_dictionary
instead), but that is left for separate work.
scylla-gdb.py is adjusted to look for both the new and old names.
The database, keyspace, and table classes represent the replica-only
part of the objects after which they are named. Reading from a table
doesn't give you the full data, just the replica's view, and it is not
consistent since reconciliation is applied on the coordinator.
As a first step in acknowledging this, move the related files to
a replica/ subdirectory.
Consider
1) n1, n2, n3, n4, n5
2) n2 and n3 are both down
3) start n6 to replace n2
4) start n7 to replace n3
We want to replace the dead nodes n2 and n3 to fix the cluster to have 5
running nodes.
Replace operation in step 3 will fail because n3 is down.
We would see errors like below:
replace[25edeec0-57d4-11ec-be6b-7085c2409b2d]: Nodes={127.0.0.3} needed
for replace operation are down. It is highly recommended to fix the down
nodes and try again.
In the above example, currently, there is no way to replace any of the
dead nodes.
Users can either fix one of the dead nodes and run replace or run
removenode operation to remove one of the dead nodes then run replace
and run bootstrap to add another node.
Fixing dead nodes is always the best solution but it might not be
possible. Running removenode operation is not better than running
replace operation (with best effort by ignoring the other dead node) in
terms of data consistency. In addition, users have to run bootstrap
operation to add back the removed node. So, allowing replacing in such
case is a clear win.
This patch adds the --ignore-dead-nodes-for-replace option to allow run
replace operation with best effort mode. Please note, use this option
only if the dead nodes are completely broken and down, and there is no
way to fix the node and bring it back. This also means the user has to
make sure the ignored dead nodes specified are really down to avoid any
data consistency issue.
Fixes#9757Closes#9758
Currently storage service acts as a glue between database schema value
and the migration manager "passive_announce" call. This interposing is
not required, migration manager can do all the management itself, and
the linkage can be done in main.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Operations of adding or removing a node to Raft configuration
are made idempotent: they do nothing if already done, and
they are safe to resume after a failure.
However, since topology changes are not transactional, if a
bootstrap or removal procedure fails midway, Raft group 0
configuration may go out of sync with topology state as seen by
gossip.
In future we must change gossip to avoid making any persistent
changes to the cluster: all changes to persistent topology state
will be done exclusively through Raft Group 0.
Specifically, instead of persisting the tokens by advertising
them through gossip, the bootstrap will commit a change to a system
table using Raft group 0. nodetool will switch from looking at
gossip-managed tables to consulting with Raft Group 0 configuration
or Raft-managed tables.
Once this transformation is done, naturally, adding a node to Raft
configuration (perhaps as a non-voting member at first) will become the
first persistent change to ring state applied when a node joins;
removing a node from the Raft Group 0 configuration will become the last
action when removing a node.
Until this is done, do our best to avoid a cluster state when
a removed node or a node which addition failed is stuck in Raft
configuration, but the node is no longer present in gossip-managed
system tables. In other words, keep the gossip the primary source of
truth. For this purpose, carefully chose the timing when we
join and leave Raft group 0:
Join the Raft group 0 only after we've advertised our tokens, so the
cluster is aware of this node, it's visible in nodetool status,
but before node state jumps to "normal", i.e. before it accepts
queries. Since the operation is idempotent, invoke it on each
restart.
Remove the node from Group 0 *before* its tokens are removed
from gossip-managed system tables. This guarantees
that if removal from Raft group 0 fails for whatever reason,
the node stays in the ring, so nodetool removenode and
friends are re-tried.
Add tracing.
Use local reference and don't use 'is_stopped' boolean as the
whole stop_transport is guarded with its own lock.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The manager is drained() on drain/decommission/isolate. Since now
it's storage_service who orchestrates all of the above, it needs
and explicit reference on the target.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>