Commit Graph

530 Commits

Author SHA1 Message Date
Pavel Solodovnikov
4af27ca653 service: storage_service: coroutinize node_ops_cmd_heartbeat_updater()
Also, pass `node_ops_cmd` by value to get rid of lifetime issues
when converting to coroutine.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-05-01 12:07:36 +03:00
Pavel Solodovnikov
b27c989e62 service: storage_service: coroutinize node_ops_abort()
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-04-25 09:11:14 +03:00
Pavel Solodovnikov
f7e84c6138 service: storage_service: coroutinize node_ops_done()
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-04-25 09:11:08 +03:00
Pavel Solodovnikov
6936dbea49 service: storage_service: coroutinize node_ops_update_heartbeat()
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-04-25 09:11:04 +03:00
Pavel Solodovnikov
0a3a7534d6 service: storage_service: coroutinize start_sys_dist_ks()
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-04-25 09:10:49 +03:00
Pavel Solodovnikov
15ea74e41f service: storage_service: coroutinize prepare_to_join()
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-04-25 09:10:43 +03:00
Pavel Solodovnikov
c739fad5d6 service: storage_service: coroutinize removenode_add_ranges()
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-04-25 09:10:05 +03:00
Pavel Solodovnikov
e392fdda96 service: storage_service: coroutinize unbootstrap()
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-04-25 09:09:56 +03:00
Pavel Solodovnikov
8fa7f47a74 service: storage_service: coroutinize get_changed_ranges_for_leaving()
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-04-25 09:09:04 +03:00
Kamil Braun
41f5b7e69e Merge branch 'raft_group0_early_startup_v3' of https://github.com/ManManson/scylla into next
* 'raft_group0_early_startup_v3' of https://github.com/ManManson/scylla:
  main: allow joining raft group0 before waiting for gossiper to settle
  service: raft_group0: make `join_group0` re-entrant
  service: storage_service: add `join_group0` method
  raft_group_registry: update gossiper state only on shard 0
  raft: don't update gossiper state if raft is enabled early or not enabled at all
  gms: feature_service: add `cluster_uses_raft_mgmt` accessor method
  db: system_keyspace: add `bootstrap_needed()` method
  db: system_keyspace: mark getter methods for bootstrap state as "const"
2022-04-14 16:42:20 +02:00
Pavel Solodovnikov
057a12e213 service: storage_service: add join_group0 method
Just delegates work to `service::raft_group0::join_group0()`
so that it can be used in `main` to activate raft group0
early in some cases (before waiting for gossiper to settle).

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-04-07 12:36:33 +03:00
Gleb Natapov
7bf557332f storage_service: remove maybe from maybe_start_sys_dist_ks
There is nothing "maybe" about it now.

Message-Id: <Ykv/bj8MvKh0UU23@scylladb.com>
2022-04-05 12:49:56 +03:00
Pavel Emelyanov
7d0d5642c0 system_keyspace: Make update_cached_values non-static
The update_table() helper template too. And the update_peer_info as
well. It can stop using global qctx and cache after that

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-03-16 14:24:40 +03:00
Pavel Emelyanov
f18a80852e storage_service: Keep sharded<system_keyspace> reference
Storage service uses system keyspace on boot heavily

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-03-16 14:24:40 +03:00
Pavel Emelyanov
190385551c storage_service: Relax operation modes switch
The set_mode() tries to combine mode switching and extended logging,
but there are no places left that do need this flexibility. It's
simpler and nicer to make set_mode() _just_ switch the mode and
log some generic "entering ... mode" message.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-03-07 13:29:47 +03:00
Pavel Emelyanov
0941098b39 storage_service: Remove _ms_stopped
This boolean protects do_stop_ms from re-entrability. However, this
method is only called from stop_transport() which handles re-entring
itself, so the _ms_stopped can be just removed.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-03-07 13:29:47 +03:00
Pavel Emelyanov
74212286f8 storage_service: Remove _is_bootstrap_mode
This "state" is the sub-state of the STARTING mode that's activated
when the storage_service::bootstrap() is called. Instead of the
separate boolean the new mode can be used. To stop it from reverting
the BOOTSTRAP mode back to JOINING some calls to set_mode() should
be converted into regular logging.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-03-07 13:29:47 +03:00
Pavel Emelyanov
dbaca825ec storage_service: Remove _initialized and is_initialized()
This bit is hairy. First, it indicates that the storage service
entered the init_server() method. But, once the node is up and
running it also indicates whether the gossiper is enabled or not
via the APi call.

To rely on the operation mode, first, the NONE mode is introduced
at which the server starts. Then in init_server() is switches to
STARTING.

Second change is to stop using the bit in enable/disable gossiper
API call, instead -- check the gossiper.is_enabled() itself.

To keep the is_initialized API call compatible, when the operation
mode is NORMAL it would return true/false according to the status
of the gossiper. This change is simple because storage service API
handlers already have the gossiper instance hanging around.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-03-07 13:29:47 +03:00
Pavel Emelyanov
ffbfa3b542 storage_service: Remove _joined and is_joined()
The is_joined() status can be get with get_operation_mode(). Since
it indicates that the operation mode is JOINING, NORMAL or anything
above, the operation mode the enum class should be shuffled to get
the simple >= comparison.

Another needed change is to set mode few steps earlier than it
happens now to cover the non-bootstrap startup case.

And the third change is to partially revert the d49aa7ab that made
the .is_joined() method be future-less. Nowadays the is_joined() is
called only from the API which is happy with being future-full in
all other storage service state checks.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-03-07 13:29:47 +03:00
Pavel Emelyanov
ca03fd3145 storage_service: Replace is_starting() with get_operation_mode()
This is trivial change, since the only user is in API and the
get_operation_mode + mode values are at hand.

One thing to pay attention to -- the new method checks the mode to
be <= STARTING, not for equality. Now this is equivalent change,
but next patch will introduce NONE mode that should be reported
as is_starting() too.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-03-07 13:29:47 +03:00
Pavel Emelyanov
c385fe7d79 storage_service: Make get_operation_mode() return mode itself
Now it reports back formatted mode. For future convenience it's
needed to return the raw value, all the more so the mode enum class
is already public.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-03-07 13:29:47 +03:00
Pavel Emelyanov
d5b75a24a5 storage_service: Out-line schema waiting code
And coroutinize while moving. No other changes.

While the code in question runs in a thread context and can enjoy
synchronous .get() calls, it's still better if it doesn't make any
assumptions about its environment. The ring joining code is changing
and new intermediate helpers should better be on the safe side from
the very beginning.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-03-02 11:53:22 +03:00
Pavel Emelyanov
3ea7539d27 storage_service: Make int delay be std::chrono::milliseconds
It's milliseconds and is converted back and forth in join_token_ring().
Having a chrono type for it makes things (mostly code reading) simpler.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-03-02 11:51:47 +03:00
Pavel Emelyanov
66b9a53808 database: Move is_replacing() and get_replace_address() (back) into storage_service
Both helpers (natuarally) used to be storage-service methods, but then
were moved to databse because bootstrapper code wanted to know this info.
Now the bootstraper is equipped with necessary arguments.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-02-07 12:43:08 +03:00
Benny Halevy
71a9524175 storage_service: no need to include utils/serialized_action.hh 2022-02-02 14:42:05 +02:00
Avi Kivity
fcb8d040e8 treewide: use Software Package Data Exchange (SPDX) license identifiers
Instead of lengthy blurbs, switch to single-line, machine-readable
standardized (https://spdx.dev) license identifiers. The Linux kernel
switched long ago, so there is strong precedent.

Three cases are handled: AGPL-only, Apache-only, and dual licensed.
For the latter case, I chose (AGPL-3.0-or-later and Apache-2.0),
reasoning that our changes are extensive enough to apply our license.

The changes we applied mechanically with a script, except to
licenses/README.md.

Closes #9937
2022-01-18 12:15:18 +01:00
Pavel Solodovnikov
6aeccbb3b8 service: storage_service: coroutinize leave_ring
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-01-11 09:37:47 +03:00
Pavel Solodovnikov
648c79347a service: storage_service: coroutinize handle_state_left
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-01-11 09:37:47 +03:00
Pavel Solodovnikov
b23c19bfb6 service: storage_service: coroutinize handle_state_leaving
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-01-11 09:37:47 +03:00
Pavel Solodovnikov
99195d637d service: storage_service: coroutinize handle_state_removing
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-01-11 09:37:47 +03:00
Pavel Solodovnikov
1593507f32 service: storage_service: coroutinize shutdown_protocol_servers
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-01-11 09:29:12 +03:00
Pavel Solodovnikov
0bee6976e3 service: storage_service: coroutinize excise
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-01-11 09:29:12 +03:00
Pavel Solodovnikov
c7d2a09424 service: storage_service: coroutinize remove_endpoint
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-01-11 09:29:12 +03:00
Pavel Solodovnikov
210c482c4f service: storage_service: coroutinize handle_state_replacing
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-01-11 09:29:12 +03:00
Pavel Solodovnikov
adfc8f8346 service: storage_service: coroutinize handle_state_normal
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-01-11 09:29:12 +03:00
Pavel Solodovnikov
ba113439de service: storage_service: coroutinize update_peer_info
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-01-11 09:29:12 +03:00
Pavel Solodovnikov
b46ebd4fe5 service: storage_service: coroutinize do_update_system_peers_table
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-01-11 09:29:12 +03:00
Pavel Solodovnikov
f8dbaa3722 service: storage_service: coroutinize handle_state_bootstrap
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-01-11 09:29:12 +03:00
Pavel Solodovnikov
f0f4a74817 service: storage_service: futurize notify_* functions
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-01-11 09:29:12 +03:00
Pavel Solodovnikov
9edf2182ab service: storage_service: coroutinize handle_state_replacing_update_pending_ranges
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-01-11 09:29:12 +03:00
Pavel Solodovnikov
5dcfb94d5a gms: i_endpoint_state_change_subscriber: make callbacks to return futures
Coroutinize a few simple callbacks in the process.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-01-11 09:29:12 +03:00
Avi Kivity
bbad8f4677 replica: move ::database, ::keyspace, and ::table to replica namespace
Move replica-oriented classes to the replica namespace. The main
classes moved are ::database, ::keyspace, and ::table, but a few
ancillary classes are also moved. There are certainly classes that
should be moved but aren't (like distributed_loader) but we have
to start somewhere.

References are adjusted treewide. In many cases, it is obvious that
a call site should not access the replica (but the data_dictionary
instead), but that is left for separate work.

scylla-gdb.py is adjusted to look for both the new and old names.
2022-01-07 12:04:38 +02:00
Avi Kivity
ae3a360725 database: Move database, keyspace, table classes to replica/ directory
The database, keyspace, and table classes represent the replica-only
part of the objects after which they are named. Reading from a table
doesn't give you the full data, just the replica's view, and it is not
consistent since reconciliation is applied on the coordinator.

As a first step in acknowledging this, move the related files to
a replica/ subdirectory.
2022-01-06 17:07:30 +02:00
Asias He
eba4a4fba4 repair: Allow ignoring dead nodes for replace operation
Consider

1) n1, n2, n3, n4, n5
2) n2 and n3 are both down
3) start n6 to replace n2
4) start n7 to replace n3

We want to replace the dead nodes n2 and n3 to fix the cluster to have 5
running nodes.

Replace operation in step 3 will fail because n3 is down.
We would see errors like below:

replace[25edeec0-57d4-11ec-be6b-7085c2409b2d]: Nodes={127.0.0.3} needed
for replace operation are down. It is highly recommended to fix the down
nodes and try again.

In the above example, currently, there is no way to replace any of the
dead nodes.

Users can either fix one of the dead nodes and run replace or run
removenode operation to remove one of the dead nodes then run replace
and run bootstrap to add another node.

Fixing dead nodes is always the best solution but it might not be
possible. Running removenode operation is not better than running
replace operation (with best effort by ignoring the other dead node) in
terms of data consistency. In addition, users have to run bootstrap
operation to add back the removed node. So, allowing replacing in such
case is a clear win.

This patch adds the --ignore-dead-nodes-for-replace option to allow run
replace operation with best effort mode. Please note, use this option
only if the dead nodes are completely broken and down, and there is no
way to fix the node and bring it back. This also means the user has to
make sure the ignored dead nodes specified are really down to avoid any
data consistency issue.

Fixes #9757

Closes #9758
2021-12-20 00:49:03 +02:00
Gleb Natapov
f25424edcd storage_service: remove unused function.
is_auto_bootstrap() function is no longer used.

Message-Id: <YbCVXPI4hE8wgT4T@scylladb.com>
2021-12-08 13:55:32 +02:00
Pavel Emelyanov
e4f35e2139 migration_manager: Eliminate storage service from passive announcing
Currently storage service acts as a glue between database schema value
and the migration manager "passive_announce" call. This interposing is
not required, migration manager can do all the management itself, and
the linkage can be done in main.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-12-02 19:43:30 +02:00
Konstantin Osipov
c22f945f11 raft: (service) manage Raft configuration during topology changes
Operations of adding or removing a node to Raft configuration
are made idempotent: they do nothing if already done, and
they are safe to resume after a failure.

However, since topology changes are not transactional, if a
bootstrap or removal procedure fails midway, Raft group 0
configuration may go out of sync with topology state as seen by
gossip.

In future we must change gossip to avoid making any persistent
changes to the cluster: all changes to persistent topology state
will be done exclusively through Raft Group 0.

Specifically, instead of persisting the tokens by advertising
them through gossip, the bootstrap will commit a change to a system
table using Raft group 0. nodetool will switch from looking at
gossip-managed tables to consulting with Raft Group 0 configuration
or Raft-managed tables.
Once this transformation is done, naturally, adding a node to Raft
configuration (perhaps as a non-voting member at first) will become the
first persistent change to ring state applied when a node joins;
removing a node from the Raft Group 0 configuration will become the last
action when removing a node.

Until this is done, do our best to avoid a cluster state when
a removed node or a node which addition failed is stuck in Raft
configuration, but the node is no longer present in gossip-managed
system tables. In other words, keep the gossip the primary source of
truth. For this purpose, carefully chose the timing when we
join and leave Raft group 0:

Join the Raft group 0 only after we've advertised our tokens, so the
cluster is aware of this node, it's visible in nodetool status,
but before node state jumps to "normal", i.e. before it accepts
queries. Since the operation is idempotent, invoke it on each
restart.

Remove the node from Group 0 *before* its tokens are removed
from gossip-managed system tables. This guarantees
that if removal from Raft group 0 fails for whatever reason,
the node stays in the ring, so nodetool removenode and
friends are re-tried.

Add tracing.
2021-11-25 12:35:42 +03:00
Pavel Emelyanov
390a971bd8 storage_service: Sanitize streaming shutdown
Use local reference and don't use 'is_stopped' boolean as the
whole stop_transport is guarded with its own lock.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-11-24 12:17:37 +03:00
Pavel Emelyanov
aaa58b7b89 storage_service: Keep streaming_manager reference
The manager is drained() on drain/decommission/isolate. Since now
it's storage_service who orchestrates all of the above, it needs
and explicit reference on the target.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-11-24 12:17:35 +03:00
Benny Halevy
9cde52c58f storage_service: keep a reference to the batchlog_manager
Rather than accessing the global batchlog_manager.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-11-23 08:27:30 +02:00