Schema digest is calculated by querying for mutations of all schema
tables, then compacting them so that all tombstones in them are
dropped. However, even if the mutation becomes empty after compaction,
we still feed its partition key. If the same mutations were compacted
prior to the query, because the tombstones expire, we won't get any
mutation at all and won't feed the partition key. So schema digest
will change once an empty partition of some schema table is compacted
away.
Tombstones expire 7 days after schema change which introduces them. If
one of the nodes is restarted after that, it will compute a different
table schema digest on boot. This may cause performance problems. When
sending a request from coordinator to replica, the replica needs
schema_ptr of exact schema version request by the coordinator. If it
doesn't know that version, it will request it from the coordinator and
perform a full schema merge. This adds latency to every such request.
Schema versions which are not referenced are currently kept in cache
for only 1 second, so if request flow has low-enough rate, this
situation results in perpetual schema pulls.
After ae8d2a550d, it is more liekly to
run into this situation, because table creation generates tombstones
for all schema tables relevant to the table, even the ones which
will be otherwise empty for the new table (e.g. computed_columns).
This change inroduces a cluster feature which when enabled will change
digest calculation to be insensitive to expiry by ignoring empty
partitions in digest calculation. When the feature is enabled,
schema_ptrs are reloaded so that the window of discrepancy during
transition is short and no rolling restart is required.
A similar problem was fixed for per-node digest calculation in
18f484cc753d17d1e3658bcb5c73ed8f319d32e8. Per-table digest calculation
was not fixed at that time because we didn't persist enabled features
and they were not enabled early-enough on boot for us to depend on
them in digest calculation. Now they are enabled before non-system
tables are loaded so digest calculation can rely on cluster features.
Fixes#4485.
Will recreate schema_ptr's from schema tables like during table
alter. Will be needed when digest calculation changes in reaction to
cluster feature at run time.
When we upgrade a cluster to use Raft, or perform manual Raft recovery
procedure (which also creates a fresh group 0 cluster, using the same
algorithm as during upgrade), we start with a non-empty group 0 state
machine; in particular, the schema tables are non-empty.
In this case we need to ensure that nodes which join group 0 receive the
group 0 state. Right now this is not the case. In previous releases,
where group 0 consisted only of schema, and schema pulls were also done
outside Raft, those nodes received schema through this outside
mechanism. In 91f609d065 we disabled
schema pulls outside Raft; we're also extending group 0 with other
things, like topology-specific state.
To solve this, we force snapshot transfers by setting the initial
snapshot index on the first group 0 server to `1` instead of `0`. During
replication, Raft will see that the joining servers are behind,
triggering snapshot transfer and forcing them to pull group 0 state.
It's unnecessary to do this for cluster which bootstraps with Raft
enabled right away but it also doesn't hurt, so we keep the logic simple
and don't introduce branches based on that.
Extend Raft upgrade tests with a node bootstrap step at the end to
prevent regressions (without this patch, the step would hang - node
would never join, waiting for schema).
Fixes: #14066Closes#14336
This series aims at hardening schema merges and preventing inconsistencies across shards by
updating the database shards before calling the notification callback.
As seen in #13137, we don't want to call the notifications on all shards in parallel while the database shards are in flux.
In addition, any error to update the keyspace will cause abort so not to leave the database shards in an inconsistent state .
Other changes optimize this path by:
- updating shard 0 first, to seed the effective_replication_map.
- executing `storage_service::keyspace_changed` only once, on shard 0 to prevent quadratic update of the token_metadata and e_r_m on every keyspace change.
Fixes#13137Closes#14158
* github.com:scylladb/scylladb:
migration_manager: propagate listener notification exceptions
storage_service: keyspace_changed: execute only on shard 0
database: modify_keyspace_on_all_shards: execute func first on shard 0
database: modify_keyspace_on_all_shards: call notifiers only after applying func on all shards
database: add modify_keyspace_on_all_shards
schema_tables: merge_keyspaces: extract_scylla_specific_keyspace_info for update_keyspace
database: create_keyspace_on_all_shards
database: update_keyspace_on_all_shards
database: drop_keyspace_on_all_shards
`handle_state_normal` may drop connections to the handled node. This
causes spurious failures if there's an ongoing concurrent operation.
This problem was already solved twice in the past in different contexts:
first in 53636167ca, then in
79ee38181c.
Time to fix it for the third time. Now we do this right after enabling
gossiping, so hopefully it's the last time.
This time it's causing snapshot transfer failures in group 0. Although
the transfer is retried and eventually succeeds, the failed transfer is
wasted work and causes an annoying ERROR message in the log which
dtests, SCT, and I don't like.
The fix is done by moving the `wait_for_normal_state_handled_on_boot()`
call before `setup_group0()`. But for the wait to work correctly we must
first ensure that gossiper sees an alive node, so we precede it with
`wait_for_live_node_to_show_up()` (before this commit, the call site of
`wait_for_normal_state_handled_on_boot` was already after this wait).
There is another problem: the bootstrap procedure is racing with gossiper
marking nodes as UP, and waiting for other nodes to be NORMAL doesn't guarantee
that they are also UP. If gossiper is quick enough, everything will be fine.
If not, problems may arise such as streaming or repair failing due to nodes
still being marked as DOWN, or the CDC generation write failing.
In general, we need all NORMAL nodes to be up for bootstrap to proceed.
One exception is replace where we ignore the replaced node. The
`sync_nodes` set constructed for `wait_for_normal_state_handled_on_boot`
takes this into account, so we also use it to wait for nodes to be UP.
As explained in commit messages and comments, we only do these
waits outside raft-based-topology mode.
This should improve CI stability.
Fixes: #12972
Refs: #14042Closes#14354
* github.com:scylladb/scylladb:
messaging_service: print which connections are dropped due to missing topology info
storage_service: wait for nodes to be UP on bootstrap
storage_service: wait for NORMAL state handler before `setup_group0()`
storage_service: extract `gossiper::wait_for_live_nodes_to_show_up()`
Since most group0 commands are just mutations it is easy to combine them
before passing them to a subsystem they destined to since it is more
efficient. The logic that handles those mutations in a subsystem will
run once for each batch of commands instead of for each individual
command. This is especially useful when a node catches up to a leader and
gets a lot of commands together.
The patch here does exactly that. It combines commands into a single
command if possible, but it preserves an order between commands, so each
time it encounters a command to a different subsystem it flushes already
combined batch and starts a new one. This extra safety assumes that
there are dependencies between subsystems managed by group0, so the order
matters. It may be not the case now, but we prefer to be on a safe side.
Broadcast table commands are not mutations, so they are never combined.
* 'raft-merge-cmds' of https://github.com/gleb-cloudius/scylla:
test: add test for group0 raft command merging
service: raft: respect max mutation size limit when persisting raft entries
group0_state_machine: merge commands before applying them whenever possible
The bootstrap procedure is racing with gossiper marking nodes as UP.
If gossiper is quick enough, everything will be fine.
If not, problems may arise such as streaming or repair failing due to
nodes still being marked as DOWN, or the CDC generation write failing.
In general, we need all NORMAL nodes to be up for bootstrap to proceed.
One exception is replace where we ignore the replaced node. The
`sync_nodes` set constructed for `wait_for_normal_state_handled_on_boot`
takes this into account, so we use it.
Refs: #14042
This doesn't completely fix#14042 yet becasue it's specific to
gossiper-based topology mode only. For Raft-based topology, the node
joining procedure will be coordinated by the topology coordinator right
from the start and it will be the coordinator who issues the 'wait for
node to see other live nodes'.
`handle_state_normal` may drop connections to the handled node. This
causes spurious failures if there's an ongoing concurrent operation.
This problem was already solved twice in the past in different contexts:
first in 53636167ca, then in
79ee38181c.
Time to fix it for the third time. Now we do this right after enabling
gossiping, so hopefully it's the last time.
This time it's causing snapshot transfer failures in group 0. Although
the transfer is retried and eventually succeeds, the failed transfer is
wasted work and causes an annoying ERROR message in the log which
dtests, SCT, and I don't like.
The fix is done by moving the `wait_for_normal_state_handled_on_boot()`
call before `setup_group0()`. But for the wait to work correctly we must
first ensure that gossiper sees an alive node, so we precede it with
`wait_for_live_node_to_show_up()` (before this commit, the call site of
`wait_for_normal_state_handled_on_boot` was already after this wait).
We do it only in non-raft-topology mode, because with Raft-based
topology, node state changes are propagated to the cluster through
explicit global barriers and we plan to remove node statuses from
gossiper altogether.
Fixes: #12972
This piece of `storage_service::wait_for_ring_to_settle()` will be
performed earlier in the boot procedure in follow-up commits.
Make it more generic, to be able to wait for `n` nodes to show up. Here
we wait for `2` nodes - ourselves and at least one other.
Add a test that submits 3 large commands each one a little bit larger
than 1/3 of maximum mutation size. Check that in the end 2 command were
executed (first 2 were merged and third was executed separately).
The code that preserves raft entries builds one batch statement to store
all of them, but the butch's statement execute() merges all of the
statements into one mutation and passes it to the database. The mutation
can be larger than max mutation size limit and the write will fail. Fix
it by splitting the write to multiple batch statements if needed.
Since most group0 commands are just mutations it is easy to combine them
before passing them to a subsystem they destined to since it is more
efficient. The logic that handles those mutations in a subsystem will
run once for each batch of commands instead of for each individual
command. This is especially useful when a node catches up to a leader and
gets a lot of commands together.
The patch here does exactly that. It combines commands into a single
command if possible, but it preserves an order between commands, so each
time it encounters a command to a different subsystem it flushes already
combined batch and starts a new one. This extra safety assumes that
there are dependencies between subsystems managed by group0, so the order
matters. It may be not the case now, but we prefer to be on a safe side.
Broadcast table commands are not mutations, so they are never combined.
Fixes: #12581
1e29b07e40 claimed
to make event notification exception safe,
but swallawing the exceptions isn't safe at all,
as this might leave the node in an inconsistent state
if e.g. storage_service::keyspace_changed fails on any of the
shards. Propagating the exception here will cause abort,
but it is better than leaving the node up, but in an
inconsistent state.
We keep notifying other listeners even if any of them failed
Based on 1e29b07e40:
```
If one of the listeners throws an exception, we must ensure that other
listeners are still notified.
```
The decision about swallowing exceptions can't be
made in such a generic layer.
Specific notification listeners that may ignore exceptions,
like in transport/evenet_notifier, may decide to swallow their
local exceptions on their own (as done in this patch).
Refs #3389
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Previously all shards called `update_topology_change_info`
which in turn calls `mutate_token_metadata`, ending up
in quadratic complexity.
Now that the notifications are called after
all database shards are updated, we can apply
the changes on token metadata / effective replication map
only on shard 0 and count on replicate_to_all_cores to
propagate those changes to all other shards.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The current Seastar RPC infrastructure lacks support
for null values in tuples in handler responses.
In this commit we add the make_default_rpc_tuple function,
which solves the problem by returning pointers to
default-constructed values for smart pointer types
rather than nulls.
The problem was introduced in this commit
2d791a5ed4. The
function `encode_replica_exception_for_rpc` used
`default_tuple_maker` callback to create tuples
containing exceptions. Callers returned pointers
to default-constructed values in this callback,
e.g. `foreign_ptr(make_lw_shared<reconcilable_result>())`.
The commit changed this to just `SourceTuple{}`,
which means nullptr for pointer types.
Fixes: #14282Closes#14352
The series contains mostly cleanups for query processor and no functional
change. The last patch is a small cleanup for the storage_proxy.
* 'qp-cleanup' of https://github.com/gleb-cloudius/scylla:
storage_proxy: remove unused variable
client_state: co-routinise has_column_family_access function
query_processor: get rid of internal_state and create individual query_satate for each request
cql3: move validation::validate_column_family from client_state::has_column_family_access
client_state: drop unneeded argument from has.*access functions
cql3: move check for dropping cdc tables from auth to the drop statement code itself
query_processor: co-routinise execute_prepared_without_checking_exception_message function
query_processor: co-routinize execute_direct_without_checking_exception_message function
cql3: remove empty statement::validate functions
cql3: remove empty function validate_cluster_support
cql3/statements: fix indentation and spurious white spaces
query_processor: move statement::validate call into execute_with_params function
query_processor: co-routinise execute_with_params function
query_processor: execute statement::validate before each execution of internal query instead of only during prepare
query_processor: get rid of shared internal_query_state
query_processor: co-routinize execute_paged_internal function
query_processor: co_routinize execute_batch_without_checking_exception_message function
query_processor: co-routinize process_authorized_statement function
Checking keyspace/table presence should not be part of authorization code
and it is not done consistently today. For instance keyspace presence
is not checked in "alter keyspace" during authorization, but during
statement execution. Make it consistent.
Checking if a table is CDC log and cannot be dropped should not be done
as part of authentication (this has nothing to do with auth), but in the
drop statement itself. Throwing unauthorized_exception is wrong as well,
but unfortunately it is enshrined with a test. Not sure if it is a good
idea to change it now.
There was a bug that caused aggregates to fail when used on column-sensitive columns.
For example:
```cql
SELECT SUM("SomeColumn") FROM ks.table;
```
would fail, with a message saying that there is no column "somecolumn".
This is because the case-sensitivity got lost on the way.
For non case-sensitive column names we convert them to lowercase, but for case sensitive names we have to preserve the name as originally written.
The problem was in `forward_service` - we took a column name and created a non case-sensitive `column_identifier` out of it.
This converted the name to lowercase, and later such column couldn't be found.
To fix it, let's make the `column_identifier` case-sensitive.
It will preserve the name, without converting it to lowercase.
Fixes: https://github.com/scylladb/scylladb/issues/14307Closes#14340
* github.com:scylladb/scylladb:
service/forward_service.cc: make case-sensitivity explicit
cql-pytest/test_aggregate: test case-sensitive column name in aggregate
forward_service: fix forgetting case-sensitivity in aggregates
Use the new Seastar functionality for storing references to connections to implement banning hosts that have left the cluster (either decommissioned or using removenode) in raft-topology mode. Any attempts at communication from those nodes will be rejected.
This works not only for nodes that restart, but also for nodes that were running behind a network partition and we removed them. Even when the partition resolves, the existing nodes will effectively put a firewall from that node.
Some changes to the decommission algorithm had to be introduced for it to work with node banning. As a side effect a pre-existing problem with decommission was fixed. Read the "introduce `left_token_ring` state" and "prepare decommission path for node banning" commits for details.
Closes#13850
* github.com:scylladb/scylladb:
test: pylib: increase checking period for `get_alive_endpoints`
test: add node banning test
test: pylib: manager_client: `get_cql()` helper
test: pylib: ScyllaCluster: server pause/unpause API
raft topology: ban left nodes
raft topology: skip `left_token_ring` state during `removenode`
raft topology: prepare decommission path for node banning
raft topology: introduce `left_token_ring` state
raft topology: `raft_topology_cmd` implicit constructor
messaging_service: implement host banning
messaging_service: exchange host IDs and map them to connections
messaging_service: store the node's host ID
messaging_service: don't use parameter defaults in constructor
main: move messaging_service init after system_keyspace init
Make it explicit that the boolean argument determines case-sensitivity. It emphasizes its importance.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
There was a bug that caused aggregates to fail when
used on column-sensitive columns.
For example:
```
SELECT SUM("SomeColumn") FROM ks.table;
```
would fail, with a message saying that there
is no column "somecolumn".
This is because the case-sensitivity got lost on the way.
For non case-sensitive column names we convert them to lowercase,
but for case sensitive names we have to preserve the name
as originally written.
The problem was in `forward_service` - we took a column name
and created a non case-sensitive `column_identifier` out of it.
This converted the name to lowercase, and later such column
couldn't be found.
To fix it, let's make the `column_identifier` case-sensitive.
It will preserve the name, without converting it to lowercase.
Fixes: https://github.com/scylladb/scylladb/issues/14307
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
This PR changes the system to respect shard assignment to tablets in tablet metadata (system.tablets):
1. The tablet allocator is changed to distribute tablets evenly across shards taking into account currently allocated tablets in the system. Each tablet has equal weight. vnode load is ignored.
2. CDC subsystem was not adjusted (not supported yet)
3. sstable sharding metadata reflects tablet boundaries
5. resharding is NOT supported yet (the node will abort on boot if there is a need to reshard tablet-based tables)
6. The system is NOT prepared to handle tablet migration / topology changes in a safe way.
7. Sstable cleanup is not wired properly yet
After this PR, dht::shard_of() and schema::get_sharder() are deprecated. One should use table::shard_of() and effective_replication_map::get_sharder() instead.
To make the life easier, support was added to obtain table pointer from the schema pointer:
```
schema_ptr s;
s->table().shard_of(...)
```
Closes#13939
* github.com:scylladb/scylladb:
locator: network_topology_startegy: Allocate shards to tablets
locator: Store node shard count in topology
service: topology: Extract topology updating to a lambda
test: Move test_tablets under topology_experimental
sstables: Add trace-level logging related to shard calculation
schema: Catch incorrect uses of schema::get_sharder()
dht: Rename dht::shard_of() to dht::static_shard_of()
treewide: Replace dht::shard_of() uses with table::shard_of() / erm::shard_of()
storage_proxy: Avoid multishard reader for tablets
storage_proxy: Obtain shard from erm in the read path
db, storage_proxy: Drop mutation/frozen_mutation ::shard_of()
forward_service: Use table sharder
alternator: Use table sharder
db: multishard: Obtain sharder from erm
sstable_directory: Improve trace-level logging
db: table: Introduce shard_of() helper
db: Use table sharder in compaction
sstables: Compute sstable shards using sharder from erm when loading
sstables: Generate sharding metadata using sharder from erm when writing
test: partitioner: Test split_range_to_single_shard() on tablet-like sharder
dht: Make split_range_to_single_shard() prepared for tablet sharder
sstables: Move compute_shards_for_this_sstable() to load()
dht: Take sharder externally in splitting functions
locator: Make sharder accessible through effective_replication_map
dht: sharder: Document guarantees about mapping stability
tablets: Implement tablet sharder
tablets: Include pending replica in get_shard()
dht: sharder: Introduce next_shard()
db: token_ring_table: Filter out tablet-based keyspaces
db: schema: Attach table pointer to schema
schema_registry: Fix SIGSEGV in learn() when concurrent with get_or_load()
schema_registry: Make learn(schema_ptr) attach entry to the target schema
test: lib: cql_test_env: Expose feature_service
test: Extract throttle object to separate header
Fixes#11017
When doing writes, storage proxy creates types deriving from abstract_write_response_handler.
These are created in the various scheduling groups executing the write inducing code. They
pick up a group-local reference to the various metrics used by SP. Normally all code
using (and esp. modifying) these metrics are executed in the same scheduling group.
However, if gossip sees a node go down, it will notify listeners, which eventually
calls get_ep_stat and register_metrics.
This code (before this patch) uses _active_ scheduling group to eventually add
metrics, using a local dict as guard against double regs. If, as described above,
we're called in a different sched group than the original one however, this
can cause double registrations.
Fixed here by keeping a reference to creating scheduling group and using this, not
active one, when/if creating new metrics.
Closes#14294
dht::shard_of() does not use the correct sharder for tablet-based tables.
Code which is supposed to work with all kinds of tables should use erm::get_sharder().
Currently, the coordinator splits the partition range at vnode (or
tablet) boundaries and then tries to merge adjacent ranges which
target the same replica. This is an optimization which makes less
sense with tablets, which are supposed to be of substantial size. If
we don't merge the ranges, then with tablets we can avoid using the
multishard reader on the replica side, since each tablet lives on a
single shard.
The main reason to avoid a multishard reader is avoiding its
complexity, and avoiding adapting it to work with tablet
sharding. Currently, the multishard reader implementation makes
several assumptions about shard assignment which do not hold with
tablets. It assumes that shards are assigned in a round-robin fashion.
dht::shard_of() does not use the correct sharder for tablet-based tables.
Code which is supposed to work with all kinds of tables should use erm::get_sharder().
dht::shard_of() does not use the correct sharder for tablet-based tables.
Code which is supposed to work with all kinds of tables should use erm::get_sharder().
schema::get_sharder() does not return the correct sharder for tablet-based tables.
Code which is supposed to work with all kinds of tables should use erm::get_sharder().
This PR implements the storage part of the cluster features on raft functionality, as described in the "Cluster features on raft v2" doc. These changes will be useful for later PRs that will implement the remaining parts of the feature.
Two new columns are added to `system.topology`:
- `supported_features set<text>` is a new clustering column which holds the features that given node advertises as supported. It will be first initialized when the node joins the cluster, and then updated every time the node reboots and its supported features set changes.
- `enabled_features set<text>` is a new static column which holds the features that are considered enabled by the cluster. Unlike in the current gossip-based implementation the features will not be enabled implicitly when all nodes support a feature, but rather via an explicit action of the topology coordinator.
These columns are reflected in the `topology_state_machine` structure and are populated when the topology state is loaded. Appropriate methods are added to the `topology_mutation_builder` and `topology_node_mutation_builder` in order to allow setting/modifying those columns.
During startup, nodes update their corresponding `supported_features` column to reflect their current feature set. For now it is done unconditionally, but in the future appropriate checks will be added which will prevent nodes from joining / starting their server for group 0 if they can't guarantee that they support all enabled features.
Closes#14232
* github.com:scylladb/scylladb:
storage_service: update supported cluster features in group0 on start
storage_service: add methods for features to topology mutation builder
storage_service: use explicit ::set overload instead of a template
storage_service: reimplement mutation builder setters
storage_service: introduce topology_mutation_builder_base
topology_state_machine: include information about features
system_keyspace: introduce deserialize_set_column
db/system_keyspace: add storage for cluster features managed in group 0
Now, when a node starts, it will update its `supported_features` row in
`system.topology` via `update_topology_with_local_metadata`.
At this point, the functionality behind cluster features on raft is
mostly incomplete and the state of the `supported_features` column does
not influence anything so it's safe to update this column
unconditionally. In the future, the node will only join / start group0
server if it is sure that it supports all enabled features and it can
safely update the `supported_features` parameter.
The newly added `supported_features` and `enabled_features` columns can
now be modified via topology mutation builders:
- `supported_features` can now be overwritten via a new overload of
`topology_node_mutation_builder::set`.
- `enabled_features` can now be extended (i.e. more elements can be
added to it) via `topology_mutation_builder::add_enabled_features`. As
the set of enabled features only grows, this should be sufficient.
The `topology_node_mutation_builder::set` function has an overload which
accepts any type which can be converted to string via `::format`. Its
presence can lead to easy mistakes which can only be detected at runtime
rather at compile time. A concrete example: I wrote a function that
accepts an std::set<S> where S is convertible to sstring; it turns out
that std::string_view is not std::convertible_to sstring and overload
resolution falled back to the catch-all overload.
This commit gets rid of the catch-all overload and replaces it with
explicit ones. Fortunately, it was used for only two enums, so it wasn't
much work.
As promised in the previous commit which introduced
topology_mutation_builder_base, this commit adjusts existing setters of
topology mutation builder and topology node mutation builder to use
helper methods defined in the base class.
Note that the `::set` method for the unordered set of tokens now does
not delete the column in case an empty value is set, instead it just
writes an empty set. This semantic is arguably more clear given that we
have an explicit `::del` method and it shouldn't affect the existing
implementation - we never intentionally insert an empty set of tokens.
Introduces `topology_mutation_builder_base` which will be a base class
for both topology mutation builder and topology node mutation builder.
Its purpose is to abstract away some detail about setting/deleting/etc.
column in the mutation, the actual topology (node) mutation builder will
only have to care about converting types and/or allowing only particular
columns to be set. The class is using CRTP: derived classes provide
access to the row being modified, schema and the timestamp.
For the sake of commit diff readability, this commt only introduces this
class and changes the builders to derive from it but no setter
implementations are modified - this will be done in the next commit.
The "tell the node to shut down" RPC would fail every time in the
removenode path (since the node is dead), which is kind of awkward.
Besides, for removenode we don't really need the `left_token_ring`
state, we don't need to coordinate with the node - writes destined for
it are failing anyway (since it's dead) and we can ban the node
immediately.
Remove the node from group 0 while in `write_both_read_new` transition
state (even when we implement abort, in this state it's too late to
abort, we're committed to removing the node - so it's fine to remove it
from group 0 at this point).
Currently the decommissioned node waits until it observes that it was
moved to the `left` state, then proceeds to leave group 0 and shut down.
Unfortunately, this strategy won't work once we introduce banning nodes
that are in `left` state - there is no guarantee that the
decommissioning node will observe that it entered `left` state. The
replication of Raft commands races with the ban propagating through the
cluster.
We also can't make the node leave as soon as it observes the
`left_token_ring` state, which would defeat the purpose of
`left_token_ring` - allowing all nodes to observe that the node has left
the token ring before it shuts down.
We could introduce yet another state between `left_token_ring` and
`left`, which the node waits for before shutting down; the coordinator
would request a barrier from the node before moving to `left` state.
The alternative - which we chose here - is to have the coordinator
explicitly tell the node to shutdown while we're in `left_token_ring`
through a direct RPC. We introduce
`raft_topology_cmd::command::shutdown` and send it to the node while in
`left_token_ring` state, after we requested a cluster barrier.
We don't require the RPC to succeed; we need to allow it to fail to
preserve availability. This is because an earlier incarnation of the
coordinator may have requested the node to shut down already, so the
new coordinator will fail the RPC as the node is already dead. This also
improves availability in general - if the node dies while we're in
`left_token_ring`, we can proceed.
We don't lose safety from that, since we'll ban the node (later commit).
We only lose a bit of user experience if there's a failure at this
decommission step - the decommissioning node may hang, never receiving
the RPC (it will be necessary to shut it down manually).
Another complication arising from banning the node is that it won't be
able to leave group 0 on its own; by the time it tries that, it may have
already been banned by the cluster (the coordinator moves the node to
`left` state after telling it to shut down). So we get rid of the
`leave_group0` step from `raft_decommission()` (which simplifies the
function too), putting a `remove_from_raft_config` inside the
coordinator code instead - after we told the node to shut down.
(Removing the node from configuration is also another reason why we need
to allow the above RPC to fail; the node won't be able to handle the
request once it's outside the configuration, because it handles all
coordinator requests by starting a read barrier.)
Finally, a complication arises when the coordinator is the
decommissioning node. The node would shut down in the middle of handling
the `left_token_ring` state, leading to harmless but awkward errors even
though there were no node/network failures (the original coordinator
would fail the `left_token_ring` state logic; a new coordinator would take
over and do it again, this time succeeding). We fix that by checking if
we're the decommissioning node at the beginning of `left_token_ring`
state handler, and if so, stepping down from leadership by becoming a
nonvoter first.
We want for the decommissioning node to wait before shutting down until
every node learns that it left the token ring. Otherwise some nodes may
still try coordinating writes to that nodes after it already shut down,
leading to unnecessary failures on the data path(e.g. for CL=ALL writes).
Before this change, a node would shut down immediately after observing
that it was in `left` state; some other nodes may still see it in
`decommissioning` state and the topology transition state as
`write_both_read_new`, so they'd try to write to that node.
After this change, the node first enters the `left_token_ring` state
before entering `left`, while the topology transition state is removed
(so we've finished the token ring change - the node no longer has tokens
in the ring, but it's still part of the topology). There we perform a
read barrier, allowing all nodes to observe that the decommissioning
node has indeed left the token ring. Only after that barrier succeeds we
allow the node to shut down.
Saves some redundant typing when passing `raft_topology_cmd` parameters,
so we can change this:
```
raft_topology_cmd{raft_topology_cmd::command::fence_old_reads}
```
into this:
```
raft_topology_cmd::command::fence_old_reads
```
This PR fixes some problems found after the PR was merged:
* missed `node_to_work_on` assignment in `handle_topology_transition`;
* change error reporting in `update_fence_version` from `on_internal_error` to regular exceptions, since that exceptions can happen during normal operation.
* `update_fence_version` has beed moved after `group0_service.setup_group0_if_exist` in `main.cc`, otherwise we use uninitialized `token_metadata::version` and get an error.
Fixes: #14303Closes#14292
* github.com:scylladb/scylladb:
main.cc: move update_fence_version after group0_service.setup_group0_if_exist
shared_token_metadata: update_fence_version: on_internal_error -> throw
storage_service: handle_topology_transition: fix missed node assignment
`query_partition_range_concurrent` implements an optimization when
querying a token range that intersects multiple vnodes. Instead of
sending a query for each vnode separately, it sometimes sends a single
query to cover multiple vnodes - if the intersection of replica sets for
those vnodes is large enough to satisfy the CL and good enough in terms
of the heat metric. To check the latter condition, the code would take
the smallest heat metric of the intersected replica set and compare them
to smallest heat metrics of replica sets calculated separately for each
vnode.
Unfortunately, there was an edge case that the code didn't handle: the
intersected replica set might be empty and the code would access an
empty range.
This was catched by an assertion added in
8db1d75c6c by the dtest
`test_query_dc_with_rf_0_does_not_crash_db`.
The fix is simple: check if the intersected set is empty - if so, don't
calculate the heat metrics because we can decide early that the
optimization doesn't apply.
Also change the `assert` to `on_internal_error`.
Fixes#14284Closes#14300
Another node can stop after it joined the group0 but before it
advertised itself in gossip. `get_inet_addrs` will try to resolve all
IPs and `wait_for_peers_to_enter_synchronize_state` will loop
indefinitely.
But `wait_for_peers_to_enter_synchronize_state` can return early if one
of the nodes confirms that the upgrade procedure has finished. For that,
it doesn't need the IPs of all group 0 members - only the IP of some
nodes which can do the confirmation.
This pr restructures the code so that IPs of nodes are resolved inside
the `max_concurrent_for_each` that
`wait_for_peers_to_enter_synchronize_state` performs. Then, even if some
IPs won't be resolved, but one of the nodes confirms a successful
upgrade, we can continue.
Fixes#13543Closes#14046
* github.com:scylladb/scylladb:
raft topology: test: check if aborted node replacing blocks bootstrap
raft topology: `wait_for_peers_to_enter_synchronize_state` doesn't need to resolve all IPs