Add a new destination_not_alive_error, thrown from two-way RPCs in case
when the RPC is not issued because the destination is not reported as
alive by the failure detector.
In snapshot transfer code, lower the verbosity of the message printed in
case it fails on the new error. This is done to prevent flakiness in the
CI - in case of slow runs, nodes might get spuriously marked as dead if
they are busy, and a message with the "error" verbosity can cause some
tests to fail.
Add a method which reports whether given raft server is running.
In following commits, the information about whether the local raft
group 0 is running or not will be included in the response to the
failure detector ping, and the is_alive method will be used there.
to have feature parity with `configure.py`. we won't need this
once we migrate to C++20 modules. but before that day comes, we
need to stick with C++ headers.
we generate a rule for each .hh files to create a corresponding
.cc and then compile it, in order to verify the self-containness of
that header. so the number of rule is quite large, to avoid the
unnecessary overhead. the check-header target is enabled only if
`Scylla_CHECK_HEADERS` option is enabled.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#15913
Hold a gate around all operations that are forwarded to a leader to be
able to wait for them during server::abort() otherwise the abort() may
complete while those operations are still running which may cause use
after free.
The handler for join_node_request will need to know which node is
considered the group 0 leader right now by the local node.
If the topology coordinator crashes and a new node immediately wants to
replace it with the same IP, the node that handles join_node_request
will attempt to perform a read barrier. If this happens quickly enough,
due to the IP reuse the RPC will be sent to the new node instead of the
(now crashed) topology coordinator; the RPC will get an error and will
fail the barrier.
If we detect that the new node wants to replace the current topology
coordinator, the upcoming join_node_request_handler will wait until
there is a leader change.
server_impl::apply_snapshot() assumes that it cannot receive a snapshots
from the same host until the previous one is handled and usually this is
true since a leader will not send another snapshot until it gets
response to a previous one. But it may happens that snapshot sending
RPC fails after the snapshot was sent, but before reply is received
because of connection disconnect. In this case the leader may send
another snapshot and there is no guaranty that the previous one was
already handled, so the assumption may break.
Drop the assert that verifies the assumption and return an error in this
case instead.
Fixes: #15222
Message-ID: <ZO9JoEiHg+nIdavS@scylladb.com>
`get_cdc_generation_mutations` splits data to mutations of maximal size
`mutation_size_treshold`. Before this commit it was hardcoded to 2 MB.
Calculate `mutation_size_threshold` to leave space for cdc generation
data and not exceed `max_command_size`.
When we upgrade a cluster to use Raft, or perform manual Raft recovery
procedure (which also creates a fresh group 0 cluster, using the same
algorithm as during upgrade), we start with a non-empty group 0 state
machine; in particular, the schema tables are non-empty.
In this case we need to ensure that nodes which join group 0 receive the
group 0 state. Right now this is not the case. In previous releases,
where group 0 consisted only of schema, and schema pulls were also done
outside Raft, those nodes received schema through this outside
mechanism. In 91f609d065 we disabled
schema pulls outside Raft; we're also extending group 0 with other
things, like topology-specific state.
To solve this, we force snapshot transfers by setting the initial
snapshot index on the first group 0 server to `1` instead of `0`. During
replication, Raft will see that the joining servers are behind,
triggering snapshot transfer and forcing them to pull group 0 state.
It's unnecessary to do this for cluster which bootstraps with Raft
enabled right away but it also doesn't hurt, so we keep the logic simple
and don't introduce branches based on that.
Extend Raft upgrade tests with a node bootstrap step at the end to
prevent regressions (without this patch, the step would hang - node
would never join, waiting for schema).
Fixes: #14066Closes#14336
Add a test that submits 3 large commands each one a little bit larger
than 1/3 of maximum mutation size. Check that in the end 2 command were
executed (first 2 were merged and third was executed separately).
There are some cases where we can deduce that the entry was committed,
but we were throwing `commit_status_unknown`. Handle one more such case.
The added comment explains it in detail.
Also add a FIXME for another case where we throw `commit_status_unknown`
but we could do better.
Fixes: #14029
CWG 2631 (https://cplusplus.github.io/CWG/issues/2631.html) reports
an issue on how the default argument is evaluated. this problem is
more obvious when it comes to how `std::source_location::current()`
is evaluated as a default argument. but not all compilers have the
same behavior, see https://godbolt.org/z/PK865KdG4.
notebaly, clang-15 evaluates the default argument at the callee
site. so we need to check the capability of compiler and fall back
to the one defined by util/source_location-compat.hh if the compiler
suffers from CWG 2631. and clang-16 implemented CWG2631 in
https://reviews.llvm.org/D136554. But unfortunately, this change
was not backported to clang-15.
before switching over to clang-16, for using std::source_location::current()
as the default parameter and expect the behavior defined by CWG2631,
we have to use the compatible layer provided by Seastar. otherwise
we always end up having the source_location at the callee side, which
is not interesting under most circumstances.
so in this change, all places using the idiom of passing
std::source_location::current() as the default parameter are changed
to use seastar::compat::source_location::current(). despite that
we have `#include "seastarx.h"` for opening the seastar namespace,
to disambiguate the "namespace compat" defined somewhere in scylladb,
the fully qualified name of
`seastar::compat::source_location::current()` is used.
see also 09a3c63345, where we used
std::source_location as an alias of std::experimental::source_location
if it was available. but this does not apply to the settings of our
current toolchain, where we have GCC-12 and Clang-15.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#14086
this is a part of a series to migrating from `operator<<(ostream&, ..)`
based formatting to fmtlib based formatting. the goal here is to enable
fmtlib to print
- raft::server_address
- raft::config_member
- raft::configuration
without the help of `operator<<`.
the corresponding `operator<<()` is removed in this change, as all its
callers are now using fmtlib for formatting now.
Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13976
Change f5f566bdd8 introduced
tagged_integer and replaced raft::internal::tagged_uint64
with utils::tagged_integer.
However, the idl type for raft::internal::tagged_uint64
was not marked as final, but utils::tagged_integer is, breaking
the on-the-wire compatibility.
This change defines the different raft tagged_uint64
types in idl/raft_storage.idl.hh as non-final
to restore the way they were serialized prior to
f5f566bdd8Fixes#13752
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
otherwise GCC 13 complains that
```
/home/kefu/dev/scylladb/raft/server.cc:42:15: error: declaration of ‘seastar::promise<void> raft::awaited_index::promise’ changes meaning of ‘promise’ [-Wchanges-meaning]
42 | promise<> promise;
| ^~~~~~~
/home/kefu/dev/scylladb/raft/server.cc:42:5: note: used here to mean ‘class seastar::promise<void>’
42 | promise<> promise;
| ^~~~~~~~~
```
see also cd4af0c722
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Updated the empty() function in the struct fsm_output to include the
max_read_id_with_quorum field when checking whether the fsm output is
empty or not. The change was made in order maintain consistency with the
codebase and adding completeness to the empty check. This change has no
impact on other parts of the codebase.
Closes#13656
This PR introduces an experimental feature called "tablets". Tablets are
a way to distribute data in the cluster, which is an alternative to the
current vnode-based replication. Vnode-based replication strategy tries
to evenly distribute the global token space shared by all tables among
nodes and shards. With tablets, the aim is to start from a different
side. Divide resources of replica-shard into tablets, with a goal of
having a fixed target tablet size, and then assign those tablets to
serve fragments of tables (also called tablets). This will allow us to
balance the load in a more flexible manner, by moving individual tablets
around. Also, unlike with vnode ranges, tablet replicas live on a
particular shard on a given node, which will allow us to bind raft
groups to tablets. Those goals are not yet achieved with this PR, but it
lays the ground for this.
Things achieved in this PR:
- You can start a cluster and create a keyspace whose tables will use
tablet-based replication. This is done by setting `initial_tablets`
option:
```
CREATE KEYSPACE test WITH replication = {'class': 'NetworkTopologyStrategy',
'replication_factor': 3,
'initial_tablets': 8};
```
All tables created in such a keyspace will be tablet-based.
Tablet-based replication is a trait, not a separate replication
strategy. Tablets don't change the spirit of replication strategy, it
just alters the way in which data ownership is managed. In theory, we
could use it for other strategies as well like
EverywhereReplicationStrategy. Currently, only NetworkTopologyStrategy
is augmented to support tablets.
- You can create and drop tablet-based tables (no DDL language changes)
- DML / DQL work with tablet-based tables
Replicas for tablet-based tables are chosen from tablet metadata
instead of token metadata
Things which are not yet implemented:
- handling of views, indexes, CDC created on tablet-based tables
- sharding is done using the old method, it ignores the shard allocated in tablet metadata
- node operations (topology changes, repair, rebuild) are not handling tablet-based tables
- not integrated with compaction groups
- tablet allocator piggy-backs on tokens to choose replicas.
Eventually we want to allocate based on current load, not statically
Closes#13387
* github.com:scylladb/scylladb:
test: topology: Introduce test_tablets.py
raft: Introduce 'raft_server_force_snapshot' error injection
locator: network_topology_strategy: Support tablet replication
service: Introduce tablet_allocator
locator: Introduce tablet_aware_replication_strategy
locator: Extract maybe_remove_node_being_replaced()
dht: token_metadata: Introduce get_my_id()
migration_manager: Send tablet metadata as part of schema pull
storage_service: Load tablet metadata when reloading topology state
storage_service: Load tablet metadata on boot and from group0 changes
db, migration_manager: Notify about tablet metadata changes via migration_listener::on_update_tablet_metadata()
migration_notifier: Introduce before_drop_keyspace()
migration_manager: Make prepare_keyspace_drop_announcement() return a future<>
test: perf: Introduce perf-tablets
test: Introduce tablets_test
test: lib: Do not override table id in create_table()
utils, tablets: Introduce external_memory_usage()
db: tablets: Add printers
db: tablets: Add persistence layer
dht: Use last_token_of_compaction_group() in split_token_range_msb()
locator: Introduce tablet_metadata
dht: Introduce first_token()
dht: Introduce next_token()
storage_proxy: Improve trace-level logging
locator: token_metadata: Fix confusing comment on ring_range()
dht, storage_proxy: Abstract token space splitting
Revert "query_ranges_to_vnodes_generator: fix for exclusive boundaries"
db: Exclude keyspace with per-table replication in get_non_local_strategy_keyspaces_erms()
db: Introduce get_non_local_vnode_based_strategy_keyspaces()
service: storage_proxy: Avoid copying keyspace name in write handler
locator: Introduce per-table replication strategy
treewide: Use replication_strategy_ptr as a shorter name for abstract_replication_strategy::ptr_type
locator: Introduce effective_replication_map
locator: Rename effective_replication_map to vnode_effective_replication_map
locator: effective_replication_map: Abstract get_pending_endpoints()
db: Propagate feature_service to abstract_replication_strategy::validate_options()
db: config: Introduce experimental "TABLETS" feature
db: Log replication strategy for debugging purposes
db: Log full exception on error in do_parse_schema_tables()
db: keyspace: Remove non-const replication strategy getter
config: Reformat
A generic template for defining strongly typed
integer types.
Use it here to replace raft::internal::tagged_uint64.
Will be used for defining gms generation and version
as strong and distinguishable types in following patches.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
at least, we need to access the declarations of exceptions, like
`not_a_leader` and `dropped_entry`, so, instead of relying on
other header to do this job for us, we should include the header
which include the declaration. so, in this chance "raft.h" is
include explicitly.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
this is a part of a series migrating from `operator<<(ostream&, ..)` based formatting to fmtlib based formatting. the goal here is to enable fmtlib to print UUID without using ostream<<. also, this change re-implements some formatting helpers using fmtlib for better performance and less dependencies on operator<<(), but we cannot drop it at this moment, as quite a few caller sites are still using operator<<(ostream&, const UUID&) and operator<<(ostream&, tagged_uuid<T>&). we will address them separately.
* add `fmt::formatter<UUID>`
* add `fmt::formatter<tagged_uuid<T>>`
* implement `UUID::to_string()` using `fmt::to_string()`
* implement `operator<<(std::ostream&, const UUID&)` with `fmt::print()`, this should help to improve the performance when printing uuid, as `fmt::print()` does not materialize a string when printing the uuid.
* treewide: use fmtlib when printing UUID
Refs #13245Closes#13246
* github.com:scylladb/scylladb:
treewide: use fmtlib when printing UUID
utils: UUID: specialize fmt::formatter for UUID and tagged_uuid<>
Add a function that allows waiting for a state change of a raft server.
It is useful for a user that wants to know when a node becomes/stops
being a leader.
Message-Id: <20230316112801.1004602-4-gleb@scylladb.com>
Make tick() and is_leader() part of the API. First is used externally
already and another will be used in following patches.
Message-Id: <20230316112801.1004602-3-gleb@scylladb.com>
this change tries to reduce the number of callers using operator<<()
for printing UUID. they are found by compiling the tree after commenting
out `operator<<(std::ostream& out, const UUID& uuid)`. but this change
alone is not enough to drop all callers, as some callers are using
`operator<<(ostream&, const unordered_map&)` and other overloads to
print ranges whose elements contain UUID. so in order to limit the
scope of the change, we are not changing them here.
Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
The code for compare_endpoints originates at the dawn of time (bc034aeaec)
and is called on the fast path from storage_proxy via `sort_by_proximity`.
This series considerably reduces the function's footprint by:
1. carefully coding the many comparisons in the function so to reduce the number of conditional banches (apparently the compiler isn't doing a good enough job at optimizing it in this case)
2. avoid sstring copy in topology::get_{datacenter,rack}
Closes#12761
* github.com:scylladb/scylladb:
topology: optimize compare_endpoints
to_string: add print operators for std::{weak,partial}_ordering
utils: to_sstring: deinline std::strong_ordering print operator
move to_string.hh to utils/
test: network_topology: add test_topology_compare_endpoints
Clang complains that the captured `this` is not used, like
```
/home/kefu/dev/scylladb/raft/fsm.hh:644:21: error: lambda capture 'this' is not used [-Werror,-Wunused-lambda-capture]
auto visitor = [this, from, msg = std::move(msg)](const auto& state) mutable {
^
/home/kefu/dev/scylladb/raft/server.cc:738:11: note: in instantiation of function template specialization 'raft::fsm::step<raft::append_request>' requested here
_fsm->step(from, std::move(append_request));
^
```
but `step(..)` is a non-static member function of `fsm`, so `this`
is actually used. to silence Clang's warning, let's just reference it
explicitly.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
After 5badf20c7a applier fiber does not
stop after it gets abort error from a state machine which may trigger an
assertion because previous batch is not applied. Fix it.
Fixes#12863
In add_entry_on_leader after wait_for_memory_permit() resolves but before
the fiber continue to run the node may stop becoming the leader and then
become a leader again which will cause currently hold units outdated.
Detect this case by checking the term after the preemption.
To trigger snapshot limit behavior provide an error injection to set
with one-shot.
Note this effectively changes it and there is no revert.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
The docs mention that method, but it doesn't exist. Instead, the
state_machine interface defines plain .apply() one.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#12541
raft_group0 used to register RPC verbs only on shard 0. This worked on
clusters with the same --smp setting on all nodes, since RPCs in this
case are processed on the same shard as the calling code, and
raft_group0 methods only run on shard 0.
A new test test_nodes_with_different_smp was added to identify the
problem. Since --smp can only be specified via the command line, a
corresponding parameter was added to the ManagerClient.server_add
method. It allows to override the default parameters set by the
SCYLLA_CMDLINE_OPTIONS variable by changing, adding or deleting
individual items.
Fixes: #12252Closes#12374
* github.com:scylladb/scylladb:
raft: raft_group0, register RPC verbs on all shards
raft: raft_append_entries, copy entries to the target shard
test.py, allow to specify the node's command line in test
If append_entries RPC was received on a non-zero shard, we may
need to pass it to a zero (or, potentially, some other) shard.
The problem is that raft::append_request contains entries in the form
of raft::log_entry_ptr == lw_shared_ptr<log_entry>, which doesn't
support cross-shard reference counting. In debug mode it contains
a special ref-counting facility debug_shared_ptr_counter_type,
which resorts to on_internal_error if it detects such a case.
To solve this, we just copy log entries to the target shard if it
isn't equal to the current one. In most cases, if --smp setting
is the same on all nodes, RPC will be handled on zero shard,
so there will be no overhead.
We recently (commit 6a5d9ff261) started
to use std::source_location instead of std::experimental::source_location.
However, this does not work on clang 14, because libc++ 12's
<source_location> only works if __builtin_source_location, and that is
not available on clang 14.
clang 15 is just three months old, and several relatively-recent
distributions still carry clang 14 so it would be nice to support it
as well.
So this patch adds a trivial compatibility header file, which, when
included and compiled with clang 14, it aliases the functional
std::experimental::source_location to std::source_location.
It turns out it's enough to include the new header file from three
headers that included <source_location> - I guess all other uses
of source_location depend on those header files directly or indirectly.
We may later need to include the compatibility header file in additional
places, bug for now we don't.
Refs #12259
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#12265
Currently if a node that is outside of the config tries to add an entry
or modify config transient error is returned and this causes the node
to retry. But the error is not transient. If a node tries to do one of
the operations above it means it was part of the cluster at some point,
but since a node with the same id should not be added back to a cluster
if it is not in the cluster now it will never be.
Return a new error not_a_member to a caller instead.
Message-Id: <Y42mTOx8bNNrHqpd@scylladb.com>
Pass a change diff into the notification callback,
rather than add or remove servers one by one, so that
if we need to persist the state, we can do it once per
configuration change, not for every added or removed server.
For now still pass added and removed entries in two separate calls
per a single configuration change. This is done mainly to fulfill the
library contract that it never sends messages to servers
outside the current configuration. The group0 RPC
implementation doesn't need the two calls, since it simply
marks the removed servers as expired: they are not removed immediately
anyway, and messages can still be delivered to them.
However, there may be test/mock implementations of RPC which
could benefit from this contract, so we decided to keep it.
When `modify_config` or `add_entry` is forwarded to the leader, it may
reach the node at "inappropriate" time and result in an exception. There
are two reasons for it - the leader is changing and, in case of
`modify_config`, other `modify_config` is currently in progress. In both
cases the command is retried, but before this patch there was no delay
before retrying, which could led to a tight loop.
The patch adds a new exception type `transient_error`. When the client
receives it, it is obliged to retry the request after some delay.
Previously leader-side exceptions were converted to `not_a_leader`,
which is strange, especially for `conf_change_in_progress`.
Fixes: #11564Closes#11769
* github.com:scylladb/scylladb:
raft: rafactor: remove duplicate code on retries delays
raft: use wait_for_next_tick in read_barrier
raft: wait for the next tick before retrying
Introduce a templated function do_on_leader_with_retries,
use it in add_entries/modify_config/read_barrier. The
function implements the basic logic of retries with aborts
and leader changes handling, adds a delay between
iterations to protect against tight loops.
Replaced the yield on transport_error
with wait_for_next_tick. Added delays for retries, similar
to add_entry/modify_config: we postpone the next
call attempt if we haven't received new information
about the current leader.
When modify_config or add_entry is forwarded
to the leader, it may reach the node at
"inappropriate" time and result in an exception.
There are two reasons for it - the leader is
changing and, in case of modify_config, other
modify_config is currently in progress. In
both cases the command is retried, but before
this patch there was no delay before retrying,
which could led to a tight loop.
The patch adds a new exception type transient_error.
When the client node receives it, it is obliged to retry
the request, possibly after some delay. Previously, leader-side
exceptions were converted to not_a_leader exception,
which is strange, especially for conf_change_in_progress.
We add a delay before retrying in modify_config
and add_entry if the client hasn't received any new
information about the leader since the last attempt.
This can happen if the server
responds with a transient_error with an empty leader
and the current node has not yet learned the new leader.
We neglect an excessive delay if the newly elected leader
is the same as the previous one, this supposed to be a rare.
Fixes: #11564
The add_entry and modify_config methods sometimes do an rpc to
execute the request on the current leader. If the tcp connection
was broken, a seastar::rpc::closed_error would be thrown to the client.
This exception was not documented in the method comments and the
client could have missed handling it. For example, this exception
was not handled when calling modify_config in raft_group0,
which sometimes broke the removenode command.
An intermittent_connection_error exception was added earlier to
solve a similar problem with the read_barrier method. In this patch it
is renamed to transport_error, as it seems to better describe the
situation, and an explicit specification for this exception
was added - the rpc implementation can throw it if it is not known
whether the call reached the target node and whether any
actions were performed on it.
In case of read_barrier it does not matter and we just retry. In case
of add_entry and modify_config we cannot retry
because the rpc calls are not idempotent, so we convert this
exception to commit_status_unknown, which the client has to handle.
Explicit comments have also been added to raft::server methods
describing all possible exceptions.
We don't want to keep memory we don't use, shrink_to_fit guarantees that.
In fact, boost::deque frees up memory when items are deleted, so this change has little effect at the moment, but it may pay off if we change the container in the future.