When the local entry with `read_idx` belongs to the current term, it's
safe to update the local `commit_idx` to `read_idx`.
The motivation for this change is to speed up read barriers. `wait_for_apply`
executed at the end of `read_barrier` is delayed until the follower learns
that the entry with `read_idx` is committed. It usually happens quickly in
the `read_quorum` message. However, non-voters don't receive this message,
so they have to wait for `append_entries`. If no new entries are being
added, `append_entries` can come only from `fsm::tick_leader()`. For group0,
this happens once every 100ms.
The issue above significantly slows down cluster setups in tests. Nodes
join group0 as non-voters, and then they are met with several read barriers
just after a write to group0. One example is `global_token_metadata_barrier`
in `write_both_read_new` performed just after `update_topology_state` in
`write_both_read_old`.
I tested the performance impact of this change with the following test:
```python
for _ in range(10):
await manager.servers_add(3)
```
It consistently takes 44-45s with the change and 50-51s without the change
in dev mode.
No backport:
- non-critical performance improvement mostly relevant in tests,
- the change requires some soak time in master.
Closesscylladb/scylladb#28891
* github.com:scylladb/scylladb:
raft: server: fix the repeating typo
raft: clarify the comment about read_barrier_reply
raft: read_barrier: update local commit_idx to read_idx when it's safe
raft: log: clarify the specification of term_for
When the local entry with `read_idx` belongs to the current term, it's
safe to update the local `commit_idx` to `read_idx`. The argument for
safety is in the new comment above `maybe_update_commit_idx_for_read`.
The motivation for this change is to speed up read barriers. `wait_for_apply`
executed at the end of `read_barrier` is delayed until the follower learns
that the entry with `read_idx` is committed. It usually happens quickly in
the `read_quorum` message. However, non-voters don't receive this message,
so they have to wait for `append_entries`. If no new entries are being
added, `append_entries` can come only from `fsm::tick_leader()`. For group0,
this happens once every 100ms.
The issue above significantly slows down cluster setups in tests. Nodes
join group0 as non-voters, and then they are met with several read barriers
just after a write to group0. One example is `global_token_metadata_barrier`
in `write_both_read_new` performed just after `update_topology_state` in
`write_both_read_old`.
Writing a test for this change would be difficult, so we trust the nemesis
tests to do the job. They have already found consistency issues in read
barriers. See #10578.
After the previous changes in `raft::server::{add_entry, modify_config}`
(cf. SCYLLADB-841), we also go through other methods of `raft::server`
and verify that they handle the aborted state properly.
I found two methods that do not:
(A) `wait_for_leader`
(B) `wait_for_state_change`
What happened before these changes?
In case (A), the dangerous scenario occurred when `_leader_promise` was
empty on entering the function. In that case, we would construct the
promise and wait on the corresponding future. However, if the server
had been already aborted before the call, the future would never
resolve and we'd be effectively stuck.
Case (B) is fully analogous: instead of `_leader_promise`, we'd work
with `_stte_change_promise`.
There's probably a more proper solution to this problem, but since I'm
not familiar with the internal code of Raft, I fix it this way. We can
improve it further in the future.
We provide two simple validation tests. They verify that after aborting
a `raft::server`, the calls:
* do not hang (the tests would time out otherwise),
* throw raft::stopped_error.
Fixes SCYLLADB-841
Before the change, calling `add_entry` or `modify_config` on an already
aborted Raft server could result in an error `not_a_leader` containing
a null server ID. It was possible precisely when forwarding was
disabled in the server configuration.
`not_a_leader` is supposed to return the ID of the current leader,
so that was wrong. Furthermore, the description of the function
specified that if a server is aborted, then it should throw
`stopped_error`.
We fix that issue. A few small reproducer tests were provided to verify
that the functions behave correctly with and without forwarding enabled.
Refs SCYLLADB-841
The `try-catch` expression is pretty much useless in its current form.
If we return the future, the awaiting will only be performed by the
caller, completely circumventing the exception handling.
As a result, instead of handling `raft::request_aborted` with a proper
error message, the user will face `seastar::abort_requested_exception`
whose message is cryptic at best. It doesn't even point to the root
of the problem.
Fixes SCYLLADB-665
When a strongly consistent request arrives at a node, we
need to know which replica is the leader, since such requests
are generally executed only on the leader. If a leader has
not yet been elected, we must wait. This commit exposes
wait_for_leader() so it can be used for that purpose.
We cannot rely solely on wait_for_state_change(), because it does not
trigger when some other node becomes a leader.
When a node is bootstrapped and joined a cluster as a non-voter and changes it's role to a voter, errors can occur while committing a new Raft record, for instance, if the Raft leader changes during this time. These errors are not critical and should not cause a node crash, as the action can be retried.
Fixesscylladb/scylladb#20814
Backport: This issue occurs frequently and disrupts the CI workflow to some extent. Backports are needed for versions 6.1 and 6.2.
Closesscylladb/scylladb#22253
* github.com:scylladb/scylladb:
raft: refactor `remove_from_raft_config` to use a timed `modify_config` call.
raft: Refactor functions using `modify_config` to use a common wrapper for retrying.
raft: Handle non-critical config update errors in when changing status to voter.
test: Add test to check that a node does not fail on unknown commit status error when starting up.
raft: Add run_op_with_retry in raft_group0.
error when starting up.
Test that a node is starting successfully if while joining a cluster and becoming a voter, it
receives an unknown commit status error.
Test for scylladb/scylladb#20814
Replace boost with a standard facility; this reduces dependencies
as lexical_cast depends on boost ranges.
As a side effect the exception error message is improved.
New logs allow us to easily distinguish two cases in which
waiting for apply times out:
- the node didn't receive the entry it was waiting for,
- the node received the entry but didn't apply it in time.
Distinguishing these cases simplifies reasoning about failures.
The first case indicates that something went wrong on the leader.
The second case indicates that something went wrong on the node
on which waiting for apply timed out.
As it turns out, many different bugs result in the `read_barrier`
(which calls `wait_for_apply`) timeout. This change should help
us in debugging bugs like these.
We want to backport this change to all supported branches so that
it helps us in all tests.
Closesscylladb/scylladb#21855
Modernize the codebase by replacing Boost range adaptors with C++23 standard library views,
reducing external dependencies and leveraging modern C++ language features.
Key Changes:
- Replace `boost::adaptors::filtered` with `std::views::filter`
- Remove `#include <boost/range/adaptor/filtered.hpp>`
- Utilize standard library range views
Motivation:
- Reduce project's external dependency footprint
- Leverage standard library's range and view capabilities
- Improve long-term code maintainability
- Align with modern C++ best practices
Implementation Challenges and Considerations:
1. Range Conversion and Move Semantics
- `std::ranges::to` adaptor requires rvalue references
- Necessitated updates to variable and parameter constness
- Example: `cql3/restrictions/statement_restrictions.cc` modified to remove `const`
from `common` to enable efficient range conversion
2. Range Iteration and Mutation
- Range views may mutate internal state during iteration
- Cannot pass ranges by const reference in some scenarios
- Solution: Pass ranges by rvalue reference to explicitly indicate
state invalidation
Limitations:
- One instance of `boost::adaptors::filtered` temporarily preserved
due to lack of a C++23 alternative for `boost::join()`
- A comprehensive replacement will be addressed in a follow-up change
This change is part of our ongoing effort to modernize the codebase,
reducing external dependencies and adopting modern C++ practices.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21648
Metrics families (e.g., all metrics with the same name but with different labels) should have the same description.
The metric layer does not enforce that. Instead, it will use the first description provided.
It's a minor issue but the results are different than what you expect.
No need to backport.
Closesscylladb/scylladb#19947
* github.com:scylladb/scylladb:
service/storage_proxy.cc All metric groups should have the same description
raft/server.cc: All metric groups should have the same description
When the server_impl::applier_fiber is paused by a co_await at line raft/server.cc:1375:
```
co_await override_snapshot_thresholds();
```
a new snapshot may be applied, which updates the actual values of the log's last applied
and snapshot indexes. As a result, the new snapshot index could become higher than the
old value stored in _applied_idx at line raft/server.cc:1365, leading to an assertion
failure in log::last_conf_for().
Since error injection is disabled in release builds, this issue does not affect production releases.
This issue was introduced in the following commit
9dfa041fe1,
when error injection was added to override the log snapshot configuration parameters.
How to reproduce:
1. Build debug version of randomized_nemesis_test
```
ninja-build build/debug/test/raft/randomized_nemesis_test
```
2. Run
```
parallel --halt now,fail=1 -j20 'build/debug/test/raft/randomized_nemesis_test \
--run_test=test_frequent_snapshotting -- -c2 -m2G --overprovisioned --unsafe-bypass-fsync 1 \
--kernel-page-cache 1 --blocked-reactor-notify-ms 2000000 --default-log-level \
trace > tmp/logs/eraseme_{}.log 2>&1 && rm tmp/logs/eraseme_{}.log' ::: {1..1000}
```
Fixesscylladb/scylladb#20363Closesscylladb/scylladb#20555
before this change, we rely on `using namespace seastar` to use
`seastar::format()` without qualifying the `format()` with its
namespace. this works fine until we changed the parameter type
of format string `seastar::format()` from `const char*` to
`fmt::format_string<...>`. this change practically invited
`seastar::format()` to the club of `std::format()` and `fmt::format()`,
where all members accept a templated parameter as its `fmt`
parameter. and `seastar::format()` is not the best candidate anymore.
despite that argument-dependent lookup (ADT for short) favors the
function which is in the same namespace as its parameter, but
`using namespace` makes `seastar::format()` more competitive,
so both `std::format()` and `seastar::format()` are considered
as the condidates.
that is what is happening scylladb in quite a few caller sites of
`format()`, hence ADT is not able to tell which function the winner
in the name lookup:
```
/__w/scylladb/scylladb/mutation/mutation_fragment_stream_validator.cc:265:12: error: call to 'format' is ambiguous
265 | return format("{} ({}.{} {})", _name_view, s.ks_name(), s.cf_name(), s.id());
| ^~~~~~
/usr/bin/../lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/format:4290:5: note: candidate function [with _Args = <const std::basic_string_view<char> &, const seastar::basic_sstring<char, unsigned int, 15> &, const seastar::basic_sstring<char, unsigned int, 15> &, const utils::tagged_uuid<table_id_tag> &>]
4290 | format(format_string<_Args...> __fmt, _Args&&... __args)
| ^
/__w/scylladb/scylladb/seastar/include/seastar/core/print.hh:143:1: note: candidate function [with A = <const std::basic_string_view<char> &, const seastar::basic_sstring<char, unsigned int, 15> &, const seastar::basic_sstring<char, unsigned int, 15> &, const utils::tagged_uuid<table_id_tag> &>]
143 | format(fmt::format_string<A...> fmt, A&&... a) {
| ^
```
in this change, we
change all `format()` to either `fmt::format()` or `seastar::format()`
with following rules:
- if the caller expects an `sstring` or `std::string_view`, change to
`seastar::format()`
- if the caller expects an `std::string`, change to `fmt::format()`.
because, `sstring::operator std::basic_string` would incur a deep
copy.
we will need another change to enable scylladb to compile with the
latest seastar. namely, to pass the format string as a templated
parameter down to helper functions which format their parameters.
to miminize the scope of this change, let's include that change when
bumping up the seastar submodule. as that change will depend on
the seastar change.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
- raft_sys_table_storage::store_snapshot_descriptor now receives a number of
preserved items in the log, rather than _config.snapshot_trailing value;
- Incorrect check for truncated number of items in store_snapshot_descriptor
was removed.
Fixesscylladb/scylladb#16817Fixesscylladb/scylladb#20080
Replace raft_server_snapshot_reduce_threshold with raft_server_set_snapshot_thresholds in tests
as raft_server_set_snapshot_thresholds fully covers the functionality of raft_server_snapshot_reduce_threshold.
With implicit conversion of tagged integers to untagged ones going away,
explicitly tag (or untag, as necessary) the operands of the following
operations, in "raft/server.cc":
- addition of tagged and untagged (both should be tagged)
- subscripting an array by tagged (should be untagged)
- comparing a size-like threshold against tagged (should be untagged)
- exposing tagged via gauges (should be untagged)
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
assert() is traditionally disabled in release builds, but not in
scylladb. This hasn't caused problems so far, but the latest abseil
release includes a commit [1] that causes a 1000 insn/op regression when
NDEBUG is not defined.
Clearly, we must move towards a build system where NDEBUG is defined in
release builds. But we can't just define it blindly without vetting
all the assert() calls, as some were written with the expectation that
they are enabled in release mode.
To solve the conundrum, change all assert() calls to a new SCYLLA_ASSERT()
macro in utils/assert.hh. This macro is always defined and is not conditional
on NDEBUG, so we can later (after vetting Seastar) enable NDEBUG in release
mode.
[1] 66ef711d68Closesscylladb/scylladb#20006
1. Fixed a single typo (send -> sent)
2. Rephrase 'How many' to 'Number of' and use less passive tense.
3. Be more specific in the description of the different metrics insteda of the more generic descriptions.
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
Closesscylladb/scylladb#19067
before this change, we rely on the default-generated fmt::formatter
created from operator<<, but fmt v10 dropped the default-generated
formatter.
in this change, instead of printing the `unique_ptr` instance, we
print the pointee of it. since `server_impl` uses pimpl paradigm,
`_fsm` is always valid after `server_impl::start()`, we can always
deference it without checking for null.
Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#17953
This allows the user of `raft::server` to ask it to create a snapshot
and truncate the Raft log. In a later commit we'll add a REST endpoint
to Scylla to trigger group 0 snapshots.
One use case for this API is to create group 0 snapshots in Scylla
deployments which upgraded to Raft in version 5.2 and started with an
empty Raft log with no snapshot at the beginning. This causes problems,
e.g. when a new node bootstraps to the cluster, it will not receive a
snapshot that would contain both schema and group 0 history, which would
then lead to inconsistent schema state and trigger assertion failures as
observed in scylladb/scylladb#16683.
In 5.4 the logic of initial group 0 setup was changed to start the Raft
log with a snapshot at index 1 (ff386e7a44)
but a problem remains with these existing deployments coming from 5.2,
we need a way to trigger a snapshot in them (other than performing 1000
arbitrary schema changes).
Another potential use case in the future would be to trigger snapshots
based on external memory pressure in tablet Raft groups (for strongly
consistent tables).
`server` was the only user of this function and it can now be
implemented using `fsm`'s public interface.
In later commits we'll extend the logic of `io_fiber` to also subscribe
to other events, triggered by `server` API calls, not only to outputs
from `fsm`.
In later commits we will use it to wake up `io_fiber` directly from
`raft::server` based on events generated by `raft::server` itself -- not
only from events generated by `raft::fsm`.
`raft::fsm` still obtains a reference to the condition variable so it
can keep signaling it.
This parameter says how many entries at most should be left trailing
before the snapshot index. There are multiple places where this
decision is made:
- in `applier_fiber` when the server locally decides to take a snapshot
due to log size pressure; this applies to the in-memory log
- in `fsm::step` when the server received an `install_snapshot` message
from the leader; this also applies to the in-memory log
- and in `io_fiber` when calling `store_snapshot_descriptor`; this
applies to the on-disk log.
The logic of how many entries should be left trailing is calculated
twice:
- first, in `applier_fiber` or in `fsm::step` when truncating the
in-memory log
- and then again as the snapshot descriptor is being persisted.
The logic is to take `_config.snapshot_trailing` for locally generated
snapshots (coming from `applier_fiber`) and `0` for remote snapshots
(from `fsm::step`).
But there is already an error injection that changes the behavior of
`applier_fiber` to leave `0` trailing entries. However, this doesn't
affect the following `store_snapshot_descriptor` call which still uses
`_config.snapshot_trailing`. So if the server got restarted, the entries
which were truncated in-memory would get "revived" from disk.
Fortunately, this is test-only code.
However in future commits we'd like to change the logic of
`applier_fiber` even further. So instead of having a separate
calculation of trailing entries inside `io_fiber`, it's better for it to
use the number that was already calculated once. This number is passed to
`fsm::apply_snapshot` (by `applier_fiber` or `fsm::step`) and can then
be received by `io_fiber` from `fsm_output` to use it inside
`store_snapshot_descriptor`.
This looks like a minor oversight, in `server_impl::abort` there are
multiple calls to `set_exception` on the different promises, only one of
them would not receive `*_aborted`.
The previous commit has fixed 5 bugs of the same type - incorrectly
passing the default nullptr to one of the changed functions. At
least some of these bugs wouldn't appear if there was no default
value. It's much harder to make this kind of a bug if you have to
write "nullptr". It's also much easier to detect it in review.
Moreover, these default values are rarely used outside tests.
Keeping them is just not worth the time spent on debugging.
Fixes some typos as found by codespell run on the code.
In this commit, I was hoping to fix only comments, not user-visible alerts, output, etc.
Follow-up commits will take care of them.
Refs: https://github.com/scylladb/scylladb/issues/16255
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
Add a new destination_not_alive_error, thrown from two-way RPCs in case
when the RPC is not issued because the destination is not reported as
alive by the failure detector.
In snapshot transfer code, lower the verbosity of the message printed in
case it fails on the new error. This is done to prevent flakiness in the
CI - in case of slow runs, nodes might get spuriously marked as dead if
they are busy, and a message with the "error" verbosity can cause some
tests to fail.
Add a method which reports whether given raft server is running.
In following commits, the information about whether the local raft
group 0 is running or not will be included in the response to the
failure detector ping, and the is_alive method will be used there.
Hold a gate around all operations that are forwarded to a leader to be
able to wait for them during server::abort() otherwise the abort() may
complete while those operations are still running which may cause use
after free.
The handler for join_node_request will need to know which node is
considered the group 0 leader right now by the local node.
If the topology coordinator crashes and a new node immediately wants to
replace it with the same IP, the node that handles join_node_request
will attempt to perform a read barrier. If this happens quickly enough,
due to the IP reuse the RPC will be sent to the new node instead of the
(now crashed) topology coordinator; the RPC will get an error and will
fail the barrier.
If we detect that the new node wants to replace the current topology
coordinator, the upcoming join_node_request_handler will wait until
there is a leader change.
server_impl::apply_snapshot() assumes that it cannot receive a snapshots
from the same host until the previous one is handled and usually this is
true since a leader will not send another snapshot until it gets
response to a previous one. But it may happens that snapshot sending
RPC fails after the snapshot was sent, but before reply is received
because of connection disconnect. In this case the leader may send
another snapshot and there is no guaranty that the previous one was
already handled, so the assumption may break.
Drop the assert that verifies the assumption and return an error in this
case instead.
Fixes: #15222
Message-ID: <ZO9JoEiHg+nIdavS@scylladb.com>
`get_cdc_generation_mutations` splits data to mutations of maximal size
`mutation_size_treshold`. Before this commit it was hardcoded to 2 MB.
Calculate `mutation_size_threshold` to leave space for cdc generation
data and not exceed `max_command_size`.
There are some cases where we can deduce that the entry was committed,
but we were throwing `commit_status_unknown`. Handle one more such case.
The added comment explains it in detail.
Also add a FIXME for another case where we throw `commit_status_unknown`
but we could do better.
Fixes: #14029
otherwise GCC 13 complains that
```
/home/kefu/dev/scylladb/raft/server.cc:42:15: error: declaration of ‘seastar::promise<void> raft::awaited_index::promise’ changes meaning of ‘promise’ [-Wchanges-meaning]
42 | promise<> promise;
| ^~~~~~~
/home/kefu/dev/scylladb/raft/server.cc:42:5: note: used here to mean ‘class seastar::promise<void>’
42 | promise<> promise;
| ^~~~~~~~~
```
see also cd4af0c722
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>