some times we initialize a loop variable like
auto i = 0;
or
int i = 0;
but since the type of `0` is `int`, what we get is a variable of
`int` type, but later we compare it with an unsigned number, if we
compile the source code with `-Werror=sign-compare` option, the
compiler would warn at seeing this. in general, this is a false
alarm, as we are not likely to have a wrong comparison result
here. but in order to prevent issues due to the integer promotion
for comparison in other places. and to prepare for enabling
`-Werror=sign-compare`. let's use unsigned to silence this warning.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
it is not supported by C++, and might not yield expected result.
as "0 <= d" evaluates to true, which is always less than "magic".
so let's avoid using it.
```
/home/kefu/dev/scylladb/test/raft/randomized_nemesis_test.cc:2908:23: error: result of comparison of constant 54313 with expression of type 'bool' is always true [-Werror,-Wtautological-constant-out-of-range-compare]
2908 | assert(0 <= d < magic);
| ~~~~~~ ^ ~~~~~
```
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#14695
As in the previous commit, if a command gets doubly applied due to
`commit_status_unknown`, this will could lead to hard-to-debug failures;
one of them was the test hanging because we would never call
`_done.set_value()` in `state_machine::apply` due to `_seen`
overshooting `_apply_entries`.
Fix the problem and print a warning if we apply too many commands.
Fixes: #14072
`commit_status_unknown` may lead to double application and then a
hard-to-debug failure. But some tests actually rely on retrying it, so
print a warning and leave a FIXME for maybe a better future solution.
Ref: #14029
A generic template for defining strongly typed
integer types.
Use it here to replace raft::internal::tagged_uint64.
Will be used for defining gms generation and version
as strong and distinguishable types in following patches.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
gcc thinks the constructor call is ambiguous since "{}" can match
the default constructor. Fix by making the parameter type explicit.
Use "{}" for the constructor call to avoid the most-vexing-parse
problem.
Add a function that allows waiting for a state change of a raft server.
It is useful for a user that wants to know when a node becomes/stops
being a leader.
Message-Id: <20230316112801.1004602-4-gleb@scylladb.com>
The code for compare_endpoints originates at the dawn of time (bc034aeaec)
and is called on the fast path from storage_proxy via `sort_by_proximity`.
This series considerably reduces the function's footprint by:
1. carefully coding the many comparisons in the function so to reduce the number of conditional banches (apparently the compiler isn't doing a good enough job at optimizing it in this case)
2. avoid sstring copy in topology::get_{datacenter,rack}
Closes#12761
* github.com:scylladb/scylladb:
topology: optimize compare_endpoints
to_string: add print operators for std::{weak,partial}_ordering
utils: to_sstring: deinline std::strong_ordering print operator
move to_string.hh to utils/
test: network_topology: add test_topology_compare_endpoints
these warnings are found by Clang-17 after removing
`-Wno-unused-lambda-capture` and '-Wno-unused-variable' from
the list of disabled warnings in `configure.py`.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Currently if a node that is outside of the config tries to add an entry
or modify config transient error is returned and this causes the node
to retry. But the error is not transient. If a node tries to do one of
the operations above it means it was part of the cluster at some point,
but since a node with the same id should not be added back to a cluster
if it is not in the cluster now it will never be.
Return a new error not_a_member to a caller instead.
Message-Id: <Y42mTOx8bNNrHqpd@scylladb.com>
1) make address map API flexible
Before this patch:
- having a mapping without an actual IP address was an
internal error
- not having a mapping for an IP address was an internal
error
- re-mapping to a new IP address wasn't allowed
After this patch:
- the address map may contain a mapping
without an actual IP address, and the caller must be prepared for it:
find() will return a nullopt. This happens when we first add an entry
to Raft configuration and only later learn its IP address, e.g. via
gossip.
- it is allowed to re-map an existing entry to a new address;
2) subscribe to gossip notifications
Learning IP addresses from gossip allows us to adjust
the address map whenever a node IP address changes.
Gossiper is also the only valid source of re-mapping, other sources
(RPC) should not re-map, since otherwise a packet from a removed
server can remap the id to a wrong address and impact liveness of a Raft
cluster.
3) prompt address map state with app state
Initialize the raft address map with initial
gossip application state, specifically IPs of members
of the cluster. With this, we no longer need to store
these IPs in Raft configuration (and update them when they change).
The obvious drawback of this approach is that a node
may join Raft config before it propagates its IP address
to the cluster via gossip - so the boot process has to
wait until it happens.
Gossip also doesn't tell us which IPs are members of Raft configuration,
so we subscribe to Group0 configuration changes to mark the
members of Raft config "non-expiring" in the address translation
map.
Thanks to the changes above, Raft configuration no longer
stores IP addresses.
We still keep the 'server_info' column in the raft_config system table,
in case we change our mind or decide to store something else in there.
Pass a change diff into the notification callback,
rather than add or remove servers one by one, so that
if we need to persist the state, we can do it once per
configuration change, not for every added or removed server.
For now still pass added and removed entries in two separate calls
per a single configuration change. This is done mainly to fulfill the
library contract that it never sends messages to servers
outside the current configuration. The group0 RPC
implementation doesn't need the two calls, since it simply
marks the removed servers as expired: they are not removed immediately
anyway, and messages can still be delivered to them.
However, there may be test/mock implementations of RPC which
could benefit from this contract, so we decided to keep it.
The direct failure detector operates on abstract `endpoint_id`s for
pinging. The `pigner` interface is responsible for translating these IDs
to 'real' addresses.
Earlier we used two types of addresses: IP addresses in 'production'
code (`gms::gossiper::direct_fd_pinger`) and `raft::server_id`s in test
code (in `randomized_nemesis_test`). For each of these use cases we
would maintain mappings between `endpoint_id`s and the address type.
In recent commits we switched the 'production' code to also operate on
Raft server IDs, which are UUIDs underneath.
In this commit we switch `endpoint_id`s from `unsigned` type to
`utils::UUID`. Because each use case operates in Raft server IDs, we can
perform a simple translation: `raft_id.uuid()` to get an `endpoint_id`
from a Raft ID, `raft::server_id{ep_id}` to obtain a Raft ID from
an `endpoint_id`. We no longer have to maintain complex sharded data
structures to store the mappings.
We plan to remove IP information from Raft addresses.
raft::server_address is used in Raft configuration and
also in discovery, which is a separate algorithm, as a handy data
structure, to avoid having new entities in RPC.
Since we plan to remove IP addresses from Raft configuration,
using raft::server_address in discovery and still storing
IPs in it would create ambiguity: in some uses raft::server_address
would store an IP, and in others - would not.
So switch to an own data structure for the purposes of discovery,
discovery_peer, which contains a pair ip, raft server id.
Note to reviewers: ideally we should switch to URIs
in discovery_peer right away. Otherwise we may have to
deal with incompatible changes in discovery when adding URI
support to Scylla.
There is a flaw in how the raft rpc endpoints are
currently managed. The io_fiber in raft::server
is supposed to first add new servers to rpc, then
send all the messages and then remove the servers
which have been excluded from the configuration.
The problem is that the send_messages function
isn't synchronous, it schedules send_append_entries
to run after all the current requests to the
target server, which can happen
after we have already removed the server from address_map.
In this patch the remove_server function is changed to mark
the server_id as expiring rather than synchronously dropping it.
This means all currently scheduled requests to
that server will still be able to resolve
the ip address for that server_id.
Fixes: #11228Closes#11748
Before this patch we could get an OOM if we
received several big commands. The number of
commands was small, but their total size
in bytes was large.
snapshot_trailing_size is needed to guarantee
progress. Without this limit the fsm could
get stuck if the size of the next item is
greater than max_log_size - (size of trailing entries).
applier_fiber could create multiple snapshots between
io_fiber run. The fsm_output.snp variable was
overwritten by applier_fiber and io_fiber didn't drop
the previous snapshot.
In this patch we introduce the variable
fsm_output.snps_to_drop, store in it
the current snapshot id before applying
a new one, and then sequentially drop them in
io_fiber after storing the last snapshot_descriptor.
_sm_events.signal() is added to fsm::apply_snapshot,
since this method mutates the _output and thus gives a
reason to run io_fiber.
The new test test_frequent_snapshotting demonstrates
the problem by causing frequent snapshots and
setting the applier queue size to one.
Closes#11530
Changing configuration involves two entries in the log: a 'joint
configuration entry' and a 'non-joint configuration entry'. We use
`wait_for_entry` to wait on the joint one. To wait on the non-joint one,
we use a separate promise field in `server`. This promise wasn't
connected to the `abort_source` passed into `set_configuration`.
The call could get stuck if the server got removed from the
configuration and lost leadership after committing the joint entry but
before committing the non-joint one, waiting on the promise. Aborting
wouldn't help. Fix this by subscribing to the `abort_source` in
resolving the promise exceptionally.
Furthermore, make sure that two `set_configuration` calls don't step on
each other's toes by one setting the other's promise. To do that, reset
the promise field at the end of `set_configuration` and check that it's
not engaged at the beginning.
Fixes#11288.
Closes#11325
* github.com:scylladb/scylladb:
test: raft: randomized_nemesis_test: additional logging
raft: server: handle aborts when waiting for config entry to commit
It could happen that we accessed failure detector service after it was
stopped if a reconfiguration happened in the 'right' moment. This would
resolve in an assertion failure. Fix this.
Closes#11326
Add some more logging to `randomized_nemesis_test` such as logging the
start and end of a reconfiguration operation in a way that makes it easy
to find one given the other in the logs.
Improve the randomness of this test, making it a bit easier to
reproduce the scenarios that the test aims to catch.
Increase timeouts a bit to account for this additional randomness.
Expiring entries are added when a message is received from an unknown
host. If the host is later added to the raft configuration they become
non expiring. After that they can only be removed when the host is
dropped from the configuration, but they should never become expiring
again.
Refs #10826
Whether a server can vote in a Raft configuration is not part of the
address. `server_address` was used in many context where `can_vote` is
irrelevant.
Split the struct: `server_address` now contains only `id` and
`server_info` as it did before `can_vote` was introduced. Instead we
have a `config_member` struct that contains a `server_address` and the
`can_vote` field.
Also remove an "unsafe" constructor from `server_address` where `id` was
provided but `server_info` was not. The constructor was used for tests
where `server_info` is irrelevant, but it's important not to forget
about the info in production code. The constructor was used for two
purposes:
- Invoking set operations such as `contains`. To solve this we use C++20
transparent hash and comparator functions, which allow invoking
`contains` and similar functions by providing a different key type (in
this case `raft::server_id` in set of addresses, for example).
- constructing addresses without `info`s in tests. For this we provide
helper functions in the test helpers module and use them.
Leader which ceases to be a leader as a result of a
execute_modify_config cannot wait for a dummy record to be
committed because io_fiber aborts current waiters as soon as it
detects a lost of leadership.
This commit excludes dummy entries from the configuration change
procedure. A special promise is set on io_fiber when it gets a
non-joint configuration, and set_configuration just waits for
the corresponding future instead of a dummy record.
Fixes: #10010Closes#10905
coroutine::parallel_for_each avoids an allocation and is therefore preferred. The lifetime
of the function object is less ambiguous, and so it is safer. Replace all eligible
occurences (i.e. caller is a coroutine).
One case (storage_service::node_ops_cmd_heartbeat_updater()) needed a little extra
attention since there was a handle_exception() continuation attached. It is converted
to a try/catch.
Closes#10699
We modify the `reconfigure` and `modify_config` APIs to take a vector of
<server_id, bool> pairs (instead of just a vector of server_ids), where
the bool indicates whether the server is a voter in the modified config.
The `reconfiguration` operation would previously shuffle the set of
servers and split it into two parts: members and non-members. Now it
partitions it into three parts: voters, non-voters, and non-members.
The test would perform `read_barrier`s but not check the correctness
of the reads: whether the state observed by a read is consistent with
the model and recent enough (in short, check linearizability).
This commit adds the correctness checks.
Introduce a new operation, `raft_read`, which calls `read_barrier`
on a server, reads the state of the server's state machine, and returns
that state.
Extend the generator in `basic_generator_test` to generate `raft_read`s.
Only do it if forwarding is enabled (although it may make sense to test
read barriers in non-forwarding scenario as well - we may think about it
and do it in a follow-up).
For now, we don't check the consistency of the results of the reads.
They do return the observed state, but we don't compare it yet with the
model. For now we simply issue the reads concurrently with other
operations to introduce some more chaos to the cluster and check
liveness and consistency of existing operations.
Extend the reconfiguration nemesis to send `modify_config` requests as
well as `reconfigure` requests. It chooses one or the other with
probability 1/2.
When `rpc` wants to perform a two-way RPC call it sends a message
containing a `reply_id`. The other side will send the `reply_id` back
when answering, so the original side can match the response to the promise
corresponding to the future being waited on by the RPC caller.
Previously each instance of `rpc` generated reply IDs independently as
increasing integers starting from 0. The network delivers messages
based on Raft server IDs. A response message may thus be delievered not
to the original instance which invoked the RPC, but to a new instance
which uses the same Raft server ID (after we simulated a server
crash/stop and restart, creating a new server with the same ID that
reuses the previous instance's `persistence` instance but has a new `rpc`).
The new instance could have started a new RPC call using the same
`reply_id` as one currently being in-flight that was started by the
previous instance. The new instance could then receive and handle a
response that was intended for the previous instance, leading to weird
bugs.
Fix this by replacing the local reply ID counters by a global counter so
that every two-way RPC call gets a unique reply ID.
A server executing a `modify_config` call, even if it initially was a
leader and accepted the request, may end up throwing a `not_a_leader`
error, rerouting the caller to a new leader - but this new leader may be
that same server. This happens because `execute_modify_config`
translates certain errors that it considers transient (such as
`conf_change_in_progress`) into `not_a_leader{last_known_leader}`,
in attempt to notify the caller that they should retry the request; but
when this translation happens, the `last_known_leader` may be that same
server (it could have even lost leadership and then regained it back
while the request was being handled).
This is not strictly an error, and it should be safe for the client to
retry the request by sending it to the same server. The nemesis test
assumed that a server never returns `not_a_leader{itself}`; this commit
drops the assumption.
An alternative solution would be to extend the error types that are now
translated to `not_a_leader` so they include information about the last
known leader. This way the client does not lose information about the
original error and still gets a potential contact point for retry.
Until now the nemesis test used its own failure detector implementation
which used one-way heartbeats.
Switch it to use the new direct failure detection service, which will
also be used in production code. Integrating it does require some work
however as we need to implement the `pinger` and `clock` interfaces
for the failure detector.
The service is sharded, but for simplicity of implementation we
implement rpcs and sleeps by routing the requests to shard 0, where
logical timers and network live.
Right now the test is running entirely on shard 0, but we want to
introduce a sharded service to the test. The initial naive attempt of
doing that failed because the test would time out (reach the tick limit)
before any work distributed to other shards could even start. The
solution in this commit solves that by synchronizing the shards on each
tick.
When the test is ran with smp=1, the behavior is as before.
For a follower to forward requests to a leader the leader must be known.
But there may be a situation where a follower does not learn about
a leader for a while. This may happen when a node becomes a follower while its
log is up-to-date and there are no new entries submitted to raft. In such
case the leader will send nothing to the follower and the only way to
learn about the current leader is to get a message from it. Until a new
entry is added to the raft's log a follower that does not know who the
leader is will not be able to add entries. Kind of a deadlock. Note that
the problem is specific to our implementation where failure detection is
done by an outside module. In vanilla raft a leader sends messages to
all followers periodically, so essentially it is never idle.
The patch solves this by broadcasting specially crafted append reject to all
nodes in the cluster on a tick in case a leader is not known. The leader
responds to this message with an empty append request which will cause the
node to learn about the leader. For optimisation purposes the patch
sends the broadcast only in case there is actually an operation that
waits for leader to be known.
Fixes#10379