Instead of lengthy blurbs, switch to single-line, machine-readable
standardized (https://spdx.dev) license identifiers. The Linux kernel
switched long ago, so there is strong precedent.
Three cases are handled: AGPL-only, Apache-only, and dual licensed.
For the latter case, I chose (AGPL-3.0-or-later and Apache-2.0),
reasoning that our changes are extensive enough to apply our license.
The changes we applied mechanically with a script, except to
licenses/README.md.
Closes#9937
There can be a situation where a leader will send to a follower entries
that the latter already snapshotted. Currently a follower consider those
to be outdated appends and it rejects them, but it may cause the
follower progress to be stuck:
- A is a leader, B is a follower, there are other followers which A used to commit entries
- A remembers that the last matched entry for B is 10, so the next entry to send is 11. A managed to commit the 11 entry using other followers
- A sends entry 11 to B
- B receives it, accepts, and updates its commit index to 11. It sends a success reply to A, but it never reaches A due to a network partition
- B takes a snapshot at index 11
- A sends entry 11 to B again
- B rejects it since it is inside the snapshot
- A receives the reject and retries from the same entry
- Same thing happen again
We should not reject such outdated entries since if they fall inside a
snapshot it means they match (according to log matching property).
Accepting them will make the case above alive.
Fixes#9552
The log maintains _last_conf_idx and _prev_conf_idx indexes into the log
to point to where the latest and previous configuration can be found.
If they are zero it means that the latest config is in the snapshot.
When snapshot with a trailing is applied we can safely reset those
indexes that are smaller than the snapshot one to zero because the
snapshot will have the latest config anyway. This simplifies maintenance
of those indexes since their value will not depend on user configured
snapshot_trailing parameter.
There are situations where a node outside the current configuration is
the only node that can become a leader. We become candidates in such
cases. But there is an easy check for when we don't need to; a comment was
added explaining that.
We add a function `log_last_conf_before(index_t)` to `fsm` which, given
an index greater than the last snapshot index, returns the configuration
at this index, i.e. the configuration of the last configuration entry
before this index.
This function is then used in `applier_fiber` to obtain the correct
configuration to be stored in a snapshot.
In order to ensure that the configuration can be obtained, i.e. the
index we're looking at is not smaller than the last snapshot index, we
strengthen the conditions required for taking a snapshot: we check that
`_fsm` has not yet applied a snapshot at a larger index (which it may
have due to a remote snapshot install request). This also causes fewer
unnecessary snapshots to be taken in general.
Replace it with a private _first_idx, which is maintained
along with the rest of class log state.
_first_idx is a name consistent with counterpart last_idx().
Do not use a function since going forward we may want
to remove Raft index from struct log_entry, so should rely
less on it.
This fixes a bug when _last_conf_idx was not reset
after apply_snapshot() because start_idx() was pointing
to a non-existent entry.
If the log is empty, we must use snapshot's term,
since the log could be right after taking a snapshot
when no trailing entries were kept.
This fixes a rare possible bug when a log matching
rule could be violated during elections by a follower
with a log which was just truncated after a snapshot.
A separate unit test for the issue will follow.
raft::log::start_idx() is currently not meaningful
in case the log is empty.
Avoid using it in fsm::replicate_to() and avoid manual search for
previous log term, instead encapsulate the search in log::term_for().
As a side effect we currently return a correct term (0)
when log matching rule is exercised for an empty log
and the very first snapshot with term 0. Update raft_etcd_test.cc
accordingly.
This change happens to reduce the overall line count.
While at it, improve the comments in raft::replicate_to().
When initializing the log, find the most recent configuration
change index, if present.
Maintain the most recent configuration change index when
the log is truncated or entries are appended to it.
The last configuration change index will be used by FSM when it enters
candidate or leader state to fetch the current configuration.
We never truncate beyond a single in-progress configuration
change, so storing the previous value of last_conf_idx
helps avoid log backward scan on truncation in 100% of cases.
Remove all unused log constructors.
Combine structs for append request send and receive into a single
struct.
Author: Gleb Natapov <gleb@scylladb.com>
Date: Mon Nov 23 14:33:14 2020 +0200
To prevent the log to take too much memory introduce a mechanism that
limits the log to a certain size. If the size is reached no new log
entries can be submitted until previous entries are committed and
snapshotted.
This patch allows to leave snapshot_trailing amount of entries
when a state machine is snapshotted and raft log entries are dropped.
Those entries can be used to catch up nodes that are slow without
requiring snapshot transfer. The value is part of the configuration
and can be changed.
This patch introduces partial RAFT implementation. It has only log
replication and leader election support. Snapshotting and configuration
change along with other, smaller features are not yet implemented.
The approach taken by this implementation is to have a deterministic
state machine coded in raft::fsm. What makes the FSM deterministic is
that it does not do any IO by itself. It only takes an input (which may
be a networking message, time tick or new append message), changes its
state and produce an output. The output contains the state that has
to be persisted, messages that need to be sent and entries that may
be applied (in that order). The input and output of the FSM is handled
by raft::server class. It uses raft::rpc interface to send and receive
messages and raft::storage interface to implement persistence.