Commit Graph

29 Commits

Author SHA1 Message Date
Avi Kivity
f3eade2f62 treewide: relicense to ScyllaDB-Source-Available-1.0
Drop the AGPL license in favor of a source-available license.
See the blog post [1] for details.

[1] https://www.scylladb.com/2024/12/18/why-were-moving-to-a-source-available-license/
2024-12-18 17:45:13 +02:00
Sergey Zolotukhin
c3e52ab942 raft: Invoke store_snapshot_descriptor with actually preserved items.
- raft_sys_table_storage::store_snapshot_descriptor now receives a number of
preserved items in the log, rather than _config.snapshot_trailing value;
- Incorrect check for truncated number of items in store_snapshot_descriptor
was removed.

Fixes scylladb/scylladb#16817
Fixes scylladb/scylladb#20080
2024-08-20 15:22:49 +02:00
Laszlo Ersek
5b9a4428c6 raft/log: clean up index_t usage
With implicit conversion of tagged integers to untagged ones going away,
explicitly tag (or untag, as necessary) the operands of the following
operations, in raft/log.{cc,h}:

- addition of tagged and untagged (both should be tagged)

- comparison (relop) between tagged an untagged (both should be tagged)

- subscripting an array, or offsetting an iterator, by tagged (should be
  untagged)

- comparing an array bound against tagged (should be untagged)

- subtracting tagged from an array bound (should be untagged)

Note: these files mix uniform initialization syntax (index_t{...}) with
constructor call syntax (index_t()), with the former being more frequent.
Stick with the former here too, for consistency.

Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
2024-08-14 13:35:08 +02:00
Avi Kivity
aa1270a00c treewide: change assert() to SCYLLA_ASSERT()
assert() is traditionally disabled in release builds, but not in
scylladb. This hasn't caused problems so far, but the latest abseil
release includes a commit [1] that causes a 1000 insn/op regression when
NDEBUG is not defined.

Clearly, we must move towards a build system where NDEBUG is defined in
release builds. But we can't just define it blindly without vetting
all the assert() calls, as some were written with the expectation that
they are enabled in release mode.

To solve the conundrum, change all assert() calls to a new SCYLLA_ASSERT()
macro in utils/assert.hh. This macro is always defined and is not conditional
on NDEBUG, so we can later (after vetting Seastar) enable NDEBUG in release
mode.

[1] 66ef711d68

Closes scylladb/scylladb#20006
2024-08-05 08:23:35 +03:00
Kefu Chai
c555af3cd8 raft: add formatter for raft::log
before this change, we rely on the default-generated fmt::formatter
created from operator<<, but fmt v10 dropped the default-generated
formatter.

in this change, we define formatters for `raft::log`, and drop its
operator<<.

Refs #13245

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#17301
2024-02-13 17:17:57 +02:00
Avi Kivity
fb6804e7a4 raft: don't compare signed and unsigned types
gcc warns it can lead to undefined behavior, though 2G entries
in a list of mutations are unlikely. Use the correct type for iteration.
2022-11-28 21:58:30 +02:00
Petr Gusev
cbfe033786 raft server, shrink_to_fit on log truncation
We don't want to keep memory we don't use, shrink_to_fit guarantees that.

In fact, boost::deque frees up memory when items are deleted, so this change has little effect at the moment, but it may pay off if we change the container in the future.
2022-09-27 12:02:36 +04:00
Petr Gusev
27e60ecbf4 raft server, log size limit in bytes
Before this patch we could get an OOM if we
received several big commands. The number of
commands was small, but their total size
in bytes was large.

snapshot_trailing_size is needed to guarantee
progress. Without this limit the fsm could
get stuck if the size of the next item is
greater than max_log_size - (size of trailing entries).
2022-09-26 13:10:10 +04:00
Avi Kivity
fcb8d040e8 treewide: use Software Package Data Exchange (SPDX) license identifiers
Instead of lengthy blurbs, switch to single-line, machine-readable
standardized (https://spdx.dev) license identifiers. The Linux kernel
switched long ago, so there is strong precedent.

Three cases are handled: AGPL-only, Apache-only, and dual licensed.
For the latter case, I chose (AGPL-3.0-or-later and Apache-2.0),
reasoning that our changes are extensive enough to apply our license.

The changes we applied mechanically with a script, except to
licenses/README.md.

Closes #9937
2022-01-18 12:15:18 +01:00
Gleb Natapov
a59779155f raft: fix matching of a snapshotted log on a follower
There can be a situation where a leader will send to a follower entries
that the latter already snapshotted. Currently a follower consider those
to be outdated appends and it rejects them, but it may cause the
follower progress to be stuck:

- A is a leader, B is a follower, there are other followers which A used to commit entries
- A remembers that the last matched entry for B is 10, so the next entry to send is 11. A managed to commit the 11 entry using other followers
- A sends entry 11 to B
- B receives it, accepts, and updates its commit index to 11. It sends a success reply to A, but it never reaches A due to a network partition
- B takes a snapshot at index 11
- A sends entry 11 to B again
- B rejects it since it is inside the snapshot
- A receives the reject and retries from the same entry
- Same thing happen again

We should not reject such outdated entries since if they fall inside a
snapshot it means they match (according to log matching property).
Accepting them will make the case above alive.

Fixes #9552
2021-11-09 14:51:40 +02:00
Gleb Natapov
a04eb2d51f raft: do not maintain _last_conf_idx and _prev_conf_idx past snapshot index
The log maintains _last_conf_idx and _prev_conf_idx indexes into the log
to point to where the latest and previous configuration can be found.
If they are zero it means that the latest config is in the snapshot.
When snapshot with a trailing is applied we can safely reset those
indexes that are smaller than the snapshot one to zero because the
snapshot will have the latest config anyway. This simplifies maintenance
of those indexes since their value will not depend on user configured
snapshot_trailing parameter.
2021-11-09 14:03:36 +02:00
Gleb Natapov
ce40b01b07 raft: rename snapshot into snapshot_descriptor
The snapshot structure does not contain the snapshot itself but only
refers to it trough its id. Rename it to snapshot_descriptor for clarity.
2021-08-29 12:53:03 +03:00
Kamil Braun
7533c84e62 raft: sometimes become a candidate even if outside the configuration
There are situations where a node outside the current configuration is
the only node that can become a leader. We become candidates in such
cases. But there is an easy check for when we don't need to; a comment was
added explaining that.
2021-08-06 13:18:32 +02:00
Kamil Braun
c6563220b0 raft: store cluster configuration when taking snapshots
We add a function `log_last_conf_before(index_t)` to `fsm` which, given
an index greater than the last snapshot index, returns the configuration
at this index, i.e. the configuration of the last configuration entry
before this index.

This function is then used in `applier_fiber` to obtain the correct
configuration to be stored in a snapshot.

In order to ensure that the configuration can be obtained, i.e. the
index we're looking at is not smaller than the last snapshot index, we
strengthen the conditions required for taking a snapshot: we check that
`_fsm` has not yet applied a snapshot at a larger index (which it may
have due to a remote snapshot install request). This also causes fewer
unnecessary snapshots to be taken in general.
2021-08-06 12:00:32 +02:00
Avi Kivity
a55b434a2b treewide: extent copyright statements to present day 2021-06-06 19:18:49 +03:00
Konstantin Osipov
0295163f6f raft: always return a non-zero configuration index from the log
Return snapshot index for last configuration index if there
is no configuration in the log.
2021-03-24 14:05:55 +03:00
Konstantin Osipov
4083026b65 raft: consistently use configuration from the log 2021-02-18 16:04:44 +03:00
Konstantin Osipov
ed65a8635e raft: update raft::log::apply_snapshot() assert
apply_snapshot() doesn't support applying the same snapshot
twice. The caller must check the current snapshot before
applying.
2021-02-18 16:04:44 +03:00
Konstantin Osipov
cfe407b402 raft: inline raft::log::truncate_tail()
It's the core of apply_snapshot() work and is only used in it.

Now that truncate_tail is inline, rename truncate_head()
to truncate_uncommitted().
2021-02-18 16:04:44 +03:00
Konstantin Osipov
e0011c6e4d raft: ignore AppendEntries RPC with a very old term
Do not assert on an outdated message.
2021-02-18 16:04:44 +03:00
Konstantin Osipov
805d52eb16 raft: remove log::start_idx()
Replace it with a private _first_idx, which is maintained
along with the rest of class log state.
_first_idx is a name consistent with counterpart last_idx().

Do not use a function since going forward we may want
to remove Raft index from struct log_entry, so should rely
less on it.

This fixes a bug when _last_conf_idx was not reset
after apply_snapshot() because start_idx() was pointing
to a non-existent entry.
2021-02-18 16:04:44 +03:00
Konstantin Osipov
af8770da63 raft: return a correct last term on an empty log
If the log is empty, we must use snapshot's term,
since the log could be right after taking a snapshot
when no trailing entries were kept.

This fixes a rare possible bug when a log matching
rule could be violated during elections by a follower
with a log which was just truncated after a snapshot.

A separate unit test for the issue will follow.
2021-02-18 16:04:43 +03:00
Konstantin Osipov
cb035a7c8d raft: do not use raft::log::start_idx() outside raft::log()
raft::log::start_idx() is currently not meaningful
in case the log is empty.

Avoid using it in fsm::replicate_to() and avoid manual search for
previous log term, instead encapsulate the search in log::term_for().

As a side effect we currently return a correct term (0)
when log matching rule is exercised for an empty log
and the very first snapshot with term 0. Update raft_etcd_test.cc
accordingly.

This change happens to reduce the overall line count.

While at it, improve the comments in raft::replicate_to().
2021-02-18 16:04:43 +03:00
Konstantin Osipov
b29181875c raft: joint consensus, keep track of the last confchange index in the log
When initializing the log, find the most recent configuration
change index, if present.
Maintain the most recent configuration change index when
the log is truncated or entries are appended to it.
The last configuration change index will be used by FSM when it enters
candidate or leader state to fetch the current configuration.

We never truncate beyond a single in-progress configuration
change, so storing the previous value of last_conf_idx
helps avoid log backward scan on truncation in 100% of cases.

Remove all unused log constructors.
2021-01-29 22:07:07 +03:00
Gleb Natapov
6d47a535b9 raft: combine append_request _receive and _send
Combine structs for append request send and receive into a single
struct.

Author:    Gleb Natapov <gleb@scylladb.com>
Date:      Mon Nov 23 14:33:14 2020 +0200
2021-01-18 12:24:13 -04:00
Gleb Natapov
851e3000c4 raft: Ignore outdated snapshots
Do not try to install snapshots that are older than current one.
2020-11-25 12:36:41 +02:00
Gleb Natapov
8d9b6f588e raft: stop accepting requests on a leader after the log reaches the limit
To prevent the log to take too much memory introduce a mechanism that
limits the log to a certain size. If the size is reached no new log
entries can be submitted until previous entries are committed and
snapshotted.
2020-11-18 19:14:37 +01:00
Gleb Natapov
7fdfa32dbd raft: preserve trailing raft log entries during snapshotting
This patch allows to leave snapshot_trailing amount of entries
when a state machine is snapshotted and raft log entries are dropped.
Those entries can be used to catch up nodes that are slow without
requiring snapshot transfer. The value is part of the configuration
and can be changed.
2020-10-15 11:50:27 +03:00
Gleb Natapov
e1ac1a61c9 raft: Implement log replication and leader election
This patch introduces partial RAFT implementation. It has only log
replication and leader election support. Snapshotting and configuration
change along with other, smaller features are not yet implemented.

The approach taken by this implementation is to have a deterministic
state machine coded in raft::fsm. What makes the FSM deterministic is
that it does not do any IO by itself. It only takes an input (which may
be a networking message, time tick or new append message), changes its
state and produce an output. The output contains the state that has
to be persisted, messages that need to be sent and entries that may
be applied (in that order). The input and output of the FSM is handled
by raft::server class. It uses raft::rpc interface to send and receive
messages and raft::storage interface to implement persistence.
2020-10-01 14:30:59 +03:00