Commit Graph

53 Commits

Author SHA1 Message Date
Avi Kivity
f3eade2f62 treewide: relicense to ScyllaDB-Source-Available-1.0
Drop the AGPL license in favor of a source-available license.
See the blog post [1] for details.

[1] https://www.scylladb.com/2024/12/18/why-were-moving-to-a-source-available-license/
2024-12-18 17:45:13 +02:00
Kamil Braun
0eda7a2619 raft: server: add trigger_snapshot API
This allows the user of `raft::server` to ask it to create a snapshot
and truncate the Raft log. In a later commit we'll add a REST endpoint
to Scylla to trigger group 0 snapshots.

One use case for this API is to create group 0 snapshots in Scylla
deployments which upgraded to Raft in version 5.2 and started with an
empty Raft log with no snapshot at the beginning. This causes problems,
e.g. when a new node bootstraps to the cluster, it will not receive a
snapshot that would contain both schema and group 0 history, which would
then lead to inconsistent schema state and trigger assertion failures as
observed in scylladb/scylladb#16683.

In 5.4 the logic of initial group 0 setup was changed to start the Raft
log with a snapshot at index 1 (ff386e7a44)
but a problem remains with these existing deployments coming from 5.2,
we need a way to trigger a snapshot in them (other than performing 1000
arbitrary schema changes).

Another potential use case in the future would be to trigger snapshots
based on external memory pressure in tablet Raft groups (for strongly
consistent tables).
2024-01-23 16:48:28 +01:00
Patryk Jędrzejczak
df2034ebd7 server, raft_group0_client: remove the default nullptr values
The previous commit has fixed 5 bugs of the same type - incorrectly
passing the default nullptr to one of the changed functions. At
least some of these bugs wouldn't appear if there was no default
value. It's much harder to make this kind of a bug if you have to
write "nullptr". It's also much easier to detect it in review.

Moreover, these default values are rarely used outside tests.
Keeping them is just not worth the time spent on debugging.
2024-01-05 18:45:50 +01:00
Yaniv Kaul
c658bdb150 Typos: fix typos in comments
Fixes some typos as found by codespell run on the code.
In this commit, I was hoping to fix only comments, not user-visible alerts, output, etc.
Follow-up commits will take care of them.

Refs: https://github.com/scylladb/scylladb/issues/16255
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
2023-12-02 22:37:22 +02:00
Piotr Dulikowski
a1ebfcf006 raft: add server::is_alive
Add a method which reports whether given raft server is running.

In following commits, the information about whether the local raft
group 0 is running or not will be included in the response to the
failure detector ping, and the is_alive method will be used there.
2023-11-23 00:34:22 +01:00
Piotr Dulikowski
64668e325e raft: expose current_leader in raft::server
The handler for join_node_request will need to know which node is
considered the group 0 leader right now by the local node.

If the topology coordinator crashes and a new node immediately wants to
replace it with the same IP, the node that handles join_node_request
will attempt to perform a read barrier. If this happens quickly enough,
due to the IP reuse the RPC will be sent to the new node instead of the
(now crashed) topology coordinator; the RPC will get an error and will
fail the barrier.

If we detect that the new node wants to replace the current topology
coordinator, the upcoming join_node_request_handler will wait until
there is a leader change.
2023-09-26 15:56:52 +02:00
Mikołaj Grzebieluch
dc6017b71b raft topology: make mutation_size_threshold depends on max_command_size
`get_cdc_generation_mutations` splits data to mutations of maximal size
`mutation_size_treshold`. Before this commit it was hardcoded to 2 MB.

Calculate `mutation_size_threshold` to leave space for cdc generation
data and not exceed `max_command_size`.
2023-07-07 13:11:52 +02:00
Gleb Natapov
2fc8e13dd8 raft: add server::wait_for_state_change() function
Add a function that allows waiting for a state change of a raft server.
It is useful for a user that wants to know when a node becomes/stops
being a leader.

Message-Id: <20230316112801.1004602-4-gleb@scylladb.com>
2023-03-20 11:31:55 +01:00
Gleb Natapov
59f7aeb79b raft: move some functions out of ad-hoc section
Make tick() and is_leader() part of the API. First is used externally
already and another will be used in following patches.

Message-Id: <20230316112801.1004602-3-gleb@scylladb.com>
2023-03-20 11:25:19 +01:00
Konstantin Osipov
990c7a209f raft: change the API of conf change notifications
Pass a change diff into the notification callback,
rather than add or remove servers one by one, so that
if we need to persist the state, we can do it once per
configuration change, not for every added or removed server.

For now still pass added and removed entries in two separate calls
per a single configuration change. This is done mainly to fulfill the
library contract that it never sends messages to servers
outside the current configuration. The group0 RPC
implementation doesn't need the two calls, since it simply
marks the removed servers as expired: they are not removed immediately
anyway, and messages can still be delivered to them.
However, there may be test/mock implementations of RPC which
could benefit from this contract, so we decided to keep it.
2022-11-17 12:07:31 +03:00
Petr Gusev
d79fbab682 raft: convert raft::transport_error to raft::commit_status_unknown
The add_entry and modify_config methods sometimes do an rpc to
execute the request on the current leader. If the tcp connection
was broken, a seastar::rpc::closed_error would be thrown to the client.
This exception was not documented in the method comments and the
client could have missed handling it. For example, this exception
was not handled when calling modify_config in raft_group0,
which sometimes broke the removenode command.

An intermittent_connection_error exception was added earlier to
solve a similar problem with the read_barrier method. In this patch it
is renamed to transport_error, as it seems to better describe the
situation, and an explicit specification for this exception
was added - the rpc implementation can throw it if it is not known
whether the call reached the target node and whether any
actions were performed on it.

In case of read_barrier it does not matter and we just retry. In case
of add_entry and modify_config we cannot retry
because the rpc calls are not idempotent, so we convert this
exception to commit_status_unknown, which the client has to handle.

Explicit comments have also been added to raft::server methods
describing all possible exceptions.
2022-10-07 13:34:16 +04:00
Petr Gusev
27e60ecbf4 raft server, log size limit in bytes
Before this patch we could get an OOM if we
received several big commands. The number of
commands was small, but their total size
in bytes was large.

snapshot_trailing_size is needed to guarantee
progress. Without this limit the fsm could
get stuck if the size of the next item is
greater than max_log_size - (size of trailing entries).
2022-09-26 13:10:10 +04:00
Petr Gusev
210d9dd026 raft: fix snapshots leak
applier_fiber could create multiple snapshots between
io_fiber run. The fsm_output.snp variable was
overwritten by applier_fiber and io_fiber didn't drop
the previous snapshot.

In this patch we introduce the variable
fsm_output.snps_to_drop, store in it
the current snapshot id before applying
a new one, and then sequentially drop them in
io_fiber after storing the last snapshot_descriptor.

_sm_events.signal() is added to fsm::apply_snapshot,
since this method mutates the _output and thus gives a
reason to run io_fiber.

The new test test_frequent_snapshotting demonstrates
the problem by causing frequent snapshots and
setting the applier queue size to one.

Closes #11530
2022-09-21 12:46:26 +02:00
Petr Gusev
e92dc9c15b raft server, provide a callback to handle background errors
Fix: #11352
2022-09-12 10:16:43 +04:00
Petr Gusev
c57238d3d6 raft server, check aborted state on public server public api's
Fix: #11352
2022-09-12 10:16:40 +04:00
Petr Gusev
eedfd7ad9b raft, limit for command size
Adds max_command_size to the raft configuration and
restricts commands to this limit.
2022-08-18 13:35:49 +04:00
Kamil Braun
daf9c53bb8 raft: split can_vote field from server_address to separate struct
Whether a server can vote in a Raft configuration is not part of the
address. `server_address` was used in many context where `can_vote` is
irrelevant.

Split the struct: `server_address` now contains only `id` and
`server_info` as it did before `can_vote` was introduced. Instead we
have a `config_member` struct that contains a `server_address` and the
`can_vote` field.

Also remove an "unsafe" constructor from `server_address` where `id` was
provided but `server_info` was not. The constructor was used for tests
where `server_info` is irrelevant, but it's important not to forget
about the info in production code. The constructor was used for two
purposes:
- Invoking set operations such as `contains`. To solve this we use C++20
  transparent hash and comparator functions, which allow invoking
  `contains` and similar functions by providing a different key type (in
  this case `raft::server_id` in set of addresses, for example).
- constructing addresses without `info`s in tests. For this we provide
  helper functions in the test helpers module and use them.
2022-07-18 18:22:10 +02:00
Kamil Braun
d128f65354 raft: server: if add_entry with wait_type::applied successfully returns, ensure state_machine::apply is called for this entry
Previously it could happen that `add_entry` returned successfully but
`state_machine::apply` was never called by the server for this entry,
even though `wait_type::applied` was used, if the server loaded
a snapshot that contained this entry in just the right moment. Some
clients may find this behavior surprising, even though we may argue that
it's not technically incorrect.

For example, the nemesis test assumed that if `add_entry` returned
successfully (with `wait_type::applied`), the local state machine
applied the entry; the test uses `apply` to obtain an output - the
result of the command - from the state machine.

It's not a problem to give a stronger guarantee, so we do it in this
commit. In the scenario where a snapshot causes Raft to skip over the
entry, `add_entry` will finish exceptionally with
`commit_status_unknown`.
2022-05-27 12:06:18 +02:00
Kamil Braun
ad3141d3e0 raft: server: translate abort_requested_exception to raft::request_aborted
The `wait_for_leader` function would throw a low-level
`abort_requested_aborted` exception from seastar::shared_promise.
Translate it to the high-level raft::request_aborted so we can reduce
the number of different exception types which cross the Raft API
boundary.

Also, add comments on Raft API functions about the exception thrown when
requests are aborted.
2022-04-05 19:18:53 +02:00
Gleb Natapov
a1604aa388 raft: make raft requests abortable
This patch adds an ability to pass abort_source to raft request APIs (
add_entry, modify_config) to make them abortable. A request issuer not
always want to wait for a request to complete. For instance because a
client disconnected or because it no longer interested in waiting
because of a timeout. After this patch it can now abort waiting for such
requests through an abort source. Note that aborting a request only
aborts the wait for it to complete, it does not mean that the request
will not be eventually executed.

Message-Id: <YjHivLfIB9Xj5F4g@scylladb.com>
2022-03-16 18:38:01 +01:00
Kamil Braun
28b5792481 raft: server: don't create local waiter in modify_config
When forwarding a reconfiguration request from follower to a leader in
`modify_config`, there is no reason to wait for the follower's commit
index to be updated. The only useful information is that the leader
committed the configuration change - so `modify_config` should return as
soon as we know that.

There is a reason *not* to wait for the follower's commit index to be
updated: if the configuration change removes the follower, the follower
will never learn about it, so a local waiter will never be resolved.

`execute_modify_config` - the part of `modify_config` executed on the
leader - is thus modified to finish when the configuration change is
fully complete (including the dummy entry appended at the end), and
`modify_config` - which does the forwarding - no longer creates a local
waiter, but returns as soon as the RPC call to the leader confirms that
the entry was committed on the leader.

We still return an `entry_id` from `execute_modify_config` but that's
just an artifact of the implementation.

Fixes #9981.
2022-01-27 17:49:40 +01:00
Avi Kivity
fcb8d040e8 treewide: use Software Package Data Exchange (SPDX) license identifiers
Instead of lengthy blurbs, switch to single-line, machine-readable
standardized (https://spdx.dev) license identifiers. The Linux kernel
switched long ago, so there is strong precedent.

Three cases are handled: AGPL-only, Apache-only, and dual licensed.
For the latter case, I chose (AGPL-3.0-or-later and Apache-2.0),
reasoning that our changes are extensive enough to apply our license.

The changes we applied mechanically with a script, except to
licenses/README.md.

Closes #9937
2022-01-18 12:15:18 +01:00
Kamil Braun
485c0b1819 raft: server: don't register metrics in start()
Instead, expose `register_metrics()` at the `server` interface
(previously it was a private method of `server_impl`).

Metrics are global so `register_metrics()` cannot be called on
two servers that have the same ID, which is useful e.g. in tests when we
want to simulate server stops and restarts.
2021-12-07 11:23:33 +01:00
Konstantin Osipov
6d28927550 raft: make forwarding optional
In absence of abort_source or timeouts in Raft API, automatic
bouncing can create too much noise during testing, especially
during network failures. Add an option to disable follower
bouncing feature, since randomized_nemesis_test has its own
bouncing which handles timeouts correctly.

Optionally disable forwarding in basic_generator_test.
2021-11-25 12:35:43 +03:00
Konstantin Osipov
e3751068fe raft: (server) allow adding entries/modify config on a follower
Implement an RPC to forward add_entry calls from the follower
to leader. Bounce & retry in case of not_a_leader.
Do not retry in case of uncertainty - this can lead to adding
duplicate entries.

The feature is added to core Raft since it's needed by
all current clients - both topology and schema changes.

When forwarding an entry to a remote leader we may get back
a term/index pair that conflicts (has the same index, but is with
a higher term) with a local entry we're still waiting on.

This can happen, e.g. because there was a leader change and the
log was truncated, but we still haven't got the append_entries
RPC from the new leader, still haven't truncated the log locally,
still haven't aborted all the local waits for truncated entries.

Only remove the offending entry from the wait list and abort it.
There may be entries labeled with an older term to the right (with
higher commit index) of the conflicting entry. However, finding them,
would require a linear scan. If we allow it, we may end up doing this
linear scan for *every* conflicting entry during the transition
period, which brings us to N^2 complexity of this step. At the
same time, as soon as append_entries that commits a higher-term
entry with the same index reaches the follower, the waits
for the respective truncated entry will be aborted anyway (see
notify_waiters() which sets dropped_entry exception), so the scan
is unnecessary.

Similarly to being able to add entries, allow to modify
Raft group configuration on a follower. The implementation
works the same way as adding entries - forwards the command
to the leader.

Now that add_entry() or modify_config never throws not_a_leader,
it's more likely to throw timed_out_error, e.g. in case the
network is partitioned. Previously it was only possible due to a
semaphore wait timeout, and this scenario was not tested.
Handle timed_out_error on RPC level to let the existing tests
(specifically the randomized nemesis test) pass.
2021-11-25 11:50:38 +03:00
Konstantin Osipov
9cde1cdf71 raft: (server) implement id() helper
There is no easy way to get server id otherwise.
2021-11-25 11:50:38 +03:00
Gleb Natapov
03a266d73b raft: make read_barrier work on a follower as well as on a leader
This patch implements RAFT extension that allows to perform linearisable
reads by accessing local state machine. The extension is described
in section 6.4 of the PhD. To sum it up to perform a read barrier on
a follower it needs to asks a leader the last committed index that it
knows about. The leader must make sure that it is still a leader before
answering by communicating with a quorum. When follower gets the index
back it waits for it to be applied and by that completes read_barrier
invocation.

The patch adds three new RPC: read_barrier, read_barrier_reply and
execute_read_barrier_on_leader. The last one is the one a follower uses
to ask a leader about safe index it can read. First two are used by a
leader to communicate with a quorum.
2021-08-25 08:57:13 +03:00
Gleb Natapov
ed49d29473 raft: allow to initiate leader stepdown process
Sometimes an ability to force a leader change is needed. For instance
if a node that is currently serving as a leader needs to be brought
down for maintenance. If it will be shutdown without leadership
transfer the cluster will be unavailable for leader election timeout at
least.

We already have a mechanism to transfer the leadership in case an active
leader is removed. The patch exposes it as an external interface with a
timeout period. If a node is still a leader after the timeout the
operation will fail.
2021-06-22 14:36:42 +03:00
Konstantin Osipov
c67c77ed03 raft: (server) wait for configuration transition to complete
By default, wait for the server to leave the joint configuration
when making a configuration change.

When assembling a fresh cluster Scylla may run a series of
configuration changes. These changes would all go through the same
leader and serialize in the critical section around server::cas().

Unless this critical section protects the complete transition from
C_old configuration to C_new, after the first configuration
is committed, the second may fail with exception that a configuration
change is in progress. The topology changes layer should handle
this exception, however, this may introduce either unpleasant
delays into cluster assembly (i.e. if we sleep before retry), or
a busy-wait/thundering herd situation, when all nodes are
retrying their configuration changes.

So let's be nice and wait for a full transition in
server::set_configuration().
2021-06-16 16:52:43 +03:00
Konstantin Osipov
631c89e1a6 raft: (server) implement raft::server::get_configuration()
raft::server::set_configuration() is useless on
application level if we can't query the previous configuration.
2021-06-16 16:52:43 +03:00
Tomasz Grabiec
50d64646cd Merge "raft: replication test fixes and OOP refactor" from Alejo
Feature requests, fixes, and OOP refactor of replication_test.

Note: all known bugs and hangs are now fixed.

A new helper class "raft_cluster" is created.
Each move of a helper function to the class has its own commit.
New helpers are provided

To simplify code, for now only a single apply function can be set per
raft_cluster. No tests were using in any other way. In the future,
there could be custom apply functions per server dynamically assigned,
if this becomes needed.

* alejo/raft-tests-replication-02-v3-30: (66 commits)
  raft: replication test: wait for log for both index and term
  raft: replication test: reset network at construction
  raft: replication test: use lambda visitor for updates
  raft: replication test: move structs into class
  raft: replication test: move data structures to cluster class
  raft: replication test: remove shared pointers
  raft: replication test: move get_states() to raft_cluster
  raft: replication test: test_server inside raft_cluster
  raft: replication test: rpc declarative tests
  raft: replication test: add wait_log
  raft: replication test: add stop and reset server
  raft: replication test: disconnect 2 support
  raft: replication test: explicit node_id naming
  raft: replication test: move definitions up
  raft: replication test: no append entries support
  raft: replication test: fix helper parameter
  raft: replication test: stop servers out of config
  raft: replication test: wait log when removing leader from configuration
  raft: replication test: only manipulate servers in configuration
  raft: replication test: only cancel rearm ticker for removed server
  ...
2021-06-06 19:18:49 +03:00
Avi Kivity
a55b434a2b treewide: extent copyright statements to present day 2021-06-06 19:18:49 +03:00
Alejo Sanchez
3e91a8ca0d raft: replication test: wait for log for both index and term
Waiting on index alone does not guarantee leader correct leader log
propagation. This patch add checking also the term of the leader's last
log entry.

This was exposed with occasional problems with packet drops.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-06-04 08:38:19 -04:00
Gleb Natapov
aa7ea333da raft: document that add entry my throw commit_status_unknown 2021-05-06 11:59:36 +03:00
Alejo Sanchez
0a5c605713 raft: replication test: fix custom election
Use the new specific connectivity to manage old leader disconnection
more specifically.

This fixes having elections where the vote of the old leader is required
for quorum. For example {A,B} and we want to switch leader.  For B to
become candidate it has to see A as down. Then A has to see B's request
for vote, and vote for A.

So to make the general case old leader needs to be first disconnected
from all nodes, make the desired node candidate, then have the old
leader connected only to the desired candidate (else, other nodes would
see the new candidate as disrupting a live leader).

Also, there might be stray messages from the former leader. These could
revert the candidate to follower. To handle this this patch retries
the process until the desired node becomes leader.

The helper function elect_me_leader() is split and renamed to
wait_until_candidate() and wait_election_done(). The former ticks until
the node is a candidate and the later waits until a candidate either
becomes a leader or reverts to follower

The existing etcd test workaround of incrementing from n=2 to n=3 nodes
is corrected back to original n=2.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-05-03 07:53:35 -04:00
Gleb Natapov
1f868d516e raft: implement prevoting stage in leader election
This is how PhD explain the need for prevoting stage:

  One downside of Raft's leader election algorithm is that a server that
  has been partitioned from the cluster is likely to cause a disruption
  when it regains connectivity. When a server is partitioned, it will
  not receive heartbeats. It will soon increment its term to start
  an election, although it won't be able to collect enough votes to
  become leader. When the server regains connectivity sometime later, its
  larger term number will propagate to the rest of the cluster (either
  through the server's RequestVote requests or through its AppendEntries
  response). This will force the cluster leader to step down, and a new
  election will have to take place to select a new leader.

  Prevoting stage is addressing that. In the Prevote algorithm, a
  candidate only increments its term if it first learns from a majority of
  the cluster that they would be willing to grant the candidate their votes
  (if the candidate's log is sufficiently up-to-date, and the voters have
  not received heartbeats from a valid leader for at least a baseline
  election timeout).

  The Prevote algorithm solves the issue of a partitioned server disrupting
  the cluster when it rejoins. While a server is partitioned, it won't
  be able to increment its term, since it can't receive permission
  from a majority of the cluster. Then, when it rejoins the cluster, it
  still won't be able to increment its term, since the other servers
  will have been receiving regular heartbeats from the leader. Once the
  server receives a heartbeat from the leader itself, it will return to
  the follower state(in the same term).

In our implementation we have "stable leader" extension that prevents
spurious RequestVote to dispose an active leader, but AppendEntries with
higher term will still do that, so prevoting extension is also required.
2021-03-12 11:09:21 +02:00
Konstantin Osipov
95ee8e1b90 raft: fix spelling
Fix spelling of a few comments.
2021-02-19 22:56:26 +03:00
Konstantin Osipov
51c968bcb4 raft: rename log::non_snapshoted_length() to log::in_memory_size()
The old name was incorrect, in case apply_snapshot() was called with
non-zero trailing entries, the total log length is greater than the
length of the part that is not stored in a snapshot.

Fix spelling in related comments.

Rename fsm::wait() to fsm::wait_max_log_size(), it's a more
specific name. Rename max_log_length to max_log_size to use
'size' rather than 'length' consistently for log size.
2021-02-18 16:04:44 +03:00
Alejo Sanchez
b41a6822e8 raft: drop ticker from raft
Remove ticker callbacks from raft::server.
External code should periodically call raft::server::tick().

Update replication_test accordingly.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-02-14 09:41:42 -04:00
Alejo Sanchez
97338ab53f raft: replication test: fix debug mode hangs
For certain situations where barely enough nodes to elect a new leader
are connected a disruptive candidate can occassionally block the
election.

For example having servers A B C D E and only A B C are active in a
partition. If the test wants to elect A, it has to first make all 3
servers reach election timeout threshold (to make B and C receptive).
Then A is ticked till it becomes a candidate and has to send vote
requests to the other servers.

But all servers have a timer (_ticker) calling their periodic tick()
functions. If one of the other servers, say B, gets its timer tick
before A sends vote requests, B becomes a (disruptive) candidate and
will refuse to vote for A. In our case of only having 3 out of 5 servers
connected a single missing vote can hang the election.

This patch disables timer ticks for all servers when running custom
elections and partitioning.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-02-11 11:42:31 -04:00
Konstantin Osipov
c7b5a60320 raft: joint consensus, wire up configuration changes in the API
Now that we've implemented joint consensus based configuration changes,
replace add_server()/remove_server() with a more general set_configuration().
2021-01-29 22:07:08 +03:00
Tomasz Grabiec
f08a3e3fd8 Merge "raft: test fixes, etcd tests, simplification" from Alejo
This patch set adds etcd unit tests for raft.

It also includes a fix for replication test in debug mode and a
simplification for append_request.

Tests: unit ({dev}), unit ({debug}), unit ({release})

*  https://github.com/alecco/scylla/tree/raft-ale-tests-09b:
  raft: etcd unit tests: test log replication
  raft: boost test etcd: test fsm can vote from any state
  raft: boost test etcd: port TestLeaderElectionOverwriteNewerLogs
  raft: replication test: add etcd test for cycling leaders
  raft: testing: provide primitives to wait for log propagation
  raft: etcd unit tests: initial boost tests
  raft: combine append_request _receive and _send
2021-01-21 10:41:33 +02:00
Pavel Solodovnikov
041072b59f raft: rename storage to persistence
The new naming scheme more clearly communicates to the client of
the raft library that the `persistence` interface implements
persistency layer of the fsm that is powering the raft
protocol itself rather than the client-side workflow and
user-provided `state_machine`.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20201126135114.7933-1-pa.solodovnikov@scylladb.com>
2021-01-20 10:23:43 +02:00
Alejo Sanchez
f627972186 raft: testing: provide primitives to wait for log propagation
For tests to be able to transition in a consistent state, in some cases
it's needed to allow the followers to catch up with the leader.

This prevents occasional hangs in debug mode for incoming tests.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-01-18 12:33:37 -04:00
Alejo Sanchez
d610d5a7b8 raft: expose fsm tick() to server for testing
For tests to advance servers they need to invoke tick().

This is needed to advance free elections.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2020-11-22 10:32:34 -04:00
Alejo Sanchez
9e7e14fc50 raft: expose is_leader() for testing
Expose fsm leader check to allow tests to find out the leader after an
election.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2020-11-22 10:32:34 -04:00
Gleb Natapov
8d9b6f588e raft: stop accepting requests on a leader after the log reaches the limit
To prevent the log to take too much memory introduce a mechanism that
limits the log to a certain size. If the size is reached no new log
entries can be submitted until previous entries are committed and
snapshotted.
2020-11-18 19:14:37 +01:00
Gleb Natapov
8bab38c6fa raft: use correct type for node info in add_server() 2020-11-06 17:06:07 +03:00
Gleb Natapov
7fdfa32dbd raft: preserve trailing raft log entries during snapshotting
This patch allows to leave snapshot_trailing amount of entries
when a state machine is snapshotted and raft log entries are dropped.
Those entries can be used to catch up nodes that are slow without
requiring snapshot transfer. The value is part of the configuration
and can be changed.
2020-10-15 11:50:27 +03:00
Gleb Natapov
7c1187b7f5 raft: implement periodic snapshotting of a state machine
The patch implements periodic taking of a snapshot and trimming of
the raft log.

In raft the only way the log of already committed entries can be shorten
is by taking a snapshot of the state machine and dropping log entries
included in the snapshot from the raft log. To not let log to grow too
large the patch takes the snapshot periodically after applying N number
of entries where N can be configured by setting snapshot_threshold
value in raft's configuration.
2020-10-15 11:48:44 +03:00