The Raft PhD presents the following scenario.
When we remove a server from the cluster configuration, it does not
receive the configuration entry which removes it (because the leader
appending this entry uses that entry's configuration to decide to which
servers to send the entry to, and the entry does not contain the removed
server). Therefore the server keeps believing it is a member but does
not receive heartbeats from leaders in the new configuration. Therefore
it will keep becoming a candidate, causing existing leaders to step
down, harming availability. With many such candidates the cluster may
even stop being able to proceed at all. We call such servers
"disruptive".
More concretely, consider the following example, adapted from the PhD for
joint configuration changes (the original PhD considered a different
algorithm which can only add/remove one server at once):
Let C_old = {A, B, C, D}, C_new = {B, C, D}, and C_joint be the joint
configuration (C_old, C_new). D is the leader. D managed to append
C_joint to every server and commit it. D appends C_new. At this point, D
stops sending heartbeats to A because C_new does not contain A, but A's
last entry is still C_joint, so it still has the ability to become a
candidate. A can now become a candidate and cause D, or any other leader
in C_new, to step down. Even if D manages to commit C_new, A can keep
disrupting the cluster until it is shut down.
Prevoting changes the situation, which the authors admit. The "even if"
above no longer applies: if D manages to commit C_new, or just append it
to a majority of C_new, then A won't be able to succeed in the prevote
phase because a majority of servers in C_new has a longer log than A
(and A must obtain a prevote from a majority of servers in C_new because
A is in C_joint which contains C_new). But the authors continue to argue
that disruptions can still occur during the small period where C_new is
only appended on D but not yet on a majority of C_new. As they say:
"we also did not want to assume that a leader will reliably replicate
entries fast enough to move past the scenario (...) quickly; that might
have worked in practice, but it depends on stronger assumptions that we
prefer to avoid about the performance (...) of replicating log entries".
One could probably try debunking this by saying that if entries take
longer to replicate than the election timeout we're in much bigger
trouble, but nevermind.
In any case, the authors propose a solution which we call "sticky
leadership". A server will not grant a vote to a candidate if it has
recently received a heartbeat from the currently known leader, even if
the candidate's term is higher. In the above example, servers in C_new
would not grant votes to A as long as D keeps sending them heartbeats,
thus A is no longer disruptive.
In our case the situation is a bit
different: in original Raft, "heartbeats" have a very specific meaning
- they are append_entries requests (possibly empty) sent by leaders.
Thus if a node stops being a leader it stops sending heartbeats;
similarly, if a node leaves the configuration, it stops receiving
heartbeats from others still in the configuration. We instead use a
"shared failure detector" interface, where nodes may still consider
other nodes alive regardless of their configuration/leadership
situation, as part of the general "MultiRaft" framework.
This pretty much invalidates the original argument, as seen on
the above example: A will still consider D alive, thus it won't become
a candidate.
Shared failure detector combined with sticky leadership actually makes
the situation worse - it may cause cluster unavailability in certain
scenarios (fortunately not a permanent one, it can be solved with server
restarts, for example). Randomized nemesis testing with reconfigurations
found the following scenario:
Let C1 = {A, B, C}, C2 = {A}, C3 = {B, C}. We start from configuration
C1, B is the leader. B commits joint (C1, C2), then new C2
configuration. Note that C does not learn about the last entry
(since it's not part of C2) but it keeps believing that B is alive,
so it keeps believing that B is the leader.
We then partition {A} from {B, C}. A appends (C2, C3) joint
configuration to its log. It's not able to append it to B or C due to
the partition. The partition holds long enough for A to revert to
candidate state (or we may restart A at this point). Eventually the
partition resolves. The only node which can become a candidate now is A:
C does not become a candidate because it keeps believeing that B is the
leader, and B does not become a candidate because it saw the C2
non-joint entry being committed. However, A won't become a leader
because C won't grant it a vote due to the sticky leadership rule.
The cluster will remain unavailable until e.g. C is restarted.
Note that this scenario requires allowing configuration changes which
remove and then readd the same servers to the configuration. One may
wonder if such reconfigurations should be allowed, but there doesn't
seem to be any example of them breaking safety of Raft (and the PhD
doesn't seem to mention them at all; perhaps it implicitly accepts
them). It is unknown whether a similar scenario may be produced without
such reconfigurations.
In any case, disabling sticky leadership resolves the problem, and it is
the last currently known availability problem found in randomized
nemesis testing. There is no reason to keep this extension, both because
the original Raft authors' argument does not apply for shared failure
detector, and because one may even argue with the authors in vanilla
Raft given that prevoting is enabled (see end of third paragraph of this
commit message).
Message-Id: <20210921153741.65084-1-kbraun@scylladb.com>
The AppendReg state machine stores a sequence of integers. It supports
`append` inputs which append a single integer to the sequence and return
the previous state (before appending).
The implementation uses the `append_seq` data structure
representing an immutable sequence that uses a vector underneath
which may be shared by multiple instances of `append_seq`.
Appending to the sequence appends to the underlying vector,
but there is no observable effect on the other instances since
they use only the prefix of the sequence that wasn't changed.
If two instances sharing the same vector try to append,
the later one must perform a copy.
This allows efficient appends if only one instance is appending, which
is useful in the following context:
- a Raft server stores a copy in the underlying state machine replica
and appends to it,
- clients send append operations to the server; the server returns the
state of the sequence before it was appended to,
- thanks to the sharing, we don't need to copy all elements when
returning the sequence to the client, and only one instance (the
server) is appending to the shared vector,
- summarizing, all operations have amortized O(1) complexity.
We use AppendReg instead of ExReg in `basic_generator_test`
with a generator which generates a sequence of append operations with
unique integers.
This implies that the result of every operation uniquely identifies the
operation (since it contains the appended integer, and different
operations use different integers) and all operations that must have
happened before it (since it contains the previous state of the append
register), which allows us to reconstruct the "current state" of the
register according to the results of operations coming from Raft calls,
giving us an on-line serializability checker with O(1) amortized
complexity on each operation completion.
We also enforce linearizability by checking that every
completed operation was previously invoked.
We also perform a simple liveness check at the end of the test by
ensuring that a leader becomes eventually elected and that we can
successfully execute a call.
* kbr/linearizability-v2:
test: raft: randomized_nemesis_test: check consistency and liveness in basic_generator_test
test: raft: randomized_nemesis_test: introduce append register
"This series removes layer violation in compaction, and also
simplifies compaction manager and how it interacts with compaction
procedure."
* 'compaction_manager_layer_violation_fix/v3' of github.com:raphaelsc/scylla:
compaction: split compaction info and data for control
compaction_manager: use task when stopping a given compaction type
compaction: remove start_size and end_size from compaction_info
compaction_manager: introduce helpers for task
compaction_manager: introduce explicit ctor for task
compaction: kill sstables field in compaction_info
compaction: kill table pointer in compaction_info
compaction: simplify procedure to stop ongoing compactions
compaction: move management of compaction_info to compaction_manager
compaction: move output run id from compaction_info into task
compaction_info must only contain info data to be exported to the
outside world, whereas compaction_data will contain data for
controlling compaction behavior and stats which change as
compaction progresses.
This separation makes the interface clearer, also allowing for
future improvements like removing direct references to table
in compaction.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
compaction_info will eventually only be used for exporting data about
ongoing compactions, so task must be used instead.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
those stats aren't used in compaction stats API and therefore they
can be removed. end_size is added to compaction_result (needed for
updating history) and start_size can be calculated in advance.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Today, compactions are tracked by both _compactions and _tasks,
where _compactions refer to actual ongoing compaction tasks,
whereas _tasks refer to manager tasks which is responsible for
spawning new compactions, retry them on failure, etc.
As each task can only have one ongoing compaction at a time,
let's move compaction into task, such that manager won't have to
look at both when deciding to do something like stopping a task.
So stopping a task becomes simpler, and duplication is naturally
gone.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Today, compaction is calling compaction manager to register / deregister
the compaction_info created by it.
This is a layer violation because manager sits one layer above
compaction, so manager should be responsible for managing compaction
info.
From now on, compaction_info will be created and managed by
compaction_manager. compaction will only have a reference to info,
which it can use to update the world about compaction progress.
This will allow compaction_manager to be simplified as info can be
coupled with its respective task, allowing duplication to be removed
and layer violation to be fixed.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
this run id is used to track partial runs that are being written to.
let's move it from info into task, as this is not an external info,
but rather one that belongs to compaction_manager.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
The previous commit already relaxed the condition for test_fib,
but the same should be done for test_fib_called_on_null
for an identical reason - more than 1 error can be expected
in the case of calling heavily recursive function, and either
fuel exhaustion, or hitting the stack limit, or any other
InvalidRequest exception should be accepted.
Closes#9363
Add new struct to the `expression` variant:
```c++
// A value serialized with the internal (latest) cql_serialization_format
struct constant {
cql3::raw_value value;
data_type type; // Never nullptr, for NULL and UNSET might be empty_type
};
```
and use it where possible instead of `terminal`.
This struct will eventually replace all classes deriving from
`terminal`, but for now `terminal` can't be removed completely.
We can't get rid of terminal yet, because sometimes `terminal` is
converted back to `term`, which `constant` can't do. This won't be a
problem once we replace term with expression.
`bool` is removed from `expression`, now `constant` is used instead.
This is a redesign of PR #9203, there is some discussion about the
chosen representation there.
Closes#9371
* github.com:scylladb/scylla:
cql3: term: Remove get_elements and multi_item_terminal from terminals
cql3: Replace most uses of terminal with expr::constant
cql3: expr: Remove repetition from expr::get_elements
cql3: expr: Add expr::get_elements(constant)
cql3: term: remove term::bind_and_get
cql3: Replace all uses of bind_and_get with evaluate_to_raw_view
cql3: expr: Add evaluate_IN_list
cql3: tuples: Implement tuples::in_value::get
cql3: Move data_type to terminal, make get_value_type non-virtual
cql3: user_types: Implement get_value_type in user_types.hh
cql3: tuples: Implement get_value_type in tuples.hh
cql3: maps: Implement get_value_type in maps.hh
cql3: sets: Implement get_value_type in sets.hh
cql3: lists: Implement get_value_type in lists.hh
cql3: constants: Implement get_value_type in constants.hh
cql3: expr: Add expr::evaluate
cql3: Make collection term get() use the internal serialization format
cql3: values: Add unset value to raw_value_view::make_temporary
cql3: expr: Add constant to expression
"
The main challenge here is to move messaging_service.start_listen()
call from out of gossiper into main. Other changes are pretty minor
compared to that and include
- patch gossiper API towards a standard start-shutdown-stop form
- gossiping "sharder info" in initial state
- configure cluster name and seeds via gossip_config
tests: unit(dev)
dtest.bootstrap_test.start_stop_test_node(dev)
manual(dev): start+stop, nodetool enable-/disablegossip
refs: #2737
refs: #2795
refs: #5489
"
* 'br-gossiper-dont-start-messaging-listen-2' of https://github.com/xemul/scylla:
code: Expell gossiper.hh from other headers
storage_service: Gossip "sharder" in initial states
gossiper: Relax set_seeds()
gossiper, main: Turn init_gossiper into get_seeds_from_config
storage_service: Eliminate the do-bind argument from everywhere
gossiper: Drop ms-registered manipulations
messaging, main, gossiper: Move listening start into main
gossiper: Do handlers reg/unreg from start/stop
gossiper: Split (un)init_messaging_handler()
gossiper: Relocate stop_gossiping() into .stop()
gossiper: Introduce .shutdown() and use where appropriate
gossiper: Set cluster_name via gossip_config
gossiper, main: Straighten start/stop
tests/cql_test_env: Open-code tst_init_ms_fd_gossiper
tests/cql_test_env: De-global most of gossiper
gossiper: Merge start_gossiping() overloads into one
gossiper: Use is_... helpers
gossiper: Fix do_shadow_round comment
gossiper: Dispose dead code
Use AppendReg instead of ExReg for the state machine.
Use a generator which generates a sequence of append operations with
unique integers.
This implies that the result of every operation uniquely identifies the
operation (since it contains the appended integer, and different
operations use different integers) and all operations that must have
happened before it (since it contains the previous state of the append
register), which allows us to reconstruct the "current state" of the
register according to the results of operations coming from Raft calls,
giving us an on-line linearizability checker with O(1) amortized
complexity on each operation completion.
We also perform a simple liveness check at the end of the test by
ensuring that a leader becomes eventually elected and that we can
successfully execute a call.
* seastar c04a12edbd...e6db0cd587 (13):
> Merge "Add kernel stack trace reporting for stalls" from Avi
Ref #8828
> Merge "Keep XFS' dioattr cached" from Pavel E
> coroutines: de-template maybe_yield()
> sharded: Add const versions of map_reduce's
> apps/io_tester: remove unused lambda capture
> doc: exclude seastar::coroutine::internal namespace
> deprecate unaligned_cast<> from unaligned.hh
> reactor: adjust max_networking_aio_io_control_blocks to lower size when fs.aio-max-nr is small
> build: clarify choice of C++ dialect, and change default to C++20
> coding_style: update concepts style to snake_case
> Merge "Teach io_tester to submit requests-per-second flow" from Pavel E
> cmake: find and link against Boost::filesystem
> coroutine: add maybe_yield
We know (verified by existing tests) that null keys are not allowed -
neither as partition keys nor clustering keys.
In issue #9352 a question was raised of whether an *empty string* is
allowed as as a key on a base table (not a materialized view or index).
The following tests confirm that the current situation is as follows:
1. An empty string is perfectly legal as a clustering key.
2. An empty string is NOT ALLOWED as a partition key - the error
"Key may not be empty" is reported if this is attempted.
3. If the partition key is compound (multiple partition-key columns)
then any or all of them may be empty strings.
These tests pass the same on both Cassandra and Scylla, showing that
this bizarre (and undocumented) behavior is identical in both.
Refs #9352.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210922131310.293846-1-nyh@scylladb.com>
"
There's a whole lot of places that create an sstable for tests
like this
auto sst = env.make_sstable(...);
sst->write_components(...);
sst->load();
Some of them are already generalized with the make_sstable_easy
helper, but there are several instances of them.
Found while hunting down the places that use default IO sched
class behind the scenes.
tests: unit(dev)
"
* 'br-sst-tests-make-sstable-easy' of https://github.com/xemul/scylla:
test: Generalize make_sstable() and make_sstable_easy()
test: Use now existing helpers elsewhere
test: Generalize all make_sstable_easy()-s
test: Set test change estimation to 1
test: Generalize make_sstable_easy in mutation tests
test: Generalize make_sstable_easy in set tests
test: Reuse make_sstable_easy in datafile tests
test: Relax make_sstable_easy in compaction tests
When a string column is indexed with a secondary index, the empty value
for this column (an empty string '') is perfectly legal, and should be
indexed as well. This is not the same as an unset (null) value which
isn't indexed.
The following test demonstrates that this case works in Cassandra, but
does not in Scylla (so the test is marked "xfail"). In Scylla, a query
that returns the expected results with ALLOW FILTERING suddenly returns
a different (and wrong) result when an index is added on the table.
This test reproduces issue #9364.
Refs #9364.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210922121510.291826-1-nyh@scylladb.com>
"
Backlog tracker isn't updated correctly when facing a schema change, and
may leak a SSTable if compaction strategy is changed, which causes
backlog to be computed incorrectly. Most of these problems happen because
sstable set and tracker are updated independently, so it could happen
that tracker lose track (pun intended) of changes applied to set.
The first patch will fix the leak when strategy is changed, and the third
patch will make sure that tracker is updated atomically with sstable set,
so these kind of problems will not happen anymore.
Fixes#9157
test: mode(debug)
"
* 'fixes_to_backlog_tracker_v3' of https://github.com/raphaelsc/scylla:
compaction: Update backlog tracker correctly when schema is updated
compaction: Don't leak backlog of input sstable when compaction strategy is changed
compaction: introduce compaction_read_monitor_generator::remove_exhausted_sstables()
compaction: simplify removal of monitors
Scylla and Cassandra do not allow an empty string as a partition key,
but a materialized view might "convert" a regular string column into a
partition key, and an empty string is a perfectly valid value for this
column. This can result in a view row which has an empty string as a
partition key. This case works in Cassandra, but doesn't in Scylla (the
row with the empty string as a partition key doesn't appear). The
following test demonstrates this difference between Scylla and Cassandra
(it passes on Cassandra, fails on Scylla, and accordingly marked
"xfail").
Refs #9375.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210922115000.290387-1-nyh@scylladb.com>
Recently we had to say goodbye to our dear friend Pekka.
He orphaned a few subsystems that can't call for his help in code
reviews anymore.
This patch makes sure no one will bother Pekka in his afterlife.
It also cleanups HACKING.md a little bit by removing Pekka and Duarte
from the maintainer/reviewer lists.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <98ba1aed9ee8a87b9037b5032b82abc5bfddbd66.1632301309.git.piotr@scylladb.com>
A variant of make_reversed() which goes through the schema registry,
teaching the schema to the registry if necessary. This effectively
caches the result of the reversing and as an added bonus double
reversing yields the very same schema C++ object that was the starting
point.
Closes#9365
The AppendReg state machine stores a sequence of integers. It supports
`append` inputs which append a single integer to the sequence and return
the previous state (before appending).
The implementation uses the `append_seq` data structure
representing an immutable sequence that uses a vector underneath
which may be shared by multiple instances of `append_seq`.
Appending to the sequence appends to the underlying vector,
but there is no observable effect on the other instances since
they use only the prefix of the sequence that wasn't changed.
If two instances sharing the same vector try to append,
the later one must perform a copy.
This allows efficient appends if only one instance is appending, which
is useful in the following context:
- a Raft server stores a copy in the underlying state machine replica
and appends to it,
- clients send append operations to the server; the server returns the
state of the sequence before it was appended to,
- thanks to the sharing, we don't need to copy all elements when
returning the sequence to the client, and only one instance (the
server) is appending to the shared vector,
- summarizing, all operations have amortized O(1) complexity.
Add the content of the compaction stats introduced in the previous patch
to the tracing data. This will help diagnose query performance related
problems caused by tombstones.
Add the content of the compaction stats introduced in the previous patch
to the tracing data. This will help diagnose query performance related
problems caused by tombstones.
This needs to add forward declarations of the gossiper class and
re-include some other headers here and there.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Right now the number of shards and ignore-msb-bits are gossiped
with a separate call. It's simpler to include this data into
the initial gossiping state.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It's much shorter and simpler to pass the seeds, obtained from the
config, into gossiper via gossip_config rahter than with the help
of a special call.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Looking into init_gossiper() helper makes it clear that what it does
is gets seeds, provider and listen_address from config and generates
a set of seeds for the gossiper. Then calls gossiper.set_seeds().
This patch renames the helper into get_seeds_from_config(), removes
all but db::config& argunebts from it and moves the call to set_seed()
into main.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The same as in previous patch -- the gossiper doesn't need to know
if it should call messaging.start_listen() or not, neither should
do the storage_service.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Before preparing the cluster join process the messaging should be
put into listening state. Right now it's done "on-demand" by the
call to the do_shadow_round(), also there's a safety call in the
start_gossiping(). Tests, however, should not start listening, so
the do_bind boolean exists and is passed all the way around.
Make the main() code explicitly call the messaging.start_listen()
and leave tests without it. This change makes messaging start
listening a bit earlier, but in between these old and new places
there's nothing that needs messaging to stay deaf.
As the do_bind becomes useless, the wait_for_gossip_to_settle() is
also moved into main.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
On start handlers can be registered any time before the messaging
starts to listen. On stop handlers can remain registered any long,
since the messaging service stops early in drain_on_shutdown().
One tricky place is API start_/stop_gossiping(). The latter calls
gossiper::stop() thus unregistering the handlers. So to make the
start_gossiping() work it must call gossiper::start() in advance.
Overall the gossiper start/stop becomes this:
gossiper.start()
`- registers handlers
gossiper.start_gossiping()
`- // starts gossiping
gossiper.shutdown()
`- // stops gossiping
gossiper.stop()
`- calls shutdown() // re-entrable
`- unregisters handlers
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The helper in question is called in two places:
1. In main() as a fuse against early exception before creating the
drain_on_shutdown() defer
2. In the stop_gossiping() API call
Both can be replaced with the stop_gossiping() call from the .stop()
method, here's why:
1. In main the gossiper::stop() call is already deferred right after
the gossiper is started. So this change moves it above. It may
happen that an exception pops up before the old fuse was deferred,
but that's OK -- the stop_gossiping() is safe against early- and
re- entrances
2. The stop_gossiping() change is effectlvey a rename -- it calls the
stop_gossiping() as it did before, but with the help of the .stop()
method
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The start/stop sequence we're moving towards assumes a shutdown (or
drain) method that will be called early on stop to notify the service
that the system is going down so it could prepare.
For gossiper it already means calling stop_gossiping() on the shard-0
instance. So by and large this patch renames a few stop_gossiping()
calls into .shutdown() ones.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>