Commit Graph

25698 Commits

Author SHA1 Message Date
Pavel Solodovnikov
2d9e94f050 raft: update README.md with info on RPC server address mappings
Describe the high-level scheme of managing RPC mappings and
also expand on the introduction of "expirable" RPC mappings concept
and why these are needed.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-03-26 22:47:13 +03:00
Pavel Solodovnikov
f61206e483 raft: wire up rpc::add_server and rpc::remove_server for configuration changes
Raft instance needs to update RPC subsystem on changes in
configuration, so that RPC can deliver messages to the new nodes
in configuration, as well as dispose of the old nodes.
I.e. the nodes which are not the part of the most recent
configuration anymore.

The effective scope of RPC mappings is limited by the piece of
code which sends messages to both the "new" nodes (which
are added to the cluster with the most recent configuration
change) and the "old" nodes which are removed from the cluster.

Until the messages are successfully delivered to at least
the majority of "old" nodes and we have heard back from them,
the mappings should be kept intact. After that point the RPC
mappings for the removed nodes are no longer of interest
and thus can be immediately disposed.

Tests: unit(dev)

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-03-26 22:47:09 +03:00
Pavel Solodovnikov
16d9e8e9af raft/fsm: add optional rpc_configuration field to fsm_output
The field is set in `fsm.get_output` whenever
`_log.last_conf_idx()` or the term changes.

Also, add `_last_conf_idx` and `_last_term` to
`fsm::last_observed_state`, they are utilized
in the condition to evaluate current rpc
configuration in `fsm.get_output()`.

This will be used later to update rpc config state
stored in `server_impl` and maintain rpc address map.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-03-26 22:47:05 +03:00
Pavel Solodovnikov
19cc85b3b6 raft: maintain current rpc context in server_impl
Introduce rpc server_address that represents the
last observed state of address mappings
for RPC module.

It does not correspond to any kind of configuration
in the raft sense, just an artificial construct
corresponding to the largest set of server
addresses coming from both previous and current
raft configurations (to be able to contact both
joining and leaving servers).

This will be used later to update rpc module mappings
when cluster configuration changes.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-03-26 20:22:44 +03:00
Pavel Solodovnikov
8799ccbab0 raft: use .contains instead of .count for std::set in raft::configuration::diff
`std::unordered_set::contains` is introduced in C++20 and provides
clearer semantics to check existence of a given element in a set.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-03-26 20:22:44 +03:00
Pavel Solodovnikov
7c229998e8 raft: unit-tests for raft_address_map
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-03-26 20:22:44 +03:00
Pavel Solodovnikov
3c4d46728d raft: support expiring server address mappings for rpc module
This patch introduces `raft_address_map` class to abstract
the notion of expirable address mappings for a raft rpc module.

In Raft an instance may need to communicate with a peer outside
its current configuration. This may happen, e.g., when a follower
falls out of sync with the majority and then a configuration is
changed and a leader not present in the old configuration is elected.

The solution is to introduce the concept of "expirable" updates to
the RPC subsystem.

When RPC receives a message from an unknown peer, it also adds the
return address of the peer to the address map with a TTL. Should
we need to respond to the peer, its address will be known.

An outgoing communication to an unconfigured peer is impossible.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-03-26 20:22:44 +03:00
Tomasz Grabiec
ef06a939c4 Merge "raft: seven etcd unit tests ported" from Alejo
Seven etcd unit tests as boost tests.

* alejo/raft-tests-etcd-08-v4-communicate-v5:
  raft: etcd unit tests: test proposal handling scenarios
  raft: etcd unit tests: test old messages ignored
  raft: etcd unit tests: test single node precandidate
  raft: etcd unit tests: test dueling precandidates
  raft: etcd unit tests: test dueling candidates
  raft: etcd unit tests: test cannot commit without new term
  raft: etcd unit tests: test single node commit
  raft: etcd unit tests: update test_leader_election_overwrite_newer_logs
  raft: etcd unit tests: fix test_progress_leader
  raft: testing: log comparison helper functions
  raft: testing: helper to make fsm candidate
  raft: testing: expose log for test verification
  raft: testing: use server_address_set
  raft: testing: add prevote configuration
  raft: testing: make become_follower() available for tests
2021-03-25 20:27:07 +01:00
Alejo Sanchez
ace0ee514f raft: etcd unit tests: test proposal handling scenarios
TestProposal
For multiple scenarios, check proposal handling.

Note, instead of expecting an explicit result for each specified case,
the test automatically checks for expected behavior when quorum is
reached or not.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-03-25 15:04:29 -04:00
Alejo Sanchez
77163ea76a raft: etcd unit tests: test old messages ignored
TestOldMessages
Checks an append request from a leader from a previous term is ignored.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-03-25 15:04:29 -04:00
Alejo Sanchez
bf65b19803 raft: etcd unit tests: test single node precandidate
TestSingleNodePreCandidate
Checks a single node configuration with precandidate on works to
automatically elect the node.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-03-25 15:04:29 -04:00
Alejo Sanchez
de7051467b raft: etcd unit tests: test dueling precandidates
TestDuelingPreCandidates
In a configuration of 3 nodes, two nodes don't see each other and they
compete for leadership. Loser (3) should revert to follower when prevote
is rejected and revert to term 1.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-03-25 15:04:29 -04:00
Alejo Sanchez
aa7d23f86b raft: etcd unit tests: test dueling candidates
TestDuelingCandidates
In a configuration of 3 nodes, two nodes don't see each other and they
compete for leadership. Once reconnected, loser should not disrupt.

But note it will remain candidate with current algorithm without
prevoting and other fsms will not bump term.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-03-25 15:04:29 -04:00
Alejo Sanchez
1eac94e7d6 raft: etcd unit tests: test cannot commit without new term
TestCannotCommitWithoutNewTermEntry tests the entries cannot be
committed when leader changes, no new proposal comes in and ChangeTerm
proposal is filtered.

NOTE: this doesn't check committed but it's implicit for next round;
      this could also use communicate() providing committed output map

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-03-25 15:04:29 -04:00
Alejo Sanchez
b421fe3605 raft: etcd unit tests: test single node commit
Port etcd TestSingleNodeCommit

In a single node configuration elect the node, add 2 entries and check
number of committed entries.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-03-25 15:04:29 -04:00
Alejo Sanchez
9b4538476b raft: etcd unit tests: update test_leader_election_overwrite_newer_logs
Make test_leader_election_overwrite_newer_logs use newer communicate()
and other new helpers.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-03-25 15:04:29 -04:00
Alejo Sanchez
368eec1190 raft: etcd unit tests: fix test_progress_leader
Make implementation follow closer to original test.
Use newer boost test helpers.

NOTE: in etcd it seems a leader's self progress is in PIPELINE state.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-03-25 15:04:28 -04:00
Alejo Sanchez
ba29970e29 raft: testing: log comparison helper functions
Two helper functions to compare logs. For now only index, term, and data
type are used. Data content comparison does not seem to be necessary for now.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-03-25 15:04:28 -04:00
Alejo Sanchez
aeab4cf4a9 raft: testing: helper to make fsm candidate
Current election_timeout() helper might bump the term twice.
It's convenient and less error prone to have a more fine grained helper
that stops right when candidate state is reached.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-03-25 15:04:19 -04:00
Alejo Sanchez
7a6616f1cb raft: testing: expose log for test verification
Let derived classes access the log to verify its contents.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-03-25 15:03:46 -04:00
Alejo Sanchez
05b1f57e67 raft: testing: use server_address_set
Use server_address_set in local namespace for brevity.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-03-25 15:01:12 -04:00
Alejo Sanchez
9d0a7d8ccf raft: testing: add prevote configuration
Provide a generic prevote configuration for tests.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-03-25 15:00:28 -04:00
Dejan Mircevski
b2a04985f7 cql-pytest: Drop needless INSERT in test_null
One INSERT statement was unnecessary for the test, so delete it.
Another was necessary, so explain it.

Tests: cql-pytest/test_null on both Scylla and Cassandra

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>

Closes #8304
2021-03-25 16:37:00 +01:00
Tomasz Grabiec
7b30d31d77 Merge "raft: test configuration changes" from Kostja
Test raft configuration changes:
a node with empty configuration, transitioning
to an entirely different cluster, transitioning
in presence of down nodes, leader change during
configuration change, stray replies, etc.

* scylla-dev/raft-empty-confchange-v5: (21 commits)
  raft: (testing) stray replies from removed followers
  raft: always return a non-zero configuration index from the log
  raft: (testing) leader change during configuration change
  raft: (testing) test confchange {ABCDE} -> {ABCDEFG}
  raft: (testing) test confchange {ABCDEF} -> {ABCGH}
  raft: (testing) test confchange {ABC} -> {CDE}
  raft: (testing) test confchange {AB} -> {CD}
  raft: (testing) test confchange {A} -> {B}
  raft: (testing) test a server with empty configuration
  raft: (testing) introduce testing utilities
  raft: (testing) simplify id allocation in test
  raft: (testing) add select_leader() helper
  raft: (testing) introduce communicate() helper
  raft: (testing) style cleanup in raft_fsm_test
  raft: (testing) fix bug in election_threshold
  raft: minor style changes & comments
  raft: do not assert when transitioning to empty config
  raft: assert we never apply a snapshot over uncommitted entries (leader)
  raft: improve tracing
  raft: add fsm_output::empty() helper to aid testing
  ...
2021-03-25 14:01:09 +01:00
Avi Kivity
46185d7d82 Update tools/jmx submodule
* tools/jmx 9c687b5...440313e (1):
  > storage_service: Add a generic toppartitions endpoint
2021-03-25 12:36:10 +02:00
Alejo Sanchez
7e6807e8fc raft: testing: make become_follower() available for tests
Some etcd tests need to force a follower with a specific leader.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-03-24 19:11:09 -04:00
Piotr Wojtczak
c1daf2bb24 column_family: Make toppartitions queries more generic
Right now toppartitions can only be invoked on one column family at a time.
This change introduces a natural extension to this functionality,
allowing to specify a list of families.

We provide three ways for filtering in the query parameter "name_list":
    1. A specific column family to include in the form "ks:cf"
    2. A keyspace, telling the server to include all column families in it.
       Specified by omitting the cf name, i.e. "ks:"
    3. All column families, which is represented by an empty list
The list can include any amount of one or both of the 1. and 2. option.

Fixes #4520

Closes #7864
2021-03-24 17:54:05 +02:00
Raphael S. Carvalho
bcbb39999b LCS: Fix terrible write amplification when reshaping level 0
LCS reshape is basically 'major compacting' level 0 until it contains less than
N sstables.

That produces terrible write amplification, because any given byte will be
compacted (initial # of sstables / max_threshold (32)) times. So if L0 initially
contained 256 ssts, there would be a WA of about 8.

This terrible write amplification can be reduced by performing STCS instead on
L0, which will leave L0 in a good shape without hurting WA as it happens
now.

Fixes #8345.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210322150655.27011-1-raphaelsc@scylladb.com>
2021-03-24 17:48:50 +02:00
Piotr Sarna
24a43681b4 thrift: handle gate closed exception on retry
During the retry mechanism, it's possible to encounter a gate
closed exception, which should simply be ignored, because
it indicates that the server is shutting down.

Closes #8337
2021-03-24 17:41:58 +02:00
Konstantin Osipov
1a1d7ab662 raft: (testing) stray replies from removed followers 2021-03-24 14:05:55 +03:00
Konstantin Osipov
0295163f6f raft: always return a non-zero configuration index from the log
Return snapshot index for last configuration index if there
is no configuration in the log.
2021-03-24 14:05:55 +03:00
Konstantin Osipov
cec59e53ef raft: (testing) leader change during configuration change 2021-03-24 14:05:36 +03:00
Pavel Emelyanov
37bec6fb76 commitlog: Open files with append_is_unlikely
This open option tells seastar that the file in question
will be truncated to the needed size right at once and all
the subsequent writes will happen within this size. This
hint turns off append optimization in seastar that's not
that cheap and helps so save few cpu cycles.

The option was introduced in seastar by 8bec57bc.

tests: unit(dev), dtest(commitlog:
                        test_batch_commitlog,
                        test_periodic_commitlog,
                        test_commitlog_replay_on_startup)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210323115409.31215-1-xemul@scylladb.com>
2021-03-24 13:05:33 +02:00
Konstantin Osipov
a203c8833f raft: (testing) test confchange {ABCDE} -> {ABCDEFG} 2021-03-24 14:04:18 +03:00
Konstantin Osipov
40e117d36e raft: (testing) test confchange {ABCDEF} -> {ABCGH} 2021-03-24 14:04:18 +03:00
Konstantin Osipov
14b2d5d308 raft: (testing) test confchange {ABC} -> {CDE}
Test leader change during configuration change.
2021-03-24 14:04:18 +03:00
Konstantin Osipov
3c718a175e raft: (testing) test confchange {AB} -> {CD} 2021-03-24 14:04:18 +03:00
Konstantin Osipov
2e30c8540e raft: (testing) test confchange {A} -> {B}
Test non-restart and leader restart scenario.
2021-03-24 14:04:18 +03:00
Konstantin Osipov
e23da06fef raft: (testing) test a server with empty configuration
Try becoming a candidate for such server, or adding it
to an existing configuration.
2021-03-24 14:04:18 +03:00
Konstantin Osipov
b18599c630 raft: (testing) introduce testing utilities
Add a discrete_failure_detector, to be able
to mark a single server dead.
2021-03-24 14:04:18 +03:00
Konstantin Osipov
8d26d24370 raft: (testing) simplify id allocation in test 2021-03-24 14:04:18 +03:00
Konstantin Osipov
322a15ec33 raft: (testing) add select_leader() helper
With leader stepdown extension, leadership transfer can happen
to any follower with long enough log. Add a helper to select that
follower from a list.
2021-03-24 14:04:18 +03:00
Konstantin Osipov
4a00da276d raft: (testing) introduce communicate() helper
Allow to communicate between arbitrary number of FSMs. Drop
messages to FSMs which are not in the argument list.
Stop communication upon predicate.
2021-03-24 14:04:18 +03:00
Konstantin Osipov
7182323ac0 raft: (testing) style cleanup in raft_fsm_test
1) Avoid memory violations on test failure
2) Print better diagnostics on failure (BOOST_CHECK_EQUAL vs
   BOOST_CHECK)
2021-03-24 14:04:18 +03:00
Konstantin Osipov
f0f25bf7fb raft: (testing) fix bug in election_threshold
election_threshold was ticking one extra tick,
causing the follower to become candidate in some cases.
This was rendering tests unstable.
2021-03-24 14:04:18 +03:00
Konstantin Osipov
00d7379bc9 raft: minor style changes & comments
Add comments explaining the rationale from transfer_leadership()
(more PhD quotes), encapsulate stable leader check in tick()
into a lambda and add more detailed comments to it.
2021-03-24 14:04:18 +03:00
Piotr Sarna
06131e21a3 configure.py: add customizing clang inline threshold
Until clang figures things out with the now infamous
`-llvm -inline-threshold X` parameter, let's allow customizing
it to make the compilation of release builds less tiresome.
For instance, scylla's row_level.o object file currently does not compile
for me until I decrease the inline threshold to a low value (e.g. 50).

Message-Id: <54113db9438e3c3371410996f49b7fbe9a1b7257.1616422536.git.sarna@scylladb.com>
2021-03-24 12:09:26 +02:00
Tomasz Grabiec
9272e74e8c sstable: writer: ka/la: Write row marker cell after row tombstone
Row marker has a cell name which sorts after the row tombstone's start
bound. The old code was writing the marker first, then the row
tombstone, which is incorrect.

This was harmeless to our sstable reader, which recognized both as
belonging to the current clustering row fragment, and collects both
fine.

However, if both atoms trigger creation of promoted index blocks, the
writer will create a promoted index with entries wich violate the cell
name ordering. It's very unlikely to run into in practice, since to
trigger promoted index entries for both atoms, the clustering key
would be so large so that the size of the marker cell exceeds the
desired promoted index block size, which is 64KB by default (but
user-controlled via column_index_size_in_kb option). 64KB is also the
limit on clustering key size accepted by the system.

This was caught by one of our unit tests:

  sstable_conforms_to_mutation_source_test

...which runs a battery of mutation reader tests with various
desired promoted index block sizes, including the target size of 1
byte, which triggers an entry for every atom.

The test started to fail for some random seeds after commit ecb6abe
inside the
test_streamed_mutation_forwarding_is_consistent_with_slicing test
case, reporting a mutation mismatch in the following line:

    assert_that(*sliced_m).is_equal_to(*fwd_m, slice_with_ranges.row_ranges(*m.schema(), m.key()));

It compares mutations read from the same sstable using different
methods, slicing using clustering key restricitons, and fast
forwarding. The reported mismatch was that fwd_m contained the row
marker, but sliced_m did not. The sstable does contain the marker, so
both reads should return it.

After reverting the commit which introduced dynamic adjustments, the
test passes, but both mutations are missing the marker, both are
wrong!

They are wrong because the promoted index contians entries whose
starting positions violate the ordering, so binary search gets confused
and selects the row tombstone's position, which is emitted after the
marker, thus skipping over the row marker.

The explanation for why the test started to fail after dynamic
adjustements is the following. The promoted index cursor works by
incrementally parsing buffers fed by the file input stream. It first
parses the whole block and then does a binary search within the parsed
array. The entries which cursor touches during binary search depend on
the size of the block read from the file. The commit which enabled
dynamic adjustements causes the block size to be different for
subsequent reads, which allows one of the reads to walk over the
corrupted entries and read the correct data by selecting the entry
corresponding to the row marker.

Fixes #8324
Message-Id: <20210322235812.1042137-1-tgrabiec@scylladb.com>
2021-03-23 16:13:47 +01:00
Tomasz Grabiec
235154cca5 Merge "Teach scylla-gdb new trees in row cache" from Pavel Emelyanov
Clustering rows are now stored in intrusive btree, cells are
now stored in radix tree, but scylla-gdb tries to walk the
intrusive_set and vector/set union respectively.

For the former case -- the btree wrapper is introduced.

For the latter -- compiler optimizes-away too many important
bits and walking the tree turns into a bunch of hard-coded
hacks and reiterpret-casts. Untill better solution is found,
just print the address of the tree root.

* xemul/br-gdb-btree-rows:
  gdb: Show address of the row::_cells tree (or "empty" mark)
  gdb: Add support for intrusive B tree
  gdb: Use helper to get rows from mutation_partition
2021-03-23 12:50:17 +01:00
Pavel Emelyanov
1cd9ec952f gdb: Show address of the row::_cells tree (or "empty" mark)
Currently clang optimizes-out lots of critical stuff from
compact radix tree. Untill we find out the way to walk the
tree in gdb, it's better to at least show where it is in
memory.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-03-23 13:29:40 +03:00