Commit Graph

1860 Commits

Author SHA1 Message Date
Gleb Natapov
09528b8671 raft: test: test leadership transfer timeout
Test that if leadership transfer cannot be done in configured time frame
fsm cancels the leadership transfer process. Also check that timeout_now
message is resent on each tick while leadership transfer is in progress.
2021-06-22 14:42:50 +03:00
Nadav Har'El
a9b383f423 cql-pytest: improve test for SSL/TLS versions
The existing test_ssl.py which tests for Scylla's support of various TLS
and SSL versions, used a deprecated and misleading Python API for
choosing the protocol version. In particular, the protocol version
ssl.PROTOCOL_SSLv23 is *not*, despite it's name, SSL versions 2 or 3,
or SSL at all - it is in fact an alias for the latest TLS version :-(
This misunderstanding led us to open the incorrect issue #8837.

So in this patch, we avoid the old Python APIs for choosing protocols,
which were gradually deprecated, and switch to the new API introduced
in Python 3.7 and OpenSSL 1.1.0g - supplying the minimum and maximum
desired protocol version.

With this new API, we can correctly connect with various versions of
the SSL and TLS protocol - between SSLv3 through TLSv1.3. With the
fixed test, we confirm that Scylla does *not* allow SSLv3 - as desired -
so issue #8837 is a non-issue.

Moreover, after issue #8827 was already fixed, this test now passes,
so the "xfail" mark is removed.

Refs #8837.
Refs #8827.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210617134305.173034-1-nyh@scylladb.com>
2021-06-17 17:06:31 +03:00
Tomasz Grabiec
6d8440fe70 Merge "raft: (testing) leadership transfer tests" from Pavel Solodovnikov
The patch set introduces a few leadership transfer tests, some of them
are adaptations of corresponding etcd tests (e.g.
`test_leader_transfer_ignore_proposal` and `test_transfer_non_member`).

Others test different scenarios ensuring that pending leadership
transfer doesn't disrupt the rest of the cluster from progressing:

Lost `timeout_now` messages` (`test_leader_transfer_lost_timeout_now` and
`test_leader_transferee_dies_upon_receiving_timeout_now`) as well as
lost `vote_request(force)` from the new candidate
(test_leader_transfer_lost_force_vote_request) don't impact the
election process following that and the leader is elected as normal.

* manmanson/leadership_transfer_tests_v3:
  raft: etcd_test: test_transfer_non_member
  raft: etcd_test: test_leader_transfer_ignore_proposal
  raft: fsm_test: test_leader_transfer_lost_force_vote_request
  raft: fsm_test: test_leader_transfer_lost_timeout_now
  raft: fsm_test: test_leader_transferee_dies_upon_receiving_timeout_now
2021-06-17 13:58:31 +02:00
Piotr Sarna
8cca68de75 cql3: add USING TIMEOUT support for deletes
Turns out the DELETE statement already supports attributes
like timestamp, so it's ridiculously easy to add USING TIMEOUT
support - it's just the matter of accepting it in the grammar.

Fixes #8855

Closes #8876
2021-06-17 14:21:01 +03:00
Avi Kivity
00ff3c1366 Merge 'treewide: add support for snapshot skip-flush option' from Benny Halevy
The option is provided by nodetool snapshot
https://docs.scylladb.com/operating-scylla/nodetool-commands/snapshot/
```
nodetool [(-h <host> | --host <host>)] [(-p <port> | --port <port>)]
         [(-pp | --print-port)] [(-pw <password> | --password <password>)]
         [(-pwf <passwordFilePath> | --password-file <passwordFilePath>)]
         [(-u <username> | --username <username>)] snapshot
         [(-cf <table> | --column-family <table> | --table <table>)]
         [(-kc <kclist> | --kc.list <kclist>)]
         [(-sf | --skip-flush)] [(-t <tag> | --tag <tag>)] [--] [<keyspaces...>]

-sf / –skip-flush    Do not flush memtables before snapshotting (snapshot will not contain unflushed data)
```

But is currently ignored by scylla-jmx (scylladb/scylla-jmx#167)
and not supported at the api level.

This patch adds support for the option in advance
from the api service level down via snapshot_ctl
to the table class and snapshot implementation.

In addition, a corresponding unit test was added to verify
that taking a snapshot with `skip_flush` does not flush the memtable
(at the table::snapshot level).

Refs #8725

Closes #8726

* github.com:scylladb/scylla:
  test: database_test: add snapshot_skip_flush_works
  api: storage_service/snapshots: support skip-flush option
  snapshot: support skip_flush option
  table: snapshot: add skip_flush option
  api: storage_service/snapshots: add sf (skip_flush) option
2021-06-17 13:32:23 +03:00
Nadav Har'El
7fd7e90213 cql-pytest: translate Cassandra's tests for static columns
This is a translation of Cassandra's CQL unit test source file
validation/entities/StaticColumnsTest.java into our our cql-pytest framework.

This test file checks various features of static columns. All these tests
pass on Cassandra, and all but one pass on Scylla. The xfailing test,
testStaticColumnsWithSecondaryIndex, exposes a query that Cassandra
allows but we don't. The new issue about that is:

Refs #8869.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210616141633.114325-1-nyh@scylladb.com>
2021-06-17 11:08:28 +02:00
Tomasz Grabiec
6bdf8c4c46 Merge "raft: second series of preparatory patches for group 0 discovery" from Kostja
Miscellaneous preparatory patches for group 0 discovery.

* scylla-dev/raft-group-0-part-2-v4:
  raft: (service) servers map is gid -> server, not sid -> server
  system_keyspace: raft.group_id and raft_snapshots.group_id are TIMEUUID
  raft: (server) wait for configuration transition to complete
  raft: (server) implement raft::server::get_configuration()
  raft: (service) don't throw from schema state machine
  raft: (service) permit some scylla.raft cells to be empty
  raft: (service) properly handle failure to add a server
  raft: implement is_transient_error()
2021-06-17 00:15:40 +02:00
Konstantin Osipov
18e3fcdbf1 raft: (service) servers map is gid -> server, not sid -> server
Raft Group registry should map Raft Group Id to Raft Server,
not Raft Server ID (which is identical for all groups) to Raft server.

Raft Group 0 ID works as a cluster identifier, so is generated when a
new cluster is created and is shared by all nodes of the same cluster.

Implement a helper to get raft::server by group id.

Consistently throw a new raft_group_not_found exception
if there is no server or rpc for the specified group id.
2021-06-16 19:05:50 +03:00
Avi Kivity
f05ddf0967 Merge "Improve LSA descriptor encoding" from Pavel
"
The LSA small objects allocation latency is greatly affected by
the way this allocator encodes the object descriptor in front of
each allocated slot.

Nowadays it's one of VLE variants implemented with the help of a
loop. Re-implementing this piece with less instructions and without
a loop allows greatly reducing the allocation latency.

The speed-up mostly comes from loop-less code that doesn't confuse
branch predictor. Also the express encoder seems to benefit from
writing 8 bytes of the encoded value in one go, rather than byte-
-by-byte.

Perf measurements:

1. (new) logallog test shows ~40% smaller times

2. perf_mutation in release mode shows ~2% increase in tps

3. the encoder itself is 2 - 4 times faster on x86_64 and
   1.05 - 3 times faster on aarch64. The speed-up depends on
   the 'encoded length', old encoder has linear time, the
   new one is constant

tests: unit(dev), perf(release), just encoder on Aarch64
"

* 'br-lsa-alloc-latency-4' of https://github.com/xemul/scylla:
  lsa: Use express encoder
  uleb64: Add express encoding
  lsa: Extract uleb64 code into header
  test: LSA allocation perf test
2021-06-16 18:07:13 +03:00
Avi Kivity
0948908502 Merge "mutation_reader: multishard_combining_reader clean-up close path" from Botond
"
The close path of the multishard combining reader is riddled with
workarounds the fact that the flat mutation reader couldn't wait on
futures when destroyed. Now that we have a close() method that can do
just that, all these workarounds can be removed.
Even more workarounds can be found in tests, where resources like the
reader concurrency semaphore are created separately for each tested
multishard reader and then destroyed after it doesn't need it, so we had
to come up with all sorts of creative and ugly workarounds to keep
these alive until background cleanup is finished.
This series fixes all this. Now, after calling close on the multishard
reader, all resources it used, including the life-cycle policy, the
semaphores created by it can be safely destroyed. This greatly
simplifies the handling of the multishard reader, and makes it much
easier to reason about life-cycle dependencies.

Tests: unit(dev, release:v2, debug:v2,
    mutation_reader_test:debug -t test_multishard,
    multishard_mutation_query_test:debug,
    multishard_combining_reader_as_mutation_source:debug)
"

* 'multishard-combining-reader-close-cleanup/v3' of https://github.com/denesb/scylla:
  mutation_reader: reader_lifecycle_policy: remove convenience methods
  mutation_reader: multishard_combining_reader: store shard_reader via unique ptr
  test/lib/reader_lifecycle_policy: destroy_reader: cleanup context
  test/lib/reader_lifecycle_policy: get rid of lifecycle workarounds
  test/lib/reader_lifecycle_policy: destroy_reader(): stop the semaphore
  test/lib/reader_lifecycle_policy: use a more robust eviction mechanism
  reader_concurrency_semaphore: wait for all permits to be destroyed in stop()
  test/lib/reader_lifcecycle_policy: fix indentation
  mutation_reader: reader_lifecycle_policy::destroy_reader(): require to be called on native shard
  reader_lifecycle_policy implementations: fix indentation
  mutation_reader: reader_lifecycle_policy::destroy_reader(): de-futurize reader parameter
  mutation_reader: shard_reader::close(): wait on the remote reader
  multishard_mutation_query: destroy remote parts in the foreground
  mutation_reader: shard_reader::close(): close _reader
  mutation_reader: reader_lifcecycle_policy::destroy_reader(): remove out-of-date comment
2021-06-16 17:25:50 +03:00
Konstantin Osipov
9c93d77e74 system_keyspace: raft.group_id and raft_snapshots.group_id are TIMEUUID
Fix a bug in definitions of system.raft, system.raft_snapshots,
group_id is TIMEUUID, not long.
2021-06-16 16:52:43 +03:00
Pavel Emelyanov
1e67361267 test: LSA allocation perf test
The test measures the time it takes to allocate a bunch
of small objects on LSA inside single segment.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-06-16 13:40:44 +03:00
Botond Dénes
b4e69cf63d test/lib/test_utils: require(): also log failed conditions
Currently `require()` throws an exception when the condition fails. The
problem with this is that the error is only printed at the end of the
test, with no trace in the logs on where exactly it happened, compared
to other logged events. This patchs also adds an error-level log line to
address this.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210616065711.46224-1-bdenes@scylladb.com>
2021-06-16 12:05:25 +03:00
Botond Dénes
a69db31b5c test/lib/reader_lifecycle_policy: destroy_reader: cleanup context
Now that we don't rely on any external machinery to keep the relevant
parts of the context alive until needed as its life-cycle is effectively
enclosed in that of the life-cycle policy itself, we can cleanup the
context in `destroy_reader()` itself, avoiding a background trip back to
this shard.
2021-06-16 11:29:36 +03:00
Botond Dénes
d2ddaced4e test/lib/reader_lifecycle_policy: get rid of lifecycle workarounds
The lifecycle of the reader lifecycle policy and all the resources the
reads use is now enclosed in that of the multishard reader thanks to its
close() method. We can now remove all the workarounds we had in place to
keep different resources as long as background reader cleanup finishes.
2021-06-16 11:29:36 +03:00
Botond Dénes
5a271e42a5 test/lib/reader_lifecycle_policy: destroy_reader(): stop the semaphore
So that when this method returns the semaphore is safe to destroy. This
in turn will enable us to get rid of all the machinery we have in place
to deal with the semaphore having to out-live the lifecycle policy
without a clear time as to when it can be safe to destroy.
2021-06-16 11:29:36 +03:00
Botond Dénes
c09c62a0fb test/lib/reader_lifecycle_policy: use a more robust eviction mechanism
The test reader lifecycle policy has a mode in which it wants to ensure
all inactive readers are evicted, so tests can stress reader recreation
logic. For this it currently employs a trick of creating a waiter on the
semaphore. I don't even know how this even works (or if it even does)
but it sure complicates the lifecycle policy code a lot.
So switch to the much more reliable and simple method of creating the
semaphore with a single count and no memory. This ensures that all
inactive reads are immediately evicted, while still allows a single read
to be admitted at all times.
2021-06-16 11:29:36 +03:00
Botond Dénes
578a092e4a reader_concurrency_semaphore: wait for all permits to be destroyed in stop()
To prevent use-after-free resulting from any permit out-living the
semaphore.
2021-06-16 11:29:36 +03:00
Botond Dénes
a10a6e253e test/lib/reader_lifcecycle_policy: fix indentation
Left broken from the previous patch.
2021-06-16 11:29:36 +03:00
Botond Dénes
8c7447effd mutation_reader: reader_lifecycle_policy::destroy_reader(): require to be called on native shard
Currently shard_reader::close() (its caller) goes to the remote shard,
copies back all fragments left there to the local shard, then calls
`destroy_reader()`, which in the case of the multishard mutation query
copies it all back to the native shard. This was required before because
`shard_reader::stop()` (`close()`'s) predecessor) couldn't wait on
`smp::submit_to()`. But close can, so we can get rid of all this
back-and-forth and just call `destroy_reader()` on the shard the reader
lives on, just like we do with `create_reader()`.
2021-06-16 11:29:35 +03:00
Botond Dénes
4ecf061c90 reader_lifecycle_policy implementations: fix indentation
Left broken from the previous patch.
2021-06-16 11:21:38 +03:00
Botond Dénes
a7e59d3e2c mutation_reader: reader_lifecycle_policy::destroy_reader(): de-futurize reader parameter
The shard reader is now able to wait on the stopped reader and pass the
already stopped reader to `destroy_reader()`, so we can de-futurize the
reader parameter of said method. The shard reader was already patched to
pass a ready future so adjusting the call-site is trivial.
The most prominent implementation, the multishard mutation query, can
now also drop its `_dismantling_gate` which was put in place so it can
wait on the background stopping if readers.

A consequence of this move is that handling errors that might happen
during the stopping of the reader is now handled in the shard reader,
not all lifecycle policy implementations.
2021-06-16 11:21:38 +03:00
Avi Kivity
fce124bd90 Merge "Introduce flat_mutation_reader_v2" from Tomasz
"
This series introduces a new version of the mutation fragment stream (called v2)
which aims at improving range tombstone handling in the system.

When compacting a mutation fragment stream (e.g. for sstable compaction, data query, repair),
the compactor needs to accumulate range tombstones which are relevant for the yet-to-be-processed range.
See range_tombstone_accumulator. One problem is that it has unbounded memory footprint because the
accumulator needs to keep track of all the tombstoned ranges which are still active.

Another, although more benign, problem is computational complexity needed to maintain that data structure.

The fix is to get rid of the overlap of range tombstones in the mutation fragment stream. In v2 of the
stream, there is no longer a range_tombstone fragment. Deletions of ranges of rows within a given
partition are represented with range_tombstone_change fragments. At any point in the stream there
is a single active clustered tombstone. It is initially equal to the neutral tombstone when the
stream of each partition starts. The range_tombstone_change fragment type signify changes of the
active clustered tombstone. All fragments emitted while a given clustered tombstone is active are
affected by that tombstone. Like with the old range_tombstone fragments, the clustered tombstone
is independent from the partition tombstone carried in partition_start.

The memory needed to compact a stream is now constant, because the compactor needs to only track the
current tombstone. Also, there is no need to expire ranges on each fragment because the stream emits
a fragment when the range ends.

This series doesn't convert all readers to v2. It introduces adaptors which can convert
between v1 and v2 streams. Each mutation source can be constructed with either v1 or v2 stream factory,
but it can be asked any version, performing conversion under the hood if necessary.

In order to guarantee that v1 to v2 conversion produces a well-formed stream, this series needs to
impose a constraint on v1 streams to trim range tombstones to clustering restrictions. Otherwise,
v1->v2 converted could produce range tombstone changes which lie outside query restrictions, making
the stream non-canonical.

The v2 stream is strict about range tombstone trimming. It emits range tombstone changes which reflect
range tombstones trimmed to query restrictions, and fast-forwarding ranges. This makes the stream
more canonical, meaning that for a given set of writes, querying the database should produce the
same stream of fragments for a given restrictions. There is less ambiguity in how the writes
are represented in the fragment stream. It wasn't the case with v1. For example, A given set
of deletions could be produced either as one range_tombstone, or may, split and/or deoverlapped
with other fragments. Making a stream canonical is easier for diff-calculating.

The mc sstable reader was converted to v2 because it seemed like a comparable effort to do that
versus implementing range tombstone trimming in v1.

The classes related to mutation fragment streams were cloned:
flat_mutation_reader_v2, mutation_fragment_v2, related concepts.

Refs #8625. To fully fix #8625 we need to finish the transition and get rid of the converters.
Converters accumulate range tombstones.

Tests:

 - unit [dev]
"

* tag 'flat_mutation_reader_range_tombstone_split-v3.2' of github.com:tgrabiec/scylla: (26 commits)
  tests: mutation_source_test: Run tests with conversions inserted in the middle
  tests: mutation_source_tests: Unroll run_flat_mutation_reader_tests()
  tests: Add tests for flat_mutation_reader_v2
  flat_mutation_reader: Update the doc to reflect range tombstone trimming
  sstables: Switch the mx reader to flat_mutation_reader_v2
  row_cache: Emit range tombstone adjacent to upper bound of population range
  tests: sstables: Fix test assertions to not expect more than they should
  flat_mutation_reader: Trim range tombstones in make_flat_mutation_reader_from_fragments()
  clustering_ranges_walker: Emit range tombstone changes while walking
  tests: flat_mutation_reader_assertions_v2: Adapt to the v2 stream
  Clone flat_reader_assertions into flat_reader_assertions_v2
  test: lib: simple_schema: Reuse new_tombstone()
  test: lib: simple_schema: Accept tombstone in delete_range()
  mutation_source: Introduce make_reader_v2()
  partition_snapshot_flat_reader: Trim range tombstones to query ranges
  mutation_partition: Trim range tombstones to query ranges
  sstables: reader: Inline specialization of sstable_mutation_reader
  sstables: k_l: reader: Trim range tombstones to query ranges
  clustering_ranges_walker: Introduce split_tombstone()
  position_range: Introduce contains() check for ranges
  ...
2021-06-16 11:10:54 +03:00
Tomasz Grabiec
605a6e0166 Merge "Remove int_or_strong_ordering concept" from Pavel
The one was added to smothly switch tri-comparing stuff from int
to strong-ordering. As for today only tests still need it and the
conversion is pretty simple, plus operator<<(ostream&) for the
std::strong_ordering type.

* xemul/br-remove-int-or-strong-ordering-2:
  util: Drop int_or_strong_ordering concept
  tests: Switch total-order-check onto strong_ordering
  to_string: Add formatter for strong_ordering
  tests: Return strong-ordering from tri-comparators
2021-06-16 09:34:49 +02:00
Tomasz Grabiec
3fcd1f43ba tests: mutation_source_test: Run tests with conversions inserted in the middle 2021-06-16 00:23:49 +02:00
Tomasz Grabiec
cddcba27de tests: mutation_source_tests: Unroll run_flat_mutation_reader_tests()
All readers are now flat so there is no need for this grouping.

Will be needed for the next patch, which needs a single function with
all test cases.
2021-06-16 00:23:49 +02:00
Tomasz Grabiec
ffb616fef6 tests: Add tests for flat_mutation_reader_v2 2021-06-16 00:23:49 +02:00
Tomasz Grabiec
a4275cf8bc sstables: Switch the mx reader to flat_mutation_reader_v2
The main difficulty was in making sure that emitted range tombstone
changes reflect range tombstones trimmed to clustering restrictions.
This is handled by mutation_fragment_filter and
clustering_ranges_walker. They return a list of range_tombstone_change
fragments to emit for each hop as the reader walks over the clustering
domain.

Tests which were using a normalizing reader expected range tombstones
to be split around rows. Drop this an adjust the tests accoridngly. No
reader splits range tombstones around rows now.
2021-06-16 00:23:49 +02:00
Tomasz Grabiec
cf958b0ad0 row_cache: Emit range tombstone adjacent to upper bound of population range
Cache populating reader was emitting the row entry which stands for
the upper bound of the population range, but did not emit range
tombstones for the clustering range corresponding to:

  [ before(key), after(key) ).

This surfaces after sstable readers are changed to trim emitted range
tombstones to the fast-forwarding range. Before, it didn't cause
problems, because that range tombstone part would be emitted as part
of the sstable read.

The fix is to drop the optimization which pushes the row after
population is done, and let the regular handling for
copy_from_cache_to_buffer() take care of emitting the row and
tombstones for the remaining range.

A unit test is added which covers population from all sstable
versions.
2021-06-16 00:23:49 +02:00
Tomasz Grabiec
5b182ff29a tests: sstables: Fix test assertions to not expect more than they should
Before this patch, the tests expected readers to emit range tombstones
which are outside clustering restrictions. Readers do not have to emit
range tombstones outside clustering restrictions, so fix tests to only
expect the part which overlaps with query ranges.

This is a preparatory patch before changing readers to trim range
tombstones to clustering ranges.
2021-06-16 00:23:49 +02:00
Tomasz Grabiec
ed055db63e tests: flat_mutation_reader_assertions_v2: Adapt to the v2 stream 2021-06-16 00:23:49 +02:00
Tomasz Grabiec
276c68c867 Clone flat_reader_assertions into flat_reader_assertions_v2 2021-06-16 00:23:49 +02:00
Tomasz Grabiec
a13e7b30b7 test: lib: simple_schema: Reuse new_tombstone() 2021-06-16 00:23:49 +02:00
Tomasz Grabiec
7e01679c99 test: lib: simple_schema: Accept tombstone in delete_range() 2021-06-16 00:23:49 +02:00
Pavel Solodovnikov
e9258f43cd raft: etcd_test: test_transfer_non_member
Test that a node outside configuration, that receives `timeout_now`
message, doesn't disrupt operation of the rest of the cluster.

That is, `timeout_now` has no effect and the outsider stays in
the follower state without promoting to the candidate.

Tests: unit(dev, debug)

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-06-15 19:44:21 +03:00
Pavel Solodovnikov
2b6d73de98 raft: etcd_test: test_leader_transfer_ignore_proposal
Test that a leader which has entered leader stepdown mode rejects
new append requests.

Tests: unit(dev, debug)

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-06-15 19:44:21 +03:00
Pavel Solodovnikov
ab6b0e3d62 raft: fsm_test: test_leader_transfer_lost_force_vote_request
3-node cluster (A, B, C). A is initially elected a leader.
The leader adds a new configuration entry, that removes it from the
cluster (B, C).

Wait up until the former leader commits the new configuration and starts
leader transfer procedure, sending out the `timeout_now` message to
one of the remaining nodes. But at that point it haven't received it yet.

Deliver the `timeout_now` message to the target but lose all the
`vote_request(force)` messages it attempts to send.
This should halt the election process.
Then wait for election timeout so that candidate node starts another
normal election (without `force` flag for vote requests).

Check that this candidate further makes progress and is elected a
leader.

Tests: unit(dev, debug)

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-06-15 19:44:21 +03:00
Pavel Solodovnikov
97fe6f9d49 raft: fsm_test: test_leader_transfer_lost_timeout_now
3-node cluster (A, B, C). A is initially elected a leader.
The leader adds a new configuration entry, that removes it from the
cluster (B, C).

Wait up until the former leader commits the new configuration and starts
leader transfer procedure, sending out the `timeout_now` message to
one of the remaining nodes. But at that point it haven't received it yet.

Lose this message and verify that the rest of the cluster (B, C)
can make progress and elect a new leader.

Tests: unit(dev, debug)

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-06-15 19:44:21 +03:00
Pavel Solodovnikov
c32497b798 raft: fsm_test: test_leader_transferee_dies_upon_receiving_timeout_now
4-node cluster (A, B, C, D). A is initially elected a leader.
The leader adds a new configuration entry, that removes it from the
cluster (B, C, D).
Communicate the cluster up to the point where A starts to resign
its leadership (calls `transfer_leadership()`).
At this point, A should send a `timeout_now` message to one
the remaining nodes (B, C or D) and the new configuration should be
committed. But no nodes actually have received the `timeout_now` message
yet.

Determine on which node the message should arrive, accept the
`timeout_now` message and disconnect the target from the rest of the
group.

Check that after that the cluster, which has only two live members,
could progress and elect a new leader through a normal election process.

tests: unit(dev, debug)

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-06-15 19:44:19 +03:00
Alejo Sanchez
9a22a30554 raft: replication test: split elect_new_leader for prevote
Branch URL: https://github.com/alecco/scylla/tree/raft-fixes-02-v3-01

Tests: unit ({dev}), unit ({debug}), unit ({release})

This fixes current election hangs in next.

Message-Id: <20210610143558.131685-1-alejo.sanchez@scylladb.com>
2021-06-15 11:53:24 +02:00
Tomasz Grabiec
9d49a26e79 Merge "raft: randomized_nemesis_test: tick servers less often than the network in basic_test" from Kamil
Previously `ticker` would use a single function, `on_tick`, which it
called in a loop with yields in-between. In `basic_test` we would use
this to tick every object in synchrony.

However, to closely simulate a production environment, we want the
tick ratios to be different. For example Raft servers should be ticked
rarely compared to the network.

We may also want to give the Seastar reactor more space between the
function calls (e.g. if they cause a bunch of work to be created for the
reactor that needs more than one tick to complete).

To support these use cases we first generalize `ticker` to take a set of
functions with associated numbers. These numbers are the call periods of
their corresponding functions: given {n, f}, `f` will be called each
`n`th tick.

We use this new functionality to tick Raft servers less often than the
network in basic_test.

This patchset effectively reverts 01b6a2eb38
which caused the ticker to call `on_tick` only when the Seastar reactor
had no work to do. This approach is unfortunately incompatible with the
approach taken there. We *do* want the ticker to race with other work,
potentially producing more work while already scheduled work is executing,
and we want to see in tests what happens when we adjust the ticking ratios
of different subsystems.

The previous approach also had a problem where if there was an infinite task
loop executing, the ticker wouldn't ever tick.

The previous fix was introduced since the ticker caused too much work to
be produced (so the reactor couldn't keep up) due to ticking the Raft
servers too often (after each yield). This commit deals with the problem
in a different way, by ticking the servers rarely, which also resembles
"real-life" scenarios better.

* kbr/tick-network-often-v4:
  raft: randomized_nemesis_test: generalize `ticker` to take a set of functions
  raft: randomized_nemesis_test: split `environment::tick` into two functions
  raft: randomized_nemesis_test: fix potential use-after-free in basic_test
2021-06-15 01:54:57 +02:00
Kamil Braun
8f1caa6a90 raft: randomized_nemesis_test: generalize ticker to take a set of functions
... with associated calling periods and use the new API in `basic_test`.

Previously `ticker` would use a single function, `on_tick`, which it
called in a loop with yields in-between. In `basic_test` we would use
this to tick every object in synchrony.

However, to closely simulate a production environment, we may want the
tick ratios to be different. For example Raft servers should be ticked
rarely compared to the network.

We may also want to give the Seastar reactor more space between the
function calls (e.g. if they cause a bunch of work to be created for the
reactor that needs more than one tick to complete).

To support these use cases we generalize `ticker` to take a set of
functions with associated numbers. These numbers are the call periods of
their corresponding functions: given {n, f}, `f` will be called each
`n`th tick.

We also modify `basic_test` to use this new approach: we tick Raft
servers once per 10 network ticks (in particular, once per 10 reactor
yields).

This commit effectively reverts 01b6a2eb38
which caused the ticker to call `on_tick` only when the Seastar reactor
had no work to do. This approach is unfortunately incompatible with the
approach taken there. We *do* want the ticker to race with other work,
potentially producing more work while already scheduled work is executing,
and we want to see in tests what happens when we adjust the ticking ratios
of different subsystems.

The previous approach also had a problem where if there was an infinite task
loop executing, the ticker wouldn't ever tick.

The previous fix was introduced since the ticker caused too much work to
be produced (so the reactor couldn't keep up) due to ticking the Raft
servers too often (after each yield). This commit deals with the problem
in a different way, by ticking the servers rarely, which also resembles
"real-life" scenarios better.

With this change we must also wait a bit longer for the first node to
elect itself as a leader at the beginning of the test.
2021-06-14 16:54:38 +02:00
Kamil Braun
c0b80f1f8a raft: randomized_nemesis_test: split environment::tick into two functions
One for ticking the network and one for ticking the servers.
2021-06-14 16:54:38 +02:00
Kamil Braun
f42776aded raft: randomized_nemesis_test: fix potential use-after-free in basic_test
The test starts by waiting a certain number of ticks for the first node
to elect itself as a leader.

If this wait times out - i.e. the number of ticks passes before the node
manages to elect itself - the future associated with the task which checks
for the leader condition becomes discarded (it is passed to
`with_timeout`) and the task may keep using the `environment` (which it
has a reference to) even after the `environment` is destroyed.

Furthermore, the aforementioned task is a coroutine which uses lambda
captures in its body. Leaving `with_timeout` destroys the lambda object,
causing the coroutine to refer to no-longer-existing captures.

We fix the problems by:
- making `environment` `weakly_referencable` and checking if its alive
  before it's used inside the task,
- not capturing anything in the lambda but passing whatever's needed as
  function arguments (so these things get allocated inside the coroutine
  frame).
2021-06-14 16:54:38 +02:00
Nadav Har'El
3645c7104b Merge: Wrap alternator start-stop into controller
Merged patch series by Pavel Emelyanov:

Alternator start and stop code is sitting inside the main()
and it's a big piece of code out there. Havig it all in main
complicates rework of start-stop sequences, it's much more
handy to have it in alternator/.

This set puts the mentioned code into transport- and thrift-
like controller model. While doing it one more call for global
storage service goes away.

* 'br-alternator-clientize' of https://github.com/xemul/scylla:
  alternator: Move start-stop code into controller
  alternator: Move the whole starting code into a sched group
  alternator: Dont capture db, use cfg
  alternator: Controller skeleton
  alternator: Controller basement
  alternator: Drop storage service from executor
2021-06-14 15:44:10 +03:00
Alejo Sanchez
5c8092cf42 raft: fix election with disruptive candidate
This patch also fixes rare hangs in debug mode for drops_04 without
prevote.

Branch URL: https://github.com/alecco/scylla/tree/raft-fixes-05-v2-dueling

Tests: unit ({dev}), unit ({debug}), unit ({release})

Changes in v2:
    - Fixed commit message                               @kostja

Whithout prevote, a node disconnected for long enough becomes candidate.
While disconnected (A) it keeps increasing its term.
When it rejoins it disrupts the current leader (C) which steps down due
to the higher term in (A)'s append_entries_reply and (C) also increases
its term.

Meanwhile followers (B) and (D) don't know (C) stepped down but see it
alive according to the current failure detecture implementation, and
also (A) has shorter log than them.
So they reject (A)'s vote requests (Raft 4.2.3 Disruptive servers).

Then (C) rejects voting for (A) because it has shorter log.
And (C) becomes candidate but even though (A) votes for (C), the
previous followers (B) and (D) ignore a vote request while leader (C) is
still alive and election timeout has not passed.

(A) and (C) alone can't reach quorum 2/4. So elections never succeed.

This patch addresses this problem by making followers not ignore vote
requests from who they think is the current leader even though
election timout was not reached.

As @kostja noted, if failure detector would consider a leader alive only
as long as it sends heartbeats (append requests) this patch is no longer
needed.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Message-Id: <20210611172734.254757-1-alejo.sanchez@scylladb.com>
2021-06-14 11:07:38 +02:00
Raphael S. Carvalho
846f0bd16e sstables: Fix incremental selection with compound sstable set
Incremental selection may not work properly for LCS and ICS due to an
use-after-free bug in partitioned set which came into existence after
compound set was introduced.

The use-after-free happens because partitioned set wasn't taking into
account that the next position can become the current position in the
next iteration, which will be used by all selectors managed by
compound set. So if next position is freed, when it were being used
as current position, subsequent selectors would find the current
position freed, making them produce incorrect results.

Fix this by moving ownership of next pos from incremental_selector_impl
to incremental_selector, which makes it more robust as the latter knows
better when the selection is done with the next pos. incremental_selector
will still return ring_position_view to avoid copies.

Fixes #8802.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210611130957.156712-1-raphaelsc@scylladb.com>
2021-06-13 16:45:07 +03:00
Tomasz Grabiec
7521301b72 Merge "raft: add tests for non-voters and fix related bugs" from Kostja
Add test coverage inspired by etcd for non-voter servers,
and fix issues discovered when testing.

* scylla-dev/raft-learner-test-v4:
  raft: (testing) test non-voter can vote
  raft: (testing) test receiving a confchange in a snapshot
  raft: (testing) test voter-non-voter config change loop
  raft: (testing) test non-voter doesn't start election on election timeout
  raft: (testing) test what happens when a learner gets TimeoutNow
  raft: (testing) implement a test for a leader becoming non-voter
  raft: style fix
  raft: step down as a leader if converted to a non-voter
  raft: improve configuration consistency checks
  raft: (testing) test that non-voter stays in PIPELINE mode
  raft: (testing) always return fsm_debug in create_follower()
2021-06-12 21:36:47 +03:00
Nadav Har'El
9774c146cc cql-pytest: add test for connecting with different SSL/TLS versions
This is a reproducer for issue #8827, that checks that a client which
tries to connect to Scylla with an unsupported version of SSL or TLS
gets the expected error alert - not some sort of unexpected EOF.

Issue #8827 is still open, so this test is still xfailing. However,
I verified that with a fix for this issue, the test passes.

The test also prints which protocol versions worked - so it also helps
checking issue #8837 (about the ancient SSL protocol being allowed).

Refs #8837
Refs #8827

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210610151714.1746330-1-nyh@scylladb.com>
2021-06-12 21:36:47 +03:00
Michael Livshin
2bbc293e22 tests: improve error reporting of test_env::reusable_sst()
Distinguish the "no such sstable" case from any reading errors.

While at it, coroutinize the function.

Refs #8785.

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
Message-Id: <20210610113304.264922-1-michael.livshin@scylladb.com>
2021-06-11 19:06:43 +02:00