Commit Graph

26327 Commits

Author SHA1 Message Date
Gleb Natapov
d0ebd79deb raft: test: return error from rpc module if nodes are disconnected
Returning an error when nodes are disconnected closer resembles what
will happen in real networking.
2021-05-06 11:34:31 +03:00
Gleb Natapov
c4d87d7a23 raft: fix a typo in a variable name 2021-05-06 11:33:47 +03:00
Gleb Natapov
745f63991f raft: test: fix c&p error in a test
Message-Id: <YJKBOwBX8hqHLxsB@scylladb.com>
2021-05-05 17:18:49 +02:00
Avi Kivity
ddb1f0e6ca Merge "Choose the user max-result-size for service levels" from Botond
"
Choosing the max-result-size for unlimited queries is broken for unknown
scheduling groups. In this case the system limit (unlimited) will be
chosen. A prime example of this break-down is when service levels are
used.

This series fixes this in the same spirit as the similar semaphore
selection issue (#8508) was fixed: use the user limit as the fall-back
in case of unknown scheduling groups.
To ensure future fixes automatically apply to both query-classification
related configurations, selecting the max result size for unlimited
queries is now delegated to the database, sharing the query
classification logic with the semaphore selection.

Fixes: #8591

Tests: unit(dev)
"

* 'query-max-size-service-level-fix/v2' of https://github.com/denesb/scylla:
  service/storage_proxy: get_max_result_size() defer to db for unlimited queries
  database: add get_unlimited_query_max_result_size()
  query_class_config: add operator== for max_result_size
  database: get_reader_concurrency_semaphore(): extract query classification logic
2021-05-05 18:11:10 +03:00
Lauro Ramos Venancio
15f72f7c9e TWCS: initialize _highest_window_seen
The timestamp_type is an int64_t. So, it has to be explicitly
initialized before using it.

This missing inicialization prevented the major compactation
from happening when a time window finishes, as described in #8569.

Fixes #8569

Signed-off-by: Lauro Ramos Venancio <lauro.venancio@incognia.com>

Closes #8590
2021-05-05 17:31:05 +03:00
Avi Kivity
1ed3f54f4a Merge "size_tiered_compaction_strategy: get_buckets improvements" from Benny
"
This patchset contains 3 main improvements to STCS get_buckets
implementation and algorithm:

1. Consider only current bucket for each sstable.
   No need to scan all buckets using a map
   since the inserted sstables are sorted by size.
2. Use double precision for keeping bucket average size.
   Prevent rounding error accumulation.
3. Don't let the bucket average drift too high.
   As we insert increasingly larger sstables into a bucket,
   it's average size drifts up and eventually this may break
   the bucket invariant that all sstables in the bucket should
   be within the (bucket_low, bucket_high) range relative
   to the bucket average.

Test: unit(dev)
DTest: compaction_test.py:TestCompaction_with_SizeTieredCompactionStrategy,
    compaction_additional_test.py:CompactionAdditionalStrategyTests_with_SizeTieredCompactionStrategy

Fixes #8584
"

* tag 'stcs-buckets-v3' of github.com:bhalevy/scylla:
  compaction: size_tiered_compaction_strategy: get_buckets: fixup indentation
  compaction: size_tiered_compaction_strategy: get_buckets: don't let the bucket average drift too high
  compaction: size_tiered_compaction_strategy: get_buckets: keep bucket average size as double precision floating point number
  compaction: size_tiered_compaction_strategy: get_buckets: rename old_average_size to bucket_average_size
  compaction: size_tiered_compaction_strategy: get_buckets: consider only current bucket for each sstable
2021-05-05 16:25:12 +03:00
Avi Kivity
6977064693 dist: scylla_raid_setup: reduce xfs block size to 1k
Since Linux 5.12 [1], XFS is able to to asynchronously overwrite
sub-block ranges without stalling. However, we want good performance
on older Linux versions, so this patch reduces the block size to the
minimum possible.

That turns out to be 1024 for crc-protected filesystems (which we want)
and it can also not be smaller than the sector size. So we fetch the
sector size and set the block size to that if it is larger than 512.
Most SSDs have a sector size of 512, so this isn't a problem.

Tested on AWS i3.large.

Fixes #8156.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ed1128c2d0c87e5ff49c40f5529f06bc35f4251b

Closes #8585
2021-05-05 16:07:50 +03:00
Nadav Har'El
64a4e5e059 cross-tree: reduce dependency on db/config.hh and database.hh
Every time db/config.hh is modified (e.g., to add a new configuration
option), 110 source files need to be recompiled. Many of those 110 didn't
really care about configuration options, and just got the dependency
accidentally by including some other header file.

In this patch, I remove the include of "db/config.hh" from all header
files. It is only needed in source files - and header files only
need forward declarations. In some cases, source files were missing
certain includes which they got incidentally from db/config.hh, so I
had to add these includes explicitly.

After this patch, the number of source files that get recompiled after a
change to db/config.hh goes down from 110 to 45.
It also means that 65 source files now compile faster because they don't
include db/config.hh and whatever it included.

Additionally, this patch also eliminates a few unnecessary inclusions
of database.hh in other header files, which can use a forward declaration
or database_fwd.hh. Some of the source files including one of those
header files relied on one of the many header files brought in by
database.hh, so we need to include those explicitly.
In view_update_generator.hh something interesting happened - it *needs*
database.hh because of code in the header file, but only included
database_fwd.hh, and the only reason this worked was that the files
including view_update_generator.hh already happened to unnecessarily
include database.hh. So we fix that too.

Refs #1

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210505121830.964529-1-nyh@scylladb.com>
2021-05-05 15:24:25 +03:00
Nadav Har'El
5fbd78ed96 CONTRIBUTING.md: add the requirement for self-contained headers
As far as I can tell, we never documented requirement for self-contained
headers in our coding style. So let's do it now, and explain the
"ninja dev-headers" command and how to use it.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210505120908.963388-1-nyh@scylladb.com>
2021-05-05 15:10:46 +03:00
Benny Halevy
ead96e21c3 compaction: size_tiered_compaction_strategy: get_buckets: fixup indentation
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-05-05 14:26:37 +03:00
Benny Halevy
c1681cb9ea compaction: size_tiered_compaction_strategy: get_buckets: don't let the bucket average drift too high
SSTables are added in increasing size order so the bucket's
average might drift upwards.
Don't let it drift too high, to a point where the smallest
SSTable might fall out of range.

For example, here's a simulation run of the algorithm for these sstable sizes:
    [21, 123, 252, 363, 379, 394, 407, 428, 463, 467, 470, 523, 752, 774]

the simulated compaction strategy options are:
min_sstable_size = 4
bucket_low = 0.66667
bucket_high = 1.5

For each bucket, the following is printed: (avg * bucket_low) avg (avg * bucket_high)

UNCHANGED:
buckets={
    (  14.0)   21.0 (  31.5): [21]
    (  82.0)  123.0 ( 184.5): [123]
    ( 276.4)  414.6 ( 621.9): [252, 363, 379, 394, 407, 428, 463, 467, 470, 523]
    ( 508.7)  763.0 (1144.5): [752, 774]
}

IMPROVED:
buckets={
    (  14.0)   21.0 (  31.5): [21]
    (  82.0)  123.0 ( 184.5): [123]
    ( 247.0)  370.5 ( 555.8): [252, 363, 379, 394, 407, 428]
    ( 320.5)  480.8 ( 721.1): [463, 467, 470, 523]
    ( 508.7)  763.0 (1144.5): [752, 774]
}

Fixes #8584

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-05-05 14:26:28 +03:00
Benny Halevy
d3aa5265ab compaction: size_tiered_compaction_strategy: get_buckets: keep bucket average size as double precision floating point number
Using integer division lose accuracy by rounding down the result.
Each time we calculate:
```
    auto total_size = bucket.size() * old_average_size;
    auto new_average_size = (total_size + size) / (bucket.size() + 1);
```

We accumulate the rounding error.
total_size might be too small since old_average_size was previously
rounded down, and then new_average_size is rounded down again.

Rather than trying to compensate for the rounding errors
by e.g. adding size / 2 to the dividend, simply keep the average
as a double precision number.

Note that we multiply old_average_size by options.bucket_{low,high},
that are double precision too so the size comparisons
are already using FP instructions implicitly.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-05-05 14:26:25 +03:00
Benny Halevy
44b094f9a5 compaction: size_tiered_compaction_strategy: get_buckets: rename old_average_size to bucket_average_size
Since now it became a reference used to update the bucket's average size
after a new sstable is inserted into it.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-05-05 14:26:20 +03:00
Benny Halevy
336a4dc0fd compaction: size_tiered_compaction_strategy: get_buckets: consider only current bucket for each sstable
Since the sstables are sorted in increasing size order
there is no need to consider all buckets to find a matching one.

Instead, just consider the most recently inserted bucket.

Once we see a sstable size outside the allowed range for this bucket,
create a new bucket and consider this one for the next sstable.

Note, `old_average_size` should be renamed since this change
turns it into a reference and it's assigned with the new average_size.
This patch keeps the old name to reduce the churn. The following
patch will do only the rename.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-05-05 14:26:05 +03:00
Botond Dénes
9d5e958331 service/storage_proxy: get_max_result_size() defer to db for unlimited queries
Defer picking the appropriate max result size for unlimited queries to
the database, which is already the place where we make query classifying
decisions. This move means that all these decisions are now centralized
in the database, not scattered in different places and fixing one fixes
all users.
2021-05-05 13:30:50 +03:00
Botond Dénes
992819b188 database: add get_unlimited_query_max_result_size()
Similar to the already existing get_reader_concurrency_semaphore(),
this method determines the appropriate max result size for the query
class, which is deduced from the current scheduling group. This method
shares its scheduling group -> query class association mechanism with
the above mentioned semaphore getter.
2021-05-05 13:30:42 +03:00
Nadav Har'El
58e275e362 cross-tree: reduce dependency on db/config.hh and database.hh
Every time db/config.hh is modified (e.g., to add a new configuration
option), 110 source files need to be recompiled. Many of those 110 didn't
really care about configuration options, and just got the dependency
accidentally by including some other header file.

In this patch, I remove the include of "db/config.hh" from all header
files. It is only needed in source files - and header files only
need forward declarations. In some cases, source files were missing
certain includes which they got incidentally from db/config.hh, so I
had to add these includes explicitly.

After this patch, the number of source files that get recompiled after a
change to db/config.hh goes down from 110 to 45.
It also means that 65 source files now compile faster because they don't
include db/config.hh and whatever it included.

Additionally, this patch also eliminates a few unnecessary inclusions
of database.hh in other header files, which can use a forward declaration
or database_fwd.hh. Some of the source files including one of those
header files relied on one of the many header files brought in by
database.hh, so we need to include those explicitly.
In view_update_generator.hh something interesting happened - it *needs*
database.hh because of code in the header file, but only included
database_fwd.hh, and the only reason this worked was that the files
including view_update_generator.hh already happened to unnecessarily
include database.hh. So we fix that too.

Refs #1

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210505102111.955470-1-nyh@scylladb.com>
2021-05-05 13:23:00 +03:00
Avi Kivity
83a826a4de Merge 'Azure Ls v2 local disk setup' from Lubos Kosco
fixes #8325

The iotune tests happened on Centos 8.2 both with stock and elrepo kernel, using Scylla 4.3 rc3

results are in https://docs.google.com/spreadsheets/d/1_uYq8UxY47XF5jreetrpleykLPqNGjfPXIirvTPh6rk/edit#gid=1101336711

Closes #7807

* github.com:scylladb/scylla:
  scylla_io_setup: add disk properties for L Azure instances
  scylla_util.py: add new class for Azure cloud support
2021-05-05 12:39:00 +03:00
Botond Dénes
e84c31fab8 query_class_config: add operator== for max_result_size 2021-05-05 11:20:22 +03:00
Botond Dénes
9313acb304 database: get_reader_concurrency_semaphore(): extract query classification logic
Into a local function. In the next patch we want to add another method
which needs to classify queries based on the current scheduling group,
so prepare for sharing this logic.
2021-05-05 10:41:04 +03:00
Tomasz Grabiec
121eb32679 Merge 'test: perf: report instructions retired per operations' from Avi Kivity
Instructions retired per op is a much more stable than time per op
(inverse throughput) since it isn't much affected by changes in
CPU frequencey or other load on the test system (it's still somewhat
affected since a slower system will run more reactor polls per op).
It's also less indicative of real performance, since it's possible for
fewer inststructions to execute in more time than more instructions,
but that isn't an issue for comparative tests).

This allows incremental changes to the code base to be compared with
more confidence.

Current results are around 55k instructions per read, and 52k for writes.

Closes #8563

* github.com:scylladb/scylla:
  test: perf: tidy up executor_stats snapshot computation
  test: perf: report instructions retired per operations
  test: perf: add RAII wrapper around Linux perf_event_open()
  test: perf: make executor_stats_snapshot() a member function of executor
2021-05-05 00:54:08 +02:00
Tomasz Grabiec
b8665c459d Merge "raft: replication test updates" from Alejo
Cleanups, fixes, and configuration change support for replication tests.

* alejo/raft-tests-replication-01-fixes-v13:
  raft: replication test: remove obsolete helper
  raft: replication test: add_entry with retries
  raft: replication test: support config change
  raft: replication test: add dummy command support
  raft: replication test: test both with and without prevote
  raft: replication test: make initial leader just default
  raft: replication test: create command helper
  raft: replication test: free elections as helper
  raft: replication test: fix election connectivity
  raft: replication test: fix custom election
  raft: replication test: add helpers for threshold and election
  raft: replication test: connectivity improvement
  raft: replication test: helper for server_address
  raft: replication test: use wait_log()
  raft: replication test: cycle leader more
  raft: replication test: fix a test description
  raft: replication test: remove multiple state machines
  raft: replication test: remove checksum
  raft: replication test: remove unused class param
2021-05-04 18:52:47 +02:00
Alejo Sanchez
27ad2a0f28 raft: replication test: remove obsolete helper
As we are now serially adding commands with consecutive integers there
is no need to build vectors of commands. Remove helper.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-05-04 11:01:07 -04:00
Alejo Sanchez
0a54fd848b raft: replication test: add_entry with retries
The current leader might have stepped down. Try again and learn if
there's a new leader.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-05-04 11:00:46 -04:00
Nadav Har'El
df65d09e08 Merge ' cdc: log: fill cdc$deleted_ columns in pre-images ' from Piotr Grabowski
Before this change, `cdc$deleted_` columns were all `NULL` in pre-images. Lack of such information made it hard to correctly interpret the pre-image rows, for example:

```
INSERT INTO tbl(pk, ck, v, v2) VALUES (1, 1, null, 1);
INSERT INTO tbl(pk, ck, v2) VALUES (1, 1, 1);
```

For this example, pre-image generated for the second operation would look like this (in both `true` and `full` pre-image mode):

```
pk=1, ck=1, v=NULL, cdc$deleted_v=NULL, v2=1
```

`v=NULL` has two meanings:
1. If pre-image was in `true` mode, `v=NULL` describes that v was not affected (affected columns: pk, ck, v2).
2. If pre-image was in `full` mode, `v=NULL` describes that v was equal to `NULL` in the pre-image.

Therefore, to properly decode pre-images you would need to know in which mode pre-image was configured on the CDC-enabled table at the moment this CDC log row was inserted. There is no way to determine such information (you can only check a current mode of pre-image).

A solution to this problem is to fill in the `cdc$deleted_` columns for pre-images. After this PR, for the `INSERT` described above, CDC now generates the following log row:

If in pre-image 'true' mode:
```
pk=1, ck=1, v=NULL, cdc$deleted_v=NULL, v2=1
```

If in pre-image 'full' mode:
```
pk=1, ck=1, v=NULL, cdc$deleted_v=true, v2=1
```

A client library now can properly decode a pre-image row. If it sees a `NULL` value, it can now check the `cdc$deleted_` column to determine if this `NULL` value was a part of pre-image or it was omitted due to not being an affected column in the delta operation.

No such change is necessary for the post-image rows, as those images are always generated in the `full` mode.

Additional example:
Additional example of trouble decoding pre-images before this change.
tbl2 - `true` pre-image mode, tbl3 - `full` pre-image mode:

```
INSERT INTO tbl2(pk, ck, v, v2) VALUES (1, 1, 5, 1);
INSERT INTO tbl3(pk, ck, v, v2) VALUES (1, 1, null, 1);
```

```
INSERT INTO tbl2(pk, ck, v2) VALUES (1, 1, 1);
```
generated pre-image:
```
pk=1, ck=1, v=NULL, cdc$deleted_v=NULL, v2=1
```

```
INSERT INTO tbl3(pk, ck, v2) VALUES (1, 1, 1);
```

generated pre-image:
```
pk=1, ck=1, v=NULL, cdc$deleted_v=NULL, v2=1
```

Both pre-images look the same, but:
1. `v=NULL` in tbl2 describes v being omitted from the pre-image.
2. `v=NULL` in tbl3 described v being `NULL` in the pre-image.

Closes #8568

* github.com:scylladb/scylla:
  cdc: log: assert post_image is always in full mode
  cdc: tests: check cdc$deleted_ columns in images
  cdc: log: fill cdc$deleted_ columns in pre-images
2021-05-04 14:45:27 +03:00
Lubos Kosco
c26bcf29f9 scylla_io_setup: add disk properties for L Azure instances 2021-05-04 13:13:05 +02:00
Lubos Kosco
f627fcbb0c scylla_util.py: add new class for Azure cloud support 2021-05-04 13:12:42 +02:00
Piotr Grabowski
cd6154e8bf cdc: log: assert post_image is always in full mode
Add an assertion that checks that post_image can never be in non-full
mode.
2021-05-04 12:33:15 +02:00
Piotr Grabowski
778fbb144f cdc: tests: check cdc$deleted_ columns in images
Add a test that checks whether the cdc$deleted_ columns are properly
filled in the pre/post-image rows.

This test checks tables with only atomic columns, tables with frozen
collections and non-frozen collections. The test is performed with
both 'true' pre-image mode and 'full' pre-image mode.
2021-05-04 12:33:15 +02:00
Calle Wilund
7e345e37e8 cql/cdc_batch_delete_postimage_test - rename test files + fix result
The tests, when added, where not named kosher (*_test), which the
runner apparently quaintly, require to pick it up (instead of the more
sensisble *.cql).

Thusly, the test was never run beyond initial creation, and also
bit-rotted slightly during behaviour changes.

Renamed and re-resulted.

Closes #8581
2021-05-04 12:47:33 +03:00
Avi Kivity
ef2313325b Merge "Teach sstables streams new streams API" from Pavel E
"
Recent changes in seastar added the ability for data sinks to
advertise the buffer size up to the stream level. This change was
needed to make the output stack honor the io-queue's max request
length. There are two more places left to patch.

The first is the sstables checksumming writer. This is the sink
implementation that has another sink inside. So this one is patched
to report up (to the output stream) the buffer size from the lower
sink (which is a file data sink that already "knows" the maximim IO
lengths).

The second one is the compress sink, but this sink embeds an output
stream inside, so even if it's working with larger buffers, that
inner stream will split them properly. So this place is patched just
to stop using the deprecated output stream constructor.

tests: unit(dev)
"

* 'br-streams-napi' of https://github.com/xemul/scylla:
  sstables: Make checksum sink report buffer size from lower sink
  sstables: Report buffer size from compressed file sink
2021-05-04 12:22:38 +03:00
Pavel Emelyanov
13b07a3c58 sstables: Make checksum sink report buffer size from lower sink
The checksum sink carries another sink on board and forwards
the put buffers lower, so there's no point in making these
two have different buffer sizes. This is what really happens
now, but this change makes this more explicit and makes the
checksumming code conform to the new output stream API.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-05-04 12:01:30 +03:00
Pavel Emelyanov
01b979beca sstables: Report buffer size from compressed file sink
This change just moves the place from which the output_stream
knows the compression::uncompressed_chunk_length() value.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-05-04 12:01:27 +03:00
Pekka Enberg
6583a04e5d Update seastar submodule
* seastar f1b6b95b...847fccaf (1):
  > perftune.py: fix parsing of 'write_back_cache' YAML option
2021-05-04 09:12:49 +03:00
Avi Kivity
6ffd813b7b Merge 'hints: delay repair until hints are replayed' from Piotr Dulikowski
Both hinted handoff and repair are meant to improve the consistency of the cluster's data. HH does this by storing records of failed replica writes and replaying them later, while repair goes through all data on all participaring replicas and makes sure the same data is stored on all nodes. The former is generally cheaper and sometimes (but not always) can bring back full consistency on its own; repair, while being more costly, is a sure way to bring back current data to full consistency.

When hinted handoff and repair are running at the same time, some of the work can be unnecessarily duplicated. For example, if a row is repaired first, then hints towards it become unnecessary. However, repair needs to do less work if data already has good consistency, so if hints finish first, then the repair will be shorter.

This PR introduces a possibility to wait for hints to be replayed before continuing with user-issued repair. The coordinator of the repair operation asks all nodes participating in the repair operation (including itself) to mark a point at the end of all hint queues pointing towards other nodes participating in repair. Then, it waits until hint replay in all those queues reaches marked point, or configured timeout is reached.

This operation is currently opt-in and can be turned on by setting the `wait_for_hint_replay_before_repair_in_ms` config option to a positive value.

Fixes #8102

Tests:
- unit(dev)
- some manual tests:
    - shutting down repair coordinator during hints replay,
    - shutting down node participating in repair during hints replay,

Closes #8452

* github.com:scylladb/scylla:
  repair: introduce abort_source for repair abort
  repair: introduce abort_source for shutdown
  storage_proxy: add abort_source to wait_for_hints_to_be_replayed
  storage_proxy: stop waiting for hints replay when node goes down
  hints: dismiss segment waiters when hint queue can't send
  repair: plug in waiting for hints to be sent before repair
  repair: add get_hosts_participating_in_repair
  storage_proxy: coordinate waiting for hints to be sent
  config: add wait_for_hint_replay_before_repair option
  storage_proxy: implement verbs for hint sync points
  messaging_service: add verbs for hint sync points
  storage_proxy: add functions for syncing with hints queue
  db/hints: make it possible to wait until current hints are sent
  db/hints: add a metric for counting processed files
  db/hints: allow to forcefully update segment list on flush
2021-05-03 18:47:27 +03:00
Alejo Sanchez
56e977ae69 raft: replication test: support config change
Add support for configuration change on leader.

Keep track of servers in config in test.

Add a dummy entry to confirm configuration changed. If the add fails,
because the old leader was not in the new config and stepped down, the
config is considered changed, too.

Add a test with some configuration changes.
Add a test cycling every scenario for 1 of 4 nodes removed.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-05-03 07:53:35 -04:00
Alejo Sanchez
8d8af92cbb raft: replication test: add dummy command support
Use a special value as dummy entry to be ignored when seen in state
machine input.

Ignore dummy entries for count.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-05-03 07:53:35 -04:00
Alejo Sanchez
4aa52be7e5 raft: replication test: test both with and without prevote
Before this change the default was prevote enabled.
With this change each test is run with and without prevote.
This duplicates the number of test cases.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-05-03 07:53:35 -04:00
Alejo Sanchez
e759e492c7 raft: replication test: make initial leader just default
The test suite requires an initial leader and at the moment it's always
just 0. Make it default and simplify code.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-05-03 07:53:35 -04:00
Alejo Sanchez
eb5bbcdec7 raft: replication test: create command helper
Factor out repeated code and make it available for other uses.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-05-03 07:53:35 -04:00
Alejo Sanchez
eb94dd26dc raft: replication test: free elections as helper
Add a helper to run free elections and use it in partitioning.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-05-03 07:53:35 -04:00
Alejo Sanchez
cb297a57df raft: replication test: fix election connectivity
If a leader was already disconnected the election of a new leader could
re-connect. Save original connectivity and restore it when done electing
new leader.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-05-03 07:53:35 -04:00
Alejo Sanchez
0a5c605713 raft: replication test: fix custom election
Use the new specific connectivity to manage old leader disconnection
more specifically.

This fixes having elections where the vote of the old leader is required
for quorum. For example {A,B} and we want to switch leader.  For B to
become candidate it has to see A as down. Then A has to see B's request
for vote, and vote for A.

So to make the general case old leader needs to be first disconnected
from all nodes, make the desired node candidate, then have the old
leader connected only to the desired candidate (else, other nodes would
see the new candidate as disrupting a live leader).

Also, there might be stray messages from the former leader. These could
revert the candidate to follower. To handle this this patch retries
the process until the desired node becomes leader.

The helper function elect_me_leader() is split and renamed to
wait_until_candidate() and wait_election_done(). The former ticks until
the node is a candidate and the later waits until a candidate either
becomes a leader or reverts to follower

The existing etcd test workaround of incrementing from n=2 to n=3 nodes
is corrected back to original n=2.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-05-03 07:53:35 -04:00
Alejo Sanchez
9909983e38 raft: replication test: add helpers for threshold and election
Add 2 helper functions for making nodes reach timeout threshold and to
elect a specific node.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-05-03 07:53:35 -04:00
Alejo Sanchez
38526d7a2f raft: replication test: connectivity improvement
Replace simple full disconnect of a node with specific from -> to
disconnection tracking.

This will help electing new leaders.

Say there are {A,B,C} with A leader and we want to elect B.
Before this patch, we would disconnect A, run an election with just
{B,C}, and then re-connect A.

If we have {A,B} and want to elect B, this won't work as B needs 2/2+1
votes and A is disconnected. Even if we made A stepped down. This patch
corrects this shortcoming. (@gleb-cloudius)

With this patch, we can specify other followers (not the previous or
next leader) to not see the old leader, but the new and old leaders see
each other just fine. In the example {A,B,C} above we can cut A<->B
specifcally.

Also, this is closer to etcd testing and should help porting cases.

NOTE: in the current test implementation failure_detector reports
node.is_alive(other_node) if there is a connection both ways.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-05-03 07:53:35 -04:00
Alejo Sanchez
f53dea432c raft: replication test: helper for server_address
A helper function to convert from local 0-based id to raft 1-based
server_address.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-05-03 07:53:35 -04:00
Alejo Sanchez
294e16cf8b raft: replication test: use wait_log()
Use wait_log() helper in leftover election code.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-05-03 07:53:35 -04:00
Alejo Sanchez
355c8a052f raft: replication test: cycle leader more
For ported etcd test cycle leader, cycle some more.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-05-03 07:53:35 -04:00
Alejo Sanchez
5b2c9a6c94 raft: replication test: fix a test description
Fix replace_log_leaders_log_empty description comment.

Reported by @kbraun

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-05-03 07:53:35 -04:00
Alejo Sanchez
bbb56e2265 raft: replication test: remove multiple state machines
Checksum was removed so undo support for multiple versions added in:

    test: add support for different state machines
    43dc5e7dc2

NOTE: as there is a test with custom total_values, expected value cannot
      be static const anymore. (line 630)

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-05-03 07:53:35 -04:00