scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-29 12:47:02 +00:00

Author	SHA1	Message	Date
Gleb Natapov	d0ebd79deb	raft: test: return error from rpc module if nodes are disconnected Returning an error when nodes are disconnected closer resembles what will happen in real networking.	2021-05-06 11:34:31 +03:00
Gleb Natapov	c4d87d7a23	raft: fix a typo in a variable name	2021-05-06 11:33:47 +03:00
Gleb Natapov	745f63991f	raft: test: fix c&p error in a test Message-Id: <YJKBOwBX8hqHLxsB@scylladb.com>	2021-05-05 17:18:49 +02:00
Avi Kivity	ddb1f0e6ca	Merge "Choose the user max-result-size for service levels" from Botond " Choosing the max-result-size for unlimited queries is broken for unknown scheduling groups. In this case the system limit (unlimited) will be chosen. A prime example of this break-down is when service levels are used. This series fixes this in the same spirit as the similar semaphore selection issue (#8508) was fixed: use the user limit as the fall-back in case of unknown scheduling groups. To ensure future fixes automatically apply to both query-classification related configurations, selecting the max result size for unlimited queries is now delegated to the database, sharing the query classification logic with the semaphore selection. Fixes: #8591 Tests: unit(dev) " * 'query-max-size-service-level-fix/v2' of https://github.com/denesb/scylla: service/storage_proxy: get_max_result_size() defer to db for unlimited queries database: add get_unlimited_query_max_result_size() query_class_config: add operator== for max_result_size database: get_reader_concurrency_semaphore(): extract query classification logic	2021-05-05 18:11:10 +03:00
Lauro Ramos Venancio	15f72f7c9e	TWCS: initialize _highest_window_seen The timestamp_type is an int64_t. So, it has to be explicitly initialized before using it. This missing inicialization prevented the major compactation from happening when a time window finishes, as described in #8569. Fixes #8569 Signed-off-by: Lauro Ramos Venancio <lauro.venancio@incognia.com> Closes #8590	2021-05-05 17:31:05 +03:00
Avi Kivity	1ed3f54f4a	Merge "size_tiered_compaction_strategy: get_buckets improvements" from Benny " This patchset contains 3 main improvements to STCS get_buckets implementation and algorithm: 1. Consider only current bucket for each sstable. No need to scan all buckets using a map since the inserted sstables are sorted by size. 2. Use double precision for keeping bucket average size. Prevent rounding error accumulation. 3. Don't let the bucket average drift too high. As we insert increasingly larger sstables into a bucket, it's average size drifts up and eventually this may break the bucket invariant that all sstables in the bucket should be within the (bucket_low, bucket_high) range relative to the bucket average. Test: unit(dev) DTest: compaction_test.py:TestCompaction_with_SizeTieredCompactionStrategy, compaction_additional_test.py:CompactionAdditionalStrategyTests_with_SizeTieredCompactionStrategy Fixes #8584 " * tag 'stcs-buckets-v3' of github.com:bhalevy/scylla: compaction: size_tiered_compaction_strategy: get_buckets: fixup indentation compaction: size_tiered_compaction_strategy: get_buckets: don't let the bucket average drift too high compaction: size_tiered_compaction_strategy: get_buckets: keep bucket average size as double precision floating point number compaction: size_tiered_compaction_strategy: get_buckets: rename old_average_size to bucket_average_size compaction: size_tiered_compaction_strategy: get_buckets: consider only current bucket for each sstable	2021-05-05 16:25:12 +03:00
Avi Kivity	6977064693	dist: scylla_raid_setup: reduce xfs block size to 1k Since Linux 5.12 [1], XFS is able to to asynchronously overwrite sub-block ranges without stalling. However, we want good performance on older Linux versions, so this patch reduces the block size to the minimum possible. That turns out to be 1024 for crc-protected filesystems (which we want) and it can also not be smaller than the sector size. So we fetch the sector size and set the block size to that if it is larger than 512. Most SSDs have a sector size of 512, so this isn't a problem. Tested on AWS i3.large. Fixes #8156. [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ed1128c2d0c87e5ff49c40f5529f06bc35f4251b Closes #8585	2021-05-05 16:07:50 +03:00
Nadav Har'El	64a4e5e059	cross-tree: reduce dependency on db/config.hh and database.hh Every time db/config.hh is modified (e.g., to add a new configuration option), 110 source files need to be recompiled. Many of those 110 didn't really care about configuration options, and just got the dependency accidentally by including some other header file. In this patch, I remove the include of "db/config.hh" from all header files. It is only needed in source files - and header files only need forward declarations. In some cases, source files were missing certain includes which they got incidentally from db/config.hh, so I had to add these includes explicitly. After this patch, the number of source files that get recompiled after a change to db/config.hh goes down from 110 to 45. It also means that 65 source files now compile faster because they don't include db/config.hh and whatever it included. Additionally, this patch also eliminates a few unnecessary inclusions of database.hh in other header files, which can use a forward declaration or database_fwd.hh. Some of the source files including one of those header files relied on one of the many header files brought in by database.hh, so we need to include those explicitly. In view_update_generator.hh something interesting happened - it needs database.hh because of code in the header file, but only included database_fwd.hh, and the only reason this worked was that the files including view_update_generator.hh already happened to unnecessarily include database.hh. So we fix that too. Refs #1 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210505121830.964529-1-nyh@scylladb.com>	2021-05-05 15:24:25 +03:00
Nadav Har'El	5fbd78ed96	CONTRIBUTING.md: add the requirement for self-contained headers As far as I can tell, we never documented requirement for self-contained headers in our coding style. So let's do it now, and explain the "ninja dev-headers" command and how to use it. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210505120908.963388-1-nyh@scylladb.com>	2021-05-05 15:10:46 +03:00
Benny Halevy	ead96e21c3	compaction: size_tiered_compaction_strategy: get_buckets: fixup indentation Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-05-05 14:26:37 +03:00
Benny Halevy	c1681cb9ea	compaction: size_tiered_compaction_strategy: get_buckets: don't let the bucket average drift too high SSTables are added in increasing size order so the bucket's average might drift upwards. Don't let it drift too high, to a point where the smallest SSTable might fall out of range. For example, here's a simulation run of the algorithm for these sstable sizes: [21, 123, 252, 363, 379, 394, 407, 428, 463, 467, 470, 523, 752, 774] the simulated compaction strategy options are: min_sstable_size = 4 bucket_low = 0.66667 bucket_high = 1.5 For each bucket, the following is printed: (avg * bucket_low) avg (avg * bucket_high) UNCHANGED: buckets={ ( 14.0) 21.0 ( 31.5): [21] ( 82.0) 123.0 ( 184.5): [123] ( 276.4) 414.6 ( 621.9): [252, 363, 379, 394, 407, 428, 463, 467, 470, 523] ( 508.7) 763.0 (1144.5): [752, 774] } IMPROVED: buckets={ ( 14.0) 21.0 ( 31.5): [21] ( 82.0) 123.0 ( 184.5): [123] ( 247.0) 370.5 ( 555.8): [252, 363, 379, 394, 407, 428] ( 320.5) 480.8 ( 721.1): [463, 467, 470, 523] ( 508.7) 763.0 (1144.5): [752, 774] } Fixes #8584 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-05-05 14:26:28 +03:00
Benny Halevy	d3aa5265ab	compaction: size_tiered_compaction_strategy: get_buckets: keep bucket average size as double precision floating point number Using integer division lose accuracy by rounding down the result. Each time we calculate: ``` auto total_size = bucket.size() * old_average_size; auto new_average_size = (total_size + size) / (bucket.size() + 1); ``` We accumulate the rounding error. total_size might be too small since old_average_size was previously rounded down, and then new_average_size is rounded down again. Rather than trying to compensate for the rounding errors by e.g. adding size / 2 to the dividend, simply keep the average as a double precision number. Note that we multiply old_average_size by options.bucket_{low,high}, that are double precision too so the size comparisons are already using FP instructions implicitly. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-05-05 14:26:25 +03:00
Benny Halevy	44b094f9a5	compaction: size_tiered_compaction_strategy: get_buckets: rename old_average_size to bucket_average_size Since now it became a reference used to update the bucket's average size after a new sstable is inserted into it. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-05-05 14:26:20 +03:00
Benny Halevy	336a4dc0fd	compaction: size_tiered_compaction_strategy: get_buckets: consider only current bucket for each sstable Since the sstables are sorted in increasing size order there is no need to consider all buckets to find a matching one. Instead, just consider the most recently inserted bucket. Once we see a sstable size outside the allowed range for this bucket, create a new bucket and consider this one for the next sstable. Note, `old_average_size` should be renamed since this change turns it into a reference and it's assigned with the new average_size. This patch keeps the old name to reduce the churn. The following patch will do only the rename. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-05-05 14:26:05 +03:00
Botond Dénes	9d5e958331	service/storage_proxy: get_max_result_size() defer to db for unlimited queries Defer picking the appropriate max result size for unlimited queries to the database, which is already the place where we make query classifying decisions. This move means that all these decisions are now centralized in the database, not scattered in different places and fixing one fixes all users.	2021-05-05 13:30:50 +03:00
Botond Dénes	992819b188	database: add get_unlimited_query_max_result_size() Similar to the already existing get_reader_concurrency_semaphore(), this method determines the appropriate max result size for the query class, which is deduced from the current scheduling group. This method shares its scheduling group -> query class association mechanism with the above mentioned semaphore getter.	2021-05-05 13:30:42 +03:00
Nadav Har'El	58e275e362	cross-tree: reduce dependency on db/config.hh and database.hh Every time db/config.hh is modified (e.g., to add a new configuration option), 110 source files need to be recompiled. Many of those 110 didn't really care about configuration options, and just got the dependency accidentally by including some other header file. In this patch, I remove the include of "db/config.hh" from all header files. It is only needed in source files - and header files only need forward declarations. In some cases, source files were missing certain includes which they got incidentally from db/config.hh, so I had to add these includes explicitly. After this patch, the number of source files that get recompiled after a change to db/config.hh goes down from 110 to 45. It also means that 65 source files now compile faster because they don't include db/config.hh and whatever it included. Additionally, this patch also eliminates a few unnecessary inclusions of database.hh in other header files, which can use a forward declaration or database_fwd.hh. Some of the source files including one of those header files relied on one of the many header files brought in by database.hh, so we need to include those explicitly. In view_update_generator.hh something interesting happened - it needs database.hh because of code in the header file, but only included database_fwd.hh, and the only reason this worked was that the files including view_update_generator.hh already happened to unnecessarily include database.hh. So we fix that too. Refs #1 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210505102111.955470-1-nyh@scylladb.com>	2021-05-05 13:23:00 +03:00
Avi Kivity	83a826a4de	Merge 'Azure Ls v2 local disk setup' from Lubos Kosco fixes #8325 The iotune tests happened on Centos 8.2 both with stock and elrepo kernel, using Scylla 4.3 rc3 results are in https://docs.google.com/spreadsheets/d/1_uYq8UxY47XF5jreetrpleykLPqNGjfPXIirvTPh6rk/edit#gid=1101336711 Closes #7807 * github.com:scylladb/scylla: scylla_io_setup: add disk properties for L Azure instances scylla_util.py: add new class for Azure cloud support	2021-05-05 12:39:00 +03:00
Botond Dénes	e84c31fab8	query_class_config: add operator== for max_result_size	2021-05-05 11:20:22 +03:00
Botond Dénes	9313acb304	database: get_reader_concurrency_semaphore(): extract query classification logic Into a local function. In the next patch we want to add another method which needs to classify queries based on the current scheduling group, so prepare for sharing this logic.	2021-05-05 10:41:04 +03:00
Tomasz Grabiec	121eb32679	Merge 'test: perf: report instructions retired per operations' from Avi Kivity Instructions retired per op is a much more stable than time per op (inverse throughput) since it isn't much affected by changes in CPU frequencey or other load on the test system (it's still somewhat affected since a slower system will run more reactor polls per op). It's also less indicative of real performance, since it's possible for fewer inststructions to execute in more time than more instructions, but that isn't an issue for comparative tests). This allows incremental changes to the code base to be compared with more confidence. Current results are around 55k instructions per read, and 52k for writes. Closes #8563 * github.com:scylladb/scylla: test: perf: tidy up executor_stats snapshot computation test: perf: report instructions retired per operations test: perf: add RAII wrapper around Linux perf_event_open() test: perf: make executor_stats_snapshot() a member function of executor	2021-05-05 00:54:08 +02:00
Tomasz Grabiec	b8665c459d	Merge "raft: replication test updates" from Alejo Cleanups, fixes, and configuration change support for replication tests. * alejo/raft-tests-replication-01-fixes-v13: raft: replication test: remove obsolete helper raft: replication test: add_entry with retries raft: replication test: support config change raft: replication test: add dummy command support raft: replication test: test both with and without prevote raft: replication test: make initial leader just default raft: replication test: create command helper raft: replication test: free elections as helper raft: replication test: fix election connectivity raft: replication test: fix custom election raft: replication test: add helpers for threshold and election raft: replication test: connectivity improvement raft: replication test: helper for server_address raft: replication test: use wait_log() raft: replication test: cycle leader more raft: replication test: fix a test description raft: replication test: remove multiple state machines raft: replication test: remove checksum raft: replication test: remove unused class param	2021-05-04 18:52:47 +02:00
Alejo Sanchez	27ad2a0f28	raft: replication test: remove obsolete helper As we are now serially adding commands with consecutive integers there is no need to build vectors of commands. Remove helper. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-05-04 11:01:07 -04:00
Alejo Sanchez	0a54fd848b	raft: replication test: add_entry with retries The current leader might have stepped down. Try again and learn if there's a new leader. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-05-04 11:00:46 -04:00
Nadav Har'El	df65d09e08	Merge ' cdc: log: fill cdc$deleted_ columns in pre-images ' from Piotr Grabowski Before this change, `cdc$deleted_` columns were all `NULL` in pre-images. Lack of such information made it hard to correctly interpret the pre-image rows, for example: ``` INSERT INTO tbl(pk, ck, v, v2) VALUES (1, 1, null, 1); INSERT INTO tbl(pk, ck, v2) VALUES (1, 1, 1); ``` For this example, pre-image generated for the second operation would look like this (in both `true` and `full` pre-image mode): ``` pk=1, ck=1, v=NULL, cdc$deleted_v=NULL, v2=1 ``` `v=NULL` has two meanings: 1. If pre-image was in `true` mode, `v=NULL` describes that v was not affected (affected columns: pk, ck, v2). 2. If pre-image was in `full` mode, `v=NULL` describes that v was equal to `NULL` in the pre-image. Therefore, to properly decode pre-images you would need to know in which mode pre-image was configured on the CDC-enabled table at the moment this CDC log row was inserted. There is no way to determine such information (you can only check a current mode of pre-image). A solution to this problem is to fill in the `cdc$deleted_` columns for pre-images. After this PR, for the `INSERT` described above, CDC now generates the following log row: If in pre-image 'true' mode: ``` pk=1, ck=1, v=NULL, cdc$deleted_v=NULL, v2=1 ``` If in pre-image 'full' mode: ``` pk=1, ck=1, v=NULL, cdc$deleted_v=true, v2=1 ``` A client library now can properly decode a pre-image row. If it sees a `NULL` value, it can now check the `cdc$deleted_` column to determine if this `NULL` value was a part of pre-image or it was omitted due to not being an affected column in the delta operation. No such change is necessary for the post-image rows, as those images are always generated in the `full` mode. Additional example: Additional example of trouble decoding pre-images before this change. tbl2 - `true` pre-image mode, tbl3 - `full` pre-image mode: ``` INSERT INTO tbl2(pk, ck, v, v2) VALUES (1, 1, 5, 1); INSERT INTO tbl3(pk, ck, v, v2) VALUES (1, 1, null, 1); ``` ``` INSERT INTO tbl2(pk, ck, v2) VALUES (1, 1, 1); ``` generated pre-image: ``` pk=1, ck=1, v=NULL, cdc$deleted_v=NULL, v2=1 ``` ``` INSERT INTO tbl3(pk, ck, v2) VALUES (1, 1, 1); ``` generated pre-image: ``` pk=1, ck=1, v=NULL, cdc$deleted_v=NULL, v2=1 ``` Both pre-images look the same, but: 1. `v=NULL` in tbl2 describes v being omitted from the pre-image. 2. `v=NULL` in tbl3 described v being `NULL` in the pre-image. Closes #8568 * github.com:scylladb/scylla: cdc: log: assert post_image is always in full mode cdc: tests: check cdc$deleted_ columns in images cdc: log: fill cdc$deleted_ columns in pre-images	2021-05-04 14:45:27 +03:00
Lubos Kosco	c26bcf29f9	scylla_io_setup: add disk properties for L Azure instances	2021-05-04 13:13:05 +02:00
Lubos Kosco	f627fcbb0c	scylla_util.py: add new class for Azure cloud support	2021-05-04 13:12:42 +02:00
Piotr Grabowski	cd6154e8bf	cdc: log: assert post_image is always in full mode Add an assertion that checks that post_image can never be in non-full mode.	2021-05-04 12:33:15 +02:00
Piotr Grabowski	778fbb144f	cdc: tests: check cdc$deleted_ columns in images Add a test that checks whether the cdc$deleted_ columns are properly filled in the pre/post-image rows. This test checks tables with only atomic columns, tables with frozen collections and non-frozen collections. The test is performed with both 'true' pre-image mode and 'full' pre-image mode.	2021-05-04 12:33:15 +02:00
Calle Wilund	7e345e37e8	cql/cdc_batch_delete_postimage_test - rename test files + fix result The tests, when added, where not named kosher (_test), which the runner apparently quaintly, require to pick it up (instead of the more sensisble .cql). Thusly, the test was never run beyond initial creation, and also bit-rotted slightly during behaviour changes. Renamed and re-resulted. Closes #8581	2021-05-04 12:47:33 +03:00
Avi Kivity	ef2313325b	Merge "Teach sstables streams new streams API" from Pavel E " Recent changes in seastar added the ability for data sinks to advertise the buffer size up to the stream level. This change was needed to make the output stack honor the io-queue's max request length. There are two more places left to patch. The first is the sstables checksumming writer. This is the sink implementation that has another sink inside. So this one is patched to report up (to the output stream) the buffer size from the lower sink (which is a file data sink that already "knows" the maximim IO lengths). The second one is the compress sink, but this sink embeds an output stream inside, so even if it's working with larger buffers, that inner stream will split them properly. So this place is patched just to stop using the deprecated output stream constructor. tests: unit(dev) " * 'br-streams-napi' of https://github.com/xemul/scylla: sstables: Make checksum sink report buffer size from lower sink sstables: Report buffer size from compressed file sink	2021-05-04 12:22:38 +03:00
Pavel Emelyanov	13b07a3c58	sstables: Make checksum sink report buffer size from lower sink The checksum sink carries another sink on board and forwards the put buffers lower, so there's no point in making these two have different buffer sizes. This is what really happens now, but this change makes this more explicit and makes the checksumming code conform to the new output stream API. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-05-04 12:01:30 +03:00
Pavel Emelyanov	01b979beca	sstables: Report buffer size from compressed file sink This change just moves the place from which the output_stream knows the compression::uncompressed_chunk_length() value. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-05-04 12:01:27 +03:00
Pekka Enberg	6583a04e5d	Update seastar submodule * seastar f1b6b95b...847fccaf (1): > perftune.py: fix parsing of 'write_back_cache' YAML option	2021-05-04 09:12:49 +03:00
Avi Kivity	6ffd813b7b	Merge 'hints: delay repair until hints are replayed' from Piotr Dulikowski Both hinted handoff and repair are meant to improve the consistency of the cluster's data. HH does this by storing records of failed replica writes and replaying them later, while repair goes through all data on all participaring replicas and makes sure the same data is stored on all nodes. The former is generally cheaper and sometimes (but not always) can bring back full consistency on its own; repair, while being more costly, is a sure way to bring back current data to full consistency. When hinted handoff and repair are running at the same time, some of the work can be unnecessarily duplicated. For example, if a row is repaired first, then hints towards it become unnecessary. However, repair needs to do less work if data already has good consistency, so if hints finish first, then the repair will be shorter. This PR introduces a possibility to wait for hints to be replayed before continuing with user-issued repair. The coordinator of the repair operation asks all nodes participating in the repair operation (including itself) to mark a point at the end of all hint queues pointing towards other nodes participating in repair. Then, it waits until hint replay in all those queues reaches marked point, or configured timeout is reached. This operation is currently opt-in and can be turned on by setting the `wait_for_hint_replay_before_repair_in_ms` config option to a positive value. Fixes #8102 Tests: - unit(dev) - some manual tests: - shutting down repair coordinator during hints replay, - shutting down node participating in repair during hints replay, Closes #8452 * github.com:scylladb/scylla: repair: introduce abort_source for repair abort repair: introduce abort_source for shutdown storage_proxy: add abort_source to wait_for_hints_to_be_replayed storage_proxy: stop waiting for hints replay when node goes down hints: dismiss segment waiters when hint queue can't send repair: plug in waiting for hints to be sent before repair repair: add get_hosts_participating_in_repair storage_proxy: coordinate waiting for hints to be sent config: add wait_for_hint_replay_before_repair option storage_proxy: implement verbs for hint sync points messaging_service: add verbs for hint sync points storage_proxy: add functions for syncing with hints queue db/hints: make it possible to wait until current hints are sent db/hints: add a metric for counting processed files db/hints: allow to forcefully update segment list on flush	2021-05-03 18:47:27 +03:00
Alejo Sanchez	56e977ae69	raft: replication test: support config change Add support for configuration change on leader. Keep track of servers in config in test. Add a dummy entry to confirm configuration changed. If the add fails, because the old leader was not in the new config and stepped down, the config is considered changed, too. Add a test with some configuration changes. Add a test cycling every scenario for 1 of 4 nodes removed. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-05-03 07:53:35 -04:00
Alejo Sanchez	8d8af92cbb	raft: replication test: add dummy command support Use a special value as dummy entry to be ignored when seen in state machine input. Ignore dummy entries for count. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-05-03 07:53:35 -04:00
Alejo Sanchez	4aa52be7e5	raft: replication test: test both with and without prevote Before this change the default was prevote enabled. With this change each test is run with and without prevote. This duplicates the number of test cases. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-05-03 07:53:35 -04:00
Alejo Sanchez	e759e492c7	raft: replication test: make initial leader just default The test suite requires an initial leader and at the moment it's always just 0. Make it default and simplify code. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-05-03 07:53:35 -04:00
Alejo Sanchez	eb5bbcdec7	raft: replication test: create command helper Factor out repeated code and make it available for other uses. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-05-03 07:53:35 -04:00
Alejo Sanchez	eb94dd26dc	raft: replication test: free elections as helper Add a helper to run free elections and use it in partitioning. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-05-03 07:53:35 -04:00
Alejo Sanchez	cb297a57df	raft: replication test: fix election connectivity If a leader was already disconnected the election of a new leader could re-connect. Save original connectivity and restore it when done electing new leader. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-05-03 07:53:35 -04:00
Alejo Sanchez	0a5c605713	raft: replication test: fix custom election Use the new specific connectivity to manage old leader disconnection more specifically. This fixes having elections where the vote of the old leader is required for quorum. For example {A,B} and we want to switch leader. For B to become candidate it has to see A as down. Then A has to see B's request for vote, and vote for A. So to make the general case old leader needs to be first disconnected from all nodes, make the desired node candidate, then have the old leader connected only to the desired candidate (else, other nodes would see the new candidate as disrupting a live leader). Also, there might be stray messages from the former leader. These could revert the candidate to follower. To handle this this patch retries the process until the desired node becomes leader. The helper function elect_me_leader() is split and renamed to wait_until_candidate() and wait_election_done(). The former ticks until the node is a candidate and the later waits until a candidate either becomes a leader or reverts to follower The existing etcd test workaround of incrementing from n=2 to n=3 nodes is corrected back to original n=2. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-05-03 07:53:35 -04:00
Alejo Sanchez	9909983e38	raft: replication test: add helpers for threshold and election Add 2 helper functions for making nodes reach timeout threshold and to elect a specific node. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-05-03 07:53:35 -04:00
Alejo Sanchez	38526d7a2f	raft: replication test: connectivity improvement Replace simple full disconnect of a node with specific from -> to disconnection tracking. This will help electing new leaders. Say there are {A,B,C} with A leader and we want to elect B. Before this patch, we would disconnect A, run an election with just {B,C}, and then re-connect A. If we have {A,B} and want to elect B, this won't work as B needs 2/2+1 votes and A is disconnected. Even if we made A stepped down. This patch corrects this shortcoming. (@gleb-cloudius) With this patch, we can specify other followers (not the previous or next leader) to not see the old leader, but the new and old leaders see each other just fine. In the example {A,B,C} above we can cut A<->B specifcally. Also, this is closer to etcd testing and should help porting cases. NOTE: in the current test implementation failure_detector reports node.is_alive(other_node) if there is a connection both ways. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-05-03 07:53:35 -04:00
Alejo Sanchez	f53dea432c	raft: replication test: helper for server_address A helper function to convert from local 0-based id to raft 1-based server_address. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-05-03 07:53:35 -04:00
Alejo Sanchez	294e16cf8b	raft: replication test: use wait_log() Use wait_log() helper in leftover election code. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-05-03 07:53:35 -04:00
Alejo Sanchez	355c8a052f	raft: replication test: cycle leader more For ported etcd test cycle leader, cycle some more. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-05-03 07:53:35 -04:00
Alejo Sanchez	5b2c9a6c94	raft: replication test: fix a test description Fix replace_log_leaders_log_empty description comment. Reported by @kbraun Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-05-03 07:53:35 -04:00
Alejo Sanchez	bbb56e2265	raft: replication test: remove multiple state machines Checksum was removed so undo support for multiple versions added in: test: add support for different state machines `43dc5e7dc2` NOTE: as there is a test with custom total_values, expected value cannot be static const anymore. (line 630) Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-05-03 07:53:35 -04:00

1 2 3 4 5 ...

26327 Commits