scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-05-30 19:46:48 +00:00

Author	SHA1	Message	Date
Pavel Solodovnikov	2d9e94f050	raft: update README.md with info on RPC server address mappings Describe the high-level scheme of managing RPC mappings and also expand on the introduction of "expirable" RPC mappings concept and why these are needed. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-03-26 22:47:13 +03:00
Pavel Solodovnikov	f61206e483	raft: wire up `rpc::add_server` and `rpc::remove_server` for configuration changes Raft instance needs to update RPC subsystem on changes in configuration, so that RPC can deliver messages to the new nodes in configuration, as well as dispose of the old nodes. I.e. the nodes which are not the part of the most recent configuration anymore. The effective scope of RPC mappings is limited by the piece of code which sends messages to both the "new" nodes (which are added to the cluster with the most recent configuration change) and the "old" nodes which are removed from the cluster. Until the messages are successfully delivered to at least the majority of "old" nodes and we have heard back from them, the mappings should be kept intact. After that point the RPC mappings for the removed nodes are no longer of interest and thus can be immediately disposed. Tests: unit(dev) Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-03-26 22:47:09 +03:00
Pavel Solodovnikov	16d9e8e9af	raft/fsm: add optional `rpc_configuration` field to fsm_output The field is set in `fsm.get_output` whenever `_log.last_conf_idx()` or the term changes. Also, add `_last_conf_idx` and `_last_term` to `fsm::last_observed_state`, they are utilized in the condition to evaluate current rpc configuration in `fsm.get_output()`. This will be used later to update rpc config state stored in `server_impl` and maintain rpc address map. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-03-26 22:47:05 +03:00
Pavel Solodovnikov	19cc85b3b6	raft: maintain current rpc context in `server_impl` Introduce rpc server_address that represents the last observed state of address mappings for RPC module. It does not correspond to any kind of configuration in the raft sense, just an artificial construct corresponding to the largest set of server addresses coming from both previous and current raft configurations (to be able to contact both joining and leaving servers). This will be used later to update rpc module mappings when cluster configuration changes. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-03-26 20:22:44 +03:00
Pavel Solodovnikov	8799ccbab0	raft: use `.contains` instead of `.count` for std::set in `raft::configuration::diff` `std::unordered_set::contains` is introduced in C++20 and provides clearer semantics to check existence of a given element in a set. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-03-26 20:22:44 +03:00
Pavel Solodovnikov	7c229998e8	raft: unit-tests for `raft_address_map` Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-03-26 20:22:44 +03:00
Pavel Solodovnikov	3c4d46728d	raft: support expiring server address mappings for rpc module This patch introduces `raft_address_map` class to abstract the notion of expirable address mappings for a raft rpc module. In Raft an instance may need to communicate with a peer outside its current configuration. This may happen, e.g., when a follower falls out of sync with the majority and then a configuration is changed and a leader not present in the old configuration is elected. The solution is to introduce the concept of "expirable" updates to the RPC subsystem. When RPC receives a message from an unknown peer, it also adds the return address of the peer to the address map with a TTL. Should we need to respond to the peer, its address will be known. An outgoing communication to an unconfigured peer is impossible. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-03-26 20:22:44 +03:00
Tomasz Grabiec	ef06a939c4	Merge "raft: seven etcd unit tests ported" from Alejo Seven etcd unit tests as boost tests. * alejo/raft-tests-etcd-08-v4-communicate-v5: raft: etcd unit tests: test proposal handling scenarios raft: etcd unit tests: test old messages ignored raft: etcd unit tests: test single node precandidate raft: etcd unit tests: test dueling precandidates raft: etcd unit tests: test dueling candidates raft: etcd unit tests: test cannot commit without new term raft: etcd unit tests: test single node commit raft: etcd unit tests: update test_leader_election_overwrite_newer_logs raft: etcd unit tests: fix test_progress_leader raft: testing: log comparison helper functions raft: testing: helper to make fsm candidate raft: testing: expose log for test verification raft: testing: use server_address_set raft: testing: add prevote configuration raft: testing: make become_follower() available for tests	2021-03-25 20:27:07 +01:00
Alejo Sanchez	ace0ee514f	raft: etcd unit tests: test proposal handling scenarios TestProposal For multiple scenarios, check proposal handling. Note, instead of expecting an explicit result for each specified case, the test automatically checks for expected behavior when quorum is reached or not. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-03-25 15:04:29 -04:00
Alejo Sanchez	77163ea76a	raft: etcd unit tests: test old messages ignored TestOldMessages Checks an append request from a leader from a previous term is ignored. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-03-25 15:04:29 -04:00
Alejo Sanchez	bf65b19803	raft: etcd unit tests: test single node precandidate TestSingleNodePreCandidate Checks a single node configuration with precandidate on works to automatically elect the node. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-03-25 15:04:29 -04:00
Alejo Sanchez	de7051467b	raft: etcd unit tests: test dueling precandidates TestDuelingPreCandidates In a configuration of 3 nodes, two nodes don't see each other and they compete for leadership. Loser (3) should revert to follower when prevote is rejected and revert to term 1. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-03-25 15:04:29 -04:00
Alejo Sanchez	aa7d23f86b	raft: etcd unit tests: test dueling candidates TestDuelingCandidates In a configuration of 3 nodes, two nodes don't see each other and they compete for leadership. Once reconnected, loser should not disrupt. But note it will remain candidate with current algorithm without prevoting and other fsms will not bump term. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-03-25 15:04:29 -04:00
Alejo Sanchez	1eac94e7d6	raft: etcd unit tests: test cannot commit without new term TestCannotCommitWithoutNewTermEntry tests the entries cannot be committed when leader changes, no new proposal comes in and ChangeTerm proposal is filtered. NOTE: this doesn't check committed but it's implicit for next round; this could also use communicate() providing committed output map Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-03-25 15:04:29 -04:00
Alejo Sanchez	b421fe3605	raft: etcd unit tests: test single node commit Port etcd TestSingleNodeCommit In a single node configuration elect the node, add 2 entries and check number of committed entries. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-03-25 15:04:29 -04:00
Alejo Sanchez	9b4538476b	raft: etcd unit tests: update test_leader_election_overwrite_newer_logs Make test_leader_election_overwrite_newer_logs use newer communicate() and other new helpers. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-03-25 15:04:29 -04:00
Alejo Sanchez	368eec1190	raft: etcd unit tests: fix test_progress_leader Make implementation follow closer to original test. Use newer boost test helpers. NOTE: in etcd it seems a leader's self progress is in PIPELINE state. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-03-25 15:04:28 -04:00
Alejo Sanchez	ba29970e29	raft: testing: log comparison helper functions Two helper functions to compare logs. For now only index, term, and data type are used. Data content comparison does not seem to be necessary for now. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-03-25 15:04:28 -04:00
Alejo Sanchez	aeab4cf4a9	raft: testing: helper to make fsm candidate Current election_timeout() helper might bump the term twice. It's convenient and less error prone to have a more fine grained helper that stops right when candidate state is reached. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-03-25 15:04:19 -04:00
Alejo Sanchez	7a6616f1cb	raft: testing: expose log for test verification Let derived classes access the log to verify its contents. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-03-25 15:03:46 -04:00
Alejo Sanchez	05b1f57e67	raft: testing: use server_address_set Use server_address_set in local namespace for brevity. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-03-25 15:01:12 -04:00
Alejo Sanchez	9d0a7d8ccf	raft: testing: add prevote configuration Provide a generic prevote configuration for tests. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-03-25 15:00:28 -04:00
Dejan Mircevski	b2a04985f7	cql-pytest: Drop needless INSERT in test_null One INSERT statement was unnecessary for the test, so delete it. Another was necessary, so explain it. Tests: cql-pytest/test_null on both Scylla and Cassandra Signed-off-by: Dejan Mircevski <dejan@scylladb.com> Closes #8304	2021-03-25 16:37:00 +01:00
Tomasz Grabiec	7b30d31d77	Merge "raft: test configuration changes" from Kostja Test raft configuration changes: a node with empty configuration, transitioning to an entirely different cluster, transitioning in presence of down nodes, leader change during configuration change, stray replies, etc. * scylla-dev/raft-empty-confchange-v5: (21 commits) raft: (testing) stray replies from removed followers raft: always return a non-zero configuration index from the log raft: (testing) leader change during configuration change raft: (testing) test confchange {ABCDE} -> {ABCDEFG} raft: (testing) test confchange {ABCDEF} -> {ABCGH} raft: (testing) test confchange {ABC} -> {CDE} raft: (testing) test confchange {AB} -> {CD} raft: (testing) test confchange {A} -> {B} raft: (testing) test a server with empty configuration raft: (testing) introduce testing utilities raft: (testing) simplify id allocation in test raft: (testing) add select_leader() helper raft: (testing) introduce communicate() helper raft: (testing) style cleanup in raft_fsm_test raft: (testing) fix bug in election_threshold raft: minor style changes & comments raft: do not assert when transitioning to empty config raft: assert we never apply a snapshot over uncommitted entries (leader) raft: improve tracing raft: add fsm_output::empty() helper to aid testing ...	2021-03-25 14:01:09 +01:00
Avi Kivity	46185d7d82	Update tools/jmx submodule * tools/jmx 9c687b5...440313e (1): > storage_service: Add a generic toppartitions endpoint	2021-03-25 12:36:10 +02:00
Alejo Sanchez	7e6807e8fc	raft: testing: make become_follower() available for tests Some etcd tests need to force a follower with a specific leader. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2021-03-24 19:11:09 -04:00
Piotr Wojtczak	c1daf2bb24	column_family: Make toppartitions queries more generic Right now toppartitions can only be invoked on one column family at a time. This change introduces a natural extension to this functionality, allowing to specify a list of families. We provide three ways for filtering in the query parameter "name_list": 1. A specific column family to include in the form "ks:cf" 2. A keyspace, telling the server to include all column families in it. Specified by omitting the cf name, i.e. "ks:" 3. All column families, which is represented by an empty list The list can include any amount of one or both of the 1. and 2. option. Fixes #4520 Closes #7864	2021-03-24 17:54:05 +02:00
Raphael S. Carvalho	bcbb39999b	LCS: Fix terrible write amplification when reshaping level 0 LCS reshape is basically 'major compacting' level 0 until it contains less than N sstables. That produces terrible write amplification, because any given byte will be compacted (initial # of sstables / max_threshold (32)) times. So if L0 initially contained 256 ssts, there would be a WA of about 8. This terrible write amplification can be reduced by performing STCS instead on L0, which will leave L0 in a good shape without hurting WA as it happens now. Fixes #8345. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20210322150655.27011-1-raphaelsc@scylladb.com>	2021-03-24 17:48:50 +02:00
Piotr Sarna	24a43681b4	thrift: handle gate closed exception on retry During the retry mechanism, it's possible to encounter a gate closed exception, which should simply be ignored, because it indicates that the server is shutting down. Closes #8337	2021-03-24 17:41:58 +02:00
Konstantin Osipov	1a1d7ab662	raft: (testing) stray replies from removed followers	2021-03-24 14:05:55 +03:00
Konstantin Osipov	0295163f6f	raft: always return a non-zero configuration index from the log Return snapshot index for last configuration index if there is no configuration in the log.	2021-03-24 14:05:55 +03:00
Konstantin Osipov	cec59e53ef	raft: (testing) leader change during configuration change	2021-03-24 14:05:36 +03:00
Pavel Emelyanov	37bec6fb76	commitlog: Open files with append_is_unlikely This open option tells seastar that the file in question will be truncated to the needed size right at once and all the subsequent writes will happen within this size. This hint turns off append optimization in seastar that's not that cheap and helps so save few cpu cycles. The option was introduced in seastar by 8bec57bc. tests: unit(dev), dtest(commitlog: test_batch_commitlog, test_periodic_commitlog, test_commitlog_replay_on_startup) Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20210323115409.31215-1-xemul@scylladb.com>	2021-03-24 13:05:33 +02:00
Konstantin Osipov	a203c8833f	raft: (testing) test confchange {ABCDE} -> {ABCDEFG}	2021-03-24 14:04:18 +03:00
Konstantin Osipov	40e117d36e	raft: (testing) test confchange {ABCDEF} -> {ABCGH}	2021-03-24 14:04:18 +03:00
Konstantin Osipov	14b2d5d308	raft: (testing) test confchange {ABC} -> {CDE} Test leader change during configuration change.	2021-03-24 14:04:18 +03:00
Konstantin Osipov	3c718a175e	raft: (testing) test confchange {AB} -> {CD}	2021-03-24 14:04:18 +03:00
Konstantin Osipov	2e30c8540e	raft: (testing) test confchange {A} -> {B} Test non-restart and leader restart scenario.	2021-03-24 14:04:18 +03:00
Konstantin Osipov	e23da06fef	raft: (testing) test a server with empty configuration Try becoming a candidate for such server, or adding it to an existing configuration.	2021-03-24 14:04:18 +03:00
Konstantin Osipov	b18599c630	raft: (testing) introduce testing utilities Add a discrete_failure_detector, to be able to mark a single server dead.	2021-03-24 14:04:18 +03:00
Konstantin Osipov	8d26d24370	raft: (testing) simplify id allocation in test	2021-03-24 14:04:18 +03:00
Konstantin Osipov	322a15ec33	raft: (testing) add select_leader() helper With leader stepdown extension, leadership transfer can happen to any follower with long enough log. Add a helper to select that follower from a list.	2021-03-24 14:04:18 +03:00
Konstantin Osipov	4a00da276d	raft: (testing) introduce communicate() helper Allow to communicate between arbitrary number of FSMs. Drop messages to FSMs which are not in the argument list. Stop communication upon predicate.	2021-03-24 14:04:18 +03:00
Konstantin Osipov	7182323ac0	raft: (testing) style cleanup in raft_fsm_test 1) Avoid memory violations on test failure 2) Print better diagnostics on failure (BOOST_CHECK_EQUAL vs BOOST_CHECK)	2021-03-24 14:04:18 +03:00
Konstantin Osipov	f0f25bf7fb	raft: (testing) fix bug in election_threshold election_threshold was ticking one extra tick, causing the follower to become candidate in some cases. This was rendering tests unstable.	2021-03-24 14:04:18 +03:00
Konstantin Osipov	00d7379bc9	raft: minor style changes & comments Add comments explaining the rationale from transfer_leadership() (more PhD quotes), encapsulate stable leader check in tick() into a lambda and add more detailed comments to it.	2021-03-24 14:04:18 +03:00
Piotr Sarna	06131e21a3	configure.py: add customizing clang inline threshold Until clang figures things out with the now infamous `-llvm -inline-threshold X` parameter, let's allow customizing it to make the compilation of release builds less tiresome. For instance, scylla's row_level.o object file currently does not compile for me until I decrease the inline threshold to a low value (e.g. 50). Message-Id: <54113db9438e3c3371410996f49b7fbe9a1b7257.1616422536.git.sarna@scylladb.com>	2021-03-24 12:09:26 +02:00
Tomasz Grabiec	9272e74e8c	sstable: writer: ka/la: Write row marker cell after row tombstone Row marker has a cell name which sorts after the row tombstone's start bound. The old code was writing the marker first, then the row tombstone, which is incorrect. This was harmeless to our sstable reader, which recognized both as belonging to the current clustering row fragment, and collects both fine. However, if both atoms trigger creation of promoted index blocks, the writer will create a promoted index with entries wich violate the cell name ordering. It's very unlikely to run into in practice, since to trigger promoted index entries for both atoms, the clustering key would be so large so that the size of the marker cell exceeds the desired promoted index block size, which is 64KB by default (but user-controlled via column_index_size_in_kb option). 64KB is also the limit on clustering key size accepted by the system. This was caught by one of our unit tests: sstable_conforms_to_mutation_source_test ...which runs a battery of mutation reader tests with various desired promoted index block sizes, including the target size of 1 byte, which triggers an entry for every atom. The test started to fail for some random seeds after commit `ecb6abe` inside the test_streamed_mutation_forwarding_is_consistent_with_slicing test case, reporting a mutation mismatch in the following line: assert_that(sliced_m).is_equal_to(fwd_m, slice_with_ranges.row_ranges(*m.schema(), m.key())); It compares mutations read from the same sstable using different methods, slicing using clustering key restricitons, and fast forwarding. The reported mismatch was that fwd_m contained the row marker, but sliced_m did not. The sstable does contain the marker, so both reads should return it. After reverting the commit which introduced dynamic adjustments, the test passes, but both mutations are missing the marker, both are wrong! They are wrong because the promoted index contians entries whose starting positions violate the ordering, so binary search gets confused and selects the row tombstone's position, which is emitted after the marker, thus skipping over the row marker. The explanation for why the test started to fail after dynamic adjustements is the following. The promoted index cursor works by incrementally parsing buffers fed by the file input stream. It first parses the whole block and then does a binary search within the parsed array. The entries which cursor touches during binary search depend on the size of the block read from the file. The commit which enabled dynamic adjustements causes the block size to be different for subsequent reads, which allows one of the reads to walk over the corrupted entries and read the correct data by selecting the entry corresponding to the row marker. Fixes #8324 Message-Id: <20210322235812.1042137-1-tgrabiec@scylladb.com>	2021-03-23 16:13:47 +01:00
Tomasz Grabiec	235154cca5	Merge "Teach scylla-gdb new trees in row cache" from Pavel Emelyanov Clustering rows are now stored in intrusive btree, cells are now stored in radix tree, but scylla-gdb tries to walk the intrusive_set and vector/set union respectively. For the former case -- the btree wrapper is introduced. For the latter -- compiler optimizes-away too many important bits and walking the tree turns into a bunch of hard-coded hacks and reiterpret-casts. Untill better solution is found, just print the address of the tree root. * xemul/br-gdb-btree-rows: gdb: Show address of the row::_cells tree (or "empty" mark) gdb: Add support for intrusive B tree gdb: Use helper to get rows from mutation_partition	2021-03-23 12:50:17 +01:00
Pavel Emelyanov	1cd9ec952f	gdb: Show address of the row::_cells tree (or "empty" mark) Currently clang optimizes-out lots of critical stuff from compact radix tree. Untill we find out the way to walk the tree in gdb, it's better to at least show where it is in memory. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-03-23 13:29:40 +03:00

1 2 3 4 5 ...

25698 Commits