scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-28 20:27:03 +00:00

Author	SHA1	Message	Date
Kefu Chai	f80f638bb9	raft: disambiguate promise name in raft::awaited_conf_changes otherwise GCC 13 complains that ``` /home/kefu/dev/scylladb/raft/server.cc:42:15: error: declaration of ‘seastar::promise<void> raft::awaited_index::promise’ changes meaning of ‘promise’ [-Wchanges-meaning] 42 \| promise<> promise; \| ^~~~~~~ /home/kefu/dev/scylladb/raft/server.cc:42:5: note: used here to mean ‘class seastar::promise<void>’ 42 \| promise<> promise; \| ^~~~~~~~~ ``` see also `cd4af0c722` Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-04-29 17:02:25 +08:00
Harsh Soni	84ea2f5066	raft: fsm: add empty check for `max_read_id_with_quorum` Updated the empty() function in the struct fsm_output to include the max_read_id_with_quorum field when checking whether the fsm output is empty or not. The change was made in order maintain consistency with the codebase and adding completeness to the empty check. This change has no impact on other parts of the codebase. Closes #13656	2023-04-27 16:04:58 +02:00
Kamil Braun	30cc07b40d	Merge 'Introduce tablets' from Tomasz Grabiec This PR introduces an experimental feature called "tablets". Tablets are a way to distribute data in the cluster, which is an alternative to the current vnode-based replication. Vnode-based replication strategy tries to evenly distribute the global token space shared by all tables among nodes and shards. With tablets, the aim is to start from a different side. Divide resources of replica-shard into tablets, with a goal of having a fixed target tablet size, and then assign those tablets to serve fragments of tables (also called tablets). This will allow us to balance the load in a more flexible manner, by moving individual tablets around. Also, unlike with vnode ranges, tablet replicas live on a particular shard on a given node, which will allow us to bind raft groups to tablets. Those goals are not yet achieved with this PR, but it lays the ground for this. Things achieved in this PR: - You can start a cluster and create a keyspace whose tables will use tablet-based replication. This is done by setting `initial_tablets` option: ``` CREATE KEYSPACE test WITH replication = {'class': 'NetworkTopologyStrategy', 'replication_factor': 3, 'initial_tablets': 8}; ``` All tables created in such a keyspace will be tablet-based. Tablet-based replication is a trait, not a separate replication strategy. Tablets don't change the spirit of replication strategy, it just alters the way in which data ownership is managed. In theory, we could use it for other strategies as well like EverywhereReplicationStrategy. Currently, only NetworkTopologyStrategy is augmented to support tablets. - You can create and drop tablet-based tables (no DDL language changes) - DML / DQL work with tablet-based tables Replicas for tablet-based tables are chosen from tablet metadata instead of token metadata Things which are not yet implemented: - handling of views, indexes, CDC created on tablet-based tables - sharding is done using the old method, it ignores the shard allocated in tablet metadata - node operations (topology changes, repair, rebuild) are not handling tablet-based tables - not integrated with compaction groups - tablet allocator piggy-backs on tokens to choose replicas. Eventually we want to allocate based on current load, not statically Closes #13387 * github.com:scylladb/scylladb: test: topology: Introduce test_tablets.py raft: Introduce 'raft_server_force_snapshot' error injection locator: network_topology_strategy: Support tablet replication service: Introduce tablet_allocator locator: Introduce tablet_aware_replication_strategy locator: Extract maybe_remove_node_being_replaced() dht: token_metadata: Introduce get_my_id() migration_manager: Send tablet metadata as part of schema pull storage_service: Load tablet metadata when reloading topology state storage_service: Load tablet metadata on boot and from group0 changes db, migration_manager: Notify about tablet metadata changes via migration_listener::on_update_tablet_metadata() migration_notifier: Introduce before_drop_keyspace() migration_manager: Make prepare_keyspace_drop_announcement() return a future<> test: perf: Introduce perf-tablets test: Introduce tablets_test test: lib: Do not override table id in create_table() utils, tablets: Introduce external_memory_usage() db: tablets: Add printers db: tablets: Add persistence layer dht: Use last_token_of_compaction_group() in split_token_range_msb() locator: Introduce tablet_metadata dht: Introduce first_token() dht: Introduce next_token() storage_proxy: Improve trace-level logging locator: token_metadata: Fix confusing comment on ring_range() dht, storage_proxy: Abstract token space splitting Revert "query_ranges_to_vnodes_generator: fix for exclusive boundaries" db: Exclude keyspace with per-table replication in get_non_local_strategy_keyspaces_erms() db: Introduce get_non_local_vnode_based_strategy_keyspaces() service: storage_proxy: Avoid copying keyspace name in write handler locator: Introduce per-table replication strategy treewide: Use replication_strategy_ptr as a shorter name for abstract_replication_strategy::ptr_type locator: Introduce effective_replication_map locator: Rename effective_replication_map to vnode_effective_replication_map locator: effective_replication_map: Abstract get_pending_endpoints() db: Propagate feature_service to abstract_replication_strategy::validate_options() db: config: Introduce experimental "TABLETS" feature db: Log replication strategy for debugging purposes db: Log full exception on error in do_parse_schema_tables() db: keyspace: Remove non-const replication strategy getter config: Reformat	2023-04-27 09:40:18 +02:00
Tomasz Grabiec	c1fdbe79b7	raft: Introduce 'raft_server_force_snapshot' error injection Will be used by tests to force followers to catch up from the snapshot.	2023-04-24 10:49:37 +02:00
Benny Halevy	f5f566bdd8	utils: add tagged_integer A generic template for defining strongly typed integer types. Use it here to replace raft::internal::tagged_uint64. Will be used for defining gms generation and version as strong and distinguishable types in following patches. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-04-23 08:37:32 +03:00
Kefu Chai	3425184b2a	raft: include boost header using <path/to/header> not "path/to/header" for more consistency with the rest of the source tree. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-03-26 14:07:50 +08:00
Kefu Chai	0421d6d12f	raft: include used header at least, we need to access the declarations of exceptions, like `not_a_leader` and `dropped_entry`, so, instead of relying on other header to do this job for us, we should include the header which include the declaration. so, in this chance "raft.h" is include explicitly. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-03-26 14:07:50 +08:00
Botond Dénes	19560419d2	Merge 'treewide: improve compatibility with gcc 13' from Avi Kivity An assortment of patches that reduce our incompatibilities with the upcoming gcc 13. Closes #13243 * github.com:scylladb/scylladb: transport: correctly format unknown opcode treewide: catch by reference test: raft: avoid confusing string compare utils, types, test: extract lexicographical compare utilities test: raft: fsm_test: disambiguate raft::configuration construction test: reader_concurrency_semaphore_test: handle all enum values repair: fix signed/unsigned compare repair: fix incorrect signed/unsigned compare treewide: avoid unused variables in if statements keys: disambiguate construction from initializer_list<bytes> cql3: expr: fix serialize_listlike() reference-to-temporary with gcc compaction: error on invalid scrub type treewide: prevent redefining names api: task_manager: fix signed/unsigned compare alternator: streams: fix signed/unsigned comparison test: fix some mismatched signed/unsigned comparisons	2023-03-24 15:16:05 +02:00
Avi Kivity	a806024e1d	treewide: avoid unused variables in if statements gcc warns about unused variables declared in if statements. Just drop them.	2023-03-21 13:42:49 +02:00
Botond Dénes	bf8b746bca	Merge 'utils: UUID: specialize fmt::formatter for UUID and tagged_uuid<>' from Kefu Chai this is a part of a series migrating from `operator<<(ostream&, ..)` based formatting to fmtlib based formatting. the goal here is to enable fmtlib to print UUID without using ostream<<. also, this change re-implements some formatting helpers using fmtlib for better performance and less dependencies on operator<<(), but we cannot drop it at this moment, as quite a few caller sites are still using operator<<(ostream&, const UUID&) and operator<<(ostream&, tagged_uuid<T>&). we will address them separately. * add `fmt::formatter<UUID>` * add `fmt::formatter<tagged_uuid<T>>` * implement `UUID::to_string()` using `fmt::to_string()` * implement `operator<<(std::ostream&, const UUID&)` with `fmt::print()`, this should help to improve the performance when printing uuid, as `fmt::print()` does not materialize a string when printing the uuid. * treewide: use fmtlib when printing UUID Refs #13245 Closes #13246 * github.com:scylladb/scylladb: treewide: use fmtlib when printing UUID utils: UUID: specialize fmt::formatter for UUID and tagged_uuid<>	2023-03-20 14:26:11 +02:00
Gleb Natapov	2fc8e13dd8	raft: add server::wait_for_state_change() function Add a function that allows waiting for a state change of a raft server. It is useful for a user that wants to know when a node becomes/stops being a leader. Message-Id: <20230316112801.1004602-4-gleb@scylladb.com>	2023-03-20 11:31:55 +01:00
Gleb Natapov	59f7aeb79b	raft: move some functions out of ad-hoc section Make tick() and is_leader() part of the API. First is used externally already and another will be used in following patches. Message-Id: <20230316112801.1004602-3-gleb@scylladb.com>	2023-03-20 11:25:19 +01:00
Kefu Chai	94c6df0a08	treewide: use fmtlib when printing UUID this change tries to reduce the number of callers using operator<<() for printing UUID. they are found by compiling the tree after commenting out `operator<<(std::ostream& out, const UUID& uuid)`. but this change alone is not enough to drop all callers, as some callers are using `operator<<(ostream&, const unordered_map&)` and other overloads to print ranges whose elements contain UUID. so in order to limit the scope of the change, we are not changing them here. Refs #13245 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-03-20 15:38:45 +08:00
Avi Kivity	6aa91c13c5	Merge 'Optimize topology::compare_endpoints' from Benny Halevy The code for compare_endpoints originates at the dawn of time (`bc034aeaec`) and is called on the fast path from storage_proxy via `sort_by_proximity`. This series considerably reduces the function's footprint by: 1. carefully coding the many comparisons in the function so to reduce the number of conditional banches (apparently the compiler isn't doing a good enough job at optimizing it in this case) 2. avoid sstring copy in topology::get_{datacenter,rack} Closes #12761 * github.com:scylladb/scylladb: topology: optimize compare_endpoints to_string: add print operators for std::{weak,partial}_ordering utils: to_sstring: deinline std::strong_ordering print operator move to_string.hh to utils/ test: network_topology: add test_topology_compare_endpoints	2023-03-07 15:17:19 +02:00
Kefu Chai	563fbb2d11	build: cmake: extract more subsystem out into its own CMakeLists.txt namely, cdc, compaction, dht, gms, lang, locator, mutation_writer, raft, readers, replica, service, tools, tracing and transport. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-03-02 10:15:25 +08:00
Kefu Chai	b926105eae	raft: reference this explicitly Clang complains that the captured `this` is not used, like ``` /home/kefu/dev/scylladb/raft/fsm.hh:644:21: error: lambda capture 'this' is not used [-Werror,-Wunused-lambda-capture] auto visitor = [this, from, msg = std::move(msg)](const auto& state) mutable { ^ /home/kefu/dev/scylladb/raft/server.cc:738:11: note: in instantiation of function template specialization 'raft::fsm::step<raft::append_request>' requested here _fsm->step(from, std::move(append_request)); ^ ``` but `step(..)` is a non-static member function of `fsm`, so `this` is actually used. to silence Clang's warning, let's just reference it explicitly. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-02-28 21:56:55 +08:00
Gleb Natapov	9bdef9158e	raft: abort applier fiber when a state machine aborts After `5badf20c7a` applier fiber does not stop after it gets abort error from a state machine which may trigger an assertion because previous batch is not applied. Fix it. Fixes #12863	2023-02-15 15:54:19 +02:00
Gleb Natapov	dfcd56736b	raft: fix race in add_entry_on_leader that may cause incorrect log length accounting In add_entry_on_leader after wait_for_memory_permit() resolves but before the fiber continue to run the node may stop becoming the leader and then become a leader again which will cause currently hold units outdated. Detect this case by checking the term after the preemption.	2023-02-15 15:51:59 +02:00
Benny Halevy	25ebc63b82	move to_string.hh to utils/ Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-02-15 11:09:04 +02:00
Alejo Sanchez	346d02b477	raft conf error injection for snapshot To trigger snapshot limit behavior provide an error injection to set with one-shot. Note this effectively changes it and there is no revert. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2023-02-03 22:33:33 +01:00
Pavel Emelyanov	4f415413d2	raft: Fix non-existing state_machine::apply_entry in docs The docs mention that method, but it doesn't exist. Instead, the state_machine interface defines plain .apply() one. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes #12541	2023-01-17 12:53:05 +01:00
Kamil Braun	4268b1bbc2	Merge 'raft: raft_group0, register RPC verbs on all shards' from Gusev Petr raft_group0 used to register RPC verbs only on shard 0. This worked on clusters with the same --smp setting on all nodes, since RPCs in this case are processed on the same shard as the calling code, and raft_group0 methods only run on shard 0. A new test test_nodes_with_different_smp was added to identify the problem. Since --smp can only be specified via the command line, a corresponding parameter was added to the ManagerClient.server_add method. It allows to override the default parameters set by the SCYLLA_CMDLINE_OPTIONS variable by changing, adding or deleting individual items. Fixes: #12252 Closes #12374 * github.com:scylladb/scylladb: raft: raft_group0, register RPC verbs on all shards raft: raft_append_entries, copy entries to the target shard test.py, allow to specify the node's command line in test	2023-01-04 11:11:21 +01:00
Petr Gusev	7725e03a09	raft: raft_append_entries, copy entries to the target shard If append_entries RPC was received on a non-zero shard, we may need to pass it to a zero (or, potentially, some other) shard. The problem is that raft::append_request contains entries in the form of raft::log_entry_ptr == lw_shared_ptr<log_entry>, which doesn't support cross-shard reference counting. In debug mode it contains a special ref-counting facility debug_shared_ptr_counter_type, which resorts to on_internal_error if it detects such a case. To solve this, we just copy log entries to the target shard if it isn't equal to the current one. In most cases, if --smp setting is the same on all nodes, RPC will be handled on zero shard, so there will be no overhead.	2023-01-03 15:25:00 +03:00
Gleb Natapov	229cef136d	raft: add trace logging to raft::server::start Allows to see initial state of the server during start. Message-Id: <20221228144944.3299711-15-gleb@scylladb.com>	2023-01-02 11:57:53 +02:00
Gleb Natapov	5182543df2	raft: fix typo in read_barrier logging The log logs applied index not append one. Message-Id: <20221228144944.3299711-3-gleb@scylladb.com>	2023-01-02 11:38:47 +02:00
Nadav Har'El	09a3c63345	cross-tree: allow std::source_location in clang 14 We recently (commit `6a5d9ff261`) started to use std::source_location instead of std::experimental::source_location. However, this does not work on clang 14, because libc++ 12's <source_location> only works if __builtin_source_location, and that is not available on clang 14. clang 15 is just three months old, and several relatively-recent distributions still carry clang 14 so it would be nice to support it as well. So this patch adds a trivial compatibility header file, which, when included and compiled with clang 14, it aliases the functional std::experimental::source_location to std::source_location. It turns out it's enough to include the new header file from three headers that included <source_location> - I guess all other uses of source_location depend on those header files directly or indirectly. We may later need to include the compatibility header file in additional places, bug for now we don't. Refs #12259 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #12265	2022-12-11 20:28:49 +02:00
Gleb Natapov	022a825b33	raft: introduce not_a_member error and return it when non member tries to do add/modify_config Currently if a node that is outside of the config tries to add an entry or modify config transient error is returned and this causes the node to retry. But the error is not transient. If a node tries to do one of the operations above it means it was part of the cluster at some point, but since a node with the same id should not be added back to a cluster if it is not in the cluster now it will never be. Return a new error not_a_member to a caller instead. Message-Id: <Y42mTOx8bNNrHqpd@scylladb.com>	2022-12-05 17:11:04 +01:00
Avi Kivity	6a5d9ff261	treewide: use non-experimental std::source_location Now that we use libstdc++ 12, we can use the standardized source_location. Closes #12137	2022-11-30 11:06:43 +02:00
Avi Kivity	fb6804e7a4	raft: don't compare signed and unsigned types gcc warns it can lead to undefined behavior, though 2G entries in a list of mutations are unlikely. Use the correct type for iteration.	2022-11-28 21:58:30 +02:00
Konstantin Osipov	990c7a209f	raft: change the API of conf change notifications Pass a change diff into the notification callback, rather than add or remove servers one by one, so that if we need to persist the state, we can do it once per configuration change, not for every added or removed server. For now still pass added and removed entries in two separate calls per a single configuration change. This is done mainly to fulfill the library contract that it never sends messages to servers outside the current configuration. The group0 RPC implementation doesn't need the two calls, since it simply marks the removed servers as expired: they are not removed immediately anyway, and messages can still be delivered to them. However, there may be test/mock implementations of RPC which could benefit from this contract, so we decided to keep it.	2022-11-17 12:07:31 +03:00
Kamil Braun	0c9cb5c5bf	Merge 'raft: wait for the next tick before retrying' from Gusev Petr When `modify_config` or `add_entry` is forwarded to the leader, it may reach the node at "inappropriate" time and result in an exception. There are two reasons for it - the leader is changing and, in case of `modify_config`, other `modify_config` is currently in progress. In both cases the command is retried, but before this patch there was no delay before retrying, which could led to a tight loop. The patch adds a new exception type `transient_error`. When the client receives it, it is obliged to retry the request after some delay. Previously leader-side exceptions were converted to `not_a_leader`, which is strange, especially for `conf_change_in_progress`. Fixes: #11564 Closes #11769 * github.com:scylladb/scylladb: raft: rafactor: remove duplicate code on retries delays raft: use wait_for_next_tick in read_barrier raft: wait for the next tick before retrying	2022-11-16 18:20:54 +01:00
Petr Gusev	ae3e0e3627	raft: rafactor: remove duplicate code on retries delays Introduce a templated function do_on_leader_with_retries, use it in add_entries/modify_config/read_barrier. The function implements the basic logic of retries with aborts and leader changes handling, adds a delay between iterations to protect against tight loops.	2022-11-15 13:18:53 +04:00
Petr Gusev	15cc1667d0	raft: use wait_for_next_tick in read_barrier Replaced the yield on transport_error with wait_for_next_tick. Added delays for retries, similar to add_entry/modify_config: we postpone the next call attempt if we haven't received new information about the current leader.	2022-11-15 12:31:49 +04:00
Petr Gusev	5e15c3c9bd	raft: wait for the next tick before retrying When modify_config or add_entry is forwarded to the leader, it may reach the node at "inappropriate" time and result in an exception. There are two reasons for it - the leader is changing and, in case of modify_config, other modify_config is currently in progress. In both cases the command is retried, but before this patch there was no delay before retrying, which could led to a tight loop. The patch adds a new exception type transient_error. When the client node receives it, it is obliged to retry the request, possibly after some delay. Previously, leader-side exceptions were converted to not_a_leader exception, which is strange, especially for conf_change_in_progress. We add a delay before retrying in modify_config and add_entry if the client hasn't received any new information about the leader since the last attempt. This can happen if the server responds with a transient_error with an empty leader and the current node has not yet learned the new leader. We neglect an excessive delay if the newly elected leader is the same as the previous one, this supposed to be a rare. Fixes: #11564	2022-11-15 11:49:26 +04:00
Petr Gusev	d79fbab682	raft: convert raft::transport_error to raft::commit_status_unknown The add_entry and modify_config methods sometimes do an rpc to execute the request on the current leader. If the tcp connection was broken, a seastar::rpc::closed_error would be thrown to the client. This exception was not documented in the method comments and the client could have missed handling it. For example, this exception was not handled when calling modify_config in raft_group0, which sometimes broke the removenode command. An intermittent_connection_error exception was added earlier to solve a similar problem with the read_barrier method. In this patch it is renamed to transport_error, as it seems to better describe the situation, and an explicit specification for this exception was added - the rpc implementation can throw it if it is not known whether the call reached the target node and whether any actions were performed on it. In case of read_barrier it does not matter and we just retry. In case of add_entry and modify_config we cannot retry because the rpc calls are not idempotent, so we convert this exception to commit_status_unknown, which the client has to handle. Explicit comments have also been added to raft::server methods describing all possible exceptions.	2022-10-07 13:34:16 +04:00
Petr Gusev	cbfe033786	raft server, shrink_to_fit on log truncation We don't want to keep memory we don't use, shrink_to_fit guarantees that. In fact, boost::deque frees up memory when items are deleted, so this change has little effect at the moment, but it may pay off if we change the container in the future.	2022-09-27 12:02:36 +04:00
Petr Gusev	b34dfed307	raft server, release memory if add_entry throws We consume memory from semaphore in add_entry_on_leader, but never release it if add_entry throws.	2022-09-27 12:02:34 +04:00
Petr Gusev	27e60ecbf4	raft server, log size limit in bytes Before this patch we could get an OOM if we received several big commands. The number of commands was small, but their total size in bytes was large. snapshot_trailing_size is needed to guarantee progress. Without this limit the fsm could get stuck if the size of the next item is greater than max_log_size - (size of trailing entries).	2022-09-26 13:10:10 +04:00
Petr Gusev	210d9dd026	raft: fix snapshots leak applier_fiber could create multiple snapshots between io_fiber run. The fsm_output.snp variable was overwritten by applier_fiber and io_fiber didn't drop the previous snapshot. In this patch we introduce the variable fsm_output.snps_to_drop, store in it the current snapshot id before applying a new one, and then sequentially drop them in io_fiber after storing the last snapshot_descriptor. _sm_events.signal() is added to fsm::apply_snapshot, since this method mutates the _output and thus gives a reason to run io_fiber. The new test test_frequent_snapshotting demonstrates the problem by causing frequent snapshots and setting the applier queue size to one. Closes #11530	2022-09-21 12:46:26 +02:00
Petr Gusev	1b5fa4088e	raft server, abort group0 server on background errors	2022-09-12 10:16:43 +04:00
Petr Gusev	e92dc9c15b	raft server, provide a callback to handle background errors Fix: #11352	2022-09-12 10:16:43 +04:00
Petr Gusev	c57238d3d6	raft server, check aborted state on public server public api's Fix: #11352	2022-09-12 10:16:40 +04:00
Kamil Braun	6c16ae4868	Merge 'raft, limit for command size' from Gusev Petr Commitlog imposes a limit on the size of mutations and throws an exception if it's exceeded. In case of schema changes before raft this exception was delivered to the client. Now it happens while saving the raft command in io_fiber in persistence->store_log_entries and what the client gets is just a timeout exception, which doesn't say much about the cause of the problem. This patch introduces an explicit command size limit and provides a clear error message in this case. Closes #11318 * github.com:scylladb/scylladb: raft, use max_command_size to satisfy commitlog limit raft, limit for command size	2022-08-26 12:20:58 +02:00
Tomasz Grabiec	83850e247a	Merge 'raft: server: handle aborts when waiting for config entry to commit' from Kamil Braun Changing configuration involves two entries in the log: a 'joint configuration entry' and a 'non-joint configuration entry'. We use `wait_for_entry` to wait on the joint one. To wait on the non-joint one, we use a separate promise field in `server`. This promise wasn't connected to the `abort_source` passed into `set_configuration`. The call could get stuck if the server got removed from the configuration and lost leadership after committing the joint entry but before committing the non-joint one, waiting on the promise. Aborting wouldn't help. Fix this by subscribing to the `abort_source` in resolving the promise exceptionally. Furthermore, make sure that two `set_configuration` calls don't step on each other's toes by one setting the other's promise. To do that, reset the promise field at the end of `set_configuration` and check that it's not engaged at the beginning. Fixes #11288. Closes #11325 * github.com:scylladb/scylladb: test: raft: randomized_nemesis_test: additional logging raft: server: handle aborts when waiting for config entry to commit	2022-08-25 12:49:09 +02:00
Tomasz Grabiec	9c4e32d2e2	Merge 'raft: server: drop waiters in `applier_fiber` instead of `io_fiber`' from Kamil Braun When `io_fiber` fetched a batch with a configuration that does not contain this node, it would send the entries committed in this batch to `applier_fiber` and proceed by any remaining entry dropping waiters (if the node was no longer a leader). If there were waiters for entries committed in this batch, it could either happen that `applier_fiber` received and processed those entries first, notifying the waiters that the entries were committed and/or applied, or it could happen that `io_fiber` reaches the dropping waiters code first, causing the waiters to be resolved with `commit_status_unknown`. The second scenario is undesirable. For example, when a follower tries to remove the current leader from the configuration using `modify_config`, if the second scenario happens, the follower will get `commit_status_unknown` - this can happen even though there are no node or network failures. In particular, this caused `randomized_nemesis_test.remove_leader_with_forwarding_finishes` to fail from time to time. Fix it by serializing the notifying and dropping of waiters in a single fiber - `applier_fiber`. We decided to move all management of waiters into `applier_fiber`, because most of that management was already there (there was already one `drop_waiters` call, and two `notify_waiters` calls). Now, when `io_fiber` observes that we've been removed from the config and no longer a leader, instead of dropping waiters, it sends a message to `applier_fiber`. `applier_fiber` will drop waiters when receiving that message. Improve an existing test to reproduce this scenario more frequently. Fixes #11235. Closes #11308 * github.com:scylladb/scylladb: test: raft: randomized_nemesis_test: more chaos in `remove_leader_with_forwarding_finishes` raft: server: drop waiters in `applier_fiber` instead of `io_fiber` raft: server: use `visit` instead of `holds_alternative`+`get`	2022-08-23 17:19:44 +02:00
Kamil Braun	efad6fe9b4	raft: server: handle aborts when waiting for config entry to commit Changing configuration involves two entries in the log: a 'joint configuration entry' and a 'non-joint configuration entry'. We use `wait_for_entry` to wait on the joint one. To wait on the non-joint one, we use a separate promise field in `server`. This promise wasn't connected to the `abort_source` passed into `set_configuration`. The call could get stuck if the server got removed from the configuration and lost leadership after committing the joint entry but before committing the non-joint one, waiting on the promise. Aborting wouldn't help. Fix this by subscribing to the `abort_source` in resolving the promise exceptionally. Furthermore, make sure that two `set_configuration` calls don't step on each other's toes by one setting the other's promise. To do that, reset the promise field at the end of `set_configuration` and check that it's not engaged at the beginning. Fixes #11288.	2022-08-23 13:14:29 +02:00
Kamil Braun	db2a3deda1	raft: server: drop waiters in `applier_fiber` instead of `io_fiber` When `io_fiber` fetched a batch with a configuration that does not contain this node, it would send the entries committed in this batch to `applier_fiber` and proceed by any remaining entry dropping waiters (if the node was no longer a leader). If there were waiters for entries committed in this batch, it could either happen that `applier_fiber` received and processed those entries first, notifying the waiters that the entries were committed and/or applied, or it could happen that `io_fiber` reaches the dropping waiters code first, causing the waiters to be resolved with `commit_status_unknown`. The second scenario is undesirable. For example, when a follower tries to remove the current leader from the configuration using `modify_config`, if the second scenario happens, the follower will get `commit_status_unknown` - this can happen even though there are no node or network failures. In particular, this caused `randomized_nemesis_test.remove_leader_with_forwarding_finishes` to fail from time to time. Fix it by serializing the notifying and dropping of waiters in a single fiber - `applier_fiber`. We decided to move all management of waiters into `applier_fiber`, because most of that management was already there (there was already one `drop_waiters` call, and two `notify_waiters` calls). Now, when `io_fiber` observes that we've been removed from the config and no longer a leader, instead of dropping waiters, it sends a message to `applier_fiber`. `applier_fiber` will drop waiters when receiving that message. Fixes #11235.	2022-08-22 18:53:44 +02:00
Kamil Braun	5badf20c7a	raft: server: use `visit` instead of `holds_alternative`+`get` In `std::holds_alternative`+`std::get` version, the `get` performs a redundant check. Also `std::visit` gives a compile-time exhaustiveness check (whether we handled all possible cases of the `variant`).	2022-08-22 18:47:48 +02:00
Kamil Braun	b52429f724	Merge 'raft: relax some error severity' from Gleb Natapov Dtest fails if it sees an unknown errors in the logs. This series reduces severity of some errors (since they are actually expected during shutdown) and removes some others that duplicate already existing errors that dtest knows how to deal with. Also fix one case of unhandled exception in schema management code. * 'dtest-fixes-v1' of github.com:gleb-cloudius/scylla: raft: getting abort_requested_exception exception from a sm::apply is not a critical error schema_registry: fix abandoned feature warning service: raft: silence rpc::closed_errors in raft_rpc	2022-08-18 12:16:44 +02:00
Petr Gusev	eedfd7ad9b	raft, limit for command size Adds max_command_size to the raft configuration and restricts commands to this limit.	2022-08-18 13:35:49 +04:00

1 2 3 4 5 ...

299 Commits