scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-05-12 19:02:12 +00:00

Author	SHA1	Message	Date
Piotr Dulikowski	c58ff554d8	raft: rpc: introduce destination_not_alive_error Add a new destination_not_alive_error, thrown from two-way RPCs in case when the RPC is not issued because the destination is not reported as alive by the failure detector. In snapshot transfer code, lower the verbosity of the message printed in case it fails on the new error. This is done to prevent flakiness in the CI - in case of slow runs, nodes might get spuriously marked as dead if they are busy, and a message with the "error" verbosity can cause some tests to fail.	2023-11-23 11:14:28 +01:00
Piotr Dulikowski	a1ebfcf006	raft: add server::is_alive Add a method which reports whether given raft server is running. In following commits, the information about whether the local raft group 0 is running or not will be included in the response to the failure detector ping, and the is_alive method will be used there.	2023-11-23 00:34:22 +01:00
Kefu Chai	efd65aebb2	build: cmake: add check-header target to have feature parity with `configure.py`. we won't need this once we migrate to C++20 modules. but before that day comes, we need to stick with C++ headers. we generate a rule for each .hh files to create a corresponding .cc and then compile it, in order to verify the self-containness of that header. so the number of rule is quite large, to avoid the unnecessary overhead. the check-header target is enabled only if `Scylla_CHECK_HEADERS` option is enabled. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#15913	2023-11-13 10:27:06 +02:00
Gleb Natapov	9f6e93c144	raft: make sure that all operation forwarded to a leader are completed before destroying raft server Hold a gate around all operations that are forwarded to a leader to be able to wait for them during server::abort() otherwise the abort() may complete while those operations are still running which may cause use after free.	2023-10-25 13:29:36 +03:00
Piotr Dulikowski	64668e325e	raft: expose current_leader in raft::server The handler for join_node_request will need to know which node is considered the group 0 leader right now by the local node. If the topology coordinator crashes and a new node immediately wants to replace it with the same IP, the node that handles join_node_request will attempt to perform a read barrier. If this happens quickly enough, due to the IP reuse the RPC will be sent to the new node instead of the (now crashed) topology coordinator; the RPC will get an error and will fail the barrier. If we detect that the new node wants to replace the current topology coordinator, the upcoming join_node_request_handler will wait until there is a leader change.	2023-09-26 15:56:52 +02:00
Gleb Natapov	55f047f33f	raft: drop assert in server_impl::apply_snapshot for a condition that may happen server_impl::apply_snapshot() assumes that it cannot receive a snapshots from the same host until the previous one is handled and usually this is true since a leader will not send another snapshot until it gets response to a previous one. But it may happens that snapshot sending RPC fails after the snapshot was sent, but before reply is received because of connection disconnect. In this case the leader may send another snapshot and there is no guaranty that the previous one was already handled, so the assumption may break. Drop the assert that verifies the assumption and return an error in this case instead. Fixes: #15222 Message-ID: <ZO9JoEiHg+nIdavS@scylladb.com>	2023-09-01 07:17:49 +03:00
Mikołaj Grzebieluch	dc6017b71b	raft topology: make `mutation_size_threshold` depends on `max_command_size` `get_cdc_generation_mutations` splits data to mutations of maximal size `mutation_size_treshold`. Before this commit it was hardcoded to 2 MB. Calculate `mutation_size_threshold` to leave space for cdc generation data and not exceed `max_command_size`.	2023-07-07 13:11:52 +02:00
Kamil Braun	ff386e7a44	service: raft: force initial snapshot transfer in new cluster When we upgrade a cluster to use Raft, or perform manual Raft recovery procedure (which also creates a fresh group 0 cluster, using the same algorithm as during upgrade), we start with a non-empty group 0 state machine; in particular, the schema tables are non-empty. In this case we need to ensure that nodes which join group 0 receive the group 0 state. Right now this is not the case. In previous releases, where group 0 consisted only of schema, and schema pulls were also done outside Raft, those nodes received schema through this outside mechanism. In `91f609d065` we disabled schema pulls outside Raft; we're also extending group 0 with other things, like topology-specific state. To solve this, we force snapshot transfers by setting the initial snapshot index on the first group 0 server to `1` instead of `0`. During replication, Raft will see that the joining servers are behind, triggering snapshot transfer and forcing them to pull group 0 state. It's unnecessary to do this for cluster which bootstraps with Raft enabled right away but it also doesn't hurt, so we keep the logic simple and don't introduce branches based on that. Extend Raft upgrade tests with a node bootstrap step at the end to prevent regressions (without this patch, the step would hang - node would never join, waiting for schema). Fixes: #14066 Closes #14336	2023-06-29 22:46:42 +02:00
Gleb Natapov	945f476363	test: add test for group0 raft command merging Add a test that submits 3 large commands each one a little bit larger than 1/3 of maximum mutation size. Check that in the end 2 command were executed (first 2 were merged and third was executed separately).	2023-06-27 14:59:55 +03:00
Kamil Braun	5504da3745	raft: server: throw fewer `commit_status_unknown`s from `wait_for_entry` There are some cases where we can deduce that the entry was committed, but we were throwing `commit_status_unknown`. Handle one more such case. The added comment explains it in detail. Also add a FIXME for another case where we throw `commit_status_unknown` but we could do better. Fixes: #14029	2023-06-07 14:17:23 +02:00
Kefu Chai	82cac8e7cf	treewide: s/std::source_location/seastar::compact::source_location/ CWG 2631 (https://cplusplus.github.io/CWG/issues/2631.html) reports an issue on how the default argument is evaluated. this problem is more obvious when it comes to how `std::source_location::current()` is evaluated as a default argument. but not all compilers have the same behavior, see https://godbolt.org/z/PK865KdG4. notebaly, clang-15 evaluates the default argument at the callee site. so we need to check the capability of compiler and fall back to the one defined by util/source_location-compat.hh if the compiler suffers from CWG 2631. and clang-16 implemented CWG2631 in https://reviews.llvm.org/D136554. But unfortunately, this change was not backported to clang-15. before switching over to clang-16, for using std::source_location::current() as the default parameter and expect the behavior defined by CWG2631, we have to use the compatible layer provided by Seastar. otherwise we always end up having the source_location at the callee side, which is not interesting under most circumstances. so in this change, all places using the idiom of passing std::source_location::current() as the default parameter are changed to use seastar::compat::source_location::current(). despite that we have `#include "seastarx.h"` for opening the seastar namespace, to disambiguate the "namespace compat" defined somewhere in scylladb, the fully qualified name of `seastar::compat::source_location::current()` is used. see also `09a3c63345`, where we used std::source_location as an alias of std::experimental::source_location if it was available. but this does not apply to the settings of our current toolchain, where we have GCC-12 and Clang-15. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #14086	2023-05-30 15:10:12 +03:00
Kefu Chai	cb22492379	raft: specialize fmt::formatter<raft::server_address&> and friends this is a part of a series to migrating from `operator<<(ostream&, ..)` based formatting to fmtlib based formatting. the goal here is to enable fmtlib to print - raft::server_address - raft::config_member - raft::configuration without the help of `operator<<`. the corresponding `operator<<()` is removed in this change, as all its callers are now using fmtlib for formatting now. Refs #13245 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #13976	2023-05-24 09:11:55 +03:00
Benny Halevy	adfb79ba3e	raft, idl: restore internal::tagged_uint64 type Change `f5f566bdd8` introduced tagged_integer and replaced raft::internal::tagged_uint64 with utils::tagged_integer. However, the idl type for raft::internal::tagged_uint64 was not marked as final, but utils::tagged_integer is, breaking the on-the-wire compatibility. This change defines the different raft tagged_uint64 types in idl/raft_storage.idl.hh as non-final to restore the way they were serialized prior to `f5f566bdd8` Fixes #13752 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-05-09 12:38:20 +03:00
Benny Halevy	531ac63a8d	raft: define term_t as a tagged uint64_t It was defined as a tagged (signed) int64_t by mistake in `f5f566bdd8`. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-05-09 06:51:26 +03:00
Kefu Chai	f80f638bb9	raft: disambiguate promise name in raft::awaited_conf_changes otherwise GCC 13 complains that ``` /home/kefu/dev/scylladb/raft/server.cc:42:15: error: declaration of ‘seastar::promise<void> raft::awaited_index::promise’ changes meaning of ‘promise’ [-Wchanges-meaning] 42 \| promise<> promise; \| ^~~~~~~ /home/kefu/dev/scylladb/raft/server.cc:42:5: note: used here to mean ‘class seastar::promise<void>’ 42 \| promise<> promise; \| ^~~~~~~~~ ``` see also `cd4af0c722` Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-04-29 17:02:25 +08:00
Harsh Soni	84ea2f5066	raft: fsm: add empty check for `max_read_id_with_quorum` Updated the empty() function in the struct fsm_output to include the max_read_id_with_quorum field when checking whether the fsm output is empty or not. The change was made in order maintain consistency with the codebase and adding completeness to the empty check. This change has no impact on other parts of the codebase. Closes #13656	2023-04-27 16:04:58 +02:00
Kamil Braun	30cc07b40d	Merge 'Introduce tablets' from Tomasz Grabiec This PR introduces an experimental feature called "tablets". Tablets are a way to distribute data in the cluster, which is an alternative to the current vnode-based replication. Vnode-based replication strategy tries to evenly distribute the global token space shared by all tables among nodes and shards. With tablets, the aim is to start from a different side. Divide resources of replica-shard into tablets, with a goal of having a fixed target tablet size, and then assign those tablets to serve fragments of tables (also called tablets). This will allow us to balance the load in a more flexible manner, by moving individual tablets around. Also, unlike with vnode ranges, tablet replicas live on a particular shard on a given node, which will allow us to bind raft groups to tablets. Those goals are not yet achieved with this PR, but it lays the ground for this. Things achieved in this PR: - You can start a cluster and create a keyspace whose tables will use tablet-based replication. This is done by setting `initial_tablets` option: ``` CREATE KEYSPACE test WITH replication = {'class': 'NetworkTopologyStrategy', 'replication_factor': 3, 'initial_tablets': 8}; ``` All tables created in such a keyspace will be tablet-based. Tablet-based replication is a trait, not a separate replication strategy. Tablets don't change the spirit of replication strategy, it just alters the way in which data ownership is managed. In theory, we could use it for other strategies as well like EverywhereReplicationStrategy. Currently, only NetworkTopologyStrategy is augmented to support tablets. - You can create and drop tablet-based tables (no DDL language changes) - DML / DQL work with tablet-based tables Replicas for tablet-based tables are chosen from tablet metadata instead of token metadata Things which are not yet implemented: - handling of views, indexes, CDC created on tablet-based tables - sharding is done using the old method, it ignores the shard allocated in tablet metadata - node operations (topology changes, repair, rebuild) are not handling tablet-based tables - not integrated with compaction groups - tablet allocator piggy-backs on tokens to choose replicas. Eventually we want to allocate based on current load, not statically Closes #13387 * github.com:scylladb/scylladb: test: topology: Introduce test_tablets.py raft: Introduce 'raft_server_force_snapshot' error injection locator: network_topology_strategy: Support tablet replication service: Introduce tablet_allocator locator: Introduce tablet_aware_replication_strategy locator: Extract maybe_remove_node_being_replaced() dht: token_metadata: Introduce get_my_id() migration_manager: Send tablet metadata as part of schema pull storage_service: Load tablet metadata when reloading topology state storage_service: Load tablet metadata on boot and from group0 changes db, migration_manager: Notify about tablet metadata changes via migration_listener::on_update_tablet_metadata() migration_notifier: Introduce before_drop_keyspace() migration_manager: Make prepare_keyspace_drop_announcement() return a future<> test: perf: Introduce perf-tablets test: Introduce tablets_test test: lib: Do not override table id in create_table() utils, tablets: Introduce external_memory_usage() db: tablets: Add printers db: tablets: Add persistence layer dht: Use last_token_of_compaction_group() in split_token_range_msb() locator: Introduce tablet_metadata dht: Introduce first_token() dht: Introduce next_token() storage_proxy: Improve trace-level logging locator: token_metadata: Fix confusing comment on ring_range() dht, storage_proxy: Abstract token space splitting Revert "query_ranges_to_vnodes_generator: fix for exclusive boundaries" db: Exclude keyspace with per-table replication in get_non_local_strategy_keyspaces_erms() db: Introduce get_non_local_vnode_based_strategy_keyspaces() service: storage_proxy: Avoid copying keyspace name in write handler locator: Introduce per-table replication strategy treewide: Use replication_strategy_ptr as a shorter name for abstract_replication_strategy::ptr_type locator: Introduce effective_replication_map locator: Rename effective_replication_map to vnode_effective_replication_map locator: effective_replication_map: Abstract get_pending_endpoints() db: Propagate feature_service to abstract_replication_strategy::validate_options() db: config: Introduce experimental "TABLETS" feature db: Log replication strategy for debugging purposes db: Log full exception on error in do_parse_schema_tables() db: keyspace: Remove non-const replication strategy getter config: Reformat	2023-04-27 09:40:18 +02:00
Tomasz Grabiec	c1fdbe79b7	raft: Introduce 'raft_server_force_snapshot' error injection Will be used by tests to force followers to catch up from the snapshot.	2023-04-24 10:49:37 +02:00
Benny Halevy	f5f566bdd8	utils: add tagged_integer A generic template for defining strongly typed integer types. Use it here to replace raft::internal::tagged_uint64. Will be used for defining gms generation and version as strong and distinguishable types in following patches. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-04-23 08:37:32 +03:00
Kefu Chai	3425184b2a	raft: include boost header using <path/to/header> not "path/to/header" for more consistency with the rest of the source tree. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-03-26 14:07:50 +08:00
Kefu Chai	0421d6d12f	raft: include used header at least, we need to access the declarations of exceptions, like `not_a_leader` and `dropped_entry`, so, instead of relying on other header to do this job for us, we should include the header which include the declaration. so, in this chance "raft.h" is include explicitly. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-03-26 14:07:50 +08:00
Botond Dénes	19560419d2	Merge 'treewide: improve compatibility with gcc 13' from Avi Kivity An assortment of patches that reduce our incompatibilities with the upcoming gcc 13. Closes #13243 * github.com:scylladb/scylladb: transport: correctly format unknown opcode treewide: catch by reference test: raft: avoid confusing string compare utils, types, test: extract lexicographical compare utilities test: raft: fsm_test: disambiguate raft::configuration construction test: reader_concurrency_semaphore_test: handle all enum values repair: fix signed/unsigned compare repair: fix incorrect signed/unsigned compare treewide: avoid unused variables in if statements keys: disambiguate construction from initializer_list<bytes> cql3: expr: fix serialize_listlike() reference-to-temporary with gcc compaction: error on invalid scrub type treewide: prevent redefining names api: task_manager: fix signed/unsigned compare alternator: streams: fix signed/unsigned comparison test: fix some mismatched signed/unsigned comparisons	2023-03-24 15:16:05 +02:00
Avi Kivity	a806024e1d	treewide: avoid unused variables in if statements gcc warns about unused variables declared in if statements. Just drop them.	2023-03-21 13:42:49 +02:00
Botond Dénes	bf8b746bca	Merge 'utils: UUID: specialize fmt::formatter for UUID and tagged_uuid<>' from Kefu Chai this is a part of a series migrating from `operator<<(ostream&, ..)` based formatting to fmtlib based formatting. the goal here is to enable fmtlib to print UUID without using ostream<<. also, this change re-implements some formatting helpers using fmtlib for better performance and less dependencies on operator<<(), but we cannot drop it at this moment, as quite a few caller sites are still using operator<<(ostream&, const UUID&) and operator<<(ostream&, tagged_uuid<T>&). we will address them separately. * add `fmt::formatter<UUID>` * add `fmt::formatter<tagged_uuid<T>>` * implement `UUID::to_string()` using `fmt::to_string()` * implement `operator<<(std::ostream&, const UUID&)` with `fmt::print()`, this should help to improve the performance when printing uuid, as `fmt::print()` does not materialize a string when printing the uuid. * treewide: use fmtlib when printing UUID Refs #13245 Closes #13246 * github.com:scylladb/scylladb: treewide: use fmtlib when printing UUID utils: UUID: specialize fmt::formatter for UUID and tagged_uuid<>	2023-03-20 14:26:11 +02:00
Gleb Natapov	2fc8e13dd8	raft: add server::wait_for_state_change() function Add a function that allows waiting for a state change of a raft server. It is useful for a user that wants to know when a node becomes/stops being a leader. Message-Id: <20230316112801.1004602-4-gleb@scylladb.com>	2023-03-20 11:31:55 +01:00
Gleb Natapov	59f7aeb79b	raft: move some functions out of ad-hoc section Make tick() and is_leader() part of the API. First is used externally already and another will be used in following patches. Message-Id: <20230316112801.1004602-3-gleb@scylladb.com>	2023-03-20 11:25:19 +01:00
Kefu Chai	94c6df0a08	treewide: use fmtlib when printing UUID this change tries to reduce the number of callers using operator<<() for printing UUID. they are found by compiling the tree after commenting out `operator<<(std::ostream& out, const UUID& uuid)`. but this change alone is not enough to drop all callers, as some callers are using `operator<<(ostream&, const unordered_map&)` and other overloads to print ranges whose elements contain UUID. so in order to limit the scope of the change, we are not changing them here. Refs #13245 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-03-20 15:38:45 +08:00
Avi Kivity	6aa91c13c5	Merge 'Optimize topology::compare_endpoints' from Benny Halevy The code for compare_endpoints originates at the dawn of time (`bc034aeaec`) and is called on the fast path from storage_proxy via `sort_by_proximity`. This series considerably reduces the function's footprint by: 1. carefully coding the many comparisons in the function so to reduce the number of conditional banches (apparently the compiler isn't doing a good enough job at optimizing it in this case) 2. avoid sstring copy in topology::get_{datacenter,rack} Closes #12761 * github.com:scylladb/scylladb: topology: optimize compare_endpoints to_string: add print operators for std::{weak,partial}_ordering utils: to_sstring: deinline std::strong_ordering print operator move to_string.hh to utils/ test: network_topology: add test_topology_compare_endpoints	2023-03-07 15:17:19 +02:00
Kefu Chai	563fbb2d11	build: cmake: extract more subsystem out into its own CMakeLists.txt namely, cdc, compaction, dht, gms, lang, locator, mutation_writer, raft, readers, replica, service, tools, tracing and transport. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-03-02 10:15:25 +08:00
Kefu Chai	b926105eae	raft: reference this explicitly Clang complains that the captured `this` is not used, like ``` /home/kefu/dev/scylladb/raft/fsm.hh:644:21: error: lambda capture 'this' is not used [-Werror,-Wunused-lambda-capture] auto visitor = [this, from, msg = std::move(msg)](const auto& state) mutable { ^ /home/kefu/dev/scylladb/raft/server.cc:738:11: note: in instantiation of function template specialization 'raft::fsm::step<raft::append_request>' requested here _fsm->step(from, std::move(append_request)); ^ ``` but `step(..)` is a non-static member function of `fsm`, so `this` is actually used. to silence Clang's warning, let's just reference it explicitly. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-02-28 21:56:55 +08:00
Gleb Natapov	9bdef9158e	raft: abort applier fiber when a state machine aborts After `5badf20c7a` applier fiber does not stop after it gets abort error from a state machine which may trigger an assertion because previous batch is not applied. Fix it. Fixes #12863	2023-02-15 15:54:19 +02:00
Gleb Natapov	dfcd56736b	raft: fix race in add_entry_on_leader that may cause incorrect log length accounting In add_entry_on_leader after wait_for_memory_permit() resolves but before the fiber continue to run the node may stop becoming the leader and then become a leader again which will cause currently hold units outdated. Detect this case by checking the term after the preemption.	2023-02-15 15:51:59 +02:00
Benny Halevy	25ebc63b82	move to_string.hh to utils/ Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-02-15 11:09:04 +02:00
Alejo Sanchez	346d02b477	raft conf error injection for snapshot To trigger snapshot limit behavior provide an error injection to set with one-shot. Note this effectively changes it and there is no revert. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2023-02-03 22:33:33 +01:00
Pavel Emelyanov	4f415413d2	raft: Fix non-existing state_machine::apply_entry in docs The docs mention that method, but it doesn't exist. Instead, the state_machine interface defines plain .apply() one. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes #12541	2023-01-17 12:53:05 +01:00
Kamil Braun	4268b1bbc2	Merge 'raft: raft_group0, register RPC verbs on all shards' from Gusev Petr raft_group0 used to register RPC verbs only on shard 0. This worked on clusters with the same --smp setting on all nodes, since RPCs in this case are processed on the same shard as the calling code, and raft_group0 methods only run on shard 0. A new test test_nodes_with_different_smp was added to identify the problem. Since --smp can only be specified via the command line, a corresponding parameter was added to the ManagerClient.server_add method. It allows to override the default parameters set by the SCYLLA_CMDLINE_OPTIONS variable by changing, adding or deleting individual items. Fixes: #12252 Closes #12374 * github.com:scylladb/scylladb: raft: raft_group0, register RPC verbs on all shards raft: raft_append_entries, copy entries to the target shard test.py, allow to specify the node's command line in test	2023-01-04 11:11:21 +01:00
Petr Gusev	7725e03a09	raft: raft_append_entries, copy entries to the target shard If append_entries RPC was received on a non-zero shard, we may need to pass it to a zero (or, potentially, some other) shard. The problem is that raft::append_request contains entries in the form of raft::log_entry_ptr == lw_shared_ptr<log_entry>, which doesn't support cross-shard reference counting. In debug mode it contains a special ref-counting facility debug_shared_ptr_counter_type, which resorts to on_internal_error if it detects such a case. To solve this, we just copy log entries to the target shard if it isn't equal to the current one. In most cases, if --smp setting is the same on all nodes, RPC will be handled on zero shard, so there will be no overhead.	2023-01-03 15:25:00 +03:00
Gleb Natapov	229cef136d	raft: add trace logging to raft::server::start Allows to see initial state of the server during start. Message-Id: <20221228144944.3299711-15-gleb@scylladb.com>	2023-01-02 11:57:53 +02:00
Gleb Natapov	5182543df2	raft: fix typo in read_barrier logging The log logs applied index not append one. Message-Id: <20221228144944.3299711-3-gleb@scylladb.com>	2023-01-02 11:38:47 +02:00
Nadav Har'El	09a3c63345	cross-tree: allow std::source_location in clang 14 We recently (commit `6a5d9ff261`) started to use std::source_location instead of std::experimental::source_location. However, this does not work on clang 14, because libc++ 12's <source_location> only works if __builtin_source_location, and that is not available on clang 14. clang 15 is just three months old, and several relatively-recent distributions still carry clang 14 so it would be nice to support it as well. So this patch adds a trivial compatibility header file, which, when included and compiled with clang 14, it aliases the functional std::experimental::source_location to std::source_location. It turns out it's enough to include the new header file from three headers that included <source_location> - I guess all other uses of source_location depend on those header files directly or indirectly. We may later need to include the compatibility header file in additional places, bug for now we don't. Refs #12259 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #12265	2022-12-11 20:28:49 +02:00
Gleb Natapov	022a825b33	raft: introduce not_a_member error and return it when non member tries to do add/modify_config Currently if a node that is outside of the config tries to add an entry or modify config transient error is returned and this causes the node to retry. But the error is not transient. If a node tries to do one of the operations above it means it was part of the cluster at some point, but since a node with the same id should not be added back to a cluster if it is not in the cluster now it will never be. Return a new error not_a_member to a caller instead. Message-Id: <Y42mTOx8bNNrHqpd@scylladb.com>	2022-12-05 17:11:04 +01:00
Avi Kivity	6a5d9ff261	treewide: use non-experimental std::source_location Now that we use libstdc++ 12, we can use the standardized source_location. Closes #12137	2022-11-30 11:06:43 +02:00
Avi Kivity	fb6804e7a4	raft: don't compare signed and unsigned types gcc warns it can lead to undefined behavior, though 2G entries in a list of mutations are unlikely. Use the correct type for iteration.	2022-11-28 21:58:30 +02:00
Konstantin Osipov	990c7a209f	raft: change the API of conf change notifications Pass a change diff into the notification callback, rather than add or remove servers one by one, so that if we need to persist the state, we can do it once per configuration change, not for every added or removed server. For now still pass added and removed entries in two separate calls per a single configuration change. This is done mainly to fulfill the library contract that it never sends messages to servers outside the current configuration. The group0 RPC implementation doesn't need the two calls, since it simply marks the removed servers as expired: they are not removed immediately anyway, and messages can still be delivered to them. However, there may be test/mock implementations of RPC which could benefit from this contract, so we decided to keep it.	2022-11-17 12:07:31 +03:00
Kamil Braun	0c9cb5c5bf	Merge 'raft: wait for the next tick before retrying' from Gusev Petr When `modify_config` or `add_entry` is forwarded to the leader, it may reach the node at "inappropriate" time and result in an exception. There are two reasons for it - the leader is changing and, in case of `modify_config`, other `modify_config` is currently in progress. In both cases the command is retried, but before this patch there was no delay before retrying, which could led to a tight loop. The patch adds a new exception type `transient_error`. When the client receives it, it is obliged to retry the request after some delay. Previously leader-side exceptions were converted to `not_a_leader`, which is strange, especially for `conf_change_in_progress`. Fixes: #11564 Closes #11769 * github.com:scylladb/scylladb: raft: rafactor: remove duplicate code on retries delays raft: use wait_for_next_tick in read_barrier raft: wait for the next tick before retrying	2022-11-16 18:20:54 +01:00
Petr Gusev	ae3e0e3627	raft: rafactor: remove duplicate code on retries delays Introduce a templated function do_on_leader_with_retries, use it in add_entries/modify_config/read_barrier. The function implements the basic logic of retries with aborts and leader changes handling, adds a delay between iterations to protect against tight loops.	2022-11-15 13:18:53 +04:00
Petr Gusev	15cc1667d0	raft: use wait_for_next_tick in read_barrier Replaced the yield on transport_error with wait_for_next_tick. Added delays for retries, similar to add_entry/modify_config: we postpone the next call attempt if we haven't received new information about the current leader.	2022-11-15 12:31:49 +04:00
Petr Gusev	5e15c3c9bd	raft: wait for the next tick before retrying When modify_config or add_entry is forwarded to the leader, it may reach the node at "inappropriate" time and result in an exception. There are two reasons for it - the leader is changing and, in case of modify_config, other modify_config is currently in progress. In both cases the command is retried, but before this patch there was no delay before retrying, which could led to a tight loop. The patch adds a new exception type transient_error. When the client node receives it, it is obliged to retry the request, possibly after some delay. Previously, leader-side exceptions were converted to not_a_leader exception, which is strange, especially for conf_change_in_progress. We add a delay before retrying in modify_config and add_entry if the client hasn't received any new information about the leader since the last attempt. This can happen if the server responds with a transient_error with an empty leader and the current node has not yet learned the new leader. We neglect an excessive delay if the newly elected leader is the same as the previous one, this supposed to be a rare. Fixes: #11564	2022-11-15 11:49:26 +04:00
Petr Gusev	d79fbab682	raft: convert raft::transport_error to raft::commit_status_unknown The add_entry and modify_config methods sometimes do an rpc to execute the request on the current leader. If the tcp connection was broken, a seastar::rpc::closed_error would be thrown to the client. This exception was not documented in the method comments and the client could have missed handling it. For example, this exception was not handled when calling modify_config in raft_group0, which sometimes broke the removenode command. An intermittent_connection_error exception was added earlier to solve a similar problem with the read_barrier method. In this patch it is renamed to transport_error, as it seems to better describe the situation, and an explicit specification for this exception was added - the rpc implementation can throw it if it is not known whether the call reached the target node and whether any actions were performed on it. In case of read_barrier it does not matter and we just retry. In case of add_entry and modify_config we cannot retry because the rpc calls are not idempotent, so we convert this exception to commit_status_unknown, which the client has to handle. Explicit comments have also been added to raft::server methods describing all possible exceptions.	2022-10-07 13:34:16 +04:00
Petr Gusev	cbfe033786	raft server, shrink_to_fit on log truncation We don't want to keep memory we don't use, shrink_to_fit guarantees that. In fact, boost::deque frees up memory when items are deleted, so this change has little effect at the moment, but it may pay off if we change the container in the future.	2022-09-27 12:02:36 +04:00

1 2 3 4 5 ...

313 Commits