scylladb

Author	SHA1	Message	Date
Emil Maskovsky	b99d87863d	raft: fix the shutdown phase being stuck Some of the calls inside the `raft_group0_client::start_operation()` method were missing the abort source parameter. This caused the repair test to be stuck in the shutdown phase - the abort source has been triggered, but the operations were not checking it. This was in particular the case of operations that try to take the ownership of the raft group semaphore (`get_units(semaphore)`) - these waits should be cancelled when the abort source is triggered. This should fix the following tests that were failing in some percentage of dtest runs (about 1-3 of 100): * TestRepairAdditional::test_repair_kill_1 * TestRepairAdditional::test_repair_kill_3 Fixes scylladb/scylladb#19223 (cherry picked from commit `5dfc50d354`)	2024-08-01 19:37:02 +02:00
Emil Maskovsky	0770069dda	raft: use the abort source reference in raft group0 client interface Most callers of the raft group0 client interface are passing a real source instance, so we can use the abort source reference in the client interface. This change makes the code simpler and more consistent. (cherry picked from commit `2dbe9ef2f2`)	2024-08-01 19:36:00 +02:00
Gleb Natapov	45ff4d2c41	group0, topology coordinator: run group0 and the topology coordinator in gossiper scheduling group Currently they both run in streaming group and it may become busy during repair/mv building and affect group0 functionality. Move it to the gossiper group where it should have more time to run. Fixes #18863 (cherry picked from commit `a74fbab99a`) Closes scylladb/scylladb#19175	2024-06-10 10:34:29 +02:00
Marcin Maliszkiewicz	cbf47319c1	db: auth: move auth tables to system keyspace Separate keyspace which also behaves as system brings little benefit while creating some compatibility problems like schema digest mismatch during rollback. So we decided to move auth tables into system keyspace. Fixes https://github.com/scylladb/scylladb/issues/18098 Closes scylladb/scylladb#18769 (cherry picked from commit `2ab143fb40`) [avi: adjust test/alternator/suite.yaml to reflect new keyspace]	2024-06-02 21:41:14 +03:00
Piotr Smaron	bd4b781dc8	New raft cmd for both schema & topo changes Allows executing combined topology & schema mutations under a single RAFT command	2024-05-30 08:33:15 +03:00
Patryk Jędrzejczak	332bd8ea98	raft: raft_group_registry: start_server_for_group: catch and rethrow abort_requested_exception If we initiate the shutdown while starting the group 0 server, we could catch `abort_requested_exception` in `start_server_for_group` and call `on_internal_error`. Then, Scylla aborts with a coredump. It causes problems in tests that shut down bootstrapping nodes. The `abort_requested_exception` can be thrown from `gossiper::lock_endpoint` called in `storage_service::topology_state_load`. So, the issue is new and applies only to the raft-based topology. Hence, there is no need to backport the patch. Fixes scylladb/scylladb#17794 Fixes scylladb/scylladb#18197 Closes scylladb/scylladb#18569	2024-05-09 14:55:11 +02:00
Kamil Braun	03818c4aa9	direct_failure_detector: increase ping timeout and make it tunable The direct failure detector design is simplistic. It sends pings sequentially and times out listeners that reached the threshold (i.e. didn't hear from a given endpoint for too long) in-between pings. Given the sequential nature, the previous ping must finish so the next ping can start. We timeout pings that take too long. The timeout was hardcoded and set to 300ms. This is too low for wide-area setups -- latencies across the Earth can indeed go up to 300ms. 3 subsequent timed out pings to a given node were sufficient for the Raft listener to "mark server as down" (the listener used a threshold of 1s). Increase the ping timeout to 600ms which should be enough even for pinging the opposite side of Earth, and make it tunable. Increase the Raft listener threshold from 1s to 2s. Without the increased threshold, one timed out ping would be enough to mark the server as down. Increasing it to 2s requires 3 timed out pings which makes it more robust in presence of transient network hiccups. In the future we'll most likely want to decrease the Raft listener threshold again, if we use Raft for data path -- so leader elections start quickly after leader failures. (Faster than 2s). To do that we'll have to improve the design of the direct failure detector. Ref: scylladb/scylladb#16410 Fixes: scylladb/scylladb#16607 --- I tested the change manually using `tc qdisc ... netem delay`, setting network delay on local setup to ~300ms with jitter. Without the change, the result is as observed in scylladb/scylladb#16410: interleaving ``` raft_group_registry - marking Raft server ... as dead for Raft groups raft_group_registry - marking Raft server ... as alive for Raft groups ``` happening once every few seconds. The "marking as dead" happens whenever we get 3 subsequent failed pings, which is happens with certain (high) probability depending on the latency jitter. Then as soon as we get a successful ping, we mark server back as alive. With the change, the phenomenon no longer appears. Closes scylladb/scylladb#18443	2024-05-07 23:40:23 +02:00
Benny Halevy	ebff5f5d70	everywhere: include seastar headers using angle brackets seastar is an external library therefore it should use the system-include syntax. Closes scylladb/scylladb#18513	2024-05-06 10:00:31 +03:00
Benny Halevy	890b890e36	storage_proxy: add mutate_locally(vector<frozen_mutation_and_schema>) method Generalizing the ad-hoc implementation out of group0_state_machine.write_mutations_to_database. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2024-05-02 19:42:58 +03:00
Benny Halevy	4ae5bbb058	raft: group0_state_machine: write_mutations_to_database: freeze mutations gently write_mutations_to_database might need to handle large mutations from system tables, so to prevent reactor stalls, freeze the mutations gently and call proxy.mutate_locally in parallel on the individual frozen mutations, rather than calling the vector<mutation> based entry point that eventually freezes each mutation synchronously. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2024-05-02 19:37:06 +03:00
Benny Halevy	504a9ab897	raft: group0_state_machine: write_mutations_to_database: use to_mutation_gently Prevent stalls coming from writing large mutations like the ones seen with the test_add_many_nodes_under_load dtest: ``` ++[1#11/11 6%] addr=0x15408f6 total=33 count=1 avg=33: \| managed_bytes::managed_bytes at ././utils/managed_bytes.hh:284 \| (inlined by) atomic_cell_or_collection::atomic_cell_or_collection at ./mutation/atomic_cell_or_collection.hh:25 \| (inlined by) cell_and_hash::cell_and_hash at ./mutation/mutation_partition.hh:73 \| (inlined by) compact_radix_tree::tree<cell_and_hash, unsigned int>::emplace<atomic_cell_or_collection, seastar::optimized_optional<cell_hash> > at ././utils/compact-radix-tree.hh:1809 ++ - addr=0x1518bae: \| row::append_cell at ./mutation/mutation_partition.cc:1344 ++ - addr=0x14acb23: \| partition_builder::accept_row_cell at ././partition_builder.hh:70 ++ - addr=0x157a6a6: \| mutation_partition_view::do_accept<partition_builder>(column_mapping const&, partition_builder&) const::cell_visitor::accept_atomic_cell at ./mutation/mutation_partition_view.cc:218 \| (inlined by) (anonymous namespace)::read_and_visit_row<mutation_partition_view::do_accept<partition_builder>(column_mapping const&, partition_builder&) const::cell_visitor>(ser::row_view, column_mapping const&, column_kind, partition_builder&&)::atomic_cell_or_collection_visitor::operator() at ./mutation/mutation_partition_view.cc:138 \| (inlined by) boost::detail::variant::invoke_visitor<(anonymous namespace)::read_and_visit_row<mutation_partition_view::do_accept<partition_builder>(column_mapping const&, partition_builder&) const::cell_visitor>(ser::row_view, column_mapping const&, column_kind, partition_builder&&)::atomic_cell_or_collection_visitor const, false>::internal_visit<boost::variant<ser::live_cell_view, ser::expiring_cell_view, ser::dead_cell_view, ser::counter_cell_view, ser::unknown_variant_type>&> at /usr/include/boost/variant/variant.hpp:1028 \| (inlined by) boost::detail::variant::visitation_impl_invoke_impl<boost::detail::variant::invoke_visitor<(anonymous namespace)::read_and_visit_row<mutation_partition_view::do_accept<partition_builder>(column_mapping const&, partition_builder&) const::cell_visitor>(ser::row_view, column_mapping const&, column_kind, partition_builder&&)::atomic_cell_or_collection_visitor const, false>, void, boost::variant<ser::live_cell_view, ser::expiring_cell_view, ser::dead_cell_view, ser::counter_cell_view, ser::unknown_variant_type> > at /usr/include/boost/variant/detail/visitation_impl.hpp:117 \| (inlined by) boost::detail::variant::visitation_impl_invoke<boost::detail::variant::invoke_visitor<(anonymous namespace)::read_and_visit_row<mutation_partition_view::do_accept<partition_builder>(column_mapping const&, partition_builder&) const::cell_visitor>(ser::row_view, column_mapping const&, column_kind, partition_builder&&)::atomic_cell_or_collection_visitor const, false>, void, boost::variant<ser::live_cell_view, ser::expiring_cell_view, ser::dead_cell_view, ser::counter_cell_view, ser::unknown_variant_type>, boost::variant<boost::variant<ser::live_cell_view, ser::expiring_cell_view, ser::dead_cell_view, ser::counter_cell_view, ser::unknown_variant_type>, ser::collection_cell_view, ser::unknown_variant_type>::has_fallback_type_> at /usr/include/boost/variant/detail/visitation_impl.hpp:157 \| (inlined by) boost::detail::variant::visitation_impl<mpl_::int_<0>, boost::detail::variant::visitation_impl_step<boost::mpl::l_iter<boost::mpl::l_item<mpl_::long_<3l>, boost::variant<ser::live_cell_view, ser::expiring_cell_view, ser::dead_cell_view, ser::counter_cell_view, ser::unknown_variant_type>, boost::mpl::l_item<mpl_::long_<2l>, ser::collection_cell_view, boost::mpl::l_item<mpl_::long_<1l>, ser::unknown_variant_type, boost::mpl::l_end> > > >, boost::mpl::l_iter<boost::mpl::l_end> >, boost::detail::variant::invoke_visitor<(anonymous namespace)::read_and_visit_row<mutation_partition_view::do_accept<partition_builder>(column_mapping const&, partition_builder&) const::cell_visitor>(ser::row_view, column_mapping const&, column_kind, partition_builder&&)::atomic_cell_or_collection_visitor const, false>, void, boost::variant<boost::variant<ser::live_cell_view, ser::expiring_cell_view, ser::dead_cell_view, ser::counter_cell_view, ser::unknown_variant_type>, ser::collection_cell_view, ser::unknown_variant_type>::has_fallback_type_> at /usr/include/boost/variant/detail/visitation_impl.hpp:238 \| (inlined by) boost::variant<boost::variant<ser::live_cell_view, ser::expiring_cell_view, ser::dead_cell_view, ser::counter_cell_view, ser::unknown_variant_type>, ser::collection_cell_view, ser::unknown_variant_type>::internal_apply_visitor_impl<boost::detail::variant::invoke_visitor<(anonymous namespace)::read_and_visit_row<mutation_partition_view::do_accept<partition_builder>(column_mapping const&, partition_builder&) const::cell_visitor>(ser::row_view, column_mapping const&, column_kind, partition_builder&&)::atomic_cell_or_collection_visitor const, false>, void> at /usr/include/boost/variant/variant.hpp:2337 \| (inlined by) boost::variant<boost::variant<ser::live_cell_view, ser::expiring_cell_view, ser::dead_cell_view, ser::counter_cell_view, ser::unknown_variant_type>, ser::collection_cell_view, ser::unknown_variant_type>::internal_apply_visitor<boost::detail::variant::invoke_visitor<(anonymous namespace)::read_and_visit_row<mutation_partition_view::do_accept<partition_builder>(column_mapping const&, partition_builder&) const::cell_visitor>(ser::row_view, column_mapping const&, column_kind, partition_builder&&)::atomic_cell_or_collection_visitor const, false> > at /usr/include/boost/variant/variant.hpp:2349 \| (inlined by) boost::variant<boost::variant<ser::live_cell_view, ser::expiring_cell_view, ser::dead_cell_view, ser::counter_cell_view, ser::unknown_variant_type>, ser::collection_cell_view, ser::unknown_variant_type>::apply_visitor<(anonymous namespace)::read_and_visit_row<mutation_partition_view::do_accept<partition_builder>(column_mapping const&, partition_builder&) const::cell_visitor>(ser::row_view, column_mapping const&, column_kind, partition_builder&&)::atomic_cell_or_collection_visitor const> at /usr/include/boost/variant/variant.hpp:2393 \| (inlined by) boost::apply_visitor<(anonymous namespace)::read_and_visit_row<mutation_partition_view::do_accept<partition_builder>(column_mapping const&, partition_builder&) const::cell_visitor>(ser::row_view, column_mapping const&, column_kind, partition_builder&&)::atomic_cell_or_collection_visitor, boost::variant<boost::variant<ser::live_cell_view, ser::expiring_cell_view, ser::dead_cell_view, ser::counter_cell_view, ser::unknown_variant_type>, ser::collection_cell_view, ser::unknown_variant_type>&> at /usr/include/boost/variant/detail/apply_visitor_unary.hpp:68 \| (inlined by) (anonymous namespace)::read_and_visit_row<mutation_partition_view::do_accept<partition_builder>(column_mapping const&, partition_builder&) const::cell_visitor> at ./mutation/mutation_partition_view.cc:158 \| (inlined by) mutation_partition_view::do_accept<partition_builder> at ./mutation/mutation_partition_view.cc:224 ++ - addr=0x151234a: \| mutation_partition::apply at ./mutation/mutation_partition.cc:476 ++ - addr=0x14e1103: \| canonical_mutation::to_mutation at ./mutation/canonical_mutation.cc:76 ++ - addr=0x283f9ee: \| service::write_mutations_to_database at ./service/raft/group0_state_machine.cc:124 ++ - addr=0x283f36c: \| service::group0_state_machine::merge_and_apply(service::group0_state_machine_merger&)::$_2::operator() at ./service/raft/group0_state_machine.cc:165 ++ - addr=0x28395e3: \| std::__invoke_impl<seastar::future<void>, seastar::internal::variant_visitor<service::group0_state_machine::merge_and_apply(service::group0_state_machine_merger&)::$_0, service::group0_state_machine::merge_and_apply(service::group0_state_machine_merger&)::$_1, service::group0_state_machine::merge_and_apply(service::group0_state_machine_merger&)::$_2, service::group0_state_machine::merge_and_apply(service::group0_state_machine_merger&)::$_3>, service::topology_change&> at /usr/bin/../lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/invoke.h:61 \| (inlined by) std::__invoke<seastar::internal::variant_visitor<service::group0_state_machine::merge_and_apply(service::group0_state_machine_merger&)::$_0, service::group0_state_machine::merge_and_apply(service::group0_state_machine_merger&)::$_1, service::group0_state_machine::merge_and_apply(service::group0_state_machine_merger&)::$_2, service::group0_state_machine::merge_and_apply(service::group0_state_machine_merger&)::$_3>, service::topology_change&> at /usr/bin/../lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/invoke.h:96 \| (inlined by) std::__detail::__variant::__gen_vtable_impl<std::__detail::__variant::_Multi_array<std::__detail::__variant::__deduce_visit_result<seastar::future<void> > (*)(seastar::internal::variant_visitor<service::group0_state_machine::merge_and_apply(service::group0_state_machine_merger&)::$_0, service::group0_state_machine::merge_and_apply(service::group0_state_machine_merger&)::$_1, service::group0_state_machine::merge_and_apply(service::group0_state_machine_merger&)::$_2, service::group0_state_machine::merge_and_apply(service::group0_state_machine_merger&)::$_3>&&, std::variant<service::schema_change, service::broadcast_table_query, service::topology_change, service::write_mutations>&)>, std::integer_sequence<unsigned long, 2ul> >::__visit_invoke at /usr/bin/../lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/variant:1032 \| (inlined by) std::__do_visit<std::__detail::__variant::__deduce_visit_result<seastar::future<void> >, seastar::internal::variant_visitor<service::group0_state_machine::merge_and_apply(service::group0_state_machine_merger&)::$_0, service::group0_state_machine::merge_and_apply(service::group0_state_machine_merger&)::$_1, service::group0_state_machine::merge_and_apply(service::group0_state_machine_merger&)::$_2, service::group0_state_machine::merge_and_apply(service::group0_state_machine_merger&)::$_3>, std::variant<service::schema_change, service::broadcast_table_query, service::topology_change, service::write_mutations>&> at /usr/bin/../lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/variant:1793 \| (inlined by) std::visit<seastar::internal::variant_visitor<service::group0_state_machine::merge_and_apply(service::group0_state_machine_merger&)::$_0, service::group0_state_machine::merge_and_apply(service::group0_state_machine_merger&)::$_1, service::group0_state_machine::merge_and_apply(service::group0_state_machine_merger&)::$_2, service::group0_state_machine::merge_and_apply(service::group0_state_machine_merger&)::$_3>, std::variant<service::schema_change, service::broadcast_table_query, service::topology_change, service::write_mutations>&> at /usr/bin/../lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/variant:1854 \| (inlined by) service::group0_state_machine::merge_and_apply at ./service/raft/group0_state_machine.cc:156 ++ - addr=0x284781e: \| service::group0_state_machine::apply at ./service/raft/group0_state_machine.cc:220 ``` Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2024-05-02 19:27:56 +03:00
Kefu Chai	8168f02550	raft_group_registry: do not use moved variable clang-tidy warns like: ``` [628/713] Building CXX object service/CMakeFiles/service.dir/raft/raft_group_registry.cc.o Warning: /home/runner/work/scylladb/scylladb/service/raft/raft_group_registry.cc:543:66: warning: 'id' used after it was moved [bugprone-use-after-move] 543 \| auto& rate_limit = _rate_limits.try_get_recent_entry(id, std::chrono::minutes(5)); \| ^ /home/runner/work/scylladb/scylladb/service/raft/raft_group_registry.cc:539:19: note: move occurred here 539 \| auto dst_id = raft::server_id{std::move(id)}; \| ^ ``` this is a false alarm. as the type of `id` is actually `utils::UUID` which is a struct enclosing two `int64_t` variables. and we don't define a move constructor for `utils::UUID`. so the value of of `id` is intact after being moved away. but it is still confusing at the first glance, as we are indeed referencing a moved-away variable. so in order to reduce the confusion and to silence the warning, let's just do not `std::move(id)`. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#18449	2024-05-01 09:45:12 +03:00
Kefu Chai	372a4d1b79	treewide: do not define FMT_DEPRECATED_OSTREAM since we do not rely on FMT_DEPRECATED_OSTREAM to define the fmt::formatter for us anymore, let's stop defining `FMT_DEPRECATED_OSTREAM`. in this change, * utils: drop the range formatters in to_string.hh and to_string.c, as we don't use them anymore. and the tests for them in test/boost/string_format_test.cc are removed accordingly. * utils: use fmt to print chunk_vector and small_vector. as we are not able to print the elements using operator<< anymore after switching to {fmt} formatters. * test/boost: specialize fmt::details::is_std_string_like<bytes> due to a bug in {fmt} v9, {fmt} fails to format a range whose element type is `basic_sstring<uint8_t>`, as it considers it as a string-like type, but `basic_sstring<uint8_t>`'s char type is signed char, not char. this issue does not exist in {fmt} v10, so, in this change, we add a workaround to explicitly specialize the type trait to assure that {fmt} format this type using its `fmt::formatter` specialization instead of trying to format it as a string. also, {fmt}'s generic ranges formatter calls the pair formatter's `set_brackets()` and `set_separator()` methods when printing the range, but operator<< based formatter does not provide these method, we have to include this change in the change switching to {fmt}, otherwise the change specializing `fmt::details::is_std_string_like<bytes>` won't compile. * test/boost: in tests, we use `BOOST_REQUIRE_EQUAL()` and its friends for comparing values. but without the operator<< based formatters, Boost.Test would not be able to print them. after removing the homebrew formatters, we need to use the generic `boost_test_print_type()` helper to do this job. so we are including `test_utils.hh` in tests so that we can print the formattable types. * treewide: add "#include "utils/to_string.hh" where `fmt::formatter<optional<>>` is used. * configure.py: do not define FMT_DEPRECATED_OSTREAM * cmake: do not define FMT_DEPRECATED_OSTREAM Refs #13245 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2024-04-19 22:57:36 +08:00
Kefu Chai	a439ebcfce	treewide: include fmt/ranges.h and/or fmt/std.h before this change, we rely on the default-generated fmt::formatter created from operator<<, but fmt v10 dropped the default-generated formatter. in this change, we include `fmt/ranges.h` and/or `fmt/std.h` for formatting the container types, like vector, map optional and variant using {fmt} instead of the homebrew formatter based on operator<<. with this change, the changes adding fmt::formatter and the changes using ostream formatter explicitly, we are allowed to drop `FMT_DEPRECATED_OSTREAM` macro. Refs scylladb#13245 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2024-04-19 22:56:16 +08:00
Kamil Braun	33751f8f4e	Merge 'raft topology: drop RAFT_PULL_TOPOLOGY_SNAPSHOT RPC' from Gleb * 'gleb/raft_snapshot_rpc-v3' of github.com:scylladb/scylla-dev: raft topology: drop RAFT_PULL_TOPOLOGY_SNAPSHOT RPC Use correct limit for raft commands throughout the code.	2024-03-28 14:25:58 +01:00
Gleb Natapov	6e6aefc9ab	raft topology: drop RAFT_PULL_TOPOLOGY_SNAPSHOT RPC We have new, more generic, RPC to pull group0 mutations now: RAFT_PULL_SNAPSHOT. Use it instead of more specific RAFT_PULL_TOPOLOGY_SNAPSHOT one.	2024-03-27 19:18:45 +02:00
Gleb Natapov	c1dcf0fae7	Use correct limit for raft commands throughout the code. Raft uses schema commitlog, so all its limits should be derived from this commitlog segment size, but many places used regular commitlog size to calculate the limits and did not do what they really suppose to be doing.	2024-03-27 19:16:09 +02:00
Gleb Natapov	6ab78e13c6	topology cooordinator: propagate initial_token option to the coordinator The patch propagates initial_token option to the topology coordinator where it is added to join request parameter.	2024-03-26 18:43:16 +02:00
Piotr Dulikowski	f23f8f81bf	Merge 'Raft-based service levels' from Michał Jadwiszczak This patch introduces raft-based service levels. The difference to the current method of working is: - service levels are stored in `system.service_levels_v2` - reads are executed with `LOCAL_ONE` - writes are done via raft group0 operation Service levels are migrated to v2 in topology upgrade. After the service levels are migrated, `key: service_level_v2_status; value: data_migrated` is written to `system.scylla_local` table. If this row is present, raft data accessor is created from the beginning and it handles recovery mode procedure (service levels will be read from v2 table even if consistent topology is disabled then) Fixes #17926 Closes scylladb/scylladb#16585 * github.com:scylladb/scylladb: test: test service levels v2 works in recovery mode test: add test for service levels migration test: add test for service levels snapshot test:topology: extract `trigger_snapshot` to utils main: create raft dda if sl data was migrated service:qos: store information about sl data migration service:qos: service levels migration main: assign standard service level DDA before starting group0 service:qos: fix `is_v2()` method service:qos: add a method to upgrade data accessor test: add unit_test_raft_service_levels_accessor service:storage_service: add support for service levels raft snapshot service:qos: add abort_source for group0 operations service:qos: raft service level distributed data accessor service:qos: use group0_guard in data accessor cql3:statements: run service level statements on shard0 with raft guard test: fix overrides in unit_test_service_levels_accessor service:qos: fix indentation service:qos: coroutinize some of the methods db:system_keyspace: add `SERVICE_LEVELS_V2` table service:qos: extract common service levels' table functions	2024-03-22 11:51:53 +01:00
Kamil Braun	4359a1b460	Merge 'raft timeouts: better handling of lost quorum' from Petr Gusev In this PR we add timeouts support to raft groups registry. We introduce the `raft_server_with_timeouts` class, which wraps the `raft::server` add exposes its interface with additional `raft_timeout` parameter. If it's set, the wrapper cancels the `abort_source` after certain amount of time. The value of the timeout can be specified either in the `raft_timeout` parameter, or the default value can be set in `the raft_server_with_timeouts` class constructor. The `raft_group_registry` interface is extended with `group0_with_timeouts()` method. It returns an instance of `raft_server_with_timeouts` for group0 raft server. The timeout value for it is configured in `create_server_for_group0`. It's one minute by default and can be overridden for tests with `group0-raft-op-timeout-in-ms` parameter. The new api allows the client to decide whether to use timeouts or not. In this PR we are reviewing all the group0 call sites and add `raft_timeout` if that makes sense. The general principle is that if the code is handling a client request and the client expects a potential error, we use timeouts. We don't use timeouts for background fibers (such as topology coordinator), since they wouldn't add much value. The only thing the background fiber can do with a timeout is to retry, and this will have the same end effect as not having a timeout at all. Fixes scylladb/scylladb#16604 Closes scylladb/scylladb#17590 * github.com:scylladb/scylladb: migration_manager: use raft_timeout{} storage_service::join_node_response_handler: use raft_timeout{} storage_service::start_upgrade_to_raft_topology: use raft_timeout{} storage_service::set_tablet_balancing_enabled: use raft_timeout{} storage_service::move_tablet: use raft_timeout{} raft_check_and_repair_cdc_streams: use raft_timeout{} raft_timeout: test that node operations fail properly raft_rebuild: use raft_timeout{} do_cluster_cleanup: use raft_timeout{} raft_initialize_discovery_leader: use raft_timeout{} update_topology_with_local_metadata: use with_timeout{} raft_decommission: use raft_timeout{} raft_removenode: use raft_timeout{} join_node_request_handler: add raft_timeout to make_nonvoters and add_entry raft_group0: make_raft_config_nonvoter: add raft_timeout parameter raft_group0: make_raft_config_nonvoter: add abort_source parameter manager_client: server_add with start=false shouldn't call driver_connect scylla_cluster: add seeds parameter to the add_server and servers_add raft_server_with_timeouts: report the lost quorum join_node_request_handler: add raft_timeout{} for start_operation skip_mode: add platform_key auth: use raft_timeout{} raft_group0_client: add raft_timeout parameter raft_group_registry: add group0_with_timeouts utils: add composite_abort_source.hh error_injection: move api registration to set_server_init error_injection: add inject_parameter method error_injection: move injection_name string into injection_shared_data error_injection: pass injection parameters at startup	2024-03-22 10:45:33 +01:00
Michał Jadwiszczak	8bbeea0169	service:storage_service: add support for service levels raft snapshot Include mutations from `system.service_levels_v2` in `raft_snapshot`.	2024-03-21 23:14:57 +01:00
Kefu Chai	900b56b117	raft_group0: print runtime_error by printing e.what() before this change, we rely on the default-generated fmt::formatter created from operator<<, but fmt v10 dropped the default-generated formatter. but fortunately, fmt v10 brings the builtin formatter for classes derived from `std::exception`. but before switching to {fmt} v10, and after dropping `FMT_DEPRECATED_OSTREAM` macro, we need to print out `std::runtime_error`. so far, we don't have a shared place for formatter for `std::runtime_error`. so we are addressing the needs on a case-by-case basis. in this change, we just print it using `e.what()`. it's behavior is identical to what we have now. Refs #13245 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#17954	2024-03-21 19:43:52 +02:00
Petr Gusev	0ad852e323	raft_group0: make_raft_config_nonvoter: add raft_timeout parameter We'll use this parameter in subsequent commits.	2024-03-21 16:35:48 +04:00
Petr Gusev	ce7fb39750	raft_group0: make_raft_config_nonvoter: add abort_source parameter	2024-03-21 16:35:48 +04:00
Petr Gusev	99419d5964	raft_server_with_timeouts: report the lost quorum In this commit we extend the timeout error message with additional context - if we see that there is no quorum of available nodes, we report this as the most likely cause of the error. We adjust the test by adding this new information to the expected_error. We need raft-group-registry-fd-threshold-in-ms to make _direct_fd threshold less than group0-raft-op-timeout-in-ms.	2024-03-21 16:35:48 +04:00
Petr Gusev	cebf87bf59	raft_group0_client: add raft_timeout parameter In this commit we add raft_timeout parameter to start_operation and add_entry method. We fix compilation in default_authorizer.cc, bind_front doesn't account for default parameter values. We should use raft_timeout{} here, but this is for another commit.	2024-03-21 16:12:51 +04:00
Petr Gusev	3d1b94475f	raft_group_registry: add group0_with_timeouts In this commit we add timeouts support to raft groups registry. We introduce the raft_server_with_timeouts class, which wraps the raft::server add exposes its interface with additional raft_timeout parameter. If it's set, the wrapper cancels the abort_source after certain amount of time. The value of the timeout can be specified in the raft_timeout parameter, or the default value can be set in the raft_server_with_timeouts class constructor. The raft_group_registry interface is extended with get_server_with_timeouts(group_id) and group0_with_timeouts() methods. They return an instance of raft_server_with_timeouts for a specified group id or for group0. The timeout value for it is configured in create_server_for_group0. It's one minute by default, can be overridden for tests with group0-raft-op-timeout-in-ms parameter. The new api allows the client to decide whether to use timeouts or not. In subsequent commits we are going to review all group0 call sites and add raft_timeout if that makes sense. The general principle is that if the code is handling a client request and the client expects a potential error, we use timeouts. We don't use timeouts for background fibers (such as topology coordinator), since they won't add much value. The only thing the background fiber can do with a timeout is to retry, and this will have the same effect as not having a timeout at all.	2024-03-21 16:12:51 +04:00
Gleb Natapov	2b11842cb4	test: add test to check that address cannot expire between join request placemen and its processing	2024-03-20 11:05:31 +02:00
Gleb Natapov	9651ae875f	raft_group0: add modifiable_address_map() function Provide access to non const address_map. We will need it later.	2024-03-19 13:34:41 +02:00
Gleb Natapov	af218d0063	raft_group0_client: assert that hold_read_apply_mutex is called on shard 0 group0 operations a valid on shard 0 only. Assert that. We already do that in the version of the function that gets abort source. Message-ID: <ZeCti70vrd7UFNim@scylladb.com>	2024-03-18 16:20:41 +01:00
Kamil Braun	19b816bb68	Merge 'Migrate system_auth to raft group0' from Marcin Maliszkiewicz This patch series makes all auth writes serialized via raft. Reads stay eventually consistent for performance reasons. To make transition to new code easier data is stored in a newly created keyspace: system_auth_v2. Internally the difference is that instead of executing CQL directly for writes we generate mutations and then announce them via raft group0. Per commit descriptions provide more implementation details. Refs https://github.com/scylladb/scylladb/issues/16970 Fixes https://github.com/scylladb/scylladb/issues/11157 Closes scylladb/scylladb#16578 * github.com:scylladb/scylladb: test: extend auth-v2 migration test to catch stale static test: add auth-v2 migration test test: add auth-v2 snapshot transfer test test: auth: add tests for lost quorum and command splitting test: pylib: disconnect driver before re-connection test: adjust tests for auth-v2 auth: implement auth-v2 migration auth: remove static from queries on auth-v2 path auth: coroutinize functions in password_authenticator auth: coroutinize functions in standard_role_manager auth: coroutinize functions in default_authorizer storage_service: add support for auth-v2 raft snapshots storage_service: extract getting mutations in raft snapshot to a common function auth: service: capture string_view by value alternator: add support for auth-v2 auth: add auth-v2 write paths auth: add raft_group0_client as dependency cql3: auth: add a way to create mutations without executing cql3: run auth DML writes on shard 0 and with raft guard service: don't loose service_level_controller when bouncing client_state auth: put system_auth and users consts in legacy namespace cql3: parametrize keyspace name in auth related statements auth: parametrize keyspace name in roles metadata helpers auth: parametrize keyspace name in password_authenticator auth: parametrize keyspace name in standard_role_manager auth: remove redundant consts auth::meta::*::qualified_name auth: parametrize keyspace name in default_authorizer db: make all system_auth_v2 tables use schema commitlog db: add system_auth_v2 tables db: add system_auth_v2 keyspace	2024-03-06 10:11:33 +01:00
Marcin Maliszkiewicz	5a6d4dbc37	storage_service: add support for auth-v2 raft snapshots This patch adds new RPC for pulling snapshot of auth tables.	2024-03-01 16:25:14 +01:00
Marcin Maliszkiewicz	bd444ed6f1	cql3: auth: add a way to create mutations without executing To make table modifications go via raft we need to publish mutations. Currently many system tables (especially auth) use CQL to generate table modifications. Added function is a missing link which will allow to do a seamless transition of certain system tables to raft.	2024-03-01 16:25:14 +01:00
Gleb Natapov	9847e272f9	raft_group0_client: assert that hold_read_apply_mutex is called on shard 0 group0 operations a valid on shard 0 only. Assert that.	2024-02-29 12:39:48 +02:00
Kamil Braun	50ebce8acc	Merge 'Purge old ip on change' from Petr Gusev When a node changes IP address we need to remove its old IP from `system.peers` and gossiper. We do this in `sync_raft_topology_nodes` when the new IP is saved into `system.peers` to avoid losing the mapping if the node crashes between deleting and saving the new IP. We also handle the possible duplicates in this case by dropping them on the read path when the node is restarted. The PR also fixes the problem with old IPs getting resurrected when a node changes its IP address. The following scenario is possible: a node `A` changes its IP from `ip1` to `ip2` with restart, other nodes are not yet aware of `ip2` so they keep gossiping `ip1`. After restart `A` receives `ip1` in a gossip message and calls `handle_major_state_change` since it considers it as a new node. Then `on_join` event is called on the gossiper notification handlers, we receive such event in `raft_ip_address_updater` and reverts the IP of the node A back to ip1. To fix this we ensure that the new gossiper generation number is used when a node registers its IP address in `raft_address_map` at startup. The `test_change_ip` is adjusted to ensure that the old IPs are properly removed in all cases, even if the node crashes. Fixes #16886 Fixes #16691 Fixes #17199 Closes scylladb/scylladb#17162 * github.com:scylladb/scylladb: test_change_ip: improve the test raft_ip_address_updater: remove stale IPs from gossiper raft_address_map: add my ip with the new generation system_keyspace::update_peer_info: check ep and host_id are not empty system_keyspace::update_peer_info: make host_id an explicit parameter system_keyspace::update_peer_info: remove any_set flag optimisation system_keyspace: remove duplicate ips for host_id system_keyspace: peers table: use coroutines storage_service::raft_ip_address_updater: log gossiper event name raft topology: ip change: purge old IP on_endpoint_change: coroutinize the lambda around sync_raft_topology_nodes	2024-02-15 17:40:29 +01:00
Petr Gusev	4b33ba2894	raft_address_map: add my ip with the new generation The following scenario is possible: a node A changes its IP from ip1 to ip2 with restart, other nodes are not yet aware of ip2 so they keep gossiping ip1, after restart A receives ip1 in a gossip message and calls handle_major_state_change since it considers it as a new node. Then on_join event is called on the gossiper notification handles, we receive such event in raft_ip_address_updater and reverts the IP of the node A back to ip1. The essence of the problem is that we don't pass the proper generation when we add ip2 as a local IP during initialization when node A restarts, so the zero generation is used in raft_address_map::add_or_update_entry and the gossiper message owerwrites ip2 to ip1. In this commit we fix this problem by passing the new generation. To do that we move the increment_and_get_generation call from join_token_ring to scylla_main, so that we have a new generation value before init_address_map is called. Also we remove the load_initial_raft_address_map function from raft_group0 since it's redundant. The comment above its call site says that it's needed to not miss gossiper updates, but the function storage_service::init_address_map where raft_address_map is now initialized is called before gossiper is started. This function does both - it load the previously persisted host_id<->IP mappings from system.local and subscribes to gossiper notifications, so there is no room for races. Note that this problem reproduces less likely with the 'raft topology: ip change: purge old IP' commit - other nodes remove the old IP before it's send back to the just restarted node. This is also the reason why this problem doesn't occur in gossiper mode. fixes scylladb/scylladb#17199	2024-02-15 13:21:04 +04:00
Gleb Natapov	f21a3b4ca5	raft_group0: add make_nonvoters function that can make multiple node non voters simultaneously	2024-02-13 16:15:35 +02:00
Piotr Dulikowski	07aba3abc4	group0_state_machine: pull snapshot after raft topology feature enabled Pulling a snapshot of the raft topology is done via new rpc verb (RAFT_PULL_TOPOLOGY_SNAPSHOT). If the recipient runs an older version of scylla and does not understand the verb, sending it will result in an error. We usually use cluster features to avoid such situations, but in the case when a node joins the cluster, it doesn't have access to features yet. Therefore, we need to enable pulling snapshots in two situations: - when the SUPPORTS_CONSISTENT_TOPOLOGY_CHANGES feature becomes enabled, - in case when starting group 0 server when joining a cluster that uses raft-based topology.	2024-02-08 19:12:28 +01:00
Piotr Dulikowski	5392bac85b	raft_group0: expose link to the upgrade doc in the header So that it can be referenced from other files.	2024-02-08 19:12:28 +01:00
Piotr Dulikowski	a55797fd41	topology_coordinator: implement core upgrade logic Implement topology coordinator's logic responsible for building the group 0 state related to topology.	2024-02-08 19:12:28 +01:00
Piotr Dulikowski	f6b303d589	raft_group0_client: add in_recovery method It tells whether the current node currently operates in recovery mode or not. It will be vital for storage_service in determining which topology operations to use at startup.	2024-02-07 10:02:01 +01:00
Piotr Dulikowski	7601f40bf8	storage_service: introduce join_node_query verb When a node joins an existing cluster, it will ask a node that already belongs to the cluster about which topology operations to use when joining.	2024-02-07 10:02:00 +01:00
Piotr Dulikowski	bab5d3bbe5	raft_group0: make discover_group0 public The `discover_group0` function returns only after it either finds a node that belongs to some group 0, or learns that the current node is supposed to create a new one. It will be very helpful to storage_service in determining which topology mode to use.	2024-02-07 10:00:16 +01:00
Piotr Dulikowski	367df7322e	raft_group0: filter current node's IP in discover_group0 This was previously done by `setup_group0`, which always was an (indirect) caller of `discover_group0`. As we want to make `discover_group0` public, it's more convenient for the callers if the called method takes care of sanitizing the argument.	2024-02-07 10:00:16 +01:00
Piotr Dulikowski	86e4a59d5b	raft_group0: remove my_id arg from discover_group0 The goal is to make `discover_group0` public. The `my_id` argument was always set to `this->load_my_id()`, so we can get rid of it and it will make it more convenient to call `discover_group0` from the outside.	2024-02-07 10:00:16 +01:00
Kamil Braun	57d5aa5a68	test: add test for fixing a broken group 0 snapshot In a cluster with group 0 with snapshot at index 0 (such group 0 might be established in a 5.2 cluster, then preserved once it upgrades to 5.4 or later), no snapshot transfer will be triggered when a node is bootstrapped. This way to new node might not obtain full schema, or obtain incorrect schema, like in scylladb/scylladb#16683. Simulate this scenario in a test case using the RECOVERY mode and error injections. Check that the newly added logic for creating a new snapshot if such situation is detected helps in this case.	2024-01-30 16:44:01 +01:00
Kamil Braun	98d75c65af	raft_group0: trigger snapshot if existing snapshot index is 0 The persisted snapshot index may be 0 if the snapshot was created in older version of Scylla, which means snapshot transfer won't be triggered to a bootstrapping node. Commands present in the log may not cover all schema changes --- group 0 might have been created through the upgrade upgrade procedure, on a cluster with existing schema. So a deployment with index=0 snapshot is broken and we need to fix it. We can use the new `raft::server::trigger_snapshot` API for that. Fixes scylladb/scylladb#16683	2024-01-30 16:35:54 +01:00
Mikołaj Grzebieluch	c08266cfe5	raft_group0_client: disable group0 operations in the maintenance mode In maintenance mode, the node doesn't communicate with other nodes, so it doesn't start or apply group0 operations. Users can still try to start it, e.g. change the schema, and the node can't allow it. Init _upgrade_state with recovery in the maintenance mode. Throw an error if the group0 operation is started in maintenance mode.	2024-01-25 15:27:53 +01:00
Gleb Natapov	1c18476385	storage_service: topology coordinator: update topology_requests table with requests progress Make topology coordinator update request's status in topology_requests table as it changes.	2024-01-16 15:35:18 +02:00
Petr Gusev	15b8e565ed	address_map: move gossiper subscription logic into storage_service We are going to remove the IP waiting loop from topology_state_load in subsequent commits. An IP for a given host_id may change after this function has been called by raft. This means we need to subscribe to the gossiper notifications and call it later with a new id<->ip mapping. In this preparatory commit we move the existing address_map update logic into storage_service so that in later commits we can enhance it with topology_state_load call.	2024-01-12 15:37:50 +04:00

1 2 3 4 5 ...

376 Commits