Commit Graph

376 Commits

Author SHA1 Message Date
Emil Maskovsky
b99d87863d raft: fix the shutdown phase being stuck
Some of the calls inside the `raft_group0_client::start_operation()`
method were missing the abort source parameter. This caused the repair
test to be stuck in the shutdown phase - the abort source has been
triggered, but the operations were not checking it.

This was in particular the case of operations that try to take the
ownership of the raft group semaphore (`get_units(semaphore)`) - these
waits should be cancelled when the abort source is triggered.

This should fix the following tests that were failing in some percentage
of dtest runs (about 1-3 of 100):
* TestRepairAdditional::test_repair_kill_1
* TestRepairAdditional::test_repair_kill_3

Fixes scylladb/scylladb#19223

(cherry picked from commit 5dfc50d354)
2024-08-01 19:37:02 +02:00
Emil Maskovsky
0770069dda raft: use the abort source reference in raft group0 client interface
Most callers of the raft group0 client interface are passing a real
source instance, so we can use the abort source reference in the client
interface. This change makes the code simpler and more consistent.

(cherry picked from commit 2dbe9ef2f2)
2024-08-01 19:36:00 +02:00
Gleb Natapov
45ff4d2c41 group0, topology coordinator: run group0 and the topology coordinator in gossiper scheduling group
Currently they both run in streaming group and it may become busy during
repair/mv building and affect group0 functionality. Move it to the
gossiper group where it should have more time to run.

Fixes #18863

(cherry picked from commit a74fbab99a)

Closes scylladb/scylladb#19175
2024-06-10 10:34:29 +02:00
Marcin Maliszkiewicz
cbf47319c1 db: auth: move auth tables to system keyspace
Separate keyspace which also behaves as system brings
little benefit while creating some compatibility problems
like schema digest mismatch during rollback. So we decided
to move auth tables into system keyspace.

Fixes https://github.com/scylladb/scylladb/issues/18098

Closes scylladb/scylladb#18769

(cherry picked from commit 2ab143fb40)

[avi: adjust test/alternator/suite.yaml to reflect new keyspace]
2024-06-02 21:41:14 +03:00
Piotr Smaron
bd4b781dc8 New raft cmd for both schema & topo changes
Allows executing combined topology & schema mutations under a single RAFT command
2024-05-30 08:33:15 +03:00
Patryk Jędrzejczak
332bd8ea98 raft: raft_group_registry: start_server_for_group: catch and rethrow abort_requested_exception
If we initiate the shutdown while starting the group 0 server,
we could catch `abort_requested_exception` in `start_server_for_group`
and call `on_internal_error`. Then, Scylla aborts with a coredump.
It causes problems in tests that shut down bootstrapping nodes.

The `abort_requested_exception` can be thrown from
`gossiper::lock_endpoint` called in
`storage_service::topology_state_load`. So, the issue is new and
applies only to the raft-based topology. Hence, there is no need
to backport the patch.

Fixes scylladb/scylladb#17794
Fixes scylladb/scylladb#18197

Closes scylladb/scylladb#18569
2024-05-09 14:55:11 +02:00
Kamil Braun
03818c4aa9 direct_failure_detector: increase ping timeout and make it tunable
The direct failure detector design is simplistic. It sends pings
sequentially and times out listeners that reached the threshold (i.e.
didn't hear from a given endpoint for too long) in-between pings.

Given the sequential nature, the previous ping must finish so the next
ping can start. We timeout pings that take too long. The timeout was
hardcoded and set to 300ms. This is too low for wide-area setups --
latencies across the Earth can indeed go up to 300ms. 3 subsequent timed
out pings to a given node were sufficient for the Raft listener to "mark
server as down" (the listener used a threshold of 1s).

Increase the ping timeout to 600ms which should be enough even for
pinging the opposite side of Earth, and make it tunable.

Increase the Raft listener threshold from 1s to 2s. Without the
increased threshold, one timed out ping would be enough to mark the
server as down. Increasing it to 2s requires 3 timed out pings which
makes it more robust in presence of transient network hiccups.

In the future we'll most likely want to decrease the Raft listener
threshold again, if we use Raft for data path -- so leader elections
start quickly after leader failures. (Faster than 2s). To do that we'll
have to improve the design of the direct failure detector.

Ref: scylladb/scylladb#16410
Fixes: scylladb/scylladb#16607

---

I tested the change manually using `tc qdisc ... netem delay`, setting
network delay on local setup to ~300ms with jitter. Without the change,
the result is as observed in scylladb/scylladb#16410: interleaving
```
raft_group_registry - marking Raft server ... as dead for Raft groups
raft_group_registry - marking Raft server ... as alive for Raft groups
```
happening once every few seconds. The "marking as dead" happens whenever
we get 3 subsequent failed pings, which is happens with certain (high)
probability depending on the latency jitter. Then as soon as we get a
successful ping, we mark server back as alive.

With the change, the phenomenon no longer appears.

Closes scylladb/scylladb#18443
2024-05-07 23:40:23 +02:00
Benny Halevy
ebff5f5d70 everywhere: include seastar headers using angle brackets
seastar is an external library therefore it should
use the system-include syntax.

Closes scylladb/scylladb#18513
2024-05-06 10:00:31 +03:00
Benny Halevy
890b890e36 storage_proxy: add mutate_locally(vector<frozen_mutation_and_schema>) method
Generalizing the ad-hoc implementation out of
group0_state_machine.write_mutations_to_database.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-05-02 19:42:58 +03:00
Benny Halevy
4ae5bbb058 raft: group0_state_machine: write_mutations_to_database: freeze mutations gently
write_mutations_to_database might need to handle
large mutations from system tables, so to prevent
reactor stalls, freeze the mutations gently
and call proxy.mutate_locally in parallel on
the individual frozen mutations, rather than
calling the vector<mutation> based entry point
that eventually freezes each mutation synchronously.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-05-02 19:37:06 +03:00
Benny Halevy
504a9ab897 raft: group0_state_machine: write_mutations_to_database: use to_mutation_gently
Prevent stalls coming from writing large mutations
like the ones seen with the test_add_many_nodes_under_load
dtest:
```
  ++[1#11/11 6%] addr=0x15408f6 total=33 count=1 avg=33:
  |              managed_bytes::managed_bytes at ././utils/managed_bytes.hh:284
  |              (inlined by) atomic_cell_or_collection::atomic_cell_or_collection at ./mutation/atomic_cell_or_collection.hh:25
  |              (inlined by) cell_and_hash::cell_and_hash at ./mutation/mutation_partition.hh:73
  |              (inlined by) compact_radix_tree::tree<cell_and_hash, unsigned int>::emplace<atomic_cell_or_collection, seastar::optimized_optional<cell_hash> > at ././utils/compact-radix-tree.hh:1809
  ++           - addr=0x1518bae:
  |              row::append_cell at ./mutation/mutation_partition.cc:1344
  ++           - addr=0x14acb23:
  |              partition_builder::accept_row_cell at ././partition_builder.hh:70
  ++           - addr=0x157a6a6:
  |              mutation_partition_view::do_accept<partition_builder>(column_mapping const&, partition_builder&) const::cell_visitor::accept_atomic_cell at ./mutation/mutation_partition_view.cc:218
  |              (inlined by) (anonymous namespace)::read_and_visit_row<mutation_partition_view::do_accept<partition_builder>(column_mapping const&, partition_builder&) const::cell_visitor>(ser::row_view, column_mapping const&, column_kind, partition_builder&&)::atomic_cell_or_collection_visitor::operator() at ./mutation/mutation_partition_view.cc:138
  |              (inlined by) boost::detail::variant::invoke_visitor<(anonymous namespace)::read_and_visit_row<mutation_partition_view::do_accept<partition_builder>(column_mapping const&, partition_builder&) const::cell_visitor>(ser::row_view, column_mapping const&, column_kind, partition_builder&&)::atomic_cell_or_collection_visitor const, false>::internal_visit<boost::variant<ser::live_cell_view, ser::expiring_cell_view, ser::dead_cell_view, ser::counter_cell_view, ser::unknown_variant_type>&> at /usr/include/boost/variant/variant.hpp:1028
  |              (inlined by) boost::detail::variant::visitation_impl_invoke_impl<boost::detail::variant::invoke_visitor<(anonymous namespace)::read_and_visit_row<mutation_partition_view::do_accept<partition_builder>(column_mapping const&, partition_builder&) const::cell_visitor>(ser::row_view, column_mapping const&, column_kind, partition_builder&&)::atomic_cell_or_collection_visitor const, false>, void*, boost::variant<ser::live_cell_view, ser::expiring_cell_view, ser::dead_cell_view, ser::counter_cell_view, ser::unknown_variant_type> > at /usr/include/boost/variant/detail/visitation_impl.hpp:117
  |              (inlined by) boost::detail::variant::visitation_impl_invoke<boost::detail::variant::invoke_visitor<(anonymous namespace)::read_and_visit_row<mutation_partition_view::do_accept<partition_builder>(column_mapping const&, partition_builder&) const::cell_visitor>(ser::row_view, column_mapping const&, column_kind, partition_builder&&)::atomic_cell_or_collection_visitor const, false>, void*, boost::variant<ser::live_cell_view, ser::expiring_cell_view, ser::dead_cell_view, ser::counter_cell_view, ser::unknown_variant_type>, boost::variant<boost::variant<ser::live_cell_view, ser::expiring_cell_view, ser::dead_cell_view, ser::counter_cell_view, ser::unknown_variant_type>, ser::collection_cell_view, ser::unknown_variant_type>::has_fallback_type_> at /usr/include/boost/variant/detail/visitation_impl.hpp:157
  |              (inlined by) boost::detail::variant::visitation_impl<mpl_::int_<0>, boost::detail::variant::visitation_impl_step<boost::mpl::l_iter<boost::mpl::l_item<mpl_::long_<3l>, boost::variant<ser::live_cell_view, ser::expiring_cell_view, ser::dead_cell_view, ser::counter_cell_view, ser::unknown_variant_type>, boost::mpl::l_item<mpl_::long_<2l>, ser::collection_cell_view, boost::mpl::l_item<mpl_::long_<1l>, ser::unknown_variant_type, boost::mpl::l_end> > > >, boost::mpl::l_iter<boost::mpl::l_end> >, boost::detail::variant::invoke_visitor<(anonymous namespace)::read_and_visit_row<mutation_partition_view::do_accept<partition_builder>(column_mapping const&, partition_builder&) const::cell_visitor>(ser::row_view, column_mapping const&, column_kind, partition_builder&&)::atomic_cell_or_collection_visitor const, false>, void*, boost::variant<boost::variant<ser::live_cell_view, ser::expiring_cell_view, ser::dead_cell_view, ser::counter_cell_view, ser::unknown_variant_type>, ser::collection_cell_view, ser::unknown_variant_type>::has_fallback_type_> at /usr/include/boost/variant/detail/visitation_impl.hpp:238
  |              (inlined by) boost::variant<boost::variant<ser::live_cell_view, ser::expiring_cell_view, ser::dead_cell_view, ser::counter_cell_view, ser::unknown_variant_type>, ser::collection_cell_view, ser::unknown_variant_type>::internal_apply_visitor_impl<boost::detail::variant::invoke_visitor<(anonymous namespace)::read_and_visit_row<mutation_partition_view::do_accept<partition_builder>(column_mapping const&, partition_builder&) const::cell_visitor>(ser::row_view, column_mapping const&, column_kind, partition_builder&&)::atomic_cell_or_collection_visitor const, false>, void*> at /usr/include/boost/variant/variant.hpp:2337
  |              (inlined by) boost::variant<boost::variant<ser::live_cell_view, ser::expiring_cell_view, ser::dead_cell_view, ser::counter_cell_view, ser::unknown_variant_type>, ser::collection_cell_view, ser::unknown_variant_type>::internal_apply_visitor<boost::detail::variant::invoke_visitor<(anonymous namespace)::read_and_visit_row<mutation_partition_view::do_accept<partition_builder>(column_mapping const&, partition_builder&) const::cell_visitor>(ser::row_view, column_mapping const&, column_kind, partition_builder&&)::atomic_cell_or_collection_visitor const, false> > at /usr/include/boost/variant/variant.hpp:2349
  |              (inlined by) boost::variant<boost::variant<ser::live_cell_view, ser::expiring_cell_view, ser::dead_cell_view, ser::counter_cell_view, ser::unknown_variant_type>, ser::collection_cell_view, ser::unknown_variant_type>::apply_visitor<(anonymous namespace)::read_and_visit_row<mutation_partition_view::do_accept<partition_builder>(column_mapping const&, partition_builder&) const::cell_visitor>(ser::row_view, column_mapping const&, column_kind, partition_builder&&)::atomic_cell_or_collection_visitor const> at /usr/include/boost/variant/variant.hpp:2393
  |              (inlined by) boost::apply_visitor<(anonymous namespace)::read_and_visit_row<mutation_partition_view::do_accept<partition_builder>(column_mapping const&, partition_builder&) const::cell_visitor>(ser::row_view, column_mapping const&, column_kind, partition_builder&&)::atomic_cell_or_collection_visitor, boost::variant<boost::variant<ser::live_cell_view, ser::expiring_cell_view, ser::dead_cell_view, ser::counter_cell_view, ser::unknown_variant_type>, ser::collection_cell_view, ser::unknown_variant_type>&> at /usr/include/boost/variant/detail/apply_visitor_unary.hpp:68
  |              (inlined by) (anonymous namespace)::read_and_visit_row<mutation_partition_view::do_accept<partition_builder>(column_mapping const&, partition_builder&) const::cell_visitor> at ./mutation/mutation_partition_view.cc:158
  |              (inlined by) mutation_partition_view::do_accept<partition_builder> at ./mutation/mutation_partition_view.cc:224
  ++           - addr=0x151234a:
  |              mutation_partition::apply at ./mutation/mutation_partition.cc:476
  ++           - addr=0x14e1103:
  |              canonical_mutation::to_mutation at ./mutation/canonical_mutation.cc:76
  ++           - addr=0x283f9ee:
  |              service::write_mutations_to_database at ./service/raft/group0_state_machine.cc:124
  ++           - addr=0x283f36c:
  |              service::group0_state_machine::merge_and_apply(service::group0_state_machine_merger&)::$_2::operator() at ./service/raft/group0_state_machine.cc:165
  ++           - addr=0x28395e3:
  |              std::__invoke_impl<seastar::future<void>, seastar::internal::variant_visitor<service::group0_state_machine::merge_and_apply(service::group0_state_machine_merger&)::$_0, service::group0_state_machine::merge_and_apply(service::group0_state_machine_merger&)::$_1, service::group0_state_machine::merge_and_apply(service::group0_state_machine_merger&)::$_2, service::group0_state_machine::merge_and_apply(service::group0_state_machine_merger&)::$_3>, service::topology_change&> at /usr/bin/../lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/invoke.h:61
  |              (inlined by) std::__invoke<seastar::internal::variant_visitor<service::group0_state_machine::merge_and_apply(service::group0_state_machine_merger&)::$_0, service::group0_state_machine::merge_and_apply(service::group0_state_machine_merger&)::$_1, service::group0_state_machine::merge_and_apply(service::group0_state_machine_merger&)::$_2, service::group0_state_machine::merge_and_apply(service::group0_state_machine_merger&)::$_3>, service::topology_change&> at /usr/bin/../lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/invoke.h:96
  |              (inlined by) std::__detail::__variant::__gen_vtable_impl<std::__detail::__variant::_Multi_array<std::__detail::__variant::__deduce_visit_result<seastar::future<void> > (*)(seastar::internal::variant_visitor<service::group0_state_machine::merge_and_apply(service::group0_state_machine_merger&)::$_0, service::group0_state_machine::merge_and_apply(service::group0_state_machine_merger&)::$_1, service::group0_state_machine::merge_and_apply(service::group0_state_machine_merger&)::$_2, service::group0_state_machine::merge_and_apply(service::group0_state_machine_merger&)::$_3>&&, std::variant<service::schema_change, service::broadcast_table_query, service::topology_change, service::write_mutations>&)>, std::integer_sequence<unsigned long, 2ul> >::__visit_invoke at /usr/bin/../lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/variant:1032
  |              (inlined by) std::__do_visit<std::__detail::__variant::__deduce_visit_result<seastar::future<void> >, seastar::internal::variant_visitor<service::group0_state_machine::merge_and_apply(service::group0_state_machine_merger&)::$_0, service::group0_state_machine::merge_and_apply(service::group0_state_machine_merger&)::$_1, service::group0_state_machine::merge_and_apply(service::group0_state_machine_merger&)::$_2, service::group0_state_machine::merge_and_apply(service::group0_state_machine_merger&)::$_3>, std::variant<service::schema_change, service::broadcast_table_query, service::topology_change, service::write_mutations>&> at /usr/bin/../lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/variant:1793
  |              (inlined by) std::visit<seastar::internal::variant_visitor<service::group0_state_machine::merge_and_apply(service::group0_state_machine_merger&)::$_0, service::group0_state_machine::merge_and_apply(service::group0_state_machine_merger&)::$_1, service::group0_state_machine::merge_and_apply(service::group0_state_machine_merger&)::$_2, service::group0_state_machine::merge_and_apply(service::group0_state_machine_merger&)::$_3>, std::variant<service::schema_change, service::broadcast_table_query, service::topology_change, service::write_mutations>&> at /usr/bin/../lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/variant:1854
  |              (inlined by) service::group0_state_machine::merge_and_apply at ./service/raft/group0_state_machine.cc:156
  ++           - addr=0x284781e:
  |              service::group0_state_machine::apply at ./service/raft/group0_state_machine.cc:220
```

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-05-02 19:27:56 +03:00
Kefu Chai
8168f02550 raft_group_registry: do not use moved variable
clang-tidy warns like:
```
[628/713] Building CXX object service/CMakeFiles/service.dir/raft/raft_group_registry.cc.o
Warning: /home/runner/work/scylladb/scylladb/service/raft/raft_group_registry.cc:543:66: warning: 'id' used after it was moved [bugprone-use-after-move]
  543 |             auto& rate_limit = _rate_limits.try_get_recent_entry(id, std::chrono::minutes(5));
      |                                                                  ^
/home/runner/work/scylladb/scylladb/service/raft/raft_group_registry.cc:539:19: note: move occurred here
  539 |     auto dst_id = raft::server_id{std::move(id)};
      |                   ^
```

this is a false alarm. as the type of `id` is actually `utils::UUID`
which is a struct enclosing two `int64_t` variables. and we don't
define a move constructor for `utils::UUID`. so the value of of `id`
is intact after being moved away. but it is still confusing at
the first glance, as we are indeed referencing a moved-away variable.

so in order to reduce the confusion and to silence the warning, let's
just do not `std::move(id)`.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#18449
2024-05-01 09:45:12 +03:00
Kefu Chai
372a4d1b79 treewide: do not define FMT_DEPRECATED_OSTREAM
since we do not rely on FMT_DEPRECATED_OSTREAM to define the
fmt::formatter for us anymore, let's stop defining `FMT_DEPRECATED_OSTREAM`.

in this change,

* utils: drop the range formatters in to_string.hh and to_string.c, as
  we don't use them anymore. and the tests for them in
  test/boost/string_format_test.cc are removed accordingly.
* utils: use fmt to print chunk_vector and small_vector. as
  we are not able to print the elements using operator<< anymore
  after switching to {fmt} formatters.
* test/boost: specialize fmt::details::is_std_string_like<bytes>
  due to a bug in {fmt} v9, {fmt} fails to format a range whose
  element type is `basic_sstring<uint8_t>`, as it considers it
  as a string-like type, but `basic_sstring<uint8_t>`'s char type
  is signed char, not char. this issue does not exist in {fmt} v10,
  so, in this change, we add a workaround to explicitly specialize
  the type trait to assure that {fmt} format this type using its
  `fmt::formatter` specialization instead of trying to format it
  as a string. also, {fmt}'s generic ranges formatter calls the
  pair formatter's `set_brackets()` and `set_separator()` methods
  when printing the range, but operator<< based formatter does not
  provide these method, we have to include this change in the change
  switching to {fmt}, otherwise the change specializing
  `fmt::details::is_std_string_like<bytes>` won't compile.
* test/boost: in tests, we use `BOOST_REQUIRE_EQUAL()` and its friends
  for comparing values. but without the operator<< based formatters,
  Boost.Test would not be able to print them. after removing
  the homebrew formatters, we need to use the generic
  `boost_test_print_type()` helper to do this job. so we are
  including `test_utils.hh` in tests so that we can print
  the formattable types.
* treewide: add "#include "utils/to_string.hh" where
  `fmt::formatter<optional<>>` is used.
* configure.py: do not define FMT_DEPRECATED_OSTREAM
* cmake: do not define FMT_DEPRECATED_OSTREAM

Refs #13245

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-04-19 22:57:36 +08:00
Kefu Chai
a439ebcfce treewide: include fmt/ranges.h and/or fmt/std.h
before this change, we rely on the default-generated fmt::formatter
created from operator<<, but fmt v10 dropped the default-generated
formatter.

in this change, we include `fmt/ranges.h` and/or `fmt/std.h`
for formatting the container types, like vector, map
optional and variant using {fmt} instead of the homebrew
formatter based on operator<<.
with this change, the changes adding fmt::formatter and
the changes using ostream formatter explicitly, we are
allowed to drop `FMT_DEPRECATED_OSTREAM` macro.

Refs scylladb#13245

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-04-19 22:56:16 +08:00
Kamil Braun
33751f8f4e Merge 'raft topology: drop RAFT_PULL_TOPOLOGY_SNAPSHOT RPC' from Gleb
* 'gleb/raft_snapshot_rpc-v3' of github.com:scylladb/scylla-dev:
  raft topology: drop RAFT_PULL_TOPOLOGY_SNAPSHOT RPC
  Use correct limit for raft commands throughout the code.
2024-03-28 14:25:58 +01:00
Gleb Natapov
6e6aefc9ab raft topology: drop RAFT_PULL_TOPOLOGY_SNAPSHOT RPC
We have new, more generic, RPC to pull group0 mutations now: RAFT_PULL_SNAPSHOT.
Use it instead of more specific RAFT_PULL_TOPOLOGY_SNAPSHOT one.
2024-03-27 19:18:45 +02:00
Gleb Natapov
c1dcf0fae7 Use correct limit for raft commands throughout the code.
Raft uses schema commitlog, so all its limits should be derived from
this commitlog segment size, but many places used regular commitlog size
to calculate the limits and did not do what they really suppose to be
doing.
2024-03-27 19:16:09 +02:00
Gleb Natapov
6ab78e13c6 topology cooordinator: propagate initial_token option to the coordinator
The patch propagates initial_token option to the topology coordinator
where it is added to join request parameter.
2024-03-26 18:43:16 +02:00
Piotr Dulikowski
f23f8f81bf Merge 'Raft-based service levels' from Michał Jadwiszczak
This patch introduces raft-based service levels.

The difference to the current method of working is:
- service levels are stored in `system.service_levels_v2`
- reads are executed with `LOCAL_ONE`
- writes are done via raft group0 operation

Service levels are migrated to v2 in topology upgrade.
After the service levels are migrated, `key: service_level_v2_status; value: data_migrated` is written to `system.scylla_local` table. If this row is present, raft data accessor is created from the beginning and it handles recovery mode procedure (service levels will be read from v2 table even if consistent topology is disabled then)

Fixes #17926

Closes scylladb/scylladb#16585

* github.com:scylladb/scylladb:
  test: test service levels v2 works in recovery mode
  test: add test for service levels migration
  test: add test for service levels snapshot
  test:topology: extract `trigger_snapshot` to utils
  main: create raft dda if sl data was migrated
  service:qos: store information about sl data migration
  service:qos: service levels migration
  main: assign standard service level DDA before starting group0
  service:qos: fix `is_v2()` method
  service:qos: add a method to upgrade data accessor
  test: add unit_test_raft_service_levels_accessor
  service:storage_service: add support for service levels raft snapshot
  service:qos: add abort_source for group0 operations
  service:qos: raft service level distributed data accessor
  service:qos: use group0_guard in data accessor
  cql3:statements: run service level statements on shard0 with raft guard
  test: fix overrides in unit_test_service_levels_accessor
  service:qos: fix indentation
  service:qos: coroutinize some of the methods
  db:system_keyspace: add `SERVICE_LEVELS_V2` table
  service:qos: extract common service levels' table functions
2024-03-22 11:51:53 +01:00
Kamil Braun
4359a1b460 Merge 'raft timeouts: better handling of lost quorum' from Petr Gusev
In this PR we add timeouts support to raft groups registry. We introduce
the `raft_server_with_timeouts` class, which wraps the `raft::server`
add exposes its interface with additional `raft_timeout` parameter. If
it's set, the wrapper cancels the `abort_source` after certain amount of
time. The value of the timeout can be specified either in the
`raft_timeout` parameter, or the default value can be set in `the
raft_server_with_timeouts` class constructor.

The `raft_group_registry` interface is extended with
`group0_with_timeouts()` method. It returns an instance of
`raft_server_with_timeouts` for group0 raft server. The timeout value
for it is configured in `create_server_for_group0`. It's one minute by
default and can be overridden for tests with
`group0-raft-op-timeout-in-ms` parameter.

The new api allows the client to decide whether to use timeouts or not.
In this PR we are reviewing all the group0 call sites and add
`raft_timeout` if that makes sense. The general principle is that if the
code is handling a client request and the client expects a potential
error, we use timeouts. We don't use timeouts for background fibers
(such as topology coordinator), since they wouldn't add much value. The
only thing the background fiber can do with a timeout is to retry, and
this will have the same end effect as not having a timeout at all.

Fixes scylladb/scylladb#16604

Closes scylladb/scylladb#17590

* github.com:scylladb/scylladb:
  migration_manager: use raft_timeout{}
  storage_service::join_node_response_handler: use raft_timeout{}
  storage_service::start_upgrade_to_raft_topology: use raft_timeout{}
  storage_service::set_tablet_balancing_enabled: use raft_timeout{}
  storage_service::move_tablet: use raft_timeout{}
  raft_check_and_repair_cdc_streams: use raft_timeout{}
  raft_timeout: test that node operations fail properly
  raft_rebuild: use raft_timeout{}
  do_cluster_cleanup: use raft_timeout{}
  raft_initialize_discovery_leader: use raft_timeout{}
  update_topology_with_local_metadata: use with_timeout{}
  raft_decommission: use raft_timeout{}
  raft_removenode: use raft_timeout{}
  join_node_request_handler: add raft_timeout to make_nonvoters and add_entry
  raft_group0: make_raft_config_nonvoter: add raft_timeout parameter
  raft_group0: make_raft_config_nonvoter: add abort_source parameter
  manager_client: server_add with start=false shouldn't call driver_connect
  scylla_cluster: add seeds parameter to the add_server and servers_add
  raft_server_with_timeouts: report the lost quorum
  join_node_request_handler: add raft_timeout{} for start_operation
  skip_mode: add platform_key
  auth: use raft_timeout{}
  raft_group0_client: add raft_timeout parameter
  raft_group_registry: add group0_with_timeouts
  utils: add composite_abort_source.hh
  error_injection: move api registration to set_server_init
  error_injection: add inject_parameter method
  error_injection: move injection_name string into injection_shared_data
  error_injection: pass injection parameters at startup
2024-03-22 10:45:33 +01:00
Michał Jadwiszczak
8bbeea0169 service:storage_service: add support for service levels raft snapshot
Include mutations from `system.service_levels_v2` in `raft_snapshot`.
2024-03-21 23:14:57 +01:00
Kefu Chai
900b56b117 raft_group0: print runtime_error by printing e.what()
before this change, we rely on the default-generated fmt::formatter
created from operator<<, but fmt v10 dropped the default-generated
formatter. but fortunately, fmt v10 brings the builtin
formatter for classes derived from `std::exception`. but before
switching to {fmt} v10, and after dropping `FMT_DEPRECATED_OSTREAM`
macro, we need to print out `std::runtime_error`. so far, we don't
have a shared place for formatter for `std::runtime_error`. so we
are addressing the needs on a case-by-case basis.

in this change, we just print it using `e.what()`. it's behavior
is identical to what we have now.

Refs #13245

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#17954
2024-03-21 19:43:52 +02:00
Petr Gusev
0ad852e323 raft_group0: make_raft_config_nonvoter: add raft_timeout parameter
We'll use this parameter in subsequent commits.
2024-03-21 16:35:48 +04:00
Petr Gusev
ce7fb39750 raft_group0: make_raft_config_nonvoter: add abort_source parameter 2024-03-21 16:35:48 +04:00
Petr Gusev
99419d5964 raft_server_with_timeouts: report the lost quorum
In this commit we extend the timeout error message with
additional context - if we see that there is no quorum of
available nodes, we report this as the most likely
cause of the error.

We adjust the test by adding this new information to the
expected_error. We need raft-group-registry-fd-threshold-in-ms
to make _direct_fd threshold less than
group0-raft-op-timeout-in-ms.
2024-03-21 16:35:48 +04:00
Petr Gusev
cebf87bf59 raft_group0_client: add raft_timeout parameter
In this commit we add raft_timeout parameter to
start_operation and add_entry method.

We fix compilation in default_authorizer.cc,
bind_front doesn't account for default parameter
values. We should use raft_timeout{} here, but this
is for another commit.
2024-03-21 16:12:51 +04:00
Petr Gusev
3d1b94475f raft_group_registry: add group0_with_timeouts
In this commit we add timeouts support to raft groups
registry. We introduce the raft_server_with_timeouts
class, which wraps the raft::server add exposes its
interface with additional raft_timeout parameter.
If it's set, the wrapper cancels the abort_source
after certain amount of time. The value of the timeout
can be specified in the raft_timeout parameter,
or the default value can be set in the raft_server_with_timeouts
class constructor.

The raft_group_registry interface is extended with
get_server_with_timeouts(group_id) and group0_with_timeouts()
methods. They return an instance of raft_server_with_timeouts for
a specified group id or for group0. The timeout value for it is configured in
create_server_for_group0. It's one minute by default, can be overridden
for tests with group0-raft-op-timeout-in-ms parameter.

The new api allows the client to decide whether to use timeouts or not.
In subsequent commits we are going to review all group0 call sites
and add raft_timeout if that makes sense. The general principle is that
if the code is handling a client request and the client expects
a potential error, we use timeouts. We don't use timeouts for
background fibers (such as topology coordinator), since they won't
add much value. The only thing the background fiber can do
with a timeout is to retry, and this will have the same effect
as not having a timeout at all.
2024-03-21 16:12:51 +04:00
Gleb Natapov
2b11842cb4 test: add test to check that address cannot expire between join request placemen and its processing 2024-03-20 11:05:31 +02:00
Gleb Natapov
9651ae875f raft_group0: add modifiable_address_map() function
Provide access to non const address_map. We will need it later.
2024-03-19 13:34:41 +02:00
Gleb Natapov
af218d0063 raft_group0_client: assert that hold_read_apply_mutex is called on shard 0
group0 operations a valid on shard 0 only. Assert that. We already do
that in the version of the function that gets abort source.

Message-ID: <ZeCti70vrd7UFNim@scylladb.com>
2024-03-18 16:20:41 +01:00
Kamil Braun
19b816bb68 Merge 'Migrate system_auth to raft group0' from Marcin Maliszkiewicz
This patch series makes all auth writes serialized via raft. Reads stay
eventually consistent for performance reasons. To make transition to new
code easier data is stored in a newly created keyspace: system_auth_v2.

Internally the difference is that instead of executing CQL directly for
writes we generate mutations and then announce them via raft group0. Per
commit descriptions provide more implementation details.

Refs https://github.com/scylladb/scylladb/issues/16970
Fixes https://github.com/scylladb/scylladb/issues/11157

Closes scylladb/scylladb#16578

* github.com:scylladb/scylladb:
  test: extend auth-v2 migration test to catch stale static
  test: add auth-v2 migration test
  test: add auth-v2 snapshot transfer test
  test: auth: add tests for lost quorum and command splitting
  test: pylib: disconnect driver before re-connection
  test: adjust tests for auth-v2
  auth: implement auth-v2 migration
  auth: remove static from queries on auth-v2 path
  auth: coroutinize functions in password_authenticator
  auth: coroutinize functions in standard_role_manager
  auth: coroutinize functions in default_authorizer
  storage_service: add support for auth-v2 raft snapshots
  storage_service: extract getting mutations in raft snapshot to a common function
  auth: service: capture string_view by value
  alternator: add support for auth-v2
  auth: add auth-v2 write paths
  auth: add raft_group0_client as dependency
  cql3: auth: add a way to create mutations without executing
  cql3: run auth DML writes on shard 0 and with raft guard
  service: don't loose service_level_controller when bouncing client_state
  auth: put system_auth and users consts in legacy namespace
  cql3: parametrize keyspace name in auth related statements
  auth: parametrize keyspace name in roles metadata helpers
  auth: parametrize keyspace name in password_authenticator
  auth: parametrize keyspace name in standard_role_manager
  auth: remove redundant consts auth::meta::*::qualified_name
  auth: parametrize keyspace name in default_authorizer
  db: make all system_auth_v2 tables use schema commitlog
  db: add system_auth_v2 tables
  db: add system_auth_v2 keyspace
2024-03-06 10:11:33 +01:00
Marcin Maliszkiewicz
5a6d4dbc37 storage_service: add support for auth-v2 raft snapshots
This patch adds new RPC for pulling snapshot of auth tables.
2024-03-01 16:25:14 +01:00
Marcin Maliszkiewicz
bd444ed6f1 cql3: auth: add a way to create mutations without executing
To make table modifications go via raft we need to publish
mutations. Currently many system tables (especially auth) use
CQL to generate table modifications. Added function is a missing
link which will allow to do a seamless transition of certain
system tables to raft.
2024-03-01 16:25:14 +01:00
Gleb Natapov
9847e272f9 raft_group0_client: assert that hold_read_apply_mutex is called on shard 0
group0 operations a valid on shard 0 only. Assert that.
2024-02-29 12:39:48 +02:00
Kamil Braun
50ebce8acc Merge 'Purge old ip on change' from Petr Gusev
When a node changes IP address we need to remove its old IP from `system.peers` and gossiper.

We do this in `sync_raft_topology_nodes` when the new IP is saved into `system.peers` to avoid losing the mapping if the node crashes between deleting and saving the new IP. We also handle the possible duplicates in this case by dropping them on the read path when the node is restarted.

The PR also fixes the problem with old IPs getting resurrected when a node changes its IP address.
The following scenario is possible: a node `A` changes its IP from `ip1` to `ip2` with restart, other nodes are not yet aware of `ip2` so they keep gossiping `ip1`. After restart `A` receives `ip1` in a gossip message and calls `handle_major_state_change` since it considers it as a new node. Then `on_join` event is called on the gossiper notification handlers, we receive such event in `raft_ip_address_updater` and reverts the IP of the node A back to ip1.

To fix this we ensure that the new gossiper generation number is used when a node registers its IP address in `raft_address_map` at startup.

The `test_change_ip` is adjusted to ensure that the old IPs are properly removed in all cases, even if the node crashes.

Fixes #16886
Fixes #16691
Fixes #17199

Closes scylladb/scylladb#17162

* github.com:scylladb/scylladb:
  test_change_ip: improve the test
  raft_ip_address_updater: remove stale IPs from gossiper
  raft_address_map: add my ip with the new generation
  system_keyspace::update_peer_info: check ep and host_id are not empty
  system_keyspace::update_peer_info: make host_id an explicit parameter
  system_keyspace::update_peer_info: remove any_set flag optimisation
  system_keyspace: remove duplicate ips for host_id
  system_keyspace: peers table: use coroutines
  storage_service::raft_ip_address_updater: log gossiper event name
  raft topology: ip change: purge old IP
  on_endpoint_change: coroutinize the lambda around sync_raft_topology_nodes
2024-02-15 17:40:29 +01:00
Petr Gusev
4b33ba2894 raft_address_map: add my ip with the new generation
The following scenario is possible: a node A changes its IP
from ip1 to ip2 with restart, other nodes are not yet aware of ip2
so they keep gossiping ip1, after restart A receives
ip1 in a gossip message and calls handle_major_state_change
since it considers it as a new node. Then on_join event is
called on the gossiper notification handles, we receive
such event in raft_ip_address_updater and reverts the IP
of the node A back to ip1.

The essence of the problem is that we don't pass the proper
generation when we add ip2 as a local IP during initialization
when node A restarts, so the zero generation is used
in raft_address_map::add_or_update_entry and the gossiper
message owerwrites ip2 to ip1.

In this commit we fix this problem by passing the new generation.
To do that we move the increment_and_get_generation call
from join_token_ring to scylla_main, so that we have a new generation
value before init_address_map is called.

Also we remove the load_initial_raft_address_map function from
raft_group0 since it's redundant. The comment above its call site
says that it's needed to not miss gossiper updates, but
the function storage_service::init_address_map where raft_address_map
is now initialized is called before gossiper is started. This
function does both - it load the previously persisted host_id<->IP
mappings from system.local and subscribes to gossiper notifications,
so there is no room for races.

Note that this problem reproduces less likely with the
'raft topology: ip change: purge old IP' commit - other
nodes remove the old IP before it's send back to the
just restarted node. This is also the reason why this
problem doesn't occur in gossiper mode.

fixes scylladb/scylladb#17199
2024-02-15 13:21:04 +04:00
Gleb Natapov
f21a3b4ca5 raft_group0: add make_nonvoters function that can make multiple node non voters simultaneously 2024-02-13 16:15:35 +02:00
Piotr Dulikowski
07aba3abc4 group0_state_machine: pull snapshot after raft topology feature enabled
Pulling a snapshot of the raft topology is done via new rpc verb
(RAFT_PULL_TOPOLOGY_SNAPSHOT). If the recipient runs an older version of
scylla and does not understand the verb, sending it will result in an
error. We usually use cluster features to avoid such situations, but in
the case when a node joins the cluster, it doesn't have access to
features yet. Therefore, we need to enable pulling snapshots in two
situations:

- when the SUPPORTS_CONSISTENT_TOPOLOGY_CHANGES feature becomes enabled,
- in case when starting group 0 server when joining a cluster that uses
  raft-based topology.
2024-02-08 19:12:28 +01:00
Piotr Dulikowski
5392bac85b raft_group0: expose link to the upgrade doc in the header
So that it can be referenced from other files.
2024-02-08 19:12:28 +01:00
Piotr Dulikowski
a55797fd41 topology_coordinator: implement core upgrade logic
Implement topology coordinator's logic responsible for building the
group 0 state related to topology.
2024-02-08 19:12:28 +01:00
Piotr Dulikowski
f6b303d589 raft_group0_client: add in_recovery method
It tells whether the current node currently operates in recovery mode or
not. It will be vital for storage_service in determining which topology
operations to use at startup.
2024-02-07 10:02:01 +01:00
Piotr Dulikowski
7601f40bf8 storage_service: introduce join_node_query verb
When a node joins an existing cluster, it will ask a node that already
belongs to the cluster about which topology operations to use when
joining.
2024-02-07 10:02:00 +01:00
Piotr Dulikowski
bab5d3bbe5 raft_group0: make discover_group0 public
The `discover_group0` function returns only after it either finds a node
that belongs to some group 0, or learns that the current node is
supposed to create a new one. It will be very helpful to storage_service
in determining which topology mode to use.
2024-02-07 10:00:16 +01:00
Piotr Dulikowski
367df7322e raft_group0: filter current node's IP in discover_group0
This was previously done by `setup_group0`, which always was an
(indirect) caller of `discover_group0`. As we want to make
`discover_group0` public, it's more convenient for the callers if the
called method takes care of sanitizing the argument.
2024-02-07 10:00:16 +01:00
Piotr Dulikowski
86e4a59d5b raft_group0: remove my_id arg from discover_group0
The goal is to make `discover_group0` public. The `my_id` argument was
always set to `this->load_my_id()`, so we can get rid of it and it will
make it more convenient to call `discover_group0` from the outside.
2024-02-07 10:00:16 +01:00
Kamil Braun
57d5aa5a68 test: add test for fixing a broken group 0 snapshot
In a cluster with group 0 with snapshot at index 0 (such group 0 might
be established in a 5.2 cluster, then preserved once it upgrades to 5.4
or later), no snapshot transfer will be triggered when a node is
bootstrapped. This way to new node might not obtain full schema, or
obtain incorrect schema, like in scylladb/scylladb#16683.

Simulate this scenario in a test case using the RECOVERY mode and error
injections. Check that the newly added logic for creating a new snapshot
if such situation is detected helps in this case.
2024-01-30 16:44:01 +01:00
Kamil Braun
98d75c65af raft_group0: trigger snapshot if existing snapshot index is 0
The persisted snapshot index may be 0 if the snapshot was created in
older version of Scylla, which means snapshot transfer won't be
triggered to a bootstrapping node. Commands present in the log may not
cover all schema changes --- group 0 might have been created through the
upgrade upgrade procedure, on a cluster with existing schema. So a
deployment with index=0 snapshot is broken and we need to fix it. We can
use the new `raft::server::trigger_snapshot` API for that.

Fixes scylladb/scylladb#16683
2024-01-30 16:35:54 +01:00
Mikołaj Grzebieluch
c08266cfe5 raft_group0_client: disable group0 operations in the maintenance mode
In maintenance mode, the node doesn't communicate with other nodes, so it doesn't
start or apply group0 operations. Users can still try to start it, e.g. change
the schema, and the node can't allow it.

Init _upgrade_state with recovery in the maintenance mode.
Throw an error if the group0 operation is started in maintenance mode.
2024-01-25 15:27:53 +01:00
Gleb Natapov
1c18476385 storage_service: topology coordinator: update topology_requests table with requests progress
Make topology coordinator update request's status in topology_requests table as it changes.
2024-01-16 15:35:18 +02:00
Petr Gusev
15b8e565ed address_map: move gossiper subscription logic into storage_service
We are going to remove the IP waiting loop from topology_state_load
in subsequent commits. An IP for a given host_id may change
after this function has been called by raft. This means we need
to subscribe to the gossiper notifications and call it later
with a new id<->ip mapping.

In this preparatory commit we move the existing address_map
update logic into storage_service so that in later commits
we can enhance it with topology_state_load call.
2024-01-12 15:37:50 +04:00