Commit Graph

809 Commits

Author SHA1 Message Date
Kamil Braun
a4d1065628 api: move reload_raft_topology_state implementation inside storage_service
In later commit we'll want to access more `storage_service` internals
in the API's implementation (namely, `_abort_source`)

Also moving the implementation there allows making
`service::topology_transition()` private again (it was made public in
992f1327d3 only for this API
implementation)
2024-09-03 15:52:03 +02:00
Raphael S. Carvalho
26facd807e storage_service: avoid processing same table unnecessarily in split monitor
If there's a token metadata for a given table, and it is in split mode,
it will be registered such that split monitor can look at it, for
example, to start split work, or do nothing if table completed it.

during topology change, e.g. drain, split is stalled since it cannot
take over the state machine.
It was noticed that the log is being spammed with a message saying the
table completed split work, since every tablet metadata update, means
waking up the monitor on behalf of a table. So it makes sense to
demote the logging level to debug. That persists until drain completes
and split can finally complete.

Another thing that was noticed is that during drain, a table can be
submitted for processing faster than the monitor can handle, so the
candidate queue may end up with multiple duplicated entries for same
table, which means unnecessary work. That is fixed by using a
sequenced set, which keeps the current FIFO behavior.

Fixes #20339.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#20029
2024-08-29 19:38:43 +03:00
Patryk Jędrzejczak
c25eefe217 storage_service: rename join_token_ring to join_topology
After introducing zero-token nodes that call join_token_ring but do
not join the ring, the join_token_ring name does not make much sense.
2024-08-29 10:37:07 +02:00
Benny Halevy
686a8f2939 abstract_replication_strategy: make get_ranges async
To prevent stalls due to large number of tokens.
For example, large cluster with say 70 nodes can have
more than 16K tokens.

Fixes #19757

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-08-25 10:57:34 +03:00
Benny Halevy
8b1877f3ca Add and use utils::optional_param to pass source_dc
Clearly indicate if a source_dc is provided,
and if so, was it explicitly given by the user,
or was implicitly selected by scylla.

This will become useful in the next patches
that will use that to either reject the operation
if it's unsafe to use the source_dc and the dc was
explicitly given by the user, or whether
to fallback to using all nodes otherwise.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-08-19 17:13:54 +03:00
Botond Dénes
5bff422b54 service/storage_service: load_tablet_metadata(): add hint parameter
Allowing for reloading only those parts of the tablet metadata that were
actually changed.
2024-08-11 09:53:19 -04:00
Botond Dénes
2cec0d8dd1 service/migration_listener: update_tablet_metadata(): add hint parameter
The hint contains information related to what exactly changed, allowing
listeners to do partial updates, instead of reloading all metadata on
each notification.
2024-08-11 09:53:19 -04:00
Botond Dénes
806ec3244a service/storage_service: topology_state_load(): allow providing change hint
So that when reloading state from disk, only changed parts are reloaded
instead of all. For now, only tablets have hints implemented.
2024-08-11 09:53:18 -04:00
Michał Jadwiszczak
5f8132c13c service/raft/group0_state_machine: update effective service levels cache
Updates to `system.role_members` and `system.role_attributes` affect
effective service levels cache, so applying mutations to those tables
should reload the effective SL cache.
2024-08-08 10:42:09 +02:00
Piotr Dulikowski
44f327675d Merge 'Remove gossiper argument from storage_service::join_cluster()' from Pavel Emelyanov
It's only needed to start hints via proxy, but proxy can do it without gossiper argument

Closes scylladb/scylladb#19894

* github.com:scylladb/scylladb:
  storage_service: Remote gossiper argument from join_cluster()
  proxy: Use remote gossiper to start hints resource manager
  hints: Const-ify gossiper references and anchor pointers
2024-08-01 10:18:14 +02:00
Pavel Emelyanov
9214aecbe7 storage_service: Remove orphan forward declaration of a method
The start_sys_dist_ks() itself was removed by bc051387c5

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#19928
2024-07-30 16:17:49 +03:00
Pavel Emelyanov
aaad2bbeaf storage_service: Remote gossiper argument from join_cluster()
This pointer was only needed to pull all the way down the hints resource
manager start() method. It's no longer needed for that.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-07-26 16:29:58 +03:00
Aleksandra Martyniuk
8e56913fdf service: node_ops: keep node ops module in storage service
Keep task manager node ops module in storage service. It will be
used to create and manage tasks related to topology changes.

The module is created and registered in storage service constructor.
In storage_service::stop() the module is stopped and so all the remaining
tasks would be unregistered immediately after they are finished.
2024-07-23 13:35:01 +02:00
Aleksandra Martyniuk
b97a348361 node_ops: implement node_ops_virtual_task methods 2024-07-23 13:35:01 +02:00
Aleksandra Martyniuk
94282b5214 db: service: modify methods to get topology_requests data
Modify get_topology_request_state (and wait_for_topology_request_completion),
so that it doesn't call on_internal_error when request_id isn't
in the topology_requests table if require_entry == false.

Add other methods to get topology request entry.
2024-07-23 13:35:01 +02:00
Aleksandra Martyniuk
91fbfbf98a node_ops: add task manager module and node_ops_virtual_task
Add task manager node ops module and node_ops_virtual_task.

Some methods will be implemented in later patches.
2024-07-23 13:35:01 +02:00
Calle Wilund
de728958d1 storage_service: Make isolate guard non-static and add test accessor
Makes storage service isolate repeatable in same process and more testable.
Note, since the test var now is shard-local we need to check twice: once
on error, once on reaching shard zero for actual shutdown.
2024-07-17 09:36:28 +00:00
Piotr Dulikowski
188b4ac0fc Merge 'service_level_controller: update configuration on raft change' from Michał Jadwiszczak
This patch is a follow-up to scylladb/scylladb#16585.

Once we have service levels on raft, we can get rid of update loop, which updates the configuration in a configured interval (default is 10s).
Instead, this PR introduces methods to `group0_state_machine` which look through table ids in mutations in `write_mutation` and update submodules based on that ids.

Fixes: scylladb/scylladb#18060

Closes scylladb/scylladb#18758

* github.com:scylladb/scylladb:
  test: remove `sleep()`s which were required to reload service levels configuration
  test/cql_test_env: remove unit test service levels data accessors
  service/storage_service: reload SL cache on topology_state_load()
  service/qos/service_level_controller: move semaphore breaking to stop
  service/qos/service_level_controller: maybe start and stop legacy update loop
  service/qos/service_level_controller: make update loop legacy
  raft/group0_state_machine: update submodules based on table_id
  service/storage_service: add a proxy method to reload sl cache
2024-07-11 16:18:48 +02:00
Michał Jadwiszczak
b61047a3f8 service/storage_service: add a proxy method to reload sl cache
In this series of patches, we want to reload service levels cache
when any changes to SL table are applied.

Firstly we need to have a way to trigger reload of the cache from
`group0_state_machines`.
To not introduce another dependency, we can use `storage_service` (which
has access to SL controller) and add a proxy method to it.
2024-07-10 10:23:04 +02:00
Kefu Chai
79ffde063a service: move storage_service::~storage_service() into .cc
as repair/repair.cc has the complete definition of node_ops_meta_data, the
compiler can default-generate the destructor of `storage_service` with the necessary
information. otherwise, clang-19 would fail to build, like:

```
FAILED: repair/CMakeFiles/repair.dir/Dev/repair.cc.o
/home/kefu/.local/bin/clang++ -DDEVEL -DFMT_SHARED -DSCYLLA_BUILD_MODE=dev -DSCYLLA_ENABLE_ERROR_INJECTION -DSEASTAR_API_LEVEL=7 -DSEASTAR_ENABLE_ALLOC_FAILURE_INJECTION -DSEASTAR_LOGGER_COMPILE_TIME_FMT -DSEASTAR_LOGGER_TYPE_STDOUT -DSEASTAR_SCHEDULING_GROUPS_COUNT=16 -DSEASTAR_SSTRING -DSEASTAR_TYPE_ERASE_MORE -DXXH_PRIVATE_API -DCMAKE_INTDIR=\"Dev\" -I/home/kefu/dev/scylladb -I/home/kefu/dev/scylladb/build/gen -I/home/kefu/dev/scylladb/seastar/include -I/home/kefu/dev/scylladb/build/seastar/gen/include -I/home/kefu/dev/scylladb/build/seastar/gen/src -isystem /home/kefu/dev/scylladb/abseil -O2 -std=gnu++23 -fvisibility=hidden -Wall -Werror -Wextra -Wno-error=deprecated-declarations -Wimplicit-fallthrough -Wno-c++11-narrowing -Wno-deprecated-copy -Wno-mismatched-tags -Wno-missing-field-initializers -Wno-overloaded-virtual -Wno-unsupported-friend -Wno-enum-constexpr-conversion -Wno-unused-parameter -ffile-prefix-map=/home/kefu/dev/scylladb=. -march=westmere -U_FORTIFY_SOURCE -Werror=unused-result -fstack-clash-protection -MD -MT repair/CMakeFiles/repair.dir/Dev/repair.cc.o -MF repair/CMakeFiles/repair.dir/Dev/repair.cc.o.d -o repair/CMakeFiles/repair.dir/Dev/repair.cc.o -c /home/kefu/dev/scylladb/repair/repair.cc
In file included from /home/kefu/dev/scylladb/repair/repair.cc:9:
In file included from /home/kefu/dev/scylladb/repair/repair.hh:11:
In file included from /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/unordered_map:41:
In file included from /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/unordered_map.h:33:
In file included from /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/hashtable.h:35:
In file included from /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/hashtable_policy.h:34:
In file included from /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/tuple:38:
/usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/stl_pair.h:291:11: error: field has incomplete type 'service::node_ops_meta_data'
  291 |       _T2 second;                ///< The second member
      |           ^
/usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/ext/aligned_buffer.h:93:28: note: in instantiation of template class 'std::pair<const utils::tagged_uuid<node_ops_id_tag>, service::node_ops_meta_data>' requested here
   93 |     : std::aligned_storage<sizeof(_Tp), __alignof__(_Tp)>
      |                            ^
/usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/hashtable_policy.h:334:43: note: in instantiation of template class '__gnu_cxx::__aligned_buffer<std::pair<const utils::tagged_uuid<node_ops_id_tag>, service::node_ops_meta_data>>' requested here
  334 |       __gnu_cxx::__aligned_buffer<_Value> _M_storage;
      |                                           ^
/usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/hashtable_policy.h:373:7: note: in instantiation of template class 'std::__detail::_Hash_node_value_base<std::pair<const utils::tagged_uuid<node_ops_id_tag>, service::node_ops_meta_data>>' requested here
  373 |     : _Hash_node_value_base<_Value>
      |       ^
/usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/hashtable.h:1662:21: note: in instantiation of template class 'std::__detail::_Hash_node_value<std::pair<const utils::tagged_uuid<node_ops_id_tag>, service::node_ops_meta_data>, false>' requested here
 1662 |                         ._M_bucket_index(declval<const __node_value_type&>(),
      |                                          ^
/usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/unordered_map.h:109:11: note: in instantiation of member function 'std::_Hashtable<utils::tagged_uuid<node_ops_id_tag>, std::pair<const utils::tagged_uuid<node_ops_id_tag>, service::node_ops_meta_data>, std::allocator<std::pair<const utils::tagged_uuid<node_ops_id_tag>, service::node_ops_meta_data>>, std::__detail::_Select1st, std::equal_to<utils::tagged_uuid<node_ops_id_tag>>, std::hash<utils::tagged_uuid<node_ops_id_tag>>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<false, false, true>>::~_Hashtable' requested here
  109 |     class unordered_map
      |           ^
/home/kefu/dev/scylladb/service/storage_service.hh:109:7: note: forward declaration of 'service::node_ops_meta_data'
  109 | class node_ops_meta_data;
      |       ^
In file included from /home/kefu/dev/scylladb/repair/repair.cc:9:
In file included from /home/kefu/dev/scylladb/repair/repair.hh:11:
In file included from /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/unordered_map:41:
In file included from /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/unordered_map.h:33:
In file included from /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/hashtable.h:35:
In file included from /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/hashtable_policy.h:34:
In file included from /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/tuple:38:
In file included from /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/stl_pair.h:60:
```

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-07-10 12:52:51 +08:00
Kamil Braun
57e810c852 Merge 'Serialize repair with tablet migration' from Tomasz Grabiec
We want to exclude repair with tablet migrations to avoid races
between repair reads and writes with replica movement. Repair is not
prepared to handle topology transitions in the middle.

One reason why it's not safe is that repair may successfully write to
a leaving replica post streaming phase and consider all replicas to be
repaired, but in fact they are not, the new replica would not be
repaired.

Other kinds of races could result in repair failures. If repair writes
to a leaving replica which was already cleaned up, such writes will
fail, causing repair to fail.

Excluding works by keeping effective_replication_map_ptr in a version
which doesn't have table's tablets in transitions. That prevents later
transitions from starting because topology coordinator's barrier will
wait for that erm before moving to a stage later than
allow_write_both_read_old, so before any requests start using the new
topology. Also, if transitions are already running, repair waits for
them to finish.

A blocked tablet migration (e.g. due to down node) will block repair,
whereas before it would fail. Once admin resolves the cause of blocked migration,
repair will continue.

Fixes #17658.
Fixes #18561.

Closes scylladb/scylladb#18641

* github.com:scylladb/scylladb:
  test: pylib: Do not block async reactor while removing directories
  repair: Exclude tablet migrations with tablet repair
  repair_service: Propagate topology_state_machine to repair_service
  main, storage_service: Move topology_state_machine outside storage_service
  storage_srvice, toplogy: Extract topology_state_machine::await_quiesced()
  tablet_scheduler: Make disabling of balancing interrupt shuffle mode
  tablet_scheduler: Log whether balancing is considered as enabled
2024-06-06 11:27:03 +02:00
Tomasz Grabiec
c45ce41330 main, storage_service: Move topology_state_machine outside storage_service
It will be propagated to repair_service to avoid cyclic dependency:

storage_service <-> repair_service
2024-06-05 16:11:22 +02:00
Pavel Emelyanov
9292d326b7 storage_service: Make register_protocol_server() start the server
After a protocol server is registered, it can be instantly started by
the main code. It makes sense to generalize this sequence by teaching
register_protocol_server() start it.

For now it's a no-op change, as "start_instantly" is false by default,
but next patches will make use of it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-06-03 12:12:03 +03:00
Pavel Emelyanov
2aab9f6340 storage_service: Turn register_protocol_server() async method
To make the next patch shorter

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-06-03 12:12:03 +03:00
Pavel Emelyanov
eb033e3c5f storage_service: Outline register_protocol_server()
To make next patch shorter

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-06-03 12:12:03 +03:00
Piotr Smaron
39181c4bf2 Return response only when tablets are reallocated
Up until now we waited until mutations are in place and then returned
directly to the caller of the ALTER statement, but that doesn't imply
that tablets were deleted/created, so we must wait until the whole
processing is done and return only then.
2024-05-28 13:56:46 +02:00
Piotr Smaron
fbd75c5c06 Implement ALTER tablets KEYSPACE statement support
This commit adds support for executing ALTER KS for keyspaces with
tablets and utilizes all the previous commits.
The ALTER KS is handled in alter_keyspace_statement, where a global
topology request in generated with data attached to system.topology
table. Then, once topology state machine is ready, it starts to handle
this global topology event, which results in producing mutations
required to change the schema of the keyspace, delete the
system.topology's global req, produce tablets mutations and additional
mutations for a table tracking the lifetime of the whole req. Tracking
the lifetime is necessary to not return the control to the user too
early, so the query processor only returns the response while the
mutations are sent.
2024-05-28 13:56:42 +02:00
Piotr Smaron
6fd0a49b63 Allow query_processor to check if global topo queue is empty
With current implementation only 1 global topo req can be executed at a
time, so when ALTER KS is executed, we'll have to check if any other
global topo req is ongoing and fail the req if that's the case.
2024-05-28 13:55:11 +02:00
Avi Kivity
52fe351c31 Merge 'Balance tablets within nodes (intra-node migration)' from Tomasz Grabiec
This is needed to avoid severe imbalance between shards which can
happen when some table grows and is split. The inter-node balance can
be equal, so inter-node migration cannot fix the imbalance. Also, if RF=N
then there is not even a possibility of moving tablets around to fix the imbalance.
The only way to bring the system to balance is to move tablets within the nodes.

The system is not prepared for intra-node migration currently. Request coordination
is host-based, while for intra-node migration it should be (also) shard-based.
The solution employed here is to keep the coordination between nodes as-is,
and for intra-node migration storage_proxy-level coordinator is not aware of
the migration (no pending host). The replica-side request handler will be a
second-level coordinator which routes requests to shards, similar to how
the first-level coordinator routes them to hosts.

Tablet sharder is adjusted to handle intra-migration where a tablet
can have two replicas on the same host. For reads, sharder uses the
read selector to resolve the conflict. For writes, the write selector
is used.

The old shard_of() API is kept to represent shard for reads, and new
method is introduced to query the shards for writing:
shard_for_writes(). All writers should be switched to that API, which
is not done in this patch yet.

The request handler on replica side acts as a second-level
coordinator, using sharder to determine routing to shards. A given
sharder has a scope of a single topology version, a single
effective_replication_map_ptr, which should be kept alive during
writes.

perf-simple-query test results show no signs of regression:

Command: perf-simple-query -c1 -m1G --write --tablets --duration=10

Before:

> 83294.81 tps ( 59.5 allocs/op,  14.3 tasks/op,   53725 insns/op,        0 errors)
> 87756.72 tps ( 59.5 allocs/op,  14.3 tasks/op,   54049 insns/op,        0 errors)
> 86428.47 tps ( 59.6 allocs/op,  14.3 tasks/op,   54208 insns/op,        0 errors)
> 86211.38 tps ( 59.7 allocs/op,  14.3 tasks/op,   54219 insns/op,        0 errors)
> 86559.89 tps ( 59.6 allocs/op,  14.3 tasks/op,   54188 insns/op,        0 errors)
> 86609.39 tps ( 59.6 allocs/op,  14.3 tasks/op,   54117 insns/op,        0 errors)
> 87464.06 tps ( 59.5 allocs/op,  14.3 tasks/op,   54039 insns/op,        0 errors)
> 86185.43 tps ( 59.6 allocs/op,  14.3 tasks/op,   54169 insns/op,        0 errors)
> 86254.71 tps ( 59.6 allocs/op,  14.3 tasks/op,   54139 insns/op,        0 errors)
> 83395.35 tps ( 60.2 allocs/op,  14.4 tasks/op,   54693 insns/op,        0 errors)
>
> median 86428.47 tps ( 59.6 allocs/op,  14.3 tasks/op,   54208 insns/op,        0 errors)
> median absolute deviation: 243.04
> maximum: 87756.72
> minimum: 83294.81
>

After:

> 85523.06 tps ( 59.5 allocs/op,  14.3 tasks/op,   53872 insns/op,        0 errors)
> 89362.47 tps ( 59.6 allocs/op,  14.3 tasks/op,   54226 insns/op,        0 errors)
> 88167.55 tps ( 59.7 allocs/op,  14.3 tasks/op,   54400 insns/op,        0 errors)
> 87044.40 tps ( 59.7 allocs/op,  14.3 tasks/op,   54310 insns/op,        0 errors)
> 88344.50 tps ( 59.6 allocs/op,  14.3 tasks/op,   54289 insns/op,        0 errors)
> 88355.06 tps ( 59.6 allocs/op,  14.3 tasks/op,   54242 insns/op,        0 errors)
> 88725.46 tps ( 59.6 allocs/op,  14.3 tasks/op,   54230 insns/op,        0 errors)
> 88640.08 tps ( 59.6 allocs/op,  14.3 tasks/op,   54210 insns/op,        0 errors)
> 90306.31 tps ( 59.4 allocs/op,  14.3 tasks/op,   54043 insns/op,        0 errors)
> 87343.62 tps ( 59.8 allocs/op,  14.3 tasks/op,   54496 insns/op,        0 errors)
>
> median 88355.06 tps ( 59.6 allocs/op,  14.3 tasks/op,   54242 insns/op,        0 errors)
> median absolute deviation: 1007.41
> maximum: 90306.31
> minimum: 85523.06

Command (reads): perf-simple-query -c1 -m1G  --tablets --duration=10

Before:

> 95860.18 tps ( 63.1 allocs/op,  14.1 tasks/op,   42476 insns/op,        0 errors)
> 97537.69 tps ( 63.1 allocs/op,  14.1 tasks/op,   42454 insns/op,        0 errors)
> 97549.23 tps ( 63.1 allocs/op,  14.1 tasks/op,   42470 insns/op,        0 errors)
> 97511.29 tps ( 63.1 allocs/op,  14.1 tasks/op,   42470 insns/op,        0 errors)
> 97227.32 tps ( 63.1 allocs/op,  14.1 tasks/op,   42471 insns/op,        0 errors)
> 94031.94 tps ( 63.1 allocs/op,  14.1 tasks/op,   42441 insns/op,        0 errors)
> 96978.04 tps ( 63.1 allocs/op,  14.1 tasks/op,   42462 insns/op,        0 errors)
> 96401.70 tps ( 63.1 allocs/op,  14.1 tasks/op,   42473 insns/op,        0 errors)
> 96573.77 tps ( 63.1 allocs/op,  14.1 tasks/op,   42440 insns/op,        0 errors)
> 96340.54 tps ( 63.1 allocs/op,  14.1 tasks/op,   42468 insns/op,        0 errors)
>
> median 96978.04 tps ( 63.1 allocs/op,  14.1 tasks/op,   42462 insns/op,        0 errors)
> median absolute deviation: 571.20
> maximum: 97549.23
> minimum: 94031.94
>

After:

> 99794.67 tps ( 63.1 allocs/op,  14.1 tasks/op,   42471 insns/op,        0 errors)
> 101244.99 tps ( 63.1 allocs/op,  14.1 tasks/op,   42472 insns/op,        0 errors)
> 101128.37 tps ( 63.1 allocs/op,  14.1 tasks/op,   42485 insns/op,        0 errors)
> 101065.27 tps ( 63.1 allocs/op,  14.1 tasks/op,   42465 insns/op,        0 errors)
> 101212.98 tps ( 63.1 allocs/op,  14.1 tasks/op,   42456 insns/op,        0 errors)
> 101413.31 tps ( 63.1 allocs/op,  14.1 tasks/op,   42463 insns/op,        0 errors)
> 101464.92 tps ( 63.1 allocs/op,  14.1 tasks/op,   42466 insns/op,        0 errors)
> 101086.74 tps ( 63.1 allocs/op,  14.1 tasks/op,   42488 insns/op,        0 errors)
> 101559.09 tps ( 63.1 allocs/op,  14.1 tasks/op,   42468 insns/op,        0 errors)
> 100742.58 tps ( 63.1 allocs/op,  14.1 tasks/op,   42491 insns/op,        0 errors)
>
> median 101212.98 tps ( 63.1 allocs/op,  14.1 tasks/op,   42456 insns/op,        0 errors)
> median absolute deviation: 200.33
> maximum: 101559.09
> minimum: 99794.67
>

Fixes #16594

Closes scylladb/scylladb#18026

* github.com:scylladb/scylladb:
  Implement fast streaming for intra-node migration
  test: tablets_test: Test sharding during intra-node migration
  test: tablets_test: Check sharding also on the pending host
  test: py: tablets: Test writes concurrent with migration
  test: py: tablets: Test crash during intra-node migration
  api, storage_service: Introduce API to wait for topology to quiesce
  dht, replica: Remove deprecated sharder APIs
  test: Avoid using deprecated sharded API
  db: do_apply_many() avoid deprecated sharded API
  replica: mutation_dump: Avoid deprecated sharder API
  repair: Avoid deprecated sharder API
  table: Remove optimization which returns empty reader when key is not owned by the shard
  dht: is_single_shard: Avoid deprecated sharder API
  dht: split_range_to_single_shard: Work with static_sharder only
  dht: ring_position_range_sharder: Avoid deprecated sharder APIs
  dht: token: Avoid use of deprecated sharder API by switching to static_sharder
  selective_token_sharder: Avoid use of deprecated sharder API
  docs: Document tablet sharding vs tablet replica placement
  readers/multishard.cc: use shard_for_reads() instead of shard_of()
  multishard_mutation_query.cc: use shard_for_reads() instead of shard_of()
  storage_proxy: Extract common code to apply mutations on many shards according to sharder
  storage_proxy: Prepare per-partition rate-limiting for intra-node migration
  storage_proxy: Avoid shard_of() use in mutate_counter_on_leader_and_replicate()
  storage_proxy: Prepare mutate_hint() for intra-node tablet migration
  commitlog_replayer: Avoid deprecated sharder::shard_of()
  lwt: Avoid deprecated sharder::shard_of()
  compaction: Avoid deprecated sharder::shard_of()
  dht: Extract dht::static_sharder
  replica: Deprecate table::shard_of()
  locator: Deprecate effective_replication_map::shard_of()
  dht: Deprecate old sharder API: shard_of/next_shard/token_for_next_shard
  tests: tablets: py: Add intra-node migration test
  tests: tablets: Test that drained nodes are not balanced internally
  tests: tablets: Add checks of replica set validity to test_load_balancing_with_random_load
  tests: tablets: Verify that disabling balancing results in no intra-node migrations
  tests: tablets: Check that nodes are internally balanced
  tests: tablets: Improve debuggability by showing which rows are missing
  tablets, storage_service: Support intra-node migration in move_tablet() API
  tablet_allocator: Generate intra-node migration plan
  tablet_allocator: Extract make_internode_plan()
  tablet_allocator: Maintain candidate list and shard tablet count for target nodes
  tablet_allocator: Lift apply_load/can_accept_load lambdas to member functions
  tablets, streaming: Implement tablet streaming for intra-node migration
  dht, auto_refreshing_sharder: Allow overriding write selector
  multishard_writer: Handle intra-node migration
  storage_proxy: Handle intra-node tablet migration for writes
  tablets: Get rid of tablet_map::get_shard()
  tablets: Avoid tablet_map::get_shard in cleanup
  tablets: test: Use sharder instead of tablet_map::get_shard()
  tablets: tablet_sharder: Allow working with non-local host
  sharding: Prepare for intra-node-migration
  docs: Document sharder use for tablets
  tablets: Introduce tablet transition kind for intra-node migration
  tests: tablets: Fix use-after-move of skiplist in rebalance_tablets()
  sstables, gdb: Track readers in a linked list
  raft topology: Fix global token metadata barrier to not fence ahead of what is drained
2024-05-20 16:13:01 +03:00
Raphael S. Carvalho
715ae689c0 Implement fast streaming for intra-node migration
With intra-node migration, all the movement is local, so we can make
streaming faster by just cloning the sstable set of leaving replica
and loading it into the pending one.

This cloning is underlying storage specific, but s3 doesn't support
snapshot() yet (th sstables::storage procedure which clone is built
upon). It's only supported by file system, with help of hard links.
A new generation is picked for new cloned sstable, and it will
live in the same directory as the original.

A challenge I bumped into was to understand why table refused to
load the sstable at pending replica, as it considered them foreign.
Later I realized that sharder (for reads) at this stage of migration
will point only to leaving replica. It didn't fail with mutation
based streaming, because the sstable writer considers the shard --
that the sstable was written into -- as its owner, regardless of what
sharder says. That was fixed by mimicking this behavior during
loading at pending.

test:
./test.py --mode=dev intranode --repeat=100 passes.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2024-05-16 00:28:47 +02:00
Tomasz Grabiec
7956a2991e api, storage_service: Introduce API to wait for topology to quiesce 2024-05-16 00:28:47 +02:00
Botond Dénes
8690dbf8ad service/storage_service: introduce get_tablet_to_endpoint_map()
The tablet variant of the existing get_token_to_endpoint_map(), which
returns a list of tablet tokens and the primary replica for each.
2024-05-13 06:57:13 -04:00
Botond Dénes
155332ebf8 Merge 'Drain view_builder in generic drain (again)' from Pavel Emelyanov
Some time ago #16558 was merged that moved view builder drain into generic drain. After this merge dtests started to fail from time to time, so the PR was reverted (see #18278). In #18295 the hang was found. View builder drain was moved from "before stopping messaging service to "after" it, and view update write handlers in proxy hanged for hard-coded timeout of 5 minutes without being aborted. Tests don't wait for 5 minutes and kill scylla, then complain about it and fail.

This PR brings back the original PR as well as the necessary fix that cancels view update write handlers on stop.

Closes scylladb/scylladb#18408

* github.com:scylladb/scylladb:
  Reapply "Merge 'Drain view_builder in generic drain' from ScyllaDB"
  view: Abort pending view updates when draining
2024-05-09 08:26:44 +03:00
Piotr Dulikowski
180cb7a2b9 storage_service: notify lifecycle subs only after token metadata update
Currently, in raft mode, when raft topology is reloaded from disk or a
notification is received from gossip about an endpoint change, token
metadata is updated accordingly. While updating token metadata we detect
whether some nodes are joining or are leaving and we notify endpoint
lifecycle subscribers if such an event occurs. These notifications are
fired _before_ we finish updating token metadata and before the updated
version is globally available.

This behavior, for "node leaving" notifications specifically, was not
present in legacy topology mode. Hinted handoff depends on token
metadata being updated before it is notified about a leaving node (we
had a similar issue before: scylladb/scylladb#5087, and we fixed it by
enforcing this property). Because this is not true right now for raft
mode, this causes the hint draining logic not to work properly - when a
node leaves the cluster, there should be an attempt to send out hints
for that node, but instead hints are not sent out and are kept on disk.

In order to fix the issue with hints, postpone notifying endpoint
lifecycle subscribers about joined and left nodes only after the final
token metadata is computed and replicated to all shards.

Fixes: scylladb/scylladb#17023

Closes scylladb/scylladb#18377
2024-05-08 09:40:44 +02:00
Piotr Dulikowski
64ba620dc2 Merge 'hinted handoff: Use host IDs instead of IPs in the module' from Dawid Mędrek
This pull request introduces host ID in the Hinted Handoff module. Nodes are now identified by their host IDs instead of their IPs. The conversion occurs on the boundary between the module and `storage_proxy.hh`, but aside from that, IPs have been erased.

The changes take into considerations that there might still be old hints, still identified by IPs, on disk – at start-up, we map them to host IDs if it's possible so that they're not lost.

Refs scylladb/scylladb#6403
Fixes scylladb/scylladb#12278

Closes scylladb/scylladb#15567

* github.com:scylladb/scylladb:
  docs: Update Hinted Handoff documentation
  db/hints: Add endpoint_downtime_not_bigger_than()
  db/hints: Migrate hinted handoff when cluster feature is enabled
  db/hints: Handle arbitrary directories in resource manager
  db/hints: Start using hint_directory_manager
  db/hints: Enforce providing IP in get_ep_manager()
  db/hints: Introduce hint_directory_manager
  db/hints/resource_manager: Update function description
  db/hints: Coroutinize space_watchdog::scan_one_ep_dir()
  db/hints: Expose update lock of space watchdog
  db/hints: Add function for migrating hint directories to host ID
  db/hints: Take both IP and host ID when storing hints
  db/hints: Prepare initializing endpoint managers for migrating from IP to host ID
  db/hints: Migrate to locator::host_id
  db/hints: Remove noexcept in do_send_one_mutation()
  service: Add locator::host_id to on_leave_cluster
  service: Fix indentation
  db/hints: Fix indentation
2024-05-06 09:58:18 +02:00
Pavel Emelyanov
67736b5cd3 Reapply "Merge 'Drain view_builder in generic drain' from ScyllaDB"
This reverts commit 9c2a836607.
2024-05-02 08:16:14 +03:00
Gleb Natapov
f2b0a5e9e1 storage_service: do not take API lock for removenode operation if topology coordinator is enabled
Topology coordinator serialize operations internally, so there is no
need to have an external lock.

Fixes: scylladb/scylladb#17681
2024-04-30 15:13:50 +03:00
Dawid Medrek
54ae9797b9 service: Add locator::host_id to on_leave_cluster
We extend the function
endpoint_lifecycle_subscriber::on_leave_cluster
by another argument -- locator::host_id.
It's more convenient to have a consistent
pair of IP and host ID.
2024-04-26 22:44:03 +02:00
Patryk Jędrzejczak
3a34bb18cd db: config: make consistent-topology-changes unused
We make the `consistent-topology-changes` experimental feature
unused and assumed to be true in 6.0. We remove code branches that
executed if `consistent-topology-changes` was disabled.
2024-04-25 14:33:21 +02:00
Gleb Natapov
06e6ed09ed gossiper: disable status check for endpoints in raft mode
Gossiper automatically removes endpoints that do not have tokens in
normal state and either do not send gossiper updates or are dead for a
long time. We do not need this with topology coordinator mode since in
this mode the coordinator is responsible to manage the set of nodes in
the cluster. In addition the patch disables quarantined endpoint
maintenance in gossiper in raft mode and uses left node list from the
topology coordinator to ignore updates for nodes that are no longer part
of the topology.
2024-04-21 16:36:07 +03:00
Gleb Natapov
0e3f92fa49 storage_service: introduce a setter for topology_change_kind
In the next patch we will extend it to have other side affects.
2024-04-21 16:36:07 +03:00
Kefu Chai
168ade72f8 treewide: replace formatter<std::string_view> with formatter<string_view>
in in {fmt} before v10, it provides the specialization of `fmt::formatter<..>`
for `std::string_view` as well as the specialization of `fmt::formatter<..>`
for `fmt::string_view` which is an implementation builtin in {fmt} for
compatibility of pre-C++17. and this type is used even if the code is
compiled with C++ stadandard greater or equal to C++17. also, before v10,
the `fmt::formatter<std::string_view>::format()` is defined so it accepts
`std::string_view`. after v10, `fmt::formatter<std::string_view>` still
exists, but it is now defined using `format_as()` machinery, so it's
`format()` method does not actually accept `std::string_view`, it
accepts `fmt::string_view`, as the former can be converted to
`fmt::string_view`.

this is why we can inherit from `fmt::formatter<std::string_view>` and
use `formatter<std::string_view>::format(foo, ctx);` to implement the
`format()` method with {fmt} v9, but we cannot do this with {fmt} v10,
and we would have following compilation failure:

```
FAILED: service/CMakeFiles/service.dir/RelWithDebInfo/topology_state_machine.cc.o
/home/kefu/.local/bin/clang++ -DFMT_DEPRECATED_OSTREAM -DFMT_SHARED -DSCYLLA_BUILD_MODE=release -DSEASTAR_API_LEVEL=7 -DSEASTAR_LOGGER_COMPILE_TIME_FMT -DSEASTAR_LOGGER_TYPE_STDOUT -DSEASTAR_SCHEDULING_GROUPS_COUNT=16 -DSEASTAR_SSTRING -DXXH_PRIVATE_API -DCMAKE_INTDIR=\"RelWithDebInfo\" -I/home/kefu/dev/scylladb -I/home/kefu/dev/scylladb/build/gen -I/home/kefu/dev/scylladb/seastar/include -I/home/kefu/dev/scylladb/build/seastar/gen/include -I/home/kefu/dev/scylladb/build/seastar/gen/src -ffunction-sections -fdata-sections -O3 -g -gz -std=gnu++20 -fvisibility=hidden -Wall -Werror -Wextra -Wno-error=deprecated-declarations -Wimplicit-fallthrough -Wno-c++11-narrowing -Wno-deprecated-copy -Wno-mismatched-tags -Wno-missing-field-initializers -Wno-overloaded-virtual -Wno-unsupported-friend -Wno-enum-constexpr-conversion -Wno-unused-parameter -ffile-prefix-map=/home/kefu/dev/scylladb=. -march=westmere -mllvm -inline-threshold=2500 -fno-slp-vectorize -U_FORTIFY_SOURCE -Werror=unused-result -MD -MT service/CMakeFiles/service.dir/RelWithDebInfo/topology_state_machine.cc.o -MF service/CMakeFiles/service.dir/RelWithDebInfo/topology_state_machine.cc.o.d -o service/CMakeFiles/service.dir/RelWithDebInfo/topology_state_machine.cc.o -c /home/kefu/dev/scylladb/service/topology_state_machine.cc
/home/kefu/dev/scylladb/service/topology_state_machine.cc:254:41: error: no matching member function for call to 'format'
  254 |     return formatter<std::string_view>::format(it->second, ctx);
      |            ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~
/usr/include/fmt/core.h:2759:22: note: candidate function template not viable: no known conversion from 'seastar::basic_sstring<char, unsigned int, 15>' to 'const fmt::basic_string_view<char>' for 1st argument
 2759 |   FMT_CONSTEXPR auto format(const T& val, FormatContext& ctx) const
      |                      ^      ~~~~~~~~~~~~
```

because the inherited `format()` method actually comes from
`fmt::formatter<fmt::string_view>`. to reduce the confusion, in this
change, we just inherit from `fmt::format<string_view>`, where
`string_view` is actually `fmt::string_view`. this follows
the document at
https://fmt.dev/latest/api.html#formatting-user-defined-types,
and since there is less indirection under the hood -- we do not
use the specialization created by `FMT_FORMAT_AS` which inherit
from `formatter<fmt::string_view>`, hopefully this can improve
the compilation speed a little bit. also, this change addresses
the build failure with {fmt} v10.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#18299
2024-04-19 07:44:07 +03:00
Tomasz Grabiec
393cb54c01 Merge 'Generalize tablet transition API calls' from Pavel Emelyanov
Recently there had been added add_tablet_replica and del_tablet_replica API calls that copy big portion of the existing move_tablet API call's logic. This PR generalizes the common parts

Closes scylladb/scylladb#18272

* github.com:scylladb/scylladb:
  tablets: Generalize transition mutations preparation
  tablets: Generalize tablet-already-in-transition check
  tablets: Generalize raft communications for tablet transition API calls
  tablets: Drop src vs dst equality check from move_tablet()
2024-04-18 14:42:10 +02:00
Kamil Braun
9c2a836607 Revert "Merge 'Drain view_builder in generic drain' from ScyllaDB"
This reverts commit 298a7fcbf2, reversing
changes made to 5cf53e670d.

The change made CI flaky.

Fixes: scylladb/scylladb#18278
2024-04-18 11:50:41 +02:00
Pavel Emelyanov
3beccb8165 tablets: Generalize tablet-already-in-transition check
Continuation of the previous patch -- there's a common sanity check of
tablet transition API handlers, namely that this tablet is not in
transition already.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-04-17 12:01:02 +03:00
Pavel Emelyanov
14923812ad tablets: Generalize raft communications for tablet transition API calls
There are three transition calls -- move, add replica and del replica --
and all three work similarly. In a loop they try to get guard for raft
operation, then perform sanity checks on topology state, then prepare
mutations and then try to apply them to raft. After the loop finishes
all three wait for transition for the given tablet to complete.

This patch generalizes the raft kicking loop and the transition
completion waiting code.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-04-17 11:59:03 +03:00
Kamil Braun
eb9ba914a3 Merge 'Set dc and rack in gossiper when loaded from system.peers and load the ignored nodes state for replace' from Benny Halevy
The problem this series solves is correctly ignoring DOWN nodes state
when replacing a node.

When a node is replaced and there are other nodes that are down, the
replacing node is told to ignore those DOWN nodes using the
`ignore_dead_nodes_for_replace` option.

Since the replacing node is bootstrapping it starts with an empty
system.peers table so it has no notion about any node state and it
learns about all other nodes via gossip shadow round done in
`storage_service::prepare_replacement_info`.

Normally, since the DOWN nodes to ignore already joined the ring, the
remaining node will have their endpoint state already in gossip, but if
the whole cluster was restarted while those DOWN nodes did not start,
the remaining nodes will only have a partial endpoint state from them,
which is loaded from system.peers.

Currently, the partial endpoint state contains only `HOST_ID` and
`TOKENS`, and in particular it lacks `STATUS`, `DC`, and `RACK`.

The first part of this series loads also `DC` and `RACK` from
system.peers to make them available to the replacing node as they are
crucial for building a correct replication map with network topology
replication strategy.

But still, without a `STATUS` those nodes are not considered as normal
token owners yet, and they do not go through handle_state_normal which
adds them to the topology and token_metadata.

The second part of this series uses the endpoint state retrieved in the
gossip shadow round to explicitly add the ignored nodes' state to
topology (including dc and rack) and token_metadata (tokens) in
`prepare_replacement_info`.  If there are more DOWN nodes that are not
explicitly ignored replace will fail (as it should).

Fixes scylladb/scylladb#15787

Closes scylladb/scylladb#15788

* github.com:scylladb/scylladb:
  storage_service: join_token_ring: load ignored nodes state if replacing
  storage_service: replacement_info: return ignore_nodes state
  locator: host_id_or_endpoint: keep value as variant
  gms: endpoint_state: add getters for host_id, dc_rack, and tokens
  storage_service: topology_state_load: set local STATUS state using add_saved_endpoint
  gossiper: add_saved_endpoint: set dc and rack
  gossiper: add_saved_endpoint: fixup indentation
  gossiper: add_saved_endpoint: make host_id mandatory
  gossiper: add load_endpoint_state
  gossiper: start_gossiping: log local state
2024-04-16 10:27:36 +02:00
Botond Dénes
298a7fcbf2 Merge 'Drain view_builder in generic drain' from ScyllaDB
For view builder draining there's dedicated deferred action in main while all other services that need to be drained do it via storage_service. The latter is to unify shutdown for services and to make `nodetool drain` drain everything, not just some part of those. This PR makes view builder drain look the same. As a side effect it also moves `mark_existing_views_as_built` from storage service to view builder and generalizes this marking code inside view builder itself.

refs: #2737
refs: #2795

Closes scylladb/scylladb#16558

* github.com:scylladb/scylladb:
  storage_service: Drain view builder on drain too
  view_builder: Generalize mark_as_built(view_ptr) method
  view_builder: Move mark_existing_views_as_built from storage service
  storage_service: Add view_builder& reference
  main,cql_test_env: Move view_builder start up (and make unconditional)
2024-04-16 07:21:42 +03:00
Pavel Emelyanov
8bad828208 api: Add method to delete replica from tablet
Copied from the add_replica counterpart

TODO: Generalize common parts of move_tablet and add_|del_tablet_replica

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-04-15 16:31:07 +03:00
Benny Halevy
e4c3c07510 storage_service: replacement_info: return ignore_nodes state
Instead of `parse_node_list` resolving host ids to inet_address
let `prepare_replacement_info` get host_id_or_endpoint from
parse_node_list and prepare `loaded_endpoint_state` for
the ignored nodes so it can be used later by the callers.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-04-14 15:43:19 +03:00