scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-23 10:00:35 +00:00

Author	SHA1	Message	Date
Tomasz Grabiec	ded9aca6ee	repair: Exclude tablet migrations with tablet repair We want to exclude repair with tablet migrations to avoid races between repair reads and writes with replica movement. Repair is not prepared to handle topology transitions in the middle. One reason why it's not safe is that repair may successfully write to a leaving replica post streaming phase and consider all replicas to be repaired, but in fact they are not, the new replica would not be repaired. Other kinds of races could result in repair failures. If repair writes to a leaving replica which was already cleaned up, such writes will fail, causing repair to fail. Excluding works by keeping effective_replication_map_ptr in a version which doesn't have table's tablets in transitions. That prevents later transitions from starting because topology coordinator's barrier will wait for that erm before moving to a stage later than allow_write_both_read_old, so before any requets start using the new topology. Also, if transitions are already running, repair waits for them to finish. Fixes #17658. Fixes #18561. (cherry picked from commit `98323be296`)	2024-06-08 16:31:18 +02:00
Benny Halevy	bdf3e71f62	locator: tablet_map: add get_primary_replica_within_dc Will be needed by repair in a following patch. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> (cherry picked from commit `84761acc31`)	2024-06-03 19:50:40 +00:00
Benny Halevy	21f87c9cfa	locator: tablet_map: get_primary_replica: return tablet_replica This is required by repair when it will start using get_primary_replica in a following patch. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> (cherry picked from commit `c52f70f92c`)	2024-06-03 19:50:39 +00:00
Avi Kivity	52fe351c31	Merge 'Balance tablets within nodes (intra-node migration)' from Tomasz Grabiec This is needed to avoid severe imbalance between shards which can happen when some table grows and is split. The inter-node balance can be equal, so inter-node migration cannot fix the imbalance. Also, if RF=N then there is not even a possibility of moving tablets around to fix the imbalance. The only way to bring the system to balance is to move tablets within the nodes. The system is not prepared for intra-node migration currently. Request coordination is host-based, while for intra-node migration it should be (also) shard-based. The solution employed here is to keep the coordination between nodes as-is, and for intra-node migration storage_proxy-level coordinator is not aware of the migration (no pending host). The replica-side request handler will be a second-level coordinator which routes requests to shards, similar to how the first-level coordinator routes them to hosts. Tablet sharder is adjusted to handle intra-migration where a tablet can have two replicas on the same host. For reads, sharder uses the read selector to resolve the conflict. For writes, the write selector is used. The old shard_of() API is kept to represent shard for reads, and new method is introduced to query the shards for writing: shard_for_writes(). All writers should be switched to that API, which is not done in this patch yet. The request handler on replica side acts as a second-level coordinator, using sharder to determine routing to shards. A given sharder has a scope of a single topology version, a single effective_replication_map_ptr, which should be kept alive during writes. perf-simple-query test results show no signs of regression: Command: perf-simple-query -c1 -m1G --write --tablets --duration=10 Before: > 83294.81 tps ( 59.5 allocs/op, 14.3 tasks/op, 53725 insns/op, 0 errors) > 87756.72 tps ( 59.5 allocs/op, 14.3 tasks/op, 54049 insns/op, 0 errors) > 86428.47 tps ( 59.6 allocs/op, 14.3 tasks/op, 54208 insns/op, 0 errors) > 86211.38 tps ( 59.7 allocs/op, 14.3 tasks/op, 54219 insns/op, 0 errors) > 86559.89 tps ( 59.6 allocs/op, 14.3 tasks/op, 54188 insns/op, 0 errors) > 86609.39 tps ( 59.6 allocs/op, 14.3 tasks/op, 54117 insns/op, 0 errors) > 87464.06 tps ( 59.5 allocs/op, 14.3 tasks/op, 54039 insns/op, 0 errors) > 86185.43 tps ( 59.6 allocs/op, 14.3 tasks/op, 54169 insns/op, 0 errors) > 86254.71 tps ( 59.6 allocs/op, 14.3 tasks/op, 54139 insns/op, 0 errors) > 83395.35 tps ( 60.2 allocs/op, 14.4 tasks/op, 54693 insns/op, 0 errors) > > median 86428.47 tps ( 59.6 allocs/op, 14.3 tasks/op, 54208 insns/op, 0 errors) > median absolute deviation: 243.04 > maximum: 87756.72 > minimum: 83294.81 > After: > 85523.06 tps ( 59.5 allocs/op, 14.3 tasks/op, 53872 insns/op, 0 errors) > 89362.47 tps ( 59.6 allocs/op, 14.3 tasks/op, 54226 insns/op, 0 errors) > 88167.55 tps ( 59.7 allocs/op, 14.3 tasks/op, 54400 insns/op, 0 errors) > 87044.40 tps ( 59.7 allocs/op, 14.3 tasks/op, 54310 insns/op, 0 errors) > 88344.50 tps ( 59.6 allocs/op, 14.3 tasks/op, 54289 insns/op, 0 errors) > 88355.06 tps ( 59.6 allocs/op, 14.3 tasks/op, 54242 insns/op, 0 errors) > 88725.46 tps ( 59.6 allocs/op, 14.3 tasks/op, 54230 insns/op, 0 errors) > 88640.08 tps ( 59.6 allocs/op, 14.3 tasks/op, 54210 insns/op, 0 errors) > 90306.31 tps ( 59.4 allocs/op, 14.3 tasks/op, 54043 insns/op, 0 errors) > 87343.62 tps ( 59.8 allocs/op, 14.3 tasks/op, 54496 insns/op, 0 errors) > > median 88355.06 tps ( 59.6 allocs/op, 14.3 tasks/op, 54242 insns/op, 0 errors) > median absolute deviation: 1007.41 > maximum: 90306.31 > minimum: 85523.06 Command (reads): perf-simple-query -c1 -m1G --tablets --duration=10 Before: > 95860.18 tps ( 63.1 allocs/op, 14.1 tasks/op, 42476 insns/op, 0 errors) > 97537.69 tps ( 63.1 allocs/op, 14.1 tasks/op, 42454 insns/op, 0 errors) > 97549.23 tps ( 63.1 allocs/op, 14.1 tasks/op, 42470 insns/op, 0 errors) > 97511.29 tps ( 63.1 allocs/op, 14.1 tasks/op, 42470 insns/op, 0 errors) > 97227.32 tps ( 63.1 allocs/op, 14.1 tasks/op, 42471 insns/op, 0 errors) > 94031.94 tps ( 63.1 allocs/op, 14.1 tasks/op, 42441 insns/op, 0 errors) > 96978.04 tps ( 63.1 allocs/op, 14.1 tasks/op, 42462 insns/op, 0 errors) > 96401.70 tps ( 63.1 allocs/op, 14.1 tasks/op, 42473 insns/op, 0 errors) > 96573.77 tps ( 63.1 allocs/op, 14.1 tasks/op, 42440 insns/op, 0 errors) > 96340.54 tps ( 63.1 allocs/op, 14.1 tasks/op, 42468 insns/op, 0 errors) > > median 96978.04 tps ( 63.1 allocs/op, 14.1 tasks/op, 42462 insns/op, 0 errors) > median absolute deviation: 571.20 > maximum: 97549.23 > minimum: 94031.94 > After: > 99794.67 tps ( 63.1 allocs/op, 14.1 tasks/op, 42471 insns/op, 0 errors) > 101244.99 tps ( 63.1 allocs/op, 14.1 tasks/op, 42472 insns/op, 0 errors) > 101128.37 tps ( 63.1 allocs/op, 14.1 tasks/op, 42485 insns/op, 0 errors) > 101065.27 tps ( 63.1 allocs/op, 14.1 tasks/op, 42465 insns/op, 0 errors) > 101212.98 tps ( 63.1 allocs/op, 14.1 tasks/op, 42456 insns/op, 0 errors) > 101413.31 tps ( 63.1 allocs/op, 14.1 tasks/op, 42463 insns/op, 0 errors) > 101464.92 tps ( 63.1 allocs/op, 14.1 tasks/op, 42466 insns/op, 0 errors) > 101086.74 tps ( 63.1 allocs/op, 14.1 tasks/op, 42488 insns/op, 0 errors) > 101559.09 tps ( 63.1 allocs/op, 14.1 tasks/op, 42468 insns/op, 0 errors) > 100742.58 tps ( 63.1 allocs/op, 14.1 tasks/op, 42491 insns/op, 0 errors) > > median 101212.98 tps ( 63.1 allocs/op, 14.1 tasks/op, 42456 insns/op, 0 errors) > median absolute deviation: 200.33 > maximum: 101559.09 > minimum: 99794.67 > Fixes #16594 Closes scylladb/scylladb#18026 * github.com:scylladb/scylladb: Implement fast streaming for intra-node migration test: tablets_test: Test sharding during intra-node migration test: tablets_test: Check sharding also on the pending host test: py: tablets: Test writes concurrent with migration test: py: tablets: Test crash during intra-node migration api, storage_service: Introduce API to wait for topology to quiesce dht, replica: Remove deprecated sharder APIs test: Avoid using deprecated sharded API db: do_apply_many() avoid deprecated sharded API replica: mutation_dump: Avoid deprecated sharder API repair: Avoid deprecated sharder API table: Remove optimization which returns empty reader when key is not owned by the shard dht: is_single_shard: Avoid deprecated sharder API dht: split_range_to_single_shard: Work with static_sharder only dht: ring_position_range_sharder: Avoid deprecated sharder APIs dht: token: Avoid use of deprecated sharder API by switching to static_sharder selective_token_sharder: Avoid use of deprecated sharder API docs: Document tablet sharding vs tablet replica placement readers/multishard.cc: use shard_for_reads() instead of shard_of() multishard_mutation_query.cc: use shard_for_reads() instead of shard_of() storage_proxy: Extract common code to apply mutations on many shards according to sharder storage_proxy: Prepare per-partition rate-limiting for intra-node migration storage_proxy: Avoid shard_of() use in mutate_counter_on_leader_and_replicate() storage_proxy: Prepare mutate_hint() for intra-node tablet migration commitlog_replayer: Avoid deprecated sharder::shard_of() lwt: Avoid deprecated sharder::shard_of() compaction: Avoid deprecated sharder::shard_of() dht: Extract dht::static_sharder replica: Deprecate table::shard_of() locator: Deprecate effective_replication_map::shard_of() dht: Deprecate old sharder API: shard_of/next_shard/token_for_next_shard tests: tablets: py: Add intra-node migration test tests: tablets: Test that drained nodes are not balanced internally tests: tablets: Add checks of replica set validity to test_load_balancing_with_random_load tests: tablets: Verify that disabling balancing results in no intra-node migrations tests: tablets: Check that nodes are internally balanced tests: tablets: Improve debuggability by showing which rows are missing tablets, storage_service: Support intra-node migration in move_tablet() API tablet_allocator: Generate intra-node migration plan tablet_allocator: Extract make_internode_plan() tablet_allocator: Maintain candidate list and shard tablet count for target nodes tablet_allocator: Lift apply_load/can_accept_load lambdas to member functions tablets, streaming: Implement tablet streaming for intra-node migration dht, auto_refreshing_sharder: Allow overriding write selector multishard_writer: Handle intra-node migration storage_proxy: Handle intra-node tablet migration for writes tablets: Get rid of tablet_map::get_shard() tablets: Avoid tablet_map::get_shard in cleanup tablets: test: Use sharder instead of tablet_map::get_shard() tablets: tablet_sharder: Allow working with non-local host sharding: Prepare for intra-node-migration docs: Document sharder use for tablets tablets: Introduce tablet transition kind for intra-node migration tests: tablets: Fix use-after-move of skiplist in rebalance_tablets() sstables, gdb: Track readers in a linked list raft topology: Fix global token metadata barrier to not fence ahead of what is drained	2024-05-20 16:13:01 +03:00
Tomasz Grabiec	aafeacc8d9	dht, auto_refreshing_sharder: Allow overriding write selector During streaming for intra-node migration we want to write only to the new shard. To achieve that, allow altering write selector in sharder::shard_for_writes() and per-instance of auto_refreshing_sharder.	2024-05-16 00:28:46 +02:00
Tomasz Grabiec	6c6ce2d928	tablets: Get rid of tablet_map::get_shard() Its semantics do not fit well with intra-node migration which allow two owning shards. Replace uses with the new has_replica() API.	2024-05-16 00:28:46 +02:00
Tomasz Grabiec	82b34d34d8	tablets: Introduce tablet transition kind for intra-node migration We need a separate transition kind for intra node migration so that we don't have to recover this information from replica set in an expensive way. This information is needed in the hot path - in effective_replicaiton_map, to not return the pending tablet replica to the coordinator. From its perspective, replica set is not transitional. The transition will also be used to alter the behavior of the sharder. When not in intra-node migration, the sharder should advertise the shard which is either in the previous or next replica set. During intra-node migration, that's not possible as there may be two such shards. So it will return the shard according to the current read selector.	2024-05-16 00:28:46 +02:00
Botond Dénes	32a0867b38	locator/tablets: introduce the primary replica concept The primary replica is an arbitrary replica of the tablet's, which is considered to tbe the "main" owner of the tablet, similar to how replicas own tokens in the vnode world. To avoid aliasing the primary replicas with a certain DC or rack, primary replicas are rotated among the tablet's replicas, selecting tablet_id % replica_count as the primary replica.	2024-05-13 01:35:05 -04:00
Kefu Chai	168ade72f8	treewide: replace formatter<std::string_view> with formatter<string_view> in in {fmt} before v10, it provides the specialization of `fmt::formatter<..>` for `std::string_view` as well as the specialization of `fmt::formatter<..>` for `fmt::string_view` which is an implementation builtin in {fmt} for compatibility of pre-C++17. and this type is used even if the code is compiled with C++ stadandard greater or equal to C++17. also, before v10, the `fmt::formatter<std::string_view>::format()` is defined so it accepts `std::string_view`. after v10, `fmt::formatter<std::string_view>` still exists, but it is now defined using `format_as()` machinery, so it's `format()` method does not actually accept `std::string_view`, it accepts `fmt::string_view`, as the former can be converted to `fmt::string_view`. this is why we can inherit from `fmt::formatter<std::string_view>` and use `formatter<std::string_view>::format(foo, ctx);` to implement the `format()` method with {fmt} v9, but we cannot do this with {fmt} v10, and we would have following compilation failure: ``` FAILED: service/CMakeFiles/service.dir/RelWithDebInfo/topology_state_machine.cc.o /home/kefu/.local/bin/clang++ -DFMT_DEPRECATED_OSTREAM -DFMT_SHARED -DSCYLLA_BUILD_MODE=release -DSEASTAR_API_LEVEL=7 -DSEASTAR_LOGGER_COMPILE_TIME_FMT -DSEASTAR_LOGGER_TYPE_STDOUT -DSEASTAR_SCHEDULING_GROUPS_COUNT=16 -DSEASTAR_SSTRING -DXXH_PRIVATE_API -DCMAKE_INTDIR=\"RelWithDebInfo\" -I/home/kefu/dev/scylladb -I/home/kefu/dev/scylladb/build/gen -I/home/kefu/dev/scylladb/seastar/include -I/home/kefu/dev/scylladb/build/seastar/gen/include -I/home/kefu/dev/scylladb/build/seastar/gen/src -ffunction-sections -fdata-sections -O3 -g -gz -std=gnu++20 -fvisibility=hidden -Wall -Werror -Wextra -Wno-error=deprecated-declarations -Wimplicit-fallthrough -Wno-c++11-narrowing -Wno-deprecated-copy -Wno-mismatched-tags -Wno-missing-field-initializers -Wno-overloaded-virtual -Wno-unsupported-friend -Wno-enum-constexpr-conversion -Wno-unused-parameter -ffile-prefix-map=/home/kefu/dev/scylladb=. -march=westmere -mllvm -inline-threshold=2500 -fno-slp-vectorize -U_FORTIFY_SOURCE -Werror=unused-result -MD -MT service/CMakeFiles/service.dir/RelWithDebInfo/topology_state_machine.cc.o -MF service/CMakeFiles/service.dir/RelWithDebInfo/topology_state_machine.cc.o.d -o service/CMakeFiles/service.dir/RelWithDebInfo/topology_state_machine.cc.o -c /home/kefu/dev/scylladb/service/topology_state_machine.cc /home/kefu/dev/scylladb/service/topology_state_machine.cc:254:41: error: no matching member function for call to 'format' 254 \| return formatter<std::string_view>::format(it->second, ctx); \| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~ /usr/include/fmt/core.h:2759:22: note: candidate function template not viable: no known conversion from 'seastar::basic_sstring<char, unsigned int, 15>' to 'const fmt::basic_string_view<char>' for 1st argument 2759 \| FMT_CONSTEXPR auto format(const T& val, FormatContext& ctx) const \| ^ ~~~~~~~~~~~~ ``` because the inherited `format()` method actually comes from `fmt::formatter<fmt::string_view>`. to reduce the confusion, in this change, we just inherit from `fmt::format<string_view>`, where `string_view` is actually `fmt::string_view`. this follows the document at https://fmt.dev/latest/api.html#formatting-user-defined-types, and since there is less indirection under the hood -- we do not use the specialization created by `FMT_FORMAT_AS` which inherit from `formatter<fmt::string_view>`, hopefully this can improve the compilation speed a little bit. also, this change addresses the build failure with {fmt} v10. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#18299	2024-04-19 07:44:07 +03:00
Pavel Emelyanov	725b2863d2	tablet: Make pending replica optional Just like leaving replica could be optional when adding replica to tablet, the pending replica can be optional too if we're removing a replica from tablet Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-04-15 16:31:07 +03:00
Pavel Emelyanov	b0cba57e29	tablet: Make leaving replica optional When getting leaving replica from from tablet info and transition info, the getter code assumes that this replica always exists. It's not going to be the case soon, so make the return value be optional. There are four places that mess with leaving replica: - stream tablet handler: this place checks that the leaving replica is _not_ current host. If leaving replica is missing, the check should pass - cleanup tablet handler: this place checks that the leaving replica _is_ current host. If leaving replica is missing, the check should fail as well - topology coordinator: it gets leaving replica to call cleanup on. If leaving replica is missing, the cleanup call is short-circuited to succeed immediately - load-stats calculator: it checks if the leaving replica is self. This check is not patched as it's automatically satisfied by std::optional comparison operator overload for wrapped type Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-04-04 09:03:36 +03:00
Kefu Chai	2e2c3a5fea	locator: fix a typo in comment s/Substracts/Subtracts/ Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#18048	2024-03-27 10:15:18 +02:00
Pavel Emelyanov	04370dc8a4	tablets: Introduce substract_sets() There are several places in code that calculate replica sets associated with specific tablet transision. Having a helper to substract two sets improves code readability. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#18033	2024-03-26 23:33:06 +02:00
Raphael S. Carvalho	6bdb456fad	sstables_loader: Fix loader when write selector is previous during tablet migration The loader is writing to pending replica even when write selector is set to previous. If migration is reverted, then the writes won't be rolled back as it assumes pending replicas weren't written to yet. That can cause data resurrection if tablet is later migrated back into the same replica. NOTE: write selector is handled correctly when set to next, because get_natural_endpoints() will return the next replica set, and none of the replicas will be considered leaving. And of course, selector set to both is also handled correctly. Fixes #17892. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#17902	2024-03-24 01:20:50 +01:00
Tomasz Grabiec	1c71f44e63	tablets, raft topology: Rebuild tablets after replacing node is normal This fixes a problem with replacing a node with tablets when RF=N. Currently, this will fail because new tablet replica allocation will not be able to find a viable destination, as the replacing node is not considered a candidate. It cannot be a candidate because replace rolls back on failure and we cannot roll back after tablets were migrated. The solution taken here is to not drain tablet replicas from replaced node during topology request but leave it to happen later after the replaced node is left and replacing node is normal. The replacing node waits for this draining to be complete on boot before the node is considered booted. Fixes #17025	2024-03-15 13:20:08 +01:00
Patryk Wrobel	3fff6bd407	locator/tablets: add tablet_map::get_sorted_tokens() This change introudces a new member function that returns a vector of sorted tokens where each pair of adjacent elements depicts a range of tokens that belong to tablet. It will be used to produce the equivalent of sorted_tokens() of vnodes when trying to use dht::describe_ownership() for tablets. Refs: scylladb#17342 Signed-off-by: Patryk Wrobel <patryk.wrobel@scylladb.com>	2024-03-11 09:50:20 +01:00
Kefu Chai	64e14d21db	locator/tablets: add fmt::formatter for tablet_* before this change, we rely on the default-generated fmt::formatter created from operator<<, but fmt v10 dropped the default-generated formatter. in this change, we define formatters for * tablet_id * tablet_replica * tablet_metadata * tablet_map their operator<<:s are dropped Refs scylladb/scylladb#13245 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#17504	2024-03-07 09:00:49 +03:00
Kefu Chai	643c01fd80	locator: fix typo in comment -- s/slecting/selecting/ fix a typo Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#17470	2024-02-22 13:28:18 +02:00
Botond Dénes	7bdd0c2cae	locator: introduce tablet_range_spliter Given a list of partition-ranges, yields the intersection of this range-list, with that of that tablet-ranges, for tablets located on the given host. This will be used in multishard_mutation_query.cc, to obtain the ranges to read from the local node: given the read ranges, obtain the ranges belonging to tablets who have replicas on the local node.	2024-02-21 02:08:48 -05:00
Pavel Emelyanov	72f3b1d5fe	topology.tablets_migration: Add cleanup_target transition stage The new stage will be used to revert migration that fails at some stages. The goal is to cleanup the pending replica, which may already received some writes by doing the cleanup RPC to the pending replica, then jumping to "revert_migration" stage introduced earlier. If pending node is dead, the call to cleanup RPC is skipped. Coordinators use old replicas. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-02-20 08:59:06 +03:00
Pavel Emelyanov	ced5bf56eb	topology.tablets_migration: Add revert_migration transition stage It's like end_migration, but old replicas intact just removing the transition (including new replicas). Coordinators use old replicas. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-02-20 08:53:36 +03:00
Asias He	904bafd069	tablets: Convert to use the new version of for_each_tablet It is more gently than the old one.	2024-02-05 18:45:40 +08:00
Asias He	fab0d33d08	tablets: Add for_each_tablet_gently In this version, the callback returns a future<>, so it can yield itself to avoid stalls in func itself.	2024-02-05 13:42:08 +08:00
Avi Kivity	c8397f0287	Merge 'Implement tablet splitting' from Raphael "Raph" Carvalho The motivation for tablet resizing is that we want to keep the average tablet size reasonable, such that load rebalancing can remain efficient. Too large tablet makes migration inefficient, therefore slowing down the balancer. If the avg size grows beyond the upper bound (split threshold), then balancer decides to split. Split spans all tablets of a table, due to power-of-two constraint. Likewise, if the avg size decreases below the lower bound (merge threshold), then merge takes place in order to grow the avg size. Merge is not implemented yet, although this series lays foundation for it to be impĺemented later on. A resize decision can be revoked if the avg size changes and the decision is no longer needed. For example, let's say table is being split and avg size drops below the target size (which is 50% of split threshold and 100% of merge one). That means after split, the avg size would drop below the merge threshold, causing a merge after split, which is wasteful, so it's better to just cancel the split. Tablet metadata gains 2 new fields for managing this: resize_type: resize decision type, can be either of "merge", "split", or "none". resize_seq_number: a sequence number that works as the global identifier of the decision (monotonically increasing, increased by 1 on every new decision emitted by the coordinator). A new RPC was implemented to pull stats from each table replica, such that load balancer can calculate the avg tablet size and know the "split status", for a given table. Avg size is aggregated carefully while taking RF of each DC into account (which might differ). When a table is done splitting its storage, it loads (mirror) the resize_seq_number from tablet metadata into its local state (in another words, my split status is ready). If a table is split ready, coordinator will see that table's seq number is the same as the one in tablet metadata. Helps to distinguish stale decisions from the latest one (in case decisions are revoked and re-emited later on). Also, it's aggregated carefully, by taking the minimum among all replicas, so coordinator will only update topology when all replicas are ready. When load balancer emits split decision, replicas will listen to need to split with a "split monitor" that is awakened once a table has replication metadata updated and detects the need for split (i.e. resize_type field is "split"). The split monitor will start splitting of compaction groups (using mechanism introduced here: `081f30d149`) for the table. And once splitting work is completed, the table updates its local state as having completed split. When coordinator pulls the split status of all replicas for a table via RPC, the balancer can see whether that table is ready for "finalizing" the decision, which is about updating tablet metadata to split each tablet into two. Once table replicas have their replication metadata updated with the new tablet count, they can update appropriately their set of compaction groups (that were previously split in the preparation step). Fixes #16536. Closes scylladb/scylladb#16580 * github.com:scylladb/scylladb: test/topology_experimental_raft: Add tablet split test replica: Bypass reshape on boot with tablets temporarily replica: Fix table::compaction_group_for_sstable() for tablet streaming test/topology_experimental_raft: Disable load balancer in test fencing replica: Remap compaction groups when tablet split is finalized service: Split tablet map when split request is finalized replica: Update table split status if completed split compaction work storage_service: Implement split monitor topology_cordinator: Generate updates for resize decisions made by balancer load_balancer: Introduce metrics for resize decisions db: Make target tablet size a live-updateable config option load_balancer: Implement resize decisions service: Wire table_resize_plan into migration_plan service: Introduce table_resize_plan tablet_mutation_builder: Add set_resize_decision() topology_coordinator: Wire load stats into load balancer storage_service: Allow tablet split and migration to happen concurrently topology_coordinator: Periodically retrieve table_load_stats locator: Introduce topology::get_datacenter_nodes() storage_service: Implement table_load_stats RPC replica: Expose table_load_stats in table replica: Introduce storage_group::live_disk_space_used() locator: Introduce table_load_stats tablets: Add resize decision metadata to tablet metadata locator: Introduce resize_decision	2024-01-31 13:59:56 +02:00
Botond Dénes	95b6aeebae	locator: tablets: add check_tablet_replica_shards() Checks that all tablets with a replica on the this node, have a valid replica shard (< smp::count). Will be used to check whether the node can start-up with the current shard-count.	2024-01-29 07:04:33 -05:00
Raphael S. Carvalho	e0de3dd844	topology_cordinator: Generate updates for resize decisions made by balancer Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2024-01-25 18:58:40 -03:00
Raphael S. Carvalho	2209c7440c	topology_coordinator: Periodically retrieve table_load_stats This implements the fiber that aggregates per-table stats that will be feeded into load balancer to make resize decisions (split, merge, or revoke ongoing ones). Initially, the stats will be refreshed every 60s, but the idea is that eventually we make the frequency table based, where the size of each table is taken into account. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2024-01-25 18:36:08 -03:00
Raphael S. Carvalho	6c74fc4b82	locator: Introduce table_load_stats This is per table stats that will be aggregated from all nodes, by the coordinator, in order to help load balancer make resize decisions. size_in_bytes is the total aggregated table size, so coordinator becomes responsible for taking into account RF of each DC and also tablet count, for computing an accurate average size. split_ready_seq_number is the minimum sequence number among all replicas. If coordinator sees all replicas store the seq number of current split, then it knows all replicas are ready for the next stage in the split process. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2024-01-25 18:36:08 -03:00
Raphael S. Carvalho	0d5ba1ee4b	tablets: Add resize decision metadata to tablet metadata The new metadata describes the ongoing resize operation (can be either of merge, split or none) that spans tablets of a given table. That's managed by group0, so down nodes will be able to see the decision when they come back up and see the changes to the metadata. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2024-01-25 18:36:06 -03:00
Raphael S. Carvalho	57582ac9c4	locator: Introduce resize_decision resize_decision is the metadata the says whether tablets of a table needs split, merge, or none. That will be recorded in tablet metadata, and therefore stored in group0. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2024-01-25 18:31:12 -03:00
Tomasz Grabiec	e5dcf03b88	tablets: Add support for removenode and replace handling New tablet replicas are allocated synchronously with node operations. They are safely rebuilt from all existing replicas. The list of ignored nodes passed to node operations is respected. Tablet scheduler is responsible for scheduling tablet transition which changes the replicas set. The infrastructure for handling decommission in tablet scheduler is reused for this. Scheduling is done incrementally, respecting per-shard load limits. Rebuilding transitions are recognized by load calculation to affect all tablet replicas. New kind of tablet transition is introduced called "rebuild" which adds new tablet replica and rebuilds it from existing replicas. Other than that, the transition goes through the same stages as regular migration to ensure safe synchronization with request coordinators. In this PR we simply stream from all tablet replicas. Later we should switch to calling repair to avoid sending excessive amounts of data. Fixes #16690.	2024-01-23 01:19:42 +01:00
Tomasz Grabiec	649ca0e46c	tablets: Introduce get_migration_streaming_info() which works on migration request Will be used by tablet load balancer to compute impact on load of planned migrations. Currently, the logic is hard coded in the load balancer and may get out of sync with the logic we have in get_migration_streaming_info() for already running tablet transitions. The logic will become more complex for rebuild transition, so use shared code to compute it.	2024-01-23 01:12:57 +01:00
Tomasz Grabiec	6dc56fd80b	tablets: Move migration_to_transition_info() to tablets.hh	2024-01-23 01:12:57 +01:00
Tomasz Grabiec	1df256221c	tablets: Extract get_new_replicas() which works on migraiton request Now we have a single place which translates tablet migration request to new replicas. Will be reused in other places.	2024-01-23 01:12:57 +01:00
Tomasz Grabiec	ae382196f1	tablets: Move tablet_migration_info to tablets.hh Will add methods which operate on it to tablets.hh where they belong.	2024-01-23 01:12:57 +01:00
Tomasz Grabiec	4a06ffb43c	tablets: Store transition kind per tablet Will be used to distinguish regular migration from rebuild, repair and RF change.	2024-01-23 01:12:57 +01:00
Kefu Chai	2c394e3f6f	tablets: remove unused #includes the removed #include headers are not used, so let's drop their `#include`s. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#16619	2024-01-03 15:30:40 +01:00
Raphael S. Carvalho	bcbba9a5e3	locator: Introduce tablet_map::get_tablet_id_and_range_side(token) Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2023-12-17 11:26:32 -03:00
Tomasz Grabiec	d1c1b59236	storage_service, api: Add API to disable tablet balancing Load balancing needs to be disabled before making a series of manual migrations so that we don't fight with the load balancer. Also will be used in tests to ensure tablets stick to expected locations.	2023-12-06 18:36:17 +01:00
Tomasz Grabiec	1f57d1ea28	storage_service, api: Add API to migrate a tablet Will be used in tests, or for hot fixes in production.	2023-12-06 18:36:17 +01:00
Tomasz Grabiec	5381792401	tablets: Add per-tablet session id field to tablet metadata range_streamer will pick it up when creating topology_guard. It's materialized in memory only for migrating tablets in tablet_transition_info.	2023-12-06 18:36:17 +01:00
sylwiaszunejko	954d51389c	locator: add function to check locality	2023-11-22 09:23:43 +01:00
sylwiaszunejko	a0c8531875	locator: add function to check if host is local	2023-11-21 15:15:20 +01:00
Tomasz Grabiec	d5539e080d	tablets: Implement cleanup step This change adds a stub for tablet cleanup on the replica side and wires it into the tablet migration process. The handling on replica side is incomplete because it doesn't remove the actual data yet. It only flushes the memtables, so that all data is in sstables and none requires a memtable flush. This patch is necessary to make decommission work. Otherwise, a memtable flush would happen when the decommissioned node is put in the drained state (as in nodetool drain) and it would fail on missing host id mapping (node is no longer in topology), which is examined by the tablet sharder when producing sstable sharding metadata. Leading to abort due to failed memtable flush.	2023-09-14 12:45:10 +02:00
Tomasz Grabiec	8fdbc42e71	tests: tablets: Add test for load balancing with active migrations	2023-07-31 01:45:23 +02:00
Tomasz Grabiec	fe181b3bac	tablets: Balance tablets concurrently with active migrations After this change, the load balancer can make progress with active migrations. If the algorithm is called with active tablet migrations in tablet metadata, those are treated by load balancer as if they were already completed. This allows the algorithm to incrementally make decision which when executed with active migrations will produce the desired result. Overload of shards is limited by the fact that the algorithm tracks streaming concurrency on both source and target shards of active migrations and takes concurrency limit into account when producing new migrations. The coordinator executes the load balancer on edges of tablet state machine stransitions. This allows new migrations to be started as soon as tablets finish streaming. The load balancer is also continuously invoked as long as it produces a non-empty plan. This is in order to saturate the cluster with streaming. A single make_plan() call is still not saturating, due to the way algorithm is implemented.	2023-07-31 01:45:23 +02:00
Tomasz Grabiec	fbc6076e6a	storage_service, tablets: Move get_leaving_replica() to tablets.cc For better encapsulation of tablet-specific code.	2023-07-31 01:45:23 +02:00
Tomasz Grabiec	18a59ab5ff	locator: tablets: Move std::hash definition earlier Will be needed in order to define a struct which has unordered_set<tablet_replica> as a field.	2023-07-31 01:45:23 +02:00
Tomasz Grabiec	6f4a35f9ae	service: tablet_allocator: Introduce tablet load balancer Will be invoked by the topology coordinator later to decide which tablets to migrate.	2023-07-25 21:08:51 +02:00
Tomasz Grabiec	d59b8d316c	tablets: Introduce tablet_map::for_each_tablet()	2023-07-25 21:08:51 +02:00

1 2

59 Commits