scylladb

Author	SHA1	Message	Date
Tomasz Grabiec	7479167af2	tablets: Filter-out left nodes in get_natural_endpoints() The API already promises this, the comment on effective_replication_map says: "Excludes replicas which are in the left state". Tablet replicas on the replaced node are rebuilt after the node already left. We may no longer have the IP mapping for the left node so we should not include that node in the replica set. Otherwise, storage_proxy may try to use the empty IP and fail: storage_proxy - No mapping for :: in the passed effective replication map It's fine to not include it, because storage proxy uses keyspace RF and not replica list size to determine quorum. The node is not coming up, so noone should need to contact it. Users which need replica list stability should use the host_id-based API. Fixes #18843 (cherry picked from commit `0d596a425c`)	2024-06-11 12:18:17 +02:00
Benny Halevy	bdf3e71f62	locator: tablet_map: add get_primary_replica_within_dc Will be needed by repair in a following patch. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> (cherry picked from commit `84761acc31`)	2024-06-03 19:50:40 +00:00
Benny Halevy	ec30bdc483	locator: tablet_map: get_primary_replica: do not copy tablet info Currently, the function needlessly copies the tablet_info (all tablet replicas in particular) to a local variable. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> (cherry picked from commit `2de79c39dc`)	2024-06-03 19:50:40 +00:00
Benny Halevy	21f87c9cfa	locator: tablet_map: get_primary_replica: return tablet_replica This is required by repair when it will start using get_primary_replica in a following patch. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> (cherry picked from commit `c52f70f92c`)	2024-06-03 19:50:39 +00:00
Avi Kivity	52fe351c31	Merge 'Balance tablets within nodes (intra-node migration)' from Tomasz Grabiec This is needed to avoid severe imbalance between shards which can happen when some table grows and is split. The inter-node balance can be equal, so inter-node migration cannot fix the imbalance. Also, if RF=N then there is not even a possibility of moving tablets around to fix the imbalance. The only way to bring the system to balance is to move tablets within the nodes. The system is not prepared for intra-node migration currently. Request coordination is host-based, while for intra-node migration it should be (also) shard-based. The solution employed here is to keep the coordination between nodes as-is, and for intra-node migration storage_proxy-level coordinator is not aware of the migration (no pending host). The replica-side request handler will be a second-level coordinator which routes requests to shards, similar to how the first-level coordinator routes them to hosts. Tablet sharder is adjusted to handle intra-migration where a tablet can have two replicas on the same host. For reads, sharder uses the read selector to resolve the conflict. For writes, the write selector is used. The old shard_of() API is kept to represent shard for reads, and new method is introduced to query the shards for writing: shard_for_writes(). All writers should be switched to that API, which is not done in this patch yet. The request handler on replica side acts as a second-level coordinator, using sharder to determine routing to shards. A given sharder has a scope of a single topology version, a single effective_replication_map_ptr, which should be kept alive during writes. perf-simple-query test results show no signs of regression: Command: perf-simple-query -c1 -m1G --write --tablets --duration=10 Before: > 83294.81 tps ( 59.5 allocs/op, 14.3 tasks/op, 53725 insns/op, 0 errors) > 87756.72 tps ( 59.5 allocs/op, 14.3 tasks/op, 54049 insns/op, 0 errors) > 86428.47 tps ( 59.6 allocs/op, 14.3 tasks/op, 54208 insns/op, 0 errors) > 86211.38 tps ( 59.7 allocs/op, 14.3 tasks/op, 54219 insns/op, 0 errors) > 86559.89 tps ( 59.6 allocs/op, 14.3 tasks/op, 54188 insns/op, 0 errors) > 86609.39 tps ( 59.6 allocs/op, 14.3 tasks/op, 54117 insns/op, 0 errors) > 87464.06 tps ( 59.5 allocs/op, 14.3 tasks/op, 54039 insns/op, 0 errors) > 86185.43 tps ( 59.6 allocs/op, 14.3 tasks/op, 54169 insns/op, 0 errors) > 86254.71 tps ( 59.6 allocs/op, 14.3 tasks/op, 54139 insns/op, 0 errors) > 83395.35 tps ( 60.2 allocs/op, 14.4 tasks/op, 54693 insns/op, 0 errors) > > median 86428.47 tps ( 59.6 allocs/op, 14.3 tasks/op, 54208 insns/op, 0 errors) > median absolute deviation: 243.04 > maximum: 87756.72 > minimum: 83294.81 > After: > 85523.06 tps ( 59.5 allocs/op, 14.3 tasks/op, 53872 insns/op, 0 errors) > 89362.47 tps ( 59.6 allocs/op, 14.3 tasks/op, 54226 insns/op, 0 errors) > 88167.55 tps ( 59.7 allocs/op, 14.3 tasks/op, 54400 insns/op, 0 errors) > 87044.40 tps ( 59.7 allocs/op, 14.3 tasks/op, 54310 insns/op, 0 errors) > 88344.50 tps ( 59.6 allocs/op, 14.3 tasks/op, 54289 insns/op, 0 errors) > 88355.06 tps ( 59.6 allocs/op, 14.3 tasks/op, 54242 insns/op, 0 errors) > 88725.46 tps ( 59.6 allocs/op, 14.3 tasks/op, 54230 insns/op, 0 errors) > 88640.08 tps ( 59.6 allocs/op, 14.3 tasks/op, 54210 insns/op, 0 errors) > 90306.31 tps ( 59.4 allocs/op, 14.3 tasks/op, 54043 insns/op, 0 errors) > 87343.62 tps ( 59.8 allocs/op, 14.3 tasks/op, 54496 insns/op, 0 errors) > > median 88355.06 tps ( 59.6 allocs/op, 14.3 tasks/op, 54242 insns/op, 0 errors) > median absolute deviation: 1007.41 > maximum: 90306.31 > minimum: 85523.06 Command (reads): perf-simple-query -c1 -m1G --tablets --duration=10 Before: > 95860.18 tps ( 63.1 allocs/op, 14.1 tasks/op, 42476 insns/op, 0 errors) > 97537.69 tps ( 63.1 allocs/op, 14.1 tasks/op, 42454 insns/op, 0 errors) > 97549.23 tps ( 63.1 allocs/op, 14.1 tasks/op, 42470 insns/op, 0 errors) > 97511.29 tps ( 63.1 allocs/op, 14.1 tasks/op, 42470 insns/op, 0 errors) > 97227.32 tps ( 63.1 allocs/op, 14.1 tasks/op, 42471 insns/op, 0 errors) > 94031.94 tps ( 63.1 allocs/op, 14.1 tasks/op, 42441 insns/op, 0 errors) > 96978.04 tps ( 63.1 allocs/op, 14.1 tasks/op, 42462 insns/op, 0 errors) > 96401.70 tps ( 63.1 allocs/op, 14.1 tasks/op, 42473 insns/op, 0 errors) > 96573.77 tps ( 63.1 allocs/op, 14.1 tasks/op, 42440 insns/op, 0 errors) > 96340.54 tps ( 63.1 allocs/op, 14.1 tasks/op, 42468 insns/op, 0 errors) > > median 96978.04 tps ( 63.1 allocs/op, 14.1 tasks/op, 42462 insns/op, 0 errors) > median absolute deviation: 571.20 > maximum: 97549.23 > minimum: 94031.94 > After: > 99794.67 tps ( 63.1 allocs/op, 14.1 tasks/op, 42471 insns/op, 0 errors) > 101244.99 tps ( 63.1 allocs/op, 14.1 tasks/op, 42472 insns/op, 0 errors) > 101128.37 tps ( 63.1 allocs/op, 14.1 tasks/op, 42485 insns/op, 0 errors) > 101065.27 tps ( 63.1 allocs/op, 14.1 tasks/op, 42465 insns/op, 0 errors) > 101212.98 tps ( 63.1 allocs/op, 14.1 tasks/op, 42456 insns/op, 0 errors) > 101413.31 tps ( 63.1 allocs/op, 14.1 tasks/op, 42463 insns/op, 0 errors) > 101464.92 tps ( 63.1 allocs/op, 14.1 tasks/op, 42466 insns/op, 0 errors) > 101086.74 tps ( 63.1 allocs/op, 14.1 tasks/op, 42488 insns/op, 0 errors) > 101559.09 tps ( 63.1 allocs/op, 14.1 tasks/op, 42468 insns/op, 0 errors) > 100742.58 tps ( 63.1 allocs/op, 14.1 tasks/op, 42491 insns/op, 0 errors) > > median 101212.98 tps ( 63.1 allocs/op, 14.1 tasks/op, 42456 insns/op, 0 errors) > median absolute deviation: 200.33 > maximum: 101559.09 > minimum: 99794.67 > Fixes #16594 Closes scylladb/scylladb#18026 * github.com:scylladb/scylladb: Implement fast streaming for intra-node migration test: tablets_test: Test sharding during intra-node migration test: tablets_test: Check sharding also on the pending host test: py: tablets: Test writes concurrent with migration test: py: tablets: Test crash during intra-node migration api, storage_service: Introduce API to wait for topology to quiesce dht, replica: Remove deprecated sharder APIs test: Avoid using deprecated sharded API db: do_apply_many() avoid deprecated sharded API replica: mutation_dump: Avoid deprecated sharder API repair: Avoid deprecated sharder API table: Remove optimization which returns empty reader when key is not owned by the shard dht: is_single_shard: Avoid deprecated sharder API dht: split_range_to_single_shard: Work with static_sharder only dht: ring_position_range_sharder: Avoid deprecated sharder APIs dht: token: Avoid use of deprecated sharder API by switching to static_sharder selective_token_sharder: Avoid use of deprecated sharder API docs: Document tablet sharding vs tablet replica placement readers/multishard.cc: use shard_for_reads() instead of shard_of() multishard_mutation_query.cc: use shard_for_reads() instead of shard_of() storage_proxy: Extract common code to apply mutations on many shards according to sharder storage_proxy: Prepare per-partition rate-limiting for intra-node migration storage_proxy: Avoid shard_of() use in mutate_counter_on_leader_and_replicate() storage_proxy: Prepare mutate_hint() for intra-node tablet migration commitlog_replayer: Avoid deprecated sharder::shard_of() lwt: Avoid deprecated sharder::shard_of() compaction: Avoid deprecated sharder::shard_of() dht: Extract dht::static_sharder replica: Deprecate table::shard_of() locator: Deprecate effective_replication_map::shard_of() dht: Deprecate old sharder API: shard_of/next_shard/token_for_next_shard tests: tablets: py: Add intra-node migration test tests: tablets: Test that drained nodes are not balanced internally tests: tablets: Add checks of replica set validity to test_load_balancing_with_random_load tests: tablets: Verify that disabling balancing results in no intra-node migrations tests: tablets: Check that nodes are internally balanced tests: tablets: Improve debuggability by showing which rows are missing tablets, storage_service: Support intra-node migration in move_tablet() API tablet_allocator: Generate intra-node migration plan tablet_allocator: Extract make_internode_plan() tablet_allocator: Maintain candidate list and shard tablet count for target nodes tablet_allocator: Lift apply_load/can_accept_load lambdas to member functions tablets, streaming: Implement tablet streaming for intra-node migration dht, auto_refreshing_sharder: Allow overriding write selector multishard_writer: Handle intra-node migration storage_proxy: Handle intra-node tablet migration for writes tablets: Get rid of tablet_map::get_shard() tablets: Avoid tablet_map::get_shard in cleanup tablets: test: Use sharder instead of tablet_map::get_shard() tablets: tablet_sharder: Allow working with non-local host sharding: Prepare for intra-node-migration docs: Document sharder use for tablets tablets: Introduce tablet transition kind for intra-node migration tests: tablets: Fix use-after-move of skiplist in rebalance_tablets() sstables, gdb: Track readers in a linked list raft topology: Fix global token metadata barrier to not fence ahead of what is drained	2024-05-20 16:13:01 +03:00
Tomasz Grabiec	6c6ce2d928	tablets: Get rid of tablet_map::get_shard() Its semantics do not fit well with intra-node migration which allow two owning shards. Replace uses with the new has_replica() API.	2024-05-16 00:28:46 +02:00
Tomasz Grabiec	82b34d34d8	tablets: Introduce tablet transition kind for intra-node migration We need a separate transition kind for intra node migration so that we don't have to recover this information from replica set in an expensive way. This information is needed in the hot path - in effective_replicaiton_map, to not return the pending tablet replica to the coordinator. From its perspective, replica set is not transitional. The transition will also be used to alter the behavior of the sharder. When not in intra-node migration, the sharder should advertise the shard which is either in the previous or next replica set. During intra-node migration, that's not possible as there may be two such shards. So it will return the shard according to the current read selector.	2024-05-16 00:28:46 +02:00
Botond Dénes	32a0867b38	locator/tablets: introduce the primary replica concept The primary replica is an arbitrary replica of the tablet's, which is considered to tbe the "main" owner of the tablet, similar to how replicas own tokens in the vnode world. To avoid aliasing the primary replicas with a certain DC or rack, primary replicas are rotated among the tablet's replicas, selecting tablet_id % replica_count as the primary replica.	2024-05-13 01:35:05 -04:00
Kefu Chai	a439ebcfce	treewide: include fmt/ranges.h and/or fmt/std.h before this change, we rely on the default-generated fmt::formatter created from operator<<, but fmt v10 dropped the default-generated formatter. in this change, we include `fmt/ranges.h` and/or `fmt/std.h` for formatting the container types, like vector, map optional and variant using {fmt} instead of the homebrew formatter based on operator<<. with this change, the changes adding fmt::formatter and the changes using ostream formatter explicitly, we are allowed to drop `FMT_DEPRECATED_OSTREAM` macro. Refs scylladb#13245 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2024-04-19 22:56:16 +08:00
Pavel Emelyanov	725b2863d2	tablet: Make pending replica optional Just like leaving replica could be optional when adding replica to tablet, the pending replica can be optional too if we're removing a replica from tablet Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-04-15 16:31:07 +03:00
Pavel Emelyanov	b0cba57e29	tablet: Make leaving replica optional When getting leaving replica from from tablet info and transition info, the getter code assumes that this replica always exists. It's not going to be the case soon, so make the return value be optional. There are four places that mess with leaving replica: - stream tablet handler: this place checks that the leaving replica is _not_ current host. If leaving replica is missing, the check should pass - cleanup tablet handler: this place checks that the leaving replica _is_ current host. If leaving replica is missing, the check should fail as well - topology coordinator: it gets leaving replica to call cleanup on. If leaving replica is missing, the cleanup call is short-circuited to succeed immediately - load-stats calculator: it checks if the leaving replica is self. This check is not patched as it's automatically satisfied by std::optional comparison operator overload for wrapped type Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-04-04 09:03:36 +03:00
Raphael S. Carvalho	12714a4123	locator: Avoid tablet map lookup on every write for getting replicas We can cache tablet map in erm, to avoid looking it up on every write for getting write replicas. We do that in tablet_sharder, but not in tablet erm. Tablet map is immutable in the context of a given erm, so the address of the map is stable during erm lifetime. This caught my attention when looking at perf diff output (comparing tablet and vnode modes). It also helps when erm is called again on write completion for checking locality, used for forwarding info to the driver if needed. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#18158	2024-04-03 10:28:04 +02:00
Pavel Emelyanov	04370dc8a4	tablets: Introduce substract_sets() There are several places in code that calculate replica sets associated with specific tablet transision. Having a helper to substract two sets improves code readability. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#18033	2024-03-26 23:33:06 +02:00
Avi Kivity	4ddf82e58b	treewide: don't #include "gms/feature_service.hh" from other headers feature_service.hh is a high-level header that integrates much of the system functionality, so including it in lower-level headers causes unnecessary rebuilds. Specifically, when retiring features. Fix by removing feature_service.hh from headers, and supply forward declarations and includes in .cc where needed. Closes scylladb/scylladb#18005	2024-03-26 15:31:18 +02:00
Raphael S. Carvalho	6bdb456fad	sstables_loader: Fix loader when write selector is previous during tablet migration The loader is writing to pending replica even when write selector is set to previous. If migration is reverted, then the writes won't be rolled back as it assumes pending replicas weren't written to yet. That can cause data resurrection if tablet is later migrated back into the same replica. NOTE: write selector is handled correctly when set to next, because get_natural_endpoints() will return the next replica set, and none of the replicas will be considered leaving. And of course, selector set to both is also handled correctly. Fixes #17892. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#17902	2024-03-24 01:20:50 +01:00
Tomasz Grabiec	1c71f44e63	tablets, raft topology: Rebuild tablets after replacing node is normal This fixes a problem with replacing a node with tablets when RF=N. Currently, this will fail because new tablet replica allocation will not be able to find a viable destination, as the replacing node is not considered a candidate. It cannot be a candidate because replace rolls back on failure and we cannot roll back after tablets were migrated. The solution taken here is to not drain tablet replicas from replaced node during topology request but leave it to happen later after the replaced node is left and replacing node is normal. The replacing node waits for this draining to be complete on boot before the node is considered booted. Fixes #17025	2024-03-15 13:20:08 +01:00
Tomasz Grabiec	888dc41d66	effective_replication_map: Introduce host_id-based get_replicas()	2024-03-15 11:05:29 +01:00
Patryk Wrobel	75aadeb32f	locator/effective_replication_map: make 'get_ranges(inet_address ep)' virtual Before this patch, the mentioned function was a specific member of vnode_effective_replication_strategy class. To allow its usage also when tablets are enabled it was shifted to the base class - effective_replication_strategy and made pure virtual to force the derived classes to implement it. It is used by 'storage_service::get_ranges_for_endpoint()' that is used in calculation of effective ownership. Such calculation needs to be performed also when tablets are enabled. Refs: scylladb#17342 Signed-off-by: Patryk Wrobel <patryk.wrobel@scylladb.com>	2024-03-11 09:50:20 +01:00
Patryk Wrobel	3fff6bd407	locator/tablets: add tablet_map::get_sorted_tokens() This change introudces a new member function that returns a vector of sorted tokens where each pair of adjacent elements depicts a range of tokens that belong to tablet. It will be used to produce the equivalent of sorted_tokens() of vnodes when trying to use dht::describe_ownership() for tablets. Refs: scylladb#17342 Signed-off-by: Patryk Wrobel <patryk.wrobel@scylladb.com>	2024-03-11 09:50:20 +01:00
Kefu Chai	64e14d21db	locator/tablets: add fmt::formatter for tablet_* before this change, we rely on the default-generated fmt::formatter created from operator<<, but fmt v10 dropped the default-generated formatter. in this change, we define formatters for * tablet_id * tablet_replica * tablet_metadata * tablet_map their operator<<:s are dropped Refs scylladb/scylladb#13245 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#17504	2024-03-07 09:00:49 +03:00
Botond Dénes	7bdd0c2cae	locator: introduce tablet_range_spliter Given a list of partition-ranges, yields the intersection of this range-list, with that of that tablet-ranges, for tablets located on the given host. This will be used in multishard_mutation_query.cc, to obtain the ranges to read from the local node: given the read ranges, obtain the ranges belonging to tablets who have replicas on the local node.	2024-02-21 02:08:48 -05:00
Pavel Emelyanov	72f3b1d5fe	topology.tablets_migration: Add cleanup_target transition stage The new stage will be used to revert migration that fails at some stages. The goal is to cleanup the pending replica, which may already received some writes by doing the cleanup RPC to the pending replica, then jumping to "revert_migration" stage introduced earlier. If pending node is dead, the call to cleanup RPC is skipped. Coordinators use old replicas. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-02-20 08:59:06 +03:00
Pavel Emelyanov	ced5bf56eb	topology.tablets_migration: Add revert_migration transition stage It's like end_migration, but old replicas intact just removing the transition (including new replicas). Coordinators use old replicas. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-02-20 08:53:36 +03:00
Botond Dénes	3f2d7e8b25	tree: remove unnecessary yields around for_each_tablet() Commit `904bafd069` consolidated the two existing for_each_tablet() overloads, to the one which has a future<> returning callback. It also added yields to the bodies of said callbacks. This is unnecessary, the loop in for_each_tablet() already has a yield per tablet, which should be enough to prevent stalls. This patch is a follow-up to #17118 Closes scylladb/scylladb#17284	2024-02-12 17:10:25 +01:00
Asias He	904bafd069	tablets: Convert to use the new version of for_each_tablet It is more gently than the old one.	2024-02-05 18:45:40 +08:00
Asias He	fab0d33d08	tablets: Add for_each_tablet_gently In this version, the callback returns a future<>, so it can yield itself to avoid stalls in func itself.	2024-02-05 13:42:08 +08:00
Avi Kivity	c8397f0287	Merge 'Implement tablet splitting' from Raphael "Raph" Carvalho The motivation for tablet resizing is that we want to keep the average tablet size reasonable, such that load rebalancing can remain efficient. Too large tablet makes migration inefficient, therefore slowing down the balancer. If the avg size grows beyond the upper bound (split threshold), then balancer decides to split. Split spans all tablets of a table, due to power-of-two constraint. Likewise, if the avg size decreases below the lower bound (merge threshold), then merge takes place in order to grow the avg size. Merge is not implemented yet, although this series lays foundation for it to be impĺemented later on. A resize decision can be revoked if the avg size changes and the decision is no longer needed. For example, let's say table is being split and avg size drops below the target size (which is 50% of split threshold and 100% of merge one). That means after split, the avg size would drop below the merge threshold, causing a merge after split, which is wasteful, so it's better to just cancel the split. Tablet metadata gains 2 new fields for managing this: resize_type: resize decision type, can be either of "merge", "split", or "none". resize_seq_number: a sequence number that works as the global identifier of the decision (monotonically increasing, increased by 1 on every new decision emitted by the coordinator). A new RPC was implemented to pull stats from each table replica, such that load balancer can calculate the avg tablet size and know the "split status", for a given table. Avg size is aggregated carefully while taking RF of each DC into account (which might differ). When a table is done splitting its storage, it loads (mirror) the resize_seq_number from tablet metadata into its local state (in another words, my split status is ready). If a table is split ready, coordinator will see that table's seq number is the same as the one in tablet metadata. Helps to distinguish stale decisions from the latest one (in case decisions are revoked and re-emited later on). Also, it's aggregated carefully, by taking the minimum among all replicas, so coordinator will only update topology when all replicas are ready. When load balancer emits split decision, replicas will listen to need to split with a "split monitor" that is awakened once a table has replication metadata updated and detects the need for split (i.e. resize_type field is "split"). The split monitor will start splitting of compaction groups (using mechanism introduced here: `081f30d149`) for the table. And once splitting work is completed, the table updates its local state as having completed split. When coordinator pulls the split status of all replicas for a table via RPC, the balancer can see whether that table is ready for "finalizing" the decision, which is about updating tablet metadata to split each tablet into two. Once table replicas have their replication metadata updated with the new tablet count, they can update appropriately their set of compaction groups (that were previously split in the preparation step). Fixes #16536. Closes scylladb/scylladb#16580 * github.com:scylladb/scylladb: test/topology_experimental_raft: Add tablet split test replica: Bypass reshape on boot with tablets temporarily replica: Fix table::compaction_group_for_sstable() for tablet streaming test/topology_experimental_raft: Disable load balancer in test fencing replica: Remap compaction groups when tablet split is finalized service: Split tablet map when split request is finalized replica: Update table split status if completed split compaction work storage_service: Implement split monitor topology_cordinator: Generate updates for resize decisions made by balancer load_balancer: Introduce metrics for resize decisions db: Make target tablet size a live-updateable config option load_balancer: Implement resize decisions service: Wire table_resize_plan into migration_plan service: Introduce table_resize_plan tablet_mutation_builder: Add set_resize_decision() topology_coordinator: Wire load stats into load balancer storage_service: Allow tablet split and migration to happen concurrently topology_coordinator: Periodically retrieve table_load_stats locator: Introduce topology::get_datacenter_nodes() storage_service: Implement table_load_stats RPC replica: Expose table_load_stats in table replica: Introduce storage_group::live_disk_space_used() locator: Introduce table_load_stats tablets: Add resize decision metadata to tablet metadata locator: Introduce resize_decision	2024-01-31 13:59:56 +02:00
Tomasz Grabiec	36f218c83b	Merge 'main: refuse startup when tablet resharding is required' from Botond Dénes We do not support tablet resharding yet. All tablet-related code assumes that the (host_id, shard) tablet replica is always valid. Violating this leads to undefined behaviour: errors in the tablet load balancer and potential crashes. Avoid this by refusing to start if the need to resharding is detected. Be as lenient as possible: check all tablets with a replica on this node, and only refuse startup if at least one tablet has an invalid replica shard. Startup will fail as: ERROR 2024-01-26 07:03:06,931 [shard 0:main] init - Startup failed: std::runtime_error (Detected a tablet with invalid replica shard, reducing shard count with tablet-enabled tables is not yet supported. Replace the node instead.) Refs: #16739 Fixes: #16843 Closes scylladb/scylladb#17008 * github.com:scylladb/scylladb: test/topolgy_experimental_raft: test_tablets.py: add test for resharding test/pylib: manager[_client]: add update_cmdline() main: refuse startup when tablet resharding is required locator: tablets: add check_tablet_replica_shards()	2024-01-29 23:39:41 +01:00
Botond Dénes	95b6aeebae	locator: tablets: add check_tablet_replica_shards() Checks that all tablets with a replica on the this node, have a valid replica shard (< smp::count). Will be used to check whether the node can start-up with the current shard-count.	2024-01-29 07:04:33 -05:00
Pavel Emelyanov	3abdb3c7ee	tablets: Remove tablet_aware_replication_strategy::parse_initial_tablets It's now unused, string with initial tablets its parsed elsewhere Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#17010	2024-01-29 10:03:38 +02:00
Raphael S. Carvalho	e0de3dd844	topology_cordinator: Generate updates for resize decisions made by balancer Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2024-01-25 18:58:40 -03:00
Raphael S. Carvalho	6c74fc4b82	locator: Introduce table_load_stats This is per table stats that will be aggregated from all nodes, by the coordinator, in order to help load balancer make resize decisions. size_in_bytes is the total aggregated table size, so coordinator becomes responsible for taking into account RF of each DC and also tablet count, for computing an accurate average size. split_ready_seq_number is the minimum sequence number among all replicas. If coordinator sees all replicas store the seq number of current split, then it knows all replicas are ready for the next stage in the split process. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2024-01-25 18:36:08 -03:00
Raphael S. Carvalho	0d5ba1ee4b	tablets: Add resize decision metadata to tablet metadata The new metadata describes the ongoing resize operation (can be either of merge, split or none) that spans tablets of a given table. That's managed by group0, so down nodes will be able to see the decision when they come back up and see the changes to the metadata. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2024-01-25 18:36:06 -03:00
Raphael S. Carvalho	57582ac9c4	locator: Introduce resize_decision resize_decision is the metadata the says whether tablets of a table needs split, merge, or none. That will be recorded in tablet metadata, and therefore stored in group0. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2024-01-25 18:31:12 -03:00
Tomasz Grabiec	e5dcf03b88	tablets: Add support for removenode and replace handling New tablet replicas are allocated synchronously with node operations. They are safely rebuilt from all existing replicas. The list of ignored nodes passed to node operations is respected. Tablet scheduler is responsible for scheduling tablet transition which changes the replicas set. The infrastructure for handling decommission in tablet scheduler is reused for this. Scheduling is done incrementally, respecting per-shard load limits. Rebuilding transitions are recognized by load calculation to affect all tablet replicas. New kind of tablet transition is introduced called "rebuild" which adds new tablet replica and rebuilds it from existing replicas. Other than that, the transition goes through the same stages as regular migration to ensure safe synchronization with request coordinators. In this PR we simply stream from all tablet replicas. Later we should switch to calling repair to avoid sending excessive amounts of data. Fixes #16690.	2024-01-23 01:19:42 +01:00
Tomasz Grabiec	649ca0e46c	tablets: Introduce get_migration_streaming_info() which works on migration request Will be used by tablet load balancer to compute impact on load of planned migrations. Currently, the logic is hard coded in the load balancer and may get out of sync with the logic we have in get_migration_streaming_info() for already running tablet transitions. The logic will become more complex for rebuild transition, so use shared code to compute it.	2024-01-23 01:12:57 +01:00
Tomasz Grabiec	6dc56fd80b	tablets: Move migration_to_transition_info() to tablets.hh	2024-01-23 01:12:57 +01:00
Tomasz Grabiec	1df256221c	tablets: Extract get_new_replicas() which works on migraiton request Now we have a single place which translates tablet migration request to new replicas. Will be reused in other places.	2024-01-23 01:12:57 +01:00
Tomasz Grabiec	4a06ffb43c	tablets: Store transition kind per tablet Will be used to distinguish regular migration from rebuild, repair and RF change.	2024-01-23 01:12:57 +01:00
Pavel Emelyanov	941f6d8fca	cql: Move initial_tablets from REPLICATION to TABLETS in DDL This patch changes the syntax of enabling tablets from CREATE KEYSPACE ... WITH REPLICATION = { ..., 'initial_tablets': <int> } to be CREATE KEYSPACE ... WITH TABLETS = { 'initial': <int> } and updates all tests accordingly. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-01-15 13:04:48 +03:00
Petr Gusev	1928dc73a8	erm: has_pending_ranges: switch to host_id In the next patches we are going to change erm data structures (replication_map and ring_mapping) from IP to host_id. Having locator::host_id instead of IP in has_pending_ranges arguments makes this transition easier.	2024-01-12 12:23:19 +04:00
Kefu Chai	2c394e3f6f	tablets: remove unused #includes the removed #include headers are not used, so let's drop their `#include`s. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#16619	2024-01-03 15:30:40 +01:00
Pavel Emelyanov	c43501d973	locator,schema: Move initial tablets from r.s. options to params The option is kepd in DDL, but is _not_ stored in system_schema.keyspaces. Instead, it's removed from the provided options and kept in scylla_keyspaces table in its own column. All the places that had optional initial_tablets disengaged now set this value up the way the find appropriate. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-12-25 16:07:10 +03:00
Pavel Emelyanov	45f4276de6	locator: Pass abstract_replication_strategy& into validate_tablet_options() It will need to check if the r.s. in question had been marked as per-table one in next patches. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-12-25 15:56:49 +03:00
Pavel Emelyanov	bf824d79d9	locator: Carry r.s. params into process_tablet_options() The latter method is the one that will need extended params in next patches. It's called from network_topology_strategy() constructor which already has params at hand. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-12-25 15:56:02 +03:00
Raphael S. Carvalho	bcbba9a5e3	locator: Introduce tablet_map::get_tablet_id_and_range_side(token) Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2023-12-17 11:26:32 -03:00
Petr Gusev	7b55ccbd8e	token_metadata: drop the template Replace token_metadata2 ->token_metadata, make token_metadata back non-template. No behavior changes, just compilation fixes.	2023-12-12 23:19:54 +04:00
Petr Gusev	11cc21d0a9	erm: switch to the new token_metadata In this commit we replace token_metadata with token_metadata2 in the erm interface and field types. To accommodate the change some of strategy-related methods are also updated. All the boost and topology tests pass with this change.	2023-12-12 23:19:53 +04:00
Petr Gusev	d9283bd025	tablets: switch to token_metadata2 locator_topology_test, network_topology_strategy_test and tablets_test are fully switched to the host_id-based token_metadata, meaning they no longer populate the old token_metadata. All the boost and topology tests pass with this change.	2023-12-12 23:19:53 +04:00
Petr Gusev	5a1418fdba	token_metadata: get_endpoint_for_host_id -> get_endpoint_for_host_id_if_known This commit fixes an inconsistency in method names: get_host_id and get_host_id_if_known are (internal_error, returns null), but there was only one method for the opposite conversion - get_endpoint_for_host_id, and it returns null. In this commit we change it to on_internal_error if it can't find the argument and add another method get_endpoint_for_host_id_if_known which returns null in this case. We can't use get_endpoint_for_host_id/get_host_id in host_id_or_endpoint::resolve since it's called from storage_service::parse_node_list -> token_metadata::parse_host_id_and_endpoint, and exceptions are caught and handled in `storage_service::parse_node_list`.	2023-12-11 12:51:34 +04:00

1 2

73 Commits