scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-05-02 14:15:46 +00:00

Author	SHA1	Message	Date
Pavel Emelyanov	b0cba57e29	tablet: Make leaving replica optional When getting leaving replica from from tablet info and transition info, the getter code assumes that this replica always exists. It's not going to be the case soon, so make the return value be optional. There are four places that mess with leaving replica: - stream tablet handler: this place checks that the leaving replica is _not_ current host. If leaving replica is missing, the check should pass - cleanup tablet handler: this place checks that the leaving replica _is_ current host. If leaving replica is missing, the check should fail as well - topology coordinator: it gets leaving replica to call cleanup on. If leaving replica is missing, the cleanup call is short-circuited to succeed immediately - load-stats calculator: it checks if the leaving replica is self. This check is not patched as it's automatically satisfied by std::optional comparison operator overload for wrapped type Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-04-04 09:03:36 +03:00
Raphael S. Carvalho	12714a4123	locator: Avoid tablet map lookup on every write for getting replicas We can cache tablet map in erm, to avoid looking it up on every write for getting write replicas. We do that in tablet_sharder, but not in tablet erm. Tablet map is immutable in the context of a given erm, so the address of the map is stable during erm lifetime. This caught my attention when looking at perf diff output (comparing tablet and vnode modes). It also helps when erm is called again on write completion for checking locality, used for forwarding info to the driver if needed. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#18158	2024-04-03 10:28:04 +02:00
Pavel Emelyanov	1adf16ce73	Merge 'network_topology_strategy: reallocate_tablets: support for rf changes' from Benny Halevy This series provides a reallocate_tablets function, that's initially called by allocate_tablets_for_new_table. The new allocation implementation is independent of vnodes/token ownership. Rather than using the natural_endpoints_tracker, it implements its own tracking based on dc/rack load (== number of replicas in rack), with the additional benefit that tablet allocation will balance the allocation across racks, using a heap structure, similar to the one we use to balance tablet allocation across shards in each node. reallocate_tablets may also be called with an optional parameter pointing the the current tablet_map. In this case the function either allocates more tablet replicas in datacenters for which the replication factor was increased, or it will deallocate tablet replicas from datacenters for which replication factor was decreased. The NetworkTopologyStrategy_tablets_test unit test was extended to cover replication factor changes. Closes scylladb/scylladb#17846 * github.com:scylladb/scylladb: network_topology_strategy: reallocate_tablets: consider new_racks before existing racks network_topology_startegy_test: add NetworkTopologyStrategy_tablet_allocation_balancing_test network_topology_strategy: reallocate_tablets: support deallocation via rf change network_topology_startegy_test: tablets_test: randomize cases network_topology_strategy: allocate_tablets_for_new_table: do not rely on token ownership network_topology_startegy_test: add NetworkTopologyStrategy_tablets_negative_test network_topology_strategy_test: endpoints_check: use particular BOOST_CHECK_* functions network_topology_strategy_test: endpoints_check: verify that replicas are placed on unique nodes network_topology_strategy_test: endpoints_check: strictly check rf for tablets network_topology_strategy_test: full_ring_check for tablets: drop unused options param	2024-03-28 11:19:11 +03:00
Benny Halevy	8a77319cb7	network_topology_strategy: reallocate_tablets: consider new_racks before existing racks Allocate first from new (unpopulated) racks before allocating from racks that are already populated with replicas. Still, rotate both new and existing racks by tablet id to ensure fairness. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2024-03-27 12:06:24 +02:00
Benny Halevy	4a7d57525e	network_topology_strategy: reallocate_tablets: support deallocation via rf change Add support for deallocating tablet replicas when the datacenter replication factor is decreased. We deallocate replicas back-to-front order to maintain replica pairing between the base table and its materialized views. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2024-03-27 12:06:24 +02:00
Benny Halevy	898cd1d404	network_topology_strategy: allocate_tablets_for_new_table: do not rely on token ownership Base initial tablets allocation for new table on the dc/rack topology, rather then on the token ring, to remove the dependency on token ownership. We keep the rack ordinal order in each dc to facilitate in-rack pairing of base/view replica pairing, and we apply load-balancing principles by sorting the nodes in each rack by their load (number of tablets allocated to the node), and attempting to fill lease-loaded nodes first. This method is more efficient than circling the token ring and attemting to insert the endpoints to the natural_endpoint_tracker until the replication factor per dc is fulfilled, and it allows an easier way to incrementally allocate more replicas after rf is increased. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2024-03-27 12:06:21 +02:00
Kefu Chai	2e2c3a5fea	locator: fix a typo in comment s/Substracts/Subtracts/ Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#18048	2024-03-27 10:15:18 +02:00
Pavel Emelyanov	04370dc8a4	tablets: Introduce substract_sets() There are several places in code that calculate replica sets associated with specific tablet transision. Having a helper to substract two sets improves code readability. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#18033	2024-03-26 23:33:06 +02:00
Avi Kivity	4ddf82e58b	treewide: don't #include "gms/feature_service.hh" from other headers feature_service.hh is a high-level header that integrates much of the system functionality, so including it in lower-level headers causes unnecessary rebuilds. Specifically, when retiring features. Fix by removing feature_service.hh from headers, and supply forward declarations and includes in .cc where needed. Closes scylladb/scylladb#18005	2024-03-26 15:31:18 +02:00
Raphael S. Carvalho	6bdb456fad	sstables_loader: Fix loader when write selector is previous during tablet migration The loader is writing to pending replica even when write selector is set to previous. If migration is reverted, then the writes won't be rolled back as it assumes pending replicas weren't written to yet. That can cause data resurrection if tablet is later migrated back into the same replica. NOTE: write selector is handled correctly when set to next, because get_natural_endpoints() will return the next replica set, and none of the replicas will be considered leaving. And of course, selector set to both is also handled correctly. Fixes #17892. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#17902	2024-03-24 01:20:50 +01:00
Tomasz Grabiec	1c71f44e63	tablets, raft topology: Rebuild tablets after replacing node is normal This fixes a problem with replacing a node with tablets when RF=N. Currently, this will fail because new tablet replica allocation will not be able to find a viable destination, as the replacing node is not considered a candidate. It cannot be a candidate because replace rolls back on failure and we cannot roll back after tablets were migrated. The solution taken here is to not drain tablet replicas from replaced node during topology request but leave it to happen later after the replaced node is left and replacing node is normal. The replacing node waits for this draining to be complete on boot before the node is considered booted. Fixes #17025	2024-03-15 13:20:08 +01:00
Tomasz Grabiec	888dc41d66	effective_replication_map: Introduce host_id-based get_replicas()	2024-03-15 11:05:29 +01:00
Tomasz Grabiec	61b3453552	raft topology: Keep nodes in the left state to topology Those nodes will be kept in tablet replica sets for a while after node replace is done, until the new replica is rebuilt. So we need to know about those node's location (dc, rack) for two reasons: 1) algorithms which work with replica sets filter nodes based on their location. For example materialized views code which pairs base replicas with view replicas filters by datacenter first. 2) tablet scheduler needs to identify each node's location in order to make decisions about new replica placement. It's ok to not know the IP, and we don't keep it. Those nodes will not be present in the IP-based replica sets, e.g. those returned by get_natural_endpoints(), only in host_id-based replica sets. storage_proxy request coordination is not affected. Nodes in the left state are still not present in token ring, and not considered to be members of the ring (datacanter endpoints excludes them). In the future we could make the change even more transparent by only loading locator::node* for those nodes and keeping node* in tablet replica sets. We load topology infromation only for left nodes which are actually referenced by any tablet. To achieve that, topology loading code queries system.tablet for the set of hosts. This set is then passed to system.topology loading method which decides whether to load replica_state for a left node or not.	2024-03-15 11:05:29 +01:00
Patryk Wrobel	75aadeb32f	locator/effective_replication_map: make 'get_ranges(inet_address ep)' virtual Before this patch, the mentioned function was a specific member of vnode_effective_replication_strategy class. To allow its usage also when tablets are enabled it was shifted to the base class - effective_replication_strategy and made pure virtual to force the derived classes to implement it. It is used by 'storage_service::get_ranges_for_endpoint()' that is used in calculation of effective ownership. Such calculation needs to be performed also when tablets are enabled. Refs: scylladb#17342 Signed-off-by: Patryk Wrobel <patryk.wrobel@scylladb.com>	2024-03-11 09:50:20 +01:00
Patryk Wrobel	3fff6bd407	locator/tablets: add tablet_map::get_sorted_tokens() This change introudces a new member function that returns a vector of sorted tokens where each pair of adjacent elements depicts a range of tokens that belong to tablet. It will be used to produce the equivalent of sorted_tokens() of vnodes when trying to use dht::describe_ownership() for tablets. Refs: scylladb#17342 Signed-off-by: Patryk Wrobel <patryk.wrobel@scylladb.com>	2024-03-11 09:50:20 +01:00
Kefu Chai	64e14d21db	locator/tablets: add fmt::formatter for tablet_* before this change, we rely on the default-generated fmt::formatter created from operator<<, but fmt v10 dropped the default-generated formatter. in this change, we define formatters for * tablet_id * tablet_replica * tablet_metadata * tablet_map their operator<<:s are dropped Refs scylladb/scylladb#13245 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#17504	2024-03-07 09:00:49 +03:00
Kefu Chai	7e9b0d3d9e	network_topology_strategy: use structured binding when appropriate for better readability Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#17642	2024-03-06 09:52:20 +02:00
Kefu Chai	643c01fd80	locator: fix typo in comment -- s/slecting/selecting/ fix a typo Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#17470	2024-02-22 13:28:18 +02:00
Tomasz Grabiec	ef9e5e64a3	locator: token_metadata: Introduce topology barrier stall detector When topology barrier is blocked for longer than configured threshold (2s), stale versions are marked as stalled and when they get released they report backtrace to the logs. This should help to identify what was holding for token metadata pointer for too long. Example log: token_metadata - topology version 30 held for 299.159 [s] past expiry, released at: 0x2397ae1 0x23a36b6 ... Closes scylladb/scylladb#17427	2024-02-21 15:05:34 +02:00
Botond Dénes	7bdd0c2cae	locator: introduce tablet_range_spliter Given a list of partition-ranges, yields the intersection of this range-list, with that of that tablet-ranges, for tablets located on the given host. This will be used in multishard_mutation_query.cc, to obtain the ranges to read from the local node: given the read ranges, obtain the ranges belonging to tablets who have replicas on the local node.	2024-02-21 02:08:48 -05:00
Avi Kivity	605bf6e221	range.hh: retire range.hh was deprecated in `bd794629f9` (2020) since its names conflict with the C++ library concept of an iterator range. The name ::range also mapped to the dangerous wrapping_interval rather than nonwrapping_interval. Complete the deprecation by removing range.hh and replacing all the aliases by the names they point to from the interval library. Note this now exposes uses of wrapping intervals as they are now explicit. The unit tests are renamed and range.hh is deleted. Closes scylladb/scylladb#17428	2024-02-21 00:24:25 +02:00
Tomasz Grabiec	e63d8ae272	Merge 'Handle tablet migration failure while streaming' from Pavel Emelyanov It can happen that a node is lost during tablet migration involving that node. Migration will be stuck, blocking topology state machine. To recover from this, the current procedure is for the admin to execute nodetool removenode or replacing the node. This marks the node as "ignored" and tablet state machine can pick this up and abort the migration. This PR implements the handling for streaming stage only and adds a test for it. Checking other stages needs more work with failure injection to inject failures into specific barrier. To handle streaming failure two new stages are introduced -- cleanup_target and revert_migration. The former is to clean the pending replica that could receive some data by the time streaming stopped working, the latter is like end_migration, but doesn't commit the new_replicas into replicas field. refs: #16527 Closes scylladb/scylladb#17360 * github.com:scylladb/scylladb: test/topology: Add checking error paths for failed migration topology.tablets_migration: Handle failed streaming topology.tablets_migration: Add cleanup_target transition stage topology.tablets_migration: Add revert_migration transition stage storage_service: Rewrap cleanup stage checking in cleanup_tablet() test/topology: Move helpers to get tablet replicas to pylib	2024-02-20 18:50:55 +01:00
Kefu Chai	b0bb3ab5b0	topology: print `node` with node_printer in `da53854b66`, we added formatter for printing a `node`, and switched to this formatter when printing `node*`. but we failed to update some caller sites when migrating to the new formatter, where a `unique_ptr<node>` is printed instead. this is not the behavior before the change, and is not expected. so, in this change, we explicitly instantiate `node_printer` instances with the pointer held by `unique_ptr<node>`, to restore the behavior before `da53854b66`. this issue was identified when compiling the tree using {fmt} v10 and compile-time format-string check enabled, which is yet upstreamed to Seastar. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#17418	2024-02-20 14:35:56 +03:00
Pavel Emelyanov	72f3b1d5fe	topology.tablets_migration: Add cleanup_target transition stage The new stage will be used to revert migration that fails at some stages. The goal is to cleanup the pending replica, which may already received some writes by doing the cleanup RPC to the pending replica, then jumping to "revert_migration" stage introduced earlier. If pending node is dead, the call to cleanup RPC is skipped. Coordinators use old replicas. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-02-20 08:59:06 +03:00
Pavel Emelyanov	ced5bf56eb	topology.tablets_migration: Add revert_migration transition stage It's like end_migration, but old replicas intact just removing the transition (including new replicas). Coordinators use old replicas. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-02-20 08:53:36 +03:00
Botond Dénes	42a76ca568	Merge 'Improve printing of nodes and backtraces in topology' from Pavel Emelyanov There's a bunch of debug- and trace-level logging of locator::node-s that also include current_backtrace(). Printing node is done via debug_format() helper that generates and returns an sstring to print. Backtrace printing is not very lightweight on its own because of backtrace collecting. Not to slow things down in info log level, which is default, all such prints are wrapped with explicit if-s about log-level being enabled or not. This PR removes those level checks by introducing lazy_backtrace() helper and by providing a formatter for nodes that also results in lazy node format string calculation. Closes scylladb/scylladb#17235 * github.com:scylladb/scylladb: topology: Restore indentation after previous patch topology: Drop if_enabled checks for logging topology: Add lazy_backtrace() helper topology: Add printer wrapper for node* and formatter for it topology: Expand formatter<locator::node>	2024-02-19 09:32:53 +02:00
Patryk Wrobel	a3fb44cbca	Rename keyspace::get_effective_replication_map() This commit renames keyspace::get_effective_replication_map() to keyspace::get_vnode_effective_replication_map(). This change is required to ease the analysis of the usage of this function. When tablets are enabled, then this function shall not be used. Instead of per-keyspace, per-table replication map should be used. The rename was performed to distinguish between those two calls. The next step will be an audit of usages of keyspace::get_vnode_effective_replication_map(). Refs: scylladb#16626 Signed-off-by: Patryk Wrobel <patryk.wrobel@scylladb.com> Closes scylladb/scylladb#17314	2024-02-13 20:22:02 +02:00
Botond Dénes	3f2d7e8b25	tree: remove unnecessary yields around for_each_tablet() Commit `904bafd069` consolidated the two existing for_each_tablet() overloads, to the one which has a future<> returning callback. It also added yields to the bodies of said callbacks. This is unnecessary, the loop in for_each_tablet() already has a yield per tablet, which should be enough to prevent stalls. This patch is a follow-up to #17118 Closes scylladb/scylladb#17284	2024-02-12 17:10:25 +01:00
Pavel Emelyanov	309d34a147	topology: Restore indentation after previous patch Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-02-09 13:49:15 +03:00
Pavel Emelyanov	f7a13b9bb0	topology: Drop if_enabled checks for logging Now all the logged arguments are lazily evaluated (node* format string and backtrace) so the preliminary log-level checks are not needed. indentation is deliberately left broken Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-02-09 13:49:15 +03:00
Pavel Emelyanov	c1ea6c8acf	topology: Add lazy_backtrace() helper This helper returns lazy_eval-ed current_backtrace(), so it will be generated and printed only if logger is really going to do it with its current log-level. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-02-09 13:49:15 +03:00
Pavel Emelyanov	da53854b66	topology: Add printer wrapper for node* and formatter for it Currently to print node information there's a debug_format(node*) helper function that returns back an sstring object. Here's the formatter that's more flexible and convenient, and a node_printer wrapper, since formatters cannot format non-void pointers. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-02-09 13:49:15 +03:00
Pavel Emelyanov	aa0293f411	topology: Expand formatter<locator::node> Equip it with :v specifier that turns verbose mode on and prints much more data about the node. Main user will appear in the next patch. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-02-09 13:49:15 +03:00
Botond Dénes	35da9551fb	Merge 'storage_service: Add describe_ring support for tablet table' from Asias He The table query param is added to get the describe_ring result for a given table. Both vnode table and tablet table can use this table param, so it is easier for users to user. If the table param is not provided by user and the keyspace contains tablet table, the request will be rejected. E.g., curl "http://127.0.0.1:10000/storage_service/describe_ring/system_auth?table=roles" curl "http://127.0.0.1:10000/storage_service/describe_ring/ks1?table=standard1" Refs #16509 Closes scylladb/scylladb#17118 * github.com:scylladb/scylladb: tablets: Convert to use the new version of for_each_tablet storage_service: Add describe_ring support for tablet table storage_service: Mark host2ip as const tablets: Add for_each_tablet_gently	2024-02-07 10:41:36 +02:00
Tomasz Grabiec	032c1a3d04	Merge 'tablets: Make sure topology has enough endpoints for RF' from Pavel Emelyanov When creating a keyspace, scylla allows setting RF value smaller than there are nodes in the DC. With vnodes, when new nodes are bootstrapped, new tokens are inserted thus catching up with RF. With tablets, it's not the case as replica set remains unchanged. With tablets it's good chance not to mimic the vnodes behavior and require as many nodes to be up and running as the requested RF is. This patch implementes this in a lazy manned -- when creating a keyspace RF can be any, but when a new table is created the topology should meet RF requirements. If not met, user can bootstrap new nodes or ALTER KEYSPACE. closes: #16529 Closes scylladb/scylladb#17079 * github.com:scylladb/scylladb: tablets: Make sure topology has enough endpoints for RF cql-pytest: Disable tablets when RF > nodes-in-DC test: Remove test that configures RF larger than the number of nodes keyspace_metadata: Include tablets property in DESCRIBE	2024-02-06 22:38:11 +01:00
Botond Dénes	a3d4131918	Merge 'Sanitize replication factor parsing by strategies' from Pavel Emelyanov RF values appear as strings and strategies classes convert them to integers. This PR removes some duplication of efforts in converting code. Closes scylladb/scylladb#17132 * github.com:scylladb/scylladb: network_topology_strategy: Do not walk list of datacenters twice replication_strategy: Do not convert string RF into int twise abstract_replication_strategy: Make validate_replication_factor return value	2024-02-06 13:26:31 +02:00
Asias He	904bafd069	tablets: Convert to use the new version of for_each_tablet It is more gently than the old one.	2024-02-05 18:45:40 +08:00
Pavel Emelyanov	45dbe38658	tablets: Make sure topology has enough endpoints for RF When creating a keyspace, scylla allows setting RF value smaller than there are nodes in the DC. With vnodes, when new nodes are bootstrapped, new tokens are inserted thus catching up with RF. With tablets, it's not the case as replica set remains unchanged. With tablets it's good chance not to mimic the vnodes behavior and require as many nodes to be up and running as the requested RF is. This patch implementes this in a lazy manned -- when creating a keyspace RF can be any, but when a new table is created the topology should meet RF requirements. If not met, user can bootstrap new nodes or ALTER KEYSPACE. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-02-05 12:50:04 +03:00
Asias He	fab0d33d08	tablets: Add for_each_tablet_gently In this version, the callback returns a future<>, so it can yield itself to avoid stalls in func itself.	2024-02-05 13:42:08 +08:00
Avi Kivity	784c2f8ad2	Merge 'treewide: replace calls to future::get0() by calls to future::get()' from Kefu Chai get0() dates back from the days where Seastar futures carried tuples, and get0() was a way to get the first (and usually only) element. Now it's a distraction, and Seastar is likely to deprecate and remove it. Replace with seastar::future::get(), which does the same thing. Closes scylladb/scylladb#17130 * github.com:scylladb/scylladb: treewide: replace seastar::future::get0() with seastar::future::get() sstable: capture return value of get0() using auto utils: result_loop: define result_type with decayed type [avi: add another one that snuck in while this was cooking]	2024-02-04 15:23:33 +02:00
Avi Kivity	7cb1c10fed	treewide: replace seastar::future::get0() with seastar::future::get() get0() dates back from the days where Seastar futures carried tuples, and get0() was a way to get the first (and usually only) element. Now it's a distraction, and Seastar is likely to deprecate and remove it. Replace with seastar::future::get(), which does the same thing.	2024-02-02 22:12:57 +08:00
Pavel Emelyanov	afda0f6ddf	network_topology_strategy: Do not walk list of datacenters twice Construct of that class walks the provided options to get per-DC replication factors. It does it twice -- first to populate the dc:rf map, second to calculate the sum of provided RF values. The latter loop can be optimized away. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-02-02 14:39:24 +03:00
Pavel Emelyanov	06f9e7367c	replication_strategy: Do not convert string RF into int twise There are two replication strategy classes that validate string RF and then convert it into integer. Since validation helper returns the parsed value, it can be just used avoiding the 2nd conversion. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-02-02 14:38:17 +03:00
Pavel Emelyanov	a8cd3bc636	abstract_replication_strategy: Make validate_replication_factor return value The helper in question checks if string RF is indeed an integer. Make this helper return the "checked" integer value, because it does this conversion. And rename it to parse_... to reflect what it now does. Next patches will make use of this change. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-02-02 14:36:47 +03:00
Kefu Chai	b45af994c2	locator/utils: remove stale comment this comment has already served its purpose when rewriting C* in C++. since we've re-implemented it, there is no need to keep it around. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#17120	2024-02-02 11:07:35 +02:00
Avi Kivity	c8397f0287	Merge 'Implement tablet splitting' from Raphael "Raph" Carvalho The motivation for tablet resizing is that we want to keep the average tablet size reasonable, such that load rebalancing can remain efficient. Too large tablet makes migration inefficient, therefore slowing down the balancer. If the avg size grows beyond the upper bound (split threshold), then balancer decides to split. Split spans all tablets of a table, due to power-of-two constraint. Likewise, if the avg size decreases below the lower bound (merge threshold), then merge takes place in order to grow the avg size. Merge is not implemented yet, although this series lays foundation for it to be impĺemented later on. A resize decision can be revoked if the avg size changes and the decision is no longer needed. For example, let's say table is being split and avg size drops below the target size (which is 50% of split threshold and 100% of merge one). That means after split, the avg size would drop below the merge threshold, causing a merge after split, which is wasteful, so it's better to just cancel the split. Tablet metadata gains 2 new fields for managing this: resize_type: resize decision type, can be either of "merge", "split", or "none". resize_seq_number: a sequence number that works as the global identifier of the decision (monotonically increasing, increased by 1 on every new decision emitted by the coordinator). A new RPC was implemented to pull stats from each table replica, such that load balancer can calculate the avg tablet size and know the "split status", for a given table. Avg size is aggregated carefully while taking RF of each DC into account (which might differ). When a table is done splitting its storage, it loads (mirror) the resize_seq_number from tablet metadata into its local state (in another words, my split status is ready). If a table is split ready, coordinator will see that table's seq number is the same as the one in tablet metadata. Helps to distinguish stale decisions from the latest one (in case decisions are revoked and re-emited later on). Also, it's aggregated carefully, by taking the minimum among all replicas, so coordinator will only update topology when all replicas are ready. When load balancer emits split decision, replicas will listen to need to split with a "split monitor" that is awakened once a table has replication metadata updated and detects the need for split (i.e. resize_type field is "split"). The split monitor will start splitting of compaction groups (using mechanism introduced here: `081f30d149`) for the table. And once splitting work is completed, the table updates its local state as having completed split. When coordinator pulls the split status of all replicas for a table via RPC, the balancer can see whether that table is ready for "finalizing" the decision, which is about updating tablet metadata to split each tablet into two. Once table replicas have their replication metadata updated with the new tablet count, they can update appropriately their set of compaction groups (that were previously split in the preparation step). Fixes #16536. Closes scylladb/scylladb#16580 * github.com:scylladb/scylladb: test/topology_experimental_raft: Add tablet split test replica: Bypass reshape on boot with tablets temporarily replica: Fix table::compaction_group_for_sstable() for tablet streaming test/topology_experimental_raft: Disable load balancer in test fencing replica: Remap compaction groups when tablet split is finalized service: Split tablet map when split request is finalized replica: Update table split status if completed split compaction work storage_service: Implement split monitor topology_cordinator: Generate updates for resize decisions made by balancer load_balancer: Introduce metrics for resize decisions db: Make target tablet size a live-updateable config option load_balancer: Implement resize decisions service: Wire table_resize_plan into migration_plan service: Introduce table_resize_plan tablet_mutation_builder: Add set_resize_decision() topology_coordinator: Wire load stats into load balancer storage_service: Allow tablet split and migration to happen concurrently topology_coordinator: Periodically retrieve table_load_stats locator: Introduce topology::get_datacenter_nodes() storage_service: Implement table_load_stats RPC replica: Expose table_load_stats in table replica: Introduce storage_group::live_disk_space_used() locator: Introduce table_load_stats tablets: Add resize decision metadata to tablet metadata locator: Introduce resize_decision	2024-01-31 13:59:56 +02:00
Tomasz Grabiec	36f218c83b	Merge 'main: refuse startup when tablet resharding is required' from Botond Dénes We do not support tablet resharding yet. All tablet-related code assumes that the (host_id, shard) tablet replica is always valid. Violating this leads to undefined behaviour: errors in the tablet load balancer and potential crashes. Avoid this by refusing to start if the need to resharding is detected. Be as lenient as possible: check all tablets with a replica on this node, and only refuse startup if at least one tablet has an invalid replica shard. Startup will fail as: ERROR 2024-01-26 07:03:06,931 [shard 0:main] init - Startup failed: std::runtime_error (Detected a tablet with invalid replica shard, reducing shard count with tablet-enabled tables is not yet supported. Replace the node instead.) Refs: #16739 Fixes: #16843 Closes scylladb/scylladb#17008 * github.com:scylladb/scylladb: test/topolgy_experimental_raft: test_tablets.py: add test for resharding test/pylib: manager[_client]: add update_cmdline() main: refuse startup when tablet resharding is required locator: tablets: add check_tablet_replica_shards()	2024-01-29 23:39:41 +01:00
Botond Dénes	95b6aeebae	locator: tablets: add check_tablet_replica_shards() Checks that all tablets with a replica on the this node, have a valid replica shard (< smp::count). Will be used to check whether the node can start-up with the current shard-count.	2024-01-29 07:04:33 -05:00
Pavel Emelyanov	3abdb3c7ee	tablets: Remove tablet_aware_replication_strategy::parse_initial_tablets It's now unused, string with initial tablets its parsed elsewhere Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#17010	2024-01-29 10:03:38 +02:00
Raphael S. Carvalho	e0de3dd844	topology_cordinator: Generate updates for resize decisions made by balancer Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2024-01-25 18:58:40 -03:00

1 2 3 4 5 ...

790 Commits