scylladb

Author	SHA1	Message	Date
Pavel Emelyanov	78f3fc8890	tablet_allocator: Put more info into failed-to-drain exception When balancer fails to find a node to balance drained tablets into, it throws an exception with tablet id and node id, but it's also good to know more details about the balancing state that lead to failure refs: #19504 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> (cherry picked from commit `c3d9831c5f`) Closes scylladb/scylladb#19619	2024-07-05 11:17:37 +03:00
Tomasz Grabiec	d5ebfea1ff	tablet_scheduler: Make disabling of balancing interrupt shuffle mode Tests will rely on that, they will run in shuffle mode, and disable balancing around section which otherwise would be infinitely blocked by ongoing shuffling (like repair). (cherry picked from commit `1513d6f0b0`)	2024-06-06 13:01:18 +00:00
Tomasz Grabiec	3fec9e1344	tablet_scheduler: Log whether balancing is considered as enabled (cherry picked from commit `6c64cf33df`)	2024-06-06 13:01:18 +00:00
Botond Dénes	a38d5463ef	Merge '[Backport 6.0] tablets: load balancer: Use random selection of candidates when moving tablets' from ScyllaDB In order to avoid per-table tablet load imbalance balance from forming in the cluster after adding nodes, the load balancer now picks the candidate tablet at random. This should keep the per-table distribution on the target node similar to the distribution on the source nodes. Currently, candidate selection picks the first tablet in the unordered_set, so the distribution depends on hashing in the unordered set. Due to the way hash is calculated, table id dominates the hash and a single table can be chosen more often for migration away. This can result in imbalance of tablets for any given table after bootstrapping a new node. For example, consider the following results of a simulation which starts with a 6-node cluster and does a sequence of node bootstraps and decommissions. One table has 4096 tablets and RF=1, and the other has 256 tablets and RF=2. Before the patch, the smaller table has node overcommit of 2.34 in the worst topology state, while after the patch it has overcommit of 1.65. overcommit is calculated as max load (tablet count per node) dividied by perfect average load (all tablets / nodes): Run #861, params: {iterations=6, nodes=6, tablets1=4096 (10.7/sh), tablets2=256 (1.3/sh), rf1=1, rf2=2, shards=64} Overcommit : init : {table1={shard=1.03, node=1.00}, table2={shard=1.51, node=1.01}} Overcommit : worst: {table1={shard=1.23, node=1.10}, table2={shard=9.85, node=1.65}} Overcommit (old) : init : {table1={shard=1.03, node=1.00}, table2={shard=1.51, node=1.01}} Overcommit (old) : worst: {table1={shard=1.31, node=1.12}, table2={shard=64.00, node=2.34}} The worst state before the patch had the following distribution of tablets for the smaller table: Load on host ba7f866d...: total=171, min=1, max=7, spread=6, avg=2.67, overcommit=2.62 Load on host 4049ae8d...: total=102, min=0, max=6, spread=6, avg=1.59, overcommit=3.76 Load on host 3b499995...: total=89, min=0, max=4, spread=4, avg=1.39, overcommit=2.88 Load on host ad33bede...: total=63, min=0, max=3, spread=3, avg=0.98, overcommit=3.05 Load on host 0c2e65dc...: total=57, min=0, max=3, spread=3, avg=0.89, overcommit=3.37 Load on host 3f2d32d4...: total=27, min=0, max=2, spread=2, avg=0.42, overcommit=4.74 Load on host 9de9f71b...: total=3, min=0, max=1, spread=1, avg=0.05, overcommit=21.33 One node has as many as 171 tablets of that table and another one has as few as 3. After the patch, the worst distribution looks like this: Load on host 94a02049...: total=121, min=1, max=6, spread=5, avg=1.89, overcommit=3.17 Load on host 65ac6145...: total=87, min=0, max=5, spread=5, avg=1.36, overcommit=3.68 Load on host 856a66d1...: total=80, min=0, max=5, spread=5, avg=1.25, overcommit=4.00 Load on host e3ac4a41...: total=77, min=0, max=4, spread=4, avg=1.20, overcommit=3.32 Load on host 81af623f...: total=66, min=0, max=4, spread=4, avg=1.03, overcommit=3.88 Load on host 4a038569...: total=47, min=0, max=2, spread=2, avg=0.73, overcommit=2.72 Load on host c6ab3fe9...: total=34, min=0, max=3, spread=3, avg=0.53, overcommit=5.65 Most-loaded node has 121 tablets and least loaded node has 34 tablets. It's still not good, a better distribution is possible, but it's an improvement. Refs #16824 (cherry picked from commit `3be6120e3b`) (cherry picked from commit `c9bcb5e400`) (cherry picked from commit `7b1eea794b`) (cherry picked from commit `603abddca9`) Refs #18885 Closes scylladb/scylladb#19036 * github.com:scylladb/scylladb: tablets: load balancer: Use random selection of candidates when moving tablets test: perf: Add test for tablet load balancer effectiveness load_sketch: Extract get_shard_minmax() load_sketch: Allow populating only for a given table	2024-06-03 12:25:05 +03:00
Pavel Emelyanov	62a23fd86a	config: Remove experimental TABLETS feature ... and replace it with boolean enable_tablets option. All the places in the code are patched to check the latter option instead of the former feature. The option is OFF by default, but the default scylla.yaml file sets this to true, so that newly installed clusters turn tablets ON. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> (cherry picked from commit `83d491af02`) Closes scylladb/scylladb#19012	2024-06-03 12:16:41 +03:00
Tomasz Grabiec	b9c88fdf4b	tablets: load balancer: Use random selection of candidates when moving tablets In order to avoid per-table tablet load imbalance balance from forming in the cluster after adding nodes, the load balancer now picks the candidate tablet at random. This should keep the per-table distribution on the target node similar to the distribution on the source nodes. Currently, candidate selection picks the first tablet in the unordered_set, so the distribution depends on hashing in the unordered set. Due to the way hash is calculated, table id dominates the hash and a single table can be chosen more often for migration away. This can result in imbalance of tablets for any given table after bootstrapping a new node. For example, consider the following results of a simulation which starts with a 6-node cluster and does a sequence of node bootstraps and decommissions. One table has 4096 tablets and RF=1, and the other has 256 tablets and RF=2. Before the patch, the smaller table has node overcommit of 2.34 in the worst topology state, while after the patch it has overcommit of 1.65. overcommit is calculated as max load (tablet count per node) dividied by perfect average load (all tablets / nodes): Run #861, params: {iterations=6, nodes=6, tablets1=4096 (10.7/sh), tablets2=256 (1.3/sh), rf1=1, rf2=2, shards=64} Overcommit : init : {table1={shard=1.03, node=1.00}, table2={shard=1.51, node=1.01}} Overcommit : worst: {table1={shard=1.23, node=1.10}, table2={shard=9.85, node=1.65}} Overcommit (old) : init : {table1={shard=1.03, node=1.00}, table2={shard=1.51, node=1.01}} Overcommit (old) : worst: {table1={shard=1.31, node=1.12}, table2={shard=64.00, node=2.34}} The worst state before the patch had the following distribution of tablets for the smaller table: Load on host ba7f866d...: total=171, min=1, max=7, spread=6, avg=2.67, overcommit=2.62 Load on host 4049ae8d...: total=102, min=0, max=6, spread=6, avg=1.59, overcommit=3.76 Load on host 3b499995...: total=89, min=0, max=4, spread=4, avg=1.39, overcommit=2.88 Load on host ad33bede...: total=63, min=0, max=3, spread=3, avg=0.98, overcommit=3.05 Load on host 0c2e65dc...: total=57, min=0, max=3, spread=3, avg=0.89, overcommit=3.37 Load on host 3f2d32d4...: total=27, min=0, max=2, spread=2, avg=0.42, overcommit=4.74 Load on host 9de9f71b...: total=3, min=0, max=1, spread=1, avg=0.05, overcommit=21.33 One node has as many as 171 tablets of that table and the one has as few as 3. After the patch, the worst distribution looks like this: Load on host 94a02049...: total=121, min=1, max=6, spread=5, avg=1.89, overcommit=3.17 Load on host 65ac6145...: total=87, min=0, max=5, spread=5, avg=1.36, overcommit=3.68 Load on host 856a66d1...: total=80, min=0, max=5, spread=5, avg=1.25, overcommit=4.00 Load on host e3ac4a41...: total=77, min=0, max=4, spread=4, avg=1.20, overcommit=3.32 Load on host 81af623f...: total=66, min=0, max=4, spread=4, avg=1.03, overcommit=3.88 Load on host 4a038569...: total=47, min=0, max=2, spread=2, avg=0.73, overcommit=2.72 Load on host c6ab3fe9...: total=34, min=0, max=3, spread=3, avg=0.53, overcommit=5.65 Most-loaded node has 121 tablets and least loaded node has 34 tablets. It's still not good, a better distribution is possible, but it's an improvement. Refs #16824 (cherry picked from commit `603abddca9`)	2024-06-02 22:40:46 +00:00
Paweł Zakrzewski	cedb47d843	tablet_allocator: make load_balancer_stats_manager configurable by name This is needed, because the same name cannot be used for 2 separate entities, because we're getting double-metrics-registration error, thus the names have to be configurable, not hardcoded.	2024-05-30 08:33:15 +03:00
Tomasz Grabiec	db9d3f0128	tablet_allocator: Generate intra-node migration plan Intra-node migrations are scheduled for each node independently with the aim to equalize per-shard tablet count on each node. This is needed to avoid severe imbalance between shards which can happen when some table grows and is split. The inter-node balance can be equal, so inter-node migration cannot fix the imbalance. Also, if RF=N then there is not even a possibility of moving tablets around to fix the imbalance. The only way to bring the system to balance is to move tablets within the nodes. After scheduling inter-node migrations, the algorithm schedules intra-node migrations. This means that across-node migrations can proceed in parallel with intra-node migrations if there is free capacity to carry them out, but across-node migrations have higher priority. Fixes #16594	2024-05-16 00:28:47 +02:00
Tomasz Grabiec	793af3d6e1	tablet_allocator: Extract make_internode_plan() Currently the load balancer is only generting an inter-node plan, and the algorithm is embedded in make_plan(). The method will become even harder to follow once we add more kinds of plan generating steps, e.g. inter-node plan. Extract the inter-node plan to make it easier to add other plans and see the grand flow.	2024-05-16 00:28:47 +02:00
Tomasz Grabiec	f95a0f0182	tablet_allocator: Maintain candidate list and shard tablet count for target nodes The node_load datastructure was not updated to reflect migration decisions on the target node. This is not needed for inter-node migration because target nodes are not considered as sources. But we want it to reflect migration decisions so that later inter-node migration sees an accurate picture with earlier migrations reflected in node_load.	2024-05-16 00:28:46 +02:00
Tomasz Grabiec	c86f659421	tablet_allocator: Lift apply_load/can_accept_load lambdas to member functions Will be needed by member methods which generate migration plans.	2024-05-16 00:28:46 +02:00
Raphael S. Carvalho	62b1cfa89c	topology_coordinator: Fix synchronization of tablet split with other concurrent ops Finalization of tablet split was only synchronizing with migrations, but that's not enough as we want to make sure that all processes like repair completes first as they might hold erm and therefore will be working with a "stale" version of token metadata. For synchronization to work properly, handling of tablet split finalize will now take over the state machine, when possible, and execute a global token metadata barrier to guarantee that update in topology by split won't cause problems. Repair for example could be writing a sstable with stale metadata, and therefore, could generate a sstable that spans multiple tablets. We don't want that to happen, therefore we need the barrier. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#18380	2024-04-30 19:23:28 +02:00
Kefu Chai	a439ebcfce	treewide: include fmt/ranges.h and/or fmt/std.h before this change, we rely on the default-generated fmt::formatter created from operator<<, but fmt v10 dropped the default-generated formatter. in this change, we include `fmt/ranges.h` and/or `fmt/std.h` for formatting the container types, like vector, map optional and variant using {fmt} instead of the homebrew formatter based on operator<<. with this change, the changes adding fmt::formatter and the changes using ostream formatter explicitly, we are allowed to drop `FMT_DEPRECATED_OSTREAM` macro. Refs scylladb#13245 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2024-04-19 22:56:16 +08:00
Avi Kivity	72bbe75d5b	Merge 'Fix node replace with tablets for RF=N' from Tomasz Grabiec This PR fixes a problem with replacing a node with tablets when RF=N. Currently, this will fail because tablet replica allocation for rebuild will not be able to find a viable destination, as the replacing node is not considered to be a candidate. It cannot be a candidate because replace rolls back on failure and we cannot roll back after tablets were migrated. The solution taken here is to not drain tablet replicas from replaced node during topology request but leave it to happen later after the replaced node is in left state and replacing node is in normal state. The replacing node waits for this draining to be complete on boot before the node is considered booted. Fixes https://github.com/scylladb/scylladb/issues/17025 Nodes in the left state will be kept in tablet replica sets for a while after node replace is done, until the new replica is rebuilt. So we need to know about those node's location (dc, rack) for two reasons: 1) algorithms which work with replica sets filter nodes based on their location. For example materialized views code which pairs base replicas with view replicas filters by datacenter first. 2) tablet scheduler needs to identify each node's location in order to make decisions about new replica placement. It's ok to not know the IP, and we don't keep it. Those nodes will not be present in the IP-based replica sets, e.g. those returned by get_natural_endpoints(), only in host_id-based replica sets. storage_proxy request coordination is not affected. Nodes in the left state are still not present in token ring, and not considered to be members of the ring (datacanter endpoints excludes them). In the future we could make the change even more transparent by only loading locator::node* for those nodes and keeping node* in tablet replica sets. Currently left nodes are never removed from topology, so will accumulate in memory. We could garbage-collect them from topology coordinator if a left node is absent in any replica set. That means we need a new state - left_for_real. Closes scylladb/scylladb#17388 * github.com:scylladb/scylladb: test: py: Add test for view replica pairing after replace raft, api: Add RESTful API to query current leader of a raft group test: test_tablets_removenode: Verify replacing when there is no spare node doc: topology-on-raft: Document replace behavior with tablets tablets, raft topology: Rebuild tablets after replacing node is normal tablets: load_balancer: Access node attributes via node struct tablets: load_balancer: Extract ensure_node() mv: Switch to using host_id-based replica set effective_replication_map: Introduce host_id-based get_replicas() raft topology: Keep nodes in the left state to topology tablets: Introduce read_required_hosts()	2024-03-18 16:16:08 +02:00
Tomasz Grabiec	1c71f44e63	tablets, raft topology: Rebuild tablets after replacing node is normal This fixes a problem with replacing a node with tablets when RF=N. Currently, this will fail because new tablet replica allocation will not be able to find a viable destination, as the replacing node is not considered a candidate. It cannot be a candidate because replace rolls back on failure and we cannot roll back after tablets were migrated. The solution taken here is to not drain tablet replicas from replaced node during topology request but leave it to happen later after the replaced node is left and replacing node is normal. The replacing node waits for this draining to be complete on boot before the node is considered booted. Fixes #17025	2024-03-15 13:20:08 +01:00
Tomasz Grabiec	b2418fab39	tablets: load_balancer: Access node attributes via node struct Reduces lookups into topology and decouples the algorithm more from the topology object.	2024-03-15 11:22:34 +01:00
Tomasz Grabiec	9090050244	tablets: load_balancer: Extract ensure_node() Will be called in another loop to populate the "nodes" map with left node.	2024-03-15 11:22:32 +01:00
Tomasz Grabiec	8c5d088928	Merge 'Drop tablets of dropped views and indices' from Benny Halevy This series adds notification before dropping views and indices so that the tablet_allocator can generate mutations to respectively drop all tablets associated with them from system.tablets. Additional unit tests were added for these cases. Note that one case is not yet tested: where a table is allowed to be dropped while having views that depend on it, when it is dropped from the alternator path. This is tested indirectly by testing dropping a table with live secondary index as it follows the same notification path as views in this series. Fixes #17627 Closes scylladb/scylladb#17773 * github.com:scylladb/scylladb: migration_manager: notify before_drop_column_family when dropping indices schema_tables: make_update_indices_mutations: use find_schema to lookup the view of dropped indices migration_manager: notify before_drop_column_family before dropping views cql-pytest: test_tablets: add test_tablets_are_dropped_when_dropping_table tablet_allocator: on_before_drop_column_family: remove unused result variable	2024-03-14 22:52:29 +01:00
Benny Halevy	358e92e645	migration_manager: notify before_drop_column_family before dropping views Call the before_drop_column_family notifications before dropping the views to allow the tablet_allocator to delete the view's tablets. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2024-03-14 20:14:56 +02:00
Pavel Emelyanov	b4dd732dab	tablet_allocator: Add skiplist to load_balancer Currently load balancer skips nodes only based on its "administrative" state, i.e. whether it's drained/decommissioned/removed/etc. There's no way to exclude any node from balancing decision based on anything else. This patch add this ability by adding skiplist argument to balance_tablets() method. When a node is in it, it will not be considered, as if it was removenode-d. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-03-14 10:47:31 +03:00
Benny Halevy	b73aaee5e4	tablet_allocator: on_before_drop_column_family: remove unused result variable Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2024-03-14 08:34:02 +02:00
Pavel Emelyanov	72f3b1d5fe	topology.tablets_migration: Add cleanup_target transition stage The new stage will be used to revert migration that fails at some stages. The goal is to cleanup the pending replica, which may already received some writes by doing the cleanup RPC to the pending replica, then jumping to "revert_migration" stage introduced earlier. If pending node is dead, the call to cleanup RPC is skipped. Coordinators use old replicas. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-02-20 08:59:06 +03:00
Pavel Emelyanov	ced5bf56eb	topology.tablets_migration: Add revert_migration transition stage It's like end_migration, but old replicas intact just removing the transition (including new replicas). Coordinators use old replicas. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-02-20 08:53:36 +03:00
Botond Dénes	3f2d7e8b25	tree: remove unnecessary yields around for_each_tablet() Commit `904bafd069` consolidated the two existing for_each_tablet() overloads, to the one which has a future<> returning callback. It also added yields to the bodies of said callbacks. This is unnecessary, the loop in for_each_tablet() already has a yield per tablet, which should be enough to prevent stalls. This patch is a follow-up to #17118 Closes scylladb/scylladb#17284	2024-02-12 17:10:25 +01:00
Botond Dénes	35da9551fb	Merge 'storage_service: Add describe_ring support for tablet table' from Asias He The table query param is added to get the describe_ring result for a given table. Both vnode table and tablet table can use this table param, so it is easier for users to user. If the table param is not provided by user and the keyspace contains tablet table, the request will be rejected. E.g., curl "http://127.0.0.1:10000/storage_service/describe_ring/system_auth?table=roles" curl "http://127.0.0.1:10000/storage_service/describe_ring/ks1?table=standard1" Refs #16509 Closes scylladb/scylladb#17118 * github.com:scylladb/scylladb: tablets: Convert to use the new version of for_each_tablet storage_service: Add describe_ring support for tablet table storage_service: Mark host2ip as const tablets: Add for_each_tablet_gently	2024-02-07 10:41:36 +02:00
Asias He	904bafd069	tablets: Convert to use the new version of for_each_tablet It is more gently than the old one.	2024-02-05 18:45:40 +08:00
Avi Kivity	7cb1c10fed	treewide: replace seastar::future::get0() with seastar::future::get() get0() dates back from the days where Seastar futures carried tuples, and get0() was a way to get the first (and usually only) element. Now it's a distraction, and Seastar is likely to deprecate and remove it. Replace with seastar::future::get(), which does the same thing.	2024-02-02 22:12:57 +08:00
Kefu Chai	110d2e52be	tablet_allocator: do not compare signed and unsigned `available_shards` could be negative when `resize_plan` is empty, and the loop to build `resize_plan` stops at the next iteration after `available_shards` is assigned with a negative number. so, instead of making it an `unsigned`, let's just compare it using `std::cmp_less()`. this change should silence following warning: ``` /home/kefu/.local/bin/clang++ -DDEBUG -DDEBUG_LSA_SANITIZER -DFMT_DEPRECATED_OSTREAM -DFMT_SHARED -DSANITIZE -DSCYLLA_BUILD_MODE=debug -DSCYLLA_ENABLE_ERROR_INJECTION -DSEASTAR_API_LEVEL=7 -DSEASTAR_DEBUG -DSEASTAR_DEBUG_PROMISE -DSEASTAR_DEBUG_SHARED_PTR -DSEASTAR_DEFAULT_ALLOCATOR -DSEASTAR_LOGGER_COMPILE_TIME_FMT -DSEASTAR_LOGGER_TYPE_STDOUT -DSEASTAR_SCHEDULING_GROUPS_COUNT=16 -DSEASTAR_SHUFFLE_TASK_QUEUE -DSEASTAR_SSTRING -DSEASTAR_TYPE_ERASE_MORE -DXXH_PRIVATE_API -DCMAKE_INTDIR=\"Debug\" -I/home/kefu/dev/scylladb -I/home/kefu/dev/scylladb/build/gen -I/home/kefu/dev/scylladb/seastar/include -I/home/kefu/dev/scylladb/build/seastar/gen/include -I/home/kefu/dev/scylladb/build/seastar/gen/src -g -O0 -g -gz -std=gnu++20 -fvisibility=hidden -Wall -Werror -Wextra -Wno-error=deprecated-declarations -Wimplicit-fallthrough -Wignored-qualifiers -Wno-c++11-narrowing -Wno-mismatched-tags -Wno-overloaded-virtual -Wno-unsupported-friend -Wno-enum-constexpr-conversion -Wno-unused-parameter -Wno-missing-field-initializers -Wno-deprecated-copy -Wno-ignored-qualifiers -ffile-prefix-map=/home/kefu/dev/scylladb=. -march=westmere -U_FORTIFY_SOURCE -Werror=unused-result "-Wno-error=#warnings" -fstack-clash-protection -fsanitize=address -fsanitize=undefined -fno-sanitize=vptr -MD -MT service/CMakeFiles/service.dir/Debug/tablet_allocator.cc.o -MF service/CMakeFiles/service.dir/Debug/tablet_allocator.cc.o.d -o service/CMakeFiles/service.dir/Debug/tablet_allocator.cc.o -c /home/kefu/dev/scylladb/service/tablet_allocator.cc /home/kefu/dev/scylladb/service/tablet_allocator.cc:529:60: error: comparison of integers of different signs: 'long' and 'const size_t' (aka 'const unsigned long') [-Werror,-Wsign-compare] 529 \| if (resize_plan.size() > 0 && available_shards < size_desc.shard_count) { \| ~~~~~~~~~~~~~~~~ ^ ~~~~~~~~~~~~~~~~~~~~~ ``` Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2024-02-01 11:01:19 +08:00
Raphael S. Carvalho	bf6f692f60	service: Split tablet map when split request is finalized When load balancer emits finalize request, the coordinator will now react to it by splitting each tablet in the current tablet map and then committing the new map. There can be no active migration while we do it. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2024-01-25 18:58:43 -03:00
Raphael S. Carvalho	3ef792c4e8	load_balancer: Introduce metrics for resize decisions Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2024-01-25 18:36:08 -03:00
Raphael S. Carvalho	638e6e30cb	db: Make target tablet size a live-updateable config option Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2024-01-25 18:36:08 -03:00
Raphael S. Carvalho	7ed5b44d52	load_balancer: Implement resize decisions This implements the ability in load balancer to emit split or merge requests, cancel ongoing ones if they're no longer needed, and also finalize those that are ready for the topology changes. That's all based on average tablet size, collected by coordinator from all nodes, and split and merge thresholds. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2024-01-25 18:36:08 -03:00
Raphael S. Carvalho	8f7f74c490	service: Wire table_resize_plan into migration_plan Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2024-01-25 18:36:08 -03:00
Raphael S. Carvalho	490d109055	topology_coordinator: Wire load stats into load balancer Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2024-01-25 18:36:08 -03:00
Avi Kivity	69d597075a	Merge 'tablets: Add support for removenode and replace handling' from Tomasz Grabiec New tablet replicas are allocated and rebuilt synchronously with node operations. They are safely rebuilt from all existing replicas. The list of ignored nodes passed to node operations is respected. Tablet scheduler is responsible for scheduling tablet rebuilding transition which changes the replicas set. The infrastructure for handling decommission in tablet scheduler is reused for this. Scheduling is done incrementally, respecting per-shard load limits. Rebuilding transitions are recognized by load calculation to affect all tablet replicas. New kind of tablet transition is introduced called "rebuild" which adds new tablet replica and rebuilds it from existing replicas. Other than that, the transition goes through the same stages as regular migration to ensure safe synchronization with request coordinators. In this PR we simply stream from all tablet replicas. Later we should switch to calling repair to avoid sending excessive amounts of data. Fixes https://github.com/scylladb/scylladb/issues/16690. Closes scylladb/scylladb#16894 * github.com:scylladb/scylladb: tests: tablets: Add tests for removenode and replace tablets: Add support for removenode and replace handling topology_coordinator: tablets: Do not fail in a tight loop topology_coordinator: tablets: Avoid warnings about ignored failured future storage_service, topology: Track excluded state in locator::topology raft topology: Introduce param-less topology::get_excluded_nodes() raft topology: Move get_excluded_nodes() to topology tablets: load_balancer: Generalize load tracking tablets: Introduce get_migration_streaming_info() which works on migration request tablets: Move migration_to_transition_info() to tablets.hh tablets: Extract get_new_replicas() which works on migraiton request tablets: Move tablet_migration_info to tablets.hh tablets: Store transition kind per tablet	2024-01-25 14:49:43 +02:00
Botond Dénes	26d814d8be	Merge 'Configure initial tablets count scaling' from Pavel Emelyanov There are currently two options how to "request" the number of initial tables for a table 1. specify it explicitly when creating a keyspace 2. let scylla calculate it on its own Both are not very nice. The former doesn't take cluster layout into consideration. The latter does, but starts with one tablet per shard, which can be too low if the amount of data grows rapidly. Here's a (maybe temporary) proposal to facilitate at least perf tests -- the --tablets-initial-scale-factor option that enhances the option number two above by multiplying the calculated number of tablets by the configured number. This is what we currently do to run perf tests by patching scylla, with the option it going to be more convenient. Closes scylladb/scylladb#16919 * github.com:scylladb/scylladb: config: Add --tablets-initial-scale-factor tablet_allocator: Add initial tablets scale to config tablet_allocator: Add config	2024-01-23 13:25:12 +02:00
Tomasz Grabiec	e5dcf03b88	tablets: Add support for removenode and replace handling New tablet replicas are allocated synchronously with node operations. They are safely rebuilt from all existing replicas. The list of ignored nodes passed to node operations is respected. Tablet scheduler is responsible for scheduling tablet transition which changes the replicas set. The infrastructure for handling decommission in tablet scheduler is reused for this. Scheduling is done incrementally, respecting per-shard load limits. Rebuilding transitions are recognized by load calculation to affect all tablet replicas. New kind of tablet transition is introduced called "rebuild" which adds new tablet replica and rebuilds it from existing replicas. Other than that, the transition goes through the same stages as regular migration to ensure safe synchronization with request coordinators. In this PR we simply stream from all tablet replicas. Later we should switch to calling repair to avoid sending excessive amounts of data. Fixes #16690.	2024-01-23 01:19:42 +01:00
Tomasz Grabiec	92f01674f2	tablets: load_balancer: Generalize load tracking This patch removes some duplication of logic and implicit assumptions by creating clear algebra for load impact calculation and its application to state of the load balancer. Will make adding new kinds of tablet transitions with different impact on load much easier.	2024-01-23 01:12:57 +01:00
Tomasz Grabiec	4a06ffb43c	tablets: Store transition kind per tablet Will be used to distinguish regular migration from rebuild, repair and RF change.	2024-01-23 01:12:57 +01:00
Pavel Emelyanov	eb3b237e05	tablet_allocator: Add initial tablets scale to config When allocating tablets for table for the frist time their initial count is calculated so that each shard in a cluster gets one tablet. It may happen that more than one initial tablet per shard is better, e.g. perf tests typically rely on that. It's possible to specify the initial tablets count when creating a keyspace, this number doesn't take the cluster topology into consideration and may also be not very nice. As a temporary solution (e.g. for perf tests) we may add a configurable that scales the initial number of calculated tablets by some factor Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-01-22 19:14:45 +03:00
Pavel Emelyanov	f57b194db0	tablet_allocator: Add config Tablet allocator is a sharded service, that starts in main, it's worth equipping it with a config. Next patches will fill it with some payload Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-01-22 19:13:58 +03:00
Botond Dénes	a48881801a	replica/tablets: drop keyspace_name from system.tablets partition-key The name of the keyspace being part of the partition key is not useful, the table_id already uniquely identifies the table. The keyspace name being part of the key, means that code wanting to interact with this table, often has to resolve the table id, just to be able to provide the keyspace name. This is counter productive, so make the keyspace_name just a static column instead, just like table_name already is. Fixes: #16377 Closes scylladb/scylladb#16881	2024-01-22 13:12:02 +01:00
Kefu Chai	ece2bd2f6e	service: do not include unused headers these unused includes were identified by clangd. see https://clangd.llvm.org/guides/include-cleaner#unused-include-warning for more details on the "Unused include" warning. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#16764	2024-01-15 13:29:33 +02:00
Pavel Emelyanov	562fcf0c19	locator: Keep optional initial_tablets on r.s. params Now all the callers have it at hands (spoiler: not yet initialized, but still) so the params can also have it. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-12-25 16:02:41 +03:00
Pavel Emelyanov	a943bd927b	locator: Call create_replication_strategy() with r.s. params Previous patch added params to r.s. classes' constructors, but callers don't construct those directly, instead they use the create_r.s.() wrapper. This patch adds params to the wrapper too. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-12-25 15:54:59 +03:00
Tomasz Grabiec	d1c1b59236	storage_service, api: Add API to disable tablet balancing Load balancing needs to be disabled before making a series of manual migrations so that we don't fight with the load balancer. Also will be used in tests to ensure tablets stick to expected locations.	2023-12-06 18:36:17 +01:00
Patryk Jędrzejczak	5027c5f1e5	tablet_allocator: update on_before_create_column_family After adding the keyspace_metadata parameter to migration_listener::on_before_create_column_family, tablet_allocator doesn't need to load it from the database. This change is necessary before merging migration_manager::announce calls in the following commit.	2023-10-31 12:08:03 +01:00
Patryk Jędrzejczak	a762179972	migration_listener: add parameter to on_before_create_column_family After adding the new prepare_new_column_family_announcement that doesn't assume the existence of a keyspace, we also need to get rid of the same assumption in all on_before_create_column_family calls. After all, they may be initiated before creating the keyspace. However, some listeners require keyspace_metadata, so we pass it as a new parameter.	2023-10-31 12:08:03 +01:00
Avi Kivity	d450a145ce	Revert "Merge 'reduce announcements of the automatic schema changes ' from Patryk Jędrzejczak" This reverts commit `4b80130b0b`, reversing changes made to `a5519c7c1f`. It's suspected of causing dtest failures due to a bug in coroutine::parallel_for_each.	2023-10-29 18:32:06 +02:00
Patryk Jędrzejczak	449b4c79c2	tablet_allocator: update on_before_create_column_family After adding the keyspace_metadata parameter to migration_listener::on_before_create_column_family, tablet_allocator doesn't need to load it from the database. This change is necessary before merging migration_manager::announce calls in the following commit.	2023-10-16 14:59:53 +02:00

1 2

66 Commits