scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-05-23 00:02:37 +00:00

Author	SHA1	Message	Date
Tomasz Grabiec	66439bb753	Merge 'load_balancer: apply balance threshold to intranode shard balancing' from Ferenc Szili - Fix intranode shard balancing to respect the size-based balance threshold, preventing unnecessary migrations when load difference between shards is negligible - Add a regression test that verifies the threshold is respected for intranode balancing The intranode shard balancing loop only stopped when the algorithm exhausted the migration candidates or when a migration would go against convergence (it would increase imbalance instead of decrease it). This caused unnecessary tablet migrations for negligible imbalances (e.g., 0.78% difference between shards). The inter-node balancer already uses `is_balanced()` to stop when the relative load difference is within the configured `size_based_balance_threshold`, but this check was missing from the intranode path. Apply the same `is_balanced()` threshold check that is already used for inter-node balancing to the intranode convergence loop. When the relative load difference between the most-loaded and least-loaded shards on a node is within the threshold, the balancer now stops without issuing further migrations. The test creates a single node with 2 shards and 512 tablets: 1. Balanced scenario (257 vs 255 tablets, same size): relative diff = 0.78% < 1% threshold → verifies no intranode migration is emitted 2. Unbalanced scenario (307 vs 205 tablets, same size): relative diff = 33% >> 1% threshold → verifies intranode migration IS emitted Fixes: SCYLLADB-1775 This is a performance improvement which reduces the number of intranode migrations issued, and needs to be backported to versions with size-based load balancing: 2026.1 and 2026.2 Closes scylladb/scylladb#29756 * github.com:scylladb/scylladb: test: add test for intranode balance threshold in size-based mode tablet_allocator: apply balance threshold to intranode shard balancing	2026-05-13 13:09:52 +02:00
Avi Kivity	ddb1181103	Merge 'load_balance: fix drain with forced capacity-based balancing' from Ferenc Szili When `force_capacity_based_balancing` is enabled and a node is being drained/excluded, the tablet allocator incorrectly aborts balancing due to incomplete tablet stats - even though capacity-based balancing doesn't depend on tablet sizes. The tablet allocator normally waits for complete load stats before balancing. An exception exists for drained+excluded nodes (they're unreachable and won't return stats). However, when forced capacity-based balancing is active, this exception was not being applied, causing the balancer to reject the drain plan. Adjust the condition in `tablet_allocator.cc` so that the "ignore missing data for drained nodes" logic applies regardless of whether capacity-based balancing is forced. Added a Boost unit test that forces capacity-based balancing and verifies a drained/excluded node gets its tablets migrated even when tablet size stats are missing. This bug was introduced in 2026.1, so this needs to be backported to 2026.1 and 2026.2 Fixes: SCYLLADB-1803 Closes scylladb/scylladb#29791 * github.com:scylladb/scylladb: test: boost: add drain test for forced capacity-based balancing service: allow draining with forced capacity-based balancing	2026-05-12 12:38:25 +03:00
Ferenc Szili	6856f51097	test: add test for intranode balance threshold in size-based mode Verify that the load balancer does not issue intranode migrations when the load difference between shards is within the size_based_balance_threshold, and that it does issue migrations when the difference exceeds the threshold.	2026-05-12 10:34:25 +02:00
Nadav Har'El	d4aa528834	Merge 'load_balancer: fix tablet allocator dropped table' from Ferenc Szili - Handle dropped tables gracefully in the tablet load balancer's `get_schema_and_rs()` instead of aborting with `on_internal_error` - The load balancer operates on a token metadata snapshot but accesses the live schema for table lookups. A DROP TABLE applied by another fiber between coroutine yield points can remove a table from the live schema while it still exists in the snapshot, causing an abort. `get_schema_and_rs()` now returns `std::optional` and logs a warning in debug log level instead of aborting when a table is missing. All callers skip dropped tables: - `make_sizing_plan`: skips to next table - `make_resize_plan`: skips to next table (merge suppression is moot) - `check_constraints`: returns `skip_info{}` with empty viable targets - `get_rs`: returns `nullptr`, checked by `check_constraints` The call chain is: `make_plan` → `make_internode_plan` → `check_constraints` → `get_rs` → `get_schema_and_rs`. The `make_internode_plan` coroutine has multiple `co_await` yield points (`maybe_yield`, `pick_candidate`) between building the candidate tablet list and checking replication constraints. A DROP TABLE schema mutation applied during any of these yields removes the table from `_db.get_tables_metadata()` while the candidate list still references it. Added `test_load_balancing_with_dropped_table` which simulates the race by capturing a token metadata snapshot, dropping the table, then calling `balance_tablets` with the stale snapshot. Fixes: SCYLLADB-1664 This fix needs to be backported to versions: 2025.4, 2026.1 Closes scylladb/scylladb#29585 * github.com:scylladb/scylladb: test: verify load balancer handles dropped tables gracefully tablet_allocator: handle dropped tables gracefully in get_schema_and_rs	2026-05-10 22:07:51 +03:00
Ferenc Szili	f7bc8f5fa7	test: boost: add drain test for forced capacity-based balancing Add a Boost unit test that forces capacity-based balancing through configuration and verifies that a drained and excluded node will be drained of its tablets when tablet size stats are missing. The test covers the regression where the allocator rejected the plan due to incomplete tablet stats, even though forced capacity-based balancing does not depend on tablet sizes.	2026-05-07 13:56:36 +02:00
Ferenc Szili	6b3e18c4a9	test: verify load balancer handles dropped tables gracefully Add test_load_balancing_with_dropped_table that simulates the race between DROP TABLE and the load balancer by capturing a token metadata snapshot before dropping the table, then passing the stale snapshot to balance_tablets(). Verifies it completes without aborting and produces no migrations for the dropped table.	2026-04-27 10:33:56 +02:00
Aleksandra Martyniuk	bcdab2e012	service: extend tablet_migration_info to handle rebuilds Make tablet_migration_info::{src,dst} optional, so that it can be reused by rebuild, for respectively leaving and pending replica.	2026-04-17 09:58:07 +02:00
Tomasz Grabiec	84361194c2	test: boost: tablets: Add test for merge with arbitrary tablet count	2026-04-15 10:40:56 +02:00
Tomasz Grabiec	7af9f5366d	tablets, database: Advertise 'arbitrary' layout in snapshot manifest Currently, the manifest advertises "powof2", which is wrong for arbitrary count and boundaries. Introduce a new kind of layout called "arbitrary", and produce it if the tablet map doesn't conform to "powof2" layout. We should also produce tablet boundaries in this case, but that's worked on in a different PR: https://github.com/scylladb/scylladb/pull/28525	2026-04-15 10:40:56 +02:00
Tomasz Grabiec	b6a7023f68	tablets: Prepare for non-power-of-two tablet count This is a step towards more flexibility in managing tablets. A prerequisite before we can split individual tablets, isolating hot partitions, and evening-out tablet sizes by shifting boundaries. After this patch, the system can handle tables with arbitrary tablet count. Tablet allocator is still rounding up desired tablet count to the nearest power of two when allocating tablets for a new table, so unless the tablet map is allocated in some other way, the counts will be still a power of two. We plan to utilize arbitrary count when migrating from vnodes to tablets, by creating a tablet map which matches vnode boundaries. One of the reasons we don't give up on power-of-two by default yet is that it creates an issue with merges. If tablet count is odd, one of the tablets doesn't have a sibling and will not be merged. That can obviously cause imbalance of token space and tablet sizes between tablets. To limit the impact, this patch dynamically chooses which tablet to isolate when initiating a merge. The largest tablet is chosen, as that will minimize imbalance. Otherwise, if we always chose the last tablet to isolate, its size would remain the same while other tablets double in size with each odd-count merge, leading to imbalance. The imbalance will still be there, but the difference in tablet sizes is limited to 2x. Example (3 tablets): [0] owns 1/3 of tokens [1] owns 1/3 of tokens [2] owns 1/3 of tokens After merge: [0] owns 2/3 of tokens [1] owns 1/3 of tokens What we would like instead: Step 1 (split [1]): [0] owns 1/3 of tokens [1] old 1.left, owns 1/6 of tokens [2] old 1.right, owns 1/6 of tokens [3] owns 1/3 of tokens Step 2 (merge): [0] owns 1/2 of tokens [1] owns 1/2 of tokens To do that, we need to be able to split individual tablets, but we're not there yet.	2026-04-15 10:40:55 +02:00
Tomasz Grabiec	66fc7967b8	tablets: Prepare resize_decision to hold data in decisions merge decision will carry a plan - which replica to isolate. So construction from a string will no longer do.	2026-04-15 10:40:55 +02:00
Tomasz Grabiec	01fb97ee78	locator: tablets: Support arbitrary tablet boundaries There are several reasons we want to do that. One is that it will give us more flexibility in distributing the load. We can subdivide tablets at any points, and achieve more evenly-sized tablets. In particular, we can isolate large partitions into separate tablets. Another reason is vnode-to-tablet migration. We could construct a tablet map which matches exactly the vnode boundaries, so migration can happen transparently from the CQL-coordinator's point of view. Implementation details: We store a vector of tokens which represent tablet boundaries in the tablet_id_map. tablet_id keeps its meaning, it's an index into vector of tablets. To avoid logarithmic lookup of tablet_id from the token, we introduce a lookup structure with power-of-two aligned buckets, and store the tablet_id of the tablet which owns the first token in the bucket. This way, lookup needs to consider tablet id range which overlaps with one bucket. If boundaries are more or less aligned, there are around 1-2 tablets overlapping with a bucket, and the lookup is still O(1). Amount of memory used increased, but not significantly relative to old size (because tablet_info is currently fat): For 131'072 tablets: Before: Size of tablet_metadata in memory: 57456 KiB After: Size of tablet_metadata in memory: 59504 KiB	2026-04-15 01:25:14 +02:00
Tomasz Grabiec	82acdae74b	locator: tablets: Introduce tablet_map::get_split_token() And reimplement existing split-related methods around it. This way we avoid calling dht::compaction_group_of(), and assuming anything about tablet boundaries or tablet count being a power of two. This will make later refactoring easier.	2026-04-15 01:24:48 +02:00
Tomasz Grabiec	2e1d41c206	dht: Introduce get_uniform_tokens()	2026-04-15 01:24:48 +02:00
Avi Kivity	0ae22a09d4	LICENSE: Update to version 1.1 Updated terms of non-commercial use (must be a never-customer).	2026-04-12 19:46:33 +03:00
Pavel Emelyanov	9a2e583f29	storage_service: Make describe_ring_for_table() take table_id All callers already have it. It makes no difference for the method itself with which table identifier to work, but will help to simplify the flow in API handler (next patch) Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-03-20 19:49:24 +03:00
Tomasz Grabiec	1256a9faa7	tablets: Fix deadlock in background storage group merge fiber When it deadlocks, groups stop merging and compaction group merge backlog will run-away. Also, graceful shutdown will be blocked on it. Found by flaky unit test test_merge_chooses_best_replica_with_odd_count, which timed-out in 1 in 100 runs. Reason for deadlock: When storage groups are merged, the main compaction group of the new storage group takes a compaction lock, which is appended to _compaction_reenablers_for_merging, and released when the merge completion fiber is done with the whole batch. If we accumulate more than 1 merge cycle for the fiber, deadlock occurs. Lock order will be this Initial state: cg0: main cg1: main cg2: main cg3: main After 1st merge: cg0': main [locked], merging_groups=[cg0.main, cg1.main] cg1': main [locked], merging_groups=[cg2.main, cg3.main] After 2nd merge: cg0'': main [locked], merging_groups=[cg0'.main [locked], cg0.main, cg1.main, cg1'.main [locked], cg2.main, cg3.main] merge completion fiber will try to stop cg0'.main, which will be blocked on compaction lock. which is held by the reenabler in _compaction_reenablers_for_merging, hence deadlock. The fix is to wait for background merge to finish before we start the next merge. It's achieved by holding old erm in the background merge, and doing a topology barrier from the merge finalizing transition. Background merge is supposed to be a relatively quick operation, it's stopping compaction groups. So may wait for active requests. It shouldn't prolong the barrier indefinitely. Tablet boost unit tests which trigger merge need to be adjusted to call the barrier, otherwise they will be vulnerable to the deadlock. Two cluster tests were removed because they assumed that merge happens in the backgournd. Now that it happens as part of merge finalization, and blocks topology state machine, those tests deadlock because they are unable to make topology changes (node bootstrap) while background merge is blocked. The test "test_tablets_merge_waits_for_lwt" needed to be adjusted. It assumed that merge finalization doesn't wait for the erm held by the LWT operation, and triggered tablet movement afterwards, and assumed that this migration will issue a barrier which will block on the LWT operation. After this commit, it's the barrier in merge finalization which is blocked. The test was adjusted to use an earlier log mark when waiting for "Got raft_topology_cmd::barrier_and_drain", which will catch the barrier in merge finalization. Fixes SCYLLADB-928	2026-03-12 22:45:01 +01:00
Tomasz Grabiec	582a4abeb6	test: boost: tablets_test: Save tablet metadata when ACKing split resize decision Needs to be ordered before split finalization, because storage_group must be in split mode already at finalization time. There must be split-ready compaction groups, otherwise finalization fails with this error: Found 0 split ready compaction groups, but expected 2 instead. Exposed by increased split activity in tests.	2026-03-12 22:45:01 +01:00
Dimitrios Symonidis	80b74d7df2	tablet options: Add max_tablet_count tablet option to enforce tablet count upper bounds Introduced a new max_tablet_count tablet option that caps the maximum number of tablets a table can have. This feature is designed primarily for backup and restore workflows. During backup, when load balancing is disabled for snapshot consistency, the current tablet count is recorded in the backup manifest. During restore, max_tablet_count is set to this recorded value, ensuring the restored table's tablet count never exceeds the original snapshot's tablet distribution. This guarantee enables efficient file-based SSTable streaming during restore, as each SSTable remains fully contained within a single tablet boundary. Closes scylladb/scylladb#28450	2026-03-03 11:19:24 +03:00
Nadav Har'El	e463d528fe	test: add unit test for tablet_map::get_secondary_replica() This patch adds a unit test for tablet_map::get_secondary_replica(). It was never officially defined how the "primary" and "secondary" replicas were chosen, and their implementation changed over time, but the one invariant that this test verifies is that the secondary replica and the primary replica must be a different node. This test reproduces issue SCYLLADB-777, where we discovered that the get_primary_replica() changed without a corresponding change to get_primary_replica(). So before the previous patch, this test failed, and after the previous patch - it passes. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2026-02-23 16:19:43 +02:00
Asias He	1be80c9e86	repair: Skip auto repair for tables using RF one There is no point running repair for tables using RF one. Row level repair will skip it but the auto repair scheduler will keep scheduling such repairs since repair_time could not be updated. Skip such repairs at the scheduler level for auto repair. If the request is issued by user, we will have to schedule such repair otherwise the user request will never be finished. Fixes SCYLLADB-561 Closes scylladb/scylladb#28640	2026-02-18 14:32:50 +02:00
Tomasz Grabiec	478b8f09df	test: tablets: Check that there are no migrations scheduled on draining nodes In case of decommission, it's not desirable because it's less urgent. In case of removenode, it leads to failure of removenode operation because scheduled co-locating migration will fail if the destination is on the excluded node, and this failure will be interpreted as drain failure and coordinator will cancel the request. Not a problem before "parallel decommission" because this failure is only a streaming failure, not a barrier failure, so exception doesn't escape into the catch clause in transition stage handler, and the migration is simply rolled back. Once draining happens in the tablet migration track, streaming failure will be interpreted as drain failure and cancel the request.	2026-01-18 15:36:07 +01:00
Avi Kivity	c6dfae5661	treewide: #include Seastar headers with angle brackets Seastar is an external library from the point of view of ScyllaDB, so should be included with angle brackets. Closes scylladb/scylladb#27947	2026-01-13 14:56:15 +02:00
Łukasz Paszkowski	62313a6264	load_sketch: Allow populating load_sketch with normalized current load Currently, tablet allocation intentionally ignores current load ( introduced by the commit #1e407ab) which could cause identical shard selection when allocating a small number of tablets in the same topology. When a tablet allocator is asked to allocate N tablets (where N is smaller than the number of shards on a node), it selects the first N lowest shards. If multiple such tables are created, each allocator run picks the same shards, leading to tablet imbalance across shards. This change initializes the load sketch with the current shard load, scaled into the [0,1] range, ensuring allocation still remains even while starting from globally least-loaded shards. Fixes https://github.com/scylladb/scylladb/issues/27620 Closes scylladb/scylladb#27802	2026-01-07 11:49:01 +01:00
Ferenc Szili	6d3c720a08	test, load balancing: add test for table balance This change adds a boost test which validates the resulting table balance of size based load balancing. The threshold was set to a conservative 1.5 overcommit to avoid flakyness.	2025-12-27 11:39:08 +01:00
Ferenc Szili	10eb364821	load_balancer: implement size-based load balancing This changes introduces tablet size based load balancing. It is an extension of capacity based balancing with the addition of actual tablet sizes. It computes the difference between the most and least loaded nodes in the DC and stops further balancing if this difference is bellow the config option size_based_balance_threshold_percentage. This config option does not apply to the absolute load, but instead to the percentage of how much the most loaded node is more loaded than the least loaded node: delta = (most_loaded - least_loaded) / most_loaded If this delta is smaller then the config threshold, the balancer will consider the nodes balanced.	2025-12-27 11:20:20 +01:00
Ferenc Szili	621cb19045	load_sketch: use tablet sizes in load computation This commit changes load_sketch so that it computes node and shard load based on tablet sizes instead of tablet count.	2025-12-27 10:37:23 +01:00
Aleksandra Martyniuk	9d20f0a3d2	test: add est_rack_list_conversion_with_two_replicas_in_rack	2025-12-16 13:31:24 +01:00
Aleksandra Martyniuk	0476e8d272	test: test creating tablet_rack_list_colocation_plan	2025-12-16 13:31:24 +01:00
Aleksandra Martyniuk	b3a0e4c2dc	test: check paused rf change requests persistence	2025-12-16 13:25:38 +01:00
Aleksandra Martyniuk	d66a36058b	service: pass topology and system_keyspace to load_balancer ctor Pass a pointer to service::topology and db::system_keyspace to load balancer. It will be used in the following patches to create rack_list_colocation plan.	2025-12-16 13:25:38 +01:00
Michael Litvak	97b7c03709	tablet: scheduler: Do not emit conflicting migration in merge colocation The tablet scheduler should not emit conflicting migrations for the same tablet. This was addressed initially in scylladb/scylladb#26038 but the check is missing in the merge colocation plan, so add it there as well. Without this check, the merge colocation plan could generate a conflicting migration for a tablet that is already scheduled for migration, as the test demonstrates. This can cause correctness problems, because if the load balancer generates two migrations for a single tablet, both will be written as mutations, and the resulting mutation could contain mixed cells from both migrations. Fixes scylladb/scylladb#27304 Closes scylladb/scylladb#27312	2025-11-28 11:17:12 +01:00
Aleksandra Martyniuk	76174d1f7a	cql3: reject ALTER KEYSPACE if rf of datacenter with tablets is omitted In ALTER KEYSPACE, when a datacenter name is omitted, its replication factor is implicitly set to zero with vnodes, while with tablets, it remains unchanged. ALTER KEYSPACE should behave the same way for tablets as it does for vnodes. However, this can be dangerous as we may mistakenly drop the whole datacenter. Reject ALTER KEYSPACE if it changes replication factor, but omits a datacenter that currently contains tablet replicas. Fixes: https://github.com/scylladb/scylladb/issues/25549. Closes scylladb/scylladb#25731	2025-11-24 06:36:51 +02:00
Ferenc Szili	fcbc239413	load_stats: add test for migrate_tablet_size() This change adds tests which validate the functionality of load_stats::migrate_tablet_size()	2025-11-11 14:28:31 +01:00
Tomasz Grabiec	f8879d797d	tablet_allocator: Avoid load balancer failure when replacing the last node in a rack Introduced in `9ebdeb2` The problem is specific to node replacing and rack-list RF. The culprit is in the part of the load balancer which determines rack's shard count. If we're replacing the last node, the rack will contain no normal nodes, and shards_per_rack will have no entry for the rack, on which the table still has replicas. This throws std::out_of_range and fails the tablet draining stage, and node replace is failed. No backport because the problem exists only on master. Fixes #26768 Closes scylladb/scylladb#26783	2025-11-05 15:49:51 +03:00
Tomasz Grabiec	1c0d847281	Merge 'load_balancer: load_stats reconcile after tablet migration and table resize' from Ferenc Szili This change adds the ability to move tablets sizes in load_stats after a tablet migration or table resize (split/merge). This is needed because the size based load balancer needs to have tablet size data which is as accurate as possible, in order to work on fresh tablet size distribution and issue correct tablet migrations. This is the second part of the size based load balancing changes: - First part for tablet size collection via load_stats: #26035 - Second part reconcile load_stats: #26152 - The third part for load_sketch changes: #26153 - The fourth part which performs tablet load balancing based on tablet size: #26254 This is a new feature and backport is not needed. Closes scylladb/scylladb#26152 * github.com:scylladb/scylladb: load_balancer: load_stats reconcile after tablet migration and table resize load_stats: change data structure which contains tablet sizes	2025-10-31 09:58:25 +01:00
Tomasz Grabiec	28f6bdc99b	cql3: ks_prop_defs: Expand numeric RF to rack list Auto-exands numeric RF in CREATE/ALTER KEYSPACE statements for new DCs specified in the statement. Doesn't auto-expand existing options, as the rack choice may not be in line with current replica placement. This requires co-locating tablet replicas, and tracking of co-location state, which is not implemented yet. Signed-off-by: Tomasz Grabiec <tgrabiec@scylladb.com>	2025-10-29 23:32:59 +01:00
Tomasz Grabiec	19d0beff38	test: tablets: Adjust to rack list test_decommission_rack_load_failure expects some tablets to land in the rack which only has the decommissioning node. Since the table uses RF=1, auto-expansion may choose the other rack and put all tablets there, and the expected failure will not happen. Force placement by using rack-list RF.	2025-10-29 23:32:57 +01:00
Tomasz Grabiec	0f38f7185c	test: tablets_test: Convert test_per_shard_goal_mixed_dc_rf to be rack-valid	2025-10-29 23:32:57 +01:00
Ferenc Szili	10f07fb95a	load_balancer: load_stats reconcile after tablet migration and table resize This change adds the ability to move tablets sizes in load_stats after a tablet migration or table resize (split/merge). This is needed because the size based load balancer needs to have tablet size data which is as accurate as possible, in order to issue migrations which improve load balance.	2025-10-28 12:12:09 +01:00
Gleb Natapov	c255740989	schema: Allow configuring consistency setting for a keyspace We want to add strongly consistent tables as an option. We will have two kind of strongly consistent tables: globally consistent and locally consistent. The former means that requests from all DCs will be globally linearisable while the later - only requests to the same DCs will be linearisable. To allow configuring all the possibilities the patch adds new parameter to a keyspace definition "consistency" that can be configured to be `eventual`, `global` or `local`. Non eventual setting is supported for tablets enabled keyspaces only. Since we want to start with implementing local consistency configuring global consistency will result in an error for now.	2025-10-16 13:34:49 +03:00
Tomasz Grabiec	9ebdeb261f	tablets: load_balancer: Recognize that tablets are confined to racks when computing desired tablet count The old logic assumes that replicas are spread across whole DC when determining how many tablets we need to have at least 10 tablets per shard. If replicas are actually confined to a subset of racks, that will come up with a too high count and overshoot actual per-shard count in this rack. Similar problem happens for scaling-down of tablet count, when we try to keep per-shard tablet count below the goal. It should be tracked per-rack rather than per-DC, since racks can differ in how loaded they are by RF if it's a rack-list.	2025-10-02 19:45:00 +02:00
Tomasz Grabiec	85ddb832b4	test: tablets: Add test for replica allocation on rack list changes	2025-10-02 19:45:00 +02:00
Tomasz Grabiec	726548b835	locator: Abstract obtaining the number of replicas from replication_strategy_config_option It will become more complex when options will contain rack lists. It's a good change regardless, as it reduces duplication and makes parsing uniform. We already diverged to use stoi / stol / stoul. The change in create_keyspace_statement.cc to add a catch clause is needed because get_replication_factor() now throws configuration_exception on parsing errors instead of std::invalid_argument, so the existing catch clause in the outer scope is not effective. That loop is trying to interpret all options as RF to run some validations. Not all options are RF, and those are supposed to be ignored.	2025-10-01 16:06:52 +02:00
Tomasz Grabiec	91e51a5dd1	cql3, locator: Use type aliases for option maps In preparation for changing their structure. 1) std::map<sstring, sstring> -> replication_strategy_config_options Parsed options. Values will become std::variant<sstring, rack_list> 2) std::map<sstring, sstring> -> property_definitions::map_type Flattened map of options, as stored system tables.	2025-10-01 16:06:51 +02:00
Benny Halevy	da6e2fdb1b	locator: Pass topology to replication strategy constructor	2025-10-01 16:06:28 +02:00
Benny Halevy	aaddff5211	tablets: tablet_map_to_mutations: accept process_func Prepare for generating several mutations for the tablet_map by calling process_func for each generated mutation. This allows the caller to directly freeze those mutations one at a time into a vector of frozen mutations or simililarly convert them into canonical mutations. Next patch will split large tablet mutations to prevent stalls. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-09-30 17:15:38 +03:00
Petr Gusev	8adbb6c4dd	tablets: disallow chains of colocated tables	2025-09-26 16:52:43 +02:00
Ferenc Szili	c6c9c316a7	load_balancer: fix std::out_of_bounds when decommissioning with empty nodes Consider the following: The tablet load balancer is working on: - node1: an empty node (no tablets) with a large disk capacity - node2: an empty node (no tablets) with a lower disk capacity then node1 - node3: is being decommissioned and contains tablet replicas In load_balancer::make_internode_plan() the initial destination node/shard is selected like this: // Pick best target shard. auto dst = global_shard_id {target, _load_sketch->get_least_loaded_shard(target)}; load_sketch::get_least_loaded_shard(host_id) calls ensure_node() which adds the host to load_sketch's internal hash maps in case the node was not yet seen by load_sketch. Let's assume dst is a shard on node1. Later in load_balancer::make_internode_plan() we will call pick_candidate() to try to find a better destination node than the initial one: // May choose a different source shard than src.shard or different destination host/shard than dst. auto candidate = co_await pick_candidate(nodes, src_node_info, target_info, src, dst, nodes_by_load_dst, drain_skipped); auto source_tablets = candidate.tablets; src = candidate.src; dst = candidate.dst; If pick_candidate() selects some other empty destination (due to larger capacity: node1) node, and that node has not yet been seen by load_sketch (because it was empty), a subsequent call to load_sketch::pick() will search for the node using std::unordered_map::at(), and because the node is not found it will throw a std::out_of_bounds() exception crashing the load balancer. This problem is fixed by changing load_sketch::populate() to initialize its internal maps with all the nodes which populate()'s arguments filter for. Fixes: #26203 Closes scylladb/scylladb#26207	2025-09-24 15:27:19 +02:00
Tomasz Grabiec	981592bca5	tablet: scheduler: Do not emit conflicting migrations in the plan Plan-making is invoked independently for different DCs (and in the future, racks) and then plans are merged. It could be that the same tablets are selected for migration in different DCs. Only one migration will prevail and be committed to group0, so it's not a correctness problem. Next cycle will recognize that the tablet is in transition and will not be selected by plan-maker. But it makes plan-making less efficient. It may also surprise consumers of the plan, like we saw in #25912. So we should make plan-maker be aware of already scheduled transitions and not consider those tablets as candidates. Fixes #26038 Closes scylladb/scylladb#26048	2025-09-23 22:40:08 +03:00

1 2 3 4 5

209 Commits