scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-05-22 15:52:13 +00:00

Author	SHA1	Message	Date
Pavel Emelyanov	17384d42e3	tablets: Implement tablet-aware cluster-wide restore This patch adds - Changes in sstables_loader::restore_tablets() method It populates the system_distributed_keyspace.snapshot_sstables table with the information read from the manifest - Implementation of tablet_restore_task_impl::run() method It emplaces a bunch of tablet migrations with "restore" kind - Topology coordinator handling of tablet_transition_stage::restore When seen, the coordinator calls RESTORE_TABLET RPC against all tablet replicas Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2026-05-12 10:40:23 +03:00
Pavel Emelyanov	cf21471391	tablets: Add restore_config to tablet_transition_info When doing cluster-wide restore using topology coordinator, the coordinator will need to serve a bunch of new tablet transition kinds -- the restore one. For that, it will need to receive information about from where to perform the restore -- the endpoint and bucket pair. This data can be grabbed from nowhere but the tablet transition itself, so add the "restore_config" member with this data. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2026-05-12 10:40:23 +03:00
Aleksandra Martyniuk	bcdab2e012	service: extend tablet_migration_info to handle rebuilds Make tablet_migration_info::{src,dst} optional, so that it can be reused by rebuild, for respectively leaving and pending replica.	2026-04-17 09:58:07 +02:00
Tomasz Grabiec	7af9f5366d	tablets, database: Advertise 'arbitrary' layout in snapshot manifest Currently, the manifest advertises "powof2", which is wrong for arbitrary count and boundaries. Introduce a new kind of layout called "arbitrary", and produce it if the tablet map doesn't conform to "powof2" layout. We should also produce tablet boundaries in this case, but that's worked on in a different PR: https://github.com/scylladb/scylladb/pull/28525	2026-04-15 10:40:56 +02:00
Tomasz Grabiec	b6a7023f68	tablets: Prepare for non-power-of-two tablet count This is a step towards more flexibility in managing tablets. A prerequisite before we can split individual tablets, isolating hot partitions, and evening-out tablet sizes by shifting boundaries. After this patch, the system can handle tables with arbitrary tablet count. Tablet allocator is still rounding up desired tablet count to the nearest power of two when allocating tablets for a new table, so unless the tablet map is allocated in some other way, the counts will be still a power of two. We plan to utilize arbitrary count when migrating from vnodes to tablets, by creating a tablet map which matches vnode boundaries. One of the reasons we don't give up on power-of-two by default yet is that it creates an issue with merges. If tablet count is odd, one of the tablets doesn't have a sibling and will not be merged. That can obviously cause imbalance of token space and tablet sizes between tablets. To limit the impact, this patch dynamically chooses which tablet to isolate when initiating a merge. The largest tablet is chosen, as that will minimize imbalance. Otherwise, if we always chose the last tablet to isolate, its size would remain the same while other tablets double in size with each odd-count merge, leading to imbalance. The imbalance will still be there, but the difference in tablet sizes is limited to 2x. Example (3 tablets): [0] owns 1/3 of tokens [1] owns 1/3 of tokens [2] owns 1/3 of tokens After merge: [0] owns 2/3 of tokens [1] owns 1/3 of tokens What we would like instead: Step 1 (split [1]): [0] owns 1/3 of tokens [1] old 1.left, owns 1/6 of tokens [2] old 1.right, owns 1/6 of tokens [3] owns 1/3 of tokens Step 2 (merge): [0] owns 1/2 of tokens [1] owns 1/2 of tokens To do that, we need to be able to split individual tablets, but we're not there yet.	2026-04-15 10:40:55 +02:00
Tomasz Grabiec	66fc7967b8	tablets: Prepare resize_decision to hold data in decisions merge decision will carry a plan - which replica to isolate. So construction from a string will no longer do.	2026-04-15 10:40:55 +02:00
Tomasz Grabiec	01fb97ee78	locator: tablets: Support arbitrary tablet boundaries There are several reasons we want to do that. One is that it will give us more flexibility in distributing the load. We can subdivide tablets at any points, and achieve more evenly-sized tablets. In particular, we can isolate large partitions into separate tablets. Another reason is vnode-to-tablet migration. We could construct a tablet map which matches exactly the vnode boundaries, so migration can happen transparently from the CQL-coordinator's point of view. Implementation details: We store a vector of tokens which represent tablet boundaries in the tablet_id_map. tablet_id keeps its meaning, it's an index into vector of tablets. To avoid logarithmic lookup of tablet_id from the token, we introduce a lookup structure with power-of-two aligned buckets, and store the tablet_id of the tablet which owns the first token in the bucket. This way, lookup needs to consider tablet id range which overlaps with one bucket. If boundaries are more or less aligned, there are around 1-2 tablets overlapping with a bucket, and the lookup is still O(1). Amount of memory used increased, but not significantly relative to old size (because tablet_info is currently fat): For 131'072 tablets: Before: Size of tablet_metadata in memory: 57456 KiB After: Size of tablet_metadata in memory: 59504 KiB	2026-04-15 01:25:14 +02:00
Tomasz Grabiec	82acdae74b	locator: tablets: Introduce tablet_map::get_split_token() And reimplement existing split-related methods around it. This way we avoid calling dht::compaction_group_of(), and assuming anything about tablet boundaries or tablet count being a power of two. This will make later refactoring easier.	2026-04-15 01:24:48 +02:00
Avi Kivity	0ae22a09d4	LICENSE: Update to version 1.1 Updated terms of non-commercial use (must be a never-customer).	2026-04-12 19:46:33 +03:00
Łukasz Paszkowski	3f70611504	locator/tablets: add tablet_map::get_tablet_range_side() Add `tablet_map::get_tablet_range_side(token)` to compute the post-split range side without computing the tablet id. Pure addition, no behavior change.	2026-03-09 17:59:36 +01:00
Nadav Har'El	9ab3d5b946	locator: fix get_secondary_replica() to match get_primary_replica() The function tablet_map::get_secondary_replica() is used by Alternator TTL to choose a node different from get_primary_replica(). Unfortunately, recently (commits `817fdad` and d88037d) the implementation of the latter function changed, without changing the former. So this patch changes the former to match. The next two patches will have two tests that fail before this patch, and pass with it: 1. A unit test that checks that get_secondary_replica() returns a different node than get_primary_replica(). 2. An Alternator TTL test that checks that when a node is down, expirations still happen because the secondary replica takes over the primary replica's work. Fixes SCYLLADB-777 Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2026-02-23 16:19:30 +02:00
Tomasz Grabiec	df949dc506	Merge 'topology_coordinator: make cleanup reliable on barrier failures' from Łukasz Paszkowski Fix a subtle but damaging failure mode in the tablet migration state machine: when a barrier fails, the follow-up barrier is triggered asynchronously, and cleanup can get skipped for that iteration. On the next loop, the original failure may no longer be visible (because the failing node got excluded), so the tablet can incorrectly move forward instead of entering `cleanup_target`. To make cleanup reliable this PR: Adds an additional “fallback cleanup” stage - `write_both_read_old_fallback_cleanup` that does not modify read/write selectors. This stage is safe to enter immediately after a barrier failure, and it funnels the tablet into cleanup with the required barriers. Avoids changing both read and write selectors in a single step transitioning from `write_both_read_new` to `cleanup_target`. The fallback path updates selectors in a safe order: read first, then write. Allows a direct no-barrier transition from `allow_write_both_read_old` to `cleanup_target` after failure, because in that specific case `cleanup_target` doesn’t change selectors and the hop is safe. No need for backport. It's an improvement. Currently, tablets transition to `cleanup_target` eventually via failed streaming. Closes scylladb/scylladb#28169 * github.com:scylladb/scylladb: topology_coordinator: add write_both_read_old_fallback_cleanup state topology_coordinator: allow cleanup_target transition from streaming/rebuild_repair without barrier topology_coordinator: allow cleanup_target transition without barrier after failure in write_both_read_old topology_coordinator: allow cleanup_target transition without barrier after failure in allow_write_both_read_old	2026-01-28 13:33:39 +01:00
Łukasz Paszkowski	f06094aa95	topology_coordinator: add write_both_read_old_fallback_cleanup state Yet another barrier-failure scenario exists in the `write_both_read_new` state. When the barrier fails, the tablet is expected to transition to `cleanup_target`, but because barrier execution is asynchronous, the cleanup transition can be skipped entirely and the tablet may continue forward instead. Both `write_both_read_new` and `cleanup_target` modify read and write selectors. In this situation, a barrier is required, and transitioning directly between these states without one is unsafe. Introduce an intermediate `write_both_read_old_fallback_cleanup` state that modifies only a read selector and can be entered without a barrier (there is no need to wait for all nodes to start using the "new" read selector). From there, the tablet can proceed to `cleanup_target`, where the required barriers are enforced. This also avoids changing both selectors in a single step. A direct transition from `write_both_read_new` to `cleanup_target` updates both selectors at once, which can leave coordinators using the old selector for writes and the new selector for reads, causing reads to miss preceding writes. By routing through the fallback state, selectors are updated in order—read first, then write—preserving read-after-write correctness.	2026-01-26 13:14:37 +01:00
Piotr Dulikowski	fe9237fdc9	Merge 'alternator: don't require rf_rack flag for indexes, validate instead' from Michael Litvak In `8df61f6d99` we changed the requirements for creating materialized views and MV-based indexes - instead of requiring the rf_rack_valid_keyspaces flag to be set, we now require the keyspace to be RF-rack-valid at the time of creation, and it is enforced to remain RF-rack-valid while the MV exists. This validation is done in the cql create view/index statements. The same should be done also for alternator - when creating a table with GSI or LSI, or when adding a GSI to an existing table, previously we required the flag rf_rack_valid_keyspaces to be set. Now we change it to instead check if the keyspace is RF-rack-valid, and if not the operation fails with an appropriate error. Fixes https://github.com/scylladb/scylladb/issues/28214 backport to 2025.4 to add RF-rack-valid enforcements in alternator Closes scylladb/scylladb#28154 * github.com:scylladb/scylladb: locator: document the exception type of assert_rf_rack_valid_keyspace alternator: don't require rf_rack flag for indexes, validate instead	2026-01-23 11:49:02 +01:00
Patryk Jędrzejczak	4e984139b2	Merge 'strongly consistent tables: basic implementation' from Petr Gusev In this PR we add a basic implementation of the strongly-consistent tables: * generate raft group id when a strongly-consistent table is created * persist it into system.tables table * start raft groups on replicas when a strongly-consistent tablet_map reaches them * add strongly-consistent version of the storage_proxy, with the `query` and `mutate` methods * the `mutate` method submits a command to the tablets raft group, the query method reads the data with `raft.read_barrier()` * strongly-consistent versions of the `select_statement` and `modification_statement` are added * a basic `test_strong_consistency.py/test_basic_write_read` is added which to check that we can write and read data in a strongly consistent fashion. Limitations: * for now the strongly consistent tables can have tablets only on shard zero. This is because we (ab/re) use the existing raft system tables which live only on shard0. In the next PRs we'll create separate tables for the new tablets raft groups. * No Scylla-side proxying - the test has to figure out who is the leader and submit the command to the right node. This will be fixed separately. * No tablet balancing -- migration/split/merges require separate complicated code. The new behavior is hidden behind `STRONGLY_CONSISTENT_TABLES` feature, which is enabled when the `STRONGLY_CONSISTENT_TABLES` experimental feature flag is set. Requirements, specs and general overview of the feature can be found [here](https://scylladb.atlassian.net/wiki/spaces/RND/pages/91422722/Strong+Consistency). Short term implementation plan is [here](https://docs.google.com/document/d/1afKeeHaCkKxER7IThHkaAQlh2JWpbqhFLIQ3CzmiXhI/edit?tab=t.0#heading=h.thkorgfek290) One can check the strongly consistent writes and reads locally via cqlsh: scylla.yaml: ``` experimental_features: - strongly-consistent-tables ``` cqlsh: ``` CREATE KEYSPACE IF NOT EXISTS my_ks WITH replication = {'class': 'NetworkTopologyStrategy', 'replication_factor': 1} AND tablets = {'initial': 1} AND consistency = 'local'; CREATE TABLE my_ks.test (pk int PRIMARY KEY, c int); INSERT INTO my_ks.test (pk, c) VALUES (10, 20); SELECT * FROM my_ks.test WHERE pk = 10; ``` Fixes SCYLLADB-34 Fixes SCYLLADB-32 Fixes SCYLLADB-31 Fixes SCYLLADB-33 Fixes SCYLLADB-56 backport: no need Closes scylladb/scylladb#27614 * https://github.com/scylladb/scylladb: test_encryption: capture stderr test/cluster: add test_strong_consistency.py raft_group_registry: disable metrics for non-0 groups strong consistency: implement select_statement::do_execute() cql: add select_statement.cc strong consistency: implement coordinator::query() cql: add modification_statement cql: add statement_helpers strong consistency: implement coordinator::mutate() raft.hh: make server::wait_for_leader() public strong_consistency: add coordinator modification_statement: make get_timeout public strong_consistency: add groups_manager strong_consistency: add state_machine and raft_command table: add get_max_timestamp_for_tablet tablets: generate raft group_id-s for new table tablet_replication_strategy: add consistency field tablets: add raft_group_id modification_statement: remove virtual where it's not needed modification_statement: inline prepare_statement() system_keyspace: disable tablet_balancing for strongly_consistent_tables cql: rename strongly_consistent statements to broadcast statements	2026-01-23 09:52:33 +01:00
Michael Litvak	d5009882c6	locator: document the exception type of assert_rf_rack_valid_keyspace The function assert_rf_rack_valid_keyspace uses the exception type std::invalid_argument when the RF-rack validation fails. Document it and change all callers to catch this specific exception type when checking for RF-rack validation failures, so that other exception types can be propagated properly.	2026-01-22 16:11:35 +01:00
Petr Gusev	53f93eb830	tablets: add raft_group_id Add a `raft_group_id` column to `system.tablets` and to the `tablet_map` class. The column is populated only when the `strongly_consistent_tables` feature is enabled. This feature is currently disabled by default and is enabled only when the user sets the `STRONGLY_CONSISTENT_TABLES` experimental flag. The `raft_group_id` column is added to `system.tablets` only when this flag is set. This allows the schema to evolve freely while the feature is experimental, without requiring complex migrations.	2026-01-21 14:56:00 +01:00
Raphael S. Carvalho	d16f9c821d	Revert "api: storage_service/tablets/repair: disable incremental repair by default" This reverts commit `c8cff94a5a`. Re-enabling incremental repair on master with "Aborting on shard 0 during scaleout + repair #26041" and "Failure to attach sstables in streaming consumer leaves sealed sstables on disk #27414" fixed. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#28120	2026-01-21 08:50:13 +02:00
Botond Dénes	04b8f72946	Merge 'repair: Implement auto repair for tablet repair' from Asias He repair: Implement auto repair for tablet repair This patch implements the basic auto repair support for tablet repair. It was decided to add no per table configuration for the initial implementation, so two scylla yaml config options are introduced to set the default auto repair configs for all the tablet tables. - auto_repair_enabled_default Set true to enable auto repair for tablet tables by default. The value will be overridden by the per keyspace or per table configuration which is not implemented yet. - auto_repair_threshold_default_in_seconds Set the default time in seconds for the auto repair threshold for tablet tables. If the time since last repair is bigger than the configured time, the tablet is eligible for auto repair. The value will be overridden by the per keyspace or per table configuration which is not implemented yet. The following metrcis are added: - auto_repair_needs_repair_nr The number of tablets with auto repair enabled that needs repair - auto_repair_enabled_nr The number of tablets with auto repair enabled The metrics are useful to tell if auto repair is falling behind. In the future, more auto repair scheduling will be added, e.g., scheduling based on the repaired and unrepaired sstable set size, tombstone ratio and so on, in addition to the time based scheduling. Fixes SCYLLADB-99 New feature. No backport. Closes scylladb/scylladb#27534 * github.com:scylladb/scylladb: topology_coordinator: Add metrics for tablet repair repair: Implement auto repair for tablet repair	2026-01-12 14:16:01 +02:00
Asias He	7ba7b25bdd	repair: Implement auto repair for tablet repair This patch implements the basic auto repair support for tablet repair. It was decided to add no per table configuration for the initial implementation, so two scylla yaml config options are introduced to set the default auto repair configs for all the tablet tables. - auto_repair_enabled_default Set true to enable auto repair for tablet tables by default. The value will be overridden by the per keyspace or per table configuration which is not implemented yet. - auto_repair_threshold_default_in_seconds Set the default time in seconds for the auto repair threshold for tablet tables. If the time since last repair is bigger than the configured time, the tablet is eligible for auto repair. The value will be overridden by the per keyspace or per table configuration which is not implemented yet. The following metrcis are added: - auto_repair_needs_repair_nr The number of tablets with auto repair enabled that needs repair - auto_repair_enabled_nr The number of tablets with auto repair enabled The metrics are useful to tell if auto repair is falling behind. In the future, more auto repair scheduling will be added, e.g., scheduling based on the repaired and unrepaired sstable set size, tombstone ratio and so on, in addition to the time based scheduling. Fixes SCYLLADB-99	2026-01-09 16:11:39 +08:00
Botond Dénes	60570d7114	Merge 'topology coordinator: restrict node join/remove to preserve RF-rack validity' from Michael Litvak Allow creating materialized views and secondary indexes in a tablets keyspace only if it's RF-rack-valid, and enforce RF-rack-validity while the keyspace has views by restricting some operations: * Altering a keyspace's RF if it would make the keyspace RF-rack-invalid * Adding a node in a new rack * Removing / Decommissioning the last node in a rack Previously the config option `rf_rack_valid_keyspaces` was required for creating views. We now remove this restriction - it's not needed because we always maintain RF-rack-validity for keyspaces with views. The restrictions are relevant only for keyspaces with numerical RF. Keyspace with rack-list-based RF are always RF-rack-valid. Fixes scylladb/scylladb#23345 Fixes https://github.com/scylladb/scylladb/issues/26820 backport to relevant versions for materialized views with tablets since it depends on rf-rack validity Closes scylladb/scylladb#26354 * github.com:scylladb/scylladb: docs: update RF-rack restrictions cql3: don't apply RF-rack restrictions on vector indexes cql3: add warning when creating mv/index with tablets about rf-rack service/tablet_allocator: always allow tablet merge of tables with views locator: extend rf-rack validation for rack lists test: test rf-rack validity when creating keyspace during node ops locator: fix rf-rack validation during node join/remove test: test topology restrictions for views with tablets test: add test_topology_ops_with_rf_rack_valid topology coordinator: restrict node join/remove to preserve RF-rack validity topology coordinator: add validation to node remove locator: extend rf-rack validation functions view: change validate_view_keyspace to allow MVs if RF=Racks db: enforce rf-rack-validity for keyspaces with views replica/db: add enforce_rf_rack_validity_for_keyspace helper db: remove enforce parameter from check_rf_rack_validity test: adjust test to not break rf-rack validity	2026-01-09 10:01:23 +02:00
Ferenc Szili	1c9ec9a76d	load_stats: add get_tablet_size_in_transition() This patch adds a method to load_stats which searches for the tablet size during tablet transition. In case of tablet migration, the tablet will be searched on the leaving replica, and during rebuild we will return the average tablet size of the pending replicas.	2025-12-27 10:37:23 +01:00
Michael Litvak	9e1f78d162	locator: extend rf-rack validation functions Extend the locator function assert_rf_rack_valid_keyspace to accept arbitrary topology dc-rack maps and nodes instead of using the current token metadata. This allows us to add a new variant of the function that checks rf-rack validity given a topology change that we want to apply. we will use it to check that rf-rack validity will be maintained before applying the topology change. The possible topology changes for the check are node add and node remove / decommission. These operations can change the number of normal racks - if a new node is added to a new rack, or the last node is removed from a rack.	2025-12-22 09:14:29 +01:00
Benny Halevy	c8cff94a5a	api: storage_service/tablets/repair: disable incremental repair by default Change the default incremental_mode to `disabled` due to https://github.com/scylladb/scylladb/issues/26041 and https://github.com/scylladb/scylladb/issues/27414 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-12-11 14:25:21 +02:00
Asias He	d51b1fea94	tablets: Allow tablet merge when repair tasks exist Currently we do not allow tablet merge if either of the tablets contain a tablet repair request. This could block the tablet merge for a very long time if the repair requests could not be scheduled and executed. We can actually merge the repair tasks in most of the cases. This is because most of the time all tablets are requested to be repaired by a single API request, so they share the same task_id, request_type and other parameters. We can merge the repair task info and executes the repair after the merge. If they do not share the task info, we could not merge and have to wait for the repair before merge, which is both rare and ok. Another case is that one of the tablet has a repair task info (t1) while the other tablet (t2) does not have, it is possible the t2 has finished repair by the same repair request or t2 is not requested to be repaired at all. We allow merge in this case too to avoid blocking the tablet merge, with the price of reparing a bit more. Fixes #26844 Closes scylladb/scylladb#26922	2025-11-20 16:01:23 +01:00
Pavel Emelyanov	f47f2db710	Merge 'Support local primary-replica-only for native restore' from Robert Bindar This PR extends the restore API so that it accepts primary_replica_only as parameter and it combines the concepts of primary-replica-only with scoped streaming so that with: - `scope=all primary_replica_only=true` The restoring node will stream to the global primary replica only - `scope=dc primary_replica_only=true` The restoring node will stream to the local primary replica only. - `scope=rack primary_replica_only=true` The restoring node will stream only to the primary replica from within its own rack (with rf=#racks, the restoring node will stream only to itself) - `scope=node primary_replica_only=true` is not allowed, the restoring node will always stream only to itself so the primary_replica_only parameter wouldn't make sense. The PR also adjusts the `nodetool refresh` restriction on running restore with both primary_replica_only and scope, it adds primary_replica_only to `nodetool restore` and it adds cluster tests for primary replica within scope. Fixes #26584 Closes scylladb/scylladb#26609 * github.com:scylladb/scylladb: Add cluster tests for checking scoped primary_replica_only streaming Improve choice distribution for primary replica Refactor cluster/object_store/test_backup nodetool restore: add primary-replica-only option nodetool refresh: Enable scope={all,dc,rack} with primary_replica_only Enable scoped primary replica only streaming Support primary_replica_only for native restore API	2025-11-13 12:11:18 +03:00
Ferenc Szili	b77ea1b8e1	load_stats: fix problem with tablet size migration This patch fixes a bug with tablet size migration in load_stats. has_tablet_size() lambda in topology_coordinator::migrate_tablet_size() was returning false in all cases due to incorrect search iterator comparison after a table and tablet saeach. This change moves load_stats migrate_tablet_sizes() functionaility into a separate method of load_stats.	2025-11-11 14:26:09 +01:00
Robert Bindar	817fdadd49	Improve choice distribution for primary replica I noticed during tests that `maybe_get_primary_replica` would not distribute uniformly the choice of primary replica because `info.replicas` on some shards would have an order whilst on others it'd be ordered differently, thus making the function choose a node as primary replica multiple times when it clearly could've chosen a different nodes. This patch sorts the replica set before passing it through the scope filter. Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>	2025-11-11 09:18:01 +02:00
Tomasz Grabiec	1c0d847281	Merge 'load_balancer: load_stats reconcile after tablet migration and table resize' from Ferenc Szili This change adds the ability to move tablets sizes in load_stats after a tablet migration or table resize (split/merge). This is needed because the size based load balancer needs to have tablet size data which is as accurate as possible, in order to work on fresh tablet size distribution and issue correct tablet migrations. This is the second part of the size based load balancing changes: - First part for tablet size collection via load_stats: #26035 - Second part reconcile load_stats: #26152 - The third part for load_sketch changes: #26153 - The fourth part which performs tablet load balancing based on tablet size: #26254 This is a new feature and backport is not needed. Closes scylladb/scylladb#26152 * github.com:scylladb/scylladb: load_balancer: load_stats reconcile after tablet migration and table resize load_stats: change data structure which contains tablet sizes	2025-10-31 09:58:25 +01:00
Tomasz Grabiec	28f6bdc99b	cql3: ks_prop_defs: Expand numeric RF to rack list Auto-exands numeric RF in CREATE/ALTER KEYSPACE statements for new DCs specified in the statement. Doesn't auto-expand existing options, as the rack choice may not be in line with current replica placement. This requires co-locating tablet replicas, and tracking of co-location state, which is not implemented yet. Signed-off-by: Tomasz Grabiec <tgrabiec@scylladb.com>	2025-10-29 23:32:59 +01:00
Ferenc Szili	10f07fb95a	load_balancer: load_stats reconcile after tablet migration and table resize This change adds the ability to move tablets sizes in load_stats after a tablet migration or table resize (split/merge). This is needed because the size based load balancer needs to have tablet size data which is as accurate as possible, in order to issue migrations which improve load balance.	2025-10-28 12:12:09 +01:00
Aleksandra Martyniuk	910cd0918b	locator: use get_primary_replica for get_primary_endpoints Currently, tablet_sstable_streamer::get_primary_endpoints is out of sync with tablet_map::get_primary_replica. The get_primary_replica optimizes the choice of the replica so that the work is fairly distributes among nodes. Meanwhile, get_primary_endpoints always chooses the first replica. Use get_primary_replica for get_primary_endpoints. Fixes: https://github.com/scylladb/scylladb/issues/21883. Closes scylladb/scylladb#26385	2025-10-28 09:56:08 +02:00
Ferenc Szili	b4ca12b39a	load_stats: change data structure which contains tablet sizes This patch changes the tablet size map in load_stats. Previously, this data structure was: std::unordered_map<range_based_tablet_id, uint64_t> tablet_sizes; and is changed into: std::unordered_map<table_id, std::unordered_map<dht::token_range, uint64_t>> tablet_sizes; This allows for improved performance of tablet tablet size reconciliation.	2025-10-24 14:37:00 +02:00
Asias He	13dd88b010	repair: Rename incremental mode name Using the name regular as the incremental mode could be confusing, since regular might be interpreted as the non-incremental repair. It is better to use incremental directly. Before: - regular (standard incremental repair) - full (full incremental repair) - disabled (incremental repair disabled) After: - incremental (standard incremental repair) - full (full incremental repair) - disabled (incremental repair disabled) Fixes #26503 Closes scylladb/scylladb#26504	2025-10-10 15:21:54 +03:00
Ferenc Szili	20aeed1607	load balancing: extend locator::load_stats to collect tablet sizes This commit extend the TABLE_LOAD_STATS RPC with data about the tablet replica sizes and effective disk capacity. Effective disk capacity of a node is computed as a sum of the sizes of all tablet replicas on a node and available disk space. This is the first change in the size based load balancing series. Closes scylladb/scylladb#26035	2025-10-03 13:37:22 +02:00
Avi Kivity	5237a20993	Merge 'replica: Fix split compaction when tablet boundaries change' from Raphael Raph Carvalho Consider the following: 1) balancer emits split decision 2) split compaction starts 3) split decision is revoked 4) emits merge decision 5) completes merge, before compaction in step 2 finishes After last step, split compaction initiated in step 2 can fail because it works with the global tablet map, rather than the map when the compaction started. With the global state changing under its feet, on merge, the mutation splitting writer will think it's going backwards since sibling tablets are merged. This problem was also seen when running load-and-stream, where split initiated by the sstable writer failed, split completed, and the unsplit sstable is left in the table dir, causing problems in the restart. To fix this, let's make split compaction always work with the state when it started, not a global state. Fixes #24153. All 2025.* versions are vulnerable, so fix must be backported to them. Closes scylladb/scylladb#25690 * github.com:scylladb/scylladb: replica: Fix split compaction when tablet boundaries change replica: Futurize split_compaction_options()	2025-09-09 17:05:32 +03:00
Asias He	cb7db47ae1	repair: Add incremental_mode option for tablet repair This patch introduces a new `incremental_mode` parameter to the tablet repair REST API, providing more fine-grained control over the incremental repair process. Previously, incremental repair was on and could not be turned off. This change allows users to select from three distinct modes: - `regular`: This is the default mode. It performs a standard incremental repair, processing only unrepaired sstables and skipping those that are already repaired. The repair state (`repaired_at`, `sstables_repaired_at`) is updated. - `full`: This mode forces the repair to process all sstables, including those that have been previously repaired. This is useful when a full data validation is needed without disabling the incremental repair feature. The repair state is updated. - `disabled`: This mode completely disables the incremental repair logic for the current repair operation. It behaves like a classic (pre-incremental) repair, and it does not update any incremental repair state (`repaired_at` in sstables or `sstables_repaired_at` in the system.tablets table). The implementation includes: - Adding the `incremental_mode` parameter to the `/storage_service/repair/tablet` API endpoint. - Updating the internal repair logic to handle the different modes. - Adding a new test case to verify the behavior of each mode. - Updating the API documentation and developer documentation. Fixes #25605 Closes scylladb/scylladb#25693	2025-09-09 06:50:21 +03:00
Raphael S. Carvalho	68f23d54d8	replica: Fix split compaction when tablet boundaries change Consider the following: 1) balancer emits split decision 2) split compaction starts 3) split decision is revoked 4) emits merge decision 5) completes merge, before compaction in step 2 finishes After last step, split compaction initiated in step 2 can fail because it works with the global tablet map, rather than the map when the compaction started. With the global state changing under its feet, on merge, the mutation splitting writer will think it's going backwards since sibling tablets are merged. This problem was also seen when running load-and-stream, where split initiated by the sstable writer failed, split completed, and the unsplit sstable is left in the table dir, causing problems in the restart. To fix this, let's make split compaction always work with the state when it started, not a global state. Fixes #24153. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-09-07 05:20:23 -03:00
Łukasz Paszkowski	54201960e6	storage_service: extend locator::load_stats to collect per-node critical disk utilization flag This commit extends the TABLE_LOAD_STATS RPC with information whether a node operates in the critical disk utilization mode. This information will be needed to distict between the causes why a table migration/repair was interrupted.	2025-08-29 14:56:13 +02:00
Asias He	f9021777d8	compaction: Add tablet incremental repair support This patch addes incremental_repair support in compaction. - The sstables are split into repaired and unrepaired set. - Repaired and unrepaired set compact sperately. - The repaired_at from sstable and sstables_repaired_at from system.tablets table are used to decide if a sstable is repaired or not. - Different compactions tasks, e.g., minor, major, scrub, split, are serialized with tablet repair.	2025-08-18 11:01:21 +08:00
Avi Kivity	8180cbcf48	Merge 'tablets: prevent accidental copy of tablets_map' from Benny Halevy As they are wasteful in many cases, it is better to move the tablet_map if possible, or clone it gently in an async fiber. Add clone() and clone_gently() methods to allow explicit copies. * minor optimization, no backport needed Closes scylladb/scylladb#24978 * github.com:scylladb/scylladb: tablets: prevent accidental copy of tablets_map locator: tablets: get rid of synchronous mutate_tablet_map	2025-07-27 16:48:27 +03:00
Lakshmi Narayanan Sreethar	0c5fa8e154	locator/token_metadata.cc: use chunked_vector to store _sorted_tokens The `token_metadata_impl` stores the sorted tokens in an `std::vector`. With a large number of nodes, the size of this vector can grow quickly, and updating it might lead to oversized allocations. This commit changes `_sorted_tokens` to a `chunked_vector` to avoid such issues. It also updates all related code to use `chunked_vector` instead of `std::vector`. Fixes #24876 Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> Closes scylladb/scylladb#25027	2025-07-27 11:29:22 +03:00
Aleksandra Martyniuk	1767eb9529	repair: remove unused code	2025-07-24 11:11:12 +02:00
Benny Halevy	fce6c4b41d	tablets: prevent accidental copy of tablets_map As they are wasteful in many cases, it is better to move the tablet_map if possible, or clone it gently in an async fiber. Add clone() and clone_gently() methods to allow explicit copies. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-07-22 15:07:26 +03:00
Benny Halevy	dee0d7ffbf	locator: tablets: get rid of synchronous mutate_tablet_map It is currently used only by tests that could very well do with mutate_tablet_map_async. This will simplify the following patch to prevent accidental copy of the tablet_map, provding explicit clone/clone_gently methods. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-07-22 15:03:02 +03:00
Michael Litvak	ddf02c9489	tablets: replace all_tables method The method all_tables in tablet_metadata is used for iterating over all tables in the tablet metadata with their tablet maps. Now that we have co-located tables we need to make the distinction on which tables we want to iterate over. In some cases we want to iterate over each group of co-located tables, treating them as one unit, and in other cases we want to iterate over all tables, doesn't matter if they are part of a co-located group and have a base table. We replace all_tables with new methods that can be used for each of the cases.	2025-07-01 13:20:18 +03:00
Michael Litvak	ddfe5dfb6b	tablets: represent co-located tables in tablet metadata Modify tablet_metadata to be able to represent co-located tables. The new method set_colocated_table adds to tablet_metadata a table which is co-located with another table. A co-located table shares the tablet map object with the base table, so we just create a copy of the shared tablet map pointer and store it as the co-located table's tablet map. Whenever a tablet map is modified we update the pointer for all the co-located tables accordingly, so the tablet map remains shared. We add some data structures to tablet_metadata to be able to work with co-located table groups efficiently: * `_table_groups` maps every base table to all tables in its co-location group. This is convenient for iterating over all table groups, or finding all tables in some group. * `_base_table` maps a co-located table to its base table.	2025-07-01 10:29:59 +03:00
Michael Litvak	34f15ca871	tablets: deallocate storage state on end_migration When a tablet is migrated and cleaned up, deallocate the tablet storage group state on `end_migration` stage, instead of `cleanup` stage: * When the stage is updated from `cleanup` to `end_migration`, the storage group is removed on the leaving replica. * When the table is initialized, if the tablet stage is `end_migration` then we don't allocate a storage group for it. This happens for example if the leaving replica is restarted during tablet migration. If it's initialized in `cleanup` stage then we allocate a storage group, and it will be deallocated when transitioning to `end_migration`. This guarantees that the storage group is always deallocated on the leaving replica by `end_migration`, and that it is always allocated if the tablet wasn't cleaned up fully yet. It is a similar case also for the pending replica when the migration is aborted. We deallocate the state on `revert_migration` which is the stage following `cleanup_target`. Previously the storage group would be allocated when the tablet is initialized on any of the tablet replicas - also on the leaving replica, and when the tablet stage is `cleanup` or `end_migration`, and deallocated during `cleanup`. This fixes the following issue: 1. A migrating tablet enters cleanup stage 2. the tablet is cleaned up successfuly 3. The leaving replica is restarted, and allocates storage group 4. tablet cleanup is not called because it was already cleaned up 4. the storage group remains allocated on the leaving replica after the migration is completed - it's not cleaned up properly. Fixes scylladb/scylladb#23481	2025-06-09 16:58:38 +03:00
Piotr Szymaniak	de96c28625	alternator: Add support for TTL when using tablets Support for TTL-based data removal when using tablets. The essence of this commit is a separate code path for finding token ranges owned by the current shard for the cases when tablets are used and not vnodes. At the same time, the vnodes-case is not touched not to cause any regressions. The TTL-caused data removal is normally performed by the primary replica (both when using vnodes and tablets). For the tablets case, the already-existing method tablet_map::get_primary_replica(tablet_id) is used to know if a shard execuring the TTL-related data removal is the primary replica for each tablet. A new method tablet_map::get_secondary_replica(tablet_id) has been added. It is needed by the data invalidation procedure to remove data when the primary replica node is down - the data is then removed by the secondary replica node. The mechanism is the same as in the vnodes case. Since alternator now supports TTL, the test `test_ttl_enable_error_with_tablets` has been removed. Also, tests in the test_ttl.py have been made to run twice, once with vnodes and once with tablets. When run with tablets, the due to lack of support for LWT with tablets (#18068), tests use 'system:write_isolation' of 'unsafe_rmw'. This approach allows early regression testing with tablets and is meant only as a tentative solution. Fixes scylladb/scylladb#16567 Closes scylladb/scylladb#23662	2025-06-05 17:39:29 +03:00
Tomasz Grabiec	2597a7e980	load_sketch: Tolerate missing tablet_map when selecting for a given table To simplify future usage in network_topology_strategy::add_tablets_in_dc() which invokes populate() for a given table, which may be both new and preexisitng.	2025-04-17 16:01:16 +02:00

1 2 3

149 Commits