scylladb

Author	SHA1	Message	Date
Asias He	d51b1fea94	tablets: Allow tablet merge when repair tasks exist Currently we do not allow tablet merge if either of the tablets contain a tablet repair request. This could block the tablet merge for a very long time if the repair requests could not be scheduled and executed. We can actually merge the repair tasks in most of the cases. This is because most of the time all tablets are requested to be repaired by a single API request, so they share the same task_id, request_type and other parameters. We can merge the repair task info and executes the repair after the merge. If they do not share the task info, we could not merge and have to wait for the repair before merge, which is both rare and ok. Another case is that one of the tablet has a repair task info (t1) while the other tablet (t2) does not have, it is possible the t2 has finished repair by the same repair request or t2 is not requested to be repaired at all. We allow merge in this case too to avoid blocking the tablet merge, with the price of reparing a bit more. Fixes #26844 Closes scylladb/scylladb#26922	2025-11-20 16:01:23 +01:00
Pavel Emelyanov	f47f2db710	Merge 'Support local primary-replica-only for native restore' from Robert Bindar This PR extends the restore API so that it accepts primary_replica_only as parameter and it combines the concepts of primary-replica-only with scoped streaming so that with: - `scope=all primary_replica_only=true` The restoring node will stream to the global primary replica only - `scope=dc primary_replica_only=true` The restoring node will stream to the local primary replica only. - `scope=rack primary_replica_only=true` The restoring node will stream only to the primary replica from within its own rack (with rf=#racks, the restoring node will stream only to itself) - `scope=node primary_replica_only=true` is not allowed, the restoring node will always stream only to itself so the primary_replica_only parameter wouldn't make sense. The PR also adjusts the `nodetool refresh` restriction on running restore with both primary_replica_only and scope, it adds primary_replica_only to `nodetool restore` and it adds cluster tests for primary replica within scope. Fixes #26584 Closes scylladb/scylladb#26609 * github.com:scylladb/scylladb: Add cluster tests for checking scoped primary_replica_only streaming Improve choice distribution for primary replica Refactor cluster/object_store/test_backup nodetool restore: add primary-replica-only option nodetool refresh: Enable scope={all,dc,rack} with primary_replica_only Enable scoped primary replica only streaming Support primary_replica_only for native restore API	2025-11-13 12:11:18 +03:00
Ferenc Szili	b77ea1b8e1	load_stats: fix problem with tablet size migration This patch fixes a bug with tablet size migration in load_stats. has_tablet_size() lambda in topology_coordinator::migrate_tablet_size() was returning false in all cases due to incorrect search iterator comparison after a table and tablet saeach. This change moves load_stats migrate_tablet_sizes() functionaility into a separate method of load_stats.	2025-11-11 14:26:09 +01:00
Robert Bindar	817fdadd49	Improve choice distribution for primary replica I noticed during tests that `maybe_get_primary_replica` would not distribute uniformly the choice of primary replica because `info.replicas` on some shards would have an order whilst on others it'd be ordered differently, thus making the function choose a node as primary replica multiple times when it clearly could've chosen a different nodes. This patch sorts the replica set before passing it through the scope filter. Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>	2025-11-11 09:18:01 +02:00
Tomasz Grabiec	1c0d847281	Merge 'load_balancer: load_stats reconcile after tablet migration and table resize' from Ferenc Szili This change adds the ability to move tablets sizes in load_stats after a tablet migration or table resize (split/merge). This is needed because the size based load balancer needs to have tablet size data which is as accurate as possible, in order to work on fresh tablet size distribution and issue correct tablet migrations. This is the second part of the size based load balancing changes: - First part for tablet size collection via load_stats: #26035 - Second part reconcile load_stats: #26152 - The third part for load_sketch changes: #26153 - The fourth part which performs tablet load balancing based on tablet size: #26254 This is a new feature and backport is not needed. Closes scylladb/scylladb#26152 * github.com:scylladb/scylladb: load_balancer: load_stats reconcile after tablet migration and table resize load_stats: change data structure which contains tablet sizes	2025-10-31 09:58:25 +01:00
Tomasz Grabiec	28f6bdc99b	cql3: ks_prop_defs: Expand numeric RF to rack list Auto-exands numeric RF in CREATE/ALTER KEYSPACE statements for new DCs specified in the statement. Doesn't auto-expand existing options, as the rack choice may not be in line with current replica placement. This requires co-locating tablet replicas, and tracking of co-location state, which is not implemented yet. Signed-off-by: Tomasz Grabiec <tgrabiec@scylladb.com>	2025-10-29 23:32:59 +01:00
Ferenc Szili	10f07fb95a	load_balancer: load_stats reconcile after tablet migration and table resize This change adds the ability to move tablets sizes in load_stats after a tablet migration or table resize (split/merge). This is needed because the size based load balancer needs to have tablet size data which is as accurate as possible, in order to issue migrations which improve load balance.	2025-10-28 12:12:09 +01:00
Aleksandra Martyniuk	910cd0918b	locator: use get_primary_replica for get_primary_endpoints Currently, tablet_sstable_streamer::get_primary_endpoints is out of sync with tablet_map::get_primary_replica. The get_primary_replica optimizes the choice of the replica so that the work is fairly distributes among nodes. Meanwhile, get_primary_endpoints always chooses the first replica. Use get_primary_replica for get_primary_endpoints. Fixes: https://github.com/scylladb/scylladb/issues/21883. Closes scylladb/scylladb#26385	2025-10-28 09:56:08 +02:00
Ferenc Szili	b4ca12b39a	load_stats: change data structure which contains tablet sizes This patch changes the tablet size map in load_stats. Previously, this data structure was: std::unordered_map<range_based_tablet_id, uint64_t> tablet_sizes; and is changed into: std::unordered_map<table_id, std::unordered_map<dht::token_range, uint64_t>> tablet_sizes; This allows for improved performance of tablet tablet size reconciliation.	2025-10-24 14:37:00 +02:00
Petr Gusev	b23f2a2425	tablet_metadata_guard: fix split/merge handling The guard should stop refreshing the ERM when the number of tablets changes. Tablet splits or merges invalidate the tablet_id field (_tablet), which means the guard can no longer correctly protect ongoing operations from tablet migrations. Fixes scylladb/scylladb#26437	2025-10-22 11:32:37 +02:00
Petr Gusev	ec6fba35aa	tablet_metadata_guard: add debug logs	2025-10-22 11:32:37 +02:00
Asias He	13dd88b010	repair: Rename incremental mode name Using the name regular as the incremental mode could be confusing, since regular might be interpreted as the non-incremental repair. It is better to use incremental directly. Before: - regular (standard incremental repair) - full (full incremental repair) - disabled (incremental repair disabled) After: - incremental (standard incremental repair) - full (full incremental repair) - disabled (incremental repair disabled) Fixes #26503 Closes scylladb/scylladb#26504	2025-10-10 15:21:54 +03:00
Piotr Dulikowski	380f243986	Merge ' Support replication factor rack list for tablet-based keyspaces' from Tomasz Grabiec This change extends the CQL replication options syntax so the replication factor can be stated as a list of rack names. For example: { 'mydatacenter': [ 'myrack1', 'myrack2', 'myrack4' ] } Rack-list based RF can coexist with the old numerical RF, even in the same keyspace for different DCs. Specifying the rack list also allows to add replicas on the specified racks (increasing the replication factor), or decommissioning certain racks from their replicas (by omitting them from the current datacenter rack-list). This will allow us to keep the keyspace rf-rack-valid, maintaining guarantees, while allowing adding/removing racks. In particular, this will allow us to add a new DC, which happens by incrementally increasing RF in that DC to cover existing racks. Migration from numerical RF to rack-list is not supported yet. Migration from rack-list to numerical RF is not planned to be supported. New feature, no backport required. Co-authored with @bhalevy Fixes https://github.com/scylladb/scylladb/issues/25269 Fixes https://github.com/scylladb/scylladb/issues/23525 Closes scylladb/scylladb#26358 * github.com:scylladb/scylladb: tablets: load_balancer: Recognize that tablets are confined to racks when computing desired tablet count locator: Make hasher for endpoint_dc_rack globally accessible test: tablets: Add test for replica allocation on rack list changes test: lib: topology_builder: generate unique rack names test: Add tests for rack list RF doc: Document rack-list replication factor topology_coordinator: Restore formatting topology_coordinator: Cancel keyspace alter on broader set of errors topology_coordinator: Make keyspace alter process options through as_ks_metadata_update() cql3: ks_prop_defs: Preserve old options cql3: ks_prop_defs: Introduce flattened() locator: Recognize rack list RF as valid in assert_rf_rack_valid_keyspace() tablet_allocator: Respect binding replicas to racks locator: network_topology_strategy: Respect rack list when reallocating tablets cql3: ks_prop_defs: Fail with more information when options are not in expected format locator, cql3: Support rack lists in replication options cql3: Fail early on vnode/tablet flavor alter cql3: Extract convert_property_map() out of Cql.g schema: Use definition from the header instead of open-coding it locator: Abstract obtaining the number of replicas from replication_strategy_config_option cql3, locator: Use type aliases for option maps locator: Add debug logging locator: Pass topology to replication strategy constructor abstract_replication_strategy, network_topology_strategy: add replication_factor_data class	2025-10-06 14:14:09 +02:00
Ferenc Szili	20aeed1607	load balancing: extend locator::load_stats to collect tablet sizes This commit extend the TABLE_LOAD_STATS RPC with data about the tablet replica sizes and effective disk capacity. Effective disk capacity of a node is computed as a sum of the sizes of all tablet replicas on a node and available disk space. This is the first change in the size based load balancing series. Closes scylladb/scylladb#26035	2025-10-03 13:37:22 +02:00
Tomasz Grabiec	6b7b0cb628	locator: Recognize rack list RF as valid in assert_rf_rack_valid_keyspace()	2025-10-02 19:42:39 +02:00
Avi Kivity	72609b5f69	Merge 'mv: generate view updates on pending replica' from Michael Litvak Generate view updates from a pending base replica if it's a reading replica, i.e. it's in the last stage of transition write_both_read_new before becoming the new base replica. Previously we didn't generate view updates on a pending replica. The problem with that is that when a base token is migrated from one replica B1 to another B2, at one stage we generate view updates only from B1, then at the next stage we generate view updates only from B2. During this transition, it can happen that for some write neither B1 nor B2 generate view update, because each one sees the other as the base replica. We fix this by generating view updates from both base replicas in the phase before the transition. We can generate view updates on the pending replica in this case, even if it requires read-before-write, because it's in a stage where it contains all data and serves reads. Fixes https://github.com/scylladb/scylladb/issues/24292 backport not needed - the issue mostly affects MV with tablets which is still experimental Closes scylladb/scylladb#25904 * github.com:scylladb/scylladb: test: mv: test view update during topology operations mv: generate view updates on both shards in intranode migration mv: generate view updates on pending replica	2025-09-30 13:17:16 +03:00
Michael Litvak	c9237bf5f6	mv: generate view updates on both shards in intranode migration Similarly to the issue of tokens migrating from one host to another, where we need to generate view updates on both replicas before transitioning in order to not lose view updates, we need to do the same in case of intranode migration. In intranode migration we migrate tokens from one shard to another. Previously we checked shard_for_reads in order to generate view updates only on the single shard that is selected for reads, and not on a pending shard that is not ready yet. The problem is that shard_for_reads switches from the source shard to the destination shard in a single transition, and during that switch we can lose view updates because neither shard sees itself as the shard for reads. We fix this by having a phase before the transition when both shards are ready for reads and both will generate view updates.	2025-09-29 13:44:04 +02:00
Petr Gusev	8adbb6c4dd	tablets: disallow chains of colocated tables	2025-09-26 16:52:43 +02:00
Avi Kivity	5237a20993	Merge 'replica: Fix split compaction when tablet boundaries change' from Raphael Raph Carvalho Consider the following: 1) balancer emits split decision 2) split compaction starts 3) split decision is revoked 4) emits merge decision 5) completes merge, before compaction in step 2 finishes After last step, split compaction initiated in step 2 can fail because it works with the global tablet map, rather than the map when the compaction started. With the global state changing under its feet, on merge, the mutation splitting writer will think it's going backwards since sibling tablets are merged. This problem was also seen when running load-and-stream, where split initiated by the sstable writer failed, split completed, and the unsplit sstable is left in the table dir, causing problems in the restart. To fix this, let's make split compaction always work with the state when it started, not a global state. Fixes #24153. All 2025.* versions are vulnerable, so fix must be backported to them. Closes scylladb/scylladb#25690 * github.com:scylladb/scylladb: replica: Fix split compaction when tablet boundaries change replica: Futurize split_compaction_options()	2025-09-09 17:05:32 +03:00
Asias He	cb7db47ae1	repair: Add incremental_mode option for tablet repair This patch introduces a new `incremental_mode` parameter to the tablet repair REST API, providing more fine-grained control over the incremental repair process. Previously, incremental repair was on and could not be turned off. This change allows users to select from three distinct modes: - `regular`: This is the default mode. It performs a standard incremental repair, processing only unrepaired sstables and skipping those that are already repaired. The repair state (`repaired_at`, `sstables_repaired_at`) is updated. - `full`: This mode forces the repair to process all sstables, including those that have been previously repaired. This is useful when a full data validation is needed without disabling the incremental repair feature. The repair state is updated. - `disabled`: This mode completely disables the incremental repair logic for the current repair operation. It behaves like a classic (pre-incremental) repair, and it does not update any incremental repair state (`repaired_at` in sstables or `sstables_repaired_at` in the system.tablets table). The implementation includes: - Adding the `incremental_mode` parameter to the `/storage_service/repair/tablet` API endpoint. - Updating the internal repair logic to handle the different modes. - Adding a new test case to verify the behavior of each mode. - Updating the API documentation and developer documentation. Fixes #25605 Closes scylladb/scylladb#25693	2025-09-09 06:50:21 +03:00
Raphael S. Carvalho	68f23d54d8	replica: Fix split compaction when tablet boundaries change Consider the following: 1) balancer emits split decision 2) split compaction starts 3) split decision is revoked 4) emits merge decision 5) completes merge, before compaction in step 2 finishes After last step, split compaction initiated in step 2 can fail because it works with the global tablet map, rather than the map when the compaction started. With the global state changing under its feet, on merge, the mutation splitting writer will think it's going backwards since sibling tablets are merged. This problem was also seen when running load-and-stream, where split initiated by the sstable writer failed, split completed, and the unsplit sstable is left in the table dir, causing problems in the restart. To fix this, let's make split compaction always work with the state when it started, not a global state. Fixes #24153. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-09-07 05:20:23 -03:00
Łukasz Paszkowski	54201960e6	storage_service: extend locator::load_stats to collect per-node critical disk utilization flag This commit extends the TABLE_LOAD_STATS RPC with information whether a node operates in the critical disk utilization mode. This information will be needed to distict between the causes why a table migration/repair was interrupted.	2025-08-29 14:56:13 +02:00
Avi Kivity	611918056a	Merge 'repair: Add tablet incremental repair support' from Asias He The central idea of incremental repair is to allow repair participants to select and repair only a portion of the dataset to speed up the repair process. All repair participants must utilize an identical selection method to repair and synchronize the same selected dataset. There are two primary selection methods: time-based and file-based. The time-based method selects data within a specified time frame. It is versatile but it is less efficient because it requires reading all of the dataset and omitting data beyond the time frame. The file-based method selects data from unrepaired SSTables and is more efficient because it allows the entire SSTable to be omitted. This document patch implements the file-based selection method. Incremental repair will only be supported for tablet tables; it will not be supported for vnode tables. On one hand, the legacy vnode is less important to support. On the other hand, the incremental repair for vnode is much harder to implement. With vnodes, a SSTalbe could contain data for multiple vnode ranges. When a given vnode range is repaired, only a portion of the SSTable is repaired. This complicates the manipulation of SSTables significantly during both repair and compaction. With tablets, an entire tablet is repaired so that a sstable is either fully repaired or not repaired which is a huge simplification. This patch uses the repaired_at from sstables::statistics component to mark a sstable as repaired. It uses a virtual clock as the repair timestamp, i.e., using a monotonically increasing number for the repaired_at field of a SSTable and sstables_repaired_at column in system.tablets table. Notice that when a sstable is not repaired, the repaired_at field will be set to the default value 0 by default. The being_repaired in memory field of a SSTable is used to explicitly mark that a SSTable is being selected. The following variables are used for incremental repair: The repaired_at on disk field of a SSTable is used. - A 64-bit number increases sequentially The sstables_repaired_at is added to the system.tablets table. - repaired_at <= sstables_repaired_at means the sstable is repaired The being_repaired in memory field of a SSTable is added. - A repair UUID tells which sstable has participated in the repair Initial test results: 1) Medium dataset results Node amount: 3 Instance type: i4i.2xlarge Disk usage per node: ~500GB Cluster pre-populated with ~500GB of data before starting repairs job. Results for Repair Timings: The regular repair run took 210 mins. Incremental repair 1st run took 183 mins, 2nd and 3rd runs took around 48s The speedup is: 183 mins / 48s = 228X 2) Small dataset results Node amount: 3 Instance type: i4i.2xlarge Disk usage per node: ~167GB Cluster pre-populated with ~167GB of data before starting the repairs job. Regular repair 1st run took 110s, 2nd and 3rd runs took 110s. Incremental repair 1st run took 110 seconds, 2nd and 3rd run took 1.5 seconds. The speedup is: 110s / 1.5s = 73X 3) Large dataset results Node amount: 6 Instance type: i4i.2xlarge, 3 racks 50% of base load, 50% read/write Dataset == Sum of data on each node Dataset Non-incremental repair (minutes) 1.3 TiB 31:07 3.5 TiB 25:10 5.0 TiB 19:03 6.3 TiB 31:42 Dataset Incremental repair (minutes) 1.3 TiB 24:32 3.0 TiB 13:06 4.0 TiB 5:23 4.8 TiB 7:14 5.6 TiB 3:58 6.3 TiB 7:33 7.0 TiB 6:55 Fixes #22472 Closes scylladb/scylladb#24291 * github.com:scylladb/scylladb: replica: Introduce get_compaction_reenablers_and_lock_holders_for_repair compaction: Move compaction_reenabler to compaction_reenabler.hh topology_coordinator: Make rpc::remote_verb_error to warning level repair: Add metrics for sstable bytes read and skipped from sstables test.py: Disable incremental for test_tombstone_gc_for_streaming_and_repair test.py: Add tests for tablet incremental repair repair: Add tablet incremental repair support compaction: Add tablet incremental repair support feature_service: Add TABLET_INCREMENTAL_REPAIR feature tablet_allocator: Add tablet_force_tablet_count_increase and decrease repair: Add incremental helpers sstable: Add being_repaired to sstable sstables: Add set_repaired_at to metadata_collector mutation_compactor: Introduce add operator to compaction_stats tablet: Add sstables_repaired_at to system.tablets table test: Fix drain api in task_manager_client.py	2025-08-19 13:13:22 +03:00
Asias He	f9021777d8	compaction: Add tablet incremental repair support This patch addes incremental_repair support in compaction. - The sstables are split into repaired and unrepaired set. - Repaired and unrepaired set compact sperately. - The repaired_at from sstable and sstables_repaired_at from system.tablets table are used to decide if a sstable is repaired or not. - Different compactions tasks, e.g., minor, major, scrub, split, are serialized with tablet repair.	2025-08-18 11:01:21 +08:00
Benny Halevy	4d646636f2	locator: effective_replication_map: get_natural_replicas: get is_vnode param Some callers, like `construct_range_to_endpoint_map` for describe_ring, or `get_secondary_ranges` for alternator ttl pass vnode tokens (the vnodes' start token), and therefore can benefit from the fast lookup path in `vnode_effective_replication_map::do_get_replicas`. Otherwise the vnode token is binary-searched in sorted_tokens using token_metadata::first_token(). Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-08-13 12:41:00 +03:00
Avi Kivity	8180cbcf48	Merge 'tablets: prevent accidental copy of tablets_map' from Benny Halevy As they are wasteful in many cases, it is better to move the tablet_map if possible, or clone it gently in an async fiber. Add clone() and clone_gently() methods to allow explicit copies. * minor optimization, no backport needed Closes scylladb/scylladb#24978 * github.com:scylladb/scylladb: tablets: prevent accidental copy of tablets_map locator: tablets: get rid of synchronous mutate_tablet_map	2025-07-27 16:48:27 +03:00
Lakshmi Narayanan Sreethar	0c5fa8e154	locator/token_metadata.cc: use chunked_vector to store _sorted_tokens The `token_metadata_impl` stores the sorted tokens in an `std::vector`. With a large number of nodes, the size of this vector can grow quickly, and updating it might lead to oversized allocations. This commit changes `_sorted_tokens` to a `chunked_vector` to avoid such issues. It also updates all related code to use `chunked_vector` instead of `std::vector`. Fixes #24876 Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> Closes scylladb/scylladb#25027	2025-07-27 11:29:22 +03:00
Aleksandra Martyniuk	1767eb9529	repair: remove unused code	2025-07-24 11:11:12 +02:00
Benny Halevy	fce6c4b41d	tablets: prevent accidental copy of tablets_map As they are wasteful in many cases, it is better to move the tablet_map if possible, or clone it gently in an async fiber. Add clone() and clone_gently() methods to allow explicit copies. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-07-22 15:07:26 +03:00
Benny Halevy	dee0d7ffbf	locator: tablets: get rid of synchronous mutate_tablet_map It is currently used only by tests that could very well do with mutate_tablet_map_async. This will simplify the following patch to prevent accidental copy of the tablet_map, provding explicit clone/clone_gently methods. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-07-22 15:03:02 +03:00
Benny Halevy	3acca0aa63	locator: tablets: tablet_metadata: clear_gently: optimize foreign ptr destruction Sort all tablet_map_ptr:s by shard_id and then destroy them on each shard to prevent long cross-shard task queues for foreign_ptr destructions. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-07-06 14:20:46 +03:00
Tomasz Grabiec	97679002ee	Merge 'Co-locate tablets of different tables' from Michael Litvak Add the option to co-locate tablets of different tables. For example, a base table and its CDC table, or a local index. main changes and ideas: * "table group" - a set of one or more tables that should be co-located. (Example: base table and CDC table). A group consists of one base table and zero or more children tables. * new column `base_table` in `system.tablets`: when creating a new table, it can be set to point to a base table, which the new table's tablets will be co-located with. when it's set, the tablet map information should be retrieved from the base table map. the child map doesn't contain per-tablet information. * co-located tables always have the same tablet count and the same tablet replicas. each tablet operation - migration, resize, repair - is applied on all tablets in a synchronized manner by the topology coordinator. * resize decision for a group is made by combining the per-table hints and comparing the average tablet size (over all tablets in the group) with the target tablet size. * the tablets load balancer works with the base table as a representative of the group. it represents a single migration unit with some `group_size` that is taken into account. * view tablets are co-located with base tablets when the partition keys match. Fixes https://github.com/scylladb/scylladb/issues/17043 backport is not needed. this is preliminary work for support of MVs and CDC with tablets. Closes scylladb/scylladb#22906 * github.com:scylladb/scylladb: tablets: validate no clustering row mutations on co-located tables raft_group0_client: extend validate_change to mixed_change type docs: topology-over-raft: document co-located tables tablet-mon.py: visual indication for co-located tablets tablet-mon.py: handle co-located tablets test/boost/view_schema_test.cc: fix race in wait_until_built boost/tablets_test: test load balancing and resize of co-located tablets test/tablets: test tablets colocation tablets: co-locate view tablets with base when the partition keys match test/pylib/tablets: common get_tablet_count api test_mv_tablets: use get_tablet_replicas from common tablets api test/pylib/tablets: fix test api to read tablet replicas from base table tablets: allocator: create co-located tables in a single operation alternator: prepare all new tables in a single announcement migration_manager: add notification for creating multiple tables tablets: read_tablet_transition_stage: read from base table storage service: allow repair request only on base tables tablets: keyspace_rf_change: apply on base table storage service: generate tablet migration updates on base tables tablets: replace all_tables method tablets: split when all co-located tablets are ready tablets: load balancer: sizing plan for table groups tablets: load balancer: handle co-located tablets tablets: allocate co-located tablets tablets: handle migration of co-located tablets storage service: add repair colocated tablets rpc tablets: save and read tablet metadata of co-located tables tablets: represent co-located tables in tablet metadata tablets: add base_table column to system.tablets docs: update system.tablets schema	2025-07-01 16:02:30 +02:00
Michael Litvak	ddf02c9489	tablets: replace all_tables method The method all_tables in tablet_metadata is used for iterating over all tables in the tablet metadata with their tablet maps. Now that we have co-located tables we need to make the distinction on which tables we want to iterate over. In some cases we want to iterate over each group of co-located tables, treating them as one unit, and in other cases we want to iterate over all tables, doesn't matter if they are part of a co-located group and have a base table. We replace all_tables with new methods that can be used for each of the cases.	2025-07-01 13:20:18 +03:00
Nadav Har'El	e12ff4d3ab	Merge 'LWT: use tablet_metadata_guard' from Petr Gusev This PR is a step towards enabling LWT for tablet-based tables. It pursues several goals: * Make it explicit that the tablet can't migrate after the `cas_shard` check in `selec_statement/modification_statement`. Currently, `storage_proxy::cas` expects that the client calls it on a correct shard -- the one which owns the partition key the LWT is running on. There reasons for that are explained in [this commit](`f16e3b0491 (diff-1073ea9ce4c5e00bb6eb614154f523ba7962403a4fe6c8cd877d1c8b73b3f649)`) message. The statements check the current shard and invokes `bounce_to_shard` if it's not the right one. However , the erm strong pointer is only captured in `storage_proxy::cas` and until that moment there is no explicit structure in the code which would prevent the ongoing migrations. In this PR we introduce such stucture -- `erm_handle`. We create it before the `cas_check` and pass it down to `storage_proxy::cas` and `paxos_response_handler`. * Another goal of this PR is an optimization -- we don't want to hold erm for the duration of entire LWT, unless it directly affects the current tablet. The is a `tablet_metadata_guard` class which is used for long running tablet operations. It automatically switches to a new erm if the topology change represented by the new erm doesn't affect the current tablet. We use this class in `erm_handle` if the table uses tablets. Otherwise, `erm_handle` just stores erm directly. * Fixes [shard bouncing issue in alternator](https://github.com/scylladb/scylladb/issues/17399) Backport: not needed (new feature). Closes scylladb/scylladb#24495 * github.com:scylladb/scylladb: LWT: make cas_shard non-optional in sp::cas LWT: create cas_shard in select_statement LWT: create cas_shard in modification and batch statements LWT: create cas_shard in alternator LWT: use cas_shard in storage_proxy::cas do_query_with_paxos: remove redundant cas_shard check storage_proxy: add cas_shard class sp::cas_shard: rename to get_cas_shard token_metadata_guard: a topology guard for a token tablet_metadata_guard: mark as noncopyable and nonmoveable	2025-07-01 11:33:20 +03:00
Michael Litvak	ddfe5dfb6b	tablets: represent co-located tables in tablet metadata Modify tablet_metadata to be able to represent co-located tables. The new method set_colocated_table adds to tablet_metadata a table which is co-located with another table. A co-located table shares the tablet map object with the base table, so we just create a copy of the shared tablet map pointer and store it as the co-located table's tablet map. Whenever a tablet map is modified we update the pointer for all the co-located tables accordingly, so the tablet map remains shared. We add some data structures to tablet_metadata to be able to work with co-located table groups efficiently: * `_table_groups` maps every base table to all tables in its co-location group. This is convenient for iterating over all table groups, or finding all tables in some group. * `_base_table` maps a co-located table to its base table.	2025-07-01 10:29:59 +03:00
Petr Gusev	85eac7c34e	token_metadata_guard: a topology guard for a token Data-plane requests typically hold a strong pointer to the effective_replication_map (ERM) to protect against tablet migrations and other topology operations. This works because major steps in the topology coordinator use global barriers. These barriers install a new token_metadata version on each shard and wait for all references to the old one to be dropped. Since the ERM holds a strong pointer to token_metadata, it effectively blocks these operations until it's released. For LWT, we usually deal with a single token within a single tablet. In such cases, it's enough to block topology changes for just that one tablet. The existing tablet_metadata_guard class already supports this: it tracks tablet-specific changes and updates the ERM pointer automatically, unless the change affects the guarded tablet. However, this only works for tablet-aware tables. To support LWT with vnodes (i.e., non-tablet-aware tables), this commit introduces a new token_metadata_guard class. It wraps tablet_metadata_guard when the table uses tablets, and falls back to holding a plain strong ERM pointer otherwise. In the next commits, we’ll migrate LWT to use token_metadata_guard in paxos_response_handler instead of erm.	2025-06-18 11:51:48 +02:00
Petr Gusev	73221aa7b1	tablet_metadata_guard: mark as noncopyable and nonmoveable tablet_metadata_guard passes a raw pointer to get_validity_abort_source, so it can't be easily copied or moved. In this commit we make this explicit. We define destructor in cpp -- the autogenerated one complains on lw_shared_ptr<replica::table> as replica::table is only forward-declared in the headers.	2025-06-18 11:50:46 +02:00
Michael Litvak	34f15ca871	tablets: deallocate storage state on end_migration When a tablet is migrated and cleaned up, deallocate the tablet storage group state on `end_migration` stage, instead of `cleanup` stage: * When the stage is updated from `cleanup` to `end_migration`, the storage group is removed on the leaving replica. * When the table is initialized, if the tablet stage is `end_migration` then we don't allocate a storage group for it. This happens for example if the leaving replica is restarted during tablet migration. If it's initialized in `cleanup` stage then we allocate a storage group, and it will be deallocated when transitioning to `end_migration`. This guarantees that the storage group is always deallocated on the leaving replica by `end_migration`, and that it is always allocated if the tablet wasn't cleaned up fully yet. It is a similar case also for the pending replica when the migration is aborted. We deallocate the state on `revert_migration` which is the stage following `cleanup_target`. Previously the storage group would be allocated when the tablet is initialized on any of the tablet replicas - also on the leaving replica, and when the tablet stage is `cleanup` or `end_migration`, and deallocated during `cleanup`. This fixes the following issue: 1. A migrating tablet enters cleanup stage 2. the tablet is cleaned up successfuly 3. The leaving replica is restarted, and allocates storage group 4. tablet cleanup is not called because it was already cleaned up 4. the storage group remains allocated on the leaving replica after the migration is completed - it's not cleaned up properly. Fixes scylladb/scylladb#23481	2025-06-09 16:58:38 +03:00
Piotr Szymaniak	de96c28625	alternator: Add support for TTL when using tablets Support for TTL-based data removal when using tablets. The essence of this commit is a separate code path for finding token ranges owned by the current shard for the cases when tablets are used and not vnodes. At the same time, the vnodes-case is not touched not to cause any regressions. The TTL-caused data removal is normally performed by the primary replica (both when using vnodes and tablets). For the tablets case, the already-existing method tablet_map::get_primary_replica(tablet_id) is used to know if a shard execuring the TTL-related data removal is the primary replica for each tablet. A new method tablet_map::get_secondary_replica(tablet_id) has been added. It is needed by the data invalidation procedure to remove data when the primary replica node is down - the data is then removed by the secondary replica node. The mechanism is the same as in the vnodes case. Since alternator now supports TTL, the test `test_ttl_enable_error_with_tablets` has been removed. Also, tests in the test_ttl.py have been made to run twice, once with vnodes and once with tablets. When run with tablets, the due to lack of support for LWT with tablets (#18068), tests use 'system:write_isolation' of 'unsafe_rmw'. This approach allows early regression testing with tablets and is meant only as a tentative solution. Fixes scylladb/scylladb#16567 Closes scylladb/scylladb#23662	2025-06-05 17:39:29 +03:00
Tomasz Grabiec	2597a7e980	load_sketch: Tolerate missing tablet_map when selecting for a given table To simplify future usage in network_topology_strategy::add_tablets_in_dc() which invokes populate() for a given table, which may be both new and preexisitng.	2025-04-17 16:01:16 +02:00
Aleksandra Martyniuk	acd32b24d3	locator: service: move to rebuild_v2 transition if cluster is upgraded If cluster is upgraded to version containing rebuild_v2 transition kind, move to this transition kind instead of rebuild.	2025-04-08 10:42:02 +02:00
Aleksandra Martyniuk	4a847df55c	locator: service: add rebuild_repair tablet transition stage Currently, in the streaming stage of rebuild tablet transition, we stream tablet data from all replicas. This patch series splits the streaming stage into two phases: - repair phase, where we repair the tablet; - streaming phase, where we stream tablet data from one replica. rebuild_repair is a stage that will be used to perform the repair phase. It executes the tablet repair on tablet_info::replicas. A primary replica out of migration_streraming_info::read_from is the repair master. If the repair succeeds, we move to streaming tablet transition stage, and to cleanup_target - if it fails. The repair bypasses the tablet repair scheduler and it does not update the repair_time. A transition to the rebuild_repair stage will be added in the following patches.	2025-04-08 10:42:02 +02:00
Aleksandra Martyniuk	5d6041617b	locator: add maybe_get_primary_replica Add maybe_get_primary_replica to choose a primary replica out of custom replica set.	2025-04-08 10:42:01 +02:00
Aleksandra Martyniuk	ed7b8bb787	locator: service: add rebuild_v2 tablet transition kind Currently, in the streaming stage of rebuild tablet transition, we stream tablet data from all replicas. This patch series splits the streaming stage into two phases: - repair phase, where we repair the tablet; - streaming phase, where we stream tablet data from one replica. To differentiate the two streaming methods, a new tablet transition kind - rebuild_v2 - is added. The transtions and stages for rebuild_v2 transition kind will be added in the following patches.	2025-04-08 10:42:01 +02:00
Dawid Mędrek	32879ec0d5	db/config: Introduce RF-rack-valid keyspaces We introduce a new term in the glossary: RF-rack-valid keyspace. We also highlight in our user documentation that all keyspaces must remain RF-rack-valid throughout their lifetime, and failing to guarantee that may result in data inconsistencies or other issues. We base that information on our experience with materialized views in keyspaces using tablets, even though they remain an experimental feature. Along with the new term, we introduce a new configuration option called `rf_rack_valid_keyspaces`, which, when enabled, will enforce preserving all keyspaces RF-rack-valid. That functionality will be implemented in upcoming commits. For now, we materialize the restriction in form of a named requirement: a function verifying that the passed keyspace is RF-rack-valid. The option is disabled by default. That will change once we adjust the existing tests to the new semantics. Once that is done, the option will first be enabled by default, and then it will be removed. Fixes scylladb/scylladb#20356	2025-03-19 14:46:35 +01:00
Tomasz Grabiec	7e7f1e6f91	storage_service, tablets: Collect per-node capacity in load_stats New RPC is introduced becuase load_stats was marked "final" in the IDL. Will be needed by capacity-aware load balancing.	2025-03-06 12:17:32 +01:00
Aleksandra Martyniuk	2b538d228c	locator: add round-robin selection of filtered replicas	2025-02-28 12:32:55 +01:00
Aleksandra Martyniuk	fe4e99d7b3	locator: add tablet_task_info::selected_by_filters Extract dcs and hosts filters check to a method.	2025-02-28 12:02:21 +01:00
Avi Kivity	d99df7af6c	Merge 'Respect per-shard tablet goal and 10x default per-shard tablet count' from Tomasz Grabiec This series achieves two things: 1) changes default number of tablet replicas per shard to be 10 in order to reduce load imbalance between shards This will result in new tables having at least 10 tablet replicas per shard by default. We want this to reduce tablet load imbalance due to differences in tablet count per shard, where some shards have 1 tablet and some shards have 2 tablets. With higher tablet count per shard, this difference-by-one is less relevant. Fixes https://github.com/scylladb/scylladb/issues/21967 2) introduces a global goal for tablet replica count per shard and adds logic to tablet scheduler to respect it by controlling per-table tablet count The per-shard goal is enforced by controlling average per-shard tablet replica count in a given DC, which is controlled by per-table tablet count. This is effective in respecting the limit on individual shards as long as tablet replicas are distributed evenly between shards. There is no attempt to move tablets around in order to enforce limits on individual shards in case of imbalance between shards. If the average per-shard tablet count exceeds the limit, all tables which contribute to it (have replicas in the DC) are scaled down by the same factor. Due to rounding up to the nearest power of 2, we may overshoot the per-shard goal by at most a factor of 2. The scaling is applied after computing desired tablet count due to all other factors: per-table tablet count hints, defaults, average tablet size. If different DCs want different scale factors of a given table, the lowest scale factor is chosen for a given table. When creating a new table, its tablet count is determined by tablet scheduler using the scheduler logic, as if the table was already created. So any scaling due to per-shard tablet count goal is reflected immediately when creating a table. It may however still take some time for the system to shrink existing tables. We don't reject requests to create new tables. Fixes #21458 Closes scylladb/scylladb#22522 * github.com:scylladb/scylladb: config, tablets: Allow tablets_initial_scale_factor to be a fraction test: tablets_test: Test scaling when creating lots of tables test: tablets_test: Test tablet count changes on per-table option and config changes test: tablets_test: Add support for auto-split mode test: cql_test_env: Expose db config config: Make tablets_initial_scale_factor live-updateable tablets: load_balancer: Pick initial_scale_factor from config tablets, load_balancer: Fix and improve logging of resize decisions tablets, load_balancer: Log reason for target tablet count tablets: load_balancer: Move hints processing to tablet scheduler tablets: load_balancer: Scale down tablet count to respect per-shard tablet count goal tablets: Use scheduler's make_sizing_plan() to decide about tablet count of a new table tablets: load_balancer: Determine desired count from size separately from count from options tablets: load_balancer: Determine resize decision from target tablet count tablets: load_balancer: Allow splits even if table stats not available tablets: load_balancer: Extract make_sizing_plan() tablets: Add formatter for resize_decision::way_type tablets: load_balancer: Simplify resize_urgency_cmp() tablets: load_balancer: Keep config items as instance members locator: network_topology_strategy: Simplify calculate_initial_tablets_from_topology() tablets: Change the meaning of initial_scale to mean min-avg-tablets-per-shard tablets: Set default initial tablet count scale to 10 tablets: network_topology_stragy: Coroutinize calculate_initial_tablets_from_topology() tablets: load_balancer: Extract get_schema_and_rs() tablets: load_balancer: Drop test_mode	2025-02-24 17:59:26 +02:00
Raphael S. Carvalho	4d8a333a7f	storage_service: Don't retry split when table is dropped The split monitor wasn't handling the scenario where the table being split is dropped. The monitor would be unable to find the tablet map of such a table, and the error would be treated as a retryable one causing the monitor to fall into an endless retry loop, with sleeps in between. And that would block further splits, since the monitor would be busy with the retries. The fix is about detecting table was dropped and skipping to the next candidate, if any. Fixes #21859. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#22933	2025-02-20 10:13:55 +01:00

1 2 3 4

160 Commits