scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-21 17:10:35 +00:00

Author	SHA1	Message	Date
Benny Halevy	6129411a5e	locator: utils: get_all_ranges, construct_range_to_endpoint_map: use end-bound ranges Commit `60d2cc886a` changed get_all_ranges to return start-bound ranges and pre-calculate the wrapping range, and then construct_range_to_endpoint_map to pass r.start() (that is now always engaged) as the vnode token. However, as can be seen in token_metadata_impl::first_token the token ranges (a.k.a. vnodes) end with the sorted tokens, not start with them, so an arbitrary token t belongs to a vnode in some range `sorted_tokens[i-1] < t <= sorted_tokens[i]` Fixes #25541 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#25580	2025-08-20 15:15:40 +02:00
Avi Kivity	611918056a	Merge 'repair: Add tablet incremental repair support' from Asias He The central idea of incremental repair is to allow repair participants to select and repair only a portion of the dataset to speed up the repair process. All repair participants must utilize an identical selection method to repair and synchronize the same selected dataset. There are two primary selection methods: time-based and file-based. The time-based method selects data within a specified time frame. It is versatile but it is less efficient because it requires reading all of the dataset and omitting data beyond the time frame. The file-based method selects data from unrepaired SSTables and is more efficient because it allows the entire SSTable to be omitted. This document patch implements the file-based selection method. Incremental repair will only be supported for tablet tables; it will not be supported for vnode tables. On one hand, the legacy vnode is less important to support. On the other hand, the incremental repair for vnode is much harder to implement. With vnodes, a SSTalbe could contain data for multiple vnode ranges. When a given vnode range is repaired, only a portion of the SSTable is repaired. This complicates the manipulation of SSTables significantly during both repair and compaction. With tablets, an entire tablet is repaired so that a sstable is either fully repaired or not repaired which is a huge simplification. This patch uses the repaired_at from sstables::statistics component to mark a sstable as repaired. It uses a virtual clock as the repair timestamp, i.e., using a monotonically increasing number for the repaired_at field of a SSTable and sstables_repaired_at column in system.tablets table. Notice that when a sstable is not repaired, the repaired_at field will be set to the default value 0 by default. The being_repaired in memory field of a SSTable is used to explicitly mark that a SSTable is being selected. The following variables are used for incremental repair: The repaired_at on disk field of a SSTable is used. - A 64-bit number increases sequentially The sstables_repaired_at is added to the system.tablets table. - repaired_at <= sstables_repaired_at means the sstable is repaired The being_repaired in memory field of a SSTable is added. - A repair UUID tells which sstable has participated in the repair Initial test results: 1) Medium dataset results Node amount: 3 Instance type: i4i.2xlarge Disk usage per node: ~500GB Cluster pre-populated with ~500GB of data before starting repairs job. Results for Repair Timings: The regular repair run took 210 mins. Incremental repair 1st run took 183 mins, 2nd and 3rd runs took around 48s The speedup is: 183 mins / 48s = 228X 2) Small dataset results Node amount: 3 Instance type: i4i.2xlarge Disk usage per node: ~167GB Cluster pre-populated with ~167GB of data before starting the repairs job. Regular repair 1st run took 110s, 2nd and 3rd runs took 110s. Incremental repair 1st run took 110 seconds, 2nd and 3rd run took 1.5 seconds. The speedup is: 110s / 1.5s = 73X 3) Large dataset results Node amount: 6 Instance type: i4i.2xlarge, 3 racks 50% of base load, 50% read/write Dataset == Sum of data on each node Dataset Non-incremental repair (minutes) 1.3 TiB 31:07 3.5 TiB 25:10 5.0 TiB 19:03 6.3 TiB 31:42 Dataset Incremental repair (minutes) 1.3 TiB 24:32 3.0 TiB 13:06 4.0 TiB 5:23 4.8 TiB 7:14 5.6 TiB 3:58 6.3 TiB 7:33 7.0 TiB 6:55 Fixes #22472 Closes scylladb/scylladb#24291 * github.com:scylladb/scylladb: replica: Introduce get_compaction_reenablers_and_lock_holders_for_repair compaction: Move compaction_reenabler to compaction_reenabler.hh topology_coordinator: Make rpc::remote_verb_error to warning level repair: Add metrics for sstable bytes read and skipped from sstables test.py: Disable incremental for test_tombstone_gc_for_streaming_and_repair test.py: Add tests for tablet incremental repair repair: Add tablet incremental repair support compaction: Add tablet incremental repair support feature_service: Add TABLET_INCREMENTAL_REPAIR feature tablet_allocator: Add tablet_force_tablet_count_increase and decrease repair: Add incremental helpers sstable: Add being_repaired to sstable sstables: Add set_repaired_at to metadata_collector mutation_compactor: Introduce add operator to compaction_stats tablet: Add sstables_repaired_at to system.tablets table test: Fix drain api in task_manager_client.py	2025-08-19 13:13:22 +03:00
Asias He	f9021777d8	compaction: Add tablet incremental repair support This patch addes incremental_repair support in compaction. - The sstables are split into repaired and unrepaired set. - Repaired and unrepaired set compact sperately. - The repaired_at from sstable and sstables_repaired_at from system.tablets table are used to decide if a sstable is repaired or not. - Different compactions tasks, e.g., minor, major, scrub, split, are serialized with tablet repair.	2025-08-18 11:01:21 +08:00
Benny Halevy	50abeb1270	locator: util: optimize describe_ring This change includes basic optimizations to locator::describe_ring, mainly caching the per-endpoint information in an unordered_map instead of looking them up in every inner-loop. This yields an improvement of 20% in cpu time. With 45 nodes organized as 3 dcs, 3 racks per dc, 5 nodes per rack, 256 tokens per node, yielding 11520 ranges and 9 replicas per range, describe_ring took Before: 30 milliseconds (2.6 microseconds per range) After: 24 milliseconds (2.1 microseconds per range) Add respective unit test of describe_ring for tablets. A unit test for vnodes already exists in test/nodetool/test_describering.py Fixes #24887 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-08-13 12:42:25 +03:00
Benny Halevy	60d2cc886a	locator: util: construct_range_to_endpoint_map: pass is_vnode=true to get_natural_replicas First, let get_all_ranges return all vnode ranges with a corrected wrapping range covering the [last token, first token) range, such that all ranges start tokens are vndoe tokens and must be in the vnode replication map. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-08-13 12:42:23 +03:00
Benny Halevy	195d02d64e	vnode_effective_replication_map: do_get_replicas: throw internal error if token not found in map Prevent a crash, especially in the is_vnode=true case, if the key_token is not found in the map. Rather than the undefined behavior when dereferencing the end() iterator, throw an internal error with additional logging about the search logic and parameters. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-08-13 12:41:03 +03:00
Benny Halevy	4d646636f2	locator: effective_replication_map: get_natural_replicas: get is_vnode param Some callers, like `construct_range_to_endpoint_map` for describe_ring, or `get_secondary_ranges` for alternator ttl pass vnode tokens (the vnodes' start token), and therefore can benefit from the fast lookup path in `vnode_effective_replication_map::do_get_replicas`. Otherwise the vnode token is binary-searched in sorted_tokens using token_metadata::first_token(). Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-08-13 12:41:00 +03:00
Benny Halevy	6dbbb80aae	locator: abstract_replication_strategy: implement local_replication_strategy Derive both vnode_effective_replication_map and local_effective_replication_map from static_effective_replication_map as both are static and per-keyspace. However, local_effective_replication_map does not need vnodes for the mapping of all tokens to the local node. Note that everywhere_replication_strategy is not abstracted in a similar way, although it could, since the plan is to get rid of it once all system keyspaces areconverted to local or tablets replication (and propagated everywhere if needed using raft group0) Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-08-06 16:05:11 +03:00
Benny Halevy	8bde507232	locator: vnode_effective_replication_map: convert clone_data_gently to clone_gently create_effective_replication_map need not know about the internals of vnode_effective_replication_map. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-08-06 16:03:53 +03:00
Benny Halevy	8d4ac97435	locator: abstract_replication_map: rename make_effective_replication_map to make_vnode_effective_replication_map_ptr since it is specific to vnode_effective_replication_map. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-08-06 16:03:53 +03:00
Benny Halevy	babb4a41a8	locator: abstract_replication_map: rename calculate_effective_replication_map to calculate_vnode_effective_replication_map since it is specific to vnode-based range calculations. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-08-06 16:03:53 +03:00
Benny Halevy	688bd4fd43	locator: effective_replication_map_factory: rename create_effective_replication_map to create_static_effective_replication_map, in preparation for separating local_effective_replication_map from vnode_effective_replication_map. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-08-06 16:03:53 +03:00
Benny Halevy	cbad497859	locator: abstract_replication_strategy: rename vnode_effective_replication_map_ptr et. al to static_effective_replication_map_ptr, in preparation for separating local_effective_replication_map from vnode_effective_replication_map. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-08-06 16:03:53 +03:00
Benny Halevy	2ab44e871b	locator: abstract_replication_strategy: rename global_vnode_effective_replication_map to global_static_effective_replication_map, in preparation for separating local_effective_replication_map from vnode_effective_replication_map. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-08-06 16:03:49 +03:00
Benny Halevy	bd62421c05	keyspace: rename get_vnode_effective_replication_map to get_static_effective_replication_map, in preparation for separating local_effective_replication_map from vnode_effective_replication_map (both are per-keyspace). Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-08-06 13:40:43 +03:00
Benny Halevy	ec85678de1	locator: abstract_replication_strategy: define is_local Prefer for specializing the local replication strategy, local effective replication map, et. al byt defining an is_local() predicate, similar to uses_tablets(). Note that is_vnode_based() still applies to local replication strategy. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-08-06 13:34:23 +03:00
Petr Gusev	801bf42ea2	sharder: add try_get_shard_for_reads method Currently, we use storage_proxy/get_cas_shard -> sharder.shard_for_reads to decide which shard to use for LWT code execution on both replicas and the coordinator. If the coordinator is not a replica, shard_for_reads returns 0 — the 'default' shard. This behavior has at least two problems: * Shard 0 may become overloaded, because all LWT coordinators that are not replicas will be served on it. * The zero shard does not match shard_for_reads on replicas, which hinders the "same shard for client and server" RPC-level optimization. To fix this, we need to know whether the current node hosts a replica for the tablet corresponding to the given token. Currently, there is no API we could use for this. For historical reasons, sharder::shard_for_reads returns 0 when the node does not host the shard, which leads to ambiguity. This commit introduces try_get_shard_for_reads, which returns a disengaged std::optional when the tablet is not present on the local node. We leave shard_for_reads method in the base sharder class, it calls try_get_shard_for_reads and returns zero by default. We need to rename tablet_sharder private methods shard_for_reads and shard_for_writes so that they don't conflict with the sharder::shard_for_reads.	2025-07-29 11:35:54 +02:00
Nadav Har'El	b4fc3578fc	Merge 'LWT: enable for tablet-based tables' from Petr Gusev This PR enables LWT (Lightweight Transactions) support for tablet-based tables by leveraging colocated tables. Currently, storing Paxos state in system tables causes two major issues: * Loss of Paxos state during tablet migration or base table rebuilds * When a tablet is migrated or the base table is rebuilt, system tables don't retain Paxos state. * This breaks LWT correctness in certain scenarios. * Failing test cases demonstrating this: * test_lwt_state_is_preserved_on_tablet_migration * test_lwt_state_is_preserved_on_rebuild * Shard misalignment and performance overhead * Tablets may be placed on arbitrary shards by the tablet balancer. * Accessing Paxos state in system tables could require a shard jump, degrading performance. We move Paxos state into a dedicated Paxos table, colocated with the base table: * Each base table gets its own Paxos state table. * This table is lazily created on the first LWT operation. * Its tablets are colocated with those of the base table, ensuring: * Co-migration during tablet movement * Co-rebuilding with the base table * Shard alignment for local access to Paxos state Some reasoning for why this is sufficient to preserve LWT correctness is discussed in [2]. This PR addresses two issues from the "Why doesn't it work for tablets" section in [1]: * Tablet migration vs LWT correctness * Paxos table sharding Other issues ("bounce to shard" and "locking for intranode_migration") have already been resolved in previous PRs. References [1] - [LWT over tablets design](https://docs.google.com/document/d/1CPm0N9XFUcZ8zILpTkfP5O4EtlwGsXg_TU4-1m7dTuM/edit?tab=t.0#heading=h.goufx7gx24yu) [2] - [LWT: Paxos state and tablet balancer](https://docs.google.com/document/d/1-xubDo612GGgguc0khCj5ukmMGgLGCLWLIeG6GtHTY4/edit?tab=t.0) [3] - [Colocated tables PR](https://github.com/scylladb/scylladb/pull/22906#issuecomment-3027123886) [4] - [Possible LWT consistency violations after a topology change](https://github.com/scylladb/scylladb/issues/5251) Backport: not needed because this is a new feature. Closes scylladb/scylladb#24819 * github.com:scylladb/scylladb: create_keyspace: fix warning for tablets docs: fix lwt.rst docs: fix tablets.rst alternator: enable LWT random_failures: enable execute_lwt_transaction test_tablets_lwt: add test_paxos_state_table_permissions test_tablets_lwt: add test_lwt_for_tablets_is_not_supported_without_raft test_tablets_lwt: test timeout creating paxos state table test_tablets_lwt: add test_lwt_concurrent_base_table_recreation test_tablets_lwt: add test_lwt_state_is_preserved_on_rebuild test_tablets_lwt: migrate test_lwt_support_with_tablets test_tablets_lwt: add test_lwt_state_is_preserved_on_tablet_migration test_tablets_lwt: add simple test for LWT check_internal_table_permissions: handle Paxos state tables client_state: extract check_internal_table_permissions paxos_store: handle base table removal database: get_base_table_for_tablet_colocation: handle paxos state table paxos_state: use node_local_only mode to access paxos state query_options: add node_local_only mode storage_proxy: handle node_local_only in query storage_proxy: handle node_local_only in mutate storage_proxy: introduce node_local_only flag abstract_replication_strategy: remove unused using storage_proxy: add coordinator_mutate_options storage_proxy: rename create_write_response_handler -> make_write_response_handler storage_proxy: simplify mutate_prepare paxos_state: lazily create paxos state table migration_manager: add timeout to start_group0_operation and announce paxos_store: use non-internal queries qp: make make_internal_options public paxos_store: conditional cf_id filter paxos_store: coroutinize feature_service: add LWT_WITH_TABLETS feature paxos_state: inline system_keyspace functions into paxos_store paxos_state: extract state access functions into paxos_store	2025-07-28 13:19:23 +03:00
Avi Kivity	8180cbcf48	Merge 'tablets: prevent accidental copy of tablets_map' from Benny Halevy As they are wasteful in many cases, it is better to move the tablet_map if possible, or clone it gently in an async fiber. Add clone() and clone_gently() methods to allow explicit copies. * minor optimization, no backport needed Closes scylladb/scylladb#24978 * github.com:scylladb/scylladb: tablets: prevent accidental copy of tablets_map locator: tablets: get rid of synchronous mutate_tablet_map	2025-07-27 16:48:27 +03:00
Lakshmi Narayanan Sreethar	0c5fa8e154	locator/token_metadata.cc: use chunked_vector to store _sorted_tokens The `token_metadata_impl` stores the sorted tokens in an `std::vector`. With a large number of nodes, the size of this vector can grow quickly, and updating it might lead to oversized allocations. This commit changes `_sorted_tokens` to a `chunked_vector` to avoid such issues. It also updates all related code to use `chunked_vector` instead of `std::vector`. Fixes #24876 Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> Closes scylladb/scylladb#25027	2025-07-27 11:29:22 +03:00
Petr Gusev	8e745137de	abstract_replication_strategy: remove unused using	2025-07-24 19:48:08 +02:00
Aleksandra Martyniuk	1767eb9529	repair: remove unused code	2025-07-24 11:11:12 +02:00
Benny Halevy	fce6c4b41d	tablets: prevent accidental copy of tablets_map As they are wasteful in many cases, it is better to move the tablet_map if possible, or clone it gently in an async fiber. Add clone() and clone_gently() methods to allow explicit copies. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-07-22 15:07:26 +03:00
Benny Halevy	dee0d7ffbf	locator: tablets: get rid of synchronous mutate_tablet_map It is currently used only by tests that could very well do with mutate_tablet_map_async. This will simplify the following patch to prevent accidental copy of the tablet_map, provding explicit clone/clone_gently methods. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-07-22 15:03:02 +03:00
Ernest Zaslavsky	408aa289fe	treewide: Move misc files to `utils` directory As requested in #22114, moved the files and fixed other includes and build system. Moved files: - interval.hh - Map_difference.hh Fixes: #22114 This is a cleanup, no need to backport Closes scylladb/scylladb#25095	2025-07-21 11:56:40 +03:00
Benny Halevy	6e4803a750	token_metadata_impl: clear_gently: release version tracker early No need to wait for all members to be cleared gently. We can release the version earlier since the held version may be awaited for in barriers. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-07-06 15:07:31 +03:00
Benny Halevy	2c0bafb934	token_metadata: clear_and_destroy_impl when destroyed We have a lot of places in the code where a token_metadata_ptr is kept in an automatic variable and destroyed when it leaves the scope. since it's a referenced counted lw_shared_ptr, the token_metadata object is rarely destroyed in those cases, but when it is, it doesn't go through clear_gently, and in particular its tablet_metadata is not cleared gently, leading to inefficient destruction of potentially many foreign_ptr:s. This patch calls clear_and_destroy_impl that gently clears and destroys the impl object in the background using the shared_token_metadata. Fixes #13381 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-07-06 15:07:31 +03:00
Benny Halevy	2b2cfaba6e	token_metadata: keep a reference to shared_token_metadata To be used by a following patch to gently clean and destroy the token_data_impl in the background. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-07-06 15:07:31 +03:00
Benny Halevy	e0a19b981a	token_metadata: move make_token_metadata_ptr into shared_token_metadata class So we can use the local shared_token_metadata instance for safe background destroy of token_metadata_impl:s. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-07-06 14:22:20 +03:00
Benny Halevy	3acca0aa63	locator: tablets: tablet_metadata: clear_gently: optimize foreign ptr destruction Sort all tablet_map_ptr:s by shard_id and then destroy them on each shard to prevent long cross-shard task queues for foreign_ptr destructions. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-07-06 14:20:46 +03:00
Tomasz Grabiec	97679002ee	Merge 'Co-locate tablets of different tables' from Michael Litvak Add the option to co-locate tablets of different tables. For example, a base table and its CDC table, or a local index. main changes and ideas: * "table group" - a set of one or more tables that should be co-located. (Example: base table and CDC table). A group consists of one base table and zero or more children tables. * new column `base_table` in `system.tablets`: when creating a new table, it can be set to point to a base table, which the new table's tablets will be co-located with. when it's set, the tablet map information should be retrieved from the base table map. the child map doesn't contain per-tablet information. * co-located tables always have the same tablet count and the same tablet replicas. each tablet operation - migration, resize, repair - is applied on all tablets in a synchronized manner by the topology coordinator. * resize decision for a group is made by combining the per-table hints and comparing the average tablet size (over all tablets in the group) with the target tablet size. * the tablets load balancer works with the base table as a representative of the group. it represents a single migration unit with some `group_size` that is taken into account. * view tablets are co-located with base tablets when the partition keys match. Fixes https://github.com/scylladb/scylladb/issues/17043 backport is not needed. this is preliminary work for support of MVs and CDC with tablets. Closes scylladb/scylladb#22906 * github.com:scylladb/scylladb: tablets: validate no clustering row mutations on co-located tables raft_group0_client: extend validate_change to mixed_change type docs: topology-over-raft: document co-located tables tablet-mon.py: visual indication for co-located tablets tablet-mon.py: handle co-located tablets test/boost/view_schema_test.cc: fix race in wait_until_built boost/tablets_test: test load balancing and resize of co-located tablets test/tablets: test tablets colocation tablets: co-locate view tablets with base when the partition keys match test/pylib/tablets: common get_tablet_count api test_mv_tablets: use get_tablet_replicas from common tablets api test/pylib/tablets: fix test api to read tablet replicas from base table tablets: allocator: create co-located tables in a single operation alternator: prepare all new tables in a single announcement migration_manager: add notification for creating multiple tables tablets: read_tablet_transition_stage: read from base table storage service: allow repair request only on base tables tablets: keyspace_rf_change: apply on base table storage service: generate tablet migration updates on base tables tablets: replace all_tables method tablets: split when all co-located tablets are ready tablets: load balancer: sizing plan for table groups tablets: load balancer: handle co-located tablets tablets: allocate co-located tablets tablets: handle migration of co-located tablets storage service: add repair colocated tablets rpc tablets: save and read tablet metadata of co-located tables tablets: represent co-located tables in tablet metadata tablets: add base_table column to system.tablets docs: update system.tablets schema	2025-07-01 16:02:30 +02:00
Michael Litvak	ddf02c9489	tablets: replace all_tables method The method all_tables in tablet_metadata is used for iterating over all tables in the tablet metadata with their tablet maps. Now that we have co-located tables we need to make the distinction on which tables we want to iterate over. In some cases we want to iterate over each group of co-located tables, treating them as one unit, and in other cases we want to iterate over all tables, doesn't matter if they are part of a co-located group and have a base table. We replace all_tables with new methods that can be used for each of the cases.	2025-07-01 13:20:18 +03:00
Nadav Har'El	e12ff4d3ab	Merge 'LWT: use tablet_metadata_guard' from Petr Gusev This PR is a step towards enabling LWT for tablet-based tables. It pursues several goals: * Make it explicit that the tablet can't migrate after the `cas_shard` check in `selec_statement/modification_statement`. Currently, `storage_proxy::cas` expects that the client calls it on a correct shard -- the one which owns the partition key the LWT is running on. There reasons for that are explained in [this commit](`f16e3b0491 (diff-1073ea9ce4c5e00bb6eb614154f523ba7962403a4fe6c8cd877d1c8b73b3f649)`) message. The statements check the current shard and invokes `bounce_to_shard` if it's not the right one. However , the erm strong pointer is only captured in `storage_proxy::cas` and until that moment there is no explicit structure in the code which would prevent the ongoing migrations. In this PR we introduce such stucture -- `erm_handle`. We create it before the `cas_check` and pass it down to `storage_proxy::cas` and `paxos_response_handler`. * Another goal of this PR is an optimization -- we don't want to hold erm for the duration of entire LWT, unless it directly affects the current tablet. The is a `tablet_metadata_guard` class which is used for long running tablet operations. It automatically switches to a new erm if the topology change represented by the new erm doesn't affect the current tablet. We use this class in `erm_handle` if the table uses tablets. Otherwise, `erm_handle` just stores erm directly. * Fixes [shard bouncing issue in alternator](https://github.com/scylladb/scylladb/issues/17399) Backport: not needed (new feature). Closes scylladb/scylladb#24495 * github.com:scylladb/scylladb: LWT: make cas_shard non-optional in sp::cas LWT: create cas_shard in select_statement LWT: create cas_shard in modification and batch statements LWT: create cas_shard in alternator LWT: use cas_shard in storage_proxy::cas do_query_with_paxos: remove redundant cas_shard check storage_proxy: add cas_shard class sp::cas_shard: rename to get_cas_shard token_metadata_guard: a topology guard for a token tablet_metadata_guard: mark as noncopyable and nonmoveable	2025-07-01 11:33:20 +03:00
Michael Litvak	ddfe5dfb6b	tablets: represent co-located tables in tablet metadata Modify tablet_metadata to be able to represent co-located tables. The new method set_colocated_table adds to tablet_metadata a table which is co-located with another table. A co-located table shares the tablet map object with the base table, so we just create a copy of the shared tablet map pointer and store it as the co-located table's tablet map. Whenever a tablet map is modified we update the pointer for all the co-located tables accordingly, so the tablet map remains shared. We add some data structures to tablet_metadata to be able to work with co-located table groups efficiently: * `_table_groups` maps every base table to all tables in its co-location group. This is convenient for iterating over all table groups, or finding all tables in some group. * `_base_table` maps a co-located table to its base table.	2025-07-01 10:29:59 +03:00
Asias He	c5a136c3b5	storage_service: Use utils::chunked_vector to avoid big allocation The following was seen: ``` !WARNING \| scylla[6057]: [shard 12:strm] seastar_memory - oversized allocation: 212992 bytes. This is non-fatal, but could lead to latency and/or fragmentation issues. Please report: at [Backtrace #0] void seastar::backtrace<seastar::current_backtrace_tasklocal()::$_0>(seastar::current_backtrace_tasklocal()::$_0&&, bool) at ./build/release/seastar/./seastar/include/seastar/util/backtrace.hh:89 (inlined by) seastar::current_backtrace_tasklocal() at ./build/release/seastar/./build/release/seastar/./seastar/src/util/backtrace.cc:99 seastar::current_tasktrace() at ./build/release/seastar/./build/release/seastar/./seastar/src/util/backtrace.cc:136 seastar::current_backtrace() at ./build/release/seastar/./build/release/seastar/./seastar/src/util/backtrace.cc:169 seastar::memory::cpu_pages::warn_large_allocation(unsigned long) at ./build/release/seastar/./build/release/seastar/./seastar/src/core/memory.cc:848 seastar::memory::allocate_slowpath(unsigned long) at ./build/release/seastar/./build/release/seastar/./seastar/src/core/memory.cc:911 operator new(unsigned long) at ./build/release/seastar/./build/release/seastar/./seastar/src/core/memory.cc:1706 std::allocator<dht::token_range_endpoints>::allocate(unsigned long) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/allocator.h:196 (inlined by) std::allocator_traits<std::allocator<dht::token_range_endpoints> >::allocate(std::allocator<dht::token_range_endpoints>&, unsigned long) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/alloc_traits.h:515 (inlined by) std::_Vector_base<dht::token_range_endpoints, std::allocator<dht::token_range_endpoints> >::_M_allocate(unsigned long) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/stl_vector.h:380 (inlined by) void std::vector<dht::token_range_endpoints, std::allocator<dht::token_range_endpoints> >::_M_realloc_append<dht::token_range_endpoints const&>(dht::token_range_endpoints const&) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/vector.tcc:596 locator::describe_ring(replica::database const&, gms::gossiper const&, seastar::basic_sstring<char, unsigned int, 15u, true> const&, bool) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/stl_vector.h:1294 std::__n4861::coroutine_handle<seastar::internal::coroutine_traits_base<std::vector<dht::token_range_endpoints, std::allocator<dht::token_range_endpoints> > >::promise_type>::resume() const at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/coroutine:242 (inlined by) seastar::internal::coroutine_traits_base<std::vector<dht::token_range_endpoints, std::allocator<dht::token_range_endpoints> > >::promise_type::run_and_dispose() at ././seastar/include/seastar/core/coroutine.hh:80 seastar::reactor::do_run() at ./build/release/seastar/./build/release/seastar/./seastar/src/core/reactor.cc:2635 std::_Function_handler<void (), seastar::smp::configure(seastar::smp_options const&, seastar::reactor_options const&)::$_0>::_M_invoke(std::_Any_data const&) at ./build/release/seastar/./build/release/seastar/./seastar/src/core/reactor.cc:4684 ``` Fix by using chunked_vector. Fixes #24158 Closes scylladb/scylladb#24561	2025-06-19 16:51:01 +03:00
Petr Gusev	85eac7c34e	token_metadata_guard: a topology guard for a token Data-plane requests typically hold a strong pointer to the effective_replication_map (ERM) to protect against tablet migrations and other topology operations. This works because major steps in the topology coordinator use global barriers. These barriers install a new token_metadata version on each shard and wait for all references to the old one to be dropped. Since the ERM holds a strong pointer to token_metadata, it effectively blocks these operations until it's released. For LWT, we usually deal with a single token within a single tablet. In such cases, it's enough to block topology changes for just that one tablet. The existing tablet_metadata_guard class already supports this: it tracks tablet-specific changes and updates the ERM pointer automatically, unless the change affects the guarded tablet. However, this only works for tablet-aware tables. To support LWT with vnodes (i.e., non-tablet-aware tables), this commit introduces a new token_metadata_guard class. It wraps tablet_metadata_guard when the table uses tablets, and falls back to holding a plain strong ERM pointer otherwise. In the next commits, we’ll migrate LWT to use token_metadata_guard in paxos_response_handler instead of erm.	2025-06-18 11:51:48 +02:00
Petr Gusev	73221aa7b1	tablet_metadata_guard: mark as noncopyable and nonmoveable tablet_metadata_guard passes a raw pointer to get_validity_abort_source, so it can't be easily copied or moved. In this commit we make this explicit. We define destructor in cpp -- the autogenerated one complains on lw_shared_ptr<replica::table> as replica::table is only forward-declared in the headers.	2025-06-18 11:50:46 +02:00
Michael Litvak	34f15ca871	tablets: deallocate storage state on end_migration When a tablet is migrated and cleaned up, deallocate the tablet storage group state on `end_migration` stage, instead of `cleanup` stage: * When the stage is updated from `cleanup` to `end_migration`, the storage group is removed on the leaving replica. * When the table is initialized, if the tablet stage is `end_migration` then we don't allocate a storage group for it. This happens for example if the leaving replica is restarted during tablet migration. If it's initialized in `cleanup` stage then we allocate a storage group, and it will be deallocated when transitioning to `end_migration`. This guarantees that the storage group is always deallocated on the leaving replica by `end_migration`, and that it is always allocated if the tablet wasn't cleaned up fully yet. It is a similar case also for the pending replica when the migration is aborted. We deallocate the state on `revert_migration` which is the stage following `cleanup_target`. Previously the storage group would be allocated when the tablet is initialized on any of the tablet replicas - also on the leaving replica, and when the tablet stage is `cleanup` or `end_migration`, and deallocated during `cleanup`. This fixes the following issue: 1. A migrating tablet enters cleanup stage 2. the tablet is cleaned up successfuly 3. The leaving replica is restarted, and allocates storage group 4. tablet cleanup is not called because it was already cleaned up 4. the storage group remains allocated on the leaving replica after the migration is completed - it's not cleaned up properly. Fixes scylladb/scylladb#23481	2025-06-09 16:58:38 +03:00
Piotr Szymaniak	de96c28625	alternator: Add support for TTL when using tablets Support for TTL-based data removal when using tablets. The essence of this commit is a separate code path for finding token ranges owned by the current shard for the cases when tablets are used and not vnodes. At the same time, the vnodes-case is not touched not to cause any regressions. The TTL-caused data removal is normally performed by the primary replica (both when using vnodes and tablets). For the tablets case, the already-existing method tablet_map::get_primary_replica(tablet_id) is used to know if a shard execuring the TTL-related data removal is the primary replica for each tablet. A new method tablet_map::get_secondary_replica(tablet_id) has been added. It is needed by the data invalidation procedure to remove data when the primary replica node is down - the data is then removed by the secondary replica node. The mechanism is the same as in the vnodes case. Since alternator now supports TTL, the test `test_ttl_enable_error_with_tablets` has been removed. Also, tests in the test_ttl.py have been made to run twice, once with vnodes and once with tablets. When run with tablets, the due to lack of support for LWT with tablets (#18068), tests use 'system:write_isolation' of 'unsafe_rmw'. This approach allows early regression testing with tablets and is meant only as a tentative solution. Fixes scylladb/scylladb#16567 Closes scylladb/scylladb#23662	2025-06-05 17:39:29 +03:00
Dawid Mędrek	9ebd6df43a	locator/production_snitch_base: Reduce log level when property file incomplete We're reducing the log level in case the provided property file is incomplete. The rationale behind this change is related to how CCM interacts with Scylla: * The `GossipingPropertyFileSnitch` reloads the `cassandra-rackdc.properties` configuration every 60 seconds. * When a new node is added to the cluster, CCM recreates the `cassandra-rackdc.properties` file for EVERY node. If those two processes start happening at about the same time, it may lead to Scylla trying to read a not-completely-recreated file, and an error will be produced. Although we would normally fix this issue and try to avoid the race, that behavior will be no longer relevant as we're making the rack and DC values immutable (cf. scylladb/scylladb#23278). What's more, trying to fix the problem in the older versions of Scylla could bring a more serious regression. Having that in mind, this commit is a compromise between making CI less flaky and having minimal impact when backported. We do the same for when the format of the file is invalid: the rationale is the same. We also do that for when there is a double declaration. Although it seems impossible that this can stem from the same scenario the other two errors can (since if the format of the file is valid, the error is justified; if the format is invalid, it should be detected sooner than a doubled declaration), let's stay consistent with the logging level. Fixes scylladb/scylladb#20092 Closes scylladb/scylladb#23956	2025-05-13 13:59:39 +03:00
Tomasz Grabiec	1e407ab4d2	tablets: Equalize per-table balance when allocating tablets for a new table Fixes the following scenario: 1. Scale out adds new nodes to each rack 2. Table is created - all tablets are allocated to new nodes because they have low load 3. Rebalancing moves tablets from old nodes to new nodes - table balance for the new table is not fixed We're wrong to try to equalize global load when allocating tablets, and we should equalize per-table load instead, and let background load balancing fix it in a fair way. It will add to the allocated storage imbalance, but: 1. The table is initially empty, so doesn't impact actual storage imbalance. 2. It's more important to avoid overloading CPU on the nodes - imbalance hurts this aspect immediately. 3. If the table was created before imbalance was formed, we would end up in the same situation in the problematic scenario after the patch. 4. It's the job of the load balancing to keep up with storage growing, and if it's not, scale out should kick in. Before we have CPU-aware tablet allocation, and thus can prove we have CPU capacity on the small nodes, we should respect per-table balance as this is the way in which we achieve full CPU utilization. Fixes #23631	2025-04-17 16:01:23 +02:00
Tomasz Grabiec	2597a7e980	load_sketch: Tolerate missing tablet_map when selecting for a given table To simplify future usage in network_topology_strategy::add_tablets_in_dc() which invokes populate() for a given table, which may be both new and preexisitng.	2025-04-17 16:01:16 +02:00
Benny Halevy	e1fe82ed33	utils: phased_barrier, pluggable: use named gate Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-12 11:47:00 +03:00
Tomasz Grabiec	b5211cca85	Merge 'tablets: rebuild: use repair for tablet rebuild' from Aleksandra Martyniuk Currently, when we rebuild a tablet, we stream data from all replicas. This creates a lot of redundancy, wastes bandwidth and CPU resources. In this series, we split the streaming stage of tablet rebuild into two phases: first we stream tablet's data from only one replica and then repair the tablet. Fixes: https://github.com/scylladb/scylladb/issues/17174. Needs backport to 2025.1 to prevent out of space during streaming Closes scylladb/scylladb#23187 * github.com:scylladb/scylladb: test: add test for rebuild with repair locator: service: move to rebuild_v2 transition if cluster is upgraded locator: service: add transition to rebuild_repair stage for rebuild_v2 locator: service: add rebuild_repair tablet transition stage locator: add maybe_get_primary_replica locator: service: add rebuild_v2 tablet transition kind gms: add REPAIR_BASED_TABLET_REBUILD cluster feature	2025-04-09 21:35:37 +02:00
Benny Halevy	dfdca2d84e	locator: topology: drop unused calculate_datacenters Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#23647	2025-04-08 19:04:56 +03:00
Aleksandra Martyniuk	acd32b24d3	locator: service: move to rebuild_v2 transition if cluster is upgraded If cluster is upgraded to version containing rebuild_v2 transition kind, move to this transition kind instead of rebuild.	2025-04-08 10:42:02 +02:00
Aleksandra Martyniuk	4a847df55c	locator: service: add rebuild_repair tablet transition stage Currently, in the streaming stage of rebuild tablet transition, we stream tablet data from all replicas. This patch series splits the streaming stage into two phases: - repair phase, where we repair the tablet; - streaming phase, where we stream tablet data from one replica. rebuild_repair is a stage that will be used to perform the repair phase. It executes the tablet repair on tablet_info::replicas. A primary replica out of migration_streraming_info::read_from is the repair master. If the repair succeeds, we move to streaming tablet transition stage, and to cleanup_target - if it fails. The repair bypasses the tablet repair scheduler and it does not update the repair_time. A transition to the rebuild_repair stage will be added in the following patches.	2025-04-08 10:42:02 +02:00
Aleksandra Martyniuk	5d6041617b	locator: add maybe_get_primary_replica Add maybe_get_primary_replica to choose a primary replica out of custom replica set.	2025-04-08 10:42:01 +02:00
Aleksandra Martyniuk	ed7b8bb787	locator: service: add rebuild_v2 tablet transition kind Currently, in the streaming stage of rebuild tablet transition, we stream tablet data from all replicas. This patch series splits the streaming stage into two phases: - repair phase, where we repair the tablet; - streaming phase, where we stream tablet data from one replica. To differentiate the two streaming methods, a new tablet transition kind - rebuild_v2 - is added. The transtions and stages for rebuild_v2 transition kind will be added in the following patches.	2025-04-08 10:42:01 +02:00
Botond Dénes	1198213000	Merge 'tablets: Make tablet allocation equalize per-shard load ' from Tomasz Grabiec Before, it was equalizing per-node load (tablet count), which is wrong in heterogeneous clusters. Nodes with fewer shards will end up with overloaded shards. Refs #23378 Closes scylladb/scylladb#23478 * github.com:scylladb/scylladb: tablets: Make tablet allocation equalize per-shard load tablets: load_balancer: Fix reporting of total load per node	2025-04-03 16:32:53 +03:00

1 2 3 4 5 ...

1029 Commits