Commit 60d2cc886a changed
get_all_ranges to return start-bound ranges and pre-calculate
the wrapping range, and then construct_range_to_endpoint_map
to pass r.start() (that is now always engaged) as the vnode token.
However, as can be seen in token_metadata_impl::first_token
the token ranges (a.k.a. vnodes) **end** with the sorted tokens,
not start with them, so an arbitrary token t belongs to a
vnode in some range `sorted_tokens[i-1] < t <= sorted_tokens[i]`
Fixes#25541
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closesscylladb/scylladb#25580
The central idea of incremental repair is to allow repair participants
to select and repair only a portion of the dataset to speed up the
repair process. All repair participants must utilize an identical
selection method to repair and synchronize the same selected dataset.
There are two primary selection methods: time-based and file-based. The
time-based method selects data within a specified time frame. It is
versatile but it is less efficient because it requires reading all of
the dataset and omitting data beyond the time frame. The file-based
method selects data from unrepaired SSTables and is more efficient
because it allows the entire SSTable to be omitted. This document patch
implements the file-based selection method.
Incremental repair will only be supported for tablet tables; it will not
be supported for vnode tables. On one hand, the legacy vnode is less
important to support. On the other hand, the incremental repair for
vnode is much harder to implement. With vnodes, a SSTalbe could contain
data for multiple vnode ranges. When a given vnode range is repaired,
only a portion of the SSTable is repaired. This complicates the
manipulation of SSTables significantly during both repair and
compaction. With tablets, an entire tablet is repaired so that a
sstable is either fully repaired or not repaired which is a huge
simplification.
This patch uses the repaired_at from sstables::statistics component to
mark a sstable as repaired. It uses a virtual clock as the repair
timestamp, i.e., using a monotonically increasing number for the
repaired_at field of a SSTable and sstables_repaired_at column in
system.tablets table. Notice that when a sstable is not repaired, the
repaired_at field will be set to the default value 0 by default. The
being_repaired in memory field of a SSTable is used to explicitly mark
that a SSTable is being selected. The following variables are used for
incremental repair:
The repaired_at on disk field of a SSTable is used.
- A 64-bit number increases sequentially
The sstables_repaired_at is added to the system.tablets table.
- repaired_at <= sstables_repaired_at means the sstable is repaired
The being_repaired in memory field of a SSTable is added.
- A repair UUID tells which sstable has participated in the repair
Initial test results:
1) Medium dataset results
Node amount: 3
Instance type: i4i.2xlarge
Disk usage per node: ~500GB
Cluster pre-populated with ~500GB of data before starting repairs job.
Results for Repair Timings:
The regular repair run took 210 mins.
Incremental repair 1st run took 183 mins, 2nd and 3rd runs took around 48s
The speedup is: 183 mins / 48s = 228X
2) Small dataset results
Node amount: 3
Instance type: i4i.2xlarge
Disk usage per node: ~167GB
Cluster pre-populated with ~167GB of data before starting the repairs job.
Regular repair 1st run took 110s, 2nd and 3rd runs took 110s.
Incremental repair 1st run took 110 seconds, 2nd and 3rd run took 1.5 seconds.
The speedup is: 110s / 1.5s = 73X
3) Large dataset results
Node amount: 6
Instance type: i4i.2xlarge, 3 racks
50% of base load, 50% read/write
Dataset == Sum of data on each node
Dataset Non-incremental repair (minutes)
1.3 TiB 31:07
3.5 TiB 25:10
5.0 TiB 19:03
6.3 TiB 31:42
Dataset Incremental repair (minutes)
1.3 TiB 24:32
3.0 TiB 13:06
4.0 TiB 5:23
4.8 TiB 7:14
5.6 TiB 3:58
6.3 TiB 7:33
7.0 TiB 6:55
Fixes#22472Closesscylladb/scylladb#24291
* github.com:scylladb/scylladb:
replica: Introduce get_compaction_reenablers_and_lock_holders_for_repair
compaction: Move compaction_reenabler to compaction_reenabler.hh
topology_coordinator: Make rpc::remote_verb_error to warning level
repair: Add metrics for sstable bytes read and skipped from sstables
test.py: Disable incremental for test_tombstone_gc_for_streaming_and_repair
test.py: Add tests for tablet incremental repair
repair: Add tablet incremental repair support
compaction: Add tablet incremental repair support
feature_service: Add TABLET_INCREMENTAL_REPAIR feature
tablet_allocator: Add tablet_force_tablet_count_increase and decrease
repair: Add incremental helpers
sstable: Add being_repaired to sstable
sstables: Add set_repaired_at to metadata_collector
mutation_compactor: Introduce add operator to compaction_stats
tablet: Add sstables_repaired_at to system.tablets table
test: Fix drain api in task_manager_client.py
This patch addes incremental_repair support in compaction.
- The sstables are split into repaired and unrepaired set.
- Repaired and unrepaired set compact sperately.
- The repaired_at from sstable and sstables_repaired_at from
system.tablets table are used to decide if a sstable is repaired or
not.
- Different compactions tasks, e.g., minor, major, scrub, split, are
serialized with tablet repair.
This change includes basic optimizations to
locator::describe_ring, mainly caching the per-endpoint
information in an unordered_map instead of looking
them up in every inner-loop.
This yields an improvement of 20% in cpu time.
With 45 nodes organized as 3 dcs, 3 racks per dc, 5 nodes per rack, 256 tokens per
node, yielding 11520 ranges and 9 replicas per range, describe_ring took
Before: 30 milliseconds (2.6 microseconds per range)
After: 24 milliseconds (2.1 microseconds per range)
Add respective unit test of describe_ring for tablets.
A unit test for vnodes already exists in
test/nodetool/test_describering.py
Fixes#24887
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
First, let get_all_ranges return all vnode ranges
with a corrected wrapping range covering the [last token, first token)
range, such that all ranges start tokens are vndoe tokens
and must be in the vnode replication map.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Prevent a crash, especially in the is_vnode=true case,
if the key_token is not found in the map.
Rather than the undefined behavior when dereferencing the
end() iterator, throw an internal error with additional
logging about the search logic and parameters.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Some callers, like `construct_range_to_endpoint_map` for describe_ring,
or `get_secondary_ranges` for alternator ttl pass vnode tokens (the
vnodes' start token), and therefore can benefit from the fast lookup
path in `vnode_effective_replication_map::do_get_replicas`.
Otherwise the vnode token is binary-searched in sorted_tokens using
token_metadata::first_token().
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Derive both vnode_effective_replication_map
and local_effective_replication_map from
static_effective_replication_map as both are static and per-keyspace.
However, local_effective_replication_map does not need vnodes
for the mapping of all tokens to the local node.
Note that everywhere_replication_strategy is not abstracted in a similar
way, although it could, since the plan is to get rid of it
once all system keyspaces areconverted to local or tablets replication
(and propagated everywhere if needed using raft group0)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
create_effective_replication_map need not know about the internals of
vnode_effective_replication_map.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
to make_vnode_effective_replication_map_ptr since
it is specific to vnode_effective_replication_map.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
to calculate_vnode_effective_replication_map since
it is specific to vnode-based range calculations.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
to create_static_effective_replication_map, in preparation
for separating local_effective_replication_map from
vnode_effective_replication_map.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
to static_effective_replication_map_ptr, in preparation
for separating local_effective_replication_map from
vnode_effective_replication_map.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
to global_static_effective_replication_map, in preparation
for separating local_effective_replication_map from
vnode_effective_replication_map.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
to get_static_effective_replication_map, in preparation
for separating local_effective_replication_map from
vnode_effective_replication_map (both are per-keyspace).
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Prefer for specializing the local replication strategy,
local effective replication map, et. al byt defining
an is_local() predicate, similar to uses_tablets().
Note that is_vnode_based() still applies to local replication
strategy.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently, we use storage_proxy/get_cas_shard ->
sharder.shard_for_reads to decide which shard to use for LWT code
execution on both replicas and the coordinator.
If the coordinator is not a replica, shard_for_reads returns 0 —
the 'default' shard. This behavior has at least two problems:
* Shard 0 may become overloaded, because all LWT coordinators that are
not replicas will be served on it.
* The zero shard does not match shard_for_reads on replicas, which
hinders the "same shard for client and server" RPC-level optimization.
To fix this, we need to know whether the current node hosts a replica
for the tablet corresponding to the given token. Currently, there is
no API we could use for this. For historical reasons,
sharder::shard_for_reads returns 0 when the node does not host the
shard, which leads to ambiguity.
This commit introduces try_get_shard_for_reads, which returns a
disengaged std::optional when the tablet is not present on
the local node.
We leave shard_for_reads method in the base sharder class, it calls
try_get_shard_for_reads and returns zero by default. We need to rename
tablet_sharder private methods shard_for_reads and shard_for_writes
so that they don't conflict with the sharder::shard_for_reads.
This PR enables **LWT (Lightweight Transactions)** support for tablet-based tables by leveraging **colocated tables**.
Currently, storing Paxos state in system tables causes two major issues:
* **Loss of Paxos state during tablet migration or base table rebuilds**
* When a tablet is migrated or the base table is rebuilt, system tables don't retain Paxos state.
* This breaks LWT correctness in certain scenarios.
* Failing test cases demonstrating this:
* test_lwt_state_is_preserved_on_tablet_migration
* test_lwt_state_is_preserved_on_rebuild
* **Shard misalignment and performance overhead**
* Tablets may be placed on arbitrary shards by the tablet balancer.
* Accessing Paxos state in system tables could require a shard jump, degrading performance.
We move Paxos state into a dedicated Paxos table, colocated with the base table:
* Each base table gets its own Paxos state table.
* This table is lazily created on the first LWT operation.
* Its tablets are colocated with those of the base table, ensuring:
* Co-migration during tablet movement
* Co-rebuilding with the base table
* Shard alignment for local access to Paxos state
Some reasoning for why this is sufficient to preserve LWT correctness is discussed in [2].
This PR addresses two issues from the "Why doesn't it work for tablets" section in [1]:
* Tablet migration vs LWT correctness
* Paxos table sharding
Other issues ("bounce to shard" and "locking for intranode_migration") have already been resolved in previous PRs.
References
[1] - [LWT over tablets design](https://docs.google.com/document/d/1CPm0N9XFUcZ8zILpTkfP5O4EtlwGsXg_TU4-1m7dTuM/edit?tab=t.0#heading=h.goufx7gx24yu)
[2] - [LWT: Paxos state and tablet balancer](https://docs.google.com/document/d/1-xubDo612GGgguc0khCj5ukmMGgLGCLWLIeG6GtHTY4/edit?tab=t.0)
[3] - [Colocated tables PR](https://github.com/scylladb/scylladb/pull/22906#issuecomment-3027123886)
[4] - [Possible LWT consistency violations after a topology change](https://github.com/scylladb/scylladb/issues/5251)
Backport: not needed because this is a new feature.
Closesscylladb/scylladb#24819
* github.com:scylladb/scylladb:
create_keyspace: fix warning for tablets
docs: fix lwt.rst
docs: fix tablets.rst
alternator: enable LWT
random_failures: enable execute_lwt_transaction
test_tablets_lwt: add test_paxos_state_table_permissions
test_tablets_lwt: add test_lwt_for_tablets_is_not_supported_without_raft
test_tablets_lwt: test timeout creating paxos state table
test_tablets_lwt: add test_lwt_concurrent_base_table_recreation
test_tablets_lwt: add test_lwt_state_is_preserved_on_rebuild
test_tablets_lwt: migrate test_lwt_support_with_tablets
test_tablets_lwt: add test_lwt_state_is_preserved_on_tablet_migration
test_tablets_lwt: add simple test for LWT
check_internal_table_permissions: handle Paxos state tables
client_state: extract check_internal_table_permissions
paxos_store: handle base table removal
database: get_base_table_for_tablet_colocation: handle paxos state table
paxos_state: use node_local_only mode to access paxos state
query_options: add node_local_only mode
storage_proxy: handle node_local_only in query
storage_proxy: handle node_local_only in mutate
storage_proxy: introduce node_local_only flag
abstract_replication_strategy: remove unused using
storage_proxy: add coordinator_mutate_options
storage_proxy: rename create_write_response_handler -> make_write_response_handler
storage_proxy: simplify mutate_prepare
paxos_state: lazily create paxos state table
migration_manager: add timeout to start_group0_operation and announce
paxos_store: use non-internal queries
qp: make make_internal_options public
paxos_store: conditional cf_id filter
paxos_store: coroutinize
feature_service: add LWT_WITH_TABLETS feature
paxos_state: inline system_keyspace functions into paxos_store
paxos_state: extract state access functions into paxos_store
As they are wasteful in many cases, it is better
to move the tablet_map if possible, or clone
it gently in an async fiber.
Add clone() and clone_gently() methods to
allow explicit copies.
* minor optimization, no backport needed
Closesscylladb/scylladb#24978
* github.com:scylladb/scylladb:
tablets: prevent accidental copy of tablets_map
locator: tablets: get rid of synchronous mutate_tablet_map
The `token_metadata_impl` stores the sorted tokens in an `std::vector`.
With a large number of nodes, the size of this vector can grow quickly,
and updating it might lead to oversized allocations.
This commit changes `_sorted_tokens` to a `chunked_vector` to avoid such
issues. It also updates all related code to use `chunked_vector` instead
of `std::vector`.
Fixes#24876
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Closesscylladb/scylladb#25027
As they are wasteful in many cases, it is better
to move the tablet_map if possible, or clone
it gently in an async fiber.
Add clone() and clone_gently() methods to
allow explicit copies.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
It is currently used only by tests that could very well
do with mutate_tablet_map_async.
This will simplify the following patch to prevent
accidental copy of the tablet_map, provding explicit
clone/clone_gently methods.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
As requested in #22114, moved the files and fixed other includes and build system.
Moved files:
- interval.hh
- Map_difference.hh
Fixes: #22114
This is a cleanup, no need to backport
Closesscylladb/scylladb#25095
No need to wait for all members to be cleared gently.
We can release the version earlier since the
held version may be awaited for in barriers.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
We have a lot of places in the code where
a token_metadata_ptr is kept in an automatic
variable and destroyed when it leaves the scope.
since it's a referenced counted lw_shared_ptr,
the token_metadata object is rarely destroyed in
those cases, but when it is, it doesn't go through
clear_gently, and in particular its tablet_metadata
is not cleared gently, leading to inefficient destruction
of potentially many foreign_ptr:s.
This patch calls clear_and_destroy_impl that gently
clears and destroys the impl object in the background
using the shared_token_metadata.
Fixes#13381
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
So we can use the local shared_token_metadata instance
for safe background destroy of token_metadata_impl:s.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Sort all tablet_map_ptr:s by shard_id
and then destroy them on each shard to prevent
long cross-shard task queues for foreign_ptr destructions.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Add the option to co-locate tablets of different tables. For example, a base table and its CDC table, or a local index.
main changes and ideas:
* "table group" - a set of one or more tables that should be co-located. (Example: base table and CDC table). A group consists of one base table and zero or more children tables.
* new column `base_table` in `system.tablets`: when creating a new table, it can be set to point to a base table, which the new table's tablets will be co-located with. when it's set, the tablet map information should be retrieved from the base table map. the child map doesn't contain per-tablet information.
* co-located tables always have the same tablet count and the same tablet replicas. each tablet operation - migration, resize, repair - is applied on all tablets in a synchronized manner by the topology coordinator.
* resize decision for a group is made by combining the per-table hints and comparing the average tablet size (over all tablets in the group) with the target tablet size.
* the tablets load balancer works with the base table as a representative of the group. it represents a single migration unit with some `group_size` that is taken into account.
* view tablets are co-located with base tablets when the partition keys match.
Fixes https://github.com/scylladb/scylladb/issues/17043
backport is not needed. this is preliminary work for support of MVs and CDC with tablets.
Closesscylladb/scylladb#22906
* github.com:scylladb/scylladb:
tablets: validate no clustering row mutations on co-located tables
raft_group0_client: extend validate_change to mixed_change type
docs: topology-over-raft: document co-located tables
tablet-mon.py: visual indication for co-located tablets
tablet-mon.py: handle co-located tablets
test/boost/view_schema_test.cc: fix race in wait_until_built
boost/tablets_test: test load balancing and resize of co-located tablets
test/tablets: test tablets colocation
tablets: co-locate view tablets with base when the partition keys match
test/pylib/tablets: common get_tablet_count api
test_mv_tablets: use get_tablet_replicas from common tablets api
test/pylib/tablets: fix test api to read tablet replicas from base table
tablets: allocator: create co-located tables in a single operation
alternator: prepare all new tables in a single announcement
migration_manager: add notification for creating multiple tables
tablets: read_tablet_transition_stage: read from base table
storage service: allow repair request only on base tables
tablets: keyspace_rf_change: apply on base table
storage service: generate tablet migration updates on base tables
tablets: replace all_tables method
tablets: split when all co-located tablets are ready
tablets: load balancer: sizing plan for table groups
tablets: load balancer: handle co-located tablets
tablets: allocate co-located tablets
tablets: handle migration of co-located tablets
storage service: add repair colocated tablets rpc
tablets: save and read tablet metadata of co-located tables
tablets: represent co-located tables in tablet metadata
tablets: add base_table column to system.tablets
docs: update system.tablets schema
The method all_tables in tablet_metadata is used for iterating over all
tables in the tablet metadata with their tablet maps.
Now that we have co-located tables we need to make the distinction on
which tables we want to iterate over. In some cases we want to iterate
over each group of co-located tables, treating them as one unit, and in
other cases we want to iterate over all tables, doesn't matter if they
are part of a co-located group and have a base table.
We replace all_tables with new methods that can be used for each of the
cases.
This PR is a step towards enabling LWT for tablet-based tables.
It pursues several goals:
* Make it explicit that the tablet can't migrate after the `cas_shard` check in `selec_statement/modification_statement`. Currently, `storage_proxy::cas` expects that the client calls it on a correct shard -- the one which owns the partition key the LWT is running on. There reasons for that are explained in [this commit](f16e3b0491 (diff-1073ea9ce4c5e00bb6eb614154f523ba7962403a4fe6c8cd877d1c8b73b3f649)) message. The statements check the current shard and invokes `bounce_to_shard` if it's not the right one. However , the erm strong pointer is only captured in `storage_proxy::cas` and until that moment there is no explicit structure in the code which would prevent the ongoing migrations. In this PR we introduce such stucture -- `erm_handle`. We create it before the `cas_check` and pass it down to `storage_proxy::cas` and `paxos_response_handler`.
* Another goal of this PR is an optimization -- we don't want to hold erm for the duration of entire LWT, unless it directly affects the current tablet. The is a `tablet_metadata_guard` class which is used for long running tablet operations. It automatically switches to a new erm if the topology change represented by the new erm doesn't affect the current tablet. We use this class in `erm_handle` if the table uses tablets. Otherwise, `erm_handle` just stores erm directly.
* Fixes [shard bouncing issue in alternator](https://github.com/scylladb/scylladb/issues/17399)
Backport: not needed (new feature).
Closesscylladb/scylladb#24495
* github.com:scylladb/scylladb:
LWT: make cas_shard non-optional in sp::cas
LWT: create cas_shard in select_statement
LWT: create cas_shard in modification and batch statements
LWT: create cas_shard in alternator
LWT: use cas_shard in storage_proxy::cas
do_query_with_paxos: remove redundant cas_shard check
storage_proxy: add cas_shard class
sp::cas_shard: rename to get_cas_shard
token_metadata_guard: a topology guard for a token
tablet_metadata_guard: mark as noncopyable and nonmoveable
Modify tablet_metadata to be able to represent co-located tables.
The new method set_colocated_table adds to tablet_metadata a table which
is co-located with another table. A co-located table shares the tablet
map object with the base table, so we just create a copy of the shared
tablet map pointer and store it as the co-located table's tablet map.
Whenever a tablet map is modified we update the pointer for all the
co-located tables accordingly, so the tablet map remains shared.
We add some data structures to tablet_metadata to be able to work with
co-located table groups efficiently:
* `_table_groups` maps every base table to all tables in its
co-location group. This is convenient for iterating over all table
groups, or finding all tables in some group.
* `_base_table` maps a co-located table to its base table.
The following was seen:
```
!WARNING | scylla[6057]: [shard 12:strm] seastar_memory - oversized allocation: 212992 bytes. This is non-fatal, but could lead to latency and/or fragmentation issues. Please report: at
[Backtrace #0]
void seastar::backtrace<seastar::current_backtrace_tasklocal()::$_0>(seastar::current_backtrace_tasklocal()::$_0&&, bool) at ./build/release/seastar/./seastar/include/seastar/util/backtrace.hh:89
(inlined by) seastar::current_backtrace_tasklocal() at ./build/release/seastar/./build/release/seastar/./seastar/src/util/backtrace.cc:99
seastar::current_tasktrace() at ./build/release/seastar/./build/release/seastar/./seastar/src/util/backtrace.cc:136
seastar::current_backtrace() at ./build/release/seastar/./build/release/seastar/./seastar/src/util/backtrace.cc:169
seastar::memory::cpu_pages::warn_large_allocation(unsigned long) at ./build/release/seastar/./build/release/seastar/./seastar/src/core/memory.cc:848
seastar::memory::allocate_slowpath(unsigned long) at ./build/release/seastar/./build/release/seastar/./seastar/src/core/memory.cc:911
operator new(unsigned long) at ./build/release/seastar/./build/release/seastar/./seastar/src/core/memory.cc:1706
std::allocator<dht::token_range_endpoints>::allocate(unsigned long) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/allocator.h:196
(inlined by) std::allocator_traits<std::allocator<dht::token_range_endpoints> >::allocate(std::allocator<dht::token_range_endpoints>&, unsigned long) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/alloc_traits.h:515
(inlined by) std::_Vector_base<dht::token_range_endpoints, std::allocator<dht::token_range_endpoints> >::_M_allocate(unsigned long) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/stl_vector.h:380
(inlined by) void std::vector<dht::token_range_endpoints, std::allocator<dht::token_range_endpoints> >::_M_realloc_append<dht::token_range_endpoints const&>(dht::token_range_endpoints const&) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/vector.tcc:596
locator::describe_ring(replica::database const&, gms::gossiper const&, seastar::basic_sstring<char, unsigned int, 15u, true> const&, bool) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/stl_vector.h:1294
std::__n4861::coroutine_handle<seastar::internal::coroutine_traits_base<std::vector<dht::token_range_endpoints, std::allocator<dht::token_range_endpoints> > >::promise_type>::resume() const at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/coroutine:242
(inlined by) seastar::internal::coroutine_traits_base<std::vector<dht::token_range_endpoints, std::allocator<dht::token_range_endpoints> > >::promise_type::run_and_dispose() at ././seastar/include/seastar/core/coroutine.hh:80
seastar::reactor::do_run() at ./build/release/seastar/./build/release/seastar/./seastar/src/core/reactor.cc:2635
std::_Function_handler<void (), seastar::smp::configure(seastar::smp_options const&, seastar::reactor_options const&)::$_0>::_M_invoke(std::_Any_data const&) at ./build/release/seastar/./build/release/seastar/./seastar/src/core/reactor.cc:4684
```
Fix by using chunked_vector.
Fixes#24158Closesscylladb/scylladb#24561
Data-plane requests typically hold a strong pointer to the
effective_replication_map (ERM) to protect against tablet
migrations and other topology operations. This works because
major steps in the topology coordinator use global barriers.
These barriers install a new token_metadata version on
each shard and wait for all references to the old one to
be dropped. Since the ERM holds a strong pointer to
token_metadata, it effectively blocks these operations
until it's released.
For LWT, we usually deal with a single token within a
single tablet. In such cases, it's enough to block
topology changes for just that one tablet. The existing
tablet_metadata_guard class already supports this: it tracks
tablet-specific changes and updates the ERM pointer
automatically, unless the change affects the guarded
tablet. However, this only works for tablet-aware tables.
To support LWT with vnodes (i.e., non-tablet-aware tables),
this commit introduces a new token_metadata_guard class.
It wraps tablet_metadata_guard when the table uses tablets,
and falls back to holding a plain strong ERM pointer otherwise.
In the next commits, we’ll migrate LWT to use token_metadata_guard
in paxos_response_handler instead of erm.
tablet_metadata_guard passes a raw pointer to get_validity_abort_source,
so it can't be easily copied or moved. In this commit we make this
explicit.
We define destructor in cpp -- the autogenerated one complains on
lw_shared_ptr<replica::table> as replica::table is only
forward-declared in the headers.
When a tablet is migrated and cleaned up, deallocate the tablet storage
group state on `end_migration` stage, instead of `cleanup` stage:
* When the stage is updated from `cleanup` to `end_migration`, the
storage group is removed on the leaving replica.
* When the table is initialized, if the tablet stage is `end_migration`
then we don't allocate a storage group for it. This happens for
example if the leaving replica is restarted during tablet migration.
If it's initialized in `cleanup` stage then we allocate a storage
group, and it will be deallocated when transitioning to
`end_migration`.
This guarantees that the storage group is always deallocated on the
leaving replica by `end_migration`, and that it is always allocated if
the tablet wasn't cleaned up fully yet.
It is a similar case also for the pending replica when the migration is
aborted. We deallocate the state on `revert_migration` which is the
stage following `cleanup_target`.
Previously the storage group would be allocated when the tablet is
initialized on any of the tablet replicas - also on the leaving replica,
and when the tablet stage is `cleanup` or `end_migration`, and
deallocated during `cleanup`.
This fixes the following issue:
1. A migrating tablet enters cleanup stage
2. the tablet is cleaned up successfuly
3. The leaving replica is restarted, and allocates storage group
4. tablet cleanup is not called because it was already cleaned up
4. the storage group remains allocated on the leaving replica after the
migration is completed - it's not cleaned up properly.
Fixesscylladb/scylladb#23481
Support for TTL-based data removal when using tablets.
The essence of this commit is a separate code path for finding token
ranges owned by the current shard for the cases when tablets are used
and not vnodes. At the same time, the vnodes-case is not touched not to
cause any regressions.
The TTL-caused data removal is normally performed by the primary
replica (both when using vnodes and tablets). For the tablets case,
the already-existing method tablet_map::get_primary_replica(tablet_id)
is used to know if a shard execuring the TTL-related data removal is
the primary replica for each tablet.
A new method tablet_map::get_secondary_replica(tablet_id) has been
added. It is needed by the data invalidation procedure to remove data
when the primary replica node is down - the data is then removed by the
secondary replica node. The mechanism is the same as in the vnodes case.
Since alternator now supports TTL, the test
`test_ttl_enable_error_with_tablets` has been removed.
Also, tests in the test_ttl.py have been made to run twice, once with
vnodes and once with tablets. When run with tablets, the due to lack of
support for LWT with tablets (#18068), tests use
'system:write_isolation' of 'unsafe_rmw'. This approach allows early
regression testing with tablets and is meant only as a tentative
solution.
Fixesscylladb/scylladb#16567Closesscylladb/scylladb#23662
We're reducing the log level in case the provided property file is incomplete.
The rationale behind this change is related to how CCM interacts with Scylla:
* The `GossipingPropertyFileSnitch` reloads the `cassandra-rackdc.properties`
configuration every 60 seconds.
* When a new node is added to the cluster, CCM recreates the
`cassandra-rackdc.properties` file for EVERY node.
If those two processes start happening at about the same time, it may lead
to Scylla trying to read a not-completely-recreated file, and an error will
be produced.
Although we would normally fix this issue and try to avoid the race, that
behavior will be no longer relevant as we're making the rack and DC values
immutable (cf. scylladb/scylladb#23278). What's more, trying to fix the problem
in the older versions of Scylla could bring a more serious regression. Having
that in mind, this commit is a compromise between making CI less flaky and
having minimal impact when backported.
We do the same for when the format of the file is invalid: the rationale
is the same.
We also do that for when there is a double declaration. Although it seems
impossible that this can stem from the same scenario the other two errors
can (since if the format of the file is valid, the error is justified;
if the format is invalid, it should be detected sooner than a doubled
declaration), let's stay consistent with the logging level.
Fixesscylladb/scylladb#20092Closesscylladb/scylladb#23956
Fixes the following scenario:
1. Scale out adds new nodes to each rack
2. Table is created - all tablets are allocated to new nodes because they have low load
3. Rebalancing moves tablets from old nodes to new nodes - table balance for the new table is not fixed
We're wrong to try to equalize global load when allocating tablets,
and we should equalize per-table load instead, and let background load
balancing fix it in a fair way. It will add to the allocated storage
imbalance, but:
1. The table is initially empty, so doesn't impact actual storage imbalance.
2. It's more important to avoid overloading CPU on the nodes - imbalance hurts this aspect immediately.
3. If the table was created before imbalance was formed, we would end up in the same situation in the problematic scenario after the patch.
4. It's the job of the load balancing to keep up with storage growing, and if it's not, scale out should kick in.
Before we have CPU-aware tablet allocation, and thus can prove we have
CPU capacity on the small nodes, we should respect per-table balance
as this is the way in which we achieve full CPU utilization.
Fixes#23631
To simplify future usage in
network_topology_strategy::add_tablets_in_dc() which invokes
populate() for a given table, which may be both new and preexisitng.
Currently, when we rebuild a tablet, we stream data from all
replicas. This creates a lot of redundancy, wastes bandwidth
and CPU resources.
In this series, we split the streaming stage of tablet rebuild into
two phases: first we stream tablet's data from only one replica
and then repair the tablet.
Fixes: https://github.com/scylladb/scylladb/issues/17174.
Needs backport to 2025.1 to prevent out of space during streaming
Closesscylladb/scylladb#23187
* github.com:scylladb/scylladb:
test: add test for rebuild with repair
locator: service: move to rebuild_v2 transition if cluster is upgraded
locator: service: add transition to rebuild_repair stage for rebuild_v2
locator: service: add rebuild_repair tablet transition stage
locator: add maybe_get_primary_replica
locator: service: add rebuild_v2 tablet transition kind
gms: add REPAIR_BASED_TABLET_REBUILD cluster feature
Currently, in the streaming stage of rebuild tablet transition,
we stream tablet data from all replicas.
This patch series splits the streaming stage into two phases:
- repair phase, where we repair the tablet;
- streaming phase, where we stream tablet data from one replica.
rebuild_repair is a stage that will be used to perform the repair
phase. It executes the tablet repair on tablet_info::replicas.
A primary replica out of migration_streraming_info::read_from is
the repair master. If the repair succeeds, we move to streaming
tablet transition stage, and to cleanup_target - if it fails.
The repair bypasses the tablet repair scheduler and it does not update
the repair_time.
A transition to the rebuild_repair stage will be added in the following
patches.
Currently, in the streaming stage of rebuild tablet transition,
we stream tablet data from all replicas.
This patch series splits the streaming stage into two phases:
- repair phase, where we repair the tablet;
- streaming phase, where we stream tablet data from one replica.
To differentiate the two streaming methods, a new tablet transition
kind - rebuild_v2 - is added.
The transtions and stages for rebuild_v2 transition kind will be
added in the following patches.
Before, it was equalizing per-node load (tablet count), which is wrong
in heterogeneous clusters. Nodes with fewer shards will end up with
overloaded shards.
Refs #23378Closesscylladb/scylladb#23478
* github.com:scylladb/scylladb:
tablets: Make tablet allocation equalize per-shard load
tablets: load_balancer: Fix reporting of total load per node