This patch adds
- Changes in sstables_loader::restore_tablets() method
It populates the system_distributed_keyspace.snapshot_sstables table
with the information read from the manifest
- Implementation of tablet_restore_task_impl::run() method
It emplaces a bunch of tablet migrations with "restore" kind
- Topology coordinator handling of tablet_transition_stage::restore
When seen, the coordinator calls RESTORE_TABLET RPC against all tablet
replicas
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When doing cluster-wide restore using topology coordinator, the
coordinator will need to serve a bunch of new tablet transition kinds --
the restore one. For that, it will need to receive information about
from where to perform the restore -- the endpoint and bucket pair. This
data can be grabbed from nowhere but the tablet transition itself, so
add the "restore_config" member with this data.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Currently, the manifest advertises "powof2", which is wrong for
arbitrary count and boundaries.
Introduce a new kind of layout called "arbitrary", and produce it if
the tablet map doesn't conform to "powof2" layout.
We should also produce tablet boundaries in this case, but that's
worked on in a different PR: https://github.com/scylladb/scylladb/pull/28525
This is a step towards more flexibility in managing tablets. A
prerequisite before we can split individual tablets, isolating hot
partitions, and evening-out tablet sizes by shifting boundaries.
After this patch, the system can handle tables with arbitrary tablet
count. Tablet allocator is still rounding up desired tablet count to
the nearest power of two when allocating tablets for a new table, so
unless the tablet map is allocated in some other way, the counts will
be still a power of two.
We plan to utilize arbitrary count when migrating from vnodes to
tablets, by creating a tablet map which matches vnode boundaries.
One of the reasons we don't give up on power-of-two by default yet is
that it creates an issue with merges. If tablet count is odd, one of
the tablets doesn't have a sibling and will not be merged. That can
obviously cause imbalance of token space and tablet sizes between
tablets. To limit the impact, this patch dynamically chooses which
tablet to isolate when initiating a merge. The largest tablet is
chosen, as that will minimize imbalance. Otherwise, if we always chose
the last tablet to isolate, its size would remain the same while other
tablets double in size with each odd-count merge, leading to
imbalance. The imbalance will still be there, but the difference in
tablet sizes is limited to 2x.
Example (3 tablets):
[0] owns 1/3 of tokens
[1] owns 1/3 of tokens
[2] owns 1/3 of tokens
After merge:
[0] owns 2/3 of tokens
[1] owns 1/3 of tokens
What we would like instead:
Step 1 (split [1]):
[0] owns 1/3 of tokens
[1] old 1.left, owns 1/6 of tokens
[2] old 1.right, owns 1/6 of tokens
[3] owns 1/3 of tokens
Step 2 (merge):
[0] owns 1/2 of tokens
[1] owns 1/2 of tokens
To do that, we need to be able to split individual tablets, but we're
not there yet.
There are several reasons we want to do that.
One is that it will give us more flexibility in distributing the
load. We can subdivide tablets at any points, and achieve more
evenly-sized tablets. In particular, we can isolate large partitions
into separate tablets.
Another reason is vnode-to-tablet migration. We could construct a
tablet map which matches exactly the vnode boundaries, so migration
can happen transparently from the CQL-coordinator's point of view.
Implementation details:
We store a vector of tokens which represent tablet boundaries in the
tablet_id_map. tablet_id keeps its meaning, it's an index into vector
of tablets. To avoid logarithmic lookup of tablet_id from the token,
we introduce a lookup structure with power-of-two aligned buckets, and
store the tablet_id of the tablet which owns the first token in the
bucket. This way, lookup needs to consider tablet id range which
overlaps with one bucket. If boundaries are more or less aligned,
there are around 1-2 tablets overlapping with a bucket, and the lookup
is still O(1).
Amount of memory used increased, but not significantly relative to old
size (because tablet_info is currently fat):
For 131'072 tablets:
Before:
Size of tablet_metadata in memory: 57456 KiB
After:
Size of tablet_metadata in memory: 59504 KiB
And reimplement existing split-related methods around it.
This way we avoid calling dht::compaction_group_of(), and
assuming anything about tablet boundaries or tablet count
being a power of two.
This will make later refactoring easier.
Add `tablet_map::get_tablet_range_side(token)` to compute the
post-split range side without computing the tablet id.
Pure addition, no behavior change.
The function tablet_map::get_secondary_replica() is used by Alternator
TTL to choose a node different from get_primary_replica(). Unfortunately,
recently (commits 817fdad and d88037d) the implementation of the latter
function changed, without changing the former. So this patch changes
the former to match.
The next two patches will have two tests that fail before this patch,
and pass with it:
1. A unit test that checks that get_secondary_replica() returns a
different node than get_primary_replica().
2. An Alternator TTL test that checks that when a node is down,
expirations still happen because the secondary replica takes over
the primary replica's work.
Fixes SCYLLADB-777
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Fix a subtle but damaging failure mode in the tablet migration state machine: when a barrier fails, the follow-up barrier is triggered asynchronously, and cleanup can get skipped for that iteration. On the next loop, the original failure may no longer be visible (because the failing node got excluded), so the tablet can incorrectly move forward instead of entering `cleanup_target`.
To make cleanup reliable this PR:
Adds an additional “fallback cleanup” stage
- `write_both_read_old_fallback_cleanup`
that does not modify read/write selectors. This stage is safe to enter immediately after a barrier failure, and it funnels the tablet into cleanup with the required barriers.
Avoids changing both read and write selectors in a single step transitioning from `write_both_read_new` to `cleanup_target`. The fallback path updates selectors in a safe order: read first, then write.
Allows a direct no-barrier transition from `allow_write_both_read_old` to `cleanup_target` after failure, because in that specific case `cleanup_target` doesn’t change selectors and the hop is safe.
No need for backport. It's an improvement. Currently, tablets transition to `cleanup_target` eventually via failed streaming.
Closesscylladb/scylladb#28169
* github.com:scylladb/scylladb:
topology_coordinator: add write_both_read_old_fallback_cleanup state
topology_coordinator: allow cleanup_target transition from streaming/rebuild_repair without barrier
topology_coordinator: allow cleanup_target transition without barrier after failure in write_both_read_old
topology_coordinator: allow cleanup_target transition without barrier after failure in allow_write_both_read_old
Yet another barrier-failure scenario exists in the `write_both_read_new`
state. When the barrier fails, the tablet is expected to transition
to `cleanup_target`, but because barrier execution is asynchronous,
the cleanup transition can be skipped entirely and the tablet may
continue forward instead.
Both `write_both_read_new` and `cleanup_target` modify read and write
selectors. In this situation, a barrier is required, and transitioning
directly between these states without one is unsafe.
Introduce an intermediate `write_both_read_old_fallback_cleanup`
state that modifies only a read selector and can be entered without
a barrier (there is no need to wait for all nodes to start using the
"new" read selector). From there, the tablet can proceed to `cleanup_target`,
where the required barriers are enforced.
This also avoids changing both selectors in a single step. A direct
transition from `write_both_read_new` to `cleanup_target` updates
both selectors at once, which can leave coordinators using the old
selector for writes and the new selector for reads, causing reads to
miss preceding writes.
By routing through the fallback state, selectors are updated in
order—read first, then write—preserving read-after-write correctness.
In 8df61f6d99 we changed the requirements for creating materialized
views and MV-based indexes - instead of requiring the
rf_rack_valid_keyspaces flag to be set, we now require the keyspace to
be RF-rack-valid at the time of creation, and it is enforced to remain
RF-rack-valid while the MV exists. This validation is done in the cql
create view/index statements.
The same should be done also for alternator - when creating a table with
GSI or LSI, or when adding a GSI to an existing table, previously we
required the flag rf_rack_valid_keyspaces to be set. Now we change it to
instead check if the keyspace is RF-rack-valid, and if not the operation
fails with an appropriate error.
Fixes https://github.com/scylladb/scylladb/issues/28214
backport to 2025.4 to add RF-rack-valid enforcements in alternator
Closesscylladb/scylladb#28154
* github.com:scylladb/scylladb:
locator: document the exception type of assert_rf_rack_valid_keyspace
alternator: don't require rf_rack flag for indexes, validate instead
In this PR we add a basic implementation of the strongly-consistent tables:
* generate raft group id when a strongly-consistent table is created
* persist it into system.tables table
* start raft groups on replicas when a strongly-consistent tablet_map reaches them
* add strongly-consistent version of the storage_proxy, with the `query` and `mutate` methods
* the `mutate` method submits a command to the tablets raft group, the query method reads the data with `raft.read_barrier()`
* strongly-consistent versions of the `select_statement` and `modification_statement` are added
* a basic `test_strong_consistency.py/test_basic_write_read` is added which to check that we can write and read data in a strongly consistent fashion.
Limitations:
* for now the strongly consistent tables can have tablets only on shard zero. This is because we (ab/re) use the existing raft system tables which live only on shard0. In the next PRs we'll create separate tables for the new tablets raft groups.
* No Scylla-side proxying - the test has to figure out who is the leader and submit the command to the right node. This will be fixed separately.
* No tablet balancing -- migration/split/merges require separate complicated code.
The new behavior is hidden behind `STRONGLY_CONSISTENT_TABLES` feature, which is enabled when the `STRONGLY_CONSISTENT_TABLES` experimental feature flag is set.
Requirements, specs and general overview of the feature can be found [here](https://scylladb.atlassian.net/wiki/spaces/RND/pages/91422722/Strong+Consistency). Short term implementation plan is [here](https://docs.google.com/document/d/1afKeeHaCkKxER7IThHkaAQlh2JWpbqhFLIQ3CzmiXhI/edit?tab=t.0#heading=h.thkorgfek290)
One can check the strongly consistent writes and reads locally via cqlsh:
scylla.yaml:
```
experimental_features:
- strongly-consistent-tables
```
cqlsh:
```
CREATE KEYSPACE IF NOT EXISTS my_ks WITH replication = {'class': 'NetworkTopologyStrategy', 'replication_factor': 1} AND tablets = {'initial': 1} AND consistency = 'local';
CREATE TABLE my_ks.test (pk int PRIMARY KEY, c int);
INSERT INTO my_ks.test (pk, c) VALUES (10, 20);
SELECT * FROM my_ks.test WHERE pk = 10;
```
Fixes SCYLLADB-34
Fixes SCYLLADB-32
Fixes SCYLLADB-31
Fixes SCYLLADB-33
Fixes SCYLLADB-56
backport: no need
Closesscylladb/scylladb#27614
* https://github.com/scylladb/scylladb:
test_encryption: capture stderr
test/cluster: add test_strong_consistency.py
raft_group_registry: disable metrics for non-0 groups
strong consistency: implement select_statement::do_execute()
cql: add select_statement.cc
strong consistency: implement coordinator::query()
cql: add modification_statement
cql: add statement_helpers
strong consistency: implement coordinator::mutate()
raft.hh: make server::wait_for_leader() public
strong_consistency: add coordinator
modification_statement: make get_timeout public
strong_consistency: add groups_manager
strong_consistency: add state_machine and raft_command
table: add get_max_timestamp_for_tablet
tablets: generate raft group_id-s for new table
tablet_replication_strategy: add consistency field
tablets: add raft_group_id
modification_statement: remove virtual where it's not needed
modification_statement: inline prepare_statement()
system_keyspace: disable tablet_balancing for strongly_consistent_tables
cql: rename strongly_consistent statements to broadcast statements
The function assert_rf_rack_valid_keyspace uses the exception type
std::invalid_argument when the RF-rack validation fails. Document it and
change all callers to catch this specific exception type when checking
for RF-rack validation failures, so that other exception types can be
propagated properly.
Add a `raft_group_id` column to `system.tablets` and to the `tablet_map`
class. The column is populated only when the
`strongly_consistent_tables` feature is enabled.
This feature is currently disabled by default and is enabled only when
the user sets the `STRONGLY_CONSISTENT_TABLES` experimental flag.
The `raft_group_id` column is added to `system.tablets` only when this
flag is set. This allows the schema to evolve freely while the feature
is experimental, without requiring complex migrations.
This reverts commit c8cff94a5a.
Re-enabling incremental repair on master with "Aborting on shard 0 during
scaleout + repair #26041" and "Failure to attach sstables in streaming consumer
leaves sealed sstables on disk #27414" fixed.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#28120
repair: Implement auto repair for tablet repair
This patch implements the basic auto repair support for tablet repair.
It was decided to add no per table configuration for the initial
implementation, so two scylla yaml config options are introduced to set
the default auto repair configs for all the tablet tables.
- auto_repair_enabled_default
Set true to enable auto repair for tablet tables by default. The value
will be overridden by the per keyspace or per table configuration which
is not implemented yet.
- auto_repair_threshold_default_in_seconds
Set the default time in seconds for the auto repair threshold for tablet
tables. If the time since last repair is bigger than the configured
time, the tablet is eligible for auto repair. The value will be
overridden by the per keyspace or per table configuration which is not
implemented yet.
The following metrcis are added:
- auto_repair_needs_repair_nr
The number of tablets with auto repair enabled that needs repair
- auto_repair_enabled_nr
The number of tablets with auto repair enabled
The metrics are useful to tell if auto repair is falling behind.
In the future, more auto repair scheduling will be added, e.g.,
scheduling based on the repaired and unrepaired sstable set size,
tombstone ratio and so on, in addition to the time based scheduling.
Fixes SCYLLADB-99
New feature. No backport.
Closesscylladb/scylladb#27534
* github.com:scylladb/scylladb:
topology_coordinator: Add metrics for tablet repair
repair: Implement auto repair for tablet repair
This patch implements the basic auto repair support for tablet repair.
It was decided to add no per table configuration for the initial
implementation, so two scylla yaml config options are introduced to set
the default auto repair configs for all the tablet tables.
- auto_repair_enabled_default
Set true to enable auto repair for tablet tables by default. The value
will be overridden by the per keyspace or per table configuration which
is not implemented yet.
- auto_repair_threshold_default_in_seconds
Set the default time in seconds for the auto repair threshold for tablet
tables. If the time since last repair is bigger than the configured
time, the tablet is eligible for auto repair. The value will be
overridden by the per keyspace or per table configuration which is not
implemented yet.
The following metrcis are added:
- auto_repair_needs_repair_nr
The number of tablets with auto repair enabled that needs repair
- auto_repair_enabled_nr
The number of tablets with auto repair enabled
The metrics are useful to tell if auto repair is falling behind.
In the future, more auto repair scheduling will be added, e.g.,
scheduling based on the repaired and unrepaired sstable set size,
tombstone ratio and so on, in addition to the time based scheduling.
Fixes SCYLLADB-99
Allow creating materialized views and secondary indexes in a tablets keyspace only if it's RF-rack-valid, and enforce RF-rack-validity while the keyspace has views by restricting some operations:
* Altering a keyspace's RF if it would make the keyspace RF-rack-invalid
* Adding a node in a new rack
* Removing / Decommissioning the last node in a rack
Previously the config option `rf_rack_valid_keyspaces` was required for creating views. We now remove this restriction - it's not needed because we always maintain RF-rack-validity for keyspaces with views.
The restrictions are relevant only for keyspaces with numerical RF. Keyspace with rack-list-based RF are always RF-rack-valid.
Fixesscylladb/scylladb#23345
Fixes https://github.com/scylladb/scylladb/issues/26820
backport to relevant versions for materialized views with tablets since it depends on rf-rack validity
Closesscylladb/scylladb#26354
* github.com:scylladb/scylladb:
docs: update RF-rack restrictions
cql3: don't apply RF-rack restrictions on vector indexes
cql3: add warning when creating mv/index with tablets about rf-rack
service/tablet_allocator: always allow tablet merge of tables with views
locator: extend rf-rack validation for rack lists
test: test rf-rack validity when creating keyspace during node ops
locator: fix rf-rack validation during node join/remove
test: test topology restrictions for views with tablets
test: add test_topology_ops_with_rf_rack_valid
topology coordinator: restrict node join/remove to preserve RF-rack validity
topology coordinator: add validation to node remove
locator: extend rf-rack validation functions
view: change validate_view_keyspace to allow MVs if RF=Racks
db: enforce rf-rack-validity for keyspaces with views
replica/db: add enforce_rf_rack_validity_for_keyspace helper
db: remove enforce parameter from check_rf_rack_validity
test: adjust test to not break rf-rack validity
This patch adds a method to load_stats which searches for the tablet
size during tablet transition. In case of tablet migration, the tablet
will be searched on the leaving replica, and during rebuild we will
return the average tablet size of the pending replicas.
Extend the locator function assert_rf_rack_valid_keyspace to accept
arbitrary topology dc-rack maps and nodes instead of using the current
token metadata.
This allows us to add a new variant of the function that checks rf-rack
validity given a topology change that we want to apply. we will use it
to check that rf-rack validity will be maintained before applying the
topology change.
The possible topology changes for the check are node add and node remove
/ decommission. These operations can change the number of normal racks -
if a new node is added to a new rack, or the last node is removed from a
rack.
Currently we do not allow tablet merge if either of the tablets contain
a tablet repair request. This could block the tablet merge for a very
long time if the repair requests could not be scheduled and executed.
We can actually merge the repair tasks in most of the cases. This is
because most of the time all tablets are requested to be repaired by a
single API request, so they share the same task_id, request_type and
other parameters. We can merge the repair task info and executes the
repair after the merge. If they do not share the task info, we could
not merge and have to wait for the repair before merge, which is both
rare and ok.
Another case is that one of the tablet has a repair task info (t1) while
the other tablet (t2) does not have, it is possible the t2 has finished
repair by the same repair request or t2 is not requested to be repaired
at all. We allow merge in this case too to avoid blocking the tablet
merge, with the price of reparing a bit more.
Fixes#26844Closesscylladb/scylladb#26922
This PR extends the restore API so that it accepts primary_replica_only as parameter and it combines the concepts of primary-replica-only with scoped streaming so that with:
- `scope=all primary_replica_only=true` The restoring node will stream to the global primary replica only
- `scope=dc primary_replica_only=true` The restoring node will stream to the local primary replica only.
- `scope=rack primary_replica_only=true` The restoring node will stream only to the primary replica from within its own rack (with rf=#racks, the restoring node will stream only to itself)
- `scope=node primary_replica_only=true` is not allowed, the restoring node will always stream only to itself so the primary_replica_only parameter wouldn't make sense.
The PR also adjusts the `nodetool refresh` restriction on running restore with both primary_replica_only and scope, it adds primary_replica_only to `nodetool restore` and it adds cluster tests for primary replica within scope.
Fixes#26584Closesscylladb/scylladb#26609
* github.com:scylladb/scylladb:
Add cluster tests for checking scoped primary_replica_only streaming
Improve choice distribution for primary replica
Refactor cluster/object_store/test_backup
nodetool restore: add primary-replica-only option
nodetool refresh: Enable scope={all,dc,rack} with primary_replica_only
Enable scoped primary replica only streaming
Support primary_replica_only for native restore API
This patch fixes a bug with tablet size migration in load_stats.
has_tablet_size() lambda in topology_coordinator::migrate_tablet_size()
was returning false in all cases due to incorrect search iterator
comparison after a table and tablet saeach.
This change moves load_stats migrate_tablet_sizes() functionaility
into a separate method of load_stats.
I noticed during tests that `maybe_get_primary_replica`
would not distribute uniformly the choice of primary replica
because `info.replicas` on some shards would have an order whilst
on others it'd be ordered differently, thus making the function choose
a node as primary replica multiple times when it clearly could've
chosen a different nodes.
This patch sorts the replica set before passing it through the
scope filter.
Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
This change adds the ability to move tablets sizes in load_stats after a tablet migration or table resize (split/merge). This is needed because the size based load balancer needs to have tablet size data which is as accurate as possible, in order to work on fresh tablet size distribution and issue correct tablet migrations.
This is the second part of the size based load balancing changes:
- First part for tablet size collection via load_stats: #26035
- Second part reconcile load_stats: #26152
- The third part for load_sketch changes: #26153
- The fourth part which performs tablet load balancing based on tablet size: #26254
This is a new feature and backport is not needed.
Closesscylladb/scylladb#26152
* github.com:scylladb/scylladb:
load_balancer: load_stats reconcile after tablet migration and table resize
load_stats: change data structure which contains tablet sizes
Auto-exands numeric RF in CREATE/ALTER KEYSPACE statements for
new DCs specified in the statement.
Doesn't auto-expand existing options, as the rack choice may not be in
line with current replica placement. This requires co-locating tablet
replicas, and tracking of co-location state, which is not implemented yet.
Signed-off-by: Tomasz Grabiec <tgrabiec@scylladb.com>
This change adds the ability to move tablets sizes in load_stats after a
tablet migration or table resize (split/merge). This is needed because
the size based load balancer needs to have tablet size data which is as
accurate as possible, in order to issue migrations which improve
load balance.
Currently, tablet_sstable_streamer::get_primary_endpoints is out of
sync with tablet_map::get_primary_replica. The get_primary_replica
optimizes the choice of the replica so that the work is fairly
distributes among nodes. Meanwhile, get_primary_endpoints always
chooses the first replica.
Use get_primary_replica for get_primary_endpoints.
Fixes: https://github.com/scylladb/scylladb/issues/21883.
Closesscylladb/scylladb#26385
This patch changes the tablet size map in load_stats. Previously, this
data structure was:
std::unordered_map<range_based_tablet_id, uint64_t> tablet_sizes;
and is changed into:
std::unordered_map<table_id, std::unordered_map<dht::token_range, uint64_t>> tablet_sizes;
This allows for improved performance of tablet tablet size reconciliation.
Using the name regular as the incremental mode could be confusing, since
regular might be interpreted as the non-incremental repair. It is better
to use incremental directly.
Before:
- regular (standard incremental repair)
- full (full incremental repair)
- disabled (incremental repair disabled)
After:
- incremental (standard incremental repair)
- full (full incremental repair)
- disabled (incremental repair disabled)
Fixes#26503Closesscylladb/scylladb#26504
This commit extend the TABLE_LOAD_STATS RPC with data about the tablet
replica sizes and effective disk capacity.
Effective disk capacity of a node is computed as a sum of the sizes of
all tablet replicas on a node and available disk space.
This is the first change in the size based load balancing series.
Closesscylladb/scylladb#26035
Consider the following:
1) balancer emits split decision
2) split compaction starts
3) split decision is revoked
4) emits merge decision
5) completes merge, before compaction in step 2 finishes
After last step, split compaction initiated in step 2 can fail because it works with the global tablet map, rather than the map when the compaction started. With the global state changing under its feet, on merge, the mutation splitting writer will think it's going backwards since sibling tablets are merged.
This problem was also seen when running load-and-stream, where split initiated by the sstable writer failed, split completed, and the unsplit sstable is left in the table dir, causing problems in the restart.
To fix this, let's make split compaction always work with the state when it started, not a global state.
Fixes#24153.
All 2025.* versions are vulnerable, so fix must be backported to them.
Closesscylladb/scylladb#25690
* github.com:scylladb/scylladb:
replica: Fix split compaction when tablet boundaries change
replica: Futurize split_compaction_options()
This patch introduces a new `incremental_mode` parameter to the tablet
repair REST API, providing more fine-grained control over the
incremental repair process.
Previously, incremental repair was on and could not be turned off. This
change allows users to select from three distinct modes:
- `regular`: This is the default mode. It performs a standard
incremental repair, processing only unrepaired sstables and skipping
those that are already repaired. The repair state (`repaired_at`,
`sstables_repaired_at`) is updated.
- `full`: This mode forces the repair to process all sstables, including
those that have been previously repaired. This is useful when a full
data validation is needed without disabling the incremental repair
feature. The repair state is updated.
- `disabled`: This mode completely disables the incremental repair logic
for the current repair operation. It behaves like a classic
(pre-incremental) repair, and it does not update any incremental
repair state (`repaired_at` in sstables or `sstables_repaired_at` in
the system.tablets table).
The implementation includes:
- Adding the `incremental_mode` parameter to the
`/storage_service/repair/tablet` API endpoint.
- Updating the internal repair logic to handle the different modes.
- Adding a new test case to verify the behavior of each mode.
- Updating the API documentation and developer documentation.
Fixes#25605Closesscylladb/scylladb#25693
Consider the following:
1) balancer emits split decision
2) split compaction starts
3) split decision is revoked
4) emits merge decision
5) completes merge, before compaction in step 2 finishes
After last step, split compaction initiated in step 2 can fail
because it works with the global tablet map, rather than the
map when the compaction started. With the global state changing
under its feet, on merge, the mutation splitting writer will
think it's going backwards since sibling tablets are merged.
This problem was also seen when running load-and-stream, where
split initiated by the sstable writer failed, split completed,
and the unsplit sstable is left in the table dir, causing
problems in the restart.
To fix this, let's make split compaction always work with
the state when it started, not a global state.
Fixes#24153.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
This commit extends the TABLE_LOAD_STATS RPC with information whether
a node operates in the critical disk utilization mode.
This information will be needed to distict between the causes why
a table migration/repair was interrupted.
This patch addes incremental_repair support in compaction.
- The sstables are split into repaired and unrepaired set.
- Repaired and unrepaired set compact sperately.
- The repaired_at from sstable and sstables_repaired_at from
system.tablets table are used to decide if a sstable is repaired or
not.
- Different compactions tasks, e.g., minor, major, scrub, split, are
serialized with tablet repair.
As they are wasteful in many cases, it is better
to move the tablet_map if possible, or clone
it gently in an async fiber.
Add clone() and clone_gently() methods to
allow explicit copies.
* minor optimization, no backport needed
Closesscylladb/scylladb#24978
* github.com:scylladb/scylladb:
tablets: prevent accidental copy of tablets_map
locator: tablets: get rid of synchronous mutate_tablet_map
The `token_metadata_impl` stores the sorted tokens in an `std::vector`.
With a large number of nodes, the size of this vector can grow quickly,
and updating it might lead to oversized allocations.
This commit changes `_sorted_tokens` to a `chunked_vector` to avoid such
issues. It also updates all related code to use `chunked_vector` instead
of `std::vector`.
Fixes#24876
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Closesscylladb/scylladb#25027
As they are wasteful in many cases, it is better
to move the tablet_map if possible, or clone
it gently in an async fiber.
Add clone() and clone_gently() methods to
allow explicit copies.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
It is currently used only by tests that could very well
do with mutate_tablet_map_async.
This will simplify the following patch to prevent
accidental copy of the tablet_map, provding explicit
clone/clone_gently methods.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The method all_tables in tablet_metadata is used for iterating over all
tables in the tablet metadata with their tablet maps.
Now that we have co-located tables we need to make the distinction on
which tables we want to iterate over. In some cases we want to iterate
over each group of co-located tables, treating them as one unit, and in
other cases we want to iterate over all tables, doesn't matter if they
are part of a co-located group and have a base table.
We replace all_tables with new methods that can be used for each of the
cases.
Modify tablet_metadata to be able to represent co-located tables.
The new method set_colocated_table adds to tablet_metadata a table which
is co-located with another table. A co-located table shares the tablet
map object with the base table, so we just create a copy of the shared
tablet map pointer and store it as the co-located table's tablet map.
Whenever a tablet map is modified we update the pointer for all the
co-located tables accordingly, so the tablet map remains shared.
We add some data structures to tablet_metadata to be able to work with
co-located table groups efficiently:
* `_table_groups` maps every base table to all tables in its
co-location group. This is convenient for iterating over all table
groups, or finding all tables in some group.
* `_base_table` maps a co-located table to its base table.
When a tablet is migrated and cleaned up, deallocate the tablet storage
group state on `end_migration` stage, instead of `cleanup` stage:
* When the stage is updated from `cleanup` to `end_migration`, the
storage group is removed on the leaving replica.
* When the table is initialized, if the tablet stage is `end_migration`
then we don't allocate a storage group for it. This happens for
example if the leaving replica is restarted during tablet migration.
If it's initialized in `cleanup` stage then we allocate a storage
group, and it will be deallocated when transitioning to
`end_migration`.
This guarantees that the storage group is always deallocated on the
leaving replica by `end_migration`, and that it is always allocated if
the tablet wasn't cleaned up fully yet.
It is a similar case also for the pending replica when the migration is
aborted. We deallocate the state on `revert_migration` which is the
stage following `cleanup_target`.
Previously the storage group would be allocated when the tablet is
initialized on any of the tablet replicas - also on the leaving replica,
and when the tablet stage is `cleanup` or `end_migration`, and
deallocated during `cleanup`.
This fixes the following issue:
1. A migrating tablet enters cleanup stage
2. the tablet is cleaned up successfuly
3. The leaving replica is restarted, and allocates storage group
4. tablet cleanup is not called because it was already cleaned up
4. the storage group remains allocated on the leaving replica after the
migration is completed - it's not cleaned up properly.
Fixesscylladb/scylladb#23481
Support for TTL-based data removal when using tablets.
The essence of this commit is a separate code path for finding token
ranges owned by the current shard for the cases when tablets are used
and not vnodes. At the same time, the vnodes-case is not touched not to
cause any regressions.
The TTL-caused data removal is normally performed by the primary
replica (both when using vnodes and tablets). For the tablets case,
the already-existing method tablet_map::get_primary_replica(tablet_id)
is used to know if a shard execuring the TTL-related data removal is
the primary replica for each tablet.
A new method tablet_map::get_secondary_replica(tablet_id) has been
added. It is needed by the data invalidation procedure to remove data
when the primary replica node is down - the data is then removed by the
secondary replica node. The mechanism is the same as in the vnodes case.
Since alternator now supports TTL, the test
`test_ttl_enable_error_with_tablets` has been removed.
Also, tests in the test_ttl.py have been made to run twice, once with
vnodes and once with tablets. When run with tablets, the due to lack of
support for LWT with tablets (#18068), tests use
'system:write_isolation' of 'unsafe_rmw'. This approach allows early
regression testing with tablets and is meant only as a tentative
solution.
Fixesscylladb/scylladb#16567Closesscylladb/scylladb#23662
To simplify future usage in
network_topology_strategy::add_tablets_in_dc() which invokes
populate() for a given table, which may be both new and preexisitng.