The PRUNE MATERALIZED VIEW statement is performed as follows:
1. Perform a range scan of the view table from the view replicas based
on the ranges specified in the statement.
2. While reading the paged scan above, for each view row perform a read
from all base replicas at the corresponding primary key. If a discrepancy
is detected, delete the row in the view table.
When reading multiple rows, this is very slow because for each view row
we need to performe a single row query on multiple replicas.
In this patch we add an option to speed this up by performing many of the
single base row reads concurrently, at the concurrency specified in the
USING CONCURRENCY clause.
Aside from the unit test, I checked manually on a 3-node cluster with 10M rows, using vnodes. There were actually no ghost rows in the test, but we still had to iterate over all view rows and read the corresponding base rows. And actual ghost rows, if there are any, should be a tiny fraction of all rows. I compared concurrencies 1,2,10,100 and the results were:
* Pruning with concurrency 1 took total 1416 seconds
* Pruning with concurrency 2 took total 731 seconds
* Pruning with concurrency 10 took total 234 seconds
* Pruning with concurrency 100 took total 171 seconds
So after a concurrency of 10 or so we're hitting diminishing returns (at least in this setup). At that point we may be no longer bottlenecked by the reads, but by CPU on the shard that's handling the PRUNE
Fixes https://github.com/scylladb/scylladb/issues/27070Closesscylladb/scylladb#27097
* github.com:scylladb/scylladb:
mv: allow setting concurrency in PRUNE MATERIALIZED VIEW
cql: add CONCURRENCY to the USING clause
Consider the following scenario:
1. A table has RF=3 and writes use CL=QUORUM
2. One node is down
3. There is a pending tablet migration from the unavailable node
that is reverted
During the revert, there can be a time window where the pending replica
being cleaned up still accepts writes. This leads to write failures,
as only two nodes (out of four) are able to acknowledge writes.
This patch fixes the issue by adding a barrier to the cleanup_target
tablet transition state, ensuring that the coordinator switches back to
the previous replica set before cleanup is triggered.
Fixes https://github.com/scylladb/scylladb/issues/26512
It's a pre existing issue. Backport is required to all recent 2025.x versions.
Closesscylladb/scylladb#27413
* github.com:scylladb/scylladb:
topology_coordinator: Fix the indentation for the cleanup_target case
topology_coordinator: Add barrier to cleanup_target
test_node_failure_during_tablet_migration: Increase RF from 2 to 3
Consider the following scenario:
1. A table has RF=3 and writes use CL=QUORUM
2. One node is down
3. There is a pending tablet migration from the unavailable node
that is reverted
During the revert, there can be a time window where the pending replica
being cleaned up still accepts writes. This leads to write failures,
as only two nodes (out of four) are able to acknowledge writes.
This patch fixes the issue by adding a barrier to the cleanup_target
tablet transition state, ensuring that the coordinator switches back to
the previous replica set before cleanup is triggered.
Fixes https://github.com/scylladb/scylladb/issues/26512
Consider this:
1) n1 is the topology coordinator
2) n1 schedules and executes a tablet repair with session id s1 for a
tablet on n3 an n4.
3) n3 and n4 take and store the in _rs._repair_compaction_locks[s1]
4) n1 steps down before it executes
locator::tablet_transition_stage::end_repair
5) n2 becomes the new topology coordinator
6) n2 runs locator::tablet_transition_stage::repair again
7) n3 and n4 try to take the lock again and hangs since the lock is
already taken.
To avoid the deadlock, we can throw in step 7 so that n2 will
proceed to end_repair stage and release the lock. After that, the
scheduler could schedule the tablet repair request again.
Fixes#26346Closesscylladb/scylladb#27163
In several exception handlers, only raft::request_aborted was being
caught and rethrown, while seastar::abort_requested_exception was
falling through to the generic catch(...) block. This caused the
exception to be incorrectly treated as a failure that triggers
rollback, instead of being recognized as an abort signal.
For example, during tablet draining, the error log showed:
"tablets draining failed with seastar::abort_requested_exception
(abort requested). Aborting the topology operation"
This change adds seastar::abort_requested_exception handling
alongside raft::request_aborted in all places where it was missing.
When rethrown, these exceptions propagate up to the main run() loop
where handle_topology_coordinator_error() recognizes them as normal
abort signals and allows the coordinator to exit gracefully without
triggering unnecessary rollback operations.
Fixes: scylladb/scylladb#27255
No backport: The problem was only seen in tests and not reported in
customer tickets, so it's enough to fix it in the main branch.
Closesscylladb/scylladb#27314
The tablet scheduler should not emit conflicting migrations for the same
tablet. This was addressed initially in scylladb/scylladb#26038 but the
check is missing in the merge colocation plan, so add it there as well.
Without this check, the merge colocation plan could generate a
conflicting migration for a tablet that is already scheduled for
migration, as the test demonstrates.
This can cause correctness problems, because if the load balancer
generates two migrations for a single tablet, both will be written as
mutations, and the resulting mutation could contain mixed cells from
both migrations.
Fixesscylladb/scylladb#27304Closesscylladb/scylladb#27312
And switch to std::source_location.
Upcoming seastar update will deprecate its compatibility layer.
The patch is
for f in $(git grep -l 'seastar::compat::source_location'); do
sed -e 's/seastar::compat::source_location/std::source_location/g' -i $f;
done
and removal of few header includes.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#27309
The PRUNE MATERALIZED VIEW statement is performed as follows:
1. Perform a range scan of the view table from the view replicas based
on the ranges specified in the statement.
2. While reading the paged scan above, for each view row perform a read
from all base replicas at the corresponding primary key. If a discrepancy
is detected, delete the row in the view table.
When reading multiple rows, this is very slow because for each view row
we need to performe a single row query on multiple replicas.
In this patch we add an option to speed this up by performing many of the
single base row reads concurrently, at the concurrency specified in the
USING CONCURRENCY clause.
Fixes https://github.com/scylladb/scylladb/issues/27070
Otherwise, in a mixed cluster, the handle_tablet_resize_finalization
would fail because of the unknown rpc verb.
Fixes#26309Closesscylladb/scylladb#27218
When applying group0_command we now inspect
whether any auth internal tables were modified,
and reload affected role entries in the cache.
Since one auth DML may change multiple tables,
when iterating over mutations we deduplicate
affected roles across those tables.
Receiving snaphot is a rare event so as a simplification
we'll be reloading the whole cache instead of trying to merge
states, especially that expected size is small, below 100 records.
Reloading is non-disruptive operation, old entries are removed
only after all entries are loaded. If entry is updated, shared
pointer will be atomically replaced in a cache map.
Prepare for use in a subsequent commit in group0_state_machine,
where the auth cache will be integrated. This follows the same
pattern as updates to the service-level cache, view-building
state, and CDC streams.
Previously, the view building coordinator relied on setting each task's state to STARTED and then explicitly removing these state entries once tasks finished, before scheduling new ones. This approach induced a significant number of group0 commits, particularly in large clusters with many nodes and tablets, negatively impacting performance and scalability.
With the update, the coordinator and worker logic has been restructured to operate without maintaining per-task states. Instead, tasks are simply tracked with an aborted boolean flag, which is still essential for certain tablet operations. This change removes much of the coordination complexity, simplifies the view building code, and reduces operational overhead.
In addition, the coordinator now batches reports of finished tasks before making commits. Rather than committing task completions individually, it aggregates them and reports in groups, significantly minimizing the frequency of group0 commits. This new approach is expected to improve efficiency and scalability during materialized view construction, especially in large deployments.
Fixes https://github.com/scylladb/scylladb/issues/26311
This patch needs to be backported to 2025.4.
Closesscylladb/scylladb#26897
* github.com:scylladb/scylladb:
docs/dev/view-building-coordinator: update the docs after recent changes
db/view/view_building: send coordinator's term in the RPC
db/view/view_building_state: replace task's state with `aborted` flag
db/view/view_building_coordinator: batch finished tasks reporting
db/view/view_building_worker: change internal implementation
db/view/view_building_coordinator: change `work_on_tasks` RPC return type
After previous commits, we can drop entire task's state and replace it
with single boolean flag, which determines if a task was aborted.
Once a task was aborted, it cannot get resurrected to a normal state.
This reverts commit 11f045bb7c.
The rpc was added together with colocated tablets in 2025.4 to support a
"shared repair" operation of a group of colocated tablets that repairs
all of them and allows also for special behavior as opposed to repairing
a single specific tablet.
It is not used anymore because we decided to not repair all colocated
tablets in a single shared operation, but to repair only the base table,
and in a later release support repairing colocated tables individually.
We can remove the rpc in 2025.4 because it is introduced in the same
version.
With the introduction of colocated tables, all the tablet transitions
now operate on groups of colocated tablets instead of individual
tablets. such is tablet migration, and also tablet repair.
The tablet repair currently doesn't work on individual tablets due to
the limitations in the tablet map being shared. The way it was
implemented to work on a group of colocated tablets is by repairing all
the colocated tablets together, using a dedicated rpc, and setting a
shared repair_time in the shared tablet map. It was implemented this
way because we wanted to have some way to repair the tablets of a
colocated table.
However, we want to change this in the next release so that it will be
possible to repair the tablets of a colocated table individually. In
order to simplify and prepare for the future change, we prefer until
then to not repair colocated tables at all. otherwise, we will need to
support both the shared repair and individual repair together for a long
time, and the upgrade will be more complicated.
We change the handling of the tablet 'repair' transition to repair only
the base table's tablets. It means it will not be possible to request
tablet repair for a non-base colocated table such as local MV, CDC and
paxos table. This restriction will be temporary until a later release
where we will suuport repairing colocated tablets.
This is a reasonable restriction because repair for these kind of tables
is not required or as important as for normal tables.
Fixesscylladb/scylladb#27119
Currently if a banned node tries to connect to a cluster it fails to
create connections, but has no idea why, so from inside the node it
looks like it has communication problems. This patch adds new rpc
NOTIFY_BANNED which is sent back to the node when its connection is
dropped. On receiving the rpc the node isolates itself and print an
informative message about why it did so.
Closesscylladb/scylladb#26943
When migrating tablet size during the end_migration tablet transition stage, we need the pending and leaving replica hosts. The leaving and pending replicas are gathered in objects of type std::optional<tablet_replica> and are not checked if they contain a value before dereferencing which could cause an exception in the topology coordinator.
This patch adds a check for leaving and pending replicas, and only performs the tablet size migration if neither are empty.
This bug was introduced in 10f07fb95a
This change also adds the ability to create a tablet size in load_stats during end_migration stage of a tablet rebuild. We compute the new tablet size from by averaging the tablet sizes of the existing replicas.
This change also adds the virtual table tablet_sizes which contains tablet sizes of all the replicas of all the tablets in the cluster.
A version containing this bug has not yet been released, so a backport is not needed.
Closesscylladb/scylladb#27118
* github.com:scylladb/scylladb:
test: add tests for tablet size migration during end_migration
virtual_table: add tablet_sizes virtual table
load_stats: update tablet sizes after migration or rebuild
Primary issue with the old method is that each update is a separate
cross-shard call, and all later updates queue behind it. If one of the
shards has high latency for such calls, the queue may accumulate and
system will appear unresponsive for mapping changes on non-zero shards.
This happened in the field when one of the shards was overloaded with
sstables and compaction work, which caused frequent stalls which
delayed polling for ~100ms. A queue of 3k address updates
accumulated, because we update mapping on each change of gossip
states. This made bootstrap impossible because nodes couldn't
learn about the IP mapping for the bootstrapping node and streaming
failed.
To protect against that, use a more efficient method of replication
which requires a single cross-shard call to replicate all prior
updates.
It is also more reliable, if replication fails transiently for some
reason, we don't give up and fail all later updates.
Fixes#26865Closesscylladb/scylladb#26941
* github.com:scylladb/scylladb:
address_map: Use barrier() to wait for replication
address_map: Use more efficient and reliable replication method
utils: Introduce helper for replicated data structures
When migrating tablet size during the end_migration tablet transition
stage, we need the pending and leaving replica hosts. The leaving and
pending replicas are gathered in objects of type
std::optional<tablet_replica> and are not checked if they contain a
value before dereferencing which could cause an exception in the
topology coordinator.
This patch adds a check for leaving and pending replicas, and only
perfoms the tablet size migration if neither are empty.
This bug was introduced in 10f07fb95a
This change also adds the functionality to add the tablet size to
load_stats after a tablet rebuild. We compute the average tablet size
from the existing replicas, and add the new size to the pending replica.
Add precompiled header support to CMakeLists.txt and configure.py -
it improves compilation time by approximately 10%.
New header `stdafx.hh` is added, don't include it manually -
the compiler will include it for you. The header contains includes from
external libraries used by Scylla - seastar, standard library,
linux headers and zlib.
The feature is enabled by default, use CMake option `Scylla_USE_PRECOMPILED_HEADER`
or configure.py --disable-precompiled-header to disable.
The feature should be disabled, when trying to check headers - otherwise
you might get false negatives on missing includes from seastar / abseil and so on.
Note: following configuration needs to be added to ccache.conf:
sloppiness = pch_defines,time_macros,include_file_mtime,include_file_ctime
Closesscylladb/scylladb#26617
This change makes the code simpler and less vulnerable to regressions.
There is no functional impact because:
- we already include a decommissioning/bootstrapping/replacing node for
`barrier` and `barrier_and_drain`,
- we never execute global commands in the presence of a rebuilding node,
- removing node always belongs to `exclude_nodes`, so it's filtered out
anyway,
- we execute global `stream_ranges` only for removenode,
- we execute global `wait_for_ip` only for new nodes when there are no
transitioning nodes.
Fixes#20272Fixes#27066Closesscylladb/scylladb#27102
This patch adds a metric for pre-compression size of sstable files.
This patch adds a per-table metric
`scylla_column_family_total_disk_space_before_compression`,
which measures the hypothetical total size of sstables on disk,
if Data.db was replaced with an uncompressed equivalent.
As for the implementation:
Before the patch, tables and sstable sets are already tracking their total physical file size.
Whenever sstables are added or removed, the size delta is propagated from the sstable up through sstable sets into table_stats.
To implement the new metric, we turn the size delta that is getting passed around from a one-dimensional to a two-dimensional value, which includes both the physical and the pre-compression size.
New functionality, no backport needed.
Closesscylladb/scylladb#26996
* github.com:scylladb/scylladb:
replica/table: add a metric for hypothetical total file size without compression
replica/table: keep track of total pre-compression file size
More efficient than 100 pings.
There was one ping in test which was done "so this shard notices the
clock advance". It's not necessary, since obsering completed SMP
call implies that local shard sees the clock advancement done within in.
Primary issue with the old method is that each update is a separate
cross-shard call, and all later updated queue behind it. If one of the
shards has high latency for such calls, the queue may accumulate and
system will appear unresponsive for mapping changes on non-zero shards.
This happened in the field when one of the shards was overloaded with
sstables and compaction work, which caused frequent stalls which
delayed polling for ~100ms. A queue of 3k address updates
accumulated. This made bootstrap impossible, since nodes couldn't
learn about the IP mapping for the bootstrapping node and streaming
failed.
To protect against that, use a more efficient method of replication
which requires a single cross-shard call to replicate all prior
updates.
It is also more reliable, if replication fails transiently for some
reason, we don't give up and fail all later updates.
Fixes#26865Fixes#26835
This series allows an operator to reset 'cleanup needed' flag if he already cleaned up the node, so that automatic cleanup will not do it again. We also change 'nodetool cleanup' back to run cleanup on one node only (and reset 'cleanup needed' flag in the end), but the new '--global' option allows to run cleanup on all nodes that needed it simultaneously.
Fixes https://github.com/scylladb/scylladb/issues/26866
Backport to all supported version since automatic cleanup behaviour as it is now may create unexpected by the operator load during cluster resizing.
Closesscylladb/scylladb#26868
* https://github.com/scylladb/scylladb:
cleanup: introduce "nodetool cluster cleanup" command to run cleanup on all dirty nodes in the cluster
cleanup: Add RESTful API to allow reset cleanup needed flag
This PR fixes staging stables handling by view building coordinator in case of intra-node tablet migration or tablet merge.
To support tablet merge, the worker stores the sstables grouped only be `table_id`, instead of `(table_id, last_token)` pair.
There shouldn't be that many staging sstables, so selecting relevant for each `process_staging` task is fine.
For the intra-node migration support, the patch adds methods to load migrated sstables on the destination shard and to cleanup them on source shard.
The patch should be backported to 2025.4
Fixes https://github.com/scylladb/scylladb/issues/26244Closesscylladb/scylladb#26454
* github.com:scylladb/scylladb:
service/storage_service: migrate staging sstables in view building worker during intra-node migration
db/view/view_building_worker: support sstables intra-node migration
db/view_building_worker: fix indent
db/view/view_building_worker: don't organize staging sstables by last token
Migration manager depends on storage service. For instance,
it has a reload_schema_in_bg background task which calls
_ss.local() so it expects that storage service is not stopped
before it stops.
To solve this we use permit approach, and during storage_service
stop:
- we ignore *new* code execution in migration_manager which'd use
storage_service
- but wait with storage_service shutdown until all *existing*
executions are done
Fixesscylladb/scylladb#26734
Backport: no need, problem existed since very long time, code restructure in https://github.com/scylladb/scylladb/commit/389afcd (and following commits) made
it hitting more often, as _ss was called earlier, but it's not released yet.
Closesscylladb/scylladb#26779
* github.com:scylladb/scylladb:
service: attach storage_service to migration_manager using pluggabe
service: migration_manager: corutinize merge_schema_from
service: migration_manager: corutinize reload_schema
Previously, only nodes in the 'normal' state and decommissioning nodes
were included in the set of nodes participating in barrier and
barrier_and_drain commands. Joining nodes are not included because they
don't coordinate requests, given their cql port is closed.
However, joining nodes may receive mutations from other nodes, for which
they may generate and coordinate materialized view updates. If their
group0 state is not synchronized it could cause lost view updates.
For example:
1. On the topology coordinator, the join completes and the joining node
becomes normal, but the joining node's state lags behind. Since it's
not synchronized by the barrier, it could be in an old state such as
`write_both_read_old`.
2. A normal node coordinates a write and sends it to the new node as the
new replica.
3. The new node applies the base mutation but doesn't generate a view
update for it, because it calculates the base-view pairing according
to its own state and replication map, and determines that it doesn't
participate in the base-view pairing.
Therefore, since the joining node participates as a coordinator for view
updates, it should be included in these barriers as well. This ensures
that before the join completes, the joining node's state is
`write_both_read_new`, where it does generate view updates.
Fixes https://github.com/scylladb/scylladb/issues/26976
backport to previous versions since it fixes a bug in MV with vnodes
Closesscylladb/scylladb#27008
* github.com:scylladb/scylladb:
test: add mv write during node join test
topology_coordinator: include joining node in barrier
This PR refactors excluded nodes handling for tablets and topology. For tablets a dedicated variable `topology::excluded_tablet_nodes` is introduced, for topology operations a method get_excluded_nodes() is inlined into topology_coordinator and renamed to `get_excluded_nodes_for_topology_request`.
The PR improves codes readability and efficiency, no behavior changes.
backport: this is a refactoring/optimization, no need to backport
Closesscylladb/scylladb#26907
* https://github.com/scylladb/scylladb:
topology_coordinator: drop unused exec_global_command overload
topology_coordinator: rename get_excluded_nodes -> get_excluded_nodes_for_topology_request
topology_state_machine: inline get_excluded_nodes
messaging_service: simplify and optimize ban_host
storage_service: topology_state_load: extract topology variable
topology_coordinator: excluded_tablet_nodes -> ignored_nodes
topology_state_machine: add excluded_tablet_nodes field
The service level controller relies on `auth::service` to collect
information about roles and the relation between them and the service
levels (those attached to them). Unfortunately, the service level
controller is initialized way earlier than `auth::service` and so we
had to prevent potential invalid queries of user service levels
(cf. 46193f5e79).
Unfortunately, that came at a price: it made the maintenance socket
incompatible with the current implementation of the service level
controller. The maintenance socket starts early, before the
`auth::service` is fully initialized and registered, and is exposed
almost immediately. If the user attempts to connect to Scylla within
this time window, via the maintenance socket, one of the things that
will happen is choosing the right service level for the connection.
Since the `auth::service` is not registered, Scylla with fail an
assertion and crash.
A similar scenario occurs when using maintenance mode. The maintenance
socket is how the user communicates with the database, and we're not
prepared for that either.
To avoid unnecessary crashes, we add new branches if the passed user is
absent or if it corresponds to the anonymous role. Since the role
corresponding to a connection via the maintenance socket is the anonymous
role, that solves the problem.
Some accesses to `auth::service` are not affected and we do not modify
those.
Fixesscylladb/scylladb#26816
Backport: yes. This is a fix of a regression.
Closesscylladb/scylladb#26856
* github.com:scylladb/scylladb:
test/cluster/test_maintenance_mode.py: Wait for initialization
test: Disable maintenance mode correctly in test_maintenance_mode.py
test: Fix keyspace in test_maintenance_mode.py
service/qos: Do not crash Scylla if auth_integration absent
Migration manager depends on storage service. For instance,
it has a reload_schema_in_bg background task which calls
_ss.local() so it expects that storage service is not stopped
before it stops.
To solve this we use permit approach, and during storage_service
stop:
- we ignore *new* code execution in migration_manager which'd use
storage_service
- but wait with storage_service shutdown until all *existing*
executions are done
Fixesscylladb/scylladb#26734
This method is specific to topology requests -- node joining, replacing,
decommissioning etc, everything that goes through
topology::transition_state::write_both_read_old and
raft_topology_cmd::command::stream_ranges. It shouldn't be used in
other contexts -- to handle global topology requests
(e.g. truncate table) or for tablets. Rename the method to make this
more explicit.
The method is specific to topology_coordinator, which already contains
a wrapper for it, so inline the topology method into it.
Also, make the logic of the method more explicit and remove multiple
transition_nodes lookups.
Previously, only nodes in the 'normal' state and decommissioning nodes
were included in the set of nodes participating in barrier and
barrier_and_drain commands. Joining nodes are not included because they
don't coordinate requests, given their cql port is closed.
However, joining nodes may receive mutations from other nodes, for which
they may generate and coordinate materialized view updates. If their
group0 state is not synchronized it could cause lost view updates.
For example:
1. On the topology coordinator, the join completes and the joining node
becomes normal, but the joining node's state lags behind. Since it's
not synchronized by the barrier, it could be in an old state such as
`write_both_read_old`.
2. A normal node coordinates a write and sends it to the new node as the
new replica.
3. The new node applies the base mutation but doesn't generate a view
update for it, because it calculates the base-view pairing according
to its own state and replication map, and determines that it doesn't
participate in the base-view pairing.
Therefore, since the joining node participates as a coordinator for view
updates, it should be included in these barriers as well. This ensures
that before the join completes, the joining node's state is
`write_both_read_new`, where it does generate view updates.
Fixesscylladb/scylladb#26976
When generating CDC log mutations for some base mutation, use a CDC schema that is compatible with the base schema.
The compatible CDC schema has for every base column a corresponding CDC column with the same name. If using a non-compatible schema, we may encounter a situation, especially during ALTER, that we have a mutation with a base column set with some value, but the CDC schema doesn't have a column by that name. This would cause the user request to fail with an error.
We add to the schema object a schema_ptr that for CDC-enabled tables points to the schema object of the CDC table that is compatible with the schema. It is set by the schema merge algorithm when creating the schema for a table that is created or altered. We use the fact that a base table and its CDC table are created and altered in the same group0 operation, and this way we can find and set the cdc schema for a base table.
When transporting the base schema as a frozen schema between shards, we transport with it the frozen cdc schema as well.
The patch starts with a series of refactoring commits that make extending the frozen schema easier and cleans up some duplication in the code about the frozen schema. We combine the two types `frozen_schema_with_base_info` and `view_schema_and_base_info` to a single type `extended_frozen_schema` that holds a frozen schema with additional data that is not part of the schema mutations but needs to be transported with it to unfreeze it - base_info, and the frozen cdc schema which is added in a later commit.
Fixes https://github.com/scylladb/scylladb/issues/26405
backport not needed - enhancement
Closesscylladb/scylladb#24960
* github.com:scylladb/scylladb:
test: cdc: test cdc compatible schema
cdc: use compatiable cdc schema
db: schema_applier: create schema with pointer to CDC schema
db: schema_applier: extract cdc tables
schema: add pointer to CDC schema
schema_registry: remove base_info from global_schema_ptr
schema_registry: use extended_frozen_schema in schema load
schema_registry: replace frozen_schema+base_info with extended_frozen_schema
frozen_schema: extract info from schema_ptr in the constructor
frozen_schema: rename frozen_schema_with_base_info to extended_frozen_schema
This PR extends the restore API so that it accepts primary_replica_only as parameter and it combines the concepts of primary-replica-only with scoped streaming so that with:
- `scope=all primary_replica_only=true` The restoring node will stream to the global primary replica only
- `scope=dc primary_replica_only=true` The restoring node will stream to the local primary replica only.
- `scope=rack primary_replica_only=true` The restoring node will stream only to the primary replica from within its own rack (with rf=#racks, the restoring node will stream only to itself)
- `scope=node primary_replica_only=true` is not allowed, the restoring node will always stream only to itself so the primary_replica_only parameter wouldn't make sense.
The PR also adjusts the `nodetool refresh` restriction on running restore with both primary_replica_only and scope, it adds primary_replica_only to `nodetool restore` and it adds cluster tests for primary replica within scope.
Fixes#26584Closesscylladb/scylladb#26609
* github.com:scylladb/scylladb:
Add cluster tests for checking scoped primary_replica_only streaming
Improve choice distribution for primary replica
Refactor cluster/object_store/test_backup
nodetool restore: add primary-replica-only option
nodetool refresh: Enable scope={all,dc,rack} with primary_replica_only
Enable scoped primary replica only streaming
Support primary_replica_only for native restore API
Every table and sstable set keeps track of the total file size
of contained sstables.
Due to a feature request, we also want to keep track of the hypothetical
file size if Data files were uncompressed, to add a metric that
shows the compression ratio of sstables.
We achieve this by replacing the relevant `uint_64 bytes_on_disk`
counters everywhere with a struct that contains both the actual
(post-compression) size and the hypothetical pre-compression size.
This patch isn't supposed to change any observable behavior.
In the next patch, we will use these changes to add a new metric.
`topology_cooridinator::migrate_tablet_size()` was introduced in 10f07fb95a. It has a bug where the has_tablet_size() lambda always returns false because of bad comparison of iterators after a table and tablet search:
```
if (auto table_i = tables.find(gid.table); table_i != tables.find(gid.table)) {
if (auto size_i = table_i->second.find(trange); size_i != table_i->second.find(trange)) {
```
This change also fixes a problem where the `migrate_tablet_size()` would crash with a `std::out_of_range` if the pending node was not present in load_stats.
This change fixes these two problems and moves the functionality into a separate method of `load_stats`. It also adds tests for the new method.
A version containing this bug has not been released yet, so no backport is needed.
Closesscylladb/scylladb#26946
* github.com:scylladb/scylladb:
load_stats: add test for migrate_tablet_size()
load_stats: fix problem with tablet size migration
ignored_nodes is sufficient in these cases. excluded_tablet_nodes
also includes left_nodes_rs, which are not needed
here — global_token_metadata_barrier runs the barrier only
on normal and transition nodes, not on left nodes.