There are two issues with it.
First, a small RPC helper struct carries database reference on board just to get feature state from it.
Second, sstable_streamer uses database as proxy to feature service.
This PR improves both.
Services dependencies improvement, not need to backport
Closesscylladb/scylladb#26989
* github.com:scylladb/scylladb:
sstables_loader: Get LOAD_AND_STREAM_ABORT_RPC_MESSAGE from messaging
sstables_loader: Keep bool on send_meta, not database reference
Correct the loop termination logic that previously caused
certain SSTables to be prematurely excluded, resulting in
lost mutations. This change ensures all relevant SSTables
are properly streamed and their mutations preserved.
The feature in question is about the way streaming sink-and-source
operate. Since sink-and-source itself are obtained from messaging
service, the feature is better be fetched from it too.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The send_meta helper needs database reference to get feature_service
from it (to check some feature state). That's too much, features are
"immutable" throug the loader lifetime, it's enough to keep the boolean
member on send_meta.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
I noticed during tests that `maybe_get_primary_replica`
would not distribute uniformly the choice of primary replica
because `info.replicas` on some shards would have an order whilst
on others it'd be ordered differently, thus making the function choose
a node as primary replica multiple times when it clearly could've
chosen a different nodes.
This patch sorts the replica set before passing it through the
scope filter.
Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
This patch removes the restriction for streaming
to primary replica only within a scope.
Node scope streaming to primary replica is dissallowed.
Fixes#26584
Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
Current native restore does not support primary_replica_only, it is
hard-coded disabled and this may lead to data amplification issues.
This patch extends the restore REST API to accept a
primary_replica_only parameter and propagates it to
sstables_loader so it gets correctly passed to
load_and_stream.
Fixes#26584
Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
The patch c543059f86 fixed the synchronization issue between tablet
split and load-and-stream. The synchronization worked only with
raft topology, and therefore was disabled with gossip.
To do the check, storage_service::raft_topology_change_enabled()
but the topology kind is only available/set on shard 0, so it caused
the synchronization to be bypassed when load-and-stream runs on
any shard other than 0.
The reason the reproducer didn't catch it is that it was restricted
to single cpu. It will now run with multi cpu and catch the
problem observed.
Fixes#22707
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#26730
When streaming SSTables across tablets, a single SSTable may be streamed to multiple tablets. The previous implementation unlinked SSTables immediately after streaming them for the first tablet, potentially making them partially unavailable for subsequent tablets. This patch replaces unlink() with mark_for_deletion() deferring actual unlinking till sstable::close_files.
test_tablets2::test_tablet_load_and_stream was enhanced to also verify that SSTables are removed after being streamed.
Fixes#26606
Backport is not required, although it is a bug fix, but it isn't visible. This is more of a preparatory fix for https://github.com/scylladb/scylladb/pull/26444.
Closesscylladb/scylladb#26622
* github.com:scylladb/scylladb:
test_tablets2: verify SSTable cleanup after tablet load and stream
tablet_sstable_streamer: replace unlink() call with mark_for_deletion()
Currently, tablet_sstable_streamer::get_primary_endpoints is out of
sync with tablet_map::get_primary_replica. The get_primary_replica
optimizes the choice of the replica so that the work is fairly
distributes among nodes. Meanwhile, get_primary_endpoints always
chooses the first replica.
Use get_primary_replica for get_primary_endpoints.
Fixes: https://github.com/scylladb/scylladb/issues/21883.
Closesscylladb/scylladb#26385
When streaming SSTables across tablets, a single SSTable may be streamed to multiple tablets.
The previous implementation unlinked SSTables immediately after streaming them for the first tablet,
potentially making them partially unavailable for subsequent tablets.
This patches replaces unlink() call with mark_for_deletion()
Load-and-stream is broken when running concurrently to the finalization step of tablet split.
Consider this:
1) split starts
2) split finalization executes barrier and succeed
3) load-and-stream runs now, starts writing sstable (pre-split)
4) split finalization publishes changes to tablet metadata
5) load-and-stream finishes writing sstable
6) sstable cannot be loaded since it spans two tablets
two possible fixes (maybe both):
1) load-and-stream awaits for topology to quiesce
2) perform split compaction on sstable that spans both sibling tablets
This patch implements # 1. By awaiting for topology to quiesce,
we guarantee that load-and-stream only starts when there's no
chance coordinator is handling some topology operation like
split finalization.
Fixes https://github.com/scylladb/scylladb/issues/26455.
Closesscylladb/scylladb#26456
* github.com:scylladb/scylladb:
test: Add reproducer for l-a-s and split synchronization issue
sstables_loader: Synchronize tablet split and load-and-stream
Load-and-stream is broken when running concurrently to the
finalization step of tablet split.
Consider this:
1) split starts
2) split finalization executes barrier and succeed
3) load-and-stream runs now, starts writing sstable (pre-split)
4) split finalization publishes changes to tablet metadata
5) load-and-stream finishes writing sstable
6) sstable cannot be loaded since it spans two tablets
two possible fixes (maybe both):
1) load-and-stream awaits for topology to quiesce
2) perform split compaction on sstable that spans both sibling tablets
This patch implements #1. By awaiting for topology to quiesce,
we guarantee that load-and-stream only starts when there's no
chance coordinator is handling some topology operation like
split finalization.
Fixes#26455.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Currently, `sstable::estimated_keys_for_range` works by
checking what fraction of Summary is covered by the given
range, and multiplying this fraction to the number of all keys.
Since computing things on Summary doesn't involve I/O (because Summary
is always kept in RAM), this is synchronous.
In a later patch, we will modify `sstable::estimated_keys_for_range`
so that it can deal with sstables that don't have a Summary
(because they use BTI indexes instead of BIG indexes).
In that case, the function is going to compute the relevant fraction
by using the index instead of Summary. This will require making
the function asynchronous. This is what we do in this patch.
(The actual change to the logic of `sstable::estimated_keys_for_range`
will come in the next patch. In this one, we only make it asynchronous).
Change return type of `check_needs_view_update_path()`. Instead of
retrning bool which tells whether to use staging directory (and register
to `view_update_generator`) or use normal directory.
Now the function returns enum with possible values:
- `normal_directory` - use normal directory for the sstable
- `staging_directly_to_generator` - use staging directory and register
to `view_update_generator`
- `staging_managed_by_vbc` - use staging directory but don't register it
to `view_update_generator` but create view building tasks for
later
The third option is new, it's used when the table has any view which is
in building process currrently. In this case, registering it to `view_update_generator`
prematurely may lead to base-view inconsistency
(for example when a replica is in a pending state).
This patch adds the new option in nodetool, patches the
load_new_ss_tables REST request with a new parameter and
skips the reshape step in refresh if this flag is passed.
Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
Closesscylladb/scylladb#24409Fixes: #24365
When starting, the loader prints all its arguments into logs. Recently
added skip-cleanup one is not included, but it's good to have one too.
refs: #24139
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#24206
* The new abort command explicitly represents the abortion flow in
mutation streaming, clearly identifying operations that are
intentionally aborted. This reduces ambiguity around failures in
streaming operations.
* In the error-handling section, aborted operations are now
explicitly marked as the cause of the streaming failure. This allows
us to differentiate them from genuine errors and appropriately adjust
log severity to reduce unnecessary alarm caused by aborted streaming
failures.
* To avoid alarming users with excessive error logs, log severity for
streaming failures caused by aborted operations has been downgraded.
This helps keep logs cleaner and prevents unnecessary concerns.
* A new feature has been added to ensure mixed clusters during updates
do not receive unsupported RPC messages, improving compatibility and
stability.
fixes: https://github.com/scylladb/scylladb/issues/23076Closesscylladb/scylladb#23214
The intention was to fail the REST API call in case --skip-cleanup is
requested for --load-and-stream loading. The corresponding if expression
is checking something else :( despite log message is correct.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#24208
Just copy the load_and_stream and primary_replica_only logic, this new
option is the same in this sense.
Throw if it's specified with the load_and_stream one.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Just put the boolean into the callstack between API and distributed
loader to reduce the churn in the next patches. No functional changes,
flag is false and unused.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This change addresses a critical race condition in the sstables_loader where `get_progress()` could access invalid `progress_holder` instances after `release_resources()` destroyed them.
Problem:
- Progress tracking uses two components: `_progress_state` (tracks state) and `_progress_per_shard` (sharded service with actual progress data)
- `get_progress()` first checks if `_progress_state` is initialized, then accumulates progress from `_progress_per_shard`
- As both functions are coroutines, `get_progress()` could be preempted after state check but before accessing `_progress_per_shard`
- If `release_resources()` runs during this preemption, it destroys the `progress_holder` instances in `_progress_per_shard`, causing `get_progress()` to access invalid memory.
Solution:
- Implemented shared/exclusive locking to protect access to both state and sharded progress data
- Multiple `get_progress()` calls can execute in parallel (shared access)
- `release_resources()` acquires exclusive access before modifying resources
- This prevents potential memory corruption and ensures consistent progress reporting
Fixes#23801
---
this change addresses a racing related to tracking the restore progress from S3 using scylla's native API, which is not used in production yet, hence no need to backport.
Closesscylladb/scylladb#23808
* github.com:scylladb/scylladb:
sstables_loader: fix the indent
sstables_loader: fix the racing between get_progress() and release_resources()
Interval map is very susceptible to quadratic space behavior when
it's flooded with many entries overlapping all (or most of)
intervals, since each such entry will have presence on all
intervals it overlaps with.
A trigger we observed was memtable flush storm, which creates many
small "L0" sstables that spans roughly the entire token range.
Since we cannot rely on insertion order, solution will be about
storing sstables with such wide ranges in a vector (unleveled).
There should be no consequence for single-key reads, since upper
layer applies an additional filtering based on token of key being
queried.
And for range scans, there can be an increase in memory usage,
but not significant because the sstables span an wide range and
would have been selected in the combined reader if the range of
scan overlaps with them.
Anyway, this is a protection against storm of memtable flushes
and shouldn't be the common scenario.
It works both with tablets and vnodes, by adjusting the token
range spanned by compaction group accordingly.
Fixes#23634.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
This change addresses a critical race condition in the sstables_loader
where `get_progress()` could access invalid `progress_holder` instances
after `release_resources()` destroyed them.
Problem:
- Progress tracking uses two components: `_progress_state` (tracks state)
and `_progress_per_shard` (sharded service with actual progress data)
- `get_progress()` first checks if `_progress_state` is initialized, then
accumulates progress from `_progress_per_shard`
- As both functions are coroutines, `get_progress()` could be preempted
after state check but before accessing `_progress_per_shard`
- If `release_resources()` runs during this preemption, it destroys the
`progress_holder` instances in `_progress_per_shard`, causing
`get_progress()` to access invalid memory.
Solution:
- Implemented shared/exclusive locking to protect access to both state
and sharded progress data
- Multiple `get_progress()` calls can execute in parallel (shared access)
- `release_resources()` acquires exclusive access before modifying resources
- This prevents potential memory corruption and ensures consistent
progress reporting
Fixes#23801
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
The member in question is unconditionally .stop()-ed in task's
release_resources() method, however, it may happen that the thing wasn't
.start()-ed in the first place. Start happens in the middle of the
task's .run() method and there can be several reasons why it can be
skipped -- e.g. the task is aborted early, or collecting sstables from
S3 throws.
fixes: #23231
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#23483
When load-and-stream finishes it may call sstable::unlink() method to
drop the loaded (and streamed) sstable. Before calling it it prints a
log message about its intention that includes component_filenames()
vector. This log message is ugly in several ways.
First, it prints only recognized components, while unlink() method
unlinks all of them, so it's sort of misleading (it doesn't seem that
anyone ever read this message IRL though)
Next, that's the only place that is _that_ verbose about sstable
unlinking. "Common" unlinking paths don't print that much info.
Finally, the log message happen in debug level, so it's hardly ever
appears in any logs, but collecting several filenames takes time.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Previously, download_task_impl's destructor would destroy per-shard progress
elements on whatever shard the task was destroyed on. In multi-shard
environments, this caused "shared_ptr accessed on non-owner cpu" errors when
attempting to free memory allocated on a different shard.
Fix by:
- Convert progress_per_shard into a sharded service
- Stop the service on owner shards during cleanup using coroutines
- Add operator+= to stream_progress to leverage seastar's built-in adder
instead of a custom adder struct
Alternative approaches considered:
1. Using foreign_ptr: Rejected as it would require interface changes
that complicate stream delegation. foreign_ptr manages the underlying
pointee with another smart pointer but does not expose the smart
pointer instance in its APIs, making it impossible to use
shared_ptr<stream_progress> in the interface.
2. Using vector<stream_progress>: Rejected for similar interface
compatibility reasons.
This solution maintains the existing interfaces while ensuring proper
cross-shard cleanup.
Fixesscylladb/scylladb#22759
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Replace remaining uses of boost::adaptors::transformed with std::views::transform
to reduce Boost dependencies, following the migration pattern established in
bab12e3a. This change addresses recently merged code that reintroduced Boost
header dependencies through boost::adaptors::transformed usage.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#22365
We restore a snapshot of table by streaming the sstables of
the given snapshot of the table using
`sstable_streamer::stream_sstable_mutations()` in batches. This function
reads mutations from a set of sstables, and streams them to the target
nodes. Due to the limit of this function, we are not able to track the
progress in bytes.
Previously, progress tracking used individual sstables as units, which caused
inaccuracies with tablet-distributed tables, where:
- An sstable spanning multiple tablets could be counted multiple times
- Progress reporting could become misleading (e.g., showing "40" progress
for a table with 10 sstables)
This change introduces a more robust progress tracking method:
- Use "batch" as the unit of progress instead of individual sstables.
Each batch represents a tablet when restoring a table snapshot if
the tablet being restored is distributed with tablets. When it comes
to tables distributed with vnode, each batch represents an sstable.
- Stream sstables for each tablet separately, handling both partially and
fully contained sstables
- Calculate progress based on the total number of sstables being streamed
- Skip tablet IDs with no owned tokens
For vnode-distributed tables, the number of "batches" directly corresponds
to the number of sstables, ensuring:
- Consistent progress reporting across different table distribution models
- Simplified implementation
- Accurate representation of restore progress
The new approach provides a more reliable and uniform method of tracking
restoration progress across different table distribution strategies.
Also, Corrected the use of `_sstables.size()` in
`sstable_streamer::stream_sstables()`. It addressed a review comment
from Pavel that was inadvertently overlooked during previous rebasing
the commit of 5ab4932f34.
Fixesscylladb/scylladb#21816
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21841
Semi-mechanical change that adds newly introduced "scope" parameter to
all the functions between API methods and the low-level streamer object.
No real functional changes. API methods set it to "all" to keep existing
behavior.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Loading and streaming tablets has pre-filtering loop that walks the
tablet map sorts sstables into three lists:
- fully contained in one of map ranges
- partially overlapping with the map
- not intersecting with the map
Sstables from the 3rd list is immediately dropped from the process and
for the remaining two core load-and-stream happens.
This filtering deserves more care from the newly introduced scope. When
a tablet replica set doesn't get in the scope, the whole entry can be
disregarded, because load-and-stream will only do its "load" part anyway
and all mutations from it will be ignored.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There's been some discussions of how primary replica only streaming
schould interact with the scope. There are two options how to consider
this combination:
- find where the primary replica is and handle it if it's within the
requested sope
- within the requested scope find the primary replica for that subset of
nodes, then handle it
There's also some itermediate solution: suppoer "primary replica in DC"
and reject all other combinations.
Until decided which way is correct, let's disable this configuration.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Currently load-and-stream sends mutations to whatever node is considered
to be a "replica" for it. One exception is the "primary-replica-only"
flag that can be requested by the user.
This patch introduces a "scope" parameter that limits streaming part in
where it can stream the data to with 4 options:
- all -- current way of doing things, stream to wherever needed
- dc -- only stream to nodes that live in the same datacenter
- rack -- only stream to nodes that live in the same rack
- node -- only "stream" to current node
It's not yet configurable and streamer object initializes itself with
"all" mode. Will be changed later.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Preparational patch. Next will add more code to get_endpoints() that
will need to work for both if/else branches, this change helps having
less churn later.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Fixes#20717
Enables abortable interface and propagates abort_source to all s3 objects used for reading the restore data.
Note: because restore is done on each shard, we have to maintain a per-shard abort source proxy for each, and do a background per-shard abort on abort call. This is synced at the end of "run()".
Abort source is added as an optional parameter to s3 storage and the s3 path in distributed loader.
There is no attempt to "clean up" an aborted restore. As we read on a mutation level from remote sstables, we should not cause incomplete sstables as such, even though we might end up of course with partial data restored.
Closesscylladb/scylladb#21567
* github.com:scylladb/scylladb:
test_backup: Add restore abort test case
sstables_loader: Make restore task abortable
distributed_loader: Add optional abort_source to get_sstables_from_object_store
s3_storage: Add optional abort_source to params/object
s3::client: Make "readable_file" abortable
"
This series converts repair, streaming and node_ops (and some parts of
alternator) to work on host ids instead of ips. This allows to remove
a lot of (but not all) functions that work on ips from effective
replication map.
CI: https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/13830/
Refs: scylladb/scylladb#21777
"
* 'gleb/move-to-host-id-more' of github.com:scylladb/scylla-dev:
locator: topology: remove no longer use get_all_ips()
gossiper: change get_unreachable_nodes to host ids
locator: drop no longer used ip based functions from effective replication map and friends
test: move network_topology_strategy_test and token_metadata_test to use host id based APIs
replica/database: drop usage of ip in favor of host id in get_keyspace_local_ranges
replica/mutation_dump: use host ids instead of ips
alternator: move ttl to work with host ids instead of ips
storage_service: move node_ops code to use host ids instead of host ips
streaming: move streaming code to use host ids instead of host ips
repair: move repair code to use host ids instead of host ips
gossiper: add get_unreachable_host_ids() function
locator: topology: add more function that return host ids to effective replication map
locator: add more function that return host ids to effective replication map
Standardize on one range library to reduce dependency load.
Unfortunately, std::views::concat (the replacement for boost::join),
is C++26 only. We use two separate inserts to the result vector to
compensate, and rationalize it by saying that boost::join() is likely
slow due to the need for type-erasure.
Closesscylladb/scylladb#21834
Fixes#20717
Enables abortable interface and propagates abort_source to
all s3 objects used for reading the restore data.
Note: because restore is done on each shard, we have to maintain
a per-shard abort source proxy for each, and do a background
per-shard abort on abort call. This is synced at the end of "run()"
v2:
* Simplify abortability by using a function-local gate instead.
these unused includes are identified by clang-include-cleaner. after
auditing the source files, all of the reports have been confirmed.
please note, because `mutation/mutation.hh` does not include
`seastar/coroutine/maybe_yield.hh` anymore, and quite a few source
files were relying on this header to bring in the declaration of
`maybe_yield()`, we have to include this header in the places where
this symbol is used. the same applies to `seastar/core/when_all.hh`.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Previously, the progress of download_task_impl launched by the "restore" API
was not tracked. Since restore operations can involve large data transfers,
this makes it difficult for users to monitor progress.
The restore process happens in two sequential steps:
1. Open specified SSTables from object storage
2. Download and stream mutation fragments from the opened SSTables to
mapped destinations
While both steps contribute to overall progress, they use different units
of measurement, making a unified progress metric challenging. Because
the load-and-stream step (step 2) is the largest time-consuming part of the
restore. This change implements progress tracking for this step as an
initial improvement to provide users with partial visibility into the
restore operation.
Fixesscylladb/scylladb#21427
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Replace manual vector iteration with ranges library to preserve batch size
information. When streaming SSTable mutations, we need to track progress
across batches. The previous implementation used a loop to move elements
from the vector's end, but this approach lost the batch size information
since the SSTable set was moved away during streaming.
Now use std::ranges to take elements from the vector's end instead of
manual iteration. This preserves the original batch size, enabling accurate
progress tracking which will be implemented in a follow-up commit.
Technical changes:
- Replace manual vector iteration with ranges::take_view
- Preserve batch size information for progress tracking
- Maintain existing batch processing behavior
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
After d1db17d490, the processed SSTable counter remained at 0 while
streaming progress was still being displayed. This fix properly tracks
and displays streaming progress by:
- Moving SSTable counter (`nr_sst_current`) to `sstable_streamer::stream_sstables()`
- Generating UUID at the streaming initialization
- Relocating progress reporting to `stream_sstables()` for accurate tracking
This ensures the progress indicator correctly reflects the actual number
of processed SSTables during streaming operations.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
the only user of `sstable_streamer::stream_sstable_mutations()` is
`sstable_st6reamer::stream_sstables()`, so mark this member function as
private.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Fix indentation regression from d1db17d490 where the function body of
`sstable_streamer::stream_sstable_mutations()` was left incorrectly
indented after the function was extracted to decouple streaming from
sstable selection.
Pure style fix, no functional changes.
in this change, we correct the indent.
Refs d1db17d490
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>