This change addresses a critical race condition in the sstables_loader where `get_progress()` could access invalid `progress_holder` instances after `release_resources()` destroyed them.
Problem:
- Progress tracking uses two components: `_progress_state` (tracks state) and `_progress_per_shard` (sharded service with actual progress data)
- `get_progress()` first checks if `_progress_state` is initialized, then accumulates progress from `_progress_per_shard`
- As both functions are coroutines, `get_progress()` could be preempted after state check but before accessing `_progress_per_shard`
- If `release_resources()` runs during this preemption, it destroys the `progress_holder` instances in `_progress_per_shard`, causing `get_progress()` to access invalid memory.
Solution:
- Implemented shared/exclusive locking to protect access to both state and sharded progress data
- Multiple `get_progress()` calls can execute in parallel (shared access)
- `release_resources()` acquires exclusive access before modifying resources
- This prevents potential memory corruption and ensures consistent progress reporting
Fixes#23801
---
this change addresses a racing related to tracking the restore progress from S3 using scylla's native API, which is not used in production yet, hence no need to backport.
Closesscylladb/scylladb#23808
* github.com:scylladb/scylladb:
sstables_loader: fix the indent
sstables_loader: fix the racing between get_progress() and release_resources()
Interval map is very susceptible to quadratic space behavior when
it's flooded with many entries overlapping all (or most of)
intervals, since each such entry will have presence on all
intervals it overlaps with.
A trigger we observed was memtable flush storm, which creates many
small "L0" sstables that spans roughly the entire token range.
Since we cannot rely on insertion order, solution will be about
storing sstables with such wide ranges in a vector (unleveled).
There should be no consequence for single-key reads, since upper
layer applies an additional filtering based on token of key being
queried.
And for range scans, there can be an increase in memory usage,
but not significant because the sstables span an wide range and
would have been selected in the combined reader if the range of
scan overlaps with them.
Anyway, this is a protection against storm of memtable flushes
and shouldn't be the common scenario.
It works both with tablets and vnodes, by adjusting the token
range spanned by compaction group accordingly.
Fixes#23634.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
This change addresses a critical race condition in the sstables_loader
where `get_progress()` could access invalid `progress_holder` instances
after `release_resources()` destroyed them.
Problem:
- Progress tracking uses two components: `_progress_state` (tracks state)
and `_progress_per_shard` (sharded service with actual progress data)
- `get_progress()` first checks if `_progress_state` is initialized, then
accumulates progress from `_progress_per_shard`
- As both functions are coroutines, `get_progress()` could be preempted
after state check but before accessing `_progress_per_shard`
- If `release_resources()` runs during this preemption, it destroys the
`progress_holder` instances in `_progress_per_shard`, causing
`get_progress()` to access invalid memory.
Solution:
- Implemented shared/exclusive locking to protect access to both state
and sharded progress data
- Multiple `get_progress()` calls can execute in parallel (shared access)
- `release_resources()` acquires exclusive access before modifying resources
- This prevents potential memory corruption and ensures consistent
progress reporting
Fixes#23801
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
The member in question is unconditionally .stop()-ed in task's
release_resources() method, however, it may happen that the thing wasn't
.start()-ed in the first place. Start happens in the middle of the
task's .run() method and there can be several reasons why it can be
skipped -- e.g. the task is aborted early, or collecting sstables from
S3 throws.
fixes: #23231
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#23483
When load-and-stream finishes it may call sstable::unlink() method to
drop the loaded (and streamed) sstable. Before calling it it prints a
log message about its intention that includes component_filenames()
vector. This log message is ugly in several ways.
First, it prints only recognized components, while unlink() method
unlinks all of them, so it's sort of misleading (it doesn't seem that
anyone ever read this message IRL though)
Next, that's the only place that is _that_ verbose about sstable
unlinking. "Common" unlinking paths don't print that much info.
Finally, the log message happen in debug level, so it's hardly ever
appears in any logs, but collecting several filenames takes time.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Previously, download_task_impl's destructor would destroy per-shard progress
elements on whatever shard the task was destroyed on. In multi-shard
environments, this caused "shared_ptr accessed on non-owner cpu" errors when
attempting to free memory allocated on a different shard.
Fix by:
- Convert progress_per_shard into a sharded service
- Stop the service on owner shards during cleanup using coroutines
- Add operator+= to stream_progress to leverage seastar's built-in adder
instead of a custom adder struct
Alternative approaches considered:
1. Using foreign_ptr: Rejected as it would require interface changes
that complicate stream delegation. foreign_ptr manages the underlying
pointee with another smart pointer but does not expose the smart
pointer instance in its APIs, making it impossible to use
shared_ptr<stream_progress> in the interface.
2. Using vector<stream_progress>: Rejected for similar interface
compatibility reasons.
This solution maintains the existing interfaces while ensuring proper
cross-shard cleanup.
Fixesscylladb/scylladb#22759
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Replace remaining uses of boost::adaptors::transformed with std::views::transform
to reduce Boost dependencies, following the migration pattern established in
bab12e3a. This change addresses recently merged code that reintroduced Boost
header dependencies through boost::adaptors::transformed usage.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#22365
We restore a snapshot of table by streaming the sstables of
the given snapshot of the table using
`sstable_streamer::stream_sstable_mutations()` in batches. This function
reads mutations from a set of sstables, and streams them to the target
nodes. Due to the limit of this function, we are not able to track the
progress in bytes.
Previously, progress tracking used individual sstables as units, which caused
inaccuracies with tablet-distributed tables, where:
- An sstable spanning multiple tablets could be counted multiple times
- Progress reporting could become misleading (e.g., showing "40" progress
for a table with 10 sstables)
This change introduces a more robust progress tracking method:
- Use "batch" as the unit of progress instead of individual sstables.
Each batch represents a tablet when restoring a table snapshot if
the tablet being restored is distributed with tablets. When it comes
to tables distributed with vnode, each batch represents an sstable.
- Stream sstables for each tablet separately, handling both partially and
fully contained sstables
- Calculate progress based on the total number of sstables being streamed
- Skip tablet IDs with no owned tokens
For vnode-distributed tables, the number of "batches" directly corresponds
to the number of sstables, ensuring:
- Consistent progress reporting across different table distribution models
- Simplified implementation
- Accurate representation of restore progress
The new approach provides a more reliable and uniform method of tracking
restoration progress across different table distribution strategies.
Also, Corrected the use of `_sstables.size()` in
`sstable_streamer::stream_sstables()`. It addressed a review comment
from Pavel that was inadvertently overlooked during previous rebasing
the commit of 5ab4932f34.
Fixesscylladb/scylladb#21816
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21841
Semi-mechanical change that adds newly introduced "scope" parameter to
all the functions between API methods and the low-level streamer object.
No real functional changes. API methods set it to "all" to keep existing
behavior.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Loading and streaming tablets has pre-filtering loop that walks the
tablet map sorts sstables into three lists:
- fully contained in one of map ranges
- partially overlapping with the map
- not intersecting with the map
Sstables from the 3rd list is immediately dropped from the process and
for the remaining two core load-and-stream happens.
This filtering deserves more care from the newly introduced scope. When
a tablet replica set doesn't get in the scope, the whole entry can be
disregarded, because load-and-stream will only do its "load" part anyway
and all mutations from it will be ignored.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There's been some discussions of how primary replica only streaming
schould interact with the scope. There are two options how to consider
this combination:
- find where the primary replica is and handle it if it's within the
requested sope
- within the requested scope find the primary replica for that subset of
nodes, then handle it
There's also some itermediate solution: suppoer "primary replica in DC"
and reject all other combinations.
Until decided which way is correct, let's disable this configuration.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Currently load-and-stream sends mutations to whatever node is considered
to be a "replica" for it. One exception is the "primary-replica-only"
flag that can be requested by the user.
This patch introduces a "scope" parameter that limits streaming part in
where it can stream the data to with 4 options:
- all -- current way of doing things, stream to wherever needed
- dc -- only stream to nodes that live in the same datacenter
- rack -- only stream to nodes that live in the same rack
- node -- only "stream" to current node
It's not yet configurable and streamer object initializes itself with
"all" mode. Will be changed later.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Preparational patch. Next will add more code to get_endpoints() that
will need to work for both if/else branches, this change helps having
less churn later.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Fixes#20717
Enables abortable interface and propagates abort_source to all s3 objects used for reading the restore data.
Note: because restore is done on each shard, we have to maintain a per-shard abort source proxy for each, and do a background per-shard abort on abort call. This is synced at the end of "run()".
Abort source is added as an optional parameter to s3 storage and the s3 path in distributed loader.
There is no attempt to "clean up" an aborted restore. As we read on a mutation level from remote sstables, we should not cause incomplete sstables as such, even though we might end up of course with partial data restored.
Closesscylladb/scylladb#21567
* github.com:scylladb/scylladb:
test_backup: Add restore abort test case
sstables_loader: Make restore task abortable
distributed_loader: Add optional abort_source to get_sstables_from_object_store
s3_storage: Add optional abort_source to params/object
s3::client: Make "readable_file" abortable
"
This series converts repair, streaming and node_ops (and some parts of
alternator) to work on host ids instead of ips. This allows to remove
a lot of (but not all) functions that work on ips from effective
replication map.
CI: https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/13830/
Refs: scylladb/scylladb#21777
"
* 'gleb/move-to-host-id-more' of github.com:scylladb/scylla-dev:
locator: topology: remove no longer use get_all_ips()
gossiper: change get_unreachable_nodes to host ids
locator: drop no longer used ip based functions from effective replication map and friends
test: move network_topology_strategy_test and token_metadata_test to use host id based APIs
replica/database: drop usage of ip in favor of host id in get_keyspace_local_ranges
replica/mutation_dump: use host ids instead of ips
alternator: move ttl to work with host ids instead of ips
storage_service: move node_ops code to use host ids instead of host ips
streaming: move streaming code to use host ids instead of host ips
repair: move repair code to use host ids instead of host ips
gossiper: add get_unreachable_host_ids() function
locator: topology: add more function that return host ids to effective replication map
locator: add more function that return host ids to effective replication map
Standardize on one range library to reduce dependency load.
Unfortunately, std::views::concat (the replacement for boost::join),
is C++26 only. We use two separate inserts to the result vector to
compensate, and rationalize it by saying that boost::join() is likely
slow due to the need for type-erasure.
Closesscylladb/scylladb#21834
Fixes#20717
Enables abortable interface and propagates abort_source to
all s3 objects used for reading the restore data.
Note: because restore is done on each shard, we have to maintain
a per-shard abort source proxy for each, and do a background
per-shard abort on abort call. This is synced at the end of "run()"
v2:
* Simplify abortability by using a function-local gate instead.
these unused includes are identified by clang-include-cleaner. after
auditing the source files, all of the reports have been confirmed.
please note, because `mutation/mutation.hh` does not include
`seastar/coroutine/maybe_yield.hh` anymore, and quite a few source
files were relying on this header to bring in the declaration of
`maybe_yield()`, we have to include this header in the places where
this symbol is used. the same applies to `seastar/core/when_all.hh`.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Previously, the progress of download_task_impl launched by the "restore" API
was not tracked. Since restore operations can involve large data transfers,
this makes it difficult for users to monitor progress.
The restore process happens in two sequential steps:
1. Open specified SSTables from object storage
2. Download and stream mutation fragments from the opened SSTables to
mapped destinations
While both steps contribute to overall progress, they use different units
of measurement, making a unified progress metric challenging. Because
the load-and-stream step (step 2) is the largest time-consuming part of the
restore. This change implements progress tracking for this step as an
initial improvement to provide users with partial visibility into the
restore operation.
Fixesscylladb/scylladb#21427
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Replace manual vector iteration with ranges library to preserve batch size
information. When streaming SSTable mutations, we need to track progress
across batches. The previous implementation used a loop to move elements
from the vector's end, but this approach lost the batch size information
since the SSTable set was moved away during streaming.
Now use std::ranges to take elements from the vector's end instead of
manual iteration. This preserves the original batch size, enabling accurate
progress tracking which will be implemented in a follow-up commit.
Technical changes:
- Replace manual vector iteration with ranges::take_view
- Preserve batch size information for progress tracking
- Maintain existing batch processing behavior
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
After d1db17d490, the processed SSTable counter remained at 0 while
streaming progress was still being displayed. This fix properly tracks
and displays streaming progress by:
- Moving SSTable counter (`nr_sst_current`) to `sstable_streamer::stream_sstables()`
- Generating UUID at the streaming initialization
- Relocating progress reporting to `stream_sstables()` for accurate tracking
This ensures the progress indicator correctly reflects the actual number
of processed SSTables during streaming operations.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
the only user of `sstable_streamer::stream_sstable_mutations()` is
`sstable_st6reamer::stream_sstables()`, so mark this member function as
private.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Fix indentation regression from d1db17d490 where the function body of
`sstable_streamer::stream_sstable_mutations()` was left incorrectly
indented after the function was extracted to decouple streaming from
sstable selection.
Pure style fix, no functional changes.
in this change, we correct the indent.
Refs d1db17d490
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
in 787ea4b1, we introduced `_prefix` and `_sstables` member variables
to `sstables_loader::download_task_impl`, replacing the functionality
of `_snapshot_name`. However, we overlooked removing the now-obsolete
`_snapshot_name` variable.
this commit removes the unused `_snapshot_name` member variable to
improve code cleanliness and prevent potential confusion.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#20969
The restore-from-s3 task uses load-and-stream internally which, in turn, unlinks loaded sstables on success. That's not what user expects when it restores from backup, objects should remain in bucket afterwards.
Closesscylladb/scylladb#20947
* github.com:scylladb/scylladb:
test: Add check that restored-from objects are not removed
sstables_loader: Dont unlink sstables when restoring from S3
sstables_loader: Make primary_replica_only bool_class RAII field
When load_and_stream() completes, all sstables that were loaded (and
streamed) are unlinked. This is wrong for the restore-from-s3 task, as
removing objects from backup storage is not what user expects.
Fix it by adding a boolean to streamer class, and set it to false (well,
bool_class<>::no) for restore task.
fixes: #20938
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This boolean is currently passed all the way around as pure bool
argument. And it's only needed in a single get_endpoints() method that
calculates the target endpoints.
This patch places this bool on class streamer, so that the call chain
arguments are not polluted, and converts it to bool_class.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
before this change, we enumerate the sstables tracked by the
system.sstables table, and restore them when serving
requests to "storage_service/restore" API. this works fine with
"storage_service/backup" API. but this "restore" API cannot be
used as a drop-in replacement of the rclone based API currently
used by scylla-manager.
in order to fill the gap, in this change:
* add the "prefix" parameter for specifying the shared prefix of
sstables
* add the "sstables" parameter for specifying the list of TOC
components of sstables
* remove the "snapshot" parameter, as we don't encode the prefix
on scylla's end anymore.
* make the "table" parameter mandatory.
Fixesscylladb/scylladb#20461
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
The method starts a task that uses sstables_loader load-and-stream
functionality to bring new sstables into the cluster. The existing
load-and-stream picks up sstables from upload/ directory, the newly
introduced task collects them from S3 bucket and given prefix (that
correspond to the path where backup API method put them).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The storage_manager maintains set of clients to configured object
storage(s). The sstables loader is going to spawn tasks that will talk
to to those storages, thus it needs the storage manager to get the
clients clients from.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This service is going to start tasks managed by task manager. For that,
it should have its module set up and registered.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Now the loading code has two different paths, and only one of them
switches sched group. It's cleaner and more natural to switch the sched
group in the loader itself, so that all code paths run in it and don't
care switching.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
before this change, we rely on the default-generated fmt::formatter
created from operator<<, but fmt v10 dropped the default-generated
formatter.
in this change, we include `fmt/ranges.h` and/or `fmt/std.h`
for formatting the container types, like vector, map
optional and variant using {fmt} instead of the homebrew
formatter based on operator<<.
with this change, the changes adding fmt::formatter and
the changes using ostream formatter explicitly, we are
allowed to drop `FMT_DEPRECATED_OSTREAM` macro.
Refs scylladb#13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Loader wants to print set of sstables' names. For that it collects names
into a dedicated vector, then prints it using fmt/ranges facility.
There's a way to achieve the same goal without allocating extra vector
with names -- use fmt::format() and pass it a range converting sstables
into their names.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#18159
The loader is writing to pending replica even when write selector is set
to previous. If migration is reverted, then the writes won't be rolled
back as it assumes pending replicas weren't written to yet. That can
cause data resurrection if tablet is later migrated back into the same
replica.
NOTE: write selector is handled correctly when set to next, because
get_natural_endpoints() will return the next replica set, and none
of the replicas will be considered leaving. And of course, selector
set to both is also handled correctly.
Fixes#17892.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#17902
Affects load-and-stream for tablets only.
The intention is that only this loop is responsible for detecting
exhausted sstables and then discarding them for next iterations:
while (sstable_it != _sstables.rend() && exhausted(*sstable_it)) {
sstable_it++;
}
But the loop which consumes non exhausted sstables, on behalf of
each tablet, was incorrectly advancing the iterator, despite the
sstable wasn't considered exhausted.
Fixes#17733.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#17899
Even though taking erm blocks migration, it cannot prevent the
load-and-stream to start while a migration is going on, erm
only prevents migration from advancing.
With tablets, new data will be streamed to pending replica too if
the write replica selector, in transition metadata, is set to both.
If migration is at a later stage where only new replica is written
to, then data is streamed only to new replica as selector is set
to next (== new replica set).
primary_replica_only flag is handled by only streaming to pending
if the primary replica is the one leaving through migration.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Similar treatment to repair is given to load-and-stream.
Jumps into a new streaming session for every tablet, so we guarantee
data will be segregated into tablets co-habiting the same shard.
Fixes#17315.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>