Use coroutine::as_future() to avoid exceptions taking flight and
triggering expensive stack-unwinding.
Especially bad for common exceptions like timeouts.
Not using coroutine::try_future(), because on the error path, the
querier has to be closed.
Table needs flush if not all its memtable lists are empty.
To be used in the next patch for a unit test.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
With the additional file_stat overload introduced in
3e9b071838, use the opened
directory for more efficient, relative-path based stat.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The functor is called both on the data directory as well
as on the staging directory, so the warning printed if the
found file is not the same inode should print the given path,
not datadir / name (as was copy and pasted).
Refs #27635
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
staging is based off of datadir, not snapshot_dir.
the issue was introduced in f5ca3657e2.
Refs #27635
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
* seastar f0298e40...4dcd4df5 (29):
> file: provide a default implementation for file_impl::statat
> util: Genralize memory_data_sink
> defer: Replace static_assert() with concept
> treewide: drop the support of fmtlib < 9.0.0
> test: Improve resilience of netsed scheduling fairness test
> Merge 'file: Use query_device_alignment_info in blkdev_alignments ' from Kefu Chai
file: Put alignment helpers in anonymous namespace
file: Use query_device_alignment_info in blkdev_alignments
> Merge 'file: Query physical block size and minimum I/O size' from Kefu Chai
file: Apply physical_block_size override to filesystem files
file: Use designated initializers in xfs_alignments
iotune: Add physical block size detection
disk_params: Add support for physical_block_size overrides from io_properties.yaml
block_device: Query alignment requirements separately for memory and I/O
> Merge 'json: formatter: fix formatting of std:string_view' from Benny Halevy
json: formatter: fix formatting of std:string_view
json: formatter: make sure std::string_view conforms to is_string_like
Fixes#27887
> demos:improve the output of demo_with_io_intent() in file_demo
> test: Add accept() vs accept_abort() socket test
> file: Refine posix_file_impl alignments initialization
> Add file::statat and a corresponding file_stat overload
> cmake: don't compile memcached app for API < 9
> Merge 'Revert to ~old lifetime semantics for lvalues passed to then()-alikes' from Travis Downs
future: adjust lifetime for lvalue continuations
future: fix value class operator()
> pollable_fd: Unfriend everything
> Merge 'file: experimental_list_directory: use buffered generator' from Benny Halevy
file: experimental_list_directory: use buffered generator
file: define list_directory_generator_type
> Merge 'Make datagram API use temporary_buffer<>-s' from Pavel Emelyanov
net: Deprecate datagram::get_data() returning packet
memcache: Fix indentation after previous patch
memcache: Use new datagram::get_buffers() API
dns: Use new datagram::get_buffers() API
tests: Use new datagram::get_buffers() API
demo: Use new datagram::get_buffers() API
udp: Make datagram implementations return span of temporary_buffer-s
> Merge 'Remove callback from timer_set::complete()' from Pavel Emelyanov
reactor: Fix indentation after previous patch
timers: Remove enabling callback from timer_set::complete()
> treewide: avoid 'static sstring' in favor of 'constexpr string_view'
> resource: Hide hwloc from public interface
> Merge 'Fix handle_exception_type for lvalues' from Travis Downs
futures_test: compile-time tests
function_traits: handle reference_wrapper
> posix_data_sink_impl: Assert to guard put UB
> treewide: fix build with `SEASTAR_SSTRING` undefined
> avoid deprecation warnings for json_exception
> `util/variant_utils`: correct type deduction for `seastar::visit`
> net/dns: fixed socket concurrent access
> treewide: add missing headers
> Merge 'Remove posix file helper file_read_state class' from Pavel Emelyanov
file: Remove file_read_state
test: Add a test for posix_file_impl::do_dma_read_bulk()
> membarrier: simplify locking
Adjust scylla to the following changes in scylla:
- file_stat became polymorphic
- needs explicit inference in table::snapshot_exists, table::get_snapshot_details
- file::experimental_list_directory now returns list_directory_generator_type
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closesscylladb/scylladb#27916
Irritated by prevailing spellchecker comments attached to every PR, I aim to fix them all.
No need to backport, just cosmetic changes.
Closesscylladb/scylladb#27897
* github.com:scylladb/scylladb:
treewide: fix some spelling errors
codespell: ignore `iif` and `tread`
- table, storage_group: add compaction_group_count
- And use to reserve vector capacity before adding an item per compaction_group
- table: reduce allocations by using for_each_compaction_group rather than compaction_groups()
- compaction_groups() may allocate memory, but when called from a synchronous call site, the caller can use for_each_compaction_group instead.
* Improvement, no backport needed
Closesscylladb/scylladb#27479
* github.com:scylladb/scylladb:
table: reduce allocations by using for_each_compaction_group rather than compaction_groups()
replica: storage_group: rename compaction_groups to compaction_groups_immediate
compaction_groups_immediate() may allocate memory, but when called from a
synchronous call site, the caller can use for_each_compaction_group
instead to iterate over the compaction groups with no extra allocations.
Calling compaction_groups_immediate() is still required from an async
context when we want to "sample" the compaction groups
so we can safely iterate over them and yield in the inner loop.
Also, some performance insensitive call sites using
compaction_groups_immediate had been left as they are
to keep them simple.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The method in question knows that it writes snapshot to local filesystem and uses this actively. This PR relaxes this knowledge and splits the logic into two parts -- one that orchestrates sstables snapshot and collects the necessary metadata, and the code that writes the metadata itself.
Closesscylladb/scylladb#27762
* github.com:scylladb/scylladb:
table: Move snapshot_file_set to table.cc
table: Rename and move snapshot_on_all_shards() method
table: Ditch jsondir variable
table, sstables: Pass snapshot name to sstable::snapshot()
table: Use snapshot_writer in write_manifest()
table: Use snapshot_writer in write_schema_as_cql()
table: Add snapshot_writer::sync()
table: Add snapshot_writer::init()
table: Introduce snapshot_writer
table: Move final sync and rename seal_snapshot()
table: Hide write_schema_as_cql()
table: Hide table::seal_snapshot()
table: Open-code finalize_snapshot()
table: Fix indentation after previuous patch
table: Use smp::invoke_on_all() to populate the vector with filenames
table: Don't touch dir once more on seal_snapshot()
table: Open-code table::take_snapshot() into caller lambda
table: Move parts of table::take_snapshot to sstables_manager
table: Introduce table::take_snapshot()
table: Store the result of smp::submit_to in local variable
The table::seal_snapshot() accepts a vector of sstables filenames and
writes them into manifest file. For that, it iterates over the vector
and moves all filenames from it into the streamer object.
The problem is that the vector contains foreign pointers on sets with
sstrings. Not only sets are foreign, sstrings in it are foreign too.
It's not correct to std::move() them to local CPU.
The fix is to make streamer object work on string_view-s and populate it
with non-owning references to the sstrings from aforementioned sets.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#27755
Now it's database::snapshot_table_on_all_shards(). This is symmetric to
database::truncate_table_on_all_shards().
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Now the table::snapshot_on_all_shards() is storage-independent and can
stop maintaining the local path variable.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Currently sstable::snapshot() is called with directory name where to put
snapshots into. This patch changes it to accept snapshot name instead.
This makes the table-sstable API be unware of snapshot destination
storage type.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The manifest writing code is self-contained in a sense that it needs
list of sstable files and output_stream to write it too. The
snapshot_writer can provide output_stream for specific component, it can
be re-used for manifest writing code, thus making it independent from
local filesystem.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The schema writing code is self-contained in a sense that it needs
schema description and output_stream to write it too. Teach the
snapshot_writer to provide output_stream and make write_schema_as_cql()
be independent from local filesystem.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It's an abstract class that defines how to write data and metadata with
table snapshot. Currently it just replaces the storage_options checks
done by table::snapshot_on_all_shards(), but it will soon evolve.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The seal_snapshot() syncs directory at the end. Now when the method is
table.cc-local, it doesn't need to be that careful. It looks nicer if
being renamed to write_manifest() and it's caller that syncs directory
after calling it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The method only needs schema description from table. The caller can
pre-get it and pass it as argument. This makes it symmetric with
seal_snapshot() (that will be renamed soon) and reduces the class table
API size.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The method is static and has nothing to do with table. The
snapshot_file_set needs to become public, but it will be moved to
table.cc soon.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There's a vector of foreign pointers to sets with sstable filenames
that's populated on all shards. The code does the invoke-on-all by hand
to grow the vector with push-back-s. However, if resizing the vector in
advance, shards will be able to just populate their slots.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Now when the logic of take_snapshot() is split between two components
(table and sstables_manager) it's no longer useful
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Move the loop over vector of sstables that calls sstable->snapshot()
into sstables manager.
This makes it symmetric with sstables_manager::delete_atomically() and
allows for future changes.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The method returns all sstables vector with a guard that prevents this
list from being modified. Currently this is the part of another existing
table::take_snapshot() method, but the newer, smaller one, is more
atomic and self-contained, next patches will benefit from it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Split prepare can run concurrently with repair.
Consider this:
1) split prepare starts
2) incremental repair starts
3) split prepare finishes
4) incremental repair produces unsplit sstable
5) split is not happening on sstable produced by repair
5.1) that sstable is not marked as repaired yet
5.2) might belong to repairing set (has compaction disabled)
6) split executes
7) repairing or repaired set has unsplit sstable
If split was acked to coordinator (meaning prepare phase finished),
repair must make sure that all sstables produced by it are split.
It's not happening today with incremental repair because it disables
split on sstables belonging to repairing group. And there's a window
where sstables produced by repair belong to that group.
To solve the problem, we want the invariant where all sealed sstables
will be split.
To achieve this, streaming consumers are patched to produce unsealed
sstable, and the new variant add_new_sstable_and_update_cache() will
take care of splitting the sstable while it's unsealed.
If no split is needed, the new sstable will be sealed and attached.
This solution was also needed to interact nicely with out of space
prevention too. If disk usage is critical, split must not happen on
restart, and the invariant aforementioned allows for it, since any
unsplit sstable left unsealed will be discarded on restart.
The streaming consumer will fail if disk usage is critical too.
The reason interposer consumer doesn't fully solve the problem is
because incremental repair can start before split, and the sstable
being produced when split decision was emitted must be split before
attached. So we need a solution which covers both scenarios.
Fixes#26041.
Fixes#27414.
Should be backported to 2025.4 that contains incremental repair
Closesscylladb/scylladb#26528
* github.com:scylladb/scylladb:
test: Add reproducer for split vs intra-node migration race
test: Verify split failure on behalf of repair during critical disk utilization
test: boost: Add failure_when_adding_new_sstable_test
test: Add reproducer for split vs incremental repair race condition
compaction: Fail split of new sstable if manager is disabled
replica: Don't split in do_add_sstable_and_update_cache()
streaming: Leave sstables unsealed until attached to the table
replica: Wire add_new_sstables_and_update_cache() into intra-node streaming
replica: Wire add_new_sstable_and_update_cache() into file streaming consumer
replica: Wire add_new_sstable_and_update_cache() into streaming consumer
replica: Document old add_sstable_and_update_cache() variants
replica: Introduce add_new_sstables_and_update_cache()
replica: Introduce add_new_sstable_and_update_cache()
replica: Account for sstables being added before ACKing split
replica: Remove repair read lock from maybe_split_new_sstable()
compaction: Preserve state of input sstable in maybe_split_new_sstable()
Rename maybe_split_sstable() to maybe_split_new_sstable()
sstables: Allow storage::snapshot() to leave destination sstable unsealed
sstables: Add option to leave sstable unsealed in the stream sink
test: Verify unsealed sstable can be compacted
sstables: Allow unsealed sstable to be loaded
sstables: Restore sstable_writer_config::leave_unsealed
Ref https://github.com/scylladb/seastar/pull/3163
We can optimize the stat calls we use here by
using open_directory to open the snapshot,
base, and staging directory once, and using statat
calls for the relative name instead of the full
blown file_stat that needs to traverse the whole
path prefix for every call (the dirents are likely
to be cached, but still why waste cpu cycles on that
over and over again).
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
If the number_of_linkes equals 1, we can be sure that
the file exists only in the snapshot directory so there is no need
to look it up in the data directory.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Now that we're using a simple loop in the coroutine just continue
the loop for files we want to ignore.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
It is more efficient to use the coroutine generator
to list the directory.
Brewing changes in seastar would make the generator buffered
as well as adding an extended generation that would
return the file stat data for each entry, that would become
useful in the next patch that optimizes the algorithm by
considering the entry's link count.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently, we first print the json contents into a stringstream buffer
and then we write it as a whole to the manifest.json file output stream.
This is not scalable and may cause large allocation for large enough number
of files.
Fixes#24216
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closesscylladb/scylladb#27542
Now, only sstable loader on boot and refresh from upload uses this
procedure. The idea is that maybe_split_new_sstable() will throw
when compaction cannot run due to e.g. out of space prevention.
It could fail repair writer, but we don't want it to fail boot.
As for refresh from upload, it's not supposed to work when tablet
map at the time of backup is not the same when restoring.
Even before this, refresh would fail if split already executed,
split would only happen if split was still ongoing. We need
token range stability for local restore. The safe variant will
always be load and stream.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
We want the invariant that after ACK, all sealed sstables will be split.
This guarantee that on restart, no unsplit sstables will be found
sealed.
The paths that generate unsplit sstables are streaming and file
streaming consumers. It includes intra-node streaming, which
is local but can clone an unsplit sstable into destination.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Piggyback on new add_new_sstable_and_update_cache(), replacing
the previous add_sstables_and_update_cache().
Will be used by intra-node migration since we want it to be
safe when loading the cloned sstables. An unsplit sstable
can be cloned into destination which already ACKed split,
so we need this variant which splits sstable if needed,
while it's unsealed.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Failure to load sstable in streaming can leave sealed sstables
on disk since they're not unlinked on failure.
This can result in several problems:
1) Data resurrection: since the sstable may contain deleted data
2) Split issue: since the finalization requires all sstables to be split
3) Disk usage issue: since the sstables hold space and streaming retries
can keep accumulating these files.
This new procedure will be later wired into streaming consumers, in
order to fix those problems.
Another benefit of the interface is that if there's split when adding
the new sstable, the output sstables will be returned to the caller,
allowing them to register the actual loaded sstables into e.g.
the view builder.
Refs #27414.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
We want the invariant that after ACK, all sealed sstables will be split.
If check-and-attach is not atomic, this sequence is possible:
1) no split decision set.
2) Unsplit sstable is checked, no need to split, sealed.
3) split decision is set and ACKed
4) unsplit sstable is attached
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
The lock is intended to serialize some maintenance compactions,
such as major, with repair. But maybe_split_new_sstable() is
restricted solely to new sstables that aren't part of the
sstable set.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
This is crucial with MVs, since the splitting must preserve the state of
the original sstable. We want the sstable to be in staging dir, so it's
excluded when calculating the diff for performing pushes to view
replicas.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Since the function must only be used on new sstables, it should
be renamed to something describing its usage should be restricted.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>