Commit Graph

3188 Commits

Author SHA1 Message Date
Nadav Har'El
edfb89ef65 sstables: stop warning when auto-snapshot leaves non-empty directory
When a table is dropped, we delete its sstables, and finally try to delete
the table's top-level directory with the rmdir system call. When the
auto-snapshot feature is enabled (this is still Scylla's default),
the snapshot will remain in that directory so it won't be empty and will
cannot be removed. Today, this results in a long, ugly and scary warning
in the log:

```
WARN  2023-07-06 20:48:04,995 [shard 0] sstable - Could not remove table directory "/tmp/scylla-test-198265/data/alternator_alternator_Test_1688665684546/alternator_Test_1688665684546-4238f2201c2511eeb15859c589d9be4d/snapshots": std::filesystem::__cxx11::filesystem_error (error system:39, filesystem error: remove failed: Directory not empty [/tmp/scylla-test-198265/data/alternator_alternator_Test_1688665684546/alternator_Test_1688665684546-4238f2201c2511eeb15859c589d9be4d/snapshots]). Ignored.
```

It is bad to log as a warning something which is completely normal - it
happens every time a table is dropped with the perfectly valid (and even
default) auto-snapshot mode. We should only log a warning if the deletion
failed because of some unexpected reason.

And in fact, this is exactly what the code **tried** to do - it does
not log a warning if the rmdir failed with EEXIST. It even had a comment
saying why it was doing this. But the problem is that in Linux, deleting
a non-empty directory does not return EEXIST, it returns ENOTEMPTY...
Posix actually allows both. So we need to check both, and this is the
only change in this patch.

To confirm this that this patch works, edit test/cql-pytest/run.py and
change auto-snapshot from 0 to 1, run test/alternator/run (for example)
and see many "Directory not empty" warnings as above. With this patch,
none of these warnings appear.

Fixes #13538

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #14557
2023-07-07 11:08:10 +02:00
Kefu Chai
04434c02b3 sstables: print generation without {:d}
the formatter for sstables::generation_type does not support "d"
specifier, so we should not use "{:d}" for printing it. this works
before d7c90b5239, but after that
change, generation_type is not an alias of int64_t anymore.
and its formatter does not support "d", so we should either
specialize fmt::formatter<generation_type> to support it or just
drop the specifier.

since seastar::format() is using
```c++
fmt::format_to(fmt::appender(out), fmt::runtime(fmt), std::forward<A>(a)...);
```
to print the arguments with given fmt string, we cannot identify
these kind of error at compile time.

at runtime, if we have issues like this, {fmt} would throw exception
like:
```
terminate called after throwing an instance of 'fmt::v9::format_error'
  what():  invalid format specifier
```
when constructing the `std::runtime_error` instance.

so, in this change, "d" is removed.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #14427
2023-07-03 13:53:13 +03:00
Raphael S. Carvalho
1d8cb32a5d table: Optimize creation of reader excluding staging for view building
View building from staging creates a reader from scratch (memtable
+ sstables - staging) for every partition, in order to calculate
the diff between new staging data and data in base sstable set,
and then pushes the result into the view replicas.

perf shows that the reader creation is very expensive:
+   12.15%    10.75%  reactor-3        scylla             [.] lexicographical_tri_compare<compound_type<(allow_prefixes)0>::iterator, compound_type<(allow_prefixes)0>::iterator, legacy_compound_view<compound_type<(allow_prefixes)0> >::tri_comparator::operator()(managed_bytes_basic_view<(mutable_view)0>, managed_bytes
+   10.01%     9.99%  reactor-3        scylla             [.] boost::icl::is_empty<boost::icl::continuous_interval<compatible_ring_position_or_view, std::less> >
+    8.95%     8.94%  reactor-3        scylla             [.] legacy_compound_view<compound_type<(allow_prefixes)0> >::tri_comparator::operator()
+    7.29%     7.28%  reactor-3        scylla             [.] dht::ring_position_tri_compare
+    6.28%     6.27%  reactor-3        scylla             [.] dht::tri_compare
+    4.11%     3.52%  reactor-3        scylla             [.] boost::icl::interval_base_map<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::sst+    4.09%     4.07%  reactor-3        scylla             [.] sstables::index_consume_entry_context<sstables::index_consumer>::process_state
+    3.46%     0.93%  reactor-3        scylla             [.] sstables::sstable_run::will_introduce_overlapping
+    2.53%     2.53%  reactor-3        libstdc++.so.6     [.] std::_Rb_tree_increment
+    2.45%     2.45%  reactor-3        scylla             [.] boost::icl::non_empty::exclusive_less<boost::icl::continuous_interval<compatible_ring_position_or_view, std::less> >
+    2.14%     2.13%  reactor-3        scylla             [.] boost::icl::exclusive_less<boost::icl::continuous_interval<compatible_ring_position_or_view, std::less> >
+    2.07%     2.07%  reactor-3        scylla             [.] logalloc::region_impl::free
+    2.06%     1.91%  reactor-3        scylla             [.] sstables::index_consumer::consume_entry(sstables::parsed_partition_index_entry&&)::{lambda()#1}::operator()() const::{lambda()#1}::operator()
+    2.04%     2.04%  reactor-3        scylla             [.] boost::icl::interval_base_map<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::sst+    1.87%     0.00%  reactor-3        [kernel.kallsyms]  [k] entry_SYSCALL_64_after_hwframe
+    1.86%     0.00%  reactor-3        [kernel.kallsyms]  [k] do_syscall_64
+    1.39%     1.38%  reactor-3        libc.so.6          [.] __memcmp_avx2_movbe
+    1.37%     0.92%  reactor-3        scylla             [.] boost::icl::segmental::join_left<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::
+    1.34%     1.33%  reactor-3        scylla             [.] logalloc::region_impl::alloc_small
+    1.33%     1.33%  reactor-3        scylla             [.] seastar::memory::small_pool::add_more_objects
+    1.30%     0.35%  reactor-3        scylla             [.] seastar::reactor::do_run
+    1.29%     1.29%  reactor-3        scylla             [.] seastar::memory::allocate
+    1.19%     0.05%  reactor-3        libc.so.6          [.] syscall
+    1.16%     1.04%  reactor-3        scylla             [.] boost::icl::interval_base_map<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::sst
+    1.07%     0.79%  reactor-3        scylla             [.] sstables::partitioned_sstable_set::insert

That shows some significant amount of work for inserting sstables
into the interval map and maintaining the sstable run (which sorts
fragments by first key and checks for overlapping).

The interval map is known for having issues with L0 sstables, as
it will have to be replicated almost to every single interval
stored by the map, causing terrible space and time complexity.
With enough L0 sstables, it can fall into quadratic behavior.

This overhead is fixed by not building a new fresh sstable set
when recreating the reader, but rather supplying a predicate
to sstable set that will filter out staging sstables when
creating either a single-key or range scan reader.

This could have another benefit over today's approach which
may incorrectly consider a staging sstable as non-staging, if
the staging sst wasn't included in the current batch for view
building.

With this improvement, view building was measured to be 3x faster.

from
INFO  2023-06-16 12:36:40,014 [shard 0] view_update_generator - Processed keyspace1.standard1: 5 sstables in 963957ms = 50kB/s

to
INFO  2023-06-16 14:47:12,129 [shard 0] view_update_generator - Processed keyspace1.standard1: 5 sstables in 319899ms = 150kB/s

Refs #14089.
Fixes #14244.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2023-06-26 22:30:39 -03:00
Kefu Chai
f014ccf369 Revert "Revert "Merge 'treewide: add uuid_sstable_identifier_enabled support' from Kefu Chai""
This reverts commit 562087beff.

The regressions introduced by the reverted change have been fixed.
So let's revert this revert to resurrect the
uuid_sstable_identifier_enabled support.

Fixes #10459
2023-06-21 13:02:40 +03:00
Tomasz Grabiec
34f28aa0cb sstables: Add trace-level logging related to shard calculation 2023-06-21 00:58:24 +02:00
Tomasz Grabiec
18f567385c sstable_directory: Improve trace-level logging 2023-06-21 00:58:24 +02:00
Tomasz Grabiec
ad983ac23d sstables: Compute sstable shards using sharder from erm when loading
schema::get_sharder() does not use the correct sharder for
tablet-based tables.  Code which is supposed to work with all kinds of
tables should obtain the sharder from erm::get_sharder().
2023-06-21 00:58:24 +02:00
Tomasz Grabiec
17d6163548 sstables: Generate sharding metadata using sharder from erm when writing
We need to keep sharding metadata consistent with tablet mapping to
shards in order for node restart to detect that those sstables belong
to a single shard and that resharding is not necessary. Resharding of
sstables based on tablet metadata is not implemented yet and will
abort after this series.

Keeping sharding metadata accurate for tablets is only necessary until
compaction group integration is finished. After that, we can use the
sstable token range to determine the owning tablet and thus the owning
shard. Before that, we can't, because a single sstable may contain
keys from different tablets, and the whole key range may overlap with
keys which belong to other shards.
2023-06-21 00:58:24 +02:00
Tomasz Grabiec
fe7922d65c sstables: Move compute_shards_for_this_sstable() to load()
Soon, compute_shards_for_this_sstable() will need to take a sharder object.

open_data() is called indirectly from sstable::load() and directly
after writing an sstable from various paths. The latter don't really
need to compute shards, since the field is already set by the writer. In
order to reduce code churn, move compute_shards_for_this_sstable() to
the load() path only so that only load() needs to take the sharder.
2023-06-21 00:58:24 +02:00
Tomasz Grabiec
390bcf3fae dht: Take sharder externally in splitting functions
We need those functions to work with tablet sharder, which is not
accessible through schema::get_sharder(). In order to propagate the
right sharder, those functions need to take it externally rather from
the schema object. The sharder will come from the
effective_replication_map attached to the table object.

Those splitting functions are used when generating sharding metadata
of an sstable. We need to keep this sharding metadata consistent with
tablet mapping to shards in order for node restart to detect that
those sstables belong to a single shard and that resharding is not
necessary. Resharding of sstables based on tablet metadata is not
implemented yet and will abort after this series.

Keeping sharding metadata accurate for tablets is only necessary until
compaction group integration is finished. After that, we can use the
sstable token range to determine the owning tablet and thus the owning
shard. Before that, we can't, because a single sstable may contain
keys from different tablets, and the whole key range may overlap with
keys which belong to other shards.
2023-06-21 00:58:24 +02:00
Botond Dénes
562087beff Revert "Merge 'treewide: add uuid_sstable_identifier_enabled support' from Kefu Chai"
This reverts commit d1dc579062, reversing
changes made to 3a73048bc9.

Said commit caused regressions in dtests. We need to investigate and fix
those, but in the meanwhile let's revert this to reduce the disruption
to our workflows.

Refs: #14283
2023-06-19 08:49:27 +03:00
Kefu Chai
2d265e860d replica,sstable: introduce invalid generation id
the invalid sstable id is the NULL of a sstable identifier. with
this concept, it would be a lot simpler to find/track the greatest
generation. the complexity is hidden in the generation_type, which
compares the a) integer-based identifiers b) uuid-based identifiers
c) invalid identitifer in different ways.

so, in this change

* the default constructor generation_type is
  now public.
* we don't check for empty generation anymore when loading
  SSTables or enumerating them.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-06-15 17:54:59 +08:00
Kefu Chai
939fa087cc sstables, replica: pass uuid_sstable_identifiers to generation generator
before this change, we assume that generation is always integer based.
in order to enable the UUID-based generation identifier if the related
option is set, we should populate this option down to generation generator.

because we don't have access to the cluster features in some places where
a new generation is created, a new accessor exposing feature_service from
sstable manager is added.

Fixes #10459
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-06-15 17:54:59 +08:00
Kefu Chai
15543464ce sstables, replica: support UUID in generation_type
this change generalize the value of generation_type so it also
supports UUID based identifier.

* sstables/generation_type.h:
  - add formatter and parse for UUID. please note, Cassandra uses
    a different format for formatting the SSTable identifier. and
    this formatter suits our needs as it uses underscore "_" as the
    delimiter, as the file name of components uses dash "-" as the
    delimiter. instead of reinventing the formatting or just use
    another delimiter in the stringified UUID, we choose to use the
    Cassandra's formatting.
  - add accessors for accessing the type and value of generation_type
  - add constructor for constructing generation_type with UUID and
    string.
  - use hash for placing sstables with uuid identifiers into shards
    for more uniformed distrbution of tables in shards.
* replica/table.cc:
  - only update the generator if the given generation contains an
    integer
* test/boost:
  - add a simple test to verify the generation_type is able to
    parse and format

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-06-15 17:54:59 +08:00
Pavel Emelyanov
d1de796f6b sstable: Move XFS renamer hack into fs storage
The method sits on sstable, but is called only from fs storage and it's
the only place that really needs it

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #14230
2023-06-14 12:35:04 +03:00
Botond Dénes
3479adc85f Merge 'Prepare sstable_directory lister to garbage_collect() s3 stuff' from Pavel Emelyanov
When scylla starts it collects dangling sstables from the datadir. It includes temporary sstable directories and pending-deletion log. S3-backed sstables cannot be garbage-collected like that, instead "garbage" entries from the ownership table should be processed. Currently the g.c. code is unaware of storage and scans datadir for whatever sstable it's called for.

This PR prepares the garbage_collect() call to become virtual, but no-op for ownership-table lister. Proper S3 garbage-collecting is not yet here, it needs an extra patch to seastar http client.

refs: #13024

Closes #14023

* github.com:scylladb/scylladb:
  sstable_directory: Do not collect filesystem garbage for S3-backed sstables
  sstable_directory: Deduplicate .process() location argument
  sstable_directory: Keep directory lister on stack
  sstable_directory: Use directory_lister API directly
2023-06-14 12:06:37 +03:00
Pavel Emelyanov
c68c154fb6 code: Reduce tracing/*hh fanout
There are some headers that include tracing/*.hh ones despite all they
need is forward-declared trace_state_ptr

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #14155
2023-06-07 19:19:22 +03:00
Pavel Emelyanov
66e43912d6 code: Switch to seastar API level 7
In that level no io_priority_class-es exist. Instead, all the IO happens
in the context of current sched-group. File API no longer accepts prio
class argument (and makes io_intent arg mandatory to impls).

So the change consists of
- removing all usage of io_priority_class
- patching file_impl's inheritants to updated API
- priority manager goes away altogether
- IO bandwidth update is performed on respective sched group
- tune-up scylla-gdb.py io_queues command

The first change is huge and was made semi-autimatically by:
- grep io_priority_class | default_priority_class
- remove all calls, found methods' args and class' fields

Patching file_impl-s is smaller, but also mechanical:
- replace io_priority_class& argument with io_intent* one
- pass intent to lower file (if applicatble)

Dropping the priority manager is:
- git-rm .cc and .hh
- sed out all the #include-s
- fix configure.py and cmakefile

The scylla-gdb.py update is a bit hairry -- it needs to use task queues
list for IO classes names and shares, but to detect it should it checks
for the "commitlog" group is present.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #13963
2023-06-06 13:29:16 +03:00
Benny Halevy
c685ef9e71 partitioned_sstable_set: insert: return early if sst is already in the set
Currently, partitioned_sstable_set::insert may erase a sstable
from the set inadvertently, if an exception is thrown while
(re-)inserting it.

To prevent that, simply return early after detecting that
insertion didn't took place, based on the unordered_set::insert
result.

This issue is theoretical, as there are no known case
of re-inserting sstables into the partitioned sstable set.

Fixes #14060

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #14061
2023-05-29 23:03:25 +03:00
Benny Halevy
26705ba6af partitioned_sstable_set: erase empty runs
When erasing a sstable first check if its run_id
exists in _all_runs, otherwise do nothing with
that respect, and then if the run becomes empty
when erasing the last sstable (and it could have been
a single-sstable run from get go), erase the run
from `_all_runs`.

Fixes #14052

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #14054
2023-05-29 23:03:24 +03:00
Botond Dénes
5a14c3311a Merge 'Break S3 upload 50Gb file limit' from Pavel Emelyanov
Current S3 uploading sink has implicit limit for the final file size that comes from two places. First, S3 protocol declares that uploading parts count from 1 to 10000 (inclusive). Second, uploading sink sends out parts once they grow above S3 minimal part size which is 5Mb. Since sstables puts data in 128kb (or smaller) portions, parts are almost exactly 5Mb in size, so the total uploading size cannot grow above ~50Gb. That's too low.

To break the limit the new sink (called jumbo sink) uses the UploadPartCopy S3 call that helps splicing several objects into one right on the server. Jumbo sink starts uploading parts into an intermediate temporary object called a piece and named ${original_object}_${piece_number}. When the number of parts in current piece grows above the configured limit the piece is finalized and upload-copied into the object as its next part, then deleted. This happens in the background, meanwhile the new piece is created and subsequent data is put into it. When the sink is flushed the current piece is flushed as is and also squashed into the object.

The new jumbo sink is capable of uploading ~500Tb of data, which looks enough.

fixes: #13019

Closes #13577

* github.com:scylladb/scylladb:
  sstables: Switch data and index sink to use jumbo uploader
  s3/test: Tune-up multipart upload test alignment
  s3/test: Add jumbo upload test
  s3/client: Wait for background upload fiber on close-abort
  c3/client: Implement jumbo upload sink
  s3/client: Move memory buffers to upload_sink from base
  s3/client: Move last part upload out of finalize_upload()
  s3/client: Merge do_flush() with upload_part()
  s3/client: Rename upload_sink -> upload_sink_base
2023-05-25 11:44:06 +03:00
Pavel Emelyanov
e435ec1b5e sstable_directory: Do not collect filesystem garbage for S3-backed sstables
The sstable_directory::garbage_collect() scans /var/lib/scylla for
whatever sstable it's called for. S3-backed ones don't have anything
there, so the g.c. run is no-op. Make this call be lister virtual
method, so that only filesystem lister does this scan and the ownership
table lister becomes the real no-op. Later it will be filled with code.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-24 17:45:50 +03:00
Pavel Emelyanov
16d66f2fe9 sstable_directory: Deduplicate .process() location argument
When sstable directory calls lister it passes the _sstable_dir as an
argument. However, the very same _sstable_dir was used to construct the
lister, and by now all the lister implementations keep this value
aboard.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-24 17:43:36 +03:00
Pavel Emelyanov
d6b5e18cb3 sstable_directory: Keep directory lister on stack
The directory_lister _lister exists as class member, but is only used
once -- when the .process() is called -- and then is closed forever.
It's simpler to keep the lister on the .process() stack.

This change also makes filesystem lister keep the copy of directory as
class member, it will be useful for the next patch as well.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-24 17:42:08 +03:00
Pavel Emelyanov
524614087a sstable_directory: Use directory_lister API directly
The filesystem components lister has private wrappers on top of
directory lister it uses internally. These are lefrovers from making the
sstable directory storage-aware, now they can be removed.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-24 17:40:38 +03:00
Botond Dénes
2526b232f1 Merge 'Remove explicit default_priority_class() usage from sstable aux methods' from Pavel Emelyanov
There are few places in sstables/ code that require caller to specify priority class to pass it along to file stream options. All these callers use default class, so it makes little sense to keep it. This change makes the sched classes unification mega patch a bit smaller.

ref: #13963

Closes #13996

* github.com:scylladb/scylladb:
  sstables: Remove default prio class from rewrite_statistics()
  sstables: Remove prio class from validate_checksums subs
  sstables: Remove always default io-prio from validate_checksums()
2023-05-24 09:23:24 +03:00
Pavel Emelyanov
6c453df9d7 sstables: Remove default prio class from rewrite_statistics()
The method is called with explicitly default pririty class and puts one
into the fstream options. This whole chain can be avoided

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-23 13:54:31 +03:00
Pavel Emelyanov
438132ad4b sstables: Remove prio class from validate_checksums subs
The sstable.read_checksum() and .read_digest() accept prio class
argument from validate_checsums(), but it's always the "default" one.
Remove the arg and remove stream options initializations as they'll pick
up default prio class on their default constructing.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-23 13:54:31 +03:00
Pavel Emelyanov
7396d9d291 sstables: Remove always default io-prio from validate_checksums()
All calls to sstables::validate_checksums() happen with explicitly
default priority class. Just hard-code it as such in the method

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-23 13:54:31 +03:00
Pavel Emelyanov
2bb024c948 index_reader: Introduce and use default arguments to constructor
Most of creators of index_reader construct it with default prio class,
null trace pointer and use_caching::yes. Assigning implicit defaults to
constructor arguments keeps the code shorter and easier to read.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-23 11:29:04 +03:00
Pavel Emelyanov
3fd5d3cc2b index_reader: Use _pc field in get_file_input_stream_options() directly
No need to pass this-> field into this-> call

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-23 11:18:14 +03:00
Pavel Emelyanov
21d24e8ea3 index_reader: Move index_reader::get_file_input_stream_options to private: block
A "while at it" cleanup. When pathing the method (next patch) it turned
out that there are no other callers other than local class, so it _is_
private.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-23 11:18:14 +03:00
Botond Dénes
3b424e391b Merge 'perform_cleanup: wait until all candidates are cleaned up' from Benny Halevy
cleanup_compaction should resolve only after all
sstables that require cleanup are cleaned up.

Since it is possible that some of them are in staging
and therefore cannot be cleaned up, retry once a second
until they become eligible.

Timeout if there is no progress within 5 minutes
to prevent hanging due to view building bug.

Fixes #9559

Closes #13812

* github.com:scylladb/scylladb:
  table: signal compaction_manager when staging sstables become eligible for cleanup
  compaction_manager: perform_cleanup: wait until all candidates are cleaned up
  compaction_manager: perform_cleanup: perform_offstrategy if needed
  compaction_manager: perform_cleanup: update_sstables_cleanup_state in advance
  sstable_set: add for_each_sstable_gently* helpers
2023-05-19 12:35:59 +03:00
Botond Dénes
c2aee26278 Merge 'Keep sstables garbage collection in sstable_directory' from Pavel Emelyanov
Currently temporary directories with incomplete sstables and pending deletion log are processed by distributed loader on start. That's not nice, because for s3 backed sstables this code makes no sense (and is currently a no-op because of incomplete implementation). This garbage collecting should be kept in sstable_directory where it can off-load this work onto lister component that is storage-aware.

Once g.c. code moved, it allows to clean the class sstable list of static helpers a bit.

refs: #13024
refs: #13020
refs: #12707

Closes #13767

* github.com:scylladb/scylladb:
  sstable: Toss tempdir extension usage
  sstable: Drop pending_delete_dir_basename()
  sstable: Drop is_pending_delete_dir() helper
  sstable_directory: Make garbage_collect() non-static
  sstable_directory: Move deletion log exists check
  distributed_loader: Move garbage collecting into sstable_directory
  distributed_loader: Collect garbace collecting in one call
  sstable: Coroutinize remove_temp_dir()
  sstable: Coroutinize touch_temp_dir()
  sstable: Use storage::temp_dir instead of hand-crafted path
2023-05-19 08:50:13 +03:00
Kefu Chai
03be1f438c sstables: move get_components_lister() into sstable_directory
sstables_manager::get_component_lister() is used by sstable_directory.
and almost all the "ingredients" used to create a component lister
are located in sstable_directory. among the other things, the two
implementations of `components_lister` are located right in
`sstable_directory`. there is no need to outsource this to
sstables_manager just for accessing the system_keyspace, which is
already exposed as a public function of `sstables_manager`. so let's
move this helper into sstable_directory as a member function.

with this change, we can even go further by moving the
`components_lister` implementations into the same .cc file.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #13853
2023-05-18 08:43:35 +03:00
Kefu Chai
8bcbc9a90d sstables: add an maybe_owned_by_this_shard() helper
instead of encoding the fact that we are using generation identifier
as a hint where the SSTable with this generation should be processed
at the caller sites of `as_int()`, just provide an accessor on
sstable_generation_generator's side. this helps to encapsulate the
underlying type of generation in `generation_type` instead of exposing
it to its users.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #13846
2023-05-18 08:41:02 +03:00
Pavel Emelyanov
ed50fda1fe sstable: Toss tempdir extension usage
The tempdir for filesystem-based sstables is {generation}.sstable one.
There are two places that need to know the ".sstable" extention -- the
tempdir creating code and the tempdir garbage-collecting code.

This patch simplifies the sstable class by patching the aforementioned
functions to use newly introduced tempdir_extension string directly,
without the help of static one-line helpers.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-17 15:19:38 +03:00
Pavel Emelyanov
e8c0ae28b5 sstable: Drop pending_delete_dir_basename()
The helper is used to return const char* value of the pending delete
dir. Callers can use it directly.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-17 15:17:33 +03:00
Pavel Emelyanov
7792479865 sstable: Drop is_pending_delete_dir() helper
It's only used by the sstable_directory::replay_pending_delete_log()
method. The latter is only called by the sstable_directory itself with
the path being pending-delete dir for sure. So the method can be made
private and the is_pending_delete_dir() can be removed.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-17 15:17:32 +03:00
Pavel Emelyanov
7429205632 sstable_directory: Make garbage_collect() non-static
When non static the call can use sstable_directory::_sstable_dir path,
not the provided argument. The main benefit is that the method can later
be moved onto lister so that filesystem and ownership-table listers can
process dangling bits differently.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-17 15:16:23 +03:00
Pavel Emelyanov
45adf61490 sstable_directory: Move deletion log exists check
Check if the deletion log exists in the handling helper, not outside of
it. This makes next patch shorter.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-17 15:16:23 +03:00
Pavel Emelyanov
3d7122d2fe distributed_loader: Move garbage collecting into sstable_directory
It's the directory that owns the components lister and can reason about
the way to pick up dangling bits, be it local directories or entries
from the ownership table.

First thing to do is to move the g.c. code into sstable_directory. While
at it -- convert ssting dir into fs::path dir and switch logger.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-17 15:16:23 +03:00
Pavel Emelyanov
22299a31c8 sstable: Coroutinize remove_temp_dir()
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-17 15:16:23 +03:00
Pavel Emelyanov
9db5e9f77f sstable: Coroutinize touch_temp_dir()
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-17 15:15:38 +03:00
Pavel Emelyanov
7e506354fd sstable: Use storage::temp_dir instead of hand-crafted path
When opening an sstable on filesystem it's first created in a temporary
directory whose path is saved in storage::temp_dir variable. However,
the opening method constructs the path by hand. Fix that.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-17 15:14:04 +03:00
Benny Halevy
ff7c9c661d sstable_set: add for_each_sstable_gently* helpers
Currently callers of `for_each_sstable` need to
use a seastar thread to allow preemption
in the for_each_sstable loop.

Provide for_each_sstable_gently and
for_each_sstable_gently_until to make using this
facility from a coroutine easier, without requiring
a seastar thread.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-05-17 11:31:07 +03:00
Pavel Emelyanov
b58ad040d2 sstables: Switch data and index sink to use jumbo uploader
These two can grow large. Non-jumbo sink is effectively limited with
10000 parts, since each is ~5Mb the maximum uploadable data/index
happens to be 50Gb which is too small.

Other components shouldn't grow that big and continue using simple and a
bit faster uploading sink.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-16 12:23:18 +03:00
Botond Dénes
20ff122a84 Merge 'Delete S3 sstables without the help of deletion log' from Pavel Emelyanov
There are two layers of stables deletion -- delete-atomically and wipe. The former is in fact the "API" method, it's called by table code when the specific sstable(s) are no longer needed. It's called "atomically" because it's expected to fail in the middle in a safe manner so that subsequent boot would pick the dangling parts and proceed. The latter is a low-level removal function that can fail in the middle, but it's not of _its_ care.

Currently the atomic deletion is implemented with the help of sstable_directory::delete_atomically() method that commits sstables files names into deletion log, then calls wipe (indirectly), then drops the deletion log. On boot all found deletion logs are replayed. The described functionality is used regardless of the sstable storage type, even for S3, though deletion log is an overkill for S3, it's better be implemented with the help of ownership table. In fact, S3 storage already implements atomic deletion in its wipe method thus being overly careful.

So this PR
- makes atomic deletion be storage-specific
- makes S3 wipe non-atomic

fixes: #13016
note: Replaying sstables deletion from ownership table on boot is not here, see #13024

Closes #13562

* github.com:scylladb/scylladb:
  sstables: Implement atomic deleter for s3 storage
  sstables: Get atomic deleter from underlying storage
  sstables: Move delete_atomically to manager and rename
2023-05-15 08:57:47 +03:00
Avi Kivity
5d6f31df8e Merge 'Coroutinize sstable::read_toc()' from Pavel Emelyanov
It consists of two parts -- call for do_read_simple() with lambda and handling of its results. PR coroutinizes it in two steps for review simplicity -- first the lambda, then the outer caller. Then restores indentation.

Closes #13862

* github.com:scylladb/scylladb:
  sstables: Restore indentation after previous patches
  sstables: Coroutinuze read_toc() outer part
  sstables: Coroutinuze read_toc() inner part
2023-05-14 14:14:23 +03:00
Avi Kivity
0a78995e2b Merge 'Share s3 clients between sstables' from Pavel Emelyanov
Currently s3::client is created for each sstable::storage. It's later shared between sstable's files and upload sink(s). Also foreign_sstable_open_info can produce a file from a handle making a new standalone client. Coupled with the seastar's http client spawning connections on demand, this makes it impossible to control the amount of opened connections to object storage server.

In order to put some policy on top of that (as well as apply workload prioritization) s3 clients should be collected in one place and then shared by users. Since s3::client uses seastar::http::client under the hood which, in turn, can generate many connections on demand, it's enough to produce a single s3::client per configured endpoint one each shard and then share it between all the sstables, files and sinks.

There's one difficulty however, solving which is most of what this PR does. The file handle, that's used to transfer sstable's file across shards, should keep aboard all it needs to re-create a file on another shard. Since there's a single s3::client per shard, creation of a file out of a handle should grab that shard's client somehow. The meaningful shard-local object that can help is the sstables_manager and there are three ways to make use of it. All deal with the fact that sstables_manager-s are not sharded<> services, but are owner by the database independently on each shard.

1. walk the client -> sst.manager -> database -> container -> database -> sst.manager -> client chain by keeping its first half on the handle and unrolling the second half to produce a file
2. keep sharded peering service referenced by the sstables_manager that's initialized in main and passed though the database constructor down to sstables_manager(s)
3. equip file_handle::to_file with the "context" argument and teach sstables foreign info opener to push sstables_manager down to s3 file ... somehow

This PR chooses the 2nd way and introduces the sstables::storage_manager main-local sharded peering service that maintains all the s3::clients. "While at it" the new manager gets the object_storage_config updating facilities from the database (it's overloaded even without it already). Later the manager will also be in charge of collecting and exporting S3 metrics. In order to limit the number of S3 connections it also needs a patch seastar http::client, there's PR already doing that, once (if) merged there'll come one more fix on top.

refs: #13458
refs: #13369
refs: scylladb/seastar#1652

Closes #13859

* github.com:scylladb/scylladb:
  s3: Pick client from manager via handle
  s3: Generalize s3 file handle
  s3: Live-update clients' configs
  sstables: Keep clients shared across sstables
  storage_manager: Rewrap config map
  sstables, database: Move object storage config maintenance onto storage_manager
  sstables: Introduce sharded<storage_manager>
2023-05-14 14:14:23 +03:00