That will be used in turn to restrict reshape to 10% of available space
in underlying storage.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 51c7ee889e)
With intra-node migration, all the movement is local, so we can make
streaming faster by just cloning the sstable set of leaving replica
and loading it into the pending one.
This cloning is underlying storage specific, but s3 doesn't support
snapshot() yet (th sstables::storage procedure which clone is built
upon). It's only supported by file system, with help of hard links.
A new generation is picked for new cloned sstable, and it will
live in the same directory as the original.
A challenge I bumped into was to understand why table refused to
load the sstable at pending replica, as it considered them foreign.
Later I realized that sharder (for reads) at this stage of migration
will point only to leaving replica. It didn't fail with mutation
based streaming, because the sstable writer considers the shard --
that the sstable was written into -- as its owner, regardless of what
sharder says. That was fixed by mimicking this behavior during
loading at pending.
test:
./test.py --mode=dev intranode --repeat=100 passes.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
```
sstables/storage.cc:152:21: warning: 'file_path' used after it was moved [bugprone-use-after-move]
remove_file(file_path).get();
^
sstables/storage.cc:145:64: note: move occurred here
auto w = file_writer(output_stream<char>(std::move(sink)), std::move(file_path));
```
It's a regression when TOC is found for a new sstable, and we try to delete temporary TOC.
courtesy of clang-tidy.
Fixes#18323.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#18367
since we do not rely on FMT_DEPRECATED_OSTREAM to define the
fmt::formatter for us anymore, let's stop defining `FMT_DEPRECATED_OSTREAM`.
in this change,
* utils: drop the range formatters in to_string.hh and to_string.c, as
we don't use them anymore. and the tests for them in
test/boost/string_format_test.cc are removed accordingly.
* utils: use fmt to print chunk_vector and small_vector. as
we are not able to print the elements using operator<< anymore
after switching to {fmt} formatters.
* test/boost: specialize fmt::details::is_std_string_like<bytes>
due to a bug in {fmt} v9, {fmt} fails to format a range whose
element type is `basic_sstring<uint8_t>`, as it considers it
as a string-like type, but `basic_sstring<uint8_t>`'s char type
is signed char, not char. this issue does not exist in {fmt} v10,
so, in this change, we add a workaround to explicitly specialize
the type trait to assure that {fmt} format this type using its
`fmt::formatter` specialization instead of trying to format it
as a string. also, {fmt}'s generic ranges formatter calls the
pair formatter's `set_brackets()` and `set_separator()` methods
when printing the range, but operator<< based formatter does not
provide these method, we have to include this change in the change
switching to {fmt}, otherwise the change specializing
`fmt::details::is_std_string_like<bytes>` won't compile.
* test/boost: in tests, we use `BOOST_REQUIRE_EQUAL()` and its friends
for comparing values. but without the operator<< based formatters,
Boost.Test would not be able to print them. after removing
the homebrew formatters, we need to use the generic
`boost_test_print_type()` helper to do this job. so we are
including `test_utils.hh` in tests so that we can print
the formattable types.
* treewide: add "#include "utils/to_string.hh" where
`fmt::formatter<optional<>>` is used.
* configure.py: do not define FMT_DEPRECATED_OSTREAM
* cmake: do not define FMT_DEPRECATED_OSTREAM
Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
before this change, we rely on the default-generated fmt::formatter
created from operator<<, but fmt v10 dropped the default-generated
formatter.
in this change, we include `fmt/ranges.h` and/or `fmt/std.h`
for formatting the container types, like vector, map
optional and variant using {fmt} instead of the homebrew
formatter based on operator<<.
with this change, the changes adding fmt::formatter and
the changes using ostream formatter explicitly, we are
allowed to drop `FMT_DEPRECATED_OSTREAM` macro.
Refs scylladb#13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
sstables_manager now depends on system_keyspace for access to the
system.sstables table, needed by object storage. This violates
modularity, since sstables_manager is a relatively low-level leaf
module while system_keyspace integrates large parts of the system
(including, indirectly, sstables_manager).
One area where this is grating is sstables::test_env, which has
to include the much higher level cql_test_env to accommodate it.
Fix this by having sstables_manager expose its dependency on
system_keyspace as an interface, sstables_registry, and have
system_keyspace implement the glue logic in
system_keyspace_sstables_manager.
Closesscylladb/scylladb#17868
this change is a follow-up of 637dd730. the goal is to use
std::filesystem::path for manipulating paths, and to avoid the
converting between sstring and fs::path back and forth.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#17257
this change is a follow-up of 637dd730. the goal is to use
std::filesystem::path for manipulating paths, and to avoid the
converting between sstring and fs::path back and forth.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#17214
get0() dates back from the days where Seastar futures carried tuples, and
get0() was a way to get the first (and usually only) element. Now
it's a distraction, and Seastar is likely to deprecate and remove it.
Replace with seastar::future::get(), which does the same thing.
they are directories, and we are concating strings to build the paths
to the sstable components. so it would be more elegant to use fs::path
for manipulating paths.
this change was inspired by the discussion on passing the relative
path to sstable to `scylla sstables`, where we use the
`path::parent_path()` as the dir of sstable, and then concatenate
it with the filename component. but if the `parent_path()` method
returns an empty string, we end up with a path like
"/me-42-big-TOC.txt", which is not reachable. what we should be
reading is "me-42-big-TOC.txt". so, we should better off either
using `fs::path` or enforcing the absolute path.
since we already using "/" as separator, and concatenating strings,
this is an opportunity to switch over to `fs::path` to address
the problem and to avoid the string concatenating.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#16982
Right now the atomic deletion is called on manager, but it gets the
actual deletion function from storage and off-loads the deletion to it.
This patch makes the manager fully responsible for the delition by
implemeting the sequence of
auto ctx = storage.prepare()
for sst in sstables:
sst.unlink()
storage.complate(ctx)
Storage implementations provide the prepare/complete methods. The
filesystem storage does it via deletion log and the s3 storage is still
not atomic :(
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The atomic deletion is going to look like
auto ctx = storage.prepare()
for sst in sstables:
sst.unlink()
storage.complate(ctx)
and this patch prepares the class storage for that by extending it with
prepare and complete methods. The opaque ctx object is also here
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The helper in question complicates the logic of sstable_directory::process() by making garbage collection differently for sstables deleted "atomically" and deleted "one-by-one". Also, the code that deletes sstables one-by-one and uses remove_by_toc_name() renders excessive TOC file reading, because there's sstable object at hand and it had all_components() ready for use.
Surprisingly, there was no test for the deletion-log functionality. This PR adds one. The test passes before the g.c. and regular unlink fix, and (of course) continues passing after it.
Closesscylladb/scylladb#16240
* github.com:scylladb/scylladb:
sstables: Drop remove_by_name()
sstables/fs_storage: Wipe by recognized+unrecognized components
sstable_directory: Enlight deletion log replay
sstables: Split remove_by_toc_name()
test: Add test case to validate deletion log work
sstable_directory: Close dir on exception
sstable_directory: Fix indentation after previous patch
sstable_directory: Coroutinize delete_with_pending_deletion_log()
test: Sstable on_delete() is not necessarily in a thread
sstable_directory: Split delete_with_pending_deletion_log()
Fixes some typos as found by codespell run on the code.
In this commit, I was hoping to fix only comments, not user-visible alerts, output, etc.
Follow-up commits will take care of them.
Refs: https://github.com/scylladb/scylladb/issues/16255
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
Currently wiping fs-backed sstable happens via reading and parsing its
TOC file back. Then the three-step process goes:
- move TOC -> TOC.tmp
- remove components (obtained from TOC.tmp)
- remove TOC.tmp
However, wiping sstable happens in one of two cases -- the sstable was
loaded from the TOC file _or_ sstable had evaluated the needed
components and generated TOC file. With that, the 2nd step can be made
without reading the TOC file, just by looking at all components sitting
on the sstable
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There are several places where TOC file is parsed into a vector of
components -- sstable::read_toc(), remove_by_toc_name() and
remove_by_registry_entry(). All three deserve some generalization.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Nowadays if memtable gets flushed into misconfigured S3 storage, the flush fails and aborts the whole scylla process. That's not very elegant. First, because upon restart garbage collecting non-sealed sstables would fail again. Second, because re-configuring an endpoint can be done runtime, scylla re-reads this config upon HUP signal.
Flushing memtable restarts when seeing ENOSPC/EDQUOT errors from on-disk sstables. This PR extends this to handle misconfigured S3 endpoints as well.
fixes: #13745Closesscylladb/scylladb#15635
* github.com:scylladb/scylladb:
test: Add object_store test to validate config reloading works
test: Add config update facility to test cluster
test: Make S3_Server export config file as pathlib.Path
config: Make object storage config updateable_value_source
memtable: Extend list of checking codes
sstables/storage/s3: Fix missing TOC status check
s3/client: Map http exceptions into storage_io_error
exceptions: Extend storage_io_error construction options
When sealing an sstable on local storage the storage driver performs
several flushes on a file that is directory open via checked-file.
Flush calls are wrapped with sstable_write_io_check, but that's
excessive, the checked file will wrap flushes with io-checks on its own
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#16173
When TOC file is missing while garbage collecting the S3 server would
resolve with storage_io_error(ENOENT) nowadays
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It's only local storage type that needs directores touch/remove, S3
storage initialization is for now a no-op, maybe some day soon it will
appear.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It's the manager that knows about storages and it should init/destroy
it. Also the "upload" and "staging" paths are about to be hidden in
sstables/ code, this code move also facilitates that.
The indentation in storage.cc is deliberately broken to make next patch
look nicer (spoiler: it won't have to shift those lines right).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Now when the system.sstables has the state field, it can be changed
(UPDATEd). However, when changing the state AND generation, this still
won't work, because generation is the clustering key of the table in
question and cannot be just changed. This, nonetheless, is OK, as
generation changes with state only when moving an sstable from upload
dir into normal/staging and this is separate issue for S3 (#13018). For
now changing state only is OK.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The state is one of <empty>(normal)/staging/quarantine. Currently when
sstable is moved to non-normal state the s3 backend state_change() call
throws thus such sstables do not appear. Next patches are going to
change that and the new field in the system.sstables is needed.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
since we use the sstable.generation() for the remote prefix of
the key of the object for storing the sstable component, there is
no need to set remote_prefix beforehand.
since `s3_storage::ensure_remote_prefix()` and
`system_kesypace::sstables_registry_lookup_entry()` are not used
anymore, they are removed.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
before this change, we create a new UUID for a new sstable managed
by the s3_storage, and we use the string representation of UUID
defined by RFC4122 like "0aa490de-7a85-46e2-8f90-38b8f496d53b" for
naming the objects stored on s3_storage. but this representation is
not what we are using for storing sstables on local filesystem when
the option of "uuid_sstable_identifiers_enabled" is enabled. instead,
we are using a base36-based representation which is shorter.
to be consistent with the naming of the sstables created for local
filesystem, and more importantly, to simplify the interaction between
the local copy of sstables and those stored on object storage, we should
use the same string representation of the sstable identifier.
so, in this change:
1. instead of creating a new UUID, just reuse the generation of the
sstable for the object's key.
2. do not store the uuid in the sstable_registry system table. As
we already have the generation of the sstable for the same purpose.
3. switch the sstable identifier representation from the one defined
by the RFC4122 (implemented by fmt::formatter<utils::UUID>) to the
base36-based one (implemented by
fmt::formatter<sstables::generation_type>)
4. enable the `uuid_sstable_identifers` cluster feature if it is
enabled in the `test_env_config`, so that it the sstable manager
can enable the uuid-based uuid when creating a new uuid for
sstable.
5. throw if the generation of sstable is not UUID-based when
accessing / manipulating an sstable with S3 storage backend. as
the S3 storage backend now relies on this option. as, otherwise
we'd have sstables with key like s3://bucket/number/basename, which
is just unable to serve as a unique id for sstable if the bucket is
shared across multiple tables.
Fixes#14175
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Now no code uses those strings. Even worse -- there are some places that
need to provide some strings but don't have real values at hand, so just
hard-code the empty strings there (because they are really not used).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When booting there can be dangling entries in sstables registry as well
as objects on the storage itself. This patch makes the S3 lister list
those entries and then kick the s3_storage to remove the corresponding
objects. At the end the dangling entries are removed from the registry
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
before this change, filesystem_storage::open() reuses
`sstable::make_component_file_writer()` to create the
temporary toc, it will rename the temporary toc to the
real TOC when sealing the sstable.
but this prevents us from reusing filesystem_storage in
yet another storage backend. as the
1. create temporary
2. rename temporary to toc
dance only applies to filesystem_storage. when
filesystem_storage calls into sstable, it calls `sst.make_component_file_writer()`,
which in turn calls the `_storage->make_component_sink()`.
but at this moment, `_storage` is not necessarily `filesystem_storage`
anymore. it could be a wrapper around `filesystem_storage`,
which is not aware of the create-rename dance. and could do
a lot more than create a temporary file when asked to
"make_component_sink()".
if we really want to go this way by reusing sstable's API
in `filesystem_storage` to create a temporary toc, we will
have to rename the whatever temporary toc component created
by the wrapper backend to the toc with the seal() func. but
again, this rename op is only implemented in the
filesystem_storage backend. to mirror this operation in
the wrapper backend does not make sense at all -- it
does not have to be aware of the filesystem_storage's internals.
so in this change, instead of reusing the
`sstable::make_component_file_writer()`, we just inline
its implementation in filesystem_storage to avoid this
problem. this is also an improvement from the design
perspective, as the storage should not call into its
the higher abstraction -- sstable.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#14443
The filesystem storage driver uses different paths depending on sstable
state. It's possible to keep only table directory _and_ state on it and
construct this path on demand when needed, but it's faster to keep full
path onboard. All the more so it's only exported outside via .prefix()
call which is for logs only, but still
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Pretty cosmetic change, but it will allow S3 to finally support moving
sstables between states (after this patch it still doesn't)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It's only used by fs storage driver that can do dir/file concatenation
on its own. Moreover, this method is not welcome to be used even
internally
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When deleting multiple sstables with the same prefix
the deletion atomicity is ensured by the pending_delete_log file,
so if scylla crashes in the middle, deletions will be replyed on
restart.
Therefore, we don't have to ensure atomicity of each individual
`unlink`. We just need to sync the directory once, before
removing the pending_delete_log file.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#14967
the formatter for sstables::generation_type does not support "d"
specifier, so we should not use "{:d}" for printing it. this works
before d7c90b5239, but after that
change, generation_type is not an alias of int64_t anymore.
and its formatter does not support "d", so we should either
specialize fmt::formatter<generation_type> to support it or just
drop the specifier.
since seastar::format() is using
```c++
fmt::format_to(fmt::appender(out), fmt::runtime(fmt), std::forward<A>(a)...);
```
to print the arguments with given fmt string, we cannot identify
these kind of error at compile time.
at runtime, if we have issues like this, {fmt} would throw exception
like:
```
terminate called after throwing an instance of 'fmt::v9::format_error'
what(): invalid format specifier
```
when constructing the `std::runtime_error` instance.
so, in this change, "d" is removed.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#14427
The method sits on sstable, but is called only from fs storage and it's
the only place that really needs it
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#14230
In that level no io_priority_class-es exist. Instead, all the IO happens
in the context of current sched-group. File API no longer accepts prio
class argument (and makes io_intent arg mandatory to impls).
So the change consists of
- removing all usage of io_priority_class
- patching file_impl's inheritants to updated API
- priority manager goes away altogether
- IO bandwidth update is performed on respective sched group
- tune-up scylla-gdb.py io_queues command
The first change is huge and was made semi-autimatically by:
- grep io_priority_class | default_priority_class
- remove all calls, found methods' args and class' fields
Patching file_impl-s is smaller, but also mechanical:
- replace io_priority_class& argument with io_intent* one
- pass intent to lower file (if applicatble)
Dropping the priority manager is:
- git-rm .cc and .hh
- sed out all the #include-s
- fix configure.py and cmakefile
The scylla-gdb.py update is a bit hairry -- it needs to use task queues
list for IO classes names and shares, but to detect it should it checks
for the "commitlog" group is present.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#13963
Current S3 uploading sink has implicit limit for the final file size that comes from two places. First, S3 protocol declares that uploading parts count from 1 to 10000 (inclusive). Second, uploading sink sends out parts once they grow above S3 minimal part size which is 5Mb. Since sstables puts data in 128kb (or smaller) portions, parts are almost exactly 5Mb in size, so the total uploading size cannot grow above ~50Gb. That's too low.
To break the limit the new sink (called jumbo sink) uses the UploadPartCopy S3 call that helps splicing several objects into one right on the server. Jumbo sink starts uploading parts into an intermediate temporary object called a piece and named ${original_object}_${piece_number}. When the number of parts in current piece grows above the configured limit the piece is finalized and upload-copied into the object as its next part, then deleted. This happens in the background, meanwhile the new piece is created and subsequent data is put into it. When the sink is flushed the current piece is flushed as is and also squashed into the object.
The new jumbo sink is capable of uploading ~500Tb of data, which looks enough.
fixes: #13019Closes#13577
* github.com:scylladb/scylladb:
sstables: Switch data and index sink to use jumbo uploader
s3/test: Tune-up multipart upload test alignment
s3/test: Add jumbo upload test
s3/client: Wait for background upload fiber on close-abort
c3/client: Implement jumbo upload sink
s3/client: Move memory buffers to upload_sink from base
s3/client: Move last part upload out of finalize_upload()
s3/client: Merge do_flush() with upload_part()
s3/client: Rename upload_sink -> upload_sink_base
Currently temporary directories with incomplete sstables and pending deletion log are processed by distributed loader on start. That's not nice, because for s3 backed sstables this code makes no sense (and is currently a no-op because of incomplete implementation). This garbage collecting should be kept in sstable_directory where it can off-load this work onto lister component that is storage-aware.
Once g.c. code moved, it allows to clean the class sstable list of static helpers a bit.
refs: #13024
refs: #13020
refs: #12707Closes#13767
* github.com:scylladb/scylladb:
sstable: Toss tempdir extension usage
sstable: Drop pending_delete_dir_basename()
sstable: Drop is_pending_delete_dir() helper
sstable_directory: Make garbage_collect() non-static
sstable_directory: Move deletion log exists check
distributed_loader: Move garbage collecting into sstable_directory
distributed_loader: Collect garbace collecting in one call
sstable: Coroutinize remove_temp_dir()
sstable: Coroutinize touch_temp_dir()
sstable: Use storage::temp_dir instead of hand-crafted path
sstables_manager::get_component_lister() is used by sstable_directory.
and almost all the "ingredients" used to create a component lister
are located in sstable_directory. among the other things, the two
implementations of `components_lister` are located right in
`sstable_directory`. there is no need to outsource this to
sstables_manager just for accessing the system_keyspace, which is
already exposed as a public function of `sstables_manager`. so let's
move this helper into sstable_directory as a member function.
with this change, we can even go further by moving the
`components_lister` implementations into the same .cc file.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13853
The tempdir for filesystem-based sstables is {generation}.sstable one.
There are two places that need to know the ".sstable" extention -- the
tempdir creating code and the tempdir garbage-collecting code.
This patch simplifies the sstable class by patching the aforementioned
functions to use newly introduced tempdir_extension string directly,
without the help of static one-line helpers.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When opening an sstable on filesystem it's first created in a temporary
directory whose path is saved in storage::temp_dir variable. However,
the opening method constructs the path by hand. Fix that.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
These two can grow large. Non-jumbo sink is effectively limited with
10000 parts, since each is ~5Mb the maximum uploadable data/index
happens to be 50Gb which is too small.
Other components shouldn't grow that big and continue using simple and a
bit faster uploading sink.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There are two layers of stables deletion -- delete-atomically and wipe. The former is in fact the "API" method, it's called by table code when the specific sstable(s) are no longer needed. It's called "atomically" because it's expected to fail in the middle in a safe manner so that subsequent boot would pick the dangling parts and proceed. The latter is a low-level removal function that can fail in the middle, but it's not of _its_ care.
Currently the atomic deletion is implemented with the help of sstable_directory::delete_atomically() method that commits sstables files names into deletion log, then calls wipe (indirectly), then drops the deletion log. On boot all found deletion logs are replayed. The described functionality is used regardless of the sstable storage type, even for S3, though deletion log is an overkill for S3, it's better be implemented with the help of ownership table. In fact, S3 storage already implements atomic deletion in its wipe method thus being overly careful.
So this PR
- makes atomic deletion be storage-specific
- makes S3 wipe non-atomic
fixes: #13016
note: Replaying sstables deletion from ownership table on boot is not here, see #13024Closes#13562
* github.com:scylladb/scylladb:
sstables: Implement atomic deleter for s3 storage
sstables: Get atomic deleter from underlying storage
sstables: Move delete_atomically to manager and rename
The existing storage::wipe() method of s3 is in fact atomic deleter --
it commits "deleting" status into ownership table, deletes the objects
from server, then removes the entry from ownership table. So the atomic
deleter does the same and the .wipe() just removes the objects, because
it's not supposed to be atomic.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
While the driver isn't known without the sstable itself, we have a
vector of them can can get it from the front element. This is not very
generic, but fortunately all sstables here belong to the same table and,
respectively, to the same storage and even prefix. The latter is also
assert-checked by the sstable_directory atomic deleter code.
For now S3 storage returns the same directory-based deleter, but next
patch will change that.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>