Loads just the metadata components. No validation.
Split off from load(), to allow scylla-sstable to partially load an
sstable.
(cherry picked from commit 435c01d1e6)
With intra-node migration, all the movement is local, so we can make
streaming faster by just cloning the sstable set of leaving replica
and loading it into the pending one.
This cloning is underlying storage specific, but s3 doesn't support
snapshot() yet (th sstables::storage procedure which clone is built
upon). It's only supported by file system, with help of hard links.
A new generation is picked for new cloned sstable, and it will
live in the same directory as the original.
A challenge I bumped into was to understand why table refused to
load the sstable at pending replica, as it considered them foreign.
Later I realized that sharder (for reads) at this stage of migration
will point only to leaving replica. It didn't fail with mutation
based streaming, because the sstable writer considers the shard --
that the sstable was written into -- as its owner, regardless of what
sharder says. That was fixed by mimicking this behavior during
loading at pending.
test:
./test.py --mode=dev intranode --repeat=100 passes.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Added support to reload components from which memory was previously
reclaimed as the total memory of reclaimable components crossed a
threshold. The implementation is kept simple as only the bloom filters
are considered reclaimable for now.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Renamed the intrusive list link type to differentiate it from the set
link type that will be added in an upcoming patch.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Added a member variable _total_memory_reclaimed to the sstable class
that tracks the total memory reclaimed from a sstable.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
in in {fmt} before v10, it provides the specialization of `fmt::formatter<..>`
for `std::string_view` as well as the specialization of `fmt::formatter<..>`
for `fmt::string_view` which is an implementation builtin in {fmt} for
compatibility of pre-C++17. and this type is used even if the code is
compiled with C++ stadandard greater or equal to C++17. also, before v10,
the `fmt::formatter<std::string_view>::format()` is defined so it accepts
`std::string_view`. after v10, `fmt::formatter<std::string_view>` still
exists, but it is now defined using `format_as()` machinery, so it's
`format()` method does not actually accept `std::string_view`, it
accepts `fmt::string_view`, as the former can be converted to
`fmt::string_view`.
this is why we can inherit from `fmt::formatter<std::string_view>` and
use `formatter<std::string_view>::format(foo, ctx);` to implement the
`format()` method with {fmt} v9, but we cannot do this with {fmt} v10,
and we would have following compilation failure:
```
FAILED: service/CMakeFiles/service.dir/RelWithDebInfo/topology_state_machine.cc.o
/home/kefu/.local/bin/clang++ -DFMT_DEPRECATED_OSTREAM -DFMT_SHARED -DSCYLLA_BUILD_MODE=release -DSEASTAR_API_LEVEL=7 -DSEASTAR_LOGGER_COMPILE_TIME_FMT -DSEASTAR_LOGGER_TYPE_STDOUT -DSEASTAR_SCHEDULING_GROUPS_COUNT=16 -DSEASTAR_SSTRING -DXXH_PRIVATE_API -DCMAKE_INTDIR=\"RelWithDebInfo\" -I/home/kefu/dev/scylladb -I/home/kefu/dev/scylladb/build/gen -I/home/kefu/dev/scylladb/seastar/include -I/home/kefu/dev/scylladb/build/seastar/gen/include -I/home/kefu/dev/scylladb/build/seastar/gen/src -ffunction-sections -fdata-sections -O3 -g -gz -std=gnu++20 -fvisibility=hidden -Wall -Werror -Wextra -Wno-error=deprecated-declarations -Wimplicit-fallthrough -Wno-c++11-narrowing -Wno-deprecated-copy -Wno-mismatched-tags -Wno-missing-field-initializers -Wno-overloaded-virtual -Wno-unsupported-friend -Wno-enum-constexpr-conversion -Wno-unused-parameter -ffile-prefix-map=/home/kefu/dev/scylladb=. -march=westmere -mllvm -inline-threshold=2500 -fno-slp-vectorize -U_FORTIFY_SOURCE -Werror=unused-result -MD -MT service/CMakeFiles/service.dir/RelWithDebInfo/topology_state_machine.cc.o -MF service/CMakeFiles/service.dir/RelWithDebInfo/topology_state_machine.cc.o.d -o service/CMakeFiles/service.dir/RelWithDebInfo/topology_state_machine.cc.o -c /home/kefu/dev/scylladb/service/topology_state_machine.cc
/home/kefu/dev/scylladb/service/topology_state_machine.cc:254:41: error: no matching member function for call to 'format'
254 | return formatter<std::string_view>::format(it->second, ctx);
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~
/usr/include/fmt/core.h:2759:22: note: candidate function template not viable: no known conversion from 'seastar::basic_sstring<char, unsigned int, 15>' to 'const fmt::basic_string_view<char>' for 1st argument
2759 | FMT_CONSTEXPR auto format(const T& val, FormatContext& ctx) const
| ^ ~~~~~~~~~~~~
```
because the inherited `format()` method actually comes from
`fmt::formatter<fmt::string_view>`. to reduce the confusion, in this
change, we just inherit from `fmt::format<string_view>`, where
`string_view` is actually `fmt::string_view`. this follows
the document at
https://fmt.dev/latest/api.html#formatting-user-defined-types,
and since there is less indirection under the hood -- we do not
use the specialization created by `FMT_FORMAT_AS` which inherit
from `formatter<fmt::string_view>`, hopefully this can improve
the compilation speed a little bit. also, this change addresses
the build failure with {fmt} v10.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#18299
Added support to track total memory from components that are reclaimable
and to reclaim memory from them if and when required. Right now only the
bloom filters are considered as reclaimable components but this can be
extended to any component in the future.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
sstables_manager now depends on system_keyspace for access to the
system.sstables table, needed by object storage. This violates
modularity, since sstables_manager is a relatively low-level leaf
module while system_keyspace integrates large parts of the system
(including, indirectly, sstables_manager).
One area where this is grating is sstables::test_env, which has
to include the much higher level cql_test_env to accommodate it.
Fix this by having sstables_manager expose its dependency on
system_keyspace as an interface, sstables_registry, and have
system_keyspace implement the glue logic in
system_keyspace_sstables_manager.
Closesscylladb/scylladb#17868
before this change, we rely on the default-generated fmt::formatter
created from operator<<, but fmt v10 dropped the default-generated
formatter.
in this change, we define formatters for `sstables::sstable_state`,
drop its operator<<.
Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
this change is a follow-up of 637dd730. the goal is to use
std::filesystem::path for manipulating paths, and to avoid the
converting between sstring and fs::path back and forth.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#17214
state_to_dir(sstable_state) translate the enum to the corresponding
directory component. and it returns a `seastar::sstring`. not all
the callers of this function expect a full-blown sstring instance,
on the contrary, quite a few of them just want a string-alike object
which represents the directory component, so they can use it, for
instance to compose a path, or just format the given `state` enum.
so to avoid the overhead of creating/destroying the `seastar::sstring`
instance, let's switch to `std::string_view`. with this change, we
will be able to implement the fmt::formatter for `sstable_state`
without the help of the formatter of sstring.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#17213
Currently, we pass an effective_replication_map_ptr to sstable_writer,
so that we can get a stable dht::sharder for writing the sharding metadata.
This is needed because with tablets, the sharder can change dynamically.
However, this is both bad and unnecessary:
- bad: holding on to an effective_replication_map_ptr is a barrier
for topology operations, preventing tablet migrations (etc) while
an sstable is being written
- unnecessary: tablets don't require sharding metadata at all, since
two tablets cannot overlap (unlike two sstables from different shards in
the same node). So the first/last key is sufficient to determine the
shard/tablet ownership.
Given that, just pass the sharder for vnode sstables, and don't generate
sharding metadata for tablet sstables.
The sstables replay_position in stats_metadata is
valid only on the originating node and shard.
Therefore, validate the originating host and shard
before using it in compaction or table truncate.
Fixes#10080
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closesscylladb/scylladb#16550
Consider this:
1) file streaming takes storage snapshot = list of sstables
2) concurrent compaction unlink some of those sstables from file system
3) file streaming tries to send unlinked sstables, but files other
than data and index cannot be read as only data and index have file
descriptors opened
To fix it, the snapshot now returns a set of files, one per sstable
component, for each sstable.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#16476
The interface is fragile because the user may incorrectly use the
wrong "gc before". Given that sstable knows how to properly calculate
"gc before", let's do it in estimate__d__t__r(), leaving no room
for mistakes.
sstable_run's variant was also changed to conform to new interface,
allowing ICS to properly estimate droppable ratio, using GC before
that is calculated using each sstable's range. That's important for
upcoming tablets, as we want to query only the range that belongs
to a particular tablet in the repair history table.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#15931
The helper in question complicates the logic of sstable_directory::process() by making garbage collection differently for sstables deleted "atomically" and deleted "one-by-one". Also, the code that deletes sstables one-by-one and uses remove_by_toc_name() renders excessive TOC file reading, because there's sstable object at hand and it had all_components() ready for use.
Surprisingly, there was no test for the deletion-log functionality. This PR adds one. The test passes before the g.c. and regular unlink fix, and (of course) continues passing after it.
Closesscylladb/scylladb#16240
* github.com:scylladb/scylladb:
sstables: Drop remove_by_name()
sstables/fs_storage: Wipe by recognized+unrecognized components
sstable_directory: Enlight deletion log replay
sstables: Split remove_by_toc_name()
test: Add test case to validate deletion log work
sstable_directory: Close dir on exception
sstable_directory: Fix indentation after previous patch
sstable_directory: Coroutinize delete_with_pending_deletion_log()
test: Sstable on_delete() is not necessarily in a thread
sstable_directory: Split delete_with_pending_deletion_log()
Fixes some typos as found by codespell run on the code.
In this commit, I was hoping to fix only comments, not user-visible alerts, output, etc.
Follow-up commits will take care of them.
Refs: https://github.com/scylladb/scylladb/issues/16255
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
The helper consists of three phases:
- move TOC -> TOC.tmp
- remove components listed in TOC
- remove TOC.tmp
The first step is needed separately by the next patch
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There are several places where TOC file is parsed into a vector of
components -- sstable::read_toc(), remove_by_toc_name() and
remove_by_registry_entry(). All three deserve some generalization.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
this change is a cleanup.
to mark a return value without value semantics has no effect. these
`const` specifier useless. so let's drop them.
and, if we compile the tree with `-Wignore-qualifiers`, the compiler
would warn like:
```
/home/kefu/dev/scylladb/schema/schema.hh:245:5: error: 'const' type qualifier on return type has no effect [-Werror,-Wignored-qualifiers]
245 | const index_metadata_kind kind() const;
| ^~~~~
```
so this change also silences the above warnings.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
to have feature parity with `configure.py`. we won't need this
once we migrate to C++20 modules. but before that day comes, we
need to stick with C++ headers.
we generate a rule for each .hh files to create a corresponding
.cc and then compile it, in order to verify the self-containness of
that header. so the number of rule is quite large, to avoid the
unnecessary overhead. the check-header target is enabled only if
`Scylla_CHECK_HEADERS` option is enabled.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#15913
There's the backward converter already out there. Next code will need to
convert string representation of the state back to the internal type.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
After "repair: Get rid of the gc_grace_seconds", the sstable's schema (mode,
gc period if applicable, etc) is used to estimate the amount of droppable
data (or determine full expiration = max_deletion_time < gc_before).
It could happen that the user switched from timeout to repair mode, but
sstables will still use the old mode, despite the user asked for a new one.
Another example is when you play with value of grace period, to prevent
data resurrection if repair won't be able to run in a timely manner.
The problem persists until all sstables using old GC settings are recompacted
or node is restarted.
To fix this, we have to feed latest schema into sstable procedures used
for expiration purposes.
Fixes#15643.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#15746
Validation scrub bypasses the usual compaction machinery, though it
still needs to be tracked with compaction_progress_monitor so that
we could reach its progress from compaction task executor.
Track sstable scrub in validate mode with read monitors.
When loading unshared remote sstable, sstable_directory needs to make a
descriptor out of a real sstable. For that it parses the sstable's Data
component path which is pretty weird. It's simpler to make descriptor
out of the ssatble itself.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When populating from a particular directory, populator code converts
state to subdir name, then prints the path. The conversion is pretty
much artificial, it's better to provide printer for state and print
state explicitly.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This means -- keep state on sstable, change it with change_state() call
and (!) fix the is_<state>() helpers not to check storage->prefix()
nit: mark requires_view_building() noexcept while at it
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Pretty cosmetic change, but it will allow S3 to finally support moving
sstables between states (after this patch it still doesn't)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This just moves make_path() call from outside of sstable::sstable()
inside it. Later it will be moved even further. Also, now sstable can
know its state and keep it (next patch)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There are several states between which an sstable can migrate. Nowadays
the state is encoded into sstable directory, which is not nice. Also S3
backed sstables don't support states only keeping sstables in "normal".
This patch adds enum state in order to replace the path-encoded one
eventually. The new sstables_manager::make_sstable() method is added
that accepts table directory (without quarantine/ or staging/ component)
and the desired initial state (optional). Next patches will make use of
this maker and the existing one will be removed.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It's only used by fs storage driver that can do dir/file concatenation
on its own. Moreover, this method is not welcome to be used even
internally
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When deleting multiple sstables with the same prefix
the deletion atomicity is ensured by the pending_delete_log file,
so if scylla crashes in the middle, deletions will be replyed on
restart.
Therefore, we don't have to ensure atomicity of each individual
`unlink`. We just need to sync the directory once, before
removing the pending_delete_log file.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#14967
schema::get_sharder() does not use the correct sharder for
tablet-based tables. Code which is supposed to work with all kinds of
tables should obtain the sharder from erm::get_sharder().
We need to keep sharding metadata consistent with tablet mapping to
shards in order for node restart to detect that those sstables belong
to a single shard and that resharding is not necessary. Resharding of
sstables based on tablet metadata is not implemented yet and will
abort after this series.
Keeping sharding metadata accurate for tablets is only necessary until
compaction group integration is finished. After that, we can use the
sstable token range to determine the owning tablet and thus the owning
shard. Before that, we can't, because a single sstable may contain
keys from different tablets, and the whole key range may overlap with
keys which belong to other shards.
The method sits on sstable, but is called only from fs storage and it's
the only place that really needs it
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#14230
In that level no io_priority_class-es exist. Instead, all the IO happens
in the context of current sched-group. File API no longer accepts prio
class argument (and makes io_intent arg mandatory to impls).
So the change consists of
- removing all usage of io_priority_class
- patching file_impl's inheritants to updated API
- priority manager goes away altogether
- IO bandwidth update is performed on respective sched group
- tune-up scylla-gdb.py io_queues command
The first change is huge and was made semi-autimatically by:
- grep io_priority_class | default_priority_class
- remove all calls, found methods' args and class' fields
Patching file_impl-s is smaller, but also mechanical:
- replace io_priority_class& argument with io_intent* one
- pass intent to lower file (if applicatble)
Dropping the priority manager is:
- git-rm .cc and .hh
- sed out all the #include-s
- fix configure.py and cmakefile
The scylla-gdb.py update is a bit hairry -- it needs to use task queues
list for IO classes names and shares, but to detect it should it checks
for the "commitlog" group is present.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#13963
The method is called with explicitly default pririty class and puts one
into the fstream options. This whole chain can be avoided
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The sstable.read_checksum() and .read_digest() accept prio class
argument from validate_checsums(), but it's always the "default" one.
Remove the arg and remove stream options initializations as they'll pick
up default prio class on their default constructing.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
All calls to sstables::validate_checksums() happen with explicitly
default priority class. Just hard-code it as such in the method
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The tempdir for filesystem-based sstables is {generation}.sstable one.
There are two places that need to know the ".sstable" extention -- the
tempdir creating code and the tempdir garbage-collecting code.
This patch simplifies the sstable class by patching the aforementioned
functions to use newly introduced tempdir_extension string directly,
without the help of static one-line helpers.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The helper is used to return const char* value of the pending delete
dir. Callers can use it directly.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It's only used by the sstable_directory::replay_pending_delete_log()
method. The latter is only called by the sstable_directory itself with
the path being pending-delete dir for sure. So the method can be made
private and the is_pending_delete_dir() can be removed.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
this change extracts the storage class and its derived classes
out into their own source files. for couple reasons:
- for better readability. the sstables.hh is over 1005 lines.
and sstables.cc 3602 lines. it's a little bit difficult to figure
out how the different parts in these sources interact with each
other. for instance, with this change, it's clear some of helper
functions are only used by file_system_storage.
- probably less inter-source dependency. by extracting the sources
files out, they can be compiled individually, so changing one .cc
file does not impact others. this could speed up the compilation
time.
Closes#13785
* github.com:scylladb/scylladb:
sstables: storage: coroutinize idempotent_link_file()
sstables: extract storage out
this change extracts the storage class and its derived classes
out into storage.cc and storage.hh. for couple reasons:
- for better readability. the sstables.hh is over 1005 lines.
and sstables.cc 3602 lines. it's a little bit difficult to figure
out how the different parts in these sources interact with each
other. for instance, with this change, it's clear some of helper
functions are only used by file_system_storage.
- probably less inter-source dependency. by extracting the sources
files out, they can be compiled individually, so changing one .cc
file does not impact others. this could speed up the compilation
time.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
In addition to the data file itself. Currently validation avoids the
index altogether, using the crawling reader which only relies on the
data file and ignores the index+summary. This is because a corrupt
sstable usually has a corrupt index too and using both at the same time
might hide the corruption. This patch adds targeted validation of the
index, independent of and in addition to the already existing data
validation: it validates the order of index entries as well as whether
the entry points to a complete partition in the data file.
This will usually result in duplicate errors for out-of-order
partitions: one for the data file and one for the index file.
Fixes: #9611Closes#11405
* github.com:scylladb/scylladb:
test/cql-pytest: add test_sstable_validation.py
test/cql-pytest: extract scylla_path,temp_workdir fixtures to conftest.py
tools/scylla-sstables: write validation result to stdout
sstables/sstable: validate(): delegate to mx validator for mx sstables
sstables/mx/reader: add mx specific validator
mutation/mutation_fragment_stream_validator: add validator() accessor to validating filter
sstables/mx/reader: template data_consume_rows_context_m on the consumer
sstables/mx/reader: move row_processing_result to namespace scope
sstables/mx/reader: use data_consumer::proceed directly
sstables/mx/reader.cc: extend namespace to end-of-file (cosmetic)
compaction/compaction: remove now unused scrub_validate_mode_validate_reader()
compaction/compaction: move away from scrub_validate_mode_validate_reader()
tools/scylla-sstable: move away from scrub_validate_mode_validate_reader()
test/boost/sstable_compaction_test: move away from scrub_validate_mode_validate_reader()
sstables/sstable: add validate() method
compaction/compaction: scrub_sstables_validate_mode(): validate sstables one-by-one
compaction: scrub: use error messages from validator
mutation_fragment_stream_validator: produce error messages in low-level validator
To replace the validate code currently in compaction/compaction.cc (not
in this commit). We want to push down this logic to the sstable layer,
so that:
* Non compaction code that wishes to validate sstables (tests, tools)
doesn't have to go through compaction.
* We can abstract how sstables are validated, in particular we want to
add a new more low-level validation method that only the more recent
sstable versions (mx) will support.
The s3_storage leaks client when sstable gets destoryed. So far this
came unnoticed, but debug-mode unit test ran over minio captured it. So
here's the fix.
When sstable is destroyed it also kicks the storage to do whatever
cleanup is needed. In case of s3 storage the cleanup is in closing the
on-boarded client. Until #13458 is fixed each sstable has its own
private version of the client and there's no other place where it can be
close()d in co_await-able mannter.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>