There are two schemas associated with a sstable writer:
the sstable's schema (i.e. the schema of the table at the time when the
sstable object was created), and the writer's schema (equal to the schema
of the reader which is feeding into the writer).
It's easy to mix up the two and break something as a result.
The writer's schema is needed to correctly interpret and serialize the data
passing through the writer, and to populate the on-disk metadata about the
on-disk schema.
The sstables's schema is used to configure some parameters for newly created
sstable, such as bloom filter false positive ratio, or compression.
This series fixes the known mixups between the two — when setting up compression,
and when setting up the bloom filters.
Fixesscylladb/scylladb#16065
The bug is present in all supported versions, so the patch has to be backported to all of them.
(cherry picked from commit a1834efd82)
(cherry picked from commit d10b38ba5b)
(cherry picked from commit 1a8ee69a43)
Refs scylladb/scylladb#19695Closesscylladb/scylladb#19877
* github.com:scylladb/scylladb:
sstables/mx/writer: when creating local_compression, use the sstables's schema, not the writer's
sstables/mx/writer: when creating filter, use the sstables's schema, not the writer's
sstables: for i_filter downcasts, use dynamic_cast instead of static_cast
As of this patch, those static_casts are actually invalid in some cases
(they cast to the wrong type) because of an oversight.
A later patch will fix that. But to even write a reliable reproducer
for the problem, we must force the invalid casts to manifest as a crash
(instead of weird results).
This patch both allows writing a reproducer for the bug and serves
as a bit of defensive programming for the future.
(cherry picked from commit a1834efd82)
# Conflicts:
# sstables/sstables.cc
The SSTable is removed from the reclaimed memory tracking logic only
when its object is deleted. However, there is a risk that the Bloom
filter reloader may attempt to reload the SSTable after it has been
unlinked but before the SSTable object is destroyed. Prevent this by
removing the SSTable from the reclaimed list maintained by the manager
as soon as it is unlinked.
The original logic that updated the memory tracking in
`sstables_manager::deactivate()` is left in place as (a) the variables
have to be updated only when the SSTable object is actually deleted, as
the memory used by the filter is not freed as long as the SSTable is
alive, and (b) the `_reclaimed.erase(*sst)` is still useful during
shutdown, for example, when the SSTable is not unlinked but just
destroyed.
Fixes https://github.com/scylladb/scylladb/issues/19722
Closes scylladb/scylladb#19717
* github.com:scylladb/scylladb:
boost/bloom_filter_test: add testcase to verify unlinked sstables are not reloaded
sstables: do not reload components of unlinked sstables
sstables/sstables_manager: introduce on_unlink method
(cherry picked from commit 591876b44e)
Backported from #19717 to 6.0
Closesscylladb/scylladb#19830
Loads just the metadata components. No validation.
Split off from load(), to allow scylla-sstable to partially load an
sstable.
(cherry picked from commit 435c01d1e6)
With intra-node migration, all the movement is local, so we can make
streaming faster by just cloning the sstable set of leaving replica
and loading it into the pending one.
This cloning is underlying storage specific, but s3 doesn't support
snapshot() yet (th sstables::storage procedure which clone is built
upon). It's only supported by file system, with help of hard links.
A new generation is picked for new cloned sstable, and it will
live in the same directory as the original.
A challenge I bumped into was to understand why table refused to
load the sstable at pending replica, as it considered them foreign.
Later I realized that sharder (for reads) at this stage of migration
will point only to leaving replica. It didn't fail with mutation
based streaming, because the sstable writer considers the shard --
that the sstable was written into -- as its owner, regardless of what
sharder says. That was fixed by mimicking this behavior during
loading at pending.
test:
./test.py --mode=dev intranode --repeat=100 passes.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
In preparation for intra-node tablet migration, to avoid
using deprecated sharder APIs.
This function is used for generating sstable sharding metadata.
For tablets, it is not invoked, so we can safely work with the
static sharder. The call site already passes static_sharder only.
Added support to reload components from which memory was previously
reclaimed as the total memory of reclaimable components crossed a
threshold. The implementation is kept simple as only the bloom filters
are considered reclaimable for now.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Added a member variable _total_memory_reclaimed to the sstable class
that tracks the total memory reclaimed from a sstable.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Renamed local variable in sstable::total_reclaimable_memory_size in
preparation for the next patch which adds a new member variable
_total_memory_reclaimed to the sstable class.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
It's completely unused, likely in favor of recently added formatter
for the type in question.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#18502
When reclaiming memory from bloom filters, do not remove them from
_recognised_components, as that leads to the on-disk filter component
being left back on disk when the SSTable is deleted.
Fixes#18398
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Closesscylladb/scylladb#18400
since we do not rely on FMT_DEPRECATED_OSTREAM to define the
fmt::formatter for us anymore, let's stop defining `FMT_DEPRECATED_OSTREAM`.
in this change,
* utils: drop the range formatters in to_string.hh and to_string.c, as
we don't use them anymore. and the tests for them in
test/boost/string_format_test.cc are removed accordingly.
* utils: use fmt to print chunk_vector and small_vector. as
we are not able to print the elements using operator<< anymore
after switching to {fmt} formatters.
* test/boost: specialize fmt::details::is_std_string_like<bytes>
due to a bug in {fmt} v9, {fmt} fails to format a range whose
element type is `basic_sstring<uint8_t>`, as it considers it
as a string-like type, but `basic_sstring<uint8_t>`'s char type
is signed char, not char. this issue does not exist in {fmt} v10,
so, in this change, we add a workaround to explicitly specialize
the type trait to assure that {fmt} format this type using its
`fmt::formatter` specialization instead of trying to format it
as a string. also, {fmt}'s generic ranges formatter calls the
pair formatter's `set_brackets()` and `set_separator()` methods
when printing the range, but operator<< based formatter does not
provide these method, we have to include this change in the change
switching to {fmt}, otherwise the change specializing
`fmt::details::is_std_string_like<bytes>` won't compile.
* test/boost: in tests, we use `BOOST_REQUIRE_EQUAL()` and its friends
for comparing values. but without the operator<< based formatters,
Boost.Test would not be able to print them. after removing
the homebrew formatters, we need to use the generic
`boost_test_print_type()` helper to do this job. so we are
including `test_utils.hh` in tests so that we can print
the formattable types.
* treewide: add "#include "utils/to_string.hh" where
`fmt::formatter<optional<>>` is used.
* configure.py: do not define FMT_DEPRECATED_OSTREAM
* cmake: do not define FMT_DEPRECATED_OSTREAM
Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
before this change, we rely on the default-generated fmt::formatter
created from operator<<, but fmt v10 dropped the default-generated
formatter.
in this change, we include `fmt/ranges.h` and/or `fmt/std.h`
for formatting the container types, like vector, map
optional and variant using {fmt} instead of the homebrew
formatter based on operator<<.
with this change, the changes adding fmt::formatter and
the changes using ostream formatter explicitly, we are
allowed to drop `FMT_DEPRECATED_OSTREAM` macro.
Refs scylladb#13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Reclaim memory from the SSTable that has the most reclaimable memory if
the total reclaimable memory has crossed the threshold. Only the bloom
filter memory is considered reclaimable for now.
Fixes#17747
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
sstables_manager::_total_reclaimable_memory variable tracks the total
memory that is reclaimable from all the SSTables managed by it.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Added support to track total memory from components that are reclaimable
and to reclaim memory from them if and when required. Right now only the
bloom filters are considered as reclaimable components but this can be
extended to any component in the future.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Loader was changed to quickly determine ownership after consuming
sharding metadata only. If it's not available, it falls back to
reading first and last keys from summary. The fallback is only there
for backward compatibility and it costs a lot more as we don't
skip to the end where keys are located in summary.
With tablets, sharding metadata is only first and last keys so
we can do it without sharder. So loader will be able to use it
instead of looking up keys in summary.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#17805
If index_reader isn't closed before it is destroyed, then ongoing
sstables reads won't be awaited and assertion will be triggered.
Close index_reader in has_partition_key before destroying it.
both of its callers are passing parent_path() and filename() to
it. so let the callee to do this. simpler this way.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#17225
despite that "welp" is more emotional expressive, it is considered
a misspelling of "whelp" by codespell. that's why this comment stands
out. but from a non-native speaker's point of view, probably we can
use more descriptive words to explain what "welp" is for in plain
English.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#17183
get0() dates back from the days where Seastar futures carried tuples, and
get0() was a way to get the first (and usually only) element. Now
it's a distraction, and Seastar is likely to deprecate and remove it.
Replace with seastar::future::get(), which does the same thing.
instead of capturing the return value of `get0()` with a reference
type, use a plain type. as `get0()` returns a plain `T` while `get0()`
returns a `T&&`, to avoid the value referenced by `T&&` gets destroyed
after the expression, let's use a plain `auto` instead of `auto&&`.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Currently, we pass an effective_replication_map_ptr to sstable_writer,
so that we can get a stable dht::sharder for writing the sharding metadata.
This is needed because with tablets, the sharder can change dynamically.
However, this is both bad and unnecessary:
- bad: holding on to an effective_replication_map_ptr is a barrier
for topology operations, preventing tablet migrations (etc) while
an sstable is being written
- unnecessary: tablets don't require sharding metadata at all, since
two tablets cannot overlap (unlike two sstables from different shards in
the same node). So the first/last key is sufficient to determine the
shard/tablet ownership.
Given that, just pass the sharder for vnode sstables, and don't generate
sharding metadata for tablet sstables.
The sstables replay_position in stats_metadata is
valid only on the originating node and shard.
Therefore, validate the originating host and shard
before using it in compaction or table truncate.
Fixes#10080
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closesscylladb/scylladb#16550
Store schema_ptr in reader permit instead of storing a const pointer to
schema to ensure that the schema doesn't get changed elsewhere when the
permit is holding on to it. Also update the constructors and all the
relevant callers to pass down schema_ptr instead of a raw pointer.
Fixes#16180
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Closesscylladb/scylladb#16658
Consider this:
1) file streaming takes storage snapshot = list of sstables
2) concurrent compaction unlink some of those sstables from file system
3) file streaming tries to send unlinked sstables, but files other
than data and index cannot be read as only data and index have file
descriptors opened
To fix it, the snapshot now returns a set of files, one per sstable
component, for each sstable.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#16476
The interface is fragile because the user may incorrectly use the
wrong "gc before". Given that sstable knows how to properly calculate
"gc before", let's do it in estimate__d__t__r(), leaving no room
for mistakes.
sstable_run's variant was also changed to conform to new interface,
allowing ICS to properly estimate droppable ratio, using GC before
that is calculated using each sstable's range. That's important for
upcoming tablets, as we want to query only the range that belongs
to a particular tablet in the repair history table.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#15931
To be used in the next patch to control whether the semaphore registers
and exports metrics or not. We want to move metric registration to the
semaphore but we don't want all semaphores to export metrics. The
decision on whether a semaphore should or shouldn't export metrics
should be made on a case-by-case basis so this new parameter has no
default value (except for the for_tests constructor).
The helper in question complicates the logic of sstable_directory::process() by making garbage collection differently for sstables deleted "atomically" and deleted "one-by-one". Also, the code that deletes sstables one-by-one and uses remove_by_toc_name() renders excessive TOC file reading, because there's sstable object at hand and it had all_components() ready for use.
Surprisingly, there was no test for the deletion-log functionality. This PR adds one. The test passes before the g.c. and regular unlink fix, and (of course) continues passing after it.
Closesscylladb/scylladb#16240
* github.com:scylladb/scylladb:
sstables: Drop remove_by_name()
sstables/fs_storage: Wipe by recognized+unrecognized components
sstable_directory: Enlight deletion log replay
sstables: Split remove_by_toc_name()
test: Add test case to validate deletion log work
sstable_directory: Close dir on exception
sstable_directory: Fix indentation after previous patch
sstable_directory: Coroutinize delete_with_pending_deletion_log()
test: Sstable on_delete() is not necessarily in a thread
sstable_directory: Split delete_with_pending_deletion_log()
Fixes some typos as found by codespell run on the code.
In this commit, I was hoping to fix only comments, not user-visible alerts, output, etc.
Follow-up commits will take care of them.
Refs: https://github.com/scylladb/scylladb/issues/16255
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
The helper consists of three phases:
- move TOC -> TOC.tmp
- remove components listed in TOC
- remove TOC.tmp
The first step is needed separately by the next patch
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There are several places where TOC file is parsed into a vector of
components -- sstable::read_toc(), remove_by_toc_name() and
remove_by_registry_entry(). All three deserve some generalization.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
this change is a cleanup.
to mark a return value without value semantics has no effect. these
`const` specifier useless. so let's drop them.
and, if we compile the tree with `-Wignore-qualifiers`, the compiler
would warn like:
```
/home/kefu/dev/scylladb/schema/schema.hh:245:5: error: 'const' type qualifier on return type has no effect [-Werror,-Wignored-qualifiers]
245 | const index_metadata_kind kind() const;
| ^~~~~
```
so this change also silences the above warnings.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
After "repair: Get rid of the gc_grace_seconds", the sstable's schema (mode,
gc period if applicable, etc) is used to estimate the amount of droppable
data (or determine full expiration = max_deletion_time < gc_before).
It could happen that the user switched from timeout to repair mode, but
sstables will still use the old mode, despite the user asked for a new one.
Another example is when you play with value of grace period, to prevent
data resurrection if repair won't be able to run in a timely manner.
The problem persists until all sstables using old GC settings are recompacted
or node is restarted.
To fix this, we have to feed latest schema into sstable procedures used
for expiration purposes.
Fixes#15643.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#15746
`deletion_time` is a part of the `partition_header`, which is in turn
a part of `partition`. and `data_file` is a sequence of `partition`.
`data_file` represents *-Data.db component of an SSTable.
see docs/architecture/sstable3/sstables-3-data-file-format.rst.
we always parse the data component via `flat_mutation_reader_v2`, which is in turn
implemented with mx/reader.cc or kl/reader.cc depending on
the version of SSTable to be read.
in other words, we decode `deletion_time` in mx/reader.cc or
kl/reader.cc, not in sstable.cc. so let's drop the overload
parse() for deletion_time. it's not necessary and more importantly,
confusing.
Refs #15116
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#15756
compaction_read_monitor_generator is an existing mechanism
for monitoring progress of sstables reading during compaction.
In this change information gathered by compaction_read_monitor_generator
is utilized by task manager compaction tasks of the lowest level,
i.e. compaction executors, to calculate task progress.
compaction_read_monitor_generator has a flag, which decides whether
monitored changes will be registered by compaction_backlog_tracker.
This allows us to pass the generator to all compaction readers without
impacting the backlog.
Task executors have access to compaction_read_monitor_generator_wrapper,
which protects the internals of compaction_read_monitor_generator
and provides only the necessary functionality.
Closesscylladb/scylladb#14878
* github.com:scylladb/scylladb:
compaction: add get_progress method to compaction_task_impl
compaction: find total compaction size
compaction: sstables: monitor validation scrub with compaction_read_generator
compaction: keep compaction_progress_monitor in compaction_task_executor
compaction: use read monitor generator for all compactions
compaction: add compaction_progress_monitor
compaction: add flag to compaction_read_monitor_generator
before this change, they reads
> Was local deletion time capped at ...
and
> Was partition tombstone deletion time capped at ...
the "Was" part is confusing. and the first description is not
accurate enough. so let's improve them a little bit.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#15108
Validation scrub bypasses the usual compaction machinery, though it
still needs to be tracked with compaction_progress_monitor so that
we could reach its progress from compaction task executor.
Track sstable scrub in validate mode with read monitors.
When loading unshared remote sstable, sstable_directory needs to make a
descriptor out of a real sstable. For that it parses the sstable's Data
component path which is pretty weird. It's simpler to make descriptor
out of the ssatble itself.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Now no code uses those strings. Even worse -- there are some places that
need to provide some strings but don't have real values at hand, so just
hard-code the empty strings there (because they are really not used).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This just moves the std::make_tuple() call into internal static path
parsing helper to make the next patch smaller and nicer.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There are two path parsers. One of them accepts keyspace and table names
and the other one doesn't. The latter is then supposed to parse the
ks.cf pair from path and put it on the descriptor. This patch makes this
method return ks.cf so that later it will be possible to remove these
strings from the desctiptor itself.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The method really parses provided path, so the existing name is pretty
confusing. It's extra confusing in the table::get_snapshot_details()
where it's just called and the return value is simply ignored.
Named "parse_..." makes it clear what the method is for.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
before this change, when running into a zero chunk_len, scylla
crashes with `assert(chunk_size != 0)`. but we can do better than
printing a backtrace like:
```
scylla: sstables/compress.cc:158: void
sstables::compression::segmented_offsets::init(uint32_t): Assertion `chunk_size != 0' failed.
```
so, in this change, a `malformed_sstable_exception` is throw in place
of an `assert()`, which is supposed to verify the programming
invariants, not for identifying corrupted data file.
Fixes#15265
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#15264
Index caching was disabled by default because it caused performance regressions
for some small-partition workloads. See https://github.com/scylladb/scylladb/issues/11202.
However, it also means that there are workloads which could benefit from the
index cache, but (by default) don't.
As a compromise, we can set a default limit on the memory usage of index cache,
which should be small enough to avoid catastrophic regressions in
small-partition workloads, but big enough to accommodate workloads where
index cache is obviously beneficial.
This series adds such a configurable limit, sets it to to 0.2 of total cache memory by default,
and re-enables index caching by default.
Fixes#15118Closes#14994
* github.com:scylladb/scylladb:
test: boost/cache_algorithm_test: add cache_algorithm_test
sstables: partition_index_cache: deglobalize stats
utils: cached_file: deglobalize cached_file metrics
db: config: enable index caching by default
config: add index_cache_fraction
utils: lru: add move semantics to list links