before this change, we rely on the default-generated fmt::formatter
created from operator<<, but fmt v10 dropped the default-generated
formatter.
in this change, we define formatters for
`cached_promoted_index::promoted_index_block`, and drop its
operator<<.
Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#17415
this change is a follow-up of 637dd730. the goal is to use
std::filesystem::path for manipulating paths, and to avoid the
converting between sstring and fs::path back and forth.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#17257
managed_bytes is implemented as chain of blob_storage objects.
Each blob_storage contains 24 bytes of metadata. But in the most
common case -- when there is only a single element in the chain --
16 bytes of this metadata is trivial/unused.
This is regrettable waste because managed_bytes is used for every
database cell in the memtables and cache. It means that every value
of size >= 7 bytes (smaller ones fit in the inline storage of
managed_bytes) receives 16 bytes of useless overhead.
To correct that, this series adds to managed_bytes an alternative storage
layout -- used for buffers small enough to fit in one fragment -- which only
stores the necessary minimum of metadata. (That is: a pointer to the parent,
to facilitate moving the storage during memory defragmentation).
This saves 16 bytes on every cell greater than 15 bytes. Which includes e.g.
every live cell with value bigger than 6 bytes, which likely applies to most cells.
Before:
```
$ build/release/scylla perf-simple-query --duration 10
median 218692.88 tps ( 61.1 allocs/op, 13.1 tasks/op, 41762 insns/op, 0 errors)
$ build/release/scylla perf-simple-query --duration 10 --write
median 173511.46 tps ( 58.3 allocs/op, 13.2 tasks/op, 53258 insns/op, 0 errors)
$ build/release/test/perf/mutation_footprint_test -c1 --row-count=20 --partition-count=100 --data-size=8 --column-count=16
- in cache: 2580222
- in memtable: 2549852
```
After:
```
$ build/release/scylla perf-simple-query --duration 10
median 218780.89 tps ( 61.1 allocs/op, 13.1 tasks/op, 41763 insns/op, 0 errors)
$ build/release/scylla perf-simple-query --duration 10 --write
median 173105.78 tps ( 58.3 allocs/op, 13.2 tasks/op, 52913 insns/op, 0 errors)
$ build/release/test/perf/mutation_footprint_test -c1 --row-count=20 --partition-count=100 --data-size=8 --column-count=16
- in cache: 2068238
- in memtable: 2037696
```
Closesscylladb/scylladb#14263
* github.com:scylladb/scylladb:
utils: managed_bytes: optimize memory usage for small buffers
utils: managed_bytes: rewrite managed_bytes methods in terms of managed_bytes_view
managed_bytes is implemented as chain of blob_storage objects.
Each blob_storage contains 24 bytes of metadata. But in the most
common case -- when there is only a single element in the chain --
16 bytes of this metadata is trivial/unused.
This is regrettable waste because managed_bytes is used for every
database cell in the memtables and cache. It means that every value
of size >= 7 bytes (smaller ones fit in the inline storage of
managed_bytes) receives 16 bytes of useless overhead.
To correct that, this patch adds to managed_bytes an alternative storage
layout -- used for buffers small enough to fit in one contiguous
fragment -- which only stores the necessary minimum of metadata.
(That is: a pointer to the parent, to facilitate moving the storage during
memory defragmentation).
both of its callers are passing parent_path() and filename() to
it. so let the callee to do this. simpler this way.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#17225
this change is a follow-up of 637dd730. the goal is to use
std::filesystem::path for manipulating paths, and to avoid the
converting between sstring and fs::path back and forth.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#17214
state_to_dir(sstable_state) translate the enum to the corresponding
directory component. and it returns a `seastar::sstring`. not all
the callers of this function expect a full-blown sstring instance,
on the contrary, quite a few of them just want a string-alike object
which represents the directory component, so they can use it, for
instance to compose a path, or just format the given `state` enum.
so to avoid the overhead of creating/destroying the `seastar::sstring`
instance, let's switch to `std::string_view`. with this change, we
will be able to implement the fmt::formatter for `sstable_state`
without the help of the formatter of sstring.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#17213
despite that "welp" is more emotional expressive, it is considered
a misspelling of "whelp" by codespell. that's why this comment stands
out. but from a non-native speaker's point of view, probably we can
use more descriptive words to explain what "welp" is for in plain
English.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#17183
get0() dates back from the days where Seastar futures carried tuples, and
get0() was a way to get the first (and usually only) element. Now
it's a distraction, and Seastar is likely to deprecate and remove it.
Replace with seastar::future::get(), which does the same thing.
instead of capturing the return value of `get0()` with a reference
type, use a plain type. as `get0()` returns a plain `T` while `get0()`
returns a `T&&`, to avoid the value referenced by `T&&` gets destroyed
after the expression, let's use a plain `auto` instead of `auto&&`.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
they are directories, and we are concating strings to build the paths
to the sstable components. so it would be more elegant to use fs::path
for manipulating paths.
this change was inspired by the discussion on passing the relative
path to sstable to `scylla sstables`, where we use the
`path::parent_path()` as the dir of sstable, and then concatenate
it with the filename component. but if the `parent_path()` method
returns an empty string, we end up with a path like
"/me-42-big-TOC.txt", which is not reachable. what we should be
reading is "me-42-big-TOC.txt". so, we should better off either
using `fs::path` or enforcing the absolute path.
since we already using "/" as separator, and concatenating strings,
this is an opportunity to switch over to `fs::path` to address
the problem and to avoid the string concatenating.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#16982
Currently, we pass an effective_replication_map_ptr to sstable_writer,
so that we can get a stable dht::sharder for writing the sharding metadata.
This is needed because with tablets, the sharder can change dynamically.
However, this is both bad and unnecessary:
- bad: holding on to an effective_replication_map_ptr is a barrier
for topology operations, preventing tablet migrations (etc) while
an sstable is being written
- unnecessary: tablets don't require sharding metadata at all, since
two tablets cannot overlap (unlike two sstables from different shards in
the same node). So the first/last key is sufficient to determine the
shard/tablet ownership.
Given that, just pass the sharder for vnode sstables, and don't generate
sharding metadata for tablet sstables.
before this change, we always reference the return value of
`make_reader()`, and the return value's type `flat_mutation_reader_v2`
is movable, so we can just pass it by moving away from it.
in this change, instead of using a lambda, let's just have the
return value of it. simpler this way.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#16835
The sstables replay_position in stats_metadata is
valid only on the originating node and shard.
Therefore, validate the originating host and shard
before using it in compaction or table truncate.
Fixes#10080
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closesscylladb/scylladb#16550
before this change, we defaults to use "mc" sstable format, and
switch to "md" if the cluster agrees on using it, and to
"me" if the cluster agrees on using this. the cluster feature
is used to get the consensus across the members in the cluster,
if any of the existing nodes in the cluster has its `sstable_format`
configured to, for instance, "mc", then the cluster is stuck with
"mc".
but we disabled "mc" sstable format back in 3d345609, the first LTS
release including that change was scylla v5.2.0. which means, the
cluster of the last major version Scylla should be using "md" or
"me". per our document on upgrade, see docs/upgrade/index.rst,
> You should perform the upgrades consecutively - to each
> successive X.Y version, without skipping any major or minor version.
>
> Before you upgrade to the next version, the whole cluster (each
> node) must be upgraded to the previous version.
we can assume that, a 6.x node will only join a cluster
with 5.x or 6.x nodes. (joining a 7.x cluster should work, but
this is not relevant to this change). in both cases, since
5.x and up scylla can only configured with "md" `sstable_format`,
there is no need to switch from "mc" to "md" anymore. so we can
ditch the code supporting it.
Refs #16551
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Store schema_ptr in reader permit instead of storing a const pointer to
schema to ensure that the schema doesn't get changed elsewhere when the
permit is holding on to it. Also update the constructors and all the
relevant callers to pass down schema_ptr instead of a raw pointer.
Fixes#16180
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Closesscylladb/scylladb#16658
Consider this:
1) file streaming takes storage snapshot = list of sstables
2) concurrent compaction unlink some of those sstables from file system
3) file streaming tries to send unlinked sstables, but files other
than data and index cannot be read as only data and index have file
descriptors opened
To fix it, the snapshot now returns a set of files, one per sstable
component, for each sstable.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#16476
The interface is fragile because the user may incorrectly use the
wrong "gc before". Given that sstable knows how to properly calculate
"gc before", let's do it in estimate__d__t__r(), leaving no room
for mistakes.
sstable_run's variant was also changed to conform to new interface,
allowing ICS to properly estimate droppable ratio, using GC before
that is calculated using each sstable's range. That's important for
upcoming tablets, as we want to query only the range that belongs
to a particular tablet in the repair history table.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#15931
To be used in the next patch to control whether the semaphore registers
and exports metrics or not. We want to move metric registration to the
semaphore but we don't want all semaphores to export metrics. The
decision on whether a semaphore should or shouldn't export metrics
should be made on a case-by-case basis so this new parameter has no
default value (except for the for_tests constructor).
Soon, the reader_concurrency_semaphore will require a unique
and meaningful name in order to label its metrics. To prepare
for that, name sstable_manager instances. This will be used
to generate a name for sstable_manager's reader_concurrency_semaphore.
before this change, we rely on the default-generated fmt::formatter
created from operator<<, but fmt v10 dropped the default-generated
formatter.
in this change, in order to enable the code in the header to
access the formatter without being moved down after the full specialization's
definition, we
* move the enum definition out of the class and before the
class,
* rename the enum's name from state to index_consume_entry_context_state
* define a formatter for index_consume_entry_context_state
* remove its operator<<().
as fmt v10 is able to use `format_as()` as a fallback, the formatter
full specialization is guarded with `#if FMT_VERSION < 10'00'00`. we
will remove it after we start build with fmt v10.
Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#16204
Right now the atomic deletion is called on manager, but it gets the
actual deletion function from storage and off-loads the deletion to it.
This patch makes the manager fully responsible for the delition by
implemeting the sequence of
auto ctx = storage.prepare()
for sst in sstables:
sst.unlink()
storage.complate(ctx)
Storage implementations provide the prepare/complete methods. The
filesystem storage does it via deletion log and the s3 storage is still
not atomic :(
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The atomic deletion is going to look like
auto ctx = storage.prepare()
for sst in sstables:
sst.unlink()
storage.complate(ctx)
and this patch prepares the class storage for that by extending it with
prepare and complete methods. The opaque ctx object is also here
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The helper in question complicates the logic of sstable_directory::process() by making garbage collection differently for sstables deleted "atomically" and deleted "one-by-one". Also, the code that deletes sstables one-by-one and uses remove_by_toc_name() renders excessive TOC file reading, because there's sstable object at hand and it had all_components() ready for use.
Surprisingly, there was no test for the deletion-log functionality. This PR adds one. The test passes before the g.c. and regular unlink fix, and (of course) continues passing after it.
Closesscylladb/scylladb#16240
* github.com:scylladb/scylladb:
sstables: Drop remove_by_name()
sstables/fs_storage: Wipe by recognized+unrecognized components
sstable_directory: Enlight deletion log replay
sstables: Split remove_by_toc_name()
test: Add test case to validate deletion log work
sstable_directory: Close dir on exception
sstable_directory: Fix indentation after previous patch
sstable_directory: Coroutinize delete_with_pending_deletion_log()
test: Sstable on_delete() is not necessarily in a thread
sstable_directory: Split delete_with_pending_deletion_log()
Fixes some typos as found by codespell run on the code.
In this commit, I was hoping to fix only comments, not user-visible alerts, output, etc.
Follow-up commits will take care of them.
Refs: https://github.com/scylladb/scylladb/issues/16255
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
Currently wiping fs-backed sstable happens via reading and parsing its
TOC file back. Then the three-step process goes:
- move TOC -> TOC.tmp
- remove components (obtained from TOC.tmp)
- remove TOC.tmp
However, wiping sstable happens in one of two cases -- the sstable was
loaded from the TOC file _or_ sstable had evaluated the needed
components and generated TOC file. With that, the 2nd step can be made
without reading the TOC file, just by looking at all components sitting
on the sstable
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Garbage collection of sstables is scattered between two strages -- g.c.
per-se and the regular processing.
The former stage collects deletion logs and for each log found goes
ahead and deletes the full sstable with the standard sequence:
- move TOC -> TOC.tmp
- remove components
- remove TOC.tmp
The latter stage picks up partially unlinked sstables that didn't go via
atomic deletion with the log. This comes as
- collect all components
- keep TOC's and TOC.tmp's in separate lists
- attach other components to TOC/TOC.tmp by generation value
- for all TOC.tmp's get all attached components and remove them
- continue loading TOC's with attached components
Said that, replaying deletion log can be as light as just the first step
out of the above sequence -- just move TOC to TOC.tmp. After that the
regular processing would pick the remaining components and clean them
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The helper consists of three phases:
- move TOC -> TOC.tmp
- remove components listed in TOC
- remove TOC.tmp
The first step is needed separately by the next patch
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When committing the deletion log creation its containing directory is
sync-ed via opened file. This place is not exception safe and directory
can be left unclosed
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The helper consists of three parts -- prepare the deletion log, unlink
sstables and drop the deletion log. For testing the first part is needed
as a separate step, so here's this split.
It renders two nested async contexts, but it will change soon.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
under most circumstances, we don't care the ordering of the sstable
identifiers, as they are just identifiers. so, as long as they can be
compared, we are good. but we have tests with expect that the sstables
can be ordered by the time they are created. for instance,
sstable_run_based_compaction_test has this expectaion.
before this change, we compare two UUID-based generations by its
(MSB, LSB) lexicographically. but UUID v1 put the lower bits of
the timestamp at the higher bits of MSB, so the ordering of the
"time" in timeuuid is not preserved when comparing the UUID-based
generations. this breaks the test of sstable_run_based_compaction_test,
which feeds the sstables to be compacted in a set, and the set is
ordered with the generation of the sstables.
after this change, we consider the UUID-based generation as
a timeuuid when comparing them.
Fixes#16215
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#16238
There are several places where TOC file is parsed into a vector of
components -- sstable::read_toc(), remove_by_toc_name() and
remove_by_registry_entry(). All three deserve some generalization.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Nowadays if memtable gets flushed into misconfigured S3 storage, the flush fails and aborts the whole scylla process. That's not very elegant. First, because upon restart garbage collecting non-sealed sstables would fail again. Second, because re-configuring an endpoint can be done runtime, scylla re-reads this config upon HUP signal.
Flushing memtable restarts when seeing ENOSPC/EDQUOT errors from on-disk sstables. This PR extends this to handle misconfigured S3 endpoints as well.
fixes: #13745Closesscylladb/scylladb#15635
* github.com:scylladb/scylladb:
test: Add object_store test to validate config reloading works
test: Add config update facility to test cluster
test: Make S3_Server export config file as pathlib.Path
config: Make object storage config updateable_value_source
memtable: Extend list of checking codes
sstables/storage/s3: Fix missing TOC status check
s3/client: Map http exceptions into storage_io_error
exceptions: Extend storage_io_error construction options
When sealing an sstable on local storage the storage driver performs
several flushes on a file that is directory open via checked-file.
Flush calls are wrapped with sstable_write_io_check, but that's
excessive, the checked file will wrap flushes with io-checks on its own
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#16173
before we support incremental repair, these is no point have the
code path setting / getting it. and even worse, it incurs confusion.
so, in this change, we
* just set the field to 0,
* drop the corresponding field in metadata_collector, as we never
update it.
* add a comment to explain why this variable is initialized to 0
Fixes#16098
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#16169
When TOC file is missing while garbage collecting the S3 server would
resolve with storage_io_error(ENOENT) nowadays
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When a keyspace is created it initiaizes the storage for it and
initialization of S3 storage is the good place to check if the endpoint
for the storage is configured at all.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Currently the cql statement .validate() callback is responsible for
checking if the non-local storage options are allowed with the
respective feature. Next patch will need to extend this check to also
validate the details of the provided storage options, but doing it at
cql level doesn't seem correct -- it's "too far" from query processor
down to sstables manager.
Good news is that there's a lower-level validation of the new keyspace,
namely the database::validate_new_keyspace() call. Move the storage
options validation into sstables manager, while at it, reimplement it
as a visitor to facilitate further extentions and plug the new
validation to the aforementioned database::validate_new_keyspace().
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It's the get_endpoint_client() peer that only checks the client
presense. To be used by next patches.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
this change is a cleanup.
to mark a return value without value semantics has no effect. these
`const` specifier useless. so let's drop them.
and, if we compile the tree with `-Wignore-qualifiers`, the compiler
would warn like:
```
/home/kefu/dev/scylladb/schema/schema.hh:245:5: error: 'const' type qualifier on return type has no effect [-Werror,-Wignored-qualifiers]
245 | const index_metadata_kind kind() const;
| ^~~~~
```
so this change also silences the above warnings.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>