It's only local storage type that needs directores touch/remove, S3
storage initialization is for now a no-op, maybe some day soon it will
appear.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It's the manager that knows about storages and it should init/destroy
it. Also the "upload" and "staging" paths are about to be hidden in
sstables/ code, this code move also facilitates that.
The indentation in storage.cc is deliberately broken to make next patch
look nicer (spoiler: it won't have to shift those lines right).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The sstable currently can move between normal, staging and quarantine state runtime. For S3-backed sstables the state change means maintaining the state itself in the ownership table and updating it accordingly.
There's also the upload facility that's implemented as state change too, but this PR doesn't support this part.
fixes: #13017Closesscylladb/scylladb#15829
* github.com:scylladb/scylladb:
test: Make test_sstables_excluding_staging_correctness run over s3 too
sstables,s3: Support state change (without generation change)
system_keyspace: Add state field to system.sstables
sstable_directory: Tune up sstables entries processing comment
system_keyspace: Tune up status change trace message
sstables: Add state string to state enum class convert
Now when the system.sstables has the state field, it can be changed
(UPDATEd). However, when changing the state AND generation, this still
won't work, because generation is the clustering key of the table in
question and cannot be just changed. This, nonetheless, is OK, as
generation changes with state only when moving an sstable from upload
dir into normal/staging and this is separate issue for S3 (#13018). For
now changing state only is OK.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The state is one of <empty>(normal)/staging/quarantine. Currently when
sstable is moved to non-normal state the s3 backend state_change() call
throws thus such sstables do not appear. Next patches are going to
change that and the new field in the system.sstables is needed.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
In fact, this FIXME had been fixed by 2c9ec6bc (sstable_directory:
Garbage collect S3 sstables on reboot) and is no longer valid. However,
it's still good to know if GC failed or misbehaved, so replace the
comment with a warning.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There's the backward converter already out there. Next code will need to
convert string representation of the state back to the internal type.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
since we use the sstable.generation() for the remote prefix of
the key of the object for storing the sstable component, there is
no need to set remote_prefix beforehand.
since `s3_storage::ensure_remote_prefix()` and
`system_kesypace::sstables_registry_lookup_entry()` are not used
anymore, they are removed.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
before this change, we create a new UUID for a new sstable managed
by the s3_storage, and we use the string representation of UUID
defined by RFC4122 like "0aa490de-7a85-46e2-8f90-38b8f496d53b" for
naming the objects stored on s3_storage. but this representation is
not what we are using for storing sstables on local filesystem when
the option of "uuid_sstable_identifiers_enabled" is enabled. instead,
we are using a base36-based representation which is shorter.
to be consistent with the naming of the sstables created for local
filesystem, and more importantly, to simplify the interaction between
the local copy of sstables and those stored on object storage, we should
use the same string representation of the sstable identifier.
so, in this change:
1. instead of creating a new UUID, just reuse the generation of the
sstable for the object's key.
2. do not store the uuid in the sstable_registry system table. As
we already have the generation of the sstable for the same purpose.
3. switch the sstable identifier representation from the one defined
by the RFC4122 (implemented by fmt::formatter<utils::UUID>) to the
base36-based one (implemented by
fmt::formatter<sstables::generation_type>)
4. enable the `uuid_sstable_identifers` cluster feature if it is
enabled in the `test_env_config`, so that it the sstable manager
can enable the uuid-based uuid when creating a new uuid for
sstable.
5. throw if the generation of sstable is not UUID-based when
accessing / manipulating an sstable with S3 storage backend. as
the S3 storage backend now relies on this option. as, otherwise
we'd have sstables with key like s3://bucket/number/basename, which
is just unable to serve as a unique id for sstable if the bucket is
shared across multiple tables.
Fixes#14175
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
After "repair: Get rid of the gc_grace_seconds", the sstable's schema (mode,
gc period if applicable, etc) is used to estimate the amount of droppable
data (or determine full expiration = max_deletion_time < gc_before).
It could happen that the user switched from timeout to repair mode, but
sstables will still use the old mode, despite the user asked for a new one.
Another example is when you play with value of grace period, to prevent
data resurrection if repair won't be able to run in a timely manner.
The problem persists until all sstables using old GC settings are recompacted
or node is restarted.
To fix this, we have to feed latest schema into sstable procedures used
for expiration purposes.
Fixes#15643.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#15746
`deletion_time` is a part of the `partition_header`, which is in turn
a part of `partition`. and `data_file` is a sequence of `partition`.
`data_file` represents *-Data.db component of an SSTable.
see docs/architecture/sstable3/sstables-3-data-file-format.rst.
we always parse the data component via `flat_mutation_reader_v2`, which is in turn
implemented with mx/reader.cc or kl/reader.cc depending on
the version of SSTable to be read.
in other words, we decode `deletion_time` in mx/reader.cc or
kl/reader.cc, not in sstable.cc. so let's drop the overload
parse() for deletion_time. it's not necessary and more importantly,
confusing.
Refs #15116
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#15756
Currently distributed_loader starts sharded<sstable_directory> with four sharded parameters. That's quite bulky and can be made much shorter.
Closesscylladb/scylladb#15653
* github.com:scylladb/scylladb:
distributed_loader: Remove explicit sharded<erms>
distributed_loader: Brush up start_subdir()
sstable_directory: Add enlightened construction
table: Add global_table_ptr::as_sharded_parameter()
compaction_read_monitor_generator is an existing mechanism
for monitoring progress of sstables reading during compaction.
In this change information gathered by compaction_read_monitor_generator
is utilized by task manager compaction tasks of the lowest level,
i.e. compaction executors, to calculate task progress.
compaction_read_monitor_generator has a flag, which decides whether
monitored changes will be registered by compaction_backlog_tracker.
This allows us to pass the generator to all compaction readers without
impacting the backlog.
Task executors have access to compaction_read_monitor_generator_wrapper,
which protects the internals of compaction_read_monitor_generator
and provides only the necessary functionality.
Closesscylladb/scylladb#14878
* github.com:scylladb/scylladb:
compaction: add get_progress method to compaction_task_impl
compaction: find total compaction size
compaction: sstables: monitor validation scrub with compaction_read_generator
compaction: keep compaction_progress_monitor in compaction_task_executor
compaction: use read monitor generator for all compactions
compaction: add compaction_progress_monitor
compaction: add flag to compaction_read_monitor_generator
When populating system keyspace the sstable_directory forgets to create upload/ subdir in the tables' datadir because of the way it's invoked from distributed loader. For non-system keyspaces directories are created in table::init_storage() which is self-contained and just creates the whole layout regardless of what.
This PR makes system keyspace's tables use table::init_storage() as well so that the datadir layout is the same for all on-disk tables.
Test included.
fixes: #15708closes: scylladb/scylla-manager#3603Closesscylladb/scylladb#15723
* github.com:scylladb/scylladb:
test: Add test for datadir/ layout
sstable_directory: Indentation fix after previous patch
db,sstables: Move storage init for system keyspace to table creation
before this change, they reads
> Was local deletion time capped at ...
and
> Was partition tombstone deletion time capped at ...
the "Was" part is confusing. and the first description is not
accurate enough. so let's improve them a little bit.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#15108
User and system keyspaces are created and populated slightly
differently.
System keyspace is created via system_keyspace::make() which eventually
calls calls add_column_family(). Then it's populated via
init_system_keyspace() which calls sstable_directory::prepare() which,
in turn, optionally creates directories in datadir/ or checks the
directory permissions if it exists
User keyspaces are created with the help of
add_column_family_and_make_directory() call which calls the
add_column_family() mentioned above _and_ calls table::init_storage() to
create directories. When it's populated with init_non_system_keyspaces()
it also calls sstable_directory::prepare() which notices that the
directory exists and then checks the permissions.
As a result, sstable_directory::prepare() initializes storage for system
keyspace only and there's a BUG (#15708) that the upload/ subdir is not
created.
This patch makes the directories creation for _all_ keyspaces with the
table::init_storage(). The change only touches system keyspace by moving
the creation of directories from sstable_directory::prepare() into
system_keyspace::make().
Indentation is deliberately left broken.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Validation scrub bypasses the usual compaction machinery, though it
still needs to be tracked with compaction_progress_monitor so that
we could reach its progress from compaction task executor.
Track sstable scrub in validate mode with read monitors.
The existing constructor is pretty heavyweight for the distributed
loader to use -- it needs to pass it 4 sharded parameters which looks
pretty bulky in the text editor. However, 5 constructor arguments are
obtained directly from the table, so the dist. loader code with global
table pointer at hand can pass _it_ as sharded parameter and let the
sstable directory extract what it needs.
Sad news is that sstable_directory cannot be switched to just use table
reference. Tools code doesn't have table at hand, but needs the
facilities sstable_directory provides
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The descriptor in question is used to parse sstable's file path and return back the result. Parser, among "relevant" info, also parses sstable directory and keyspace+table names. However, there are no code (almost) that needs those strings. And the need to construct descriptor with those makes some places obscurely use empty strings.
The PR removes sstable's directory, keyspace and table names from descriptor and, while at it, relaxes the sstable directory code that makes descriptor out of a real sstable object by (!) parsing its Data file path back.
Closesscylladb/scylladb#15617
* github.com:scylladb/scylladb:
sstables: Make descriptor from sstable without parsing
sstables: Do not keep directory, keyspace and table names on descriptor
sstables: Make tuple inside helper parser method
sstables: Do not use ks.cf pair from descriptor
sstables: Return tuple from parse_path() without ks.cf hints
sstables: Rename make_descriptor() to parse_path()
When loading unshared remote sstable, sstable_directory needs to make a
descriptor out of a real sstable. For that it parses the sstable's Data
component path which is pretty weird. It's simpler to make descriptor
out of the ssatble itself.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Now no code uses those strings. Even worse -- there are some places that
need to provide some strings but don't have real values at hand, so just
hard-code the empty strings there (because they are really not used).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This just moves the std::make_tuple() call into internal static path
parsing helper to make the next patch smaller and nicer.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There are two path parsers. One of them accepts keyspace and table names
and the other one doesn't. The latter is then supposed to parse the
ks.cf pair from path and put it on the descriptor. This patch makes this
method return ks.cf so that later it will be possible to remove these
strings from the desctiptor itself.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The method really parses provided path, so the existing name is pretty
confusing. It's extra confusing in the table::get_snapshot_details()
where it's just called and the return value is simply ignored.
Named "parse_..." makes it clear what the method is for.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Define sstable_set_impl::selector_and_schema_t type as a tuple that
contains both a newly created selector and a schema that the selector
is using.
This will allow removal of _schema field from sstable_set class as
the only place it was used was make_incremental_selector.
Signed-off-by: Piotr Jastrzebski <haaawk@gmail.com>
Before this commit the primary key was hashed for bloom filter check
for each sstable.
This commit makes the key be hashed once per sstable set and reused
for bloom filter lookups in all sstables in the set.
I tested this change using perf_simple_query with the following modifications:
1. Create more than one sstable to have sstable set of more than one elements
2. Try to prevent compactions (I wasn't 100% successful)
3. Use a key that's not present to avoid reading from disk
```
diff --git a/test/perf/perf_simple_query.cc b/test/perf/perf_simple_query.cc
index 26dbf1e99..6bd460df2 100644
--- a/test/perf/perf_simple_query.cc
+++ b/test/perf/perf_simple_query.cc
@@ -105,6 +105,8 @@ std::ostream& operator<<(std::ostream& os, const test_config& cfg) {
static void create_partitions(cql_test_env& env, test_config& cfg) {
std::cout << "Creating " << cfg.partitions << " partitions..." << std::endl;
+ // Create 10 sstables each with all the data
+ for (unsigned count = 0; count < 10; ++count) {
for (unsigned sequence = 0; sequence < cfg.partitions; ++sequence) {
if (cfg.counters) {
execute_counter_update_for_key(env, make_key(sequence));
@@ -117,6 +119,7 @@ static void create_partitions(cql_test_env& env, test_config& cfg) {
std::cout << "Flushing partitions..." << std::endl;
env.db().invoke_on_all(&replica::database::flush_all_memtables).get();
}
+ }
}
static int64_t make_random_seq(test_config& cfg) {
@@ -137,8 +140,18 @@ static std::vector<perf_result> test_read(cql_test_env& env, test_config& cfg) {
query += " using timeout " + cfg.timeout;
}
auto id = env.prepare(query).get0();
- return time_parallel([&env, &cfg, id] {
- bytes key = make_random_key(cfg);
+ // Always use the same key that is not present
+ // to make sure we don't read from disk and make
+ // the benchmark CPU bounded.
+ int64_t key_value = 6;
+ bytes key(bytes::initialized_later(), 5*sizeof(key_value));
+ auto i = key.begin();
+ write<uint64_t>(i, key_value);
+ write<uint64_t>(i, key_value);
+ write<uint64_t>(i, key_value);
+ write<uint64_t>(i, key_value);
+ write<uint64_t>(i, key_value);
+ return time_parallel([&env, id, key] {
return env.execute_prepared(id, {{cql3::raw_value::make_value(std::move(key))}}).discard_result();
}, cfg.concurrency, cfg.duration_in_seconds, cfg.operations_per_shard, cfg.stop_on_error);
}
@@ -423,6 +436,10 @@ static std::vector<perf_result> do_cql_test(cql_test_env& env, test_config& cfg)
.with_column("C2", bytes_type)
.with_column("C3", bytes_type)
.with_column("C4", bytes_type)
+ // Try to prevent compaction
+ // to keep the number of sstables high
+ .set_compaction_enabled(false)
+ .set_min_compaction_threshold(2000000000)
.build();
}).get();
@@ -539,6 +556,11 @@ int scylla_simple_query_main(int argc, char** argv) {
const auto enable_cache = app.configuration()["enable-cache"].as<bool>();
std::cout << "enable-cache=" << enable_cache << '\n';
db_cfg->enable_cache(enable_cache);
+ // Try to prevent compaction
+ // to keep the number of sstables high
+ db_cfg->concurrent_compactors(1);
+ db_cfg->compaction_enforce_min_threshold(true);
+ db_cfg->compaction_throughput_mb_per_sec(1);
cql_test_config cfg(db_cfg);
return do_with_cql_env_thread([&app] (auto&& env) {
```
The following command showed 2-3% improvement on my machine but this
depends on the lenght of the key and the number of sstables in the set.
```
./build/release/scylla perf-simple-query --bypass-cache --flush -c 1
--random-seed=2068087418 --enable-cache false
```
Signed-off-by: Piotr Jastrzebski <haaawk@gmail.com>
Closesscylladb/scylladb#15538
SSTable runs work hard to keep the disjointness invariant, therefore they're
expensive to build from scratch.
For every insertion, it keeps the elements sorted by their first key in
order to reject insertion of element that would introduce overlapping.
Additionally, a sstable run can grow to dozens of elements (or hundreds)
therefore, we can also make interaction with compaction strategies more
efficient by not copying them when building a list of candidates in compaction
manager. And less fragile by filtering out any sstable runs that are not
completely eligible for compaction.
Previously, ICS had to give up on using runs managed by sstable set due to
fragility of the interface (meaning runs are being built from scratch
on every call to the strategy, which is very inefficient, but that had to
be done for correctness), but now we can restore that.
Closesscylladb/scylladb#15440
* github.com:scylladb/scylladb:
compaction: Switch to strategy_control::candidates() for regular compaction
tests: Prepare sstable_compaction_test for change in compaction_strategy interface
compaction: Allow strategy to retrieve candidates either as sstables or runs
compaction: Make get_candidates() work with frozen_sstable_run too
sstables: add sstable_run::run_identifier()
sstables: tag sstable_run::insert() with nodiscard
sstables: Make all_sstable_runs() more efficient by exposing frozen shared runs
sstables: Simplify sstable_set interface to retrieve runs
sstable_run may reject insertion of a sstable if it's going
to break the disjoint invariant of the run, but it's important
that the caller is aware of it, so it can act on it like
generating a new run id for the sstable so it can be inserted
in another run. the tag is important to avoid unknown
problems in this area.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Users of all_sstable_runs() don't want to mutate the runs, but rather
work with their content. So let's avoid copy and make the intention
explicit with the new frozen_sstable_run used as return type
for the interface.
This will guarantee that ICS will be able to fetch uncompacting
runs efficiently.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
This interface selects all runs that store at least one of the
sstables in the vector.
But that's very fragile, to the point that even ICS had to
stop using it. A better interface is to return all runs
managed by the set and allow compaction manager to do its
filtering.
We want to use it in ICS to avoid the overhead of rebuilding
sstable runs which may be expensive as sorting is performed
to guarantee the disjoint invariant.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
When FS lister gets prepared it
- checks if the directory exists
- creates if it it doesn't or bais out if it's quarantine one
- goes and checks the directory's owner and mode
The last step is excessive if the directory didn't exist on entry and
was created.
Indentation is deliberately left broken.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This is continuation of the previous patch -- when populating a table,
creating directories should be (optionally) performed by the lister
backend, not by the generic loader.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The loader code still "knows" that tables' sstables live in directories
on datadir filesystem, but that's not always so. So whether or not the
directory with sstables exists should be checked by sstable directory's
component lister, not the loader.
After this change potentially missing quarantine directory will be
processed with the sstable directory with empty result, but that's OK,
empty directories should be already handled correctly, so even if the
directory lister doesn't produce any sstables because it found no files,
or because it just skipped scanning doesn't make any difference.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Current sstable_directory::prepare() code checks the sstable directory
existance which only makes sense for filesystem-backed sstables.
S3-backed don't (well -- won't) have any directories in datadir, so the
check should be moved into component lister.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The S3 uploading sink needs to collect buffers internally before sending them out, because the minimal upload-able part size is 5Mb. When the necessary amount of bytes is accumulated, the part uploading fibers starts in the background. On flush the sink waits for all the fibers to complete and handles failure of any.
Uploading parallelism is nowadays limited by the means of the http client max-connections parameter. However, when a part uploading fibers waits for it connection it keeps the 5Mb+ buffers on the request's body, so even though the number of uploading parts is limited, the number of _waiting_ parts is effectively not.
This PR adds a shard-wide limiter on the number of background buffers S3 clients (and theirs http clients) may use.
Closesscylladb/scylladb#15497
* github.com:scylladb/scylladb:
s3::client: Track memory in client uploads
code: Configure s3 clients' memory usage
s3::client: Construct client with shared semaphore
sstables::storage_manager: Introduce config
Sstables in transitional states are marked with the respective 'status' in the registry. Currently there are two of such -- 'creating' and 'removing'. And the 'sealed' status for sstables in use.
On boot the distributed loader tries to garbage collect the dangling sstables. For filesystem storage it's done with the help of temorary sstables' dirs and pending deletion logs. For s3-backed sstables, the garbage collection means fetching all non-sealed entries and removing the corresponding objects from the storage.
Test included (last patch)
fixes#13024Closesscylladb/scylladb#15318
* github.com:scylladb/scylladb:
test: Extend object_store test to validate GC works
sstable_directory: Garbage collect S3 sstables on reboot
sstable_directory: Pass storage to garbage_collect()
sstable_directory: Create storage instance too
This sets the real limits on the memory semaphore.
- scylla sets it to 1% of total memory, 10Mb min, 100Mb max
- tests set it to 16Mb
- perf test sets it to all available memory
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The semaphore will be used to cap memory consumption by client. This
patch makes sure the reference to a semaphore exists as an argument to
client's constructor, not more than that.
In scylla binary, the semaphore sits on storage_manager. In tests the
semaphore is some local object. For now the semaphore is unused and is
initialized locked as this patch just pushes the needed argument all the
way around, next patches will make use of it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Just an empty config that's fed to storage_manager when constructed as a
preparation for further heavier patching
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Some time ago populating of tables from sstables was reworked to use sstable states instead of full paths (#12707). Since then few places in the populator was left that still operate on the state-based subdirectory name. This PR collects most of those dangling ends
refs: #13020Closesscylladb/scylladb#15421
* github.com:scylladb/scylladb:
distributed_loader: Print sstable state explicitly
distributed_loader: Move check for the missing dir upper
distributed_loader: Use state as _sstable_directories key
column_stats::update_local_deletion_time() is not used anywhere,
what is being used is
`column_stats::update_local_deletion_time_and_tombstone_histogram(time_point)`.
while `update_local_deletion_time_and_tombstone_histogram(int32_t)`
is only used internally by a single caller.
neither is `column_stats::update(const deletion_time&)` used.
so let's drop them. and merge
`update_local_deletion_time_and_tombstone_histogram(int32_t)`
into its caller.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#15189
When populating from a particular directory, populator code converts
state to subdir name, then prints the path. The conversion is pretty
much artificial, it's better to provide printer for state and print
state explicitly.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
In this refactoring commit we remove the db::config::host_id
field, as it's hacky and duplicates token_metadata::get_my_id.
Some tests want specific host_id, we add it to cql_test_config
and use in cql_test_env.
We can't pass host_id to sstables_manager by value since it's
initialized in database constructor and host_id is not loaded yet.
We also prefer not to make a dependency on shared_token_metadata
since in this case we would have to create artificial
shared_token_metadata in many tools and tests where sstables_manager
is used. So we pass a function that returns host_id to
sstables_manager constructor.
When booting there can be dangling entries in sstables registry as well
as objects on the storage itself. This patch makes the S3 lister list
those entries and then kick the s3_storage to remove the corresponding
objects. At the end the dangling entries are removed from the registry
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>