scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-25 19:10:42 +00:00

Author	SHA1	Message	Date
Benny Halevy	b08f2ac4c6	sstable: add on_delete observer Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-08-08 08:15:00 +03:00
Benny Halevy	6f037549ac	sstables: delete_with_pending_deletion_log: batch sync_directory When deleting multiple sstables with the same prefix the deletion atomicity is ensured by the pending_delete_log file, so if scylla crashes in the middle, deletions will be replyed on restart. Therefore, we don't have to ensure atomicity of each individual `unlink`. We just need to sync the directory once, before removing the pending_delete_log file. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #14967	2023-08-06 18:52:13 +03:00
Raphael S. Carvalho	8829ff02c5	Revert "sstables: Close SSTable reader if index exhaustion is detected in fast forward call" This reverts commit `1fefe597e6`. Can be reverted after auto-closed reader. Refs #12998.	2023-07-12 10:48:28 -03:00
Raphael S. Carvalho	ca8705bd82	sstables: Automatically close exhausted SSTable readers in cleanup Add a reader that will automatically close the underlying sstable reader if fast forward is called with a range past the range spanned by the SSTable. This is only to be used in the context of fast forward calls in cleanup, as combined reader in full scans can proactively close the readers that returned EOS. Regular reads that go through cache enable fast forwarding to position range, therefore won't enable auto-closed reader. Compactions don't enable any kind of forward, and they won't have it enabled either. The overhead is minimal, with cleanup being able to reach the same 38MB/s as before this patch. Refs #12998. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2023-07-12 10:48:14 -03:00
Avi Kivity	1545ae2d3b	Merge 'Make SSTable cleanup more efficient by fast forwarding to next owned range' from Raphael "Raph" Carvalho Today, SSTable cleanup skips to the next partition, one at a time, when it finds that the current partition is no longer owned by this node. That's very inefficient because when a cluster is growing in size, existing nodes lose multiple sequential tokens in its owned ranges. Another inefficiency comes from fetching index pages spanning all unowned tokens, which was described in https://github.com/scylladb/scylladb/issues/14317. To solve both problems, cleanup will now use multi range reader, to guarantee that it will only process the owned data and as a result skip unowned data. This results in cleanup scanning an owned range and then fast forwarding to the next one, until it's done with them all. This reduces significantly the amount of data in the index caching, as index will only be invoked at each range boundary instead. Without further ado, before: `INFO 2023-07-01 07:10:26,281 [shard 0] compaction - [Cleanup keyspace2.standard1 701af580-17f7-11ee-8b85-a479a1a77573] Cleaned 1 sstables to [./tmp/1/keyspace2/standard1-b490ee20179f11ee9134afb16b3e10fd/me-3g7a_0s8o_06uww24drzrroaodpv-big-Data.db:level=0]. 2GB to 1GB (~50% of original) in 26248ms = 81MB/s. ~9443072 total partitions merged to 4750028.` after: `INFO 2023-07-01 07:07:52,354 [shard 0] compaction - [Cleanup keyspace2.standard1 199dff90-17f7-11ee-b592-b4f5d81717b9] Cleaned 1 sstables to [./tmp/1/keyspace2/standard1-b490ee20179f11ee9134afb16b3e10fd/me-3g7a_0s4m_5hehd2rejj8w15d2nt-big-Data.db:level=0]. 2GB to 1GB (~50% of original) in 17424ms = 123MB/s. ~9443072 total partitions merged to 4750028.` Fixes #12998. Fixes #14317. Closes #14469 * github.com:scylladb/scylladb: test: Extend cleanup correctness test to cover more cases compaction: Make SSTable cleanup more efficient by fast forwarding to next owned range sstables: Close SSTable reader if index exhaustion is detected in fast forward call sstables: Simplify sstable reader initialization compaction: Extend make_sstable_reader() interface to work with mutation_source test: Extend sstable partition skipping test to cover fast forward using token	2023-07-11 23:28:15 +03:00
Raphael S. Carvalho	1fefe597e6	sstables: Close SSTable reader if index exhaustion is detected in fast forward call When wiring multi range reader with cleanup, I found that cleanup wouldn't be able to release disk space of input SSTables earlier. The reason is that multi range reader fast forward to the next range, therefore it enables mutation_reader::forwarding, and as a result, combined reader cannot release readers proactively as it cannot tell for sure that the underlying reader is exhausted. It may have reached EOS for the current range, but it may have data for the next one. The concept of EOS actually only applies to the current range being read. A reader that returned EOS will actually get out of this state once the combined reader fast forward to the next range. Therefore, only the underlying reader, i.e. the sstable reader, can for certain know that the data source is completely exhausted, given that tokens are read in monotonically increasing order. For reversed reads, that's not true but fast forward to range is not actually supported yet for it. Today, the SSTable reader already knows that the underlying SSTable was exhausted in fast_forward_to(), after it call index_reader's advance_to(partition_range), therefore it disables subsequent reads. We can take a step further and also check that the index was exhausted, i.e. reached EOF. So if the index is exhausted, and there's no partition to read after the fast_forward_to() call, we know that there's nothing left to do in this reader, and therefore the reader can be closed proactively, allowing the disk space of SSTable to be reclaimed if it was already deleted. We can see that the combined reader, under multi range reader, will incrementally find a set of disjoint SSTable exhausted, as it fast foward to owned ranges 1: INFO 2023-07-05 10:51:09,570 [shard 0] mutation_reader - flat_multi_range_mutation_reader(): fast forwarding to range [{-4525396453480898112, start},{-4525396453480898112, end}] INFO 2023-07-05 10:51:09,570 [shard 0] sstable - sstable /tmp/scylla-9831a31a-66f3-4541-8681-000ac8e21bbb/me-1-big-Data.db, start == end, eof ? true INFO 2023-07-05 10:51:09,570 [shard 0] sstable - closing reader 0x60100029d800 for /tmp/scylla-9831a31a-66f3-4541-8681-000ac8e21bbb/me-1-big-Data.db INFO 2023-07-05 10:51:09,570 [shard 0] sstable - sstable /tmp/scylla-9831a31a-66f3-4541-8681-000ac8e21bbb/me-3-big-Data.db, start == end, eof ? false INFO 2023-07-05 10:51:09,570 [shard 0] sstable - sstable /tmp/scylla-9831a31a-66f3-4541-8681-000ac8e21bbb/me-4-big-Data.db, start == end, eof ? false INFO 2023-07-05 10:51:09,570 [shard 0] sstable - sstable /tmp/scylla-9831a31a-66f3-4541-8681-000ac8e21bbb/me-5-big-Data.db, start == end, eof ? false INFO 2023-07-05 10:51:09,570 [shard 0] sstable - sstable /tmp/scylla-9831a31a-66f3-4541-8681-000ac8e21bbb/me-6-big-Data.db, start == end, eof ? false INFO 2023-07-05 10:51:09,570 [shard 0] sstable - sstable /tmp/scylla-9831a31a-66f3-4541-8681-000ac8e21bbb/me-7-big-Data.db, start == end, eof ? false INFO 2023-07-05 10:51:09,570 [shard 0] sstable - sstable /tmp/scylla-9831a31a-66f3-4541-8681-000ac8e21bbb/me-8-big-Data.db, start == end, eof ? false INFO 2023-07-05 10:51:09,570 [shard 0] sstable - sstable /tmp/scylla-9831a31a-66f3-4541-8681-000ac8e21bbb/me-9-big-Data.db, start == end, eof ? false INFO 2023-07-05 10:51:09,570 [shard 0] sstable - sstable /tmp/scylla-9831a31a-66f3-4541-8681-000ac8e21bbb/me-10-big-Data.db, start == end, eof ? false 2: INFO 2023-07-05 10:51:09,572 [shard 0] mutation_reader - flat_multi_range_mutation_reader(): fast forwarding to range [{-2253424581619911583, start},{-2253424581619911583, end}] INFO 2023-07-05 10:51:09,572 [shard 0] sstable - sstable /tmp/scylla-9831a31a-66f3-4541-8681-000ac8e21bbb/me-2-big-Data.db, start == end, eof ? true INFO 2023-07-05 10:51:09,572 [shard 0] sstable - closing reader 0x60100029d400 for /tmp/scylla-9831a31a-66f3-4541-8681-000ac8e21bbb/me-2-big-Data.db INFO 2023-07-05 10:51:09,572 [shard 0] sstable - sstable /tmp/scylla-9831a31a-66f3-4541-8681-000ac8e21bbb/me-4-big-Data.db, start == end, eof ? false INFO 2023-07-05 10:51:09,572 [shard 0] sstable - sstable /tmp/scylla-9831a31a-66f3-4541-8681-000ac8e21bbb/me-5-big-Data.db, start == end, eof ? false INFO 2023-07-05 10:51:09,572 [shard 0] sstable - sstable /tmp/scylla-9831a31a-66f3-4541-8681-000ac8e21bbb/me-6-big-Data.db, start == end, eof ? false INFO 2023-07-05 10:51:09,572 [shard 0] sstable - sstable /tmp/scylla-9831a31a-66f3-4541-8681-000ac8e21bbb/me-7-big-Data.db, start == end, eof ? false INFO 2023-07-05 10:51:09,572 [shard 0] sstable - sstable /tmp/scylla-9831a31a-66f3-4541-8681-000ac8e21bbb/me-8-big-Data.db, start == end, eof ? false INFO 2023-07-05 10:51:09,572 [shard 0] sstable - sstable /tmp/scylla-9831a31a-66f3-4541-8681-000ac8e21bbb/me-9-big-Data.db, start == end, eof ? false INFO 2023-07-05 10:51:09,572 [shard 0] sstable - sstable /tmp/scylla-9831a31a-66f3-4541-8681-000ac8e21bbb/me-10-big-Data.db, start == *end, eof ? false And so on. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2023-07-11 13:56:24 -03:00
Raphael S. Carvalho	f08a4eaacb	sstables: Simplify sstable reader initialization It's odd that we see things like: if (!is_initialized()) { return initialize().then([this] { if (!is_initialized()) { and return ensure_initialized().then([this, &pr] { if (!is_initialized()) { One might think initialize will actually initialize the reader by setting up context, and ensure_initialized() will even have stronger guarantees, meaning that the reader must be initialized by it. But none are true. In the context of single-partition read, it can happen initialize() will not set up context, meaning is_initialized() returns false, which is why initialization must be checked even after we call ensure_initialized(). Let's merge ensure_initialized() and initialize() into a maybe_initialize() which returns a boolean saying if the reader is initialized. It makes the code initializing the reader easier to understand.	2023-07-11 13:56:23 -03:00
Kefu Chai	25f4a7c400	sstables: format using format string instead of concatenating strings, let's format using the builtin support of `log::debug()`. for two reasons: 1. better performance, after this change, we don't need to materialize the concatenated string, if the "debug" level logging is not enabled. seasetar::log only formats when a certain log level is enabled. 2. better readability. with the format string, it is clear what is the fixed part, and which arguments are to be formatted. this also helps us to move to compile-time formatting check, as fmtlib requires the caller to be explicit when it wants to use runtime format string. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #14627	2023-07-11 15:31:20 +03:00
Botond Dénes	37dd2503ff	Merge 'replica,sstable: do not assign a value to a shared_ptr' from Kefu Chai instead using the operator=(T&&) to assign an instance of `T` to a shared_ptr, assign a new instance of shared_ptr to it. unlike std::shared_ptr, seastar::shared_ptr allows us to move a value into the existing value pointed by shared_ptr with operator=(). the corresponding change in seastar is `319ae0b530`. but this is a little bit confusing, as the behavior of a shared_ptr should look like a pointer instead the value pointed by it. and this could be error-prune, because user could use something like ```c++ p = std::string(); ``` by accident, and expect that the value pointed by `p` is cleared. and all copies of this shared_ptr are updated accordingly. what he/she really wants is: ```c++ p = std::string(); ``` and the code compiles, while the outcome of the statement is that the pointee of `p` is destructed, and `p` now points to a new instance of string with a new address. the copies of this instance of shared_ptr still hold the old value. this behavior is not expected. so before deprecating and removing this operator. let's stop using it. in this change, we update two caller sites of the `lw_shared_ptr::operator=(T&&)`. instead of creating a new instance pointee of the pointer in-place, a new instance of lw_shared_ptr is created, and is assigned to the existing shared_ptr. Closes #14470 github.com:scylladb/scylladb: sstables: use try_emplace() when appropriate replica,sstable: do not assign a value to a shared_ptr	2023-07-11 09:19:48 +03:00
Avi Kivity	0cabf4eeb9	build: disable implicit fallthrough Prevent switch case statements from falling through without annotation ([[fallthrough]]) proving that this was intended. Existing intended cases were annotated. Closes #14607	2023-07-10 19:36:06 +02:00
Nadav Har'El	edfb89ef65	sstables: stop warning when auto-snapshot leaves non-empty directory When a table is dropped, we delete its sstables, and finally try to delete the table's top-level directory with the rmdir system call. When the auto-snapshot feature is enabled (this is still Scylla's default), the snapshot will remain in that directory so it won't be empty and will cannot be removed. Today, this results in a long, ugly and scary warning in the log: ``` WARN 2023-07-06 20:48:04,995 [shard 0] sstable - Could not remove table directory "/tmp/scylla-test-198265/data/alternator_alternator_Test_1688665684546/alternator_Test_1688665684546-4238f2201c2511eeb15859c589d9be4d/snapshots": std::filesystem::__cxx11::filesystem_error (error system:39, filesystem error: remove failed: Directory not empty [/tmp/scylla-test-198265/data/alternator_alternator_Test_1688665684546/alternator_Test_1688665684546-4238f2201c2511eeb15859c589d9be4d/snapshots]). Ignored. ``` It is bad to log as a warning something which is completely normal - it happens every time a table is dropped with the perfectly valid (and even default) auto-snapshot mode. We should only log a warning if the deletion failed because of some unexpected reason. And in fact, this is exactly what the code tried to do - it does not log a warning if the rmdir failed with EEXIST. It even had a comment saying why it was doing this. But the problem is that in Linux, deleting a non-empty directory does not return EEXIST, it returns ENOTEMPTY... Posix actually allows both. So we need to check both, and this is the only change in this patch. To confirm this that this patch works, edit test/cql-pytest/run.py and change auto-snapshot from 0 to 1, run test/alternator/run (for example) and see many "Directory not empty" warnings as above. With this patch, none of these warnings appear. Fixes #13538 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #14557	2023-07-07 11:08:10 +02:00
Kefu Chai	949bb719cd	sstables: use try_emplace() when appropriate so we don't have to search in the unordered_map twice. and it's more readable, as we don't need to compare an iterator with the sentry. also, take the opportunity to simplify the code by using the temporary `s3_cfg` when possible instead of `it->second.cfg` which is less readable. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-07-04 15:40:10 +08:00
Kefu Chai	dcfbc85485	replica,sstable: do not assign a value to a shared_ptr instead using the operator=(T&&) to assign an instance of `T` to a shared_ptr, assign a new instance of shared_ptr to it. unlike std::shared_ptr, seastar::shared_ptr allows us to move a value into the existing value pointed by shared_ptr with operator=(). the corresponding change in seastar is `319ae0b530`. but this is a little bit confusing, as the behavior of a shared_ptr should look like a pointer instead the value pointed by it. and this could be error-prune, because user could use something like ```c++ p = std::string(); ``` by accident, and expect that the value pointed by `p` is cleared. and all copies of this shared_ptr are updated accordingly. what he/she really wants is: ```c++ *p = std::string(); ``` and the code compiles, while the outcome of the statement is that the pointee of `p` is destructed, and `p` now points to a new instance of string with a new address. the copies of this instance of shared_ptr still hold the old value. this behavior is not expected. so before deprecating and removing this operator. let's stop using it. in this change, we update two caller sites of the `lw_shared_ptr::operator=(T&&)`. instead of creating a new instance pointee of the pointer in-place, a new instance of lw_shared_ptr is created, and is assigned to the existing shared_ptr. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-07-04 15:39:52 +08:00
Kefu Chai	04434c02b3	sstables: print generation without {:d} the formatter for sstables::generation_type does not support "d" specifier, so we should not use "{:d}" for printing it. this works before `d7c90b5239`, but after that change, generation_type is not an alias of int64_t anymore. and its formatter does not support "d", so we should either specialize fmt::formatter<generation_type> to support it or just drop the specifier. since seastar::format() is using ```c++ fmt::format_to(fmt::appender(out), fmt::runtime(fmt), std::forward<A>(a)...); ``` to print the arguments with given fmt string, we cannot identify these kind of error at compile time. at runtime, if we have issues like this, {fmt} would throw exception like: ``` terminate called after throwing an instance of 'fmt::v9::format_error' what(): invalid format specifier ``` when constructing the `std::runtime_error` instance. so, in this change, "d" is removed. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #14427	2023-07-03 13:53:13 +03:00
Raphael S. Carvalho	1d8cb32a5d	table: Optimize creation of reader excluding staging for view building View building from staging creates a reader from scratch (memtable + sstables - staging) for every partition, in order to calculate the diff between new staging data and data in base sstable set, and then pushes the result into the view replicas. perf shows that the reader creation is very expensive: + 12.15% 10.75% reactor-3 scylla [.] lexicographical_tri_compare<compound_type<(allow_prefixes)0>::iterator, compound_type<(allow_prefixes)0>::iterator, legacy_compound_view<compound_type<(allow_prefixes)0> >::tri_comparator::operator()(managed_bytes_basic_view<(mutable_view)0>, managed_bytes + 10.01% 9.99% reactor-3 scylla [.] boost::icl::is_empty<boost::icl::continuous_interval<compatible_ring_position_or_view, std::less> > + 8.95% 8.94% reactor-3 scylla [.] legacy_compound_view<compound_type<(allow_prefixes)0> >::tri_comparator::operator() + 7.29% 7.28% reactor-3 scylla [.] dht::ring_position_tri_compare + 6.28% 6.27% reactor-3 scylla [.] dht::tri_compare + 4.11% 3.52% reactor-3 scylla [.] boost::icl::interval_base_map<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::sst+ 4.09% 4.07% reactor-3 scylla [.] sstables::index_consume_entry_context<sstables::index_consumer>::process_state + 3.46% 0.93% reactor-3 scylla [.] sstables::sstable_run::will_introduce_overlapping + 2.53% 2.53% reactor-3 libstdc++.so.6 [.] std::_Rb_tree_increment + 2.45% 2.45% reactor-3 scylla [.] boost::icl::non_empty::exclusive_less<boost::icl::continuous_interval<compatible_ring_position_or_view, std::less> > + 2.14% 2.13% reactor-3 scylla [.] boost::icl::exclusive_less<boost::icl::continuous_interval<compatible_ring_position_or_view, std::less> > + 2.07% 2.07% reactor-3 scylla [.] logalloc::region_impl::free + 2.06% 1.91% reactor-3 scylla [.] sstables::index_consumer::consume_entry(sstables::parsed_partition_index_entry&&)::{lambda()#1}::operator()() const::{lambda()#1}::operator() + 2.04% 2.04% reactor-3 scylla [.] boost::icl::interval_base_map<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::sst+ 1.87% 0.00% reactor-3 [kernel.kallsyms] [k] entry_SYSCALL_64_after_hwframe + 1.86% 0.00% reactor-3 [kernel.kallsyms] [k] do_syscall_64 + 1.39% 1.38% reactor-3 libc.so.6 [.] __memcmp_avx2_movbe + 1.37% 0.92% reactor-3 scylla [.] boost::icl::segmental::join_left<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables:: + 1.34% 1.33% reactor-3 scylla [.] logalloc::region_impl::alloc_small + 1.33% 1.33% reactor-3 scylla [.] seastar::memory::small_pool::add_more_objects + 1.30% 0.35% reactor-3 scylla [.] seastar::reactor::do_run + 1.29% 1.29% reactor-3 scylla [.] seastar::memory::allocate + 1.19% 0.05% reactor-3 libc.so.6 [.] syscall + 1.16% 1.04% reactor-3 scylla [.] boost::icl::interval_base_map<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::sst + 1.07% 0.79% reactor-3 scylla [.] sstables::partitioned_sstable_set::insert That shows some significant amount of work for inserting sstables into the interval map and maintaining the sstable run (which sorts fragments by first key and checks for overlapping). The interval map is known for having issues with L0 sstables, as it will have to be replicated almost to every single interval stored by the map, causing terrible space and time complexity. With enough L0 sstables, it can fall into quadratic behavior. This overhead is fixed by not building a new fresh sstable set when recreating the reader, but rather supplying a predicate to sstable set that will filter out staging sstables when creating either a single-key or range scan reader. This could have another benefit over today's approach which may incorrectly consider a staging sstable as non-staging, if the staging sst wasn't included in the current batch for view building. With this improvement, view building was measured to be 3x faster. from INFO 2023-06-16 12:36:40,014 [shard 0] view_update_generator - Processed keyspace1.standard1: 5 sstables in 963957ms = 50kB/s to INFO 2023-06-16 14:47:12,129 [shard 0] view_update_generator - Processed keyspace1.standard1: 5 sstables in 319899ms = 150kB/s Refs #14089. Fixes #14244. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2023-06-26 22:30:39 -03:00
Kefu Chai	f014ccf369	Revert "Revert "Merge 'treewide: add uuid_sstable_identifier_enabled support' from Kefu Chai"" This reverts commit `562087beff`. The regressions introduced by the reverted change have been fixed. So let's revert this revert to resurrect the uuid_sstable_identifier_enabled support. Fixes #10459	2023-06-21 13:02:40 +03:00
Tomasz Grabiec	34f28aa0cb	sstables: Add trace-level logging related to shard calculation	2023-06-21 00:58:24 +02:00
Tomasz Grabiec	18f567385c	sstable_directory: Improve trace-level logging	2023-06-21 00:58:24 +02:00
Tomasz Grabiec	ad983ac23d	sstables: Compute sstable shards using sharder from erm when loading schema::get_sharder() does not use the correct sharder for tablet-based tables. Code which is supposed to work with all kinds of tables should obtain the sharder from erm::get_sharder().	2023-06-21 00:58:24 +02:00
Tomasz Grabiec	17d6163548	sstables: Generate sharding metadata using sharder from erm when writing We need to keep sharding metadata consistent with tablet mapping to shards in order for node restart to detect that those sstables belong to a single shard and that resharding is not necessary. Resharding of sstables based on tablet metadata is not implemented yet and will abort after this series. Keeping sharding metadata accurate for tablets is only necessary until compaction group integration is finished. After that, we can use the sstable token range to determine the owning tablet and thus the owning shard. Before that, we can't, because a single sstable may contain keys from different tablets, and the whole key range may overlap with keys which belong to other shards.	2023-06-21 00:58:24 +02:00
Tomasz Grabiec	fe7922d65c	sstables: Move compute_shards_for_this_sstable() to load() Soon, compute_shards_for_this_sstable() will need to take a sharder object. open_data() is called indirectly from sstable::load() and directly after writing an sstable from various paths. The latter don't really need to compute shards, since the field is already set by the writer. In order to reduce code churn, move compute_shards_for_this_sstable() to the load() path only so that only load() needs to take the sharder.	2023-06-21 00:58:24 +02:00
Tomasz Grabiec	390bcf3fae	dht: Take sharder externally in splitting functions We need those functions to work with tablet sharder, which is not accessible through schema::get_sharder(). In order to propagate the right sharder, those functions need to take it externally rather from the schema object. The sharder will come from the effective_replication_map attached to the table object. Those splitting functions are used when generating sharding metadata of an sstable. We need to keep this sharding metadata consistent with tablet mapping to shards in order for node restart to detect that those sstables belong to a single shard and that resharding is not necessary. Resharding of sstables based on tablet metadata is not implemented yet and will abort after this series. Keeping sharding metadata accurate for tablets is only necessary until compaction group integration is finished. After that, we can use the sstable token range to determine the owning tablet and thus the owning shard. Before that, we can't, because a single sstable may contain keys from different tablets, and the whole key range may overlap with keys which belong to other shards.	2023-06-21 00:58:24 +02:00
Botond Dénes	562087beff	Revert "Merge 'treewide: add uuid_sstable_identifier_enabled support' from Kefu Chai" This reverts commit `d1dc579062`, reversing changes made to `3a73048bc9`. Said commit caused regressions in dtests. We need to investigate and fix those, but in the meanwhile let's revert this to reduce the disruption to our workflows. Refs: #14283	2023-06-19 08:49:27 +03:00
Kefu Chai	2d265e860d	replica,sstable: introduce invalid generation id the invalid sstable id is the NULL of a sstable identifier. with this concept, it would be a lot simpler to find/track the greatest generation. the complexity is hidden in the generation_type, which compares the a) integer-based identifiers b) uuid-based identifiers c) invalid identitifer in different ways. so, in this change * the default constructor generation_type is now public. * we don't check for empty generation anymore when loading SSTables or enumerating them. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-06-15 17:54:59 +08:00
Kefu Chai	939fa087cc	sstables, replica: pass uuid_sstable_identifiers to generation generator before this change, we assume that generation is always integer based. in order to enable the UUID-based generation identifier if the related option is set, we should populate this option down to generation generator. because we don't have access to the cluster features in some places where a new generation is created, a new accessor exposing feature_service from sstable manager is added. Fixes #10459 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-06-15 17:54:59 +08:00
Kefu Chai	15543464ce	sstables, replica: support UUID in generation_type this change generalize the value of generation_type so it also supports UUID based identifier. * sstables/generation_type.h: - add formatter and parse for UUID. please note, Cassandra uses a different format for formatting the SSTable identifier. and this formatter suits our needs as it uses underscore "_" as the delimiter, as the file name of components uses dash "-" as the delimiter. instead of reinventing the formatting or just use another delimiter in the stringified UUID, we choose to use the Cassandra's formatting. - add accessors for accessing the type and value of generation_type - add constructor for constructing generation_type with UUID and string. - use hash for placing sstables with uuid identifiers into shards for more uniformed distrbution of tables in shards. * replica/table.cc: - only update the generator if the given generation contains an integer * test/boost: - add a simple test to verify the generation_type is able to parse and format Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-06-15 17:54:59 +08:00
Pavel Emelyanov	d1de796f6b	sstable: Move XFS renamer hack into fs storage The method sits on sstable, but is called only from fs storage and it's the only place that really needs it Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes #14230	2023-06-14 12:35:04 +03:00
Botond Dénes	3479adc85f	Merge 'Prepare sstable_directory lister to garbage_collect() s3 stuff' from Pavel Emelyanov When scylla starts it collects dangling sstables from the datadir. It includes temporary sstable directories and pending-deletion log. S3-backed sstables cannot be garbage-collected like that, instead "garbage" entries from the ownership table should be processed. Currently the g.c. code is unaware of storage and scans datadir for whatever sstable it's called for. This PR prepares the garbage_collect() call to become virtual, but no-op for ownership-table lister. Proper S3 garbage-collecting is not yet here, it needs an extra patch to seastar http client. refs: #13024 Closes #14023 * github.com:scylladb/scylladb: sstable_directory: Do not collect filesystem garbage for S3-backed sstables sstable_directory: Deduplicate .process() location argument sstable_directory: Keep directory lister on stack sstable_directory: Use directory_lister API directly	2023-06-14 12:06:37 +03:00
Pavel Emelyanov	c68c154fb6	code: Reduce tracing/hh fanout There are some headers that include tracing/.hh ones despite all they need is forward-declared trace_state_ptr Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes #14155	2023-06-07 19:19:22 +03:00
Pavel Emelyanov	66e43912d6	code: Switch to seastar API level 7 In that level no io_priority_class-es exist. Instead, all the IO happens in the context of current sched-group. File API no longer accepts prio class argument (and makes io_intent arg mandatory to impls). So the change consists of - removing all usage of io_priority_class - patching file_impl's inheritants to updated API - priority manager goes away altogether - IO bandwidth update is performed on respective sched group - tune-up scylla-gdb.py io_queues command The first change is huge and was made semi-autimatically by: - grep io_priority_class \| default_priority_class - remove all calls, found methods' args and class' fields Patching file_impl-s is smaller, but also mechanical: - replace io_priority_class& argument with io_intent* one - pass intent to lower file (if applicatble) Dropping the priority manager is: - git-rm .cc and .hh - sed out all the #include-s - fix configure.py and cmakefile The scylla-gdb.py update is a bit hairry -- it needs to use task queues list for IO classes names and shares, but to detect it should it checks for the "commitlog" group is present. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes #13963	2023-06-06 13:29:16 +03:00
Benny Halevy	c685ef9e71	partitioned_sstable_set: insert: return early if sst is already in the set Currently, partitioned_sstable_set::insert may erase a sstable from the set inadvertently, if an exception is thrown while (re-)inserting it. To prevent that, simply return early after detecting that insertion didn't took place, based on the unordered_set::insert result. This issue is theoretical, as there are no known case of re-inserting sstables into the partitioned sstable set. Fixes #14060 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #14061	2023-05-29 23:03:25 +03:00
Benny Halevy	26705ba6af	partitioned_sstable_set: erase empty runs When erasing a sstable first check if its run_id exists in _all_runs, otherwise do nothing with that respect, and then if the run becomes empty when erasing the last sstable (and it could have been a single-sstable run from get go), erase the run from `_all_runs`. Fixes #14052 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #14054	2023-05-29 23:03:24 +03:00
Botond Dénes	5a14c3311a	Merge 'Break S3 upload 50Gb file limit' from Pavel Emelyanov Current S3 uploading sink has implicit limit for the final file size that comes from two places. First, S3 protocol declares that uploading parts count from 1 to 10000 (inclusive). Second, uploading sink sends out parts once they grow above S3 minimal part size which is 5Mb. Since sstables puts data in 128kb (or smaller) portions, parts are almost exactly 5Mb in size, so the total uploading size cannot grow above ~50Gb. That's too low. To break the limit the new sink (called jumbo sink) uses the UploadPartCopy S3 call that helps splicing several objects into one right on the server. Jumbo sink starts uploading parts into an intermediate temporary object called a piece and named ${original_object}_${piece_number}. When the number of parts in current piece grows above the configured limit the piece is finalized and upload-copied into the object as its next part, then deleted. This happens in the background, meanwhile the new piece is created and subsequent data is put into it. When the sink is flushed the current piece is flushed as is and also squashed into the object. The new jumbo sink is capable of uploading ~500Tb of data, which looks enough. fixes: #13019 Closes #13577 * github.com:scylladb/scylladb: sstables: Switch data and index sink to use jumbo uploader s3/test: Tune-up multipart upload test alignment s3/test: Add jumbo upload test s3/client: Wait for background upload fiber on close-abort c3/client: Implement jumbo upload sink s3/client: Move memory buffers to upload_sink from base s3/client: Move last part upload out of finalize_upload() s3/client: Merge do_flush() with upload_part() s3/client: Rename upload_sink -> upload_sink_base	2023-05-25 11:44:06 +03:00
Pavel Emelyanov	e435ec1b5e	sstable_directory: Do not collect filesystem garbage for S3-backed sstables The sstable_directory::garbage_collect() scans /var/lib/scylla for whatever sstable it's called for. S3-backed ones don't have anything there, so the g.c. run is no-op. Make this call be lister virtual method, so that only filesystem lister does this scan and the ownership table lister becomes the real no-op. Later it will be filled with code. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-24 17:45:50 +03:00
Pavel Emelyanov	16d66f2fe9	sstable_directory: Deduplicate .process() location argument When sstable directory calls lister it passes the _sstable_dir as an argument. However, the very same _sstable_dir was used to construct the lister, and by now all the lister implementations keep this value aboard. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-24 17:43:36 +03:00
Pavel Emelyanov	d6b5e18cb3	sstable_directory: Keep directory lister on stack The directory_lister _lister exists as class member, but is only used once -- when the .process() is called -- and then is closed forever. It's simpler to keep the lister on the .process() stack. This change also makes filesystem lister keep the copy of directory as class member, it will be useful for the next patch as well. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-24 17:42:08 +03:00
Pavel Emelyanov	524614087a	sstable_directory: Use directory_lister API directly The filesystem components lister has private wrappers on top of directory lister it uses internally. These are lefrovers from making the sstable directory storage-aware, now they can be removed. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-24 17:40:38 +03:00
Botond Dénes	2526b232f1	Merge 'Remove explicit default_priority_class() usage from sstable aux methods' from Pavel Emelyanov There are few places in sstables/ code that require caller to specify priority class to pass it along to file stream options. All these callers use default class, so it makes little sense to keep it. This change makes the sched classes unification mega patch a bit smaller. ref: #13963 Closes #13996 * github.com:scylladb/scylladb: sstables: Remove default prio class from rewrite_statistics() sstables: Remove prio class from validate_checksums subs sstables: Remove always default io-prio from validate_checksums()	2023-05-24 09:23:24 +03:00
Pavel Emelyanov	6c453df9d7	sstables: Remove default prio class from rewrite_statistics() The method is called with explicitly default pririty class and puts one into the fstream options. This whole chain can be avoided Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-23 13:54:31 +03:00
Pavel Emelyanov	438132ad4b	sstables: Remove prio class from validate_checksums subs The sstable.read_checksum() and .read_digest() accept prio class argument from validate_checsums(), but it's always the "default" one. Remove the arg and remove stream options initializations as they'll pick up default prio class on their default constructing. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-23 13:54:31 +03:00
Pavel Emelyanov	7396d9d291	sstables: Remove always default io-prio from validate_checksums() All calls to sstables::validate_checksums() happen with explicitly default priority class. Just hard-code it as such in the method Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-23 13:54:31 +03:00
Pavel Emelyanov	2bb024c948	index_reader: Introduce and use default arguments to constructor Most of creators of index_reader construct it with default prio class, null trace pointer and use_caching::yes. Assigning implicit defaults to constructor arguments keeps the code shorter and easier to read. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-23 11:29:04 +03:00
Pavel Emelyanov	3fd5d3cc2b	index_reader: Use _pc field in get_file_input_stream_options() directly No need to pass this-> field into this-> call Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-23 11:18:14 +03:00
Pavel Emelyanov	21d24e8ea3	index_reader: Move index_reader::get_file_input_stream_options to private: block A "while at it" cleanup. When pathing the method (next patch) it turned out that there are no other callers other than local class, so it _is_ private. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-23 11:18:14 +03:00
Botond Dénes	3b424e391b	Merge 'perform_cleanup: wait until all candidates are cleaned up' from Benny Halevy cleanup_compaction should resolve only after all sstables that require cleanup are cleaned up. Since it is possible that some of them are in staging and therefore cannot be cleaned up, retry once a second until they become eligible. Timeout if there is no progress within 5 minutes to prevent hanging due to view building bug. Fixes #9559 Closes #13812 * github.com:scylladb/scylladb: table: signal compaction_manager when staging sstables become eligible for cleanup compaction_manager: perform_cleanup: wait until all candidates are cleaned up compaction_manager: perform_cleanup: perform_offstrategy if needed compaction_manager: perform_cleanup: update_sstables_cleanup_state in advance sstable_set: add for_each_sstable_gently* helpers	2023-05-19 12:35:59 +03:00
Botond Dénes	c2aee26278	Merge 'Keep sstables garbage collection in sstable_directory' from Pavel Emelyanov Currently temporary directories with incomplete sstables and pending deletion log are processed by distributed loader on start. That's not nice, because for s3 backed sstables this code makes no sense (and is currently a no-op because of incomplete implementation). This garbage collecting should be kept in sstable_directory where it can off-load this work onto lister component that is storage-aware. Once g.c. code moved, it allows to clean the class sstable list of static helpers a bit. refs: #13024 refs: #13020 refs: #12707 Closes #13767 * github.com:scylladb/scylladb: sstable: Toss tempdir extension usage sstable: Drop pending_delete_dir_basename() sstable: Drop is_pending_delete_dir() helper sstable_directory: Make garbage_collect() non-static sstable_directory: Move deletion log exists check distributed_loader: Move garbage collecting into sstable_directory distributed_loader: Collect garbace collecting in one call sstable: Coroutinize remove_temp_dir() sstable: Coroutinize touch_temp_dir() sstable: Use storage::temp_dir instead of hand-crafted path	2023-05-19 08:50:13 +03:00
Kefu Chai	03be1f438c	sstables: move get_components_lister() into sstable_directory sstables_manager::get_component_lister() is used by sstable_directory. and almost all the "ingredients" used to create a component lister are located in sstable_directory. among the other things, the two implementations of `components_lister` are located right in `sstable_directory`. there is no need to outsource this to sstables_manager just for accessing the system_keyspace, which is already exposed as a public function of `sstables_manager`. so let's move this helper into sstable_directory as a member function. with this change, we can even go further by moving the `components_lister` implementations into the same .cc file. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #13853	2023-05-18 08:43:35 +03:00
Kefu Chai	8bcbc9a90d	sstables: add an maybe_owned_by_this_shard() helper instead of encoding the fact that we are using generation identifier as a hint where the SSTable with this generation should be processed at the caller sites of `as_int()`, just provide an accessor on sstable_generation_generator's side. this helps to encapsulate the underlying type of generation in `generation_type` instead of exposing it to its users. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #13846	2023-05-18 08:41:02 +03:00
Pavel Emelyanov	ed50fda1fe	sstable: Toss tempdir extension usage The tempdir for filesystem-based sstables is {generation}.sstable one. There are two places that need to know the ".sstable" extention -- the tempdir creating code and the tempdir garbage-collecting code. This patch simplifies the sstable class by patching the aforementioned functions to use newly introduced tempdir_extension string directly, without the help of static one-line helpers. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-17 15:19:38 +03:00
Pavel Emelyanov	e8c0ae28b5	sstable: Drop pending_delete_dir_basename() The helper is used to return const char* value of the pending delete dir. Callers can use it directly. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-17 15:17:33 +03:00

1 2 3 4 5 ...

3200 Commits