scylladb

Author	SHA1	Message	Date
Michał Chojnowski	b18ddcb92e	sstables: delegate compressor creation to the compressor factory Remove `compressor::create()`. This enforces that compressors are only created through the `sstable_compressor_factory`. Unlike the synchronous `compressor::create()`, the factory will be able to create dict-aware compressors.	2025-04-01 00:07:28 +02:00
Michał Chojnowski	8e611536b0	sstables/compress: move ownership of `compressor` to `sstable::compression` SSTable readers and writers use `compressor` objects to compress and decompress chunks of SSTable data files. `compressor` objects are read-only, so only one of them is needed for each SSTable. Before this commit, each reader and writer has its own `compressor` object. This isn't necessary, but it's okay. But later in this series it will stop being okay, because the creation of a `compressor` will become an expensive cross-shard operation (because it might require sharing a compression dictionary from another shard). So we have to adjust the code so that there is only once `compressor` per sstable, not one per reader/writer. We stuff the ownership of this compressor into `sstable::compression`. To make the ownership clear, we remove `compression_ptr` shared pointers from readers and writers, and make them access the compressor via the `sstable::compression` instead.	2025-04-01 00:07:27 +02:00
Avi Kivity	73e4a3c581	sstables: store features early in write path sstable features indicate that an sstable has some extension, or that some bug was fixed. They allow us to know if we can rely on certain properties in a read sstables. Currently, sstable features are set early in the read path (when we read the scylla metadata file) and very late in the write path (when we write the scylla metadata file just before sealing the sstable). However, we happen to read features before we set them in the write path - when we resize the bloom filter for a newly written sstable we instantiate an index reader, and that depends on some features. As a result, we read a disengaged optional (for the scylla metadata component) as if it was engaged. This somehow worked so far, but fails with libstdc++ hash table implementation. Fix it by moving storage of the features to the sstable itself, and setting it early in the write path. Fixes #23484 Closes scylladb/scylladb#23485	2025-03-31 09:33:56 +03:00
Pavel Emelyanov	68c41f0459	sstables: Make file_writer keep component_name on board The class in question is a wrapper around output_stream that writes, flushes and closes the stream in async context. For logging it also keeps the component filename on board, and now it's good time to patch it and keep the component_filename instead. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-03-19 13:03:29 +03:00
Pavel Emelyanov	1ba91e28cb	sstables: Make get_filename() return component_name Similarly to previous patches -- mostly the result is used as log argument. The remaining users include - scylla sstable tool that dumps component names to json output - API endpoint that returns component names to user - tests these are all good to explicitly convert component_names to strings. There are few more places that expect strings instead of component name objects. For now they also use fmt::to_string() explicitly, partially it will be fixed later, mostly -- as future follow-ups. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-03-19 13:03:29 +03:00
Pavel Emelyanov	80e0030613	sstables: Make sstable::index_filename() return component_name Most of the method callers use it as log parameter. There are few more places that push it to malformed_sstable_exception, which immediately converts it to string, so this patch makes the exception be constructed with the component_name either. And there's one more place that passes this string to file_writer constructor. For now, convert it to string explicitly, but next patches will fix that place to use pure component_name too. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-03-19 13:01:23 +03:00
Pavel Emelyanov	dcc9167734	sstables: Rename filename($component) calls to ${component}_filename() There's a generic sstable::filename(component_type) method that returns a file name for the given component. For "popular" components, namely TOC, Data and Index there are dedicated sstable methods to get their names. Fix existing callers of the generic method to use the former. It's shorter, nicer and makes further patching simpler. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-03-19 12:45:21 +03:00
Pavel Emelyanov	eff61b167c	treewide: Reduce db/config.hh header fanout Drop it from files that obviously don't need it. Also kill some forward declarations while at it. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#22979	2025-02-25 15:16:40 +01:00
Kefu Chai	0237913337	sstables: Migrate from boost::adaptors::indexed to std::views::enumerate This change modernizes the codebase by: - Replacing Boost's indexed adaptor with C++20's std::views::enumerate - Removing unnecessary Boost header inclusion With this change, we can: - Reduce external dependencies - Leverage standard library features - Improve long-term code maintainability Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#22469	2025-01-26 20:51:14 +02:00
Kefu Chai	93be8f3a0c	db,sstables: migate boost::range::stable_partition to std library now that we are allowed to use C++23. we now have the luxury of using `std::ranges::stable_partition`. in this change, we: - replace `boost::range::stable_parition()` to `std::ranges::stable_parition()` - since `std::ranges::stable_parition()` returns a subrange instead of an iterator, change the names of variables which were previously used for holding the return value of `boost::range::stable_partition()` accordingly for better readability. - remove unused `#include` of boost headers Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#21911	2024-12-19 14:56:07 +02:00
Avi Kivity	f3eade2f62	treewide: relicense to ScyllaDB-Source-Available-1.0 Drop the AGPL license in favor of a source-available license. See the blog post [1] for details. [1] https://www.scylladb.com/2024/12/18/why-were-moving-to-a-source-available-license/	2024-12-18 17:45:13 +02:00
Botond Dénes	05246e123d	Merge 'sstables: Avoid computing column_values_fixed_lengths on each read' from Tomasz Grabiec Reads which need sstable index were computing column_values_fixed_lengths each time. This showed up in perf profile for a sstable-read heavy workload, and amounted to about 1-2% of time. Computing it involves type name parsing. Avoid by using cached per-sstable mapping. There is already sstable::_column_translation which can be used for this. It caches the mapping for the least-recently used schema. Since the cursor uses the mapping only for primary key columns, which are stable, any schema will do, so we can use the last _column_translation. We only need to make sure that it's always armed, so sstable loading is augmented with arming with sstable's schema. Also, fixes a potential use-after-free on schema in column_translation. Closes scylladb/scylladb#21347 * github.com:scylladb/scylladb: sstables: Fix potential use-after-free on column_translation::column_info::name sstables: Avoid computing column_values_fixed_lengths on each read	2024-12-12 12:22:32 +02:00
Kefu Chai	714d12014e	sstable/mx: use subrange.advance() when appropriate Replace manual subrange advancement with the more concise and readable `subrange.advance()` method. This change: - Eliminates unnecessary subrange instance creation - Improves code readability - Reduces potential for unnecessary object allocation - Leverages the built-in `advance()` method for cleaner iterator handling The modification simplifies the iteration logic while maintaining the same functional behavior. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#21865	2024-12-12 10:04:12 +02:00
Kefu Chai	ce2f80c227	treewide: migrate from boost::make_iterator_range to ranges::subrange Replace boost::make_iterator_range() with std::ranges::subrange. This change improves code modernization and reduces external dependencies: - Replace boost::make_iterator_range() with std::ranges::subrange - Remove boost/range/iterator_range.hpp include - Improve iterator type detection in interval.hh using std::ranges::const_iterator_t<Range> This is part of ongoing efforts to modernize our codebase and minimize external dependencies. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#21787	2024-12-09 21:31:53 +02:00
Tomasz Grabiec	b0a5bf8b4a	sstables: Avoid computing column_values_fixed_lengths on each read Reads which need clustering index cursor were computing column_values_fixed_lengths each time. This showed up in perf profile for a sstable-read heavy workload, and amounted to about 1%. Avoid by using cached per-sstable mapping. There is already sstable::_column_translation which can be used for this. It caches the mapping for the most recently used schema. Since the cursor uses the mapping only for primary key columns, which are stable, any schema will do, so we can use the last _column_translation. We only need to make sure that it's always armed, so sstable loading is augmented with arming with sstable's schema.	2024-12-09 14:05:37 +01:00
Kefu Chai	bab12e3a98	treewide: migrate from boost::adaptors::transformed to std::views::transform now that we are allowed to use C++23. we now have the luxury of using `std::views::transform`. in this change, we: - replace `boost::adaptors::transformed` with `std::views::transform` - use `fmt::join()` when appropriate where `boost::algorithm::join()` is not applicable to a range view returned by `std::view::transform`. - use `std::ranges::fold_left()` to accumulate the range returned by `std::view::transform` - use `std::ranges::fold_left()` to get the maximum element in the range returned by `std::view::transform` - use `std::ranges::min()` to get the minimal element in the range returned by `std::view::transform` - use `std::ranges::equal()` to compare the range views returned by `std::view::transform` - remove unused `#include <boost/range/adaptor/transformed.hpp>` - use `std::ranges::subrange()` instead of `boost::make_iterator_range()`, to feed `std::views::transform()` a view range. to reduce the dependency to boost for better maintainability, and leverage standard library features for better long-term support. this change is part of our ongoing effort to modernize our codebase and reduce external dependencies where possible. limitations: there are still a couple places where we are still using `boost::adaptors::transformed` due to the lack of a C++23 alternative for `boost::join()` and `boost::adaptors::uniqued`. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#21700	2024-12-03 09:41:32 +02:00
Nadav Har'El	da99dc3a7f	cross-tree: change to_sstring_view() to to_string_view() For historic reasons, we have (in bytes.hh) a type sstring_view which is an alias for std::string_view - since the same standard type can hold a pointer into both a seastar::sstring and std::string. This alias in unnecessary and misleading to new developers (who might assume it is somehow different from std::string_view). This patch doesn't yet remove all occurances of sstring_view (the request in #4062), but begins to do it by renaming one commonly-used function, to_sstring_view(bytes) to to_string_view() and of course changes all its uses to the new name. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2024-11-18 14:57:49 +02:00
Avi Kivity	3a6c0a9b36	Merge 'compaction: Perform integrity checks on compacting SSTables' from Nikos Dragazis This PR enables compaction tasks to verify the integrity of the input data through checksum and digest checks. The mechanism for integrity checking was introduced in previous PRs (#20207, #20720) as a built-in functionality of the input streams. This PR integrates this mechanism with compaction. The change applies to all compaction types and covers both compressed and uncompressed SSTables adhering to the 3.x format. If a compaction task reads only part of an SSTable, then only the per-chunk checksums are verified, not the digest. The PR consists of: * Changes to mx readers to support integrity checking. The kl readers, considered as compatibility-only, were left unchanged. Also, integrity checking on single-partition reversed reads (`data_consume_reversed_partition()`) remains unsupported by mx readers as this is not used in compaction. * Changes to `sstable` and `sstable_set` APIs to allow toggling integrity checks for mx readers. * Activation of integrity checking for all compaction types. * Tests for all compaction types with corrupted SSTables. Integrity checks come at a cost. For uncompressed SSTables, the cost is the loading of the CRC and Digest components from disk, and the calculation of checksums and digest from the actual data. For compressed SSTables, checksums are stored in-place and they are being checked already on all reads, so the only extra cost is the loading and calculation of the digest. The measurements show a ~5% regression in compaction performance for uncompressed SSTables, and a negligible regression for compressed SSTables. Command: `perf-sstable --smp=1 --cpuset=1 --poll-mode --mode=compaction --iterations=1000 --partitions 10000 --sstables=1 --key_size=4096 --num_columns=15 --column_size={32, 1024, 3500, 7000, 14500}` Uncompressed SSTables: ``` +--------------+-----------------------+----------------------+------------+ \| SSTable Size \| No Integrity (p/sec) \| Integrity (p/sec) \| Regression \| +--------------+-----------------------+----------------------+------------+ \| 50 MiB \| 65175.59 +- 80.82 \| 61814.63 +- 72.88 \| 5.16% \| \| 200 MiB \| 41795.10 +- 60.39 \| 39686.28 +- 45.05 \| 5.05% \| \| 500 MiB \| 21087.41 +- 30.72 \| 20092.93 +- 25.05 \| 4.72% \| \| 1 GiB \| 12781.64 +- 21.77 \| 12233.94 +- 21.71 \| 4.29% \| \| 2 GiB \| 6629.99 +- 9.40 \| 6377.13 +- 8.28 \| 3.81% \| +--------------+-----------------------+----------------------+------------+ ``` Compressed SSTables: ``` +--------------+-----------------------+----------------------+------------+ \| SSTable Size \| No Integrity (p/sec) \| Integrity (p/sec) \| Regression \| +--------------+-----------------------+----------------------+------------+ \| 50 MiB \| 53975.05 +- 63.18 \| 53825.93 +- 62.28 \| 0.28% \| \| 200 MiB \| 28687.94 +- 26.58 \| 28689.41 +- 26.91 \| 0% \| \| 500 MiB \| 13865.35 +- 15.50 \| 13790.41 +- 14.88 \| 0.54% \| \| 1 GiB \| 7858.10 +- 7.71 \| 7829.75 +- 9.66 \| 0.36% \| \| 2 GiB \| 4023.11 +- 2.43 \| 4010.54 +- 2.55 \| 0.31% \| +--------------+-----------------------+----------------------+------------+ (p/sec = partitions/sec) ``` Refs #19071. New feature, no backport is needed. Closes scylladb/scylladb#21153 * github.com:scylladb/scylladb: test: Add test for compaction with corrupted SSTables compaction: Enable integrity checks for all compaction types sstables: Add integrity option to factories for sstable_set readers sstables: Add integrity option to sstable::make_reader() sstables: Add integrity option to mx::make_reader() sstables: Load checksums and digests in mx full-scan reader sstables: Add integrity option to data_consume_single_partition() sstables: Disengage integrity_check from sstable class sstables: Allow data sources to disable digest check	2024-11-17 20:59:31 +02:00
Botond Dénes	fed2c6ba83	sstables/mx/reader: release column value buffer after consumed data_consume_rows_context_m has a _column_value buffer it uses to read key and column values into, preparing for parsing and consuming them. This buffer is reset (released) in a few different cases: * When using it for key - after consuming its content * When using it for column value - when a colum has no value However, the buffer is not released when used for a column value and the column is consumed. This means that if a large column is read from the sstable, this buffer can potentially linger and keep consuming memory until either one of the other release scenarios is hit, or the reader is destroyed. Add a third release scenario, releasing the buffer after the row end was consumed. This allows the buffer to be re-used between columns of the same row, at the same time ensuring that a large buffer will not linger. This patch can almost halve the memory consumption of reads in certain circumstances. Point in case: the test test_reader_concurrency_semaphore_memory_limit_engages starts to fail after this fix, because the read doesn't trigger the OOM limit anymore and needs doubling of the concurrency to keep passing. This issue was found in a dtest (`test_ics_refresh_with_big_sstable_files`), which writes some large cells of up to 7MiB. After reading the row containing this large cell, the reader holds on to the 7MiB buffer causing the semaphore's OOM protection to kick in down the line. Fixes: https://github.com/scylladb/scylladb/issues/21160 Closes scylladb/scylladb#21132	2024-11-14 17:24:53 +01:00
Nikos Dragazis	64688fdad6	sstables: Add integrity option to mx::make_reader() In previous patch we added support for integrity checking in the mx full-scan reader. Do the same for the mx reader, which is the one used by all compaction types except for scrub compaction. The mx reader should now support integrity checking for single-partition and multi-partition reads. Single-partition reversed reads were excluded from this patch because they are not used in compaction. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2024-11-11 20:40:30 +02:00
Nikos Dragazis	1993aa5261	sstables: Load checksums and digests in mx full-scan reader In `716fc487fd` we introduced integrity checking in the mx crawling reader (later renamed to full-scan reader in `6250ff18eb`). When integrity checking is enabled, the full-scan reader expects that the checksum and digest components have been loaded from disk by the caller. This is true for the validation path, in which `sstable::validate()` loads the components before creating the full-scan reader, but it doesn't hold if a full-scan reader is created directly by a higher-level function through `sstable::make_full_scan_reader()`. As part of the effort to enable integrity checking for compaction, this becomes a blocker for scrub compaction, which relies solely on full-scan readers. Solve this by allowing the mx full-scan reader to load the checksum and digest components internally. The loading is an asynchronous operation, so it has to be deferred until the first buffer fill. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2024-11-11 20:26:27 +02:00
Nikos Dragazis	609b16307e	sstables: Add integrity option to data_consume_single_partition() Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2024-11-11 20:26:27 +02:00
Nikos Dragazis	5b896cdbb7	sstables: Disengage integrity_check from sstable class The `integrity_check` flag was first introduced as a parameter in `sstable::data_stream()` to support creating input streams with integrity checking. As such, it was defined in the sstable class. However, we also use this flag in the kl/mx full-scan readers, and, in a later patch, we will use it in `class sstable_set` as well. Move the definition into `types_fwd.hh` since it is no longer bound to the sstable class. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2024-11-11 20:26:27 +02:00
Pavel Emelyanov	f3f956841f	sstables: Remove unused mp_row_consumer_m::range_tombstone_start It's only used by its operator<< so remove it as well Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#21380	2024-11-03 16:40:02 +02:00
Avi Kivity	907da210b6	compound_compat: replace use of boost ranges with std ranges To reduce the dependency load, replace use of boost ranges with the std equivalent. Files that lost the indirect boost dependency have it added as a direct dependency.	2024-10-30 19:58:07 +02:00
Avi Kivity	94c21e5c05	Merge 'sstables: Reduce amount of I/O for clustering-key-bounded reads from large partitions' from Tomasz Grabiec Single-row reads from large partition issue 64 KiB reads to the data file, which is equal to the default span of the promoted index block in the data file. If users would want to increase selectivity of the index to speed up single-row reads, this won't be effective. The reason is that the reader uses promoted index to look up the start position in the data file of the read, but end position will in practice extend to the next partition, and amount of I/O will be determined by the underlying file input stream implementation and its read-ahead heuristics. By default, that results in at least 2 IOs 32KB each. There is already infrastructure to lookup end position based on upper bound of the read, in anticipation for sharing the promoted index cache, but it's not effective becasue it's a non-populating lookup and the upper bound cursor has its own private cached_promoted_index, which is cold when positions are computed. It's non-populating on purpose, to avoid extra index file IO to read upper bound. In case upper bound is far-enough from the lower bound, this will only increase the cost of the read. The solution employed here is to warm up the lower bound cursor's cache before positions are computed, and use that cursor for non-populating lookup of the upper bound. We use the lower bound cursor and the slice's lower bound so that we read the same blocks as later lower-bound slicing would, so that we don't incur extra IO for cases where looking up upper bound is not worth it, that is when upper bound is far from the lower bound. If upper bound is near lower bound, then warming up using lower bound will populate cached_promoted_index with blocks which will allow us to locate the upper bound block accurately. This is especially important for single-row reads, where the bounds are around the same key. In this case we want to read the data file range which belongs to a single promoted index block. It doesn't matter that the upper bound is not exactly the same. They both will likely lie in the same block, and if not, binary search will bring adjacent blocks into cache. Even if upper bound is not near, the binary search will populate the cache with blocks which can be used to narrow down the data file range somewhat. Fixes #10030. The change was tested with perf-fast-forward. I populated the data set with `column_index_size_in_kb` set to 1 scylla perf-fast-forward --populate --run-tests=large-partition-slicing --column-index-size-in-kb=1 Test run: build/release/scylla perf-fast-forward --run-tests=large-partition-select-few-rows -c1 --keep-cache-across-test-cases --test-case-duration=0 This test issues two reads of subsequent keys from the middle of a large partition (1M rows in total). The first read will miss in the index file page cache, the second read will hit. Notice that before the change, the second read issued 2 aio requests worth of 64KiB in total. After the change, the second read issued 1 aio worth of 2 KiB. That's because promoted index block is larger than 1 KiB. I verified using logging that the data file range matches a single promoted index block. Also, the first read which misses in cache is still faster after the change. Before: ``` running: large-partition-select-few-rows on dataset large-part-ds1 Testing selecting few rows from a large partition: stride rows time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk allocs tasks insns/f cpu 500000 1 0.009802 1 1 102 0 102 102 21.0 21 196 2 1 0 1 1 0 0 0 568 269 4716050 53.4% 500001 1 0.000321 1 1 3113 0 3113 3113 2.0 2 64 1 0 1 0 0 0 0 0 116 26 555110 45.0% ``` After: ``` running: large-partition-select-few-rows on dataset large-part-ds1 Testing selecting few rows from a large partition: stride rows time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk allocs tasks insns/f cpu 500000 1 0.009609 1 1 104 0 104 104 20.0 20 137 2 1 0 1 1 0 0 0 561 268 4633407 43.1% 500001 1 0.000217 1 1 4602 0 4602 4602 1.0 1 2 1 0 1 0 0 0 0 0 110 26 313882 64.1% ``` Backports: none, not a regression Closes scylladb/scylladb#20522 * github.com:scylladb/scylladb: perf: perf_fast_forward: Add test case for querying missing rows perf-fast-forward: Allow overriding promoted index block size perf-fast-forward: Test subsequent key reads from the middle in test_large_partition_select_few_rows perf-fast-forward: Allow adding key offset in test_large_partition_select_few_rows perf-fast-forward: Use single-partition reads in test_large_partition_select_few_rows sstables: bsearch_clustered_cursor: Add more tracing points sstables: reader: Log data file range sstables: bsearch_clustered_cursor: Unify skip_info logging sstables: bsearch_clustered_cursor: Narrow down range using "end" position of the block sstables: bsearch_clustered_cursor: Skip even to the first block test: sstables: sstable_3_x_test: Improve failure message sstables: mx: writer: Never include partition_end marker in promoted index block width sstables: Reduce amount of I/O for clustering-key-bounded reads from large partitions sstables: clustered_cursor: Track current block	2024-10-28 21:13:23 +02:00
Avi Kivity	820509026f	schema: replace boost ranges with std ranges To reduce dependency load, use std ranges instead of boost ranges. The std::ranges::{lower,upper}_bound don't support heterogeneous lookup, but a more natural solution is to use a projection to search for the name, so we use that and the custom comparator is removed. Many callers are converted as well due to poor interoperability between boost ranges and std ranges.	2024-10-15 16:42:54 +03:00
Tomasz Grabiec	753f6a61fd	sstables: bsearch_clustered_cursor: Add more tracing points	2024-10-03 16:24:18 +02:00
Tomasz Grabiec	95b864497a	sstables: reader: Log data file range	2024-10-03 14:16:05 +02:00
Tomasz Grabiec	41d3ae5e81	sstables: bsearch_clustered_cursor: Unify skip_info logging Now all exit paths which return skip_info will print it in the same way which makes for easier log parsing.	2024-10-03 14:16:05 +02:00
Tomasz Grabiec	1b82d5117a	sstables: bsearch_clustered_cursor: Narrow down range using "end" position of the block This is optimization. Example: block0: start=aaa, end=aaA block1: start=bbb, end=bbB block2: whatever Before the patch, advance_to("aAA") would skip to block0, and upper bound probe would skip to block1. This way, the reader would read the range of block0 from the data file. After the patch, "end" position is taken into account, so advance_to("aAA") will notice that block0 doesn't contain the position and will skip to block1. This is especially important for dense indexes, as it allows us to skip accessing data file if the search key is missing. It also solves the edge case problem related to the fact that single row reads are using a range which with positions which are not equal to the key, but are before(key) and after(key) for the lower bound and upper bound respectively. Before the patch, advance_to(before("bbb")) would skip to block0, before the position is before the block1's start. And upper bound probe for after("bbb") would point to block2. This way the read would scan block0 needlessly. After the patch, advance_to(before("bbb")) will skip to block1 because we notice based on "end" that block0 doesn't contain the position. This change also ensures that the start position of the upper bound entry of the after_key(pos), where pos is the last advance_to() position, is warm in cache. This is needed to optimize single-row reads with a dense index so that they always read exactly one promoted index block. For this to work, probe_upper_bound() for the after_key(row) always needs to find the upper bound block in cache.	2024-10-03 14:16:05 +02:00
Tomasz Grabiec	b03f23a09b	sstables: bsearch_clustered_cursor: Skip even to the first block It was unnecessary to emit a skip info for the first block since it follows immediately the partition start, but it is relevant to the optimization of avoiding data reads for missing keys. This optimization relies on the fact that lower bound position equals upper bound position. If the reader's key is before the first key in the partition and we don't arm the skip info for the first block, lower bound would be equal to the partition start, and upper bound would be equal to the first row's position, which are not equal.	2024-10-03 14:16:05 +02:00
Tomasz Grabiec	7f077893ed	sstables: mx: writer: Never include partition_end marker in promoted index block width Currently, it may happen that the last promoted index block includes the partition_end marker. That's because we first write the partition end marker and then emit the unclosed block. This behavior matches Cassandra (checked in 3.x and 5.0.1). This is problematic for ruling out data file reads based on index. The width field is currently unused, but it will be used later where the width of the last block is used to compute the skip position past the last block for lookups which land after all keys in the partition. If width includes the marker then such a skip would land in the next partition, which is incorrect, as the reader context expects a cell element. Even if that was recognized, it's wrong - if this is not a single partition read (so upper bound is not at the next partition too), then we would read from the wrong (next) partition. We want to be able to make such skips in order to avoid unnecessary data file IO for reads of missing rows. Currently, we would always read the last block even if the key is past its "end" position. Another way to solve this would be to propagate the "past the last block" condition from the index cursor to the reader and let it deal with it, but the logic for that would be complicated. With this fix, there is no special logic required.	2024-10-03 14:09:57 +02:00
Tomasz Grabiec	a29501ed67	sstables: Reduce amount of I/O for clustering-key-bounded reads from large partitions Single-row reads from large partition issue 64 KiB reads to the data file, which is equal to the default span of the promoted index block in the data file. If users would want to reduce selectivity of the index to speed up single-row reads, this won't be effective. The reason is that the reader uses promoted index to look up the start position in the data file of the read, but end position will in practice extend to the next partition, and amount of I/O will be determined by the underlying file input stream implementation and its read-ahead heuristics. By default, that results in at least 2 IOs 32KB each. There is already infrastructure to lookup end position based on upper bound of the read, but it's not effective becasue it's a non-populating lookup and the upper bound cursor has its own private cached_promoted_index, which is cold when positions are computed. It's non-populating on purpose, to avoid extra index file IO to read upper bound. In case upper bound is far-enough from the lower bound, this will only increase the cost of the read. The solution employed here is to warm up the lower bound cursor's cache before positions are computed, and use that cursor for non-populating lookup of the upper bound. We use the lower bound cursor and the slice's lower bound so that we read the same blocks as later lower-bound slicing would, so that we don't incur extra IO for cases where looking up upper bound is not worth it, that is when upper bound is far from the lower bound. If upper bound is near lower bound, then warming up using lower bound will populate cached_promoted_index with blocks which will allow us to locate the upper bound block accurately. This is especially important for single-row reads, where the bounds are around the same key. In this case we want to read the data file range which belongs to a single promoted index block. It doesn't matter that the upper bound is not exactly the same. They both will likely lie in the same block, and if not, binary search will bring adjacent blocks into cache. Even if upper bound is not near, the binary search will populate the cache with blocks which can be used to narrow down the data file range somewhat. Fixes #10030. The change was tested with perf-fast-forward. I populated the data set with `column_index_size_in_kb` set to 1 scylla perf-fast-forward --populate --run-tests=large-partition-slicing --column-index-size-in-kb=1 Test run: build/release/scylla perf-fast-forward --run-tests=large-partition-select-few-rows -c1 --keep-cache-across-test-cases --test-case-duration=0 This test reads two rows from the middle of a large partition (1M rows), of subsequent keys. The first read will miss in the index file page cache, the second read will hit. Notice that before the change, the second read issued 2 aio requests worth of 64KiB in total. After the change, the second read issued 1 aio worth of 2 KiB. That's because promoted index block is larger than 1 KiB. I verified using logging that the data file range matches a single promoted index block. Also, the first read which misses in cache is still faster after the change. Before: running: large-partition-select-few-rows on dataset large-part-ds1 Testing selecting few rows from a large partition: stride rows time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk allocs tasks insns/f cpu 500000 1 0.009802 1 1 102 0 102 102 21.0 21 196 2 1 0 1 1 0 0 0 568 269 4716050 53.4% 500001 1 0.000321 1 1 3113 0 3113 3113 2.0 2 64 1 0 1 0 0 0 0 0 116 26 555110 45.0% After: running: large-partition-select-few-rows on dataset large-part-ds1 Testing selecting few rows from a large partition: stride rows time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk allocs tasks insns/f cpu 500000 1 0.009609 1 1 104 0 104 104 20.0 20 137 2 1 0 1 1 0 0 0 561 268 4633407 43.1% 500001 1 0.000217 1 1 4602 0 4602 4602 1.0 1 2 1 0 1 0 0 0 0 0 110 26 313882 64.1% (cherry picked from commit dfb339376aff1ed961b26c4759b1604f7df35e54)	2024-10-01 18:40:34 +02:00
Tomasz Grabiec	41be5d1daf	sstables: clustered_cursor: Track current block Will be needed by the reader to jump to the current block even if we already advanced to it before, when setting up the reader context. We want to advance to lower bound earlier, before the praser skips to the lower bound. We want that in order to set input stream data file range based on index. If we didn't have access to the current block and used the result from advance_to(), the parser will think we're already in the block which has lower_bound when it attempts to skip, and will not skip, falling back to scanning.	2024-10-01 18:40:34 +02:00
Tomasz Grabiec	b5ae7da9d2	sstables: bsearch_clustered_cursor: Add trace-level logging	2024-09-27 01:25:15 +02:00
Tomasz Grabiec	8e54ecd38e	sstables: bsearch_clustered_cursor: Move definitions out of line In order to later use the formatter for the inner class promoted_index_block, which is defined out of line after cached_promoted_index class definition.	2024-09-27 01:25:15 +02:00
Tomasz Grabiec	0279ac5faa	test, sstables: Verify parsing stability when allocating section is retried	2024-09-27 01:25:15 +02:00
Tomasz Grabiec	c09fa0cb98	test, sstables: Verify parsing stability when buffers cross page boundary	2024-09-27 01:25:15 +02:00
Tomasz Grabiec	7670ee701a	sstables: bsearch_clustered_cursor: Switch parsers to work with page_view This fixes a use-after-free bug when parsing clustering key across pages. Clustering key index lookup is based on the index file page cache. We do a binary search within the index, which involves parsing index blocks touched by the algorithm. Index file pages are 4 KB chunks which are stored in LSA. To parse the first key of the block, we reuse clustering_parser, which is also used when parsing the data file. The parser is stateful and accepts consecutive chunks as temporary_buffers. The parser is supposed to keep its state across chunks. In `b1b5bda`, the parser was changed to keep shared fragments of the buffer passed to the parser in its internal state (across pages) rather than copy the fragments into a new buffer. This is problematic when buffers come from page cache because LSA buffers may be moved around or evicted. So the temporary_buffer which is a view on the LSA buffer is valid only around the duration of a single consume() call to the parser. If the blob which is parsed (e.g. variable-length clustering key component) spans pages, the fragments stored in the parser may be invalidated before the component is fully parsed. As a result, the parsed clustering key may have incorrect component values. This never causes parsing errors because the "length" field is always parsed from the current buffer, which is valid, and component parsing will end at the right place in the next (valid) buffer. The problematic path for clustering_key parsing is the one which calls primitive_consumer::read_bytes(), which is called for example for text components. Fixed-size components are not parsed like this, they store the intermediate state by copying data. This may cause incorrect clustering keys to be parsed when doing binary search in the index, diverting the search to an incorrect block. The solution is to use page_view instead of temporary_buffer, which can be safely shared via share() and stored across allocating section. The page_view maintains its hold to the LSA buffer even across allocating sections. Fixes #20766	2024-09-27 01:25:15 +02:00
Tomasz Grabiec	93bfaf4282	sstables: promoted_index_block_parser: Make reset() always bring parser to initial state When reset() is done due to allocating section retry, it can be theoretically in an arbitrary point. So we should not assume that it finished parsing and state was reset by previous parsing. We should reset all the fields.	2024-09-27 01:23:43 +02:00
Tomasz Grabiec	ac823b1050	sstables: bsearch_clustered_cursor: Switch read_block_offset() to use the read() method To unify logic which handles allocating section retry, and thus improve safety.	2024-09-27 01:22:35 +02:00
Tomasz Grabiec	8aca93b3ec	sstables: bsearch_clustered_cursor: Fix parsing when allocating section is retried Parser's state was not reset when allocating section was retried. This doesn't cause problems in practice, because reserves are enough to cover allocation demands of parsing clustering keys, which are at most 64K in size. But it's still potentially unsafe and needs fixing.	2024-09-26 12:34:41 +02:00
Kefu Chai	df7f332a58	sstable: s/crawling_sstable_mutation_reader/sstable_full_scan_reader "crawling" is a little bit obscure in this context. so let's rename this class to reflect the fact that this reader only reads the entire content of the sstable. both crawling reader for kl and mx formats are renamed. also, in order to be consistent, all "crawling reader" in variable names are updated as well. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2024-09-17 10:39:37 +08:00
Kefu Chai	c1ed2f0ea4	sstable/mx/reader: add comment for mx_crawling_sstable_mutation_reader to explain its typical usage. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2024-09-17 10:39:25 +08:00
Botond Dénes	c7c5817808	Merge 'Improve timestamp heuristics for tombstone garbage collection' from Benny Halevy When purging regular tombstone consult the min_live_timestamp, if available. This is safe since we don't need to protect dead data from resurrection, as it is already dead. For shadowable_tombstones, consult the min_memtable_live_row_marker_timestamp, if available, otherwise fallback to the min_live_timestamp. If we see in a view table a shadowable tombstone with time T, then in any row where the row marker's timestamp is higher than T the shadowable tombstone is completely ignored and it doesn't hide any data in any column, so the shadowable tombstone can be safely purged without any effect or risk resurrecting any deleted data. In other words, rows which might cause problems for purging a shadowable tombstone with time T are rows with row markers older or equal T. So to know if a whole sstable can cause problems for shadowable tombstone of time T, we need to check if the sstable's oldest row marker (and not oldest column) is older or equal T. And the same check applies similarly to the memtable. If both extended timestamp statistics are missing, fallback to the legacy (and inaccurate) min_timestamp. Fixes scylladb/scylladb#20423 Fixes scylladb/scylladb#20424 > [!NOTE] > no backport needed at this time > We may consider backport later on after given some soak time in master/enterprise > since we do see tombstone accumulation in the field under some materialized views workloads Closes scylladb/scylladb#20446 * github.com:scylladb/scylladb: cql-pytest: add test_compaction_tombstone_gc sstable_compaction_test: add mv_tombstone_purge_test sstable_compaction_test: tombstone_purge_test: test that old deleted data do not inhibit tombstone garbage collection sstable_compaction_test: tombstone_purge_test: add testlog debugging sstable_compaction_test: tombstone_purge_test: make_expiring: use next_timestamp sstable, compaction: add debug logging for extended min timestamp stats compaction: get_max_purgeable_timestamp: use memtable and sstable extended timestamp stats compaction: define max_purgeable_fn tombstone: can_gc_fn: move declaration to compaction_garbage_collector.hh sstables: scylla_metadata: add ext_timestamp_stats compaction_group, storage_group, table_state: add extended timestamp stats getters sstables, memtable: track live timestamps memtable_encoding_stats_collector: update row_marker: do nothing if missing	2024-09-13 08:56:51 +03:00
Kefu Chai	3e84d43f93	treewide: use seastar::format() or fmt::format() explicitly before this change, we rely on `using namespace seastar` to use `seastar::format()` without qualifying the `format()` with its namespace. this works fine until we changed the parameter type of format string `seastar::format()` from `const char*` to `fmt::format_string<...>`. this change practically invited `seastar::format()` to the club of `std::format()` and `fmt::format()`, where all members accept a templated parameter as its `fmt` parameter. and `seastar::format()` is not the best candidate anymore. despite that argument-dependent lookup (ADT for short) favors the function which is in the same namespace as its parameter, but `using namespace` makes `seastar::format()` more competitive, so both `std::format()` and `seastar::format()` are considered as the condidates. that is what is happening scylladb in quite a few caller sites of `format()`, hence ADT is not able to tell which function the winner in the name lookup: ``` /__w/scylladb/scylladb/mutation/mutation_fragment_stream_validator.cc:265:12: error: call to 'format' is ambiguous 265 \| return format("{} ({}.{} {})", _name_view, s.ks_name(), s.cf_name(), s.id()); \| ^~~~~~ /usr/bin/../lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/format:4290:5: note: candidate function [with _Args = <const std::basic_string_view<char> &, const seastar::basic_sstring<char, unsigned int, 15> &, const seastar::basic_sstring<char, unsigned int, 15> &, const utils::tagged_uuid<table_id_tag> &>] 4290 \| format(format_string<_Args...> __fmt, _Args&&... __args) \| ^ /__w/scylladb/scylladb/seastar/include/seastar/core/print.hh:143:1: note: candidate function [with A = <const std::basic_string_view<char> &, const seastar::basic_sstring<char, unsigned int, 15> &, const seastar::basic_sstring<char, unsigned int, 15> &, const utils::tagged_uuid<table_id_tag> &>] 143 \| format(fmt::format_string<A...> fmt, A&&... a) { \| ^ ``` in this change, we change all `format()` to either `fmt::format()` or `seastar::format()` with following rules: - if the caller expects an `sstring` or `std::string_view`, change to `seastar::format()` - if the caller expects an `std::string`, change to `fmt::format()`. because, `sstring::operator std::basic_string` would incur a deep copy. we will need another change to enable scylladb to compile with the latest seastar. namely, to pass the format string as a templated parameter down to helper functions which format their parameters. to miminize the scope of this change, let's include that change when bumping up the seastar submodule. as that change will depend on the seastar change. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2024-09-11 23:21:40 +03:00
Nikos Dragazis	719757fba9	sstables: Enable checksum validation for uncompressed SSTables Extend the `sstable::validate()` to validate the checksums of uncompressed SSTables. Given that this is already supported for compressed SSTables, this allows us to provide consistent behavior across any type of SSTable, be it either compressed or uncompressed. The most prominent use case for this is scrub/validate, which is now able to detect file-level corruption in uncompressed SSTables as well. Note that this change will not affect normal user reads which skip checksum validation altogether. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2024-09-11 12:28:59 +03:00
Nikos Dragazis	716fc487fd	sstables: Expose integrity option via crawling mutation readers Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2024-09-11 12:28:59 +03:00
Nikos Dragazis	1d2dc9f2e1	sstables: Expose integrity option via data_consume_rows() Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2024-09-11 12:28:59 +03:00

1 2 3 4 5

247 Commits