scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-22 17:40:34 +00:00

Author	SHA1	Message	Date
Taras Veretilnyk	bc2e83bc1f	sstables: store digest of all sstable components in scylla metadata This change replaces plain file_writer with crc32_digest_file_writer for all SSTable components that should be checksummed. The resulting component digests are stored in the sstable structure and later persisted to disk as part of the Scylla metadata component during writer::consume_end_of_stream.	2025-12-04 21:00:09 +01:00
Taras Veretilnyk	a191503ddf	sstables: Extract file writer closing logic into separate methods Refactor the consume_end_of_stream() method by extracting the inline file writer closing logic into dedicated methods: - close_index_writer() - close_partitions_writer() - close_rows_writer()	2025-12-02 13:07:41 +01:00
Michał Chojnowski	3cf51cb9e8	sstables: fix some typos in comments I added those typos recently, and spellcheckers complain. Closes scylladb/scylladb#26376	2025-10-07 13:20:06 +03:00
Michał Chojnowski	cee4011e7a	sstables/trie/bti_partition_index_writer: in add(), get the key hash from the caller Partitions.db internally uses a piece of the partition key murmur hash (the same hash which is used to compute the token and the relevant bits in the bloom filter). Before this patch, the Partitions.db writer computes the hash internally from the `sstables::partition_key`. That's a waste, because this hash is also computed for bloom filter purposes just before that, in the owning sstable writer. So in this patch we let the caller pass that hash here instead.	2025-09-29 13:01:21 +02:00
Michał Chojnowski	f8e3d5e7c2	sstables/mx: make Index and Summary components optional In previous patches we (hopefully) modified all users of Index and Summary components so that they don't longer need those components to exist. (And can use Partitions and Rows components instead).	2025-09-29 13:01:21 +02:00
Michał Chojnowski	b1984d6798	sstables: implement an alternative way to rebuild bloom filters for sstables without Index For efficiency, the cardinality of the bloom filter (i.e. the number of partition keys which will be written into the sstable) has to be known before elements are inserted into the filter. In some cases (e.g. memtables flush) this number is known exactly. But in others (e.g. repair) it can only be estimated, and the estimation might be very wrong, leading to an oversized filter. Because of that, some time ago we added a piece of logic (ran after the sstable is written, but before it's sealed) which looks at the actual number of written partitions, compares it to the initial estimate (on which the size of the bloom filter was based on), and if the difference is unacceptably large, it rewrites the bloom filter from partition keys contained in Index.db. But the idea to rebuild the bloom filters from index files isn't going to work with BTI indexes, because they don't store whole partition keys. If we want sstables which don't have Index.db files, we need some other way to deal with oversized filters. Partition keys can be recovered from Data.db, but that would often be way too expensive. This patch adds another way. We introduce a new component file, TemporaryHashes. This component, if written at all, contains the 16-byte murmur hash for every partition key, in order, and can be used in place of Index to reconstruct the bloom filter. (Our bloom filters are actually built from the set of murmur hashes of partition keys. The first step of inserting a partition key into a filter is hashing the key. Remembering the hashes is sufficient to build the filter later, without looking at partition keys again.) As of this patch, if the Index component is not being written, we don't allocate and populate a bloom filter during the Data.db write. Instead, we write the murmur hashes to TemporaryHashes, and only later, after the Data write finishes, we allocate the optimal-size, bloom filter, we read the hashes back from TemporaryHashes, and we populate the filter with them. That is suboptimal. Writing the hashes to disk (or worse, to S3) and reading them back is more expensive than building the bloom filter during the main Data pass. So ideally it should be avoided in cases where we know in advance that the partition key count estimate is good enough. (Which should be the case in flushes and compactions). But we defer that to a future patch. (Such a change would involve passing some flag to the sstable writer if the cardinality estimate is trustworthy, and not creating TemporaryHashes if the estimate is trustworthy).	2025-09-29 13:01:21 +02:00
Michał Chojnowski	e0fda9ae6f	sstables/mx/writer: populate BTI index files In the previous patch we added code responsible for creating and opening Partitions.db and Rows.db, but we left those files empty. In this patch, we populate the files using `trie::bti_row_index_writer` and `trie::bti_partition_index_writer`. Note: for the row index, we insert the same clustering blocks to both indexes. The logic for choosing the size of the blocks hasn't been changed in any way. Much of this patch has to do with propagating the current range tombstone down to all places which can start a new clustering block. The reason we need that is that, for each clustering block, BIG indexes store the range tombstone succeeding the block (i.e. the range tombstone in between the given block and its successor) BTI indexes store the range tombstone preceding the block, (i.e. the range tombstone in between the given block and its predecessor). So before the patch there's no code which looks at the current tombstone when starting the block, only when ending the block. This patch adds an extra copy for each `decorated_key`. This is mostly unavoidable -- the BTI partition writer just has to remember the key until its successor appears, to find the common prefix. (We could avoid the key copy if the BTI isn't used, though. We don't do that in this patch, we just let the copy happen).	2025-09-29 13:01:21 +02:00
Michał Chojnowski	cdcf34b3a0	sstables: create and open BTI index files, when enabled This patch adds code responsible for creation and opening of BTI index components (Rows.db, Partitions.db) when BTI index writing is enabled. (It is enabled if the cluster feature is enabled and the relevant config entry permits it). The files are empty for now, and are never read. We will populate and use them in following patches.	2025-09-29 13:01:21 +02:00
Michał Chojnowski	e04ee6d5f6	sstables/mx/writer: make `_pi_write_m.partition_tombstone` a `sstables::deletion_time` There's a test (boost/sstable_compaction_test.cc::tombstone_purge_test) which tests the value of `_stats.capped_tombstone_deletion_time`. Before this patch, for "ms" sstables, `to_deletion_time` would have be called twice for each written partition tombstone, which would fail the test. Since `_pi_write_m.partition_tombstone` always ends up being converted from `tombstone` to `sstables::deletion_time` anyway, let's just make it a `sstables::deletion_time` to begin with. This will ensure that `to_deletion_time` will be able to be only called once per partition tombstone.	2025-09-29 13:01:20 +02:00
Michał Chojnowski	416f5d64d4	sstables/mx/writer: rename _pi_write_m.tomb to partition_tombstone With `tomb` it's not obvious enough what tombstone this field is about.	2025-09-17 12:22:40 +02:00
Michał Chojnowski	94088f0a41	sstables/writer: add an accessor for the current write position in Data.db It will be used by index tests to know the ground truth for where each partition and row are written, so that this can be checked against the index.	2025-09-07 00:32:02 +02:00
Michał Chojnowski	30dad06c9a	sstables/mx: move clustering_info from writer.cc to types.hh We will use this type as the input to the BTI row index writer. Since it will be implemented in other translation units, the definition of the type has to be moved to a header.	2025-08-14 02:06:33 +02:00
Botond Dénes	592ca789e2	sstables/mx/writer: handler rows with empty keys Although valid for compact tables, non-full (or empty) clustering key prefixes are not handled for row keys when writing sstables. Only the present components are written, consequently if the key is empty, it is omitted entirely. When parsing sstables, the parsing code unconditionally parses a full prefix. This mis-match results in parsing failures, as the parser parses part of the row content as a key resulting in a garbage key and subsequent mis-parsing of the row content and maybe even subsequent partitions. Use the recently introduced corrupt_data_handler to handle rows with such corrupt keys. This way, we avoid corrupting the sstables beyond parsing and the rows are also kept around in system.corrupt_data for later inspection and possible recovery.	2025-06-25 08:41:29 +03:00
Michał Chojnowski	b18ddcb92e	sstables: delegate compressor creation to the compressor factory Remove `compressor::create()`. This enforces that compressors are only created through the `sstable_compressor_factory`. Unlike the synchronous `compressor::create()`, the factory will be able to create dict-aware compressors.	2025-04-01 00:07:28 +02:00
Michał Chojnowski	8e611536b0	sstables/compress: move ownership of `compressor` to `sstable::compression` SSTable readers and writers use `compressor` objects to compress and decompress chunks of SSTable data files. `compressor` objects are read-only, so only one of them is needed for each SSTable. Before this commit, each reader and writer has its own `compressor` object. This isn't necessary, but it's okay. But later in this series it will stop being okay, because the creation of a `compressor` will become an expensive cross-shard operation (because it might require sharing a compression dictionary from another shard). So we have to adjust the code so that there is only once `compressor` per sstable, not one per reader/writer. We stuff the ownership of this compressor into `sstable::compression`. To make the ownership clear, we remove `compression_ptr` shared pointers from readers and writers, and make them access the compressor via the `sstable::compression` instead.	2025-04-01 00:07:27 +02:00
Avi Kivity	73e4a3c581	sstables: store features early in write path sstable features indicate that an sstable has some extension, or that some bug was fixed. They allow us to know if we can rely on certain properties in a read sstables. Currently, sstable features are set early in the read path (when we read the scylla metadata file) and very late in the write path (when we write the scylla metadata file just before sealing the sstable). However, we happen to read features before we set them in the write path - when we resize the bloom filter for a newly written sstable we instantiate an index reader, and that depends on some features. As a result, we read a disengaged optional (for the scylla metadata component) as if it was engaged. This somehow worked so far, but fails with libstdc++ hash table implementation. Fix it by moving storage of the features to the sstable itself, and setting it early in the write path. Fixes #23484 Closes scylladb/scylladb#23485	2025-03-31 09:33:56 +03:00
Pavel Emelyanov	68c41f0459	sstables: Make file_writer keep component_name on board The class in question is a wrapper around output_stream that writes, flushes and closes the stream in async context. For logging it also keeps the component filename on board, and now it's good time to patch it and keep the component_filename instead. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-03-19 13:03:29 +03:00
Pavel Emelyanov	1ba91e28cb	sstables: Make get_filename() return component_name Similarly to previous patches -- mostly the result is used as log argument. The remaining users include - scylla sstable tool that dumps component names to json output - API endpoint that returns component names to user - tests these are all good to explicitly convert component_names to strings. There are few more places that expect strings instead of component name objects. For now they also use fmt::to_string() explicitly, partially it will be fixed later, mostly -- as future follow-ups. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-03-19 13:03:29 +03:00
Pavel Emelyanov	80e0030613	sstables: Make sstable::index_filename() return component_name Most of the method callers use it as log parameter. There are few more places that push it to malformed_sstable_exception, which immediately converts it to string, so this patch makes the exception be constructed with the component_name either. And there's one more place that passes this string to file_writer constructor. For now, convert it to string explicitly, but next patches will fix that place to use pure component_name too. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-03-19 13:01:23 +03:00
Pavel Emelyanov	dcc9167734	sstables: Rename filename($component) calls to ${component}_filename() There's a generic sstable::filename(component_type) method that returns a file name for the given component. For "popular" components, namely TOC, Data and Index there are dedicated sstable methods to get their names. Fix existing callers of the generic method to use the former. It's shorter, nicer and makes further patching simpler. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-03-19 12:45:21 +03:00
Pavel Emelyanov	eff61b167c	treewide: Reduce db/config.hh header fanout Drop it from files that obviously don't need it. Also kill some forward declarations while at it. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#22979	2025-02-25 15:16:40 +01:00
Kefu Chai	0237913337	sstables: Migrate from boost::adaptors::indexed to std::views::enumerate This change modernizes the codebase by: - Replacing Boost's indexed adaptor with C++20's std::views::enumerate - Removing unnecessary Boost header inclusion With this change, we can: - Reduce external dependencies - Leverage standard library features - Improve long-term code maintainability Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#22469	2025-01-26 20:51:14 +02:00
Kefu Chai	93be8f3a0c	db,sstables: migate boost::range::stable_partition to std library now that we are allowed to use C++23. we now have the luxury of using `std::ranges::stable_partition`. in this change, we: - replace `boost::range::stable_parition()` to `std::ranges::stable_parition()` - since `std::ranges::stable_parition()` returns a subrange instead of an iterator, change the names of variables which were previously used for holding the return value of `boost::range::stable_partition()` accordingly for better readability. - remove unused `#include` of boost headers Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#21911	2024-12-19 14:56:07 +02:00
Avi Kivity	f3eade2f62	treewide: relicense to ScyllaDB-Source-Available-1.0 Drop the AGPL license in favor of a source-available license. See the blog post [1] for details. [1] https://www.scylladb.com/2024/12/18/why-were-moving-to-a-source-available-license/	2024-12-18 17:45:13 +02:00
Avi Kivity	94c21e5c05	Merge 'sstables: Reduce amount of I/O for clustering-key-bounded reads from large partitions' from Tomasz Grabiec Single-row reads from large partition issue 64 KiB reads to the data file, which is equal to the default span of the promoted index block in the data file. If users would want to increase selectivity of the index to speed up single-row reads, this won't be effective. The reason is that the reader uses promoted index to look up the start position in the data file of the read, but end position will in practice extend to the next partition, and amount of I/O will be determined by the underlying file input stream implementation and its read-ahead heuristics. By default, that results in at least 2 IOs 32KB each. There is already infrastructure to lookup end position based on upper bound of the read, in anticipation for sharing the promoted index cache, but it's not effective becasue it's a non-populating lookup and the upper bound cursor has its own private cached_promoted_index, which is cold when positions are computed. It's non-populating on purpose, to avoid extra index file IO to read upper bound. In case upper bound is far-enough from the lower bound, this will only increase the cost of the read. The solution employed here is to warm up the lower bound cursor's cache before positions are computed, and use that cursor for non-populating lookup of the upper bound. We use the lower bound cursor and the slice's lower bound so that we read the same blocks as later lower-bound slicing would, so that we don't incur extra IO for cases where looking up upper bound is not worth it, that is when upper bound is far from the lower bound. If upper bound is near lower bound, then warming up using lower bound will populate cached_promoted_index with blocks which will allow us to locate the upper bound block accurately. This is especially important for single-row reads, where the bounds are around the same key. In this case we want to read the data file range which belongs to a single promoted index block. It doesn't matter that the upper bound is not exactly the same. They both will likely lie in the same block, and if not, binary search will bring adjacent blocks into cache. Even if upper bound is not near, the binary search will populate the cache with blocks which can be used to narrow down the data file range somewhat. Fixes #10030. The change was tested with perf-fast-forward. I populated the data set with `column_index_size_in_kb` set to 1 scylla perf-fast-forward --populate --run-tests=large-partition-slicing --column-index-size-in-kb=1 Test run: build/release/scylla perf-fast-forward --run-tests=large-partition-select-few-rows -c1 --keep-cache-across-test-cases --test-case-duration=0 This test issues two reads of subsequent keys from the middle of a large partition (1M rows in total). The first read will miss in the index file page cache, the second read will hit. Notice that before the change, the second read issued 2 aio requests worth of 64KiB in total. After the change, the second read issued 1 aio worth of 2 KiB. That's because promoted index block is larger than 1 KiB. I verified using logging that the data file range matches a single promoted index block. Also, the first read which misses in cache is still faster after the change. Before: ``` running: large-partition-select-few-rows on dataset large-part-ds1 Testing selecting few rows from a large partition: stride rows time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk allocs tasks insns/f cpu 500000 1 0.009802 1 1 102 0 102 102 21.0 21 196 2 1 0 1 1 0 0 0 568 269 4716050 53.4% 500001 1 0.000321 1 1 3113 0 3113 3113 2.0 2 64 1 0 1 0 0 0 0 0 116 26 555110 45.0% ``` After: ``` running: large-partition-select-few-rows on dataset large-part-ds1 Testing selecting few rows from a large partition: stride rows time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk allocs tasks insns/f cpu 500000 1 0.009609 1 1 104 0 104 104 20.0 20 137 2 1 0 1 1 0 0 0 561 268 4633407 43.1% 500001 1 0.000217 1 1 4602 0 4602 4602 1.0 1 2 1 0 1 0 0 0 0 0 110 26 313882 64.1% ``` Backports: none, not a regression Closes scylladb/scylladb#20522 * github.com:scylladb/scylladb: perf: perf_fast_forward: Add test case for querying missing rows perf-fast-forward: Allow overriding promoted index block size perf-fast-forward: Test subsequent key reads from the middle in test_large_partition_select_few_rows perf-fast-forward: Allow adding key offset in test_large_partition_select_few_rows perf-fast-forward: Use single-partition reads in test_large_partition_select_few_rows sstables: bsearch_clustered_cursor: Add more tracing points sstables: reader: Log data file range sstables: bsearch_clustered_cursor: Unify skip_info logging sstables: bsearch_clustered_cursor: Narrow down range using "end" position of the block sstables: bsearch_clustered_cursor: Skip even to the first block test: sstables: sstable_3_x_test: Improve failure message sstables: mx: writer: Never include partition_end marker in promoted index block width sstables: Reduce amount of I/O for clustering-key-bounded reads from large partitions sstables: clustered_cursor: Track current block	2024-10-28 21:13:23 +02:00
Avi Kivity	820509026f	schema: replace boost ranges with std ranges To reduce dependency load, use std ranges instead of boost ranges. The std::ranges::{lower,upper}_bound don't support heterogeneous lookup, but a more natural solution is to use a projection to search for the name, so we use that and the custom comparator is removed. Many callers are converted as well due to poor interoperability between boost ranges and std ranges.	2024-10-15 16:42:54 +03:00
Tomasz Grabiec	7f077893ed	sstables: mx: writer: Never include partition_end marker in promoted index block width Currently, it may happen that the last promoted index block includes the partition_end marker. That's because we first write the partition end marker and then emit the unclosed block. This behavior matches Cassandra (checked in 3.x and 5.0.1). This is problematic for ruling out data file reads based on index. The width field is currently unused, but it will be used later where the width of the last block is used to compute the skip position past the last block for lookups which land after all keys in the partition. If width includes the marker then such a skip would land in the next partition, which is incorrect, as the reader context expects a cell element. Even if that was recognized, it's wrong - if this is not a single partition read (so upper bound is not at the next partition too), then we would read from the wrong (next) partition. We want to be able to make such skips in order to avoid unnecessary data file IO for reads of missing rows. Currently, we would always read the last block even if the key is past its "end" position. Another way to solve this would be to propagate the "past the last block" condition from the index cursor to the reader and let it deal with it, but the logic for that would be complicated. With this fix, there is no special logic required.	2024-10-03 14:09:57 +02:00
Botond Dénes	c7c5817808	Merge 'Improve timestamp heuristics for tombstone garbage collection' from Benny Halevy When purging regular tombstone consult the min_live_timestamp, if available. This is safe since we don't need to protect dead data from resurrection, as it is already dead. For shadowable_tombstones, consult the min_memtable_live_row_marker_timestamp, if available, otherwise fallback to the min_live_timestamp. If we see in a view table a shadowable tombstone with time T, then in any row where the row marker's timestamp is higher than T the shadowable tombstone is completely ignored and it doesn't hide any data in any column, so the shadowable tombstone can be safely purged without any effect or risk resurrecting any deleted data. In other words, rows which might cause problems for purging a shadowable tombstone with time T are rows with row markers older or equal T. So to know if a whole sstable can cause problems for shadowable tombstone of time T, we need to check if the sstable's oldest row marker (and not oldest column) is older or equal T. And the same check applies similarly to the memtable. If both extended timestamp statistics are missing, fallback to the legacy (and inaccurate) min_timestamp. Fixes scylladb/scylladb#20423 Fixes scylladb/scylladb#20424 > [!NOTE] > no backport needed at this time > We may consider backport later on after given some soak time in master/enterprise > since we do see tombstone accumulation in the field under some materialized views workloads Closes scylladb/scylladb#20446 * github.com:scylladb/scylladb: cql-pytest: add test_compaction_tombstone_gc sstable_compaction_test: add mv_tombstone_purge_test sstable_compaction_test: tombstone_purge_test: test that old deleted data do not inhibit tombstone garbage collection sstable_compaction_test: tombstone_purge_test: add testlog debugging sstable_compaction_test: tombstone_purge_test: make_expiring: use next_timestamp sstable, compaction: add debug logging for extended min timestamp stats compaction: get_max_purgeable_timestamp: use memtable and sstable extended timestamp stats compaction: define max_purgeable_fn tombstone: can_gc_fn: move declaration to compaction_garbage_collector.hh sstables: scylla_metadata: add ext_timestamp_stats compaction_group, storage_group, table_state: add extended timestamp stats getters sstables, memtable: track live timestamps memtable_encoding_stats_collector: update row_marker: do nothing if missing	2024-09-13 08:56:51 +03:00
Kefu Chai	3e84d43f93	treewide: use seastar::format() or fmt::format() explicitly before this change, we rely on `using namespace seastar` to use `seastar::format()` without qualifying the `format()` with its namespace. this works fine until we changed the parameter type of format string `seastar::format()` from `const char*` to `fmt::format_string<...>`. this change practically invited `seastar::format()` to the club of `std::format()` and `fmt::format()`, where all members accept a templated parameter as its `fmt` parameter. and `seastar::format()` is not the best candidate anymore. despite that argument-dependent lookup (ADT for short) favors the function which is in the same namespace as its parameter, but `using namespace` makes `seastar::format()` more competitive, so both `std::format()` and `seastar::format()` are considered as the condidates. that is what is happening scylladb in quite a few caller sites of `format()`, hence ADT is not able to tell which function the winner in the name lookup: ``` /__w/scylladb/scylladb/mutation/mutation_fragment_stream_validator.cc:265:12: error: call to 'format' is ambiguous 265 \| return format("{} ({}.{} {})", _name_view, s.ks_name(), s.cf_name(), s.id()); \| ^~~~~~ /usr/bin/../lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/format:4290:5: note: candidate function [with _Args = <const std::basic_string_view<char> &, const seastar::basic_sstring<char, unsigned int, 15> &, const seastar::basic_sstring<char, unsigned int, 15> &, const utils::tagged_uuid<table_id_tag> &>] 4290 \| format(format_string<_Args...> __fmt, _Args&&... __args) \| ^ /__w/scylladb/scylladb/seastar/include/seastar/core/print.hh:143:1: note: candidate function [with A = <const std::basic_string_view<char> &, const seastar::basic_sstring<char, unsigned int, 15> &, const seastar::basic_sstring<char, unsigned int, 15> &, const utils::tagged_uuid<table_id_tag> &>] 143 \| format(fmt::format_string<A...> fmt, A&&... a) { \| ^ ``` in this change, we change all `format()` to either `fmt::format()` or `seastar::format()` with following rules: - if the caller expects an `sstring` or `std::string_view`, change to `seastar::format()` - if the caller expects an `std::string`, change to `fmt::format()`. because, `sstring::operator std::basic_string` would incur a deep copy. we will need another change to enable scylladb to compile with the latest seastar. namely, to pass the format string as a templated parameter down to helper functions which format their parameters. to miminize the scope of this change, let's include that change when bumping up the seastar submodule. as that change will depend on the seastar change. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2024-09-11 23:21:40 +03:00
Benny Halevy	4de4af954f	sstables: scylla_metadata: add ext_timestamp_stats Store and retrieve the optional extended timestamp statistics (min_live_timestamp and min_live_row_marker_timestamp) in the scylla_metadata component. Note that there is no need for a cluster feature to store those attributes since the scylla_metadata on-disk format is extensible so that old sstables can be read by new versions, seeing the extra stats is missing, and new sstables can be read by old versions that ignore unknown scylla metadata section types. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2024-09-10 19:05:57 +03:00
Benny Halevy	14d86a3a12	sstables, memtable: track live timestamps When garbage collecting tombstones, we care only about shadowing of live data. However, currently we track min/max timestamp of both live and dead data, but there is no problem with purging tombstones that shadow dead data (expired or shdowed by other tombstones in the sstable/memtable). Also, for shadowable tombstones, we track live row marker timestamps separately since, if the live row marker timestamp is greater than a shadowable tombstone timestamp, then the row marker would shadow the shadowable tombstone thus exposing the cells in that row, even if their timestasmp may be smaller than the shadow tombstone's. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2024-09-10 19:05:49 +03:00
Avi Kivity	aa1270a00c	treewide: change assert() to SCYLLA_ASSERT() assert() is traditionally disabled in release builds, but not in scylladb. This hasn't caused problems so far, but the latest abseil release includes a commit [1] that causes a 1000 insn/op regression when NDEBUG is not defined. Clearly, we must move towards a build system where NDEBUG is defined in release builds. But we can't just define it blindly without vetting all the assert() calls, as some were written with the expectation that they are enabled in release mode. To solve the conundrum, change all assert() calls to a new SCYLLA_ASSERT() macro in utils/assert.hh. This macro is always defined and is not conditional on NDEBUG, so we can later (after vetting Seastar) enable NDEBUG in release mode. [1] `66ef711d68` Closes scylladb/scylladb#20006	2024-08-05 08:23:35 +03:00
Lakshmi Narayanan Sreethar	7b58fa2534	sstables: use _origin in write path Now that the origin is available inside the sstable object, no need to pass it to the methods called in the write path. Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2024-07-16 20:44:28 +05:30
Lakshmi Narayanan Sreethar	b762a09dcd	sstable::open_sstable: pass and store origin Pass origin when opening the sstable from the writer and store it in the sstable object. This will make the origin available for the entire write path. Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2024-07-16 20:43:30 +05:30
Michał Chojnowski	1a8ee69a43	sstables/mx/writer: when creating local_compression, use the sstables's schema, not the writer's There are two schema's associated with a sstable writer: the sstable's schema (i.e. the schema of the table at the time when the sstable object was created), and the writer's schema (equal to the schema of the reader which is feeding into the writer). It's easy to mix up the two and break something as a result. The writer's schema is needed to correctly interpret and serialize the data passing through the writer, and to populate the on-disk metadata about the on-disk schema. The sstables's schema is used to configure some parameters for newly created sstable, such as bloom filter false positive ratio, or compression. The problem fixed by this patch is that the writer was wrongly creating the compressor objects based on its own schema, but using them based based on the sstable's schema the sstable's schema. This patch forces the writer to use the sstable's schema for both.	2024-07-11 12:53:54 +02:00
Michał Chojnowski	d10b38ba5b	sstables/mx/writer: when creating filter, use the sstables's schema, not the writer's There are two schema's associated with a sstable writer: the sstable's schema (i.e. the schema of the table at the time when the sstable object was created), and the writer's schema (equal to the schema of the reader which is feeding into the writer). It's easy to mix up the two and break something as a result. The writer's schema is needed to correctly interpret and serialize the data passing through the writer, and to populate the on-disk metadata about the on-disk schema. The sstables's schema is used to configure some parameters for newly created sstable, such as bloom filter false positive ratio, or compression. The problem fixed by this patch is that the writer was wrongly creating the filter based on its own schema, while the layer outside the writer was interpreting it as if it was created with the sstable's schema. This patch forces the writer to pick the filter's parameters based on the sstable's schema instead.	2024-07-11 12:53:54 +02:00
Lakshmi Narayanan Sreethar	c80df8504c	sstables::maybe_rebuild_filter_from_index: log sstable origin Log the sstable origin when its bloom filter is being rebuilt. The origin has to be passed to the method by the caller as it is not available in the sstable object when the filter is rebuilt. Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> Closes scylladb/scylladb#19601	2024-07-04 10:01:23 +03:00
Lakshmi Narayanan Sreethar	21e463b108	sstables/mx/writer: rebuild bloom filters with bad partition estimates The bloom filters are built with partition estimates, as the actual partition count might not be available in all the cases. If the estimate was bad, the bloom filters might end up too large or too small than their optimal sizes. Rebuild such bloom filters with actual partition count before the filter is written to disk and the sstable is sealed. Fixes #19049 Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2024-06-24 12:06:02 +05:30
Lakshmi Narayanan Sreethar	afc90657d6	sstables/mx/writer: add variable to track number of partitions consumed Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2024-06-24 12:06:02 +05:30
Ferenc Szili	b06af5b2b9	sstable: write dead_rows count to system.large_partitions	2024-05-02 11:49:10 +02:00
Ferenc Szili	63e724c974	sstable: added counter for dead rows	2024-05-02 11:49:10 +02:00
Ferenc Szili	98bec4e02a	sstable: large data handler needs to count range tombstones as rows When issuing warnings about partitions with the number of rows above a configured threshold, the large partitions handler does not take into consideration the number of range tombstone markers in the total rows count. This fix adds the number of range tombstone markers to the total number of rows and saves this total in system.large_partitions.rows (if it is above the threshold). It also adds a new column range_tombstones to the system.large_partitions table which only contains the number of range tombstone markers for the given partition. This PR fixes the first part of issue #13968 It does not cover distinguishing between live and dead rows. A subsequent PR will handle that.	2024-04-22 15:24:18 +02:00
Avi Kivity	7cb1c10fed	treewide: replace seastar::future::get0() with seastar::future::get() get0() dates back from the days where Seastar futures carried tuples, and get0() was a way to get the first (and usually only) element. Now it's a distraction, and Seastar is likely to deprecate and remove it. Replace with seastar::future::get(), which does the same thing.	2024-02-02 22:12:57 +08:00
Avi Kivity	8ee75ae8f4	sstables: writer: don't require effective_replication_map for sharding metadata Currently, we pass an effective_replication_map_ptr to sstable_writer, so that we can get a stable dht::sharder for writing the sharding metadata. This is needed because with tablets, the sharder can change dynamically. However, this is both bad and unnecessary: - bad: holding on to an effective_replication_map_ptr is a barrier for topology operations, preventing tablet migrations (etc) while an sstable is being written - unnecessary: tablets don't require sharding metadata at all, since two tablets cannot overlap (unlike two sstables from different shards in the same node). So the first/last key is sufficient to determine the shard/tablet ownership. Given that, just pass the sharder for vnode sstables, and don't generate sharding metadata for tablet sstables.	2024-01-23 22:23:08 +02:00
Yaniv Kaul	c658bdb150	Typos: fix typos in comments Fixes some typos as found by codespell run on the code. In this commit, I was hoping to fix only comments, not user-visible alerts, output, etc. Follow-up commits will take care of them. Refs: https://github.com/scylladb/scylladb/issues/16255 Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>	2023-12-02 22:37:22 +02:00
Kefu Chai	9c24be05c3	sstable/writer: log sstable name and pk when capping ldt when the local_deletion_time is too large and beyond the epoch time of INT32_MAX, we cap it to INT32_MAX - 1. this is a signal of bad configuration or a bug in scylla. so let's add more information in the logging message to help track back to the source of the problem. Fixes #15015 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-08-21 19:25:32 +08:00
Tomasz Grabiec	17d6163548	sstables: Generate sharding metadata using sharder from erm when writing We need to keep sharding metadata consistent with tablet mapping to shards in order for node restart to detect that those sstables belong to a single shard and that resharding is not necessary. Resharding of sstables based on tablet metadata is not implemented yet and will abort after this series. Keeping sharding metadata accurate for tablets is only necessary until compaction group integration is finished. After that, we can use the sstable token range to determine the owning tablet and thus the owning shard. Before that, we can't, because a single sstable may contain keys from different tablets, and the whole key range may overlap with keys which belong to other shards.	2023-06-21 00:58:24 +02:00
Pavel Emelyanov	66e43912d6	code: Switch to seastar API level 7 In that level no io_priority_class-es exist. Instead, all the IO happens in the context of current sched-group. File API no longer accepts prio class argument (and makes io_intent arg mandatory to impls). So the change consists of - removing all usage of io_priority_class - patching file_impl's inheritants to updated API - priority manager goes away altogether - IO bandwidth update is performed on respective sched group - tune-up scylla-gdb.py io_queues command The first change is huge and was made semi-autimatically by: - grep io_priority_class \| default_priority_class - remove all calls, found methods' args and class' fields Patching file_impl-s is smaller, but also mechanical: - replace io_priority_class& argument with io_intent* one - pass intent to lower file (if applicatble) Dropping the priority manager is: - git-rm .cc and .hh - sed out all the #include-s - fix configure.py and cmakefile The scylla-gdb.py update is a bit hairry -- it needs to use task queues list for IO classes names and shares, but to detect it should it checks for the "commitlog" group is present. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes #13963	2023-06-06 13:29:16 +03:00
Pavel Emelyanov	ac1e56c9d9	sstable, storage: Virtualize data sink making for Data and Index Add the make_data_or_index_sink() virtual method and its implementation for filesystem_storage. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-04-10 16:43:01 +03:00
Pavel Emelyanov	1d4fcce5dd	sstable/writer: Shuffle writer::init_file_writers() The method needs to create two data sinks -- for Data and for Index files -- and then wrap it with more stuff (compression, checksums, streams, etc.). With S3 backend using file-output-stream won't work, becase S3 storage cannot provide writable file API (it has data_sink instead). This patch extracts file_data_sink creation so that it could be virtualized with storage API later. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-04-10 16:43:01 +03:00

1 2 3

107 Commits