scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-26 03:20:37 +00:00

Author	SHA1	Message	Date
Avi Kivity	1930f3e67f	Merge 'sstables/mx/reader: accommodate inexact partition indexes' from Michał Chojnowski Unlike the currently-used sstable index files, BTI indexes don't store the entire partition keys. They only store prefixes of decorated keys, up to the minimum length needed to differentiate a key from its neighbours in the sstable. This saves space. However, it means that a BTI index query might be off by one partition (on each end of the queried partition range) with respect to the optimal Data position. For example, if the index stores prefixes `a`, `b`, `c`, the index has no way to know if the first index entry after key `bb` is `b` (which might correspond to `ba` as well as `bc`), or `c`. So the index reader conservatively has to pick the wider Data range, and the Data reader must ignore the superfluous partitions. (And there's no way around that.) Before this patch, the sstable reader expects the index query to return an exact (optimal) Data range. This patch adjusts the logic of the sstable reader to allow for inexact ranges. Note: the patch is more complicated that it looks. The logic of the sstable reader was already fairly hard to follow and this adds even more flags, more weird special states and more edge cases. I think I managed to write a decent test and it did find three or four edge cases I wouldn't have noticed otherwise. I think it should cover all the added logic, but I didn't verify code coverage. (Do our scripts for that even work nowadays)? Simplification ideas are welcome. Preparation for new functionality, no backporting needed. Closes scylladb/scylladb#25093 * github.com:scylladb/scylladb: sstables/index_reader: weaken some exactness guarantees in abstract_index_reader test/boost: add a test for inexact index lookups sstables/mx/reader: allow passing a custom index reader to the constructor sstables/index_reader: remove advance_to sstables/mx/reader: handle inexact lookups in `advance_context()` sstables/mx/reader: handle inexact lookups in `advance_to_next_partition()` sstables/index_reader: make the return value of `get_partition_key` optional sstables/mx/reader: handle "backward jumps" in forward_to sstables/mx/reader: filter out partitions outside the queried range sstables/mx/reader: update _pr after `fast_forward_to`	2025-07-27 19:39:36 +03:00
Michał Chojnowski	a0c29055e5	sstables/mx/reader: handle "backward jumps" in forward_to A bunch of code assumes that the Data.db stream can only go forward. But with BTI indexes, if we perform an advance_to, the index can point to a position which the data reader has already passed, since the index is inexact. The logic of the data reader ensures that it has stopped within the last partition range, or just immediately after it, after reading the next partition key and noticing that it doesn't belong to the range. But forward_to can only be used with increasing ranges. The start of the next range must be greater or equal to the end of the previous range. This means that the exact start of the next partition range must be no earlier than: 1. Before the partition key just read by the data reader, if the data reader is positioned immediately after a partition key. 2. The start of the first partition after the current data reader position, if the data reader isn't positioned immediately after a partition key. So, if the index returns a position smaller than the current data reader position, then: 1. If the reader is immediately after a partition key, we have to reuse this partition key (since we can't go back in the stream to read it again), and keep reading from the current position. 2. Otherwise we can safely walk the index to the first partition that lies no earlier than the current position.	2025-07-25 10:49:58 +02:00
Ernest Zaslavsky	d2c5765a6b	treewide: Move keys related files to a new keys directory As requested in #22102, #22103 and #22105 moved the files and fixed other includes and build system. Moved files: - clustering_bounds_comparator.hh - keys.cc - keys.hh - clustering_interval_set.hh - clustering_key_filter.hh - clustering_ranges_walker.hh - compound_compat.hh - compound.hh - full_position.hh Fixes: #22102 Fixes: #22103 Fixes: #22105 Closes scylladb/scylladb#25082	2025-07-25 10:45:32 +03:00
Botond Dénes	20693edb27	Merge 'sstables: put index_reader behind a virtual interface' from Michał Chojnowski This is a refactoring patch in preparation for BTI indexes. It contains no functional changes (or at least it's not intended to). In this patch, we modify the sstable readers to use index readers through a new virtual `abstract_index_readers` interface. Later, we will add BTI indexes which will also implement this interface. This interface contains the methods of `index_reader` which are needed by sstable readers, and leaves out all other methods, such as `current_clustered_cursor`. Not all methods of this interface will be implementable by a trie-based index later. For example, a trie-based index can't provide a reliable `get_partition_key()`, because — unlike the current index — it only stores partition keys for partitions which have a row index. So the interface will have to be further restricted later. We don't do that in this patch because that will require changes to sstable reader logic, and this patch is supposed to only include cosmetic changes. No backports needed, this is a preparation for new functionality. Closes scylladb/scylladb#25000 * github.com:scylladb/scylladb: sstables: add sstable::make_index_reader() and use where appropriate sstables/mx: in readers, use abstract_index_reader instead of index_reader sstables: in validate(), use abstract_index_reader instead of index_reader where possible test/lib/index_reader_assertions: accept abstract_index_reader instead of index_reader sstables/index_reader: introduce abstract_index_reader sstables/index_reader: extract a prefetch_lower_bound() method	2025-07-17 14:32:08 +03:00
Michał Chojnowski	1c4065e7dd	sstables/mx: in readers, use abstract_index_reader instead of index_reader This makes clear which methods of index_reader are available for use by sstable readers, and which aren't.	2025-07-17 10:32:57 +02:00
Ernest Zaslavsky	dff9a229a7	sstables: refactor readers and sources to use coroutines Refactor readers and sources to support coroutine usage in preparation for integration with `make_data_or_index_source`. Move coroutine-based member initialization out of constructors where applicable, and defer initialization until first use.	2025-07-15 10:10:23 +03:00
Avi Kivity	f3eade2f62	treewide: relicense to ScyllaDB-Source-Available-1.0 Drop the AGPL license in favor of a source-available license. See the blog post [1] for details. [1] https://www.scylladb.com/2024/12/18/why-were-moving-to-a-source-available-license/	2024-12-18 17:45:13 +02:00
Nikos Dragazis	609b16307e	sstables: Add integrity option to data_consume_single_partition() Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2024-11-11 20:26:27 +02:00
Nikos Dragazis	5b896cdbb7	sstables: Disengage integrity_check from sstable class The `integrity_check` flag was first introduced as a parameter in `sstable::data_stream()` to support creating input streams with integrity checking. As such, it was defined in the sstable class. However, we also use this flag in the kl/mx full-scan readers, and, in a later patch, we will use it in `class sstable_set` as well. Move the definition into `types_fwd.hh` since it is no longer bound to the sstable class. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2024-11-11 20:26:27 +02:00
Tomasz Grabiec	a29501ed67	sstables: Reduce amount of I/O for clustering-key-bounded reads from large partitions Single-row reads from large partition issue 64 KiB reads to the data file, which is equal to the default span of the promoted index block in the data file. If users would want to reduce selectivity of the index to speed up single-row reads, this won't be effective. The reason is that the reader uses promoted index to look up the start position in the data file of the read, but end position will in practice extend to the next partition, and amount of I/O will be determined by the underlying file input stream implementation and its read-ahead heuristics. By default, that results in at least 2 IOs 32KB each. There is already infrastructure to lookup end position based on upper bound of the read, but it's not effective becasue it's a non-populating lookup and the upper bound cursor has its own private cached_promoted_index, which is cold when positions are computed. It's non-populating on purpose, to avoid extra index file IO to read upper bound. In case upper bound is far-enough from the lower bound, this will only increase the cost of the read. The solution employed here is to warm up the lower bound cursor's cache before positions are computed, and use that cursor for non-populating lookup of the upper bound. We use the lower bound cursor and the slice's lower bound so that we read the same blocks as later lower-bound slicing would, so that we don't incur extra IO for cases where looking up upper bound is not worth it, that is when upper bound is far from the lower bound. If upper bound is near lower bound, then warming up using lower bound will populate cached_promoted_index with blocks which will allow us to locate the upper bound block accurately. This is especially important for single-row reads, where the bounds are around the same key. In this case we want to read the data file range which belongs to a single promoted index block. It doesn't matter that the upper bound is not exactly the same. They both will likely lie in the same block, and if not, binary search will bring adjacent blocks into cache. Even if upper bound is not near, the binary search will populate the cache with blocks which can be used to narrow down the data file range somewhat. Fixes #10030. The change was tested with perf-fast-forward. I populated the data set with `column_index_size_in_kb` set to 1 scylla perf-fast-forward --populate --run-tests=large-partition-slicing --column-index-size-in-kb=1 Test run: build/release/scylla perf-fast-forward --run-tests=large-partition-select-few-rows -c1 --keep-cache-across-test-cases --test-case-duration=0 This test reads two rows from the middle of a large partition (1M rows), of subsequent keys. The first read will miss in the index file page cache, the second read will hit. Notice that before the change, the second read issued 2 aio requests worth of 64KiB in total. After the change, the second read issued 1 aio worth of 2 KiB. That's because promoted index block is larger than 1 KiB. I verified using logging that the data file range matches a single promoted index block. Also, the first read which misses in cache is still faster after the change. Before: running: large-partition-select-few-rows on dataset large-part-ds1 Testing selecting few rows from a large partition: stride rows time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk allocs tasks insns/f cpu 500000 1 0.009802 1 1 102 0 102 102 21.0 21 196 2 1 0 1 1 0 0 0 568 269 4716050 53.4% 500001 1 0.000321 1 1 3113 0 3113 3113 2.0 2 64 1 0 1 0 0 0 0 0 116 26 555110 45.0% After: running: large-partition-select-few-rows on dataset large-part-ds1 Testing selecting few rows from a large partition: stride rows time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk allocs tasks insns/f cpu 500000 1 0.009609 1 1 104 0 104 104 20.0 20 137 2 1 0 1 1 0 0 0 561 268 4633407 43.1% 500001 1 0.000217 1 1 4602 0 4602 4602 1.0 1 2 1 0 1 0 0 0 0 0 110 26 313882 64.1% (cherry picked from commit dfb339376aff1ed961b26c4759b1604f7df35e54)	2024-10-01 18:40:34 +02:00
Nikos Dragazis	1d2dc9f2e1	sstables: Expose integrity option via data_consume_rows() Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2024-09-11 12:28:59 +03:00
Patryk Wrobel	a89e3d10af	code-cleanup: add missing header guards The following command had been executed to get the list of headers that did not contain '#pragma once': 'grep -rnw . -e "#pragma once" --include *.hh -L' This change adds missing include guard to headers that did not contain any guard. Signed-off-by: Patryk Wrobel <patryk.wrobel@scylladb.com> Closes scylladb/scylladb#19626	2024-07-09 18:31:35 +03:00
Tomasz Grabiec	4d84451cf1	sstables, gdb: Track readers in a linked list For the purpose of scylla-gdb.py command "scylla active-sstables". Before the patch, readers were located by scanning the heap for live objects with vtable pointers corresponding to readers. It was observed that the test scylla_gdb/test_misc.py::test_active_sstables started failing like this: gdb.error: Error occurred in Python: Cannot access memory at address 0x300000000000000 This could be explained by there being a live object on the heap which used to be a reader but now is a different object, and the _sst field contains some other data which is not a pointer. To fix, track readers explicitly in a linked list so that the gdb script can reliably walk readers. Fixes #18618.	2024-05-16 00:28:46 +02:00
Kefu Chai	a6152cb87b	sstables: do not include unused headers these unused includes were identified by clangd. see https://clangd.llvm.org/guides/include-cleaner#unused-include-warning for more details on the "Unused include" warning. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#16666	2024-01-09 11:45:44 +02:00
Yaniv Kaul	c658bdb150	Typos: fix typos in comments Fixes some typos as found by codespell run on the code. In this commit, I was hoping to fix only comments, not user-visible alerts, output, etc. Follow-up commits will take care of them. Refs: https://github.com/scylladb/scylladb/issues/16255 Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>	2023-12-02 22:37:22 +02:00
Benny Halevy	a1acf6854b	everywhere: reduce dependencies on i_partitioner.hh Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-11-05 20:47:44 +02:00
Pavel Emelyanov	66e43912d6	code: Switch to seastar API level 7 In that level no io_priority_class-es exist. Instead, all the IO happens in the context of current sched-group. File API no longer accepts prio class argument (and makes io_intent arg mandatory to impls). So the change consists of - removing all usage of io_priority_class - patching file_impl's inheritants to updated API - priority manager goes away altogether - IO bandwidth update is performed on respective sched group - tune-up scylla-gdb.py io_queues command The first change is huge and was made semi-autimatically by: - grep io_priority_class \| default_priority_class - remove all calls, found methods' args and class' fields Patching file_impl-s is smaller, but also mechanical: - replace io_priority_class& argument with io_intent* one - pass intent to lower file (if applicatble) Dropping the priority manager is: - git-rm .cc and .hh - sed out all the #include-s - fix configure.py and cmakefile The scylla-gdb.py update is a bit hairry -- it needs to use task queues list for IO classes names and shares, but to detect it should it checks for the "commitlog" group is present. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes #13963	2023-06-06 13:29:16 +03:00
Kefu Chai	df63e2ba27	types: move types.{cc,hh} into types they are part of the CQL type system, and are "closer" to types. let's move them into "types" directory. the building systems are updated accordingly. the source files referencing `types.hh` were updated using following command: ``` find . -name "*.{cc,hh}" -exec sed -i 's/\"types.hh\"/\"types\/types.hh\"/' {} + ``` the source files under sstables include "types.hh", which is indeed the one located under "sstables", so include "sstables/types.hh" instea, so it's more explicit. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #12926	2023-02-19 21:05:45 +02:00
Avi Kivity	c5e4bf51bd	Introduce mutation/ module Move mutation-related files to a new mutation/ directory. The names are kept in the global namespace to reduce churn; the names are unambiguous in any case. mutation_reader remains in the readers/ module. mutation_partition_v2.cc was missing from CMakeLists.txt; it's added in this patch. This is a step forward towards librarization or modularization of the source base. Closes #12788	2023-02-14 11:19:03 +02:00
Pavel Emelyanov	9bdea110a6	code: Reduce fanout of sstables(_manager)?.hh over headers This change removes sstables.hh from some other headers replacing it with version.hh and shared_sstable.hh. Also this drops sstables_manager.hh from some more headers, because this header propagates sstables.hh via self. That change is pretty straightforward, but has a recochet in database.hh that needs disk-error-handler.hh. Without the patch touch sstables/sstable.hh results in 409 targets recompillation, with the patch -- 299 targets. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes #12222	2022-12-07 14:34:19 +02:00
Botond Dénes	4b222e7f37	sstables: move mp_row_consumer_reader_k_l to kl/reader.cc Its only user is in said file, so that is a better place for it.	2022-04-28 14:12:24 +03:00
Botond Dénes	b029bd3db7	tree: remove mutation_reader.hh include In most files it was unused. We should move these to the patch which moved out the last interesting reader from mutation_reader.hh (and added the corresponding new header include) but its probably not worth the effort. Some other files still relied on mutation_reader.hh to provide reader concurrency semaphore and some other misc reader related definitions.	2022-03-30 15:42:51 +03:00
Mikołaj Sielużycki	1d84a254c0	flat_mutation_reader: Split readers by file and remove unnecessary includes. The flat_mutation_reader files were conflated and contained multiple readers, which were not strictly necessary. Splitting optimizes both iterative compilation times, as touching rarely used readers doesn't recompile large chunks of codebase. Total compilation times are also improved, as the size of flat_mutation_reader.hh and flat_mutation_reader_v2.hh have been reduced and those files are included by many file in the codebase. With changes real 29m14.051s user 168m39.071s sys 5m13.443s Without changes real 30m36.203s user 175m43.354s sys 5m26.376s Closes #10194	2022-03-14 13:20:25 +02:00
Avi Kivity	fcb8d040e8	treewide: use Software Package Data Exchange (SPDX) license identifiers Instead of lengthy blurbs, switch to single-line, machine-readable standardized (https://spdx.dev) license identifiers. The Linux kernel switched long ago, so there is strong precedent. Three cases are handled: AGPL-only, Apache-only, and dual licensed. For the latter case, I chose (AGPL-3.0-or-later and Apache-2.0), reasoning that our changes are extensive enough to apply our license. The changes we applied mechanically with a script, except to licenses/README.md. Closes #9937	2022-01-18 12:15:18 +01:00
Wojciech Mitros	64e703bb54	sstables: mx: introduce partition_reversing_data_source This patch adds an implementation of a data source that wraps an sstable data file and returns data buffers with contents of one partition in the sstable as if the rows of the partition were present in a reversed order. In other words, to the user of the source the partition appears to be reversed. We shall call this an 'intermediary' data source. As part of the interface of the intermediary source the user is also given read access to the source's current position over the data file, and the constructor of the source takes a reference to `index_reader`. This is necessary because the index operates directly on data file offsets and we want the user to be able to use the index to skip sequences of rows. In order to ask the source to skip a sequence of rows - e.g. when jumping between clustering ranges - the user must advance the index' upper bound in reverse (to an earlier position). The source will then notice that the end position of the index has changed and take appropriate action. An alternative would be to translate the data positions of `index_reader` to 'reversed positions' of the intermediary and then use `skip_to` for skipping, as we do for forward reads. However this solution would introduce more complexity to `index_reader` and the intermediary source. One reason for the complexity in the input stream is that we would have two kinds of skips: a single row skip, and a skip to a clustering range. We know the offset of the next row, so we could check that to differentiate them. We would also need to add an information about the position of first clustering row and end of the last one in the index_reader. Skipping by checking the index seems to be overall simpler. For simplicity, the intermediary stream always starts with parsing the partition header and (if present) the static row, and returning the corresponding bytes as a result of the first read. After partition header and static row we must find the last row entry of the requested range. If the range ends before the partition end (i.e. there are more row entries after the range) we can use the 'previous unfiltered size' of the row following the range; otherwise we must scan the last promoted index block and take its last row. After finding the data range of the last row, we parse rows consecutively in reversed order. We must parse the rows partially to learn their lengths and the positions of previous rows. We're using similar constructs as in the sstable parser, but it only contains a small part of the parsing coroutine and doesn't perform any correctness checks. The parser for rows still turned out rather big mostly because we can't always deduce the size of the clustering blocks without reading the block header. The parser allows reading rows while skipping their bodies also in non-reversed order, which we are making use of while reading the last promoted index block. The intermediary data source has one more utility: reversing range tombstones. When we read a tombstone bound/boundary, we modify the data buffer so that the resulting bound/boundary has the reversed kind (so we don't read ends before starts) and the boundaries have their before/after timestamps swapped.	2021-10-04 15:24:12 +02:00
Benny Halevy	4476800493	flat_mutation_reader: get rid of timeout parameter Now that the timeout is taken from the reader_permit. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-08-24 16:30:51 +03:00
Tomasz Grabiec	a4275cf8bc	sstables: Switch the mx reader to flat_mutation_reader_v2 The main difficulty was in making sure that emitted range tombstone changes reflect range tombstones trimmed to clustering restrictions. This is handled by mutation_fragment_filter and clustering_ranges_walker. They return a list of range_tombstone_change fragments to emit for each hop as the reader walks over the clustering domain. Tests which were using a normalizing reader expected range tombstones to be split around rows. Drop this an adjust the tests accoridngly. No reader splits range tombstones around rows now.	2021-06-16 00:23:49 +02:00
Tomasz Grabiec	8784ffe07f	sstables: reader: Inline specialization of sstable_mutation_reader Needed before converting the mx reader to flat_mutation_reader_v2 because now it and the k_l reader cannot share the reader implementation. They derive from different reader impl bases and push different fragment types.	2021-06-16 00:23:49 +02:00
Avi Kivity	a55b434a2b	treewide: extent copyright statements to present day	2021-06-06 19:18:49 +03:00
Benny Halevy	8c585ccb5c	sstables: sstable_mutation_reader: implement close Close both the _index_reader and _context, if they are engaged. Warn and ignore any erros from close as it may be called either from the destructor or from f_m_r close. Call close() for closing in the background if needed when destroyed and warn about. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-04-25 11:16:10 +03:00
Wojciech Mitros	599cfe586f	sstables: add parsing of cell values into fragmented buffers The entire sstable cell value is currently stored in a single temporary_buffer. Cells may be very large, so to avoid large contiguous allocations, the buffer is changed to a fragmented_temporary_buffer. Fixes #7457 Fixes #6376 Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>	2021-04-01 15:36:58 +02:00
Botond Dénes	361ba473c7	sstables: get rid of mp_row_consumer.{hh,cc} Move stuff contained therein to `sstable_mutation_reader.{hh,cc}` which will serve as the collection point of utility stuff needed by all reader implementations.	2021-03-11 12:17:13 +02:00
Botond Dénes	3ba782bddd	sstables: get rid of row.hh Move stuff contained therein to `sstable_mutation_reader.{hh,cc}` which will serve as the collection point of utility stuff needed by all reader implementations.	2021-03-11 12:17:13 +02:00
Botond Dénes	4e3ae9d913	sstables: move kl specific context and consumer to kl/reader.cc Move all the kl format specific context and consumer code to kl/reader* and add a factory function `kl::make_reader()` which takes over the job of instantiating the `sstable_mutation_reader` with the kl specific context and consumer. Code which is used by test is moved to kl/reader_impl.hh, while code that can be hidden us moved to kl/reader.cc. Users who just want to create a reader only have to include kl/reader.hh.	2021-03-11 12:17:13 +02:00
Botond Dénes	0ec040921d	sstables: mv partition.cc sstable_mutation_reader.hh The sstable reader currently knows the definition of all the different consumers and contexts. But it doesn't really need to, as it is a template. Exploit this and prepare for a organization scheme where the consumers and contexts live hidden in a cc file which includes and instantiates the sstable reader template. As a first step expose `sstable_mutation_reader` in a header.	2021-03-11 12:17:13 +02:00

35 Commits