scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-21 17:10:35 +00:00

Author	SHA1	Message	Date
Avi Kivity	fcb8d040e8	treewide: use Software Package Data Exchange (SPDX) license identifiers Instead of lengthy blurbs, switch to single-line, machine-readable standardized (https://spdx.dev) license identifiers. The Linux kernel switched long ago, so there is strong precedent. Three cases are handled: AGPL-only, Apache-only, and dual licensed. For the latter case, I chose (AGPL-3.0-or-later and Apache-2.0), reasoning that our changes are extensive enough to apply our license. The changes we applied mechanically with a script, except to licenses/README.md. Closes #9937	2022-01-18 12:15:18 +01:00
Wojciech Mitros	8385f3eb21	sstables: index_reader: add support for iterating over clustering ranges in reverse In the sstable reader, we iterate over clustering ranges using the index_reader, which normally only accepts advancing to increasing positions. In this patch we add methods for advancing the index reader in reverse. To simplify our job we restrict our attention to a single implementation of the promoted index block cursor: `bsearch_clustered_cursor`. The `index_reader` methods for advancing in reverse will thus assume that this implementation is used. The assumption is correct given that we're working only with sstables of versions >= mc, which is indeed the intended use case. We add some documentation in appropriate places to make this obvious. We extend `bsearch_clustered_cursor` with two methods: `advance_past(pos)`, which advances the cursor to the first block after `pos` (or to the end if there is no such block), and `last_block_offset()`, which returns the data file offset of the first row from the last promoted index block. To efficiently find the position in the data file of the last row of the partition (which we need when performing a reversed query) the sstable reader may need to read the span of the entire last promoted index block in the data file. To learn where the block starts it can use `index_reader::last_block_offset()`, which is implemented in terms of `bsearch_clustered_cursor::last_block_offset()`. When performing a single partition read in forward order, the reader asks the index to position its lower bound at the start of the partition and its upper bound after the end of the slice. It starts by reading the first range. After exhausting a range it jumps to the next one by asking the index to advance the lower bound. For reverse single partition reads we'll take a similar approach: the initial bound positions are as in the forward case. However, we start with the last range and after exhausting a range we want to jump to a previous one; we will do it by advancing the upper bound in reverse (i.e. moving it closer to the beginning of the partition). For this we introduce the `index_reader::advance_reverse` function.	2021-10-04 15:24:12 +02:00
Tomasz Grabiec	f4227c303b	sstables: Do not populate partition index cache for "bypass cache" reads Index cursor for reads which bypass cache will use a private temporary instance of the partition index cache. Promoted index scanner (ka/la format) will not go through the page cache.	2021-07-15 12:13:20 +02:00
Tomasz Grabiec	9f957f1cf9	sstables: Cache partition index pages in LSA and link to LRU As part of this change, the container for partition index pages was changed from utils::loading_shared_values to intrusive_btree. This is to avoid reactor stalls which the former induces with a large number of elements (pages) due to its use of a hashtable under the hood, which reallocates contiguous storage.	2021-07-02 19:02:14 +02:00
Tomasz Grabiec	2a852cd0c9	sstables: Rename index_list to partition_index_page and shared_index_lists to partition_index_cache The new names are less confusing.	2021-07-02 19:02:14 +02:00
Tomasz Grabiec	8360a64f73	sstables: Drop the _use_binary_search flag from index entries It doesn't have to be set by the parser now that the cursors are created lazily, pass it to the cursor when it's created.	2021-07-02 19:02:14 +02:00
Tomasz Grabiec	06e373e272	sstables: index_reader: Keep index objects under LSA In preparation for caching index objects, manage them under LSA. Implementation notes: key_view was changed to be a view on managed_bytes_view instead of bytes, so it now can be fragmented. Old users of key_view now have to linearize it. Actual linearization should be rare since partition keys are typically small. Index parser is now not constructing the index_entry directly, but produces value objects which live in the standard allocator space: class parsed_promoted_index_entry; calss parsed_partition_index_entry; This change was needed to support consumers which don't populate the partition index cache and don't use LSA, e.g. sstable::generate_summary(). It's now consumer's responsibility to allocate index_entry out of parsed_partition_index_entry.	2021-07-02 19:02:14 +02:00
Tomasz Grabiec	a955e7971d	sstables: index_reader: Don't store schema reference inside index_entry To save space.	2021-07-02 19:02:13 +02:00
Tomasz Grabiec	9e7bf066a9	sstables: index_reader: Don't store file object inside promoted_index The file object which is currently stored there has per-request tracing wrappers (permit, trace_state) attached to it. It doesn't make sense once the entry is cached and shared. Annotate when the cursor is created instead.	2021-07-02 19:02:13 +02:00
Tomasz Grabiec	86b135056c	sstables: index_reader: Don't store front buffer inside promoted_index Index reads and promoted index reads are both using the same cached_file now, so there's no need to pass the buffers between the index parser and promoted index reader. Makes the promoted_index structure easier to move to LSA.	2021-07-02 19:02:13 +02:00
Avi Kivity	a55b434a2b	treewide: extent copyright statements to present day	2021-06-06 19:18:49 +03:00
Benny Halevy	6a82e9f4be	sstables: index_reader: mark close noexcept We'd like that to simplify the soon-to-be-introduced sstable_mutation_reader::close error handling path. close_index_list can be marked noexcept since parallel_for_each is, with that index_reader::close can be marked noexcept too. Note that since reader close can not fail both lower and upper bounds are closed (since closing lower_bound cannot fail). Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-04-25 11:16:10 +03:00
Botond Dénes	981699ae76	sstables: move promoted_index_blocks_reader into own header index_entry.hh (the current home of `promoted_index_blocks_reader`) is included in `sstables.hh` and thus in half our code-base. All that code really doesn't need the definition of the promoted index blocks reader which also pulls in the sstables parser mechanism. Move it into its own header and only include it where it is actually needed: the promoted index cursor implementations. Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20210317093654.34196-1-bdenes@scylladb.com>	2021-03-18 11:15:59 +02:00
Tomasz Grabiec	c232d71fc8	sstables: index_reader: Drop now unnecessary index_entry::close_pi_stream()	2021-02-04 15:24:07 +01:00
Tomasz Grabiec	5ed559c8c6	sstables: index_reader: Do not store cluster index cursor inside partition indexes Currently, the partition index page parser will create and store promoted index cursors for each entry. The assumption is that partition index pages are not shared by readers so each promoted index cursor will be used by a single index_reader (the top-level cursor). In order to be able to share partition index entries we must make the entries immutable and thus move the cursor outside. The promoted index cursor is now created and owned by each index_reader. There is at most one such active cursor per index_reader bound (lower/upper).	2021-02-04 15:23:55 +01:00
Avi Kivity	bd42bdd6b5	sstables: index_reader: disambiguate promoted_index_blocks_reader "state" type and data member promoted_index_blocks_reader has a data member called "state", and a type member called "state". Somehow gcc manages to disambiguate the two when used, but clang doesn't. I believe clang is correct here, one member should subsume the other. Change the type member to have a different name to disambiguate the two.	2020-09-21 16:32:53 +03:00
Benny Halevy	12393c5ec2	sstables: rename mc folder to mx Prepare for supporting the md format. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2020-08-10 18:53:04 +03:00
Tomasz Grabiec	ab274b8203	sstables: clustered_index: Relax scope of validity of entry_info entry_info holds views, which may get invalidated when the containing index blocks are removed. Current implementations of next_entry() keeps the blocks in memory as long as the cursor is alive but that will change in new implementations of the cursor. Adjust the assumption of tests accordingly.	2020-06-16 16:15:23 +02:00
Tomasz Grabiec	ea2fbcc2cd	sstables: index_entry: Introduce owning promoted_index_block_position	2020-06-16 16:15:23 +02:00
Tomasz Grabiec	f2e52c433f	sstables: index_entry: Rename promoted_index_block_position to promoted_index_block_position_view	2020-06-16 16:15:23 +02:00
Tomasz Grabiec	101fd613c5	sstables: mc: Extract parser for promoted index block It will be reused in binary search over the index.	2020-06-16 16:15:14 +02:00
Tomasz Grabiec	a557c374fd	sstables: mc: Extract parser for clustering out of the promoted index block parser This parser will be used stand-alone when doing a binary search over promoted index blocks. We will only parse the start key not the whole block.	2020-06-16 16:14:31 +02:00
Tomasz Grabiec	d5bf540079	sstables: Abstract the clustering index cursor behavior In preparation for supporting more than one algorithm for lookups in the promoted index, extract relevant logic out of the index_reader (which is a partition index cursor). The clustered index cursor implementation is now hidden behind abstract interface called clustered_index_cursor. The current implementation is put into the scanning_clustered_index_cursor. It's mostly code movement with minor adjustments. In order to encapsulate iteration over promoted index entries, clustered_index_cursor::next_entry() was introduced. No change in behavior intended in this patch.	2020-06-16 16:14:17 +02:00
Piotr Jastrzebski	1db437ee91	index_entry: stop calling global_partitioner() Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2020-02-17 10:59:15 +01:00
Botond Dénes	936619a8d3	sstables/continuous_data_consumer: track buffers used for parsing Based on heap profiling, buffers used for storing half-parsed fields are a major contributor to the overall memory consumption of reads. This memory was completely "under the radar" before. Track it by using tracked `temporary_buffer` instances everywhere in `continuous_data_consumer`. As `continuous_data_consumer` is the basis for parsing all index and data files, adding the tracing here automatically covers all data, index and promoted index parsing. I'm almost convinced that there is a better place to store the `permit` then the three places now, but so far I was unable to completely decipher the our data/index file parsing class hierarchy.	2020-01-28 08:13:16 +02:00
Pavel Solodovnikov	2f442f28af	treewide: add const qualifiers throughout the code base	2019-11-26 02:24:49 +03:00
Rafael Ávila de Espíndola	3c9178d122	sstables: Refactor predicates on bound_kind_m This moves the predicate functions to the start of the file, renames is_in_bound_kind to is_bound_kind for consistency with to_bound_kind and defines all predicates in a similar fashion. It also uses the predicates to reduce code duplication. Tests: unit (release) Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>	2019-01-02 17:50:44 -08:00
Tomasz Grabiec	d2f96a60f6	sstables: mc: index_reader: Handle CK_SIZE split across buffers properly we incorrectly falled-through to the next state instead of returning to read more data. This can manifest in a number of ways, an abort, or incorrect read. Introduced in `917528c` Fixes #4011. Message-Id: <1545402032-4114-1-git-send-email-tgrabiec@scylladb.com>	2018-12-21 16:34:10 +02:00
Avi Kivity	7c7da0b462	sstables: fix overflow in clustering key blocks header bit access _ck_blocks_header is a 64-bit variable, so the mask should be 64 bits too. Otherwise, a shift in the range 32-63 will produce wrong results. Fix by using a 64-bit mask. Found by Fedora 29's ubsan. Fixes #3973. Message-Id: <20181209120549.21371-1-avi@scylladb.com>	2018-12-10 11:09:25 +00:00
Rafael Ávila de Espíndola	cf4dc38259	Simplify state machine loop. These loops have the structure : while (true) { switch (state) { case state1: ... break; case state2: if (...) { ... break; } else {... continue; } ... } break; } There a couple things I find a bit odd on that structure: * The break refers to the switch, the continue to the loop. * A while (true) loop always hits a break or a continue. This patch uses early returns to simplify the logic to while (true) { switch (state) { case state1: ... return case state2: if (...) { ... return; } ... } } Now there are no breaks or continues. Tests: unit (release) Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com> Message-Id: <20181126171726.84629-1-espindola@scylladb.com>	2018-12-03 20:34:03 +01:00
Rafael Ávila de Espíndola	d18bbe9d45	Remove unreachable default cases. These switches are fully covered. We can be sure they will stay this way because of -Werror and gcc's -Wswitch warning. We can also be sure that we never have an invalid enum value since the state machine values are not read from disk. The patch also removes a superfluous ';'. Message-Id: <20181124020128.111083-1-espindola@scylladb.com>	2018-11-24 09:31:51 +00:00
Vladimir Krivopalov	d57380f44c	sstables: Set/reset range tombstone start from end open marker. When we skip through a wide partition using promoted index, we may land to a position that lies in the middle of a range tombstone so we need to be aware of it. For this, we check if the previous promoted block has an end open marker and either set the range tombstone start using it or reset if missing. Note several things about the implementation. Firstly, we have to peek back at the previous promoted index block for the end open marker, and so we have to always preserve one more promoted index block when we read the next batch so that we can stil access it. Secondly, we use the previous promoted block end position to build position in partition for the range tombstone start. Lastly, we don't have a notion of end open marker in older consumers that work with SSTables of ka/la formats so we only call the corresponding methods if the consumer supports them. Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>	2018-09-05 09:48:17 -07:00
Vladimir Krivopalov	939e4893ef	sstables: Fix end_open_marker population in promoted index blocks. We should not access the internal object stored in std::optional when passing the end_open_marker, moreover that it can be disengaged. Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>	2018-09-05 09:48:17 -07:00
Vladimir Krivopalov	4d3467d793	sstables: Add getter for end_open_marker to index_reader. Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>	2018-07-20 13:51:13 -07:00
Vladimir Krivopalov	917528c427	sstables: Read promoted index stored in SSTables 3.x ('mc') format. Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>	2018-07-20 13:51:13 -07:00
Vladimir Krivopalov	86d14f8166	sstables: Make promoted_index_block support clustering keys for both ka/la and mc formats. This is a pre-requisite for parsing promoted index blocks written in SSTables 'mc' format. Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>	2018-07-20 13:51:13 -07:00
Vladimir Krivopalov	741d5f3b5d	sstables: Remove unused includes from index_entry.hh Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>	2018-07-20 13:50:17 -07:00
Tomasz Grabiec	b17f7257a9	sstables: index_reader: Reduce size of index_entry by indirecting promoted_index Reduces size of index_entry from 384 bytes to 64 bytes by using indirection for the optional promoted index instead of embedding it. Improves query time from 9ms to 4ms in a micro benchmark with a very large index page. Message-Id: <1531406354-10089-1-git-send-email-tgrabiec@scylladb.com>	2018-07-12 17:46:58 +03:00
Vladimir Krivopalov	a497edcbda	sstables: Move promoted_index_block from types.hh to index_entry.hh. It is only being used by index_reader internally and never exposed so should not be listed in commonly used types. Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>	2018-06-28 12:28:59 -07:00
Vladimir Krivopalov	81fba73e9d	sstables: Factor out promoted index into a separate class. An index entry may or may not have a promoted index. All the optional fields are better scoped under the same class to avoid lots of separate optional fields and give better representation. Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>	2018-06-28 12:28:59 -07:00
Vladimir Krivopalov	fc629b9ca6	sstables: Use std::optional instead of std::experimental optional in index_reader. Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>	2018-06-27 16:47:53 -07:00
Vladimir Krivopalov	c996191411	Close promoted index streams when closing index_readers. Promoted index input streams must be explicitly closed when closing the index_reader in order to ensure all the pending read-aheads are completed. Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>	2018-02-20 16:04:15 -08:00
Vladimir Krivopalov	71495691aa	Use separate shared_index_lists per sstable_mutation_reader instead of a single one per sstable. With the changes introduced in #2981, it is no longer safe to share index_entries among multiple sstable_mutation_readers. The original intent behind sharing index_entries among index_readers was to avoid re-reading same pages twice as we have two index readers - lower and upper bound - for every sstable_mutation_reader. In fact, the shared entries were held at the sstable object level so index_readers from different sstable_mutation_readers could have accessed them. Now, with calls to index_reader::advance_to(pos)/index_reader::advance_past(pos), index_entry can be accessed in a way that modifies its state if we need to read more promoted index blocks. It is safe to keep sharing them between two index_readers within the same sstable_mutation_reader as the invariant is maintained that readers can be only moved forward. We cannot safely assume, however, that this invariant holds for multiple sstable_mutation_readers as it may happen that one of them has read and thrown away some promoted index blocks that another one needs. So we restrict sharing to per-sstable_mutation_reader level. Fixes #3189. Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com> Message-Id: <83957d007621fe4c62af49aebf1838bb2f32ee55.1518226793.git.vladimir@scylladb.com>	2018-02-10 15:08:45 +02:00
Vladimir Krivopalov	0a7a56edd5	Simplify continuous_data_consumer::consume_input() interface. Remove redundant input parameter as continuous_data_consumer derivatives would only use themselves as a context. So take it internally and make the function regular (non-template) and having no parameters. Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>	2018-01-29 11:57:26 -08:00
Vladimir Krivopalov	7e15e436de	Parse promoted index entries lazily upon request rather than immediately. Now promoted index is converted into an input_stream and skipped over instead of being consumed immediately and stored as a single buffer. The only part that is read right away is the deletion time as it is likely to be there in the already read buffer and reading it should both be cheap and prevent from reading the whole promoted index if only deletion time mark is needed. When accessed, promoted index is parsed in chunks, buffer by buffer, to limit memory consumption. Fixes #2981 Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>	2018-01-29 11:57:15 -08:00

45 Commits