scylladb

Author	SHA1	Message	Date
Wojciech Mitros	7f590a3686	sstables: index_reader: optimize single partition reads All entries from a single partition can be found in a single summary page. Because of that, in cases when we know we want to read only one partition, we can limit the underyling file input_stream to the range of the page. Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>	2022-02-22 02:16:52 +01:00
Avi Kivity	fcb8d040e8	treewide: use Software Package Data Exchange (SPDX) license identifiers Instead of lengthy blurbs, switch to single-line, machine-readable standardized (https://spdx.dev) license identifiers. The Linux kernel switched long ago, so there is strong precedent. Three cases are handled: AGPL-only, Apache-only, and dual licensed. For the latter case, I chose (AGPL-3.0-or-later and Apache-2.0), reasoning that our changes are extensive enough to apply our license. The changes we applied mechanically with a script, except to licenses/README.md. Closes #9937	2022-01-18 12:15:18 +01:00
Kamil Braun	8722e0d23c	sstables: mx: enable position fast-forwarding in reverse mode Most of the machinery was already implemented since it was used when jumping between clustering ranges of a query slice. We need only perform one additional thing when performing an index skip during fast-forwarding: reset the stored range tombstone in the consumer (which may only be stored in fast-forwarding mode, so it didn't matter that it wasn't reset earlier). Comments were added to explain the details.	2021-11-29 11:10:49 +01:00
Tomasz Grabiec	cc56a971e8	database, treewide: Introduce partition_slice::is_reversed() Cleanup, reduces noise. Message-Id: <20211014093001.81479-1-tgrabiec@scylladb.com>	2021-10-14 12:39:16 +03:00
Kamil Braun	27238eaa0f	sstables: mx: implement reversed single-partition reads We use partition_reversing_data_source and the new `index_reader` methods to implement single-partition reads in `mx_sstable_mutation_reader`. The parsing logic does not need to change: the buffers returned by the source already contain rows in reversed clustering order. Some changes were required in `mp_row_consumer_m` which processes the parsed rows and emits appropriate mutation fragments. The consumer uses `mutation_fragment_filter` underneath to decide whether a fragment should be ignored or not (e.g. the parsed fragment may come from outside the requested clustering range), among other things. Previously `mutation_fragment_filter` was provided a `partition_slice`. If the slice was reversed, the filter would use `clustering_key_filter_ranges::get_ranges` to obtain the clustering ranges from the slice in unreversed order (they were reversed in the slice) since we didn't perform any reversing in the reader. Now the reader provides the ranges directly instead of the slice; furthermore, the ranges are provided in native-reversed format (the order of ranges is reversed and the ranges themselves are also reversed), and the schema provided to the filter is also reversed. Thus to the filter everything appears as if it was used during a non-reversed query but on a table with reversed schema, which works correctly given the fact that the reader is feeding parsed rows into the consumer in reversed order. During reversed queries the reader uses alternative logic for skipping to a later range (or, speaking in non-reversed terms, to an earlier range), which happens in `advance_context`. It asks the index to advance its upper bound in reverse so that the reversing_data_source notices the change of the index end position and returns following buffers with rows from the new range. There is a slight difference in behavior of the reader from `mp_row_consumer_m`'s point of view. For non-reversed reads, after the consumer obtains the beginning of a row (`consume_row_start`) - which contains the row's position but not the columns - and tells the reader that the row won't be emitted because we need to skip to a later range, the reader would tell the data source (the 'context') immediately to skip to a later range by calling `skip_to`. This caused the source not to return the rest of the row, and the rest of the row would not be fed to the consumer (`consume_row_end`). However, for reversed reads, the data source performs skipping 'on its own', after it notices that the index end position has changed. This may happen 'too late', causing the rest of the row to be returned anyway. We are prepared for this situation inside `mp_row_consumer` by consulting the mutation fragment filter again when the rest of the row arrives. Fast forwarding is not supported at this point, which is fine given that the cache is disabled for reversed queries for now (and the cache is the only user of fast forwarding). The `partition_slice` provided by callers is provided in 'half-reversed' format for reversed queries, where the order of clustering ranges is reversed, but the ranges themselves are not. This means we need to modify the slice sometimes: for non-single-partition queries the mx reader must use a non-reversed slice, and for single-partition queries the mx reader must use a native-reversed slice (where the clustering ranges themselves are reversed as well). The modified slice must be stored somewhere; we store it inside the mx reader itself so we don't need to allocate more intermediate readers at the call sites. This causes the interface of `mx::make_reader` to be a bit weird: for non-single-partition queries where the provided slice is reversed the reader will actually return a non-reversed stream of fragments, telling the user to reverse the stream on their own. The interface has been documented in detail with appropriate comments.	2021-10-04 15:24:12 +02:00
Wojciech Mitros	64e703bb54	sstables: mx: introduce partition_reversing_data_source This patch adds an implementation of a data source that wraps an sstable data file and returns data buffers with contents of one partition in the sstable as if the rows of the partition were present in a reversed order. In other words, to the user of the source the partition appears to be reversed. We shall call this an 'intermediary' data source. As part of the interface of the intermediary source the user is also given read access to the source's current position over the data file, and the constructor of the source takes a reference to `index_reader`. This is necessary because the index operates directly on data file offsets and we want the user to be able to use the index to skip sequences of rows. In order to ask the source to skip a sequence of rows - e.g. when jumping between clustering ranges - the user must advance the index' upper bound in reverse (to an earlier position). The source will then notice that the end position of the index has changed and take appropriate action. An alternative would be to translate the data positions of `index_reader` to 'reversed positions' of the intermediary and then use `skip_to` for skipping, as we do for forward reads. However this solution would introduce more complexity to `index_reader` and the intermediary source. One reason for the complexity in the input stream is that we would have two kinds of skips: a single row skip, and a skip to a clustering range. We know the offset of the next row, so we could check that to differentiate them. We would also need to add an information about the position of first clustering row and end of the last one in the index_reader. Skipping by checking the index seems to be overall simpler. For simplicity, the intermediary stream always starts with parsing the partition header and (if present) the static row, and returning the corresponding bytes as a result of the first read. After partition header and static row we must find the last row entry of the requested range. If the range ends before the partition end (i.e. there are more row entries after the range) we can use the 'previous unfiltered size' of the row following the range; otherwise we must scan the last promoted index block and take its last row. After finding the data range of the last row, we parse rows consecutively in reversed order. We must parse the rows partially to learn their lengths and the positions of previous rows. We're using similar constructs as in the sstable parser, but it only contains a small part of the parsing coroutine and doesn't perform any correctness checks. The parser for rows still turned out rather big mostly because we can't always deduce the size of the clustering blocks without reading the block header. The parser allows reading rows while skipping their bodies also in non-reversed order, which we are making use of while reading the last promoted index block. The intermediary data source has one more utility: reversing range tombstones. When we read a tombstone bound/boundary, we modify the data buffer so that the resulting bound/boundary has the reversed kind (so we don't read ends before starts) and the boundaries have their before/after timestamps swapped.	2021-10-04 15:24:12 +02:00
Botond Dénes	9548200e85	sstables: mx/reader: add crawling reader A special-purpose reader which doesn't use the index at all and hence doesn't support skipping at all. It is designed to be used in conditions in which the index is not reliable (scrub compaction).	2021-09-01 08:44:13 +03:00
Benny Halevy	4476800493	flat_mutation_reader: get rid of timeout parameter Now that the timeout is taken from the reader_permit. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-08-24 16:30:51 +03:00
Benny Halevy	f25aabf1b2	flat_mutation_reader: maybe_timed_out: use permit timeout Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-08-24 14:29:44 +03:00
Michael Livshin	f07306d75c	sstables: make sstable::make_reader() return flat_mutation_reader_v2 Rename the old version to `sstables::make_reader_v1()`, to have a nicely searcheable eradication target. Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2021-08-09 19:20:48 +03:00
Michael Livshin	5f9695c1b2	sstables: count read row tombstones Refs #7749. Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2021-08-01 19:41:11 +03:00
Avi Kivity	42e1f318d7	Merge "Respect "bypass cache" in sstable index caching" from Tomasz " This series changes the behavior of the system when executing reads annotated with "bypass cache" clause in CQL. Such reads will not use nor populate the sstable partition index cache and sstable index page cache. " * 'bypass-cache-in-sstable-index-reads' of github.com:tgrabiec/scylla: sstables: Do not populate page cache when searching in promoted index for "bypass cache" reads sstables: Do not populate partition index cache for "bypass cache" reads	2021-07-28 18:45:39 +03:00
Wojciech Mitros	fc17c48bc9	sstables: merge consumer_m into mp_row_consumer_m The consumer_m interface has only one implementation: mp_row_consumer_m; and we're not planning other ones, so to reduce the number of inheritances, and the number of lines in the sstable reader, these classes may be combined. Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>	2021-07-21 17:36:10 +02:00
Wojciech Mitros	fbb56e930c	sstables: move mp_row_consumer_m To make next patch combining consumer_m and mp_row_consumer_m more readable, move mp_row_consumer_m next to consumer_m. Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>	2021-07-21 17:36:04 +02:00
Tomasz Grabiec	f4227c303b	sstables: Do not populate partition index cache for "bypass cache" reads Index cursor for reads which bypass cache will use a private temporary instance of the partition index cache. Promoted index scanner (ka/la format) will not go through the page cache.	2021-07-15 12:13:20 +02:00
Avi Kivity	1643549d08	Merge 'Coroutinize the sstable reader' from Wojciech Mitros This patch applies the same changes to both kl and mx sstable readers, but because the kl reader is old, we'll focus on the newer one. This patch makes the main sstable reader process a coroutine, allowing to simplify it, by: - using the state saved in the coroutine instead of most of the states saved in the _state variable - removing the switch statement and moving the code of former switch cases, resulting in reduced number of jumps in code - removing repetitive ifs for read statuses, by adding them to the coroutine implementation The coroutine is saved in a new class ```processing_result_generator```, which works like a generator: using its ```generate()``` method, one can order the coroutine to continue until it yields a data_consumer::processing_result value, which was achieved previously by calling the function that is now the coroutine(```do_process_state()```). Before the patch, the main processing method had 558 lines. The patch reduces this number to 345 lines. However, usage of c++ coroutines has a non-negligible effect on the performance of the sstable reader. In the test cases from ```perf_fast_forward``` the new sstable reader performs up to 2% more instructions (per fragment) than the former implementation, and this loss is achieved for cases where we're reading many subsequent rows, without any skips. Thanks to finding an optimization during the development of the patch, the loss is mitigated when we do skip rows, and for some cases, we can even observe an improvement. You can see the full results in attached files: [old_results.txt](https://github.com/scylladb/scylla/files/6793139/old_results.txt), [new_results.txt](https://github.com/scylladb/scylla/files/6793140/new_results.txt) Test: unit(dev) Refs: #7952 Closes #9002 * github.com:scylladb/scylla: mx sstable reader: reduce code blocks mx sstable reader: make ifs consistent sstable readers: make awaiter for read status mx sstable reader: don't yield if the data buffer is not empty mx sstable reader: combine FLAGS and FLAGS_2 states mx sstable reader: reduce placeholder state usage mx sstable reader: replace non_consuming states with a bool mx sstable reader: reduce placeholder state usage mx sstable reader: replace unnecessary states with a placeholder mx sstable reader: remove false if case mx sstable reader: remove row_body_missing_columns_label mx sstable reader: remove row_body_deletion_label mx sstable reader: remove column_end_label mx sstable reader: remove column_cell_path_label mx sstable reader: remove column_ttl_label mx sstable reader: remove column_deletion_time_label mx sstable reader: remove complex_column_2_label mx sstable reader: remove row_body_missing_columns_read_columns_label mx sstable reader: remove row_body_marker_label mx sstable reader: remove row_body_shadowable_deletion_label mx sstable reader: remove row_body_prev_size_label mx sstable reader: remove ck_block_label mx sstable reader: remove ck_block2_label mx sstable reader: remove clustering_row_label and complex_column_label mx sstable reader: remove labels with only one goto mx sstable reader: replace the switch cases with gotos and a new label mx sstable reader: remove states only reached consecutively or from goto mx sstable reader: remove switch breaks for consecutive states mx sstable reader: convert readers main method into a coroutine kl sstable reader: replace states for ending with one state, simplify non_consuming kl sstable reader: remove unnecessary states kl sstable reader: remove unnecessary yield kl sstable reader: remove unnecessary blocks kl sstable reader: fix indentation kl sstable reader: replace switch with standard flow control kl sstable reader: remove state::CELL case kl sstable reader: move states code only reachable from one place kl sstable reader: remove states only reached consecutively kl sstable reader: remove switch breaks for consecutive states kl sstable reader: remove unreachable case kl sstable reader: move testing hack for fragmented buffers outside the coroutine kl sstable reader: convert readers main method into a coroutine sstable readers: create a generator class for coroutines	2021-07-15 12:06:14 +03:00
Wojciech Mitros	45058776c2	mx sstable reader: reduce code blocks Some blocks of code were surrounded by curly braces, because a variable was declared inside a switch case. After changes, some of the variable declarations are in if/else/while cases, and no longer need to be in separate code blocks, while other blocks can be extended to entire labels for simplicity.	2021-07-14 20:50:30 +02:00
Wojciech Mitros	9b333908e4	mx sstable reader: make ifs consistent In several places we're checking the return value of our consumers' consume_* calls. Because the behaviour in all cases is the same, let us use the same notation as well.	2021-07-14 20:50:30 +02:00
Wojciech Mitros	dc38605f75	sstable readers: make awaiter for read status After each read* call of the primitive_consumer we need to check if the entire primitive was in our current buffer. We can check it in the proceed_generator object by yielding the returned read status: if the yielded status is ready, the yield_value method returns a structure whose await_ready() method returns true. Otherwise it returns false. The returned structure is co_awaited by the coroutine (due to co_yield), and if await_ready() returns true, the coroutine isn't stopped, conversely, if it returns false, (technical: and because its await_suspend methods returns void) the coroutine stops, and a proceed::yes value is saved, indicating that we need more buffers.	2021-07-14 20:50:30 +02:00
Wojciech Mitros	09a0cd7c05	mx sstable reader: don't yield if the data buffer is not empty The skip() method returns a skip_bytes object if we want to skip the entire buffer, otherwise it returns a proceed::yes and trims the buffer. If the buffer is only trimmed we don't need to interrupt the coroutine, we simply continue instead.	2021-07-14 20:50:30 +02:00
Wojciech Mitros	5dc64532bd	mx sstable reader: combine FLAGS and FLAGS_2 states We don't differentiate between FLAGS and FLAGS_2 in verify_end_state(), so we can merge them into one state.	2021-07-14 20:50:30 +02:00
Wojciech Mitros	ab1e6f4211	mx sstable reader: reduce placeholder state usage After the changes to non_consuming states, we can remove some state::OTHER assignments again.	2021-07-14 20:50:30 +02:00
Wojciech Mitros	c904ab12c8	mx sstable reader: replace non_consuming states with a bool The non_consuming() method is only used after assuring that primitive_consumer::active() (in continuous_data_consumer::process()) so we don't need states where primitive_consumer::active(), which is most of them. We still need to make sure that the states change when they need to, so we replace all the concerned states with the placeholder state, and for the few states from the non_consuming() OR, where the primitive_consumer::active() returns true, we set the value of _consuming to false, changing it back when the state is no longer non_consuming.	2021-07-14 20:50:30 +02:00
Wojciech Mitros	b05d3eefed	mx sstable reader: reduce placeholder state usage We can remove state assignments that we know are changing a state to itself. Similarily, if a state is changed in the same way in an if and an else, it can be changed before the if/else instead.	2021-07-14 20:50:30 +02:00
Wojciech Mitros	b2e3fbffd0	mx sstable reader: replace unnecessary states with a placeholder After removing the switch, the state is only used for verify_end_state() and non_consuming(), so we can replace states that are not used there with a single one, so that the state still stops being one of the appearing states when it needs to.	2021-07-14 20:50:30 +02:00
Wojciech Mitros	9a7a8fa86c	mx sstable reader: remove false if case consume_row_marker_and_tombstone does not return proceed::no in the mp_row_consumer_m implementation, and even if it did, we would most likely want to yield proceed::no in that case as well.	2021-07-14 20:50:30 +02:00
Wojciech Mitros	2262aac11a	mx sstable reader: remove row_body_missing_columns_label row_body_missing_columns_label is only reached from one goto, or consecutively, so the code omitted by goto can be ommited by an if instead (or else).	2021-07-14 20:50:30 +02:00
Wojciech Mitros	99b5a332db	mx sstable reader: remove row_body_deletion_label row_body_deletion_label is only reached from one goto, or consecutively, so the code omitted by goto can be ommited by an if instead (or else).	2021-07-14 20:50:30 +02:00
Wojciech Mitros	cbce22a88b	mx sstable reader: remove column_end_label column_end_label is only reached from one goto, or consecutively, so the code omitted by goto can be ommited by an if instead (or else).	2021-07-14 20:50:30 +02:00
Wojciech Mitros	925d921cb4	mx sstable reader: remove column_cell_path_label column_cell_path_label is only reached from two goto, both at the end of an if/else block, or consecutively, so the code after the if/else block can be ommited by an if instead (or else).	2021-07-14 20:50:30 +02:00
Wojciech Mitros	e85987a439	mx sstable reader: remove column_ttl_label column_ttl_label is only reached from two goto, both at the end of an if/else block, or consecutively, so the code after the if/else block can be ommited by an if instead (or else).	2021-07-14 20:50:30 +02:00
Wojciech Mitros	4b3607e97b	mx sstable reader: remove column_deletion_time_label column_deletion_time_label is only reached from one goto, or consecutively, so the code omitted by goto can be ommited by an if instead (or else).	2021-07-14 20:50:30 +02:00
Wojciech Mitros	8cf23c3b01	mx sstable reader: remove complex_column_2_label complex_column_2_label is only reached from one goto, or consecutively, so the code omitted by goto can be ommited by an if instead (or else).	2021-07-14 20:50:30 +02:00
Wojciech Mitros	fbe28d18f3	mx sstable reader: remove row_body_missing_columns_read_columns_label row_body_missing_columns_read_columns_label is only reached consecutively, or from a goto after the label. This is changed to a while loop starting at the label and ending at the goto. The code executed in the only case we do not reach the goto (so when exiting the loop) is moved after the while.	2021-07-14 20:50:30 +02:00
Wojciech Mitros	3b512ea2c2	mx sstable reader: remove row_body_marker_label row_body_marker_label is only reached from one goto inside an else case, or consecutively, so the code omitted by goto can be moved inside the corresponding if case.	2021-07-14 20:50:30 +02:00
Wojciech Mitros	0bcde69319	mx sstable reader: remove row_body_shadowable_deletion_label row_body_shadowable_deletion_label is only reached from one goto, or consecutively, so the code omitted by goto can be ommited by an if instead (or else).	2021-07-14 20:50:30 +02:00
Wojciech Mitros	3d0fdf9f3b	mx sstable reader: remove row_body_prev_size_label row_body_prev_size_label is only reached consecutively, or from a goto not far after the label. This is changed to a while loop starting at the label and ending at the goto.	2021-07-14 20:50:30 +02:00
Wojciech Mitros	b27166c36f	mx sstable reader: remove ck_block_label ck_block_label is only reached consecutively, or from a few gotos not far after the label. This is changed to a while loop with gotos replaced with continue's.	2021-07-14 20:50:30 +02:00
Wojciech Mitros	ec6c2f0e07	mx sstable reader: remove ck_block2_label ck_block2_label is only reached from one goto, or consecutively, so the code omitted by goto can be ommited by an if instead (or else).	2021-07-14 20:50:30 +02:00
Wojciech Mitros	1e59e249ec	mx sstable reader: remove clustering_row_label and complex_column_label clustering_row_label is only reached from one goto, or consecutively, so the code omitted by goto can be ommited by an if instead (or else). Also remove complex_column_label because it is next to its only goto.	2021-07-14 20:50:30 +02:00
Wojciech Mitros	440aba61a9	mx sstable reader: remove labels with only one goto If a case is reached only after after jumping with a single goto, that goto may be replaced with the target code.	2021-07-14 20:50:30 +02:00
Wojciech Mitros	65f7eb5ada	mx sstable reader: replace the switch cases with gotos and a new label Because the number of remaining cases is moderately low, and after finishing a case we always enter another one, the switch is removed completely, and the last remaining cases are handled by 3 additional gotos and 1 new label.	2021-07-14 20:50:30 +02:00
Wojciech Mitros	0398c68797	mx sstable reader: remove states only reached consecutively or from goto If a state is never reached from the top of the switch, but only by continuing from the previous case, we don't need to have a case: for it. Similarily, if there is a label that we goto, we don't need the switch case.	2021-07-14 20:50:30 +02:00
Wojciech Mitros	f87b27b9e4	mx sstable reader: remove switch breaks for consecutive states If _state at the end of a switch case has the same value as the next case, instead of breaking the switch, we can just fall through.	2021-07-14 20:50:30 +02:00
Wojciech Mitros	32b996aca5	mx sstable reader: convert readers main method into a coroutine (same as in kl sstable reader) The function is converted to a coroutine simply by adding an infinite loop around the switch, and starting another iteration after yielding a value, instead of returning. Because the coroutine resume() function does not take any arguments, a new member is introduced to remember the "data" buffer, that was previously an argument to the method.	2021-07-14 20:50:30 +02:00
Avi Kivity	99d5355007	Merge "Cache sstable indexes in memory" from Tomasz " The main goal of this series is to improve efficiency of reads from large partitions by reducing amount of I/O needed to read the sstable index. This is achieved by caching index file pages and partition index entries in memory. Currently, the pages are cached by individual reads only for the duration of the read. This was done to facilitate binary search in the promoted index (intra-partition index). After this series, all reads share the index file page cache, which stays around even after reads stop. The page cache is subject to eviction. It uses the same region as the current row cache and shares the LRU with row cache entries. This means that LRU objects need to be virtualized. This series takes an easy approach and does this by introducing a virtual base class. This adds an overhead to row cache entry to store the vtable pointer. SStable indexes have a hierarchy. There is a summary, which is a sparse partition key index into the full partition index. This one is already kept in memory. The partition index is divided by the summary into pages. Each entry in the partition index contains promoted index, which is a sparse index into atoms identified by the clustering key (rows, tombstones). In order to read the promoted index, the reader needs to read the partition index entry first. To speed this up, this series also adds caching of partition index entries. This cache survives reads and is subject to eviction, just like the index file page cache. The unit of caching is the partition index page. Without this cache, each access to promoted index would have to be preceded with the parsing of the partition index page containing the partition key. Performance testing results follow. 1) scylla-bench large partition reads Populated with: perf_fast_forward --run-tests=large-partition-skips --datasets=sb-large-part-ds1 \ -c1 -m1G --populate --value-size=1024 --rows=10000000 Single partition, 9G data file, 4MB index file Test execution: build/release/scylla -c1 -m4G scylla-bench -workload uniform -mode read -limit 1 -concurrency 100 -partition-count 1 \ -clustering-row-count 10000000 -duration 60m TL;DR: after: 2x throughput, 0.5 median latency Before (`c1daf2bb24`): Results Time (avg): 5m21.033180213s Total ops: 966951 Total rows: 966951 Operations/s: 3011.997048812112 Rows/s: 3011.997048812112 Latency: max: 74.055679ms 99.9th: 63.569919ms 99th: 41.320447ms 95th: 38.076415ms 90th: 37.158911ms median: 34.537471ms mean: 33.195994ms After: Results Time (avg): 5m14.706669345s Total ops: 2042831 Total rows: 2042831 Operations/s: 6491.22243800942 Rows/s: 6491.22243800942 Latency: max: 60.096511ms 99.9th: 35.520511ms 99th: 27.000831ms 95th: 23.986175ms 90th: 21.659647ms median: 15.040511ms mean: 15.402076ms 2) scylla-bench small partitions I tested several scenarios with a varying data set size, e.g. data fully fitting in memory, half fitting, and being much larger. The improvement varied a bit but in all cases the "after" code performed slightly better. Below is a representative run over data set which does not fit in memory. scylla -c1 -m4G scylla-bench -workload uniform -mode read -concurrency 400 -partition-count 10000000 \ -clustering-row-count 1 -duration 60m -no-lower-bound Before: Time (avg): 51.072411913s Total ops: 3165885 Total rows: 3165885 Operations/s: 61988.164024260645 Rows/s: 61988.164024260645 Latency: max: 34.045951ms 99.9th: 25.985023ms 99th: 23.298047ms 95th: 19.070975ms 90th: 17.530879ms median: 3.899391ms mean: 6.450616ms After: Time (avg): 50.232410679s Total ops: 3778863 Total rows: 3778863 Operations/s: 75227.58014424688 Rows/s: 75227.58014424688 Latency: max: 37.027839ms 99.9th: 24.805375ms 99th: 18.219007ms 95th: 14.090239ms 90th: 12.124159ms median: 4.030463ms mean: 5.315111ms The results include the warmup phase which populates the partition index cache, so the hot-cache effect is dampened in the statistics. See the 99th percentile. Latency gets better after the cache warms up which moves it lower. 3) perf_fast_forward --run-tests=large-partition-skips Caching is not used here, included to show there are no regressions for the cold cache case. TL;DR: No significant change perf_fast_forward --run-tests=large-partition-skips --datasets=large-part-ds1 -c1 -m1G Config: rows: 10000000, value size: 2000 Before: read skip time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu 1 0 36.429822 4 10000000 274500 62 274521 274429 153889.2 153883 19696986 153853 0 0 0 0 0 0 0 22.5% 1 1 36.856236 4 5000000 135662 7 135670 135650 155652.0 155652 19704117 139326 1 0 1 1 0 0 0 38.1% 1 8 36.347667 4 1111112 30569 0 30570 30569 155652.0 155652 19704117 139071 1 0 1 1 0 0 0 19.5% 1 16 36.278866 4 588236 16214 1 16215 16213 155652.0 155652 19704117 139073 1 0 1 1 0 0 0 16.6% 1 32 36.174784 4 303031 8377 0 8377 8376 155652.0 155652 19704117 139056 1 0 1 1 0 0 0 12.3% 1 64 36.147104 4 153847 4256 0 4256 4256 155652.0 155652 19704117 139109 1 0 1 1 0 0 0 11.1% 1 256 9.895288 4 38911 3932 1 3933 3930 100869.2 100868 3178298 59944 38912 0 1 1 0 0 0 14.3% 1 1024 2.599921 4 9757 3753 0 3753 3753 26604.0 26604 801850 15071 9758 0 1 1 0 0 0 14.6% 1 4096 0.784568 4 2441 3111 1 3111 3109 7982.0 7982 205946 3772 2442 0 1 1 0 0 0 13.8% 64 1 36.553975 4 9846154 269359 10 269369 269337 155663.8 155652 19704117 139230 1 0 1 1 0 0 0 28.2% 64 8 36.509694 4 8888896 243467 8 243475 243449 155652.0 155652 19704117 139120 1 0 1 1 0 0 0 26.5% 64 16 36.466282 4 8000000 219381 4 219385 219374 155652.0 155652 19704117 139232 1 0 1 1 0 0 0 24.8% 64 32 36.395926 4 6666688 183171 6 183180 183165 155652.0 155652 19704117 139158 1 0 1 1 0 0 0 21.8% 64 64 36.296856 4 5000000 137753 4 137757 137737 155652.0 155652 19704117 139105 1 0 1 1 0 0 0 17.7% 64 256 20.590392 4 2000000 97133 18 97151 94996 135248.8 131395 7877402 98335 31282 0 1 1 0 0 0 15.7% 64 1024 6.225773 4 588288 94492 1436 95434 88748 46066.5 41321 2324378 30360 9193 0 1 1 0 0 0 15.8% 64 4096 1.856069 4 153856 82893 54 82948 82721 16115.0 16043 583674 11574 2675 0 1 1 0 0 0 16.3% After: read skip time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu 1 0 36.429240 4 10000000 274505 38 274515 274417 153887.8 153883 19696986 153849 0 0 0 0 0 0 0 22.4% 1 1 36.933806 4 5000000 135377 15 135385 135354 155658.0 155658 19704085 139398 1 0 1 1 0 0 0 40.0% 1 8 36.419187 4 1111112 30509 2 30510 30507 155658.0 155658 19704085 139233 1 0 1 1 0 0 0 22.0% 1 16 36.353475 4 588236 16181 0 16182 16181 155658.0 155658 19704085 139183 1 0 1 1 0 0 0 19.2% 1 32 36.251356 4 303031 8359 0 8359 8359 155658.0 155658 19704085 139120 1 0 1 1 0 0 0 14.8% 1 64 36.203692 4 153847 4249 0 4250 4249 155658.0 155658 19704085 139071 1 0 1 1 0 0 0 13.0% 1 256 9.965876 4 38911 3904 0 3906 3904 100875.2 100874 3178266 60108 38912 0 1 1 0 0 0 17.9% 1 1024 2.637501 4 9757 3699 1 3700 3697 26610.0 26610 801818 15071 9758 0 1 1 0 0 0 19.5% 1 4096 0.806745 4 2441 3026 1 3027 3024 7988.0 7988 205914 3773 2442 0 1 1 0 0 0 18.3% 64 1 36.611243 4 9846154 268938 5 268942 268921 155669.8 155705 19704085 139330 2 0 1 1 0 0 0 29.9% 64 8 36.559471 4 8888896 243135 11 243156 243124 155658.0 155658 19704085 139261 1 0 1 1 0 0 0 28.1% 64 16 36.510319 4 8000000 219116 15 219126 219101 155658.0 155658 19704085 139173 1 0 1 1 0 0 0 26.3% 64 32 36.439069 4 6666688 182954 9 182964 182943 155658.0 155658 19704085 139274 1 0 1 1 0 0 0 23.2% 64 64 36.334808 4 5000000 137609 11 137612 137596 155658.0 155658 19704085 139258 2 0 1 1 0 0 0 19.1% 64 256 20.624759 4 2000000 96971 88 97059 92717 138296.0 131401 7877370 98332 31282 0 1 1 0 0 0 17.2% 64 1024 6.260598 4 588288 93967 1429 94905 88051 45939.5 41327 2324346 30361 9193 0 1 1 0 0 0 17.8% 64 4096 1.881338 4 153856 81780 140 81920 81520 16109.8 16092 582714 11617 2678 0 1 1 0 0 0 18.2% 4) perf_fast_forward --run-tests=large-partition-slicing Caching enabled, each line shows the median run from many iterations TL;DR: We can observe reduction in IO which translates to reduction in execution time, especially for slicing in the middle of partition. perf_fast_forward --run-tests=large-partition-slicing --datasets=large-part-ds1 -c1 -m1G --keep-cache-across-test-cases Config: rows: 10000000, value size: 2000 Before: offset read time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk allocs tasks insns/f cpu 0 1 0.000491 127 1 2037 24 2109 127 4.0 4 128 2 2 0 1 1 0 0 0 157 80 3058208 15.0% 0 32 0.000561 1740 32 56995 410 60031 47208 5.0 5 160 3 2 0 1 1 0 0 0 386 111 113353 17.5% 0 256 0.002052 488 256 124736 7111 144762 89053 16.6 17 672 14 2 0 1 1 0 0 0 2113 446 52669 18.6% 0 4096 0.016437 61 4096 249199 692 252389 244995 69.4 69 8640 57 5 0 1 1 0 0 0 26638 1717 23321 22.4% 5000000 1 0.002171 221 1 461 2 466 221 25.0 25 268 3 3 0 1 1 0 0 0 638 376 14311524 10.2% 5000000 32 0.002392 404 32 13376 48 13528 13015 27.0 27 332 5 3 0 1 1 0 0 0 931 432 489691 11.9% 5000000 256 0.003659 279 256 69967 764 73130 52563 39.5 41 780 19 3 0 1 1 0 0 0 2689 825 93756 15.8% 5000000 4096 0.018592 55 4096 220313 433 234214 218803 94.2 94 9484 62 9 0 1 1 0 0 0 27349 2213 26562 21.0% After: offset read time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk allocs tasks insns/f cpu 0 1 0.000229 115 1 4371 85 4585 115 2.1 2 64 1 1 1 0 0 0 0 0 90 31 1314749 22.2% 0 32 0.000277 2174 32 115674 1015 128109 14144 3.0 3 96 2 1 1 0 0 0 0 0 319 62 52508 26.1% 0 256 0.001786 576 256 143298 5534 179142 113715 14.7 17 544 15 1 1 0 0 0 0 0 2110 453 45419 21.4% 0 4096 0.015498 61 4096 264289 2006 268850 259342 67.4 67 8576 59 4 1 0 0 0 0 0 26657 1738 22897 23.7% 5000000 1 0.000415 233 1 2411 15 2456 234 4.1 4 128 2 2 1 0 0 0 0 0 199 72 2644719 16.8% 5000000 32 0.000635 1413 32 50398 349 51149 46439 6.0 6 192 4 2 1 0 0 0 0 0 458 128 125893 18.6% 5000000 256 0.002028 486 256 126228 3024 146327 82559 17.8 18 1024 13 4 1 0 0 0 0 0 2123 385 51787 19.6% 5000000 4096 0.016836 61 4096 243294 814 263434 241660 73.0 73 9344 62 8 1 0 0 0 0 0 26922 1920 24389 22.4% Future work: - Check the impact on non-uniform workloads. Caching sstable indexes takes space away from the row cache which may reduce the hit ratio. - Reduce memory footprint of partition index cache. Currently, about 8x bloat over the on-disk size. - Disable cache population for "bypass cache" reads - Add a switch to disable sstable index caching, per-node, maybe per-table - Better sstable index format. Current format leads to inefficiency in caching since only some elements of the cached page can be hot. A B-tree index would be more efficient. Same applies to the partition index. Only some elements in the partition index page can be hot. - Add heuristic for reducing index file IO size when large partitions are anticipated. If we're bound by disk's bandwidth it's wasteful to read the front of promoted index using 32K IO, better use 4K which should cover the partition entry and then let binary search read the rest. In V2: - Fixed perf_fast_forward regression in the number of IOs used to read partition index page The reader uses 32K reads, which were split by page cache into 4K reads Fix by propagating IO size hints to page cache and using single IO to populate it. New patch: "cached_file: Issue single I/O for the whole read range on miss" - Avoid large allocations to store partition index page entries (due to managed_vector storage). There is a unit test which detects this and fails. Fixed by implementing chunked_managed_vector, based on chunked_vector. - fixed bug in cached_file::evict_gently() where the wrong allocation strategy was used to free btree chunks - Simplify region_impl::free_buf() according to Avi's suggestions - Fit segment_kind in segment_descriptor::_free_space and lift requirement that _buf_pointers emptiness determines the kind - Workaround sigsegv which was most likely due to coroutine miscompilation. Worked around by manipulating local object scope. - Wire up system/drop_sstable_caches RESTful API - Fix use-after-move on permit for the old scanning ka/la index reader - Fixed more cases of double open_data() in tests leading to assert failure - Adjusted cached_file class doc to account for changes in behavior. - Rebased Fixes #7079. Refs #363. " * tag 'sstable-index-caching-v2' of github.com:tgrabiec/scylla: (39 commits) api: Drop sstable index caches on system/drop_sstable_caches cached_file: Issue single I/O for the whole read range on miss row_cache: cache_tracker: Do not register metrics when constructed for tests sstables, cached_file: Evict cache gently when sstable is destroyed sstables: Hide partition_index_cache implementation away from sstables.hh sstables: Drop shared_index_lists alias sstables: Destroy partition index cache gently sstables: Cache partition index pages in LSA and link to LRU utils: Introduce lsa::weak_ptr<> sstables: Rename index_list to partition_index_page and shared_index_lists to partition_index_cache sstables, cached_file: Avoid copying buffers from cache when parsing promoted index cached_file: Introduce get_page_units() sstables: read: Document that primitive_consumer::read_32() is alloc-free sstables: read: Count partition index page evictions sstables: Drop the _use_binary_search flag from index entries sstables: index_reader: Keep index objects under LSA lsa: chunked_managed_vector: Adapt more to managed_vector utils: lsa: chunked_managed_vector: Make LSA-aware test: chunked_managed_vector_test: Make exception_safe_class standard layout lsa: Copy chunked_vector to chunked_managed_vector ...	2021-07-07 18:17:10 +03:00
Tomasz Grabiec	2b673478aa	sstables: index_reader: Do not expose index_entry references index_entry will be an LSA-managed object. Those have to be accessed with care, with the LSA region locked. This patch hides most of direct index_entry accesses inside the index_reader so that users are safe.	2021-07-02 19:02:13 +02:00
Raphael S. Carvalho	ef76cdb2c7	sstables: Attach sstable name to exception triggered in sstable mutation reader When compaction fails due to a failure that comes from a specific sstable, like on data corruption, the log isn't telling which sstable contributed to that. Let's always attach the sstable name to the exception triggered in sstable mutation reader. Exceptions in la and mx consumer attached sst name, but now only sst mutation reader will do it so as to avoid duplicating the sst name. Now: ERROR 2021-06-11 16:07:34,489 [shard 0] compaction_manager - compaction failed: sstables::malformed_sstable_exception (Failed to read partition from SSTable /home/.../md-74-big-Data.db due to compressed chunk of size 3735 at file offset 406491 failed checksum, expected=0, actual=1422312584): retrying Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-06-28 12:54:24 -03:00
Benny Halevy	9bbe7b1482	sstables: mx_sstable_mutation_reader: enforce timeout Check if the timeout has expired before issuing I/O. Note that the sstable reader input_stream is not closed when the timeout is detected. The reader must be closed anyhow after the error bubbles up the chain of readers and before the reader is destroyed. This might already happen if the reader times out while waiting for reader_concurrency_semaphore admission. Test: unit(dev), auth_test.test_alter_with_timeouts(debug) Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210624073232.551735-1-bhalevy@scylladb.com>	2021-06-24 12:26:57 +02:00
Tomasz Grabiec	a4275cf8bc	sstables: Switch the mx reader to flat_mutation_reader_v2 The main difficulty was in making sure that emitted range tombstone changes reflect range tombstones trimmed to clustering restrictions. This is handled by mutation_fragment_filter and clustering_ranges_walker. They return a list of range_tombstone_change fragments to emit for each hop as the reader walks over the clustering domain. Tests which were using a normalizing reader expected range tombstones to be split around rows. Drop this an adjust the tests accoridngly. No reader splits range tombstones around rows now.	2021-06-16 00:23:49 +02:00

1 2

58 Commits