scylladb

Author	SHA1	Message	Date
Avi Kivity	585c0841c3	Merge 'sstables: enable read ahead for the partition index reader' from Wojciech Mitros Currently, when advancing one of `index_reader`'s bounds, we're creating a new `index_consume_entry_context` with a new underlying file `input_stream` for each new page. For either bound, the streams can be reused, because the indexes of pages that we are reading are never decreasing. This patch adds a `index_consume_entry_context` to each of `index_reader`'s bounds, so that for each new page, the same file `input_stream` is used. As a result, when reading consecutive pages, the reads that follow the first one can be satisfied by the `input_stream`'s read aheads, decreasing the number of blocking reads and increasing the throughput of the `index_reader`. Additionally, we're reusing the `index_consumer` for all pages, calling `index_consumer::prepare` when we need to increase the size of the `_entries` `chunked_managed_vector`. A big difference can be seen when we're reading the entire table, frequently skipping a few rows; which we can test using perf_fast_forward: Before: ``` running: small-partition-skips on dataset small-part Testing scanning small partitions with skips. Reads whole range interleaving reads with skips according to read-skip pattern: read skip time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk allocs tasks insns/f cpu -> 1 0 0.899447 4 1000000 1111794 12284 1113248 1096537 975.5 972 124356 1 0 0 0 0 0 0 0 12032202 29103 8967 100.0% -> 1 1 1.805811 4 500000 276884 907 278214 275977 3655.8 3654 135084 2688 0 3161 4548 5935 0 0 0 7225100 140466 27010 75.6% -> 1 8 0.927339 4 111112 119818 357 120465 119461 3654.0 3654 135084 2685 0 2133 4548 6963 0 0 0 1749663 107922 57502 50.2% -> 1 16 0.790630 4 58824 74401 782 74617 73497 3654.0 3654 135084 2695 0 1975 4548 7121 0 0 0 1019189 109349 90832 42.7% -> 1 32 0.717235 4 30304 42251 243 42266 41975 3654.0 3654 135084 2689 0 1871 4548 7225 0 0 0 619876 109199 156751 37.3% -> 1 64 0.681624 4 15385 22571 244 22815 22286 3654.0 3654 135084 2685 0 1870 4548 7226 0 0 0 407671 105798 285688 34.0% -> 1 256 0.630439 4 3892 6173 24 6214 6150 3549.0 3549 135116 2581 0 1313 3927 6505 0 0 0 232541 100803 1022454 29.1% -> 1 1024 0.313303 4 976 3115 219 3126 2766 1956.0 1956 130608 986 0 0 987 1962 0 0 0 81165 41385 1724979 29.1% -> 1 4096 0.083688 4 245 2928 85 3012 2134 738.8 737 17212 492 244 0 247 491 0 0 0 30500 19406 1999263 24.6% -> 64 1 1.509011 4 984616 652491 2746 660930 649745 3673.5 3654 135084 2687 0 4507 4548 4589 0 0 0 11075882 117074 13157 68.9% -> 64 8 1.424147 4 888896 624160 4446 625675 617713 3654.0 3654 135084 2691 0 4248 4548 4848 0 0 0 10019098 117383 13700 66.5% -> 64 16 1.343276 4 800000 595559 5834 605880 589725 3654.0 3654 135084 2698 0 3989 4548 5107 0 0 0 9043830 124022 14206 64.9% -> 64 32 1.249721 4 666688 533469 5056 536638 526212 3654.0 3654 135084 2688 0 3616 4548 5480 0 0 0 7570848 123043 15377 60.9% -> 64 64 1.154549 4 500032 433097 10215 443312 415001 3654.0 3654 135084 2703 0 3161 4548 5935 0 0 0 5718758 110657 17787 53.2% -> 64 256 1.005309 4 200000 198944 1179 199338 196989 3935.0 3935 137216 2966 0 690 4048 5592 0 0 0 2398359 110510 27855 51.3% -> 64 1024 0.441913 4 58880 133239 8094 135471 120467 2161.0 2161 131820 1190 0 0 1192 1848 0 0 0 725092 45449 33740 59.7% -> 64 4096 0.124826 4 15424 123564 5958 126814 95101 795.5 794 17400 553 240 0 312 482 0 0 0 199943 20869 46621 41.9% ``` After: ``` running: small-partition-skips on dataset small-part Testing scanning small partitions with skips. Reads whole range interleaving reads with skips according to read-skip pattern: read skip time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk allocs tasks insns/f cpu -> 1 0 0.917468 4 1000000 1089956 1422 1091378 1073112 975.5 972 124356 1 0 0 0 0 0 0 0 12032761 29721 8972 100.0% -> 1 1 1.311446 4 500000 381259 3212 384470 377238 1087.0 1083 138420 2 0 4445 4548 4651 0 0 0 7096216 55681 20869 100.0% -> 1 8 0.467975 4 111112 237432 1446 239372 235985 1121.2 1119 143124 9 0 4344 4548 4752 0 0 0 1619944 23502 28844 98.7% -> 1 16 0.337085 4 58824 174508 3410 178451 171099 1117.5 1120 143276 11 0 4319 4548 4777 0 0 0 883692 19152 37460 96.8% -> 1 32 0.262798 4 30304 115313 1222 116535 112400 1070.2 1066 135620 166 26 4354 4548 4742 0 0 0 483185 18856 54275 94.9% -> 1 64 0.283954 4 15385 54181 531 56177 53650 2022.5 2040 137036 319 19 4351 4548 4745 0 0 0 292766 32998 102276 84.9% -> 1 256 0.207020 4 3892 18800 575 19105 17520 1315.5 1334 136072 418 24 3703 3927 4115 0 0 0 118400 27427 292146 82.1% -> 1 1024 0.164396 4 976 5937 57 5993 5842 1208.2 1195 135384 568 14 932 987 1030 0 0 0 62999 27554 503559 70.0% -> 1 4096 0.085079 4 245 2880 108 2987 2714 635.8 634 26468 248 246 233 247 258 0 0 0 31264 12872 1546404 37.4% -> 64 1 1.073331 4 984616 917346 7614 923983 909314 1812.2 1824 136792 11 20 4544 4548 4552 0 0 0 10971661 54538 9919 99.6% -> 64 8 1.024389 4 888896 867733 6327 870429 845215 3027.2 3072 138212 31 0 4523 4548 4573 0 0 0 9933078 68059 10050 99.5% -> 64 16 0.978754 4 800000 817366 7802 827665 809564 3012.2 3008 139884 39 0 4486 4548 4610 0 0 0 8947041 64050 10302 98.1% -> 64 32 0.837266 4 666688 796267 10312 806579 785370 2275.8 2266 139672 29 0 4465 4548 4631 0 0 0 7458644 50754 10564 97.8% -> 64 64 0.645627 4 500032 774490 4713 779203 768432 1136.8 1137 145428 8 0 4438 4548 4658 0 0 0 5593168 29982 10938 98.4% -> 64 256 0.386192 4 200000 517877 22509 544067 495368 1134.8 1136 145300 109 0 2135 4048 4147 0 0 0 2270291 22840 13682 94.5% -> 64 1024 0.238617 4 58880 246755 55856 305110 190899 1176.0 1118 135324 451 13 625 1192 1223 0 0 0 701262 24418 17323 71.1% -> 64 4096 0.133340 4 15424 115674 14837 117978 99072 974.0 961 27132 366 347 99 312 383 0 0 0 209595 20657 43096 50.4% ``` For single partition reads, the index_reader is modified to behave in practically the same way, as before the change (not reading ahead past the page with the partition). For example, a single partition read from a table with 10 rows per partition performs a single 6KB read from the index file, and the same read is performed before the change (as can be seen in traces below). If we enabled read aheads in that case, we would perform 2 16KB reads. Relevant traces: Before: ``` ./tmp/data/ks/t2-75ebed30eb0211eb837a8f4cd3d1cf62/md-1-big-Index.db: scheduling bulk DMA read of size 6478 at offset 0 [shard 0] \| 2021-07-23 15:22:25.847362 \| 127.0.0.1 \| 148 \| 127.0.0.1 ./tmp/data/ks/t2-75ebed30eb0211eb837a8f4cd3d1cf62/md-1-big-Index.db: finished bulk DMA read of size 6478 at offset 0, successfully read 6478 bytes [shard 0] \| 2021-07-23 15:22:25.900996 \| 127.0.0.1 \| 53782 \| 127.0.0.1 ``` After: ``` ./tmp/data/ks/t2-75ebed30eb0211eb837a8f4cd3d1cf62/md-1-big-Index.db: scheduling bulk DMA read of size 6478 at offset 0 [shard 0] \| 2021-07-23 15:19:37.380033 \| 127.0.0.1 \| 149 \| 127.0.0.1 ./tmp/data/ks/t2-75ebed30eb0211eb837a8f4cd3d1cf62/md-1-big-Index.db: finished bulk DMA read of size 6478 at offset 0, successfully read 6478 bytes [shard 0] \| 2021-07-23 15:19:37.433662 \| 127.0.0.1 \| 53777 \| 127.0.0.1 ``` Tests: unit(dev) Closes #9063 * github.com:scylladb/scylla: sstables: index_reader: optimize single partition reads sstables: use read-aheads in the index reader sstables: index_reader: remove unused members from index reader context	2022-03-21 13:47:28 +02:00
Mikołaj Sielużycki	1d84a254c0	flat_mutation_reader: Split readers by file and remove unnecessary includes. The flat_mutation_reader files were conflated and contained multiple readers, which were not strictly necessary. Splitting optimizes both iterative compilation times, as touching rarely used readers doesn't recompile large chunks of codebase. Total compilation times are also improved, as the size of flat_mutation_reader.hh and flat_mutation_reader_v2.hh have been reduced and those files are included by many file in the codebase. With changes real 29m14.051s user 168m39.071s sys 5m13.443s Without changes real 30m36.203s user 175m43.354s sys 5m26.376s Closes #10194	2022-03-14 13:20:25 +02:00
Botond Dénes	105bf8888a	sstables: convert mx writer to v2 The sstables::sstable class has two methods for writing sstables: 1) sstable_writer get_writer(...); 2) future<> write_components(flat_mutation_reader, ...); (1) directly exposes the writer type, so we have to update all users of it (there is not that many) in this same patch. We defer updating users of (2) to a follow-up commits.	2022-03-10 07:03:49 +02:00
Botond Dénes	11adb404c6	sstables/metadata_collector: use position_in_partition for min/max keys Instead of naked clustering keys. Working with the latter is dangerous because it cannot accurately represent the entire clustering domain: it cannot represent positions between (before/after) keys. For this reason the metadata collector had a separate update_min_max_components() overload for range tombstones because the positions of these cannot be represented by clustering keys alone. Moving to position_in_partition solves this problem and it is now enough to have a single overload with position_in_partition_view. This is also more future proof as it will work with range tombstone changes without any additional changes.	2022-03-10 07:03:49 +02:00
Wojciech Mitros	7f590a3686	sstables: index_reader: optimize single partition reads All entries from a single partition can be found in a single summary page. Because of that, in cases when we know we want to read only one partition, we can limit the underyling file input_stream to the range of the page. Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>	2022-02-22 02:16:52 +01:00
Avi Kivity	fcb8d040e8	treewide: use Software Package Data Exchange (SPDX) license identifiers Instead of lengthy blurbs, switch to single-line, machine-readable standardized (https://spdx.dev) license identifiers. The Linux kernel switched long ago, so there is strong precedent. Three cases are handled: AGPL-only, Apache-only, and dual licensed. For the latter case, I chose (AGPL-3.0-or-later and Apache-2.0), reasoning that our changes are extensive enough to apply our license. The changes we applied mechanically with a script, except to licenses/README.md. Closes #9937	2022-01-18 12:15:18 +01:00
Tomasz Grabiec	3226c5bf9d	Merge 'sstables: mx: enable position fast-forwarding in reverse mode' from Kamil Braun Most of the machinery was already implemented since it was used when jumping between clustering ranges of a query slice. We need only perform one additional thing when performing an index skip during fast-forwarding: reset the stored range tombstone in the consumer (which may only be stored in fast-forwarding mode, so it didn't matter that it wasn't reset earlier). Comments were added to explain the details. As a preparation for the change, we extend the sstable reversing reader random schema test with a fast-forwarding test and include some minor fixes. Fixes #9427. Closes #9484 * github.com:scylladb/scylla: query-request: add comment about clustering ranges with non-full prefix key bounds sstables: mx: enable position fast-forwarding in reverse mode test: sstable_conforms_to_mutation_source_test: extend `test_sstable_reversing_reader_random_schema` with fast-forwarding test: sstable_conforms_to_mutation_source_test: fix `vector::erase` call test: mutation_source_test: extract `forwardable_reader_to_mutation` function test: random_schema: fix clustering column printing in `random_schema::cql`	2021-11-29 16:01:53 +01:00
Kamil Braun	8722e0d23c	sstables: mx: enable position fast-forwarding in reverse mode Most of the machinery was already implemented since it was used when jumping between clustering ranges of a query slice. We need only perform one additional thing when performing an index skip during fast-forwarding: reset the stored range tombstone in the consumer (which may only be stored in fast-forwarding mode, so it didn't matter that it wasn't reset earlier). Comments were added to explain the details.	2021-11-29 11:10:49 +01:00
Avi Kivity	4d7a013e94	sstables: mx: writer: make large partition stats accounting branch-free It is bad form to introduce branches just for statistics, since branches can be expensive (even when perfectly predictable, they consume branch history resources). Switch to simple addition instead; this should be not cause any cache misses since we already touch other statistics earlier. The inputs are already boolean, but cast them to boolean just so it is clear we're adding 0/1, not a count. Closes #9626	2021-11-15 11:28:48 +02:00
Michael Livshin	a7511cf600	system keyspace: record partitions with too many rows Add "rows" field to system.large_partitions. Add partitions to the table when they are too large or have too many rows. Fixes #9506 Signed-off-by: Michael Livshin <michael.livshin@scylladb.com> Closes #9577	2021-11-14 14:25:18 +02:00
Tomasz Grabiec	cc56a971e8	database, treewide: Introduce partition_slice::is_reversed() Cleanup, reduces noise. Message-Id: <20211014093001.81479-1-tgrabiec@scylladb.com>	2021-10-14 12:39:16 +03:00
Kamil Braun	339b9bc38a	sstables: mx: partition_reversing_data_source: close internal data consumers `partition_reversing_data_source` uses `continuous_data_consumer`s internally (`partition_header_context`, `row_body_skipping_context`) which hold `input_stream`s opened to sstable data files. These `input_stream`s must be closed before destruction. Right now they would sometimes cause "Assertion `_reads_in_progress == 0' failed" on destruction. Close the `continuous_data_consumer`s before they are destroyed so they can close their `input_stream`s. Fixes #9444. Closes #9451	2021-10-11 12:35:54 +02:00
Kamil Braun	27238eaa0f	sstables: mx: implement reversed single-partition reads We use partition_reversing_data_source and the new `index_reader` methods to implement single-partition reads in `mx_sstable_mutation_reader`. The parsing logic does not need to change: the buffers returned by the source already contain rows in reversed clustering order. Some changes were required in `mp_row_consumer_m` which processes the parsed rows and emits appropriate mutation fragments. The consumer uses `mutation_fragment_filter` underneath to decide whether a fragment should be ignored or not (e.g. the parsed fragment may come from outside the requested clustering range), among other things. Previously `mutation_fragment_filter` was provided a `partition_slice`. If the slice was reversed, the filter would use `clustering_key_filter_ranges::get_ranges` to obtain the clustering ranges from the slice in unreversed order (they were reversed in the slice) since we didn't perform any reversing in the reader. Now the reader provides the ranges directly instead of the slice; furthermore, the ranges are provided in native-reversed format (the order of ranges is reversed and the ranges themselves are also reversed), and the schema provided to the filter is also reversed. Thus to the filter everything appears as if it was used during a non-reversed query but on a table with reversed schema, which works correctly given the fact that the reader is feeding parsed rows into the consumer in reversed order. During reversed queries the reader uses alternative logic for skipping to a later range (or, speaking in non-reversed terms, to an earlier range), which happens in `advance_context`. It asks the index to advance its upper bound in reverse so that the reversing_data_source notices the change of the index end position and returns following buffers with rows from the new range. There is a slight difference in behavior of the reader from `mp_row_consumer_m`'s point of view. For non-reversed reads, after the consumer obtains the beginning of a row (`consume_row_start`) - which contains the row's position but not the columns - and tells the reader that the row won't be emitted because we need to skip to a later range, the reader would tell the data source (the 'context') immediately to skip to a later range by calling `skip_to`. This caused the source not to return the rest of the row, and the rest of the row would not be fed to the consumer (`consume_row_end`). However, for reversed reads, the data source performs skipping 'on its own', after it notices that the index end position has changed. This may happen 'too late', causing the rest of the row to be returned anyway. We are prepared for this situation inside `mp_row_consumer` by consulting the mutation fragment filter again when the rest of the row arrives. Fast forwarding is not supported at this point, which is fine given that the cache is disabled for reversed queries for now (and the cache is the only user of fast forwarding). The `partition_slice` provided by callers is provided in 'half-reversed' format for reversed queries, where the order of clustering ranges is reversed, but the ranges themselves are not. This means we need to modify the slice sometimes: for non-single-partition queries the mx reader must use a non-reversed slice, and for single-partition queries the mx reader must use a native-reversed slice (where the clustering ranges themselves are reversed as well). The modified slice must be stored somewhere; we store it inside the mx reader itself so we don't need to allocate more intermediate readers at the call sites. This causes the interface of `mx::make_reader` to be a bit weird: for non-single-partition queries where the provided slice is reversed the reader will actually return a non-reversed stream of fragments, telling the user to reverse the stream on their own. The interface has been documented in detail with appropriate comments.	2021-10-04 15:24:12 +02:00
Wojciech Mitros	64e703bb54	sstables: mx: introduce partition_reversing_data_source This patch adds an implementation of a data source that wraps an sstable data file and returns data buffers with contents of one partition in the sstable as if the rows of the partition were present in a reversed order. In other words, to the user of the source the partition appears to be reversed. We shall call this an 'intermediary' data source. As part of the interface of the intermediary source the user is also given read access to the source's current position over the data file, and the constructor of the source takes a reference to `index_reader`. This is necessary because the index operates directly on data file offsets and we want the user to be able to use the index to skip sequences of rows. In order to ask the source to skip a sequence of rows - e.g. when jumping between clustering ranges - the user must advance the index' upper bound in reverse (to an earlier position). The source will then notice that the end position of the index has changed and take appropriate action. An alternative would be to translate the data positions of `index_reader` to 'reversed positions' of the intermediary and then use `skip_to` for skipping, as we do for forward reads. However this solution would introduce more complexity to `index_reader` and the intermediary source. One reason for the complexity in the input stream is that we would have two kinds of skips: a single row skip, and a skip to a clustering range. We know the offset of the next row, so we could check that to differentiate them. We would also need to add an information about the position of first clustering row and end of the last one in the index_reader. Skipping by checking the index seems to be overall simpler. For simplicity, the intermediary stream always starts with parsing the partition header and (if present) the static row, and returning the corresponding bytes as a result of the first read. After partition header and static row we must find the last row entry of the requested range. If the range ends before the partition end (i.e. there are more row entries after the range) we can use the 'previous unfiltered size' of the row following the range; otherwise we must scan the last promoted index block and take its last row. After finding the data range of the last row, we parse rows consecutively in reversed order. We must parse the rows partially to learn their lengths and the positions of previous rows. We're using similar constructs as in the sstable parser, but it only contains a small part of the parsing coroutine and doesn't perform any correctness checks. The parser for rows still turned out rather big mostly because we can't always deduce the size of the clustering blocks without reading the block header. The parser allows reading rows while skipping their bodies also in non-reversed order, which we are making use of while reading the last promoted index block. The intermediary data source has one more utility: reversing range tombstones. When we read a tombstone bound/boundary, we modify the data buffer so that the resulting bound/boundary has the reversed kind (so we don't read ends before starts) and the boundaries have their before/after timestamps swapped.	2021-10-04 15:24:12 +02:00
Wojciech Mitros	8385f3eb21	sstables: index_reader: add support for iterating over clustering ranges in reverse In the sstable reader, we iterate over clustering ranges using the index_reader, which normally only accepts advancing to increasing positions. In this patch we add methods for advancing the index reader in reverse. To simplify our job we restrict our attention to a single implementation of the promoted index block cursor: `bsearch_clustered_cursor`. The `index_reader` methods for advancing in reverse will thus assume that this implementation is used. The assumption is correct given that we're working only with sstables of versions >= mc, which is indeed the intended use case. We add some documentation in appropriate places to make this obvious. We extend `bsearch_clustered_cursor` with two methods: `advance_past(pos)`, which advances the cursor to the first block after `pos` (or to the end if there is no such block), and `last_block_offset()`, which returns the data file offset of the first row from the last promoted index block. To efficiently find the position in the data file of the last row of the partition (which we need when performing a reversed query) the sstable reader may need to read the span of the entire last promoted index block in the data file. To learn where the block starts it can use `index_reader::last_block_offset()`, which is implemented in terms of `bsearch_clustered_cursor::last_block_offset()`. When performing a single partition read in forward order, the reader asks the index to position its lower bound at the start of the partition and its upper bound after the end of the slice. It starts by reading the first range. After exhausting a range it jumps to the next one by asking the index to advance the lower bound. For reverse single partition reads we'll take a similar approach: the initial bound positions are as in the forward case. However, we start with the last range and after exhausting a range we want to jump to a previous one; we will do it by advancing the upper bound in reverse (i.e. moving it closer to the beginning of the partition). For this we introduce the `index_reader::advance_reverse` function.	2021-10-04 15:24:12 +02:00
Avi Kivity	daf028210b	build: enable -Winconsistent-missing-override warning This warning can catch a virtual function that thinks it overrides another, but doesn't, because the two functions have different signatures. This isn't very likely since most of our virtual functions override pure virtuals, but it's still worth having. Enable the warning and fix numerous violations. Closes #9347	2021-09-15 12:55:54 +03:00
Botond Dénes	9548200e85	sstables: mx/reader: add crawling reader A special-purpose reader which doesn't use the index at all and hence doesn't support skipping at all. It is designed to be used in conditions in which the index is not reliable (scrub compaction).	2021-09-01 08:44:13 +03:00
Benny Halevy	4476800493	flat_mutation_reader: get rid of timeout parameter Now that the timeout is taken from the reader_permit. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-08-24 16:30:51 +03:00
Benny Halevy	f25aabf1b2	flat_mutation_reader: maybe_timed_out: use permit timeout Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-08-24 14:29:44 +03:00
Benny Halevy	fe479aca1d	reader_permit: add timeout member To replace the timeout parameter passed to flat_mutation_reader methods. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-08-24 14:29:44 +03:00
Michael Livshin	f07306d75c	sstables: make sstable::make_reader() return flat_mutation_reader_v2 Rename the old version to `sstables::make_reader_v1()`, to have a nicely searcheable eradication target. Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2021-08-09 19:20:48 +03:00
Michael Livshin	5f9695c1b2	sstables: count read row tombstones Refs #7749. Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2021-08-01 19:41:11 +03:00
Avi Kivity	42e1f318d7	Merge "Respect "bypass cache" in sstable index caching" from Tomasz " This series changes the behavior of the system when executing reads annotated with "bypass cache" clause in CQL. Such reads will not use nor populate the sstable partition index cache and sstable index page cache. " * 'bypass-cache-in-sstable-index-reads' of github.com:tgrabiec/scylla: sstables: Do not populate page cache when searching in promoted index for "bypass cache" reads sstables: Do not populate partition index cache for "bypass cache" reads	2021-07-28 18:45:39 +03:00
Wojciech Mitros	fc17c48bc9	sstables: merge consumer_m into mp_row_consumer_m The consumer_m interface has only one implementation: mp_row_consumer_m; and we're not planning other ones, so to reduce the number of inheritances, and the number of lines in the sstable reader, these classes may be combined. Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>	2021-07-21 17:36:10 +02:00
Wojciech Mitros	fbb56e930c	sstables: move mp_row_consumer_m To make next patch combining consumer_m and mp_row_consumer_m more readable, move mp_row_consumer_m next to consumer_m. Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>	2021-07-21 17:36:04 +02:00
Botond Dénes	5aa733f933	sstables/mx/writer: initialize _range_tombstones at the end of the ctor We need a permit to initialize said object which makes the semaphore used and hence trigger an error if an exception is thrown in the constructor. Move the initialization to the end of the constructor to prevent this. Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20210719040449.9202-1-bdenes@scylladb.com>	2021-07-19 11:43:00 +03:00
Tomasz Grabiec	21f1a7be8b	sstables: Do not populate page cache when searching in promoted index for "bypass cache" reads Reads which bypass cache will use a private temporary instance of cached_file which dies together with the index cursor. The cursor still needs a cached_file with cachig layer. Binary searching needs caching for performance, some of the pages will be reused. Another reason to still use cached_file is to work with a common interface, and reusing it requires minimal changes.	2021-07-15 12:14:28 +02:00
Tomasz Grabiec	f4227c303b	sstables: Do not populate partition index cache for "bypass cache" reads Index cursor for reads which bypass cache will use a private temporary instance of the partition index cache. Promoted index scanner (ka/la format) will not go through the page cache.	2021-07-15 12:13:20 +02:00
Avi Kivity	1643549d08	Merge 'Coroutinize the sstable reader' from Wojciech Mitros This patch applies the same changes to both kl and mx sstable readers, but because the kl reader is old, we'll focus on the newer one. This patch makes the main sstable reader process a coroutine, allowing to simplify it, by: - using the state saved in the coroutine instead of most of the states saved in the _state variable - removing the switch statement and moving the code of former switch cases, resulting in reduced number of jumps in code - removing repetitive ifs for read statuses, by adding them to the coroutine implementation The coroutine is saved in a new class ```processing_result_generator```, which works like a generator: using its ```generate()``` method, one can order the coroutine to continue until it yields a data_consumer::processing_result value, which was achieved previously by calling the function that is now the coroutine(```do_process_state()```). Before the patch, the main processing method had 558 lines. The patch reduces this number to 345 lines. However, usage of c++ coroutines has a non-negligible effect on the performance of the sstable reader. In the test cases from ```perf_fast_forward``` the new sstable reader performs up to 2% more instructions (per fragment) than the former implementation, and this loss is achieved for cases where we're reading many subsequent rows, without any skips. Thanks to finding an optimization during the development of the patch, the loss is mitigated when we do skip rows, and for some cases, we can even observe an improvement. You can see the full results in attached files: [old_results.txt](https://github.com/scylladb/scylla/files/6793139/old_results.txt), [new_results.txt](https://github.com/scylladb/scylla/files/6793140/new_results.txt) Test: unit(dev) Refs: #7952 Closes #9002 * github.com:scylladb/scylla: mx sstable reader: reduce code blocks mx sstable reader: make ifs consistent sstable readers: make awaiter for read status mx sstable reader: don't yield if the data buffer is not empty mx sstable reader: combine FLAGS and FLAGS_2 states mx sstable reader: reduce placeholder state usage mx sstable reader: replace non_consuming states with a bool mx sstable reader: reduce placeholder state usage mx sstable reader: replace unnecessary states with a placeholder mx sstable reader: remove false if case mx sstable reader: remove row_body_missing_columns_label mx sstable reader: remove row_body_deletion_label mx sstable reader: remove column_end_label mx sstable reader: remove column_cell_path_label mx sstable reader: remove column_ttl_label mx sstable reader: remove column_deletion_time_label mx sstable reader: remove complex_column_2_label mx sstable reader: remove row_body_missing_columns_read_columns_label mx sstable reader: remove row_body_marker_label mx sstable reader: remove row_body_shadowable_deletion_label mx sstable reader: remove row_body_prev_size_label mx sstable reader: remove ck_block_label mx sstable reader: remove ck_block2_label mx sstable reader: remove clustering_row_label and complex_column_label mx sstable reader: remove labels with only one goto mx sstable reader: replace the switch cases with gotos and a new label mx sstable reader: remove states only reached consecutively or from goto mx sstable reader: remove switch breaks for consecutive states mx sstable reader: convert readers main method into a coroutine kl sstable reader: replace states for ending with one state, simplify non_consuming kl sstable reader: remove unnecessary states kl sstable reader: remove unnecessary yield kl sstable reader: remove unnecessary blocks kl sstable reader: fix indentation kl sstable reader: replace switch with standard flow control kl sstable reader: remove state::CELL case kl sstable reader: move states code only reachable from one place kl sstable reader: remove states only reached consecutively kl sstable reader: remove switch breaks for consecutive states kl sstable reader: remove unreachable case kl sstable reader: move testing hack for fragmented buffers outside the coroutine kl sstable reader: convert readers main method into a coroutine sstable readers: create a generator class for coroutines	2021-07-15 12:06:14 +03:00
Wojciech Mitros	45058776c2	mx sstable reader: reduce code blocks Some blocks of code were surrounded by curly braces, because a variable was declared inside a switch case. After changes, some of the variable declarations are in if/else/while cases, and no longer need to be in separate code blocks, while other blocks can be extended to entire labels for simplicity.	2021-07-14 20:50:30 +02:00
Wojciech Mitros	9b333908e4	mx sstable reader: make ifs consistent In several places we're checking the return value of our consumers' consume_* calls. Because the behaviour in all cases is the same, let us use the same notation as well.	2021-07-14 20:50:30 +02:00
Wojciech Mitros	dc38605f75	sstable readers: make awaiter for read status After each read* call of the primitive_consumer we need to check if the entire primitive was in our current buffer. We can check it in the proceed_generator object by yielding the returned read status: if the yielded status is ready, the yield_value method returns a structure whose await_ready() method returns true. Otherwise it returns false. The returned structure is co_awaited by the coroutine (due to co_yield), and if await_ready() returns true, the coroutine isn't stopped, conversely, if it returns false, (technical: and because its await_suspend methods returns void) the coroutine stops, and a proceed::yes value is saved, indicating that we need more buffers.	2021-07-14 20:50:30 +02:00
Wojciech Mitros	09a0cd7c05	mx sstable reader: don't yield if the data buffer is not empty The skip() method returns a skip_bytes object if we want to skip the entire buffer, otherwise it returns a proceed::yes and trims the buffer. If the buffer is only trimmed we don't need to interrupt the coroutine, we simply continue instead.	2021-07-14 20:50:30 +02:00
Wojciech Mitros	5dc64532bd	mx sstable reader: combine FLAGS and FLAGS_2 states We don't differentiate between FLAGS and FLAGS_2 in verify_end_state(), so we can merge them into one state.	2021-07-14 20:50:30 +02:00
Wojciech Mitros	ab1e6f4211	mx sstable reader: reduce placeholder state usage After the changes to non_consuming states, we can remove some state::OTHER assignments again.	2021-07-14 20:50:30 +02:00
Wojciech Mitros	c904ab12c8	mx sstable reader: replace non_consuming states with a bool The non_consuming() method is only used after assuring that primitive_consumer::active() (in continuous_data_consumer::process()) so we don't need states where primitive_consumer::active(), which is most of them. We still need to make sure that the states change when they need to, so we replace all the concerned states with the placeholder state, and for the few states from the non_consuming() OR, where the primitive_consumer::active() returns true, we set the value of _consuming to false, changing it back when the state is no longer non_consuming.	2021-07-14 20:50:30 +02:00
Wojciech Mitros	b05d3eefed	mx sstable reader: reduce placeholder state usage We can remove state assignments that we know are changing a state to itself. Similarily, if a state is changed in the same way in an if and an else, it can be changed before the if/else instead.	2021-07-14 20:50:30 +02:00
Wojciech Mitros	b2e3fbffd0	mx sstable reader: replace unnecessary states with a placeholder After removing the switch, the state is only used for verify_end_state() and non_consuming(), so we can replace states that are not used there with a single one, so that the state still stops being one of the appearing states when it needs to.	2021-07-14 20:50:30 +02:00
Wojciech Mitros	9a7a8fa86c	mx sstable reader: remove false if case consume_row_marker_and_tombstone does not return proceed::no in the mp_row_consumer_m implementation, and even if it did, we would most likely want to yield proceed::no in that case as well.	2021-07-14 20:50:30 +02:00
Wojciech Mitros	2262aac11a	mx sstable reader: remove row_body_missing_columns_label row_body_missing_columns_label is only reached from one goto, or consecutively, so the code omitted by goto can be ommited by an if instead (or else).	2021-07-14 20:50:30 +02:00
Wojciech Mitros	99b5a332db	mx sstable reader: remove row_body_deletion_label row_body_deletion_label is only reached from one goto, or consecutively, so the code omitted by goto can be ommited by an if instead (or else).	2021-07-14 20:50:30 +02:00
Wojciech Mitros	cbce22a88b	mx sstable reader: remove column_end_label column_end_label is only reached from one goto, or consecutively, so the code omitted by goto can be ommited by an if instead (or else).	2021-07-14 20:50:30 +02:00
Wojciech Mitros	925d921cb4	mx sstable reader: remove column_cell_path_label column_cell_path_label is only reached from two goto, both at the end of an if/else block, or consecutively, so the code after the if/else block can be ommited by an if instead (or else).	2021-07-14 20:50:30 +02:00
Wojciech Mitros	e85987a439	mx sstable reader: remove column_ttl_label column_ttl_label is only reached from two goto, both at the end of an if/else block, or consecutively, so the code after the if/else block can be ommited by an if instead (or else).	2021-07-14 20:50:30 +02:00
Wojciech Mitros	4b3607e97b	mx sstable reader: remove column_deletion_time_label column_deletion_time_label is only reached from one goto, or consecutively, so the code omitted by goto can be ommited by an if instead (or else).	2021-07-14 20:50:30 +02:00
Wojciech Mitros	8cf23c3b01	mx sstable reader: remove complex_column_2_label complex_column_2_label is only reached from one goto, or consecutively, so the code omitted by goto can be ommited by an if instead (or else).	2021-07-14 20:50:30 +02:00
Wojciech Mitros	fbe28d18f3	mx sstable reader: remove row_body_missing_columns_read_columns_label row_body_missing_columns_read_columns_label is only reached consecutively, or from a goto after the label. This is changed to a while loop starting at the label and ending at the goto. The code executed in the only case we do not reach the goto (so when exiting the loop) is moved after the while.	2021-07-14 20:50:30 +02:00
Wojciech Mitros	3b512ea2c2	mx sstable reader: remove row_body_marker_label row_body_marker_label is only reached from one goto inside an else case, or consecutively, so the code omitted by goto can be moved inside the corresponding if case.	2021-07-14 20:50:30 +02:00
Wojciech Mitros	0bcde69319	mx sstable reader: remove row_body_shadowable_deletion_label row_body_shadowable_deletion_label is only reached from one goto, or consecutively, so the code omitted by goto can be ommited by an if instead (or else).	2021-07-14 20:50:30 +02:00
Wojciech Mitros	3d0fdf9f3b	mx sstable reader: remove row_body_prev_size_label row_body_prev_size_label is only reached consecutively, or from a goto not far after the label. This is changed to a while loop starting at the label and ending at the goto.	2021-07-14 20:50:30 +02:00

1 2 3

108 Commits