Commit Graph

75 Commits

Author SHA1 Message Date
Botond Dénes
62e134f423 sstables: replace SCYLLA_ASSERT() with parse_assert() on the read path
So parse errors on corrupt SSTables don't result in crashes, instead
just aborting the read in process.
There are a lot of SCYLLA_ASSERT() usages remaining in sstables/. This
patch tried to focus on those usages which are in the read path. Some
places not only used on the read path may have been converted too, where
the usage of said method is not clear.

(cherry picked from commit bce89c0f5e)
2025-06-26 14:53:08 +00:00
Kefu Chai
a3ac7c3d33 remove redundant std::move() from position_in_partition::key()
Fix GCC warning about moving from a const reference in mp_row_consumer_k_l::flush_if_needed.
Since position_in_partition::key() returns a const reference, std::move has no effect.

Considered adding an rvalue reference overload (clustering_key_prefix&& key() &&) but
since the "me" sstable format is mandatory since 63b266e9, this approach offers no benefit.

This change simply removes the redundant std::move() call to silence the warning and
improve code clarity.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#23085
2025-03-03 12:51:40 +03:00
Avi Kivity
f3eade2f62 treewide: relicense to ScyllaDB-Source-Available-1.0
Drop the AGPL license in favor of a source-available license.
See the blog post [1] for details.

[1] https://www.scylladb.com/2024/12/18/why-were-moving-to-a-source-available-license/
2024-12-18 17:45:13 +02:00
Nikos Dragazis
609b16307e sstables: Add integrity option to data_consume_single_partition()
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2024-11-11 20:26:27 +02:00
Nikos Dragazis
5b896cdbb7 sstables: Disengage integrity_check from sstable class
The `integrity_check` flag was first introduced as a parameter in
`sstable::data_stream()` to support creating input streams with
integrity checking. As such, it was defined in the sstable class.

However, we also use this flag in the kl/mx full-scan readers, and, in
a later patch, we will use it in `class sstable_set` as well.

Move the definition into `types_fwd.hh` since it is no longer bound to
the sstable class.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2024-11-11 20:26:27 +02:00
Tomasz Grabiec
a29501ed67 sstables: Reduce amount of I/O for clustering-key-bounded reads from large partitions
Single-row reads from large partition issue 64 KiB reads to the data file,
which is equal to the default span of the promoted index block in the data file.
If users would want to reduce selectivity of the index to speed up single-row reads,
this won't be effective. The reason is that the reader uses promoted index
to look up the start position in the data file of the read, but end position
will in practice extend to the next partition, and amount of I/O will be
determined by the underlying file input stream implementation and its
read-ahead heuristics. By default, that results in at least 2 IOs 32KB each.

There is already infrastructure to lookup end position based on upper
bound of the read, but it's not effective becasue it's a
non-populating lookup and the upper bound cursor has its own private
cached_promoted_index, which is cold when positions are computed. It's
non-populating on purpose, to avoid extra index file IO to read upper
bound. In case upper bound is far-enough from the lower bound, this
will only increase the cost of the read.

The solution employed here is to warm up the lower bound cursor's
cache before positions are computed, and use that cursor for
non-populating lookup of the upper bound.

We use the lower bound cursor and the slice's lower bound so that we
read the same blocks as later lower-bound slicing would, so that we
don't incur extra IO for cases where looking up upper bound is not
worth it, that is when upper bound is far from the lower bound. If
upper bound is near lower bound, then warming up using lower bound
will populate cached_promoted_index with blocks which will allow us to
locate the upper bound block accurately.  This is especially important
for single-row reads, where the bounds are around the same key.  In
this case we want to read the data file range which belongs to a
single promoted index block.  It doesn't matter that the upper bound
is not exactly the same. They both will likely lie in the same block,
and if not, binary search will bring adjacent blocks into cache.  Even
if upper bound is not near, the binary search will populate the cache
with blocks which can be used to narrow down the data file range
somewhat.

Fixes #10030.

The change was tested with perf-fast-forward.

I populated the data set with `column_index_size_in_kb` set to 1

  scylla perf-fast-forward --populate --run-tests=large-partition-slicing --column-index-size-in-kb=1

Test run:

  build/release/scylla perf-fast-forward --run-tests=large-partition-select-few-rows -c1 --keep-cache-across-test-cases --test-case-duration=0

This test reads two rows from the middle of a large partition (1M
rows), of subsequent keys. The first read will miss in the index file
page cache, the second read will hit.

Notice that before the change, the second read issued 2 aio requests worth of 64KiB in total.
After the change, the second read issued 1 aio worth of 2 KiB. That's because promoted index block is larger than 1 KiB.
I verified using logging that the data file range matches a single promoted index block.

Also, the first read which misses in cache is still faster after the change.

Before:

running: large-partition-select-few-rows on dataset large-part-ds1
Testing selecting few rows from a large partition:
stride  rows      time (s)   iterations     frags     frag/s    mad f/s    max f/s    min f/s    avg aio    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    allocs   tasks insns/f    cpu
500000  1         0.009802            1         1        102          0        102        102       21.0     21        196       2       1        0        1        1        0        0        0       568     269 4716050  53.4%
500001  1         0.000321            1         1       3113          0       3113       3113        2.0      2         64       1       0        1        0        0        0        0        0       116      26  555110  45.0%

After:

running: large-partition-select-few-rows on dataset large-part-ds1
Testing selecting few rows from a large partition:
stride  rows      time (s)   iterations     frags     frag/s    mad f/s    max f/s    min f/s    avg aio    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    allocs   tasks insns/f    cpu
500000  1         0.009609            1         1        104          0        104        104       20.0     20        137       2       1        0        1        1        0        0        0       561     268 4633407  43.1%
500001  1         0.000217            1         1       4602          0       4602       4602        1.0      1          2       1       0        1        0        0        0        0        0       110      26  313882  64.1%

(cherry picked from commit dfb339376aff1ed961b26c4759b1604f7df35e54)
2024-10-01 18:40:34 +02:00
Tomasz Grabiec
c0fa49bab5 sstables, utils: Allow parsers to work with different buffer types
Currently, parsers work with temporary_buffer<char>. This is unsafe
when invoked by bsearch_clustered_cursor, which reuses some of the
parsers, and passes temporary_buffer<char> which is a view onto LSA
buffer which comes from the index file page cache. This view is stable
only around consume(). If parsing requires more than one page, it will
continue with a different input buffer. The old buffer will be
invalid, and it's unsafe for the parser to store and access
it. Unfortunetly, the temporary_buffer API allows sharing the buffer
via the share() method, which shares the underlying memory area. This
is not correct when the underlying is managed by LSA, because storage
may move. Parser uses this sharing when parsing blobs, e.g. clustering
key components. When parsing resumes in the next page, parser will try
to access the stored shared buffers pointing to the previous page,
which may result in use-after-free on the memory area.

In prearation for fixing the problem, parametrize parsers to work with
different kinds of buffers. This will allow us to instantiate them
with a buffer kind which supports sharing of LSA buffers properly in a
safe way.

It's not purely mechanical work. Some parts of the parsing state
machine still works with temporary_buffer<char>, and allocate buffers
internally, when reading into linearized destination buffer. They used
to store this destination in _read_bytes vector, same field which is
used to store the shared buffers. Now it's not possible, since shared
buffer type may be different than temporary_buffer<char>. So those
paths were changed to use a new field: _read_bytes_buf.
2024-09-27 01:24:54 +02:00
Kefu Chai
df7f332a58 sstable: s/crawling_sstable_mutation_reader/sstable_full_scan_reader
"crawling" is a little bit obscure in this context. so let's rename this
class to reflect the fact that this reader only reads the entire content
of the sstable.

both crawling reader for kl and mx formats are renamed. also, in order
to be consistent, all "crawling reader" in variable names are updated
as well.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-09-17 10:39:37 +08:00
Nikos Dragazis
716fc487fd sstables: Expose integrity option via crawling mutation readers
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2024-09-11 12:28:59 +03:00
Nikos Dragazis
1d2dc9f2e1 sstables: Expose integrity option via data_consume_rows()
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2024-09-11 12:28:59 +03:00
Łukasz Paszkowski
7b201e9165 kl::reader::make_reader: Unify interface with mx::reader::make_reader
Ensure both readers have the same interfaces to avoid mistakes as
both readers are used in sstable::make_reader. Less error prone.
2024-08-13 10:02:43 +02:00
Avi Kivity
aa1270a00c treewide: change assert() to SCYLLA_ASSERT()
assert() is traditionally disabled in release builds, but not in
scylladb. This hasn't caused problems so far, but the latest abseil
release includes a commit [1] that causes a 1000 insn/op regression when
NDEBUG is not defined.

Clearly, we must move towards a build system where NDEBUG is defined in
release builds. But we can't just define it blindly without vetting
all the assert() calls, as some were written with the expectation that
they are enabled in release mode.

To solve the conundrum, change all assert() calls to a new SCYLLA_ASSERT()
macro in utils/assert.hh. This macro is always defined and is not conditional
on NDEBUG, so we can later (after vetting Seastar) enable NDEBUG in release
mode.

[1] 66ef711d68

Closes scylladb/scylladb#20006
2024-08-05 08:23:35 +03:00
Avi Kivity
fdc1449392 treewide: rename flat_mutation_reader_v2 to mutation_reader
flat_mutation_reader_v2 was introduced in a pair of commits in 2021:

  e3309322c3 "Clone flat_mutation_reader related classes into v2 variants"
  08b5773c12 "Adapt flat_mutation_reader_v2 to the new version of the API"

as a replacement for flat_mutation_reader, using range_tombstone_change
instead of range_tombstone to represent represent range tombstones. See
those commits for more information.

The transition was incremental; the last use of the original
flat_mutation_reader was removed in 2022 in commit

  026f8cc1e7 "db: Use mutation_partition_v2 in mvcc"

In turn, flat_mutation_reader was introduced in 2017 in commit

  748205ca75 "Introduce flat_mutation_reader"

To transition from a mutation_reader that nested rows within
a partition in a separate stream, to a flat reader that streamed
partitions and rows in the same stream.

Here, we reclaim the original name and rename the awkward
flat_mutation_reader_v2 to mutation_reader.

Note that mutation_fragment_v2 remains since we still use the original
for compatibilty, sometimes.

Some notes about the transition:

 - files were also renamed. In one case (flat_mutation_reader_test.cc), the
   rename target already existed, so we rename to
    mutation_reader_another_test.cc.

 - a namespace 'mutation_reader' with two definitions existed (in
   mutation_reader_fwd.hh). Its contents was folded into the mutation_reader
   class. As a result, a few #includes had to be adjusted.

Closes scylladb/scylladb#19356
2024-06-21 07:12:06 +03:00
Kefu Chai
372a4d1b79 treewide: do not define FMT_DEPRECATED_OSTREAM
since we do not rely on FMT_DEPRECATED_OSTREAM to define the
fmt::formatter for us anymore, let's stop defining `FMT_DEPRECATED_OSTREAM`.

in this change,

* utils: drop the range formatters in to_string.hh and to_string.c, as
  we don't use them anymore. and the tests for them in
  test/boost/string_format_test.cc are removed accordingly.
* utils: use fmt to print chunk_vector and small_vector. as
  we are not able to print the elements using operator<< anymore
  after switching to {fmt} formatters.
* test/boost: specialize fmt::details::is_std_string_like<bytes>
  due to a bug in {fmt} v9, {fmt} fails to format a range whose
  element type is `basic_sstring<uint8_t>`, as it considers it
  as a string-like type, but `basic_sstring<uint8_t>`'s char type
  is signed char, not char. this issue does not exist in {fmt} v10,
  so, in this change, we add a workaround to explicitly specialize
  the type trait to assure that {fmt} format this type using its
  `fmt::formatter` specialization instead of trying to format it
  as a string. also, {fmt}'s generic ranges formatter calls the
  pair formatter's `set_brackets()` and `set_separator()` methods
  when printing the range, but operator<< based formatter does not
  provide these method, we have to include this change in the change
  switching to {fmt}, otherwise the change specializing
  `fmt::details::is_std_string_like<bytes>` won't compile.
* test/boost: in tests, we use `BOOST_REQUIRE_EQUAL()` and its friends
  for comparing values. but without the operator<< based formatters,
  Boost.Test would not be able to print them. after removing
  the homebrew formatters, we need to use the generic
  `boost_test_print_type()` helper to do this job. so we are
  including `test_utils.hh` in tests so that we can print
  the formattable types.
* treewide: add "#include "utils/to_string.hh" where
  `fmt::formatter<optional<>>` is used.
* configure.py: do not define FMT_DEPRECATED_OSTREAM
* cmake: do not define FMT_DEPRECATED_OSTREAM

Refs #13245

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-04-19 22:57:36 +08:00
Michał Chojnowski
f9e97fa632 sstables: fix a use-after-free in key_view::explode()
key_view::explode() contains a blatant use-after-free:
unless the input is already linearized, it returns a view to a local temporary buffer.

This is rare, because partition keys are usually not large enough to be fragmented.
But for a sufficiently large key, this bug causes a corrupted partition_key down
the line.

Fixes #17625

Closes scylladb/scylladb#17626
2024-03-07 09:07:07 +02:00
Kefu Chai
a6152cb87b sstables: do not include unused headers
these unused includes were identified by clangd. see
https://clangd.llvm.org/guides/include-cleaner#unused-include-warning
for more details on the "Unused include" warning.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#16666
2024-01-09 11:45:44 +02:00
Pavel Emelyanov
66e43912d6 code: Switch to seastar API level 7
In that level no io_priority_class-es exist. Instead, all the IO happens
in the context of current sched-group. File API no longer accepts prio
class argument (and makes io_intent arg mandatory to impls).

So the change consists of
- removing all usage of io_priority_class
- patching file_impl's inheritants to updated API
- priority manager goes away altogether
- IO bandwidth update is performed on respective sched group
- tune-up scylla-gdb.py io_queues command

The first change is huge and was made semi-autimatically by:
- grep io_priority_class | default_priority_class
- remove all calls, found methods' args and class' fields

Patching file_impl-s is smaller, but also mechanical:
- replace io_priority_class& argument with io_intent* one
- pass intent to lower file (if applicatble)

Dropping the priority manager is:
- git-rm .cc and .hh
- sed out all the #include-s
- fix configure.py and cmakefile

The scylla-gdb.py update is a bit hairry -- it needs to use task queues
list for IO classes names and shares, but to detect it should it checks
for the "commitlog" group is present.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #13963
2023-06-06 13:29:16 +03:00
Petr Gusev
64427b9164 flat_mutation_reader_v2: drop forward_buffer_to
This is just a strange method I came across.
It effectively does nothing but clear_buffer().
2023-02-28 23:00:02 +04:00
Kefu Chai
0cb842797a treewide: do not define/capture unused variables
these warnings are found by Clang-17 after removing
`-Wno-unused-lambda-capture` and '-Wno-unused-variable' from
the list of disabled warnings in `configure.py`.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-02-15 22:57:18 +02:00
Botond Dénes
2acfa950d7 sstables: wire in the reader_permit's sstable read count tracking
Hook in the relevant methods when creating and destroying sstable
readers.
2023-01-03 09:37:29 -05:00
Botond Dénes
0bcfc9d522 treewide: use ::for_partition_end() instead of ::end_of_partition_tag_t{}
We just added a convenience static factory method for partition end,
change the present users of the clunky constructor+tag to use it
instead.
2022-11-11 09:58:18 +02:00
Botond Dénes
f1a039fc2b treewide: use ::for_partition_start() instead of ::partition_start_tag_t{}
We just added a convenience static factory method for partition start,
change the present users of the clunky constructor+tag to use it
instead.
2022-11-11 09:58:18 +02:00
Pavel Emelyanov
2c1ef0d2b7 sstables.hh: Remove unused headers
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #11709
2022-10-04 23:37:07 +02:00
Michał Chojnowski
cdb3e71045 sstables: add a flag for disabling long-term index caching
Long-term index caching in the global cache, as introduced in 4.6, is a major
pessimization for workloads where accesses to the index are (spacially) sparse.
We want to have a way to disable it for the affected workloads.

There is already infrastructure in place for disabling it for BYPASS CACHE
queries. One way of solving the issue is hijacking that infrastructure.

This patch adds a global flag (and a corresponding CLI option) which controls
index caching. Setting the flag to `false` causes all index reads to behave
like they would in BYPASS CACHE queries.

Consequences of this choice:

- The per-SSTable partition_index_cache is unused. Every index_reader has
  its own, and they die together. Independent reads can no longer reuse the
  work of other reads which hit the same index pages. This is not crucial,
  since partition accesses have no (natural) spatial locality. Note that
  the original reason for partition_index_cache -- the ability to share
  reads for the lower and upper bound of the query -- is unaffected.
- The per-SSTable cached_file is unused. Every index_reader has its own
  (uncached) input stream from the index file, and every
  bsearch_clustered_cursor has its own cached_file, which dies together with
  the cursor. Note that the cursor still can perform its binary search with
  caching. However, it won't be able to reuse the file pages read by
  index_reader. In particular, if the promoted index is small, and fits inside
  the same file page as its index_entry, that page will be re-read.
  It can also happen that index_reader will read the same index file page
  multiple times. When the summary is so dense that multiple index pages fit in
  one index file page, advancing the upper bound, which reads the next index
  page, will read the same index file page. Since summary:disk ratio is 1:2000,
  this is expected to happen for partitions with size greater than 2000
  partition keys.

Fixes #11202
2022-09-15 17:16:26 +03:00
Botond Dénes
70d019116f sstables/kl: make reader impl v2 native
The conversion is shallow: the meat of the logic remains v1, fragments
are converted to v2 right before being pushed into the buffer. This
approach is simple, surgical and is still better then a full
upgrade_to_v2().
2022-04-28 14:12:24 +03:00
Botond Dénes
a22b02c801 sstables/kl: return v2 reader from factory methods
This just moves the upgrade_to_v2() calls to the other side of said
factory methods, preparing the ground for converting the kl reader impl
to a native v2 one.
2022-04-28 14:12:24 +03:00
Botond Dénes
4b222e7f37 sstables: move mp_row_consumer_reader_k_l to kl/reader.cc
Its only user is in said file, so that is a better place for it.
2022-04-28 14:12:24 +03:00
Avi Kivity
585c0841c3 Merge 'sstables: enable read ahead for the partition index reader' from Wojciech Mitros
Currently, when advancing one of `index_reader`'s bounds, we're creating a new `index_consume_entry_context` with a new underlying file `input_stream` for each new page.

For either bound, the streams can be reused, because the indexes of pages that we are reading are never decreasing.

This patch adds a `index_consume_entry_context` to each of `index_reader`'s bounds, so that for each new page, the same file `input_stream` is used.
As a result, when reading consecutive pages, the reads that follow the first one can be satisfied by the `input_stream`'s read aheads, decreasing the number of blocking reads and increasing the throughput of the `index_reader`.

Additionally, we're reusing the `index_consumer` for all pages, calling `index_consumer::prepare` when we need to increase the size of  the `_entries` `chunked_managed_vector`.

A big difference can be seen when we're reading the entire table, frequently skipping a few rows; which we can test using perf_fast_forward:

Before:
```
running: small-partition-skips on dataset small-part
Testing scanning small partitions with skips.
Reads whole range interleaving reads with skips according to read-skip pattern:
   read    skip      time (s)   iterations     frags     frag/s    mad f/s    max f/s    min f/s    avg aio    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    allocs   tasks insns/f    cpu
-> 1       0         0.899447            4   1000000    1111794      12284    1113248    1096537      975.5    972     124356       1       0        0        0        0        0        0        0  12032202   29103    8967 100.0%
-> 1       1         1.805811            4    500000     276884        907     278214     275977     3655.8   3654     135084    2688       0     3161     4548     5935        0        0        0   7225100  140466   27010  75.6%
-> 1       8         0.927339            4    111112     119818        357     120465     119461     3654.0   3654     135084    2685       0     2133     4548     6963        0        0        0   1749663  107922   57502  50.2%
-> 1       16        0.790630            4     58824      74401        782      74617      73497     3654.0   3654     135084    2695       0     1975     4548     7121        0        0        0   1019189  109349   90832  42.7%
-> 1       32        0.717235            4     30304      42251        243      42266      41975     3654.0   3654     135084    2689       0     1871     4548     7225        0        0        0    619876  109199  156751  37.3%
-> 1       64        0.681624            4     15385      22571        244      22815      22286     3654.0   3654     135084    2685       0     1870     4548     7226        0        0        0    407671  105798  285688  34.0%
-> 1       256       0.630439            4      3892       6173         24       6214       6150     3549.0   3549     135116    2581       0     1313     3927     6505        0        0        0    232541  100803 1022454  29.1%
-> 1       1024      0.313303            4       976       3115        219       3126       2766     1956.0   1956     130608     986       0        0      987     1962        0        0        0     81165   41385 1724979  29.1%
-> 1       4096      0.083688            4       245       2928         85       3012       2134      738.8    737      17212     492     244        0      247      491        0        0        0     30500   19406 1999263  24.6%
-> 64      1         1.509011            4    984616     652491       2746     660930     649745     3673.5   3654     135084    2687       0     4507     4548     4589        0        0        0  11075882  117074   13157  68.9%
-> 64      8         1.424147            4    888896     624160       4446     625675     617713     3654.0   3654     135084    2691       0     4248     4548     4848        0        0        0  10019098  117383   13700  66.5%
-> 64      16        1.343276            4    800000     595559       5834     605880     589725     3654.0   3654     135084    2698       0     3989     4548     5107        0        0        0   9043830  124022   14206  64.9%
-> 64      32        1.249721            4    666688     533469       5056     536638     526212     3654.0   3654     135084    2688       0     3616     4548     5480        0        0        0   7570848  123043   15377  60.9%
-> 64      64        1.154549            4    500032     433097      10215     443312     415001     3654.0   3654     135084    2703       0     3161     4548     5935        0        0        0   5718758  110657   17787  53.2%
-> 64      256       1.005309            4    200000     198944       1179     199338     196989     3935.0   3935     137216    2966       0      690     4048     5592        0        0        0   2398359  110510   27855  51.3%
-> 64      1024      0.441913            4     58880     133239       8094     135471     120467     2161.0   2161     131820    1190       0        0     1192     1848        0        0        0    725092   45449   33740  59.7%
-> 64      4096      0.124826            4     15424     123564       5958     126814      95101      795.5    794      17400     553     240        0      312      482        0        0        0    199943   20869   46621  41.9%
```
After:
```
running: small-partition-skips on dataset small-part
Testing scanning small partitions with skips.
Reads whole range interleaving reads with skips according to read-skip pattern:
   read    skip      time (s)   iterations     frags     frag/s    mad f/s    max f/s    min f/s    avg aio    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    allocs   tasks insns/f    cpu
-> 1       0         0.917468            4   1000000    1089956       1422    1091378    1073112      975.5    972     124356       1       0        0        0        0        0        0        0  12032761   29721    8972 100.0%
-> 1       1         1.311446            4    500000     381259       3212     384470     377238     1087.0   1083     138420       2       0     4445     4548     4651        0        0        0   7096216   55681   20869 100.0%
-> 1       8         0.467975            4    111112     237432       1446     239372     235985     1121.2   1119     143124       9       0     4344     4548     4752        0        0        0   1619944   23502   28844  98.7%
-> 1       16        0.337085            4     58824     174508       3410     178451     171099     1117.5   1120     143276      11       0     4319     4548     4777        0        0        0    883692   19152   37460  96.8%
-> 1       32        0.262798            4     30304     115313       1222     116535     112400     1070.2   1066     135620     166      26     4354     4548     4742        0        0        0    483185   18856   54275  94.9%
-> 1       64        0.283954            4     15385      54181        531      56177      53650     2022.5   2040     137036     319      19     4351     4548     4745        0        0        0    292766   32998  102276  84.9%
-> 1       256       0.207020            4      3892      18800        575      19105      17520     1315.5   1334     136072     418      24     3703     3927     4115        0        0        0    118400   27427  292146  82.1%
-> 1       1024      0.164396            4       976       5937         57       5993       5842     1208.2   1195     135384     568      14      932      987     1030        0        0        0     62999   27554  503559  70.0%
-> 1       4096      0.085079            4       245       2880        108       2987       2714      635.8    634      26468     248     246      233      247      258        0        0        0     31264   12872 1546404  37.4%
-> 64      1         1.073331            4    984616     917346       7614     923983     909314     1812.2   1824     136792      11      20     4544     4548     4552        0        0        0  10971661   54538    9919  99.6%
-> 64      8         1.024389            4    888896     867733       6327     870429     845215     3027.2   3072     138212      31       0     4523     4548     4573        0        0        0   9933078   68059   10050  99.5%
-> 64      16        0.978754            4    800000     817366       7802     827665     809564     3012.2   3008     139884      39       0     4486     4548     4610        0        0        0   8947041   64050   10302  98.1%
-> 64      32        0.837266            4    666688     796267      10312     806579     785370     2275.8   2266     139672      29       0     4465     4548     4631        0        0        0   7458644   50754   10564  97.8%
-> 64      64        0.645627            4    500032     774490       4713     779203     768432     1136.8   1137     145428       8       0     4438     4548     4658        0        0        0   5593168   29982   10938  98.4%
-> 64      256       0.386192            4    200000     517877      22509     544067     495368     1134.8   1136     145300     109       0     2135     4048     4147        0        0        0   2270291   22840   13682  94.5%
-> 64      1024      0.238617            4     58880     246755      55856     305110     190899     1176.0   1118     135324     451      13      625     1192     1223        0        0        0    701262   24418   17323  71.1%
-> 64      4096      0.133340            4     15424     115674      14837     117978      99072      974.0    961      27132     366     347       99      312      383        0        0        0    209595   20657   43096  50.4%
```
For single partition reads, the index_reader is modified to behave in practically the same way, as before the change (not reading ahead past the page with the partition).
For example, a single partition read from a table with 10 rows per partition performs a single 6KB read from the index file, and the same read is performed before the change (as can be seen in traces below). If we enabled read aheads in that case, we would perform 2 16KB reads.
Relevant traces:
Before:
```
./tmp/data/ks/t2-75ebed30eb0211eb837a8f4cd3d1cf62/md-1-big-Index.db: scheduling bulk DMA read of size 6478 at offset 0 [shard 0] | 2021-07-23 15:22:25.847362 | 127.0.0.1 |            148 | 127.0.0.1
./tmp/data/ks/t2-75ebed30eb0211eb837a8f4cd3d1cf62/md-1-big-Index.db: finished bulk DMA read of size 6478 at offset 0, successfully read 6478 bytes [shard 0] | 2021-07-23 15:22:25.900996 | 127.0.0.1 |          53782 | 127.0.0.1
```
After:
```
./tmp/data/ks/t2-75ebed30eb0211eb837a8f4cd3d1cf62/md-1-big-Index.db: scheduling bulk DMA read of size 6478 at offset 0 [shard 0] | 2021-07-23 15:19:37.380033 | 127.0.0.1 |            149 | 127.0.0.1
./tmp/data/ks/t2-75ebed30eb0211eb837a8f4cd3d1cf62/md-1-big-Index.db: finished bulk DMA read of size 6478 at offset 0, successfully read 6478 bytes [shard 0] | 2021-07-23 15:19:37.433662 | 127.0.0.1 |          53777 | 127.0.0.1
```
Tests: unit(dev)

Closes #9063

* github.com:scylladb/scylla:
  sstables: index_reader: optimize single partition reads
  sstables: use read-aheads in the index reader
  sstables: index_reader: remove unused members from index reader context
2022-03-21 13:47:28 +02:00
Mikołaj Sielużycki
1d84a254c0 flat_mutation_reader: Split readers by file and remove unnecessary includes.
The flat_mutation_reader files were conflated and contained multiple
readers, which were not strictly necessary. Splitting optimizes both
iterative compilation times, as touching rarely used readers doesn't
recompile large chunks of codebase. Total compilation times are also
improved, as the size of flat_mutation_reader.hh and
flat_mutation_reader_v2.hh have been reduced and those files are
included by many file in the codebase.

With changes

real	29m14.051s
user	168m39.071s
sys	5m13.443s

Without changes

real	30m36.203s
user	175m43.354s
sys	5m26.376s

Closes #10194
2022-03-14 13:20:25 +02:00
Wojciech Mitros
7f590a3686 sstables: index_reader: optimize single partition reads
All entries from a single partition can be found in a
single summary page.
Because of that, in cases when we know we want to read
only one partition, we can limit the underyling file
input_stream to the range of the page.

Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
2022-02-22 02:16:52 +01:00
Avi Kivity
fcb8d040e8 treewide: use Software Package Data Exchange (SPDX) license identifiers
Instead of lengthy blurbs, switch to single-line, machine-readable
standardized (https://spdx.dev) license identifiers. The Linux kernel
switched long ago, so there is strong precedent.

Three cases are handled: AGPL-only, Apache-only, and dual licensed.
For the latter case, I chose (AGPL-3.0-or-later and Apache-2.0),
reasoning that our changes are extensive enough to apply our license.

The changes we applied mechanically with a script, except to
licenses/README.md.

Closes #9937
2022-01-18 12:15:18 +01:00
Botond Dénes
4421929b25 sstables: kl/reader: add crawling reader
A special-purpose reader which doesn't use the index at all and hence
doesn't support skipping at all. It is designed to be used in conditions
in which the index is not reliable (scrub compaction).
2021-09-01 08:42:10 +03:00
Benny Halevy
4476800493 flat_mutation_reader: get rid of timeout parameter
Now that the timeout is taken from the reader_permit.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-08-24 16:30:51 +03:00
Michael Livshin
5f9695c1b2 sstables: count read row tombstones
Refs #7749.

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2021-08-01 19:41:11 +03:00
Avi Kivity
42e1f318d7 Merge "Respect "bypass cache" in sstable index caching" from Tomasz
"
This series changes the behavior of the system when executing reads
annotated with "bypass cache" clause in CQL. Such reads will not
use nor populate the sstable partition index cache and sstable index page cache.
"

* 'bypass-cache-in-sstable-index-reads' of github.com:tgrabiec/scylla:
  sstables: Do not populate page cache when searching in promoted index for "bypass cache" reads
  sstables: Do not populate partition index cache for "bypass cache" reads
2021-07-28 18:45:39 +03:00
Wojciech Mitros
7f41af0916 sstables: merge row_consumer into mp_row_consumer_k_l
The row_consumer interface has only one implementation:
mp_row_consumer_k_l; and we're not planning other ones,
so to reduce the number of inheritances, and the number
of lines in the sstable reader, these classes may be
combined.

Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
2021-07-21 18:19:49 +02:00
Wojciech Mitros
1ff72ca0a6 sstables: move kl row_consumer
In preparation for the next patch combining row_consumer and
mp_row_consumer_k_l, move row_consumer next to row_consumer.

Because row_consumer is going to be removed, we retire some
old tests for different implementations of the row_consumer
interface; as a result, we don't need to expose internal
types of kl sstable reader for tests, so all classes from
reader_impl.hh are moved to reader.cc, and the reader_impl.hh
file is deleted, and the reader.cc file has an analogous
structure to the reader.cc file in sstables/mx directory.

Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
2021-07-21 18:04:22 +02:00
Tomasz Grabiec
f4227c303b sstables: Do not populate partition index cache for "bypass cache" reads
Index cursor for reads which bypass cache will use a private temporary
instance of the partition index cache.

Promoted index scanner (ka/la format) will not go through the page cache.
2021-07-15 12:13:20 +02:00
Avi Kivity
1643549d08 Merge 'Coroutinize the sstable reader' from Wojciech Mitros
This patch applies the same changes to both kl and mx sstable readers, but because the kl reader is old, we'll focus on the newer one.

This patch makes the main sstable reader process a coroutine,
allowing to simplify it, by:

- using the state saved in the coroutine instead of most of the states saved in the _state variable
- removing the switch statement and moving the code of former switch cases, resulting in reduced number of jumps in code
- removing repetitive ifs for read statuses, by adding them to the coroutine implementation

The coroutine is saved in a new class ```processing_result_generator```, which works like a generator: using its ```generate()``` method, one can order the coroutine to continue until it yields a data_consumer::processing_result value, which was achieved previously by calling the function that is now the coroutine(```do_process_state()```).

Before the patch, the main processing method had 558 lines. The patch reduces this number to 345 lines.

However, usage of c++ coroutines has a non-negligible effect on the performance of the sstable reader.
In the test cases from ```perf_fast_forward``` the new sstable reader performs up to 2% more instructions (per fragment) than the former implementation, and this loss is achieved for cases where we're reading many subsequent rows, without any skips.
Thanks to finding an optimization during the development of the patch, the loss is mitigated when we do skip rows, and for some cases, we can even observe an improvement.
You can see the full results in attached files: [old_results.txt](https://github.com/scylladb/scylla/files/6793139/old_results.txt), [new_results.txt](https://github.com/scylladb/scylla/files/6793140/new_results.txt)

Test: unit(dev)
Refs: #7952

Closes #9002

* github.com:scylladb/scylla:
  mx sstable reader: reduce code blocks
  mx sstable reader: make ifs consistent
  sstable readers: make awaiter for read status
  mx sstable reader: don't yield if the data buffer is not empty
  mx sstable reader: combine FLAGS and FLAGS_2 states
  mx sstable reader: reduce placeholder state usage
  mx sstable reader: replace non_consuming states with a bool
  mx sstable reader: reduce placeholder state usage
  mx sstable reader: replace unnecessary states with a placeholder
  mx sstable reader: remove false if case
  mx sstable reader: remove row_body_missing_columns_label
  mx sstable reader: remove row_body_deletion_label
  mx sstable reader: remove column_end_label
  mx sstable reader: remove column_cell_path_label
  mx sstable reader: remove column_ttl_label
  mx sstable reader: remove column_deletion_time_label
  mx sstable reader: remove complex_column_2_label
  mx sstable reader: remove row_body_missing_columns_read_columns_label
  mx sstable reader: remove row_body_marker_label
  mx sstable reader: remove row_body_shadowable_deletion_label
  mx sstable reader: remove row_body_prev_size_label
  mx sstable reader: remove ck_block_label
  mx sstable reader: remove ck_block2_label
  mx sstable reader: remove clustering_row_label and complex_column_label
  mx sstable reader: remove labels with only one goto
  mx sstable reader: replace the switch cases with gotos and a new label
  mx sstable reader: remove states only reached consecutively or from goto
  mx sstable reader: remove switch breaks for consecutive states
  mx sstable reader: convert readers main method into a coroutine
  kl sstable reader: replace states for ending with one state, simplify non_consuming
  kl sstable reader: remove unnecessary states
  kl sstable reader: remove unnecessary yield
  kl sstable reader: remove unnecessary blocks
  kl sstable reader: fix indentation
  kl sstable reader: replace switch with standard flow control
  kl sstable reader: remove state::CELL case
  kl sstable reader: move states code only reachable from one place
  kl sstable reader: remove states only reached consecutively
  kl sstable reader: remove switch breaks for consecutive states
  kl sstable reader: remove unreachable case
  kl sstable reader: move testing hack for fragmented buffers outside the coroutine
  kl sstable reader: convert readers main method into a coroutine
  sstable readers: create a generator class for coroutines
2021-07-15 12:06:14 +03:00
Wojciech Mitros
dc38605f75 sstable readers: make awaiter for read status
After each read* call of the primitive_consumer we need to check
if the entire primitive was in our current buffer. We can check it
in the proceed_generator object by yielding the returned read status:
if the yielded status is ready, the yield_value method returns
a structure whose await_ready() method returns true. Otherwise it
returns false.
The returned structure is co_awaited by the coroutine (due to co_yield),
and if await_ready() returns true, the coroutine isn't stopped,
conversely, if it returns false, (technical: and because its await_suspend
methods returns void) the coroutine stops, and a proceed::yes value
is saved, indicating that we need more buffers.
2021-07-14 20:50:30 +02:00
Wojciech Mitros
4816e8120b kl sstable reader: replace states for ending with one state, simplify non_consuming
After removing the switch, the only use for states in the sstable reader
are methods non_consuming() and verify_end_state().

The non_consuming() method is only used after assuring that
!primitive_consumer::active() (in continuous_data_consumer::process())
so we don't need states where primitive_consumer::active() for this
method, and is actually all of them.

We don't differentiate between ATOM_START and ATOM_START_2 in
verify_end_state(), so we can just merge them into one.

While we need tho remember times when we enter states used in verify_end_state(),
we also need to remember when we exit them. For that reason we introduce a new
state "NOT_CLOSING", that fails all comparisons in verify_end_state(), and
replaces all states that aren't used in verify_end_state()
2021-07-14 20:50:30 +02:00
Wojciech Mitros
0c284a8b5e kl sstable reader: remove unnecessary states
After removing the switch, the state is only used for
verify_end_state() and non_consuming(), so we can
remove states that are not used there (and which do
not change them).
2021-07-14 20:50:30 +02:00
Wojciech Mitros
35c30e6178 kl sstable reader: remove unnecessary yield
We don't need to yield row_consumer::proceed::yes if we are
not parsing a primitive using primitive_consumer, we can just
continue execution.
2021-07-14 20:50:30 +02:00
Wojciech Mitros
97c7b5fe76 kl sstable reader: remove unnecessary blocks
Some blocks of code were surrounded by curly braces, because
a variable was declared inside a switch case. With standard
flow control, it's no longer needed.
2021-07-14 20:50:30 +02:00
Wojciech Mitros
914e4f27e9 kl sstable reader: fix indentation
To simplify review, the code moved in previous commits
didn't change its indentation. This commit fixes it.
2021-07-14 20:50:30 +02:00
Wojciech Mitros
7a6729159f kl sstable reader: replace switch with standard flow control
We get rid of the switch by using the infinite loop around the
switch for jumping to the first case, adding an infinite loop
around the second case (one break from the switch with the
state of the first case becomes a break of the new while),
and adding an if around the first case (because we never break
in the first case).
2021-07-14 20:50:30 +02:00
Wojciech Mitros
cfe6a46a60 kl sstable reader: remove state::CELL case
The CELL state is only set in the if/else block immediately
before the CELL case, so we don't need to have a case for it.
2021-07-14 20:50:30 +02:00
Wojciech Mitros
c41f49d2e5 kl sstable reader: move states code only reachable from one place
If a case is reached only after exiting a certain other case (or goto)
its code may as well be moved to that place.
2021-07-14 20:50:30 +02:00
Wojciech Mitros
5f27413c1f kl sstable reader: remove states only reached consecutively
If a state is never reached from the top of the switch, but only
by continuing from the previous case, we don't need to have a case:
for it.
2021-07-14 20:50:30 +02:00
Wojciech Mitros
e226fc12c9 kl sstable reader: remove switch breaks for consecutive states
If _state at the end of a switch case has the same value as the
next case, instead of breaking the switch, we can just fall through.
2021-07-14 20:50:30 +02:00