Commit Graph

158 Commits

Author SHA1 Message Date
Botond Dénes
62e134f423 sstables: replace SCYLLA_ASSERT() with parse_assert() on the read path
So parse errors on corrupt SSTables don't result in crashes, instead
just aborting the read in process.
There are a lot of SCYLLA_ASSERT() usages remaining in sstables/. This
patch tried to focus on those usages which are in the read path. Some
places not only used on the read path may have been converted too, where
the usage of said method is not clear.

(cherry picked from commit bce89c0f5e)
2025-06-26 14:53:08 +00:00
Kefu Chai
7215d4bfe9 utils: do not include unused headers
these unused includes were identifier by clang-include-cleaner. after
auditing these source files, all of the reports have been confirmed.

please note, because quite a few source files relied on
`utils/to_string.hh` to pull in the specialization of
`fmt::formatter<std::optional<T>>`, after removing
`#include <fmt/std.h>` from `utils/to_string.hh`, we have to
include `fmt/std.h` directly.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2025-01-14 07:56:39 -05:00
Avi Kivity
f3eade2f62 treewide: relicense to ScyllaDB-Source-Available-1.0
Drop the AGPL license in favor of a source-available license.
See the blog post [1] for details.

[1] https://www.scylladb.com/2024/12/18/why-were-moving-to-a-source-available-license/
2024-12-18 17:45:13 +02:00
Botond Dénes
05246e123d Merge 'sstables: Avoid computing column_values_fixed_lengths on each read' from Tomasz Grabiec
Reads which need sstable index were computing
column_values_fixed_lengths each time. This showed up in perf profile
for a sstable-read heavy workload, and amounted to about 1-2% of time.

Computing it involves type name parsing.

Avoid by using cached per-sstable mapping. There is already
sstable::_column_translation which can be used for this. It caches the
mapping for the least-recently used schema. Since the cursor uses the
mapping only for primary key columns, which are stable, any schema
will do, so we can use the last _column_translation. We only need to
make sure that it's always armed, so sstable loading is augmented with
arming with sstable's schema.

Also, fixes a potential use-after-free on schema in column_translation.

Closes scylladb/scylladb#21347

* github.com:scylladb/scylladb:
  sstables: Fix potential use-after-free on column_translation::column_info::name
  sstables: Avoid computing column_values_fixed_lengths on each read
2024-12-12 12:22:32 +02:00
Kefu Chai
48c8d24345 treewide: drop support for fmt < v10
since fedora 38 is EOL. and fedora 39 comes with fmt v10.0.0, also,
we've switched to the build image based on fedora 40, which ships
fmt-devel v10.2.1, there is no need to support fmt < 10.

in this change, we drop the support fmt < 10.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21847
2024-12-09 20:42:38 +02:00
Tomasz Grabiec
b0a5bf8b4a sstables: Avoid computing column_values_fixed_lengths on each read
Reads which need clustering index cursor were computing
column_values_fixed_lengths each time. This showed up in perf profile
for a sstable-read heavy workload, and amounted to about 1%.

Avoid by using cached per-sstable mapping. There is already
sstable::_column_translation which can be used for this. It caches the
mapping for the most recently used schema. Since the cursor uses the
mapping only for primary key columns, which are stable, any schema
will do, so we can use the last _column_translation. We only need to
make sure that it's always armed, so sstable loading is augmented with
arming with sstable's schema.
2024-12-09 14:05:37 +01:00
Avi Kivity
b706e3e9e4 Merge 'sstables/index_reader: avoid unnecessary index page reads in single-partition reads' from Michał Chojnowski
Terminology note: in the context of this series, "index page" means an contiguous segment of the index file starting (inclusive) at a key corresponding to a summary entry and ending (exclusive) before the key corresponding to the next summary entry. "Index pages" are not related to filesystem pages.

---

In a single-partition read, if the searched partition key is the first key in its index page, we start scanning the index for that key starting at the previous index page (inclusive), even though we could start directly from the key's page. Similarly, if the searched partition key is absent from the sstable and lies after all other keys in its appropriate page, we additionally scan the next page, even though it's known from the summary that it can't possibly contain the key.

Those cases are wasteful. It's worse than it might seem at first glance. When partitions are small, only a small fraction of search keys fulfills those conditions (i.e. "first key in its page" or "an absent key greater than the last key in its page"), so the waste doesn't matter much. But when partitions are big enough, every index page contains only one partition key (and a promoted index for that partition), which directly means that *all* search keys fulfill the conditions, which means that total index reading work is two times bigger than what it should be.

In addition, there is a secondary performance bug which, when the aforementioned conditions are fulfilled, causes *additional* I/O to happen *past* the index reads which are actually parsed and used. In effect, the index I/O in single-partition reads might be not just doubled, but even tripled (that's for IOPS — throughput might be multiplied even more), all because of a slight inaccuracy in the edge cases.

This series fixes those inefficiencies by tightening the edge cases and ensuring that single-partition reads always read only a single index page.

Here's an example where we query the first row (i.e. `LIMIT 1`) of a certain partition key, in a table with large (1 MB) promoted indexes. Before the patch, the lookup of the lower bound involves 3 serialized disk reads (as described above) to subsequent index pages, and even the lookup of the upper bound involves 2 disk reads:

```
                                                                                                                                                                                     Execute CQL3 query
                                                                                                                                                                          Parsing a statement [shard 0]
                                                                                                                                     Processing a statement for authenticated user: anonymous [shard 0]
                                                                                                                                                        Executing read query (reversed false) [shard 0]
                                                                   Creating read executor for token -1297921881139976049 with all: [127.11.11.1] targets: [127.11.11.1] repair decision: NONE [shard 0]
                                                                    Creating never_speculating_read_executor - speculative retry is disabled or there are no extra replicas to speculate with [shard 0]
                                                                                                                                                                  read_data: querying locally [shard 0]
                                                                                                                         Start querying singular range {{-1297921881139976049, pk{00023130}}} [shard 0]
                                                                                                                                     [reader concurrency semaphore user] admitted immediately [shard 0]
                                                                                                                                           [reader concurrency semaphore user] executing read [shard 0]
                            Reading key {-1297921881139976049, pk{00023130}} from sstable ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Data.db [shard 0]
                              ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: scheduling bulk DMA read of size 32768 at offset 38359040 [shard 0]
                              ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: scheduling bulk DMA read of size 32768 at offset 38391808 [shard 0]
 ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: finished bulk DMA read of size 32768 at offset 38359040, successfully read 32768 bytes [shard 0]
 ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: finished bulk DMA read of size 32768 at offset 38391808, successfully read 32768 bytes [shard 0]
                              ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: scheduling bulk DMA read of size 32768 at offset 39370752 [shard 0]
                              ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: scheduling bulk DMA read of size 32768 at offset 39403520 [shard 0]
 ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: finished bulk DMA read of size 32768 at offset 39370752, successfully read 32768 bytes [shard 0]
 ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: finished bulk DMA read of size 32768 at offset 39403520, successfully read 32768 bytes [shard 0]
                              ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: scheduling bulk DMA read of size 32768 at offset 40378368 [shard 0]
                              ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: scheduling bulk DMA read of size 32768 at offset 40411136 [shard 0]
 ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: finished bulk DMA read of size 32768 at offset 40378368, successfully read 32768 bytes [shard 0]
 ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: finished bulk DMA read of size 32768 at offset 40411136, successfully read 32768 bytes [shard 0]
                                                                                                                      upper_bound_cache_only({position: clustered, ckp{}, 1}): no upper bound [shard 0]
                              ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: scheduling bulk DMA read of size 32768 at offset 40378368 [shard 0]
                              ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: scheduling bulk DMA read of size 32768 at offset 40411136 [shard 0]
 ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: finished bulk DMA read of size 32768 at offset 40378368, successfully read 32768 bytes [shard 0]
 ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: finished bulk DMA read of size 32768 at offset 40411136, successfully read 32768 bytes [shard 0]
                              ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: scheduling bulk DMA read of size 32768 at offset 41390080 [shard 0]
                              ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: scheduling bulk DMA read of size 32768 at offset 41422848 [shard 0]
 ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: finished bulk DMA read of size 32768 at offset 41390080, successfully read 32768 bytes [shard 0]
 ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: finished bulk DMA read of size 32768 at offset 41422848, successfully read 32768 bytes [shard 0]
                                 ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Data.db: scheduling bulk DMA read of size 21926 at offset 819200 [shard 0]
    ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Data.db: finished bulk DMA read of size 21926 at offset 819200, successfully read 24576 bytes [shard 0]
                                      Page stats: 1 partition(s), 0 static row(s) (0 live, 0 dead), 1 clustering row(s) (1 live, 0 dead), 0 range tombstone(s) and 0 cell(s) (0 live, 0 dead) [shard 0]
                                                                                                                                                                             Querying is done [shard 0]
                                                                                                                                                         Done processing - preparing a result [shard 0]
                                                                                                                                                                                       Request complete
```

After the patch, the lookup of each bound involves 1 read:
```

                                                                                                                                                                                     Execute CQL3 query
                                                                                                                                                                          Parsing a statement [shard 0]
                                                                                                                                     Processing a statement for authenticated user: anonymous [shard 0]
                                                                                                                                                        Executing read query (reversed false) [shard 0]
                                                                   Creating read executor for token -1297921881139976049 with all: [127.11.11.1] targets: [127.11.11.1] repair decision: NONE [shard 0]
                                                                    Creating never_speculating_read_executor - speculative retry is disabled or there are no extra replicas to speculate with [shard 0]
                                                                                                                                                                  read_data: querying locally [shard 0]
                                                                                                                         Start querying singular range {{-1297921881139976049, pk{00023130}}} [shard 0]
                                                                                                                                     [reader concurrency semaphore user] admitted immediately [shard 0]
                                                                                                                                           [reader concurrency semaphore user] executing read [shard 0]
                            Reading key {-1297921881139976049, pk{00023130}} from sstable ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Data.db [shard 0]
                              ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: scheduling bulk DMA read of size 32768 at offset 39370752 [shard 0]
                              ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: scheduling bulk DMA read of size 32768 at offset 39403520 [shard 0]
 ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: finished bulk DMA read of size 32768 at offset 39370752, successfully read 32768 bytes [shard 0]
 ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: finished bulk DMA read of size 32768 at offset 39403520, successfully read 32768 bytes [shard 0]
                                                                                                                      upper_bound_cache_only({position: clustered, ckp{}, 1}): no upper bound [shard 0]
                              ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: scheduling bulk DMA read of size 32768 at offset 40378368 [shard 0]
                              ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: scheduling bulk DMA read of size 32768 at offset 40411136 [shard 0]
 ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: finished bulk DMA read of size 32768 at offset 40378368, successfully read 32768 bytes [shard 0]
 ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: finished bulk DMA read of size 32768 at offset 40411136, successfully read 32768 bytes [shard 0]
                                 ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Data.db: scheduling bulk DMA read of size 21926 at offset 819200 [shard 0]
    ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Data.db: finished bulk DMA read of size 21926 at offset 819200, successfully read 24576 bytes [shard 0]
                                      Page stats: 1 partition(s), 0 static row(s) (0 live, 0 dead), 1 clustering row(s) (1 live, 0 dead), 0 range tombstone(s) and 0 cell(s) (0 live, 0 dead) [shard 0]
                                                                                                                                                                             Querying is done [shard 0]
                                                                                                                                                         Done processing - preparing a result [shard 0]
                                                                                                                                                                                       Request complete
```

Doesn't have to be backported, since the problem only affects performance, not correctness, and it has been present since forever.

Closes scylladb/scylladb#20897

* github.com:scylladb/scylladb:
  index_reader: remove a piece of misguided code involved in single-partition reads
  index_reader: in single-partition reads, don't read more than one page
  index_reader: fix unnecessary reads of preceding index pages
2024-11-04 14:28:27 +02:00
Tomasz Grabiec
1b82d5117a sstables: bsearch_clustered_cursor: Narrow down range using "end" position of the block
This is optimization.

Example:

  block0: start=aaa, end=aaA
  block1: start=bbb, end=bbB
  block2: whatever

Before the patch, advance_to("aAA") would skip to block0, and upper
bound probe would skip to block1. This way, the reader would read the
range of block0 from the data file.

After the patch, "end" position is taken into account, so
advance_to("aAA") will notice that block0 doesn't contain the position
and will skip to block1. This is especially important for dense
indexes, as it allows us to skip accessing data file if the search key
is missing.

It also solves the edge case problem related to the fact that single
row reads are using a range which with positions which are not equal
to the key, but are before(key) and after(key) for the lower bound and
upper bound respectively. Before the patch, advance_to(before("bbb"))
would skip to block0, before the position is before the block1's
start. And upper bound probe for after("bbb") would point to
block2. This way the read would scan block0 needlessly. After the
patch, advance_to(before("bbb")) will skip to block1 because we notice
based on "end" that block0 doesn't contain the position.

This change also ensures that the start position of the upper bound
entry of the after_key(pos), where pos is the last advance_to()
position, is warm in cache. This is needed to optimize single-row
reads with a dense index so that they always read exactly one promoted
index block. For this to work, probe_upper_bound() for the
after_key(row) always needs to find the upper bound block in
cache.
2024-10-03 14:16:05 +02:00
Tomasz Grabiec
a29501ed67 sstables: Reduce amount of I/O for clustering-key-bounded reads from large partitions
Single-row reads from large partition issue 64 KiB reads to the data file,
which is equal to the default span of the promoted index block in the data file.
If users would want to reduce selectivity of the index to speed up single-row reads,
this won't be effective. The reason is that the reader uses promoted index
to look up the start position in the data file of the read, but end position
will in practice extend to the next partition, and amount of I/O will be
determined by the underlying file input stream implementation and its
read-ahead heuristics. By default, that results in at least 2 IOs 32KB each.

There is already infrastructure to lookup end position based on upper
bound of the read, but it's not effective becasue it's a
non-populating lookup and the upper bound cursor has its own private
cached_promoted_index, which is cold when positions are computed. It's
non-populating on purpose, to avoid extra index file IO to read upper
bound. In case upper bound is far-enough from the lower bound, this
will only increase the cost of the read.

The solution employed here is to warm up the lower bound cursor's
cache before positions are computed, and use that cursor for
non-populating lookup of the upper bound.

We use the lower bound cursor and the slice's lower bound so that we
read the same blocks as later lower-bound slicing would, so that we
don't incur extra IO for cases where looking up upper bound is not
worth it, that is when upper bound is far from the lower bound. If
upper bound is near lower bound, then warming up using lower bound
will populate cached_promoted_index with blocks which will allow us to
locate the upper bound block accurately.  This is especially important
for single-row reads, where the bounds are around the same key.  In
this case we want to read the data file range which belongs to a
single promoted index block.  It doesn't matter that the upper bound
is not exactly the same. They both will likely lie in the same block,
and if not, binary search will bring adjacent blocks into cache.  Even
if upper bound is not near, the binary search will populate the cache
with blocks which can be used to narrow down the data file range
somewhat.

Fixes #10030.

The change was tested with perf-fast-forward.

I populated the data set with `column_index_size_in_kb` set to 1

  scylla perf-fast-forward --populate --run-tests=large-partition-slicing --column-index-size-in-kb=1

Test run:

  build/release/scylla perf-fast-forward --run-tests=large-partition-select-few-rows -c1 --keep-cache-across-test-cases --test-case-duration=0

This test reads two rows from the middle of a large partition (1M
rows), of subsequent keys. The first read will miss in the index file
page cache, the second read will hit.

Notice that before the change, the second read issued 2 aio requests worth of 64KiB in total.
After the change, the second read issued 1 aio worth of 2 KiB. That's because promoted index block is larger than 1 KiB.
I verified using logging that the data file range matches a single promoted index block.

Also, the first read which misses in cache is still faster after the change.

Before:

running: large-partition-select-few-rows on dataset large-part-ds1
Testing selecting few rows from a large partition:
stride  rows      time (s)   iterations     frags     frag/s    mad f/s    max f/s    min f/s    avg aio    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    allocs   tasks insns/f    cpu
500000  1         0.009802            1         1        102          0        102        102       21.0     21        196       2       1        0        1        1        0        0        0       568     269 4716050  53.4%
500001  1         0.000321            1         1       3113          0       3113       3113        2.0      2         64       1       0        1        0        0        0        0        0       116      26  555110  45.0%

After:

running: large-partition-select-few-rows on dataset large-part-ds1
Testing selecting few rows from a large partition:
stride  rows      time (s)   iterations     frags     frag/s    mad f/s    max f/s    min f/s    avg aio    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    allocs   tasks insns/f    cpu
500000  1         0.009609            1         1        104          0        104        104       20.0     20        137       2       1        0        1        1        0        0        0       561     268 4633407  43.1%
500001  1         0.000217            1         1       4602          0       4602       4602        1.0      1          2       1       0        1        0        0        0        0        0       110      26  313882  64.1%

(cherry picked from commit dfb339376aff1ed961b26c4759b1604f7df35e54)
2024-10-01 18:40:34 +02:00
Michał Chojnowski
c77d00fd8d index_reader: remove a piece of misguided code involved in single-partition reads
This patch removes a piece of code which, according to the comment,
allows for forwarding the index reader even if it was created as a
single-partition reader.

For single-partition reads, the input_stream
used by the reader is limited to the single index page containing the partition,
since reading the index file past that point would be a waste.
Because of this limit, such an index reader can't be forwarded/advanced.

The dubious piece of code gets around that by unsetting the stream and ensuring
it will be re-created, this time without the limit, if the index is advanced.

But there is no use for this. The idea of a "single-partition reader"
exist as an optimization. It's illegal to forward single-partition readers,
and it doesn't make sense to attempt that. (If there's a need for forwarding,
just don't create a single-partition reader).

I suspect this piece of code was written due to a misunderstanding.
Before the previous patch in this series, when the searched partition key
was the first key in its page, the index reader would scan the preceding
page first, realize it made a mistake, and advance to the next, correct page.
I suspect this piece of code was written to make this work.

But this is, in fact, undesirable. The fact that the index reader was working
like this was a performance bug. In the single-partition case there's
never an inherent reason to start with the wrong page. The index logic can be
corrected to always start with the right page, and that's what the previous
patch in this series does. And with that, there is no need to support advancing
anymore, and the dubious piece of code can be erased.

We also add an assert to emphasize that advancing a single-partition reader is
illegal.
2024-09-30 23:43:36 +02:00
Michał Chojnowski
bc30509523 index_reader: in single-partition reads, don't read more than one page
When looking for a partition key in the index, we scan the index from the
first index page which can possibly contain the key.

In a single-partition read, there is never a reason to read beyond that
page. After the previous patch in this series, it's guaranteed that
the first key in the next page is strictly greater than the searched key.
So if the searched key is greater than the last key in the first page,
then it is neither in the first nor the second page -- it must be absent
from the sstable.

But with the current logic, we read the second index page anyway, and
the realization that the key is absent happens higher in the call chain.

This patch optimizes that inefficiency by immediately returning EOF
if a single-partition read doesn't find the key in the first page.

Returning "end of file" even though we didn't actually go beyond the
end of file is hacky, but I don't see any other non-invasive way
of communicating to the caller that the partition is absent.

Some caller of the index could possibly assume that returning EOF
proves that the searched key is greater than all keys in the sstable.
I don't think any such caller exists today, but it's a possible
place for confusion.

Together with the previous patch in this series, this patch guarantees
that a single-partition read only accesses a single index page.
This fixes a weird secondary performance bug.
Due to some misunderstanding in the logic, when during a single-partition read
we scan two index pages, the second index page is scanned via an input_stream
created without an upper I/O limit, which means that we additionally read
a full read-ahead (currently: 64 kiB) past the second index page for no reason
whatsoever.
After this and the previous patch, a single-partition read always reads exactly one
index page, so the above problem cannot occur.
2024-09-30 23:43:36 +02:00
Michał Chojnowski
6b8b7d962c index_reader: fix unnecessary reads of preceding index pages
When setting the index to position X, we first look for the first summary
entry N such that N >= X. Then we load the index page preceding N and
scan it for the first partition key P such that P >= X.
If there is no such key in this page, then we scan the next page
(starting with N) for such key. (In this case it's always the first key).

For example, assume we have:
summary:  A   C   E
index:    A B C D E F

If we look up "B" in the index, then we first locate summary entry "C",
then we scan the index for B, starting from "A". This is all fine.

But when we look for "C" in the index, then we do the exactly the same
-- we scan the index for "C" starting from "A". This is wasteful, because
we can start scanning from "C".

To avoid this inefficiency, we should be looking for N > X, not N >= X.
This patch fixes that.

In addition, this fixes a second, weirder performance bug.
Due to some misunderstanding in the logic, when during a single-partition read
we scan two index pages, the second index page is scanned via an input_stream
created without an upper I/O limit, which means that we additionally read
a full read-ahead (currently: 64 kiB) past the second index page for no reason
whatsoever. After this patch, a single-partition read always reads exactly one
index page, so the above problem cannot occur.
2024-09-30 23:43:36 +02:00
Avi Kivity
aa1270a00c treewide: change assert() to SCYLLA_ASSERT()
assert() is traditionally disabled in release builds, but not in
scylladb. This hasn't caused problems so far, but the latest abseil
release includes a commit [1] that causes a 1000 insn/op regression when
NDEBUG is not defined.

Clearly, we must move towards a build system where NDEBUG is defined in
release builds. But we can't just define it blindly without vetting
all the assert() calls, as some were written with the expectation that
they are enabled in release mode.

To solve the conundrum, change all assert() calls to a new SCYLLA_ASSERT()
macro in utils/assert.hh. This macro is always defined and is not conditional
on NDEBUG, so we can later (after vetting Seastar) enable NDEBUG in release
mode.

[1] 66ef711d68

Closes scylladb/scylladb#20006
2024-08-05 08:23:35 +03:00
Lakshmi Narayanan Sreethar
64dadd5ec2 sstables/index_reader: stop consuming index when abort has been requested
When an abort is requested, stop further reading of the index file and
throw and exception from index_consume_entry_context::process_state().

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2024-07-16 20:42:50 +05:30
Lakshmi Narayanan Sreethar
c2524337a2 sstables::index_consume_entry_context: store abort_source
Store abort source inside sstables::index_consume_entry_context, so that
the next patch can implement cancelling the index read when abort is
requested.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2024-07-16 20:42:50 +05:30
Aleksandra Martyniuk
4530be9e5b test: add test to check if reader is closed
Add test to check if reader is closed in sstable::has_partition_key.
2024-02-22 14:53:14 +01:00
Michał Chojnowski
5a3e4a1cc0 utils: managed_bytes: optimize memory usage for small buffers
managed_bytes is implemented as chain of blob_storage objects.
Each blob_storage contains 24 bytes of metadata. But in the most
common case -- when there is only a single element in the chain --
16 bytes of this metadata is trivial/unused.

This is regrettable waste because managed_bytes is used for every
database cell in the memtables and cache. It means that every value
of size >= 7 bytes (smaller ones fit in the inline storage of
managed_bytes) receives 16 bytes of useless overhead.

To correct that, this patch adds to managed_bytes an alternative storage
layout -- used for buffers small enough to fit in one contiguous
fragment -- which only stores the necessary minimum of metadata.
(That is: a pointer to the parent, to facilitate moving the storage during
memory defragmentation).
2024-02-09 20:56:20 +01:00
Kefu Chai
a6152cb87b sstables: do not include unused headers
these unused includes were identified by clangd. see
https://clangd.llvm.org/guides/include-cleaner#unused-include-warning
for more details on the "Unused include" warning.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#16666
2024-01-09 11:45:44 +02:00
Kefu Chai
893f319004 sstables: add formatter for index_consume_entry_context_state
before this change, we rely on the default-generated fmt::formatter
created from operator<<, but fmt v10 dropped the default-generated
formatter.

in this change, in order to enable the code in the header to
access the formatter without being moved down after the full specialization's
definition, we

* move the enum definition out of the class and before the
  class,
* rename the enum's name from state to index_consume_entry_context_state
* define a formatter for index_consume_entry_context_state
* remove its operator<<().

as fmt v10 is able to use `format_as()` as a fallback, the formatter
full specialization is guarded with `#if FMT_VERSION < 10'00'00`. we
will remove it after we start build with fmt v10.

Refs #13245

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#16204
2023-12-08 12:45:38 +02:00
Michał Chojnowski
f00bed9429 sstables: partition_index_cache: deglobalize stats
Move partition_index_cache stats from a thread_local variable
to cache_tracker. After the change, partition_index_cache
receives a reference to the stats via constructor, instead of
referencing a global.

This is needed so that cache_tracker can know the memory usage
of index caches (for cache eviction purposes) without relying on
globals.

But it also makes sense even without that motive.
2023-09-01 22:34:41 +02:00
Michał Chojnowski
c7d9d35030 utils: cached_file: deglobalize cached_file metrics
Move cached_file metrics from a thread_local variable
to cache_tracker.

This is needed so that cache_tracker can know
the memory usage of index caches (for purposes
of cache eviction) without relying on globals.

But it also makes sense even without that motive.
2023-09-01 22:34:41 +02:00
Avi Kivity
0cabf4eeb9 build: disable implicit fallthrough
Prevent switch case statements from falling through without annotation
([[fallthrough]]) proving that this was intended.

Existing intended cases were annotated.

Closes #14607
2023-07-10 19:36:06 +02:00
Pavel Emelyanov
66e43912d6 code: Switch to seastar API level 7
In that level no io_priority_class-es exist. Instead, all the IO happens
in the context of current sched-group. File API no longer accepts prio
class argument (and makes io_intent arg mandatory to impls).

So the change consists of
- removing all usage of io_priority_class
- patching file_impl's inheritants to updated API
- priority manager goes away altogether
- IO bandwidth update is performed on respective sched group
- tune-up scylla-gdb.py io_queues command

The first change is huge and was made semi-autimatically by:
- grep io_priority_class | default_priority_class
- remove all calls, found methods' args and class' fields

Patching file_impl-s is smaller, but also mechanical:
- replace io_priority_class& argument with io_intent* one
- pass intent to lower file (if applicatble)

Dropping the priority manager is:
- git-rm .cc and .hh
- sed out all the #include-s
- fix configure.py and cmakefile

The scylla-gdb.py update is a bit hairry -- it needs to use task queues
list for IO classes names and shares, but to detect it should it checks
for the "commitlog" group is present.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #13963
2023-06-06 13:29:16 +03:00
Pavel Emelyanov
2bb024c948 index_reader: Introduce and use default arguments to constructor
Most of creators of index_reader construct it with default prio class,
null trace pointer and use_caching::yes. Assigning implicit defaults to
constructor arguments keeps the code shorter and easier to read.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-23 11:29:04 +03:00
Pavel Emelyanov
3fd5d3cc2b index_reader: Use _pc field in get_file_input_stream_options() directly
No need to pass this-> field into this-> call

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-23 11:18:14 +03:00
Pavel Emelyanov
21d24e8ea3 index_reader: Move index_reader::get_file_input_stream_options to private: block
A "while at it" cleanup. When pathing the method (next patch) it turned
out that there are no other callers other than local class, so it _is_
private.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-23 11:18:14 +03:00
Kefu Chai
0cb842797a treewide: do not define/capture unused variables
these warnings are found by Clang-17 after removing
`-Wno-unused-lambda-capture` and '-Wno-unused-variable' from
the list of disabled warnings in `configure.py`.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-02-15 22:57:18 +02:00
Benny Halevy
63c2cdafe8 sstables: index_reader: close(index_bound&) reset current_list
When closing _lower_bound and *_upper_bound
in the final close() call, they are currently left with
an engaged current_list member.

If the index_reader uses a _local_index_cache,
it is evicted with evict_gently which will, rightfully,
see the respective pages as referenced, and they won't be
evicted gently (only later when the index_reader is destroyed).

Reset index_bound.current_list on close(index_bound&)
to free up the reference.

Ref #12271

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #12370
2023-01-02 16:42:33 +01:00
Michał Chojnowski
d9269abf5b sstables: index_reader: always evict the local cache gently
Due to an oversight, the local index cache isn't evicted gently
when _upper_bound existed. This is a source of reactor stalls.
Fix that.

Fixes #12271

Closes #12364
2022-12-20 18:23:27 +02:00
Pavel Emelyanov
5f579eb405 sstable: Introduce index_filename()
Currently the sstable::filename(Index) is used in several places that
get the filename as a printable or throwable string and don't treat is
as a real location of any file.

For those, add the index_filename() helper symmetrical to toc_filename()
and (in some sense) the get_filename() one.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-12-15 10:14:49 +03:00
Michał Chojnowski
0c54e7c5c7 sstables: index_reader: remove a stray vsprintf call from the hot path
sstable::get_filename() constructs the filename from components, which
takes some work. It happens to be called on every
index_reader::index_reader() call even though it's only used for TRACE
logs. That's 1700 instructions (~1% of a full query) wasted on every
SSTable read. Fix that.

Closes #11485
2022-09-08 14:29:23 +03:00
Botond Dénes
7501a075bd sstables/index_reader: push down eof() check to advance_to(index_bound&, dht::ring_position_view)
Commit e8f3d7dd13 added eof() checks to public partition-level
advance_to() methods, to ensure we do not attempt to re-read the last
page of the index when at eof(). It was noted however that this check
would be safer in advance_to(index_bound&, dht::ring_position_view)
because that is the method that all these higher-level methods end up
calling. Placing the check there would guarantee safety for all such
operations. This path does exactly that: it pushes down the check to
said method. One change needed for this to work is to check eof on the
bound that is currently advanced, instead of unconditionally checking
the lower bound.

Closes #10531
2022-05-11 14:46:30 +02:00
Botond Dénes
e8f3d7dd13 sstables/index_reader: short-circuit fast-forward-to when at EOF
Attempting to call advance_to() on the index, after it is positioned at
EOF, can result in an assert failure, because the operation results in
an attempt to move backwards in the index-file (to read the last index
page, which was already read). This only happens if the index cache
entry belonging to the last index page is evicted, otherwise the advance
operation just looks-up said entry and returns it.
To prevent this, we add an early return conditioned on eof() to all the
partition-level advance-to methods.
A regression unit test reproducing the above described crash is also
added.
2022-05-05 14:42:37 +03:00
Avi Kivity
585c0841c3 Merge 'sstables: enable read ahead for the partition index reader' from Wojciech Mitros
Currently, when advancing one of `index_reader`'s bounds, we're creating a new `index_consume_entry_context` with a new underlying file `input_stream` for each new page.

For either bound, the streams can be reused, because the indexes of pages that we are reading are never decreasing.

This patch adds a `index_consume_entry_context` to each of `index_reader`'s bounds, so that for each new page, the same file `input_stream` is used.
As a result, when reading consecutive pages, the reads that follow the first one can be satisfied by the `input_stream`'s read aheads, decreasing the number of blocking reads and increasing the throughput of the `index_reader`.

Additionally, we're reusing the `index_consumer` for all pages, calling `index_consumer::prepare` when we need to increase the size of  the `_entries` `chunked_managed_vector`.

A big difference can be seen when we're reading the entire table, frequently skipping a few rows; which we can test using perf_fast_forward:

Before:
```
running: small-partition-skips on dataset small-part
Testing scanning small partitions with skips.
Reads whole range interleaving reads with skips according to read-skip pattern:
   read    skip      time (s)   iterations     frags     frag/s    mad f/s    max f/s    min f/s    avg aio    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    allocs   tasks insns/f    cpu
-> 1       0         0.899447            4   1000000    1111794      12284    1113248    1096537      975.5    972     124356       1       0        0        0        0        0        0        0  12032202   29103    8967 100.0%
-> 1       1         1.805811            4    500000     276884        907     278214     275977     3655.8   3654     135084    2688       0     3161     4548     5935        0        0        0   7225100  140466   27010  75.6%
-> 1       8         0.927339            4    111112     119818        357     120465     119461     3654.0   3654     135084    2685       0     2133     4548     6963        0        0        0   1749663  107922   57502  50.2%
-> 1       16        0.790630            4     58824      74401        782      74617      73497     3654.0   3654     135084    2695       0     1975     4548     7121        0        0        0   1019189  109349   90832  42.7%
-> 1       32        0.717235            4     30304      42251        243      42266      41975     3654.0   3654     135084    2689       0     1871     4548     7225        0        0        0    619876  109199  156751  37.3%
-> 1       64        0.681624            4     15385      22571        244      22815      22286     3654.0   3654     135084    2685       0     1870     4548     7226        0        0        0    407671  105798  285688  34.0%
-> 1       256       0.630439            4      3892       6173         24       6214       6150     3549.0   3549     135116    2581       0     1313     3927     6505        0        0        0    232541  100803 1022454  29.1%
-> 1       1024      0.313303            4       976       3115        219       3126       2766     1956.0   1956     130608     986       0        0      987     1962        0        0        0     81165   41385 1724979  29.1%
-> 1       4096      0.083688            4       245       2928         85       3012       2134      738.8    737      17212     492     244        0      247      491        0        0        0     30500   19406 1999263  24.6%
-> 64      1         1.509011            4    984616     652491       2746     660930     649745     3673.5   3654     135084    2687       0     4507     4548     4589        0        0        0  11075882  117074   13157  68.9%
-> 64      8         1.424147            4    888896     624160       4446     625675     617713     3654.0   3654     135084    2691       0     4248     4548     4848        0        0        0  10019098  117383   13700  66.5%
-> 64      16        1.343276            4    800000     595559       5834     605880     589725     3654.0   3654     135084    2698       0     3989     4548     5107        0        0        0   9043830  124022   14206  64.9%
-> 64      32        1.249721            4    666688     533469       5056     536638     526212     3654.0   3654     135084    2688       0     3616     4548     5480        0        0        0   7570848  123043   15377  60.9%
-> 64      64        1.154549            4    500032     433097      10215     443312     415001     3654.0   3654     135084    2703       0     3161     4548     5935        0        0        0   5718758  110657   17787  53.2%
-> 64      256       1.005309            4    200000     198944       1179     199338     196989     3935.0   3935     137216    2966       0      690     4048     5592        0        0        0   2398359  110510   27855  51.3%
-> 64      1024      0.441913            4     58880     133239       8094     135471     120467     2161.0   2161     131820    1190       0        0     1192     1848        0        0        0    725092   45449   33740  59.7%
-> 64      4096      0.124826            4     15424     123564       5958     126814      95101      795.5    794      17400     553     240        0      312      482        0        0        0    199943   20869   46621  41.9%
```
After:
```
running: small-partition-skips on dataset small-part
Testing scanning small partitions with skips.
Reads whole range interleaving reads with skips according to read-skip pattern:
   read    skip      time (s)   iterations     frags     frag/s    mad f/s    max f/s    min f/s    avg aio    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    allocs   tasks insns/f    cpu
-> 1       0         0.917468            4   1000000    1089956       1422    1091378    1073112      975.5    972     124356       1       0        0        0        0        0        0        0  12032761   29721    8972 100.0%
-> 1       1         1.311446            4    500000     381259       3212     384470     377238     1087.0   1083     138420       2       0     4445     4548     4651        0        0        0   7096216   55681   20869 100.0%
-> 1       8         0.467975            4    111112     237432       1446     239372     235985     1121.2   1119     143124       9       0     4344     4548     4752        0        0        0   1619944   23502   28844  98.7%
-> 1       16        0.337085            4     58824     174508       3410     178451     171099     1117.5   1120     143276      11       0     4319     4548     4777        0        0        0    883692   19152   37460  96.8%
-> 1       32        0.262798            4     30304     115313       1222     116535     112400     1070.2   1066     135620     166      26     4354     4548     4742        0        0        0    483185   18856   54275  94.9%
-> 1       64        0.283954            4     15385      54181        531      56177      53650     2022.5   2040     137036     319      19     4351     4548     4745        0        0        0    292766   32998  102276  84.9%
-> 1       256       0.207020            4      3892      18800        575      19105      17520     1315.5   1334     136072     418      24     3703     3927     4115        0        0        0    118400   27427  292146  82.1%
-> 1       1024      0.164396            4       976       5937         57       5993       5842     1208.2   1195     135384     568      14      932      987     1030        0        0        0     62999   27554  503559  70.0%
-> 1       4096      0.085079            4       245       2880        108       2987       2714      635.8    634      26468     248     246      233      247      258        0        0        0     31264   12872 1546404  37.4%
-> 64      1         1.073331            4    984616     917346       7614     923983     909314     1812.2   1824     136792      11      20     4544     4548     4552        0        0        0  10971661   54538    9919  99.6%
-> 64      8         1.024389            4    888896     867733       6327     870429     845215     3027.2   3072     138212      31       0     4523     4548     4573        0        0        0   9933078   68059   10050  99.5%
-> 64      16        0.978754            4    800000     817366       7802     827665     809564     3012.2   3008     139884      39       0     4486     4548     4610        0        0        0   8947041   64050   10302  98.1%
-> 64      32        0.837266            4    666688     796267      10312     806579     785370     2275.8   2266     139672      29       0     4465     4548     4631        0        0        0   7458644   50754   10564  97.8%
-> 64      64        0.645627            4    500032     774490       4713     779203     768432     1136.8   1137     145428       8       0     4438     4548     4658        0        0        0   5593168   29982   10938  98.4%
-> 64      256       0.386192            4    200000     517877      22509     544067     495368     1134.8   1136     145300     109       0     2135     4048     4147        0        0        0   2270291   22840   13682  94.5%
-> 64      1024      0.238617            4     58880     246755      55856     305110     190899     1176.0   1118     135324     451      13      625     1192     1223        0        0        0    701262   24418   17323  71.1%
-> 64      4096      0.133340            4     15424     115674      14837     117978      99072      974.0    961      27132     366     347       99      312      383        0        0        0    209595   20657   43096  50.4%
```
For single partition reads, the index_reader is modified to behave in practically the same way, as before the change (not reading ahead past the page with the partition).
For example, a single partition read from a table with 10 rows per partition performs a single 6KB read from the index file, and the same read is performed before the change (as can be seen in traces below). If we enabled read aheads in that case, we would perform 2 16KB reads.
Relevant traces:
Before:
```
./tmp/data/ks/t2-75ebed30eb0211eb837a8f4cd3d1cf62/md-1-big-Index.db: scheduling bulk DMA read of size 6478 at offset 0 [shard 0] | 2021-07-23 15:22:25.847362 | 127.0.0.1 |            148 | 127.0.0.1
./tmp/data/ks/t2-75ebed30eb0211eb837a8f4cd3d1cf62/md-1-big-Index.db: finished bulk DMA read of size 6478 at offset 0, successfully read 6478 bytes [shard 0] | 2021-07-23 15:22:25.900996 | 127.0.0.1 |          53782 | 127.0.0.1
```
After:
```
./tmp/data/ks/t2-75ebed30eb0211eb837a8f4cd3d1cf62/md-1-big-Index.db: scheduling bulk DMA read of size 6478 at offset 0 [shard 0] | 2021-07-23 15:19:37.380033 | 127.0.0.1 |            149 | 127.0.0.1
./tmp/data/ks/t2-75ebed30eb0211eb837a8f4cd3d1cf62/md-1-big-Index.db: finished bulk DMA read of size 6478 at offset 0, successfully read 6478 bytes [shard 0] | 2021-07-23 15:19:37.433662 | 127.0.0.1 |          53777 | 127.0.0.1
```
Tests: unit(dev)

Closes #9063

* github.com:scylladb/scylla:
  sstables: index_reader: optimize single partition reads
  sstables: use read-aheads in the index reader
  sstables: index_reader: remove unused members from index reader context
2022-03-21 13:47:28 +02:00
Pavel Solodovnikov
95c8d65949 treewide: fix compilation issues with fmtlib 8.1.0+
Due to fd62fba985
scoped enums are not automatically converted to integers anymore,
this is the intended behavior, according to the fmtlib devs.

A bit nicer solution would be to use `std::to_underlying`
instead of a direct `static_cast`, but it's not available until
C++23 and some compilers are still missing the support for it.

Tests: unit(dev)

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2022-03-16 12:31:50 +03:00
Wojciech Mitros
7f590a3686 sstables: index_reader: optimize single partition reads
All entries from a single partition can be found in a
single summary page.
Because of that, in cases when we know we want to read
only one partition, we can limit the underyling file
input_stream to the range of the page.

Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
2022-02-22 02:16:52 +01:00
Wojciech Mitros
c81992c665 sstables: use read-aheads in the index reader
Currently, when advancing one of index_reader's bounds,
we're creating a new index_consume_entry_context with a new
underlying file input_stream for each new page.

For either bound, the streams can be reused, because
the indexes of pages that we are reading are never
decreasing.

This patch adds a index_consume_entry_context to each of
index_reader's bounds, so that for each new page, the same
file input_stream is used.
As a result, when reading consecutive pages, the reads that
follow the first one can be satisfied by the input_stream's
read aheads, decreasing the number of blocking reads and
increasing the throughput of the index_reader.

Fixes #2388

Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
2022-02-22 01:51:33 +01:00
Wojciech Mitros
0a1500acd2 sstables: index_reader: remove unused members from index reader context
The _file_name and _index_file fields in index_consume_entry_context
are no longer used anywhere in the class (_file_name isn't even set,
and _index_file was previously used when creating a promoted_index,
which doesn't store the file object anymore)
2022-02-20 16:24:27 +01:00
Avi Kivity
fcb8d040e8 treewide: use Software Package Data Exchange (SPDX) license identifiers
Instead of lengthy blurbs, switch to single-line, machine-readable
standardized (https://spdx.dev) license identifiers. The Linux kernel
switched long ago, so there is strong precedent.

Three cases are handled: AGPL-only, Apache-only, and dual licensed.
For the latter case, I chose (AGPL-3.0-or-later and Apache-2.0),
reasoning that our changes are extensive enough to apply our license.

The changes we applied mechanically with a script, except to
licenses/README.md.

Closes #9937
2022-01-18 12:15:18 +01:00
Botond Dénes
940874f3ff sstables/index_reader: process_state(): add additional information to trace logging
The amount of data available for parsing at the start of each entry, and
the parsed key size.
2022-01-18 10:38:11 +02:00
Botond Dénes
afb14508c4 sstables/index_reader: verify_end_state(): add check for premature EOS
Add a check which ensures that parsing ended in a valid state and not in
the middle of a half-parsed entry.
2022-01-18 10:38:11 +02:00
Botond Dénes
36c0fe904e sstables/index_reader: convert exception in verify_end_state() to malformed sstable exception
Errors during parsing are usually reported via malformed sstable
exception to signify their gravity of potentially being caused by
corrupt sstables. This patch converts the exception thrown in
`index_consume_entry_context::verify_end_state()`.
While at it the error message is improved as well. It currently suggests
that parsing was ended prematurely because data ran out, while in fact
the condition under which this error is thrown is the opposite: parsing
ended but there is unconsumed data left. The current state is also added
to the error message.
2022-01-18 10:38:11 +02:00
Botond Dénes
7508b4fd22 sstables/index_reader: add const sstable& to index_consume_entry_context
To be used by the next patches to throw malformed sstable exception.
2022-01-18 10:38:11 +02:00
Botond Dénes
9f3e5ae801 sstables/index_reader: remove unused members from index_consume_entry_context
The unused members are: _s and _file_name.
2022-01-18 10:38:11 +02:00
Botond Dénes
259649c779 sstables/index_reader: improved diagnostics on missing index entry
Add the summary index and the bound's address to the error message, so
it can be correlated with other trace level logging when investigating a
problem.

Refs: #9446

Tests: unit(dev)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20211202124955.542293-2-bdenes@scylladb.com>
2021-12-02 19:43:30 +02:00
Wojciech Mitros
8385f3eb21 sstables: index_reader: add support for iterating over clustering ranges in reverse
In the sstable reader, we iterate over clustering ranges using the
index_reader, which normally only accepts advancing to increasing
positions. In this patch we add methods for advancing the index
reader in reverse.

To simplify our job we restrict our attention to a single implementation
of the promoted index block cursor: `bsearch_clustered_cursor`. The
`index_reader` methods for advancing in reverse will thus assume that
this implementation is used. The assumption is correct given that we're
working only with sstables of versions >= mc, which is indeed the
intended use case. We add some documentation in appropriate places to
make this obvious.

We extend `bsearch_clustered_cursor` with two methods:
`advance_past(pos)`, which advances the cursor to the first block after
`pos` (or to the end if there is no such block), and
`last_block_offset()`, which returns the data file offset of the first
row from the last promoted index block.

To efficiently find the position in the data file of the last row
of the partition (which we need when performing a reversed query)
the sstable reader may need to read the span of the entire last promoted
index block in the data file. To learn where the block starts it can use
`index_reader::last_block_offset()`, which is implemented in terms of
`bsearch_clustered_cursor::last_block_offset()`.

When performing a single partition read in forward order, the reader
asks the index to position its lower bound at the start of the partition
and its upper bound after the end of the slice. It starts by reading the
first range. After exhausting a range it jumps to the next one by asking
the index to advance the lower bound.

For reverse single partition reads we'll take a similar approach: the
initial bound positions are as in the forward case. However, we start
with the last range and after exhausting a range we want to jump to a
previous one; we will do it by advancing the upper bound in reverse
(i.e. moving it closer to the beginning of the partition).  For this
we introduce the `index_reader::advance_reverse` function.
2021-10-04 15:24:12 +02:00
Kamil Braun
e3f1667744 sstables: remove use_binary_search_in_promoted_index
This was a global variable that was potentially modified from a
performance benchmark. It would modify the behavior of `index_reader`
in certain scenarios.

Remove the variable so we can specify the behavior of `index_reader`
functions without relying on anything other than what's passed into the
constructor and the function parameters.
2021-09-19 13:59:25 +03:00
Tomasz Grabiec
21f1a7be8b sstables: Do not populate page cache when searching in promoted index for "bypass cache" reads
Reads which bypass cache will use a private temporary instance of cached_file
which dies together with the index cursor.

The cursor still needs a cached_file with cachig layer. Binary searching needs
caching for performance, some of the pages will be reused. Another reason to still
use cached_file is to work with a common interface, and reusing it requires minimal changes.
2021-07-15 12:14:28 +02:00
Tomasz Grabiec
f4227c303b sstables: Do not populate partition index cache for "bypass cache" reads
Index cursor for reads which bypass cache will use a private temporary
instance of the partition index cache.

Promoted index scanner (ka/la format) will not go through the page cache.
2021-07-15 12:13:20 +02:00
Tomasz Grabiec
f14576f4be sstables: Hide partition_index_cache implementation away from sstables.hh
Reduces scope of the header to index_reader.hh which reduces
recompilation time.
2021-07-02 19:02:14 +02:00