Files

Avi Kivity 94c21e5c05 Merge 'sstables: Reduce amount of I/O for clustering-key-bounded reads from large partitions' from Tomasz Grabiec

Single-row reads from large partition issue 64 KiB reads to the data file,
which is equal to the default span of the promoted index block in the data file.
If users would want to increase selectivity of the index to speed up single-row reads,
this won't be effective. The reason is that the reader uses promoted index
to look up the start position in the data file of the read, but end position
will in practice extend to the next partition, and amount of I/O will be
determined by the underlying file input stream implementation and its
read-ahead heuristics. By default, that results in at least 2 IOs 32KB each.

There is already infrastructure to lookup end position based on upper
bound of the read, in anticipation for sharing the promoted index cache,
but it's not effective becasue it's a non-populating lookup and the upper
bound cursor has its own private cached_promoted_index, which is cold
when positions are computed. It's non-populating on purpose, to avoid
extra index file IO to read upper bound. In case upper bound is far-enough
from the lower bound, this will only increase the cost of the read.

The solution employed here is to warm up the lower bound cursor's
cache before positions are computed, and use that cursor for
non-populating lookup of the upper bound.

We use the lower bound cursor and the slice's lower bound so that we
read the same blocks as later lower-bound slicing would, so that we
don't incur extra IO for cases where looking up upper bound is not
worth it, that is when upper bound is far from the lower bound. If
upper bound is near lower bound, then warming up using lower bound
will populate cached_promoted_index with blocks which will allow us to
locate the upper bound block accurately.  This is especially important
for single-row reads, where the bounds are around the same key.  In
this case we want to read the data file range which belongs to a
single promoted index block.  It doesn't matter that the upper bound
is not exactly the same. They both will likely lie in the same block,
and if not, binary search will bring adjacent blocks into cache.  Even
if upper bound is not near, the binary search will populate the cache
with blocks which can be used to narrow down the data file range
somewhat.

Fixes #10030.

The change was tested with perf-fast-forward.

I populated the data set with `column_index_size_in_kb` set to 1

  scylla perf-fast-forward --populate --run-tests=large-partition-slicing --column-index-size-in-kb=1

Test run:

  build/release/scylla perf-fast-forward --run-tests=large-partition-select-few-rows -c1 --keep-cache-across-test-cases --test-case-duration=0

This test issues two reads of subsequent keys from the middle of a large partition (1M rows in total). The first read will miss in the index file page cache, the second read will hit.

Notice that before the change, the second read issued 2 aio requests worth of 64KiB in total.
After the change, the second read issued 1 aio worth of 2 KiB. That's because promoted index block is larger than 1 KiB.
I verified using logging that the data file range matches a single promoted index block.

Also, the first read which misses in cache is still faster after the change.

Before:

```
running: large-partition-select-few-rows on dataset large-part-ds1
Testing selecting few rows from a large partition:
stride  rows      time (s)   iterations     frags     frag/s    mad f/s    max f/s    min f/s    avg aio    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    allocs   tasks insns/f    cpu
500000  1         0.009802            1         1        102          0        102        102       21.0     21        196       2       1        0        1        1        0        0        0       568     269 4716050  53.4%
500001  1         0.000321            1         1       3113          0       3113       3113        2.0      2         64       1       0        1        0        0        0        0        0       116      26  555110  45.0%
```

After:

```
running: large-partition-select-few-rows on dataset large-part-ds1
Testing selecting few rows from a large partition:
stride  rows      time (s)   iterations     frags     frag/s    mad f/s    max f/s    min f/s    avg aio    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    allocs   tasks insns/f    cpu
500000  1         0.009609            1         1        104          0        104        104       20.0     20        137       2       1        0        1        1        0        0        0       561     268 4633407  43.1%
500001  1         0.000217            1         1       4602          0       4602       4602        1.0      1          2       1       0        1        0        0        0        0        0       110      26  313882  64.1%
```

Backports: none, not a regression

Closes scylladb/scylladb#20522

* github.com:scylladb/scylladb:
  perf: perf_fast_forward: Add test case for querying missing rows
  perf-fast-forward: Allow overriding promoted index block size
  perf-fast-forward: Test subsequent key reads from the middle in test_large_partition_select_few_rows
  perf-fast-forward: Allow adding key offset in test_large_partition_select_few_rows
  perf-fast-forward: Use single-partition reads in test_large_partition_select_few_rows
  sstables: bsearch_clustered_cursor: Add more tracing points
  sstables: reader: Log data file range
  sstables: bsearch_clustered_cursor: Unify skip_info logging
  sstables: bsearch_clustered_cursor: Narrow down range using "end" position of the block
  sstables: bsearch_clustered_cursor: Skip even to the first block
  test: sstables: sstable_3_x_test: Improve failure message
  sstables: mx: writer: Never include partition_end marker in promoted index block width
  sstables: Reduce amount of I/O for clustering-key-bounded reads from large partitions
  sstables: clustered_cursor: Track current block

2024-10-28 21:13:23 +02:00

api_v2.md

docs: dev: correct a typo

2023-03-31 17:19:08 +03:00

backport.md

Typos: fix typos in documentation

2023-12-07 11:10:17 +02:00

building.md

Merge 'Fixes for docs/dev/building.md' from Kamil Braun

2023-02-26 19:27:33 +02:00

cdc.md

raft topology: store committed CDC generations' IDs in the topology

2024-02-20 12:35:16 +01:00

code-coverage.md

Add code coverage documentation

2024-01-18 11:11:34 +02:00

commitlog-file-format.md

docs: Add entry on commitlog file format v4

2024-09-03 16:38:28 +00:00

compaction_controller.md

docs: dev: write mathematical expressions in LaTeX

2023-03-29 15:07:14 +03:00

compilation-time-analysis.md

doc/dev: add document about analyzing build time

2023-09-01 11:33:36 +03:00

cql3-type-mapping.md

…

cql-extensions-internal.md

…

debugging.md

docs: Extend debugging with info about exploring ELF notes

2024-08-05 09:49:52 +03:00

describe_schema.md

docs/dev: Add documentation for DESC SCHEMA

2024-09-24 14:18:01 +02:00

docker-hub.md

doc: remove outdated JMX references

2024-10-07 13:55:15 +03:00

hinted_handoff_design.md

docs: Update Hinted Handoff documentation

2024-04-28 01:22:59 +02:00

IDL.md

Typos: more/less then -> more/less than

2024-02-13 17:16:15 +02:00

isolation.md

docs: isolation.md: add section on RPC call isolation

2024-05-21 03:12:22 -04:00

logging.md

…

lua-type-mapping.md

…

maintainer.md

docs/dev/maintainer.md: clarify "Updating submodule references"

2024-09-05 13:57:32 +03:00

metrics.md

…

migrating-from-users-to-roles.md

…

modules.md

forward_service: rename to mapreduce_service

2024-07-03 19:29:47 +03:00

mvcc.md

doc: Introduce docs/dev/mvcc.md

2023-01-27 19:15:39 +01:00

object_storage.md

docs: promote object storage configuration to user-facing documentation

2024-10-22 18:26:19 +08:00

paged-queries.md

treewide: rename flat_mutation_reader_v2 to mutation_reader

2024-06-21 07:12:06 +03:00

parallel_aggregations.md

…

per-partition-rate-limit.md

Typos: fix typos in documentation

2023-12-07 11:10:17 +02:00

protocol-extensions.md

docs: fix misspellings

2024-01-26 13:14:21 +02:00

protocols.md

treewide: drop thrift support

2024-06-07 06:44:59 +08:00

raft-in-scylla.md

Merge 'Don't calculate hashes for schema versions in Raft mode' from Kamil Braun

2023-12-11 12:17:57 +01:00

reader-concurrency-semaphore.md

docs/dev/reader-concurrency-semaphore.md: update the documentation on diagnostics dumps

2024-09-12 08:31:25 -04:00

README.md

docs: fix misspellings

2024-01-26 13:14:21 +02:00

redis.md

docs: fix typos in dev documents

2024-05-27 12:28:34 +03:00

repair_based_node_ops.md

docs/dev/repair_based_node_ops: better formatting

2023-05-25 08:31:43 +03:00

reverse-reads.md

reverse-reads.md: Drop legacy reverse format information

2024-08-13 10:07:12 +02:00

review-checklist.md

…

row_cache.md

doc: Introduce docs/dev/mvcc.md

2023-01-27 19:15:39 +01:00

row_level_repair.md

…

rust.md

rust: use depfile and Cargo.lock to avoid building rust when unnecessary

2023-01-12 14:44:11 +02:00

secondary_index.md

docs/design-notes/secondary_index: add VALUES to index target list

2022-08-14 10:29:52 +03:00

service_levels.md

docs/dev/service_levels: replace unspecified workload type with NULL

2024-09-24 11:43:29 +03:00

sstable-scylla-format.md

Merge 'sstables: Reduce amount of I/O for clustering-key-bounded reads from large partitions' from Tomasz Grabiec

2024-10-28 21:13:23 +02:00

sstables-directory-structure.md

…

system_keyspace.md

treewide: drop thrift support

2024-06-07 06:44:59 +08:00

system_schema_keyspace.md

…

task_manager.md

api: task_manager: add operation to get ttl

2024-08-29 13:53:39 +02:00

testing.md

Add documentation how to use allure reporting

2024-07-01 16:21:50 +02:00

timestamp-conflict-resolution.md

docs: use less slangy language

2024-03-13 13:33:37 +02:00

topology-over-raft.md

Extend system.topology with 3 new columns to store data required to process alter ks global topo req

2024-05-28 13:55:11 +02:00

tracing.md

…

virtual-tables.md

…

README.md

Scylla developer documentation

This folder contains developer-oriented documentation concerning the ScyllaDB codebase. We also have a wiki, which contains additional developer-oriented documentation. There is currently no clear definition of what goes where, so when looking for something be sure to check both.

Seastar documentation can be found here.

User documentation can be found on docs.scylladb.com

For information on how to build Scylla and how to contribute visit HACKING.md and CONTRIBUTING.md.

Index

Module list and dependencies