Files
scylladb/docs/dev/sstable-scylla-format.md
Avi Kivity 94c21e5c05 Merge 'sstables: Reduce amount of I/O for clustering-key-bounded reads from large partitions' from Tomasz Grabiec
Single-row reads from large partition issue 64 KiB reads to the data file,
which is equal to the default span of the promoted index block in the data file.
If users would want to increase selectivity of the index to speed up single-row reads,
this won't be effective. The reason is that the reader uses promoted index
to look up the start position in the data file of the read, but end position
will in practice extend to the next partition, and amount of I/O will be
determined by the underlying file input stream implementation and its
read-ahead heuristics. By default, that results in at least 2 IOs 32KB each.

There is already infrastructure to lookup end position based on upper
bound of the read, in anticipation for sharing the promoted index cache,
but it's not effective becasue it's a non-populating lookup and the upper
bound cursor has its own private cached_promoted_index, which is cold
when positions are computed. It's non-populating on purpose, to avoid
extra index file IO to read upper bound. In case upper bound is far-enough
from the lower bound, this will only increase the cost of the read.

The solution employed here is to warm up the lower bound cursor's
cache before positions are computed, and use that cursor for
non-populating lookup of the upper bound.

We use the lower bound cursor and the slice's lower bound so that we
read the same blocks as later lower-bound slicing would, so that we
don't incur extra IO for cases where looking up upper bound is not
worth it, that is when upper bound is far from the lower bound. If
upper bound is near lower bound, then warming up using lower bound
will populate cached_promoted_index with blocks which will allow us to
locate the upper bound block accurately.  This is especially important
for single-row reads, where the bounds are around the same key.  In
this case we want to read the data file range which belongs to a
single promoted index block.  It doesn't matter that the upper bound
is not exactly the same. They both will likely lie in the same block,
and if not, binary search will bring adjacent blocks into cache.  Even
if upper bound is not near, the binary search will populate the cache
with blocks which can be used to narrow down the data file range
somewhat.

Fixes #10030.

The change was tested with perf-fast-forward.

I populated the data set with `column_index_size_in_kb` set to 1

  scylla perf-fast-forward --populate --run-tests=large-partition-slicing --column-index-size-in-kb=1

Test run:

  build/release/scylla perf-fast-forward --run-tests=large-partition-select-few-rows -c1 --keep-cache-across-test-cases --test-case-duration=0

This test issues two reads of subsequent keys from the middle of a large partition (1M rows in total). The first read will miss in the index file page cache, the second read will hit.

Notice that before the change, the second read issued 2 aio requests worth of 64KiB in total.
After the change, the second read issued 1 aio worth of 2 KiB. That's because promoted index block is larger than 1 KiB.
I verified using logging that the data file range matches a single promoted index block.

Also, the first read which misses in cache is still faster after the change.

Before:

```
running: large-partition-select-few-rows on dataset large-part-ds1
Testing selecting few rows from a large partition:
stride  rows      time (s)   iterations     frags     frag/s    mad f/s    max f/s    min f/s    avg aio    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    allocs   tasks insns/f    cpu
500000  1         0.009802            1         1        102          0        102        102       21.0     21        196       2       1        0        1        1        0        0        0       568     269 4716050  53.4%
500001  1         0.000321            1         1       3113          0       3113       3113        2.0      2         64       1       0        1        0        0        0        0        0       116      26  555110  45.0%
```

After:

```
running: large-partition-select-few-rows on dataset large-part-ds1
Testing selecting few rows from a large partition:
stride  rows      time (s)   iterations     frags     frag/s    mad f/s    max f/s    min f/s    avg aio    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    allocs   tasks insns/f    cpu
500000  1         0.009609            1         1        104          0        104        104       20.0     20        137       2       1        0        1        1        0        0        0       561     268 4633407  43.1%
500001  1         0.000217            1         1       4602          0       4602       4602        1.0      1          2       1       0        1        0        0        0        0        0       110      26  313882  64.1%
```

Backports: none, not a regression

Closes scylladb/scylladb#20522

* github.com:scylladb/scylladb:
  perf: perf_fast_forward: Add test case for querying missing rows
  perf-fast-forward: Allow overriding promoted index block size
  perf-fast-forward: Test subsequent key reads from the middle in test_large_partition_select_few_rows
  perf-fast-forward: Allow adding key offset in test_large_partition_select_few_rows
  perf-fast-forward: Use single-partition reads in test_large_partition_select_few_rows
  sstables: bsearch_clustered_cursor: Add more tracing points
  sstables: reader: Log data file range
  sstables: bsearch_clustered_cursor: Unify skip_info logging
  sstables: bsearch_clustered_cursor: Narrow down range using "end" position of the block
  sstables: bsearch_clustered_cursor: Skip even to the first block
  test: sstables: sstable_3_x_test: Improve failure message
  sstables: mx: writer: Never include partition_end marker in promoted index block width
  sstables: Reduce amount of I/O for clustering-key-bounded reads from large partitions
  sstables: clustered_cursor: Track current block
2024-10-28 21:13:23 +02:00

6.2 KiB

File format of the Scylla.db sstable component

The Scylla.db component (present in a file named like mc-223-big-Scylla.db contains assorted Scylla-only metadata. Its presence indicates the sstable was created by Scylla (or some Scylla-aware creator). Non-Scylla consumers will ignore it.

The file is small and intended to be processed in-memory.

Main structure

The main structure is that of an unordered set of subcomponents. Each component is prefixed with a be32 tag that indicates its type, and its serialized size (so unknown subcomponents can be skipped).

scylla_db = subcomponent_count (tag serialized_size subcomponent)*
subcomponent_count = be32
serialized_size = be32
tag = be32

Subcomponents and tag values

The following subcomponents are recognized. They are described in more detail in individual sections

subcomponent = sharding_metadata
    | features
    | extension_attributes
    | run_identifier
    | large_data_stats
    | sstable_origin
    | scylla_build_id
    | scylla_version
    | ext_timestamp_stats

sharding_metadata (tag 1): describes what token sub-ranges are included in this sstable. This is used, when loading the sstable, to determine which shard(s) it occupies.

features (tag 2): a set of boolean flags that describe the sstable

extension_attributes (tag 3): a map<string, string> with additional attributes

run_identifier (tag 4): a uuid that is the same for all sstables in the same run (and different for sstables in different runs).

large_data_stats (tag 5): a map<large_data_type, large_data_stats_entry> with statistics about large data entities in the sstable.

sstable_origin (tag 6): a string describing the origin of the sstable ("memtable" for memtable flush, "garbage collection" for compaction, etc.).

scylla_build_id (tag 7): a string containing the build id of the Scylla executable that created the sstable.

scylla_version (tag 8): a string containing the version of the Scylla executable that created the sstable.

ext_timestamp_stats (tag 9): a map<ext_timestamp_stats_type, int64_t> with statistics about timestamps in the sstable, like: min_live_timestamp, and min_live_row_marker_timestamp.

sstable_identifier (tag 10): a uuid identifying the sstable for its whole lifetime. It is derived from the sstable uuid generation, upon creation (or uniquely generated if the sstable has numerical generation). Yet, unlike the sstable that may change if the sstable is migrated to a different shard or node, the sstable identifier is stable and copied with the rest of the scylla metadata.

The scylla sstable dump-scylla-metadata tool can be used to dump the scylla metadata in JSON format.

sharding_metadata subcomponent

sharding_metadata = token_range_count token_range*
token_range_count = be32
token_range = left_token_bound right_token_bound
left_token_bound = token_bound
right_token_bound = token_bound
token_bound = exclusive_flag token
exclusive_flag = byte          // 0=inclusive, 1=exclusive
token = token_size byte*
token_size = be16

Sharding metadata is a sorted list of disjoint token ranges. Each token range consists of a left bound and a right bound; either bound may be inclusive or exclusive. The tokens are interpreted according to the partitioner.

The sstable contains no partitions whose token is outside the ranges described by sharding_metadata.

features subcomponent

features = be64      // interpreted as a set of bits

bit 0: NonCompoundPIEntries (if set, indicates the sstable was generated by Scylla with issue #2993 fixed)

bit 1: NonCompoundRangeTombstones (if set, indicates the sstable was generated by Scylla with issue #2986 fixed)

bit 2: ShadowableTombstones (if set, indicates the sstable was generated by Scylla with issue #3885 fixed)

bit 3: CorrectStaticCompact (if set, indicates the sstable was generated by Scylla with issue #4139 fixed)

bit 4: CorrectEmptyCounters (if set, indicates the sstable was generated by Scylla with issue #4363 fixed)

bit 5: CorrectUDTsInCollections (if set, indicates that the sstable was generated by Scylla with issue #6130 fixed)

bit 6: CorrectLastPiBlockWidth (if set, indicates that the width of the last promoted index block never includes the partition end marker)

extension_attributes subcomponent

extension_attributes = extension_attribute_count extension_attribute*
extension_attribute_count = be32
extension_attribute = extension_attribute_key extension_attribute_value
extension_attribute_key = string32
extension_attribute_value = string32
string32 = string32_size byte*
string32_size = be32

There are currently no defined attributes.

run_identifier subcomponent

run_identifier = uuid
uuid = uuid_high_bits uuid_low_bits
uuid_high_bits = be64
uuid_low_bits = be64

If the run_identifier subcomponent is present, the sstable is part of a run. All sstables with the same run_identifier belong to the same run. They are guaranteed to be disjoint (non-overlapping) in their partition keys.

large_data_stats subcomponent

large_data_stats = large_data_count large_data_pair*
large_data_count = be32
large_data_pair = large_data_type large_data_stats_entry
large_data_type = partition_size | row_size | cell_size | rows_in_partition | elements_in_collection
    partition_size = be32(1)    // partition size, in bytes
    row_size = be32(2)          // row size, in bytes
    cell_size = be32(3)         // cell size, in bytes
    rows_in_partition = be32(4) // number of rows in a partition
    elements_in_collection = be32(5) // number of elements in a collection
large_data_stats_entry = max_value threshold above_threshold
    max_value = be64
    threshold = be64
    above_threshold = be32

The large_data_stats component holds statistics about partition, row, and cell sizes and about number of rows in partition. For each entry, it keeps the largest value for the entry type, the respective large_data threshold and the number of entities that are above the threshold.