Files
scylladb/utils
Avi Kivity 9322c07c71 Merge "Use binary search in sstable promoted index" from Tomasz
"
The "promoted index" is how the sstable format calls the clustering key index within a given partition.
Large partitions with many rows have it. It's embedded in the partition index entry.

Currently, lookups in the promoted index are done by scanning the index linearly so the lookup
is O(N). For large partitions that's inefficient. It consumes both a lot of CPU and I/O.

We could do better and use binary search in the index. This patch series switches the mc-format
index reader to do that. Other formats use the old way.

The "mc" format promoted index has an extra structure at the end of the index called "offset map".
It's a vector of offsets of consecutive promoted index entries. This allows us to access random
entries in the index without reading the whole index.

The location of the offset entry for a given promoted index entry can be derived by knowing where
the offset vector ends in the index file, so the offset map also doesn't have to be read completely
into the memory.

The most tricky part is caching. We need to cache blocks read from the index file to amortize the
cost of binary search:

  - if the promoted index fits in the 32 KiB which was read from the index when looking for
    the partition entry, we don't want to issue any additional I/O to search the promoted index.

  - with large promoted indexes, the last few bisections will fall into the same I/O block and we
    want to reuse that block.

  - we don't want the cache to grow too big, we don't want to cache the whole promoted index
    as the read progresses over the index. Scanning reads may skip multiple times.

This series implements a rather simple approach which meets all the
above requirements and is not worse than the current state of affairs:

   - Each index cursor has its own cache of the index file area which corresponds to promoted index
     This is managed by the cached_file class.

   - Each index cursor has its own cache of parsed blocks. This allows the upper bound estimation to
     reuse information obtained during lower bound lookup. This estimation is used to limit
     read-aheads in the data file.

   - Each cursor drops entries that it walked past so that memory footprint stays O(log N)

   - Cached buffers are accounted to read's reader_permit.

Later, we could have a single cache shared by many readers. For that, we need to come up with eviction
policy.

Fixes #4007.

TESTING RESULTS

 * Point reads, large promoted index:

  Config: rows: 10000000, value size: 2000
  Partition size: 20 GB
  Index size: 7 MB

  Notes:

    - Slicing read into the middle of partition (offset=5000000, read=1) is a clear win for the binary search:

      time: 1.9ms vs 22.9ms
      CPU utilization: 8.9% vs 92.3%
      I/O: 21 reqs / 172 KiB vs 29 reqs / 3'520 KiB

      It's 12x faster, CPU utilization is 10x times smaller, disk utilization is 20x smaller.

    - Slicing at the front (offset=0) is a mixed bag.

      time is similar: 1.8ms
      CPU utilization is 6.7x smaller for bsearch: 8.5% vs 57.7%
      disk bandwidth utilization is smaller for bsearch but uses more IOs: 4 reqs / 320 KiB (scan) vs 17 reqs / 188 KiB (bsearch)

      bsearch uses less bandwidth because the series reduces buffer size used for index file I/O.

      scan is issuing:

         2 * 128 KB (index page)
         2 * 32 KB (data file)

      bsearch is issuing:

         1 * 64 KB (index page)
         15 * 4 KB (promoted index)
         1 * 64 KB (data file)

      The 1 * 64 KB is chosen dynamically by seastar. Sometimes it chooses 2 * 32 KB (with read-ahead).
      32 KB is the minimum I/O currently.

      Disk utilization could be further improved by changing the way seastar's dynamic I/O adjustments work
      so that it uses 1 * 4 KB when it suffices. This is left for the follow-up.

  Command:

        perf_fast_forward --datasets=large-part-ds1 \
         --run-tests=large-partition-slicing-clustering-keys -c1 --test-case-duration=1

  Before:

    offset  read      time (s)   iterations     frags     frag/s    mad f/s    max f/s    min f/s    avg aio    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu    mem
    0       1         0.001836          172         1        545          9        563        175        4.0      4        320       2       2        0        1        1        0        0        0  57.7%      0
    0       32        0.001858          502        32      17220        126      17776      11526        3.2      3        324       2       1        0        1        1        0        0        0  56.4%      0
    0       256       0.002833          339       256      90374        427      91757      85931        7.0      7        776       3       1        0        1        1        0        0        0  41.1%      0
    0       4096      0.017211           58      4096     237984       2011     241802     233870       66.1     66       8376      59       2        0        1        1        0        0        0  21.4%      0
    5000000 1         0.022952           42         1         44          1         45         41       29.2     29       3520      22       2        0        1        1        0        0        0  92.3%      0
    5000000 32        0.023052           43        32       1388         14       1414       1331       31.1     32       3588      26       2        0        1        1        0        0        0  91.7%      0
    5000000 256       0.024795           41       256      10325        129      10721       9993       43.1     39       4544      29       2        0        1        1        0        0        0  86.4%      0
    5000000 4096      0.038856           27      4096     105414        398     106918     103162       95.2     95      12160      78       5        0        1        1        0        0        0  61.4%      0

 After (v2):

    offset  read      time (s)   iterations     frags     frag/s    mad f/s    max f/s    min f/s    avg aio    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu    mem
    0       1         0.001831          248         1        546         21        581        252       17.6     17        188       2       0        0        1        1        0        0        0   8.5%      0
    0       32        0.001910          535        32      16751        626      17770      13896       17.9     19        160       3       0        0        1        1        0        0        0   8.8%      0
    0       256       0.003545          266       256      72207       2333      89076      62852       26.9     24        764       7       0        0        1        1        0        0        0   9.7%      0
    0       4096      0.016800           56      4096     243812        524     245430     239736       83.6     83       8700      64       0        0        1        1        0        0        0  16.6%      0
    5000000 1         0.001968          351         1        508         19        538        380       21.3     21        172       2       0        0        1        1        0        0        0   8.9%      0
    5000000 32        0.002273          431        32      14077        436      15503      11551       22.7     22        268       3       0        0        1        1        0        0        0   8.9%      0
    5000000 256       0.003889          257       256      65824       2197      81833      57813       34.0     37        652      18       0        0        1        1        0        0        0  11.2%      0
    5000000 4096      0.017115           54      4096     239324        834     241310     231993       88.3     88       8844      65       0        0        1        1        0        0        0  16.8%      0

 After (v1):

    offset  read      time (s)   iterations     frags     frag/s    mad f/s    max f/s    min f/s    avg aio    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu    mem
    0       1         0.001886          259         1        530          4        545        261       18.0     18        376       2       2        0        1        1        0        0        0   9.1%      0
    0       32        0.001954          513        32      16381         93      16844      15618       19.0     19        408       3       2        0        1        1        0        0        0   9.3%      0
    0       256       0.003266          318       256      78393       1820      81567      61663       30.8     26       1272       7       2        0        1        1        0        0        0  10.4%      0
    0       4096      0.017991           57      4096     227666        855     231915     225781       83.1     83       8888      55       5        0        1        1        0        0        0  15.5%      0
    5000000 1         0.002353          232         1        425          2        432        232       23.0     23        396       2       2        0        1        1        0        0        0   8.7%      0
    5000000 32        0.002573          384        32      12437         47      12571        429       25.0     25        460       4       2        0        1        1        0        0        0   8.5%      0
    5000000 256       0.003994          259       256      64101       2904      67924      51427       37.0     35       1484      11       2        0        1        1        0        0        0  10.6%      0
    5000000 4096      0.018567           56      4096     220609        448     227395     219029       89.8     89       9036      59       5        0        1        1        0        0        0  15.1%      0

 * Point reads, small promoted index (two blocks):

  Config: rows: 400, value size: 200
  Partition size: 84 KiB
  Index size: 65 B

  Notes:
     - No significant difference in time
     - the same disk utilization
     - similar CPU utilization

  Command:

      perf_fast_forward --datasets=large-part-ds1 \
         --run-tests=large-partition-slicing-clustering-keys -c1 --test-case-duration=1

  Before:

    offset  read      time (s)   iterations     frags     frag/s    mad f/s    max f/s    min f/s    avg aio    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu    mem
    0       1         0.000279          470         1       3587         31       3829        478        3.0      3         68       2       1        0        1        1        0        0        0  21.1%      0
    0       32        0.000276         3498        32     116038        811     122756     104033        3.0      3         68       2       1        0        1        1        0        0        0  24.0%      0
    0       256       0.000412         2554       256     621044       1778     732150     559221        2.0      2         72       2       0        0        1        1        0        0        0  32.6%      0
    0       4096      0.000510         1901       400     783883       4078     819058     665616        2.0      2         88       2       0        0        1        1        0        0        0  36.4%      0
    200     1         0.000339         2712         1       2951          8       3001       2569        2.0      2         72       2       0        0        1        1        0        0        0  17.8%      0
    200     32        0.000352         2586        32      91019        266      92427      83411        2.0      2         72       2       0        0        1        1        0        0        0  20.8%      0
    200     256       0.000458         2073       200     436503       1618     453945     385501        2.0      2         88       2       0        0        1        1        0        0        0  29.4%      0
    200     4096      0.000458         2097       200     436475       1676     458349     381558        2.0      2         88       2       0        0        1        1        0        0        0  29.0%      0

  After (v1):

    Testing slicing of large partition using clustering keys:
    offset  read      time (s)   iterations     frags     frag/s    mad f/s    max f/s    min f/s    avg aio    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu    mem
    0       1         0.000278          492         1       3598         30       3831        500        3.0      3         68       2       1        0        1        1        0        0        0  19.4%      0
    0       32        0.000275         3433        32     116153        753     122915      92559        3.0      3         68       2       1        0        1        1        0        0        0  22.5%      0
    0       256       0.000458         2576       256     559437       2978     728075     504375        2.1      2         88       2       0        0        1        1        0        0        0  29.0%      0
    0       4096      0.000506         1888       400     790064       3306     822360     623109        2.0      2         88       2       0        0        1        1        0        0        0  36.6%      0
    200     1         0.000382         2493         1       2619         10       2675       2268        2.0      2         88       2       0        0        1        1        0        0        0  16.3%      0
    200     32        0.000398         2393        32      80422        333      84759      22281        2.0      2         88       2       0        0        1        1        0        0        0  19.0%      0
    200     256       0.000459         2096       200     435943       1608     453989     380749        2.0      2         88       2       0        0        1        1        0        0        0  30.5%      0
    200     4096      0.000458         2097       200     436410       1651     455779     382485        2.0      2         88       2       0        0        1        1        0        0        0  29.2%      0

 * Scan with skips, large index:

  Config: rows: 10000000, value size: 2000
  Partition size: 20 GB
  Index size: 7 MB

  Notes:

    - Similar time, slightly worse for binary search: 36.1 s (scan) vs 36.4 (bsearch)

    - Slightly more I/O for bsearch: 153'932 reqs / 19'703'260 KiB (scan) vs 155'651 reqs / 19'704'088 KiB (bsearch)

      Binary search reads more by 828 KB and by 1719 IOs.
      It does more I/O to read the the promoted index offset map.

    - similar (low) memory footprint. The danger here is that by caching index blocks which we touch as we scan
      we would end up caching the whole index. But this is protected against by eviction as demonstrated by the
      last "mem" column.

  Command:

    perf_fast_forward --datasets=large-part-ds1 \
       --run-tests=large-partition-skips -c1 --test-case-duration=1

  Before:

      read    skip      time (s)   iterations     frags     frag/s    mad f/s    max f/s    min f/s    avg aio    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu    mem
      1       1        36.103451            4   5000000     138491         38     138601     138453   153932.0 153932   19703260  153561       1        0        1        1        0        0        0  31.5% 502690

  After (v2):

    read    skip      time (s)   iterations     frags     frag/s    mad f/s    max f/s    min f/s    avg aio    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu    mem
    1       1        37.000145            4   5000000     135135          6     135146     135128   155651.0 155651   19704088  138968       0        0        1        1        0        0        0  34.2%      0

  After (v1):

    read    skip      time (s)   iterations     frags     frag/s    mad f/s    max f/s    min f/s    avg aio    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu    mem
    1       1        36.965520            4   5000000     135261         30     135311     135231   155628.0 155628   19704216  139133       1        0        1        1        0        0        0  33.9% 248738

Also in:

  git@github.com:tgrabiec/scylla.git sstable-use-index-offset-map-v2

Tests:

  - unit (all modes)
  - manual using perf_fast_forward
"

* tag 'sstable-use-index-offset-map-v2' of github.com:tgrabiec/scylla:
  sstables: Add promoted index cache metrics
  position_in_partition: Introduce external_memory_usage()
  cached_file, sstables: Add tracing to index binary search and page cache
  sstables: Dynamically adjust I/O size for index reads
  sstables, tests: Allow disabling binary search in promoted index from perf tests
  sstables: mc: Use binary search over the promoted index
  utils: Introduce cached_file
  sstables: clustered_index: Relax scope of validity of entry_info
  sstables: index_entry: Introduce owning promoted_index_block_position
  compound_compat: Allow constructing composite from a view
  sstables: index_entry: Rename promoted_index_block_position to promoted_index_block_position_view
  sstables: mc: Extract parser for promoted index block
  sstables: mc: Extract parser for clustering out of the promoted index block parser
  sstables: consumer: Extract primitive_consumer
  sstables: Abstract the clustering index cursor behavior
  sstables: index_reader: Rearrange to reduce branching and optionals
2020-06-18 12:09:39 +03:00
..
2020-01-30 14:10:18 +02:00
2020-03-23 11:59:30 +02:00
2020-03-20 19:37:48 +01:00
2020-05-31 12:58:59 +03:00
2020-03-23 11:59:30 +02:00
2019-02-20 08:03:46 -08:00
2020-01-26 18:54:36 +02:00