Commit Graph

997 Commits

Author SHA1 Message Date
Avi Kivity
99d5355007 Merge "Cache sstable indexes in memory" from Tomasz
"
The main goal of this series is to improve efficiency of reads from large partitions by
reducing amount of I/O needed to read the sstable index. This is achieved by caching
index file pages and partition index entries in memory.

Currently, the pages are cached by individual reads only for the duration of the read.
This was done to facilitate binary search in the promoted index (intra-partition index).
After this series, all reads share the index file page cache, which stays around even after reads stop.

The page cache is subject to eviction. It uses the same region as the current row cache and shares
the LRU with row cache entries. This means that LRU objects need to be virtualized. This series takes
an easy approach and does this by introducing a virtual base class. This adds an overhead to row cache
entry to store the vtable pointer.

SStable indexes have a hierarchy. There is a summary, which is a sparse partition key index into the
full partition index. This one is already kept in memory. The partition index is divided by the summary
into pages. Each entry in the partition index contains promoted index, which is a sparse index into atoms
identified by the clustering key (rows, tombstones).

In order to read the promoted index, the reader needs to read the partition index entry first.
To speed this up, this series also adds caching of partition index entries. This cache survives
reads and is subject to eviction, just like the index file page cache. The unit of caching is
the partition index page. Without this cache, each access to promoted index would have to be
preceded with the parsing of the partition index page containing the partition key.

Performance testing results follow.

1) scylla-bench large partition reads

  Populated with:

        perf_fast_forward --run-tests=large-partition-skips --datasets=sb-large-part-ds1 \
            -c1 -m1G --populate --value-size=1024 --rows=10000000

  Single partition, 9G data file, 4MB index file

  Test execution:

    build/release/scylla -c1 -m4G
    scylla-bench -workload uniform -mode read -limit 1 -concurrency 100 -partition-count 1 \
       -clustering-row-count 10000000 -duration 60m

  TL;DR: after: 2x throughput, 0.5 median latency

    Before (c1daf2bb24):

    Results
    Time (avg):	 5m21.033180213s
    Total ops:	 966951
    Total rows:	 966951
    Operations/s:	 3011.997048812112
    Rows/s:		 3011.997048812112
    Latency:
      max:		 74.055679ms
      99.9th:	 63.569919ms
      99th:		 41.320447ms
      95th:		 38.076415ms
      90th:		 37.158911ms
      median:	 34.537471ms
      mean:		 33.195994ms

    After:

    Results
    Time (avg):	 5m14.706669345s
    Total ops:	 2042831
    Total rows:	 2042831
    Operations/s:	 6491.22243800942
    Rows/s:		 6491.22243800942
    Latency:
      max:		 60.096511ms
      99.9th:	 35.520511ms
      99th:		 27.000831ms
      95th:		 23.986175ms
      90th:		 21.659647ms
      median:	 15.040511ms
      mean:		 15.402076ms

2) scylla-bench small partitions

  I tested several scenarios with a varying data set size, e.g. data fully fitting in memory,
  half fitting, and being much larger. The improvement varied a bit but in all cases the "after"
  code performed slightly better.

  Below is a representative run over data set which does not fit in memory.

  scylla -c1 -m4G
  scylla-bench -workload uniform -mode read  -concurrency 400 -partition-count 10000000 \
      -clustering-row-count 1 -duration 60m -no-lower-bound

  Before:

    Time (avg):	 51.072411913s
    Total ops:	 3165885
    Total rows:	 3165885
    Operations/s:	 61988.164024260645
    Rows/s:		 61988.164024260645
    Latency:
      max:		 34.045951ms
      99.9th:	 25.985023ms
      99th:		 23.298047ms
      95th:		 19.070975ms
      90th:		 17.530879ms
      median:	 3.899391ms
      mean:		 6.450616ms

  After:

    Time (avg):	 50.232410679s
    Total ops:	 3778863
    Total rows:	 3778863
    Operations/s:	 75227.58014424688
    Rows/s:		 75227.58014424688
    Latency:
      max:		 37.027839ms
      99.9th:	 24.805375ms
      99th:		 18.219007ms
      95th:		 14.090239ms
      90th:		 12.124159ms
      median:	 4.030463ms
      mean:		 5.315111ms

  The results include the warmup phase which populates the partition index cache, so the hot-cache effect
  is dampened in the statistics. See the 99th percentile. Latency gets better after the cache warms up which
  moves it lower.

3) perf_fast_forward --run-tests=large-partition-skips

    Caching is not used here, included to show there are no regressions for the cold cache case.

    TL;DR: No significant change

    perf_fast_forward --run-tests=large-partition-skips --datasets=large-part-ds1 -c1 -m1G

    Config: rows: 10000000, value size: 2000

    Before:

    read    skip      time (s)   iterations     frags     frag/s    mad f/s    max f/s    min f/s    avg aio    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu
    1       0        36.429822            4  10000000     274500         62     274521     274429   153889.2 153883   19696986  153853       0        0        0        0        0        0        0  22.5%
    1       1        36.856236            4   5000000     135662          7     135670     135650   155652.0 155652   19704117  139326       1        0        1        1        0        0        0  38.1%
    1       8        36.347667            4   1111112      30569          0      30570      30569   155652.0 155652   19704117  139071       1        0        1        1        0        0        0  19.5%
    1       16       36.278866            4    588236      16214          1      16215      16213   155652.0 155652   19704117  139073       1        0        1        1        0        0        0  16.6%
    1       32       36.174784            4    303031       8377          0       8377       8376   155652.0 155652   19704117  139056       1        0        1        1        0        0        0  12.3%
    1       64       36.147104            4    153847       4256          0       4256       4256   155652.0 155652   19704117  139109       1        0        1        1        0        0        0  11.1%
    1       256       9.895288            4     38911       3932          1       3933       3930   100869.2 100868    3178298   59944   38912        0        1        1        0        0        0  14.3%
    1       1024      2.599921            4      9757       3753          0       3753       3753    26604.0  26604     801850   15071    9758        0        1        1        0        0        0  14.6%
    1       4096      0.784568            4      2441       3111          1       3111       3109     7982.0   7982     205946    3772    2442        0        1        1        0        0        0  13.8%

    64      1        36.553975            4   9846154     269359         10     269369     269337   155663.8 155652   19704117  139230       1        0        1        1        0        0        0  28.2%
    64      8        36.509694            4   8888896     243467          8     243475     243449   155652.0 155652   19704117  139120       1        0        1        1        0        0        0  26.5%
    64      16       36.466282            4   8000000     219381          4     219385     219374   155652.0 155652   19704117  139232       1        0        1        1        0        0        0  24.8%
    64      32       36.395926            4   6666688     183171          6     183180     183165   155652.0 155652   19704117  139158       1        0        1        1        0        0        0  21.8%
    64      64       36.296856            4   5000000     137753          4     137757     137737   155652.0 155652   19704117  139105       1        0        1        1        0        0        0  17.7%
    64      256      20.590392            4   2000000      97133         18      97151      94996   135248.8 131395    7877402   98335   31282        0        1        1        0        0        0  15.7%
    64      1024      6.225773            4    588288      94492       1436      95434      88748    46066.5  41321    2324378   30360    9193        0        1        1        0        0        0  15.8%
    64      4096      1.856069            4    153856      82893         54      82948      82721    16115.0  16043     583674   11574    2675        0        1        1        0        0        0  16.3%

    After:

    read    skip      time (s)   iterations     frags     frag/s    mad f/s    max f/s    min f/s    avg aio    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu
    1       0        36.429240            4  10000000     274505         38     274515     274417   153887.8 153883   19696986  153849       0        0        0        0        0        0        0  22.4%
    1       1        36.933806            4   5000000     135377         15     135385     135354   155658.0 155658   19704085  139398       1        0        1        1        0        0        0  40.0%
    1       8        36.419187            4   1111112      30509          2      30510      30507   155658.0 155658   19704085  139233       1        0        1        1        0        0        0  22.0%
    1       16       36.353475            4    588236      16181          0      16182      16181   155658.0 155658   19704085  139183       1        0        1        1        0        0        0  19.2%
    1       32       36.251356            4    303031       8359          0       8359       8359   155658.0 155658   19704085  139120       1        0        1        1        0        0        0  14.8%
    1       64       36.203692            4    153847       4249          0       4250       4249   155658.0 155658   19704085  139071       1        0        1        1        0        0        0  13.0%
    1       256       9.965876            4     38911       3904          0       3906       3904   100875.2 100874    3178266   60108   38912        0        1        1        0        0        0  17.9%
    1       1024      2.637501            4      9757       3699          1       3700       3697    26610.0  26610     801818   15071    9758        0        1        1        0        0        0  19.5%
    1       4096      0.806745            4      2441       3026          1       3027       3024     7988.0   7988     205914    3773    2442        0        1        1        0        0        0  18.3%

    64      1        36.611243            4   9846154     268938          5     268942     268921   155669.8 155705   19704085  139330       2        0        1        1        0        0        0  29.9%
    64      8        36.559471            4   8888896     243135         11     243156     243124   155658.0 155658   19704085  139261       1        0        1        1        0        0        0  28.1%
    64      16       36.510319            4   8000000     219116         15     219126     219101   155658.0 155658   19704085  139173       1        0        1        1        0        0        0  26.3%
    64      32       36.439069            4   6666688     182954          9     182964     182943   155658.0 155658   19704085  139274       1        0        1        1        0        0        0  23.2%
    64      64       36.334808            4   5000000     137609         11     137612     137596   155658.0 155658   19704085  139258       2        0        1        1        0        0        0  19.1%
    64      256      20.624759            4   2000000      96971         88      97059      92717   138296.0 131401    7877370   98332   31282        0        1        1        0        0        0  17.2%
    64      1024      6.260598            4    588288      93967       1429      94905      88051    45939.5  41327    2324346   30361    9193        0        1        1        0        0        0  17.8%
    64      4096      1.881338            4    153856      81780        140      81920      81520    16109.8  16092     582714   11617    2678        0        1        1        0        0        0  18.2%

4) perf_fast_forward --run-tests=large-partition-slicing

    Caching enabled, each line shows the median run from many iterations

    TL;DR: We can observe reduction in IO which translates to reduction in execution time,
           especially for slicing in the middle of partition.

    perf_fast_forward --run-tests=large-partition-slicing --datasets=large-part-ds1 -c1 -m1G --keep-cache-across-test-cases

    Config: rows: 10000000, value size: 2000

    Before:

    offset  read      time (s)   iterations     frags     frag/s    mad f/s    max f/s    min f/s    avg aio    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    allocs   tasks insns/f    cpu
    0       1         0.000491          127         1       2037         24       2109        127        4.0      4        128       2       2        0        1        1        0        0        0       157      80 3058208  15.0%
    0       32        0.000561         1740        32      56995        410      60031      47208        5.0      5        160       3       2        0        1        1        0        0        0       386     111  113353  17.5%
    0       256       0.002052          488       256     124736       7111     144762      89053       16.6     17        672      14       2        0        1        1        0        0        0      2113     446   52669  18.6%
    0       4096      0.016437           61      4096     249199        692     252389     244995       69.4     69       8640      57       5        0        1        1        0        0        0     26638    1717   23321  22.4%
    5000000 1         0.002171          221         1        461          2        466        221       25.0     25        268       3       3        0        1        1        0        0        0       638     376 14311524  10.2%
    5000000 32        0.002392          404        32      13376         48      13528      13015       27.0     27        332       5       3        0        1        1        0        0        0       931     432  489691  11.9%
    5000000 256       0.003659          279       256      69967        764      73130      52563       39.5     41        780      19       3        0        1        1        0        0        0      2689     825   93756  15.8%
    5000000 4096      0.018592           55      4096     220313        433     234214     218803       94.2     94       9484      62       9        0        1        1        0        0        0     27349    2213   26562  21.0%

    After:

    offset  read      time (s)   iterations     frags     frag/s    mad f/s    max f/s    min f/s    avg aio    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    allocs   tasks insns/f    cpu
    0       1         0.000229          115         1       4371         85       4585        115        2.1      2         64       1       1        1        0        0        0        0        0        90      31 1314749  22.2%
    0       32        0.000277         2174        32     115674       1015     128109      14144        3.0      3         96       2       1        1        0        0        0        0        0       319      62   52508  26.1%
    0       256       0.001786          576       256     143298       5534     179142     113715       14.7     17        544      15       1        1        0        0        0        0        0      2110     453   45419  21.4%
    0       4096      0.015498           61      4096     264289       2006     268850     259342       67.4     67       8576      59       4        1        0        0        0        0        0     26657    1738   22897  23.7%
    5000000 1         0.000415          233         1       2411         15       2456        234        4.1      4        128       2       2        1        0        0        0        0        0       199      72 2644719  16.8%
    5000000 32        0.000635         1413        32      50398        349      51149      46439        6.0      6        192       4       2        1        0        0        0        0        0       458     128  125893  18.6%
    5000000 256       0.002028          486       256     126228       3024     146327      82559       17.8     18       1024      13       4        1        0        0        0        0        0      2123     385   51787  19.6%
    5000000 4096      0.016836           61      4096     243294        814     263434     241660       73.0     73       9344      62       8        1        0        0        0        0        0     26922    1920   24389  22.4%

Future work:

 - Check the impact on non-uniform workloads. Caching sstable indexes takes space away from the row cache
   which may reduce the hit ratio.

 - Reduce memory footprint of partition index cache. Currently, about 8x bloat over the on-disk size.

 - Disable cache population for "bypass cache" reads

 - Add a switch to disable sstable index caching, per-node, maybe per-table

 - Better sstable index format. Current format leads to inefficiency in caching since only some elements of the cached
   page can be hot. A B-tree index would be more efficient. Same applies to the partition index. Only some elements in
   the partition index page can be hot.

 - Add heuristic for reducing index file IO size when large partitions are anticipated. If we're bound by disk's
   bandwidth it's wasteful to read the front of promoted index using 32K IO, better use 4K which should cover the
   partition entry and then let binary search read the rest.

In V2:

 - Fixed perf_fast_forward regression in the number of IOs used to read partition index page
   The reader uses 32K reads, which were split by page cache into 4K reads
   Fix by propagating IO size hints to page cache and using single IO to populate it.
   New patch: "cached_file: Issue single I/O for the whole read range on miss"

 - Avoid large allocations to store partition index page entries (due to managed_vector storage).
   There is a unit test which detects this and fails.
   Fixed by implementing chunked_managed_vector, based on chunked_vector.

 - fixed bug in cached_file::evict_gently() where the wrong allocation strategy was used to free btree chunks

 - Simplify region_impl::free_buf() according to Avi's suggestions

 - Fit segment_kind in segment_descriptor::_free_space and lift requirement that _buf_pointers emptiness determines the kind

 - Workaround sigsegv which was most likely due to coroutine miscompilation. Worked around by manipulating local object scope.

 - Wire up system/drop_sstable_caches RESTful API

 - Fix use-after-move on permit for the old scanning ka/la index reader

 - Fixed more cases of double open_data() in tests leading to assert failure

 - Adjusted cached_file class doc to account for changes in behavior.

 - Rebased

Fixes #7079.
Refs #363.
"

* tag 'sstable-index-caching-v2' of github.com:tgrabiec/scylla: (39 commits)
  api: Drop sstable index caches on system/drop_sstable_caches
  cached_file: Issue single I/O for the whole read range on miss
  row_cache: cache_tracker: Do not register metrics when constructed for tests
  sstables, cached_file: Evict cache gently when sstable is destroyed
  sstables: Hide partition_index_cache implementation away from sstables.hh
  sstables: Drop shared_index_lists alias
  sstables: Destroy partition index cache gently
  sstables: Cache partition index pages in LSA and link to LRU
  utils: Introduce lsa::weak_ptr<>
  sstables: Rename index_list to partition_index_page and shared_index_lists to partition_index_cache
  sstables, cached_file: Avoid copying buffers from cache when parsing promoted index
  cached_file: Introduce get_page_units()
  sstables: read: Document that primitive_consumer::read_32() is alloc-free
  sstables: read: Count partition index page evictions
  sstables: Drop the _use_binary_search flag from index entries
  sstables: index_reader: Keep index objects under LSA
  lsa: chunked_managed_vector: Adapt more to managed_vector
  utils: lsa: chunked_managed_vector: Make LSA-aware
  test: chunked_managed_vector_test: Make exception_safe_class standard layout
  lsa: Copy chunked_vector to chunked_managed_vector
  ...
2021-07-07 18:17:10 +03:00
Avi Kivity
4c01a88c9d logalloc: do not capture backtraces by default in debug mode
logalloc has a nice leak/double-free sanitizer, with the nice
feature of capturing backtraces to make error reports easy to
track down. But capturing backtraces is itself very expensive.

This patch makes backtrace capture optional, reducing database_test
runtime from 30 minutes to 20 minutes on my machine.

Closes #8978
2021-07-06 00:18:22 +02:00
Tomasz Grabiec
f553db69f7 cached_file: Issue single I/O for the whole read range on miss
Currently, reading a page range would issue I/O for each missing
page. This is inefficient, better to issue a single I/O for the whole
range and populate cache from that.

As an optimization, issue a single I/O if the first page is missing.

This is important for index reads which optimistically try to read
32KB of index file to read the partition index page.
2021-07-02 19:02:14 +02:00
Tomasz Grabiec
1f74863bf8 sstables, cached_file: Evict cache gently when sstable is destroyed
We must evict before the _cached_index_file associated with the
sstable goes away. Better to do it gently to avoid stalls.
2021-07-02 19:02:14 +02:00
Tomasz Grabiec
9f957f1cf9 sstables: Cache partition index pages in LSA and link to LRU
As part of this change, the container for partition index pages was
changed from utils::loading_shared_values to intrusive_btree. This is
to avoid reactor stalls which the former induces with a large number
of elements (pages) due to its use of a hashtable under the hood,
which reallocates contiguous storage.
2021-07-02 19:02:14 +02:00
Tomasz Grabiec
b3728f7d9b utils: Introduce lsa::weak_ptr<>
Simplifies managing non-owning references to LSA-managed objects. The
lsa::weak_ptr is a smart pointer which is not invalidated by LSA and
can be used safely in any allocator context. Dereferenced will always
give a valid reference.

This can be used as a building block for implementing cursors into
LSA-based caches.

Example simple use:

     // LSA-managed
     struct X : public lsa::weakly_referencable<X> {
         int value;
     };

     lsa::weak_ptr<X> x_ptr = with_allocator(region(), [] {
           X* x = current_allocator().construct<X>();
           return x->weak_from_this();
     });

     std::cout << x_ptr->value;
2021-07-02 19:02:14 +02:00
Tomasz Grabiec
934824394a sstables, cached_file: Avoid copying buffers from cache when parsing promoted index 2021-07-02 19:02:14 +02:00
Tomasz Grabiec
7b6f18b4ed cached_file: Introduce get_page_units()
Will be needed later for reading a page view which cannot use
make_tracked_temporary_buffer(). Standardize on get_page_units(),
converting existing code to wrap the units in a deleter.
2021-07-02 19:02:14 +02:00
Tomasz Grabiec
20ef54e9ed lsa: chunked_managed_vector: Adapt more to managed_vector
For seamless transition.
2021-07-02 19:02:14 +02:00
Tomasz Grabiec
78e5b9fd85 utils: lsa: chunked_managed_vector: Make LSA-aware
The max chunk size is set to be 10% of segment size.
2021-07-02 19:02:14 +02:00
Tomasz Grabiec
c87ea09535 lsa: Copy chunked_vector to chunked_managed_vector
In preparation for adapting it to LSA. Split into two steps to make
reiew easier.
2021-07-02 19:02:14 +02:00
Tomasz Grabiec
1523a7d367 utils: managed_vector: Make clear_and_release() public
Will be needed by index reader to ensure that destructor doesn't
invoke the allocator so that all is destroyed in the desried
allocation context before the object is destroyed.
2021-07-02 19:02:14 +02:00
Tomasz Grabiec
484e06d69b cached_file: Always start at offset 0
All current uses start at offset 0, so simplify the code by assuming it.
2021-07-02 19:02:13 +02:00
Tomasz Grabiec
078a6e422b sstables: Cache all index file reads
After this patch, there is a singe index file page cache per
sstable, shared by index readers. The cache survives reads,
which reduces amount of I/O on subsequent reads.

As part of this, cached_file needed to be adjusted in the following ways.

The page cache may occupy a significant portion of memory. Keeping the
pages in the standard allocator could cause memory fragmentation
problems. To avoid them, the cache_file is changed to keep buffers in LSA
using lsa_buffer allocation method.

When a page is needed by the seastar I/O layer, it needs to be copied
to a temporary_buffer which is stable, so must be allocated in the
standard allocator space. We copy the page on-demand. Concurrent
requests for the same page will share the temporary_buffer. When page
is not used, it only lives in the LSA space.

In the subsequent patches cached_file::stream will be adjusted to also support
access via cached_page::ptr_type directly, to avoid materializating a
temporary_buffer.

While a page is used, it is not linked in the LRU so that it is not
freed. This ensures that the storage which is actively consumed
remains stable, either via temporary_buffer (kept alive by its
deleter), or by cached_page::ptr_type directly.
2021-07-02 19:02:13 +02:00
Tomasz Grabiec
b5ca0eb2a2 lsa: Introduce lsa_buffer
lsa_buffer is similar in spirit to std::unique_ptr<char[]>. It owns
buffers allocated inside LSA segments. It uses an alternative
allocation method which differs from regular LSA allocations in the
following ways:

  1) LSA segments only hold buffers, they don't hold metadata. They
     also don't mix with standard allocations. So a 128K segment can
     hold 32 4K buffers.

  2) objects' life time is managed by lsa_buffer, an owning smart
     pointer, which is automatically updated when buffers are migrated
     to another segment. This makes LSA allocations easier to use and
     off-loads metadata management to the client (which can keep the
     lsa_buffer wherever he wants).

The metadata is kept inside segment_descriptor, in a vector. Each
allocated buffer will have an entangled object there (8 bytes), which
is paired with an entabled object inside lsa_buffer.

The reason to have an alternative allocation method is to efficiently
pack buffers inside LSA segments.
2021-07-02 19:02:13 +02:00
Tomasz Grabiec
a23f27034f lsa: Introduce entangled helper
Will be useful in building higher-level LSA tools.
2021-07-02 19:02:13 +02:00
Tomasz Grabiec
056f14063e lsa: Encapsulate segment_descriptor::_free_space access
Prepares for reusing some of its bits for storing segment kind.
2021-07-02 19:02:13 +02:00
Tomasz Grabiec
019956739d cached_file: Switch to bplus::tree
In order to be able to move it to LSA later.
2021-07-02 10:25:58 +02:00
Tomasz Grabiec
8fbea0b5b7 utils: cached_file: Introduce file wrapper
It's an adpator between seastar::file and cached_file. It gives a
seastar::file which will serve reads using a given cached_file as a
read-through cache.
2021-07-02 10:25:58 +02:00
Tomasz Grabiec
8e2118069b sstables: cached_file: Account buffers returned by cached_file under read_permit
We want buffers to be accounted only when they are used outside
cached_file. Cached pages should not be accounted because they will
stay around for longer than the read after subsequent commits.
2021-07-02 10:25:58 +02:00
Tomasz Grabiec
7fa4e10aa0 row_cache: Use generic LRU for eviction
In preparation for tracking different kinds of objects, not just
rows_entry, in the LRU, switch to the LRU implementation form
utils/lru.hh which can hold arbitrary element type.
2021-07-02 10:25:58 +02:00
Tomasz Grabiec
6b59c8cfb1 utils: Introduce general-purpose LRU
The LRU can link objects of different types, which is achieved by
having a virtual base class called "evictable" from which the linked
objects should inherit. Whe the object is removed from the LRU,
evictable::on_evicted() is called.

The container is non-owning.
2021-07-02 10:25:58 +02:00
Benny Halevy
9963d15613 utils: stall_free: implemnt clear_gently of froeign_ptr
clear_gently of the foreign_ptr needs to run on the owning
shard, so provide a specialization from the SmartPointer
implementation.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-07-01 19:16:11 +03:00
Benny Halevy
eca9f45c59 utils: stall_free: define generic clear_gently methods
Define a bunch of clear_gently methods that asynchronously
clear the contents of containers and allow yielding.

This replaces clear_gently(std::list<T>&) used by row level
repair by a more generic template implementation.

Note that we do not use coroutines in this patch
to facilitate backporting to releases that do not support coroutines
and since a miscompilation bug was hit with clang++ 11 when attempting
to coroutinize this patch (see
https://bugs.llvm.org/show_bug.cgi?id=50345).

Test: stall_free_test(debug)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-07-01 19:00:49 +03:00
Avi Kivity
6676ceabde Merge 'Prevent reactor stall in utils::merge_to_gently' from Benny Halevy
std::copy_if runs without yielding.

See https://github.com/scylladb/scylla/issues/8897#issuecomment-867522480

Also, eliminate extraneous loop on merge

first1 will point to the inserted value which is a copy of *first2.
Since list2 is sorted in ascending order, the next item from list2
will never be less than the one we've just inserted,
so we waste an iteration to merely increment first1 again.

Fixes #8897

Test: unit(dev), stall_free_test(debug)
DTest: repair_additional_test.py:RepairAdditionalTest.{repair_same_row_diff_value_3nodes_diff_shard_count_test,repair_disjoint_row_3nodes_diff_shard_count_test} (dev)

Closes #8925

* github.com:scylladb/scylla:
  utils: merge_to_gently: eliminate extraneous loop on merge
  utils: merge_to_gently: prevent stall in std::copy_if
2021-06-27 13:56:32 +03:00
Benny Halevy
b96eeaefe4 utils: merge_to_gently: eliminate extraneous loop on merge
first1 will point to the inserted value which is a copy of *first2.
Since list2 is sorted in ascending order, the next item from list2
will never be less than the one we've just inserted,
so we waste an iteration to merely increment first1 again.

Note that the standard states that no iterators or references are invalidated
on insert so we can safely keep looking at `first1` after inserting a copy of
`*first2` before it.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-06-24 14:58:12 +03:00
Benny Halevy
453e7c8795 utils: merge_to_gently: prevent stall in std::copy_if
std::copy_if runs without yielding.

See https://github.com/scylladb/scylla/issues/8897#issuecomment-867522480

Note that the standard states that no iterators or references are invalidated
on insert so we can keep inserting before last1 when merging the
remainder of list2 at the tail of list1.

Fixes #8897

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-06-24 14:47:25 +03:00
Pavel Emelyanov
a61afe8421 btree: Improve unlink_leftmost_without_rebalance()
The helper is used to walk the tree key-by-key destroying it
in the mean time. Current implementation of this method just
uses the "regular" erasing code which actually rebalances the
tree despite the name.

The biggest problem with removing the rebalancing is that at
some point non-balanced tree may have the left-most key on an
inner node, so to make 100% rebalance-less unlink every other
method of the tree would have to be prepared for that. However,
there's an option to make "light rebalance" (as it's called in
this patch) that only maintains this crucial property of the
tree -- the left-most key is on the leaf.

Some more tech details. Current rebalancer starts when the
node population falls below 1/2 of its capacity and tries to
- grab a key from one of the siblings if it's balanced
- merge two siblings together if they are small enough

The light rebalance is lighter in two ways. First, it leaves
the node unbalanced until it becomes empty. And then it goes
ahead and replaces it with the next sibling.

This change removes ~60% of the keys movements on random test.
Keys still move when the sibling replace happens because in
this case the separation key needs to be placed at the right
sibling 0 position which means shifting all its keys right.

tests: unit(debug)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210623083836.27491-1-xemul@scylladb.com>
2021-06-24 12:03:26 +03:00
Benny Halevy
02917c79b6 logalloc: get rid of unused _descendant_blocked_requests
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210620064204.1709957-1-bhalevy@scylladb.com>
2021-06-22 15:58:56 +02:00
Avi Kivity
f05ddf0967 Merge "Improve LSA descriptor encoding" from Pavel
"
The LSA small objects allocation latency is greatly affected by
the way this allocator encodes the object descriptor in front of
each allocated slot.

Nowadays it's one of VLE variants implemented with the help of a
loop. Re-implementing this piece with less instructions and without
a loop allows greatly reducing the allocation latency.

The speed-up mostly comes from loop-less code that doesn't confuse
branch predictor. Also the express encoder seems to benefit from
writing 8 bytes of the encoded value in one go, rather than byte-
-by-byte.

Perf measurements:

1. (new) logallog test shows ~40% smaller times

2. perf_mutation in release mode shows ~2% increase in tps

3. the encoder itself is 2 - 4 times faster on x86_64 and
   1.05 - 3 times faster on aarch64. The speed-up depends on
   the 'encoded length', old encoder has linear time, the
   new one is constant

tests: unit(dev), perf(release), just encoder on Aarch64
"

* 'br-lsa-alloc-latency-4' of https://github.com/xemul/scylla:
  lsa: Use express encoder
  uleb64: Add express encoding
  lsa: Extract uleb64 code into header
  test: LSA allocation perf test
2021-06-16 18:07:13 +03:00
Pavel Emelyanov
8d0780fb92 lsa: Use express encoder
To make it possible to use the express encoder, lsa needs to
make sure that the value is below express supreme value and
provide the size of the gap after the encoded value.

Both requirements can be satisfied when encoding the migrator
index on object allocation.

On free the encoded value can be larger, so the extended
express encoder will need more instructions and will not be
that efficient, so the old encoder is used there.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-06-16 17:47:12 +03:00
Pavel Emelyanov
1782b0c6b9 uleb64: Add express encoding
Standard encoding is compiled into a loop that puts values
into memory byte-by-byte. This works slowly, but reliably.
When allocating an object LSA uses ubel64 encoder with 2
features that allow to optimize the encoder:

1. the value is migrator.index() which is small enough
   to fit 2 bytes when encoded
2. After the descriptor there usually comes an object
   which is of 8+ bytes in size

Feature #1 makes it possible to encode the value with just
a few instructions. In O3 level clang makes it like

  mov    %esi,%ecx
  and    $0xfc0,%ecx
  and    $0x3f,%esi
  lea    (%rsi,%rcx,4),%ecx
  add    $0x40,%ecx

Next, the encoder needs to put the value into a gap whose
size depends on the alignment of previous and current objects,
so the classical algo loops through this size. Feature #2
makes it possible to put the encoded value and the needed
amount of zeros by using 2 64-bit movs. In this case the
encoded value gets off the needed size and overwrites some
memory after. That's OK, as this overwritten memory is where
the allocated object _will_ be, so the contents there is not
of any interest.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-06-16 17:47:10 +03:00
Pavel Emelyanov
d8dea48248 lsa: Extract uleb64 code into header
The LSA code encodes an object descriptor before the object
itself. The descriptor is 32-bit value and to put it in an
efficient manner it's encoded into unsigned little-endian
base-64 sequence.

The encoding code is going to be optimized, so put it into a
dedicated header in advance.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-06-16 17:46:44 +03:00
Tomasz Grabiec
605a6e0166 Merge "Remove int_or_strong_ordering concept" from Pavel
The one was added to smothly switch tri-comparing stuff from int
to strong-ordering. As for today only tests still need it and the
conversion is pretty simple, plus operator<<(ostream&) for the
std::strong_ordering type.

* xemul/br-remove-int-or-strong-ordering-2:
  util: Drop int_or_strong_ordering concept
  tests: Switch total-order-check onto strong_ordering
  to_string: Add formatter for strong_ordering
  tests: Return strong-ordering from tri-comparators
2021-06-16 09:34:49 +02:00
Nadav Har'El
afacffc556 utils/enum_option.hh: add implicit converter to the underlying enum
Add an implicit converter of the enum_option to the underyling enum
it is holding. This is needed for using switch() on an enum_option.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-06-13 13:18:49 +03:00
Pavel Emelyanov
4ad4208426 util: Drop int_or_strong_ordering concept
Nobody uses it now. All tri-comparing stuff is strong_ordering now.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-06-08 11:40:55 +03:00
Pavel Solodovnikov
76bea23174 treewide: reduce header interdependencies
Use forward declarations wherever possible.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>

Closes #8813
2021-06-07 15:58:35 +03:00
Avi Kivity
a55b434a2b treewide: extent copyright statements to present day 2021-06-06 19:18:49 +03:00
Avi Kivity
e9e5663731 build, utils/bptree.hh: drop -Wno-gnu-designator warning
Drop the warning about old-stye GNU designated initializers and
convert two violations in bptree.hh to the standard C++20 syntax.

Closes #8743
2021-05-31 18:51:49 +03:00
Botond Dénes
cd6bbd37a4 utils/utf8.c: move includes outside of namespaces
Including in the middle of a namespace is not a good practice.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210528142502.962947-1-bdenes@scylladb.com>
2021-05-30 23:23:20 +03:00
Benny Halevy
d46958d3ce phased_barrier: advance_and_await: abort on allocation failure
Currently, advance_and_wait() allocates a new gate
which might fail.  Rather than returning this failure
as an exceptional future - which will require its callers
to handle that failure, keep the function as noexcept and
let an exception from make_lw_shared<gate>() terminate the program.

This makes the function "fail-free" to its callers,
in particular, when called from the table::stop() path where
we can't do much about these failures and we require close/stop
functions to always succeed.

The alternative of make the allocation of a new gate optional
and covering from it in start() is possible but was deemed not
worth it as it will add complexity and cost to start() that's
called on the common, hot, path.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210525055336.1190029-1-bhalevy@scylladb.com>
2021-05-25 12:50:59 +02:00
Avi Kivity
948e2c0b36 utils: config_file: delete unneeded template instantation of operator<<()
config_file.cc instantiates std::istream& std::operator>>(std::istream&,
std::unordered_map<seastar::sstring, seastar::sstring>&), but that
instantiation is ignored since config_file_impl.hh specializes
that signature. -Winstantiation-after-specialization warns about it,
so re-enable it now that the code base is clean.

Also remove the matching "extern template" declaration, which has no
definition any more.

Closes #8696
2021-05-24 18:34:45 +03:00
Avi Kivity
50f3bbc359 Merge "treewide: various header cleanups" from Pavel S
"
The patch set is an assorted collection of header cleanups, e.g:
* Reduce number of boost includes in header files
* Switch to forward declarations in some places

A quick measurement was performed to see if these changes
provide any improvement in build times (ccache cleaned and
existing build products wiped out).

The results are posted below (`/usr/bin/time -v ninja dev-build`)
for 24 cores/48 threads CPU setup (AMD Threadripper 2970WX).

Before:

	Command being timed: "ninja dev-build"
	User time (seconds): 28262.47
	System time (seconds): 824.85
	Percent of CPU this job got: 3979%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 12:10.97
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 2129888
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 1402838
	Minor (reclaiming a frame) page faults: 124265412
	Voluntary context switches: 1879279
	Involuntary context switches: 1159999
	Swaps: 0
	File system inputs: 0
	File system outputs: 11806272
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

After:

	Command being timed: "ninja dev-build"
	User time (seconds): 26270.81
	System time (seconds): 767.01
	Percent of CPU this job got: 3905%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 11:32.36
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 2117608
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 1400189
	Minor (reclaiming a frame) page faults: 117570335
	Voluntary context switches: 1870631
	Involuntary context switches: 1154535
	Swaps: 0
	File system inputs: 0
	File system outputs: 11777280
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

The observed improvement is about 5% of total wall clock time
for `dev-build` target.

Also, all commits make sure that headers stay self-sufficient,
which would help to further improve the situation in the future.
"

* 'feature/header_cleanups_v1' of https://github.com/ManManson/scylla:
  transport: remove extraneous `qos/service_level_controller` includes from headers
  treewide: remove evidently unneded storage_proxy includes from some places
  service_level_controller: remove extraneous `service/storage_service.hh` include
  sstables/writer: remove extraneous `service/storage_service.hh` include
  treewide: remove extraneous database.hh includes from headers
  treewide: reduce boost headers usage in scylla header files
  cql3: remove extraneous includes from some headers
  cql3: various forward declaration cleanups
  utils: add missing <limits> header in `extremum_tracking.hh`
2021-05-24 14:24:20 +03:00
Michał Chojnowski
03faf139c8 collection_mutation: don't linearize collection values
Yet another patch preventing potentially large allocations.
Currently, collection_mutation{_view,}_description linearize each collection
value during deserialization. It's not unthinkable that a user adds a
large element to a list or a map, so let's avoid that.

This patch removes the dependency on linearizing_input_stream, which does not
provide a way to read fragmented subbuffers, and replaces it with a new
helper, which does. (Extending linearizing_input_stream is not viable without
rewriting it completely).

Only linearization of collection values is corrected in this patch.
Collection keys are still linearized. Storing them in managed_bytes is likely
to be more harmful than helpful, because large map keys are extremely unlikely,
and UUIDs, which are used as keys in lists, do not fit into manages_bytes's
small value optimization, so this would incure an extra allocation for every
list element.

Note: this patch leaves utils/linearizing_input_stream.hh unused.

Refs: #8120

Closes #8690
2021-05-23 12:16:56 +03:00
Michał Chojnowski
d72b91053b logalloc: fix quadratic behaviour of reclaim_from_evictable
As an optimization for optimistic cases, reclaim_from_evictable first evicts
the requested amount of memory before attempting to reclaim segments through
compactions. However, due to an oversight, it does this before every compaction
instead of once before all compactions.

Usually reclaim_from_evictable is called with small targets, or is preemptible,
and in those cases this issue is not visible. However, when the target is bigger
than one segment and the reclaim is not preemptible, which is he case when it's
called from allocating_section, this results in a quadratic explosion of
evictions, which can evict several hundred MiB to reclaim a few MiB.

Fix that by calculating the target of memory eviction only once, instead of
recalculating it after every compaction.

Fixes #8542.

Closes #8611
2021-05-22 20:49:00 +02:00
Pavel Solodovnikov
fff7ef1fc2 treewide: reduce boost headers usage in scylla header files
`dev-headers` target is also ensured to build successfully.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-05-20 01:33:18 +03:00
Piotr Sarna
223a59c09c test: make rjson allocator test working in sanitize mode
Following Nadav's advice, instead of ignoring the test
in sanitize/debug modes, the allocator simply has a special path
of failing sufficiently large allocation requests.
With that, a problem with the address sanitizer is bypassed
and other debug mode sanitizers can inspect and check
if there are no more problems related to wrapping the original
rapidjson allocator.

Closes #8539
2021-05-20 00:42:47 +03:00
Pavel Solodovnikov
94b5c6333f utils: add missing <limits> header in extremum_tracking.hh
This makes all headers in scylla to be self-sufficient up
to the moment.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-05-20 00:04:51 +03:00
Pavel Emelyanov
0c4ba56594 bptree: Require dynamic object for nodes reconstruct
The B+ tree is not intrusive and supports both kinds of objects --
dynamic (in sense of previous patch) and fixed-size. Respectively,
the nodes provide .storage_size() method and get the embedded object
storage size themselves. Thus, if a dynamic object is used with the
tree but it misses the .storage_size() itself this would come
unnoticed. Fortunately, dynamic objects use the .reconstruct()
method, so the method should be equipeed with the DybnamicObject
concept.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-05-19 09:23:49 +03:00
Pavel Emelyanov
9216a5bc08 allocation_strategy, code: Conceptualize dynamic objects
Usually lsa allocation is performed with the construct() helper that
allocates a sizeof(T) slot and constructs it in place. Some rare
objects have dynamic size, so they are created by alloc()ating a
slot of some specific size and (!) must provide the correct overload
of size_for_allocation_strategy that reports back the relevant
storage size.

This "must provide" is not enforced, if missed a default sizer would
be instantiated, but won't work properly. This patch makes all users
of alloc() conform to DynamicObject concept which requires the
presense of .storage_size() method to tell that size.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-05-19 09:23:49 +03:00