Files
scylladb/utils/logalloc.hh
Avi Kivity 99d5355007 Merge "Cache sstable indexes in memory" from Tomasz
"
The main goal of this series is to improve efficiency of reads from large partitions by
reducing amount of I/O needed to read the sstable index. This is achieved by caching
index file pages and partition index entries in memory.

Currently, the pages are cached by individual reads only for the duration of the read.
This was done to facilitate binary search in the promoted index (intra-partition index).
After this series, all reads share the index file page cache, which stays around even after reads stop.

The page cache is subject to eviction. It uses the same region as the current row cache and shares
the LRU with row cache entries. This means that LRU objects need to be virtualized. This series takes
an easy approach and does this by introducing a virtual base class. This adds an overhead to row cache
entry to store the vtable pointer.

SStable indexes have a hierarchy. There is a summary, which is a sparse partition key index into the
full partition index. This one is already kept in memory. The partition index is divided by the summary
into pages. Each entry in the partition index contains promoted index, which is a sparse index into atoms
identified by the clustering key (rows, tombstones).

In order to read the promoted index, the reader needs to read the partition index entry first.
To speed this up, this series also adds caching of partition index entries. This cache survives
reads and is subject to eviction, just like the index file page cache. The unit of caching is
the partition index page. Without this cache, each access to promoted index would have to be
preceded with the parsing of the partition index page containing the partition key.

Performance testing results follow.

1) scylla-bench large partition reads

  Populated with:

        perf_fast_forward --run-tests=large-partition-skips --datasets=sb-large-part-ds1 \
            -c1 -m1G --populate --value-size=1024 --rows=10000000

  Single partition, 9G data file, 4MB index file

  Test execution:

    build/release/scylla -c1 -m4G
    scylla-bench -workload uniform -mode read -limit 1 -concurrency 100 -partition-count 1 \
       -clustering-row-count 10000000 -duration 60m

  TL;DR: after: 2x throughput, 0.5 median latency

    Before (c1daf2bb24):

    Results
    Time (avg):	 5m21.033180213s
    Total ops:	 966951
    Total rows:	 966951
    Operations/s:	 3011.997048812112
    Rows/s:		 3011.997048812112
    Latency:
      max:		 74.055679ms
      99.9th:	 63.569919ms
      99th:		 41.320447ms
      95th:		 38.076415ms
      90th:		 37.158911ms
      median:	 34.537471ms
      mean:		 33.195994ms

    After:

    Results
    Time (avg):	 5m14.706669345s
    Total ops:	 2042831
    Total rows:	 2042831
    Operations/s:	 6491.22243800942
    Rows/s:		 6491.22243800942
    Latency:
      max:		 60.096511ms
      99.9th:	 35.520511ms
      99th:		 27.000831ms
      95th:		 23.986175ms
      90th:		 21.659647ms
      median:	 15.040511ms
      mean:		 15.402076ms

2) scylla-bench small partitions

  I tested several scenarios with a varying data set size, e.g. data fully fitting in memory,
  half fitting, and being much larger. The improvement varied a bit but in all cases the "after"
  code performed slightly better.

  Below is a representative run over data set which does not fit in memory.

  scylla -c1 -m4G
  scylla-bench -workload uniform -mode read  -concurrency 400 -partition-count 10000000 \
      -clustering-row-count 1 -duration 60m -no-lower-bound

  Before:

    Time (avg):	 51.072411913s
    Total ops:	 3165885
    Total rows:	 3165885
    Operations/s:	 61988.164024260645
    Rows/s:		 61988.164024260645
    Latency:
      max:		 34.045951ms
      99.9th:	 25.985023ms
      99th:		 23.298047ms
      95th:		 19.070975ms
      90th:		 17.530879ms
      median:	 3.899391ms
      mean:		 6.450616ms

  After:

    Time (avg):	 50.232410679s
    Total ops:	 3778863
    Total rows:	 3778863
    Operations/s:	 75227.58014424688
    Rows/s:		 75227.58014424688
    Latency:
      max:		 37.027839ms
      99.9th:	 24.805375ms
      99th:		 18.219007ms
      95th:		 14.090239ms
      90th:		 12.124159ms
      median:	 4.030463ms
      mean:		 5.315111ms

  The results include the warmup phase which populates the partition index cache, so the hot-cache effect
  is dampened in the statistics. See the 99th percentile. Latency gets better after the cache warms up which
  moves it lower.

3) perf_fast_forward --run-tests=large-partition-skips

    Caching is not used here, included to show there are no regressions for the cold cache case.

    TL;DR: No significant change

    perf_fast_forward --run-tests=large-partition-skips --datasets=large-part-ds1 -c1 -m1G

    Config: rows: 10000000, value size: 2000

    Before:

    read    skip      time (s)   iterations     frags     frag/s    mad f/s    max f/s    min f/s    avg aio    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu
    1       0        36.429822            4  10000000     274500         62     274521     274429   153889.2 153883   19696986  153853       0        0        0        0        0        0        0  22.5%
    1       1        36.856236            4   5000000     135662          7     135670     135650   155652.0 155652   19704117  139326       1        0        1        1        0        0        0  38.1%
    1       8        36.347667            4   1111112      30569          0      30570      30569   155652.0 155652   19704117  139071       1        0        1        1        0        0        0  19.5%
    1       16       36.278866            4    588236      16214          1      16215      16213   155652.0 155652   19704117  139073       1        0        1        1        0        0        0  16.6%
    1       32       36.174784            4    303031       8377          0       8377       8376   155652.0 155652   19704117  139056       1        0        1        1        0        0        0  12.3%
    1       64       36.147104            4    153847       4256          0       4256       4256   155652.0 155652   19704117  139109       1        0        1        1        0        0        0  11.1%
    1       256       9.895288            4     38911       3932          1       3933       3930   100869.2 100868    3178298   59944   38912        0        1        1        0        0        0  14.3%
    1       1024      2.599921            4      9757       3753          0       3753       3753    26604.0  26604     801850   15071    9758        0        1        1        0        0        0  14.6%
    1       4096      0.784568            4      2441       3111          1       3111       3109     7982.0   7982     205946    3772    2442        0        1        1        0        0        0  13.8%

    64      1        36.553975            4   9846154     269359         10     269369     269337   155663.8 155652   19704117  139230       1        0        1        1        0        0        0  28.2%
    64      8        36.509694            4   8888896     243467          8     243475     243449   155652.0 155652   19704117  139120       1        0        1        1        0        0        0  26.5%
    64      16       36.466282            4   8000000     219381          4     219385     219374   155652.0 155652   19704117  139232       1        0        1        1        0        0        0  24.8%
    64      32       36.395926            4   6666688     183171          6     183180     183165   155652.0 155652   19704117  139158       1        0        1        1        0        0        0  21.8%
    64      64       36.296856            4   5000000     137753          4     137757     137737   155652.0 155652   19704117  139105       1        0        1        1        0        0        0  17.7%
    64      256      20.590392            4   2000000      97133         18      97151      94996   135248.8 131395    7877402   98335   31282        0        1        1        0        0        0  15.7%
    64      1024      6.225773            4    588288      94492       1436      95434      88748    46066.5  41321    2324378   30360    9193        0        1        1        0        0        0  15.8%
    64      4096      1.856069            4    153856      82893         54      82948      82721    16115.0  16043     583674   11574    2675        0        1        1        0        0        0  16.3%

    After:

    read    skip      time (s)   iterations     frags     frag/s    mad f/s    max f/s    min f/s    avg aio    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu
    1       0        36.429240            4  10000000     274505         38     274515     274417   153887.8 153883   19696986  153849       0        0        0        0        0        0        0  22.4%
    1       1        36.933806            4   5000000     135377         15     135385     135354   155658.0 155658   19704085  139398       1        0        1        1        0        0        0  40.0%
    1       8        36.419187            4   1111112      30509          2      30510      30507   155658.0 155658   19704085  139233       1        0        1        1        0        0        0  22.0%
    1       16       36.353475            4    588236      16181          0      16182      16181   155658.0 155658   19704085  139183       1        0        1        1        0        0        0  19.2%
    1       32       36.251356            4    303031       8359          0       8359       8359   155658.0 155658   19704085  139120       1        0        1        1        0        0        0  14.8%
    1       64       36.203692            4    153847       4249          0       4250       4249   155658.0 155658   19704085  139071       1        0        1        1        0        0        0  13.0%
    1       256       9.965876            4     38911       3904          0       3906       3904   100875.2 100874    3178266   60108   38912        0        1        1        0        0        0  17.9%
    1       1024      2.637501            4      9757       3699          1       3700       3697    26610.0  26610     801818   15071    9758        0        1        1        0        0        0  19.5%
    1       4096      0.806745            4      2441       3026          1       3027       3024     7988.0   7988     205914    3773    2442        0        1        1        0        0        0  18.3%

    64      1        36.611243            4   9846154     268938          5     268942     268921   155669.8 155705   19704085  139330       2        0        1        1        0        0        0  29.9%
    64      8        36.559471            4   8888896     243135         11     243156     243124   155658.0 155658   19704085  139261       1        0        1        1        0        0        0  28.1%
    64      16       36.510319            4   8000000     219116         15     219126     219101   155658.0 155658   19704085  139173       1        0        1        1        0        0        0  26.3%
    64      32       36.439069            4   6666688     182954          9     182964     182943   155658.0 155658   19704085  139274       1        0        1        1        0        0        0  23.2%
    64      64       36.334808            4   5000000     137609         11     137612     137596   155658.0 155658   19704085  139258       2        0        1        1        0        0        0  19.1%
    64      256      20.624759            4   2000000      96971         88      97059      92717   138296.0 131401    7877370   98332   31282        0        1        1        0        0        0  17.2%
    64      1024      6.260598            4    588288      93967       1429      94905      88051    45939.5  41327    2324346   30361    9193        0        1        1        0        0        0  17.8%
    64      4096      1.881338            4    153856      81780        140      81920      81520    16109.8  16092     582714   11617    2678        0        1        1        0        0        0  18.2%

4) perf_fast_forward --run-tests=large-partition-slicing

    Caching enabled, each line shows the median run from many iterations

    TL;DR: We can observe reduction in IO which translates to reduction in execution time,
           especially for slicing in the middle of partition.

    perf_fast_forward --run-tests=large-partition-slicing --datasets=large-part-ds1 -c1 -m1G --keep-cache-across-test-cases

    Config: rows: 10000000, value size: 2000

    Before:

    offset  read      time (s)   iterations     frags     frag/s    mad f/s    max f/s    min f/s    avg aio    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    allocs   tasks insns/f    cpu
    0       1         0.000491          127         1       2037         24       2109        127        4.0      4        128       2       2        0        1        1        0        0        0       157      80 3058208  15.0%
    0       32        0.000561         1740        32      56995        410      60031      47208        5.0      5        160       3       2        0        1        1        0        0        0       386     111  113353  17.5%
    0       256       0.002052          488       256     124736       7111     144762      89053       16.6     17        672      14       2        0        1        1        0        0        0      2113     446   52669  18.6%
    0       4096      0.016437           61      4096     249199        692     252389     244995       69.4     69       8640      57       5        0        1        1        0        0        0     26638    1717   23321  22.4%
    5000000 1         0.002171          221         1        461          2        466        221       25.0     25        268       3       3        0        1        1        0        0        0       638     376 14311524  10.2%
    5000000 32        0.002392          404        32      13376         48      13528      13015       27.0     27        332       5       3        0        1        1        0        0        0       931     432  489691  11.9%
    5000000 256       0.003659          279       256      69967        764      73130      52563       39.5     41        780      19       3        0        1        1        0        0        0      2689     825   93756  15.8%
    5000000 4096      0.018592           55      4096     220313        433     234214     218803       94.2     94       9484      62       9        0        1        1        0        0        0     27349    2213   26562  21.0%

    After:

    offset  read      time (s)   iterations     frags     frag/s    mad f/s    max f/s    min f/s    avg aio    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    allocs   tasks insns/f    cpu
    0       1         0.000229          115         1       4371         85       4585        115        2.1      2         64       1       1        1        0        0        0        0        0        90      31 1314749  22.2%
    0       32        0.000277         2174        32     115674       1015     128109      14144        3.0      3         96       2       1        1        0        0        0        0        0       319      62   52508  26.1%
    0       256       0.001786          576       256     143298       5534     179142     113715       14.7     17        544      15       1        1        0        0        0        0        0      2110     453   45419  21.4%
    0       4096      0.015498           61      4096     264289       2006     268850     259342       67.4     67       8576      59       4        1        0        0        0        0        0     26657    1738   22897  23.7%
    5000000 1         0.000415          233         1       2411         15       2456        234        4.1      4        128       2       2        1        0        0        0        0        0       199      72 2644719  16.8%
    5000000 32        0.000635         1413        32      50398        349      51149      46439        6.0      6        192       4       2        1        0        0        0        0        0       458     128  125893  18.6%
    5000000 256       0.002028          486       256     126228       3024     146327      82559       17.8     18       1024      13       4        1        0        0        0        0        0      2123     385   51787  19.6%
    5000000 4096      0.016836           61      4096     243294        814     263434     241660       73.0     73       9344      62       8        1        0        0        0        0        0     26922    1920   24389  22.4%

Future work:

 - Check the impact on non-uniform workloads. Caching sstable indexes takes space away from the row cache
   which may reduce the hit ratio.

 - Reduce memory footprint of partition index cache. Currently, about 8x bloat over the on-disk size.

 - Disable cache population for "bypass cache" reads

 - Add a switch to disable sstable index caching, per-node, maybe per-table

 - Better sstable index format. Current format leads to inefficiency in caching since only some elements of the cached
   page can be hot. A B-tree index would be more efficient. Same applies to the partition index. Only some elements in
   the partition index page can be hot.

 - Add heuristic for reducing index file IO size when large partitions are anticipated. If we're bound by disk's
   bandwidth it's wasteful to read the front of promoted index using 32K IO, better use 4K which should cover the
   partition entry and then let binary search read the rest.

In V2:

 - Fixed perf_fast_forward regression in the number of IOs used to read partition index page
   The reader uses 32K reads, which were split by page cache into 4K reads
   Fix by propagating IO size hints to page cache and using single IO to populate it.
   New patch: "cached_file: Issue single I/O for the whole read range on miss"

 - Avoid large allocations to store partition index page entries (due to managed_vector storage).
   There is a unit test which detects this and fails.
   Fixed by implementing chunked_managed_vector, based on chunked_vector.

 - fixed bug in cached_file::evict_gently() where the wrong allocation strategy was used to free btree chunks

 - Simplify region_impl::free_buf() according to Avi's suggestions

 - Fit segment_kind in segment_descriptor::_free_space and lift requirement that _buf_pointers emptiness determines the kind

 - Workaround sigsegv which was most likely due to coroutine miscompilation. Worked around by manipulating local object scope.

 - Wire up system/drop_sstable_caches RESTful API

 - Fix use-after-move on permit for the old scanning ka/la index reader

 - Fixed more cases of double open_data() in tests leading to assert failure

 - Adjusted cached_file class doc to account for changes in behavior.

 - Rebased

Fixes #7079.
Refs #363.
"

* tag 'sstable-index-caching-v2' of github.com:tgrabiec/scylla: (39 commits)
  api: Drop sstable index caches on system/drop_sstable_caches
  cached_file: Issue single I/O for the whole read range on miss
  row_cache: cache_tracker: Do not register metrics when constructed for tests
  sstables, cached_file: Evict cache gently when sstable is destroyed
  sstables: Hide partition_index_cache implementation away from sstables.hh
  sstables: Drop shared_index_lists alias
  sstables: Destroy partition index cache gently
  sstables: Cache partition index pages in LSA and link to LRU
  utils: Introduce lsa::weak_ptr<>
  sstables: Rename index_list to partition_index_page and shared_index_lists to partition_index_cache
  sstables, cached_file: Avoid copying buffers from cache when parsing promoted index
  cached_file: Introduce get_page_units()
  sstables: read: Document that primitive_consumer::read_32() is alloc-free
  sstables: read: Count partition index page evictions
  sstables: Drop the _use_binary_search flag from index entries
  sstables: index_reader: Keep index objects under LSA
  lsa: chunked_managed_vector: Adapt more to managed_vector
  utils: lsa: chunked_managed_vector: Make LSA-aware
  test: chunked_managed_vector_test: Make exception_safe_class standard layout
  lsa: Copy chunked_vector to chunked_managed_vector
  ...
2021-07-07 18:17:10 +03:00

845 lines
30 KiB
C++

/*
* Copyright (C) 2015-present ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#pragma once
#include <memory>
#include <seastar/core/memory.hh>
#include <seastar/core/condition-variable.hh>
#include <seastar/core/smp.hh>
#include <seastar/core/shared_ptr.hh>
#include <seastar/core/shared_future.hh>
#include <seastar/core/expiring_fifo.hh>
#include "allocation_strategy.hh"
#include <boost/heap/binomial_heap.hpp>
#include "seastarx.hh"
#include "db/timeout_clock.hh"
#include "utils/entangled.hh"
namespace logalloc {
struct occupancy_stats;
class region;
class region_impl;
class allocating_section;
constexpr int segment_size_shift = 17; // 128K; see #151, #152
constexpr size_t segment_size = 1 << segment_size_shift;
constexpr size_t max_zone_segments = 256;
//
// Frees some amount of objects from the region to which it's attached.
//
// This should eventually stop given no new objects are added:
//
// while (eviction_fn() == memory::reclaiming_result::reclaimed_something) ;
//
using eviction_fn = std::function<memory::reclaiming_result()>;
//
// Users of a region_group can pass an instance of the class region_group_reclaimer, and specialize
// its methods start_reclaiming() and stop_reclaiming(). Those methods will be called when the LSA
// see relevant changes in the memory pressure conditions for this region_group. By specializing
// those methods - which are a nop by default - the callers can take action to aid the LSA in
// alleviating pressure.
class region_group_reclaimer {
protected:
size_t _threshold;
size_t _soft_limit;
bool _under_pressure = false;
bool _under_soft_pressure = false;
// The following restrictions apply to implementations of start_reclaiming() and stop_reclaiming():
//
// - must not use any region or region_group objects, because they're invoked synchronously
// with operations on those.
//
// - must be noexcept, because they're called on the free path.
//
// - the implementation may be called synchronously with any operation
// which allocates memory, because these are called by memory reclaimer.
// In particular, the implementation should not depend on memory allocation
// because that may fail when in reclaiming context.
//
virtual void start_reclaiming() noexcept {}
virtual void stop_reclaiming() noexcept {}
public:
bool under_pressure() const {
return _under_pressure;
}
bool over_soft_limit() const {
return _under_soft_pressure;
}
void notify_soft_pressure() noexcept {
if (!_under_soft_pressure) {
_under_soft_pressure = true;
start_reclaiming();
}
}
void notify_soft_relief() noexcept {
if (_under_soft_pressure) {
_under_soft_pressure = false;
stop_reclaiming();
}
}
void notify_pressure() noexcept {
_under_pressure = true;
}
void notify_relief() noexcept {
_under_pressure = false;
}
region_group_reclaimer()
: _threshold(std::numeric_limits<size_t>::max()), _soft_limit(std::numeric_limits<size_t>::max()) {}
region_group_reclaimer(size_t threshold)
: _threshold(threshold), _soft_limit(threshold) {}
region_group_reclaimer(size_t threshold, size_t soft)
: _threshold(threshold), _soft_limit(soft) {
assert(_soft_limit <= _threshold);
}
virtual ~region_group_reclaimer() {}
size_t throttle_threshold() const {
return _threshold;
}
size_t soft_limit_threshold() const {
return _soft_limit;
}
};
// Groups regions for the purpose of statistics. Can be nested.
class region_group {
static region_group_reclaimer no_reclaimer;
struct region_evictable_occupancy_ascending_less_comparator {
bool operator()(region_impl* r1, region_impl* r2) const;
};
// We want to sort the subgroups so that we can easily find the one that holds the biggest
// region for freeing purposes. Please note that this is not the biggest of the region groups,
// since a big region group can have a big collection of very small regions, and freeing them
// won't achieve anything. An example of such scenario is a ScyllaDB region with a lot of very
// small memtables that add up, versus one with a very big memtable. The small memtables are
// likely still growing, and freeing the big memtable will guarantee that the most memory is
// freed up, while maximizing disk throughput.
//
// As asynchronous reclaim will likely involve disk operation, and those tend to be more
// efficient when bulk done, this behavior is not ScyllaDB memtable specific.
//
// The maximal score is recursively defined as:
//
// max(our_biggest_region, our_subtree_biggest_region)
struct subgroup_maximal_region_ascending_less_comparator {
bool operator()(region_group* rg1, region_group* rg2) const {
return rg1->maximal_score() < rg2->maximal_score();
}
};
friend struct subgroup_maximal_region_ascending_less_comparator;
using region_heap = boost::heap::binomial_heap<region_impl*,
boost::heap::compare<region_evictable_occupancy_ascending_less_comparator>,
boost::heap::allocator<std::allocator<region_impl*>>,
//constant_time_size<true> causes corruption with boost < 1.60
boost::heap::constant_time_size<false>>;
using subgroup_heap = boost::heap::binomial_heap<region_group*,
boost::heap::compare<subgroup_maximal_region_ascending_less_comparator>,
boost::heap::allocator<std::allocator<region_group*>>,
//constant_time_size<true> causes corruption with boost < 1.60
boost::heap::constant_time_size<false>>;
region_group* _parent = nullptr;
size_t _total_memory = 0;
region_group_reclaimer& _reclaimer;
subgroup_heap _subgroups;
subgroup_heap::handle_type _subgroup_heap_handle;
region_heap _regions;
region_group* _maximal_rg = nullptr;
// We need to store the score separately, otherwise we'd have to have an extra pass
// before we update the region occupancy.
size_t _maximal_score = 0;
struct allocating_function {
virtual ~allocating_function() = default;
virtual void allocate() = 0;
virtual void fail(std::exception_ptr) = 0;
};
template <typename Func>
struct concrete_allocating_function : public allocating_function {
using futurator = futurize<std::result_of_t<Func()>>;
typename futurator::promise_type pr;
Func func;
public:
void allocate() override {
futurator::invoke(func).forward_to(std::move(pr));
}
void fail(std::exception_ptr e) override {
pr.set_exception(e);
}
concrete_allocating_function(Func&& func) : func(std::forward<Func>(func)) {}
typename futurator::type get_future() {
return pr.get_future();
}
};
class on_request_expiry {
class blocked_requests_timed_out_error : public timed_out_error {
const sstring _msg;
public:
explicit blocked_requests_timed_out_error(sstring name)
: _msg(std::move(name) + ": timed out") {}
virtual const char* what() const noexcept override {
return _msg.c_str();
}
};
sstring _name;
public:
explicit on_request_expiry(sstring name) : _name(std::move(name)) {}
void operator()(std::unique_ptr<allocating_function>&) noexcept;
};
// It is a more common idiom to just hold the promises in the circular buffer and make them
// ready. However, in the time between the promise being made ready and the function execution,
// it could be that our memory usage went up again. To protect against that, we have to recheck
// if memory is still available after the future resolves.
//
// But we can greatly simplify it if we store the function itself in the circular_buffer, and
// execute it synchronously in release_requests() when we are sure memory is available.
//
// This allows us to easily provide strong execution guarantees while keeping all re-check
// complication in release_requests and keep the main request execution path simpler.
expiring_fifo<std::unique_ptr<allocating_function>, on_request_expiry, db::timeout_clock> _blocked_requests;
uint64_t _blocked_requests_counter = 0;
condition_variable _relief;
future<> _releaser;
bool _shutdown_requested = false;
bool reclaimer_can_block() const;
future<> start_releaser(scheduling_group deferered_work_sg);
void notify_relief();
friend void region_group_binomial_group_sanity_check(const region_group::region_heap& bh);
public:
// When creating a region_group, one can specify an optional throttle_threshold parameter. This
// parameter won't affect normal allocations, but an API is provided, through the region_group's
// method run_when_memory_available(), to make sure that a given function is only executed when
// the total memory for the region group (and all of its parents) is lower or equal to the
// region_group's throttle_treshold (and respectively for its parents).
//
// The deferred_work_sg parameter specifies a scheduling group in which to run allocations
// (given to run_when_memory_available()) when they must be deferred due to lack of memory
// at the time the call to run_when_memory_available() was made.
region_group(sstring name = "(unnamed region_group)",
region_group_reclaimer& reclaimer = no_reclaimer,
scheduling_group deferred_work_sg = default_scheduling_group())
: region_group(name, nullptr, reclaimer, deferred_work_sg) {}
region_group(sstring name, region_group* parent, region_group_reclaimer& reclaimer = no_reclaimer,
scheduling_group deferred_work_sg = default_scheduling_group());
region_group(region_group&& o) = delete;
region_group(const region_group&) = delete;
~region_group() {
// If we set a throttle threshold, we'd be postponing many operations. So shutdown must be
// called.
if (reclaimer_can_block()) {
assert(_shutdown_requested);
}
if (_parent) {
_parent->del(this);
}
}
region_group& operator=(const region_group&) = delete;
region_group& operator=(region_group&&) = delete;
size_t memory_used() const {
return _total_memory;
}
void update(ssize_t delta);
// It would be easier to call update, but it is unfortunately broken in boost versions up to at
// least 1.59.
//
// One possibility would be to just test for delta sigdness, but we adopt an explicit call for
// two reasons:
//
// 1) it save us a branch
// 2) some callers would like to pass delta = 0. For instance, when we are making a region
// evictable / non-evictable. Because the evictable occupancy changes, we would like to call
// the full update cycle even then.
void increase_usage(region_heap::handle_type& r_handle, ssize_t delta) {
_regions.increase(r_handle);
update(delta);
}
void decrease_evictable_usage(region_heap::handle_type& r_handle) {
_regions.decrease(r_handle);
}
void decrease_usage(region_heap::handle_type& r_handle, ssize_t delta) {
decrease_evictable_usage(r_handle);
update(delta);
}
//
// Make sure that the function specified by the parameter func only runs when this region_group,
// as well as each of its ancestors have a memory_used() amount of memory that is lesser or
// equal the throttle_threshold, as specified in the region_group's constructor.
//
// region_groups that did not specify a throttle_threshold will always allow for execution.
//
// In case current memory_used() is over the threshold, a non-ready future is returned and it
// will be made ready at some point in the future, at which memory usage in the offending
// region_group (either this or an ancestor) falls below the threshold.
//
// Requests that are not allowed for execution are queued and released in FIFO order within the
// same region_group, but no guarantees are made regarding release ordering across different
// region_groups.
//
// When timeout is reached first, the returned future is resolved with timed_out_error exception.
template <typename Func>
futurize_t<std::result_of_t<Func()>> run_when_memory_available(Func&& func, db::timeout_clock::time_point timeout) {
// We disallow future-returning functions here, because otherwise memory may be available
// when we start executing it, but no longer available in the middle of the execution.
static_assert(!is_future<std::result_of_t<Func()>>::value, "future-returning functions are not permitted.");
auto blocked_at = do_for_each_parent(this, [] (auto rg) {
return (rg->_blocked_requests.empty() && !rg->under_pressure()) ? stop_iteration::no : stop_iteration::yes;
});
if (!blocked_at) {
return futurize_invoke(func);
}
auto fn = std::make_unique<concrete_allocating_function<Func>>(std::forward<Func>(func));
auto fut = fn->get_future();
_blocked_requests.push_back(std::move(fn), timeout);
++_blocked_requests_counter;
return fut;
}
// returns a pointer to the largest region (in terms of memory usage) that sits below this
// region group. This includes the regions owned by this region group as well as all of its
// children.
region* get_largest_region();
// Shutdown is mandatory for every user who has set a threshold
// Can be called at most once.
future<> shutdown() {
_shutdown_requested = true;
_relief.signal();
return std::move(_releaser);
}
size_t blocked_requests() {
return _blocked_requests.size();
}
uint64_t blocked_requests_counter() const {
return _blocked_requests_counter;
}
private:
// Returns true if and only if constraints of this group are not violated.
// That's taking into account any constraints imposed by enclosing (parent) groups.
bool execution_permitted() noexcept;
// Executes the function func for each region_group upwards in the hierarchy, starting with the
// parameter node. The function func may return stop_iteration::no, in which case it proceeds to
// the next ancestor in the hierarchy, or stop_iteration::yes, in which case it stops at this
// level.
//
// This method returns a pointer to the region_group that was processed last, or nullptr if the
// root was reached.
template <typename Func>
static region_group* do_for_each_parent(region_group *node, Func&& func) {
auto rg = node;
while (rg) {
if (func(rg) == stop_iteration::yes) {
return rg;
}
rg = rg->_parent;
}
return nullptr;
}
inline bool under_pressure() const {
return _reclaimer.under_pressure();
}
uint64_t top_region_evictable_space() const;
uint64_t maximal_score() const {
return _maximal_score;
}
void update_maximal_rg() {
auto my_score = top_region_evictable_space();
auto children_score = _subgroups.empty() ? 0 : _subgroups.top()->maximal_score();
auto old_maximal_score = _maximal_score;
if (children_score > my_score) {
_maximal_rg = _subgroups.top()->_maximal_rg;
} else {
_maximal_rg = this;
}
_maximal_score = _maximal_rg->top_region_evictable_space();
if (_parent) {
// binomial heap update boost bug.
if (_maximal_score > old_maximal_score) {
_parent->_subgroups.increase(_subgroup_heap_handle);
} else if (_maximal_score < old_maximal_score) {
_parent->_subgroups.decrease(_subgroup_heap_handle);
}
}
}
void add(region_group* child);
void del(region_group* child);
void add(region_impl* child);
void del(region_impl* child);
friend class region_impl;
};
// Controller for all LSA regions. There's one per shard.
class tracker {
public:
class impl;
struct config {
bool defragment_on_idle;
bool abort_on_lsa_bad_alloc;
bool sanitizer_report_backtrace = false; // Better reports but slower
size_t lsa_reclamation_step;
scheduling_group background_reclaim_sched_group;
};
void configure(const config& cfg);
future<> stop();
private:
std::unique_ptr<impl> _impl;
memory::reclaimer _reclaimer;
friend class region;
friend class region_impl;
memory::reclaiming_result reclaim(seastar::memory::reclaimer::request);
public:
tracker();
~tracker();
//
// Tries to reclaim given amount of bytes in total using all compactible
// and evictable regions. Returns the number of bytes actually reclaimed.
// That value may be smaller than requested when evictable pools are empty
// and compactible pools can't compact any more.
//
// Invalidates references to objects in all compactible and evictable regions.
//
size_t reclaim(size_t bytes);
// Compacts as much as possible. Very expensive, mainly for testing.
// Guarantees that every live object from reclaimable regions will be moved.
// Invalidates references to objects in all compactible and evictable regions.
void full_compaction();
void reclaim_all_free_segments();
// Returns aggregate statistics for all pools.
occupancy_stats region_occupancy();
// Returns statistics for all segments allocated by LSA on this shard.
occupancy_stats occupancy();
// Returns amount of allocated memory not managed by LSA
size_t non_lsa_used_space() const;
impl& get_impl() { return *_impl; }
// Returns the minimum number of segments reclaimed during single reclamation cycle.
size_t reclamation_step() const;
bool should_abort_on_bad_alloc();
};
tracker& shard_tracker();
class segment_descriptor;
/// A unique pointer to a chunk of memory allocated inside an LSA region.
///
/// The pointer can be in disengaged state in which case it doesn't point at any buffer (nullptr state).
/// When the pointer points at some buffer, it is said to be engaged.
///
/// The pointer owns the object.
/// When the pointer is destroyed or it transitions from engaged to disengaged state, the buffer is freed.
/// The buffer is never leaked when operating by the API of lsa_buffer.
/// The pointer object can be safely destroyed in any allocator context.
///
/// The pointer object is never invalidated.
/// The pointed-to buffer can be moved around by LSA, so the pointer returned by get() can be
/// invalidated, but the pointer object itself is updated automatically and get() always returns
/// a pointer which is valid at the time of the call.
///
/// Must not outlive the region.
class lsa_buffer {
friend class region_impl;
entangled _link; // Paired with segment_descriptor::_buf_pointers[...]
segment_descriptor* _desc; // Valid only when engaged
char* _buf = nullptr; // Valid only when engaged
size_t _size = 0;
public:
using char_type = char;
lsa_buffer() = default;
lsa_buffer(lsa_buffer&&) noexcept = default;
~lsa_buffer();
/// Makes this instance point to the buffer pointed to by the other pointer.
/// If this pointer was engaged before, the owned buffer is freed.
/// The other pointer will be in disengaged state after this.
lsa_buffer& operator=(lsa_buffer&& other) noexcept {
if (this != &other) {
this->~lsa_buffer();
new (this) lsa_buffer(std::move(other));
}
return *this;
}
/// Disengages the pointer.
/// If the pointer was engaged before, the owned buffer is freed.
/// Postcondition: !bool(*this)
lsa_buffer& operator=(std::nullptr_t) noexcept {
this->~lsa_buffer();
return *this;
}
/// Returns a pointer to the first element of the buffer.
/// Valid only when engaged.
char_type* get() { return _buf; }
const char_type* get() const { return _buf; }
/// Returns the number of bytes in the buffer.
size_t size() const { return _size; }
/// Returns true iff the pointer is engaged.
explicit operator bool() const noexcept { return bool(_link); }
};
// Monoid representing pool occupancy statistics.
// Naturally ordered so that sparser pools come fist.
// All sizes in bytes.
class occupancy_stats {
size_t _free_space;
size_t _total_space;
public:
occupancy_stats() : _free_space(0), _total_space(0) {}
occupancy_stats(size_t free_space, size_t total_space)
: _free_space(free_space), _total_space(total_space) { }
bool operator<(const occupancy_stats& other) const {
return used_fraction() < other.used_fraction();
}
friend occupancy_stats operator+(const occupancy_stats& s1, const occupancy_stats& s2) {
occupancy_stats result(s1);
result += s2;
return result;
}
friend occupancy_stats operator-(const occupancy_stats& s1, const occupancy_stats& s2) {
occupancy_stats result(s1);
result -= s2;
return result;
}
occupancy_stats& operator+=(const occupancy_stats& other) {
_total_space += other._total_space;
_free_space += other._free_space;
return *this;
}
occupancy_stats& operator-=(const occupancy_stats& other) {
_total_space -= other._total_space;
_free_space -= other._free_space;
return *this;
}
size_t used_space() const {
return _total_space - _free_space;
}
size_t free_space() const {
return _free_space;
}
size_t total_space() const {
return _total_space;
}
float used_fraction() const {
return _total_space ? float(used_space()) / total_space() : 0;
}
explicit operator bool() const {
return _total_space > 0;
}
friend std::ostream& operator<<(std::ostream&, const occupancy_stats&);
};
class basic_region_impl : public allocation_strategy {
protected:
bool _reclaiming_enabled = true;
seastar::shard_id _cpu = this_shard_id();
public:
void set_reclaiming_enabled(bool enabled) {
assert(this_shard_id() == _cpu);
_reclaiming_enabled = enabled;
}
bool reclaiming_enabled() const {
return _reclaiming_enabled;
}
};
//
// Log-structured allocator region.
//
// Objects allocated using this region are said to be owned by this region.
// Objects must be freed only using the region which owns them. Ownership can
// be transferred across regions using the merge() method. Region must be live
// as long as it owns any objects.
//
// Each region has separate memory accounting and can be compacted
// independently from other regions. To reclaim memory from all regions use
// shard_tracker().
//
// Region is automatically added to the set of
// compactible regions when constructed.
//
class region {
public:
using impl = region_impl;
private:
shared_ptr<basic_region_impl> _impl;
private:
region_impl& get_impl();
const region_impl& get_impl() const;
public:
region();
explicit region(region_group& group);
~region();
region(region&& other);
region& operator=(region&& other);
region(const region& other) = delete;
occupancy_stats occupancy() const;
allocation_strategy& allocator() noexcept {
return *_impl;
}
const allocation_strategy& allocator() const noexcept {
return *_impl;
}
region_group* group();
// Allocates a buffer of a given size.
// The buffer's pointer will be aligned to 4KB.
// Note: it is wasteful to allocate buffers of sizes which are not a multiple of the alignment.
lsa_buffer alloc_buf(size_t buffer_size);
// Merges another region into this region. The other region is left empty.
// Doesn't invalidate references to allocated objects.
void merge(region& other) noexcept;
// Compacts everything. Mainly for testing.
// Invalidates references to allocated objects.
void full_compaction();
// Runs eviction function once. Mainly for testing.
memory::reclaiming_result evict_some();
// Changes the reclaimability state of this region. When region is not
// reclaimable, it won't be considered by tracker::reclaim(). By default region is
// reclaimable after construction.
void set_reclaiming_enabled(bool e) { _impl->set_reclaiming_enabled(e); }
// Returns the reclaimability state of this region.
bool reclaiming_enabled() const { return _impl->reclaiming_enabled(); }
// Returns a value which is increased when this region is either compacted or
// evicted from, which invalidates references into the region.
// When the value returned by this method doesn't change, references remain valid.
uint64_t reclaim_counter() const {
return allocator().invalidate_counter();
}
// Will cause subsequent calls to evictable_occupancy() to report empty occupancy.
void ground_evictable_occupancy();
// Follows region's occupancy in the parent region group. Less fine-grained than occupancy().
// After ground_evictable_occupancy() is called returns 0.
occupancy_stats evictable_occupancy();
// Makes this region an evictable region. Supplied function will be called
// when data from this region needs to be evicted in order to reclaim space.
// The function should free some space from this region.
void make_evictable(eviction_fn);
const eviction_fn& evictor() const;
friend class region_group;
friend class allocating_section;
};
// Forces references into the region to remain valid as long as this guard is
// live by disabling compaction and eviction.
// Can be nested.
struct reclaim_lock {
region& _region;
bool _prev;
reclaim_lock(region& r)
: _region(r)
, _prev(r.reclaiming_enabled())
{
_region.set_reclaiming_enabled(false);
}
~reclaim_lock() {
_region.set_reclaiming_enabled(_prev);
}
};
// Utility for running critical sections which need to lock some region and
// also allocate LSA memory. The object learns from failures how much it
// should reserve up front in order to not cause allocation failures.
class allocating_section {
// Do not decay below these minimal values
static constexpr size_t s_min_lsa_reserve = 1;
static constexpr size_t s_min_std_reserve = 1024;
static constexpr uint64_t s_bytes_per_decay = 10'000'000'000;
static constexpr unsigned s_segments_per_decay = 100'000;
size_t _lsa_reserve = s_min_lsa_reserve; // in segments
size_t _std_reserve = s_min_std_reserve; // in bytes
size_t _minimum_lsa_emergency_reserve = 0;
int64_t _remaining_std_bytes_until_decay = s_bytes_per_decay;
int _remaining_lsa_segments_until_decay = s_segments_per_decay;
private:
struct guard {
size_t _prev;
guard();
~guard();
};
void reserve();
void maybe_decay_reserve();
void on_alloc_failure(logalloc::region&);
public:
void set_lsa_reserve(size_t);
void set_std_reserve(size_t);
//
// Reserves standard allocator and LSA memory for subsequent operations that
// have to be performed with memory reclamation disabled.
//
// Throws std::bad_alloc when reserves can't be increased to a sufficient level.
//
template<typename Func>
decltype(auto) with_reserve(Func&& fn) {
auto prev_lsa_reserve = _lsa_reserve;
auto prev_std_reserve = _std_reserve;
try {
guard g;
_minimum_lsa_emergency_reserve = g._prev;
reserve();
return fn();
} catch (const std::bad_alloc&) {
// roll-back limits to protect against pathological requests
// preventing future requests from succeeding.
_lsa_reserve = prev_lsa_reserve;
_std_reserve = prev_std_reserve;
throw;
}
}
//
// Invokes func with reclaim_lock on region r. If LSA allocation fails
// inside func it is retried after increasing LSA segment reserve. The
// memory reserves are increased with region lock off allowing for memory
// reclamation to take place in the region.
//
// References in the region are invalidated when allocating section is re-entered
// on allocation failure.
//
// Throws std::bad_alloc when reserves can't be increased to a sufficient level.
//
template<typename Func>
decltype(auto) with_reclaiming_disabled(logalloc::region& r, Func&& fn) {
assert(r.reclaiming_enabled());
maybe_decay_reserve();
while (true) {
try {
logalloc::reclaim_lock _(r);
memory::disable_abort_on_alloc_failure_temporarily dfg;
return fn();
} catch (const std::bad_alloc&) {
on_alloc_failure(r);
}
}
}
//
// Reserves standard allocator and LSA memory and
// invokes func with reclaim_lock on region r. If LSA allocation fails
// inside func it is retried after increasing LSA segment reserve. The
// memory reserves are increased with region lock off allowing for memory
// reclamation to take place in the region.
//
// References in the region are invalidated when allocating section is re-entered
// on allocation failure.
//
// Throws std::bad_alloc when reserves can't be increased to a sufficient level.
//
template<typename Func>
decltype(auto) operator()(logalloc::region& r, Func&& func) {
return with_reserve([this, &r, &func] {
return with_reclaiming_disabled(r, func);
});
}
};
future<> prime_segment_pool(size_t available_memory, size_t min_free_memory);
uint64_t memory_allocated();
uint64_t memory_compacted();
occupancy_stats lsa_global_occupancy_stats();
}