"
The main goal of this series is to improve efficiency of reads from large partitions by
reducing amount of I/O needed to read the sstable index. This is achieved by caching
index file pages and partition index entries in memory.
Currently, the pages are cached by individual reads only for the duration of the read.
This was done to facilitate binary search in the promoted index (intra-partition index).
After this series, all reads share the index file page cache, which stays around even after reads stop.
The page cache is subject to eviction. It uses the same region as the current row cache and shares
the LRU with row cache entries. This means that LRU objects need to be virtualized. This series takes
an easy approach and does this by introducing a virtual base class. This adds an overhead to row cache
entry to store the vtable pointer.
SStable indexes have a hierarchy. There is a summary, which is a sparse partition key index into the
full partition index. This one is already kept in memory. The partition index is divided by the summary
into pages. Each entry in the partition index contains promoted index, which is a sparse index into atoms
identified by the clustering key (rows, tombstones).
In order to read the promoted index, the reader needs to read the partition index entry first.
To speed this up, this series also adds caching of partition index entries. This cache survives
reads and is subject to eviction, just like the index file page cache. The unit of caching is
the partition index page. Without this cache, each access to promoted index would have to be
preceded with the parsing of the partition index page containing the partition key.
Performance testing results follow.
1) scylla-bench large partition reads
Populated with:
perf_fast_forward --run-tests=large-partition-skips --datasets=sb-large-part-ds1 \
-c1 -m1G --populate --value-size=1024 --rows=10000000
Single partition, 9G data file, 4MB index file
Test execution:
build/release/scylla -c1 -m4G
scylla-bench -workload uniform -mode read -limit 1 -concurrency 100 -partition-count 1 \
-clustering-row-count 10000000 -duration 60m
TL;DR: after: 2x throughput, 0.5 median latency
Before (c1daf2bb24):
Results
Time (avg): 5m21.033180213s
Total ops: 966951
Total rows: 966951
Operations/s: 3011.997048812112
Rows/s: 3011.997048812112
Latency:
max: 74.055679ms
99.9th: 63.569919ms
99th: 41.320447ms
95th: 38.076415ms
90th: 37.158911ms
median: 34.537471ms
mean: 33.195994ms
After:
Results
Time (avg): 5m14.706669345s
Total ops: 2042831
Total rows: 2042831
Operations/s: 6491.22243800942
Rows/s: 6491.22243800942
Latency:
max: 60.096511ms
99.9th: 35.520511ms
99th: 27.000831ms
95th: 23.986175ms
90th: 21.659647ms
median: 15.040511ms
mean: 15.402076ms
2) scylla-bench small partitions
I tested several scenarios with a varying data set size, e.g. data fully fitting in memory,
half fitting, and being much larger. The improvement varied a bit but in all cases the "after"
code performed slightly better.
Below is a representative run over data set which does not fit in memory.
scylla -c1 -m4G
scylla-bench -workload uniform -mode read -concurrency 400 -partition-count 10000000 \
-clustering-row-count 1 -duration 60m -no-lower-bound
Before:
Time (avg): 51.072411913s
Total ops: 3165885
Total rows: 3165885
Operations/s: 61988.164024260645
Rows/s: 61988.164024260645
Latency:
max: 34.045951ms
99.9th: 25.985023ms
99th: 23.298047ms
95th: 19.070975ms
90th: 17.530879ms
median: 3.899391ms
mean: 6.450616ms
After:
Time (avg): 50.232410679s
Total ops: 3778863
Total rows: 3778863
Operations/s: 75227.58014424688
Rows/s: 75227.58014424688
Latency:
max: 37.027839ms
99.9th: 24.805375ms
99th: 18.219007ms
95th: 14.090239ms
90th: 12.124159ms
median: 4.030463ms
mean: 5.315111ms
The results include the warmup phase which populates the partition index cache, so the hot-cache effect
is dampened in the statistics. See the 99th percentile. Latency gets better after the cache warms up which
moves it lower.
3) perf_fast_forward --run-tests=large-partition-skips
Caching is not used here, included to show there are no regressions for the cold cache case.
TL;DR: No significant change
perf_fast_forward --run-tests=large-partition-skips --datasets=large-part-ds1 -c1 -m1G
Config: rows: 10000000, value size: 2000
Before:
read skip time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu
1 0 36.429822 4 10000000 274500 62 274521 274429 153889.2 153883 19696986 153853 0 0 0 0 0 0 0 22.5%
1 1 36.856236 4 5000000 135662 7 135670 135650 155652.0 155652 19704117 139326 1 0 1 1 0 0 0 38.1%
1 8 36.347667 4 1111112 30569 0 30570 30569 155652.0 155652 19704117 139071 1 0 1 1 0 0 0 19.5%
1 16 36.278866 4 588236 16214 1 16215 16213 155652.0 155652 19704117 139073 1 0 1 1 0 0 0 16.6%
1 32 36.174784 4 303031 8377 0 8377 8376 155652.0 155652 19704117 139056 1 0 1 1 0 0 0 12.3%
1 64 36.147104 4 153847 4256 0 4256 4256 155652.0 155652 19704117 139109 1 0 1 1 0 0 0 11.1%
1 256 9.895288 4 38911 3932 1 3933 3930 100869.2 100868 3178298 59944 38912 0 1 1 0 0 0 14.3%
1 1024 2.599921 4 9757 3753 0 3753 3753 26604.0 26604 801850 15071 9758 0 1 1 0 0 0 14.6%
1 4096 0.784568 4 2441 3111 1 3111 3109 7982.0 7982 205946 3772 2442 0 1 1 0 0 0 13.8%
64 1 36.553975 4 9846154 269359 10 269369 269337 155663.8 155652 19704117 139230 1 0 1 1 0 0 0 28.2%
64 8 36.509694 4 8888896 243467 8 243475 243449 155652.0 155652 19704117 139120 1 0 1 1 0 0 0 26.5%
64 16 36.466282 4 8000000 219381 4 219385 219374 155652.0 155652 19704117 139232 1 0 1 1 0 0 0 24.8%
64 32 36.395926 4 6666688 183171 6 183180 183165 155652.0 155652 19704117 139158 1 0 1 1 0 0 0 21.8%
64 64 36.296856 4 5000000 137753 4 137757 137737 155652.0 155652 19704117 139105 1 0 1 1 0 0 0 17.7%
64 256 20.590392 4 2000000 97133 18 97151 94996 135248.8 131395 7877402 98335 31282 0 1 1 0 0 0 15.7%
64 1024 6.225773 4 588288 94492 1436 95434 88748 46066.5 41321 2324378 30360 9193 0 1 1 0 0 0 15.8%
64 4096 1.856069 4 153856 82893 54 82948 82721 16115.0 16043 583674 11574 2675 0 1 1 0 0 0 16.3%
After:
read skip time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu
1 0 36.429240 4 10000000 274505 38 274515 274417 153887.8 153883 19696986 153849 0 0 0 0 0 0 0 22.4%
1 1 36.933806 4 5000000 135377 15 135385 135354 155658.0 155658 19704085 139398 1 0 1 1 0 0 0 40.0%
1 8 36.419187 4 1111112 30509 2 30510 30507 155658.0 155658 19704085 139233 1 0 1 1 0 0 0 22.0%
1 16 36.353475 4 588236 16181 0 16182 16181 155658.0 155658 19704085 139183 1 0 1 1 0 0 0 19.2%
1 32 36.251356 4 303031 8359 0 8359 8359 155658.0 155658 19704085 139120 1 0 1 1 0 0 0 14.8%
1 64 36.203692 4 153847 4249 0 4250 4249 155658.0 155658 19704085 139071 1 0 1 1 0 0 0 13.0%
1 256 9.965876 4 38911 3904 0 3906 3904 100875.2 100874 3178266 60108 38912 0 1 1 0 0 0 17.9%
1 1024 2.637501 4 9757 3699 1 3700 3697 26610.0 26610 801818 15071 9758 0 1 1 0 0 0 19.5%
1 4096 0.806745 4 2441 3026 1 3027 3024 7988.0 7988 205914 3773 2442 0 1 1 0 0 0 18.3%
64 1 36.611243 4 9846154 268938 5 268942 268921 155669.8 155705 19704085 139330 2 0 1 1 0 0 0 29.9%
64 8 36.559471 4 8888896 243135 11 243156 243124 155658.0 155658 19704085 139261 1 0 1 1 0 0 0 28.1%
64 16 36.510319 4 8000000 219116 15 219126 219101 155658.0 155658 19704085 139173 1 0 1 1 0 0 0 26.3%
64 32 36.439069 4 6666688 182954 9 182964 182943 155658.0 155658 19704085 139274 1 0 1 1 0 0 0 23.2%
64 64 36.334808 4 5000000 137609 11 137612 137596 155658.0 155658 19704085 139258 2 0 1 1 0 0 0 19.1%
64 256 20.624759 4 2000000 96971 88 97059 92717 138296.0 131401 7877370 98332 31282 0 1 1 0 0 0 17.2%
64 1024 6.260598 4 588288 93967 1429 94905 88051 45939.5 41327 2324346 30361 9193 0 1 1 0 0 0 17.8%
64 4096 1.881338 4 153856 81780 140 81920 81520 16109.8 16092 582714 11617 2678 0 1 1 0 0 0 18.2%
4) perf_fast_forward --run-tests=large-partition-slicing
Caching enabled, each line shows the median run from many iterations
TL;DR: We can observe reduction in IO which translates to reduction in execution time,
especially for slicing in the middle of partition.
perf_fast_forward --run-tests=large-partition-slicing --datasets=large-part-ds1 -c1 -m1G --keep-cache-across-test-cases
Config: rows: 10000000, value size: 2000
Before:
offset read time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk allocs tasks insns/f cpu
0 1 0.000491 127 1 2037 24 2109 127 4.0 4 128 2 2 0 1 1 0 0 0 157 80 3058208 15.0%
0 32 0.000561 1740 32 56995 410 60031 47208 5.0 5 160 3 2 0 1 1 0 0 0 386 111 113353 17.5%
0 256 0.002052 488 256 124736 7111 144762 89053 16.6 17 672 14 2 0 1 1 0 0 0 2113 446 52669 18.6%
0 4096 0.016437 61 4096 249199 692 252389 244995 69.4 69 8640 57 5 0 1 1 0 0 0 26638 1717 23321 22.4%
5000000 1 0.002171 221 1 461 2 466 221 25.0 25 268 3 3 0 1 1 0 0 0 638 376 14311524 10.2%
5000000 32 0.002392 404 32 13376 48 13528 13015 27.0 27 332 5 3 0 1 1 0 0 0 931 432 489691 11.9%
5000000 256 0.003659 279 256 69967 764 73130 52563 39.5 41 780 19 3 0 1 1 0 0 0 2689 825 93756 15.8%
5000000 4096 0.018592 55 4096 220313 433 234214 218803 94.2 94 9484 62 9 0 1 1 0 0 0 27349 2213 26562 21.0%
After:
offset read time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk allocs tasks insns/f cpu
0 1 0.000229 115 1 4371 85 4585 115 2.1 2 64 1 1 1 0 0 0 0 0 90 31 1314749 22.2%
0 32 0.000277 2174 32 115674 1015 128109 14144 3.0 3 96 2 1 1 0 0 0 0 0 319 62 52508 26.1%
0 256 0.001786 576 256 143298 5534 179142 113715 14.7 17 544 15 1 1 0 0 0 0 0 2110 453 45419 21.4%
0 4096 0.015498 61 4096 264289 2006 268850 259342 67.4 67 8576 59 4 1 0 0 0 0 0 26657 1738 22897 23.7%
5000000 1 0.000415 233 1 2411 15 2456 234 4.1 4 128 2 2 1 0 0 0 0 0 199 72 2644719 16.8%
5000000 32 0.000635 1413 32 50398 349 51149 46439 6.0 6 192 4 2 1 0 0 0 0 0 458 128 125893 18.6%
5000000 256 0.002028 486 256 126228 3024 146327 82559 17.8 18 1024 13 4 1 0 0 0 0 0 2123 385 51787 19.6%
5000000 4096 0.016836 61 4096 243294 814 263434 241660 73.0 73 9344 62 8 1 0 0 0 0 0 26922 1920 24389 22.4%
Future work:
- Check the impact on non-uniform workloads. Caching sstable indexes takes space away from the row cache
which may reduce the hit ratio.
- Reduce memory footprint of partition index cache. Currently, about 8x bloat over the on-disk size.
- Disable cache population for "bypass cache" reads
- Add a switch to disable sstable index caching, per-node, maybe per-table
- Better sstable index format. Current format leads to inefficiency in caching since only some elements of the cached
page can be hot. A B-tree index would be more efficient. Same applies to the partition index. Only some elements in
the partition index page can be hot.
- Add heuristic for reducing index file IO size when large partitions are anticipated. If we're bound by disk's
bandwidth it's wasteful to read the front of promoted index using 32K IO, better use 4K which should cover the
partition entry and then let binary search read the rest.
In V2:
- Fixed perf_fast_forward regression in the number of IOs used to read partition index page
The reader uses 32K reads, which were split by page cache into 4K reads
Fix by propagating IO size hints to page cache and using single IO to populate it.
New patch: "cached_file: Issue single I/O for the whole read range on miss"
- Avoid large allocations to store partition index page entries (due to managed_vector storage).
There is a unit test which detects this and fails.
Fixed by implementing chunked_managed_vector, based on chunked_vector.
- fixed bug in cached_file::evict_gently() where the wrong allocation strategy was used to free btree chunks
- Simplify region_impl::free_buf() according to Avi's suggestions
- Fit segment_kind in segment_descriptor::_free_space and lift requirement that _buf_pointers emptiness determines the kind
- Workaround sigsegv which was most likely due to coroutine miscompilation. Worked around by manipulating local object scope.
- Wire up system/drop_sstable_caches RESTful API
- Fix use-after-move on permit for the old scanning ka/la index reader
- Fixed more cases of double open_data() in tests leading to assert failure
- Adjusted cached_file class doc to account for changes in behavior.
- Rebased
Fixes #7079.
Refs #363.
"
* tag 'sstable-index-caching-v2' of github.com:tgrabiec/scylla: (39 commits)
api: Drop sstable index caches on system/drop_sstable_caches
cached_file: Issue single I/O for the whole read range on miss
row_cache: cache_tracker: Do not register metrics when constructed for tests
sstables, cached_file: Evict cache gently when sstable is destroyed
sstables: Hide partition_index_cache implementation away from sstables.hh
sstables: Drop shared_index_lists alias
sstables: Destroy partition index cache gently
sstables: Cache partition index pages in LSA and link to LRU
utils: Introduce lsa::weak_ptr<>
sstables: Rename index_list to partition_index_page and shared_index_lists to partition_index_cache
sstables, cached_file: Avoid copying buffers from cache when parsing promoted index
cached_file: Introduce get_page_units()
sstables: read: Document that primitive_consumer::read_32() is alloc-free
sstables: read: Count partition index page evictions
sstables: Drop the _use_binary_search flag from index entries
sstables: index_reader: Keep index objects under LSA
lsa: chunked_managed_vector: Adapt more to managed_vector
utils: lsa: chunked_managed_vector: Make LSA-aware
test: chunked_managed_vector_test: Make exception_safe_class standard layout
lsa: Copy chunked_vector to chunked_managed_vector
...
845 lines
30 KiB
C++
845 lines
30 KiB
C++
/*
|
|
* Copyright (C) 2015-present ScyllaDB
|
|
*/
|
|
|
|
/*
|
|
* This file is part of Scylla.
|
|
*
|
|
* Scylla is free software: you can redistribute it and/or modify
|
|
* it under the terms of the GNU Affero General Public License as published by
|
|
* the Free Software Foundation, either version 3 of the License, or
|
|
* (at your option) any later version.
|
|
*
|
|
* Scylla is distributed in the hope that it will be useful,
|
|
* but WITHOUT ANY WARRANTY; without even the implied warranty of
|
|
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
|
|
* GNU General Public License for more details.
|
|
*
|
|
* You should have received a copy of the GNU General Public License
|
|
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
|
|
*/
|
|
|
|
#pragma once
|
|
|
|
#include <memory>
|
|
#include <seastar/core/memory.hh>
|
|
#include <seastar/core/condition-variable.hh>
|
|
#include <seastar/core/smp.hh>
|
|
#include <seastar/core/shared_ptr.hh>
|
|
#include <seastar/core/shared_future.hh>
|
|
#include <seastar/core/expiring_fifo.hh>
|
|
#include "allocation_strategy.hh"
|
|
#include <boost/heap/binomial_heap.hpp>
|
|
#include "seastarx.hh"
|
|
#include "db/timeout_clock.hh"
|
|
#include "utils/entangled.hh"
|
|
|
|
namespace logalloc {
|
|
|
|
struct occupancy_stats;
|
|
class region;
|
|
class region_impl;
|
|
class allocating_section;
|
|
|
|
constexpr int segment_size_shift = 17; // 128K; see #151, #152
|
|
constexpr size_t segment_size = 1 << segment_size_shift;
|
|
constexpr size_t max_zone_segments = 256;
|
|
|
|
//
|
|
// Frees some amount of objects from the region to which it's attached.
|
|
//
|
|
// This should eventually stop given no new objects are added:
|
|
//
|
|
// while (eviction_fn() == memory::reclaiming_result::reclaimed_something) ;
|
|
//
|
|
using eviction_fn = std::function<memory::reclaiming_result()>;
|
|
|
|
//
|
|
// Users of a region_group can pass an instance of the class region_group_reclaimer, and specialize
|
|
// its methods start_reclaiming() and stop_reclaiming(). Those methods will be called when the LSA
|
|
// see relevant changes in the memory pressure conditions for this region_group. By specializing
|
|
// those methods - which are a nop by default - the callers can take action to aid the LSA in
|
|
// alleviating pressure.
|
|
class region_group_reclaimer {
|
|
protected:
|
|
size_t _threshold;
|
|
size_t _soft_limit;
|
|
bool _under_pressure = false;
|
|
bool _under_soft_pressure = false;
|
|
// The following restrictions apply to implementations of start_reclaiming() and stop_reclaiming():
|
|
//
|
|
// - must not use any region or region_group objects, because they're invoked synchronously
|
|
// with operations on those.
|
|
//
|
|
// - must be noexcept, because they're called on the free path.
|
|
//
|
|
// - the implementation may be called synchronously with any operation
|
|
// which allocates memory, because these are called by memory reclaimer.
|
|
// In particular, the implementation should not depend on memory allocation
|
|
// because that may fail when in reclaiming context.
|
|
//
|
|
virtual void start_reclaiming() noexcept {}
|
|
virtual void stop_reclaiming() noexcept {}
|
|
public:
|
|
bool under_pressure() const {
|
|
return _under_pressure;
|
|
}
|
|
|
|
bool over_soft_limit() const {
|
|
return _under_soft_pressure;
|
|
}
|
|
|
|
void notify_soft_pressure() noexcept {
|
|
if (!_under_soft_pressure) {
|
|
_under_soft_pressure = true;
|
|
start_reclaiming();
|
|
}
|
|
}
|
|
|
|
void notify_soft_relief() noexcept {
|
|
if (_under_soft_pressure) {
|
|
_under_soft_pressure = false;
|
|
stop_reclaiming();
|
|
}
|
|
}
|
|
|
|
void notify_pressure() noexcept {
|
|
_under_pressure = true;
|
|
}
|
|
|
|
void notify_relief() noexcept {
|
|
_under_pressure = false;
|
|
}
|
|
|
|
region_group_reclaimer()
|
|
: _threshold(std::numeric_limits<size_t>::max()), _soft_limit(std::numeric_limits<size_t>::max()) {}
|
|
region_group_reclaimer(size_t threshold)
|
|
: _threshold(threshold), _soft_limit(threshold) {}
|
|
region_group_reclaimer(size_t threshold, size_t soft)
|
|
: _threshold(threshold), _soft_limit(soft) {
|
|
assert(_soft_limit <= _threshold);
|
|
}
|
|
|
|
virtual ~region_group_reclaimer() {}
|
|
|
|
size_t throttle_threshold() const {
|
|
return _threshold;
|
|
}
|
|
size_t soft_limit_threshold() const {
|
|
return _soft_limit;
|
|
}
|
|
};
|
|
|
|
// Groups regions for the purpose of statistics. Can be nested.
|
|
class region_group {
|
|
static region_group_reclaimer no_reclaimer;
|
|
|
|
struct region_evictable_occupancy_ascending_less_comparator {
|
|
bool operator()(region_impl* r1, region_impl* r2) const;
|
|
};
|
|
|
|
// We want to sort the subgroups so that we can easily find the one that holds the biggest
|
|
// region for freeing purposes. Please note that this is not the biggest of the region groups,
|
|
// since a big region group can have a big collection of very small regions, and freeing them
|
|
// won't achieve anything. An example of such scenario is a ScyllaDB region with a lot of very
|
|
// small memtables that add up, versus one with a very big memtable. The small memtables are
|
|
// likely still growing, and freeing the big memtable will guarantee that the most memory is
|
|
// freed up, while maximizing disk throughput.
|
|
//
|
|
// As asynchronous reclaim will likely involve disk operation, and those tend to be more
|
|
// efficient when bulk done, this behavior is not ScyllaDB memtable specific.
|
|
//
|
|
// The maximal score is recursively defined as:
|
|
//
|
|
// max(our_biggest_region, our_subtree_biggest_region)
|
|
struct subgroup_maximal_region_ascending_less_comparator {
|
|
bool operator()(region_group* rg1, region_group* rg2) const {
|
|
return rg1->maximal_score() < rg2->maximal_score();
|
|
}
|
|
};
|
|
friend struct subgroup_maximal_region_ascending_less_comparator;
|
|
|
|
using region_heap = boost::heap::binomial_heap<region_impl*,
|
|
boost::heap::compare<region_evictable_occupancy_ascending_less_comparator>,
|
|
boost::heap::allocator<std::allocator<region_impl*>>,
|
|
//constant_time_size<true> causes corruption with boost < 1.60
|
|
boost::heap::constant_time_size<false>>;
|
|
|
|
using subgroup_heap = boost::heap::binomial_heap<region_group*,
|
|
boost::heap::compare<subgroup_maximal_region_ascending_less_comparator>,
|
|
boost::heap::allocator<std::allocator<region_group*>>,
|
|
//constant_time_size<true> causes corruption with boost < 1.60
|
|
boost::heap::constant_time_size<false>>;
|
|
|
|
region_group* _parent = nullptr;
|
|
size_t _total_memory = 0;
|
|
region_group_reclaimer& _reclaimer;
|
|
|
|
subgroup_heap _subgroups;
|
|
subgroup_heap::handle_type _subgroup_heap_handle;
|
|
region_heap _regions;
|
|
region_group* _maximal_rg = nullptr;
|
|
// We need to store the score separately, otherwise we'd have to have an extra pass
|
|
// before we update the region occupancy.
|
|
size_t _maximal_score = 0;
|
|
|
|
struct allocating_function {
|
|
virtual ~allocating_function() = default;
|
|
virtual void allocate() = 0;
|
|
virtual void fail(std::exception_ptr) = 0;
|
|
};
|
|
|
|
template <typename Func>
|
|
struct concrete_allocating_function : public allocating_function {
|
|
using futurator = futurize<std::result_of_t<Func()>>;
|
|
typename futurator::promise_type pr;
|
|
Func func;
|
|
public:
|
|
void allocate() override {
|
|
futurator::invoke(func).forward_to(std::move(pr));
|
|
}
|
|
void fail(std::exception_ptr e) override {
|
|
pr.set_exception(e);
|
|
}
|
|
concrete_allocating_function(Func&& func) : func(std::forward<Func>(func)) {}
|
|
typename futurator::type get_future() {
|
|
return pr.get_future();
|
|
}
|
|
};
|
|
|
|
class on_request_expiry {
|
|
class blocked_requests_timed_out_error : public timed_out_error {
|
|
const sstring _msg;
|
|
public:
|
|
explicit blocked_requests_timed_out_error(sstring name)
|
|
: _msg(std::move(name) + ": timed out") {}
|
|
virtual const char* what() const noexcept override {
|
|
return _msg.c_str();
|
|
}
|
|
};
|
|
|
|
sstring _name;
|
|
public:
|
|
explicit on_request_expiry(sstring name) : _name(std::move(name)) {}
|
|
void operator()(std::unique_ptr<allocating_function>&) noexcept;
|
|
};
|
|
|
|
// It is a more common idiom to just hold the promises in the circular buffer and make them
|
|
// ready. However, in the time between the promise being made ready and the function execution,
|
|
// it could be that our memory usage went up again. To protect against that, we have to recheck
|
|
// if memory is still available after the future resolves.
|
|
//
|
|
// But we can greatly simplify it if we store the function itself in the circular_buffer, and
|
|
// execute it synchronously in release_requests() when we are sure memory is available.
|
|
//
|
|
// This allows us to easily provide strong execution guarantees while keeping all re-check
|
|
// complication in release_requests and keep the main request execution path simpler.
|
|
expiring_fifo<std::unique_ptr<allocating_function>, on_request_expiry, db::timeout_clock> _blocked_requests;
|
|
|
|
uint64_t _blocked_requests_counter = 0;
|
|
|
|
condition_variable _relief;
|
|
future<> _releaser;
|
|
bool _shutdown_requested = false;
|
|
|
|
bool reclaimer_can_block() const;
|
|
future<> start_releaser(scheduling_group deferered_work_sg);
|
|
void notify_relief();
|
|
friend void region_group_binomial_group_sanity_check(const region_group::region_heap& bh);
|
|
public:
|
|
// When creating a region_group, one can specify an optional throttle_threshold parameter. This
|
|
// parameter won't affect normal allocations, but an API is provided, through the region_group's
|
|
// method run_when_memory_available(), to make sure that a given function is only executed when
|
|
// the total memory for the region group (and all of its parents) is lower or equal to the
|
|
// region_group's throttle_treshold (and respectively for its parents).
|
|
//
|
|
// The deferred_work_sg parameter specifies a scheduling group in which to run allocations
|
|
// (given to run_when_memory_available()) when they must be deferred due to lack of memory
|
|
// at the time the call to run_when_memory_available() was made.
|
|
region_group(sstring name = "(unnamed region_group)",
|
|
region_group_reclaimer& reclaimer = no_reclaimer,
|
|
scheduling_group deferred_work_sg = default_scheduling_group())
|
|
: region_group(name, nullptr, reclaimer, deferred_work_sg) {}
|
|
region_group(sstring name, region_group* parent, region_group_reclaimer& reclaimer = no_reclaimer,
|
|
scheduling_group deferred_work_sg = default_scheduling_group());
|
|
region_group(region_group&& o) = delete;
|
|
region_group(const region_group&) = delete;
|
|
~region_group() {
|
|
// If we set a throttle threshold, we'd be postponing many operations. So shutdown must be
|
|
// called.
|
|
if (reclaimer_can_block()) {
|
|
assert(_shutdown_requested);
|
|
}
|
|
if (_parent) {
|
|
_parent->del(this);
|
|
}
|
|
}
|
|
region_group& operator=(const region_group&) = delete;
|
|
region_group& operator=(region_group&&) = delete;
|
|
size_t memory_used() const {
|
|
return _total_memory;
|
|
}
|
|
void update(ssize_t delta);
|
|
|
|
// It would be easier to call update, but it is unfortunately broken in boost versions up to at
|
|
// least 1.59.
|
|
//
|
|
// One possibility would be to just test for delta sigdness, but we adopt an explicit call for
|
|
// two reasons:
|
|
//
|
|
// 1) it save us a branch
|
|
// 2) some callers would like to pass delta = 0. For instance, when we are making a region
|
|
// evictable / non-evictable. Because the evictable occupancy changes, we would like to call
|
|
// the full update cycle even then.
|
|
void increase_usage(region_heap::handle_type& r_handle, ssize_t delta) {
|
|
_regions.increase(r_handle);
|
|
update(delta);
|
|
}
|
|
|
|
void decrease_evictable_usage(region_heap::handle_type& r_handle) {
|
|
_regions.decrease(r_handle);
|
|
}
|
|
|
|
void decrease_usage(region_heap::handle_type& r_handle, ssize_t delta) {
|
|
decrease_evictable_usage(r_handle);
|
|
update(delta);
|
|
}
|
|
|
|
//
|
|
// Make sure that the function specified by the parameter func only runs when this region_group,
|
|
// as well as each of its ancestors have a memory_used() amount of memory that is lesser or
|
|
// equal the throttle_threshold, as specified in the region_group's constructor.
|
|
//
|
|
// region_groups that did not specify a throttle_threshold will always allow for execution.
|
|
//
|
|
// In case current memory_used() is over the threshold, a non-ready future is returned and it
|
|
// will be made ready at some point in the future, at which memory usage in the offending
|
|
// region_group (either this or an ancestor) falls below the threshold.
|
|
//
|
|
// Requests that are not allowed for execution are queued and released in FIFO order within the
|
|
// same region_group, but no guarantees are made regarding release ordering across different
|
|
// region_groups.
|
|
//
|
|
// When timeout is reached first, the returned future is resolved with timed_out_error exception.
|
|
template <typename Func>
|
|
futurize_t<std::result_of_t<Func()>> run_when_memory_available(Func&& func, db::timeout_clock::time_point timeout) {
|
|
// We disallow future-returning functions here, because otherwise memory may be available
|
|
// when we start executing it, but no longer available in the middle of the execution.
|
|
static_assert(!is_future<std::result_of_t<Func()>>::value, "future-returning functions are not permitted.");
|
|
|
|
auto blocked_at = do_for_each_parent(this, [] (auto rg) {
|
|
return (rg->_blocked_requests.empty() && !rg->under_pressure()) ? stop_iteration::no : stop_iteration::yes;
|
|
});
|
|
|
|
if (!blocked_at) {
|
|
return futurize_invoke(func);
|
|
}
|
|
|
|
auto fn = std::make_unique<concrete_allocating_function<Func>>(std::forward<Func>(func));
|
|
auto fut = fn->get_future();
|
|
_blocked_requests.push_back(std::move(fn), timeout);
|
|
++_blocked_requests_counter;
|
|
|
|
return fut;
|
|
}
|
|
|
|
// returns a pointer to the largest region (in terms of memory usage) that sits below this
|
|
// region group. This includes the regions owned by this region group as well as all of its
|
|
// children.
|
|
region* get_largest_region();
|
|
|
|
// Shutdown is mandatory for every user who has set a threshold
|
|
// Can be called at most once.
|
|
future<> shutdown() {
|
|
_shutdown_requested = true;
|
|
_relief.signal();
|
|
return std::move(_releaser);
|
|
}
|
|
|
|
size_t blocked_requests() {
|
|
return _blocked_requests.size();
|
|
}
|
|
|
|
uint64_t blocked_requests_counter() const {
|
|
return _blocked_requests_counter;
|
|
}
|
|
private:
|
|
// Returns true if and only if constraints of this group are not violated.
|
|
// That's taking into account any constraints imposed by enclosing (parent) groups.
|
|
bool execution_permitted() noexcept;
|
|
|
|
// Executes the function func for each region_group upwards in the hierarchy, starting with the
|
|
// parameter node. The function func may return stop_iteration::no, in which case it proceeds to
|
|
// the next ancestor in the hierarchy, or stop_iteration::yes, in which case it stops at this
|
|
// level.
|
|
//
|
|
// This method returns a pointer to the region_group that was processed last, or nullptr if the
|
|
// root was reached.
|
|
template <typename Func>
|
|
static region_group* do_for_each_parent(region_group *node, Func&& func) {
|
|
auto rg = node;
|
|
while (rg) {
|
|
if (func(rg) == stop_iteration::yes) {
|
|
return rg;
|
|
}
|
|
rg = rg->_parent;
|
|
}
|
|
return nullptr;
|
|
}
|
|
|
|
inline bool under_pressure() const {
|
|
return _reclaimer.under_pressure();
|
|
}
|
|
|
|
uint64_t top_region_evictable_space() const;
|
|
|
|
uint64_t maximal_score() const {
|
|
return _maximal_score;
|
|
}
|
|
|
|
void update_maximal_rg() {
|
|
auto my_score = top_region_evictable_space();
|
|
auto children_score = _subgroups.empty() ? 0 : _subgroups.top()->maximal_score();
|
|
auto old_maximal_score = _maximal_score;
|
|
if (children_score > my_score) {
|
|
_maximal_rg = _subgroups.top()->_maximal_rg;
|
|
} else {
|
|
_maximal_rg = this;
|
|
}
|
|
|
|
_maximal_score = _maximal_rg->top_region_evictable_space();
|
|
if (_parent) {
|
|
// binomial heap update boost bug.
|
|
if (_maximal_score > old_maximal_score) {
|
|
_parent->_subgroups.increase(_subgroup_heap_handle);
|
|
} else if (_maximal_score < old_maximal_score) {
|
|
_parent->_subgroups.decrease(_subgroup_heap_handle);
|
|
}
|
|
}
|
|
}
|
|
|
|
void add(region_group* child);
|
|
void del(region_group* child);
|
|
void add(region_impl* child);
|
|
void del(region_impl* child);
|
|
friend class region_impl;
|
|
};
|
|
|
|
// Controller for all LSA regions. There's one per shard.
|
|
class tracker {
|
|
public:
|
|
class impl;
|
|
|
|
struct config {
|
|
bool defragment_on_idle;
|
|
bool abort_on_lsa_bad_alloc;
|
|
bool sanitizer_report_backtrace = false; // Better reports but slower
|
|
size_t lsa_reclamation_step;
|
|
scheduling_group background_reclaim_sched_group;
|
|
};
|
|
|
|
void configure(const config& cfg);
|
|
future<> stop();
|
|
|
|
private:
|
|
std::unique_ptr<impl> _impl;
|
|
memory::reclaimer _reclaimer;
|
|
friend class region;
|
|
friend class region_impl;
|
|
memory::reclaiming_result reclaim(seastar::memory::reclaimer::request);
|
|
|
|
public:
|
|
tracker();
|
|
~tracker();
|
|
|
|
//
|
|
// Tries to reclaim given amount of bytes in total using all compactible
|
|
// and evictable regions. Returns the number of bytes actually reclaimed.
|
|
// That value may be smaller than requested when evictable pools are empty
|
|
// and compactible pools can't compact any more.
|
|
//
|
|
// Invalidates references to objects in all compactible and evictable regions.
|
|
//
|
|
size_t reclaim(size_t bytes);
|
|
|
|
// Compacts as much as possible. Very expensive, mainly for testing.
|
|
// Guarantees that every live object from reclaimable regions will be moved.
|
|
// Invalidates references to objects in all compactible and evictable regions.
|
|
void full_compaction();
|
|
|
|
void reclaim_all_free_segments();
|
|
|
|
// Returns aggregate statistics for all pools.
|
|
occupancy_stats region_occupancy();
|
|
|
|
// Returns statistics for all segments allocated by LSA on this shard.
|
|
occupancy_stats occupancy();
|
|
|
|
// Returns amount of allocated memory not managed by LSA
|
|
size_t non_lsa_used_space() const;
|
|
|
|
impl& get_impl() { return *_impl; }
|
|
|
|
// Returns the minimum number of segments reclaimed during single reclamation cycle.
|
|
size_t reclamation_step() const;
|
|
|
|
bool should_abort_on_bad_alloc();
|
|
};
|
|
|
|
tracker& shard_tracker();
|
|
|
|
class segment_descriptor;
|
|
|
|
/// A unique pointer to a chunk of memory allocated inside an LSA region.
|
|
///
|
|
/// The pointer can be in disengaged state in which case it doesn't point at any buffer (nullptr state).
|
|
/// When the pointer points at some buffer, it is said to be engaged.
|
|
///
|
|
/// The pointer owns the object.
|
|
/// When the pointer is destroyed or it transitions from engaged to disengaged state, the buffer is freed.
|
|
/// The buffer is never leaked when operating by the API of lsa_buffer.
|
|
/// The pointer object can be safely destroyed in any allocator context.
|
|
///
|
|
/// The pointer object is never invalidated.
|
|
/// The pointed-to buffer can be moved around by LSA, so the pointer returned by get() can be
|
|
/// invalidated, but the pointer object itself is updated automatically and get() always returns
|
|
/// a pointer which is valid at the time of the call.
|
|
///
|
|
/// Must not outlive the region.
|
|
class lsa_buffer {
|
|
friend class region_impl;
|
|
entangled _link; // Paired with segment_descriptor::_buf_pointers[...]
|
|
segment_descriptor* _desc; // Valid only when engaged
|
|
char* _buf = nullptr; // Valid only when engaged
|
|
size_t _size = 0;
|
|
public:
|
|
using char_type = char;
|
|
|
|
lsa_buffer() = default;
|
|
lsa_buffer(lsa_buffer&&) noexcept = default;
|
|
~lsa_buffer();
|
|
|
|
/// Makes this instance point to the buffer pointed to by the other pointer.
|
|
/// If this pointer was engaged before, the owned buffer is freed.
|
|
/// The other pointer will be in disengaged state after this.
|
|
lsa_buffer& operator=(lsa_buffer&& other) noexcept {
|
|
if (this != &other) {
|
|
this->~lsa_buffer();
|
|
new (this) lsa_buffer(std::move(other));
|
|
}
|
|
return *this;
|
|
}
|
|
|
|
/// Disengages the pointer.
|
|
/// If the pointer was engaged before, the owned buffer is freed.
|
|
/// Postcondition: !bool(*this)
|
|
lsa_buffer& operator=(std::nullptr_t) noexcept {
|
|
this->~lsa_buffer();
|
|
return *this;
|
|
}
|
|
|
|
/// Returns a pointer to the first element of the buffer.
|
|
/// Valid only when engaged.
|
|
char_type* get() { return _buf; }
|
|
const char_type* get() const { return _buf; }
|
|
|
|
/// Returns the number of bytes in the buffer.
|
|
size_t size() const { return _size; }
|
|
|
|
/// Returns true iff the pointer is engaged.
|
|
explicit operator bool() const noexcept { return bool(_link); }
|
|
};
|
|
|
|
// Monoid representing pool occupancy statistics.
|
|
// Naturally ordered so that sparser pools come fist.
|
|
// All sizes in bytes.
|
|
class occupancy_stats {
|
|
size_t _free_space;
|
|
size_t _total_space;
|
|
public:
|
|
occupancy_stats() : _free_space(0), _total_space(0) {}
|
|
|
|
occupancy_stats(size_t free_space, size_t total_space)
|
|
: _free_space(free_space), _total_space(total_space) { }
|
|
|
|
bool operator<(const occupancy_stats& other) const {
|
|
return used_fraction() < other.used_fraction();
|
|
}
|
|
|
|
friend occupancy_stats operator+(const occupancy_stats& s1, const occupancy_stats& s2) {
|
|
occupancy_stats result(s1);
|
|
result += s2;
|
|
return result;
|
|
}
|
|
|
|
friend occupancy_stats operator-(const occupancy_stats& s1, const occupancy_stats& s2) {
|
|
occupancy_stats result(s1);
|
|
result -= s2;
|
|
return result;
|
|
}
|
|
|
|
occupancy_stats& operator+=(const occupancy_stats& other) {
|
|
_total_space += other._total_space;
|
|
_free_space += other._free_space;
|
|
return *this;
|
|
}
|
|
|
|
occupancy_stats& operator-=(const occupancy_stats& other) {
|
|
_total_space -= other._total_space;
|
|
_free_space -= other._free_space;
|
|
return *this;
|
|
}
|
|
|
|
size_t used_space() const {
|
|
return _total_space - _free_space;
|
|
}
|
|
|
|
size_t free_space() const {
|
|
return _free_space;
|
|
}
|
|
|
|
size_t total_space() const {
|
|
return _total_space;
|
|
}
|
|
|
|
float used_fraction() const {
|
|
return _total_space ? float(used_space()) / total_space() : 0;
|
|
}
|
|
|
|
explicit operator bool() const {
|
|
return _total_space > 0;
|
|
}
|
|
|
|
friend std::ostream& operator<<(std::ostream&, const occupancy_stats&);
|
|
};
|
|
|
|
class basic_region_impl : public allocation_strategy {
|
|
protected:
|
|
bool _reclaiming_enabled = true;
|
|
seastar::shard_id _cpu = this_shard_id();
|
|
public:
|
|
void set_reclaiming_enabled(bool enabled) {
|
|
assert(this_shard_id() == _cpu);
|
|
_reclaiming_enabled = enabled;
|
|
}
|
|
|
|
bool reclaiming_enabled() const {
|
|
return _reclaiming_enabled;
|
|
}
|
|
};
|
|
|
|
//
|
|
// Log-structured allocator region.
|
|
//
|
|
// Objects allocated using this region are said to be owned by this region.
|
|
// Objects must be freed only using the region which owns them. Ownership can
|
|
// be transferred across regions using the merge() method. Region must be live
|
|
// as long as it owns any objects.
|
|
//
|
|
// Each region has separate memory accounting and can be compacted
|
|
// independently from other regions. To reclaim memory from all regions use
|
|
// shard_tracker().
|
|
//
|
|
// Region is automatically added to the set of
|
|
// compactible regions when constructed.
|
|
//
|
|
class region {
|
|
public:
|
|
using impl = region_impl;
|
|
private:
|
|
shared_ptr<basic_region_impl> _impl;
|
|
private:
|
|
region_impl& get_impl();
|
|
const region_impl& get_impl() const;
|
|
public:
|
|
region();
|
|
explicit region(region_group& group);
|
|
~region();
|
|
region(region&& other);
|
|
region& operator=(region&& other);
|
|
region(const region& other) = delete;
|
|
|
|
occupancy_stats occupancy() const;
|
|
|
|
allocation_strategy& allocator() noexcept {
|
|
return *_impl;
|
|
}
|
|
const allocation_strategy& allocator() const noexcept {
|
|
return *_impl;
|
|
}
|
|
|
|
region_group* group();
|
|
|
|
// Allocates a buffer of a given size.
|
|
// The buffer's pointer will be aligned to 4KB.
|
|
// Note: it is wasteful to allocate buffers of sizes which are not a multiple of the alignment.
|
|
lsa_buffer alloc_buf(size_t buffer_size);
|
|
|
|
// Merges another region into this region. The other region is left empty.
|
|
// Doesn't invalidate references to allocated objects.
|
|
void merge(region& other) noexcept;
|
|
|
|
// Compacts everything. Mainly for testing.
|
|
// Invalidates references to allocated objects.
|
|
void full_compaction();
|
|
|
|
// Runs eviction function once. Mainly for testing.
|
|
memory::reclaiming_result evict_some();
|
|
|
|
// Changes the reclaimability state of this region. When region is not
|
|
// reclaimable, it won't be considered by tracker::reclaim(). By default region is
|
|
// reclaimable after construction.
|
|
void set_reclaiming_enabled(bool e) { _impl->set_reclaiming_enabled(e); }
|
|
|
|
// Returns the reclaimability state of this region.
|
|
bool reclaiming_enabled() const { return _impl->reclaiming_enabled(); }
|
|
|
|
// Returns a value which is increased when this region is either compacted or
|
|
// evicted from, which invalidates references into the region.
|
|
// When the value returned by this method doesn't change, references remain valid.
|
|
uint64_t reclaim_counter() const {
|
|
return allocator().invalidate_counter();
|
|
}
|
|
|
|
// Will cause subsequent calls to evictable_occupancy() to report empty occupancy.
|
|
void ground_evictable_occupancy();
|
|
|
|
// Follows region's occupancy in the parent region group. Less fine-grained than occupancy().
|
|
// After ground_evictable_occupancy() is called returns 0.
|
|
occupancy_stats evictable_occupancy();
|
|
|
|
// Makes this region an evictable region. Supplied function will be called
|
|
// when data from this region needs to be evicted in order to reclaim space.
|
|
// The function should free some space from this region.
|
|
void make_evictable(eviction_fn);
|
|
|
|
const eviction_fn& evictor() const;
|
|
|
|
friend class region_group;
|
|
friend class allocating_section;
|
|
};
|
|
|
|
// Forces references into the region to remain valid as long as this guard is
|
|
// live by disabling compaction and eviction.
|
|
// Can be nested.
|
|
struct reclaim_lock {
|
|
region& _region;
|
|
bool _prev;
|
|
reclaim_lock(region& r)
|
|
: _region(r)
|
|
, _prev(r.reclaiming_enabled())
|
|
{
|
|
_region.set_reclaiming_enabled(false);
|
|
}
|
|
~reclaim_lock() {
|
|
_region.set_reclaiming_enabled(_prev);
|
|
}
|
|
};
|
|
|
|
// Utility for running critical sections which need to lock some region and
|
|
// also allocate LSA memory. The object learns from failures how much it
|
|
// should reserve up front in order to not cause allocation failures.
|
|
class allocating_section {
|
|
// Do not decay below these minimal values
|
|
static constexpr size_t s_min_lsa_reserve = 1;
|
|
static constexpr size_t s_min_std_reserve = 1024;
|
|
static constexpr uint64_t s_bytes_per_decay = 10'000'000'000;
|
|
static constexpr unsigned s_segments_per_decay = 100'000;
|
|
size_t _lsa_reserve = s_min_lsa_reserve; // in segments
|
|
size_t _std_reserve = s_min_std_reserve; // in bytes
|
|
size_t _minimum_lsa_emergency_reserve = 0;
|
|
int64_t _remaining_std_bytes_until_decay = s_bytes_per_decay;
|
|
int _remaining_lsa_segments_until_decay = s_segments_per_decay;
|
|
private:
|
|
struct guard {
|
|
size_t _prev;
|
|
guard();
|
|
~guard();
|
|
};
|
|
void reserve();
|
|
void maybe_decay_reserve();
|
|
void on_alloc_failure(logalloc::region&);
|
|
public:
|
|
|
|
void set_lsa_reserve(size_t);
|
|
void set_std_reserve(size_t);
|
|
|
|
//
|
|
// Reserves standard allocator and LSA memory for subsequent operations that
|
|
// have to be performed with memory reclamation disabled.
|
|
//
|
|
// Throws std::bad_alloc when reserves can't be increased to a sufficient level.
|
|
//
|
|
template<typename Func>
|
|
decltype(auto) with_reserve(Func&& fn) {
|
|
auto prev_lsa_reserve = _lsa_reserve;
|
|
auto prev_std_reserve = _std_reserve;
|
|
try {
|
|
guard g;
|
|
_minimum_lsa_emergency_reserve = g._prev;
|
|
reserve();
|
|
return fn();
|
|
} catch (const std::bad_alloc&) {
|
|
// roll-back limits to protect against pathological requests
|
|
// preventing future requests from succeeding.
|
|
_lsa_reserve = prev_lsa_reserve;
|
|
_std_reserve = prev_std_reserve;
|
|
throw;
|
|
}
|
|
}
|
|
|
|
//
|
|
// Invokes func with reclaim_lock on region r. If LSA allocation fails
|
|
// inside func it is retried after increasing LSA segment reserve. The
|
|
// memory reserves are increased with region lock off allowing for memory
|
|
// reclamation to take place in the region.
|
|
//
|
|
// References in the region are invalidated when allocating section is re-entered
|
|
// on allocation failure.
|
|
//
|
|
// Throws std::bad_alloc when reserves can't be increased to a sufficient level.
|
|
//
|
|
template<typename Func>
|
|
decltype(auto) with_reclaiming_disabled(logalloc::region& r, Func&& fn) {
|
|
assert(r.reclaiming_enabled());
|
|
maybe_decay_reserve();
|
|
while (true) {
|
|
try {
|
|
logalloc::reclaim_lock _(r);
|
|
memory::disable_abort_on_alloc_failure_temporarily dfg;
|
|
return fn();
|
|
} catch (const std::bad_alloc&) {
|
|
on_alloc_failure(r);
|
|
}
|
|
}
|
|
}
|
|
|
|
//
|
|
// Reserves standard allocator and LSA memory and
|
|
// invokes func with reclaim_lock on region r. If LSA allocation fails
|
|
// inside func it is retried after increasing LSA segment reserve. The
|
|
// memory reserves are increased with region lock off allowing for memory
|
|
// reclamation to take place in the region.
|
|
//
|
|
// References in the region are invalidated when allocating section is re-entered
|
|
// on allocation failure.
|
|
//
|
|
// Throws std::bad_alloc when reserves can't be increased to a sufficient level.
|
|
//
|
|
template<typename Func>
|
|
decltype(auto) operator()(logalloc::region& r, Func&& func) {
|
|
return with_reserve([this, &r, &func] {
|
|
return with_reclaiming_disabled(r, func);
|
|
});
|
|
}
|
|
};
|
|
|
|
future<> prime_segment_pool(size_t available_memory, size_t min_free_memory);
|
|
|
|
uint64_t memory_allocated();
|
|
uint64_t memory_compacted();
|
|
|
|
occupancy_stats lsa_global_occupancy_stats();
|
|
|
|
}
|