The constructors of these global variables can allocate memory. Since
the variables are thread_local, they are initialized at first use.
There is nothing we can do if these allocations fail, so use
disable_failure_guard.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200729184901.205646-1-espindola@scylladb.com>
Now that there are no ad-hoc aliases needing to overwrite the name and
description parameter of this method, we can drop these and have each
config item just use `name()` and `desc()` to access these.
Allow configuration items to also have an alias, besides the name.
This allows easy replacement of configuration items, with newer names,
while still supporting the old name for backward compatibility.
The alias mechanism takes care of registering both the name and the
alias as command line arguments, as well as parsing them from YAML.
The command line documentation of the alias will just refer to the name
for documentation.
"
The set's goal is to reduce the indirect fanout of 3 headers only,
but likely affects more. The measured improvement rates are
flat_mutation_reader.hh: -80%
mutation.hh : -70%
mutation_partition.hh : -20%
tests: dev-build, 'checkheaders' for changed headers (the tree-wide
fails on master)
"
* 'br-debloat-mutation-headers' of https://github.com/xemul/scylla:
headers:: Remove flat_mutation_reader.hh from several other headers
migration_manager: Remove db/schema_tables.hh inclustion into header
storage_proxy: Remove frozen_mutation.hh inclustion
storage_proxy: Move paxos/*.hh inclusions from .hh to .cc
storage_proxy: Move hint_wrapper from .hh to .cc
headers: Remove mutation.hh from trace_state.hh
The schema_tables.hh -> migration_manager.hh couple seems to work as one
of "single header for everyhing" creating big blot for many seemingly
unrelated .hh's.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
If external is true, _u.ptr is not null. An empty managed_bytes uses
the internal representation.
The current code looks scary, since it seems possible that backref
would still point to the old location, which would invite corruption
when the reclaimer runs.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Reviewed-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20200716233124.521796-1-espindola@scylladb.com>
Merged pull request https://github.com/scylladb/scylla/pull/6694
by Calle Wilund:
Implementation of DynamoDB streams using Scylla CDC.
Fixes#5065
Initial, naive implementation insofar that it uses 1:1 mapping CDC stream to
DynamoDB shard. I.e. there are a lot of shards.
Includes tests verified against both local DynamoDB server and actual AWS
remote one.
Note:
Because of how data put is implemented in alternator, currently we do not
get "proper" INSERT labels for first write of data, because to CDC it looks
like an update. The test compensates for this, but actual users might not
like it.
To allow immediate json value conversion for types we
have TypeHelper<...>:s for.
Typed opt-get to get both automatic type conversion, _and_
find functionality in one call.
The change is the same as with row-cache -- use B+ with int64_t token
as key and array of memtable_entry-s inside it.
The changes are:
Similar to those for row_cache:
- compare() goes away, new collection uses ring_position_comparator
- insertion and removal happens with the help of double_decker, most
of the places are about slightly changed semantics of it
- flags are added to memtable_entry, this makes its size larger than
it could be, but still smaller than it was before
Memtable-specific:
- when the new entry is inserted into tree iterators _might_ get
invalidated by double-decker inner array. This is easy to check
when it happens, so the invalidation is avoided when possible
- the size_in_allocator_without_rows() is now not very precise. This
is because after the patch memtable_entries are not allocated
individually as they used to. They can be squashed together with
those having token conflict and asking allocator for the occupied
memory slot is not possible. As the closest (lower) estimate the
size of enclosing B+ data node is used
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The collection is K:V store
bplus::tree<Key = K, Value = array_trusted_bounds<V>>
It will be used as partitions cache. The outer tree is used to
quickly map token to cache_entry, the inner array -- to resolve
(expected to be rare) hash collisions.
It also must be equipped with two comparators -- less one for
keys and full one for values. The latter is not kept on-board,
but it required on all calls.
The core API consists of just 2 calls
- Heterogenuous lower_bound(search_key) -> iterator : finds the
element that's greater or equal to the provided search key
Other than the iterator the call returns a "hint" object
that helps the next call.
- emplace_before(iterator, key, hint, ...) : the call construct
the element right before the given iterator. The key and hint
are needed for more optimal algo, but strictly speaking not
required.
Adding an entry to the double_decker may result in growing the
node's array. Here to B+ iterator's .reconstruct() method
comes into play. The new array is created, old elements are
moved onto it, then the fresh node replaces the old one.
// TODO: Ideally this should be turned into the
// template <typename OuterCollection, typename InnerCollection>
// but for now the double_decker still has some intimate knowledge
// about what outer and inner collections are.
Insertion into this collection _may_ invalidate iterators, but
may leave intact. Invalidation only happens in case of hashing
conflict, which can be clearly seen from the hint object, so
there's a good room for improvement.
The main usage by row_cache (the find_or_create_entry) looks like
cache_entry find_or_create_entry() {
bound_hint hint;
it = lower_bound(decorated_key, &hint);
if (!hint.found) {
it = emplace_before(it, decorated_key.token(), hint,
<constructor args>)
}
return *it;
}
Now the hint. It contains 3 booleans, that are
- match: set to true when the "greater or equal" condition
evaluated to "equal". This frees the caller from the need
to manually check whether the entry returned matches the
search key or the new one should be inserted.
This is the "!found" check from the above snippet.
To explain the next 2 bools, here's a small example. Consider
the tree containing two elements {token, partition key}:
{ 3, "a" }, { 5, "z" }
As the collection is sorted they go in the order shown. Next,
this is what the lower_bound would return for some cases:
{ 3, "z" } -> { 5, "z" }
{ 4, "a" } -> { 5, "z" }
{ 5, "a" } -> { 5, "z" }
Apparently, the lower bound for those 3 elements are the same,
but the code-flows of emplacing them before one differ drastically.
{ 3, "z" } : need to get previous element from the tree and
push the element to it's vector's back
{ 4, "a" } : need to create new element in the tree and populate
its empty vector with the single element
{ 5, "a" } : need to put the new element in the found tree
element right before the found vector position
To make one of the above decisions the .emplace_before would need
to perform another set of comparisons of keys and elements.
Fortunately, the needed information was already known inside the
lower_bound call and can be reported via the hint.
Said that,
- key_match: set to true if tree.lower_bound() found the element
for the Key (which is token). For above examples this will be
true for cases 3z and 5a.
- key_tail: set to true if the tree element was found, but when
comparing values from array the bounding element turned out
to belong to the next tree element and the iterator was ++-ed.
For above examples this would be true for case 3z only.
And the last, but not least -- the "erase self" feature. Which is
given only the cache_entry pointer at hands remove it from the
collection. To make this happen we need to make two steps:
1. get the array the entry sits in
2. get the b+ tree node the vectors sits in
Both methods are provided by array_trusted_bounds and bplus::tree.
So, when we need to get iterator from the given T pointer, the algo
looks like
- Walk back the T array untill hitting the head element
- Call array_trusted_bounds::from_element() getting the array
- Construct b+ iterator from obtained array
- Construct the double_decker iterator from b+ iterator and from
the number of "steps back" from above
- Call double_decker::iterator.erase()
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
A plain array of elements that grows and shrinks by
constructing the new instance from an existing one and
moving the elements from it.
Behaves similarly to vector's external array, but has
0-bytes overhead. The array bounds (0-th and N-th
elemements) are determined by checking the flags on the
elements themselves. For this the type must support
getters and setters for the flags.
To remove an element from array there's also a nothrow
option that drops the requested element from array,
shifts the righter ones left and keeps the trailing
unused memory (so called "train") until reconstruction
or destruction.
Also comes with lower_bound() helper that helps keeping
the elements sotred and the from_element() one that
returns back reference to the array in which the element
sits.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
// The story is at
// https://groups.google.com/forum/#!msg/scylladb-dev/sxqTHM9rSDQ/WqwF1AQDAQAJ
This is the B+ version which satisfies several specific requirements
to be suitable for row-cache usage.
1. Insert/Remove doesn't invalidate iterators
2. Elements should be LSA-compactable
3. Low overhead of data nodes (1 pointer)
4. External less-only comparator
5. As little actions on insert/delete as possible
6. Iterator walks the sorted keys
The design, briefly is:
There are 3 types of nodes: inner, leaf and data, inner and leaf
keep build-in array of N keys and N(+1) nodes. Leaf nodes sit in
a doubly linked list. Data nodes live separately from the leaf ones
and keep pointers on them. Tree handler keeps pointers on root and
left-most and right-most leaves. Nodes do _not_ keep pointers or
references on the tree (except 3 of them, see below).
changes in v9:
- explicitly marked keys/kids indices with type aliases
- marked the whole erase/clear stuff noexcept
- disposers now accept object pointer instead of reference
- clear tree in destructor
- added more comments
- style/readability review comments fixed
Prior changes
**
- Add noexcepts where possible
- Restrict Less-comparator constraint -- it must be noexcept
- Generalized node_id
- Packed code for beging()/cbegin()
**
- Unsigned indices everywhere
- Cosmetics changes
**
- Const iterators
- C++20 concepts
**
- The index_for() implmenetation is templatized the other way
to make it possible for AVX key search specialization (further
patching)
**
- Insertion tries to push kids to siblings before split
Before this change insertion into full node resulted into this
node being split into two equal parts. This behaviour for random
keys stress gives a tree with ~2/3 of nodes half-filled.
With this change before splitting the full node try to push one
element to each of the siblings (if they exist and not full).
This slows the insertion a bit (but it's still way faster than
the std::set), but gives 15% less total number of nodes.
- Iterator method to reconstruct the data at the given position
The helper creates a new data node, emplaces data into it and
replaces the iterator's one with it. Needed to keep arrays of
data in tree.
- Milli-optimize erase()
- Return back an iterator that will likely be not re-validated
- Do not try to update ancestors separation key for leftmost kid
This caused the clear()-like workload work poorly as compared to
std:set. In particular the row_cache::invalidate() method does
exactly this and this change improves its timing.
- Perf test to measure drain speed
- Helper call to collect tree counters
**
- Fix corner case of iterator.emplace_before()
- Clean heterogenous lookup API
- Handle exceptions from nodes allocations
- Explicitly mark places where the key is copied (for future)
- Extend the tree.lower_bound() API to report back whether
the bound hit the key or not
- Addressed style/cleanness review comments
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
"
This series converts an API to use std::string_view and then converts
a few sstring variables to be constexpr std::string_view. This has the
advantage that a constexpr variables cannot be part of any
initialization order problem.
"
* 'espindola/convert-to-constexpr' of https://github.com/espindola/scylla:
auth: Convert sstring variables in common.hh to constexpr std::string_view
auth: Convert sstring variables in default_authorizer to constexpr std::string_view
cql_test_env: Make ks_name a constexpr std::string_view
class_registry: Use std::string_view in (un)?qualified_name
Existing infrastructure relies on being able to parse a JSON string
straight into a map of strings. In order to make rjson a drop-in
replacement(tm) for libjsoncpp, a similar helper function is provided.
It's redundant to provide function overloads for both string_view
and const string&, since both of them can be implicitly created from
const char*. Thus, only string_view overloads are kept.
Example code which was ambiguous before the patch, but compiles fine
after it:
rjson::from_string("hello");
Without the patch, one had to explicitly state the type, e.g.:
rjson::from_string(std::string_view("hello"));
which is excessive.
A negative scale was being passed an a positive value to
boost::multiprecision::pow, which would never finish.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
This patch aim to make the implementation and usage of the
approx_exponential_histogram clearer.
The approx_exponential_histogram Uses a combination of Min, Max,
Precision and number of buckets where the user needs to pick 3.
Most of the changes in the patch are about documenting the class and
method, but following the review there are two functionality changes:
1. The user would pick: Min, Max and Precision and the number of buckets
will be calculated from these values.
2. The template restrictions are now state in a requires so voiolation
will be stop at compile time.
"
The "promoted index" is how the sstable format calls the clustering key index within a given partition.
Large partitions with many rows have it. It's embedded in the partition index entry.
Currently, lookups in the promoted index are done by scanning the index linearly so the lookup
is O(N). For large partitions that's inefficient. It consumes both a lot of CPU and I/O.
We could do better and use binary search in the index. This patch series switches the mc-format
index reader to do that. Other formats use the old way.
The "mc" format promoted index has an extra structure at the end of the index called "offset map".
It's a vector of offsets of consecutive promoted index entries. This allows us to access random
entries in the index without reading the whole index.
The location of the offset entry for a given promoted index entry can be derived by knowing where
the offset vector ends in the index file, so the offset map also doesn't have to be read completely
into the memory.
The most tricky part is caching. We need to cache blocks read from the index file to amortize the
cost of binary search:
- if the promoted index fits in the 32 KiB which was read from the index when looking for
the partition entry, we don't want to issue any additional I/O to search the promoted index.
- with large promoted indexes, the last few bisections will fall into the same I/O block and we
want to reuse that block.
- we don't want the cache to grow too big, we don't want to cache the whole promoted index
as the read progresses over the index. Scanning reads may skip multiple times.
This series implements a rather simple approach which meets all the
above requirements and is not worse than the current state of affairs:
- Each index cursor has its own cache of the index file area which corresponds to promoted index
This is managed by the cached_file class.
- Each index cursor has its own cache of parsed blocks. This allows the upper bound estimation to
reuse information obtained during lower bound lookup. This estimation is used to limit
read-aheads in the data file.
- Each cursor drops entries that it walked past so that memory footprint stays O(log N)
- Cached buffers are accounted to read's reader_permit.
Later, we could have a single cache shared by many readers. For that, we need to come up with eviction
policy.
Fixes#4007.
TESTING RESULTS
* Point reads, large promoted index:
Config: rows: 10000000, value size: 2000
Partition size: 20 GB
Index size: 7 MB
Notes:
- Slicing read into the middle of partition (offset=5000000, read=1) is a clear win for the binary search:
time: 1.9ms vs 22.9ms
CPU utilization: 8.9% vs 92.3%
I/O: 21 reqs / 172 KiB vs 29 reqs / 3'520 KiB
It's 12x faster, CPU utilization is 10x times smaller, disk utilization is 20x smaller.
- Slicing at the front (offset=0) is a mixed bag.
time is similar: 1.8ms
CPU utilization is 6.7x smaller for bsearch: 8.5% vs 57.7%
disk bandwidth utilization is smaller for bsearch but uses more IOs: 4 reqs / 320 KiB (scan) vs 17 reqs / 188 KiB (bsearch)
bsearch uses less bandwidth because the series reduces buffer size used for index file I/O.
scan is issuing:
2 * 128 KB (index page)
2 * 32 KB (data file)
bsearch is issuing:
1 * 64 KB (index page)
15 * 4 KB (promoted index)
1 * 64 KB (data file)
The 1 * 64 KB is chosen dynamically by seastar. Sometimes it chooses 2 * 32 KB (with read-ahead).
32 KB is the minimum I/O currently.
Disk utilization could be further improved by changing the way seastar's dynamic I/O adjustments work
so that it uses 1 * 4 KB when it suffices. This is left for the follow-up.
Command:
perf_fast_forward --datasets=large-part-ds1 \
--run-tests=large-partition-slicing-clustering-keys -c1 --test-case-duration=1
Before:
offset read time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu mem
0 1 0.001836 172 1 545 9 563 175 4.0 4 320 2 2 0 1 1 0 0 0 57.7% 0
0 32 0.001858 502 32 17220 126 17776 11526 3.2 3 324 2 1 0 1 1 0 0 0 56.4% 0
0 256 0.002833 339 256 90374 427 91757 85931 7.0 7 776 3 1 0 1 1 0 0 0 41.1% 0
0 4096 0.017211 58 4096 237984 2011 241802 233870 66.1 66 8376 59 2 0 1 1 0 0 0 21.4% 0
5000000 1 0.022952 42 1 44 1 45 41 29.2 29 3520 22 2 0 1 1 0 0 0 92.3% 0
5000000 32 0.023052 43 32 1388 14 1414 1331 31.1 32 3588 26 2 0 1 1 0 0 0 91.7% 0
5000000 256 0.024795 41 256 10325 129 10721 9993 43.1 39 4544 29 2 0 1 1 0 0 0 86.4% 0
5000000 4096 0.038856 27 4096 105414 398 106918 103162 95.2 95 12160 78 5 0 1 1 0 0 0 61.4% 0
After (v2):
offset read time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu mem
0 1 0.001831 248 1 546 21 581 252 17.6 17 188 2 0 0 1 1 0 0 0 8.5% 0
0 32 0.001910 535 32 16751 626 17770 13896 17.9 19 160 3 0 0 1 1 0 0 0 8.8% 0
0 256 0.003545 266 256 72207 2333 89076 62852 26.9 24 764 7 0 0 1 1 0 0 0 9.7% 0
0 4096 0.016800 56 4096 243812 524 245430 239736 83.6 83 8700 64 0 0 1 1 0 0 0 16.6% 0
5000000 1 0.001968 351 1 508 19 538 380 21.3 21 172 2 0 0 1 1 0 0 0 8.9% 0
5000000 32 0.002273 431 32 14077 436 15503 11551 22.7 22 268 3 0 0 1 1 0 0 0 8.9% 0
5000000 256 0.003889 257 256 65824 2197 81833 57813 34.0 37 652 18 0 0 1 1 0 0 0 11.2% 0
5000000 4096 0.017115 54 4096 239324 834 241310 231993 88.3 88 8844 65 0 0 1 1 0 0 0 16.8% 0
After (v1):
offset read time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu mem
0 1 0.001886 259 1 530 4 545 261 18.0 18 376 2 2 0 1 1 0 0 0 9.1% 0
0 32 0.001954 513 32 16381 93 16844 15618 19.0 19 408 3 2 0 1 1 0 0 0 9.3% 0
0 256 0.003266 318 256 78393 1820 81567 61663 30.8 26 1272 7 2 0 1 1 0 0 0 10.4% 0
0 4096 0.017991 57 4096 227666 855 231915 225781 83.1 83 8888 55 5 0 1 1 0 0 0 15.5% 0
5000000 1 0.002353 232 1 425 2 432 232 23.0 23 396 2 2 0 1 1 0 0 0 8.7% 0
5000000 32 0.002573 384 32 12437 47 12571 429 25.0 25 460 4 2 0 1 1 0 0 0 8.5% 0
5000000 256 0.003994 259 256 64101 2904 67924 51427 37.0 35 1484 11 2 0 1 1 0 0 0 10.6% 0
5000000 4096 0.018567 56 4096 220609 448 227395 219029 89.8 89 9036 59 5 0 1 1 0 0 0 15.1% 0
* Point reads, small promoted index (two blocks):
Config: rows: 400, value size: 200
Partition size: 84 KiB
Index size: 65 B
Notes:
- No significant difference in time
- the same disk utilization
- similar CPU utilization
Command:
perf_fast_forward --datasets=large-part-ds1 \
--run-tests=large-partition-slicing-clustering-keys -c1 --test-case-duration=1
Before:
offset read time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu mem
0 1 0.000279 470 1 3587 31 3829 478 3.0 3 68 2 1 0 1 1 0 0 0 21.1% 0
0 32 0.000276 3498 32 116038 811 122756 104033 3.0 3 68 2 1 0 1 1 0 0 0 24.0% 0
0 256 0.000412 2554 256 621044 1778 732150 559221 2.0 2 72 2 0 0 1 1 0 0 0 32.6% 0
0 4096 0.000510 1901 400 783883 4078 819058 665616 2.0 2 88 2 0 0 1 1 0 0 0 36.4% 0
200 1 0.000339 2712 1 2951 8 3001 2569 2.0 2 72 2 0 0 1 1 0 0 0 17.8% 0
200 32 0.000352 2586 32 91019 266 92427 83411 2.0 2 72 2 0 0 1 1 0 0 0 20.8% 0
200 256 0.000458 2073 200 436503 1618 453945 385501 2.0 2 88 2 0 0 1 1 0 0 0 29.4% 0
200 4096 0.000458 2097 200 436475 1676 458349 381558 2.0 2 88 2 0 0 1 1 0 0 0 29.0% 0
After (v1):
Testing slicing of large partition using clustering keys:
offset read time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu mem
0 1 0.000278 492 1 3598 30 3831 500 3.0 3 68 2 1 0 1 1 0 0 0 19.4% 0
0 32 0.000275 3433 32 116153 753 122915 92559 3.0 3 68 2 1 0 1 1 0 0 0 22.5% 0
0 256 0.000458 2576 256 559437 2978 728075 504375 2.1 2 88 2 0 0 1 1 0 0 0 29.0% 0
0 4096 0.000506 1888 400 790064 3306 822360 623109 2.0 2 88 2 0 0 1 1 0 0 0 36.6% 0
200 1 0.000382 2493 1 2619 10 2675 2268 2.0 2 88 2 0 0 1 1 0 0 0 16.3% 0
200 32 0.000398 2393 32 80422 333 84759 22281 2.0 2 88 2 0 0 1 1 0 0 0 19.0% 0
200 256 0.000459 2096 200 435943 1608 453989 380749 2.0 2 88 2 0 0 1 1 0 0 0 30.5% 0
200 4096 0.000458 2097 200 436410 1651 455779 382485 2.0 2 88 2 0 0 1 1 0 0 0 29.2% 0
* Scan with skips, large index:
Config: rows: 10000000, value size: 2000
Partition size: 20 GB
Index size: 7 MB
Notes:
- Similar time, slightly worse for binary search: 36.1 s (scan) vs 36.4 (bsearch)
- Slightly more I/O for bsearch: 153'932 reqs / 19'703'260 KiB (scan) vs 155'651 reqs / 19'704'088 KiB (bsearch)
Binary search reads more by 828 KB and by 1719 IOs.
It does more I/O to read the the promoted index offset map.
- similar (low) memory footprint. The danger here is that by caching index blocks which we touch as we scan
we would end up caching the whole index. But this is protected against by eviction as demonstrated by the
last "mem" column.
Command:
perf_fast_forward --datasets=large-part-ds1 \
--run-tests=large-partition-skips -c1 --test-case-duration=1
Before:
read skip time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu mem
1 1 36.103451 4 5000000 138491 38 138601 138453 153932.0 153932 19703260 153561 1 0 1 1 0 0 0 31.5% 502690
After (v2):
read skip time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu mem
1 1 37.000145 4 5000000 135135 6 135146 135128 155651.0 155651 19704088 138968 0 0 1 1 0 0 0 34.2% 0
After (v1):
read skip time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu mem
1 1 36.965520 4 5000000 135261 30 135311 135231 155628.0 155628 19704216 139133 1 0 1 1 0 0 0 33.9% 248738
Also in:
git@github.com:tgrabiec/scylla.git sstable-use-index-offset-map-v2
Tests:
- unit (all modes)
- manual using perf_fast_forward
"
* tag 'sstable-use-index-offset-map-v2' of github.com:tgrabiec/scylla:
sstables: Add promoted index cache metrics
position_in_partition: Introduce external_memory_usage()
cached_file, sstables: Add tracing to index binary search and page cache
sstables: Dynamically adjust I/O size for index reads
sstables, tests: Allow disabling binary search in promoted index from perf tests
sstables: mc: Use binary search over the promoted index
utils: Introduce cached_file
sstables: clustered_index: Relax scope of validity of entry_info
sstables: index_entry: Introduce owning promoted_index_block_position
compound_compat: Allow constructing composite from a view
sstables: index_entry: Rename promoted_index_block_position to promoted_index_block_position_view
sstables: mc: Extract parser for promoted index block
sstables: mc: Extract parser for clustering out of the promoted index block parser
sstables: consumer: Extract primitive_consumer
sstables: Abstract the clustering index cursor behavior
sstables: index_reader: Rearrange to reduce branching and optionals
It is a read-through cache of a file.
Will be used to cache contents of the promoted index area from the
index file.
Currently, cached pages are evicted manually using the invalidate_*()
method family, or when the object is destroyed.
The cached_file represents a subset of the file. The reason for this
is to satisfy two requirements. One is that we have a page-aligned
caching, where pages are aligned relative to the start of the
underlying file. This matches requirements of the seastar I/O engine
on I/O requests. Another requirement is to have an effective way to
populate the cache using an unaligned buffer which starts in the
middle of the file when we know that we won't need to access bytes
located before the buffer's position. See populate_front(). If we
couldn't assume that, we wouldn't be able to insert an unaligned
buffer into the cache.
But not compaction.
When reclaiming segments to seastar non-empty segments are copied
as-is to some other place. Instead of doing this reclaimer can copy
only allocated objects and leave the freed holes behing, i.e. -- do
the regular compaction. This would be the same or better from the
timing perspective, and will help to avoid yet another compaction
pass over the same set of objects in the future.
Current migration code checks for the free segments reserve to be
above minimum to proceed with migration, so does the code after this
patch, thus the segment compaction is called with non-empty free
segments set and thus it's guaranteed not to fail the new segment
allocation (if it will be required at all).
Plus some bikeshedding patches for the run-up.
tests: unit(dev)
* https://github.com/xemul/scylla/tree/br-logalloc-compact-on-reclaim-2:
logalloc: Compact segments on reclaim instead of migration
logallog: Introduce RAII allocation lock
logalloc: Shuffle code around region::impl::compact
logalloc: Do not lock reclaimer twice
logalloc: Do not calculate object size twice
logalloc: Do not convert obj_desc to migrator back and forth
"
This series Adds a pseudo-floating-point histogram implementation.
The histogram is used for time_estimated_histogram a histogram for latency tracking and then used in storage_proxy as a more efficient with a higher resolution histogram.
Follow up series would use the new histogram in other places in the system and will add an implementation that supports lower values.
Fixes#5815Fixes#4746
"
* amnonh-quicker_estimated_histogram:
storage_proxy: use time_estimated_histogram for latencies
test/boost/estimated_histogram_test
utils/histogram_metrics_helper Adding histogram converter
utils/estimated_histogram: Adding approx_exponential_histogram
This patch adds a helper converter function to convert from a
approx_exponential_histogram histogram to a seastar::metrics::histogram
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
This patch adds an efficient histogram implementation.
The implementation chooses efficiency over flexibility.
That is why templates are used.
How the approx_exponential_histogram pseudo floating point histogram
works: It split the range [MIN, MAX] into log2(MAX/MIN) ranges it then
split each of that ranges linearly according to a given resolution.
For example, using resolution of 4, would be similar to using an
exponentially growing histogram with a coefficient of 1.2.
All values are uint64. To prevent handling of corner cases, it is not
allowed to set the MIN to be lower than the resolution.
The approx_exponential_histogram will probably not be used directly,
the first used is by time_estimated_histogram. A histogram for durations.
It should be compared to the estimated_histogram.
Performance comparison:
Comparison was done by inserting 2^20 values into
time_estimated_histogram and estimated_histogram.
In debug mode on a local machine insert operation took an average of
26.0 nanoseconds vs 342.2 nanoseconds.
In release mode insert operation took an average of 1.90 vs 8.28 nanoseconds
Fixes#5815
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Given that the key is a std::pair, we have to explicitly mark the hash
and eq types as transparent for heterogeneous lookup to work.
With that, pass std::string_view to a few functions that just check if
a value is in the map.
This increases the .text section by 11 KiB (0.03%).
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
When reclaiming segments to the seastar the code tries to free the segments
sequentially. For this it walks the segments from left to right and frees
them, but every time a non-empty segment is met it gets migrated to another
segment, that's allocated from the right end of the list.
This is waste of cycles sometimes. The destination segment inherits the
holes from the source one, and thus it will be compacted some time in the
future. Why not compact it right at the reclamation time? It will take the
same time or less, but will result in better compaction.
To acheive this, the segment to be reclaimed is compacted with the existing
compact_segment_locked() code with some special care around it.
1. The allocation of new segments from seastar is locked
2. The reclaiming of segments with evict-and-compact is locked as well
3. The emergency pool is opened (the compaction is called with non-empty
reserve to avoid bad_alloc exception throw in the middle of compaction)
4. The segment is forcibly removed from the histogram and the closed_occupancy
is updated just like it is with general compaction
The segments-migration auxiliary code can be removed after this.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The lock disables the segment_pool to call for more segments from
the underlying allocator.
To be used in next patch.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This includes 3 small changes to facilitate next patching:
- rename region::impl::compact into compact_segment_locked
- merging former compact with compact_single_segment_locked
- moving log print and stats update into compact_segment_locked
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The tracker::impl::reclaim is already in reclaim-locked
section, no need for yet another nested lock.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When calling alloc_small the migrator is passed just to get the
object descriptor, but during compaction the descriptor is already
at hands, so no need to re-get it again.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Add Paxos error injections before/after save promise, proposal, decision,
paxos_response_handler, delete decision.
Adds a method to inject an error providing a lambda while avoiding to add
a continuation when the error injection is disabled.
For this provide error exception and enter() to allow flow control (i.e. return)
on simple error injections without lambdas.
Also includes Pavel's patch for CQL API for error injections, updated to
current error injection API and added one_shot support. Also added some
basic CQL API boost tests.
For CQL API there's a limitation of the current grammar not supporting
f(<terminal>) so values have to be inserted in a table until this is
resolved. See #5411
* https://github.com/alecco/scylla/tree/error_injection_v11:
paxos: fix indentation
paxos: add error injections
utils: add timeout error injection with lambda
utils: error injection add enter() for control flow
utils: error injections provide error exceptions
failure_injector: implement CQL API for failure injector class
lwt: fix disabled error injection templates
Even though calling then() on a ready future does not allocate a
continuation, calling then on the result of it will allocate.
This error injection only adds a continuation in the dependency
chain if error injections are enabled at compile timeand this particular
error injection is enabled.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
For control flow (i.e. return) and simplicity add enter() method.
For disabled injections, this method is const returning false,
therefore it has no overhead.
Add boost test.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>