Currently, reading a page range would issue I/O for each missing
page. This is inefficient, better to issue a single I/O for the whole
range and populate cache from that.
As an optimization, issue a single I/O if the first page is missing.
This is important for index reads which optimistically try to read
32KB of index file to read the partition index page.
As part of this change, the container for partition index pages was
changed from utils::loading_shared_values to intrusive_btree. This is
to avoid reactor stalls which the former induces with a large number
of elements (pages) due to its use of a hashtable under the hood,
which reallocates contiguous storage.
Simplifies managing non-owning references to LSA-managed objects. The
lsa::weak_ptr is a smart pointer which is not invalidated by LSA and
can be used safely in any allocator context. Dereferenced will always
give a valid reference.
This can be used as a building block for implementing cursors into
LSA-based caches.
Example simple use:
// LSA-managed
struct X : public lsa::weakly_referencable<X> {
int value;
};
lsa::weak_ptr<X> x_ptr = with_allocator(region(), [] {
X* x = current_allocator().construct<X>();
return x->weak_from_this();
});
std::cout << x_ptr->value;
Will be needed later for reading a page view which cannot use
make_tracked_temporary_buffer(). Standardize on get_page_units(),
converting existing code to wrap the units in a deleter.
Will be needed by index reader to ensure that destructor doesn't
invoke the allocator so that all is destroyed in the desried
allocation context before the object is destroyed.
After this patch, there is a singe index file page cache per
sstable, shared by index readers. The cache survives reads,
which reduces amount of I/O on subsequent reads.
As part of this, cached_file needed to be adjusted in the following ways.
The page cache may occupy a significant portion of memory. Keeping the
pages in the standard allocator could cause memory fragmentation
problems. To avoid them, the cache_file is changed to keep buffers in LSA
using lsa_buffer allocation method.
When a page is needed by the seastar I/O layer, it needs to be copied
to a temporary_buffer which is stable, so must be allocated in the
standard allocator space. We copy the page on-demand. Concurrent
requests for the same page will share the temporary_buffer. When page
is not used, it only lives in the LSA space.
In the subsequent patches cached_file::stream will be adjusted to also support
access via cached_page::ptr_type directly, to avoid materializating a
temporary_buffer.
While a page is used, it is not linked in the LRU so that it is not
freed. This ensures that the storage which is actively consumed
remains stable, either via temporary_buffer (kept alive by its
deleter), or by cached_page::ptr_type directly.
lsa_buffer is similar in spirit to std::unique_ptr<char[]>. It owns
buffers allocated inside LSA segments. It uses an alternative
allocation method which differs from regular LSA allocations in the
following ways:
1) LSA segments only hold buffers, they don't hold metadata. They
also don't mix with standard allocations. So a 128K segment can
hold 32 4K buffers.
2) objects' life time is managed by lsa_buffer, an owning smart
pointer, which is automatically updated when buffers are migrated
to another segment. This makes LSA allocations easier to use and
off-loads metadata management to the client (which can keep the
lsa_buffer wherever he wants).
The metadata is kept inside segment_descriptor, in a vector. Each
allocated buffer will have an entangled object there (8 bytes), which
is paired with an entabled object inside lsa_buffer.
The reason to have an alternative allocation method is to efficiently
pack buffers inside LSA segments.
It's an adpator between seastar::file and cached_file. It gives a
seastar::file which will serve reads using a given cached_file as a
read-through cache.
We want buffers to be accounted only when they are used outside
cached_file. Cached pages should not be accounted because they will
stay around for longer than the read after subsequent commits.
In preparation for tracking different kinds of objects, not just
rows_entry, in the LRU, switch to the LRU implementation form
utils/lru.hh which can hold arbitrary element type.
The LRU can link objects of different types, which is achieved by
having a virtual base class called "evictable" from which the linked
objects should inherit. Whe the object is removed from the LRU,
evictable::on_evicted() is called.
The container is non-owning.
std::copy_if runs without yielding.
See https://github.com/scylladb/scylla/issues/8897#issuecomment-867522480
Also, eliminate extraneous loop on merge
first1 will point to the inserted value which is a copy of *first2.
Since list2 is sorted in ascending order, the next item from list2
will never be less than the one we've just inserted,
so we waste an iteration to merely increment first1 again.
Fixes#8897
Test: unit(dev), stall_free_test(debug)
DTest: repair_additional_test.py:RepairAdditionalTest.{repair_same_row_diff_value_3nodes_diff_shard_count_test,repair_disjoint_row_3nodes_diff_shard_count_test} (dev)
Closes#8925
* github.com:scylladb/scylla:
utils: merge_to_gently: eliminate extraneous loop on merge
utils: merge_to_gently: prevent stall in std::copy_if
first1 will point to the inserted value which is a copy of *first2.
Since list2 is sorted in ascending order, the next item from list2
will never be less than the one we've just inserted,
so we waste an iteration to merely increment first1 again.
Note that the standard states that no iterators or references are invalidated
on insert so we can safely keep looking at `first1` after inserting a copy of
`*first2` before it.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The helper is used to walk the tree key-by-key destroying it
in the mean time. Current implementation of this method just
uses the "regular" erasing code which actually rebalances the
tree despite the name.
The biggest problem with removing the rebalancing is that at
some point non-balanced tree may have the left-most key on an
inner node, so to make 100% rebalance-less unlink every other
method of the tree would have to be prepared for that. However,
there's an option to make "light rebalance" (as it's called in
this patch) that only maintains this crucial property of the
tree -- the left-most key is on the leaf.
Some more tech details. Current rebalancer starts when the
node population falls below 1/2 of its capacity and tries to
- grab a key from one of the siblings if it's balanced
- merge two siblings together if they are small enough
The light rebalance is lighter in two ways. First, it leaves
the node unbalanced until it becomes empty. And then it goes
ahead and replaces it with the next sibling.
This change removes ~60% of the keys movements on random test.
Keys still move when the sibling replace happens because in
this case the separation key needs to be placed at the right
sibling 0 position which means shifting all its keys right.
tests: unit(debug)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210623083836.27491-1-xemul@scylladb.com>
"
The LSA small objects allocation latency is greatly affected by
the way this allocator encodes the object descriptor in front of
each allocated slot.
Nowadays it's one of VLE variants implemented with the help of a
loop. Re-implementing this piece with less instructions and without
a loop allows greatly reducing the allocation latency.
The speed-up mostly comes from loop-less code that doesn't confuse
branch predictor. Also the express encoder seems to benefit from
writing 8 bytes of the encoded value in one go, rather than byte-
-by-byte.
Perf measurements:
1. (new) logallog test shows ~40% smaller times
2. perf_mutation in release mode shows ~2% increase in tps
3. the encoder itself is 2 - 4 times faster on x86_64 and
1.05 - 3 times faster on aarch64. The speed-up depends on
the 'encoded length', old encoder has linear time, the
new one is constant
tests: unit(dev), perf(release), just encoder on Aarch64
"
* 'br-lsa-alloc-latency-4' of https://github.com/xemul/scylla:
lsa: Use express encoder
uleb64: Add express encoding
lsa: Extract uleb64 code into header
test: LSA allocation perf test
To make it possible to use the express encoder, lsa needs to
make sure that the value is below express supreme value and
provide the size of the gap after the encoded value.
Both requirements can be satisfied when encoding the migrator
index on object allocation.
On free the encoded value can be larger, so the extended
express encoder will need more instructions and will not be
that efficient, so the old encoder is used there.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Standard encoding is compiled into a loop that puts values
into memory byte-by-byte. This works slowly, but reliably.
When allocating an object LSA uses ubel64 encoder with 2
features that allow to optimize the encoder:
1. the value is migrator.index() which is small enough
to fit 2 bytes when encoded
2. After the descriptor there usually comes an object
which is of 8+ bytes in size
Feature #1 makes it possible to encode the value with just
a few instructions. In O3 level clang makes it like
mov %esi,%ecx
and $0xfc0,%ecx
and $0x3f,%esi
lea (%rsi,%rcx,4),%ecx
add $0x40,%ecx
Next, the encoder needs to put the value into a gap whose
size depends on the alignment of previous and current objects,
so the classical algo loops through this size. Feature #2
makes it possible to put the encoded value and the needed
amount of zeros by using 2 64-bit movs. In this case the
encoded value gets off the needed size and overwrites some
memory after. That's OK, as this overwritten memory is where
the allocated object _will_ be, so the contents there is not
of any interest.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The LSA code encodes an object descriptor before the object
itself. The descriptor is 32-bit value and to put it in an
efficient manner it's encoded into unsigned little-endian
base-64 sequence.
The encoding code is going to be optimized, so put it into a
dedicated header in advance.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The one was added to smothly switch tri-comparing stuff from int
to strong-ordering. As for today only tests still need it and the
conversion is pretty simple, plus operator<<(ostream&) for the
std::strong_ordering type.
* xemul/br-remove-int-or-strong-ordering-2:
util: Drop int_or_strong_ordering concept
tests: Switch total-order-check onto strong_ordering
to_string: Add formatter for strong_ordering
tests: Return strong-ordering from tri-comparators
Add an implicit converter of the enum_option to the underyling enum
it is holding. This is needed for using switch() on an enum_option.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Currently, advance_and_wait() allocates a new gate
which might fail. Rather than returning this failure
as an exceptional future - which will require its callers
to handle that failure, keep the function as noexcept and
let an exception from make_lw_shared<gate>() terminate the program.
This makes the function "fail-free" to its callers,
in particular, when called from the table::stop() path where
we can't do much about these failures and we require close/stop
functions to always succeed.
The alternative of make the allocation of a new gate optional
and covering from it in start() is possible but was deemed not
worth it as it will add complexity and cost to start() that's
called on the common, hot, path.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210525055336.1190029-1-bhalevy@scylladb.com>
config_file.cc instantiates std::istream& std::operator>>(std::istream&,
std::unordered_map<seastar::sstring, seastar::sstring>&), but that
instantiation is ignored since config_file_impl.hh specializes
that signature. -Winstantiation-after-specialization warns about it,
so re-enable it now that the code base is clean.
Also remove the matching "extern template" declaration, which has no
definition any more.
Closes#8696
"
The patch set is an assorted collection of header cleanups, e.g:
* Reduce number of boost includes in header files
* Switch to forward declarations in some places
A quick measurement was performed to see if these changes
provide any improvement in build times (ccache cleaned and
existing build products wiped out).
The results are posted below (`/usr/bin/time -v ninja dev-build`)
for 24 cores/48 threads CPU setup (AMD Threadripper 2970WX).
Before:
Command being timed: "ninja dev-build"
User time (seconds): 28262.47
System time (seconds): 824.85
Percent of CPU this job got: 3979%
Elapsed (wall clock) time (h:mm:ss or m:ss): 12:10.97
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 2129888
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 1402838
Minor (reclaiming a frame) page faults: 124265412
Voluntary context switches: 1879279
Involuntary context switches: 1159999
Swaps: 0
File system inputs: 0
File system outputs: 11806272
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
After:
Command being timed: "ninja dev-build"
User time (seconds): 26270.81
System time (seconds): 767.01
Percent of CPU this job got: 3905%
Elapsed (wall clock) time (h:mm:ss or m:ss): 11:32.36
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 2117608
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 1400189
Minor (reclaiming a frame) page faults: 117570335
Voluntary context switches: 1870631
Involuntary context switches: 1154535
Swaps: 0
File system inputs: 0
File system outputs: 11777280
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
The observed improvement is about 5% of total wall clock time
for `dev-build` target.
Also, all commits make sure that headers stay self-sufficient,
which would help to further improve the situation in the future.
"
* 'feature/header_cleanups_v1' of https://github.com/ManManson/scylla:
transport: remove extraneous `qos/service_level_controller` includes from headers
treewide: remove evidently unneded storage_proxy includes from some places
service_level_controller: remove extraneous `service/storage_service.hh` include
sstables/writer: remove extraneous `service/storage_service.hh` include
treewide: remove extraneous database.hh includes from headers
treewide: reduce boost headers usage in scylla header files
cql3: remove extraneous includes from some headers
cql3: various forward declaration cleanups
utils: add missing <limits> header in `extremum_tracking.hh`
Yet another patch preventing potentially large allocations.
Currently, collection_mutation{_view,}_description linearize each collection
value during deserialization. It's not unthinkable that a user adds a
large element to a list or a map, so let's avoid that.
This patch removes the dependency on linearizing_input_stream, which does not
provide a way to read fragmented subbuffers, and replaces it with a new
helper, which does. (Extending linearizing_input_stream is not viable without
rewriting it completely).
Only linearization of collection values is corrected in this patch.
Collection keys are still linearized. Storing them in managed_bytes is likely
to be more harmful than helpful, because large map keys are extremely unlikely,
and UUIDs, which are used as keys in lists, do not fit into manages_bytes's
small value optimization, so this would incure an extra allocation for every
list element.
Note: this patch leaves utils/linearizing_input_stream.hh unused.
Refs: #8120Closes#8690
As an optimization for optimistic cases, reclaim_from_evictable first evicts
the requested amount of memory before attempting to reclaim segments through
compactions. However, due to an oversight, it does this before every compaction
instead of once before all compactions.
Usually reclaim_from_evictable is called with small targets, or is preemptible,
and in those cases this issue is not visible. However, when the target is bigger
than one segment and the reclaim is not preemptible, which is he case when it's
called from allocating_section, this results in a quadratic explosion of
evictions, which can evict several hundred MiB to reclaim a few MiB.
Fix that by calculating the target of memory eviction only once, instead of
recalculating it after every compaction.
Fixes#8542.
Closes#8611
Following Nadav's advice, instead of ignoring the test
in sanitize/debug modes, the allocator simply has a special path
of failing sufficiently large allocation requests.
With that, a problem with the address sanitizer is bypassed
and other debug mode sanitizers can inspect and check
if there are no more problems related to wrapping the original
rapidjson allocator.
Closes#8539
The B+ tree is not intrusive and supports both kinds of objects --
dynamic (in sense of previous patch) and fixed-size. Respectively,
the nodes provide .storage_size() method and get the embedded object
storage size themselves. Thus, if a dynamic object is used with the
tree but it misses the .storage_size() itself this would come
unnoticed. Fortunately, dynamic objects use the .reconstruct()
method, so the method should be equipeed with the DybnamicObject
concept.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Usually lsa allocation is performed with the construct() helper that
allocates a sizeof(T) slot and constructs it in place. Some rare
objects have dynamic size, so they are created by alloc()ating a
slot of some specific size and (!) must provide the correct overload
of size_for_allocation_strategy that reports back the relevant
storage size.
This "must provide" is not enforced, if missed a default sizer would
be instantiated, but won't work properly. This patch makes all users
of alloc() conform to DynamicObject concept which requires the
presense of .storage_size() method to tell that size.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
After IMR rework the only lsa-migrating functionality is standard one
that calls move constructors on lsa slots. Hide the whole thing inside
allocation-strategy.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Todays alloc() accepts migrate-fn, size and alignment. All the callers
don't really need to provide anything special for the migrate-fn and
are just happy with default alignof() for alignment. The simplification
is in providing alloc() that only accepts size arg and does the rest
itself.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Refs #251.
Closes#8630
* github.com:scylladb/scylla:
statistics: add global bloom filter memory gauge
statistics: add some sstable management metrics
sstables: make the `_open` field more useful
sstables: stats: noexcept all accessors