In preparation for tracking different kinds of objects, not just
rows_entry, in the LRU, switch to the LRU implementation form
utils/lru.hh which can hold arbitrary element type.
Such that the holder, that is responsible for closing the
read_context before destroying it, holds it uniquely.
cache_flat_mutation_reader may be constructed either
with a read_context&, where it knows that the read_context
is owned externally, by the caller, or it could
be constructed with a std::unique_ptr<read_context> in
which case it assumes ownership of the read_context
and it is now responsible for closing it.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
This is a revival of #7490.
Quoting #7490:
The managed_bytes class now uses implicit linearization: outside LSA, data is never fragmented, and within LSA, data is linearized on-demand, as long as the code is running within with_linearized_managed_bytes() scope.
We would like to stop linearizing managed_bytes and keep it fragmented at all times, since linearization can require large contiguous chunks. Large contiguous allocations are hard to satisfy and cause latency spikes.
As a first step towards that, we remove all implicitly linearizing accessors and replace them with an explicit linearization accessor, with_linearized().
Some of the linearization happens long before use, by creating a bytes_view of the managed_bytes object and passing it onwards, perhaps storing it for later use. This does not work with with_linearized(), which creates a temporary linearized view, and does not work towards the longer term goal of never linearizing. As a substitute a managed_bytes_view class is introduced that acts as a view for managed_bytes (for interoperability it can also be a view for bytes and is compatible with bytes_view).
By the end of the series, all linearizations are temporary, within the scope of a with_linearized() call and can be converted to fragmented consumption of the data at leisure.
This has limited practical value directly, as current uses of managed_bytes are limited to keys (which are limited to 64k). However, it enables converting the atomic_cell layer back to managed_bytes (so we can remove IMR) and the CQL layer to managed_bytes/managed_bytes_view, removing contiguous allocations from the coordinator.
Closes#7820
* github.com:scylladb/scylla:
test: add hashers_test
memtable: fix accounting of managed_bytes in partition_snapshot_accounter
test: add managed_bytes_test
utils: fragment_range: add a fragment iterator for FragmentedView
keys: update comments after changes and remove an unused method
mutation_test: use the correct preferred_max_contiguous_allocation in measuring_allocator
row_cache: more indentation fixes
utils: remove unused linearization facilities in `managed_bytes` class
misc: fix indentation
treewide: remove remaining `with_linearized_managed_bytes` uses
memtable, row_cache: remove `with_linearized_managed_bytes` uses
utils: managed_bytes: remove linearizing accessors
keys, compound: switch from bytes_view to managed_bytes_view
sstables: writer: add write_* helpers for managed_bytes_view
compound_compat: transition legacy_compound_view from bytes_view to managed_bytes_view
types: change equal() to accept managed_bytes_view
types: add parallel interfaces for managed_bytes_view
types: add to_managed_bytes(const sstring&)
serializer_impl: handle managed_bytes without linearizing
utils: managed_bytes: add managed_bytes_view::operator[]
utils: managed_bytes: introduce managed_bytes_view
utils: fragment_range: add serialization helpers for FragmentedMutableView
bytes: implement std::hash using appending_hash
utils: mutable_view: add substr()
utils: fragment_range: add compare_unsigned
utils: managed_bytes: make the constructors from bytes and bytes_view explicit
utils: managed_bytes: introduce with_linearized()
utils: managed_bytes: constrain with_linearized_managed_bytes()
utils: managed_bytes: avoid internal uses of managed_bytes::data()
utils: managed_bytes: extract do_linearize_pure()
thrift: do not depend on implicit conversion of keys to bytes_view
clustering_bounds_comparator: do not depend on implicit conversion of keys to bytes_view
cql3: expression: linearize get_value_from_mutation() eariler
bytes: add to_bytes(bytes)
cql3: expression: mark do_get_value() as static
Since `managed_bytes::data()` is deleted as well as other public
APIs of `managed_bytes` which would linearize stored values except
for explicit `with_linearized`, there is no point
invoking `with_linearized_managed_bytes` hack which would trigger
automatic linearization under the hood of managed_bytes.
Remove useless `with_linearized_managed_bytes` wrapper from
memtable and row_cache code.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
External updater may do some preparatory work like constructing a new sstable list,
and at the end atomically replace the old list by the new one.
Decoupling the preparation from execution will give us the following benefits:
- the preparation step can now yield if needed to avoid reactor stalls, as it's
been futurized.
- the execution step will now be able to provide strong exception guarantees, as
it's now decoupled from the preparation step which can be non-exception-safe.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
The classes touche private data of each other for no real
reason. Putting the interaction behind API makes it easier
to track the usage.
* xemul/br-unfriends-in-row-cache-2:
row cache: Unfriend classes from each other
rows_entry: Move container/hooks types declarations
rows_entry: Simplify LRU unlink
mutation_partition: Define .replace_with method for rows_entry
mutation_partition: Use rows_entry::apply_monotonically
Define container types near the containing elements' hook
members, so that they could be private without the need
to friend classes with each other.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The cache_tracker tries to access private member of the
rows_entry to unlink it, but the lru_type is auto_unlink
and can unlink itself.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The row_cache::find_or_create is only used to put (or touch) an entry in cache
having the partition_start mutation at hands. Thus, theres no point in carrying
key reference and tombstone value through the calls, just the partition_start
reference is enough.
Since the new cache entry is created incomplete, rename the creation method
to reflect this.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The only caller of find_or_create() in tests works on already existing (.populate()-d) entry,
so patch this place for explicity and for the sake of next patching.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
If the key type is int64_t and the less-comparator is "natural" (i.e. it's
literally 'a < b') we may use the SIMD instructions to search for the key
on a node. Before doing so, the maybe_key and the searcher should be prepared
for that, in particular:
1. maybe_key should set unused keys to the minimal value
2. the searcher for this case should call the gt() helper with
primitive types -- int64_t search key and array of int64_t values
To tell to B+ code that the key-less pair is such the less-er should define
the simplify_key() method converting search keys to int64_t-s.
This searcher is selected automatically, if any mismatch happens it silently
falls back to default one. Thus also add a static assertion to the row-cache
to mitigate this.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
All they can live with forward declaration of the f._m._r. plus a
seastar header in commitlog code.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The change is the same as with row-cache -- use B+ with int64_t token
as key and array of memtable_entry-s inside it.
The changes are:
Similar to those for row_cache:
- compare() goes away, new collection uses ring_position_comparator
- insertion and removal happens with the help of double_decker, most
of the places are about slightly changed semantics of it
- flags are added to memtable_entry, this makes its size larger than
it could be, but still smaller than it was before
Memtable-specific:
- when the new entry is inserted into tree iterators _might_ get
invalidated by double-decker inner array. This is easy to check
when it happens, so the invalidation is avoided when possible
- the size_in_allocator_without_rows() is now not very precise. This
is because after the patch memtable_entries are not allocated
individually as they used to. They can be squashed together with
those having token conflict and asking allocator for the occupied
memory slot is not possible. As the closest (lower) estimate the
size of enclosing B+ data node is used
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The row_cache::partitions_type is replaced from boost::intrusive::set
to bplus::tree<Key = int64_t, T = array_trusted_bounds<cache_entry>>
Where token is used to quickly locate the partition by its token and
the internal array -- to resolve hashing conflicts.
Summary of changes in cache_entry:
- compare's goes away as the new collection needs tri-compare one which
is provided by ring_position_comparator
- when initialized the dummy entry is added with "after_all_keys" kind,
not "before_all_keys" as it was by default. This is to make tree
entries sorted by token
- insertion and removing of cache_entries happens inside double_decker,
most of the changes in row_cache.cc are about passing constructor args
from current_allocator.construct into double_decker.empace_before()
- the _flags is extended to keep array head/tail bits. There's a room
for it, sizeof(cache_entry) remains unchanged
The rest fits smothly into the double_decker API.
Also, as was told in the previous patch, insertion and removal _may_
invalidate iterators, but may leave them intact. However, currently
this doesn't seem to be a problem as the cache_tracker ::insert() and
::on_partition_erase do invalidate iterators unconditionally.
Later this can be otimized, as iterators are invalidated by double-decker
only in case of hash conflict, otherwise it doesn't change arrays and
B+ tree doesn't invalidate its.
tests: unit(dev), perf(dev)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The row cache (and memtable) code uses own comparators built on top
of the ring_position_comparator for collections of partitions. These
collections will be switched from the key less-compare to the pair
of token less-compare + key tri-compare.
Prepare for the switch by generalizing the ring_partition_comparator
and by patching all the non-collections usage of less-compare to use
one.
The memtable code doesn't use it outside of collections, but patch it
anyway as a part of preparations.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
All but few are trivially such.
The clear_continuity() calls cache_entry::set_continuous() that had become noexcept
a patch ago.
The allocator() calls region.allocator() which had been marked noexcept few patches
back.
The on_partition_erase() calls allocator().invalidate_references(), both had
been marked noexcept few patches back.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
All but one are trivially such, the position() one calls is_dummy_entry()
which has become noexcept right now.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
All reader are soon going to require a valid permit, so make sure we
have a valid permit which we can pass to the underlying reader when
creating it. This means `row_cache::make_reader()` now also requires
a permit to be passed to it.
This patch has two goals -- speed up the total partitions
calculations (walking databases is faster than walking tables),
and get rid og row_cache._partitions.size() call, which will
not be available on new _partitions collection implementation.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20200423133900.27818-1-xemul@scylladb.com>
Adds per-table metrics for counting partition and row reuse
in memtables. New metrics are as follows:
- memtable_partition_writes - number of write operations performed
on partitions in memtables,
- memtable_partition_hits - number of write operations performed
on partitions that previously existed in a memtable,
- memtable_row_writes - number of row write operations performed
in memtables,
- memtable_row_hits - number of row write operations that ovewrote
rows previously present in a memtable.
Tests: unit(release)
invalidate_unwrapped() calls cache_entry::evict(), which cannot be
called concurrently with cache update. invalidate() serializes it
properly by calling do_update(), but evict() doesn't. The purpose of
evict() is to stress eviction in tests, which can happen concurrently
with cache update. Switch it to use memory reclaimer, so that it's
both correct and more realistic.
evict() is used only in tests.
dht::ring_position cannot represent all ring_position_view instances,
in particular those obtained from
dht::ring_position_view::for_range_start(). To allow using the latter,
switch to views.
Replace stdx::optional and stdx::string_view with the C++ std
counterparts.
Some instances of boost::variant were also replaced with std::variant,
namely those that called seastar::visit.
Scylla now requires GCC 8 to compile.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20190108111141.5369-1-duarte@scylladb.com>
* seastar d59fcef...b924495 (2):
> build: Fix protobuf generation rules
> Merge "Restructure files" from Jesse
Includes fixup patch from Jesse:
"
Update Seastar `#include`s to reflect restructure
All Seastar header files are now prefixed with "seastar" and the
configure script reflects the new locations of files.
Signed-off-by: Jesse Haber-Kucharsky <jhaberku@scylladb.com>
Message-Id: <5d22d964a7735696fb6bb7606ed88f35dde31413.1542731639.git.jhaberku@scylladb.com>
"
The worker is responsible for merging MVCC snapshots, which is similar
to merging sstables, but in memory. The new scheduling group will be
therefore called "memory compaction".
We should run it in a separate scheduling group instead of
main/memtables, so that it doesn't disrupt writes and other system
activities. It's also nice for monitoring how much CPU time we spend
on this.
Row cache tracker has numerous implicit dependencies on ohter objects
(e.g. LSA migrators for data held by mutation_cleaner). The fact that
both cache tracker and some of those dependencies are thread local
objects makes it hard to guarantee correct destruction order.
Let's deglobalise cache tracker and put in in the database class.
Now all snapshots will have a mutation_cleaner which they will use to
gently destroy freed partition_version objects.
Destruction of memtable entries during cache update is also using the
gentle cleaner now. We need to have a separate cleaner for memtable
objects even though they're owned by cache's region, because memtable
versions must be cleared without a cache_tracker.
Each memtable will have its own cleaner, which will be merged with the
cache's cleaner when memtable is merged into cache.
Fixes some sources of reactor stalls on cache update when there are
large partition entries in memtables.
Instead of destroying whole partition_versions at once, we will do that
gently using mutation_cleaner to avoid reactor stalls.
Large deletions could happen when large partition gets invalidated,
upgraded to a new schema, or when it's abandaned by a detached snapshot.
Refs #3289.
Partitions can get very large. Destroying them all at once can stall
the reactor for significant amount of time. We want to avoid that by
doing destruction incrementally, deferring in between. A new API is
added for that at various levels:
stop_iteration clear_gently() noexcept;
It returns stop_iteration::yes when the object is fully cleared and
can be now destroyed quickly. So a deferring destruction can look like
this:
return repeat([this] { return clear_gently(); });
The reason why clear_gently() doesn't return a future<> itself is that some
contexts cannot defer, like memory reclamation.
Will be used in row_cache_alloc_stress to unlink partitions which we
don't want to get evicted, instead of reapeatedly calling touch() on
them after each subsequent population. After switching to row-level
LRU, doing so greatly increases run time of the test due to quadratic
behavior.
Instead of evicting whole partitions, evicts whole rows.
As part of this, invalidation of partition entries was changed to not
evict from snapshots right away, but unlink them and let them be
evicted by the reclaimer.
We will need to propagate a cache_tracker reference to evict(). Instead
of evicting from destructor, do so before cache_entry gets unlinked
from the tree. Entries which are not linked, don't need to be
explicitly evicted.