This will prevent accumulation of unnecessary dummy entries.
A single-partition populating scan with clustering key restrictions
will insert dummy entries positioned at the boundaries of the
clustering query range to mark the newly populated range as
continuous.
Those dummy entries may accumulate with time, increasing the cost of
the scan, which needs to walk over them.
In some workloads we could prevent this. If a populating query
overlaps with dummy entries, we could erase the old dummy entry since
it will not be needed, it will fall inside a broader continuous
range. This will be the case for time series worklodas which scan with
a decreasing (newest) lower bound.
Refs #8153.
_last_row is now updated atomically with _next_row. Before, _last_row
was moved first. If exception was thrown and the section was retried,
this could cause the wrong entry to be removed (new next instead of
old last) by the new algorithm. I don't think this was causing
problems before this patch.
The problem is not solved for all the cases. After this patch, we
remove dummies only when there is a single MVCC version. We could
patch apply_monotonically() to also do it, so that dummies which are
inside continuous ranges are eventually removed, but this is left for
later.
perf_row_cache_reads output after that patch shows that the second
scan touches no dummies:
$ build/release/test/perf/perf_row_cache_reads_g -c1 -m200M
Rows in cache: 0
Populating with dummy rows
Rows in cache: 265320
Scanning
read: 142.621613 [ms], preemption: {count: 639, 99%: 0.545791 [ms], max: 0.526929 [ms]}, cache: 0/0 [MB]
read: 0.023197 [ms], preemption: {count: 1, 99%: 0.035425 [ms], max: 0.032736 [ms]}, cache: 0/0 [MB]
Message-Id: <20210226172801.800264-1-tgrabiec@scylladb.com>
fill_buffer() will keep scanning until _lower_bound_changed is true,
even if preemption is signaled, so that the reader makes forward
progress.
Before the patch, we did not update _lower_bound on touching a dummy
entry. The read will not respect preemption until we hit a non-dummy
row. If there is a lot of dummy rows, that can cause reactor stalls.
Fix that by updating _lower_bound on dummy entries as well.
Refs #8153.
Tested with perf_row_cache_reads:
```
$ build/release/test/perf/perf_row_cache_reads -c1 -m200M
Rows in cache: 0
Populating with dummy rows
Rows in cache: 373929
Scanning
read: 183.658966 [ms], preemption: {count: 848, 99%: 0.545791 [ms], max: 0.519343 [ms]}, cache: 99/100 [MB]
read: 120.951515 [ms], preemption: {count: 257, 99%: 0.545791 [ms], max: 0.518795 [ms]}, cache: 99/100 [MB]
```
Notice that max preemption latency is low in the second "read:" line.
Closes#8167
* github.com:scylladb/scylla:
row_cache: Make fill_buffer() preemptable when cursor leads with dummy rows
tests: perf: Introduce perf_row_cache_reads
row_cache: Add metric for dummy row hits
Instead of resetting _reader in scanning_and_populating_reader::fill_buffer
in the `reader_finished` case, use a gentler, _read_next_partition flag
on which `read_next_partition` will be called in the next iteration.
Then, read_next_partition can close _reader only before overwriting it
with a new reader. Otherwise, if _reader is always closed in the
``reader_finished` case, we end up hitting premature end_of_stream.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210215101254.480228-30-bhalevy@scylladb.com>
The switch is pretty straightforward, and consists of
- change less-compare into tri-compare
- rename insert/insert_check into insert_before_hint
- use tree::key_grabber in mutation_partition::apply_monotonically to
exception-safely transfer a row from one tree to another
- explicitly erase the row from tree in rows_entry::on_evicted, there's
a O(1) tree::iterator method for this
- rewrite rows_entry -> cache_entry transofrmation in the on_evicted to
fit the B-tree API
- include the B-tree's external memory usage into stats
That's it. The number of keys per node was is set to 12 with linear search
and linear extention of 20 because
- experimenting with tree shows that numbers 8 through 10 keys with linear
search show the best performance on stress tests for insert/find-s of
keys that are memcmp-able arrays of bytes (which is an approximation of
current clustring key compare). More keys work slower, but still better
than any bigger value with any type of search up to 64 keys per node
- having 12 keys per nodes is the threshold at which the memory footprint
for B-tree becomes smaller than for boost::intrusive::set for partitions
with 32+ keys
- 20 keys for linear root eats the first-split peak and still performs
well in linear search
As a result the footpring for B tree is bigger than the one for BST only for
trees filled with 21...32 keys by 0.1...0.7 bytes per key.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The main motivation for this patchset is to prepare
for adding a async close() method to flat_mutation_reader.
In order to close the reader before destroying it
in all paths we need to make next_partition asynchronous
so it can asynchronously close a current reader before
destoring it, e.g. by reassignment of flat_mutation_reader_opt,
as done in scanning_reader::next_partition.
Test: unit(release, debug)
* git@github.com:bhalevy/scylla.git futurize-next-partition-v1:
flat_mutation_reader: return future from next_partition
multishard_mutation_query: read_context: save_reader: destroy reader_meta from the calling shard
mutation_reader: filtering_reader: fill_buffer: futurize inner loop
flat_mutation_reader::impl: consumer_adapter: futurize handle_result
flat_mutation_reader: consume_pausable/in_thread: futurize_invoke consumer
flat_mutation_reader: FlatMutationReaderConsumer: support also async consumer
flat_mutation_reader:impl: get rid of _consume_done member
This is a revival of #7490.
Quoting #7490:
The managed_bytes class now uses implicit linearization: outside LSA, data is never fragmented, and within LSA, data is linearized on-demand, as long as the code is running within with_linearized_managed_bytes() scope.
We would like to stop linearizing managed_bytes and keep it fragmented at all times, since linearization can require large contiguous chunks. Large contiguous allocations are hard to satisfy and cause latency spikes.
As a first step towards that, we remove all implicitly linearizing accessors and replace them with an explicit linearization accessor, with_linearized().
Some of the linearization happens long before use, by creating a bytes_view of the managed_bytes object and passing it onwards, perhaps storing it for later use. This does not work with with_linearized(), which creates a temporary linearized view, and does not work towards the longer term goal of never linearizing. As a substitute a managed_bytes_view class is introduced that acts as a view for managed_bytes (for interoperability it can also be a view for bytes and is compatible with bytes_view).
By the end of the series, all linearizations are temporary, within the scope of a with_linearized() call and can be converted to fragmented consumption of the data at leisure.
This has limited practical value directly, as current uses of managed_bytes are limited to keys (which are limited to 64k). However, it enables converting the atomic_cell layer back to managed_bytes (so we can remove IMR) and the CQL layer to managed_bytes/managed_bytes_view, removing contiguous allocations from the coordinator.
Closes#7820
* github.com:scylladb/scylla:
test: add hashers_test
memtable: fix accounting of managed_bytes in partition_snapshot_accounter
test: add managed_bytes_test
utils: fragment_range: add a fragment iterator for FragmentedView
keys: update comments after changes and remove an unused method
mutation_test: use the correct preferred_max_contiguous_allocation in measuring_allocator
row_cache: more indentation fixes
utils: remove unused linearization facilities in `managed_bytes` class
misc: fix indentation
treewide: remove remaining `with_linearized_managed_bytes` uses
memtable, row_cache: remove `with_linearized_managed_bytes` uses
utils: managed_bytes: remove linearizing accessors
keys, compound: switch from bytes_view to managed_bytes_view
sstables: writer: add write_* helpers for managed_bytes_view
compound_compat: transition legacy_compound_view from bytes_view to managed_bytes_view
types: change equal() to accept managed_bytes_view
types: add parallel interfaces for managed_bytes_view
types: add to_managed_bytes(const sstring&)
serializer_impl: handle managed_bytes without linearizing
utils: managed_bytes: add managed_bytes_view::operator[]
utils: managed_bytes: introduce managed_bytes_view
utils: fragment_range: add serialization helpers for FragmentedMutableView
bytes: implement std::hash using appending_hash
utils: mutable_view: add substr()
utils: fragment_range: add compare_unsigned
utils: managed_bytes: make the constructors from bytes and bytes_view explicit
utils: managed_bytes: introduce with_linearized()
utils: managed_bytes: constrain with_linearized_managed_bytes()
utils: managed_bytes: avoid internal uses of managed_bytes::data()
utils: managed_bytes: extract do_linearize_pure()
thrift: do not depend on implicit conversion of keys to bytes_view
clustering_bounds_comparator: do not depend on implicit conversion of keys to bytes_view
cql3: expression: linearize get_value_from_mutation() eariler
bytes: add to_bytes(bytes)
cql3: expression: mark do_get_value() as static
The patch fixes indentation issues introduced in previous patches
related to removing `with_linearized_managed_bytes` uses from the
code tree.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Since `managed_bytes::data()` is deleted as well as other public
APIs of `managed_bytes` which would linearize stored values except
for explicit `with_linearized`, there is no point
invoking `with_linearized_managed_bytes` hack which would trigger
automatic linearization under the hood of managed_bytes.
Remove useless `with_linearized_managed_bytes` wrapper from
memtable and row_cache code.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
External updater may do some preparatory work like constructing a new sstable list,
and at the end atomically replace the old list by the new one.
Decoupling the preparation from execution will give us the following benefits:
- the preparation step can now yield if needed to avoid reactor stalls, as it's
been futurized.
- the execution step will now be able to provide strong exception guarantees, as
it's now decoupled from the preparation step which can be non-exception-safe.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
The main user of this method, the one which required this method to
return the collective buffer size of the entire reader tree, is now
gone. The remaining two users just use it to check the size of the
reader instance they are working with.
So de-virtualize this method and reduce its responsibility to just
returning the buffer size of the current reader instance.
Not used yet, this patch does all the churn of propagating a permit
to each impl.
In the next patch we will use it to track to track the memory
consumption of `_buffer`.
The classes touche private data of each other for no real
reason. Putting the interaction behind API makes it easier
to track the usage.
* xemul/br-unfriends-in-row-cache-2:
row cache: Unfriend classes from each other
rows_entry: Move container/hooks types declarations
rows_entry: Simplify LRU unlink
mutation_partition: Define .replace_with method for rows_entry
mutation_partition: Use rows_entry::apply_monotonically
The cache_tracker tries to access private member of the
rows_entry to unlink it, but the lru_type is auto_unlink
and can unlink itself.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The partitions_type::lower_bound() method can return a hint that saves
info about the "lower-ness of the bound", in particular when the search
key is found, this can be guessed from the hint without comparison.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The row_cache::find_or_create is only used to put (or touch) an entry in cache
having the partition_start mutation at hands. Thus, theres no point in carrying
key reference and tombstone value through the calls, just the partition_start
reference is enough.
Since the new cache entry is created incomplete, rename the creation method
to reflect this.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The only caller of find_or_create() in tests works on already existing (.populate()-d) entry,
so patch this place for explicity and for the sake of next patching.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Now when the key for new partition is copied inside do_find_or_create_entry we may call
this function without allocator set, as it sets the allocator inside.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When the missing partition is created in cache the decorated key is copied from
the ring position view too early -- to do the lookup. However, the read context
had been already entered the partition and already has the decorated key on board,
so for lookup we can use the reference.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The change is the same as with row-cache -- use B+ with int64_t token
as key and array of memtable_entry-s inside it.
The changes are:
Similar to those for row_cache:
- compare() goes away, new collection uses ring_position_comparator
- insertion and removal happens with the help of double_decker, most
of the places are about slightly changed semantics of it
- flags are added to memtable_entry, this makes its size larger than
it could be, but still smaller than it was before
Memtable-specific:
- when the new entry is inserted into tree iterators _might_ get
invalidated by double-decker inner array. This is easy to check
when it happens, so the invalidation is avoided when possible
- the size_in_allocator_without_rows() is now not very precise. This
is because after the patch memtable_entries are not allocated
individually as they used to. They can be squashed together with
those having token conflict and asking allocator for the occupied
memory slot is not possible. As the closest (lower) estimate the
size of enclosing B+ data node is used
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The row_cache::partitions_type is replaced from boost::intrusive::set
to bplus::tree<Key = int64_t, T = array_trusted_bounds<cache_entry>>
Where token is used to quickly locate the partition by its token and
the internal array -- to resolve hashing conflicts.
Summary of changes in cache_entry:
- compare's goes away as the new collection needs tri-compare one which
is provided by ring_position_comparator
- when initialized the dummy entry is added with "after_all_keys" kind,
not "before_all_keys" as it was by default. This is to make tree
entries sorted by token
- insertion and removing of cache_entries happens inside double_decker,
most of the changes in row_cache.cc are about passing constructor args
from current_allocator.construct into double_decker.empace_before()
- the _flags is extended to keep array head/tail bits. There's a room
for it, sizeof(cache_entry) remains unchanged
The rest fits smothly into the double_decker API.
Also, as was told in the previous patch, insertion and removal _may_
invalidate iterators, but may leave them intact. However, currently
this doesn't seem to be a problem as the cache_tracker ::insert() and
::on_partition_erase do invalidate iterators unconditionally.
Later this can be otimized, as iterators are invalidated by double-decker
only in case of hash conflict, otherwise it doesn't change arrays and
B+ tree doesn't invalidate its.
tests: unit(dev), perf(dev)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The row cache (and memtable) code uses own comparators built on top
of the ring_position_comparator for collections of partitions. These
collections will be switched from the key less-compare to the pair
of token less-compare + key tri-compare.
Prepare for the switch by generalizing the ring_partition_comparator
and by patching all the non-collections usage of less-compare to use
one.
The memtable code doesn't use it outside of collections, but patch it
anyway as a part of preparations.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
All but few are trivially such.
The clear_continuity() calls cache_entry::set_continuous() that had become noexcept
a patch ago.
The allocator() calls region.allocator() which had been marked noexcept few patches
back.
The on_partition_erase() calls allocator().invalidate_references(), both had
been marked noexcept few patches back.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This is relevant only when using partition or clustering keys which
have a representation in memory which is larger than 12.8 KB (10% of
LSA segment size).
There are several places in code (cache, background garbage
collection) which may need to linearize keys because of performing key
comparison, but it's not done safely:
1) the code does not run with the LSA region locked, so pointers may
get invalidated on linearization if it needs to reclaim memory. This
is fixed by running the code inside an allocating section.
2) LSA region is locked, but the scope of
with_linearized_managed_bytes() encloses the allocating section. If
allocating section needs to reclaim, linearization context will
contain invalidated pointers. The fix is to reorder the scopes so
that linearization context lives within an allocating section.
Example of 1 can be found in
range_populating_reader::handle_end_of_stream() where it performs a
lookup:
auto prev = std::prev(it);
if (prev->key().equal(*_cache._schema, *_last_key->_key)) {
it->set_continuous(true);
but handle_end_of_stream() is not invoked under allocating section.
Example of 2 can be found in mutation_cleaner_impl::merge_some() where
it does:
return with_linearized_managed_bytes([&] {
...
return _worker_state->alloc_section(region, [&] {
Fixes#6637.
Refs #6108.
Tests:
- unit (all)
Message-Id: <1592218544-9435-1-git-send-email-tgrabiec@scylladb.com>
All reader are soon going to require a valid permit, so make sure we
have a valid permit which we can pass to the underlying reader when
creating it. This means `row_cache::make_reader()` now also requires
a permit to be passed to it.
Replace it with std::tuple, introduce range_populating_reader::read_result
type alias for less keystrokes.
This makes row_cache.o compilation warn-less.
tests: unit(dev)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20200518160511.26984-1-xemul@scylladb.com>
We typically use `std::bad_function_call` to throw from
mandatory-to-implement virtual functions, that cannot have a meaningful
implementation in the derived class. The problem with
`std::bad_function_call` is that it carries absolutely no information
w.r.t. where was it thrown from.
I originally wanted to replace `std::bad_function_call` in our codebase
with a custom exception type that would allow passing in the name of the
function it is thrown from to be included in the exception message.
However after I ended up also including a backtrace, Benny Halevy
pointed out that I might as well just throw `std:bad_function_call` with
a backtrace instead. So this is what this patch does.
All users are various unimplemented methods of the
`flat_mutation_reader::impl` interface.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200408075801.701416-1-bdenes@scylladb.com>
The former was never really more than a reader_permit with one
additional method. Currently using it doesn't even save one from any
includes. Now that readers will be using reader_permit we would have to
pass down both to mutation_source. Instead get rid of
reader_resource_tracker and just use reader_permit. Instead of making it
a last and optional parameter that is easy to ignore, make it a
first class parameter, right after schema, to signify that permits are
now a prominent part of the reader API.
This -- mostly mechanical -- patch essentially refactors mutation_source
to ask for the reader_permit instead of reader_resource_tracking and
updates all usage sites.
Since 90d6c0b, cache will abort when trying to detach partition
entries while they're updated. This should never happen. It can happen
though, when the update fails on bad_alloc, because the cleanup guard
invalidates the cache before it releases partition snapshots (held by
"update" coroutine).
Fix by destroying the coroutine first.
Fixes#5327.
Tests:
- row_cache_test (dev)
Message-Id: <1574360259-10132-1-git-send-email-tgrabiec@scylladb.com>
Adds per-table metrics for counting partition and row reuse
in memtables. New metrics are as follows:
- memtable_partition_writes - number of write operations performed
on partitions in memtables,
- memtable_partition_hits - number of write operations performed
on partitions that previously existed in a memtable,
- memtable_row_writes - number of row write operations performed
in memtables,
- memtable_row_hits - number of row write operations that ovewrote
rows previously present in a memtable.
Tests: unit(release)
Cache update may defer in the middle of moving of partition entry
from a flushed memtable to the cache. If the schema was changed since
the entry was written, it upgrades the schema of the partition_entry
first but doesn't update the schema_ptr in memtable_entry. The entry
is removed from the memtable afterward. If a memtable reader
encounters such an entry, it will try to upgrade it assuming it's
still at the old schema.
That is undefined behavior in general, which may include:
- read failures due to bad_alloc, if fixed-size cells are interpreted
as variable-sized cells, and we misinterpret a value for a huge
size
- wrong read results
- node crash
This doesn't result in a permanent corruption, restarting the node
should help.
It's the more likely to happen the more rows there are in a
partition. It's unlikely to happen with single-row partitions.
Introduced in 70c7277.
Fixes#5128.
If the whole partition entry is evicted while being updated from the
memtable, a subsequent read may populate the partition using the old
version of data if it attempts to do it before cache update advances
past that partition. Partial eviction is not affected because
populating reads will notice that there is a newer snapshot
corresponding to the updater.
This can happen only in OOM situations where the whole cache gets evicted.
Affects only tables with multi-row partitions, which are the only ones
that can experience the update of partition entry being preempted.
Introduced in 70c7277.
Fixes#5134.
invalidate_unwrapped() calls cache_entry::evict(), which cannot be
called concurrently with cache update. invalidate() serializes it
properly by calling do_update(), but evict() doesn't. The purpose of
evict() is to stress eviction in tests, which can happen concurrently
with cache update. Switch it to use memory reclaimer, so that it's
both correct and more realistic.
evict() is used only in tests.
When a read enters a partition entry in the cache, it first upgrades
it to the current schema of the cache. The same happens when an entry
is updated after a memtable flush. Upgrading the entry is currently
performed by squashing all versions and replacing them with a single
upgraded version. That has a side effect of detaching all snapshots
from the partition entry. Partition entry update on memtable flush is
writing into a snapshot. If that snapshot is detached by a schema
upgrade, the entry will be missing writes from the memtable which fall
into continuous ranges in that entry which have not yet been updated.
This can happen only if the update of the entry is preempted and the
schema was altered during that, and a read hit that partition before
the update went past it.
Affects only tables with multi-row partitions, which are the only ones
that can experience the update of partition entry being preempted.
The problem is fixed by locking updated entries and not upgrading
schema of locked entries. cache_entry::read() is prepared for this,
and will upgrade on-the-fly to the cache's schema.
Fixes#5135
This change inserts preemption points between removal of partitions.
The main complication is in maintaining consitency in the face of
concurrent population or eviction. We use the same mechanism which is
used by memtable updates. _prev_snapshot_pos is the ring position
which partitions the ring into the part which is already updated in
cache and the one which is yet to be updated. That position should be
set accordingly on preemption.
In case of invalidation, updating means removing all entries in the
range and marking the range as discontinuous. When resuming
invalidation of a range we continue from _prev_snapshot_pos as the
lower bound.
This affects high-level operations like nodetool refresh, table
truncation, repair and streaming.
Fixes#2683
The improvement on stalls was measured using tests/perf_row_cache_update:
Before
Small partitions, no overwrites:
invalidation: 339.420624 [ms], preemption: {count: 2, 99%: 0.008239 [ms], max: 339.422144 [ms]}
Small partition with a few rows:
invalidation: 191.855331 [ms], preemption: {count: 2, 99%: 0.008239 [ms], max: 191.856816 [ms]}
Large partition, lots of small rows:
invalidation: 0.959328 [ms], preemption: {count: 2, 99%: 0.008239 [ms], max: 0.961453 [ms]}
After:
Small partitions, no overwrites:
invalidation: 400.505554 [ms], preemption: {count: 843, 99%: 0.545791 [ms], max: 0.502340 [ms]}
Small partition with a few rows:
invalidation: 306.352600 [ms], preemption: {count: 644, 99%: 0.545791 [ms], max: 0.506464 [ms]}
Large partition, lots of small rows:
invalidation: 0.963660 [ms], preemption: {count: 2, 99%: 0.009887 [ms], max: 0.963264 [ms]}
The maximum scheduling latency went down form 339 ms to 0.5 ms (task quota).
dht::ring_position cannot represent all ring_position_view instances,
in particular those obtained from
dht::ring_position_view::for_range_start(). To allow using the latter,
switch to views.