Memtable entries should be cleaned using memtable cleaner, which
unlike the cache' cleaner is not associated with the cache
tracker. It's an error to clean a snapshot using tracker which doesn't
own the entries. This will corrupt cache tracker's row counter.
Fixes failure of test_exception_safety_of_update_from_memtable from
row_cache.cc in debug mode and with allocation failure injection
enabled.
Introduce in "cache: Defer during partition merging"
(70c72773be).
Message-Id: <1528988256-20578-1-git-send-email-tgrabiec@scylladb.com>
Incremental merging will be implemented by the means of resumable
functions, which return stop_iteration::no when not yet
finished. We're not using futures, so that the caller can do work
around preemption points as well.
Curently tests have a single LSA region lock around construction of
managed objects, their manipulation, and access. This way we avoid the
complexity of dealing with allocating sections. That will not be
possible once apply_to_incomplete() is changed to enter an allocating
section itself becasue this requires region to be unlocked at
entry. The tests will have to take more fine-grained locks. That is
somewhat tricky add would add a lot of noise to tests. This patch will
make things easier by abstracting LSA management, among other things,
inside mvcc_conatiner and mvcc_partition classes.
Now all snapshots will have a mutation_cleaner which they will use to
gently destroy freed partition_version objects.
Destruction of memtable entries during cache update is also using the
gentle cleaner now. We need to have a separate cleaner for memtable
objects even though they're owned by cache's region, because memtable
versions must be cleared without a cache_tracker.
Each memtable will have its own cleaner, which will be merged with the
cache's cleaner when memtable is merged into cache.
Fixes some sources of reactor stalls on cache update when there are
large partition entries in memtables.
Instead of evicting whole partitions, evicts whole rows.
As part of this, invalidation of partition entries was changed to not
evict from snapshots right away, but unlink them and let them be
evicted by the reclaimer.
This change is a preparation for introducing row-level eviction, such that entries
can be evicted from older versions without having to touch other versions.
Currently continuity flags on entries are interpreted relative to the
combined view merged from all entries. For example:
v2: <key=2, cont=1>
v1: <key=1, cont=1>
In v2, the flag on entry key=2 marks the range (1, 2) as
continuous. This is problematic because if the old version is evicted, continuity
will change in an incorrect way:
v2: <key=2, cont=1>
Here, the range (-inf, 1) would be marked as continuous, which is not true.
To solve this problem, we change the rules for continuity
interpretation in MVCC. Each version will have its own continuity,
fully specified in that version, independent of continuity of other
versions. Continuity of the snapshot will be a union of continuous
ranges in each version.
It is assumed that continuous intervals in different versions are non-
overlapping, except for points corresponding to complete rows, in
which case a later version may overlap with an older version
(overwrite). We make use of this assumption to make calculation of the
union of intervals on merging easier. I make use of the above
assumption in mutation_partition::apply_monotonically().
MVCC population of incomplete entries already almost maintains the
non-overlapping invariant, because population intervals correspond to
intervals which are incomplete in the old snapshot. The only change
needed is to ensure that both population bounds will have entries in
the latest version. Population from memtables doesn't mark any
intervals as continuous, so also conforms. The only change needed
there is to not inherit continuity flags from the old snapshot,
effectively making the new version internally discontinuous except for
row points.
The example from the beginning will become:
v2: <key=1, cont=0> <key=2, cont=1>
v1: <key=1, cont=1>
When marking a range as continuous with some rows present only in
older versions, we need to insert entries in the latest version, so
that we can mark the range as continuous. The easiest solution is to
copy the entry from the old version. Another option would be to add
support for incomplete rows and insert such instead. This way we would
avoid duplicating row contents. This optimization is deferred.
Simply copying mutations which are not fully continuous may violate
MVCC invariants, like the one about non-overlapping continuity which
will be added later. Use apply_to_incomplete() instead.
This unfortunately reduces strenght of the test, since the continuity
of the entry is now completely determined by the first version. We should
use populate() instead, but it doesn't exist yet. It could be extracted
from cache_streamed_mutation, but that's not an easy change.
This is alleviated by adding a similar test to row_cache_test_g, in a
later patch.
Commit 6ccd317 introduced a bug in partition_entry::evict() where a
partition entry may be partially evicted if there are non-evictable
snapshots in it. Partially evicting some of the versions may violate
consistency of a snapshot which includes evicted versions. For one,
continuity flags are interpreted realtive to the merged view, not
within a version, so evicting from some of the versions may mark
reanges as continuous when before they were discontinuous. Also, range
tombtsones of the snapshot are taken from all versions, so we can't
partially evict some of them without marking all affected ranges as
discontinuous.
The fix is to revert back to full eviciton, and avoid moving
non-evictable snapshots to cache. When moving whole partition entry to
cache, we first create a neutral empty partition entry and then merge
the memtable entry into it just like we would if the entry already
existed.
Fixes#3215.
Tests: unit (release)
Message-Id: <1518710592-21925-2-git-send-email-tgrabiec@scylladb.com>
When digest is requested, pre-calculate the cell's hash. We consider
the case when the cell is already in the cache, and the case when it
added by the underlying reader.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
"Changes merging in MVCC to apply newer version to older instead of older to
newer.
Before (v0 = oldest):
(((v3 + v2) + v1) + v0)
After:
(v0 + (v1 + (v2 + v3)))
or:
(((v0 + v1) + v2) + v3)
There are several reasons to do this:
1) When continuity merging will change semantics to support eviction
from older versions, it will be easier to implement apply() if we
can assume that we merge newer to older instead of older to
newer, since newer version may have entries falling into a
continuous interval in older, but not the other way around. If we
didn't revert the order, apply() would have to keep track of
lower bound of a continuous interval in the right-hand side
argument (older version) as it is applied and update continuity
flags in the left hand side by scanning all entries overlapping
with it. If order is reversed, merging only needs to deal with
the current entry. Also, if we were to keep the old order, we
cannot simply move entries from the left hand side as we merge
because we need to keep track of the lower bound of a continuous
interval, and we need to provide monotonic exception
guarantees. So merging would be both more complicated and slower.
2) With large partitions older versions are typically larger than
newer versions, and since merging is O(N_right*(1 + log(N_left))),
it's better to merge newer into older.
This fixes latency spikes seen in perf_cache_eviction.
Fixes #2715."
* tag 'tgrabiec/reverse-order-of-mvcc-version-merging-v1' of github.com:scylladb/seastar-dev:
mvcc: Reverse order of version merging
anchorless_list: Introduce last()
mvcc: Implement partition_entry::upgrade() using squashed()
mvcc: Extract version merging functions
mutation_partition: Add rows_entry::set_dummy()
position_in_partition: Introduce after_key()
Change merging to apply newer version to older instead of older to
newer.
Before:
(((v3 + v2) + v1) + v0)
After:
(v0 + (v1 + (v2 + v3)))
or equivalent:
(((v0 + v1) + v2) + v3)
There are several reasons to do this:
1) When continuity merging will change semantics to support eviction
from older versions, it will be easier to implement apply() if we
can assume that we merge newer to older instead of older to
newer, since newer version may have entries falling into a
continuous interval in older, but not the other way around. If we
didn't revert the order, apply() would have to keep track of
lower bound of a continuous interval in the right-hand side
argument (older version) as it is applied and update continuity
flags in the left hand side by scanning all entries overlapping
with it. If order is reversed, merging only needs to deal with
the current entry. Also, if we were to keep the old order, we
cannot simply move entries from the left hand side as we merge
because we need to keep track of the lower bound of a continuous
interval, and we need to provide monotonic exception
guarantees. So merging would be both more complicated and slower.
2) With large partitions older versions are typically larger than
newer versions, and since merging is O(N_right*(1 + log(N_left))),
it's better to merge newer into older.
Fixes#2715.
It currently is updated only when iterators are invalidated. Better
to not assume that, because it's not really needed, and
maintaining this would complicate maybe_refresh() after continuity
merging rules change later.
"This simplifies implementation of mutation_partition merging by relaxing
exception guarantees it needs to provide. This allows reverters to be dropped.
Direct motivation for this is to make it easier to implement new semantics
for merging of clustering range continuity.
Implementation details:
We only need strong exception guarantees when applying to the memtable, which is
using MVCC. Instead of calling apply() with strong exception guarantees on the latest
version, we will move the incoming mutation to a new partition_version and then
use monotonic apply() to merge them. If that merging fails, we attach the version with
the remainder, which cannot fail. This way apply() always succeeds if the allocation
of partition_version object succeeds.
Results of `perf_simple_query_g -c1 -m1G --write` (high overwrite rate):
Before:
101011.13 tps
102498.07 tps
103174.68 tps
102879.55 tps
103524.48 tps
102794.56 tps
103565.11 tps
103018.51 tps
103494.37 tps
102375.81 tps
103361.65 tps
After:
101785.37 tps
101366.19 tps
103532.26 tps
100834.83 tps
100552.11 tps
100891.31 tps
101752.06 tps
101532.00 tps
100612.06 tps
102750.62 tps
100889.16 tps
Fixes #2012."
* tag 'tgrabiec/drop-reversible-apply-v1' of github.com:scylladb/seastar-dev:
mutation_partition: Drop apply_reversibly()
mutation_partition: Relax exception guarantees of apply()
mutation_partition: Introduce apply_weak()
tests: mvcc: Add test for atomicity of partition_entry::apply()
tests: Move failure_injecting_allocation_strategy to a header
tests: mutation_partition: Test exception guarantees of apply_monotonically()
mvcc: Use apply_monotonically() where sufficient
mvcc: partition_version: Use apply_monotonically() to provide atomicity
mvcc: Extract partition_entry::add_version()
mutation_partition: Introduce apply_monotonically()
mutation_partition: Introduce row::consume_with()