Memtable entries should be cleaned using memtable cleaner, which
unlike the cache' cleaner is not associated with the cache
tracker. It's an error to clean a snapshot using tracker which doesn't
own the entries. This will corrupt cache tracker's row counter.
Fixes failure of test_exception_safety_of_update_from_memtable from
row_cache.cc in debug mode and with allocation failure injection
enabled.
Introduce in "cache: Defer during partition merging"
(70c72773be).
Message-Id: <1528988256-20578-1-git-send-email-tgrabiec@scylladb.com>
After 70c72773be it's possible that
open_version() is called with a phase which is smaller than the phase
of the latest version, because latest version belongs to the
in-progress cache update. In such case we must return the existing
non-latest snapshot and not create a new version on top of the
in-progress update. Not doing this violates several invariants, and
may lead to inconsistencies, including violation of write atomicity or
temporary loss of writes.
partition_entry::read() was already adjusted by the aforementioned
commit. Do a similar adjustement for open_version().
Fixes sporadic failures of row_cache_test.cc::test_concurrent_reads_and_eviction
Message-Id: <1528211847-22825-1-git-send-email-tgrabiec@scylladb.com>
Incremental merging will be implemented by the means of resumable
functions, which return stop_iteration::no when not yet
finished. We're not using futures, so that the caller can do work
around preemption points as well.
Now all snapshots will have a mutation_cleaner which they will use to
gently destroy freed partition_version objects.
Destruction of memtable entries during cache update is also using the
gentle cleaner now. We need to have a separate cleaner for memtable
objects even though they're owned by cache's region, because memtable
versions must be cleared without a cache_tracker.
Each memtable will have its own cleaner, which will be merged with the
cache's cleaner when memtable is merged into cache.
Fixes some sources of reactor stalls on cache update when there are
large partition entries in memtables.
Partitions can get very large. Destroying them all at once can stall
the reactor for significant amount of time. We want to avoid that by
doing destruction incrementally, deferring in between. A new API is
added for that at various levels:
stop_iteration clear_gently() noexcept;
It returns stop_iteration::yes when the object is fully cleared and
can be now destroyed quickly. So a deferring destruction can look like
this:
return repeat([this] { return clear_gently(); });
The reason why clear_gently() doesn't return a future<> itself is that some
contexts cannot defer, like memory reclamation.
Previously it was also possible to pass a frozen_mutation to it.
Now we de-serialize frozen mutations at the calling side.
This is a pre-requisite for collecting memtable statistics needed for
writing into the SSTables 3.0 format.
For #1969.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Instead of evicting whole partitions, evicts whole rows.
As part of this, invalidation of partition entries was changed to not
evict from snapshots right away, but unlink them and let them be
evicted by the reclaimer.
Every evictable version will have a dummy entry at the end so that it can be
tracked in the LRU.
It is also needed to allow old versions to stay around (with
tombstones and static rows) after all rows are evicted. Such versions
must be fully discontinuous, and we need some entry to mark that.
This change is a preparation for introducing row-level eviction, such that entries
can be evicted from older versions without having to touch other versions.
Currently continuity flags on entries are interpreted relative to the
combined view merged from all entries. For example:
v2: <key=2, cont=1>
v1: <key=1, cont=1>
In v2, the flag on entry key=2 marks the range (1, 2) as
continuous. This is problematic because if the old version is evicted, continuity
will change in an incorrect way:
v2: <key=2, cont=1>
Here, the range (-inf, 1) would be marked as continuous, which is not true.
To solve this problem, we change the rules for continuity
interpretation in MVCC. Each version will have its own continuity,
fully specified in that version, independent of continuity of other
versions. Continuity of the snapshot will be a union of continuous
ranges in each version.
It is assumed that continuous intervals in different versions are non-
overlapping, except for points corresponding to complete rows, in
which case a later version may overlap with an older version
(overwrite). We make use of this assumption to make calculation of the
union of intervals on merging easier. I make use of the above
assumption in mutation_partition::apply_monotonically().
MVCC population of incomplete entries already almost maintains the
non-overlapping invariant, because population intervals correspond to
intervals which are incomplete in the old snapshot. The only change
needed is to ensure that both population bounds will have entries in
the latest version. Population from memtables doesn't mark any
intervals as continuous, so also conforms. The only change needed
there is to not inherit continuity flags from the old snapshot,
effectively making the new version internally discontinuous except for
row points.
The example from the beginning will become:
v2: <key=1, cont=0> <key=2, cont=1>
v1: <key=1, cont=1>
When marking a range as continuous with some rows present only in
older versions, we need to insert entries in the latest version, so
that we can mark the range as continuous. The easiest solution is to
copy the entry from the old version. Another option would be to add
support for incomplete rows and insert such instead. This way we would
avoid duplicating row contents. This optimization is deferred.
Commit 6ccd317 introduced a bug in partition_entry::evict() where a
partition entry may be partially evicted if there are non-evictable
snapshots in it. Partially evicting some of the versions may violate
consistency of a snapshot which includes evicted versions. For one,
continuity flags are interpreted realtive to the merged view, not
within a version, so evicting from some of the versions may mark
reanges as continuous when before they were discontinuous. Also, range
tombtsones of the snapshot are taken from all versions, so we can't
partially evict some of them without marking all affected ranges as
discontinuous.
The fix is to revert back to full eviciton, and avoid moving
non-evictable snapshots to cache. When moving whole partition entry to
cache, we first create a neutral empty partition entry and then merge
the memtable entry into it just like we would if the entry already
existed.
Fixes#3215.
Tests: unit (release)
Message-Id: <1518710592-21925-2-git-send-email-tgrabiec@scylladb.com>
cache_entry constructor was marked noexcept, yet make_evictable() may
fail in rare cases due to allocation in add_version(). Lift the
annotation and make sure that construction has strong exception
guarantees for the moved-in state so that it can be retried without
data loss inside allocating section.
When moving whole partition entries from memtable to cache, we move
snapshots as well. It is incorrect to evict from such snapshots
though, because associated readers would miss data.
Solution is to record evictability of partition version references (snapshots)
and avoiding eviction from non-evictable snapshots.
Could affect scanning reads, if the reader uses partition entry from
memtable, and the partition is too large to fit in reader's buffer,
and that entry gets moved to cache (was absent in cache), and then
gets evicted (memory pressure). The reader will not see the remainder
of that entry.
Introduced in ca8e3c4, so affects 2.1+
Fixes#3186.
Internal invariants of MVCC are better preserved by partition_entry
methods, so move construction of partition entries out of cache_entry
constructors.
When digest is requested, pre-calculate the cell's hash. We consider
the case when the cell is already in the cache, and the case when it
added by the underlying reader.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
When digest is requested, pre-calculate the cell's hash. A downside of
this approach is that more work will be done when there are multiple
versions of a row that contain values for the same cell, but we expect
these cases to be rare and the upside of caching a cell's hash to
compensate for the extra work.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch drops the use of apply_reversibly(). We move the mutation
to be applied into a new version and then use apply_monotonically() to
merge it (if no snapshot) with the current version. This guarantees
that apply() is atomic even if apply_monotonically() throws.
Fixes#2012.
Once that is added, also add a method to a memtable entry to calculate
the entire size of a memtable entry. Right now we only have one method
to calculate the size minus rows.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
[tgrabiec:
- extracted from a larger commit
- fix heap comparator in apply_incomplete_target to order versions properly
- extracted partition_version detaching into
partition_entry::with_detached_versions()
- dropped unnecessary rows_iterator::_version field
- dropped unnecessary allocation of rows_entry and key copies
in rows_iterator
- dropped row_pointer
- replaced apply_reversibly() with weaker and faster apply()
- added handling of dummy entries at any position
- fixed exception safety issue in apply_to_incomplete() which may
result in data loss. We cannot move data out of applied versions
into a new synthetic row and then apply it, because if exception
happens in the middle, the data which was moved from the source
will be lost. To fix that, row_iterator::consume_row() is
introduced which allows in-place consumption of data without
construction of temporary deletable_row.
]