called autoupdating_underlying_flat_reader. It will be modified
in the next patch to use flat reader to underlying.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Right now, once a region is moved to the cache is no longer visible to
the dirty memory system. Not as real dirty nor virtual dirty.
The problem is that until a particular partition is moved to the cache
it is not evictable. As a result we can OOM the system if we have a lot
of pending cache updates as the writes will not be throttled and memory
won't be made available.
This patch pins the memory used by the region as real dirty before the
cache update starts, and unpins it when it is over. In the mean time it
gradually releases memory of the partitions that are being moved to
cache.
I have verified in a couple of workloads that the amount of memory
accounted through this is the same amount of memory accounted through
the memtable flush procedure.
Fixes#1942
Signed-off-by: Glauber Costa <glauber@scylladb.com>
query::full_slice doesn't select any regular or static columns, which
is at odds with the expectations of its users. This patch replaces it
with the schema::full_slice() version.
Refs #2885
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1507732800-9448-2-git-send-email-duarte@scylladb.com>
If snapshots are not evicted, they may pin unbouned amount of memory
for a long time in cache, which may lead to OOM. Evict snapshots
together with the entry.
Fixes#2775.
Fixes#2730.
modification_count is currently only used to detect invalidation of
references, intended to be incremented on erasure.
Insertion into intrusive set doesn't invalidate references, so no
need to increment the counter.
Cache imposes requirements on how updates to the on-disk mutation source
are made:
1) each change to the on-disk muation source must be followed
by cache synchronization reflecting that change
2) The two must be serialized with other synchronizations
3) must have strong failure guarantees (atomicity)
Because of that, sstable list update and cache synchronization must be
done under a lock, and cache synchronization cannot fail to synchronize.
Normally cache synchronization achieves no-failure thing by wiping the
cache (which is noexcept) in case failure is detect. There are some
setup steps hoever which cannot be skipped, e.g. taking a lock
followed by switching cache to use the new snapshot. That truly cannot
fail. The lock inside cache synchronizers is redundant, since the
user needs to take it anyway around the combined operation.
In order to make ensuring strong exception guarantees easier, and
making the cache interface easier to use correctly, this patch moves
the control of the combined update into the cache. This is done by
having cache::update() et al accept a callback (external_updater)
which is supposed to perform modiciation of the underlying mutation
source when invoked.
This is in-line with the layering. Cache is layered on top of the
on-disk mutation source (it wraps it) and reading has to go through
cache. After the patch, modification also goes through cache. This way
more of cache's requirements can be confined to its implementation.
The failure semantics of update() and other synchronizers needed to
change due to strong exception guaratnees. Now if it fails, it means
the update was not performed, neither to the cache nor to the
underlying mutation source.
The database::_cache_update_sem goes away, serialization is done
internally by the cache.
The external_updater needs to have strong exception guarantees. This
requirement is not new. It is however currently violated in some
places. This patch marks those callbacks as noexcept and leaves a
FIXME. Those should be fixed, but that's not in the scope of this
patch. Aborting is still better than corrupting the state.
Fixes#2754.
Also fixes the following test failure:
tests/row_cache_test.cc(949): fatal error: in "test_update_failure": critical check it->second.equal(*s, mopt->partition()) has failed
which started to trigger after commit 318423d50b. Thread stack
allocation may fail, in which case we did not do the necessary
invalidation.
Underlying data source in row cache holds a reference to sstable set
prior to compaction which isn't released until a memtable flush, which
means file descriptors of deleted sstables remains opened, wasting
disk space.
The fix is to refresh underlying data source in row cache.
Fixes#2570.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
[tgrabiec:
- extracted from a larger commit
- removed coupling with how cache_streamed_mutation is created (the
code went out of sync), used more stable make_reader(). it's simpler too.
- replaced false/true literals with is_continuous/is_dummy where appropraite
- dropped tests for cache::underlying (class is gone)
- reused streamed_mutation_assertions, it has better error messages
- fixed the tests to not create tombstones with missing timestamps
- relaxed range tombstone assertions to only check information relevant for the query range
- print cache on failure for improved debuggability
]
This streamed mutation populates cache with
the rows requested by the read. It takes whatever
it can find in the cache and fetches the remainings
from underlying source.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
[tgrabiec:
- fixed maybe_add_to_cache_and_update_continuity() leaking entries if
the key already exists in the snapshot
- fixed a problem where population race could result in a read
missing some rows, because cache_streamed_mutation was advancing
the cursor, then deferring, and then checking continuity. We
should check continuity atomically with advancing.
- fixed rows_handle.maybe_refresh() being accessed outside of update
section in read_from_underlying() (undefined behavior)
- fixed a problem in start_reading_from_underlying() where we would
use incorrect start if lower_bound ended with a range tombstone
starting before a key.
- range tombstone trimming in add_to_buffer() could create a
tombstone which has too low start bound if last_rt.end was a
prefix and had inclusive end. invert_kind(end_kind) should be used
instead of unconditional inc_start.
- range tombstone trimming incorrectly assumed it is fine to trim
the tombstone from underlying to the previous fragment's end and
emit such tombstone. That would mean the stream can't emit any
fragments which start before previous tombstone's end. Solve with
range_tombstone_stream.
- split add_to_buffer() into overloads for clustering_row, and
range_tombstone. Better than wrapping into mutation_fragment
before the call and having add_to_buffer() rediscover the
information.
- changed maybe_add_to_cache_and_update_continuity() to not set
continuity to false for existing entries, it's not necessary
- moved range tombstone trimming to range_tombstone class
- moved range tombstone slicing code to range_tombstone_list and partition_snapshot
- can_populate::can_use_cache was unused, dropped
- dropped assumption that dummy entries are only at the end
- renamed maybe_add_to_cache_and_update_continuity() to maybe_add_to_cache()
- dropped no longer needed lower_bound class
- extracted row_handle to a seaparate patch
- made the copy-from-cache loop preemptable
- split maybe_add_next_to_buffer_and_update_continuity(bool)
- dropped cache_populator
- replaced "underlying" class with use of read_context
- replaced can_populate class with a function
- simplified lsa_manager methods to avoid moves
]
Currently readers are always using the latest snapshot. This is fine
for respecting write atomicity if partitions are fully continuous in
cache (now), but will break write atomicity once partial population is
allowed.
Consider the following case:
flush write(ck=1), write(ck=2) -> snapshot_1
cache reader 1 reads and inserts ck=1 @snapshot_1
flush write(ck=1), write(ck=2) -> snapshot_2
cache reader 2 reads and inserts ck=2 @snapshot_2
Because cache update is not atomic, it can happen that reader 2 will
complete while the partition hasn't been updated yet for snapshot_2.
In such case, after read 2 the partition would contain ck=1 from
snapshot_1 and ck=2 from snapshot_2. It will match neither of the
snapshots, and this could violate write atomicity.
To solve this problem we conceptually assign each partition key in the
ring to its current snapshot which it reflects. The update process
gradually converts entries in ring order to the new snapshot. Reads
will not be using the latest snapshot, but rather the current snapshot
for the position in the ring they are at.
There is a race between the update process and populating reads. Since
after the update all entries must reflect the new snapshot, reads
using the old snapshot cannot be allowed to insert data which can no
longer be reached by the update process. Before this patch this race
was prevented by the use of a phased_barrier, where readers would keep
phased_barrier::operation alive between starting a read of a partition
and inserting it into cache. Cache update was waiting for all prior
operations before starting the update. Any later read which was not
waited for would use the latest snapshot for reads, so the update
process didn't have to fix anything up for such reads.
After this change, later reads cannot always use the latest snapshot,
they have to use the snapshot corresponding to given entry. So it's
not enough for update() to wait for prior reads in order to prevent
stale populations. The (simple) solution implemented in this patch is
to detect the conflict and abandon population of given sub-range. In
general, reads are allowed to populate given range only if it belongs
to a single snapshot.
Note that the range here is not the whole query range. For population
of continuity, it is the range starting after the previous key and
ending after the key being inserted. When populating a partition
entry, the range is a singular range containing only the partition
key. Readers switch to new snapshots automatically as they move across
the ring. It's possible that the insertion of the partition doesn't
conflict, but continuity does. In such case the entry will be inserted
but continuity will not be set.
Currently every time cache needs to create reader for missing data it
obtains a reader which is most up to date. That reader includes writes
from later populate phases, for which update() was not yet
called. This will be problematic once we allow partitions to be
partially populated, because different parts of the partition could be
partially populated using readers using different sets of writes, and break
write atomicity.
The solution will be to always populate given partition using the same
set of writes, using reader created from the current snapshot. The
snapshot changes only on update(), with update() gradually converting
each partition to the new snapshot.
1) Reduce duplication by delegating to more general overloads
2) Improve documentation to not mention effects in terms of
population (detail) but rather write visibiliy
3) Rename clear() to invalidate() and merge with the range variant,
it has the same semantics