Right now, once a region is moved to the cache is no longer visible to
the dirty memory system. Not as real dirty nor virtual dirty.
The problem is that until a particular partition is moved to the cache
it is not evictable. As a result we can OOM the system if we have a lot
of pending cache updates as the writes will not be throttled and memory
won't be made available.
This patch pins the memory used by the region as real dirty before the
cache update starts, and unpins it when it is over. In the mean time it
gradually releases memory of the partitions that are being moved to
cache.
I have verified in a couple of workloads that the amount of memory
accounted through this is the same amount of memory accounted through
the memtable flush procedure.
Fixes#1942
Signed-off-by: Glauber Costa <glauber@scylladb.com>
query::full_slice doesn't select any regular or static columns, which
is at odds with the expectations of its users. This patch replaces it
with the schema::full_slice() version.
Refs #2885
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1507732800-9448-2-git-send-email-duarte@scylladb.com>
If snapshots are not evicted, they may pin unbouned amount of memory
for a long time in cache, which may lead to OOM. Evict snapshots
together with the entry.
Fixes#2775.
Fixes#2730.
modification_count is currently only used to detect invalidation of
references, intended to be incremented on erasure.
Insertion into intrusive set doesn't invalidate references, so no
need to increment the counter.
Cache imposes requirements on how updates to the on-disk mutation source
are made:
1) each change to the on-disk muation source must be followed
by cache synchronization reflecting that change
2) The two must be serialized with other synchronizations
3) must have strong failure guarantees (atomicity)
Because of that, sstable list update and cache synchronization must be
done under a lock, and cache synchronization cannot fail to synchronize.
Normally cache synchronization achieves no-failure thing by wiping the
cache (which is noexcept) in case failure is detect. There are some
setup steps hoever which cannot be skipped, e.g. taking a lock
followed by switching cache to use the new snapshot. That truly cannot
fail. The lock inside cache synchronizers is redundant, since the
user needs to take it anyway around the combined operation.
In order to make ensuring strong exception guarantees easier, and
making the cache interface easier to use correctly, this patch moves
the control of the combined update into the cache. This is done by
having cache::update() et al accept a callback (external_updater)
which is supposed to perform modiciation of the underlying mutation
source when invoked.
This is in-line with the layering. Cache is layered on top of the
on-disk mutation source (it wraps it) and reading has to go through
cache. After the patch, modification also goes through cache. This way
more of cache's requirements can be confined to its implementation.
The failure semantics of update() and other synchronizers needed to
change due to strong exception guaratnees. Now if it fails, it means
the update was not performed, neither to the cache nor to the
underlying mutation source.
The database::_cache_update_sem goes away, serialization is done
internally by the cache.
The external_updater needs to have strong exception guarantees. This
requirement is not new. It is however currently violated in some
places. This patch marks those callbacks as noexcept and leaves a
FIXME. Those should be fixed, but that's not in the scope of this
patch. Aborting is still better than corrupting the state.
Fixes#2754.
Also fixes the following test failure:
tests/row_cache_test.cc(949): fatal error: in "test_update_failure": critical check it->second.equal(*s, mopt->partition()) has failed
which started to trigger after commit 318423d50b. Thread stack
allocation may fail, in which case we did not do the necessary
invalidation.
Underlying data source in row cache holds a reference to sstable set
prior to compaction which isn't released until a memtable flush, which
means file descriptors of deleted sstables remains opened, wasting
disk space.
The fix is to refresh underlying data source in row cache.
Fixes#2570.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
[tgrabiec:
- extracted from a larger commit
- removed coupling with how cache_streamed_mutation is created (the
code went out of sync), used more stable make_reader(). it's simpler too.
- replaced false/true literals with is_continuous/is_dummy where appropraite
- dropped tests for cache::underlying (class is gone)
- reused streamed_mutation_assertions, it has better error messages
- fixed the tests to not create tombstones with missing timestamps
- relaxed range tombstone assertions to only check information relevant for the query range
- print cache on failure for improved debuggability
]
This streamed mutation populates cache with
the rows requested by the read. It takes whatever
it can find in the cache and fetches the remainings
from underlying source.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
[tgrabiec:
- fixed maybe_add_to_cache_and_update_continuity() leaking entries if
the key already exists in the snapshot
- fixed a problem where population race could result in a read
missing some rows, because cache_streamed_mutation was advancing
the cursor, then deferring, and then checking continuity. We
should check continuity atomically with advancing.
- fixed rows_handle.maybe_refresh() being accessed outside of update
section in read_from_underlying() (undefined behavior)
- fixed a problem in start_reading_from_underlying() where we would
use incorrect start if lower_bound ended with a range tombstone
starting before a key.
- range tombstone trimming in add_to_buffer() could create a
tombstone which has too low start bound if last_rt.end was a
prefix and had inclusive end. invert_kind(end_kind) should be used
instead of unconditional inc_start.
- range tombstone trimming incorrectly assumed it is fine to trim
the tombstone from underlying to the previous fragment's end and
emit such tombstone. That would mean the stream can't emit any
fragments which start before previous tombstone's end. Solve with
range_tombstone_stream.
- split add_to_buffer() into overloads for clustering_row, and
range_tombstone. Better than wrapping into mutation_fragment
before the call and having add_to_buffer() rediscover the
information.
- changed maybe_add_to_cache_and_update_continuity() to not set
continuity to false for existing entries, it's not necessary
- moved range tombstone trimming to range_tombstone class
- moved range tombstone slicing code to range_tombstone_list and partition_snapshot
- can_populate::can_use_cache was unused, dropped
- dropped assumption that dummy entries are only at the end
- renamed maybe_add_to_cache_and_update_continuity() to maybe_add_to_cache()
- dropped no longer needed lower_bound class
- extracted row_handle to a seaparate patch
- made the copy-from-cache loop preemptable
- split maybe_add_next_to_buffer_and_update_continuity(bool)
- dropped cache_populator
- replaced "underlying" class with use of read_context
- replaced can_populate class with a function
- simplified lsa_manager methods to avoid moves
]
Currently readers are always using the latest snapshot. This is fine
for respecting write atomicity if partitions are fully continuous in
cache (now), but will break write atomicity once partial population is
allowed.
Consider the following case:
flush write(ck=1), write(ck=2) -> snapshot_1
cache reader 1 reads and inserts ck=1 @snapshot_1
flush write(ck=1), write(ck=2) -> snapshot_2
cache reader 2 reads and inserts ck=2 @snapshot_2
Because cache update is not atomic, it can happen that reader 2 will
complete while the partition hasn't been updated yet for snapshot_2.
In such case, after read 2 the partition would contain ck=1 from
snapshot_1 and ck=2 from snapshot_2. It will match neither of the
snapshots, and this could violate write atomicity.
To solve this problem we conceptually assign each partition key in the
ring to its current snapshot which it reflects. The update process
gradually converts entries in ring order to the new snapshot. Reads
will not be using the latest snapshot, but rather the current snapshot
for the position in the ring they are at.
There is a race between the update process and populating reads. Since
after the update all entries must reflect the new snapshot, reads
using the old snapshot cannot be allowed to insert data which can no
longer be reached by the update process. Before this patch this race
was prevented by the use of a phased_barrier, where readers would keep
phased_barrier::operation alive between starting a read of a partition
and inserting it into cache. Cache update was waiting for all prior
operations before starting the update. Any later read which was not
waited for would use the latest snapshot for reads, so the update
process didn't have to fix anything up for such reads.
After this change, later reads cannot always use the latest snapshot,
they have to use the snapshot corresponding to given entry. So it's
not enough for update() to wait for prior reads in order to prevent
stale populations. The (simple) solution implemented in this patch is
to detect the conflict and abandon population of given sub-range. In
general, reads are allowed to populate given range only if it belongs
to a single snapshot.
Note that the range here is not the whole query range. For population
of continuity, it is the range starting after the previous key and
ending after the key being inserted. When populating a partition
entry, the range is a singular range containing only the partition
key. Readers switch to new snapshots automatically as they move across
the ring. It's possible that the insertion of the partition doesn't
conflict, but continuity does. In such case the entry will be inserted
but continuity will not be set.
Currently every time cache needs to create reader for missing data it
obtains a reader which is most up to date. That reader includes writes
from later populate phases, for which update() was not yet
called. This will be problematic once we allow partitions to be
partially populated, because different parts of the partition could be
partially populated using readers using different sets of writes, and break
write atomicity.
The solution will be to always populate given partition using the same
set of writes, using reader created from the current snapshot. The
snapshot changes only on update(), with update() gradually converting
each partition to the new snapshot.
1) Reduce duplication by delegating to more general overloads
2) Improve documentation to not mention effects in terms of
population (detail) but rather write visibiliy
3) Rename clear() to invalidate() and merge with the range variant,
it has the same semantics
This object stores all read relevant context required all
over the place. This leads to a cleaner code.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
[tgrabiec:
- made read_context shareable to allow storing shared
mutable state later
- added range and cache getters
]
This is an abstraction that represents a reader
to the underlying source and auto updates itself
to make sure the reader reflects the latest state
of the underlying source.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
[tgrabiec: Add range getter to avoid friendships]
In commit c63e88d556, support was added for
fast_forward_to() in data_consume_rows(). Because an input stream's end
cannot be changed after creation, that patch ignores the specified end
byte, and uses the end of file as the end position of the stream.
As result of this, even when we want to read a specific byte range (e.g.,
in the repair code to checksum the partitions in a given range), the code
reads an entire 128K buffer around the end byte, or significantly more, with
read-ahead enabled. This causes repair to do more than 10 times the amount
of I/O it really has to do in the checksumming phase (which in the current
implementation, reads small ranges of partitions at a time).
This patch has two levels:
1. In the lower level, sstable::data_consume_rows(), which reads all
partitions in a given disk byte range, now gets another byte position,
"last_end". That can be the range's end, the end of the file, or anything
in between the two. It opens the disk stream until last_end, which means
1. we will never read-ahead beyond last_end, and 2. fast_fordward_to() is
not allowed beyond last_end.
2. In the upper level, we add to the various layers of sstable readers,
mutation readers, etc., a boolean flag mutation_reader::forwarding, which
says whether fast_forward_to() is allowed on the stream of mutations to
move the stream to a different partition range.
Note that this flag is separate from the existing boolean flag
streamed_mutation::fowarding - that one talks about skipping inside a
single partition, while the flag we are adding is about switching the
partition range being read. Most of the functions that previously
accepted streamed_mutation::forwarding now accept *also* the option
mutation_reader::forwarding. The exception are functions which are known
to read only a single partition, and not support fast_forward_to() a
different partition range.
We note that if mutation_reader::forwarding::no is requested, and
fast_forward_to() is forbidden, there is no point in reading anything
beyond the range's end, so data_consume_rows() is called with last_end as
the range's end. But if forwarding::yes is requested, we use the end of the
file as last_end, exactly like the code before this patch did.
Importantly, we note that the repair's partition reading code,
column_family::make_streaming_reader, uses mutation_reader::forwarding::no,
while the other existing reading code will use the default forwarding::yes.
In the future, we can further optimize the amount of bytes read from disk
by replacing forwarding::yes by an actual last partition that may ever be
read, and use its byte position as the last_end passed to data_consume_rows.
But we don't do this yet, and it's not a regression from the existing code,
which also opened the file input stream until the end of the file, and not
until the end of the range query. Moreover, such an improvement will not
improve of anything if the overall range is always very large, in which
case not over-reading at its end will not improve performance.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20170619152629.11703-1-nyh@scylladb.com>
This reverts commit 317d7fc253 (and also the
related 2c57ab84b2). It causes crashes
during range scans, reported by Gleb:
"To reproduce I run SELECT * FROM keyspace1.standard1; on typical c-s
dataset and 3 node cluster.
Backtrace:
at /home/gleb/work/seastar/seastar/core/apply.hh:36
rvalue=<unknown type in /home/gleb/work/seastar/build/release/scylla, CU 0x54cf307, DIE 0x55ebf2a>) at /home/gleb/work/seastar/seastar/core/do_with.hh:57
range=std::vector of length 6, capacity 8 = {...}) at /home/gleb/work/seastar/seastar/core/future-util.hh:142
at ./seastar/core/future.hh:890
at /home/gleb/work/seastar/seastar/core/future-util.hh:119
at /home/gleb/work/seastar/seastar/core/future-util.hh:142
In commit c63e88d556, support was added for
fast_forward_to() in data_consume_rows(). Because an input stream's end
cannot be changed after creation, that patch ignores the specified end
byte, and uses the end of file as the end position of the stream.
As result of this, even when we want to read a specific byte range (e.g.,
in the repair code to checksum the partitions in a given range), the code
reads an entire 128K buffer around the end byte, or significantly more, with
read-ahead enabled. This causes repair to do more than 10 times the amount
of I/O it really has to do in the checksumming phase (which in the current
implementation, reads small ranges of partitions at a time).
This patch has two levels:
1. In the lower level, sstable::data_consume_rows(), which reads all
partitions in a given disk byte range, now gets another byte position,
"last_end". That can be the range's end, the end of the file, or anything
in between the two. It opens the disk stream until last_end, which means
1. we will never read-ahead beyond last_end, and 2. fast_fordward_to() is
not allowed beyond last_end.
2. In the upper level, we add to the various layers of sstable readers,
mutation readers, etc., a boolean flag mutation_reader::forwarding, which
says whether fast_forward_to() is allowed on the stream of mutations to
move the stream to a different partition range.
Note that this flag is separate from the existing boolean flag
streamed_mutation::fowarding - that one talks about skipping inside a
single partition, while the flag we are adding is about switching the
partition range being read. Most of the functions that previously
accepted streamed_mutation::forwarding now accept *also* the option
mutation_reader::forwarding. The exception are functions which are known
to read only a single partition, and not support fast_forward_to() a
different partition range.
We note that if mutation_reader::forwarding::no is requested, and
fast_forward_to() is forbidden, there is no point in reading anything
beyond the range's end, so data_consume_rows() is called with last_end as
the range's end. But if forwarding::yes is requested, we use the end of the
file as last_end, exactly like the code before this patch did.
Importantly, we note that the repair's partition reading code,
column_family::make_streaming_reader, uses mutation_reader::forwarding::no,
while the other existing reading code will use the default forwarding::yes.
In the future, we can further optimize the amount of bytes read from disk
by replacing forwarding::yes by an actual last partition that may ever be
read, and use its byte position as the last_end passed to data_consume_rows.
But we don't do this yet, and it's not a regression from the existing code,
which also opened the file input stream until the end of the file, and not
until the end of the range query. Moreover, such an improvement will not
improve of anything if the overall range is always very large, in which
case not over-reading at its end will not improve performance.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20170614072122.13473-1-nyh@scylladb.com>