Commit Graph

184 Commits

Author SHA1 Message Date
Tomasz Grabiec
1971332195 row_cache: Fix exception safety of cache_entry::read()
When we fail, we need to return streamed_mutation back, so that
the operation can be retried.

Causes SIGSEGV on nullptr otherwise on bad_alloc.
2017-11-13 20:55:14 +01:00
Tomasz Grabiec
11a195c403 row_cache: scanning_and_populating_reader: Fix exception unsafety causing read to skip data
If assignment to _lower_bound in the "_secondary_in_progress = true;"
case in do_read_from_primary() throws due to allocation failure, the
update section will be retried and we will take the not_moved path,
skipping the range which was discontinuous and was supposed to be read
from underlying.

Fix by redoing lookup using _lower_bound in case the section is
retried. When we retry, _primary.valid() will be false. We need to
ensure now that _lower_bound is always valid.

Fixes #2944.
2017-11-13 20:55:14 +01:00
Tomasz Grabiec
5dc1ee41e4 row_cache: partition_range_cursor: Extract valid() and advance_to() from refresh() 2017-11-13 20:55:14 +01:00
Tomasz Grabiec
09c49b2db3 cache_streamed_mutation: Add trace-level logging to cache_streamed_mutation 2017-11-13 20:55:14 +01:00
Glauber Costa
1d7617723d row cache: pin real dirty during cache updates.
Right now, once a region is moved to the cache is no longer visible to
the dirty memory system. Not as real dirty nor virtual dirty.

The problem is that until a particular partition is moved to the cache
it is not evictable. As a result we can OOM the system if we have a lot
of pending cache updates as the writes will not be throttled and memory
won't be made available.

This patch pins the memory used by the region as real dirty before the
cache update starts, and unpins it when it is over. In the mean time it
gradually releases memory of the partitions that are being moved to
cache.

I have verified in a couple of workloads that the amount of memory
accounted through this is the same amount of memory accounted through
the memtable flush procedure.

Fixes #1942

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2017-11-08 19:46:36 -05:00
Glauber Costa
b836005555 row_cache: modernize use of seastar threads
For a while now we have an async() function, that simplifies the code by not
needing to issue an explicit join. This patch converts the row cache to use
async() as well, which most of our code already does. Doing so will make
it easier to make changes to update_cache.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2017-11-08 16:21:44 -05:00
Tomasz Grabiec
083b9cddef row_cache: Fix handling of concurrent partition population
This fixes a regression introduced in 27a3b4bca9 (master only).

partition_range_cursor assumes that as long as references are valid,
_end is valid as well. But if new entries were inserted before _end,
it may not, if the new entries fall after the query range. This may
result in reads returning partitions from outside the query range.
Message-Id: <1507815478-20269-1-git-send-email-tgrabiec@scylladb.com>
2017-10-12 15:55:20 +01:00
Piotr Jastrzebski
6069bab755 Cache single queries to non-existing partitions
This way we don't need to query sstables again
when the query is repeated.

Fixes #1533

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <8f8559ff19c534dbbb7c9ef6c28271cec607ba20.1506521461.git.piotr@scylladb.com>
2017-09-27 16:15:18 +02:00
Tomasz Grabiec
0911fbbdef row_cache: Fix row_cache::update_invalidating()
evict() doesn't guarantee that the whole partition is discontinuous.
In particular, partition tombstone cannot be marked as discontinuous.
The parts which are still continuous must be updated.

Broken after c78047fa5b.

Message-Id: <1505375684-28574-1-git-send-email-tgrabiec@scylladb.com>
2017-09-14 10:58:25 +03:00
Tomasz Grabiec
c78047fa5b row_cache: Evict partition snapshots
If snapshots are not evicted, they may pin unbouned amount of memory
for a long time in cache, which may lead to OOM. Evict snapshots
together with the entry.

Fixes #2775.
Fixes #2730.
2017-09-13 17:47:03 +02:00
Tomasz Grabiec
adb159d51b row_cache: Reuse allocation_strategy::invalidate_references()
Modification count in the tracker is redundant, we can rely on
allocator's invalidation counter.
2017-09-13 17:38:08 +02:00
Tomasz Grabiec
27a3b4bca9 row_cache: Don't invalidate references on insertion
modification_count is currently only used to detect invalidation of
references, intended to be incremented on erasure.

Insertion into intrusive set doesn't invalidate references, so no
need to increment the counter.
2017-09-13 17:38:08 +02:00
Tomasz Grabiec
2df6f356b1 mvcc: Store LSA region reference in partition_snapshot
Will be useful for improving encapsulation.
2017-09-13 17:38:08 +02:00
Tomasz Grabiec
d22fdf4261 row_cache: Improve safety of cache updates
Cache imposes requirements on how updates to the on-disk mutation source
are made:
  1) each change to the on-disk muation source must be followed
     by cache synchronization reflecting that change
  2) The two must be serialized with other synchronizations
  3) must have strong failure guarantees (atomicity)

Because of that, sstable list update and cache synchronization must be
done under a lock, and cache synchronization cannot fail to synchronize.

Normally cache synchronization achieves no-failure thing by wiping the
cache (which is noexcept) in case failure is detect. There are some
setup steps hoever which cannot be skipped, e.g. taking a lock
followed by switching cache to use the new snapshot. That truly cannot
fail.  The lock inside cache synchronizers is redundant, since the
user needs to take it anyway around the combined operation.

In order to make ensuring strong exception guarantees easier, and
making the cache interface easier to use correctly, this patch moves
the control of the combined update into the cache. This is done by
having cache::update() et al accept a callback (external_updater)
which is supposed to perform modiciation of the underlying mutation
source when invoked.

This is in-line with the layering. Cache is layered on top of the
on-disk mutation source (it wraps it) and reading has to go through
cache. After the patch, modification also goes through cache. This way
more of cache's requirements can be confined to its implementation.

The failure semantics of update() and other synchronizers needed to
change due to strong exception guaratnees. Now if it fails, it means
the update was not performed, neither to the cache nor to the
underlying mutation source.

The database::_cache_update_sem goes away, serialization is done
internally by the cache.

The external_updater needs to have strong exception guarantees. This
requirement is not new. It is however currently violated in some
places. This patch marks those callbacks as noexcept and leaves a
FIXME. Those should be fixed, but that's not in the scope of this
patch. Aborting is still better than corrupting the state.

Fixes #2754.

Also fixes the following test failure:

  tests/row_cache_test.cc(949): fatal error: in "test_update_failure": critical check it->second.equal(*s, mopt->partition()) has failed

which started to trigger after commit 318423d50b. Thread stack
allocation may fail, in which case we did not do the necessary
invalidation.
2017-09-04 10:04:29 +02:00
Tomasz Grabiec
b0f3efa577 row_cache: Extract invalidate_sync() 2017-09-04 10:04:29 +02:00
Tomasz Grabiec
56e3ce05db row_cache: Don't require presence checker to be supplied externally
The API is simpler and safer this way.
2017-09-04 10:04:29 +02:00
Tomasz Grabiec
1a2f17d42c row_cache: Make populate() preserve continuity 2017-09-04 10:04:29 +02:00
Tomasz Grabiec
bc3112a187 row_cache: Allow marking as fully continuous on construction
Will be needed in tests.
2017-09-04 10:04:29 +02:00
Avi Kivity
48b9e47f7d Revert "row_cache: Add missing handling for failures happening outside the updating thread"
This reverts commit f9feb310ab (requested by author).
2017-08-29 19:26:02 +03:00
Tomasz Grabiec
f9feb310ab row_cache: Add missing handling for failures happening outside the updating thread
Thread stack allocation may fail, in which case we did not do the
necessary invalidation. Fix by hoisting the scope of the cleanup function.

Also fixes the following test failure:

  tests/row_cache_test.cc(949): fatal error: in "test_update_failure": critical check it->second.equal(*s, mopt->partition()) has failed

which started to trigger after commit 318423d50b.

Message-Id: <1504023113-30374-2-git-send-email-tgrabiec@scylladb.com>
2017-08-29 19:17:22 +03:00
Raphael S. Carvalho
637f3bfa50 db: refresh row cache's underlying data source after compaction
Underlying data source in row cache holds a reference to sstable set
prior to compaction which isn't released until a memtable flush, which
means file descriptors of deleted sstables remains opened, wasting
disk space.
The fix is to refresh underlying data source in row cache.

Fixes #2570.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-07-24 15:49:11 -03:00
Piotr Jastrzebski
a4b6cfe8f0 row_cache: use continuity info in single partition queries
If a query requests for a single partition that is inside
a range that has already been queried, use the continuity info
and don't go to disk when it's not needed.

Fixes #2244.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <15bb3b5b03225e7402e3862da53b5e06d3f4fa74.1499345295.git.piotr@scylladb.com>
2017-07-07 10:29:19 +02:00
Tomasz Grabiec
37d2b6b3c6 row_cache: Switch _stats.hits/misses to row granularity
Those are exported by the RESTful APIs called
"get_row_hits/get_row_misses" and reported by nodetool.
2017-07-04 13:55:06 +02:00
Tomasz Grabiec
60c2a86192 row_cache: Track mispopulations also at row level 2017-07-04 13:55:06 +02:00
Tomasz Grabiec
94547db620 row_cache: Track row insertions 2017-07-04 13:55:06 +02:00
Tomasz Grabiec
a58f2c8640 row_cache: Track row hits and misses 2017-07-04 13:55:06 +02:00
Tomasz Grabiec
77b2a92ece row_cache: Make mispopulation counter also apply for continuity information 2017-07-04 13:55:06 +02:00
Tomasz Grabiec
a5fdff2ac2 row_cache: Add partition_ prefix to current counters
In preparation for adding per-row counters.
2017-07-04 13:55:06 +02:00
Tomasz Grabiec
6a22cbceaf row_cache: Add metrics for operations on underlying reader 2017-07-04 13:55:06 +02:00
Tomasz Grabiec
5c7b6fc164 row_cache: Add reader-related metrics 2017-07-04 13:55:06 +02:00
Tomasz Grabiec
e720b317c9 row_cache: Restore update of concurrent_misses_same_key
It was lost in action in 6f6575f456.

Message-Id: <1499168837-5072-1-git-send-email-tgrabiec@scylladb.com>
2017-07-04 14:51:05 +03:00
Tomasz Grabiec
1d6fec0755 row_cache: Drop not very useful prefixes from metric names
This drops "total_opertaions_" and "objects_" prefixes. There is no
convention of adding them in other parts of the system, and they don't
add much value.

Fixes scylladb/scylla-grafana-monitoring#169.

Message-Id: <1499160342-25865-1-git-send-email-tgrabiec@scylladb.com>
2017-07-04 13:37:12 +03:00
Tomasz Grabiec
97005825bf row_cache: Fix compilation errors with gcc 5
Message-Id: <1498741526-27055-1-git-send-email-tgrabiec@scylladb.com>
2017-06-29 16:34:46 +03:00
Tomasz Grabiec
786e75dbf7 row_cache: Use continuity information to decide whether to populate
If cache is missing given key, but the range is marked as continuous,
it means sstables don't have that entry and we can insert it without
asking the presence checker (bloom filter based). The latter is more
expensive and gives false positives. So this improves update
performance and hit ratio.

Another positive effect is that we don't have to clear continuity now.

Fixes #1999.

Message-Id: <1498643043-21117-1-git-send-email-tgrabiec@scylladb.com>
2017-06-28 13:32:48 +03:00
Tomasz Grabiec
b56232b216 row_cache: Introduce evict() 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
6f6575f456 row_cache: Enable partial partition population 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
e792220c3a row_cache: Introduce update_invalidating() 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
c29878f49f row_cache: Extract memtable walking logic from update() into do_update()
So that it can be reused in update_invalidating().
2017-06-24 18:06:11 +02:00
Tomasz Grabiec
509a0d8a83 row_cache: Allow reading from underlying through read_context
The interaction will be as follows:

  - Before creating cache_streamed_mutation for given partition, cache
    mutation reader sets up read_context for current partition (in one
    of two ways) so that the matching underlying streamed_mutation can
    be accessed at any time by cached_stream_mutation.

  - cache_streamed_mutation assumes that read_context is set up for
    current partition and invokes fast_forward_to() and
    get_next_fragment() to access the underlying
    streamed_mutation.
2017-06-24 18:06:11 +02:00
Tomasz Grabiec
a1d3e0318c row_cache: Store autoupdating_underlying_reader in read_context
Will be reused for reading of incomplete partition entries.
2017-06-24 18:06:11 +02:00
Tomasz Grabiec
3f2320c377 row_cache: Store information whether query is a range query in read_context
We will need to use this information later in yet another place, when
creating a reader for incomplete cache entry. This refactors the code
so that there is a single place which determines this fact.
2017-06-24 18:06:11 +02:00
Tomasz Grabiec
a2207ee9a6 row_cache: Move autoupdating_underlying_reader to read_context.hh 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
ca920bd0ef row_cache: Keep only one streamed_mutation in scanning_and_populating_reader
Currently scanning_and_populating_reader asks
just_cache_scanning_reader for the next partition from cache, together
with information if the range is continuous. If it's not, it saves the
partition it got from it and moves on to reading from the underlying
reader up to that partition. When that's done, it emits the stored
partition.

This approach won't work well with upcoming changes for storing
partial partitions. We won't have whole partitions any more, so
streamed_mutation returned for the entry needs to be prepared for
reading from the underlying mutation source. We want to reuse the same
underlying reader as much as possible, so all streamed_mutations for
given read (read_context) will share the state of the underlying
reader. Construction of a streamed_mutation will depend on the fact
that the shared state is set up for it, so we cannot have two
streamed_mutations prepared at the same time (one for entry from
primary, and one for the earlier entry being populated). This change
defers the creation of a streamed_mutation for the entry present in
cache until the whole reader reaches it to avoid this problem.

This will also have antoher potentially beneficial effect. Since we
defer the decision about which snapshot to use until we reach the
entry, there is a higher chance that the current snapshot of the entry
will match the one used last by the populating read, and that we will
be able to reuse the reader.

It's implemented by utilizing a stable partition cursor which tracks
its current position so that it's possible to revisit the cache entry
(if it's still there) after population ends. The functionality of
just_cache_scanning_reader was inlined into
scanning_and_populating_reader.
2017-06-24 18:06:11 +02:00
Tomasz Grabiec
045888d5f3 row_cache: Introduce partition_range_cursor 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
c3905bf235 row_cache: Print position instead of key of cache_entry 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
5bfecaad99 row_cache: Switch invalidate_unwrapped() to use ring_position_view ranges
It's needed before switching cache_entry ordering to rely solely on
cache_entry::position() so that invalidate_unwrapped() never removes
the dummy entry at the end. Currently if the range has upper bound
like this:

  { ring_position::max(), inclusive=true }

The code which selects entries for removal would include the dummy row
at the end. It uses upper_bound() to get the end iterator, and the
dummy entry has a position which is equal to the position in the
bound.

ring_position_view ranges are end-exclusive, so it's impossible to
create a partition range which would include a dummy entry.

The code is also simpler.
2017-06-24 18:06:11 +02:00
Tomasz Grabiec
64626b32b0 row_cache: Make printable 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
54b3da1910 row_cache: Introduce find_or_create() helper 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
f2d2c221d4 row_cache: Return cache_entry reference from do_find_or_create_entry
Will be useful when additional action needs to be done on the entry
after it was created or constructed.
2017-06-24 18:06:11 +02:00
Tomasz Grabiec
bbfa52822e row_cache: Switch readers to use per-entry snapshots
Currently readers are always using the latest snapshot. This is fine
for respecting write atomicity if partitions are fully continuous in
cache (now), but will break write atomicity once partial population is
allowed.

Consider the following case:

  flush write(ck=1), write(ck=2) -> snapshot_1
  cache reader 1 reads and inserts ck=1 @snapshot_1
  flush write(ck=1), write(ck=2) -> snapshot_2
  cache reader 2 reads and inserts ck=2 @snapshot_2

Because cache update is not atomic, it can happen that reader 2 will
complete while the partition hasn't been updated yet for snapshot_2.
In such case, after read 2 the partition would contain ck=1 from
snapshot_1 and ck=2 from snapshot_2. It will match neither of the
snapshots, and this could violate write atomicity.

To solve this problem we conceptually assign each partition key in the
ring to its current snapshot which it reflects. The update process
gradually converts entries in ring order to the new snapshot. Reads
will not be using the latest snapshot, but rather the current snapshot
for the position in the ring they are at.

There is a race between the update process and populating reads. Since
after the update all entries must reflect the new snapshot, reads
using the old snapshot cannot be allowed to insert data which can no
longer be reached by the update process. Before this patch this race
was prevented by the use of a phased_barrier, where readers would keep
phased_barrier::operation alive between starting a read of a partition
and inserting it into cache. Cache update was waiting for all prior
operations before starting the update. Any later read which was not
waited for would use the latest snapshot for reads, so the update
process didn't have to fix anything up for such reads.

After this change, later reads cannot always use the latest snapshot,
they have to use the snapshot corresponding to given entry. So it's
not enough for update() to wait for prior reads in order to prevent
stale populations. The (simple) solution implemented in this patch is
to detect the conflict and abandon population of given sub-range. In
general, reads are allowed to populate given range only if it belongs
to a single snapshot.

Note that the range here is not the whole query range. For population
of continuity, it is the range starting after the previous key and
ending after the key being inserted. When populating a partition
entry, the range is a singular range containing only the partition
key. Readers switch to new snapshots automatically as they move across
the ring. It's possible that the insertion of the partition doesn't
conflict, but continuity does. In such case the entry will be inserted
but continuity will not be set.
2017-06-24 18:06:11 +02:00