Commit Graph

23 Commits

Author SHA1 Message Date
Tomasz Grabiec
9975135110 row_cache: Make sure reader makes forward progress after each fill_buffer()
If reader's buffer is small enough, or preemption happens often
enough, fill_buffer() may not make enough progress to advance
_lower_bound. If also iteartors are constantly invalidated across
fill_buffer() calls, the reader will not be able to make progress.

See row_cache_test.cc::test_reading_progress_with_small_buffer_and_invalidation()
for an examplary scenario.

Also reproduced in debug-mode row_cache_test.cc::test_concurrent_reads_and_eviction

Message-Id: <1528283957-16696-1-git-send-email-tgrabiec@scylladb.com>
2018-06-06 16:01:52 +03:00
Paweł Dziepak
27014a23d7 treewide: require type info for copying atomic_cell_or_collection 2018-05-31 15:51:11 +01:00
Tomasz Grabiec
7c34cd04e2 mvcc: Propagate information if insertion happened from ensure_entry_if_complete()
It's needed by users to update statistics, different ones depending on
if the row already existed or not.
2018-03-07 16:50:55 +01:00
Tomasz Grabiec
381bf02f55 cache: Evict with row granularity
Instead of evicting whole partitions, evicts whole rows.

As part of this, invalidation of partition entries was changed to not
evict from snapshots right away, but unlink them and let them be
evicted by the reclaimer.
2018-03-06 11:50:29 +01:00
Tomasz Grabiec
dce9185fc9 cache: Track static row insertions separately from regular rows
So that row eviction counter, which doesn't look at the static row,
can be in sync with row insertion counter.
2018-03-06 11:50:28 +01:00
Tomasz Grabiec
ab407d99cc mvcc: Store complete rows in each version in evictable entries
For row-level eviction we need to ensure that each version has
complete rows so that eviction from older versions doesn't affect the
value of the row in newer snapshots.

This is achieved by copying the row from an older version before
applying the increment in the new version.

Only affects evictable entries, memtables are not affected.
2018-03-06 11:50:28 +01:00
Tomasz Grabiec
29d167bf01 mvcc: Introduce partition_snapshot_row_cursor::ensure_entry_in_latest()
To avoid duplication of logic between cache reader and
ensure_entry_if_complete().
2018-03-06 11:50:28 +01:00
Tomasz Grabiec
9893e8e5f7 mvcc: Make each version have independent continuity
This change is a preparation for introducing row-level eviction, such that entries
can be evicted from older versions without having to touch other versions.

Currently continuity flags on entries are interpreted relative to the
combined view merged from all entries. For example:

 v2:                  <key=2, cont=1>
 v1: <key=1, cont=1>

In v2, the flag on entry key=2 marks the range (1, 2) as
continuous. This is problematic because if the old version is evicted, continuity
will change in an incorrect way:

   v2:                  <key=2, cont=1>

Here, the range (-inf, 1) would be marked as continuous, which is not true.

To solve this problem, we change the rules for continuity
interpretation in MVCC. Each version will have its own continuity,
fully specified in that version, independent of continuity of other
versions. Continuity of the snapshot will be a union of continuous
ranges in each version.

It is assumed that continuous intervals in different versions are non-
overlapping, except for points corresponding to complete rows, in
which case a later version may overlap with an older version
(overwrite). We make use of this assumption to make calculation of the
union of intervals on merging easier. I make use of the above
assumption in mutation_partition::apply_monotonically().

MVCC population of incomplete entries already almost maintains the
non-overlapping invariant, because population intervals correspond to
intervals which are incomplete in the old snapshot. The only change
needed is to ensure that both population bounds will have entries in
the latest version. Population from memtables doesn't mark any
intervals as continuous, so also conforms. The only change needed
there is to not inherit continuity flags from the old snapshot,
effectively making the new version internally discontinuous except for
row points.

The example from the beginning will become:

 v2: <key=1, cont=0>  <key=2, cont=1>
 v1: <key=1, cont=1>

When marking a range as continuous with some rows present only in
older versions, we need to insert entries in the latest version, so
that we can mark the range as continuous. The easiest solution is to
copy the entry from the old version. Another option would be to add
support for incomplete rows and insert such instead. This way we would
avoid duplicating row contents. This optimization is deferred.
2018-03-06 11:50:25 +01:00
Tomasz Grabiec
d0e1a3c63e mvcc: partition_snapshot_row_weakref: Introduce is_in_latest_version() 2018-03-06 11:32:09 +01:00
Tomasz Grabiec
313f2c2bb0 cache: Document intent of maybe_update_continuity() 2018-03-06 11:32:09 +01:00
Tomasz Grabiec
3214883a25 cache: Extract cache_streamed_mutation::ensure_population_lower_bound() 2018-03-06 11:32:09 +01:00
Duarte Nunes
712c051de6 cache_flat_mutation_reader: Pre-calculate cell hash
When digest is requested, pre-calculate the cell's hash. We consider
the case when the cell is already in the cache, and the case when it
added by the underlying reader.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-02-01 01:02:50 +00:00
Piotr Jastrzebski
96c97ad1db Rename streamed_mutation* files to mutation_fragment*
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:56:49 +01:00
Piotr Jastrzebski
19e1f7c285 cache_flat_mutation_reader: fix tombstones handling with small buffer
Before when the buffer was so small that it could fit only a single
range_tombstone, cache_flat_mutation_reader would keep returning
the same tombstone over and over again.

The fix is to set _lower_bound to the next fragment we want to return.

Fixes #3139

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:09:11 +01:00
Tomasz Grabiec
6654fa6df7 row_cache: Drop unnecessary assignment to _lower_bound on exception
We no longer drain cached tombstones since commit
41ede08a1d, so this adjustment of
lower_bound is not needed.

Message-Id: <1516796248-11290-1-git-send-email-tgrabiec@scylladb.com>
2018-01-24 16:39:34 +02:00
Glauber Costa
5140aaea00 add a timeout to fast forward to
In the last patch, we enabled per-request timeouts, we enable timeouts
in fill_buffer. There are many places, though, in which we
fast_forward_to before we fill_buffer, so in order to make that
effective we need to propagate the timeouts to fast_forward_to as well.

In the same way as fill_buffer, we make the argument optional wherever
possible in the high level callers, making them mandatory in the
implementations.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-01-12 07:43:19 -05:00
Glauber Costa
d965af42b0 add a timeout to fill_buffer
As part of the work to enable per-request timeouts, we enable timeouts
in fill_buffer.

The argument is made optional at the main classes, but mandatory in all
the ::impl versions. This way we'll make sure we didn't forget anything.

At this point we're still mostly passing that information around and
don't have any entity that will act on those timeouts. In the next patch
we will wire that up.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-01-11 12:07:41 -05:00
Duarte Nunes
16c975edcc partition_version: Return static_row fragment from static_row()
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180109162815.5811-1-duarte@scylladb.com>
2018-01-09 19:17:02 +01:00
Tomasz Grabiec
41ede08a1d mutation_reader: Allow range tombstones with same position in the fragment stream
When we get two range tombstones with the same lower bound from
different data sources (e.g. two sstable), which need to be combined
into a single stream, they need to be de-overlapped, because each
mutation fragment in the stream must have a different position. If we
have range tombstones [1, 10) and [1, 20), the result of that
de-overlapping will be [1, 10) and [10, 20]. The problem is that if
the stream corresponds to a clustering slice with upper bound greater
than 1, but lower than 10, the second range tombstone would appear as
being out of the query range. This is currently violating assumptions
made by some consumers, like cache populator.

One effect of this may be that a reader will miss rows which are in
the range (1, 10) (after the start of the first range tombstone, and
before the start of the second range tombstone), if the second range
tombstone happens to be the last fragment which was read for a
discontinuous range in cache and we stopped reading at that point
because of a full buffer and cache was evicted before we resumed
reading, so we went to reading from the sstable reader again. There
could be more cases in which this violation may resurface.

There is also a related bug in mutation_fragment_merger. If the reader
is in forwarding mode, and the current range is [1, 5], the reader
would still emit range_tombstone([10, 20]). If that reader is later
fast forwarded to another range, say [6, 8], it may produce fragments
with smaller positions which were emitted before, violating
monotonicity of fragment positions in the stream.

A similar bug was also present in partition_snapshot_flat_reader.

Possible solutions:

 1) relax the assumption (in cache) that streams contain only relevant
 range tombstones, and only require that they contain at least all
 relevant tombstones

 2) allow subsequent range tombstones in a stream to share the same
 starting position (position is weakly monotonic), then we don't need
 to de-overlap the tombstones in readers.

 3) teach combining readers about query restrictions so that they can drop
fragments which fall outside the range

 4) force leaf readers to trim all range tombstones to query restrictions

This patch implements solution no 2. It simplifies combining readers,
which don't need to accumulate and trim range tombstones.

I don't like solution 3, because it makes combining readers more
complicated, slower, and harder to properly construct (currently
combining readers don't need to know restrictions of the leaf
streams).

Solution 4 is confined to implementations of leaf readers, but also
has disadvantage of making those more complicated and slower.

Fixes #3093.
2017-12-22 11:06:20 +01:00
Piotr Jastrzebski
b976872c1a Rename all *_underlying_flat methods in read_context to *_underlying.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-18 16:37:57 +01:00
Piotr Jastrzebski
a322268416 Turn cache_flat_mutation_reader into a flat reader.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-18 13:28:33 +01:00
Piotr Jastrzebski
f467e84424 Rename cache_streamed_mutation to cache_flat_mutation_reader
in cache_flat_mutation_reader.hh

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-18 13:28:33 +01:00
Piotr Jastrzebski
3075780097 Make copy of cache_streamed_mutation.hh
and call it cache_flat_mutation_reader.hh.
It will be turned into a flat mutation reader.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-18 13:28:33 +01:00