Commit Graph

251 Commits

Author SHA1 Message Date
Tomasz Grabiec
cd7c7ac40f mutation_partition: Make do_compact() respect range tombstone merging rules
It compares only timestamps, but it should use intrinsic ordering of
the tombstone, which takes deletio ntime into consideration as well.
If we have two range tombstones with the same timestamp but different
deletion time (odd case, but still), then the one with the higher
deletion time should win. That's what all other parts of the system
use to resolve merges, in particular range_tombstone_list and
compact_mutation_state (the fragment stream compactor).

Not respecting this ordering violates the following equality:

  do_compact(do_compact(m1) + m2) == do_compact(m1 + m2)

which may results in some clustered rows being missing in the
right-hand side, but not in the left-hand side, due to differences in
range tombstones.

This impacts only tests currently.
Message-Id: <1528705602-7218-1-git-send-email-tgrabiec@scylladb.com>
2018-06-11 10:05:52 +01:00
Paweł Dziepak
a040d37cd5 atomic_cell: switch to new IMR-based cell reperesentation
This patch changes the implementation of atomic_cell and
atomic_cell_or_collection to use the data::cell implementation which is
based on the new in-memory representation infrastructure.
2018-05-31 15:51:11 +01:00
Paweł Dziepak
9bb1f10bb6 treewide: require type for comparing cells 2018-05-31 15:51:11 +01:00
Paweł Dziepak
aa25f0844f atomic_cell: introduce fragmented buffer value interface
As a prepratation for the switch to the new cell representation this
patch changes the type returned by atomic_cell_view::value() to one that
requires explicit linearisation of the cell value. Even though the value
is still implicitly linearised (and only when managed by the LSA) the
new interface is the same as the target one so that no more changes to
its users will be needed.
2018-05-31 15:51:11 +01:00
Paweł Dziepak
ec9d166a4f treewide: require type to compute cell memory usage 2018-05-31 15:51:11 +01:00
Paweł Dziepak
418c159057 treewide: require type to copy atomic_cell 2018-05-31 15:51:11 +01:00
Paweł Dziepak
27014a23d7 treewide: require type info for copying atomic_cell_or_collection 2018-05-31 15:51:11 +01:00
Paweł Dziepak
93130e80fb atomic_cell: require column_definition for creating atomic_cell views 2018-05-31 15:51:11 +01:00
Tomasz Grabiec
82e8217ba0 mutation_partition: Reduce row lookups in apply_monotonically()
This change speeds up merging of partition versions with many rows in
case the merged version has many rows which fall between existing rows
in the target version. This is often the case for time-series
workloads, which insert rows at the front. Lookup can be avoided for
all but the first row in the stride because we already have a
reference to the successor in the target tree, we only need to check
that the current entry in the target tree is still the successor.

This change greatly reduces amount of lookups per row during version
merging of large partitions in time-series workloads.
2018-05-30 14:41:41 +02:00
Tomasz Grabiec
81d231f35b mvcc: Remove rows from tracker gently
Some parititons may have a lot of rows. Better to iterate over them
incrementally as part of clear_gently() to avoid stalls.
2018-05-30 14:41:40 +02:00
Tomasz Grabiec
e0803ff71e Introduce mutation_cleaner
Used for collecting unsued partition_version objects and freeing them
incrementally. Will be used for both cache and memtables.
2018-05-30 14:41:39 +02:00
Tomasz Grabiec
40cc766cf2 database: Add API for incremental clearing of partition entries
Partitions can get very large. Destroying them all at once can stall
the reactor for significant amount of time. We want to avoid that by
doing destruction incrementally, deferring in between. A new API is
added for that at various levels:

  stop_iteration clear_gently() noexcept;

It returns stop_iteration::yes when the object is fully cleared and
can be now destroyed quickly. So a deferring destruction can look like
this:

  return repeat([this] { return clear_gently(); });

The reason why clear_gently() doesn't return a future<> itself is that some
contexts cannot defer, like memory reclamation.
2018-05-30 12:18:56 +02:00
Duarte Nunes
eed09dfdf9 mutation_partition: Throw std::out_of_range with backtrace on cell_at
Makes it easier to investigate bugs.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180521133753.16375-1-duarte@scylladb.com>
2018-05-23 13:51:54 +03:00
Paweł Dziepak
05c94bc98d mutation_partition: do not dereference null in find_cell()
row::find_cell() may be called for cells that do not exist in that row.
In such case nullptr shall be returned, this patch makes sure that
it is not dereferenced.
Message-Id: <20180522091726.24396-1-pdziepak@scylladb.com>
2018-05-22 10:31:09 +01:00
Paweł Dziepak
33dffd5fb6 row: add clear_hash()
Needed to measure the performance of hashing a cell.
2018-05-09 16:52:26 +01:00
Paweł Dziepak
00509913fc mutation_partition: enable ADL for cell swap
Calling fully qualified std::swap() prohibits the cell objects from
using their own swap implementations. This patch invokes std::swap in
the usual ADL-friendly way.
2018-05-09 16:52:26 +01:00
Paweł Dziepak
a2b5779714 counters: drop revertability of apply()
Since 4cfcd8055e 'Merge "Drop reversible
apply() from mutation_partition" from Tomasz' it is no longer required
for apply() to be revertable.
2018-05-09 16:52:26 +01:00
Vladimir Krivopalov
ed62b9a667 Add mutation_partition::apply_insert() overload that accepts TTL and expiry for row marker.
For #1969.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-04-26 13:27:42 -07:00
Duarte Nunes
c8baba4e3a mutation_partition: Clarify comment about emptiness
empty() doesn't distinguish between live and dead data, so clarify
that in its comment.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-04-23 09:32:03 +01:00
Duarte Nunes
67dac67c46 mutation_partition: Regular base column in view determines row liveness
When views contain a primary key column that is not part of the base
table primary key, that column determines whether the row is live or
not. We need to ensure that when that cell is dead, and thus the
derived row marker, either by normal deletion of by TTL, so is the
rest of the row.

This patch introduces the idea of shawdowing row marker. We map the
status of the regular base column in the view's PK to the view row's
marker. If this marker is dead, so is that cell in the base table, and
so should the view row become. To enforce that, a view row's dead
marker shadows the whole row if that view includes a base regular
column in its PK.

Fixes #3360

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-04-23 09:32:02 +01:00
Duarte Nunes
b0cb5480d5 mutation_fragment: Allow querying if row is live
For clustering_row and static_row, allow querying whether they are
live or not.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-04-23 09:32:02 +01:00
Glauber Costa
9188059427 database: group statements in their own scheduling group
When we introduced the CPU scheduler, we have also introduced a group
for commitlog - but never used it. There is also doubtful value in
separating reads from writes, since they are often part of the same
workload.

To accomodate for that, let's rename the query group to "statement"
(query is not incorrect, just confusing), and move the write path,
currently ungrouped, inside it.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-03-20 16:58:36 -04:00
Botond Dénes
ff808d9ce6 Save and restore queriers in mutation_query() and data_query()
Use the querier_cache (represented by the passed-in
querier_cache_context) object to lookup saved queriers at the start of
the page and save them at the end of it if it is likely that there will
be more page requests.
2018-03-13 10:34:34 +02:00
Tomasz Grabiec
da901b93fc cache: Track number of rows and row invalidations 2018-03-06 11:50:29 +01:00
Tomasz Grabiec
381bf02f55 cache: Evict with row granularity
Instead of evicting whole partitions, evicts whole rows.

As part of this, invalidation of partition entries was changed to not
evict from snapshots right away, but unlink them and let them be
evicted by the reclaimer.
2018-03-06 11:50:29 +01:00
Tomasz Grabiec
ab407d99cc mvcc: Store complete rows in each version in evictable entries
For row-level eviction we need to ensure that each version has
complete rows so that eviction from older versions doesn't affect the
value of the row in newer snapshots.

This is achieved by copying the row from an older version before
applying the increment in the new version.

Only affects evictable entries, memtables are not affected.
2018-03-06 11:50:28 +01:00
Tomasz Grabiec
bee875fa7d cache: Ensure all evictable partition_versions have a dummy after all rows
Every evictable version will have a dummy entry at the end so that it can be
tracked in the LRU.

It is also needed to allow old versions to stay around (with
tombstones and static rows) after all rows are evicted. Such versions
must be fully discontinuous, and we need some entry to mark that.
2018-03-06 11:50:27 +01:00
Tomasz Grabiec
5320705300 cache: Propagate cache_tracker to places manipulating evictable entries
cache_tracker reference will be needed to link/unlink row entries.

No change of behavior in this patch.
2018-03-06 11:50:27 +01:00
Tomasz Grabiec
3dc9000c51 mutation_partition: Introduce rows_entry::is_last_dummy()
Will be needed by row evictor, which needs to treat last dummies
specially (not evict them).
2018-03-06 11:50:26 +01:00
Tomasz Grabiec
9893e8e5f7 mvcc: Make each version have independent continuity
This change is a preparation for introducing row-level eviction, such that entries
can be evicted from older versions without having to touch other versions.

Currently continuity flags on entries are interpreted relative to the
combined view merged from all entries. For example:

 v2:                  <key=2, cont=1>
 v1: <key=1, cont=1>

In v2, the flag on entry key=2 marks the range (1, 2) as
continuous. This is problematic because if the old version is evicted, continuity
will change in an incorrect way:

   v2:                  <key=2, cont=1>

Here, the range (-inf, 1) would be marked as continuous, which is not true.

To solve this problem, we change the rules for continuity
interpretation in MVCC. Each version will have its own continuity,
fully specified in that version, independent of continuity of other
versions. Continuity of the snapshot will be a union of continuous
ranges in each version.

It is assumed that continuous intervals in different versions are non-
overlapping, except for points corresponding to complete rows, in
which case a later version may overlap with an older version
(overwrite). We make use of this assumption to make calculation of the
union of intervals on merging easier. I make use of the above
assumption in mutation_partition::apply_monotonically().

MVCC population of incomplete entries already almost maintains the
non-overlapping invariant, because population intervals correspond to
intervals which are incomplete in the old snapshot. The only change
needed is to ensure that both population bounds will have entries in
the latest version. Population from memtables doesn't mark any
intervals as continuous, so also conforms. The only change needed
there is to not inherit continuity flags from the old snapshot,
effectively making the new version internally discontinuous except for
row points.

The example from the beginning will become:

 v2: <key=1, cont=0>  <key=2, cont=1>
 v1: <key=1, cont=1>

When marking a range as continuous with some rows present only in
older versions, we need to insert entries in the latest version, so
that we can mark the range as continuous. The easiest solution is to
copy the entry from the old version. Another option would be to add
support for incomplete rows and insert such instead. This way we would
avoid duplicating row contents. This optimization is deferred.
2018-03-06 11:50:25 +01:00
Duarte Nunes
42f407ad9e row: Use cached hash for hash calculation
This entails doing the cell hash calculation slightly differently,
where the cell is hashed individually, the resulting hash being added
to the running one.

Instead of propagating a flag all through the call chain, we detect
whether we are in the new mode by the employed hash algorithm.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-02-01 01:02:49 +00:00
Duarte Nunes
d773e4b9d4 mutation_partition: Replace hash_row_slice with appending_hash
This enables us to only branch once per row on the actual hash
algorithm, instead of once per row data item.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-02-01 01:02:49 +00:00
Duarte Nunes
99a3e3aa76 mutation_partition: Allow caching cell hashes
We add storage to a row to hold the cached hashes of each individual
cell. We don't store the hash in each cell because that would a)
change the cell equality function, and b) require us to change a cell
in a potentially fragmented buffer.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-02-01 01:02:47 +00:00
Duarte Nunes
b2e1a91f4d query-result: Use digester instead of md5_hasher
Use the digester class instead of md5_hasher to encapsulate the
decision of which hash algorithm to use.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-02-01 00:22:50 +00:00
Piotr Jastrzebski
96c97ad1db Rename streamed_mutation* files to mutation_fragment*
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:56:49 +01:00
Avi Kivity
c743d1258d Merge "Reverse order of version merging in MVCC" from Tomasz
"Changes merging in MVCC to apply newer version to older instead of older to
newer.

Before (v0 = oldest):

  (((v3 + v2) + v1) + v0)

After:

  (v0 + (v1 + (v2 + v3)))

or:

  (((v0 + v1) + v2) + v3)

There are several reasons to do this:

  1) When continuity merging will change semantics to support eviction
     from older versions, it will be easier to implement apply() if we
     can assume that we merge newer to older instead of older to
     newer, since newer version may have entries falling into a
     continuous interval in older, but not the other way around. If we
     didn't revert the order, apply() would have to keep track of
     lower bound of a continuous interval in the right-hand side
     argument (older version) as it is applied and update continuity
     flags in the left hand side by scanning all entries overlapping
     with it. If order is reversed, merging only needs to deal with
     the current entry. Also, if we were to keep the old order, we
     cannot simply move entries from the left hand side as we merge
     because we need to keep track of the lower bound of a continuous
     interval, and we need to provide monotonic exception
     guarantees. So merging would be both more complicated and slower.

  2) With large partitions older versions are typically larger than
     newer versions, and since merging is O(N_right*(1 + log(N_left))),
     it's better to merge newer into older.
     This fixes latency spikes seen in perf_cache_eviction.

Fixes #2715."

* tag 'tgrabiec/reverse-order-of-mvcc-version-merging-v1' of github.com:scylladb/seastar-dev:
  mvcc: Reverse order of version merging
  anchorless_list: Introduce last()
  mvcc: Implement partition_entry::upgrade() using squashed()
  mvcc: Extract version merging functions
  mutation_partition: Add rows_entry::set_dummy()
  position_in_partition: Introduce after_key()
2018-01-21 13:56:57 +02:00
José Guilherme Vanz
380bc0aa0d Swap arguments order of mutation constructor
Swap arguments in the mutation constructor keeping the same standard
from the constructor variants. Refs #3084

Signed-off-by: José Guilherme Vanz <guilherme.sft@gmail.com>
Message-Id: <20180120000154.3823-1-guilherme.sft@gmail.com>
2018-01-21 12:58:42 +02:00
Piotr Jastrzebski
d266eaa01e mutation_source: rename make_flat_mutation_reader to make_reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-19 09:30:12 +01:00
Tomasz Grabiec
60d3c25c02 mvcc: Reverse order of version merging
Change merging to apply newer version to older instead of older to
newer.

Before:

  (((v3 + v2) + v1) + v0)

After:

  (v0 + (v1 + (v2 + v3)))

or equivalent:

  (((v0 + v1) + v2) + v3)

There are several reasons to do this:

  1) When continuity merging will change semantics to support eviction
     from older versions, it will be easier to implement apply() if we
     can assume that we merge newer to older instead of older to
     newer, since newer version may have entries falling into a
     continuous interval in older, but not the other way around. If we
     didn't revert the order, apply() would have to keep track of
     lower bound of a continuous interval in the right-hand side
     argument (older version) as it is applied and update continuity
     flags in the left hand side by scanning all entries overlapping
     with it. If order is reversed, merging only needs to deal with
     the current entry. Also, if we were to keep the old order, we
     cannot simply move entries from the left hand side as we merge
     because we need to keep track of the lower bound of a continuous
     interval, and we need to provide monotonic exception
     guarantees. So merging would be both more complicated and slower.

  2) With large partitions older versions are typically larger than
     newer versions, and since merging is O(N_right*(1 + log(N_left))),
     it's better to merge newer into older.

Fixes #2715.
2018-01-18 13:52:08 +01:00
Duarte Nunes
83e983d4d0 mutation_partition: Remove unused operator==()
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180115013546.67260-1-duarte@scylladb.com>
2018-01-15 11:16:35 +02:00
Duarte Nunes
9d1d9883ff mutation_partition: Remove unused for_each_cell() overload
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180115013618.67351-1-duarte@scylladb.com>
2018-01-15 11:16:34 +02:00
Glauber Costa
54d3ebde4e flat_mutation_reader: pass timeout down to consume()
We pass the timeout that we received from data_query/mutation_query
down to consume, which is responsible for actually reading the data.

To make those timeouts actionable, though, we'll have to patch
fill_buffer(). This will happen in the next patch.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-01-11 12:07:41 -05:00
Glauber Costa
8433702c90 mutation_query: add a timeout to the mutation query path
data_query and mutation_query are patched so that they start accepting a
per-query timeout. We will default to no timeout, and then no callers
will be changed yet.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-01-11 12:07:41 -05:00
Tomasz Grabiec
8e8ece5dec mutation_partition: Introduce deletable_row::apply() from a clustering_row fragment 2017-12-08 17:50:47 +01:00
Tomasz Grabiec
b3709047b0 mutation_partition: Extract sliced() from mutation into mutation_partition
So that we can call it on mutation_partition.
2017-12-08 17:50:47 +01:00
Tomasz Grabiec
5541c9fd63 mutation_partition: Define equal_continuity() using get_continuity()
This fixes the problem of equal_continuity() being prone to false
positives due to redundant information (extra dummy rows) present in
one of the partitions. get_continuity() is minified, so is not prone
to this.
2017-12-08 12:01:27 +01:00
Tomasz Grabiec
bde050835f mutation_partition: Make check_continuity() const-qualified 2017-12-08 12:01:27 +01:00
Tomasz Grabiec
865bd8a594 mutation_partition: Introduce mutation_partition::get_continuity()
Intended to be used in tests.
2017-12-08 12:01:27 +01:00
Tomasz Grabiec
22138554e6 mutation_partition: Leave moved-from row in an empty state
Needed by apply_monotonically(). Fixes SIGSEGV in mutation_test_g.
2017-12-08 12:01:27 +01:00
Tomasz Grabiec
a305a28574 mutation_partition: Fix upgrade() not preserving static row continuity
We do not rely on this yet, but will.
2017-12-08 12:01:27 +01:00