When views contain a primary key column that is not part of the base
table primary key, that column determines whether the row is live or
not. We need to ensure that when that cell is dead, and thus the
derived row marker, either by normal deletion of by TTL, so is the
rest of the row.
This patch introduces the idea of shawdowing row marker. We map the
status of the regular base column in the view's PK to the view row's
marker. If this marker is dead, so is that cell in the base table, and
so should the view row become. To enforce that, a view row's dead
marker shadows the whole row if that view includes a base regular
column in its PK.
Fixes#3360
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Instead of evicting whole partitions, evicts whole rows.
As part of this, invalidation of partition entries was changed to not
evict from snapshots right away, but unlink them and let them be
evicted by the reclaimer.
This change is a preparation for introducing row-level eviction, such that entries
can be evicted from older versions without having to touch other versions.
Currently continuity flags on entries are interpreted relative to the
combined view merged from all entries. For example:
v2: <key=2, cont=1>
v1: <key=1, cont=1>
In v2, the flag on entry key=2 marks the range (1, 2) as
continuous. This is problematic because if the old version is evicted, continuity
will change in an incorrect way:
v2: <key=2, cont=1>
Here, the range (-inf, 1) would be marked as continuous, which is not true.
To solve this problem, we change the rules for continuity
interpretation in MVCC. Each version will have its own continuity,
fully specified in that version, independent of continuity of other
versions. Continuity of the snapshot will be a union of continuous
ranges in each version.
It is assumed that continuous intervals in different versions are non-
overlapping, except for points corresponding to complete rows, in
which case a later version may overlap with an older version
(overwrite). We make use of this assumption to make calculation of the
union of intervals on merging easier. I make use of the above
assumption in mutation_partition::apply_monotonically().
MVCC population of incomplete entries already almost maintains the
non-overlapping invariant, because population intervals correspond to
intervals which are incomplete in the old snapshot. The only change
needed is to ensure that both population bounds will have entries in
the latest version. Population from memtables doesn't mark any
intervals as continuous, so also conforms. The only change needed
there is to not inherit continuity flags from the old snapshot,
effectively making the new version internally discontinuous except for
row points.
The example from the beginning will become:
v2: <key=1, cont=0> <key=2, cont=1>
v1: <key=1, cont=1>
When marking a range as continuous with some rows present only in
older versions, we need to insert entries in the latest version, so
that we can mark the range as continuous. The easiest solution is to
copy the entry from the old version. Another option would be to add
support for incomplete rows and insert such instead. This way we would
avoid duplicating row contents. This optimization is deferred.
We add storage to a row to hold the cached hashes of each individual
cell. We don't store the hash in each cell because that would a)
change the cell equality function, and b) require us to change a cell
in a potentially fragmented buffer.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch forces the size of vector_storage's internal storage to 5,
meaning that the underlying managed_vector will ensure it doesn't need
to externally allocate a buffer to hold the row, if only its first 5
cells are set.
We define this size explicitly so we can change the vector's value
type in upcoming patches without affecting the optimization.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
The uses which needed strong or weak exception guarantees were
switched to a solution involving apply_monotonically(). All remaining
uses don't need any exception guarantees.
Has weaker exception guarantees than apply(), which allows for simpler
implementation. Intended to replace the apply() with strong exception
guarantees.
In the memtable flusher, we account for the size of a partition as we
read them. However, there are other points in the architecture where we
would like to calculate the size of a partition in a point in which we
are not reading it. One such example is the cache update process.
This patch enhances the mutation_partition adding a method that returns
the total size for this partition.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
This will allow expressing lack of information about certain ranges of
rows (including the static row), which will be used in cache to
determine if information in cache is complete or not.
Continuity is represented internally using flags on row entries. The
key range between two consecutive entries is continuous iff
rows_entry::continuous() is true for the later entry. The range
starting after the last entry is assumed to be continuous. The range
corresponding to the key of the entry is continuous iff
rows_entry::dummy() is false.
[tgrabiec:
- based on the following commits:
4a5bf75 - Piotr Jastrzebski : mutation_partition: introduce dummy rows_entry
773070e - Piotr Jastrzebski : mutation_partition: add continuity flag to rows_entry
- documented that partition tombstone is always complete
- require specifying the partition tombstone when creating an incomplete entry
- replaced rows_entry(dummy_tag, ...) constructor with more general
rows_entry(position_in_partition, ...)
- documented continuity semantics on mutation_partition
- fixed _static_row_cached being lost by mutation_partition copy constructors
- fixed conversion to streamed_mutation to ignore dummy entries
- fixed mutation_partition serializer to drop dummy entries
- documented semantics of continuity on mutation_partition level
- dropped assumptions that dummy entries can be only at the last position
- changed equality to ignore continuity completely, rather than
partially (it was not ignoring dummy entries, but ignoring
continuity flag)
- added printout of continuity information in mutation_partition
- fixed handling of empty entries in apply_reversibly() with regards
to continuity; we no longer can remove empty entries before
merging, since that may affect continuity of the right-hand
mutation. Added _erased flag.
- fixed mutation_partition::clustered_row() with dummy==true to not ignore the key
- fixed partition_builder to not ignore continuity
- renamed dummy_tag_t to dummy_tag. _t suffix is reserved.
- standardized all APIs on is_dummy and is_continuous bool_class:es
- replaced add_dummy_entry() with ensure_last_dummy() with safer semantics
- dropped unused remove_dummy_entry()
- simplified and inlined cache_entry::add_dummy_entry()
- fixed mutation_partition(incomplete_tag) constructor to mark all row ranges as discontinuous
]
key() will not be valid for dummy entries, but position() is always
valid.
[tgrabiec: Extracted from other commits]
[tgrabiec: Added missing change to range_tombstone_stream::get_next]
Change the clustering key argument in mutation::set_cell from
exploded_clustering_prefix to clustering_key_prefix, which allows for
some overall code simplification and fewer copies. This mostly affects
the cql3 layer.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch ensures the different mutation_partition::apply_delete()
overloads behave similarly, so that, for example, an empty clustering
key is treated the same way as an empty
exploded_clustering_key_prefix.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch replaces the current row tombstone representation by a
row_tombstone.
The intent of the patch is thus to reify the idea of shadowable
tombstones, that up until now we considered all materialized view row
tombstones to be.
We need to distinguish shadowable from non-shadowable row tombstones
to support scenarios such as, when inserting to a table with a
materialzied view:
1. insert into base (p, v1, v2) values (3, 1, 3) using timestamp 1
2. delete from base using timestamp 2 where p = 3
3. insert into base (p, v1) values (3, 1) using timestamp 3
These should yield a view row where v2 is definitely null, but with
the current implementation, v2 will pop back with its value v2=3@TS=1,
even though its dead in the base row. This is because the row
tombstone inserted at 2) is a shadowable one.
This patch only addresses the memory representation of such
row_tombstones.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch introduces the row_tombstone class, which represents a
tombstone made up of a regular tombstone and a shadowable one.
The rules for row_tombstones are as follows:
- The shadowable tombstone is always >= than the regular one;
- The regular tombstone works as expected;
- The shadowable tombstone doesn't erase or compact away the regular
row tombstone, nor dead cells;
- The shadowable tombstone can erase live cells, but only provided
they can be recovered (e.g., by including all cells in a MV update,
both updated cells and pre-existing ones);
- The shadowable tombstone can be erased or compacted away by a newer
row marker.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
A shadowable tombstone is a tombstone that can be replaced by a
smaller one if provided a row_marker with a bigger timestamp than the
shadowable tombstone.
In the context of a row, it is only valid as long as no newer insert
is done (thus setting a live row marker; note that if the row
timestamp set is lower than the tombstone's, then the tombstone
remains in effect as usual). If a row has a shadowable tombstone with
timestamp Ti and that row is updated with a timestamp Tj, such that
Tj > Ti (and that update sets the row marker), then the shadowable
tombstone is shadowed by that update. A concrete consequence is that
if the update has cells with timestamp lower than Ti, then those cells
are preserved (since the deletion is removed), and this is contrary to
a regular, non-shadowable row tombstone where the tombstone is
preserved and such cells are removed.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Counter write path involves read-modify-write. That read is guaranteed
to query only a single partition, does not care about dead cells and
expects to receive an unserialized mutation as a result.
Standard mutation queries can are able to produce results fit for
counter updates, but the logic involved is much more general (i.e.
slower), hence the addition of new, counter-specific kind of query.
This patch introduces shadowable row tombstones. A shadowable row
tombstone is valid only if the row has no live marker. In other words,
the row tombstone is only valid as long as no newer insert is done
(thus setting a live row marker; note that if the row timestamp set
is lower than the tombstone's, then the tombstone remains in effect
as usual).
If a row has a shadowable tombstone with timestamp Ti and that row
is updated with a timestamp Tj, such that Tj > Ti (and that update
sets the row marker), then the shadowable tombstone is shadowed by
that update. A concrete consequence is that if the update has cells
with timestamp lower than Ti, then those cells are preserved (since
the deletion is removed), and this is contrary to a regular,
non-shadowable row tombstone where the tombstone is preserved and
such cells are removed.
Currently, only Materialized Views require shadowable row tombstones,
which solve a problem with view row deletions. Consider a base row with
columns p, v1, v2, PRIMARY KEY (p) denormalized into a view row consisting
of columns p, v1, v2 PRIMARY KEY (p, v1), and the following operations:
1) INSERT INTO base (p, v1, v2) VALUES (0, 0, 1) USING TIMESTAMP 0;
2) UPDATE base SET v1 = 1 USING TIMESTAMP 1 WHERE p = 0;
3) UPDATE base SET v1 = 0 USING TIMESTAMP 2 WHERE p = 0;
Without shadowable tombstones, the view contains:
At 1), pk = (0, 0), row_marker@T0, v2=1@T0
At 2), pk = (0, 0), row_marker@T0, row_tombstone@T1, v2=1@T0
pk = (0, 1), row_marker@T1, v2=1@T0
At 3), pk = (0, 0), row_marker@T2, row_tombstone@T1, v2=1@T0
pk = (0, 1), row_marker@T1, row_tombstone@T2, v2=1@T0
Notice how, if we read row (0, 0), the value of v2 will be shadowed by
the row tombstone we previously inserted.
With a view's row tombstone becoming shadowable, at 3) the row (0, 0)
will look like pk = (0, 0), row_marker@T2, shadowable_tombstone@T1, v2=1@T0,
which is equivalent to pk = (0, 0), row_marker@T2, v2=1@T0.
Since the shadowable tombstone is shadowed by the new row marker (T0 <
T2), now v2 would be taken into account.
Finally, note that this patch doesn't generalize the idea of
shadowable tombstone, instead taking advantage of the fact that they
are only needed by Materialized Views. This saves changing the
tombstone representation to account for an extra flag, the bits such
representation would require, and also avoids changes to the storage
format.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>