Commit Graph

139 Commits

Author SHA1 Message Date
Glauber Costa
d49ecae201 mutation_partition: estimate size of partition
In the memtable flusher, we account for the size of a partition as we
read them. However, there are other points in the architecture where we
would like to calculate the size of a partition in a point in which we
are not reading it. One such example is the cache update process.

This patch enhances the mutation_partition adding a method that returns
the total size for this partition.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2017-11-08 16:21:44 -05:00
Tomasz Grabiec
749f5770df mutation: Introduce apply(mutation_fragment) 2017-11-02 12:16:17 +01:00
Tomasz Grabiec
72028bb048 mutation_partition: Allow creating rows_entry at any clustered position_in_partition
In preparation for supporting setting continuity of arbitrary clustering range.
2017-11-02 11:05:19 +01:00
Tomasz Grabiec
409adc045a mutation_partition: Remove delegating_compare()
It can't work with rows_entry at any position_in_partition,
so we need to drop it.
2017-11-02 11:05:19 +01:00
Tomasz Grabiec
455a1b0d24 mutation_partition: Introduce range continuity checking methods 2017-09-13 17:47:04 +02:00
Tomasz Grabiec
abc489e99d mutation_partition: Enable rows_entry::compare() on position_in_partition_views
For full symmetry with existing overloads.
2017-09-13 17:47:04 +02:00
Tomasz Grabiec
b6ae5783cd mvcc: Introduce partition_entry::evict()
The operation frees as much memory as possible, marking affected
mutation elements as discontinuous.
2017-09-13 17:47:03 +02:00
Paweł Dziepak
43cce6c2f4 rows_entry: make position() inlineable 2017-07-26 14:38:27 +01:00
Tomasz Grabiec
0770845a23 mutation_partition: Introduce r-value accepting deletable_row::apply() 2017-06-24 18:06:11 +02:00
Piotr Jastrzebski
efc75b0bc3 mutation_partition: Add rows_entry constructor which accepts full contents
[tgrabiec: Extracted from different patch]
2017-06-24 18:06:11 +02:00
Tomasz Grabiec
dce293e11c tests: row_cache: Apply only fully continuous mutations to underlying mutation source
Cache currently assumes that mutations coming from outside are fully
continuous.
2017-06-24 18:06:11 +02:00
Piotr Jastrzebski
05b56fcfb0 mutation_partition: Add support for specifying continuity
This will allow expressing lack of information about certain ranges of
rows (including the static row), which will be used in cache to
determine if information in cache is complete or not.

Continuity is represented internally using flags on row entries. The
key range between two consecutive entries is continuous iff
rows_entry::continuous() is true for the later entry. The range
starting after the last entry is assumed to be continuous. The range
corresponding to the key of the entry is continuous iff
rows_entry::dummy() is false.

[tgrabiec:
  - based on the following commits:
     4a5bf75 - Piotr Jastrzebski : mutation_partition: introduce dummy rows_entry
     773070e - Piotr Jastrzebski : mutation_partition: add continuity flag to rows_entry
  - documented that partition tombstone is always complete
  - require specifying the partition tombstone when creating an incomplete entry
  - replaced rows_entry(dummy_tag, ...) constructor with more general
    rows_entry(position_in_partition, ...)
  - documented continuity semantics on mutation_partition
  - fixed _static_row_cached being lost by mutation_partition copy constructors
  - fixed conversion to streamed_mutation to ignore dummy entries
  - fixed mutation_partition serializer to drop dummy entries
  - documented semantics of continuity on mutation_partition level
  - dropped assumptions that dummy entries can be only at the last position
  - changed equality to ignore continuity completely, rather than
    partially (it was not ignoring dummy entries, but ignoring
    continuity flag)
  - added printout of continuity information in mutation_partition
  - fixed handling of empty entries in apply_reversibly() with regards
    to continuity; we no longer can remove empty entries before
    merging, since that may affect continuity of the right-hand
    mutation. Added _erased flag.
  - fixed mutation_partition::clustered_row() with dummy==true to not ignore the key
  - fixed partition_builder to not ignore continuity
  - renamed dummy_tag_t to dummy_tag. _t suffix is reserved.
  - standardized all APIs on is_dummy and is_continuous bool_class:es
  - replaced add_dummy_entry() with ensure_last_dummy() with safer semantics
  - dropped unused remove_dummy_entry()
  - simplified and inlined cache_entry::add_dummy_entry()
  - fixed mutation_partition(incomplete_tag) constructor to mark all row ranges as discontinuous
  ]
2017-06-24 18:06:11 +02:00
Tomasz Grabiec
a77734952d mutation_partition: Make rows_entry comparable with position_in_partition 2017-06-24 18:06:11 +02:00
Piotr Jastrzebski
65b3123516 mutation_partition: Use rows_entry::position() in comparators
key() will not be valid for dummy entries, but position() is always
valid.

[tgrabiec: Extracted from other commits]
[tgrabiec: Added missing change to range_tombstone_stream::get_next]
2017-06-24 18:06:11 +02:00
Tomasz Grabiec
660f3127a6 mutation_partition: Introduce rows_entry::position()
In preparation for enabling dummy entries with postion past all
clustering rows.
2017-06-24 18:06:11 +02:00
Gleb Natapov
f5679e0416 database: remove remnants of no longer existing db::serializer.
Message-Id: <20170604100552.GD8248@scylladb.com>
2017-06-04 13:07:17 +03:00
Duarte Nunes
9e88b60ef5 mutation: Set cell using clustering_key_prefix
Change the clustering key argument in mutation::set_cell from
exploded_clustering_prefix to clustering_key_prefix, which allows for
some overall code simplification and fewer copies. This mostly affects
the cql3 layer.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-05-04 15:59:50 +02:00
Duarte Nunes
db63ffdbb4 mutation_partition: Harmonize apply_delete overloads
This patch ensures the different mutation_partition::apply_delete()
overloads behave similarly, so that, for example, an empty clustering
key is treated the same way as an empty
exploded_clustering_key_prefix.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-05-04 15:59:50 +02:00
Duarte Nunes
4e693383f7 mutation_partion: Use row_tombstone
This patch replaces the current row tombstone representation by a
row_tombstone.

The intent of the patch is thus to reify the idea of shadowable
tombstones, that up until now we considered all materialized view row
tombstones to be.

We need to distinguish shadowable from non-shadowable row tombstones
to support scenarios such as, when inserting to a table with a
materialzied view:

1. insert into base (p, v1, v2) values (3, 1, 3) using timestamp 1
2. delete from base using timestamp 2 where p = 3
3. insert into base (p, v1) values (3, 1) using timestamp 3

These should yield a view row where v2 is definitely null, but with
the current implementation, v2 will pop back with its value v2=3@TS=1,
even though its dead in the base row. This is because the row
tombstone inserted at 2) is a shadowable one.

This patch only addresses the memory representation of such
row_tombstones.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-04-25 11:46:33 +02:00
Duarte Nunes
6a2bccd4ae mutation_partion: Introduce row_tombstone
This patch introduces the row_tombstone class, which represents a
tombstone made up of a regular tombstone and a shadowable one.

The rules for row_tombstones are as follows:

 - The shadowable tombstone is always >= than the regular one;
 - The regular tombstone works as expected;
 - The shadowable tombstone doesn't erase or compact away the regular
   row tombstone, nor dead cells;
 - The shadowable tombstone can erase live cells, but only provided
   they can be recovered (e.g., by including all cells in a MV update,
   both updated cells and pre-existing ones);
 - The shadowable tombstone can be erased or compacted away by a newer
   row marker.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-04-25 11:46:28 +02:00
Duarte Nunes
3d49c1da01 mutation_partition: Introduce shadowable tombstones
A shadowable tombstone is a tombstone that can be replaced by a
smaller one if provided a row_marker with a bigger timestamp than the
shadowable tombstone.

In the context of a row, it is only valid as long as no newer insert
is done (thus setting a live row marker; note that if the row
timestamp set is lower than the tombstone's, then the tombstone
remains in effect as usual). If a row has a shadowable tombstone with
timestamp Ti and that row is updated with a timestamp Tj, such that
Tj > Ti (and that update sets the row marker), then the shadowable
tombstone is shadowed by that update. A concrete consequence is that
if the update has cells with timestamp lower than Ti, then those cells
are preserved (since the deletion is removed), and this is contrary to
a regular, non-shadowable row tombstone where the tombstone is
preserved and such cells are removed.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-04-25 11:46:22 +02:00
Duarte Nunes
392403b5b3 row_marker: Mark constructors explicit
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-04-25 11:43:04 +02:00
Paweł Dziepak
582d397c41 introduce counter_write_query()
Counter write path involves read-modify-write. That read is guaranteed
to query only a single partition, does not care about dead cells and
expects to receive an unserialized mutation as a result.

Standard mutation queries can are able to produce results fit for
counter updates, but the logic involved is much more general (i.e.
slower), hence the addition of new, counter-specific kind of query.
2017-03-01 16:33:36 +00:00
Duarte Nunes
7e150a18eb mutation_partition: Introduce shadowable tombstone
This patch introduces shadowable row tombstones. A shadowable row
tombstone is valid only if the row has no live marker. In other words,
the row tombstone is only valid as long as no newer insert is done
(thus setting a live row marker; note that if the row timestamp set
is lower than the tombstone's, then the tombstone remains in effect
as usual).

If a row has a shadowable tombstone with timestamp Ti and that row
is updated with a timestamp Tj, such that Tj > Ti (and that update
sets the row marker), then the shadowable tombstone is shadowed by
that update. A concrete consequence is that if the update has cells
with timestamp lower than Ti, then those cells are preserved (since
the deletion is removed), and this is contrary to a regular,
non-shadowable row tombstone where the tombstone is preserved and
such cells are removed.

Currently, only Materialized Views require shadowable row tombstones,
which solve a problem with view row deletions. Consider a base row with
columns p, v1, v2, PRIMARY KEY (p) denormalized into a view row consisting
of columns p, v1, v2 PRIMARY KEY (p, v1), and the following operations:

1) INSERT INTO base (p, v1, v2) VALUES (0, 0, 1) USING TIMESTAMP 0;
2) UPDATE base SET v1 = 1 USING TIMESTAMP 1 WHERE p = 0;
3) UPDATE base SET v1 = 0 USING TIMESTAMP 2 WHERE p = 0;

Without shadowable tombstones, the view contains:

At 1), pk = (0, 0), row_marker@T0, v2=1@T0
At 2), pk = (0, 0), row_marker@T0, row_tombstone@T1, v2=1@T0
       pk = (0, 1), row_marker@T1, v2=1@T0
At 3), pk = (0, 0), row_marker@T2, row_tombstone@T1, v2=1@T0
       pk = (0, 1), row_marker@T1, row_tombstone@T2, v2=1@T0

Notice how, if we read row (0, 0), the value of v2 will be shadowed by
the row tombstone we previously inserted.

With a view's row tombstone becoming shadowable, at 3) the row (0, 0)
will look like pk = (0, 0), row_marker@T2, shadowable_tombstone@T1, v2=1@T0,
which is equivalent to pk = (0, 0), row_marker@T2, v2=1@T0.

Since the shadowable tombstone is shadowed by the new row marker (T0 <
T2), now v2 would be taken into account.

Finally, note that this patch doesn't generalize the idea of
shadowable tombstone, instead taking advantage of the fact that they
are only needed by Materialized Views. This saves changing the
tombstone representation to account for an extra flag, the bits such
representation would require, and also avoids changes to the storage
format.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-02-06 13:36:45 +01:00
Paweł Dziepak
b6564651e4 mutation_partition: make for_each_cell() accessible outside source file
for_each_cell() const already can be used from any place in the code,
allow the same with non-const version.
2017-02-02 10:35:14 +00:00
Piotr Jastrzebski
15cc8460bd mutation_partition: make rows_entry constructors explicit
All converting constructors should be explicit otherwise they
can create a confusion. I got myself in such a situation when
clustering key got implicitly converted into rows_entry when
I was not expecting it.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <c3f19719760f6dc7cf5e858b9c452506faedf521.1485950529.git.piotr@scylladb.com>
2017-02-01 17:57:50 +01:00
Tomasz Grabiec
ddfee57c97 Replace iostream include with iosfwd in headers
Message-Id: <1484656119-8386-4-git-send-email-tgrabiec@scylladb.com>
2017-01-17 14:52:44 +02:00
Piotr Jastrzebski
041b0a65ac Implement intrusive set using rbtree_algorithms
This new implementation takes less memory because it
does not store comparator.

It also uses tree nodes optimized for size. This means
that instead of storing an enum field |color| they embed
this information inside pointer to parent.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-01-05 11:46:58 +01:00
Piotr Jastrzebski
4bbe05dd47 mutation_partition: take schema in find_row and clustered_row
This will allow intrusive set implementation that does not
store schema.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-01-05 11:26:03 +01:00
Piotr Jastrzebski
fe3c91db90 mutation_partition: Extract intrusive set logic to a class.
It will make it easier to change the implementation
of the intrusive set.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-01-05 11:26:03 +01:00
Avi Kivity
1d9ee358f1 Revert "Merge "Reduce the size of mutation_partition" from Piotr"
This reverts commit aa392810ff, reversing
changes made to a24ff47c637e6a5fd158099b8a65f1191fc2d023; it uses
boost::intrusive::detail directly, which it must not, and doesn't compile on
all boost versions as a consequence.
2016-12-25 16:07:48 +02:00
Piotr Jastrzebski
671affc36c Implement intrusive set using rbtree_algorithms
This new implementation takes less memory because it
does not store comparator.

It also uses tree nodes optimized for size. This means
that instead of storing an enum field |color| they embed
this information inside pointer to parent.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2016-12-23 11:32:13 +01:00
Piotr Jastrzebski
2af6ff68d9 mutation_partition: take schema in find_row and clustered_row
This will allow intrusive set implementation that does not
store schema.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2016-12-23 11:29:07 +01:00
Piotr Jastrzebski
b3b924dec9 mutation_partition: Extract intrusive set logic to a class.
It will make it easier to change the implementation
of the intrusive set.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2016-12-23 11:29:07 +01:00
Paweł Dziepak
ef57b9a26f rename memory_usage() to external_memory_usage() where applicable
Renaming the function to external_memory_usage() makes it clear that
sizeof(T) is not included, something that was a source of confusion in
the past.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-11-18 11:25:36 +00:00
Piotr Jastrzebski
b05b90b3a5 Introduce clustering_key_filter_ranges.
This fixes the problem of multiple concurrent get_ranges calls.
Previously each call was invalidating the result of the previous
call. Now they don't step on each other foot.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2016-08-30 19:46:38 +02:00
Duarte Nunes
5161ea283f query: query::clustering_range can't wrap around
This patch changes the type of query::clustering_range to express that
ranges that wrap around are not allowed, and ranges that have the
start bound after the end bound are considered empty.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-08-15 14:50:20 +00:00
Paweł Dziepak
27fea7bf2c mutation_partition: add non-cons rows and tombstones accessors
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-07-13 09:50:07 +01:00
Tomasz Grabiec
8c4b5e4283 db: Avoiding checking bloom filters during compaction
Checking bloom filters of sstables to compute max purgeable timestamp
for compaction is expensive in terms of CPU time. We can avoid
calculating it if we're not about to GC any tombstone.

This patch changes compacting functions to accept a function instead
of ready value for max_purgeable.

I verified that bloom filter operations no longer appear on flame
graphs during compaction-heavy workload (without tombstones).

Refs #1322.
2016-07-10 09:54:20 +02:00
Paweł Dziepak
23d0bfd065 mutation_partition: add row::memory_usage()
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-07-07 12:17:25 +01:00
Paweł Dziepak
f95c5542dc mutation_partition: allow slicing moved mutation_partition
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-06-20 21:29:51 +01:00
Paweł Dziepak
22160ae6d5 mutation_partition: make rows_type public
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-06-20 21:29:49 +01:00
Paweł Dziepak
847bf878ec mutation_partition: add more row::apply() overloads
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-06-20 21:29:48 +01:00
Duarte Nunes
70083efee2 sstables: Read and write range tombstone bounds
This patch uses the composite_marker to add inclusiveness information
to the prefixes of a range tombstone.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-06-02 16:21:59 +02:00
Duarte Nunes
7628e403a3 sstables: Drop code for tombstone merging
Since Scylla now supports proper range tombstones, the code for
reading ranges from sstables and converting them to overlapping
tombstones is no longer necessary, and is, in fact, wasteful as
the internal representation converts overlapping tombstones back to
ranges.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-06-02 16:21:59 +02:00
Duarte Nunes
91aac30f12 mutations: Row tombstones are now a set of ranges
This patch changes the type of the mutation partition's row_tombstones
to be a range_tombstone_list, so that they are now represented as a
set of disjoint ranges. All of its usages are updated accordingly.

Fixes #1155

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-06-02 16:21:59 +02:00
Piotr Jastrzebski
23c23abe53 Make memtable mutation_reader slice using clustering ranges.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2016-05-16 11:46:41 +02:00
Tomasz Grabiec
a1539fed95 mutation_partition: Fix reversed trim_rows()
The first erase_and_dispose(), which removes rows between last
position and beginning of the next range, can invalidate end()
iterator of the range. Fix by looking up end after erasing.

mutation_partition::range() was split into lower_bound() and
upper_bound() to allow for that.

This affects for example queries with descending order where the
selected clustering range is empty and falls before all rows.

Exposed by f15c380a4f, which is now
calling do_compact() during query.

Reproduced by dtest paging_test.py:TestPagingData.static_columns_paging_test
2016-04-08 20:53:33 +02:00
Avi Kivity
db03295c8a Merge "Fix query digest mismatch" from Tomasz
"Currently data query digest includes cells and tombstones which may have
expired or be covered by higher-level tombstones. This causes digest
mismatch between replicas if some elements are compacted on one of the
nodes and not on others. This mismatch triggers read-repair which doesn't
resolve because mutations received by mutation queries are not differing,
they are compacted already.

The fix adds compacting step before writing and digesting query results by
reusing the algorithm used by mutation query. This is not the most optimal
way to fix this. The compaction step could be folded with the query writing,
there is redundancy in both steps. However such change carries more risk,
and thus was postponed.

perf_simple_query test (cassandra-stress-like partitions) shows regression
from 83k to 77k (7%) ops/s.

Fixes #1165."
2016-04-08 12:13:29 +03:00
Pekka Enberg
38a54df863 Fix pre-ScyllaDB copyright statements
People keep tripping over the old copyrights and copy-pasting them to
new files. Search and replace "Cloudius Systems" with "ScyllaDB".

Message-Id: <1460013664-25966-1-git-send-email-penberg@scylladb.com>
2016-04-08 08:12:47 +03:00