Commit Graph

186 Commits

Author SHA1 Message Date
Tomasz Grabiec
65ca8eebb8 mutation_partition: Print rows_entry's position instead of key
For dummy rows, _key doesn't reflect the right position.

Message-Id: <1505317040-6783-1-git-send-email-tgrabiec@scylladb.com>
2017-09-13 20:49:28 +03:00
Tomasz Grabiec
455a1b0d24 mutation_partition: Introduce range continuity checking methods 2017-09-13 17:47:04 +02:00
Tomasz Grabiec
b6ae5783cd mvcc: Introduce partition_entry::evict()
The operation frees as much memory as possible, marking affected
mutation elements as discontinuous.
2017-09-13 17:47:03 +02:00
Duarte Nunes
c7aa3ea069 mutation_partition: Remove obsolete short read detection
When compacting a partition for querying we would read an extra row,
to include any tombstones between that one and the previous row.

This is no longer needed since we have a general mechanism to detect
short reads in the storage_proxy.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170811103031.22866-1-duarte@scylladb.com>
2017-08-15 12:01:55 +01:00
Paweł Dziepak
43cce6c2f4 rows_entry: make position() inlineable 2017-07-26 14:38:27 +01:00
Tomasz Grabiec
136d205855 mutation_partition: Always mark static row as continuous when no static columns
To avoid unnecessary cache misses after static columns are added.

Message-Id: <1500650057-26036-1-git-send-email-tgrabiec@scylladb.com>
2017-07-24 10:23:35 +03:00
Tomasz Grabiec
0770845a23 mutation_partition: Introduce r-value accepting deletable_row::apply() 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
dce293e11c tests: row_cache: Apply only fully continuous mutations to underlying mutation source
Cache currently assumes that mutations coming from outside are fully
continuous.
2017-06-24 18:06:11 +02:00
Piotr Jastrzebski
05b56fcfb0 mutation_partition: Add support for specifying continuity
This will allow expressing lack of information about certain ranges of
rows (including the static row), which will be used in cache to
determine if information in cache is complete or not.

Continuity is represented internally using flags on row entries. The
key range between two consecutive entries is continuous iff
rows_entry::continuous() is true for the later entry. The range
starting after the last entry is assumed to be continuous. The range
corresponding to the key of the entry is continuous iff
rows_entry::dummy() is false.

[tgrabiec:
  - based on the following commits:
     4a5bf75 - Piotr Jastrzebski : mutation_partition: introduce dummy rows_entry
     773070e - Piotr Jastrzebski : mutation_partition: add continuity flag to rows_entry
  - documented that partition tombstone is always complete
  - require specifying the partition tombstone when creating an incomplete entry
  - replaced rows_entry(dummy_tag, ...) constructor with more general
    rows_entry(position_in_partition, ...)
  - documented continuity semantics on mutation_partition
  - fixed _static_row_cached being lost by mutation_partition copy constructors
  - fixed conversion to streamed_mutation to ignore dummy entries
  - fixed mutation_partition serializer to drop dummy entries
  - documented semantics of continuity on mutation_partition level
  - dropped assumptions that dummy entries can be only at the last position
  - changed equality to ignore continuity completely, rather than
    partially (it was not ignoring dummy entries, but ignoring
    continuity flag)
  - added printout of continuity information in mutation_partition
  - fixed handling of empty entries in apply_reversibly() with regards
    to continuity; we no longer can remove empty entries before
    merging, since that may affect continuity of the right-hand
    mutation. Added _erased flag.
  - fixed mutation_partition::clustered_row() with dummy==true to not ignore the key
  - fixed partition_builder to not ignore continuity
  - renamed dummy_tag_t to dummy_tag. _t suffix is reserved.
  - standardized all APIs on is_dummy and is_continuous bool_class:es
  - replaced add_dummy_entry() with ensure_last_dummy() with safer semantics
  - dropped unused remove_dummy_entry()
  - simplified and inlined cache_entry::add_dummy_entry()
  - fixed mutation_partition(incomplete_tag) constructor to mark all row ranges as discontinuous
  ]
2017-06-24 18:06:11 +02:00
Piotr Jastrzebski
65b3123516 mutation_partition: Use rows_entry::position() in comparators
key() will not be valid for dummy entries, but position() is always
valid.

[tgrabiec: Extracted from other commits]
[tgrabiec: Added missing change to range_tombstone_stream::get_next]
2017-06-24 18:06:11 +02:00
Tomasz Grabiec
660f3127a6 mutation_partition: Introduce rows_entry::position()
In preparation for enabling dummy entries with postion past all
clustering rows.
2017-06-24 18:06:11 +02:00
Paweł Dziepak
b2b78158f6 mutation_partition: restore formatting
No functional change.

Message-Id: <20170526104119.22075-2-pdziepak@scylladb.com>
2017-06-06 11:20:57 +03:00
Paweł Dziepak
d9dd798c4f counter_write_query: avoid use-after-free on partition range
Message-Id: <20170526104119.22075-1-pdziepak@scylladb.com>
2017-05-28 11:41:30 +03:00
Tomasz Grabiec
804f46f684 mutation: Make compare_*_for_merge() consistent with equals()
equals() considers expiring cells to be different form non-expiring cells,
but compare_row_marker_for_merge() considers them equal. Fix the latter to
pick expiring cells. The choice was arbitrary.
2017-05-23 13:35:03 +02:00
Duarte Nunes
9e88b60ef5 mutation: Set cell using clustering_key_prefix
Change the clustering key argument in mutation::set_cell from
exploded_clustering_prefix to clustering_key_prefix, which allows for
some overall code simplification and fewer copies. This mostly affects
the cql3 layer.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-05-04 15:59:50 +02:00
Duarte Nunes
db63ffdbb4 mutation_partition: Harmonize apply_delete overloads
This patch ensures the different mutation_partition::apply_delete()
overloads behave similarly, so that, for example, an empty clustering
key is treated the same way as an empty
exploded_clustering_key_prefix.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-05-04 15:59:50 +02:00
Duarte Nunes
4e693383f7 mutation_partion: Use row_tombstone
This patch replaces the current row tombstone representation by a
row_tombstone.

The intent of the patch is thus to reify the idea of shadowable
tombstones, that up until now we considered all materialized view row
tombstones to be.

We need to distinguish shadowable from non-shadowable row tombstones
to support scenarios such as, when inserting to a table with a
materialzied view:

1. insert into base (p, v1, v2) values (3, 1, 3) using timestamp 1
2. delete from base using timestamp 2 where p = 3
3. insert into base (p, v1) values (3, 1) using timestamp 3

These should yield a view row where v2 is definitely null, but with
the current implementation, v2 will pop back with its value v2=3@TS=1,
even though its dead in the base row. This is because the row
tombstone inserted at 2) is a shadowable one.

This patch only addresses the memory representation of such
row_tombstones.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-04-25 11:46:33 +02:00
Duarte Nunes
392403b5b3 row_marker: Mark constructors explicit
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-04-25 11:43:04 +02:00
Raphael S. Carvalho
a6f8f4fe24 compaction: do not write expired cell as dead cell if it can be purged right away
When compacting a fully expired sstable, we're not allowing that sstable
to be purged because expired cell is *unconditionally* converted into a
dead cell. Why not check if the expired cell can be purged instead using
gc before and max purgeable timestamp?

Currently, we need two compactions to get rid of a fully expired sstable
which cells could have always been purged.

look at this sstable with expired cell:
  {
    "partition" : {
      "key" : [ "2" ],
      "position" : 0
    },
    "rows" : [
      {
        "type" : "row",
        "position" : 120,
        "liveness_info" : { "tstamp" : "2017-04-09T17:07:12.702597Z",
"ttl" : 20, "expires_at" : "2017-04-09T17:07:32Z", "expired" : true },
        "cells" : [
          { "name" : "country", "value" : "1" },
        ]

now this sstable data after first compaction:
[shard 0] compaction - Compacted 1 sstables to [...]. 120 bytes to 79
(~65% of original) in 229ms = 0.000328997MB/s.

  {
    ...
    "rows" : [
      {
        "type" : "row",
        "position" : 79,
        "cells" : [
          { "name" : "country", "deletion_info" :
{ "local_delete_time" : "2017-04-09T17:07:12Z" },
            "tstamp" : "2017-04-09T17:07:12.702597Z"
          },
        ]

now another compaction will actually get rid of data:
compaction - Compacted 1 sstables to []. 79 bytes to 0 (~0% of original)
in 1ms = 0MB/s. ~2 total partitions merged to 0

NOTE:
It's a waste of time to wait for second compaction because the expired
cell could have been purged at first compaction because it satisfied
gc_before and max purgeable timestamp.

Fixes #2249, #2253

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170413001049.9663-1-raphaelsc@scylladb.com>
2017-04-13 10:59:19 +03:00
Avi Kivity
27c42359bc Merge seastar upstream
* seastar 6b21197...2ebe842 (6):
  > Merge "Various improvements to execution stages" from Paweł
  > app-template: allow apps to specify a name for help message
  > bool_class: avoid initializing object of incomplete type
  > app-template: make sure we can still get help with required options
  > prometheus: Http handler that returns prometheus 0.4 protobuf or text format
  > Update DPDK to 17.02

Includes patch from Pawel to adjust to updated execution_stage interface.
2017-03-26 10:50:21 +03:00
Paweł Dziepak
a78501c206 mutation_query: add an execution stage 2017-03-09 09:27:43 +00:00
Avi Kivity
439b38f5ab Merge "Improvements to counter implementation" from Paweł
"This series adds various optimisations to counter implementation
(nothing extreme, mostly just avoiding unnecessary operations) as well
as some missing features such as tracing and dropping timed out queries.

Performance was tested using:
perf-simple-query -c4 --counters --duration 60

The following results are medians.
          before       after      diff
write   18640.41    33156.81    +77.9%
read    58002.32    62733.93     +8.2%"

* tag 'pdziepak/optimise-counters/v3' of github.com:cloudius-systems/seastar-dev: (30 commits)
  cell_locker: add metrics for lock acquisition
  storage_proxy: count counter updates for which the node was a leader
  storage_proxy: use counter-specific timeout for writes
  storage_proxy: transform counter timeouts to mutation_write_timeout_exception
  db: avoid allocations in do_apply_counter_update()
  tests/counters: add test for apply reversability
  counters: attempt to apply in place
  atomic_cell: add COUNTER_IN_PLACE_REVERT flag
  counters: add equality operators
  counters: implement decrement operators for shard_iterator
  counters: allow using both views and mutable_views
  atomic_cell: introduce atomic_cell_mutable_view
  managed_bytes: add cast to mutable_view
  bytes: add bytes_mutable_view
  utils: introduce mutable_view
  db: add more tracing events for counter writes
  db: propagate tracing state for counter writes
  tests/cell_locker: add test for timing out lock acquisition
  counter_cell_locker: allow setting timeouts
  db: propagate timeout for counter writes
  ...
2017-03-07 11:48:13 +02:00
Tomasz Grabiec
4b6e77e97e db: Fix overflow of gc_clock time point
If query_time is time_point::min(), which is used by
to_data_query_result(), the result of subtraction of
gc_grace_seconds() from query_time will overflow.

I don't think this bug would currently have user-perceivable
effects. This affects which tombstones are dropped, but in case of
to_data_query_result() uses, tombstones are not present in the final
data query result, and mutation_partition::do_compact() takes
tombstones into consideration while compacting before expiring them.

Fixes the following UBSAN report:

  /usr/include/c++/5.3.1/chrono:399:55: runtime error: signed integer overflow: -2147483648 - 604800 cannot be represented in type 'int'

Message-Id: <1488385429-14276-1-git-send-email-tgrabiec@scylladb.com>
2017-03-01 18:49:56 +02:00
Paweł Dziepak
582d397c41 introduce counter_write_query()
Counter write path involves read-modify-write. That read is guaranteed
to query only a single partition, does not care about dead cells and
expects to receive an unserialized mutation as a result.

Standard mutation queries can are able to produce results fit for
counter updates, but the logic involved is much more general (i.e.
slower), hence the addition of new, counter-specific kind of query.
2017-03-01 16:33:36 +00:00
Tomasz Grabiec
f46ae8128d database: Fix mutation_source created by as_mutation_source() to not ignore trace_state_ptr
It was using the state passed via as_mutation_source() instead. Let's
respect mutation_source contract instead, and use the state passed via
mutation_source invocation.

Technically just a cleanup. Alse prerequisite for more cleanup.
2017-02-23 18:23:52 +01:00
Tomasz Grabiec
2489a0f82e mutation_partition: Drop unneeded range tombstones
Fixes #1254.
2017-02-13 16:12:16 +01:00
Tomasz Grabiec
884858078a mutation_partition: Simplify row removal 2017-02-13 16:12:15 +01:00
Duarte Nunes
7e150a18eb mutation_partition: Introduce shadowable tombstone
This patch introduces shadowable row tombstones. A shadowable row
tombstone is valid only if the row has no live marker. In other words,
the row tombstone is only valid as long as no newer insert is done
(thus setting a live row marker; note that if the row timestamp set
is lower than the tombstone's, then the tombstone remains in effect
as usual).

If a row has a shadowable tombstone with timestamp Ti and that row
is updated with a timestamp Tj, such that Tj > Ti (and that update
sets the row marker), then the shadowable tombstone is shadowed by
that update. A concrete consequence is that if the update has cells
with timestamp lower than Ti, then those cells are preserved (since
the deletion is removed), and this is contrary to a regular,
non-shadowable row tombstone where the tombstone is preserved and
such cells are removed.

Currently, only Materialized Views require shadowable row tombstones,
which solve a problem with view row deletions. Consider a base row with
columns p, v1, v2, PRIMARY KEY (p) denormalized into a view row consisting
of columns p, v1, v2 PRIMARY KEY (p, v1), and the following operations:

1) INSERT INTO base (p, v1, v2) VALUES (0, 0, 1) USING TIMESTAMP 0;
2) UPDATE base SET v1 = 1 USING TIMESTAMP 1 WHERE p = 0;
3) UPDATE base SET v1 = 0 USING TIMESTAMP 2 WHERE p = 0;

Without shadowable tombstones, the view contains:

At 1), pk = (0, 0), row_marker@T0, v2=1@T0
At 2), pk = (0, 0), row_marker@T0, row_tombstone@T1, v2=1@T0
       pk = (0, 1), row_marker@T1, v2=1@T0
At 3), pk = (0, 0), row_marker@T2, row_tombstone@T1, v2=1@T0
       pk = (0, 1), row_marker@T1, row_tombstone@T2, v2=1@T0

Notice how, if we read row (0, 0), the value of v2 will be shadowed by
the row tombstone we previously inserted.

With a view's row tombstone becoming shadowable, at 3) the row (0, 0)
will look like pk = (0, 0), row_marker@T2, shadowable_tombstone@T1, v2=1@T0,
which is equivalent to pk = (0, 0), row_marker@T2, v2=1@T0.

Since the shadowable tombstone is shadowed by the new row marker (T0 <
T2), now v2 would be taken into account.

Finally, note that this patch doesn't generalize the idea of
shadowable tombstone, instead taking advantage of the fact that they
are only needed by Materialized Views. This saves changing the
tombstone representation to account for an extra flag, the bits such
representation would require, and also avoids changes to the storage
format.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-02-06 13:36:45 +01:00
Paweł Dziepak
b6564651e4 mutation_partition: make for_each_cell() accessible outside source file
for_each_cell() const already can be used from any place in the code,
allow the same with non-const version.
2017-02-02 10:35:14 +00:00
Paweł Dziepak
0c93d01232 atomic_cell: make sure upper level tombstones cover counters
Support for deletion of counters is limited in a way that once deleted
they cannot be used again (i.e. tombstone always wins, regardless of the
timestamp). Logic responsible for merging two counter cells already
makes sure that tombstones are handled properly, but it is also
necessary to ensure that higher level tombstones always cover counters.
2017-02-02 10:35:14 +00:00
Paweł Dziepak
47d14906e6 mutation_partition: support querying counter cells 2017-02-02 10:35:14 +00:00
Paweł Dziepak
63f25eb12c mutation_hasher: handle counter cells properly 2017-02-02 10:35:14 +00:00
Paweł Dziepak
a57e86cc37 mutation_partition: compute counter difference 2017-02-02 10:35:13 +00:00
Paweł Dziepak
2725a4945d mutation_partition: apply counter cells properly 2017-02-02 10:35:13 +00:00
Piotr Jastrzebski
041b0a65ac Implement intrusive set using rbtree_algorithms
This new implementation takes less memory because it
does not store comparator.

It also uses tree nodes optimized for size. This means
that instead of storing an enum field |color| they embed
this information inside pointer to parent.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-01-05 11:46:58 +01:00
Piotr Jastrzebski
a0c20f5c49 mutation_partition: make apply_reversibly_intrusive_set nongeneric
apply_reversibly_intrusive_set is used only in one place
and always with rows_type. There's no need for it to be generic.
This will allow changing intrusive set implementation.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-01-05 11:26:03 +01:00
Piotr Jastrzebski
4bbe05dd47 mutation_partition: take schema in find_row and clustered_row
This will allow intrusive set implementation that does not
store schema.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-01-05 11:26:03 +01:00
Piotr Jastrzebski
fe3c91db90 mutation_partition: Extract intrusive set logic to a class.
It will make it easier to change the implementation
of the intrusive set.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-01-05 11:26:03 +01:00
Piotr Jastrzebski
da67ac7ae4 mutation_partition: Replace value_comp with key_comp calls
This will reduce the size of bi::set API being used.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-01-05 11:26:03 +01:00
Paweł Dziepak
40176ca2f8 mutation_partition: use result limiter for digest reads
Even if we are performing a digest query we should do proper result
memory accounting so that the result ends exactly in the same place that
it would if it was a data query. This is to avoid digest mismatches
between replicas.
2016-12-22 17:16:23 +01:00
Paweł Dziepak
38ee69dee0 idl: allow writers to use any output stream
Original IDL generated code was hardcoded to always use bytes_ostream.
This patch makes the output stream a template parameter so that any
valid output stream can be used.
Unfortunately, making IDL writers generic requires updates in the code
that uses them, this is fixed in C++17 which would be able to deduce the
parameter in most cases.
2016-12-22 13:35:04 +01:00
Paweł Dziepak
1c7cade559 mutation_partition: honour allowed_short_read for static rows 2016-12-22 13:35:04 +01:00
Piotr Jastrzebski
3e502de153 mutation_partition: don't use unique_ptr to manage LSA objects
Unique_ptr won't destruct them correctly.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <5b49bb25a962432a178fe75554dd010c3cdea41d.1482261888.git.piotr@scylladb.com>
2016-12-21 09:40:15 +01:00
Asias He
e5485f3ea6 Get rid of query::partition_range
Use dht::partition_range instead
2016-12-19 08:09:25 +08:00
Duarte Nunes
781cd82cb8 column_family: Use counters in query::result::builder
This patch changes column_family::query() to use the counters in the
builder to determine how many partitions and rows to ask for and also
to implement the stop condition. This saves a continuation to do the
bookkeeping, and allows us to remove data_query_result.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-15 10:27:46 +00:00
Duarte Nunes
05b2ef4fa2 query_result_builder: Use the underlying counters
This patch changes the query_result_builder to use the counters
provided by the query::result::builder. It also ensures they are kept
current.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-15 10:27:46 +00:00
Duarte Nunes
f5cf7f7921 mutation_partition: Count partitions in query_compacted
This patch changes mutation_partition::query_compacted() to count the
number of partitions written to the underlying writer.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-15 10:27:46 +00:00
Duarte Nunes
f21dfb8217 mutation_partition: Remove tabs in query_compacted
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-12-15 10:27:46 +00:00
Paweł Dziepak
ba51e7e8db data_query: limit result size
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-12-14 14:10:02 +00:00
Paweł Dziepak
f1b9f49f2b mutation_query: limit result size
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-12-14 14:10:02 +00:00