Commit Graph

212 Commits

Author SHA1 Message Date
Duarte Nunes
83e983d4d0 mutation_partition: Remove unused operator==()
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180115013546.67260-1-duarte@scylladb.com>
2018-01-15 11:16:35 +02:00
Duarte Nunes
9d1d9883ff mutation_partition: Remove unused for_each_cell() overload
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180115013618.67351-1-duarte@scylladb.com>
2018-01-15 11:16:34 +02:00
Glauber Costa
54d3ebde4e flat_mutation_reader: pass timeout down to consume()
We pass the timeout that we received from data_query/mutation_query
down to consume, which is responsible for actually reading the data.

To make those timeouts actionable, though, we'll have to patch
fill_buffer(). This will happen in the next patch.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-01-11 12:07:41 -05:00
Glauber Costa
8433702c90 mutation_query: add a timeout to the mutation query path
data_query and mutation_query are patched so that they start accepting a
per-query timeout. We will default to no timeout, and then no callers
will be changed yet.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-01-11 12:07:41 -05:00
Tomasz Grabiec
8e8ece5dec mutation_partition: Introduce deletable_row::apply() from a clustering_row fragment 2017-12-08 17:50:47 +01:00
Tomasz Grabiec
b3709047b0 mutation_partition: Extract sliced() from mutation into mutation_partition
So that we can call it on mutation_partition.
2017-12-08 17:50:47 +01:00
Tomasz Grabiec
5541c9fd63 mutation_partition: Define equal_continuity() using get_continuity()
This fixes the problem of equal_continuity() being prone to false
positives due to redundant information (extra dummy rows) present in
one of the partitions. get_continuity() is minified, so is not prone
to this.
2017-12-08 12:01:27 +01:00
Tomasz Grabiec
bde050835f mutation_partition: Make check_continuity() const-qualified 2017-12-08 12:01:27 +01:00
Tomasz Grabiec
865bd8a594 mutation_partition: Introduce mutation_partition::get_continuity()
Intended to be used in tests.
2017-12-08 12:01:27 +01:00
Tomasz Grabiec
22138554e6 mutation_partition: Leave moved-from row in an empty state
Needed by apply_monotonically(). Fixes SIGSEGV in mutation_test_g.
2017-12-08 12:01:27 +01:00
Tomasz Grabiec
a305a28574 mutation_partition: Fix upgrade() not preserving static row continuity
We do not rely on this yet, but will.
2017-12-08 12:01:27 +01:00
Tomasz Grabiec
05a6c67804 mutation_partition: Don't print absent elements
Makes printout shorter and thus easier to parse.
2017-12-01 10:52:37 +01:00
Tomasz Grabiec
d8b54a57aa mutation_partition: Make row_marker printout similar to other partition elements 2017-12-01 10:52:37 +01:00
Tomasz Grabiec
fd7ab5fe99 database: Move operator<<() overloads to appropriate source files 2017-12-01 10:52:37 +01:00
Tomasz Grabiec
7bde3090b4 mutation_partition: Use multi-line printout
Convert to a multi line output, which is easier to read for a human.

After:

{ks.cf key {key: pk{000c706b30303030303030303030}, token:-2018791535786252460} data {mutation_partition: {tombstone: none},
 range_tombstones: {},
 static: cont=1 {row: },
 clustered: {
    {rows_entry: cont=true dummy=false {position: clustered,ckp{000c636b30303030303030303030},0} {deletable_row: {row: }}},
    {rows_entry: cont=true dummy=true {position: clustered,ckp{000c636b30303030303030303031},0} {deletable_row: {row: }}}}}}
2017-12-01 10:52:37 +01:00
Tomasz Grabiec
70e14f78a7 mutation_partition: Drop apply_reversibly() 2017-11-28 13:03:06 +01:00
Tomasz Grabiec
091e10fc70 mutation_partition: Relax exception guarantees of apply()
The uses which needed strong or weak exception guarantees were
switched to a solution involving apply_monotonically(). All remaining
uses don't need any exception guarantees.
2017-11-28 13:03:06 +01:00
Tomasz Grabiec
988d3c67b4 mutation_partition: Introduce apply_weak()
Intended to be used by code which doesn't need any exception
guarantees.  Currently just delegates to apply_monotonically().
2017-11-28 13:03:03 +01:00
Tomasz Grabiec
97ebf51d3a mutation_partition: Introduce apply_monotonically()
Has weaker exception guarantees than apply(), which allows for simpler
implementation. Intended to replace the apply() with strong exception
guarantees.
2017-11-28 12:28:51 +01:00
Tomasz Grabiec
978b874065 mutation_partition: Introduce row::consume_with() 2017-11-28 11:20:03 +01:00
Paweł Dziepak
48c3db54c9 mutation_partition: convert queries to flat_mutation_readers 2017-11-21 11:37:04 +00:00
Glauber Costa
d49ecae201 mutation_partition: estimate size of partition
In the memtable flusher, we account for the size of a partition as we
read them. However, there are other points in the architecture where we
would like to calculate the size of a partition in a point in which we
are not reading it. One such example is the cache update process.

This patch enhances the mutation_partition adding a method that returns
the total size for this partition.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2017-11-08 16:21:44 -05:00
Tomasz Grabiec
ca3e72266f mutation_partition: Fix abort in case range tombstone copying fails
If exception is thrown from _row_tombstones.apply(), _rows will be
left uncleared. This will trigger assertion in bi::set_member_hook
destructor, which assrts that the hook is not linked.

Always clear _rows.
2017-11-07 15:33:24 +01:00
Tomasz Grabiec
749f5770df mutation: Introduce apply(mutation_fragment) 2017-11-02 12:16:17 +01:00
Tomasz Grabiec
72028bb048 mutation_partition: Allow creating rows_entry at any clustered position_in_partition
In preparation for supporting setting continuity of arbitrary clustering range.
2017-11-02 11:05:19 +01:00
Tomasz Grabiec
409adc045a mutation_partition: Remove delegating_compare()
It can't work with rows_entry at any position_in_partition,
so we need to drop it.
2017-11-02 11:05:19 +01:00
Tomasz Grabiec
65ca8eebb8 mutation_partition: Print rows_entry's position instead of key
For dummy rows, _key doesn't reflect the right position.

Message-Id: <1505317040-6783-1-git-send-email-tgrabiec@scylladb.com>
2017-09-13 20:49:28 +03:00
Tomasz Grabiec
455a1b0d24 mutation_partition: Introduce range continuity checking methods 2017-09-13 17:47:04 +02:00
Tomasz Grabiec
b6ae5783cd mvcc: Introduce partition_entry::evict()
The operation frees as much memory as possible, marking affected
mutation elements as discontinuous.
2017-09-13 17:47:03 +02:00
Duarte Nunes
c7aa3ea069 mutation_partition: Remove obsolete short read detection
When compacting a partition for querying we would read an extra row,
to include any tombstones between that one and the previous row.

This is no longer needed since we have a general mechanism to detect
short reads in the storage_proxy.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170811103031.22866-1-duarte@scylladb.com>
2017-08-15 12:01:55 +01:00
Paweł Dziepak
43cce6c2f4 rows_entry: make position() inlineable 2017-07-26 14:38:27 +01:00
Tomasz Grabiec
136d205855 mutation_partition: Always mark static row as continuous when no static columns
To avoid unnecessary cache misses after static columns are added.

Message-Id: <1500650057-26036-1-git-send-email-tgrabiec@scylladb.com>
2017-07-24 10:23:35 +03:00
Tomasz Grabiec
0770845a23 mutation_partition: Introduce r-value accepting deletable_row::apply() 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
dce293e11c tests: row_cache: Apply only fully continuous mutations to underlying mutation source
Cache currently assumes that mutations coming from outside are fully
continuous.
2017-06-24 18:06:11 +02:00
Piotr Jastrzebski
05b56fcfb0 mutation_partition: Add support for specifying continuity
This will allow expressing lack of information about certain ranges of
rows (including the static row), which will be used in cache to
determine if information in cache is complete or not.

Continuity is represented internally using flags on row entries. The
key range between two consecutive entries is continuous iff
rows_entry::continuous() is true for the later entry. The range
starting after the last entry is assumed to be continuous. The range
corresponding to the key of the entry is continuous iff
rows_entry::dummy() is false.

[tgrabiec:
  - based on the following commits:
     4a5bf75 - Piotr Jastrzebski : mutation_partition: introduce dummy rows_entry
     773070e - Piotr Jastrzebski : mutation_partition: add continuity flag to rows_entry
  - documented that partition tombstone is always complete
  - require specifying the partition tombstone when creating an incomplete entry
  - replaced rows_entry(dummy_tag, ...) constructor with more general
    rows_entry(position_in_partition, ...)
  - documented continuity semantics on mutation_partition
  - fixed _static_row_cached being lost by mutation_partition copy constructors
  - fixed conversion to streamed_mutation to ignore dummy entries
  - fixed mutation_partition serializer to drop dummy entries
  - documented semantics of continuity on mutation_partition level
  - dropped assumptions that dummy entries can be only at the last position
  - changed equality to ignore continuity completely, rather than
    partially (it was not ignoring dummy entries, but ignoring
    continuity flag)
  - added printout of continuity information in mutation_partition
  - fixed handling of empty entries in apply_reversibly() with regards
    to continuity; we no longer can remove empty entries before
    merging, since that may affect continuity of the right-hand
    mutation. Added _erased flag.
  - fixed mutation_partition::clustered_row() with dummy==true to not ignore the key
  - fixed partition_builder to not ignore continuity
  - renamed dummy_tag_t to dummy_tag. _t suffix is reserved.
  - standardized all APIs on is_dummy and is_continuous bool_class:es
  - replaced add_dummy_entry() with ensure_last_dummy() with safer semantics
  - dropped unused remove_dummy_entry()
  - simplified and inlined cache_entry::add_dummy_entry()
  - fixed mutation_partition(incomplete_tag) constructor to mark all row ranges as discontinuous
  ]
2017-06-24 18:06:11 +02:00
Piotr Jastrzebski
65b3123516 mutation_partition: Use rows_entry::position() in comparators
key() will not be valid for dummy entries, but position() is always
valid.

[tgrabiec: Extracted from other commits]
[tgrabiec: Added missing change to range_tombstone_stream::get_next]
2017-06-24 18:06:11 +02:00
Tomasz Grabiec
660f3127a6 mutation_partition: Introduce rows_entry::position()
In preparation for enabling dummy entries with postion past all
clustering rows.
2017-06-24 18:06:11 +02:00
Paweł Dziepak
b2b78158f6 mutation_partition: restore formatting
No functional change.

Message-Id: <20170526104119.22075-2-pdziepak@scylladb.com>
2017-06-06 11:20:57 +03:00
Paweł Dziepak
d9dd798c4f counter_write_query: avoid use-after-free on partition range
Message-Id: <20170526104119.22075-1-pdziepak@scylladb.com>
2017-05-28 11:41:30 +03:00
Tomasz Grabiec
804f46f684 mutation: Make compare_*_for_merge() consistent with equals()
equals() considers expiring cells to be different form non-expiring cells,
but compare_row_marker_for_merge() considers them equal. Fix the latter to
pick expiring cells. The choice was arbitrary.
2017-05-23 13:35:03 +02:00
Duarte Nunes
9e88b60ef5 mutation: Set cell using clustering_key_prefix
Change the clustering key argument in mutation::set_cell from
exploded_clustering_prefix to clustering_key_prefix, which allows for
some overall code simplification and fewer copies. This mostly affects
the cql3 layer.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-05-04 15:59:50 +02:00
Duarte Nunes
db63ffdbb4 mutation_partition: Harmonize apply_delete overloads
This patch ensures the different mutation_partition::apply_delete()
overloads behave similarly, so that, for example, an empty clustering
key is treated the same way as an empty
exploded_clustering_key_prefix.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-05-04 15:59:50 +02:00
Duarte Nunes
4e693383f7 mutation_partion: Use row_tombstone
This patch replaces the current row tombstone representation by a
row_tombstone.

The intent of the patch is thus to reify the idea of shadowable
tombstones, that up until now we considered all materialized view row
tombstones to be.

We need to distinguish shadowable from non-shadowable row tombstones
to support scenarios such as, when inserting to a table with a
materialzied view:

1. insert into base (p, v1, v2) values (3, 1, 3) using timestamp 1
2. delete from base using timestamp 2 where p = 3
3. insert into base (p, v1) values (3, 1) using timestamp 3

These should yield a view row where v2 is definitely null, but with
the current implementation, v2 will pop back with its value v2=3@TS=1,
even though its dead in the base row. This is because the row
tombstone inserted at 2) is a shadowable one.

This patch only addresses the memory representation of such
row_tombstones.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-04-25 11:46:33 +02:00
Duarte Nunes
392403b5b3 row_marker: Mark constructors explicit
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-04-25 11:43:04 +02:00
Raphael S. Carvalho
a6f8f4fe24 compaction: do not write expired cell as dead cell if it can be purged right away
When compacting a fully expired sstable, we're not allowing that sstable
to be purged because expired cell is *unconditionally* converted into a
dead cell. Why not check if the expired cell can be purged instead using
gc before and max purgeable timestamp?

Currently, we need two compactions to get rid of a fully expired sstable
which cells could have always been purged.

look at this sstable with expired cell:
  {
    "partition" : {
      "key" : [ "2" ],
      "position" : 0
    },
    "rows" : [
      {
        "type" : "row",
        "position" : 120,
        "liveness_info" : { "tstamp" : "2017-04-09T17:07:12.702597Z",
"ttl" : 20, "expires_at" : "2017-04-09T17:07:32Z", "expired" : true },
        "cells" : [
          { "name" : "country", "value" : "1" },
        ]

now this sstable data after first compaction:
[shard 0] compaction - Compacted 1 sstables to [...]. 120 bytes to 79
(~65% of original) in 229ms = 0.000328997MB/s.

  {
    ...
    "rows" : [
      {
        "type" : "row",
        "position" : 79,
        "cells" : [
          { "name" : "country", "deletion_info" :
{ "local_delete_time" : "2017-04-09T17:07:12Z" },
            "tstamp" : "2017-04-09T17:07:12.702597Z"
          },
        ]

now another compaction will actually get rid of data:
compaction - Compacted 1 sstables to []. 79 bytes to 0 (~0% of original)
in 1ms = 0MB/s. ~2 total partitions merged to 0

NOTE:
It's a waste of time to wait for second compaction because the expired
cell could have been purged at first compaction because it satisfied
gc_before and max purgeable timestamp.

Fixes #2249, #2253

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170413001049.9663-1-raphaelsc@scylladb.com>
2017-04-13 10:59:19 +03:00
Avi Kivity
27c42359bc Merge seastar upstream
* seastar 6b21197...2ebe842 (6):
  > Merge "Various improvements to execution stages" from Paweł
  > app-template: allow apps to specify a name for help message
  > bool_class: avoid initializing object of incomplete type
  > app-template: make sure we can still get help with required options
  > prometheus: Http handler that returns prometheus 0.4 protobuf or text format
  > Update DPDK to 17.02

Includes patch from Pawel to adjust to updated execution_stage interface.
2017-03-26 10:50:21 +03:00
Paweł Dziepak
a78501c206 mutation_query: add an execution stage 2017-03-09 09:27:43 +00:00
Avi Kivity
439b38f5ab Merge "Improvements to counter implementation" from Paweł
"This series adds various optimisations to counter implementation
(nothing extreme, mostly just avoiding unnecessary operations) as well
as some missing features such as tracing and dropping timed out queries.

Performance was tested using:
perf-simple-query -c4 --counters --duration 60

The following results are medians.
          before       after      diff
write   18640.41    33156.81    +77.9%
read    58002.32    62733.93     +8.2%"

* tag 'pdziepak/optimise-counters/v3' of github.com:cloudius-systems/seastar-dev: (30 commits)
  cell_locker: add metrics for lock acquisition
  storage_proxy: count counter updates for which the node was a leader
  storage_proxy: use counter-specific timeout for writes
  storage_proxy: transform counter timeouts to mutation_write_timeout_exception
  db: avoid allocations in do_apply_counter_update()
  tests/counters: add test for apply reversability
  counters: attempt to apply in place
  atomic_cell: add COUNTER_IN_PLACE_REVERT flag
  counters: add equality operators
  counters: implement decrement operators for shard_iterator
  counters: allow using both views and mutable_views
  atomic_cell: introduce atomic_cell_mutable_view
  managed_bytes: add cast to mutable_view
  bytes: add bytes_mutable_view
  utils: introduce mutable_view
  db: add more tracing events for counter writes
  db: propagate tracing state for counter writes
  tests/cell_locker: add test for timing out lock acquisition
  counter_cell_locker: allow setting timeouts
  db: propagate timeout for counter writes
  ...
2017-03-07 11:48:13 +02:00
Tomasz Grabiec
4b6e77e97e db: Fix overflow of gc_clock time point
If query_time is time_point::min(), which is used by
to_data_query_result(), the result of subtraction of
gc_grace_seconds() from query_time will overflow.

I don't think this bug would currently have user-perceivable
effects. This affects which tombstones are dropped, but in case of
to_data_query_result() uses, tombstones are not present in the final
data query result, and mutation_partition::do_compact() takes
tombstones into consideration while compacting before expiring them.

Fixes the following UBSAN report:

  /usr/include/c++/5.3.1/chrono:399:55: runtime error: signed integer overflow: -2147483648 - 604800 cannot be represented in type 'int'

Message-Id: <1488385429-14276-1-git-send-email-tgrabiec@scylladb.com>
2017-03-01 18:49:56 +02:00
Paweł Dziepak
582d397c41 introduce counter_write_query()
Counter write path involves read-modify-write. That read is guaranteed
to query only a single partition, does not care about dead cells and
expects to receive an unserialized mutation as a result.

Standard mutation queries can are able to produce results fit for
counter updates, but the logic involved is much more general (i.e.
slower), hence the addition of new, counter-specific kind of query.
2017-03-01 16:33:36 +00:00