Commit Graph

13877 Commits

Author SHA1 Message Date
Tomasz Grabiec
b26ce36d4b mvcc: Introduce partition_snapshot::static_row_continuous() 2017-12-08 17:50:47 +01:00
Tomasz Grabiec
c283744fcb mvcc: Introduce partition_snapshot::range_tombstones() for full range 2017-12-08 17:50:47 +01:00
Tomasz Grabiec
df964c70f8 mvcc: Don't require external schema in parition_snapshot::range_tombstones() 2017-12-08 17:50:47 +01:00
Tomasz Grabiec
5541c9fd63 mutation_partition: Define equal_continuity() using get_continuity()
This fixes the problem of equal_continuity() being prone to false
positives due to redundant information (extra dummy rows) present in
one of the partitions. get_continuity() is minified, so is not prone
to this.
2017-12-08 12:01:27 +01:00
Tomasz Grabiec
bde050835f mutation_partition: Make check_continuity() const-qualified 2017-12-08 12:01:27 +01:00
Tomasz Grabiec
f9257886cb mutation_partition: Make check_continuity() public 2017-12-08 12:01:27 +01:00
Tomasz Grabiec
865bd8a594 mutation_partition: Introduce mutation_partition::get_continuity()
Intended to be used in tests.
2017-12-08 12:01:27 +01:00
Tomasz Grabiec
7e5d243a95 Introduce clustering_interval_set
Will make it easy to represent and manipulate continuity in tests.

Could also replace clustering_row_ranges in the future, which is
currently a naked vector<> with no semantic methods.
2017-12-08 12:01:27 +01:00
Tomasz Grabiec
22138554e6 mutation_partition: Leave moved-from row in an empty state
Needed by apply_monotonically(). Fixes SIGSEGV in mutation_test_g.
2017-12-08 12:01:27 +01:00
Tomasz Grabiec
a305a28574 mutation_partition: Fix upgrade() not preserving static row continuity
We do not rely on this yet, but will.
2017-12-08 12:01:27 +01:00
Paweł Dziepak
051cbbc9af Merge "Fix range tombstone emitting which led to skipping over data" from Tomasz
"Fixes cache reader to not skip over data in some cases involving overlapping
range tombstones in different partition versions and discontinuous cache.

Introduced in 2.0

Fixes #3053."

* tag 'tgrabiec/fix-range-tombstone-slicing-v2' of github.com:scylladb/seastar-dev:
  tests: row_cache: Add reproducer for issue #3053
  tests: mvcc: Add test for partition_snapshot::range_tombstones()
  mvcc: Optimize partition_snapshot::range_tombstones() for single version case
  mvcc: Fix partition_snapshot::range_tombstones()
  tests: random_mutation_generator: Do not emit dummy entries at clustering row positions
2017-12-08 10:27:17 +00:00
Tomasz Grabiec
4cc4c661f3 tests: row_cache: Add reproducer for issue #3053
The issue is that partition_snapshot::range_tombstones() is
deoverlapping tombstones coming from different versions, and it may
happen that due to range tombstone splitting that function will return
a tombstone which starts after the requested range. This breaks
assumptions made by the cache reader. It keeps track of the maximum
fragment position, and if cache reader will then need to read from
sstables due to a miss, it would do so starting from the position
marked by that out of range tombstone, possibly skipping over some
rows.
2017-12-08 10:15:58 +01:00
Tomasz Grabiec
b6f4637aec tests: mvcc: Add test for partition_snapshot::range_tombstones() 2017-12-08 10:15:58 +01:00
Tomasz Grabiec
183554cbc4 mvcc: Optimize partition_snapshot::range_tombstones() for single version case 2017-12-08 10:15:58 +01:00
Tomasz Grabiec
1303320377 mvcc: Fix partition_snapshot::range_tombstones()
partition_snapshot::range_tombstones() is deoverlapping tombstones
coming from different versions and it may happen that due to range
tombstone splitting the method will return a tombstone which starts
after the requested range. This would cause it to return a tombstone
which doesn't overlap with the requested range.

This breaks assumptions made by cache reader. It keeps track of the
maximum fragment position, and if cache reader will then need to read
from sstables due to a miss, it would do so starting from the position
marked by that out of range tombstone, possibly skipping over some
rows.

Exposed by a change in row_cache_test.cc::test_mvcc() which fills the
buffer of sm5 reader after it is created.

Fixes #3053.
2017-12-08 10:15:58 +01:00
Tomasz Grabiec
89e3b734ed tests: random_mutation_generator: Do not emit dummy entries at clustering row positions
It is assumed that dummy entries are only at !is_clustering_row() positions.
Causes cache_streamed_mutation to assert when trying to trim a range tombstone.
2017-12-07 20:20:37 +01:00
Avi Kivity
d934ca55a7 Merge "SSTable resharding fixes" from Raphael
"Didn't affect any release. Regression introduced in 301358e.

Fixes #3041"

* 'resharding_fix_v4' of github.com:raphaelsc/scylla:
  tests: add sstable resharding test to test.py
  tests: fix sstable resharding test
  sstables: Fix resharding by not filtering out mutation that belongs to other shard
  db: introduce make_range_sstable_reader
  rename make_range_sstable_reader to make_local_shard_sstable_reader
  db: extract sstable reader creation from incremental_reader_selector
  db: reuse make_range_sstable_reader in make_sstable_reader
2017-12-07 16:42:48 +02:00
Amos Kong
8fd5d27508 dist/debian: add scylla-tools-core to depends list
Signed-off-by: Amos Kong <amos@scylladb.com>
Message-Id: <db39cbda0e08e501633556ab238d816e357ad327.1512646123.git.amos@scylladb.com>
2017-12-07 13:40:10 +02:00
Amos Kong
eb3b138ee2 dist/redhat: add scylla-tools-core to requires list
Fixes #3051

Signed-off-by: Amos Kong <amos@scylladb.com>
Message-Id: <f7013a4fbc241bb4429d855671fee4b845b255cd.1512646123.git.amos@scylladb.com>
2017-12-07 13:40:08 +02:00
Gleb Natapov
8f104bab5d storage_proxy: send negative write replies only when entire cluster supports the feature
Message-Id: <20171207102934.GM1885@scylladb.com>
2017-12-07 12:31:35 +02:00
Botond Dénes
1ff65f41fd mutation_reader_merger: don't query the kind of moved-from fragment
Call mutation_fragment_kind() on the fragment *before* it's moved as
there are not guarantees for the state of a moved-from object (apart
from that it's in a valid one).

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <c47b1e22877bb9499f1fbb9d513093c29ef1901b.1512635422.git.bdenes@scylladb.com>
2017-12-07 10:40:31 +02:00
Avi Kivity
060e5d3354 Merge "Improve time-series performance by not actually compacting fully expired tables" from Raphael
"In time-series, it's common for tables in a given time window to be eventually
fully expired. The deletion of such tables is done by compaction, but there's
*no* need to *actually* compact such fully expired sstables *iff* their full
deletion will not cause older data to be ressurected. In other words, a fully
expired table can be actually skipped (but deleted in the end) by compaction
*iff* it doesn't contain newer data than its overlapping counterparts. So there
may be false negatives, but never false positives.
All that said, the goal behind this patchset is to save read bandwidth of disk
in such scenarios. Given that fully expired sstables will not be read by
compaction process anymore, read amplification will be greatly reduced too.

Fixes #2620."

* 'time_series_performance_improvement_v2_2' of github.com:raphaelsc/scylla:
  tests: check sstable auto correct bad max deletion time
  tests: add test for compaction with fully expired table
  sstables/compaction: do not actually compact fully expired sstables
  sstables: make sstable auto correct max_local_deletion_time
  sstables: switch to const ref wherever possible
  sstables: use gc_clock::time_point for gc_before
  gc_clock: introduce operator<<(ostream&, gc_clock::time_point)
  sstables: introduce sstable::get_max_local_deletion_time
  sstables: remove unnecessary copy in time series strategies
  sstables: change return value type of get_fully_expired_sstables
  dtcs: make code to extract non expired tables faster
  sstables: add has_correct_max_deletion_time to sstable
2017-12-07 10:29:31 +02:00
Avi Kivity
908daa67bd Merge "Generalize data_resource" from Jesse
"Soon we will have resources beyond just keyspaces and table names. There
will be resources for roles, for user-defined functions (UDFs), and
possible resources for REST end-points. This change generalizes the
implementation of a `data_resource` to many different kinds of
resources, though there is still only one kind (`data`).

The most important patch is 2/5 ("auth/resource: Generalize to different
kinds"), which re-writes `auth::data_resource`. The patch message should
sufficiently explain the design decisions involved.

The other patches rename files and identifiers based on the expanded
role of this class, except for 5/5 ("auth/resource.hh: Rename
`resource_ids`"): this patch gives a more appropriate name to a type
alias.

Fixes #3027."

* 'jhk/generalize_resource/v3' of https://github.com/hakuch/scylla:
  auth/resource.hh: Rename `resource_ids`
  auth: Rename `data_resource` files
  cql3/authorization_statement: Fix typo
  auth/resource: Generalize to different kinds
  auth: Rename `data_resource` to `resource`
2017-12-07 10:25:58 +02:00
Botond Dénes
9fce51f8a0 Add streamed mutation fast-forwarding unit test for the flat combined-reader
Test for the bug fixed by 9661769.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <fc917bae8e9c99f026bf7b366e6e9d39faf466af.1512630741.git.bdenes@scylladb.com>
2017-12-07 09:45:12 +02:00
Raphael S. Carvalho
39f7404436 tests: add sstable resharding test to test.py
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-12-07 03:15:27 -02:00
Raphael S. Carvalho
fc193c29cf tests: fix sstable resharding test
wrong sstable was used when checking for content, and storage service
for test was missing.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-12-07 03:15:27 -02:00
Raphael S. Carvalho
bad21ba444 sstables: Fix resharding by not filtering out mutation that belongs to other shard
After 301358e, sstable resharding stopped work because shared sstables would
use a filtering reader, which excludes mutation that belong to other shards.
That completely breaks which relies on compaction of mutations that belong
to different shards. The fix is about using recently introduced non local
shard reader.

Fixes #3041.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-12-07 03:15:26 -02:00
Raphael S. Carvalho
f1b65a115a db: introduce make_range_sstable_reader
introduce reader variant that will allow its caller to read a range
in a given table without any filter applied.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-12-07 03:15:26 -02:00
Raphael S. Carvalho
d1b146baa6 rename make_range_sstable_reader to make_local_shard_sstable_reader
Tomek says:
"I think that the least surprising behavior for a function named like this
is to read the sstables unfiltered (it just reads them), and the filtering
should be indicated specially in the name or by accepting a parameter."

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-12-07 03:15:25 -02:00
Raphael S. Carvalho
3d725d6823 db: extract sstable reader creation from incremental_reader_selector
step closer to divorcing incremental_selector from sstables

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-12-07 01:53:16 -02:00
Raphael S. Carvalho
ab82bacddd db: reuse make_range_sstable_reader in make_sstable_reader
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-12-07 01:53:14 -02:00
Raphael S. Carvalho
5eef7371b3 tests: check sstable auto correct bad max deletion time
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-12-06 19:52:33 -02:00
Raphael S. Carvalho
a86ee38638 tests: add test for compaction with fully expired table
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-12-06 19:52:33 -02:00
Raphael S. Carvalho
809b30c4a2 sstables/compaction: do not actually compact fully expired sstables
There's no need to actually compact a sstable which is fully expired
and which deletion of all its data will not ressurect older data.
For that, a sstable will only be considered fully expired if it
doesn't contain data newer than its overlapping counterparts.
That way, there could be a false negative, but never a false positive.
Currently, a fully expired sstable would unnecessarily waste read
bandwidth of disk. This will help a lot time series workloads in
which data for a given time window is all deleted at once using TTL.

Fixes #2620.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-12-06 19:52:33 -02:00
Raphael S. Carvalho
810e2ec3d9 sstables: make sstable auto correct max_local_deletion_time
sstables created prior to cc6c383 can contain bad max deletion time stat,
which would make get_fully_expired_sstables return sstables that aren't
actually fully expired. Let's make sstable invalidate the stat if it
is potentially incorrect.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-12-06 19:52:33 -02:00
Raphael S. Carvalho
d2ab154f12 sstables: switch to const ref wherever possible
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-12-06 19:52:33 -02:00
Raphael S. Carvalho
d916c8cdad sstables: use gc_clock::time_point for gc_before
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-12-06 19:52:33 -02:00
Raphael S. Carvalho
1d0e6496ec gc_clock: introduce operator<<(ostream&, gc_clock::time_point)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-12-06 19:52:32 -02:00
Raphael S. Carvalho
fcdce38e7f sstables: introduce sstable::get_max_local_deletion_time
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-12-06 18:47:05 -02:00
Raphael S. Carvalho
18bdf496fe sstables: remove unnecessary copy in time series strategies
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-12-06 18:46:46 -02:00
Raphael S. Carvalho
45c11865fa sstables: change return value type of get_fully_expired_sstables
unordered_set will allow us to quickly extract fully expired tables
from a set of compacting sstables.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-12-06 18:45:55 -02:00
Raphael S. Carvalho
4fe6fea758 dtcs: make code to extract non expired tables faster
since it's O(n) and not O(n log n).

change also needed for change in interface of function to retrieve
fully expired tables, or sort lambda would need to be parametrized.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-12-06 18:40:16 -02:00
Raphael S. Carvalho
11176324bd sstables: add has_correct_max_deletion_time to sstable
Commit cc6c38324 fixes the stat. It was only updated for range
tombstone prior to fix, so a sstable that had a regular cell with
no expiration time could be considered fully expired which can
lead to bad decisions in compaction for time series workloads.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-12-06 18:40:05 -02:00
Jesse Haber-Kucharsky
aea262cdc4 auth/resource.hh: Rename resource_ids 2017-12-06 14:39:40 -05:00
Jesse Haber-Kucharsky
3cad18631d auth: Rename data_resource files
Now that there can be many kinds of resources, the old name doesn't fit.
2017-12-06 14:39:40 -05:00
Jesse Haber-Kucharsky
3665261a90 cql3/authorization_statement: Fix typo 2017-12-06 14:39:40 -05:00
Jesse Haber-Kucharsky
1bb22bb190 auth/resource: Generalize to different kinds
This change generalizes the implementation of a `resource` to many
different kinds of resources, though there is still only one
kind (`data`). In the future, we also expect resource kinds for roles,
user-defined functions (UDFs), and possibly on particular REST
end-points.

I considered several approaches to generalizing to different kinds of
resources.

One approach is to have a base class that is inherited from by different
resource kinds. The common functionality would be accessed through
virtual member functions and kind-specific functions would exist in
sub-classes. I rejected this approach because dealing with different
kinds of resources uniformly requires storage and life-time management
through something like `std::unique_ptr<auth::resource>`, which means
that we lose value semantics (including comparison) and must deal with
complications around ownership.

Another option was to use `boost::variant` (or, in future,
`std::variant`). This is closer to what we want, since there a static
set of resource kinds that we support. I rejected this approach for two
reasons. The first is that all resource kinds share the same data (a
list of segments and a root identifier), which would be duplicated in
each type that composed the variant. The second is that the complexity
and source-code overhead of `boost::variant` didn't seem warranted.

The solution I ended up with is home-grown variant. All resources are
described in the same `final` class: `auth::resource`. This class has
value semantics, supports equality comparison, and has a strict
ordering. All resources have in common a tag ("kind") and a list of
parts. Most operations on resources don't care about the kind of
resource (like getting its name, parsing a name, querying for the
parent, etc). These are just member functions of the class.

When we care about a kind-specific interpretation of a resource, we can
produce a "view" of the resource. For example, `data_resource_view`
allows for accessing the (optional) keyspace and table names.

I anticipate in the future to add functions for creating role
resources (`auth::resource::role`) and also `role_resource_view`.

The functional behaviour of the system should be unchanged with this
patch.

I've added new unit tests in `auth_resource_test.cc` and removed the old
test from `auth_test.cc`.

Fixes #3027.
2017-12-06 14:37:56 -05:00
Jesse Haber-Kucharsky
8fe53ecf78 auth: Rename data_resource to resource
The implementation and interface of `auth::resource` will change soon to
support different kinds of resources beyond just data (keyspaces and
tables).
2017-12-06 10:18:05 -05:00
Gleb Natapov
ddf117535a storage_proxy: add counters for speculative reads
Fixes #3030

Message-Id: <20171206143611.8756-1-gleb@scylladb.com>
2017-12-06 16:38:16 +02:00
Avi Kivity
ccc315bcfe Merge "storage_proxy: allow fail request earlier if CL cannot be reached due to errors" from Gleb
"This is CASSANDRA-7886 and CASSANDRA-8592. The patch series detects
that CL of a request can no longer be reached due to errors and fails
the request earlier. New type of errors are reported: read/write failure
which were introduced in cql v4 protocol. For compatibility if older
protocol is used the error is translated to timeout error."

* 'gleb/request-failure_v2' of github.com:scylladb/seastar-dev:
  storage_proxy: fail read/write requests early if it cannot be completed due to errors
  storage_service: add WRITE_FAILURE_REPLY_FEATURE feature
  gossiper: add node_has_feature() function
  cql: add read/write failure exceptions
  storage_proxy: fix data presence reporting in read timeout error during
  storage_proxy: remove inheritance from enable_shared_from_this for abstract_write_response_handler
  storage_proxy: remove unneeded field in abstract_write_response_handler
  storage_proxy: fix pending endpoint accounting for EACH_QUORUM
  consistency_level: constify quorum_for() and local_quorum_for()
2017-12-06 16:17:19 +02:00