Commit Graph

1148 Commits

Author SHA1 Message Date
Avi Kivity
dc6be68852 Merge "promoted index for reading partial partitions" from Nadav
"The goal of this patch series is to support reading and writing of a
"promoted index" - the Cassandra 2.* SSTable feature which allows reading
only a part of the partition without needing to read an entire partition
when it is very long. To make a long story short, a "promoted index" is
a sample of each partition's column names, written to the SSTable Index
file with that partition's entry. See a longer explanation of the index
file format, and the promoted index, here:

     https://github.com/scylladb/scylla/wiki/SSTables-Index-File

There are two main features in this series - first enabling reading of
parts of partitions (using the promoted index stored in an sstable),
and then enable writing promoted indexes to new sstables. These two
features are broken up into smaller stand-alone pieces to facilitate the
review.

Three features are still missing from this series and are planned to be
developed later:

1. When we fail to parse a partition's promoted index, we silently fall back
   to reading the entire partition. We should log (with rate limiting) and
   count these errors, to help in debugging sstable problems.

2. The current code only uses the promoted index when looking for a single
   contiguous clustering-key range. If the ck range is non-contiguous, we
   fall back to reading the entire partition. We should use the promoted
   index in that case too.

3. The current code only uses the promoted index when reading a single
   partition, via sstable::read_row(). When scanning through all or a
   range of partitions (read_rows() or read_range_rows()), we do not yet
   use the promoted index; We read contiguously from data file (we do not
   even read from the index file, so unsurprisingly we can't use it)."

(cherry picked from commit 700feda0db)
2016-08-09 17:54:15 +03:00
Paweł Dziepak
e95f4eaee4 Merge "partition_limit: Don't count dead partitions" from Duarte
"This patch series ensures we don't count dead partitions (i.e.,
partitions with no live rows) towards the partition_limit. We also
enforce the partition limit at the storage_proxy level, so that
limits with smp > 1 works correctly."

(cherry picked from commit 5f11a727c9)
2016-08-03 12:44:32 +03:00
Tomasz Grabiec
b224ff6ede Merge 'pdziepak/row-cache-wide-entries/v4' from seastar-dev.git
This series adds the ability for partition cache to keep information
whether partition size makes it uncacheable. During, reads these
entries save us IO operations since we already know that the partiiton
is too big to be put in the cache.

First part of the patchset makes all mutation_readers allow the
streamed_mutations they produce to outlive them, which is a guarantee
used later by the code handling reading large partitions.

(cherry picked from commit d2ed75c9ff)
2016-08-02 20:24:29 +02:00
Piotr Jastrzebski
6960fce9b2 Use continuity flag correctly with concurrent invalidations
Between reading cache entry and actually using it
invalidations can happen so we have to check if no flag was
cleared if it was we need to read the entry again.

Fixes #1464.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <7856b0ded45e42774ccd6f402b5ee42175bd73cf.1469701026.git.piotr@scylladb.com>
(cherry picked from commit fdfd1af694)
2016-08-02 20:24:22 +02:00
Duarte Nunes
d11b0cac3b sstable_mutation_test: Test non-compound cell name
This patch adds a test case for reading non-compound cell names,
validating that such a cell is not incorrectly marked as static.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1469616205-4550-5-git-send-email-duarte@scylladb.com>
2016-07-28 12:11:37 +02:00
Tomasz Grabiec
7d73599acd tests: lsa_async_eviction_test: Use chunked_fifo<>
To protect against large reallocations during push() which are done
under reclaim lock and may fail.
2016-07-28 09:43:51 +02:00
Piotr Jastrzebski
bf27379583 Add tests for wide partiton handling in cache.
They shouldn't be cached.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
(cherry picked from commit 7d29cdf81f)
2016-07-27 14:09:45 +03:00
Paweł Dziepak
4e43cb84ff mests/sstables: test reading sstable with duplicated range tombstones
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
(cherry picked from commit b405ff8ad2)
2016-07-27 14:09:02 +03:00
Paweł Dziepak
a39bec0e24 tests: extract streamed_mutation assertions
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
(cherry picked from commit 50469e5ef3)
2016-07-27 14:05:43 +03:00
Raphael S. Carvalho
66ebef7d10 tests: add new test for date tiered strategy
This test set the time window to 1 hour and checks that the strategy
works accordingly.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit cf54af9e58)
2016-07-21 12:00:26 +03:00
Raphael S. Carvalho
7b9cf528ad tests: fix occassional failure in date tiered test
That was a bug in the test itself. It could happen that a sstable would
incorrectly belong to the next time window if the current minute is
approaching its end. Fix is about having all sstables that we want in
the same time window with the same min/max timestamp.

Fixes #1448.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <ee25d49e7ed12b4cf7d018a08163404c3d122e56.1468782787.git.raphaelsc@scylladb.com>
2016-07-18 15:18:29 +02:00
Duarte Nunes
9792a77266 range: Add deoverlap function
This patch adds the deoverlap function to range.hh, which takes in a
vector of possibly overlapping ranges and returns a vector of
non-overlapping ranges covering the same values.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-07-14 18:20:41 +02:00
Tomasz Grabiec
7227c537ce Merge branch 'pdziepak/streamed-mutations-hashing/v5' from seastar-dev.git
From Paweł:

This is another episode in the "convert X to streamed mutations" series.
Hashing mutations (mainly for repair) is converted so that it doesn't
need to rebuild whole mutation.

The first part of the series changes the way streamed mutations deal
with range tombstones. Since it is not necessary to make sure we write
disjoint tombstones to sstables there is no need anymore for streamed
mutations to produce disjoint tombstones and, consequently, no need for
range tombstones to be split into range_tombstone_begin and
range_tombstone_end.

The second part is the actual hashing implementation. However, to ensure
that the hash depends only on the contents of the mutation and no the
way it is stored in different data sources range tombstones have to be
made disjoint before they are hashed.

This series also ensures that any changes caused by streamed mutations
to hashing and streaming do not break repair during upgrade.
2016-07-13 11:24:00 +02:00
Duarte Nunes
674afc52bc compound_test: Test singular composite_view::explode()
This patch adds a test case for composite_view::explode() called on a
non-compound composite.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1468353393-3074-1-git-send-email-duarte@scylladb.com>
2016-07-13 11:23:24 +02:00
Paweł Dziepak
c5662919df tests/streamed_mutation: test hashing
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-07-13 09:51:23 +01:00
Paweł Dziepak
eb1dcf08e7 tests/streamed_mutation: add test for range_tombstones_stream
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-07-13 09:51:23 +01:00
Paweł Dziepak
93cc4454a6 streamed_mutation: emit range_tombstones directly
Originally, streamed_mutations guaranteed that emitted tombstones are
disjoint. In order to achieve that two separate objects were produced
for each range tombstone: range_tombstone_begin and range_tombstone_end.

Unfortunately, this forced sstable writer to accumulate all clustering
rows between range_tombstone_begin and range_tombstone_end.

However, since there is no need to write disjoint tombstones to sstables
(see #1153 "Write range tombstones to sstables like Cassandra does") it
is also not necessary for streamed_mutations to produce disjoint range
tombstones.

This patch changes that by making streamed_mutation produce
range_tombstone objects directly.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-07-13 09:51:18 +01:00
Duarte Nunes
0b87d16699 composite: Add unit tests
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-07-11 16:55:11 +02:00
Tomasz Grabiec
8c4b5e4283 db: Avoiding checking bloom filters during compaction
Checking bloom filters of sstables to compute max purgeable timestamp
for compaction is expensive in terms of CPU time. We can avoid
calculating it if we're not about to GC any tombstone.

This patch changes compacting functions to accept a function instead
of ready value for max_purgeable.

I verified that bloom filter operations no longer appear on flame
graphs during compaction-heavy workload (without tombstones).

Refs #1322.
2016-07-10 09:54:20 +02:00
Paweł Dziepak
cba996a3ea Merge "Implement missing functions for byte_ordered_partitioner" from Asias 2016-07-08 10:49:25 +01:00
Asias He
9c27b5c46e byte_ordered_partitioner: Implement missing describe_ownership and midpoint
In order to support ByteOrderedPartitioner, we need to implement the
missing describe_ownership and midpoint function in
byte_ordered_partitioner class.

As a starter, this path uses a simple node token distance based method
to calculate ownership. C* uses a complicated key samples based method.
We can switch to what C* does later.

Tests are added to tests/partitioner_test.cc.

Fixes #1378
2016-07-08 17:44:55 +08:00
Paweł Dziepak
a7b6c1110f sstables: do not require seal_sstable() to be run in thread
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-07-07 12:18:35 +01:00
Paweł Dziepak
4e34bd4e8a tests/streamed_mutation: test fragment_and_freeze()
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-07-07 12:18:35 +01:00
Raphael S. Carvalho
b5ec4d46c6 tests: add test for date tiered compaction strategy
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2016-07-06 02:11:47 -03:00
Raphael S. Carvalho
cab2892866 tests: add test for sstables::get_fully_expired_sstables
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2016-07-06 02:11:47 -03:00
Raphael S. Carvalho
69b3860662 tests: add test for leveled_manifest::overlapping
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2016-07-06 02:11:45 -03:00
Raphael S. Carvalho
1118cfc51a tests: test that sstable max_local_deletion_time is properly updated
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2016-07-06 01:13:34 -03:00
Avi Kivity
2a46410f4a Change sstable_list from a map to a set
sstable_list is now a map<generation, sstable>; change it to a set
in preparation for replacing it with sstable_set.  The change simplifies
a lot of code; the only casualty is the code that computes the highest
generation number.
2016-07-03 10:26:57 +03:00
Avi Kivity
1b448877d7 Merge " thrift: Implement CQL over thrift" from Duarte
"This patchset implements the CQL over thrift verbs. Only CQL3 is supported,
and the CQL2 verbs are disabled."
2016-06-28 13:36:12 +03:00
Piotr Jastrzebski
68e5a199e9 Clean continuous flag of cache entry
preceeding invalidated decorated key even
when it's not found.

Add test.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <c7b8f4df37256363bf304e0396f84b5f37921b81.1467059472.git.piotr@scylladb.com>
2016-06-28 10:26:02 +02:00
Duarte Nunes
c8afb4cc46 query_processor: Support thrift prepared statements
This patch adds support for thrift prepared statements. It specializes
the result_message::prepared into two types:
result_message::prepared::cql and result_message::prepared::thrift, as
their identifiers have different types.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-06-27 15:39:02 +02:00
Duarte Nunes
1ffae6e6ee database_test: Add test case for row limit
This patch introduces database_test and adds a test case to ensure
the row limit is respected when querying multiple partition ranges.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20160623111723.17523-1-duarte@scylladb.com>
2016-06-23 14:20:34 +02:00
Duarte Nunes
aacc7193f2 schema: Replace keyspace's schema_ptr on CF update
This patch ensures we replace the schema_ptr held by its respective
keyspace object when a column family is being updated.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20160623085710.26168-1-duarte@scylladb.com>
2016-06-23 11:11:52 +02:00
Piotr Jastrzebski
9b011bff18 row_cache: add contiguity flag to cache entry to reduce disk IO during scans
Add contiguity flag to cache entry and set it in scanning reader.
Partitions fetched during scanning are continuous
and we know there's nothing between them.

Clear contiguity flag on cache entries
when the succeeding entry is removed.

Use continuous flag in range queries.
Don't go do disk if we know that there's nothing
between two entries we have in cache. We know that
when continuous flag of the first one is set to true.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <72bae432717037e95d1ac9465deaccfa7c7da707.1466627603.git.piotr@scylladb.com>
2016-06-23 09:43:15 +03:00
Duarte Nunes
69798df95e query: Limit number of partitions returned
This is required to implement a thrift verb.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-06-22 09:48:13 +02:00
Tomasz Grabiec
597cbbdedc Merge branch 'pdziepak/streamed-mutations/v5' from seastar-dev.git
From Paweł:

This series introduces streaming_mutations which allow mutations to be
streamed between the producers and the consumers as a series of
mutation_fragments. Because of that the mutation streaming interface
works well with partitions larger than available memory provided that
actual producer and consumer implementations can support this as well.

mutation_fragments are the basic objects that are emitted by
streamed_mutations they can represent a static row, a clustering row,
the beginning and the end of a range tombstone. They are ordered by their
clustering keys (with static rows being always the first emitted mutation
fragment). The beginning of range tombstone is emitted before any
clustering row affected by that tombstone and the end of range tombstone
is emitted after the last clustering row affected by it. Range tombstones
are disjoint.

In this series all producers are converted to fully support the new
interface, that includes cache, memtables and sstables. Mutation queries
and data queries are the only consumers converted so far.

To minimize the per-mutation_fragment overhead streamed_mutations use
batching. The actual producer implementation fills a buffer until
it is full (currently, buffer size is 16, the limit should, however,
be changed to depend on the actual size in memory of the stored elements)
or end of stream is reached.

In order to guarantee isolation of writes reads from cache and memtable
use MVCC. When a reader is created it takes a snapshot of the particular
cache or memtable entry. The snapshot is immutable and if there happen
to be any incoming writes while the read is active a new version of
partition is created. When the snapshot is destroyed partition versions
are merged together as much as possible.

Performance results with perf_simple_query (median of results with
duration 15):

         before        after          diff
write    618652.70     618047.58      -0.10%
read     661712.44     608070.49      -8.11%
2016-06-21 12:15:21 +02:00
Tomasz Grabiec
e783b58e3b Merge branch 'glommer/LSA-throttler-v6' from git@github.com:glommer/scylla.gi
From Glauber:

This is my new take at the "Move throttler to the LSA" series, except
this one don't actually move anything anywhere: I am leaving all
memtable conversion out, and instead I am sending just the LSA bits +
LSA active reclaim. This should help us see where we are going, and
then we can discuss all memtable changes in a series on its own,
logically separated (and hopefully already integrated with virtual
dirty).

[tgrabiec: trivial merge conflicts in logalloc.cc]
2016-06-21 10:22:26 +02:00
Glauber Costa
7f29cb8aba tests: add logalloc tests for pressure notification
tests to make sure varios scenarios of pressure notification for active
asynchronous reclaim work.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-06-20 18:58:39 -04:00
Glauber Costa
8f5047fc5f tests: add tests to new region_group throttle interface
Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-06-20 18:51:00 -04:00
Paweł Dziepak
a3423bac38 tests/streamed_mutation: test freezing streamed_mutations
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-06-20 21:29:52 +01:00
Paweł Dziepak
494c6fa9c1 tests/mutation_query_test: make sure mutations are sliced properly
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-06-20 21:29:52 +01:00
Paweł Dziepak
983321f194 tests/mutation: do not create memtable on stack
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-06-20 21:29:51 +01:00
Paweł Dziepak
4a5a9148e3 tests/row_cache: test slicing mutation reader
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-06-20 21:29:51 +01:00
Paweł Dziepak
e1a8d94542 tests/row_cache: test mvcc
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-06-20 21:29:51 +01:00
Paweł Dziepak
e4ae7894d4 tests/mutation: test slicing mutations
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-06-20 21:29:51 +01:00
Paweł Dziepak
4992ea9949 tests: add test for anchorless_list
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-06-20 21:29:51 +01:00
Paweł Dziepak
f991a2deb5 tests/row_cache_alloc_stress: use another memtable for underlying storage
It is incorrect to update row_cache with a memtable that is also its
underlying storage. The reason for that is that after memtable is merged
into row_cache they share lsa region. Then when there is a cache miss
it asks underlying storage for data. This will result with memtable
reader running under row_cache allocation section. Since memtable reader
also uses allocation section the result is an assertion fault since
allocation sections from the same lsa region cannot be nested.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-06-20 21:29:51 +01:00
Paweł Dziepak
5a5c519fa0 tests/row_cache_alloc_stress: use large cells instead of many rows
With streamed_mutations a partition with many small rows doesn't stress
the cache as much as the test expects. Use large clustering rows instead.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-06-20 21:29:50 +01:00
Paweł Dziepak
71e961427a test/sstables: test reading sstables with incorrect ordering
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-06-20 21:29:50 +01:00
Paweł Dziepak
b6f78a8e2f sstable: make sstable reads return streamed_mutation
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-06-20 21:29:50 +01:00