Commit Graph

778 Commits

Author SHA1 Message Date
Avi Kivity
299d1fad0b Merge "reduce bloom filter overhead in compaction" from Raphael
"Function to calculate maximum purgeable timestamp is made 10 times faster when
compacting sstables overlap with 10% of all sstables.
That's possible with an incremental selector that will incrementally select
sstables based on key being compacted.
Currently, we iterate through all non-compacting sstables and consult their
bloom filter to determine max purgeable timestamp, and that will be very
expensive for compactions that are frequently deciding whether or not to purge
tombstones."

* 'filter_overhead_fix_v4' of github.com:raphaelsc/scylla:
  compaction: reduce bloom filter overhead with incremental selector
  tests: add test for sstable set's incremental selector
  sstable_set: introduce incremental selector
  compatible_ring_position: add function to return token
2016-12-11 09:46:58 +02:00
Glauber Costa
5803957ab5 compaction: fix build
Commit 732ee275 moved tracking of one statistics value inside a lambda
without capturing this in that lambda. Compilation fails as a result.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Reviewed-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <68860640f4533dd43e43f341f1620e25464b700b.1481313455.git.glauber@scylladb.com>
2016-12-10 09:00:20 +02:00
Raphael S. Carvalho
fcfc84e836 compaction: reduce bloom filter overhead with incremental selector
The procedure to calculate max purgeable timestamp is optimized
by only visiting sstables that overlap with key being currently
compacted. That's done using incremental sstable selector.

Function to calculate maximum purgeable timestamp is made 10 times
faster when compacting sstables overlap with 10% of all sstables.

Fixes #1322.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2016-12-09 16:17:17 -02:00
Raphael S. Carvalho
02541e15c1 sstable_set: introduce incremental selector
Incrementally select sstables from sstable set using token
in ascending order.
For leveled strategy, it returns all sstables that belong
to current interval. For other strategies, it just return
all sstables from the set.
Useful for compaction which needs all sstables that overlap
with key being currently compacted to calculate maximum
purgeable timestamp.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2016-12-09 16:17:16 -02:00
Raphael S. Carvalho
732ee275f8 compaction: fix running compaction counter when splitting sstables
The counter was being increased before taking the semaphore, so
every pending split would count as a running compaction which
misleads the user as a result.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <f2050cc3599cee7af29d4579368a154708b37731.1481248048.git.raphaelsc@scylladb.com>
2016-12-09 15:01:43 +02:00
Avi Kivity
872b5ef5f0 sstables: fix probe with Unknown component
Commit 53b7b7def3 ("sstables: handle unrecognized sstable component")
ignores unrecognized components, but misses one code path during probe_file().

Ignore unrecognized components there too.

Fixes #1922.
Message-Id: <20161208131027.28939-1-avi@scylladb.com>
2016-12-08 15:24:25 +01:00
Avi Kivity
5530a61975 stables: fix build with older boost (boost::variant::get<T&>)
Older boost doesn't support boost::variant::get<T&> (where the type
parameter is reference qualified); remove (unneeded anyway).
2016-12-08 10:56:05 +02:00
Avi Kivity
3c3a18f222 sstables: move sharding metadata from Statistics component to a new Scylla component
The Cassandra derived sstable tools (and likely Cassandra itself) object to
a new sub-component in the Statistics component; create a new Scylla
component instead to host this data.
2016-12-07 15:20:13 +02:00
Avi Kivity
24140ec8c6 sstables: add support for sets of discriminated union types
Allow declaring discriminated unions (with an enum type as the
discriminant and any sstable serializable type as a value) and sets
of these unions, with the disciminant as the key.  Parsers and writers
are auto-generated.
2016-12-07 13:27:52 +02:00
Raphael S. Carvalho
b30a2cb21a lcs: generate info that preserves token distribution in higher levels
The information (last compacted keys) is lost after node is restarted
or schema is updated, which causes strategy to be rebuilt.
We need it for strategy to guarantee uniform distribution of token
range across sstables, or we could end up with 1 sstable of level L
overlapping with lots of sstables of level L+1, and that results in
a compaction of undesired length.
That information can be generated from scratch by getting last key
of newest sstable in each level > 0.

Fixes #1906.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <35ebd15977d5a8418239febb160c796cdc0e98fa.1480533805.git.raphaelsc@scylladb.com>
2016-12-01 11:19:58 +02:00
Raphael S. Carvalho
38743c1948 sstables: provide write time of data component
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <59686148149f2159990329775e0cd8780bc54254.1480533805.git.raphaelsc@scylladb.com>
2016-12-01 11:19:57 +02:00
Raphael S. Carvalho
a16425833c size_tiered: do not recreate bucket when it goes beyond max threshold
Problem will cause size tiered to return small jobs when there are
more than max_threshold sstables of similar size. For example, if
max_threshold is 32, and there are 36 sstables of similar size,
strategy will only return 4 sstables to be compacted. That's because
we incorrectly create a new bucket when it meets the max threshold.
What we should do is to allow buckets to grow beyond max threshold
and trim them when selecting the most suitable one for compaction.

Important to mention that estimation for size tiered will now
work better when there are more than max_threshold sstables of
similar size.

Fixes #1901.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <080bad70d6cb86eaf52ac1bdd6765ac47aab5b03.1478316140.git.raphaelsc@scylladb.com>
2016-11-29 16:56:02 +02:00
Raphael S. Carvalho
4781b6eb71 sstables: use nonwrapping_range::make to avoid compilation issues
GCC 5.3.1 was unable to convert bound to optional<bound>.

sstables/sstables.cc:2494:123: error: no matching function for call to
‘nonwrapping_range<dht::ring_position>::nonwrapping_range(dht::ring_position,
dht::ring_position)’
(dtr.right.exclusive ? dht::ring_position::starting_at :
dht::ring_position::ending_at)(std::move(t2)));

In file included from ./dht/i_partitioner.hh:52:0,
                 from ./query-request.hh:28,
                 from ./clustering_key_filter.hh:27,
                 from sstables/sstables.hh:35,
                 from sstables/sstables.cc:38:
./range.hh:441:14: note: candidate: nonwrapping_range<T>::nonwrapping_range(
const wrapping_range<U>&) [with T = dht::ring_position]
     explicit nonwrapping_range(const wrapping_range<T>& r)

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <95bbf984cd73a61739c8da99cf6cd5e94f1d1457.1479954360.git.raphaelsc@scylladb.com>
2016-11-24 11:26:16 +02:00
Duarte Nunes
cc3f26c993 lz4: Conditionally use LZ4_compress_default()
Since not all distributions have a version of LZ4 with
LZ4_compress_default(), we use it conditionally.

This is specially important beginning with version 1.7.3 of LZ4,
which deprecates the LZ4_compress() function in favour of
LZ4_compress_default() and thus prevents Scylla from compiling
due to the deprecated warning.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20161124092339.23017-1-duarte@scylladb.com>
2016-11-24 11:25:03 +02:00
Avi Kivity
98a4544e1c sstables: add method to get sstable owning shards from an unloaded sstable
When we load an sstable, we don't know beforehand which shards it belongs
to; we don't want to open it until we do.  Add a method that allows us
to read just the sharding data, without opening anything else.
2016-11-22 21:52:23 +02:00
Avi Kivity
bdd11648ac sstables: add intra-node sharding metadata
Add a metadata component that describes token ranges that are spanned by
this sstable.  With the current sharding algorithm, where each shard owns
a single token range, the first/last partition key is sufficient to
describing sharding information, but for multi-range algorithms, this
is not sufficient.
2016-11-22 21:44:25 +02:00
Avi Kivity
316ef1d70a sstables: automate writing statistics components
Add a virtual funnction to metadata_base so we can loop over statistics
components when writing them.
2016-11-22 21:05:06 +02:00
Avi Kivity
d05b22e502 sstables: automatically calculate offsets in statistics
Instead of calculating the offset for each statistic component manually,
use a loop to iterate over all components, accumulating the offset as we
go along.
2016-11-22 20:35:24 +02:00
Avi Kivity
7c5e6525ef sstables: switch statistics components to generic serialized_size() implementation 2016-11-22 20:20:38 +02:00
Avi Kivity
096ae59a5b sstables: introduce generic serialized_size()
Introduce a new function that reuses the file_writer code to compute
the serialized size of an sstable object, by serializing it into memory
and discarding the result.
2016-11-22 20:06:23 +02:00
Avi Kivity
3c06ffac9d sstables: const correctness for the write(file_writer&, T&) functions
write() doesn't need to change its input; so change it to const.

The only snag is that describe_type() isn't and can't be made const-correct,
so cheat when it is called and const_cast the input.

This helps in writing a generic serialized_size() that is const correct,
in the next patch.
2016-11-22 20:04:27 +02:00
Raphael S. Carvalho
3dc9294023 db: do not leak deleted sstable when deletion triggers an exception
The leakage results in deleted sstables being opened until shutdown, and disk
space isn't released. That's because column_family::rebuild_sstable_list()
will not remove reference to deleted sstables if an exception was triggered in
sstables::delete_atomically(). A sstable only has its files closed when its
object is destructed.

The exception happens when a major compaction is issued in parallel to a
regular one, and one of them will be unable to delete a sstable already deleted
by the other. That results in remove_by_toc_name() triggering boost::filesystem
::filesystem_error because TOC and temporary TOC don't exist.

We wouldn't have seen this problem if major compaction were going through
compaction manager, but remove_by_toc_name() and rebuild_sstable_list() should
be made resilient.

Fixes #1840.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <d43b2e78f9658e2c3c5bbb7f813756f18874bf92.1479390842.git.raphaelsc@scylladb.com>
2016-11-17 17:46:36 +02:00
Gleb Natapov
c052a1bc4f sstable: use schema's min_index_interval config when generating missing summary
Message-Id: <20161116181937.GA25303@scylladb.com>
2016-11-17 15:24:03 +02:00
Gleb Natapov
ae0a2935b4 sstables: fix ad-hoc summary creation
If sstable Summary is not present Scylla does not refuses to boot but
instead creates summary information on the fly. There is a bug in this
code though. Summary files is a map between keys and offsets into Index
file, but the code creates map between keys and Data file offsets
instead. Fix it by keeping offset of an index entry in index_entry
structure and use it during Summary file creation.

Reviewed-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20161116165421.GA22296@scylladb.com>
2016-11-17 11:05:23 +02:00
Raphael S. Carvalho
e86de40b49 compaction_manager: inform about compaction cancelled by shutdown
After some changes in compaction manager, user no longer is informed
that compaction was cancelled in event of shutdown. That's because
we only ignore ready future when compaction manager was asked to
stop.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <02ca29b5a93fe3a558896598f325b0dce069e82c.1478277317.git.raphaelsc@scylladb.com>
2016-11-14 16:37:33 +02:00
Piotr Jastrzebski
4fe989d58e Cleanup sstables::mutation_reader::impl
Pointer to sstable seems unnecessary.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <a45e8853af2b5f896ec44144fbc26d3325a5ec0c.1479123740.git.piotr@scylladb.com>
2016-11-14 11:52:52 +00:00
Avi Kivity
f10b9906d8 sstables: move atomic deletion code to its own files
This will simplify unit testing.  We move generic code that
depends only on seastar, so compile time should not increase too much.
2016-11-04 15:47:35 +02:00
Avi Kivity
9e85653c33 sstables: make atomic_deletion_manager more abstract
Make the shard count and method of deleting sstables abstract, in order
not to require all that machinery for unit tests.
2016-11-04 15:44:09 +02:00
Avi Kivity
e527da1e3c sstables: wrap atomic deletion code in a class
This makes it easier to abstract and unit-test.
2016-11-04 15:44:07 +02:00
Avi Kivity
a05837936a sstables: remove quadratic behavior from atomic sstable deletions
In order to ensure exception safety, the atomic sstable deletion code
creates a copy of the list of sstables pending deletion, modifies that
copy, and then replaces the original data with the copy.  This guarantees
that any exception does not change the data, since the assignment does
not require allocation.

However, it does result in quadratic behavior.  During startup, all
sstables are loaded on each shard, and each shard deletes sstables that
are do not have any partitions served by that shard; this results in
almost all sstables being deleted from all shards, with all that work
going to shard 0; the list grows to O(nr sstables), and there are
O((nr sstables) * (nr shards)) operations to perform.

Fix by replacing the copy-modify-assign method with an in-place update,
but one that is designed to only commit changes after all allocations
have been made; in addition, instead of using a list, use a hash table,
removing another source of quadratic behavior.

Fixes #1812 (the quadratic beahvior part).
2016-11-04 15:42:44 +02:00
Avi Kivity
a35136533d Convert ring_position and token ranges to be nonwrapping
Wrapping ranges are a pain, so we are moving wrap handling to the edges.

Since cql can't generate wrapping ranges, this means thrift and the ring
maintenance code; also range->ring transformations need to merge the first
and last ranges.

Message-Id: <1478105905-31613-1-git-send-email-avi@scylladb.com>
2016-11-02 21:04:11 +02:00
Raphael S. Carvalho
53b7b7def3 sstables: handle unrecognized sstable component
As in C*, unrecognized sstable components should be ignored when
loading a sstable. At the moment, Scylla fails to do so and will
not boot as a result. In addition, unknown components should be
remembered when moving a sstable or changing its generation.

Fixes #1780.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <b7af0c28e5b574fd577a7a1d28fb006ac197aa0a.1478025930.git.raphaelsc@scylladb.com>
2016-11-02 12:44:53 +02:00
Raphael S. Carvalho
a3e065da9b db: make it possible to use custom error handler with io checker
By default, io checker will cause Scylla to shutdown if it finds
specific system errors. Right now, io checker isn't flexible
enough to allow a specialized handler. For example, we don't want
to Scylla to shutdown if there's an permission problem when
uploading new files from upload dir. This desired flexibility is
made possible here by allowing a handler parameter to io check
functions and also changing existing code to take advantage of it.
That's a step towards fixing #1709.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2016-10-27 15:54:21 -02:00
Raphael S. Carvalho
bc2d351c25 sstables: remove duplicated declaration of remove_by_toc_name
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2016-10-26 11:21:27 -02:00
Piotr Jastrzebski
27726cecff Clean up position_in_partition.
Introduce position_in_partition_view and use it in
position() method in mutation_fragment, range_tombstone,
static_row and clustering_row.
Clean up comparators in position_in_partition.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <c65293c71a6aa23cf930ed317fb63df1fdc34fd1.1477399763.git.piotr@scylladb.com>
2016-10-25 15:13:20 +01:00
Raphael S. Carvalho
975ce62dbc sstables: do not swallow exception when reading TOC
That caused problem when refreshing a sstable with bad permissions.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <48e5322c53234209e55da05c64c99b8ec4e190a3.1477372974.git.raphaelsc@scylladb.com>
2016-10-25 12:21:32 +03:00
Paweł Dziepak
ab0eeae82d sstables: keep separate stream history for single and range reads
Single partition and partition range reads are expected to behave
considerably different so it is worth to have them use separate file
stream history. This also makes reads use different history for each
sstable which is also a good thing.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-19 15:29:08 +01:00
Paweł Dziepak
20bfa1fa52 sstables: drop sstable::{lower, upper}_bound()
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-19 15:29:08 +01:00
Paweł Dziepak
6755a679f6 drop key readers
key_readers weren't used since introduction of continuity flag to cache
entries.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-19 15:29:08 +01:00
Paweł Dziepak
c63e88d556 sstables: implement mutation_reader::impl::fast_forward_to()
This patch allows sstable readers to be fast forwarded without making it
necessary to recreate the reader (and dropping all buffers in the
process). It is built on top of index_reader and ability of
data_consume_context to be fast forwarded.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-19 15:29:08 +01:00
Paweł Dziepak
a530762277 sstables: introduce index_reader
index_reader is a helper that implements index lookups. Its goal is to
avoid dropping read buffers if they still may be needed (for example to
get end bound of the range or after fast forwarding the reader).

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-19 15:29:08 +01:00
Paweł Dziepak
f49a9e0d64 sstables: drop unused read_range_rows() overload
That overload was used only by unit test and violated guarantee that
partition range lives until mutation reader is done.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-19 15:29:08 +01:00
Paweł Dziepak
0bc873ace5 sstables: add fast_forward_to() to continuous_data_consumer
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-19 15:29:08 +01:00
Paweł Dziepak
25b91c51e2 ssables: add data_consume_rows_context::reset()
reset() is going to be used to restore valid state after fast forwarding
the reader.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-19 15:29:08 +01:00
Paweł Dziepak
2124d08b88 sstables: add skip() to compressed_file_data_source
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-19 15:29:08 +01:00
Duarte Nunes
c36dbaf0f1 sstables: Add function to get key samples
This patch implements the get_key_samples() function, on which a
future patch will base an implementation of the describe_splits()
thrift verb closer to Cassandra's.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-10-10 19:50:14 +02:00
Duarte Nunes
fc07b66678 sstables/key: Add to_partition_key function
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-10-10 19:50:11 +02:00
Duarte Nunes
ceed09b23e sstables: Get estimates for a particular range
This patch adds the estimated_keys_for_range() function, which
estimates the number of keys present between the specified range.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-10-10 17:52:15 +02:00
Duarte Nunes
8c223b31c8 sstables/key: Make key::kind public
Needed to create synthetic keys without any value but with ordering
properties.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-10-10 17:47:24 +02:00
Glauber Costa
7146776d7c fix sstable tests by not using the flush_reader if no region_group
The latest virtual dirty patches broke the SSTable tests. The reason for
this is that those tests will flush synthetic memtables that do not have
a region_group attached to it.

Normally in cases like this we would just give the flush_reader an empty
region group. However, the memtable class constructor takes a
region_group pointer and that can be null according to the interface.
So we must conditionally test it.

If there isn't a region_group involved, the virtual dirty accounting
should be disabled: after all, we won't even have the baseline memory
to begin with.

One of the approaches to fix this could be to just provide null
accounter classes to be used as a surrogate for the accounting classes
in this case. However, since this is mostly used for tests, a much
simpler way is to just revert back to the scanning reader in that case.

The scanning reader is similar enough to the flush_reader, except that
it can handle partial ranges, slices, and delegate accesses to an
sstable post-flush. We don't need any of that, but as argued above,
there is no need to remove it either.

Signed-off-by: Glauber Costa <glommer@scylladb.com>
Message-Id: <1475667271-60806-1-git-send-email-glommer@scylladb.com>
2016-10-05 12:44:21 +01:00