Commit Graph

1041 Commits

Author SHA1 Message Date
Raphael S. Carvalho
050a7019b8 sstables/index_reader: fix index reader for summary entry spanning lots of keys
quantity prevents index_reader from reading all index entries of a summary
entry that span more than min_index_interval entries. That can happen after
introduction of size-based sampling, and consequently, sstable will not be
able to return a key which logical position in summary entry is beyond
min_index_interval. It's ok to not use quantity because index_reader will
read all indexes until either next summary entry or end of file is reached.

Fixes test_sstable_conforms_to_mutation_source

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170812045821.25269-1-raphaelsc@scylladb.com>
2017-08-12 09:44:16 +03:00
Raphael S. Carvalho
872412d31a db/config: introduce sstable_summary_ratio option
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-08-11 01:36:21 -03:00
Raphael S. Carvalho
8726ee937d sstables: introduce size-based sampling for sstable summary
Currently, a summary entry is added after min_index_interval index
entries were written. Not taking into account size of index entries
becomes a problem with large partitions which may create big index
entries due to promoted indexes. Read performance is affected as a
consequence because index entries spanned by summary are all read
from disk to serve request.

What we wanna do is to also add a summary entry after index reaches
a boundary. To deal with oversampling, we want to write 1 byte to
summary for every 2000 bytes written to data file (this will be
eventually made into an option in the config file).
Both conditions must be met to avoid under or oversampling.
That way, the amount of data needed from index file to satify the
request is drastically reduced.

Fixes #1842.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-08-11 00:30:12 -03:00
Raphael S. Carvalho
da7489720b sstables: make components_writer::offset const qualified and uint64_t
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-08-10 21:48:11 -03:00
Raphael S. Carvalho
881c479be8 sstables: make writer::offset const qualified and uint64_t
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-08-10 21:46:39 -03:00
Botond Dénes
94fc550e68 sstable_set::incremental_selector: select() now returns a selection
A seletion contains - in addition to the list of sstables - a next_token
which is a hint as to what is the next best token to call select() with.
This should be the smallest token such that at the next call to
select() the least number of new sstables will be returned, without
skipping any.
2017-08-09 16:27:33 +03:00
Raphael S. Carvalho
dddbd34b52 sstables: close index file when sstable writer fails
index's file output stream uses write behind but it's not closed
when sstable write fails and that may lead to crash.
It happened before for data file (which is obviously easier to
reproduce for it) and was fixed by 0977f4fdf8.

Fixes #2673.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170807171146.10243-1-raphaelsc@scylladb.com>
2017-08-08 09:53:14 +03:00
Duarte Nunes
569bbf2edd sstables/sstables: Use per-cpu noop_write_monitor
We employ a thread-per-core architecture, so don't go about sharing
seastar::shared_ptrs across cpus.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170801144153.17354-1-duarte@scylladb.com>
2017-08-01 18:10:49 +03:00
Avi Kivity
db7329b1cb Merge "Ensure correct EOC for PI block cell names" from Duarte
"This series ensures the always write correct cell names to promoted
index cell blocks, taking into account the eoc of range tombstones.

Fixes #2333"

* 'pi-cell-name/v1' of github.com:duarten/scylla:
  tests/sstable_mutation_test: Test promoted index blocks are monotonic
  sstables: Consider eoc when flushing pi block
  sstables: Extract out converting bound_kind to eoc
2017-08-01 18:09:07 +03:00
Avi Kivity
1e8bb972b6 compaction: fix iteration in leveled compaction droppable tombstones loop
Since get_level_count() is unsigned, it will never be negative, and
the loop may never terminate.

Message-Id: <20170719133502.13316-1-avi@scylladb.com>
2017-08-01 13:40:36 +03:00
Avi Kivity
ba2e170e4b compaction: fix return in leveled compaction droppable tombstones loop
If the loop ever terminates, we need to return something.

Message-Id: <20170719133508.13374-1-avi@scylladb.com>
2017-08-01 13:33:02 +03:00
Duarte Nunes
1a33cc6847 sstables: Release the flush permit before fsyncing
This allows a queued flush to start while we fsync the current
sstable, which helps reduce the overall time new writes are blocked on
dirty memory.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-07-31 12:40:19 +02:00
Duarte Nunes
784a078e72 sstables: Introduce write_monitor
The write_monitor provides callbacks to inform an observer of the
state of the ongoing sstable write.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-07-31 12:40:19 +02:00
Avi Kivity
e855a28fae Revert "Merge "memtable flush: Fixes and improvements" from Duarte"
This reverts commit 733a64a1df, reversing
changes made to e11e66723a.

Breaks sstable_test and perf_fast_forward.
2017-07-31 12:44:28 +03:00
Duarte Nunes
5e64839e85 sstables: Release the flush permit before fsyncing
This allows a queued flush to start while we fsync the current
sstable, which helps reduce the overall time new writes are blocked on
dirty memory.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-07-27 21:09:18 +02:00
Duarte Nunes
a737577881 sstables: Introduce write_monitor
The write_monitor provides callbacks to inform an observer of the
state of the ongoing sstable write.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-07-27 21:09:18 +02:00
Duarte Nunes
06728bdfe9 sstables: Consider eoc when flushing pi block
When flushing a promoted index block using a range tombstone cell name
as a bound, use the right eoc value instead of always writing
composite::eoc::none.

Fixes #2333

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-07-27 18:23:58 +02:00
Duarte Nunes
718517ed91 sstables: Extract out converting bound_kind to eoc
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-07-27 18:23:58 +02:00
Paweł Dziepak
7b0f75c0d1 sstables: avoid indirect calls to abstract_type::is_multi_cell() 2017-07-26 14:38:27 +01:00
Paweł Dziepak
28c105e4a7 sstables: avoid copying key components 2017-07-26 14:38:27 +01:00
Paweł Dziepak
960a140880 index_reader: advance_and_check_if_present() use index_comparator 2017-07-26 14:36:37 +01:00
Paweł Dziepak
dc7bad9a50 sstables: cache token in index entries
When a sstable reader is fast forwarded some index entries may be read
(and compared) multiple times. This patch makes sure that once a token
is computed we keep it around and reuse if the entry is accessed again.
2017-07-26 14:36:37 +01:00
Paweł Dziepak
bfb7b56c74 sstable: keep a pre-computed token in summary_entry
Each sstable index lookup involves a binary search in the summary and
each time a partition key of summary entry is compared with anything its
token needs to be calculated.
Since we keep summary in the memory all the time it is better to also
keep the tokens around.
2017-07-26 14:36:36 +01:00
Paweł Dziepak
31d7cfdefb sstables: introduce decorated_key_view 2017-07-26 14:36:36 +01:00
Paweł Dziepak
e0a04cb7fe sstables: make sure that fill_buffer() actually fills buffer
streamed_mutation::impl::fill_buffer() is supposed to either push
mutation fragments to the buffer or set EOS flag. However, it was
possible that mp_row_consumer would return proceed::no if a skip was
needed without satisfying any of these conditions.
2017-07-26 14:36:36 +01:00
Avi Kivity
c5ee62a6a4 Merge "restrict background writers with scheduling groups" from Glauber
"This patchset restricts background writers - such as compactions,
streaming flushes and memtable flushes to a maximum amount of CPU usage
through a seastar::thread_scheduling_group.

The said maximum is recommended to be set  50 % - it is default
disabled, but can be adjusted through a configuration option until we
are able to auto-tune this.

The second patch in this series provides a preview on how such auto-tune
would look like. By implementing a simple controller we automatically
adjust the quota for the memtable writer processes, so that the rate at
which bytes come in is equal to the rates at which bytes are flushed.

Tail latencies are greatly reduced by this series, and heavy spikes that
previously appeared on CPU-bound workloads are no more."

* 'memtable-controller-v5' of https://github.com/glommer/scylla:
  simple controller for memtable/streaming writer shares.
  restrict background writers to 50 % of CPU.
2017-07-20 10:58:53 +03:00
Tomasz Grabiec
a9237c1666 schema: Revert back to the 1.7 layout of static compact tables in memory
We are using C* 3.x compatible layout in schema tables but want to
keep using the 1.7 layout in memory for compatibility during rolling
upgrade. This patch switches the schema and schema_builder classes
back to the old layout. Translation of layout happens when converting
to/from schema mutations.

Notable changes:

 1) Includes a revert of commit 6260f31e08
    "thrift: Update CQL mapping of static CFs".

 2) Brings back the "default_validation_class" schema attribute. In v3
    it can be dervied from column definitions, but in v2 it can't, so
    we have to store it.

 3) legacy_schema_migrator and schema_builder don't have to do
    conversions to v3, this is now handled by the v3_columns
    class. schema_builder works with the same layout as schema, that
    is v2.

 4) Includes a revert of commit 66991a7ccb
    "v3 schema test fixes"

Fixes #2555.
2017-07-19 09:52:15 +02:00
Raphael S. Carvalho
7ecedac222 compaction: wire up time window compaction strategy
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-07-19 02:58:37 -03:00
Raphael S. Carvalho
01886c23a8 compaction/twcs: override default values with options in schema
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-07-19 02:58:37 -03:00
Raphael S. Carvalho
206d30c52a sstables: implement time window compaction strategy
For more details, https://issues.apache.org/jira/browse/CASSANDRA-9666

Fixes #1432.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-07-19 02:58:35 -03:00
Glauber Costa
4f01ec0910 restrict background writers to 50 % of CPU.
In scylla, we have foreground processes, which are latency sensitive and
need to be responded to as fast as possible in order to maintain good
latency profiles, and background process, which are less so.

The most important background processes we have during normal write
workload operations are memtable writes and sstable compactions. Those
processes are quite CPU-intensive, and left unchecked will easily
dominate the CPU. Lower values of task-quota usually help, as it will
force those processes to preempt more, but aren't enough to guarantee
good isolation. We have seen boxes with good NVMe storage having their
throughput reduced to less than half of the original baseline in a short
dive down for the duration of a compaction.

In the long run, our goal is to leverage the CPU scheduler to make sure
that those processes are balanced with respect to all the others.
However, the current state of affairs is causing grievances as this very
moment. Thankfully, those processes live in a seastar::thread, that
ships with its own rudimentary bandwidth control mechanism: the
scheduling group.

The goal of this patch is to wrap background processes together in a
scheduling group, and assign to such group 50 % of our CPU power; the
remainder being left to foreground processes.

While we pride ourselves in dynamically adjusting things to the
workload, we won't be able to do this properly before the CPU scheduler
lands - and let's face it, leaving background processes run wild is not
adaptative either. Every workload would benefit most from a different
value for such shares, but 50 % is as fair as it gets if we really need
static partitining in the mean time.

As a defense against unforeseen consequences, we'll leave the actual
value as an option, but will do our best to hide it - as this is not a
tunable that we want to be part of a normal Scylla setup. The most
convenient place for this tunable is still db::config, so we can easily
pass it down to the database layer - but we will not document it in the
yaml, and will clearly note in the help string that it is not supposed
to be tuned.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2017-07-18 23:35:33 -04:00
Raphael S. Carvalho
2686e84792 sstables: import TimeWindowCompactionStrategy.java
it will be later converted to C++. Imported from latest scylla-
tools-java repository. Checked that it doesn't lack anything.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-07-18 18:26:17 -03:00
Raphael S. Carvalho
7dbfebb7dc lcs: remove conditional limit for partial sort
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170711140241.11023-2-raphaelsc@scylladb.com>
2017-07-11 17:18:32 +03:00
Raphael S. Carvalho
ebb5dafef0 lcs: remove useless filter for demotion procedure
there's no way a sstable from a level higher than N+1 will be in
set of candidates that can be either level N or level N + 1.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170711140241.11023-1-raphaelsc@scylladb.com>
2017-07-11 17:18:31 +03:00
Raphael S. Carvalho
6aa2e5be17 lcs: only demote sstable from level higher than target one
if we are compacting level 1 into level 2, we only want to demote
a sstable from level 3 or higher.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-07-11 09:35:42 -03:00
Raphael S. Carvalho
53b72b473e lcs: improve indentation for get_overlapping_starved_sstables
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-07-11 09:35:40 -03:00
Raphael S. Carvalho
3639b48d7b lcs: improve indentation for get_compaction_candidates
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-07-11 09:35:38 -03:00
Raphael S. Carvalho
5a8b8a6ccb lcs: partially sort candidates that will be trimmed
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-07-11 09:35:37 -03:00
Raphael S. Carvalho
8334086441 lcs: remove quadratic behavior from L0 compaction
L0 compaction triggers quadratic behavior when many newly created
sstables are needed for promotion due to their size being relatively
low to max sstable size parameter. So until L0 is worth promoting,
the strategy will compact every new sstable with all the existing
ones in L0. To fix it, let's do STCS on level 0 until it becomes
worth promoting.

Fixes #2432.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-07-11 09:35:35 -03:00
Raphael S. Carvalho
80f1dca328 lcs: introduce private interface
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-07-11 09:35:33 -03:00
Raphael S. Carvalho
bc71f97116 lcs: make some member functions static
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-07-11 09:35:32 -03:00
Raphael S. Carvalho
f4b733efe4 lcs: make some functions const qualified
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-07-11 09:35:28 -03:00
Raphael S. Carvalho
ede0ee16b2 lcs: remove add method
Its code can be inlined because no one besides create() calls it

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-07-11 09:35:26 -03:00
Raphael S. Carvalho
00ef528e5b lcs: extract code for higher levels compaction from get_candidates_for
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-07-11 09:35:25 -03:00
Raphael S. Carvalho
a46b73c401 lcs: simplify code to get candidates for higher levels
get rid of unneeded loop for dealing with suspect sstables and
std::advance because vector allows random access.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-07-11 09:35:19 -03:00
Raphael S. Carvalho
e954af0f0f lcs: extract round-robin heuristic for even distribution of keys into function
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-07-11 09:35:15 -03:00
Raphael S. Carvalho
3c0028d921 lcs: update outdated comments for level 0 compaction
some comments are no longer relevant, especially the ones that
talk about dealing with busy sstables due to parallel compaction,
which isn't done by us for lcs.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-07-11 09:35:07 -03:00
Raphael S. Carvalho
62607ba36a lcs: improve worth_promoting_L0_candidates interface
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-07-11 09:35:00 -03:00
Raphael S. Carvalho
c1e42f6528 lcs: do not check if level 0 can be promoted twice
can_promote flag will be used to carry info about whether or not
level 0 can promoted. That will avoid a single iteration for higher
levels too which can contain tens of thousands of sstables.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-07-11 09:34:49 -03:00
Raphael S. Carvalho
887aab4ae7 lcs: extract code for level 0 compaction from get_candidates_for
I will split code for higher levels compaction into functions first
before putting it into its own function too.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-07-11 09:34:41 -03:00