Commit Graph

873 Commits

Author SHA1 Message Date
Tzach Livyatan
12fb975282 Fix typos in metrics description
Fixes #2658

Signed-off-by: Tzach Livyatan <tzach@scylladb.com>
Message-Id: <20170803121732.19640-1-tzach@scylladb.com>
2017-08-28 10:48:28 +03:00
Glauber Costa
83323e155e database: add gate for generic async operations to column family
run_with_compaction_disabled(), which is called by truncate, has a
pretty large defer point in remove(). When the code gets to finally
execute, we can't guarantee that the column family will still be alive.

That is true in particular if we issued a drop table command following
truncate: by the time truncate gets to resume, the CF will be gone.
Before the column family is dropped, it will always call its stop()
method, which means we have an opportunity to do some waiting there. We
already wait for flushes and current compactions to end.

Traditionally, we have been solving similar problems by adding a gate
that will catch asynchronous operations and making sure that potentially
asynchronous operations will enter the gate before executing. Let's do
the same thing here. We will close() the gate during stop().

Fixes #2726

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2017-08-24 13:12:57 -04:00
Glauber Costa
d090e7be35 database: make sure that column family is always stopped when dropped
truncate can throw exceptions. If it does, cf->stop() will never be
called because it is contained in a .then clause instead of finally.

One of the things that truncate does - in a finally block of its own -
is initiate a final compaction. If it returns an exception nobody will
wait for that compaction to finish (since cf->stop() is the one doing
that) and we'll crash.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2017-08-24 13:01:47 -04:00
Amnon Heiman
abbd78367c Add configuration to disable per keyspace and column family metrics
The number of keysapce and column family metrics reported is
proportional to the number of shards times the number of keysapce/column
families.

This can cause a performance issue both on the reporting system and on
the collecting system.

This patch adds a configuration flag (set to false by default) to enable
or disable those metrics.

Fixes #2701

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <20170821113843.1036-1-amnon@scylladb.com>
2017-08-22 19:19:54 +03:00
Raphael S. Carvalho
10eaa2339e compaction: Make resharding go through compaction manager
Two reasons for this change:
1) every compaction should be multiplexed to manager which in turn
will make decision when to schedule. improvements on it will
immediately benefit every existing compaction type.
2) active tasks metric will now track ongoing reshard jobs.

Fixes #2671.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170817224334.6402-1-raphaelsc@scylladb.com>
2017-08-20 11:35:14 +03:00
Botond Dénes
e70cfc8f36 incremental_reader_selector: account for possibly disengaged lower bound
In addition to the constructor (fixed previously) the check for no
sstables on the first call to select() also has to be prepared for the
lower bound of the range being disengaged.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <4ab1296c71814fcd492996fa36fd00fd7bbbbc7f.1502949875.git.bdenes@scylladb.com>
2017-08-17 10:07:26 +03:00
Botond Dénes
af83b7f57b incremental_reader_selector: use lazy_deref instead of tertiary operator
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <4f4b884c6a1f517bd654f3b27608d854b17a66e1.1502948635.git.bdenes@scylladb.com>
2017-08-17 08:45:46 +03:00
Avi Kivity
8df6dd1fa0 database: make incremental_reader_selector robust vs. full-range partition_range
incremental_reader_selector assumes the partition_range it receives has a lower
bound, but it was seen in mutation_test that this is not so.

Fix by checking whether the bound exists or not.
Message-Id: <20170815095852.14149-1-avi@scylladb.com>
2017-08-15 11:03:22 +01:00
Duarte Nunes
7fb6a74302 combined_mutation_reader: Drop exhausted readers if not in FF mode
Exhausted readers can be fast forwarded, so we have to keep them
around. However, if the current reader is not fast forwardable, then
we can drop those readers and their buffers.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-08-14 14:37:27 +02:00
Botond Dénes
9ee9988097 Add combined_mutation_reader_test unit test 2017-08-10 12:38:10 +03:00
Botond Dénes
3e97a5cd6b Remove range_sstable_reader
range_sstable_reader is replaced with combined_mutation_reader, using
the incremental_reader_selector.
2017-08-10 12:38:10 +03:00
Botond Dénes
bfc74f1312 Add incremental_reader_selector
incremental_reader_selector is a specialization of reader_selector for
the case when sstables have narrow and/or disjoint token ranges. To
exploit this it creates new readers on-demand when their sstable's
token range intersects with the current ring position.
2017-08-10 12:38:02 +03:00
Botond Dénes
94fc550e68 sstable_set::incremental_selector: select() now returns a selection
A seletion contains - in addition to the list of sstables - a next_token
which is a hint as to what is the next best token to call select() with.
This should be the smallest token such that at the next call to
select() the least number of new sstables will be returned, without
skipping any.
2017-08-09 16:27:33 +03:00
Glauber Costa
4a911879a3 add active streaming reads metric
In commit f38e4ff3f, we have separated streaming reads from normal reads
for the purpose of determining the maximum number of reads going on.
However, we'll now be totally unaware of how many reads will be
happening on behalf of streaming and that can be important information
when debugging issues.

This patch adds this metric so we don't fly blind.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <1501909973-32519-1-git-send-email-glauber@scylladb.com>
2017-08-05 11:06:37 +03:00
Avi Kivity
f38e4ff3f9 database: prevent streaming reads from blocking normal reads
Streaming reads and normal reads share a semaphore, so if a bunch of
streaming reads use all available slots, no normal reads can proceed.

Fix by assigning streaming reads their own semaphore; they will compete
with normal reads once issued, and the I/O scheduler will determine the
winner.

Fixes #2663.
Message-Id: <20170802153107.939-1-avi@scylladb.com>
2017-08-03 10:23:01 +01:00
Avi Kivity
911536960a database: remove streaming read queue length limit
If we fail a streaming read due queue overload, we will fail the entire repair.
Remove the limit for streaming, and trust the caller (repair) to have bounded
concurrency.

Fixes #2659.
Message-Id: <20170802143448.28311-1-avi@scylladb.com>
2017-08-03 10:21:07 +01:00
Duarte Nunes
a85232dd82 Fix compilation errors on GCC 6
GCC 6 inconsistently requires explicitly calling a member function
through "this->" for lambda functions capturing "this".

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170731143755.21970-1-duarte@scylladb.com>
2017-07-31 17:40:44 +03:00
Avi Kivity
3fe6731436 Merge "educe the effect of the latency metrics" from Amnon
"This series reduce that effect in two ways:
1. Remove the latency counters from the system keyspaces
2. Reduce the histogram size by limiting the maximum number of buckets and
   stop the last bucket."

Fixes #2650.

* 'amnon/remove_cf_latency_v2' of github.com:cloudius-systems/seastar-dev:
  database: remove latency from the system table
  estimated histogram: return a smaller histogram
2017-07-31 15:58:30 +03:00
Duarte Nunes
c81431ad16 column_family: Re-acquire flush permit in case of error
If we fail to flush an sstable, after creating the flush_reader, then
we will have released the flush permit when we retry the flush. Ensure
that when retrying, we re-acquire the flush permit.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-07-31 12:40:19 +02:00
Duarte Nunes
9162e016da column_family: Don't hold sstable read lock when retrying flush
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-07-31 12:40:19 +02:00
Duarte Nunes
1a33cc6847 sstables: Release the flush permit before fsyncing
This allows a queued flush to start while we fsync the current
sstable, which helps reduce the overall time new writes are blocked on
dirty memory.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-07-31 12:40:19 +02:00
Duarte Nunes
a2b732c156 dirty_memory_manager: Refactor flush permit lifetime management
This patch refactors how the flush permit lifetime is managed,
dropping the current hash table in favour of a RAII approach.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-07-31 12:40:19 +02:00
Duarte Nunes
f647f5b14a dirty_memory_manager: Invert permit acquisition order
For an upcoming fix it is required to invert the permit acquisition
order: first we acquire the background work permit and then the single
flush permit.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-07-31 12:40:19 +02:00
Duarte Nunes
e371accac8 memtable_list: Register different seal functions for each behaviour
Instead of passing a flush_behaviour to the seal function, use two
different functions for each of the behaviours.

This will be important in the forthcoming patches, which will require
the signatures of those functions to differ.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-07-31 12:40:19 +02:00
Avi Kivity
e855a28fae Revert "Merge "memtable flush: Fixes and improvements" from Duarte"
This reverts commit 733a64a1df, reversing
changes made to e11e66723a.

Breaks sstable_test and perf_fast_forward.
2017-07-31 12:44:28 +03:00
Duarte Nunes
0f1bd81523 column_family: Re-acquire flush permit in case of error
If we fail to flush an sstable, after creating the flush_reader, then
we will have released the flush permit when we retry the flush. Ensure
that when retrying, we re-acquire the flush permit.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-07-27 21:09:18 +02:00
Duarte Nunes
2f4cffc7f6 column_family: Don't hold sstable read lock when retrying flush
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-07-27 21:09:18 +02:00
Duarte Nunes
5e64839e85 sstables: Release the flush permit before fsyncing
This allows a queued flush to start while we fsync the current
sstable, which helps reduce the overall time new writes are blocked on
dirty memory.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-07-27 21:09:18 +02:00
Duarte Nunes
ef1275e9dd dirty_memory_manager: Refactor flush permit lifetime management
This patch refactors how the flush permit lifetime is managed,
dropping the current hash table in favour of a RAII approach.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-07-27 21:09:18 +02:00
Duarte Nunes
cfc8fae33f dirty_memory_manager: Invert permit acquisition order
For an upcoming fix it is required to invert the permit acquisition
order: first we acquire the background work permit and then the single
flush permit.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-07-27 21:09:18 +02:00
Duarte Nunes
7e68e4677d memtable_list: Register different seal functions for each behaviour
Instead of passing a flush_behaviour to the seal function, use two
different functions for each of the behaviours.

This will be important in the forthcoming patches, which will require
the signatures of those functions to differ.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-07-27 21:09:18 +02:00
Amnon Heiman
a71b9e498a database: remove latency from the system table
This patch remove the latency histograms from the system table, it also
extend the already existing exclusion to all system keyspaces.

It also uses the new get_histogram API to set a minimal bucket size to
100 microseconds.
2017-07-27 11:41:15 +03:00
Paweł Dziepak
295689d16f db: include counter writes on leader in metrics
Counters write path on leader is completely different than on any other
replica (non-leaders share write path between counters and regular
columns). This patch makes sure that counter writes performed on leader
are added to appropriate metrics.
Message-Id: <20170725153346.31238-1-pdziepak@scylladb.com>
2017-07-25 18:31:43 +02:00
Raphael S. Carvalho
637f3bfa50 db: refresh row cache's underlying data source after compaction
Underlying data source in row cache holds a reference to sstable set
prior to compaction which isn't released until a memtable flush, which
means file descriptors of deleted sstables remains opened, wasting
disk space.
The fix is to refresh underlying data source in row cache.

Fixes #2570.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-07-24 15:49:11 -03:00
Raphael S. Carvalho
e3ad676433 db: atomically synchronize cache with changes to the snapshot
updates to cache and snapshot (i.e. sstable set) aren't synchronized, so
it may happen that cache update for memtable flush will use wrong snapshot
version, and that violates cache invariant of each partition entry only
reflecting one snapshot.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-07-24 15:45:05 -03:00
Tomasz Grabiec
714d609605 database: Fix reversed order of keyspace and table names in a log message
Message-Id: <1500649623-25377-1-git-send-email-tgrabiec@scylladb.com>
2017-07-21 17:10:17 +02:00
Tomasz Grabiec
408cea66cd database: Allow disabling auto snapshots during drop/truncate
Message-Id: <1500573920-31478-1-git-send-email-tgrabiec@scylladb.com>
2017-07-21 16:56:29 +02:00
Avi Kivity
c5ee62a6a4 Merge "restrict background writers with scheduling groups" from Glauber
"This patchset restricts background writers - such as compactions,
streaming flushes and memtable flushes to a maximum amount of CPU usage
through a seastar::thread_scheduling_group.

The said maximum is recommended to be set  50 % - it is default
disabled, but can be adjusted through a configuration option until we
are able to auto-tune this.

The second patch in this series provides a preview on how such auto-tune
would look like. By implementing a simple controller we automatically
adjust the quota for the memtable writer processes, so that the rate at
which bytes come in is equal to the rates at which bytes are flushed.

Tail latencies are greatly reduced by this series, and heavy spikes that
previously appeared on CPU-bound workloads are no more."

* 'memtable-controller-v5' of https://github.com/glommer/scylla:
  simple controller for memtable/streaming writer shares.
  restrict background writers to 50 % of CPU.
2017-07-20 10:58:53 +03:00
Calle Wilund
247c36e048 system_schema: Fix remaining places not handing two system keyspaces
Some places remained where code looked directly at
system_keyspace::NAME to determine iff a ks is
considered special/system/protected. Including
schema digest calculation.

Export "is_system_keyspace" and use accordingly.

Message-Id: <1500469809-23546-1-git-send-email-calle@scylladb.com>
2017-07-19 16:18:45 +03:00
Glauber Costa
c9a529ebee simple controller for memtable/streaming writer shares.
This patch introduces a simple controller that will adjust memtables CPU
shares, trying to keep it around the soft limit: if we start going below
it means we're too fast (unless we are idle) and shares are adjusted
downwards. If we start going above it means we're too fast and shares
are adjusted upwards.

I have tested this extensively in a single-CPU setup with various
CPU-bound workloads while tracking virtual dirty and the results are
good, with virtual dirty fluctuating only slightly, somewhere within the
desired range.

Exceptions to this are:
1) when the load is very light - the idle system goes faster, and that's
   ok
2) when the load is very high - as foreground requests dominate we can't
   flush fast enough and hit the hard limit. However, in such scenarios
   the memtable shares do hit its maximum, and the results are no worse
   than they are right now and this will only be fixed by CPU-limiting the
   actual requests.

This feature can be disabled with a config option - that is scheduled to
go away as we acquire more confidence in this. When the feature is
disabled, all background writers (streaming, compaction, memtables) will
share the same scheduling group, with static quotas.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2017-07-18 23:35:47 -04:00
Glauber Costa
4f01ec0910 restrict background writers to 50 % of CPU.
In scylla, we have foreground processes, which are latency sensitive and
need to be responded to as fast as possible in order to maintain good
latency profiles, and background process, which are less so.

The most important background processes we have during normal write
workload operations are memtable writes and sstable compactions. Those
processes are quite CPU-intensive, and left unchecked will easily
dominate the CPU. Lower values of task-quota usually help, as it will
force those processes to preempt more, but aren't enough to guarantee
good isolation. We have seen boxes with good NVMe storage having their
throughput reduced to less than half of the original baseline in a short
dive down for the duration of a compaction.

In the long run, our goal is to leverage the CPU scheduler to make sure
that those processes are balanced with respect to all the others.
However, the current state of affairs is causing grievances as this very
moment. Thankfully, those processes live in a seastar::thread, that
ships with its own rudimentary bandwidth control mechanism: the
scheduling group.

The goal of this patch is to wrap background processes together in a
scheduling group, and assign to such group 50 % of our CPU power; the
remainder being left to foreground processes.

While we pride ourselves in dynamically adjusting things to the
workload, we won't be able to do this properly before the CPU scheduler
lands - and let's face it, leaving background processes run wild is not
adaptative either. Every workload would benefit most from a different
value for such shares, but 50 % is as fair as it gets if we really need
static partitining in the mean time.

As a defense against unforeseen consequences, we'll leave the actual
value as an option, but will do our best to hide it - as this is not a
tunable that we want to be part of a normal Scylla setup. The most
convenient place for this tunable is still db::config, so we can easily
pass it down to the database layer - but we will not document it in the
yaml, and will clearly note in the help string that it is not supposed
to be tuned.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2017-07-18 23:35:33 -04:00
Duarte Nunes
2c711922cc database: Drop mutations that raced with truncate
Mutations that race with a truncate can just be dropped.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-07-16 00:08:05 +02:00
Duarte Nunes
0825c9c805 database: Rename replay_position_reordered_exception
Rename replay_position_reordered_exception to
mutation_reordered_with_truncate_exception for more precision, since
this is the only situation where this exception can be thrown.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-07-16 00:08:05 +02:00
Duarte Nunes
5f24e9a4a5 memtable: Stop tracking the highest flushed rp
Since we no longer enforce that mutations are applied in memory
ordered by their replay_positions, the way the highest_flush_rp is
being tracked is no longer correct.

The invariant it was used to maintain no longer exists, so we can get
rid of it together with the assertion on the highest_flush_rp on
flush().

Fixes #2074

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-07-13 22:56:06 +02:00
Duarte Nunes
003941cd95 column_family: Stop using flush_queue
Since commitlog ordering requirements have been relaxed, we now keep
the set of replay_positions seen by a memtable in a set, which we then
use to clean up relevant segments in the commitlog. This means that
the guarantees provided by the flush_queue are no longer necessary.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-07-13 22:56:00 +02:00
Duarte Nunes
7e6fe5895e column_family: Don't bother closing the flush_queue on stop()
When stopping a column family we issue a flush(), for which we wait.
Since writes are supposed to have stopped coming in, and also new
flush requests, there's no need to call and wait for the flush_queue
to be closed.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-07-13 22:51:58 +02:00
Duarte Nunes
a1f4536ffb column_family: Don't rely on flush_queue to guarantee flushes finished
We now don't ensure mutations are applied in memory following the
order of their replay positions, so we can't rely on the replay
position to order memtable flushes. So, use a phased_barrier() to
ensure that calling flush() returns a future that completes when all
flushes up to that point have finished.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-07-13 22:51:58 +02:00
Duarte Nunes
1b320496e2 dirty_memory_manager: Remove unnecessary check from flush_one()
We don't need to check whether a memtable is empty in flush_one(), as
that must be checked later, during the actual sealing.

The condition itself is rare and is checked already after the potentially
contented semaphore has been acquired.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-07-13 22:51:57 +02:00
Duarte Nunes
59bdaed02b column_family: More precise count of pending flushes
This patch ensures we update the count of pending flushes in the same
place as we update the stats across column families, which is more
correct since it only accounts for actual flushes and not those of
empty memtables or that have been coalesced together.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-07-13 22:51:25 +02:00
Duarte Nunes
3e27c335a9 column_family: Fix typo in pending_tasks metric name
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-07-13 22:51:25 +02:00