Commit Graph

857 Commits

Author SHA1 Message Date
Duarte Nunes
a85232dd82 Fix compilation errors on GCC 6
GCC 6 inconsistently requires explicitly calling a member function
through "this->" for lambda functions capturing "this".

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170731143755.21970-1-duarte@scylladb.com>
2017-07-31 17:40:44 +03:00
Avi Kivity
3fe6731436 Merge "educe the effect of the latency metrics" from Amnon
"This series reduce that effect in two ways:
1. Remove the latency counters from the system keyspaces
2. Reduce the histogram size by limiting the maximum number of buckets and
   stop the last bucket."

Fixes #2650.

* 'amnon/remove_cf_latency_v2' of github.com:cloudius-systems/seastar-dev:
  database: remove latency from the system table
  estimated histogram: return a smaller histogram
2017-07-31 15:58:30 +03:00
Duarte Nunes
c81431ad16 column_family: Re-acquire flush permit in case of error
If we fail to flush an sstable, after creating the flush_reader, then
we will have released the flush permit when we retry the flush. Ensure
that when retrying, we re-acquire the flush permit.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-07-31 12:40:19 +02:00
Duarte Nunes
9162e016da column_family: Don't hold sstable read lock when retrying flush
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-07-31 12:40:19 +02:00
Duarte Nunes
1a33cc6847 sstables: Release the flush permit before fsyncing
This allows a queued flush to start while we fsync the current
sstable, which helps reduce the overall time new writes are blocked on
dirty memory.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-07-31 12:40:19 +02:00
Duarte Nunes
a2b732c156 dirty_memory_manager: Refactor flush permit lifetime management
This patch refactors how the flush permit lifetime is managed,
dropping the current hash table in favour of a RAII approach.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-07-31 12:40:19 +02:00
Duarte Nunes
f647f5b14a dirty_memory_manager: Invert permit acquisition order
For an upcoming fix it is required to invert the permit acquisition
order: first we acquire the background work permit and then the single
flush permit.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-07-31 12:40:19 +02:00
Duarte Nunes
e371accac8 memtable_list: Register different seal functions for each behaviour
Instead of passing a flush_behaviour to the seal function, use two
different functions for each of the behaviours.

This will be important in the forthcoming patches, which will require
the signatures of those functions to differ.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-07-31 12:40:19 +02:00
Avi Kivity
e855a28fae Revert "Merge "memtable flush: Fixes and improvements" from Duarte"
This reverts commit 733a64a1df, reversing
changes made to e11e66723a.

Breaks sstable_test and perf_fast_forward.
2017-07-31 12:44:28 +03:00
Duarte Nunes
0f1bd81523 column_family: Re-acquire flush permit in case of error
If we fail to flush an sstable, after creating the flush_reader, then
we will have released the flush permit when we retry the flush. Ensure
that when retrying, we re-acquire the flush permit.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-07-27 21:09:18 +02:00
Duarte Nunes
2f4cffc7f6 column_family: Don't hold sstable read lock when retrying flush
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-07-27 21:09:18 +02:00
Duarte Nunes
5e64839e85 sstables: Release the flush permit before fsyncing
This allows a queued flush to start while we fsync the current
sstable, which helps reduce the overall time new writes are blocked on
dirty memory.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-07-27 21:09:18 +02:00
Duarte Nunes
ef1275e9dd dirty_memory_manager: Refactor flush permit lifetime management
This patch refactors how the flush permit lifetime is managed,
dropping the current hash table in favour of a RAII approach.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-07-27 21:09:18 +02:00
Duarte Nunes
cfc8fae33f dirty_memory_manager: Invert permit acquisition order
For an upcoming fix it is required to invert the permit acquisition
order: first we acquire the background work permit and then the single
flush permit.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-07-27 21:09:18 +02:00
Duarte Nunes
7e68e4677d memtable_list: Register different seal functions for each behaviour
Instead of passing a flush_behaviour to the seal function, use two
different functions for each of the behaviours.

This will be important in the forthcoming patches, which will require
the signatures of those functions to differ.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-07-27 21:09:18 +02:00
Amnon Heiman
a71b9e498a database: remove latency from the system table
This patch remove the latency histograms from the system table, it also
extend the already existing exclusion to all system keyspaces.

It also uses the new get_histogram API to set a minimal bucket size to
100 microseconds.
2017-07-27 11:41:15 +03:00
Paweł Dziepak
295689d16f db: include counter writes on leader in metrics
Counters write path on leader is completely different than on any other
replica (non-leaders share write path between counters and regular
columns). This patch makes sure that counter writes performed on leader
are added to appropriate metrics.
Message-Id: <20170725153346.31238-1-pdziepak@scylladb.com>
2017-07-25 18:31:43 +02:00
Raphael S. Carvalho
637f3bfa50 db: refresh row cache's underlying data source after compaction
Underlying data source in row cache holds a reference to sstable set
prior to compaction which isn't released until a memtable flush, which
means file descriptors of deleted sstables remains opened, wasting
disk space.
The fix is to refresh underlying data source in row cache.

Fixes #2570.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-07-24 15:49:11 -03:00
Raphael S. Carvalho
e3ad676433 db: atomically synchronize cache with changes to the snapshot
updates to cache and snapshot (i.e. sstable set) aren't synchronized, so
it may happen that cache update for memtable flush will use wrong snapshot
version, and that violates cache invariant of each partition entry only
reflecting one snapshot.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-07-24 15:45:05 -03:00
Tomasz Grabiec
714d609605 database: Fix reversed order of keyspace and table names in a log message
Message-Id: <1500649623-25377-1-git-send-email-tgrabiec@scylladb.com>
2017-07-21 17:10:17 +02:00
Tomasz Grabiec
408cea66cd database: Allow disabling auto snapshots during drop/truncate
Message-Id: <1500573920-31478-1-git-send-email-tgrabiec@scylladb.com>
2017-07-21 16:56:29 +02:00
Avi Kivity
c5ee62a6a4 Merge "restrict background writers with scheduling groups" from Glauber
"This patchset restricts background writers - such as compactions,
streaming flushes and memtable flushes to a maximum amount of CPU usage
through a seastar::thread_scheduling_group.

The said maximum is recommended to be set  50 % - it is default
disabled, but can be adjusted through a configuration option until we
are able to auto-tune this.

The second patch in this series provides a preview on how such auto-tune
would look like. By implementing a simple controller we automatically
adjust the quota for the memtable writer processes, so that the rate at
which bytes come in is equal to the rates at which bytes are flushed.

Tail latencies are greatly reduced by this series, and heavy spikes that
previously appeared on CPU-bound workloads are no more."

* 'memtable-controller-v5' of https://github.com/glommer/scylla:
  simple controller for memtable/streaming writer shares.
  restrict background writers to 50 % of CPU.
2017-07-20 10:58:53 +03:00
Calle Wilund
247c36e048 system_schema: Fix remaining places not handing two system keyspaces
Some places remained where code looked directly at
system_keyspace::NAME to determine iff a ks is
considered special/system/protected. Including
schema digest calculation.

Export "is_system_keyspace" and use accordingly.

Message-Id: <1500469809-23546-1-git-send-email-calle@scylladb.com>
2017-07-19 16:18:45 +03:00
Glauber Costa
c9a529ebee simple controller for memtable/streaming writer shares.
This patch introduces a simple controller that will adjust memtables CPU
shares, trying to keep it around the soft limit: if we start going below
it means we're too fast (unless we are idle) and shares are adjusted
downwards. If we start going above it means we're too fast and shares
are adjusted upwards.

I have tested this extensively in a single-CPU setup with various
CPU-bound workloads while tracking virtual dirty and the results are
good, with virtual dirty fluctuating only slightly, somewhere within the
desired range.

Exceptions to this are:
1) when the load is very light - the idle system goes faster, and that's
   ok
2) when the load is very high - as foreground requests dominate we can't
   flush fast enough and hit the hard limit. However, in such scenarios
   the memtable shares do hit its maximum, and the results are no worse
   than they are right now and this will only be fixed by CPU-limiting the
   actual requests.

This feature can be disabled with a config option - that is scheduled to
go away as we acquire more confidence in this. When the feature is
disabled, all background writers (streaming, compaction, memtables) will
share the same scheduling group, with static quotas.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2017-07-18 23:35:47 -04:00
Glauber Costa
4f01ec0910 restrict background writers to 50 % of CPU.
In scylla, we have foreground processes, which are latency sensitive and
need to be responded to as fast as possible in order to maintain good
latency profiles, and background process, which are less so.

The most important background processes we have during normal write
workload operations are memtable writes and sstable compactions. Those
processes are quite CPU-intensive, and left unchecked will easily
dominate the CPU. Lower values of task-quota usually help, as it will
force those processes to preempt more, but aren't enough to guarantee
good isolation. We have seen boxes with good NVMe storage having their
throughput reduced to less than half of the original baseline in a short
dive down for the duration of a compaction.

In the long run, our goal is to leverage the CPU scheduler to make sure
that those processes are balanced with respect to all the others.
However, the current state of affairs is causing grievances as this very
moment. Thankfully, those processes live in a seastar::thread, that
ships with its own rudimentary bandwidth control mechanism: the
scheduling group.

The goal of this patch is to wrap background processes together in a
scheduling group, and assign to such group 50 % of our CPU power; the
remainder being left to foreground processes.

While we pride ourselves in dynamically adjusting things to the
workload, we won't be able to do this properly before the CPU scheduler
lands - and let's face it, leaving background processes run wild is not
adaptative either. Every workload would benefit most from a different
value for such shares, but 50 % is as fair as it gets if we really need
static partitining in the mean time.

As a defense against unforeseen consequences, we'll leave the actual
value as an option, but will do our best to hide it - as this is not a
tunable that we want to be part of a normal Scylla setup. The most
convenient place for this tunable is still db::config, so we can easily
pass it down to the database layer - but we will not document it in the
yaml, and will clearly note in the help string that it is not supposed
to be tuned.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2017-07-18 23:35:33 -04:00
Duarte Nunes
2c711922cc database: Drop mutations that raced with truncate
Mutations that race with a truncate can just be dropped.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-07-16 00:08:05 +02:00
Duarte Nunes
0825c9c805 database: Rename replay_position_reordered_exception
Rename replay_position_reordered_exception to
mutation_reordered_with_truncate_exception for more precision, since
this is the only situation where this exception can be thrown.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-07-16 00:08:05 +02:00
Duarte Nunes
5f24e9a4a5 memtable: Stop tracking the highest flushed rp
Since we no longer enforce that mutations are applied in memory
ordered by their replay_positions, the way the highest_flush_rp is
being tracked is no longer correct.

The invariant it was used to maintain no longer exists, so we can get
rid of it together with the assertion on the highest_flush_rp on
flush().

Fixes #2074

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-07-13 22:56:06 +02:00
Duarte Nunes
003941cd95 column_family: Stop using flush_queue
Since commitlog ordering requirements have been relaxed, we now keep
the set of replay_positions seen by a memtable in a set, which we then
use to clean up relevant segments in the commitlog. This means that
the guarantees provided by the flush_queue are no longer necessary.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-07-13 22:56:00 +02:00
Duarte Nunes
7e6fe5895e column_family: Don't bother closing the flush_queue on stop()
When stopping a column family we issue a flush(), for which we wait.
Since writes are supposed to have stopped coming in, and also new
flush requests, there's no need to call and wait for the flush_queue
to be closed.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-07-13 22:51:58 +02:00
Duarte Nunes
a1f4536ffb column_family: Don't rely on flush_queue to guarantee flushes finished
We now don't ensure mutations are applied in memory following the
order of their replay positions, so we can't rely on the replay
position to order memtable flushes. So, use a phased_barrier() to
ensure that calling flush() returns a future that completes when all
flushes up to that point have finished.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-07-13 22:51:58 +02:00
Duarte Nunes
1b320496e2 dirty_memory_manager: Remove unnecessary check from flush_one()
We don't need to check whether a memtable is empty in flush_one(), as
that must be checked later, during the actual sealing.

The condition itself is rare and is checked already after the potentially
contented semaphore has been acquired.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-07-13 22:51:57 +02:00
Duarte Nunes
59bdaed02b column_family: More precise count of pending flushes
This patch ensures we update the count of pending flushes in the same
place as we update the stats across column families, which is more
correct since it only accounts for actual flushes and not those of
empty memtables or that have been coalesced together.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-07-13 22:51:25 +02:00
Duarte Nunes
3e27c335a9 column_family: Fix typo in pending_tasks metric name
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-07-13 22:51:25 +02:00
Duarte Nunes
a11724c6e1 column_family: More precise count of switched memtables
The memtable_switch_count metric is supposed to count the number of
times a flush has resulted in the memtable being switched out, but we
were incrementing the count regardless of whether we tried to flush an
empty memtable or two or more flushes were coalesced into one. This
patch fixes this by moving the metric to where the memtable is
actually switched.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-07-13 22:51:25 +02:00
Duarte Nunes
bca1b19ce9 commitlog: Always flush latest memtable
We now don't ensure mutations are applied in memory following the
order of their replay positions, so we can't rely on the replay
position to order memtable flushes. When flushing commit log segments,
ensure we flush the latest memtable.

Refs #2074

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-07-13 22:51:25 +02:00
Duarte Nunes
3df6777b9b database: Load views after loading tables
Since base tables no longer look for their views, we need to parse
base tables first so that when we add a view we can fetch and connect
it to its base table.

When announcing view table mutations to other nodes we always include
the base table mutations, so there's no need to expect a view being
added before its base table.

Found out while testing view building.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170712172115.2960-1-duarte@scylladb.com>
2017-07-13 11:14:02 +02:00
Duarte Nunes
136accdbf6 database: Fix typos in metric descriptions
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170709145522.19534-1-duarte@scylladb.com>
2017-07-09 18:35:17 +03:00
Botond Dénes
b1082641f9 Make sure keyspace strategy class is stored in qualified form
Even when it's provided in unqualified (short) form.
Fixes #767

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <4379f8864843e64c097d432fd06129ce4025f100.1499322476.git.bdenes@scylladb.com>
2017-07-06 14:50:00 +03:00
Raphael S. Carvalho
972a0237ef database: restore indentation for cleanup_sstables
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170630035324.19881-2-raphaelsc@scylladb.com>
2017-07-03 12:48:54 +03:00
Raphael S. Carvalho
b9d0645199 database: fix potential use-after-free in sstable cleanup
when do_for_each is in its last iteration and with_semaphore defers
because there's an ongoing cleanup, sstable object will be used after
freed because it was taken by ref and the container it lives in was
destroyed prematurely.

Let's fix it with a do_with, also making code nicer.

Fixes #2537.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170630035324.19881-1-raphaelsc@scylladb.com>
2017-07-03 12:48:53 +03:00
Avi Kivity
fc966c0c4c Merge "tombstone removal compaction" from Raphael
"This feature is intended to make compaction more efficient at getting rid of
droppable tombstone and expired data wasting disk space. So far, people have
been dealing with it manually through major compaction.

With strategies other than date tiered, large sstables will be left untouched
for a long time even though it's all expired. Date tiered suffers from it when
mixing data with different TTL because it only includes for compaction sstable
that is fully expired.

sstables keeps as metadata a histogram which allows us to easily estimate
droppable data ratio from gc_before. sstables which droppable data ratio is
above 20% (default value for tombstone_threshold option) will be considered
candidates for the operation.

Like in C*, we will only do tombstone removal compaction when there's nothing
to compact in standard way. It would be interesting to trigger it too when
disk usage is above a given threshold, but I decided to leave this for later.

Fixes #2306."

* 'tombstone_removal_compaction_v4' of github.com:raphaelsc/scylla:
  tests: more testing for tombstone compaction options
  tests: basic tombstone compaction test for date tiered
  compaction/dtcs: add support for tombstone compaction
  tests: basic test of tombstone compaction with lcs
  compaction/lcs: add support for tombstone compaction
  tests: basic tombstone compaction test for size tiered
  compaction/stcs: add support for tombstone compaction
  tests: add test for estimation of droppable tombstone ratio
  sstables: introduce function to estimate droppable tombstone ratio
  compaction_manager: periodically submit cfs for compaction
  streaming_histogram: fix coding style
  tests: add streaming_histogram_test
  streaming_histogram: implement sum
  tests: add test for sstable with bad tombstone histogram
  sstables: discard bad streaming histogram for future use
  tests: add sstable tombstone histogram test
  streaming_histogram: fix update
  streaming_histogram: move it to utils
  streaming_histogram: do not limit it to be used by sstables
  sstables: update tombstone_histogram for cells with expiration time
2017-06-29 10:19:59 +03:00
Raphael S. Carvalho
a3a73899bc database: remove outdated FIXME comments
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170621002253.29660-1-raphaelsc@scylladb.com>
2017-06-28 11:06:02 +02:00
Raphael S. Carvalho
fb9bc609c6 streaming_histogram: do not limit it to be used by sstables
streaming histogram will later be placed in /utils, so we want
it to use std::unordered_map<> instead of disk_hash<>.
That also requires implementing serialization/deserialization
functions for it.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-06-27 16:51:52 -03:00
Nadav Har'El
6cf44f6817 Optimize column_family::make_sstable_reader() for one partition
This patch does the same thing to column_family::make_sstable_reader() as
commit 186f031 did to sstable::as_mutation_source().

Although usually one can fast_forward_to() on the result of a
column_family::make_sstable_reader(), earlier we had an optimization
where if a single partition was specified, it was read exactly,
and fast_forward_to() was *NOT* allowed.

With the mutation_reader::forwarding flag patch, when this flag
was on - requesting fast_forward_to() - we disabled this optimization.
This makes sense, but is not backward compatible with the code which
previously assumes this optimization exists. In particular,
column_family::data_query() does a single partition read but does not
specify forwarding::no explicitly.
So this patch returns this optimization, despite this meaning that we
blatently ignore the fwd_mr flag in that case.

Fixes #2524.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20170626141121.30322-1-nyh@scylladb.com>
2017-06-26 17:13:03 +03:00
Avi Kivity
9b21a9bfb6 Merge "Implement partial cache" from Tomasz and Piotr
"This series enables cache to keep partial partitions.
Reads no longer have to read whole partition from sstables
in order to cache the result.

The 10MB threshold for partition size in cache is lifted.

Known issues:

 - There is no partial eviction yet, whole partitions are still evicted,
   and partition snapshots held by active reads are not evictable at all
 - Information about range continuity is not recorded if that
   would require inserting a dummy entry, or if previous entry
   doesn't belong to the latest snapshot
 - Cache update after memtable flush happening concurrently with reads
   may inhibit that reads' ability to populate cache (new issue)
 - Cache update from flushed memtables has partition granularity,
   so may cause latency problems with large partition
 - Schema is still tracked per-partition, so after schema changes
   reads may induce high latency due to whole partition needing
   to be converted atomically
 - Range tombstones are repeated in the stream for every range between
   cache entries they cover (new issue)
 - Populating scans for both small and large partitions (perf_fast_forward)
   experienced a 40% reduction of throughput, CPU bound

How was this tested:

 - test.py --mode release
 - row_cache_stress_test -c1 -m1G
 - perf_fast_forward, passes except for the test case checking range continuity population
   which would require inserting a dummy entry (mentioned above)
 - perf_simple_query (-c1 -m1G --duration 32):
     before: 90k [ops/s] stdev: 4k [ops/s]
     after:  94k [ops/s] stdev: 2k [ops/s]"

* tag 'tgrabiec/introduce-partial-cache-v8' of github.com:cloudius-systems/seastar-dev: (130 commits)
  tests: row_cache: Add test_tombstone_merging_in_partial_partition test case
  tests: Introduce row_cache_stress_test
  utils: Add helpers for dealing with nonwrapping_range<int>
  tests: simple_schema: Allow passing the tombstone to make_range_tombstone()
  tests: simple_schema: Accept value by reference
  tests: simple_schema: Make add_row() accept optional timestamp
  tests: simple_schema: Make new_timestamp() public
  tests: simple_schema: Introduce make_ckeys()
  tests: simple_schema: Introduce get_value(const clustered_row&) helper
  tests: simple_schema: Fix comment
  tests: simple_schema: Add missing include
  row_cache: Introduce evict()
  tests: Add cache_streamed_mutation_test
  tests: mutation_assertions: Allow expecting fragments
  mutation_fragment: Implement equality check
  tests: row_cache: Add test for population of random partitions
  tests: row_cache: Add test for partition tombstone population
  tests: row_cache: Test reading randomly populated partition
  tests: row_cache: Add test_single_partition_update()
  tests: row_cache: Add test_scan_with_partial_partitions
  ...
2017-06-26 14:54:37 +03:00
Avi Kivity
555621b537 Disentable memtables from sstables
Remove sstable::write_components(memtable), replacing it with a helper.

Fixes #2354
Message-Id: <20170624142639.16662-1-avi@scylladb.com>
2017-06-26 09:37:11 +02:00
Tomasz Grabiec
1828e28bbb database: Invalidate cache atomically with attaching streaming sstables
Not doing so may cause reads to see partial writes, if another
update+read happens in between.
2017-06-24 18:06:11 +02:00
Tomasz Grabiec
896196b841 database: Invalidate cache from seal_active_streaming_memtable_immediate()
Cache must be synchronized atomically with changing the underlying
mutation source, otherwise write atomicity may not hold.
2017-06-24 18:06:11 +02:00
Tomasz Grabiec
8ba6366610 row_cache: Switch to using snapshot_source
Currently every time cache needs to create reader for missing data it
obtains a reader which is most up to date. That reader includes writes
from later populate phases, for which update() was not yet
called. This will be problematic once we allow partitions to be
partially populated, because different parts of the partition could be
partially populated using readers using different sets of writes, and break
write atomicity.

The solution will be to always populate given partition using the same
set of writes, using reader created from the current snapshot. The
snapshot changes only on update(), with update() gradually converting
each partition to the new snapshot.
2017-06-24 18:06:11 +02:00