Commit Graph

88 Commits

Author SHA1 Message Date
Raphael S. Carvalho
f93912f344 Revert "Revert "streaming: Do not invalidate cache if no sstable is added in flush_streaming_mutations""
With #4446 fixed, this commit can be reverted.

This reverts commit 454e7e0109.
2020-02-20 10:55:50 -03:00
Raphael S. Carvalho
fb81f2aa7c table: Fix stale data being returned due to lack of cache invalidation
Row cache needs to be invalidated whenever data in sstables changes. Cleanup removes
data from sstables which doesn't belong to the node anymore, which means cache must
be invalidated on cleanup.
Currently, stale data can be returned when a node re-owns ranges which data are still
stored in the node's row cache, because cleanup didn't invalidate the cache.

To prevent data that belongs to the node from being purged from the row cache, cleanup
will only invalidate the cache with a set of token ranges that will not overlap with
any of ranges owned by the node.

update_cluster_layout_tests.py:TestUpdateClusterLayout.simple_decommission_node_2_test
now passes.

Fixes #4446.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2020-02-20 10:55:50 -03:00
Raphael S. Carvalho
65b4fc8bcd sstables/compaction: Introduce compaction_completion_desc
This descriptor contain all information needed for table to be properly
updated on compaction completion. A new member will be added to it soon,
which will store ranges to be invalidated in row cache on behalf of
cleanup compaction.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2020-02-19 19:29:32 -03:00
Avi Kivity
454e7e0109 Revert "streaming: Do not invalidate cache if no sstable is added in flush_streaming_mutations"
This reverts commit 5e9925b9f0. It causes
data resurrection in simple_decommission_node_2_test.

Fixes #5838.
2020-02-18 20:13:10 +02:00
Avi Kivity
6c7aa18238 Merge "Introduce schema::get_partitioner" from Piotr
"
Introduce schema::get_partitioner and use it instead of dht::global_partitioner.

Fixes #5493

Tests: unit(dev, release, debug)
"

* 'per_table_partitioner_prep' of https://github.com/haaawk/scylla: (35 commits)
  cdc: stop using partitioners
  partitioner_test: stop calling set_global_partitioner
  storage_service: stop calling global_partitioner()
  mutation_writer_test: stop calling global_partitioner()
  schema: reduce number of global_partitioner() calls
  test_services: stop calling global_partitioner()
  sstable_utils: stop calling global_partitioner()
  sstable_resharding_test: stop depending on global partitioner
  sstable_mutation_test: stop calling global_partitioner()
  sstable_data_file_test: stop calling global_partitioner()
  random_schema: stop taking partitioner in constructor
  mutation_reader_test: stop calling global_partitioner()
  multishard_mutation_query_test: stop calling global_partitioner()
  row_level repair: stop calling global_partitioner()
  distribute_reader_and_consume_on_shards: don't take partitioner
  thrift: reduce global_partitioner() calls
  binary_search: stop calling global_partitioner()
  index_entry: stop calling global_partitioner()
  mc writer: stop calling global_partitioner()
  sstable: stop calling global_partitioner()
  ...
2020-02-17 18:12:53 +02:00
Tomasz Grabiec
76d1dd7ec6 Merge "nodetool scrub: implement validation and the skip-corrupted flag
" from Botond

Nodetool scrub rewrites all sstables, validating their data. If corrupt
data is found the scrub is aborted. If the skip-corrupted flag is set,
corrupt data is instead logged (just the keys) and skipped.

The scrubbing algorithm itself is fairly simple, especially that we
already have a mutation stream validator that we can use to validate the
data. However currently scrub is piggy-backed on top of cleanup
compaction. To implement this flag, we have to make scrub a separate
compaction type and propagate down the flag. This required some
massaging of the code:
* Add support for more than two (cleanup or not) compaction types.
* Allow passing custom options for each compaction type.
* Allow stopping a compaction without the manager retrying it later.

Additionally the validator itself needed some changes to allow different
ways to handle errors, as needed by the scrub.

Fixes: #5487

* https://github.com/denesb/nodetool-scrub-skip-corrupted/v7:
  table: cleanup_sstables(): only short-circuit on actual cleanup
  compaction: compaction_type: add Upgrade
  compaction: introduce compaction_options
  compaction: compaction_descriptor: use compaction options instead of
    cleanup flag
  compaction_manager: collect all cleanup related logic in
    perform_cleanup()
  sstables: compaction_stop_exception: add retry flag
  mutation_fragment_stream_validator: split into low-level and
    high-level API
  compaction: introduce scrub_compaction
  compaction_manager: scrub: don't piggy-back on upgrade_sstables()
  test: sstable_datafile_test: add scrub unit test
2020-02-17 15:28:07 +02:00
Piotr Jastrzebski
2d7532f87f dht: add dht::get_token
and replace all calls to dht::global_partitioner().get_token

dht::get_token is better because it takes schema and uses it
to obtain partitioner instead of using a global partitioner.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-02-17 10:59:15 +01:00
Piotr Jastrzebski
ca4a89d239 dht: add dht::decorate_key
and replace all dht::global_partitioner().decorate_key
with dht::decorate_key

It is an improvement because dht::decorate_key takes schema
and uses it to obtain partitioner instead of using global
partitioner as it was before.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-02-17 10:59:06 +01:00
Piotr Jastrzebski
abd76e566f dht::shard_of: stop calling global_partitioner()
Take const schema& as a parameter of shard_of and
use it to obtain partitioner instead of calling
global_partitioner().

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-02-17 10:23:16 +01:00
Asias He
5e9925b9f0 streaming: Do not invalidate cache if no sstable is added in flush_streaming_mutations
The table::flush_streaming_mutations is used in the days when streaming
data goes to memtable. After switching to the new streaming, data goes
to sstables directly in streaming, so the sstables generated in
table::flush_streaming_mutations will be empty.

It is unnecessary to invalidate the cache if no sstables are added. To
avoid unnecessary cache invalidating which pokes hole in the cache, skip
calling _cache.invalidate() if the sstables is empty.

The steps are:

- STREAM_MUTATION_DONE verb is sent when streaming is done with old or
  new streaming
- table::flush_streaming_mutations is called in the verb handler
- cache is invalidated for the streaming ranges

In summary, this patch will avoid a lot of cache invalidation for
streaming.

Backports: 3.0 3.1 3.2
Fixes: #5769
2020-02-16 11:22:30 +02:00
Pavel Emelyanov
b11cf6e950 cql3/query_processor.hh: Debloat from other headers
This gives ~30% less (251 jobs -> 181 jobs) recompile when touching it

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20200212225828.3374-1-xemul@scylladb.com>
2020-02-16 11:22:30 +02:00
Botond Dénes
8014c7124d compaction_manager: collect all cleanup related logic in perform_cleanup()
Currently the call chain for a cleanup collection looks like this:
compaction_manager::perform_cleanup()
    compaction_manager::rewrite_sstables()
        table::cleanup_sstables()
            ...

`perform_cleanup()` is essentially empty, immediately deferring to
`rewrite_sstables()`. Cleanup related logic is scattered between the
latter two methods on the call chain. These methods however recently
started serving as generic methods for compactions that want to
rewrite each sstable one-by-one, collecting cleanup related ifs in
various places.
The reason is historic, we first had cleanup, then bolted others on top,
trying to share the underlying code as much as possible.

It is time this is cleaned up (pun intended). Make `perform_cleanup()`
the place where all cleanup related logic is, with the rest of the stack
made truly generic.
2020-02-11 17:47:44 +02:00
Botond Dénes
b2dc5d4895 compaction: compaction_descriptor: use compaction options instead of cleanup flag
Instead of the restrictive `cleanup` boolean flag, which allows for choosing
between only two compaction types, use `compaction_options`, which in
addition to allowing any number of compaction types to be selected,
also allows seamlessly passing specific options to them.
2020-02-11 17:47:44 +02:00
Botond Dénes
0b53ccaecd table: cleanup_sstables(): only short-circuit on actual cleanup
Currently the cleanup call is short circuited if it is determined that
cleanup is not needed for the sstable to-be-cleaned-up. This is
undesired because actually not just cleanup uses this routine to rewrite
sstables, sstable-upgrade and sstable-scrub also uses it, and they want
to go on with the cleanup compaction sstables even if all data in it
belongs to the current node.

Fix: #5699
2020-02-11 17:47:44 +02:00
Eliran Sinvani
8cfc2aad57 internalize storage proxy statistics metric registration
The storage proxy statistics structure did not contain
a method for registering the statistics for metric
groups, instead, each user had to register some
of the metrics by itself. There is no real reason
for separating the metrics registration from
the statistics data. There is even less justification
for doing this only for part of the stats as is
the case for those statistics.
This commit internalize the metrics registration
in the storage_proxy stats structures.

Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
2020-01-30 15:01:40 +01:00
Botond Dénes
dfc8b2fc45 treewide: replace reader_resource_tracer with reader_permit
The former was never really more than a reader_permit with one
additional method. Currently using it doesn't even save one from any
includes. Now that readers will be using reader_permit we would have to
pass down both to mutation_source. Instead get rid of
reader_resource_tracker and just use reader_permit. Instead of making it
a last and optional parameter that is easy to ignore, make it a
first class parameter, right after schema, to signify that permits are
now a prominent part of the reader API.

This -- mostly mechanical -- patch essentially refactors mutation_source
to ask for the reader_permit instead of reader_resource_tracking and
updates all usage sites.
2020-01-28 08:13:16 +02:00
Amnon Heiman
028525daeb database: add schema.cql file when creating a snapshot
When creating a snapshot we need to add a schema.cql file in the
snapshot directory that describes the table in that snapshot.

This patch adds the file using the schema describe method.

get_snapshot_details and manifest_json_filter were modified to ignore
the schema.cql file.

Fixes #4192

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2020-01-15 15:06:00 +02:00
Benny Halevy
718e9eb341 table: move_sstables_from_staging: fix use after free of shared_sstable
Introduced in 4b3243f5b9

Reproducible with materialized_views_test:TestMaterializedViews.mv_populating_from_existing_data_during_node_remove_test
and read_amplification_test:ReadAmplificationTest.no_read_amplification_on_repair_with_mv_test

==955382==ERROR: AddressSanitizer: heap-use-after-free on address 0x60200023de18 at pc 0x00000051d788 bp 0x7f8a0563fcc0 sp 0x7f8a0563fcb0
READ of size 8 at 0x60200023de18 thread T1 (reactor-1)
    #0 0x51d787 in seastar::lw_shared_ptr<sstables::sstable>::lw_shared_ptr(seastar::lw_shared_ptr<sstables::sstable> const&) /local/home/bhalevy/dev/scylla/seastar/include/seastar/core/shared_ptr.hh:289
    #1 0x10ba189 in apply<table::move_sstables_from_staging(std::vector<seastar::lw_shared_ptr<sstables::sstable> >)::<lambda()>::<lambda(std::set<seastar::basic_sstring<char, unsigned int, 15> >&)>::<lambda(sstables::shared_sstable)>&, const seastar::lw_shared_ptr<sstables::sstabl
e>&> /local/home/bhalevy/dev/scylla/seastar/include/seastar/core/future.hh:1530
    #2 0x109c4f1 in apply<table::move_sstables_from_staging(std::vector<seastar::lw_shared_ptr<sstables::sstable> >)::<lambda()>::<lambda(std::set<seastar::basic_sstring<char, unsigned int, 15> >&)>::<lambda(sstables::shared_sstable)>&, const seastar::lw_shared_ptr<sstables::sstabl
e>&> /local/home/bhalevy/dev/scylla/seastar/include/seastar/core/future.hh:1556
    #3 0x106941a in do_for_each<__gnu_cxx::__normal_iterator<const seastar::lw_shared_ptr<sstables::sstable>*, std::vector<seastar::lw_shared_ptr<sstables::sstable> > >, table::move_sstables_from_staging(std::vector<seastar::lw_shared_ptr<sstables::sstable> >)::<lambda()>::<lambda(
std::set<seastar::basic_sstring<char, unsigned int, 15> >&)>::<lambda(sstables::shared_sstable)> > /local/home/bhalevy/dev/scylla/seastar/include/seastar/core/future-util.hh:618
    #4 0x1069203 in operator() /local/home/bhalevy/dev/scylla/seastar/include/seastar/core/future-util.hh:626
    #5 0x10ba589 in apply /local/home/bhalevy/dev/scylla/seastar/include/seastar/core/apply.hh:36
    #6 0x10ba668 in apply<seastar::do_for_each(Iterator, Iterator, AsyncAction) [with Iterator = __gnu_cxx::__normal_iterator<const seastar::lw_shared_ptr<sstables::sstable>*, std::vector<seastar::lw_shared_ptr<sstables::sstable> > >; AsyncAction = table::move_sstables_from_staging
(std::vector<seastar::lw_shared_ptr<sstables::sstable> >)::<lambda()>::<lambda(std::set<seastar::basic_sstring<char, unsigned int, 15> >&)>::<lambda(sstables::shared_sstable)>]::<lambda()>&> /local/home/bhalevy/dev/scylla/seastar/include/seastar/core/apply.hh:44
    #7 0x10ba7c0 in apply<seastar::do_for_each(Iterator, Iterator, AsyncAction) [with Iterator = __gnu_cxx::__normal_iterator<const seastar::lw_shared_ptr<sstables::sstable>*, std::vector<seastar::lw_shared_ptr<sstables::sstable> > >; AsyncAction = table::move_sstables_from_staging
(std::vector<seastar::lw_shared_ptr<sstables::sstable> >)::<lambda()>::<lambda(std::set<seastar::basic_sstring<char, unsigned int, 15> >&)>::<lambda(sstables::shared_sstable)>]::<lambda()>&> /local/home/bhalevy/dev/scylla/seastar/include/seastar/core/future.hh:1563
    ...

0x60200023de18 is located 8 bytes inside of 16-byte region [0x60200023de10,0x60200023de20)
freed by thread T1 (reactor-1) here:
    #0 0x7f8a153b796f in operator delete(void*) (/lib64/libasan.so.5+0x11096f)
    #1 0x6ab4d1 in __gnu_cxx::new_allocator<seastar::lw_shared_ptr<sstables::sstable> >::deallocate(seastar::lw_shared_ptr<sstables::sstable>*, unsigned long) /usr/include/c++/9/ext/new_allocator.h:128
    #2 0x612052 in std::allocator_traits<std::allocator<seastar::lw_shared_ptr<sstables::sstable> > >::deallocate(std::allocator<seastar::lw_shared_ptr<sstables::sstable> >&, seastar::lw_shared_ptr<sstables::sstable>*, unsigned long) /usr/include/c++/9/bits/alloc_traits.h:470
    #3 0x58fdfb in std::_Vector_base<seastar::lw_shared_ptr<sstables::sstable>, std::allocator<seastar::lw_shared_ptr<sstables::sstable> > >::_M_deallocate(seastar::lw_shared_ptr<sstables::sstable>*, unsigned long) /usr/include/c++/9/bits/stl_vector.h:351
    #4 0x52a790 in std::_Vector_base<seastar::lw_shared_ptr<sstables::sstable>, std::allocator<seastar::lw_shared_ptr<sstables::sstable> > >::~_Vector_base() /usr/include/c++/9/bits/stl_vector.h:332
    #5 0x52a99b in std::vector<seastar::lw_shared_ptr<sstables::sstable>, std::allocator<seastar::lw_shared_ptr<sstables::sstable> > >::~vector() /usr/include/c++/9/bits/stl_vector.h:680
    #6 0xff60fa in ~<lambda> /local/home/bhalevy/dev/scylla/table.cc:2477
    #7 0xff7202 in operator() /local/home/bhalevy/dev/scylla/table.cc:2496
    #8 0x106af5b in apply<table::move_sstables_from_staging(std::vector<seastar::lw_shared_ptr<sstables::sstable> >)::<lambda()> > /local/home/bhalevy/dev/scylla/seastar/include/seastar/core/future.hh:1573
    #9 0x102f5d5 in futurize_apply<table::move_sstables_from_staging(std::vector<seastar::lw_shared_ptr<sstables::sstable> >)::<lambda()> > /local/home/bhalevy/dev/scylla/seastar/include/seastar/core/future.hh:1645
    #10 0x102f9ee in operator()<seastar::semaphore_units<seastar::named_semaphore_exception_factory> > /local/home/bhalevy/dev/scylla/seastar/include/seastar/core/semaphore.hh:488
    #11 0x109d2f1 in apply /local/home/bhalevy/dev/scylla/seastar/include/seastar/core/apply.hh:36
    #12 0x109d42c in apply<seastar::with_semaphore(seastar::basic_semaphore<ExceptionFactory, Clock>&, size_t, Func&&) [with ExceptionFactory = seastar::named_semaphore_exception_factory; Func = table::move_sstables_from_staging(std::vector<seastar::lw_shared_ptr<sstables::sstable>
 >)::<lambda()>; Clock = std::chrono::_V2::steady_clock]::<lambda(auto:51)>&, seastar::semaphore_units<seastar::named_semaphore_exception_factory, std::chrono::_V2::steady_clock>&&> /local/home/bhalevy/dev/scylla/seastar/include/seastar/core/apply.hh:44
    #13 0x109d595 in apply<seastar::with_semaphore(seastar::basic_semaphore<ExceptionFactory, Clock>&, size_t, Func&&) [with ExceptionFactory = seastar::named_semaphore_exception_factory; Func = table::move_sstables_from_staging(std::vector<seastar::lw_shared_ptr<sstables::sstable>
 >)::<lambda()>; Clock = std::chrono::_V2::steady_clock]::<lambda(auto:51)>&, seastar::semaphore_units<seastar::named_semaphore_exception_factory, std::chrono::_V2::steady_clock>&&> /local/home/bhalevy/dev/scylla/seastar/include/seastar/core/future.hh:1563
    ...

Fixes #5511

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20191222214326.1229714-1-bhalevy@scylladb.com>
2019-12-23 15:20:41 +02:00
Benny Halevy
4b3243f5b9 table: move_sstables_from_staging_in_thread with _sstable_deletion_sem
Hold the _sstable_deletion_sem while moving sstables from the staging directory
so not to move them under the feet of table::snapshot.

Fixes #5340

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2019-12-17 12:20:20 +02:00
Benny Halevy
6efef84185 sstable: return future from move_to_new_dir
distributed_loader::probe_file needlessly creates a seastar
thread for it and the next patch will use it as part of
a parallel_for_each loop to move a list of sstables
(and sync the directories once at the end).

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2019-12-17 12:20:20 +02:00
Piotr Sarna
79c3a508f4 table: Reduce read amplification in view update generation
This commit makes sure that single-partition readers for
read-before-write do not have fast-forwarding enabled,
as it may lead to huge read amplification. The observed case was:
1. Creating an index.
  CREATE INDEX index1  ON myks2.standard1 ("C1");
2. Running cassandra-stress in order to generate view updates.
cassandra-stress write no-warmup n=1000000 cl=ONE -schema \
  'replication(factor=2) compaction(strategy=LeveledCompactionStrategy)' \
  keyspace=myks2 -pop seq=4000000..8000000 -rate threads=100 -errors
  skip-read-validation -node 127.0.0.1;

Without disabling fast-forwarding, single-partition readers
were turned into scanning readers in cache, which resulted
in reading 36GB (sic!) on a workload which generates less
than 1GB of view updates. After applying the fix, the number
dropped down to less than 1GB, as expected.

Refs #5409
Fixes #4615
Fixes #5418
2019-12-05 11:58:34 +02:00
Avi Kivity
fd951a36e3 Merge "Let compaction wait on background deletions" from Benny
"
In several cases in distributed testing (dtest) we trigger compaction using nodetool compact assuming that when it is done, it is indeed really done.
However, the way compaction is currently implemented in scylla, it may leave behind some background tasks to delete the old sstables that were compacted.

This commit changes major compaction (triggered via the ss::force_keyspace_compaction api) so it would wait on the background deletes and will return only when they finish.

Fixes #4909

Tests: unit(dev), nodetool_refresh_with_data_perms_test, test_nodetool_snapshot_during_major_compaction
"
2019-12-04 11:18:41 +02:00
Piotr Sarna
9c5a5a5ac2 treewide: add names to semaphores
By default, semaphore exceptions bring along very little context:
either that a semaphore was broken or that it timed out.
In order to make debugging easier without introducing significant
runtime costs, a notion of named semaphore is added.
A named semaphore is simply a semaphore with statically defined
name, which is present in its errors, bringing valuable context.
A semaphore defined as:

  auto sem = semaphore(0);

will present the following message when it breaks:
"Semaphore broken"
However, a named semaphore:

  auto named_sem = named_semaphore(0, named_semaphore_exception_factory{"io_concurrency_sem"});

will present a message with at least some debugging context:

  "Semaphore broken: io_concurrency_sem"

It's not much, but it would really help in pinpointing bugs
without having to inspect core dumps.

At the same time, it does not incur any costs for normal
semaphore operations (except for its creation), but instead
only uses more CPU in case an error is actually thrown,
which is considered rare and not to be on the hot path.

Refs #4999

Tests: unit(dev), manual: hardcoding a failure in view building code
2019-11-26 15:14:21 +02:00
Benny Halevy
f9e93bba38 sstables: compaction: move cleanup parameter to compaction_descriptor
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20191117165806.3234-1-bhalevy@scylladb.com>
2019-11-18 10:52:20 +01:00
Kamil Braun
a67e887dea sstables: fix sstable file I/O CQL tracing when reading multiple files (#5285)
CQL tracing would only report file I/O involving one sstable, even if
multiple sstables were read from during the query.

Steps to reproduce:

create a table with NullCompactionStrategy
insert row, flush memtables
insert row, flush memtables
restart Scylla
tracing on
select * from table
The trace would only report DMA reads from one of the two sstables.

Kudos to @denesb for catching this.

Related issue: #4908
2019-11-17 00:38:37 -08:00
Piotr Dulikowski
59fbbb993f memtables: add partition/row hit/miss counters
Adds per-table metrics for counting partition and row reuse
in memtables. New metrics are as follows:
    - memtable_partition_writes - number of write operations performed
          on partitions in memtables,
    - memtable_partition_hits - number of write operations performed
          on partitions that previously existed in a memtable,
    - memtable_row_writes - number of row write operations performed
          in memtables,
    - memtable_row_hits - number of row write operations that ovewrote
          rows previously present in a memtable.

Tests: unit(release)
2019-11-12 13:35:41 +01:00
Vladimir Davydov
b75862610e paxos_state: account paxos round latency
This patch adds the following per table stats:

  cas_prepare_latency
  cas_propose_latency
  cas_commit_latency

They are equivalent to CasPropose, CasPrepare, CasCommit metrics exposed
by Cassandra.
2019-10-29 19:26:18 +03:00
Kamil Braun
394c36835a sstables: report sstable data file I/O in CQL tracing
Use tracing::make_traced_file when creating an sstable input_stream.
To achieve that, trace_state needs to be plumbed down through some
functions.
2019-10-25 14:10:28 +02:00
Raphael S. Carvalho
7f1a2156c7 table: Don't account for shared SSTables in compaction backlog tracker
We don't want to add shared sstables to table's backlog tracker because:
1) table's backlog tracker has only an influence on regular compaction
2) shared sstables are never regular compacted, they're worked by
resharding which has its own backlog tracker.

Such sstables belong to more than one shard, meaning that currently
they're added to backlog tracker of all shards that own them.
But the thing is that such sstables ends up being resharded in shard
that may be completely random. So increasing backlog of all shards
such sstables belong to, won't lead to faster resharding. Also, table's
backlog tracker is supposed to deal only with regular compaction.

Accounting for shared sstables in table's tracker may lead to incorrect
speed up of regular compactions because the controller is not aware
that some relevant part of the backlog is due to pending resharding.
The fix is about ignoring sstables that will be resharded and let
table's backlog tracker account only for sstables that can be worked on
by regular compaction, and rely on resharding controlling itself
with its own tracker.
NOTE: this doesn't fix the resharding controlling issue completely,
as described in #4952. We'll still need to throttle regular compaction
on behalf of resharding. So subsequent work may be about:
- move resharding to its own priority class, perhaps streaming.
- make a resharding's backlog tracker accounts for sstables in all of
its pending jobs, not only the ongoing ones (currently limited to 1 by shard).
- limit compaction shares when resharding is in progress.
THIS only fixes the issue in which controller for regular compaction
shouldn't account sstables completely exclusive to resharding.

Fixes #5077.
Refs #4952.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20190924022109.17400-1-raphaelsc@scylladb.com>
2019-10-13 10:14:13 +03:00
Tomasz Grabiec
b0e0f29b06 db: read: Filter-out sstables using its first and last keys
Affects single-partition reads only.

Refs #5113

When executing a query on the replica we do several things in order to
narrow down the sstable set we read from.

For tables which use LeveledCompactionStrategy, we store sstables in
an interval set and we select only sstables whose partition ranges
overlap with the queried range. Other compaction strategies don't
organize the sstables and will select all sstables at this stage. The
reasoning behind this is that for non-LCS compaction strategies the
sstables' ranges will typically overlap and using interval sets in
this case would not be effective and would result in quadratic (in
sstable count) memory consumption.

The assumption for overlap does not hold if the sstables come from
repair or streaming, which generates non-overlapping sstables.

At a later stage, for single-partition queries, we use the sstables'
bloom filter (kept in memory) to drop sstables which surely don't
contain given partition. Then we proceed to sstable indexes to narrow
down the data file range.

Tables which don't use LCS will do unnecessary I/O to read index pages
for single-partition reads if the partition is outside of the
sstable's range and the bloom filter is ineffective (Refs #5112).

This patch fixes the problem by consulting sstable's partition range
in addition to the bloom filter, so that the non-overlapping sstables
will be filtered out with certainty and not depend on bloom filter's
efficiency.

It's also faster to drop sstables based on the keys than the bloom
filter.

Tests:
  - unit (dev)
  - manual using cqlsh

Reviewed-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20190927122505.21932-1-tgrabiec@scylladb.com>
2019-09-28 19:42:57 +03:00
Benny Halevy
19b67d82c9 table::on_compaction_completion: fix indentation
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2019-09-02 12:15:38 +03:00
Benny Halevy
8dd6e13468 table::on_compaction_completion: wait for background deletes
Don't let background deletes accumulate uncontrollably.

Fixes #4909

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2019-09-02 12:15:38 +03:00
Benny Halevy
da6645dc2c table: refresh_snapshot before deleting any sstables
The row cache must not hold refrences to any sstable we're
about to delete.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2019-09-02 12:15:29 +03:00
Botond Dénes
136fc856c5 treewide: silence discarded future warnings for questionable discards
This patches silences the remaining discarded future warnings, those
where it cannot be determined with reasonable confidence that this was
indeed the actual intent of the author, or that the discarding of the
future could lead to problems. For all those places a FIXME is added,
with the intent that these will be soon followed-up with an actual fix.
I deliberately haven't fixed any of these, even if the fix seems
trivial. It is too easy to overlook a bad fix mixed in with so many
mechanical changes.
2019-08-26 19:28:43 +03:00
Botond Dénes
fddd9a88dd treewide: silence discarded future warnings for legit discards
This patch silences those future discard warnings where it is clear that
discarding the future was actually the intent of the original author,
*and* they did the necessary precautions (handling errors). The patch
also adds some trivial error handling (logging the error) in some
places, which were lacking this, but otherwise look ok. No functional
changes.
2019-08-26 18:54:44 +03:00
Piotr Sarna
1ab07b80b4 database: assign proper io priority for streaming view updates
Streamed view updates parasitized on writing io priority, which is
reserved for user writes - it's now properly bound to streaming
write priority.
2019-08-20 00:24:50 +02:00
Avi Kivity
77686ab889 Merge "Make SSTable cleanup run aware" from Raphael
"
Fixes #4663.
Fixes #4718.
"

* 'make_cleanup_run_aware_v3' of https://github.com/raphaelsc/scylla:
  tests/sstable_datafile_test: Check cleaned sstable is generated with expected run id
  table: Make SSTable cleanup run aware
  compaction: introduce constants for compaction descriptor
  compaction: Make it possible to config the identifier of the output sstable run
  table: do not rely on undefined behavior in cleanup_sstables
2019-07-31 19:10:22 +03:00
Tomasz Grabiec
7604980d63 database: Add missing partition slicing on streaming reader recreation
streaming_reader_lifecycle_policy::create_reader() was ignoring the
partition_slice passed to it and always creating the reader for the
full slice.

That's wrong because create_reader() is called when recreating a
reader after it's evicted. If the reader stopped in the middle of
partition we need to start from that point. Otherwise, fragments in
the mutation stream will appear duplicated or out of ordre, violating
assumptions of the consumers.

This was observed to result in repair writing incorrect sstables with
duplicated clustering rows, which results in
malformed_sstable_exception on read from those sstables.

Fixes #4659.

In v2:

  - Added an overload without partition_slice to avoid changing existing users which never slice

Tests:

  - unit (dev)
  - manual (3 node ccm + repair)

Backport: 3.1
Reviewd-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <1563451506-8871-1-git-send-email-tgrabiec@scylladb.com>
2019-07-18 18:35:28 +03:00
Raphael S. Carvalho
332c2ff710 table: Make SSTable cleanup run aware
The cleanup procedure will move any sstable out of its sstable run
because sstables are cleaned up individually and they end up receiving
a new run identifier, meaning a table may potentially end up with a
new sstable run for each of the sstables cleaned.

SStable cleanup needs to be run aware, so that the run structure is
not messed up after the operation is done. Given that only one fragment
or other, composing a sstable run, may need cleanup, it's better to keep
them in their original sstable run.

Fixes #4663.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2019-07-15 23:39:47 -03:00
Raphael S. Carvalho
0e732ed1cf table: do not rely on undefined behavior in cleanup_sstables
It shouldn't rely on argument evaluation order, which is ub.

Fixes #4718.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2019-07-15 23:39:22 -03:00
Benny Halevy
0e4567c881 table: document _sstables_lock/_sstable_deletion_sem locking order
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2019-07-15 19:20:35 +03:00
Benny Halevy
6dad9baa1c table: disable_sstable_write: acquire _sstable_deletion_sem
`disable_sstable_write` needs to acquire `_sstable_deletion_sem`
to properly synchronize with background deletions done by
`on_compaction_completion` to ensure no sstables will be created
or deleted during `reshuffle_sstables` after
`storage_service::load_new_sstables` disables sstable writes.

Fixes #4622

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2019-07-11 12:14:44 +03:00
Benny Halevy
bbbd749f70 table: uninline enable_sstable_write
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2019-07-11 12:14:44 +03:00
Benny Halevy
c6bad3f3c2 table: reshuffle_sstables: add log message
To mark the point in time writes are disabled and
scanning of the data directory is beginning.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2019-07-11 12:14:44 +03:00
Kamil Braun
d6736a304a Add metric for failed memtable flushes
Resolves #3316.

Signed-off-by: Kamil Braun <kbraun@scylladb.com>
2019-07-10 11:30:10 +03:00
Dejan Mircevski
8dcb35913a table: Avoid needless allocation of cell lockers
All `table` instances currently unconditionally allocate a cell locker
for counter cells, though not all need one.  Since the lockers occupy
quite a bit of memory (as reported in #4441), it's wasteful to
allocate them when unneeded.

Fixes #4441.

Tests: unit (dev, debug)

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Message-Id: <20190515190910.87931-1-dejan@scylladb.com>
2019-05-16 11:10:38 +03:00
Avi Kivity
add20eb9a6 table: fix potentially wrong schema when reading from zero sstables
We use the schema during creation of the mutation_source rather than
during the query itself. Likely they're the same, and since no rows
are returned from a zero-sstable query, harmless. But gcc 9 complains.

Fix by using the query's schema.
2019-05-07 09:56:30 +03:00
Benny Halevy
5a99023d4a treewide: use lambda for io_check of *touch_directory
To prepare for a seastar change that adds an optional file_permissions
parameter to touch_directory and recursive_touch_directory.
This change messes up the call to io_check since the compiler can't
derive the Func&& argument.  Therefore, use a lambda function instead
to wrap the call to {recursive_,}touch_directory.

Ref #4395

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20190421085502.24729-1-bhalevy@scylladb.com>
2019-04-21 12:04:39 +03:00
Raphael S. Carvalho
d59f716e1c table: fix wild disk usage stat after sstables are discarded by truncate
Truncate would make disk usage stat go wild because it isn't updated
when sstables are removed in table::discard_sstables(). Let's update
the stat after sstables are removed from the sstable set.

Fixes #3624.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20190328154918.25404-1-raphaelsc@scylladb.com>
2019-04-01 13:55:11 +03:00
Avi Kivity
4b330b3911 Merge "introduce sstables manager" from Benny
"
This series introduce a rudimentary sstables manager
that will be used for making and deleting sstables, and tracking
of thereof.

The motivation for having a sstables manager is detailed in
https://github.com/scylladb/scylla/issues/4149.
The gist of it is that we need a proper way to manage the life
cycle of sstables to solve potential races between compaction
and various consumers of sstables, so they don't get deleted by
compaction while being used.

In addition, we plan to add global statistics methods like returning
the total capacity used by all sstables.

This patchset changes the way class sstable gets the large_data_handler.
Rather than passing it separately for writing the sstable and when deleting
sstables, we provide the large_data_handler when the sstable object is
constructed and then use it when needed.

Refs #4149
"

* 'projects/sstables_manager/v3' of https://github.com/bhalevy/scylla:
  sstables: provide large_data_handler to constructor
  sstables_manager: default_sstable_buffer_size need not be a function
  sstables: introduce sstables_manager
  sstables: move shareable_components def to its own header
  tests: use global nop_lp_handler in test_services
  sstables: compress.hh: add missing include
  sstables: reorder entry_descriptor constructor params
  sstables: entry_descriptor: get rid of unused ctor
  sstables: make load_shared_components a method of sstable
  sstables: remove default params from sstable constructor
  database: add table::make_sstable helper
  distributed_loader: pass column_family to load_sstables_with_open_info
  distributed_loader: no need for forward declaration of load_sstables_with_open_info
  distributed_loader: reshard: use default params for make_sstable
2019-03-26 16:31:40 +02:00