Commit Graph

80 Commits

Author SHA1 Message Date
Benny Halevy
29002e3b48 flat_mutation_reader: return future from next_partition
To allow it to asynchronously close underlying readers
on next_partition().

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-01-13 17:35:07 +02:00
Avi Kivity
96d64b7a1f Merge "Wire interposer consumer for memtable flush" from Raphael
"
Without interposer consumer on flush, it could happen that a new sstable,
produced by memtable flush, will not conform to the strategy invariant.
For example, with TWCS, this new sstable could span multiple time windows,
making it hard for the strategy to purge expired data. If interposer is
enabled, the data will be correctly segregated into different sstables,
each one spanning a single window.

Fixes #4617.

tests:
    - mode(dev).
    - manually tested it by forcing a flush of memtable spanning many windows
"

* 'segregation_on_flush_v2' of github.com:raphaelsc/scylla:
  test: Add test for TWCS interposer on memtable flush
  table: Wire interposer consumer for memtable flush
  table: Add write_memtable_to_sstable variant which accepts flat_mutation_reader
  table: Allow sstable write permit to be shared across monitors
  memtable: Track min timestamp
  table: Extend cache update to operate a memtable split into multiple sstables
2021-01-13 11:07:29 +02:00
Botond Dénes
4b254a26ab test/boost/sstable_datafile_test: sstable_scrub_test: disable key validation
The test violates clustering key order on purpose to produce a corrupt
sstable (to test scrub). Disable key validation so when we move the
validator into the writer itself in the next patch it doesn't abort the
test.
2021-01-11 09:12:56 +02:00
Raphael S. Carvalho
d265bb9bdb test: Add test for TWCS interposer on memtable flush
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-01-04 16:55:06 -03:00
Raphael S. Carvalho
e4b55f40f3 sstables: Fix sstable reshaping for STCS
The heuristic of STCS reshape is correct, and it built the compaction
descriptor correctly, but forgot to return it to the caller, so no
reshape was ever done on behalf of STCS even when the strategy
needed it.

Fixes #7774.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20201209175044.1609102-1-raphaelsc@scylladb.com>
2020-12-10 12:45:25 +02:00
Avi Kivity
f802356572 Revert "Revert "Merge "raft: fix replication if existing log on leader" from Gleb""
This reverts commit dc77d128e9. It was reverted
due to a strange and unexplained diff, which is now explained. The
HEAD on the working directory being pulled from was set back, so git
thought it was merging the intended commits, plus all the work that was
committed from HEAD to master. So it is safe to restore it.
2020-12-08 19:19:55 +02:00
Avi Kivity
dc77d128e9 Revert "Merge "raft: fix replication if existing log on leader" from Gleb"
This reverts commit 0aa1f7c70a, reversing
changes made to 72c59e8000. The diff is
strange, including unrelated commits. There is no understanding of the
cause, so to be safe, revert and try again.
2020-12-06 11:34:19 +02:00
Avi Kivity
e8ff77c05f Merge 'sstables: a bunch of refactors' from Kamil Braun
1. sstables: move `sstable_set` implementations to a separate module

    All the implementations were kept in sstables/compaction_strategy.cc
    which is quite large even without them. `sstable_set` already had its
    own header file, now it gets its own implementation file.

    The declarations of implementation classes and interfaces (`sstable_set_impl`,
    `bag_sstable_set`, and so on) were also exposed in a header file,
    sstable_set_impl.hh, for the purposes of potential unit testing.

2. mutation_reader: move `mutation_reader::forwarding` to flat_mutation_reader.hh

    Files which need this definition won't have to include
    mutation_reader.hh, only flat_mutation_reader.hh (so the inclusions are
    in total smaller; mutation_reader.hh includes flat_mutation_reader.hh).

3. sstables: move sstable reader creation functions to `sstable_set`

    Lower level functions such as `create_single_key_sstable_reader`
    were made methods of `sstable_set`.

    The motivation is that each concrete sstable_set
    may decide to use a better sstable reading algorithm specific to the
    data structures used by this sstable_set. For this it needs to access
    the set's internals.

    A nice side effect is that we moved some code out of table.cc
    and database.hh which are huge files.

4. sstables: pass `ring_position` to `create_single_key_sstable_reader`

    instead of `partition_range`.

    It would be best to pass `partition_key` or `decorated_key` here.
    However, the implementation of this function needs a `partition_range`
    to pass into `sstable_set::select`, and `partition_range` must be
    constructed from `ring_position`s. We could create the `ring_position`
    internally from the key but that would involve a copy which we want to
    avoid.

5. sstable_set: refactor `filter_sstable_for_reader_by_pk`

    Introduce a `make_pk_filter` function, which given a ring position,
    returns a boolean function (a filter) that given a sstable, tells
    whether the sstable may contain rows with the given position.

    The logic has been extracted from `filter_sstable_for_reader_by_pk`.

Split from #7437.

Closes #7655

* github.com:scylladb/scylla:
  sstable_set: refactor filter_sstable_for_reader_by_pk
  sstables: pass ring_position to create_single_key_sstable_reader
  sstables: move sstable reader creation functions to `sstable_set`
  mutation_reader: move mutation_reader::forwarding to flat_mutation_reader.hh
  sstables: move sstable_set implementations to a separate module
2020-11-24 09:23:57 +02:00
Kamil Braun
d158921966 sstables: add may_have_partition_tombstones method
For sstable versions greater or equal than md, the `min_max_column_names`
sstable metadata gives a range of position-in-partitions such that all
clustering rows stored in this sstable have positions in this range.

Partition tombstones in this context are understood as covering the
entire range of clustering keys; thus, if the sstable contains at least
one partition tombstone, the sstable position range is set to be the
range of all clustered rows.

Therefore, by checking that the position range is *not* the range of all
clustered rows we know that the sstable cannot have any partition tombstones.

Closes #7678
2020-11-23 23:30:19 +02:00
Kamil Braun
40d8bfa394 sstables: move sstable reader creation functions to sstable_set
Lower level functions such as `create_single_key_sstable_reader`
were made methods of `sstable_set`.

The motivation is that each concrete sstable_set
may decide to use a better sstable reading algorithm specific to the
data structures used by this sstable_set. For this it needs to access
the set's internals.

A nice side effect is that we moved some code out of table.cc
and database.hh which are huge files.
2020-11-19 17:52:39 +01:00
Avi Kivity
58e02c216a test: sstable_datafile_test: sstable_run_based_compaction_test: prevent use of uninitialized variable observer
The variable 'observer' (an std::optional) may be left uninitialized
if 'incremental_enabled' is false. However, it is used afterwards
with a call to disconnect, accessing garbage.

Fix by accessing it via the optional wrapper. A call to optional::reset()
destroys the observable, which in turn calls disconnect().

Closes #7380
2020-10-11 17:36:08 +03:00
Botond Dénes
6ca0464af5 mutation_fragment: add schema and permit
We want to start tracking the memory consumption of mutation fragments.
For this we need schema and permit during construction, and on each
modification, so the memory consumption can be recalculated and pass to
the permit.

In this patch we just add the new parameters and go through the insane
churn of updating all call sites. They will be used in the next patch.
2020-09-28 11:27:23 +03:00
Botond Dénes
3fab83b3a1 flat_mutation_reader: impl: add reader_permit parameter
Not used yet, this patch does all the churn of propagating a permit
to each impl.

In the next patch we will use it to track to track the memory
consumption of `_buffer`.
2020-09-28 10:53:48 +03:00
Avi Kivity
3976066156 test: sstable_datafile_test: prepare for asynchronously closed sstables_manager
sstables_manager will soon be closed asynhronously, with a future-returning
close() function. To prepare for that, make the following changes
 - replace on-stack test_env with test_env::do_with()
 - use the variant of column_family_for_tests that accepts an sstables_manager
 - replace test_sstables_manager with an sstables_manager obtained from test_env

These changes allow lifetime management of the sstables_manager used
in the tests to be centralized in test_env.

Since test_env now calls await_background_jobs on termination, those
calls are dropped.
2020-09-23 20:55:12 +03:00
Avi Kivity
1c1a737eda test: sstable_datafile_test: drop bad 'return'
The pattern

    return function_returning_a_future().get();

is legal, but confusing. It returns an unexpected std::tuple<>. Here,
it doesn't do any harm, but if we try to coerce the surrounding code
into a signature (void ()), then that will fail.

Remove the unneeded and unexpected return.
2020-09-23 20:55:06 +03:00
Avi Kivity
c27c2a06bb test: sstable_datafile_test: reorder table stop in compaction_manager_test
Stopping a table will soon close its sstables; so the next check will fail
as the number of sstables for the table will be zero.

Reorder the stop() call to make it safe.

We don't need the stop() for the check, since the previous loop made sure
compactions completed.
2020-09-23 20:55:03 +03:00
Pavel Emelyanov
a6e6856e1f compaction: Keep database reference on cleanup options
The database is available at both places that create the options --
tests and API perform_cleanup call.

Options object doesn't over-survive the returned future, so it's
safe to keep the reference on it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-08-21 14:58:40 +03:00
Rafael Ávila de Espíndola
6363716799 schema: Pass an rvalue to set_compaction_strategy_options
This produces less code and makes sure every caller moves the value.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-08-19 14:02:35 -07:00
Raphael S. Carvalho
3be1420083 test: Check that TWCS properly performs size-tiered compaction on past windows
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2020-08-18 15:14:09 -03:00
Raphael S. Carvalho
f2b588cfc4 compaction/twcs: Make newest_bucket() non-static
To fix #6928, newest_bucket() will have to access the class fields.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2020-08-17 12:29:34 -03:00
Raphael S. Carvalho
11df96718a compaction: Prevent non-regular compaction from picking compacting SSTables
After 8014c7124, cleanup can potentially pick a compacting SSTable.
Upgrade and scrub can also pick a compacting SSTable.
The problem is that table::candidates_for_compaction() was badly named.
It misleads the user into thinking that the SSTables returned are perfect
candidates for compaction, but manager still need to filter out the
compacting SSTables from the returned set. So it's being renamed.

When the same SSTable is compacted in parallel, the strategy invariant
can be broken like overlapping being introduced in LCS, and also
some deletion failures as more than one compaction process would try
to delete the same files.

Let's fix scrub, cleanup and ugprade by calling the manager function
which gets the correct candidates for compaction.

Fixes #6938.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200811200135.25421-1-raphaelsc@scylladb.com>
2020-08-16 17:31:03 +03:00
Piotr Jastrzebski
c001374636 codebase wide: replace count with contains
C++20 introduced `contains` member functions for maps and sets for
checking whether an element is present in the collection. Previously
`count` function was often used in various ways.

`contains` does not only express the intend of the code better but also
does it in more unified way.

This commit replaces all the occurences of the `count` with the
`contains`.

Tests: unit(dev)

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <b4ef3b4bc24f49abe04a2aba0ddd946009c9fcb2.1597314640.git.piotr@scylladb.com>
2020-08-15 20:26:02 +03:00
Rafael Ávila de Espíndola
bd2f9fc685 test: Move sstable_run_based_compaction_strategy_for_tests.hh to test/lib
This is in preparation to moving the code to a .cc file.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-08-11 11:48:41 -07:00
Avi Kivity
3530e80ce1 Merge "Support md format" from Benny
"
This series adds support for the "md" sstable format.

Support is based on the following:

* do not use clustering based filtering in the presence
  of static row, tombstones.
* Disabling min/max column names in the metadata for
  formats older than "md".
* When updating the metadata, reset and disable min/max
  in the presence of range tombstones (like Cassandra does
  and until we process them accurately).
* Fix the way we maintain min/max column names by:
  keeping whole clustering key prefixes as min/max
  rather than calculating min/max independently for
  each component, like Cassandra does in the "md" format.

Fixes #4442

Tests: unit(dev), cql_query_test -t test_clustering_filtering* (debug)
md migration_test dtest from git@github.com:bhalevy/scylla-dtest.git migration_test-md-v1
"

* tag 'md-format-v4' of github.com:bhalevy/scylla: (27 commits)
  config: enable_sstables_md_format by default
  test: cql_query_test: add test_clustering_filtering unit tests
  table: filter_sstable_for_reader: allow clustering filtering md-format sstables
  table: create_single_key_sstable_reader: emit partition_start/end for empty filtered results
  table: filter_sstable_for_reader: adjust to md-format
  table: filter_sstable_for_reader: include non-scylla sstables with tombstones
  table: filter_sstable_for_reader: do not filter if static column is requested
  table: filter_sstable_for_reader: refactor clustering filtering conditional expression
  features: add MD_SSTABLE_FORMAT cluster feature
  config: add enable_sstables_md_format
  database: add set_format_by_config
  test: sstable_3_x_test: test both mc and md versions
  test: Add support for the "md" format
  sstables: mx/writer: use version from sstable for write calls
  sstables: mx/writer: update_min_max_components for partition tombstone
  sstables: metadata_collector: support min_max_components for range tombstones
  sstable: validate_min_max_metadata: drop outdated logic
  sstables: rename mc folder to mx
  sstables: may_contain_rows: always true for old formats
  sstables: add may_contain_rows
  ...
2020-08-11 13:29:11 +03:00
Piotr Jastrzebski
80e3923b3c codebase wide: replace find(...) != end() with contains
C++20 introduced `contains` member functions for maps and sets for
checking whether an element is present in the collection. Previously
the code pattern looked like:

<collection>.find(<element>) != <collection>.end()

In C++20 the same can be expressed with:

<collection>.contains(<element>)

This is not only more concise but also expresses the intend of the code
more clearly.

This commit replaces all the occurences of the old pattern with the new
approach.

Tests: unit(dev)

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <f001bbc356224f0c38f06ee2a90fb60a6e8e1980.1597132302.git.piotr@scylladb.com>
2020-08-11 13:28:50 +03:00
Benny Halevy
bd4383a842 sstables: mx/writer: update_min_max_components for partition tombstone
Partition tombstones represent an implicit clustering range
that is unbound on both sides, so reflect than in min/max
column names metadata using empty clustering key prefixes.

If we don't do that, when using the sstable for filtering, we have no
other way of distinguishing range tombstones from partition tombstones
given the sstable metadata and we would need to include any sstable
with tombstones, even if those are range tombstone, for which
we can do a better filtering job, using the sstable min/max
column names metadata.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-08-10 18:53:04 +03:00
Benny Halevy
68acae5873 sstables: metadata_collector: support min_max_components for range tombstones
We essentially treat min/max column names as range bounds
with min as incl_start and max as incl_end.

By generating a bound_view for min/max column names on the fly,
we can correctly track and compare also short clustering
key prefixes that may be used as bounds for range tombstones.

Extend the sstable_tombstone_metadata_check unit test
to cover these cases.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-08-10 18:53:04 +03:00
Pekka Enberg
a37eaaa022 sstables: Add support for the "md" format enum value
Add the sstable_version_types::md enum value
and logically extend sstable_version_types comparisons to cover
also the > sstable_version_types::mc cases.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-08-10 18:53:04 +03:00
Benny Halevy
9f114d821a sstables: keep whole clustering_key_prefix as min/max_column_names
Currently we compare each min/max component independently.
This may lead to suboptimal, inclusive clustering ranges
that do not indicate any actual key we encountered.

For example: ['a', 2], ['b', 1] will lead to min=['a', 1], max=['b', 2]
instead of the keys themselves.

This change keeps the min or max keys as a whole.

It considers shorter clustering prefixes (that are possible with compact
storage) as range tombstone bounds, so that a shorter key is considered
less than the minimum if the latter has a common prefix, and greater
than the maximum if the latter has a common prefix.

Extend the min_max_clustering_key_test to test for this case.
Previously {"a", "2"}, {"b", "1"} clustering keys would erronuously
end up with min={"a", "1"} max={"b", "2"} while we want them to be
min={"a", "2"} max={"b", "1"}.

Adjust sstable_3_x_test to ignore original mc sstables that were
previously computed with different min/max column names.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-08-10 18:53:03 +03:00
Avi Kivity
257c17a87a Merge "Don't depend on seastar::make_(lw_)?shared idiosyncrasies" from Rafael
"
While working on another patch I was getting odd compiler errors
saying that a call to ::make_shared was ambiguous. The reason was that
seastar has both:

template <typename T, typename... A>
shared_ptr<T> make_shared(A&&... a);

template <typename T>
shared_ptr<T> make_shared(T&& a);

The second variant doesn't exist in std::make_shared.

This series drops the dependency in scylla, so that a future change
can make seastar::make_shared a bit more like std::make_shared.
"

* 'espindola/make_shared' of https://github.com/espindola/scylla:
  Everywhere: Explicitly instantiate make_lw_shared
  Everywhere: Add a make_shared_schema helper
  Everywhere: Explicitly instantiate make_shared
  cql3: Add a create_multi_column_relation helper
  main: Return a shared_ptr from defer_verbose_shutdown
2020-08-02 19:51:24 +03:00
Botond Dénes
fe127a2155 sstables: clamp estimated_partitions to [1, +inf) in writers
In some cases estimated number of partitions can be 0, which is albeit a
legit estimation result, breaks many low-level sstable writer code, so
some of these have assertions to ensure estimated partitions is > 0.
To avoid hitting this assert all users of the sstable writers do the
clamping, to ensure estimated partitions is at least 1. However leaving
this to the callers is error prone as #6913 has shown it. As this
clamping is standard practice, it is better to do it in the writers
themselves, avoiding this problem altogether. This is exactly what this
patch does. It also adds two unit tests, one that reproduces the crash
in #6913, and another one that ensures all sstable writers are fine with
estimated partitions being 0 now. Call sites previously doing the
clamping are changed to not do it, it is unnecessary now as the writer
does it itself.

Fixes #6913

Tests: unit(dev)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200724120227.267184-1-bdenes@scylladb.com>
2020-07-27 09:19:37 +02:00
Rafael Ávila de Espíndola
e15c8ee667 Everywhere: Explicitly instantiate make_lw_shared
seastar::make_lw_shared has a constructor taking a T&&. There is no
such constructor in std::make_shared:

https://en.cppreference.com/w/cpp/memory/shared_ptr/make_shared

This means that we have to move from

    make_lw_shared(T(...)

to

    make_lw_shared<T>(...)

If we don't want to depend on the idiosyncrasies of
seastar::make_lw_shared.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-07-21 10:33:49 -07:00
Rafael Ávila de Espíndola
efeaded427 Everywhere: Add a make_shared_schema helper
This replaces a lot of make_lw_shared(schema(...)) with
make_shared_schema(...).

This makes it easier to drop a dependency on the differences between
seastar::make_shared and std::make_shared.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-07-21 10:33:49 -07:00
Rafael Ávila de Espíndola
66d866427d sstable_datafile_test: Use BOOST_REQUIRE_EQUAL
This only works for types that can be printed, but produces a better
error message if the check fails.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Reviewed-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20200716232700.521414-1-espindola@scylladb.com>
2020-07-17 11:58:58 +03:00
Raphael S. Carvalho
cf352e7c14 sstables: optimize procedure that checks if a sstable needs cleanup
needs_cleanup() returns true if a sstable needs cleanup.

Turns out it's very slow because it iterates through all the local
ranges for all sstables in the set, making its complexity:
	O(num_sstables * local_ranges)

We can optimize it by taking into account that abstract_replication_strategy
documents that get_ranges() will return a list of ranges that is sorted
and non-overlapping. Compaction for cleanup already takes advantage of that
when checking if a given partition can be actually purged.

So needs_cleanup() can be optimized into O(num_sstables * log(local_ranges)).

With num_sstables=1000, RF=3, then local_ranges=256(num_tokens)*3, it means
the max # of checks performed will go from 768000 to ~9584.

Fixes #6730.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200629171355.45118-2-raphaelsc@scylladb.com>
2020-06-30 12:58:43 +03:00
Raphael S. Carvalho
8e47f61df7 compaction: Enable tombstone expiration based on the presence of the sstable set
For tombstone expiration to proceed correctly without the risk of resurrecting
data, the sstable set must be present.
Regular compaction and derivatives provide the sstable set, so they're able
to expire tombstones with no resurrection risk.
Resharding, on the other hand, can run on any shard, not necessarily on the
same shard that one of the input sstables belongs to, so it currently cannot
provide a sstable set for tombstone expiration to proceed safely.
That being said, let's only do expiration based on the presence of the set.
This makes room for the sstable set to be feeded to compaction via descriptor,
allowing even resharding to do expiration. Currently, compaction thinks that
sstable set can only come from the table, and that also needs to be changed
for further flexibility.

It's theoretically possible that a given resharding job will resurrect data if
a fully expired SSTable is resharded at a shard which it doesn't belong to.
Resharding will have no way to tell that expiring all that data will lead to
resurrection because the relevant SSTables are at different shards.
This is fixed by checking for fully expired sstables only on presence of
the sstable set.

Fixes #6600.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200605200954.24696-1-raphaelsc@scylladb.com>
2020-06-07 11:46:48 +03:00
Raphael S. Carvalho
fb6976f1b9 Make sure SSTables created by streaming are added to backlog tracker
New SStables are only added to backlog tracker if set_unshared() was
called on their behalf. SStables created for streaming are not being
added to the tracker because make_streaming_sstable_for_write()
doesn't call set_unshared() nor does it caller. Which results in backlog
not accounting for their existence, which means backlog will be much
lower than expected.

This problem could be fixed by adding a set_unshared() call but it
turns out we don't even need set_unshared() anymore. It was introduced
when Scylla metadata didn't exist, now a SSTable has built-in knowledge
of whether or not it's shared. Relying on every SSTable creator calling
set_unshared() is bug prone. Let's get rid of it and let the SStable
itself say whether or not it's shared. If an imported SSTable has not
Scylla metadata, Scylla will still be able to compute shards using
token range metadata.

Refs #6021.
Refs #6227.
Fixes #6441.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200512220226.134481-1-raphaelsc@scylladb.com>
2020-06-03 17:35:22 +03:00
Avi Kivity
0c6bbc84cd Merge "Classify queries based on their initiator, rather than their target" from Botond
"
Currently we classify queries as "system" or "user" based on the table
they target. The class of a query determines how the query is treated,
currently: timeout, limits for reverse queries and the concurrency
semaphore. The catch is that users are also allowed to query system
tables and when doing so they will bypass the limits intended for user
queries. This has caused performance problems in the past, yet the
reason we decided to finally address this is that we want to introduce a
memory limit for unpaged queries. Internal (system) queries are all
unpaged and we don't want to impose the same limit on them.

This series uses scheduling groups to distinguish user and system
workloads, based on the assumption that user workloads will run in the
statement scheduling group, while system workloads will run in the main
(or default) scheduling group, or perhaps something else, but in any
case not in the statement one. Currently the scheduling group of reads
and writes is lost when going through the messaging service, so to be
able to use scheduling groups to distinguish user and system reads this
series refactors the messaging service to retain this distinction across
verb calls. Furthermore, we execute some system reads/writes as part of
user reads/writes, such as auth and schema sync. These processes are
tagged to run in the main group.
This series also centralises query classification on the replica and
moves it to a higher level. More specifically, queries are now
classified -- the scheduling group they run in is translated to the
appropriate query class specific configuration -- on the database level
and the configuration is propagated down to the lower layers.
Currently this query class specific configuration consists of the reader
concurrency semaphore and the max memory limit for otherwise unlimited
queries. A corollary of the semaphore begin selected on the database
level is that the read permit is now created before the read starts. A
valid permit is now available during all stages of the read, enabling
tracking the memory consumption of e.g. the memtable and cache readers.
This change aligns nicely with the needs of more accurate reader memory
tracking, which also wants a valid permit that is available in every layer.

The series can be divided roughly into the following distinct patch
groups:
* 01-02: Give system read concurrency a boost during startup.
* 03-06: Introduce user/system statement isolation to messaging service.
* 07-13: Various infrastructure changes to prepare for using read
  permits in all stages of reads.
* 14-19: Propagate the semaphore and the permit from database to the
  various table methods that currently create the permit.
* 20-23: Migrate away from using the reader concurrency semaphore for
  waiting for admission, use the permit instead.
* 24: Introduce `database::make_query_config()` and switch the database
  methods needing such a config to use it.
* 25-31: Get rid of all uses of `no_reader_permit()`.
* 32-33: Ban empty permits for good.
* 34: querier_cache: use the queriers' permits to obtain the semaphore.

Fixes: #5919

Tests: unit(dev, release, debug),
dtest(bootstrap_test.py:TestBootstrap.start_stop_test_node), manual
testing with a 2 node mixed cluster with extra logging.
"
* 'query-class/v6' of https://github.com/denesb/scylla: (34 commits)
  querier_cache: get semaphore from querier
  reader_permit: forbid empty permits
  reader_permit: fix reader_resources::operator bool
  treewide: remove all uses of no_reader_permit()
  database: make_multishard_streaming_reader: pass valid permit to multi range reader
  sstables: pass valid permits to all internal reads
  compaction: pass a valid permit to sstable reads
  database: add compaction read concurrency semaphore
  view: use valid permits for reads from the base table
  database: use valid permit for counter read-before-write
  database: introduce make_query_class_config()
  reader_concurrency_semaphore: remove wait_admission and consume_resources()
  test: move away from reader_concurrency_semaphore::wait_admission()
  reader_permit: resource_units: introduce add()
  mutation_reader: restricted_reader: work in terms of reader_permit
  row_cache: pass a valid permit to underlying read
  memtable: pass a valid permit to the delegate reader
  table: require a valid permit to be passed to most read methods
  multishard_mutation_query: pass a valid permit to shard mutation sources
  querier: add reader_permit parameter and forward it to the mutation_source
  ...
2020-05-29 10:11:44 +03:00
Raphael S. Carvalho
097a5e9e07 compaction: Disable garbage collected writer if interposer consumer is used
GC writer, used for incremental compaction, cannot be currently used if interposer
consumer is used. That's because compaction assumes that GC writer will be operated
only by a single compaction writer at a given point in time.
With interposer consumer, multiple writers will concurrently operate on the same
GC writer, leading to race condition which potentially result in use-after-free.

Let's disable GC writer if interposer consumer is enabled. We're not losing anything
because GC writer is currently only needed on strategies which don't implement an
interposer consumer. Resharding will always disable GC writer, which is the expected
behavior because it doesn't support incremental compaction yet.
The proper fix, which allows GC writer and interposer consumer to work together,
will require more time to implement and test, and for that reason, I am postponing
it as #6472 is a showstopper for the current release.

Fixes #6472.

tests: mode(dev).

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Reviewed-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20200526195428.230472-1-raphaelsc@scylladb.com>
2020-05-29 08:26:43 +02:00
Botond Dénes
d68ac8bf18 treewide: remove all uses of no_reader_permit() 2020-05-28 11:34:35 +03:00
Botond Dénes
e4c591aa67 database: introduce make_query_class_config()
And use it to obtain any query-class specific configuration that was
obtained from `table::config` before, such as the read concurrency
semaphore and the max memory limit for unlimited queries. As all users
of these items get these from the query class config now, we can remove
them from `table::config`.
2020-05-28 11:34:35 +03:00
Botond Dénes
cc5137ffe3 table: require a valid permit to be passed to most read methods
Now that the most prevalent users (range scan and single partition
reads) all pass valid permits we require all users to do so and
propagate the permit down towards `make_sstable_reader()`. The plan is
to use this permit for restricting the sstable readers, instead of the
semaphore the table is configured with. The various
`make_streaming_*reader()` overloads keep using the internal semaphores
as but they also create the permit before the read starts and pass it to
`make_sstable_reader()`.
2020-05-28 11:34:35 +03:00
Glauber Costa
e29701ca1c compaction_manager: expand state to be able to differentiate between enabled and stopped
We are having many issues with the stop code in the compaction_manager.
Part of the reason is that the "stopped" state has its meaning overloaded
to indicate both "compaction manager is not accepting compactions" and
"compaction manager is not ready or destructed".

In a later step we could default to enabled-at-start, but right now we
maintain current behavior to minimize noise.

It is only possible to stop the compaction manager once.
It is possible to enable / disable the compaction manager many times.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2020-05-13 16:51:25 -04:00
Glauber Costa
70a89ab4ab compaction: do not assume I/O priority class
We shouldn't assume the I/O priority class for compactions.  For
instance, if we are dealing with offstrategy compactions we may want to
use the maintenance group priority for them.

For now, all compactions are put in the compaction class.  rewrite
compactions (scrub, cleanup) could be maintenance, but we don't have
clear access to the database object at this time to derive the
equivalent CPU priority. This is planned to be changed in the future,
and when we do change it, we'll adjust.

Same goes for resharding: while we could at this point change it we'd
risking memory pressure since resharding is run online and sstables are
shared until resharding is done. When we move it to offline execution
we'll do it with maintenance priority.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20200512002233.306538-3-glauber@scylladb.com>
2020-05-12 08:23:19 +03:00
Ivan Prisyazhnyy
84e25e8ba4 api: support table auto compaction control
The patch implements:

- /storage_service/auto_compaction API endpoint
- /column_family/autocompaction/{name} API endpoint

Those APIs allow to control and request the status of background
compaction jobs for the existing tables.

The implementation introduces the table::_compaction_disabled_by_user.
Then the CompactionManager checks if it can push the background
compaction job for the corresponding table.

New members
===

    table::enable_auto_compaction();
    table::disable_auto_compaction();
    bool table::is_auto_compaction_disabled_by_user() const

Test
===
Tests: unit(sstable_datafile_test autocompaction_control_test), manual

    $ ninja build/dev/test/boost/sstable_datafile_test
    $ ./build/dev/test/boost/sstable_datafile_test --run_test=autocompaction_control_test -- -c1 -m2G --overprovisioned --unsafe-bypass-fsync 1 --blocked-reactor-notify-ms 2000000

The test tries to submit a compaction job after playing
with autocompaction control table switch. However, there is
no reliable way to hook pending compaction task. The code
assumed that with_scheduling_group() closure will never
preempt execution of the stats check.

Revert
===
Reverts commit c8247ac. In previous version the execution
sometimes resulted into the following error:

    test/boost/sstable_datafile_test.cc(1076): fatal error: in "autocompaction_control_test":
    critical check cm->get_stats().pending_tasks == 1 || cm->get_stats().active_tasks == 1 has failed

This version adds a few sstables to the cf, starts
the compaction and awaits until it is finished.

API change
===

- `/column_family/autocompaction/` always returned `true` while answering to the question: if the autocompaction disabled (see https://github.com/scylladb/scylla-jmx/blob/master/src/main/java/org/apache/cassandra/db/ColumnFamilyStore.java#L321). now it answers to the question: if the autocompaction for specific table is enabled. The question logic is inverted. The patch to the JMX is required. However, the change is decent because all old values were invalid (it always reported all compactions are disabled).
- `/column_family/autocompaction/` got support for POST/DELETE per table

Fixes
===

Fixes #1488
Fixes #1808
Fixes #440

Signed-off-by: Ivan Prisyazhnyy <ivan@scylladb.com>
Reviewed-by: Glauber Costa <glauber@scylladb.com>
2020-05-07 16:23:38 +03:00
Raphael S. Carvalho
a214ccdf89 sstables/compaction: Don't invalidate row cache when adding GC SSTable to SSTable set
Garbage collected SSTable is incorrectly added to SSTable set with a function
that invalidates row cache. This problem is fixed by adding GC SStable
to set using mechanism which replaces old sstables with new sstables.

Also, adding GC SSTable to set in a separate call is not correct.
We should make sure that GC SSTable reaches the SSTable set at the same time
its respective old (input) SSTable is removed from the set, and that's done
using a single request call to table.

Fixes #5956.
Fixes #6275.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2020-05-05 12:03:19 -03:00
Raphael S. Carvalho
8f4458f1d5 sstables/compaction: Change meaning of compaction_completion_desc input and output fields
input_sstables is renamed to old_sstables and is about old SSTables that should be
deleted and removed from the SSTable set.
output_sstables is renamed to new_sstables and is about new SSTable that should be
added to the SSTable set, replacing the old ones.

This will allow us, for example, to add auxiliary SSTables to SSTable set using
the same call which replaces output SSTables by input SSTables in compaction.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2020-05-05 12:03:08 -03:00
Glauber Costa
55f5ca39a9 sstable_test: rework test to use a thread
The compaction_manager test lives inside a thread and it is not taking
advantage of it, with continuations all over.

One of the side effects of it is that the test is calling stop() twice
on the compaction_manager.  While this works today, it is not good
practice. A change I am making is just about to break it.

This patch converts the test to fully use .get() instead of chained
continuations and in doing so also guarantees that the compaction
manager will be RAII-stopped just one, from a defer object.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20200503161420.8346-2-glauber@scylladb.com>
2020-05-03 19:54:04 +03:00
Pekka Enberg
c8247aced6 Revert "api: support table auto compaction control"
This reverts commit 1c444b7e1e. The test
it adds sometimes fails as follows:

  test/boost/sstable_datafile_test.cc(1076): fatal error: in "autocompaction_control_test":
  critical check cm->get_stats().pending_tasks == 1 || cm->get_stats().active_tasks == 1 has failed

Ivan is working on a fix, but let's revert this commit to avoid blocking
next promotion failing from time to time.
2020-04-11 17:56:02 +03:00
Ivan Prisyazhnyy
1c444b7e1e api: support table auto compaction control
This patch adds API endpoint /column_family/autocompaction/{name}
that listen to GET and POST requests to pick and control table
background compactions.

To implement that the patch introduces "_compaction_disabled_by_user"
flag that affects if CompactionManager is allowed to push background
compactions jobs into the work.

It introduces

    table::enable_auto_compaction();
    table::disable_auto_compaction();
    bool table::is_auto_compaction_disabled_by_user() const

to control auto compaction state.

Fixes #1488
Fixes #1808
Fixes #440
Tests: unit(sstable_datafile_test autocompaction_control_test), manual
2020-04-08 21:18:38 +03:00