Commit Graph

29204 Commits

Author SHA1 Message Date
Jan Ciolek
7bbfa48bc5 cql3: Add has_eq_restriction_on_column function
Adds a function that checks whether a given expression has eq restrction
on the specified column.

It finds restrictions like
col = ...
or
(col, col2) = ...

IN restrictions don't count, they aren't EQ restrictions

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2021-12-09 12:06:43 +01:00
Jan Ciolek
f76a1cd4bf cql3: Reorganize orderings code
Reorganized the code that handles column ordering (ASC or DESC).
I feel that it's now clearer and easier to understand.

Added an enum that describes column ordering.
It has two possible values: ascending or descending.
It used to be a bool that was sometimes called 'reversed',
which could mean multiple things.

Instead of column.type->is_reversed() != <ordering bool>
there is now a function called are_column_select_results_reversed.

Split checking if ordering is reversed and verifying whether it's correct into two functions.
Before all of this was done by is_reversed()

This is a preparation to later allow skipping ORDER BY restrictions on some columns.
Adding this to the existing code caused it to get quite complex,
but this new version is better suited for the task.

The diff is a bit messy because I moved all ordering functions to one place,
it's better to read select_statement.cc lines 1495-1651 directly.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2021-12-09 12:06:42 +01:00
Piotr Sarna
337906bc1c alternator: precompute scan range parameters in a function
This commit addresses a very simple FIXME left in alternator TTL
implementation - it reduces the number of parameters passed
to scan_table_ranges() by enclosing the parameters in a separate
object.

Tests: unit(release)

Message-Id: <214afcd9d5c1968182ad98550105f82add216c80.1638354094.git.sarna@scylladb.com>
2021-12-02 10:04:05 +02:00
Raphael S. Carvalho
de165b864c repair: Enable off-strategy compaction for rebuild
Let's enable offstrategy for repair based rebuild, for it to take
advantage of offstrategy benefits, one of the most important
being compaction not acting aggressively, which is important
for both reducing operation time and delivering good latency
while the operation is running.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20211130115957.13779-1-raphaelsc@scylladb.com>
2021-12-02 09:58:58 +02:00
David Garcia
954d5d5d63 Fix cql docs error
Closes #9613
2021-12-02 09:58:58 +02:00
Avi Kivity
ef3edcf848 test: refine test suite names exposed via xunit format
The test suite names seen by Jenkins are suboptimal: there is
no distinction between modes, and the ".cc" suffix of file names
is interpreted as a class name, which is converted to a tree node
that must be clicked to expand. Massage the names to remove
unnecessary information and add the mode.

Closes #9696
2021-12-02 09:58:58 +02:00
Avi Kivity
9edd86362a test: sstable_test: don't read compressed file size from closed file
We read the compressed file size from a file that was already closed,
resulting in EBADF on my machine. Not sure why it works for everyone
else.

Fix by reading the size using the path.

Closes #9675
2021-12-01 16:28:46 +02:00
Raphael S. Carvalho
f23e0d7f2d compaction_manager: Disconsider inactive tasks when filtering sstables
After commit 1f5b17f, overlapping can be introduced in level 1 because
procedure that filters out sstables from partial runs is considering
inactive tasks, so L1 sstables can be incorrectly filtered out from
next compaction attempt. When L0 is merged into L1, overlapping is
then introduced in L1 because old L1 sstables weren't considered in
L0 -> L1 compaction.

From now on, compaction_manager::get_candidates() will only consider
active tasks, to make sure actual partial runs are filtered out.

Fixes #9693.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20211129180459.125847-1-raphaelsc@scylladb.com>
2021-12-01 16:11:44 +02:00
Raphael S. Carvalho
9de7abdc80 compaction: LCS: Fix inefficiency when pushing SSTables to higher levels
To satisfy backlog controller, commit 28382cb25c changed LCS to
incrementally push sstables to highest level *when* there's nothing else
to be done.

That's overkill because controller will be satisfied with level L being
fanout times larger than L-1. No need to push everything to last level as
it's even worse than a major, because any file being promoted will
overlap with ~10 files in next level. At least, the cost is amortized by
multiple iterations, but terrible write amplification is still there.
Consequently, this reduces overall efficiency.
For example, it might happen that LCS in table A start pushing everything
to highest level, when table B needs resources for compaction to reduce its
backlog. Increased write amplification in A may prevent other tables
from reducing their backlog in a timely manner.

It's clear that LCS should stop promoting as soon as level L is 10x
larger than L-1, so strategy will still be satisfied while fixing the
inefficiency problem.

Now layout will look like as follow:
SSTables in each level: [0, 2, 15, 121]

Previously, it looked like once table stopped being written to:
SSTables in each level: [0, 0, 0, 138]

It's always good to have everything in a single run, but that comes
with a high write amplification cost which we cannot afford in steady
state. With this change, the layout will still be good enough to make
everybody happy.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20211129143606.71257-1-raphaelsc@scylladb.com>
2021-12-01 16:10:25 +02:00
Gleb Natapov
f2ab5f4e60 raft service: insert a new raft instance into the servers' list only after it is started
RPC module starts to dispatching calls to a server the moment it is in the
servers' list, but until raft::server::start() completes the instance is
not fully created yet and is not ready to accept anything. Fix the code
that initialize new raft group to insert new raft instance into the list
only after it is started.

Message-Id: <YZTFFW9v0NlV7spR@scylladb.com>
2021-12-01 13:11:49 +01:00
Nadav Har'El
94eb5c55c8 Merge 'Loading cache improve eviction use policy' from Vladislav Zolotarov
This series introduces a new version of a loading_cache class.
The old implementation was susceptible to a "pollution" phenomena when frequently used entry can get evicted by an intensive burst of "used once" entries pushed into the cache.

The new version is going to have a privileged and unprivileged cache sections and there's a new loading_cache template parameter - SectionHitThreshold. The new cache algorithm goes as follows:
  * We define 2 dynamic cache sections which total size should not exceed the maximum cache size.
  * New cache entry is always added to the "unprivileged" section.
  * After a cache entry is read more than SectionHitThreshold times it moves to the second cache section.
  * Both sections' entries obey expiration and reload rules in the same way as before this patch.
  * When cache entries need to be evicted due to a size restriction "unprivileged" section's
    least recently used entries are evicted first.

More details may be found in #8674.

In addition, during a testing another issue was found in the authorized_prepared_statements_cache: #9590.
There is a patch that fixes it as well.

Closes #9708

* github.com:scylladb/scylla:
  loading_cache: account unprivileged section evictions
  loading_cache: implement a variation of least frequent recently used (LFRU) eviction policy
  authorized_prepared_statements_cache: always "touch" a corresponding cache entry when accessed
  loading_cache::timestamped::lru_entry: refactoring
  loading_cache.hh: rearrange the code (no functional change)
  loading_cache: use std::pmr::polymorphic_allocator
2021-12-01 13:13:53 +02:00
Calle Wilund
3e21fea2b6 test_streamts: test_streams_starting_sequence_number fix 'LastEvaluatedShardId' usage
It is not part of raw response, but of the 'StreamDescription' object.
Test fails internmittently depending on PK randomization.

Closes #9710
2021-12-01 11:05:40 +02:00
Avi Kivity
03755b362a Merge 'compaction_manager api: stop ongoing compactions' from Benny Halevy
This series extends `compaction_manager::stop_ongoing_compaction` so it can be used from the api layer for:
- table::disable_auto_compaction
- compaction_manager::stop_compaction

Fixes #9313
Fixes #9695

Test: unit(dev)

Closes #9699

* github.com:scylladb/scylla:
  compaction_manager: stop_compaction: wait for ongoing compactions to stop
  compaction_manager: stop_ongoing_compactions: log Stopping 0 tasks at debug level
  compaction_manager: unify stop_ongoing_compactions implementations
  compaction_manager: stop_ongoing_compactions: add compaction_type option
  compaction_manager: get_compactions: get a table* parameter
  table: disable_auto_compaction: stop ongoing compactions
  compaction_manager: make stop_ongoing_compactions public
  table: futurize disable_auto_compactions
2021-11-30 19:08:14 +02:00
Avi Kivity
595cc328b1 Merge 'cql3: Remove term, replace with expression' from Jan Ciołek
This PR finally removes the `term` class and replaces it with `expression`.

* There was some trouble with `lwt_cache_id` in `expr::function_call`.
  The current code works the following way:
  * for each `function_call` inside a `term` that describes a pk restriction, `prepare_context::add_pk_function_call` is called.
  * `add_pk_function_call` takes a `::shared_ptr<cql3::functions::function_call>`, sets its `cache_id` and pushes this shared pointer onto a vector of all collected function calls
  * Later when some condiition is met we want to clear cache ids of all those collected function calls. To do this we iterate through shared pointers collected in `prepare_context` and clear cache id for each of them.

  This doesn't work with `expr::function_call` because it isn't kept inside a shared pointer.
  To solve this I put the `lwt_cache_id` inside a shared pointer and then `prepare_context` collects these shared pointers to cache ids.

  I also experimented with doing this without any shared pointers, maybe we could just walk through the expression and clear the cache ids ourselves. But the problem is that expressions are copied all the time, we could clear the cache in one place, but forget about a copy. Doing it using shared pointers more closely matches the original behaviour.
The experiment is on the [term2-pr3-backup-altcache](https://github.com/cvybhu/scylla/tree/term2-pr3-backup-altcache) branch
* `shared_ptr<term>` being `nullptr` could mean:
  * It represents a cql value `null`
  * That there is no value, like `std::nullopt` (for example in `attributes.hh`)
  * That it's a mistake, it shouldn't be possible

  A good way to distinguish between optional and mistake is to look for `my_term->bind_and_get()`, we then know that it's not an optional value.

* On the other hand `raw_value` cased to bool means:
   * `false` - null or unset
   * `true` - some value, maybe empty

I ran a simple benchmark on my laptop to see how performance is affected:
```
build/release/test/perf/perf_simple_query --smp 1 -m 1G --operations-per-shard 1000000 --task-quota-ms 10
```
* On master (a21b1fbb2f) I get:
  ```
  176506.60 tps ( 77.0 allocs/op,  12.0 tasks/op,   45831 insns/op)

  median 176506.60 tps ( 77.0 allocs/op,  12.0 tasks/op,   45831 insns/op)
  median absolute deviation: 0.00
  maximum: 176506.60
  minimum: 176506.60
  ```
* On this branch I get:
  ```
  172225.30 tps ( 75.1 allocs/op,  12.1 tasks/op,   46106 insns/op)

  median 172225.30 tps ( 75.1 allocs/op,  12.1 tasks/op,   46106 insns/op)
  median absolute deviation: 0.00
  maximum: 172225.30
  minimum: 172225.30
  ```

Closes #9481

* github.com:scylladb/scylla:
  cql3: Remove remaining mentions of term
  cql3: Remove term
  cql3: Rename prepare_term to prepare_expression
  cql3: Make prepare_term return an expression instead of term
  cql3: expr: Add size check to evaluate_set
  cql3: expr: Add expr::contains_bind_marker
  cql3: expr: Rename find_atom to find_binop
  cql3: expr: Add find_in_expression
  cql3: Remove term in operations
  cql3: Remove term in relations
  cql3: Remove term in multi_column_restrictions
  cql3: Remove term in term_slice, rename to bounds_slice
  cql3: expr: Remove term in expression
  cql3: expr: Add evaluate_IN_list(expression, options)
  cql3: Remove term in column_condition
  cql3: Remove term in select_statement
  cql3: Remove term in update_statement
  cql3: Use internal cql format in insert_prepared_json_statement cache
  types: Add map_type_impl::serialize(range of <bytes, bytes>)
  cql3: Remove term in cql3/attributes
  cql3: expr: Add constant::view() method
  cql3: expr: Implement fill_prepare_context(expression)
  cql3: expr: add expr::visit that takes a mutable expression
  cql3: expr: Add receiver to expr::bind_variable
2021-11-30 16:39:39 +02:00
Avi Kivity
078f69c133 Merge "raft: (service) implement group 0 as a service" from Kostja
"
To ensure consistency of schema and topology changes,
Scylla needs a linearizable storage for this data
available at every member of the database cluster.

The series introduces such storage as a service,
available to all Scylla subsystems. Using this service, any other
internal service such as gossip or migrations (schema) could
persist changes to cluster metadata and expect this to be done in
a consistent, linearizable way.

The series uses the built-in Raft library to implement a
dedicated Raft group, running on shard 0, which includes all
members of the cluster (group 0), adds hooks to topology change
events, such as adding or removing nodes of the cluster, to update
group 0 membership, ensures the group is started when the
server boots.

The state machine for the group, i.e. the actual storage
for cluster-wide information still remains a stub. Extending
it to actually persist changes of schema or token ring
is subject to a subsequent series.

Another Raft related service was implemented earlier: Raft Group
Registry. The purpose of the registry is to allow Scylla have an
arbitrary number of groups, each with its own subset of cluster
members and a relevant state machine, sharing a common transport.
Group 0 is one (the first) group among many.
"

* 'raft-group-0-v12' of github.com:scylladb/scylla-dev:
  raft: (server) improve tracing
  raft: (metrics) fix spelling of waiters_awaken
  raft: make forwarding optional
  raft: (service) manage Raft configuration during topology changes
  raft: (service) break a dependency loop
  raft: (discovery) introduce leader discovery state machine
  system_keyspace: mark scylla_local table as always-sync commitlog
  system_keyspace: persistence for Raft Group 0 id and Raft Server Id
  raft: add a test case for adding entries on follower
  raft: (server) allow adding entries/modify config on a follower
  raft: (test) replace virtual with override in derived class
  raft: (server) fix a typo in exception message
  raft: (server) implement id() helper
  raft: (server) remove apply_dummy_entry()
  raft: (test) fix missing initialization in generator.hh
2021-11-30 16:24:51 +02:00
Raphael S. Carvalho
0d5ac845e1 compaction: Make cleanup withstand better disk pressure scenario
It's not uncommong for cleanup to be issued against an entire keyspace,
which may be composed of tons of tables. To increase chances of success
if low on space, cleanup will now start from smaller tables first, such
that bigger tables will have more space available, once they're reached,
to satisfy their space requirement.

parallel_for_each() is dropped and wasn't needed given that manager
performs per-shard serialization of cleanup jobs.

Refs #9504.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20211130133712.64517-1-raphaelsc@scylladb.com>
2021-11-30 16:15:24 +02:00
Benny Halevy
957003e73f compaction_manager: stop_compaction: wait for ongoing compactions to stop
Similar to #9313, stop_compaction should also reuse the
stop_ongoing_comapctions() infrastructure and wait on ongoing
compactions of the given type to stop.

Fixes #9695

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-11-30 16:09:11 +02:00
Benny Halevy
b9ba181d3c compaction_manager: stop_ongoing_compactions: log Stopping 0 tasks at debug level
Normally, "Stopping 0 tasks for 0 ongoing compactions for table ..."
is not very interesting so demote its log_level to debug.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-11-30 16:09:11 +02:00
Benny Halevy
03e969dbef compaction_manager: unify stop_ongoing_compactions implementations
Now stop_ongoing_compactions(reason) is equivalent to
to stop_ongoing_compactions(reason, nullptr, std::nullopt)
so share the code of the latter for the former entry point.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-11-30 16:09:07 +02:00
Benny Halevy
94011bdcca compaction_manager: stop_ongoing_compactions: add compaction_type option
And make the table optional as well, so it can be used
by stop_compaction() to a particular compaction type on all tables.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-11-30 16:07:47 +02:00
Benny Halevy
a419759835 compaction_manager: get_compactions: get a table* parameter
Optionally get running compaction on the provided table.
This is required for stop_ongoing_compactions on a given table.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-11-30 16:06:34 +02:00
Benny Halevy
4affa801a5 table: disable_auto_compaction: stop ongoing compactions
The api call disables new regular compaction jobs from starting
but it doesn't wait for ongoing compaction to stop and so it's
much less useful.

Returning after stopping regular compaction jobs and waiting
for them to stop guarantees that no regular compactions job are
running when nodetool disableautocompaction returns successfully.

Fixes #9313

Test: sstable_compaction_test,sstable_directory_test(dev)

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-11-30 16:06:34 +02:00
Benny Halevy
3c721eb228 compaction_manager: make stop_ongoing_compactions public
So it can be used directly by table code
in the next patch.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-11-30 16:06:29 +02:00
Raphael S. Carvalho
3006394312 compaction: Allow incremental compaction with interposer consumer
Until commit c94e6f8567, interposer consumer wouldn't work
with our GC writer, needed for incremental compaction correctness.
Now that the technical debt is gone, let's allow incremental
compaction with interposer consumer.

The only change needed is serialization of replacer as two
consumers cannot step on each toe, like when we have concurrent
bucket writers with TWCS.

sstable_compaction_test.test_bug_6472 passes with this change,
which was added when #6472 was fixed by not allowing incremental
compaction with interposer consumer.

Refs #6472.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20211126191000.43292-1-raphaelsc@scylladb.com>
2021-11-30 15:24:17 +02:00
Eliran Sinvani
ddd7248b3b testlib: close index_reader to avoid racing condition
In order to avoid race condition introduced in 9dce1e4 the
index_reader should be closed prior to it's destruction.
This only exposes 4.4 and earlier releases to this specific race.
However, it is always a good idea to first close the index reader
and only then destroy it since it is most likely to be assumed by
all developers that will change the reader index in the future.

Ref #9704 (because on 4.4 and earlier releases are vulnerable).

Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>

Closes #9705
2021-11-30 13:05:24 +01:00
Benny Halevy
b60d697084 table: futurize disable_auto_compactions
So it can stop ongoing compaction and wait
for them to complete.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-11-30 08:33:04 +02:00
Vlad Zolotarov
4cb245fe3c loading_cache: account unprivileged section evictions
Provide a template parameter to provide a static callbacks object to
increment a counter of evictions from the unprivileged section.

If entries are evicted from the cache while still in the unprivileged
section indicates a not efficient usage of the cache and should be
investigated.

This patch instruments authorized_prepared_statements_cache and a
prepared_statements_cache objects to provide non-empty callbacks.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2021-11-29 21:45:53 -05:00
Vlad Zolotarov
1a9c6d9fd3 loading_cache: implement a variation of least frequent recently used (LFRU) eviction policy
This patch implements a simple variation of LFRU eviction policy:
  * We define 2 dynamic cache sections which total size should not exceed the maximum cache size.
  * New cache entry is always added to the "unprivileged" section.
  * After a cache entry is read more than SectionHitThreshold times it moves to the second cache section.
  * Both sections' entries obey expiration and reload rules in the same way as before this patch.
  * When cache entries need to be evicted due to a size restriction "unprivileged" section's
    least recently used entries are evicted first.

Note:
With a 2 sections cache it's not enough for a new entry to have the latest timestamp
in order not be evicted right after insertion: e.g. if all all other entries
are from the privileged section.

And obviously we want to allow new cache entries to be added to a cache.

Therefore we can no longer first add a new entry and then shrink the cache.
Switching the order of these two operations resolves the culprit.

Fixes #8674

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2021-11-29 21:45:21 -05:00
Pavel Solodovnikov
e3f922c48b raft: write raft log in user memory
System dirty memory space is limited by 10MB capacity.
This means that memtables cannot accumulate more than
5MB before they are flushed to sstables.

This can impact performance under load.

Move the `system.raft` table to the regular dirty
memory space.

Fixes: #9692
Tests: unit(dev)

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20211129200044.1144961-1-pa.solodovnikov@scylladb.com>
2021-11-29 23:51:24 +01:00
Vlad Zolotarov
66c150769b authorized_prepared_statements_cache: always "touch" a corresponding cache entry when accessed
Always "touch" a prepared_statements_cache entry when it's accessed via
authorized_prepared_statements_cache.

If we don't do this it may turn out that the most recently used prepared statement doesn't have
the newest last_read timestamp and can get evicted before the not-so-recently-read statement if
we need to create space in the prepared statements cache for a new entry.

And this is going to trigger an eviction of the corresponding entry from the authorized_prepared_cache
breaking the LRU paradigm of these caches.

Fixes #9590

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2021-11-29 17:37:25 -05:00
Nadav Har'El
d9c5c4eab6 test/alternator: tests for Select parameter in GSI and LSI
We already have tests for the behavior of the "Select" parameter when
querying a base table, but this patch adds additional tests for its
behavior when querying a GSI or a LSI. There are some differences:
Select=ALL_PROJECTED_ATTRIBUTES is not allowed for base tables, but is
allowed - and in fact is the default - for GSI and LSI. Also, GSI may
not allow ALL_ATTRIBUTES (which is the default for base tables) if
only a subset of the attributes were projected.

The new tests xfail because the Select and Projection features have
not yet been implemented in Alternator. They pass in DynamoDB.
After this patch we have (hopefully) complete test coverage of the
Select feature, which will be helpful when we start implementing it.

Refs #5058 (Select)
Refs #5036 (Projection)

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20211125100443.746917-1-nyh@scylladb.com>
2021-11-29 20:28:43 +01:00
Nadav Har'El
1c279118f4 test/alternator: more test cases for Select parameter
Add to the existing tests for the Select parameter of the Query and Scan
operations another check: That when Select is ALL_ATTRIBUTES or COUNT,
specifying AttributesToGet or ProjectionExpression is forbidden -
because the combination doesn't make sense.

The expanded test continues to xfail on Alternator (because the Select
parameter isn't yet implemented), and passes on DynamoDB. Strengthening
the tests for this feature will be helpful when we decide to implement it.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20211125074128.741677-1-nyh@scylladb.com>
2021-11-29 20:28:25 +01:00
Vlad Zolotarov
cbabde9622 loading_cache::timestamped::lru_entry: refactoring
* Store a reference to a parent (loading_cache) object instead of holding
     references to separate fields.
   * Access loading_cache fields via accessors.
   * Move the LRU "touch" logic to the loading_cache.
   * Keep only a plain "list entry" logic in the lru_entry class.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2021-11-29 14:24:56 -05:00
Vlad Zolotarov
9125b4545e loading_cache.hh: rearrange the code (no functional change)
Hide internal classes inside the loading_cache class:
  * Simpler calls - no need for a tricky back-referencing to access loading_cache fields.
  * Cleaner interface.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2021-11-29 14:24:56 -05:00
Vlad Zolotarov
fd92718f48 loading_cache: use std::pmr::polymorphic_allocator
Use std::pmr::polymorphic_allocator instead of
std::allocator - the former allows not to define the
allocated object during the template specification.

As a result we won't have to have lru_entry defined
before loading_cache, which in line would allow us
to rearrange classes making all classes internal to
loading_cache and hence simplifying the interface.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2021-11-29 14:24:56 -05:00
Raphael S. Carvalho
1f3135abb4 sstable_set: use for_each_sstable() in make_crawling_reader()
sstable_set_impl::all() may have to copy all sstables from multiple
sets, if compound. let's avoid this overhead by using
sstable_set_impl::for_each_sstable().

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20211127181037.56542-1-raphaelsc@scylladb.com>
2021-11-29 19:59:39 +02:00
Michael Livshin
f0e2ada748 fix mutation_source::operator bool() for v2 factories
A mutation source is valid when it has either a v1 or v2 flat mutation
reader factory, but `operator bool()` only checks for the former.

Fixes #9697

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>

Closes #9698
2021-11-29 19:50:37 +02:00
Nadav Har'El
8618346331 config: automate experimental_features_t::all()
The experimental_features_t has an all() method, supposedly returning
all values of the enum - but it's easy to forget to update it when
adding a new experimental feature - and it's currently out-of-sync
(it's missing the ALTERNATOR_TTL option).
We already have another method, map(), where a new experimental feature
must be listed otherwise it can't be used, so let's just take all()'s
values from map(), automatically, instead of forcing developers to keep
both lists up-to-date.

Note that using the all() function to enable all experimental features
is not recommended - the best practice is to enable specific experimental
features, not all of them. Nevertheless, this all() function is still used
in one place - in the cql_repl tool - which uses it to enable all
experimental features.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20211108135601.78460-1-nyh@scylladb.com>
2021-11-29 18:44:23 +02:00
Tomasz Grabiec
3226c5bf9d Merge 'sstables: mx: enable position fast-forwarding in reverse mode' from Kamil Braun
Most of the machinery was already implemented since it was used when
jumping between clustering ranges of a query slice. We need only perform
one additional thing when performing an index skip during
fast-forwarding: reset the stored range tombstone in the consumer (which
may only be stored in fast-forwarding mode, so it didn't matter that it
wasn't reset earlier). Comments were added to explain the details.

As a preparation for the change, we extend the sstable reversing reader
random schema test with a fast-forwarding test and include some minor
fixes.

Fixes #9427.

Closes #9484

* github.com:scylladb/scylla:
  query-request: add comment about clustering ranges with non-full prefix key bounds
  sstables: mx: enable position fast-forwarding in reverse mode
  test: sstable_conforms_to_mutation_source_test: extend `test_sstable_reversing_reader_random_schema` with fast-forwarding
  test: sstable_conforms_to_mutation_source_test: fix `vector::erase` call
  test: mutation_source_test: extract `forwardable_reader_to_mutation` function
  test: random_schema: fix clustering column printing in `random_schema::cql`
2021-11-29 16:01:53 +01:00
Raphael S. Carvalho
80a1ebf0f3 compaction_manager: Fix race when selecting sstables for rewrite operations
Rewrite operations are scrub, cleanup and upgrade.

Race can happen because 'selection of sstables' and 'mark sstables as
compacting' are decoupled. So any deferring point in between can lead
to a parallel compaction picking the same files. After commit 2cf0c4bbf,
files are marked as compacting before rewrite starts, but it didn't
take into account the commit c84217ad which moved retrieval of
candidates to a deferring thread, before rewrite_sstables() is even
called.

Scrub isn't affected by this because it uses a coarse grained approach
where whole operation is run with compaction disabled, which isn't good
because regular compaction cannot run until its completion.

From now on, selection of files and marking them as compacting will
be serialized by running them with compaction disabled.

Now cleanup will also retrieve sstables with compaction disabled,
meaning it will no longer leave uncleaned files behind, which is
important to avoid data resurrection if node regains ownership of
data in uncleaned files.

Fixes #8168.
Refs #8155.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20211129133107.53011-1-raphaelsc@scylladb.com>
2021-11-29 16:27:29 +02:00
Avi Kivity
bcadd8229b Merge "memtable-sstable: Add compacting reader when flushing memtable." from Mikołaj
"
When memtable contains both mutations and tombstones that delete them,
the output flushed to sstables contains both mutations. Inserting a
compacting reader results in writing smaller sstables and saves
compaction work later.

There are mixed performance implications of this change:
- If no rows are removed, there is a ~12% penalty on writing. Read times
  are not affected. A heuristic is implemented to avoid this problem -
  compaction is executed only if there are tombstones.
- Read and write performance linearly improves with percentage of rows
  removed. At ~15% of rows removed, writes become faster than without
  compaction.

In the tables below in columns 4 and 7, values below 100% denote
improvement and values over 100% denote regression. The tests were
performed on a table with 5 columns and the exact percentages will vary
across different table schemas.

1. percentage removed
2. write duration/row no compaction
3. write duration/row with compaction
4. write performance new/old
5. read duration/row no compaction
6. read duration/row with compaction
7. read performance new/old
  1           2           3          4           5           6          7
5      8.91E-07    9.64E-07    108.25%    6.05E-07    5.76E-07    95.23%
10     9.28E-07    9.94E-07    107.15%    6.14E-07    5.56E-07    90.55%
15     9.27E-07    9.21E-07    99.43%     6.24E-07    5.27E-07    84.39%
20     9.28E-07    9.03E-07    97.31%     6.19E-07    4.83E-07    78.03%
25     9.49E-07    8.58E-07    90.40%     6.40E-07    4.59E-07    71.76%
30     9.68E-07    8.28E-07    85.61%     6.35E-07    4.20E-07    66.07%
35     9.81E-07    8.07E-07    82.26%     6.38E-07    3.88E-07    60.85%
40     9.97E-07    7.81E-07    78.35%     6.43E-07    3.59E-07    55.91%
45     1.01E-06    7.59E-07    75.28%     6.45E-07    3.34E-07    51.75%
50     1.02E-06    7.30E-07    71.52%     6.55E-07    3.00E-07    45.78%
55     1.06E-06    7.08E-07    66.97%     6.65E-07    2.70E-07    40.56%
60     1.04E-06    6.87E-07    66.20%     6.62E-07    2.40E-07    36.22%
65     1.05E-06    6.56E-07    62.49%     6.60E-07    2.12E-07    32.04%
70     1.06E-06    6.34E-07    59.58%     6.66E-07    1.80E-07    27.07%
75     1.07E-06    6.09E-07    56.90%     6.69E-07    1.50E-07    22.38%
80     1.09E-06    5.84E-07    53.58%     6.80E-07    1.20E-07    17.62%
85     1.10E-06    5.56E-07    50.49%     6.83E-07    9.00E-08    13.18%
90     1.11E-06    5.33E-07    47.92%     6.90E-07    5.97E-08    8.66%
95     1.12E-06    5.07E-07    45.10%     6.93E-07    3.04E-08    4.39%
100    1.14E-06    4.87E-07    42.77%     6.97E-07    6.56E-12    0.00%

1. percentage removed
2. write instructions retired/row no compaction
3. write instructions retired/row with compaction
4. write performance new/old
5. read instructions retired/row no compaction
6. read instructions retired/row with compaction
7. read performance new/old
  1     2       3         4    5       6          7
5   10276   11188   108.88% 7735    7297    94.34%
10  10463   10891   104.09% 7797    6913    88.66%
15  10633   10596   99.65%  7852    6529    83.15%
20  10811   10300   95.27%  7910    6145    77.69%
25  10997   9998    90.92%  7976    5755    72.15%
30  11177   9707    86.85%  8033    5376    66.92%
35  11353   9412    82.90%  8092    4992    61.69%
40  11522   9111    79.07%  8143    4604    56.54%
45  11708   8819    75.32%  8208    4224    51.46%
50  11877   8520    71.74%  8259    3836    46.45%
55  12064   8228    68.20%  8325    3456    41.51%
60  12240   7928    64.77%  8382    3069    36.61%
65  12419   7635    61.48%  8440    2688    31.85%
70  12598   7339    58.26%  8499    2304    27.11%
75  12768   7043    55.16%  8549    1920    22.46%
80  12977   6747    51.99%  8616    1536    17.83%
85  13131   6451    49.13%  8673    1152    13.28%
90  13311   6155    46.24%  8731    767     8.78%
95  13487   5858    43.43%  8790    383     4.36%
100 13657   5562    40.73%  8841    0       0.00%
"

* 'add-compacting-reader-when-flushing-memtable-v6' of github.com:mikolajsieluzycki/scylla:
  memtable-sstable: Add compacting reader when flushing memtable.
  memtable-sstable: Track existence of tombstones in memtable.
2021-11-29 15:15:59 +02:00
Mikołaj Sielużycki
a88f7df195 memtable-sstable: Add compacting reader when flushing memtable.
When memtable contains both mutations and tombstones that delete them,
the output flushed to sstables contains both mutations. Inserting a
compacting reader results in writing smaller sstables and saves
compaction work later.

Performance tests of this change have shown a regression in a common
case where there are no deletes. A heuristic is employed to skip
compaction unless there are tombstones in the memtable to minimise
the impact of that issue.
2021-11-29 13:19:42 +01:00
Mikołaj Sielużycki
6dd9f63f3b memtable-sstable: Track existence of tombstones in memtable.
Add flags if memtable contains tombstones. They can be used as a
heuristic to determine if a memtable should be compacted on
flush. It's an intermediate step until we can compact during applying
mutations on a memtable.
2021-11-29 13:06:12 +01:00
Kamil Braun
b2b242d0ad query-request: add comment about clustering ranges with non-full prefix key bounds 2021-11-29 11:10:49 +01:00
Kamil Braun
8722e0d23c sstables: mx: enable position fast-forwarding in reverse mode
Most of the machinery was already implemented since it was used when
jumping between clustering ranges of a query slice. We need only perform
one additional thing when performing an index skip during
fast-forwarding: reset the stored range tombstone in the consumer (which
may only be stored in fast-forwarding mode, so it didn't matter that it
wasn't reset earlier). Comments were added to explain the details.
2021-11-29 11:10:49 +01:00
Kamil Braun
ea6310961c test: sstable_conforms_to_mutation_source_test: extend test_sstable_reversing_reader_random_schema with fast-forwarding
The test would check whether the forward and reverse readers returned
consistent results when created in non-forwarding mode with slicing.

Do the same but using fast-forwarding instead of slicing.

To do this we require a vector of `position_range`s. We also need a
vector of `clustering_range`s for the existing test. We modify the
existing `random_ranges` function to return `position_range`s instead of
`clustering_range`s since `position_range`s are easier to reason about,
especially when we consider non-full clustering key prefixes. A function
is introduced to convert a `position_range` to a `clustering_range` for
the existing test.
2021-11-29 11:10:46 +01:00
Benny Halevy
cf528d7df9 database: shutdown: don't shutdown keyspaces yet
Don't shutdown the keyspaces just yet,
since they are needed during shutdown.

FIXME: restore when #8995 is fixed and no queries are issued
after the database shuts down.

Refs #8995
Fixes #9684

Test: unit(dev)
- scylla-gdb test fails locally with #9677

DTest: update_cluster_layout_tests.py:TestUpdateClusterLayout.simple_add_new_node_while_adding_info_{1,2}_test(dev)
- running now into #8995. dtest fails with unexpected error: "storage_proxy - Exception when communicating with
  127.0.62.4, to read from system_distributed.service_levels:
  seastar::gate_closed_exception (gate closed)"

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20211127083348.146649-2-bhalevy@scylladb.com>
2021-11-29 11:59:45 +02:00
Benny Halevy
93367ba55f effective_replication_map_factory: temporarily unregister outstanding maps when destroyed
The next patch will disable stopping the keyspaces
in database shutdown due to #9684.

This will leave outstanding e_r_m:s when the factory
is destroyed. They must be unregistered from the factory
so they won't try to submit_background_work()
to gently clear their contents.

Support that temporarily until shutdown is fixed
to ensure they are no outstanding e_r_m:s when
the factory is destroyed, at which point this
can turn into an internal error.

Refs #8995
Refs #9684

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20211127083348.146649-1-bhalevy@scylladb.com>
2021-11-29 11:59:44 +02:00
Nadav Har'El
1e2ecd282a Merge 'Harden compaction manager remove' from Benny Halevy
This series hardens compaction_manager::remove by:
- add debug logging around task execution and stopping.
- access compaction_state as lw_shared_ptr rather than via a raw pointer.
  - with that, detach it from `_compaction_state` in `compaction_manager::remove` right away, to prevent further use of it while compactions are stopped.
 - added write_lock in `remove` to make sure the lock is not held by any stray task.

Test: unit(dev), sstable_compaction_test(debug)
Dtest: alternator_tests.py:AlternatorTest.test_slow_query_logging (debug)

Closes #9636

* github.com:scylladb/scylla:
  compaction_manager: add compaction_state when table is constructed
  compaction_manager: remove: fixup indentation
  compaction_manager: remove: detach compaction_state before stopping ongoing compactions
  compaction_manager: remove: serialize stop_ongoing_compactions and gate.close
  compaction_manager: task: keep a reference on compaction_state
  test: sstable_compaction_test: incremental_compaction_data_resurrection_test: stop table before it's destroyed.
  test: sstable_utils: compact_sstables: deregister compaction also on error path
  test: sstable_compaction_test: partial_sstable_run_filtered_out_test: deregiser_compaction also on error path
  test: compaction_manager_test: add debug logging to register/deregister compaction
  test: compaction_manager_test: deregister_compaction: erase by iterator
  test: compaction_manager_test: move methods out of line
  compaction_manager: compaction_state: use counter for compaction_disabled
  compaction_manager: task: delete move and copy constructors
  compaction_manager: add per-task debug log messages
  compaction_manager: stop_ongoing_compactions: log number of tasks to stop
2021-11-28 22:12:52 +02:00
Avi Kivity
b23af15432 tests: consolidate boost xunit result files
The recent parallelization of boost unit tests caused an increase
in xml result files. This is challenging to Jenkins, since it
appears to use rpc-over-ssh to read the result files, and as a result
it takes more than an hour to read all result files when the Jenkins
main node is not on the same continent as the agent.

To fix this, merge the result files in test.py and leave one result
file per mode. Later we can leave one result file overall (integrating
the mode into the testsuite name), but that can wait.

Tested on a local Jenkins instance (just reading the result files,
not the entire build).

Closes #9668
2021-11-28 22:12:52 +02:00