Commit Graph

155 Commits

Author SHA1 Message Date
Raphael S. Carvalho
f5715d3f0b replica: Move memtables to compaction_group
Now memtables live in compaction_group. Also introduced function
that selects group based on token, but today table always return
the single group managed by it. Once multiple groups are supported,
then the function should interpret token content to select the
group.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-09-11 14:26:59 -03:00
Raphael S. Carvalho
f4579795e6 replica: move compound SSTable set to compaction group
The group is now responsible for providing the compound set.
table still has one compound set, which will span all groups for
the cases we want to ignore the group isolation.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-09-11 14:26:59 -03:00
Raphael S. Carvalho
6717d96684 replica: move maintenance SSTable set to compaction_group
This commit is restricted to moving maintenance set into compaction_group.
Next, we'll introduce compound set into it.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-09-11 14:26:59 -03:00
Raphael S. Carvalho
ce8e5f354c replica: move main SSTable set to compaction_group
This commit is restricted to moving main set into compaction_group.
Next, we'll move maintenance set into it and finally the memtable.

A method is introduced to figure out which group a sstable belongs
to, but it's still unimplemented as table is still limited to
a single group.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-09-11 14:26:59 -03:00
Raphael S. Carvalho
4871f1c97c replica: Introduce compaction_group
Compaction group is a new abstraction used to group SSTables
that are eligible to be compacted together. By this definition,
a table in a given shard has a single compaction group.
The problem with this approach is that data from different vnodes
is intermixed in the same sstable, making it hard to move data
in a given sstable around.
Therefore, we'll want to have multiple groups per table.

A group can be thought of an isolated LSM tree where its memtable
and sstable files are isolated from other groups.

As for the implementation, the idea is to take a very incremental
approach.
In this commit, we're introducing a single compaction group to
table.
Next, we'll migrate sstable and maintenance set from table
into that single compaction group. And finally, the memtable.

Cache will be shared among the groups, for simplicity.
It works due to its ability to invalidate a subset of the
token range.

There will be 1:1 relationship between compaction_group and
table_state.
We can later rename table_state to compaction_group_state.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-09-11 14:26:59 -03:00
Raphael S. Carvalho
a6ecadf3de replica: convert table::stop() into coroutine
await_pending_ops() is today marked noexcept, so doesn't have to
be implemented with finally() semantics.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-09-11 14:26:59 -03:00
Benny Halevy
7e4612d3aa mutation_readers: pass tombstone_gc_state to compating_reader
To be passed further done to `compact_mutation_state` in
a following patch.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-09-07 07:43:14 +03:00
Benny Halevy
2cd3fc2f36 compaction: table_state: add virtual get_tombstone_gc_state method
and override it in table::table_state to get the tombstone_gc_state
from the table's compaction_manager.

It is going to be used in the next patched to pass the gc state
from the compaction_strategy down to sstables and compaction.

table_state_for_test was modified to just keep a null
tombstone_gc_state.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-09-06 23:05:39 +03:00
Benny Halevy
71ede6124a db: view: pass base table to view_update_builder
To be used by generate_update() for getting the
tombstone_gc_state via the table's compaction_manager.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-09-06 23:04:23 +03:00
Raphael S. Carvalho
631b2d8bdb replica: rename table::on_compaction_completion and coroutinize it
on_compaction_completion() is not very descriptive. let's rename
it, following the example of
update_sstable_lists_on_off_strategy_completion().

Also let's coroutinize it, so to remove the restriction of running
it inside a thread only.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes #11407
2022-08-31 06:17:20 +03:00
Benny Halevy
a980510654 table: seal_active_memtable: handle ENOSPC error
Aborting too soon on ENOSPC is too harsh, leading to loss of
availability of the node for reads, while restarting it won't
solve the ENOSPC condition.

Fixes #11245

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #11246
2022-08-23 17:58:20 +02:00
Avi Kivity
e9cbc9ee85 Merge 'Add support for empty replica pages' from Botond Dénes
Many tombstones in a partition is a problem that has been plaguing queries since the inception of Scylla (and even before that as they are a pain in Apache Cassandra too). Tombstones don't count towards the query's page limit, neither the size nor the row number one. Hence, large spans of tombstones (be that row- or range-tombstones) are problematic: the query can time out while processing this span of tombstones, as it waits for more live rows to fill the page. In the extreme case a partition becomes entirely unreadable, all read attempts timing out, until compaction manages to purge the tombstones.
The solution proposed in this PR is to pass down a tombstone limit to replicas: when this limit is reached, the replica cuts the page and marks it as short one, even if the page is empty currently. To make this work, we use the last-position infrastructure added recently by 3131cbea62, so that replicas can provide the position of the last processed item to continue the next page from. Without this no forward progress could be made in the case of an empty page: the query would continue from the same position on the next page, having to process the same span of tombstones.
The limit can be configured with the newly added `query_tombstone_limit` configuration item, defaulted to 10000. The coordinator will pass this to the newly added `tombstone_limit` field of `read_command`, if the `replica_empty_pages` cluster feature is set.

Upgrade sanity test was conducted as following:
* Created cluster of 3 nodes with RF=3 with master version
* Wrote small dataset of 1000 rows.
* Deleted prefix of 980 rows.
* Started read workload: `scylla-bench -mode=read -workload=uniform -replication-factor=3 -nodes 127.0.0.1,127.0.0.2,127.0.0.3 -clustering-row-count=10000 -duration=10m -rows-per-request=9000 -page-size=100`
* Also did some manual queries via `cqlsh` with smaller page size and tracing on.
* Stopped and upgraded each node one-by-one. New nodes were started by `--query-tombstone-page-limit=10`.
* Confirmed there are no errors or read-repairs.

Perf regression test:
```
build/release/test/perf/perf_simple_query_g -c1 -m2G --concurrency=1000 --task-quota-ms 10 --duration=60
```
Before:
```
median 133665.96 tps ( 62.0 allocs/op,  12.0 tasks/op,   43007 insns/op,        0 errors)
median absolute deviation: 973.40
maximum: 135511.63
minimum: 104978.74
```
After:
```
median 129984.90 tps ( 62.0 allocs/op,  12.0 tasks/op,   43181 insns/op,        0 errors)
median absolute deviation: 2979.13
maximum: 134538.13
minimum: 114688.07
```
Diff: +~200 instruction/op.

Fixes: https://github.com/scylladb/scylla/issues/7689
Fixes: https://github.com/scylladb/scylla/issues/3914
Fixes: https://github.com/scylladb/scylla/issues/7933
Refs: https://github.com/scylladb/scylla/issues/3672

Closes #11053

* github.com:scylladb/scylladb:
  test/cql-pytest: add test for query tombstone page limit
  query-result-writer: stop when tombstone-limit is reached
  service/pager: prepare for empty pages
  service/storage_proxy: set smallest continue pos as query's continue pos
  service/storage_proxy: propagate last position on digest reads
  query: result_merger::get() don't reset last-pos on short-reads and last pages
  query: add tombstone-limit to read-command
  service/storage_proxy: add get_tombstone_limit()
  query: add tombstone_limit type
  db/config: add config item for query tombstone limit
  gms: add cluster feature for empty replica pages
  tree: don't use query::read_command's IDL constructor
2022-08-10 13:38:06 +03:00
Raphael S. Carvalho
ace6334619 replica: table: kill unused _sstables_staging
Good change as it's one less thing to worry about in compaction
group.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-08-10 12:32:13 +03:00
Botond Dénes
7730419f5c query-result-writer: stop when tombstone-limit is reached
The query result writer now counts tombstones and cuts the page (marking
it as a short one) when the tombstone limit is reached. This is to avoid
timing out on large span of tombstones, especially prefixes.
In the case of unpaged queries, we fail the read instead, similarly to
how we do with max result size.
If the limit is 0, the previous behaviour is used: tombstones are not
taken into consideration at all.
2022-08-10 06:03:38 +03:00
Avi Kivity
8d37370a71 Revert "Merge "memtable-sstable: Add compacting reader when flushing memtable." from Mikołaj"
This reverts commit bcadd8229b, reversing
changes made to cf528d7df9. Since
4bd4aa2e88 ("Merge 'memtable, cache: Eagerly
compact data with tombstones' from Tomasz Grabiec"), memtable is
self-compacting and the extra compaction step only reduces throughput.

The unit test in memtable_test.cc is not reverted as proof that the
revert does not cause a regression.

Closes #11243
2022-08-09 11:23:29 +03:00
Benny Halevy
45ce635527 table: reindent write_schema_as_cql
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-07 12:53:05 +03:00
Benny Halevy
3b2cce068a table: coroutinize write_schema_as_cql
and make sure to always close the output stream.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-07 12:53:05 +03:00
Benny Halevy
dbae7807d1 table: seal_snapshot: maybe_yield when iterating over the table names
Add maybe_yield calls in tight loop, potentially
over thousands of sstable names to prevent reactor stalls.
Although the per-sstable cost is very small, we've experienced
stalls realted to printing in O(#sstables) in compaction.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-07 12:53:05 +03:00
Benny Halevy
3ba0c72b77 table: reindent seal_snapshot
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-07 12:53:05 +03:00
Benny Halevy
41a2d09a5d table: coroutinize seal_snapshot
Handle exceptions, making sure the output
stream is properly closed in all cases,
and an intermediate error, if any, is returned as the
final future.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-07 12:53:05 +03:00
Benny Halevy
5316dbbe78 table: delete unused snapshot_manager and pending_snapshots
Now that snapshot orchestration in snapshot_on_all_shards
doesn't use snapshot_manager, get rid of the data structure.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-07 12:53:05 +03:00
Benny Halevy
cca9068cfb table: delete unused snapshot function
Now that snapshot orchestration is done solely
in snapshot_on_all_shards, the per-shard
snapshot function can be deleted.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-07 12:53:05 +03:00
Benny Halevy
351a3a313d table: snapshot_on_all_shards: orchestrate snapshot process
Call take_snapshot on each shard and collect the
returns snapshot_file_set.

When all are done, move the vector<snapshot_file_set>
to finalize_snapshot.

All that without resorting to using the snapshot_manager
nor calling table::snapshot.

Both will deleted in the following patches.

Fixes #11132

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-07 12:53:05 +03:00
Benny Halevy
84dfd2cabb table: snapshot: move pending_snapshots.erase from seal_snapshot
Now that seal_snapshot doesn't need to lookup
the snapshot_manager in pending_snapshots to
get to the file_sets, erasing the snapshot_manager
object can be done in table::snapshot which
also inserted it there.

This will make it easier to get rid of it in a later patch.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-07 12:53:05 +03:00
Benny Halevy
39276cacc3 table: finalize_snapshot: take the file sets as a param
and pass it to seal_snapshot, so that the latter won't
need to lookup and access the snapshot_manager object.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-07 12:53:05 +03:00
Benny Halevy
4dd56bbd6d table: make seal_snapshot a static member
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-07 12:53:05 +03:00
Benny Halevy
7cb0a3f6f4 table: finalize_snapshot: reindent
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-07 12:53:05 +03:00
Benny Halevy
12716866a9 table: refactor finalize_snapshot out of snapshot
Write schema.cql and the files manifest in finalize_snapshot.
Currently call it from table::snapshot, but it will
be called in a later patch by snapshot_on_all_shards.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-07 12:53:05 +03:00
Benny Halevy
240f83546d table: snapshot: keep per-shard file sets in snapshot_manager
To simplify processing of the per-shard file names
for generating the manifest.

We only need to print them to the manifest at the
end of the process, so there's no point in copying
them around in the process, just move the
foreign unique unordered_set.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-07 12:53:05 +03:00
Benny Halevy
5100c1ba68 table: take_snapshot: return foreign unique ptr
Currently copying the sstable file names are created
and destroyed on each shard and are copied by the
"coordinator" shards using submit_to, while the
coroutine holds the source on its stack frame.

To prepare for the next patches that refactor this
code so that the coordinator shard will submit_to
each shard to perform `take_snapshot` and return
the set of sstrings in the future result, we need
to wrap the result in a foreign_ptr so it gets
freed on the shard that created it.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-07 12:53:05 +03:00
Benny Halevy
b54626ad0e table: take_snapshot: maybe yield in per-sstable loop
There could be thousands of sstables so we better
cosider yielding in the tight loop that copies
the sstable names into the unordered_set we return.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-07 12:53:05 +03:00
Benny Halevy
24a1a4069e table: take_snapshot: simplify tables construction code
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-07 12:53:05 +03:00
Benny Halevy
75e38ebccc table: take_snapshot: reindent
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-07 12:53:05 +03:00
Benny Halevy
67c1d00f44 table: take_snapshot: simplify error handling
Don't catch exception but rather just return
them in the return future, as the exception
is handled by the caller.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-07 12:53:05 +03:00
Benny Halevy
ff6508aa53 table: refactor take_snapshot out of snapshot
Do the actual snapshot-taking code in a per-shard
take_snapshot function, to be called from
snapshot_on_all_shards in a following patch.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-07 12:53:05 +03:00
Benny Halevy
4d4ca40c38 table: add snapshot_on_all_shards
Called from the respective database entry points.

Will be called also from the database drop / truncate path
and will be used for central coordination of per-shard
table::snapshot so we don't have to depend on the snapshot_manager
mechanism that is fragile and currently causes abort if we fail
to allocate it.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-07 12:53:05 +03:00
Kamil Braun
7b4146dd2a replica: table: get_hit_rate: take const gossiper&
It doesn't use any non-const members.
2022-08-04 12:16:09 +02:00
Benny Halevy
f26e655646 compaction_manager: add maybe_wait_for_sstable_count_reduction
Called from try_flush_memtable_to_sstable,
maybe_wait_for_sstable_count_reduction will wait for
compaction to catch up with memtable flush if there
the bucket to compact is inflated, having too many
sstables.  In that case we don't want to add fuel
to the fire by creating yet another sstable.

Fixes #4116

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-28 14:43:30 +03:00
Avi Kivity
2c0932cc41 Merge 'Reduce the amount of per-table metrics' from Amnon Heiman
This series is the first step in the effort to reduce the number of metrics reported by Scylla.
The series focuses on the per-table metrics.

The combination of histograms, per-tables, and per shard makes the number of metrics in a cluster explode.
The following series uses multiple tools to reduce the number of metrics.
1. Multiple metrics should only be reported for the user tables and the condition that checked it was not updated when more non-user keyspaces were added.
2. Second, instead of a histogram, per table, per shard, it will report a summary per table, per shard, and a single histogram per node.
3. Histograms, summaries, and counters will be reported only if they are used (for example, the cas-related metrics will not be reported for tables that are not using cas).

Closes #11058

* github.com:scylladb/scylla:
  Add summary_test
  database: Reduce the number of per-table metrics
  replica/table.cc: Do not register per-table metrics for system
  histogram_metrics_helper.hh: Add to_metrics_summary function
  Unified histogram, estimated_histogram, rates, and summaries
  Split the timed_rate_moving_average into data and timer
  utils/histogram.hh: should_sample should use a bitmask
  estimated_histogram: add missing getter method
2022-07-27 22:01:08 +03:00
Amnon Heiman
99a060126d database: Reduce the number of per-table metrics
This patch reduces the number of metrics that is reported per table, when
the per-table flag is on.

When possible, it moves from time_estimated_histogram and
timed_rate_moving_average_and_histogram to use the unified timer.

Instead of a histogram per shard, it will now report a summary per shard
and a histogram per node.

Counters, histograms, and summaries will not be reported if they were
never used.

The API was updated accordingly so it would not break.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2022-07-27 16:58:52 +03:00
Amnon Heiman
c31a58f2e9 replica/table.cc: Do not register per-table metrics for system
There is a set of per-table metrics that should only be registered for
user tables.
As time passes there are more keyspaces that are not for the user
keyspace and there is now a function that covers all those cases.

This patch replaces the implementation to use is_internal_keyspace.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2022-07-27 16:58:52 +03:00
Benny Halevy
b5abbb971f test: memtable_test: failed_flush_prevents_writes: extend error injection
Inject errors into all seal_active_memtable distinct error
handling sites.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-27 14:06:59 +03:00
Benny Halevy
a5911619c0 table: seal_active_memtable: abort if retried for too long
If we haven't been able to flush the memtable
in ~30 minutes (based on the number of retries)
just abort assuming that the OOM
condition is permanent rather than transient.

Refs #4344

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-27 14:06:59 +03:00
Benny Halevy
bc18f750c6 table: seal_active_memtable: abort on unexpected error
Currently when we can't write the flushed sstable
due to corruption in the memtable we get into
an infinite retry loop (see #10498).

Until we can go into maintenance mode, the next best thing
would be to abort, though there is still a risk that
commitlog replay will reproduce the corruption in the
memtable and we's end up with an infinite crash loop.
(hence #10498 is not Fixed with this patch)

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-27 14:06:57 +03:00
Benny Halevy
f0a597a252 table: try_flush_memtable_to_sstable: propagate errors to seal_active_memtable
And let seal_active_memtable decide about how to handle them
as now all flush error handling logic is implemented there.

In particular, unlike today, sstable write errors will
cause internal error rather than loop forever.

Also, check for shutdown earlier to ignore errors
like semaphore_broken that might happen when
the table is stopped.

Refs #10498

(The issue will be considered fixed when going
into maintenance mode on write errors rather than
throwing internal error and potentially retrying forever)

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-27 14:04:55 +03:00
Benny Halevy
d55a2ac762 dirty_memory_manager: flush_when_needed: move error handling to flush_one/seal_active_memtable
Currently flush is retried both by dirty_memory_manager::flush_when_needed
and table::seal_active_memtable, which may be called by other paths
like table::flush.

Unify the retry logic into seal_active_memtable so that
we have similar error handling semantics on all paths.

Refs #4174
Refs #10498

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-27 13:43:17 +03:00
Benny Halevy
67479e4243 table: reindent seal_active_memtable 2022-07-27 13:43:17 +03:00
Benny Halevy
00941452d5 table: coroutinize seal_active_memtable
As a first step to making it robust using
state machine driven retries.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-27 13:43:17 +03:00
Benny Halevy
863e9d9e6a dirty_memory_manager: flush_when_needed: target error handling at flush_one
Now that everything prior to flush_one is noexcept
make table::seal_active_memtable and the paths that call it
noexcept, making sure that any errors are returned only
as exceptional futures, and handle them in flush_when_needed().

The original handle_exception had a broader scope than now needed,
so this change is mostly technical, to show that we can narrow down
the error handling to the continuation of flush_one - and verify that
the unit test is not broken.
A later patch moves this error handling logic away to seal_active_memtable.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-27 13:43:17 +03:00
Benny Halevy
f60ff44fdf table: try_flush_memtable_to_sstable: consume: close reader on error
If an exception is throws in `consume` before
write_memtable_to_sstable is called or if the latter fails,
we must close the reader passed to it.

Fixes #11075

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-19 16:35:59 +03:00