scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-26 19:35:12 +00:00

Files

Avi Kivity aa8f135f64 Merge 'Block flush until compaction finishes if sstables accumulate' from Mikołaj Sielużycki

If we reach a situation where flush rate exceeds compaction rate, we may
end up with arbitrarily large number of sstables on disk. If a read is
executed in such case, the amount of memory required is proportional to
the number of sstables for the given shard, which in extreme cases can
lead to OOM.

In the wild, this was observed in 2 scenarios:
- A node with >10 shards creates a keyspace with thousands of tables,
  drops the keyspace and shuts down before compaction finishes. Dropping
  keyspace drops tables, and each dropped table is smp::count writes to
  system.local table with flush after write, which creates tens of
  thousands of sstables. Bootstrap read from system.local will run OOM.
- A failure to agree on table schema (due to a code bug) between nodes
  during repair resulted in excessive flushing of small sstables which
  compaction couldn't keep up with.

In the unit test introduced in this patch series it can be proved that
even hard setting maximum shares for compaction and minimum shares for
flushing doesn't tilt the balance towards compaction enough to prevent
the problem. Since it's a fast producer, slow consumer problem, the
remaining solution is to block producer until the consumer catches up.
If there are too many table runs originating from memtable, we block the
current flush until the number of sstables is reduced (via ongoing
compaction or a truncate operation).

Fixes https://github.com/scylladb/scylla/issues/4116

Changelog:
v5:
- added a nicer way of timing the stalls caused by waiting for flush
- added predicate on signal when waiting for reduction of the number of sstables to correctly handle spurious wake ups
- added comment why we trigger compaction before waiting for sstable count reduction
- removed unnecessary cv.signal from table::stop

v4:
- removed conversion of table::stop to coroutines. It's an orthogonal change and doesn't need to go into this patchset

v3:
- removed unnecessary change to scheduling groups from v2
- moved sstables_changed signalling to suggested place in table::stop
- added log how long the table flush was blocked for
- changed the threshold to max(schema()->max_compaction_threshold(), 32) and comparison to <=

v2:
- Reimplemented waiting algorithm based on reviewers' feedback. It's confined to the table class and it waits in a loop until the number of sstable runs goes below threshold. It uses condition variable which is signaled on sstable set refresh. It handles node shutdown as well.
- Converted table::stop to coroutines.
- Reordered commits so that test is committed after fix, so it doesn't trip up bisection.

Closes #10717

* github.com:scylladb/scylla:
  table: Add test where compaction doesn't keep up with flush rate.
  random_mutation_generator: Add option to specify ks_name and cf_name
  table: Prevent creating unbounded number of sstables

2022-06-15 14:51:08 +03:00

alternator_test_env.cc

treewide: use Software Package Data Exchange (SPDX) license identifiers

2022-01-18 12:15:18 +01:00

alternator_test_env.hh

treewide: use Software Package Data Exchange (SPDX) license identifiers

2022-01-18 12:15:18 +01:00

cql_assertions.cc

treewide: use Software Package Data Exchange (SPDX) license identifiers

2022-01-18 12:15:18 +01:00

cql_assertions.hh

treewide: use Software Package Data Exchange (SPDX) license identifiers