Files
scylladb/tests
Paweł Dziepak 62d0639fe9 Merge "Avoid reactor stalls in cache with large partitions" from Tomasz
"
We currently suffer from reactor stalls caused by non-preemptible processing
of large partitions in the following places:

  (1) dropping partition entries from cache or memtables does not defer

  (2) dropping partition versions abandoned by detached snapshots does not defer

  (3) merging of partition versions when snapshots go away does not defer

  (4) cache update from memtable processes partition entries without deferring (#2578)

  (5) partition entries are upgraded to new schema atomically

This series fixes problems (1), (2) and (4), but not (3) and (5).

(1) and (2) are fixed by introducing mutation_cleaner objects which are
containers for garbage partition versions which are delaying actual freeing.
Freeing happens from memory reclaimers and is incremental.

(3) and (5) are not solved yet.

(4) is solved by having partition merging process partitions with row
granularity and defer in the middle of partition. In order to preserve update
atomicity on partition level as perceived by reads, when update starts we
create a snapshot to the current version of partition and process memtable
entry by inserting data into a separate partition version. This way if upgrade
defers in the middle of partition reads can still go to the old version and
not see partial writes. Snapshots are marked with phase numbers, and reads
will use the previous phase until whole partition is upgraded. When partition
is finally merged, the snapshots go away and the new version will eventually
be merged to the old version. Due to (3) however, this merging may still add
latency to the upgrade path.

Remaining work:

  - Solving problem (3). I think the approach to take here would be to
    move the task of merging versions to the background, maybe into mutation_cleaner.

  - Merging range tombstones incrementally.

Performance
===========

Performance improvements were evaluated using tests/perf_row_cache_update -c1 -m1G,
which measures time it takes to update cache from memtable for various workloads
and schemas.

For large partition with lots of small rows we see a significant reduction of
scheduling latency from ~550ms to ~23ms. The cause of remainig latency is
problem (3) stated above. The run time is reduced by 70%.

For small partition case without clustering columns we see no degradation.

For small partition case with clustering key, but only 3 small rows per partition,
we see a 30% degradation in run time.

For large partition with lots of range tombstones we see degradation of 15% in
run time and scheduling latency.

Below you can see full statistics for cache update run time:

=== Small partitions, no overwrites:

Before:

  avg = 433.965155
  stdev = 35.958024
  min = 340.093201
  max = 468.564514

After:

  avg = 436.929447 (+1%)
  stdev = 37.130237
  min = 349.410339
  max = 489.953400

=== Small partition with a few rows:

Before:

  avg = 315.379316
  stdev = 30.059120
  min = 240.340561
  max = 342.408295

After:

  avg = 407.232691 (+30%)
  stdev = 53.918717
  min = 269.514648
  max = 444.846649

=== Large partition, lots of small rows:

Before:

  avg = 412.870689
  stdev = 227.411317
  min = 286.990631
  max = 1263.417847

After:

  avg = 124.351705 (-70%)
  stdev = 4.705762
  min = 110.063255
  max = 129.643387

=== Large partition, lots of range tombstones:

Before:

  avg = 601.172644
  stdev = 121.376866
  min = 223.502136
  max = 874.111572

After:

  avg = 695.627588 (+15%)
  stdev = 135.057004
  min = 337.173950
  max = 784.838745
"

* tag 'tgrabiec/clear-gently-all-partitions-v3' of github.com:tgrabiec/scylla:
  mvcc: Use small_vector<> in partition_snapshot_row_cursor
  utils: Extract small_vector.hh
  mvcc: Erase rows gradually in apply_to_incomplete()
  mvcc: partition_snapshot_row_cursor: Avoid row copying in consume() when possible
  cache: real_dirty_memory_accounter: Move unpinning out of the hot path
  mvcc: partition_snapshot_row_cursor: Reduce lookups in ensure_entry_if_complete()
  mutation_partition: Reduce row lookups in apply_monotonically()
  cache: Release dirty memory with row granularity
  cache: Defer during partition merging
  mvcc: partition_snapshot_row_cursor: Introduce consume_row()
  mvcc: partition_snapshot_row_cursor: Introduce maybe_refresh_static()
  mvcc: Make apply_to_incomplete() work with attached versions
  cache: Propagate phase to apply_to_incomplete()
  cache: Prepare for incremental apply_to_incomplete()
  Introduce a coroutine wrapper
  tests: mvcc: Encapsulate memory management details
  tests: cache: Take into account that update() may defer
  cache: real_dirty_memory_accounter: Allow construction without memtable
  cache: Extract real_dirty_memory_accounter
  mvcc: Destroy memtable partition versions gently
  memtable: Destroy partitions incrementally from clear_gently()
  mvcc: Remove rows from tracker gently
  cache: Destroy partition versions incrementally
  Introduce mutation_cleaner
  mvcc: Introduce partition_version_list
  mvcc: Fix move constructor of partition_version_ref() not preserving _unique_owner
  database: Add API for incremental clearing of partition entries
  cache: Define trivial methods inline
  tests: Improve perf_row_cache_update
  mutation_reader: Make empty mutation source advertize no partitions
2018-05-30 14:12:29 +01:00
..