mirror of
https://github.com/scylladb/scylladb.git
synced 2026-06-01 04:26:48 +00:00
" We currently suffer from reactor stalls caused by non-preemptible processing of large partitions in the following places: (1) dropping partition entries from cache or memtables does not defer (2) dropping partition versions abandoned by detached snapshots does not defer (3) merging of partition versions when snapshots go away does not defer (4) cache update from memtable processes partition entries without deferring (#2578) (5) partition entries are upgraded to new schema atomically This series fixes problems (1), (2) and (4), but not (3) and (5). (1) and (2) are fixed by introducing mutation_cleaner objects which are containers for garbage partition versions which are delaying actual freeing. Freeing happens from memory reclaimers and is incremental. (3) and (5) are not solved yet. (4) is solved by having partition merging process partitions with row granularity and defer in the middle of partition. In order to preserve update atomicity on partition level as perceived by reads, when update starts we create a snapshot to the current version of partition and process memtable entry by inserting data into a separate partition version. This way if upgrade defers in the middle of partition reads can still go to the old version and not see partial writes. Snapshots are marked with phase numbers, and reads will use the previous phase until whole partition is upgraded. When partition is finally merged, the snapshots go away and the new version will eventually be merged to the old version. Due to (3) however, this merging may still add latency to the upgrade path. Remaining work: - Solving problem (3). I think the approach to take here would be to move the task of merging versions to the background, maybe into mutation_cleaner. - Merging range tombstones incrementally. Performance =========== Performance improvements were evaluated using tests/perf_row_cache_update -c1 -m1G, which measures time it takes to update cache from memtable for various workloads and schemas. For large partition with lots of small rows we see a significant reduction of scheduling latency from ~550ms to ~23ms. The cause of remainig latency is problem (3) stated above. The run time is reduced by 70%. For small partition case without clustering columns we see no degradation. For small partition case with clustering key, but only 3 small rows per partition, we see a 30% degradation in run time. For large partition with lots of range tombstones we see degradation of 15% in run time and scheduling latency. Below you can see full statistics for cache update run time: === Small partitions, no overwrites: Before: avg = 433.965155 stdev = 35.958024 min = 340.093201 max = 468.564514 After: avg = 436.929447 (+1%) stdev = 37.130237 min = 349.410339 max = 489.953400 === Small partition with a few rows: Before: avg = 315.379316 stdev = 30.059120 min = 240.340561 max = 342.408295 After: avg = 407.232691 (+30%) stdev = 53.918717 min = 269.514648 max = 444.846649 === Large partition, lots of small rows: Before: avg = 412.870689 stdev = 227.411317 min = 286.990631 max = 1263.417847 After: avg = 124.351705 (-70%) stdev = 4.705762 min = 110.063255 max = 129.643387 === Large partition, lots of range tombstones: Before: avg = 601.172644 stdev = 121.376866 min = 223.502136 max = 874.111572 After: avg = 695.627588 (+15%) stdev = 135.057004 min = 337.173950 max = 784.838745 " * tag 'tgrabiec/clear-gently-all-partitions-v3' of github.com:tgrabiec/scylla: mvcc: Use small_vector<> in partition_snapshot_row_cursor utils: Extract small_vector.hh mvcc: Erase rows gradually in apply_to_incomplete() mvcc: partition_snapshot_row_cursor: Avoid row copying in consume() when possible cache: real_dirty_memory_accounter: Move unpinning out of the hot path mvcc: partition_snapshot_row_cursor: Reduce lookups in ensure_entry_if_complete() mutation_partition: Reduce row lookups in apply_monotonically() cache: Release dirty memory with row granularity cache: Defer during partition merging mvcc: partition_snapshot_row_cursor: Introduce consume_row() mvcc: partition_snapshot_row_cursor: Introduce maybe_refresh_static() mvcc: Make apply_to_incomplete() work with attached versions cache: Propagate phase to apply_to_incomplete() cache: Prepare for incremental apply_to_incomplete() Introduce a coroutine wrapper tests: mvcc: Encapsulate memory management details tests: cache: Take into account that update() may defer cache: real_dirty_memory_accounter: Allow construction without memtable cache: Extract real_dirty_memory_accounter mvcc: Destroy memtable partition versions gently memtable: Destroy partitions incrementally from clear_gently() mvcc: Remove rows from tracker gently cache: Destroy partition versions incrementally Introduce mutation_cleaner mvcc: Introduce partition_version_list mvcc: Fix move constructor of partition_version_ref() not preserving _unique_owner database: Add API for incremental clearing of partition entries cache: Define trivial methods inline tests: Improve perf_row_cache_update mutation_reader: Make empty mutation source advertize no partitions