scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-05-31 03:56:42 +00:00

Files

Avi Kivity 628e6ffd33 Merge 'database, storage_proxy: Reconcile pages with dead rows and partitions incrementally' from Botond Dénes

Currently, mutation query on replica side will not respond with a result which doesn't have at least one live row. This causes problems if there is a lot of dead rows or partitions before we reach a live row, which stem from the fact that resulting reconcilable_result will be large:

1. Large allocations.  Serialization of reconcilable_result causes large allocations for storing result rows in std::deque
2. Reactor stalls. Serialization of reconcilable_result on the replica side and on the coordinator side causes reactor stalls. This impacts not only the query at hand. For 1M dead rows, freezing takes 130ms, unfreezing takes 500ms. Coordinator  does multiple freezes and unfreezes. The reactor stall on the coordinator side is >5s
3. Too large repair mutations. If reconciliation works on large pages, repair may fail due to too large mutation size. 1M dead rows is already too much: Refs https://github.com/scylladb/scylladb/issues/9111.

This patch fixes all of the above by making mutation reads respect the memory accounter's limit for the page size, even for dead rows.

This patch also addresses the problem of client-side timeouts during paging. Reconciling queries processing long strings of tombstones will now properly page tombstones,like regular queries do.

My testing shows that this solution even increases efficiency. I tested with a cluster of 2 nodes, and a table of RF=2. The data layout was as follows (1 partition):
* Node1: 1 live row, 1M dead rows
* Node2: 1M dead rows, 1 live row

This was designed to trigger reconciliation right from the very start of the query.

Before:
```
Running query (node2, CL=ONE, cold cache)
Query done, duration: 140.0633503ms, pages: 101, result: [Row(pk=0, ck=3000000, v=0)]
Running query (node2, CL=ONE, hot cache)
Query done, duration: 66.7195275ms, pages: 101, result: [Row(pk=0, ck=3000000, v=0)]
Running query (all-nodes, CL=ALL, reconcile, cold-cache)
Query done, duration: 873.5400742ms, pages: 2, result: [Row(pk=0, ck=0, v=0), Row(pk=0, ck=3000000, v=0)]
```

After:
```
Running query (node2, CL=ONE, cold cache)
Query done, duration: 136.9035122ms, pages: 101, result: [Row(pk=0, ck=3000000, v=0)]
Running query (node2, CL=ONE, hot cache)
Query done, duration: 69.5286021ms, pages: 101, result: [Row(pk=0, ck=3000000, v=0)]
Running query (all-nodes, CL=ALL, reconcile, cold-cache)
Query done, duration: 162.6239498ms, pages: 100, result: [Row(pk=0, ck=0, v=0), Row(pk=0, ck=3000000, v=0)]
```

Non-reconciling queries have almost identical duration (1 few ms changes can be observed between runs). Note how in the after case, the reconciling read also produces 100 pages, vs. just 2 pages in the before case, leading to a much lower duration (less than 1/4 of the before).

Refs https://github.com/scylladb/scylladb/issues/7929
Refs https://github.com/scylladb/scylladb/issues/3672
Refs https://github.com/scylladb/scylladb/issues/7933
Fixes https://github.com/scylladb/scylladb/issues/9111

Closes #14923

* github.com:scylladb/scylladb:
  test/topology_custom: add test_read_repair.py
  replica/mutation_dump: detect end-of-page in range-scans
  tools/scylla-sstable: write: abort parser thread if writing fails
  test/pylib: add REST methods to get node exe and workdir paths
  test/pylib/rest_client: add load_new_sstables, keyspace_{flush,compaction}
  service/storage_proxy: add trace points for the actual read executor type
  service/storage_proxy: add trace points for read-repair
  storage_proxy: Add more trace-level logging to read-repair
  database: Fix accounting of small partitions in mutation query
  database, storage_proxy: Reconcile pages with no live rows incrementally

2023-09-11 19:20:19 +03:00