Currently, mutation query on replica side will not respond with a result
which doesn't have at least one live row. This causes problems if there
is a lot of dead rows or partitions before we reach a live row, which
stems from the fact that resulting reconcilable_result will be large:
* Large allocations. Serialization of reconcilable_result causes large
allocations for storing result rows in std::deque
* Reactor stalls. Serialization of reconcilable_result on the replica
side and on the coordinator side causes reactor stalls. This impacts
not only the query at hand. For 1M dead rows, freezing takes 130ms,
unfreezing takes 500ms. Coordinator does multiple freezes and
unfreezes. The reactor stall on the coordinator side is >5s.
* Large repair mutations. If reconciliation works on large pages, repair
may fail due to too large mutation size. 1M dead rows is already too
much: Refs #9111.
This patch fixes all of the above by making mutation reads respect the
memory accounter's limit for the page size, even for dead rows.
This patch also addresses the problem of client-side timeouts during
paging. Reconciling queries processing long strings of tombstones will
now properly page tombstones,like regular queries do.
My testing shows that this solution even increases efficiency. I tested
with a cluster of 2 nodes, and a table of RF=2. The data layout was as
follows (1 partition):
Node1: 1 live row, 1M dead rows
Node2: 1M dead rows, 1 live row
This was designed to trigger reconciliation right from the very start of
the query.
Before:
Running query (node2, CL=ONE, cold cache)
Query done, duration: 140.0633503ms, pages: 101, result: [Row(pk=0, ck=3000000, v=0)]
Running query (node2, CL=ONE, hot cache)
Query done, duration: 66.7195275ms, pages: 101, result: [Row(pk=0, ck=3000000, v=0)]
Running query (all-nodes, CL=ALL, reconcile, cold-cache)
Query done, duration: 873.5400742ms, pages: 2, result: [Row(pk=0, ck=0, v=0), Row(pk=0, ck=3000000, v=0)]
After:
Running query (node2, CL=ONE, cold cache)
Query done, duration: 136.9035122ms, pages: 101, result: [Row(pk=0, ck=3000000, v=0)]
Running query (node2, CL=ONE, hot cache)
Query done, duration: 69.5286021ms, pages: 101, result: [Row(pk=0, ck=3000000, v=0)]
Running query (all-nodes, CL=ALL, reconcile, cold-cache)
Query done, duration: 162.6239498ms, pages: 100, result: [Row(pk=0, ck=0, v=0), Row(pk=0, ck=3000000, v=0)]
Non-reconciling queries have almost identical duration (1 few ms changes
can be observed between runs). Note how in the after case, the
reconciling read also produces 100 pages, vs. just 2 pages in the before
case, leading to a much lower duration (less than 1/4 of the before).
Refs #7929
Refs #3672
Refs #7933Fixes#9111