mirror of
https://github.com/scylladb/scylladb.git
synced 2026-06-01 04:26:48 +00:00
Compaction prevents data resurrection from happening by checking that there's no way a data shadowed by a GC'able tombstone will survive alone, after a failure for example. Consider the following scenario: We have two runs A and B, each divided to 5 fragments, A1..A5, B1..B5. They have the following token ranges: A: A1=[0, 3] A2=[4, 7] A3=[8, 11] A4=[12, 15] A5=[16,18] B is the same as A's ranges, offset by 1: B: B1=[1,4] B2=[5,8] B3=[9,12] B4=[13,16] B5=[17,19] Let's say we are finished flushing output until position 10 in the compaction. We are currently working on A3 and B3, so obviously those cannot be deleted. Because B2 overlaps with A3, we cannot delete B2 either. Otherwise, B2 could have a GC'able tombstone that shadows data in A3, and after B2 is gone, dead data in A3 could be resurrected *on failure*. Now, A2 overlaps with B2 which we couldn't delete yet, so we can't delete A2. Now A2 overlaps with B1 so we can't delete B1. And B1 overlaps with A1 so we can't delete A1. So we can't delete any fragment. The problem with this approach is obvious, fragments can potentially not be released due to data dependency, so incremental compaction efficiency is severely reduced. To fix it, let's not purge GC'able tombstones right away in the mutation compactor step. Instead, let's have compaction writing them to a separate sstable run that would be deleted in the end of compaction. By making sure that tombstone information from all compacting sstables is not lost, we no longer need to have incremental compaction imposing lots of restriction on which fragments could be released. Now, any sstable which data is safe in a new sstable can be released right away. In addition, incremental compaction will only take place if compaction procedure is working with one multi-fragment sstable run at least. Fixes #4531. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>