Files
scylladb/replica
Raphael S. Carvalho 65297d01fd replica: Fix race of some operations like cleanup with snapshot
There are two semaphores in table for synchronizing changes to sstable list:

sstable_set_mutation_sem: used to serialize two concurrent operations updating
the list, to prevent them from racing with each other.

sstable_deletion_sem: A deletion guard, used to serialize deletion and
iteration over the list, to prevent iteration from finding deleted files on
disk.

they're always taken in this order to avoid deadlocks:
sstable_set_mutation_sem -> sstable_deletion_sem.

problem:

A = tablet cleanup
B = take_snapshot()

1) A acquires sstable_set_mutation_sem for updating list
2) A acquires sstable_deletion_sem, then delete sstable before updating list
3) A releases sstable_deletion_sem, then yield
4) B acquires sstable_deletion_sem
5) B iterates through list and bumps sstable deleted in step 2
6) B fails since it cannot find the file on disk

Initial reaction is to say that no procedure must delete sstable before
updating the list, that's true.

But we want a iteration, running concurrently to cleanup, to not find sstables
being removed from the system. Otherwise, e.g. snapshot works with sstables
of a tablet that was just cleaned up. That's achieved by serializing iteration
with list update.
Since sstable_deletion_sem is used within the scope of deletion only, it's
useless for achieving this. Cleanup could acquire the deletion sem when
preparing list updates, and then pass the "permit" to deletion function, but
then sstable_deletion_sem would essentially become sstable_set_mutation_sem,
which was created exactly to protect the list update.

That being said, it makes sense to merge both semaphores. Also things become
easier to reason about, and we don't have to worry about deadlocks anymore.

The deletion goes through sstable_list_builder, which holds a permit throughout
its lifetime, which guarantees that list updates and deletion are atomic to
other concurrent operations. The interface becomes less error prone with that.
It allowed us to find discard_sstables() was doing deletion without any permit,
meaning another race could happen between truncate and snapshot.

So we're fixing race of (truncate|cleanup) with take_snapshot, as far as we
know. It's possible another unknown races are fixed as well.

Fixes #23049.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#23117

(cherry picked from commit fedd838b9d)

Closes scylladb/scylladb#23279
2025-06-13 14:35:53 +03:00
..