Consider the following scenario:
1. A table has RF=3 and writes use CL=QUORUM
2. One node is down
3. There is a pending tablet migration from the unavailable node
that is reverted
During the revert, there can be a time window where the pending replica
being cleaned up still accepts writes. This leads to write failures,
as only two nodes (out of four) are able to acknowledge writes.
This patch fixes the issue by adding a barrier to the cleanup_target
tablet transition state, ensuring that the coordinator switches back to
the previous replica set before cleanup is triggered.
Fixes https://github.com/scylladb/scylladb/issues/26512
It's a pre existing issue. Backport is required to all recent 2025.x versions.
- (cherry picked from commit 669286b1d6)
- (cherry picked from commit 67f1c6d36c)
- (cherry picked from commit 6163fedd2e)
Parent PR: #27413Closesscylladb/scylladb#27425
* github.com:scylladb/scylladb:
topology_coordinator: Fix the indentation for the cleanup_target case
topology_coordinator: Add barrier to cleanup_target
test_node_failure_during_tablet_migration: Increase RF from 2 to 3