Recently we suffered a regression on how Alternator TTL behaves when a node goes down when tablets are used.
Usually, expiration of data in a particular tablet are handled by this tablet's "primary replica". However, if that node is down, we want another node to perform these expiration until the primary replica goes back online. We created a function `tablet_map::get_secondary_replica()` to select that "other node". We don't care too much what the "secondary replica" means, but we do care that it's different from the primary replica - if it's the same the expiration of that tablet will never be done.
It turns out that recently, in commits 817fdad and d88036d, the implementation of get_primary_replica() changed without a corresponding change to get_secondary_replica(). After those changes, the two functions are mismatched, and sometimes return the same node for both primary and secondary replica.
Unfortunately, although we had a dtest for the handling of a dead node in Alternator TTL, it failed to reproduce this bug, so this regression was missed - nothing else besides Alternator TTL ever used the get_secondary_replica() function.
So this series, in addition to fixing the bug, we add two tests that reproduce this bug (fail before the fix, pass with the fix):
1. A unit test that checks that get_secondary_replica() always returns a different node from get_primary_replica()
2. A cluster test based on the original dtest, which does reproduce this bug in Alternator TTL where some of the data was never expired (but only failed in release build, for an unknown reason).
Fixes SCYLLADB-777.
Closesscylladb/scylladb#28771
* github.com:scylladb/scylladb:
test: add unit test for tablet_map::get_secondary_replica()
test, alternator: add test for TTL expiration with a node down
locator: fix get_secondary_replica() to match get_primary_replica()