topology_coordinator: Fix missed notification on abort

If _as is aborted while the coordinator is in the middle of handling,
and decides to go to sleep, it may go to sleep without noticing that
it was aborted. Fix by checking before blocking on the condition
variable.

In general, every condition which can cause signal() should be checked
before when(). This patch doesn't fix all the cases. For example,
signal() can be called when there arrives a new topology request. This
can happen after the coordinator checked because it releases the guard
before calling when().
This commit is contained in:
Tomasz Grabiec
2023-07-16 23:26:02 +02:00
parent e338679266
commit 2811b1df0a

View File

@@ -1744,6 +1744,11 @@ class topology_coordinator {
// Returns true if the state machine was transitioned into tablet migration path.
future<bool> maybe_start_tablet_migration(group0_guard);
future<> await_event() {
_as.check();
co_await _topo_sm.event.when();
}
public:
topology_coordinator(
sharded<db::system_distributed_keyspace>& sys_dist_ks,
@@ -1817,7 +1822,7 @@ future<> topology_coordinator::run() {
if (!had_work) {
// Nothing to work on. Wait for topology change event.
slogger.trace("raft topology: topology coordinator fiber has nothing to do. Sleeping.");
co_await _topo_sm.event.when();
co_await await_event();
slogger.trace("raft topology: topology coordinator fiber got an event");
}
} catch (raft::request_aborted&) {