cql_test_env/gossip: Prevent double shutdown call crash

Fixes #21159

When an exception is thrown in sstable write etc such that
storage_manager::isolate is initiated, we start a shutdown chain
for message service, gossip etc. These are synced (properly) in
storage_manager::stop, but if we somehow call gossiper::shutdown
outside the normal service::stop cycle, we can end up running the
method simultaneously, intertwined (missing the guard because of
the state change between check and set). We then end up co_awaiting
an invalid future (_failure_detector_loop_done) - a second wait.

Fixed by
a.) Remove superfluous gossiper::shutdown in cql_test_env. This was added
    in 20496ed, ages ago. However, it should not be needed nowadays.
b.) Ensure _failure_detector_loop_done is always waitable. Just to be sure.

(cherry picked from commit c28a5173d9)
This commit is contained in:
Calle Wilund
2024-10-30 14:52:22 +00:00
committed by Mergify
parent 51f7ff8697
commit 242011c4d4
2 changed files with 14 additions and 2 deletions

View File

@@ -2335,7 +2335,15 @@ future<> gossiper::do_stop_gossiping() {
// Take the semaphore makes sure existing gossip loop is finished
auto units = co_await get_units(_callback_running, 1);
co_await container().invoke_on_all([] (auto& g) {
return std::move(g._failure_detector_loop_done);
// #21159
// gossiper::shutdown can be called from more than once place - both
// storage_service::isolate and normal gossip service stop. The former is
// waited for in storage_service::stop, but if we, as was done in cql_test_env,
// call shutdown independently, we could still end up here twite, and not hit
// the _enabled guard (because we do waiting things before setting it, and setting it
// is also waiting). However, making sure we don't leave an invalid future
// here should ensure even if we reenter this method in such as way, we don't crash.
return std::exchange(g._failure_detector_loop_done, make_ready_future<>());
});
logger.info("Gossip is now stopped");
}

View File

@@ -968,7 +968,11 @@ private:
}).get();
auto deinit_storage_service_server = defer([this] {
_gossiper.invoke_on_all(&gms::gossiper::shutdown).get();
// #21159 don't shutdown gossip here - we don't in main.cc, and we should
// strive to keep the two paths aligned. Doing a gossip::shutdown here
// can, if we've provoked a storage_manager::isolate, cause parallel
// double execution of the shutdown method, which causes waiting for
// an invalid future if we're unlucky.
_auth_service.stop().get();
});