Unmounts can be slow and break quorum-heartbeat-timeout

We observe that unmount in this test can consume up to 10sec of time before proceeding to record heartbeat timeout elections by followers. When this happens, elections and new leaders happen before unmount even completes. This indicates that hearbeat packets from the unmount are ceased immediately, but the unmount is taking longer doing other things. The timeouts then trigger, possibly during the unmount. The result is that with timeouts of 3 seconds, we're not actually waiting for an election at all. It already happened 7 seconds ago. The code here just "sees" that it happens a few hundred ms after it started looking for it. There's a few ways about this fix. We could record the actual timestamp of the election, and compare it with the actual timestamp of the last heartbeat packet. This would be conclusive, and could disregard any complication from umount taking too long. But it also means adding timestamping in various places, or having to rely on tcpdump with packet processing. We can't just record $start before unmount. We will still violate the part of the test that checks that elections didn't happen too late. Especially in the 3sec test case if unmount takes 10sec. The simplest solution is to unmount in a bg thread, and circle around later to `wait` for it to assure we can re-mount without ill effect. Signed-off-by: Auke Kok <auke.kok@versity.com>
2025-12-23 05:25:18 +00:00 · 2025-12-18 14:36:14 -08:00
parent 50bff13f21
commit d41fc376b7
1 changed files with 3 additions and 1 deletions
--- a/tests/tests/quorum-heartbeat-timeout.sh
+++ b/tests/tests/quorum-heartbeat-timeout.sh
@@ -62,7 +62,7 @@ test_timeout()
 	sleep 1

 	# tear down the current server/leader
-	t_force_umount $sv
+	t_force_umount $sv &

 	# see how long it takes for the next leader to start
 	start=$(time_ms)
@@ -73,6 +73,8 @@ test_timeout()
 	echo "to $to delay $delay" >> $T_TMP.delay

 	# restore the mount that we tore down
+	wait
+	sleep 1
 	t_mount $sv

 	# make sure the new leader delay was reasonable, allowing for some slack