mirror of
https://github.com/versity/scoutfs.git
synced 2025-12-23 05:25:18 +00:00
Unmounts can be slow and break quorum-heartbeat-timeout
We observe that unmount in this test can consume up to 10sec of time before proceeding to record heartbeat timeout elections by followers. When this happens, elections and new leaders happen before unmount even completes. This indicates that hearbeat packets from the unmount are ceased immediately, but the unmount is taking longer doing other things. The timeouts then trigger, possibly during the unmount. The result is that with timeouts of 3 seconds, we're not actually waiting for an election at all. It already happened 7 seconds ago. The code here just "sees" that it happens a few hundred ms after it started looking for it. There's a few ways about this fix. We could record the actual timestamp of the election, and compare it with the actual timestamp of the last heartbeat packet. This would be conclusive, and could disregard any complication from umount taking too long. But it also means adding timestamping in various places, or having to rely on tcpdump with packet processing. We can't just record $start before unmount. We will still violate the part of the test that checks that elections didn't happen too late. Especially in the 3sec test case if unmount takes 10sec. The simplest solution is to unmount in a bg thread, and circle around later to `wait` for it to assure we can re-mount without ill effect. Signed-off-by: Auke Kok <auke.kok@versity.com>
This commit is contained in:
@@ -62,7 +62,7 @@ test_timeout()
|
|||||||
sleep 1
|
sleep 1
|
||||||
|
|
||||||
# tear down the current server/leader
|
# tear down the current server/leader
|
||||||
t_force_umount $sv
|
t_force_umount $sv &
|
||||||
|
|
||||||
# see how long it takes for the next leader to start
|
# see how long it takes for the next leader to start
|
||||||
start=$(time_ms)
|
start=$(time_ms)
|
||||||
@@ -73,6 +73,8 @@ test_timeout()
|
|||||||
echo "to $to delay $delay" >> $T_TMP.delay
|
echo "to $to delay $delay" >> $T_TMP.delay
|
||||||
|
|
||||||
# restore the mount that we tore down
|
# restore the mount that we tore down
|
||||||
|
wait
|
||||||
|
sleep 1
|
||||||
t_mount $sv
|
t_mount $sv
|
||||||
|
|
||||||
# make sure the new leader delay was reasonable, allowing for some slack
|
# make sure the new leader delay was reasonable, allowing for some slack
|
||||||
|
|||||||
Reference in New Issue
Block a user