Compare commits

...

2 Commits

Author SHA1 Message Date
Auke Kok
a5084c3548 Test setup: loop 30x quorum-heartbeat-timeout
Signed-off-by: Auke Kok <auke.kok@versity.com>
2025-12-18 14:52:49 -08:00
Auke Kok
d41fc376b7 Unmounts can be slow and break quorum-heartbeat-timeout
We observe that unmount in this test can consume up to 10sec of time
before proceeding to record heartbeat timeout elections by followers.

When this happens, elections and new leaders happen before unmount even
completes. This indicates that hearbeat packets from the unmount are
ceased immediately, but the unmount is taking longer doing other things.
The timeouts then trigger, possibly during the unmount.

The result is that with timeouts of 3 seconds, we're not actually
waiting for an election at all. It already happened 7 seconds ago. The
code here just "sees" that it happens a few hundred ms after it started
looking for it.

There's a few ways about this fix. We could record the actual timestamp
of the election, and compare it with the actual timestamp of the last
heartbeat packet. This would be conclusive, and could disregard any
complication from umount taking too long. But it also means adding
timestamping in various places, or having to rely on tcpdump with packet
processing.

We can't just record $start before unmount. We will still violate the
part of the test that checks that elections didn't happen too late.
Especially in the 3sec test case if unmount takes 10sec.

The simplest solution is to unmount in a bg thread, and circle around
later to `wait` for it to assure we can re-mount without ill effect.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2025-12-18 14:36:14 -08:00
2 changed files with 5 additions and 2 deletions

View File

@@ -92,7 +92,8 @@ done
T_TRACE_DUMP="0"
T_TRACE_PRINTK="0"
T_PORT_START="19700"
T_LOOP_ITER="1"
T_LOOP_ITER="30"
T_INCLUDE="quorum-heartbeat-timeout"
# array declarations to be able to use array ops
declare -a T_TRACE_GLOB

View File

@@ -62,7 +62,7 @@ test_timeout()
sleep 1
# tear down the current server/leader
t_force_umount $sv
t_force_umount $sv &
# see how long it takes for the next leader to start
start=$(time_ms)
@@ -73,6 +73,8 @@ test_timeout()
echo "to $to delay $delay" >> $T_TMP.delay
# restore the mount that we tore down
wait
sleep 1
t_mount $sv
# make sure the new leader delay was reasonable, allowing for some slack