Unmounts can be slow and break quorum-heartbeat-timeout

We observe that unmount in this test can consume up to 10sec of time before proceeding to record heartbeat timeout elections by followers. When this happens, elections and new leaders happen before unmount even completes. This indicates that hearbeat packets from the unmount are ceased immediately, but the unmount is taking longer doing other things. The timeouts then trigger, possibly during the unmount. The result is that with timeouts of 3 seconds, we're not actually waiting for an election at all. It already happened 7 seconds ago. The code here just "sees" that it happens a few hundred ms after it started looking for it. There's a few ways about this fix. We could record the actual timestamp of the election, and compare it with the actual timestamp of the last heartbeat packet. This would be conclusive, and could disregard any complication from umount taking too long. But it also means adding timestamping in various places, or having to rely on tcpdump with packet processing. We can't just record $start before unmount. We will still violate the part of the test that checks that elections didn't happen too late. Especially in the 3sec test case if unmount takes 10sec. The simplest solution is to unmount in a bg thread, and circle around later to `wait` for it to assure we can re-mount without ill effect. Signed-off-by: Auke Kok <auke.kok@versity.com>
Merge pull request #266 from versity/zab/increase_move_empty_budget
2026-01-05 03:44:05 +00:00 · 2025-12-18 14:36:14 -08:00 · 2025-12-18 12:44:20 -08:00 · 2025-12-17 14:22:04 -06:00 · 2025-12-17 11:06:32 -08:00 · 2025-12-17 11:04:00 -08:00
8 changed files with 43 additions and 17 deletions
--- a/kmod/src/server.c
+++ b/kmod/src/server.c
@@ -1618,7 +1618,8 @@ static int server_get_log_trees(struct super_block *sb,
 		goto update;
 	}

-	ret = alloc_move_empty(sb, &super->data_alloc, &lt.data_freed, 100);
+	ret = alloc_move_empty(sb, &super->data_alloc, &lt.data_freed,
+			       COMMIT_HOLD_ALLOC_BUDGET / 2);
 	if (ret == -EINPROGRESS)
 		ret = 0;
 	if (ret < 0) {
@@ -1913,9 +1914,11 @@ static int reclaim_open_log_tree(struct super_block *sb, u64 rid)
 	       scoutfs_alloc_splice_list(sb, &server->alloc, &server->wri, server->other_freed,
 					 &lt.meta_avail)) ?:
 	      (err_str = "empty data_avail",
-	       alloc_move_empty(sb, &super->data_alloc, &lt.data_avail, 100)) ?:
+	       alloc_move_empty(sb, &super->data_alloc, &lt.data_avail,
+				COMMIT_HOLD_ALLOC_BUDGET / 2)) ?:
 	      (err_str = "empty data_freed",
-	       alloc_move_empty(sb, &super->data_alloc, &lt.data_freed, 100));
+	       alloc_move_empty(sb, &super->data_alloc, &lt.data_freed,
+				COMMIT_HOLD_ALLOC_BUDGET / 2));
 	mutex_unlock(&server->alloc_mutex);

 	/* only finalize, allowing merging, once the allocators are fully freed */
@@ -3036,7 +3039,13 @@ static int server_commit_log_merge(struct super_block *sb,
 				  SCOUTFS_LOG_MERGE_STATUS_ZONE, 0, 0,
 				  &stat, sizeof(stat));
 	if (ret < 0) {
-		err_str = "getting merge status item";
+		/*
+		 * During a retransmission, it's possible that the server
+		 * already committed and resolved this log merge. ENOENT
+		 * is expected in that case.
+		 */
+		if (ret != -ENOENT)
+			err_str = "getting merge status item";
 		goto out;
 	}

--- a/tests/fenced-local-force-unmount.sh
+++ b/tests/fenced-local-force-unmount.sh
@@ -9,7 +9,7 @@
 echo "$0 running rid '$SCOUTFS_FENCED_REQ_RID' ip '$SCOUTFS_FENCED_REQ_IP' args '$@'"

 echo_fail() {
-	echo "$@" >> /dev/stderr
+	echo "$@" >&2
 	exit 1
 }

@@ -27,8 +27,7 @@ for fs in /sys/fs/scoutfs/*; do
 	nr="$(quiet_cat $fs/data_device_maj_min)"
 	[ ! -d "$fs" -o "$fs_rid" != "$rid" ] && continue

-	mnt=$(findmnt -l -n -t scoutfs -o TARGET -S $nr) || \
-		echo_fail "findmnt -t scoutfs -S $nr failed"
+	mnt=$(findmnt -l -n -t scoutfs -o TARGET -S $nr)
 	[ -z "$mnt" ] && continue

 	if ! umount -qf "$mnt"; then
--- a/tests/funcs/filter.sh
+++ b/tests/funcs/filter.sh
@@ -170,6 +170,9 @@ t_filter_dmesg()
 	# some ci test guests are unresponsive
 	re="$re|longest quorum heartbeat .* delay"

+	# creating block devices may trigger this
+	re="$re|block device autoloading is deprecated and will be removed."
+
 	egrep -v "($re)" | \
 		ignore_harmless_unwind_kasan_stack_oob
 }
--- a/tests/funcs/tap.sh
+++ b/tests/funcs/tap.sh
@@ -43,9 +43,14 @@ t_tap_progress()
 	local testname=$1
 	local result=$2

+	local stmsg=""
 	local diff=""
 	local dmsg=""

+	if [[ -s $T_RESULTS/tmp/${testname}/status.msg ]]; then
+		stmsg="1"
+	fi
+
 	if [[ -s "$T_RESULTS/tmp/${testname}/dmesg.new" ]]; then
 		dmsg="1"
 	fi
@@ -61,6 +66,7 @@ t_tap_progress()
 		echo "# ${testname} ** skipped - permitted **"
 	else
 		echo "not ok ${i} - ${testname}"
+
 		case ${result} in
 		101)
 			echo "# ${testname} ** skipped **"
@@ -70,6 +76,13 @@ t_tap_progress()
 			;;
 		esac

+		if [[ -n "${stmsg}" ]]; then
+			echo "#"
+			echo "# status:"
+			echo "#"
+			cat $T_RESULTS/tmp/${testname}/status.msg | sed 's/^/# - /'
+		fi
+
 		if [[ -n "${diff}" ]]; then
 			echo "#"
 			echo "# diff:"
--- a/tests/tests/get-referring-entries.sh
+++ b/tests/tests/get-referring-entries.sh
@@ -72,7 +72,7 @@ touch $T_D0/dir/file
 mkdir $T_D0/dir/dir
 ln -s $T_D0/dir/file $T_D0/dir/symlink
 mknod $T_D0/dir/char c 1 3 # null
-mknod $T_D0/dir/block b 7 0 # loop0
+mknod $T_D0/dir/block b 42 0 # SAMPLE block dev - nonexistant/demo use only number
 for name in $(ls -UA $T_D0/dir | sort); do
 	ino=$(stat -c '%i' $T_D0/dir/$name)
 	$GRE $ino | filter_types
--- a/tests/tests/quorum-heartbeat-timeout.sh
+++ b/tests/tests/quorum-heartbeat-timeout.sh
@@ -62,7 +62,7 @@ test_timeout()
 	sleep 1

 	# tear down the current server/leader
-	t_force_umount $sv
+	t_force_umount $sv &

 	# see how long it takes for the next leader to start
 	start=$(time_ms)
@@ -73,6 +73,8 @@ test_timeout()
 	echo "to $to delay $delay" >> $T_TMP.delay

 	# restore the mount that we tore down
+	wait
+	sleep 1
 	t_mount $sv

 	# make sure the new leader delay was reasonable, allowing for some slack
--- a/tests/tests/renameat2-noreplace.sh
+++ b/tests/tests/renameat2-noreplace.sh
@@ -8,19 +8,19 @@ t_require_mounts 2
 echo "=== renameat2 noreplace flag test"

 # give each mount their own dir (lock group) to minimize create contention
-mkdir $T_M0/dir0
-mkdir $T_M1/dir1
+mkdir $T_D0/dir0
+mkdir $T_D1/dir1

 echo "=== run two asynchronous calls to renameat2 NOREPLACE"
 for i in $(seq 0 100); do
        # prepare inputs in isolation
-        touch "$T_M0/dir0/old0"
-        touch "$T_M1/dir1/old1"
+        touch "$T_D0/dir0/old0"
+        touch "$T_D1/dir1/old1"

        # race doing noreplace renames, both can't succeed
-        dumb_renameat2 -n "$T_M0/dir0/old0" "$T_M0/dir0/sharednew" 2> /dev/null &
+        dumb_renameat2 -n "$T_D0/dir0/old0" "$T_D0/dir0/sharednew" 2> /dev/null &
        pid0=$!
-        dumb_renameat2 -n "$T_M1/dir1/old1" "$T_M1/dir0/sharednew" 2> /dev/null &
+        dumb_renameat2 -n "$T_D1/dir1/old1" "$T_D1/dir0/sharednew" 2> /dev/null &
        pid1=$!

        wait $pid0
@@ -31,7 +31,7 @@ for i in $(seq 0 100); do
        test "$rc0" == 0 -a "$rc1" == 0 && t_fail "both renames succeeded"

        # blow away possible files for either race outcome
-        rm -f "$T_M0/dir0/old0" "$T_M1/dir1/old1" "$T_M0/dir0/sharednew" "$T_M1/dir1/sharednew"
+        rm -f "$T_D0/dir0/old0" "$T_D1/dir1/old1" "$T_D0/dir0/sharednew" "$T_D1/dir1/sharednew"
 done

 t_pass
--- a/utils/fenced/scoutfs-fenced
+++ b/utils/fenced/scoutfs-fenced
@@ -7,7 +7,7 @@ message_output()

 error_message()
 {
-	message_output "$@" >> /dev/stderr
+	message_output "$@" >&2
 }

 error_exit()
Author	SHA1	Message	Date
Auke Kok	d41fc376b7	Unmounts can be slow and break quorum-heartbeat-timeout We observe that unmount in this test can consume up to 10sec of time before proceeding to record heartbeat timeout elections by followers. When this happens, elections and new leaders happen before unmount even completes. This indicates that hearbeat packets from the unmount are ceased immediately, but the unmount is taking longer doing other things. The timeouts then trigger, possibly during the unmount. The result is that with timeouts of 3 seconds, we're not actually waiting for an election at all. It already happened 7 seconds ago. The code here just "sees" that it happens a few hundred ms after it started looking for it. There's a few ways about this fix. We could record the actual timestamp of the election, and compare it with the actual timestamp of the last heartbeat packet. This would be conclusive, and could disregard any complication from umount taking too long. But it also means adding timestamping in various places, or having to rely on tcpdump with packet processing. We can't just record $start before unmount. We will still violate the part of the test that checks that elections didn't happen too late. Especially in the 3sec test case if unmount takes 10sec. The simplest solution is to unmount in a bg thread, and circle around later to `wait` for it to assure we can re-mount without ill effect. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-12-18 14:36:14 -08:00
Zach Brown	50bff13f21	Merge pull request #266 from versity/zab/increase_move_empty_budget Increase server commit block budget for alloc move	2025-12-18 12:44:20 -08:00
Zach Brown	de70ca2372	Increase server commit block budget for alloc move A few callers of alloc_move_empty in the server were providing a budget that was too small. Recent changes to extent_mod_blocks increased the max budget that is necessary to move extents between btrees. The existing WAG of 100 was too small for trees of height 2 and 3. This caused looping in production. We can increase the move budget to half the overall commit budget, which leaves room for a height of around 7 each. This is much greater than we see in practice because the size of the per-mount btrees is effectiely limited by both watermarks and thresholds to commit and drain. Signed-off-by: Zach Brown <zab@versity.com>	2025-12-17 14:22:04 -06:00
Zach Brown	5af1412d5f	Merge pull request #270 from versity/auke/bdev_autoloading Avoid block device autoloading warning.	2025-12-17 11:06:32 -08:00
Zach Brown	0a2b2ad409	Merge pull request #269 from versity/auke/tap_status_msg Include t_fail status in tap output.	2025-12-17 11:04:00 -08:00
Auke Kok	6c4590a8a0	Avoid block device autoloading warning. It's possible to trigger the block device autoloading mechanism with a mknod()/stat(), and this mechanism has long been declared obsolete, thus triggering a dmesg warning since el9_7, which then fails the test. You may need to `rmmod loop` to reproduce. Avoid this by avoiding to trigger a loop autoload - we just make a different blockdev. Chosing `42` here should avoid any autoload mechanism as this number is explicitly for demo drivers and should never trigger an autoload. We also just ignore the warning line in dmesg. Other tests can and might perhaps still trigger this, as well as background noise running during the test. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-12-08 13:04:58 -08:00
Zach Brown	1768f69c3c	Merge pull request #224 from versity/auke/renameat2-test-sub-dir Use T_D0/1 instead of T_M0 here.	2025-12-08 10:05:46 -08:00
Zach Brown	dcb0fd5805	Merge pull request #268 from versity/auke/dont_use_bash_special_stdfiles Avoid using bash special device nodes.	2025-12-08 09:47:19 -08:00
Auke Kok	660f874488	Use T_D0/1 instead of T_M0 here. Use of T_M0 and variants should be reserved for e.g. scoutfs <subcommand> -p <mountpoint> type of usages. Tests should create individual content files in the assigned subdirectory. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-12-04 14:34:02 -05:00
Auke Kok	e1a6689a9b	Include t_fail status in tap output. The tap output file was not yet complete as it failed to include the contents of `status.msg`. In a few cases, that would mean it lacks important context. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-12-04 14:09:39 -05:00
Auke Kok	2884a92408	Avoid using bash special device nodes. Bash has special handling when these standard IO files, but there are cases where customers have special restrictions set on them. Likely to avoid leaking error data out of system logs as part of IDS software. In any case, we can just reopen existing file descriptors here in both these cases to avoid this entirely. This will always work. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-12-04 13:24:48 -05:00
Zach Brown	e194714004	Merge pull request #264 from versity/auke/findmnt_retval Findmnt returns 1 when no matching entries found	2025-12-03 14:29:31 -08:00
Auke Kok	8bb2f83cf9	Findmnt returns 1 when no matching entries found Our local fence script attempts to interpret errors executing `findmnt` as critical errors, but the program exit code explicitly returns EXIT_FAILURE when the total number of matching mount entries is zero. This can happen if the mount disappeared while we're attempting to fence the mount, but, the scoutfs sysfs files are still in place as we read them. It's a small window, but, it's a fork/exec plus full parse of /etc/fstab, and a lot can happen in the 0.015s findmnt takes on my system. There's no other exit codes from findmnt other than 0 and 1. At that point, we can only assume that if the stdout is empty, the mount isn't there anymore. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-12-02 12:55:11 -08:00
Zach Brown	6a9a6789d5	Merge pull request #267 from versity/clk/merge_enoent Handle ENOENT when getting log merge status item	2025-12-02 09:34:28 -08:00
Chris Kirby	ee630b164f	Handle ENOENT when getting log merge status item Tests that cause client retries can fail with this error from server_commit_log_merge(): error -2 committing log merge: getting merge status item This can happen if the server has already committed and resolved the log merge that is being retried. We can safely ignore ENOENT here just like we do a few lines later. Signed-off-by: Chris Kirby <ckirby@versity.com>	2025-12-01 08:58:24 -06:00
Zach Brown	1c7678b6f5	Merge pull request #263 from versity/zab/v1.26 v1.26 Release	2025-11-18 09:39:27 -08:00