v1.27 Release

Finish the release notes for the 1.27 release. Signed-off-by: Zach Brown <zab@versity.com>
Merge pull request #266 from versity/zab/increase_move_empty_budget
2026-01-10 21:50:20 +00:00 · 2026-01-07 10:31:54 -08:00 · 2025-12-18 12:44:20 -08:00 · 2025-12-17 14:22:04 -06:00 · 2025-12-17 11:06:32 -08:00 · 2025-12-17 11:04:00 -08:00
16 changed files with 315 additions and 99 deletions
--- a/ReleaseNotes.md
+++ b/ReleaseNotes.md
@@ -1,6 +1,56 @@
 Versity ScoutFS Release Notes
 =============================

+---
+v1.27
+\
+*Jan 7, 2026*
+
+Fix a server livelock case that can happen while committing client
+transations that contain a large amount of freed file data extents.
+This would present as client tasks hanging and a server task spinning
+consuming cpu.
+
+Fix a rare server request processing failure that doesn't deal with
+retransmission of a request that a previous server partially processed.
+This would present as hung client tasks and repeated "error -2
+committing log merge: getting merge status item" kernel messages.
+
+---
+v1.26
+\
+*Nov 17, 2025*
+
+Add the ino\_alloc\_per\_lock mount option.  This changes the number of
+inode numbers allocated under each cluster lock and can alleviate lock
+contention for some patterns of larger file creation.
+
+Add the tcp\_keepalive\_timeout\_ms mount option.  This can enable the
+system to survive longer periods of networking outages.
+
+Fix a rare double free of internal btree metadata blocks when merging
+log trees.  The duplicated freed metadata block numbers would cause
+persistent errors in the server, preventing the server from starting and
+hanging the system.
+
+Fix the data\_wait interface to not require the correct data\_version of
+the inode when raising an error.  This lets callers raise errors when
+they're unable to recall the details of the inode to discover its
+data\_version.
+
+Change scoutfs to more aggressively reclaim cached memory when under
+memory pressure.  This makes scoutfs behave more like other kernel
+components and it integrates better with the reclaim policy heuristics
+in the VM core of the kernel.
+
+Change scoutfs to more efficiently transmit and receive socket messages.
+Under heavy load this can process messages sufficiently more quickly to
+avoid hung task messages for tasks that were waiting for cluster lock
+messages to be processed.
+
+Fix faulty server block commit budget calculations that were generating
+spurious "holders exceeded alloc budget" console messages.
+
 ---
 v1.25
 \
--- a/kmod/src/inode.c
+++ b/kmod/src/inode.c
@@ -1482,12 +1482,6 @@ static int remove_index_items(struct super_block *sb, u64 ino,
 * Return an allocated and unused inode number.  Returns -ENOSPC if
 * we're out of inode.
 *
- * Each parent directory has its own pool of free inode numbers.  Items
- * are sorted by their inode numbers as they're stored in segments.
- * This will tend to group together files that are created in a
- * directory at the same time in segments.  Concurrent creation across
- * different directories will be stored in their own regions.
- *
 * Inode numbers are never reclaimed.  If the inode is evicted or we're
 * unmounted the pending inode numbers will be lost.  Asking for a
 * relatively small number from the server each time will tend to
@@ -1497,12 +1491,18 @@ static int remove_index_items(struct super_block *sb, u64 ino,
 int scoutfs_alloc_ino(struct super_block *sb, bool is_dir, u64 *ino_ret)
 {
 	DECLARE_INODE_SB_INFO(sb, inf);
+	struct scoutfs_mount_options opts;
 	struct inode_allocator *ia;
 	u64 ino;
 	u64 nr;
 	int ret;

-	ia = is_dir ? &inf->dir_ino_alloc : &inf->ino_alloc;
+	scoutfs_options_read(sb, &opts);
+
+	if (is_dir && opts.ino_alloc_per_lock == SCOUTFS_LOCK_INODE_GROUP_NR)
+		ia = &inf->dir_ino_alloc;
+	else
+		ia = &inf->ino_alloc;

 	spin_lock(&ia->lock);

@@ -1523,6 +1523,17 @@ int scoutfs_alloc_ino(struct super_block *sb, bool is_dir, u64 *ino_ret)
 	*ino_ret = ia->ino++;
 	ia->nr--;

+	if (opts.ino_alloc_per_lock != SCOUTFS_LOCK_INODE_GROUP_NR) {
+		nr = ia->ino & SCOUTFS_LOCK_INODE_GROUP_MASK;
+		if (nr >= opts.ino_alloc_per_lock) {
+			nr = SCOUTFS_LOCK_INODE_GROUP_NR - nr;
+			if (nr > ia->nr)
+				nr = ia->nr;
+			ia->ino += nr;
+			ia->nr -= nr;
+		}
+	}
+
 	spin_unlock(&ia->lock);
 	ret = 0;
 out:
--- a/kmod/src/msg.h
+++ b/kmod/src/msg.h
@@ -35,6 +35,12 @@ do {									\
 	}								\
 } while (0)								\

+#define scoutfs_bug_on_err(sb, err, fmt, args...) \
+do { \
+	__typeof__(err) _err = (err); \
+	scoutfs_bug_on(sb, _err < 0 && _err != -ENOLINK, fmt, ##args); \
+} while (0)
+
 /*
 * Each message is only generated once per volume.  Remounting resets
 * the messages.
--- a/kmod/src/options.c
+++ b/kmod/src/options.c
@@ -33,6 +33,7 @@ enum {
 	Opt_acl,
 	Opt_data_prealloc_blocks,
 	Opt_data_prealloc_contig_only,
+	Opt_ino_alloc_per_lock,
 	Opt_log_merge_wait_timeout_ms,
 	Opt_metadev_path,
 	Opt_noacl,
@@ -47,6 +48,7 @@ static const match_table_t tokens = {
 	{Opt_acl, "acl"},
 	{Opt_data_prealloc_blocks, "data_prealloc_blocks=%s"},
 	{Opt_data_prealloc_contig_only, "data_prealloc_contig_only=%s"},
+	{Opt_ino_alloc_per_lock, "ino_alloc_per_lock=%s"},
 	{Opt_log_merge_wait_timeout_ms, "log_merge_wait_timeout_ms=%s"},
 	{Opt_metadev_path, "metadev_path=%s"},
 	{Opt_noacl, "noacl"},
@@ -136,6 +138,7 @@ static void init_default_options(struct scoutfs_mount_options *opts)

 	opts->data_prealloc_blocks = SCOUTFS_DATA_PREALLOC_DEFAULT_BLOCKS;
 	opts->data_prealloc_contig_only = 1;
+	opts->ino_alloc_per_lock = SCOUTFS_LOCK_INODE_GROUP_NR;
 	opts->log_merge_wait_timeout_ms = DEFAULT_LOG_MERGE_WAIT_TIMEOUT_MS;
 	opts->orphan_scan_delay_ms = -1;
 	opts->quorum_heartbeat_timeout_ms = SCOUTFS_QUORUM_DEF_HB_TIMEO_MS;
@@ -238,6 +241,18 @@ static int parse_options(struct super_block *sb, char *options, struct scoutfs_m
 			opts->data_prealloc_contig_only = nr;
 			break;

+		case Opt_ino_alloc_per_lock:
+			ret = match_int(args, &nr);
+			if (ret < 0 || nr < 1 || nr > SCOUTFS_LOCK_INODE_GROUP_NR) {
+				scoutfs_err(sb, "invalid ino_alloc_per_lock option, must be between 1 and %u",
+					    SCOUTFS_LOCK_INODE_GROUP_NR);
+				if (ret == 0)
+					ret = -EINVAL;
+				return ret;
+			}
+			opts->ino_alloc_per_lock = nr;
+			break;
+
 		case Opt_tcp_keepalive_timeout_ms:
 			ret = match_int(args, &nr);
 			ret = verify_tcp_keepalive_timeout_ms(sb, ret, nr);
@@ -393,6 +408,7 @@ int scoutfs_options_show(struct seq_file *seq, struct dentry *root)
 		seq_puts(seq, ",acl");
 	seq_printf(seq, ",data_prealloc_blocks=%llu", opts.data_prealloc_blocks);
 	seq_printf(seq, ",data_prealloc_contig_only=%u", opts.data_prealloc_contig_only);
+	seq_printf(seq, ",ino_alloc_per_lock=%u", opts.ino_alloc_per_lock);
 	seq_printf(seq, ",metadev_path=%s", opts.metadev_path);
 	if (!is_acl)
 		seq_puts(seq, ",noacl");
@@ -481,6 +497,45 @@ static ssize_t data_prealloc_contig_only_store(struct kobject *kobj, struct kobj
 }
 SCOUTFS_ATTR_RW(data_prealloc_contig_only);

+static ssize_t ino_alloc_per_lock_show(struct kobject *kobj, struct kobj_attribute *attr,
+					 char *buf)
+{
+	struct super_block *sb = SCOUTFS_SYSFS_ATTRS_SB(kobj);
+	struct scoutfs_mount_options opts;
+
+	scoutfs_options_read(sb, &opts);
+
+	return snprintf(buf, PAGE_SIZE, "%u", opts.ino_alloc_per_lock);
+}
+static ssize_t ino_alloc_per_lock_store(struct kobject *kobj, struct kobj_attribute *attr,
+					  const char *buf, size_t count)
+{
+	struct super_block *sb = SCOUTFS_SYSFS_ATTRS_SB(kobj);
+	DECLARE_OPTIONS_INFO(sb, optinf);
+	char nullterm[20]; /* more than enough for octal -U32_MAX */
+	long val;
+	int len;
+	int ret;
+
+	len = min(count, sizeof(nullterm) - 1);
+	memcpy(nullterm, buf, len);
+	nullterm[len] = '\0';
+
+	ret = kstrtol(nullterm, 0, &val);
+	if (ret < 0 || val < 1 || val > SCOUTFS_LOCK_INODE_GROUP_NR) {
+		scoutfs_err(sb, "invalid ino_alloc_per_lock option, must be between 1 and %u",
+			    SCOUTFS_LOCK_INODE_GROUP_NR);
+		return -EINVAL;
+	}
+
+	write_seqlock(&optinf->seqlock);
+	optinf->opts.ino_alloc_per_lock = val;
+	write_sequnlock(&optinf->seqlock);
+
+	return count;
+}
+SCOUTFS_ATTR_RW(ino_alloc_per_lock);
+
 static ssize_t log_merge_wait_timeout_ms_show(struct kobject *kobj, struct kobj_attribute *attr,
 						char *buf)
 {
@@ -621,6 +676,7 @@ SCOUTFS_ATTR_RO(quorum_slot_nr);
 static struct attribute *options_attrs[] = {
 	SCOUTFS_ATTR_PTR(data_prealloc_blocks),
 	SCOUTFS_ATTR_PTR(data_prealloc_contig_only),
+	SCOUTFS_ATTR_PTR(ino_alloc_per_lock),
 	SCOUTFS_ATTR_PTR(log_merge_wait_timeout_ms),
 	SCOUTFS_ATTR_PTR(metadev_path),
 	SCOUTFS_ATTR_PTR(orphan_scan_delay_ms),
--- a/kmod/src/options.h
+++ b/kmod/src/options.h
@@ -8,6 +8,7 @@
 struct scoutfs_mount_options {
 	u64 data_prealloc_blocks;
 	bool data_prealloc_contig_only;
+	unsigned int ino_alloc_per_lock;
 	unsigned int log_merge_wait_timeout_ms;
 	char *metadev_path;
 	unsigned int orphan_scan_delay_ms;
--- a/kmod/src/server.c
+++ b/kmod/src/server.c
@@ -994,10 +994,11 @@ static int for_each_rid_last_lt(struct super_block *sb, struct scoutfs_btree_roo
 }

 /*
- * Log merge range items are stored at the starting fs key of the range.
- * The only fs key field that doesn't hold information is the zone, so
- * we use the zone to differentiate all types that we store in the log
- * merge tree.
+ * Log merge range items are stored at the starting fs key of the range
+ * with the zone overwritten to indicate the log merge item type.  This
+ * day0 mistake loses sorting information for items in the different
+ * zones in the fs root, so the range items aren't strictly sorted by
+ * the starting key of their range.
 */
 static void init_log_merge_key(struct scoutfs_key *key, u8 zone, u64 first,
 			       u64 second)
@@ -1029,6 +1030,51 @@ static int next_log_merge_item_key(struct super_block *sb, struct scoutfs_btree_
 	return ret;
 }

+/*
+ * The range items aren't sorted by their range.start because
+ * _RANGE_ZONE clobbers the range's zone.  We sweep all the items and
+ * find the range with the next least starting key that's greater than
+ * the caller's starting key.  We have to be careful to iterate over the
+ * log_merge tree keys because the ranges can overlap as they're mapped
+ * to the log_merge keys by clobbering their zone.
+ */
+static int next_log_merge_range(struct super_block *sb, struct scoutfs_btree_root *root,
+				struct scoutfs_key *start, struct scoutfs_log_merge_range *rng)
+{
+	struct scoutfs_log_merge_range *next;
+	SCOUTFS_BTREE_ITEM_REF(iref);
+	struct scoutfs_key key;
+	int ret;
+
+	key = *start;
+	key.sk_zone = SCOUTFS_LOG_MERGE_RANGE_ZONE;
+	scoutfs_key_set_ones(&rng->start);
+
+	do {
+		ret = scoutfs_btree_next(sb, root, &key, &iref);
+		if (ret == 0) {
+			if (iref.key->sk_zone != SCOUTFS_LOG_MERGE_RANGE_ZONE) {
+				ret = -ENOENT;
+			} else if (iref.val_len != sizeof(struct scoutfs_log_merge_range)) {
+				ret = -EIO;
+			} else {
+				next = iref.val;
+				if (scoutfs_key_compare(&next->start, &rng->start) < 0 &&
+				    scoutfs_key_compare(&next->start, start) >= 0)
+					*rng = *next;
+				key = *iref.key;
+				scoutfs_key_inc(&key);
+			}
+			scoutfs_btree_put_iref(&iref);
+		}
+	} while (ret == 0);
+
+	if (ret == -ENOENT && !scoutfs_key_is_ones(&rng->start))
+		ret = 0;
+
+	return ret;
+}
+
 static int next_log_merge_item(struct super_block *sb,
 			       struct scoutfs_btree_root *root,
 			       u8 zone, u64 first, u64 second,
@@ -1572,7 +1618,8 @@ static int server_get_log_trees(struct super_block *sb,
 		goto update;
 	}

-	ret = alloc_move_empty(sb, &super->data_alloc, &lt.data_freed, 100);
+	ret = alloc_move_empty(sb, &super->data_alloc, &lt.data_freed,
+			       COMMIT_HOLD_ALLOC_BUDGET / 2);
 	if (ret == -EINPROGRESS)
 		ret = 0;
 	if (ret < 0) {
@@ -1682,6 +1729,7 @@ static int server_commit_log_trees(struct super_block *sb,
 	int ret;

 	if (arg_len != sizeof(struct scoutfs_log_trees)) {
+		err_str = "invalid message log_trees size";
 		ret = -EINVAL;
 		goto out;
 	}
@@ -1745,7 +1793,7 @@ static int server_commit_log_trees(struct super_block *sb,

 	ret = scoutfs_btree_update(sb, &server->alloc, &server->wri,
 				   &super->logs_root, &key, &lt, sizeof(lt));
-	BUG_ON(ret < 0); /* dirtying should have guaranteed success */
+	BUG_ON(ret < 0); /* dirtying should have guaranteed success, srch item inconsistent */
 	if (ret < 0)
 		err_str = "updating log trees item";

@@ -1753,11 +1801,10 @@ unlock:
 	mutex_unlock(&server->logs_mutex);

 	ret = server_apply_commit(sb, &hold, ret);
+out:
 	if (ret < 0)
 		scoutfs_err(sb, "server error %d committing client logs for rid %016llx, nr %llu: %s",
 			    ret, rid, le64_to_cpu(lt.nr), err_str);
-out:
-	WARN_ON_ONCE(ret < 0);
 	return scoutfs_net_response(sb, conn, cmd, id, ret, NULL, 0);
 }

@@ -1867,9 +1914,11 @@ static int reclaim_open_log_tree(struct super_block *sb, u64 rid)
 	       scoutfs_alloc_splice_list(sb, &server->alloc, &server->wri, server->other_freed,
 					 &lt.meta_avail)) ?:
 	      (err_str = "empty data_avail",
-	       alloc_move_empty(sb, &super->data_alloc, &lt.data_avail, 100)) ?:
+	       alloc_move_empty(sb, &super->data_alloc, &lt.data_avail,
+				COMMIT_HOLD_ALLOC_BUDGET / 2)) ?:
 	      (err_str = "empty data_freed",
-	       alloc_move_empty(sb, &super->data_alloc, &lt.data_freed, 100));
+	       alloc_move_empty(sb, &super->data_alloc, &lt.data_freed,
+				COMMIT_HOLD_ALLOC_BUDGET / 2));
 	mutex_unlock(&server->alloc_mutex);

 	/* only finalize, allowing merging, once the allocators are fully freed */
@@ -2094,7 +2143,7 @@ static int server_srch_get_compact(struct super_block *sb,

 apply:
 	ret = server_apply_commit(sb, &hold, ret);
-	WARN_ON_ONCE(ret < 0 && ret != -ENOENT); /* XXX leaked busy item */
+	WARN_ON_ONCE(ret < 0 && ret != -ENOENT && ret != -ENOLINK); /* XXX leaked busy item */
 out:
 	ret = scoutfs_net_response(sb, conn, cmd, id, ret,
 				   sc, sizeof(struct scoutfs_srch_compact));
@@ -2472,10 +2521,9 @@ out:
 		}
 	}

-	if (ret < 0)
-		scoutfs_err(sb, "server error %d splicing log merge completion: %s", ret, err_str);
-
-	BUG_ON(ret); /* inconsistent */
+	/* inconsistent */
+	scoutfs_bug_on_err(sb, ret,
+			   "server error %d splicing log merge completion: %s", ret, err_str);

 	return ret ?: einprogress;
 }
@@ -2720,10 +2768,7 @@ restart:

 	/* find the next range, always checking for splicing */
 	for (;;) {
-		key = stat.next_range_key;
-		key.sk_zone = SCOUTFS_LOG_MERGE_RANGE_ZONE;
-		ret = next_log_merge_item_key(sb, &super->log_merge, SCOUTFS_LOG_MERGE_RANGE_ZONE,
-					      &key, &rng, sizeof(rng));
+		ret = next_log_merge_range(sb, &super->log_merge, &stat.next_range_key, &rng);
 		if (ret < 0 && ret != -ENOENT) {
 			err_str = "finding merge range item";
 			goto out;
@@ -2994,7 +3039,13 @@ static int server_commit_log_merge(struct super_block *sb,
 				  SCOUTFS_LOG_MERGE_STATUS_ZONE, 0, 0,
 				  &stat, sizeof(stat));
 	if (ret < 0) {
-		err_str = "getting merge status item";
+		/*
+		 * During a retransmission, it's possible that the server
+		 * already committed and resolved this log merge. ENOENT
+		 * is expected in that case.
+		 */
+		if (ret != -ENOENT)
+			err_str = "getting merge status item";
 		goto out;
 	}

--- a/tests/fenced-local-force-unmount.sh
+++ b/tests/fenced-local-force-unmount.sh
@@ -8,36 +8,33 @@

 echo "$0 running rid '$SCOUTFS_FENCED_REQ_RID' ip '$SCOUTFS_FENCED_REQ_IP' args '$@'"

-log() {
-	echo "$@" > /dev/stderr
+echo_fail() {
+	echo "$@" >&2
 	exit 1
 }

-echo_fail() {
-	echo "$@" > /dev/stderr
-	exit 1
+# silence error messages
+quiet_cat()
+{
+	cat "$@" 2>/dev/null
 }

 rid="$SCOUTFS_FENCED_REQ_RID"

+shopt -s nullglob
 for fs in /sys/fs/scoutfs/*; do
-	[ ! -d "$fs" ] && continue
+	fs_rid="$(quiet_cat $fs/rid)"
+	nr="$(quiet_cat $fs/data_device_maj_min)"
+	[ ! -d "$fs" -o "$fs_rid" != "$rid" ] && continue

-	fs_rid="$(cat $fs/rid)" || \
-		echo_fail "failed to get rid in $fs"
-	if [ "$fs_rid" != "$rid" ]; then
-		continue
+	mnt=$(findmnt -l -n -t scoutfs -o TARGET -S $nr)
+	[ -z "$mnt" ] && continue
+
+	if ! umount -qf "$mnt"; then
+		if [ -d "$fs" ]; then
+			echo_fail "umount -qf $mnt failed"
+		fi
 	fi
-
-	nr="$(cat $fs/data_device_maj_min)" || \
-		echo_fail "failed to get data device major:minor in $fs"
-
-	mnts=$(findmnt -l -n -t scoutfs -o TARGET -S $nr) || \
-		echo_fail "findmnt -t scoutfs -S $nr failed"
-	for mnt in $mnts; do
-		umount -f "$mnt" || \
-			echo_fail "umout -f $mnt failed"
-	done
 done

 exit 0
--- a/tests/funcs/filter.sh
+++ b/tests/funcs/filter.sh
@@ -121,6 +121,7 @@ t_filter_dmesg()

 	# in debugging kernels we can slow things down a bit
 	re="$re|hrtimer: interrupt took .*"
+	re="$re|clocksource: Long readout interval"

 	# fencing tests force unmounts and trigger timeouts
 	re="$re|scoutfs .* forcing unmount"
@@ -166,6 +167,12 @@ t_filter_dmesg()
 	# perf warning that it adjusted sample rate
 	re="$re|perf: interrupt took too long.*lowering kernel.perf_event_max_sample_rate.*"

+	# some ci test guests are unresponsive
+	re="$re|longest quorum heartbeat .* delay"
+
+	# creating block devices may trigger this
+	re="$re|block device autoloading is deprecated and will be removed."
+
 	egrep -v "($re)" | \
 		ignore_harmless_unwind_kasan_stack_oob
 }
--- a/tests/funcs/tap.sh
+++ b/tests/funcs/tap.sh
@@ -43,9 +43,14 @@ t_tap_progress()
 	local testname=$1
 	local result=$2

+	local stmsg=""
 	local diff=""
 	local dmsg=""

+	if [[ -s $T_RESULTS/tmp/${testname}/status.msg ]]; then
+		stmsg="1"
+	fi
+
 	if [[ -s "$T_RESULTS/tmp/${testname}/dmesg.new" ]]; then
 		dmsg="1"
 	fi
@@ -61,6 +66,7 @@ t_tap_progress()
 		echo "# ${testname} ** skipped - permitted **"
 	else
 		echo "not ok ${i} - ${testname}"
+
 		case ${result} in
 		101)
 			echo "# ${testname} ** skipped **"
@@ -70,6 +76,13 @@ t_tap_progress()
 			;;
 		esac

+		if [[ -n "${stmsg}" ]]; then
+			echo "#"
+			echo "# status:"
+			echo "#"
+			cat $T_RESULTS/tmp/${testname}/status.msg | sed 's/^/# - /'
+		fi
+
 		if [[ -n "${diff}" ]]; then
 			echo "#"
 			echo "# diff:"
--- a/tests/run-tests.sh
+++ b/tests/run-tests.sh
@@ -39,20 +39,6 @@ cmd() {
 		die "cmd failed (check the run.log)"
 }

-# we can record pids to kill as we exit, we kill in reverse added order
-declare -a atexit_kill_pids
-atexit_kill()
-{
-	local pid
-
-	for pid in $(echo ${atexit_kill_pids[*]} | rev); do
-		if test -e "/proc/$pid/status" ; then
-			kill "$pid"
-		fi
-	done
-}
-trap atexit_kill EXIT
-
 show_help()
 {
 cat << EOF
@@ -452,6 +438,30 @@ cmd grep .  /sys/kernel/debug/tracing/options/trace_printk \
 	    /sys/kernel/debug/tracing/buffer_size_kb \
 	    /proc/sys/kernel/ftrace_dump_on_oops

+# we can record pids to kill as we exit, we kill in reverse added order
+atexit_kill_pids=""
+add_atexit_kill_pid()
+{
+	atexit_kill_pids="$1 $atexit_kill_pids"
+}
+atexit_kill()
+{
+	local pid
+
+	# suppress bg function exited messages
+	exec {ERR}>&2 2>/dev/null
+
+	for pid in $atexit_kill_pids; do
+		if test -e "/proc/$pid/status" ; then
+			kill "$pid"
+			wait "$pid"
+		fi
+	done
+
+	exec 2>&$ERR {ERR}>&-
+}
+trap atexit_kill EXIT
+
 #
 # Build a fenced config that runs scripts out of the repository rather
 # than the default system directory
@@ -467,7 +477,7 @@ T_FENCED_LOG="$T_RESULTS/fenced.log"

 $T_UTILS/fenced/scoutfs-fenced > "$T_FENCED_LOG" 2>&1 &
 fenced_pid=$!
-atexit_kill_pids+=($fenced_pid)
+add_atexit_kill_pid $fenced_pid

 #
 # some critical failures will cause fs operations to hang.  We can watch
@@ -496,13 +506,12 @@ crash_monitor()
 		if [ "$bad" != 0 ]; then
 			echo "run-tests monitor triggering crash"
 			echo c > /proc/sysrq-trigger
-			# bg function doesn't reload bash, $$ is parent run-tests.sh
-			kill -9 $$
+			exit 1
 		fi
 	done
 }
 crash_monitor &
-atexit_kill_pids+=($!)
+add_atexit_kill_pid $!

 # setup dm tables
 echo "0 $(blockdev --getsz $T_META_DEVICE) linear $T_META_DEVICE 0" > \
--- a/tests/src/mmap_stress.c
+++ b/tests/src/mmap_stress.c
@@ -19,6 +19,7 @@
 #include <sys/types.h>
 #include <stdio.h>
 #include <sys/stat.h>
+#include <inttypes.h>
 #include <fcntl.h>
 #include <unistd.h>
 #include <stdlib.h>
@@ -29,7 +30,7 @@
 #include <errno.h>

 static int size = 0;
-static int count = 0; /* XXX make this duration instead */
+static int duration = 0;

 struct thread_info {
 	int nr;
@@ -41,6 +42,8 @@ static void *run_test_func(void *ptr)
 	void *buf = NULL;
 	char *addr = NULL;
 	struct thread_info *tinfo = ptr;
+	uint64_t seconds = 0;
+	struct timespec ts;
 	int c = 0;
 	int fd;
 	ssize_t read, written, ret;
@@ -61,9 +64,15 @@ static void *run_test_func(void *ptr)

 	usleep(100000); /* 0.1sec to allow all threads to start roughly at the same time */

+	clock_gettime(CLOCK_REALTIME, &ts); /* record start time */
+	seconds = ts.tv_sec + duration;
+
 	for (;;) {
-		if (++c > count)
-			break;
+		if (++c % 16 == 0) {
+			clock_gettime(CLOCK_REALTIME, &ts);
+			if (ts.tv_sec >= seconds)
+				break;
+		}

 		switch (rand() % 4) {
 		case 0: /* pread */
@@ -99,6 +108,8 @@ static void *run_test_func(void *ptr)
 			memcpy(addr, buf, size); /* noerr */
 			break;
 		}
+
+		usleep(10000);
 	}

 	munmap(addr, size);
@@ -120,7 +131,7 @@ int main(int argc, char **argv)
 	int i;

 	if (argc != 8) {
-		fprintf(stderr, "%s requires 7 arguments - size count file1 file2 file3 file4 file5\n", argv[0]);
+		fprintf(stderr, "%s requires 7 arguments - size duration file1 file2 file3 file4 file5\n", argv[0]);
 		exit(-1);
 	}

@@ -130,9 +141,9 @@ int main(int argc, char **argv)
 		exit(-1);
 	}

-	count = atoi(argv[2]);
-	if (count < 0) {
-		fprintf(stderr, "invalid count, must be greater than 0\n");
+	duration = atoi(argv[2]);
+	if (duration < 0) {
+		fprintf(stderr, "invalid duration, must be greater than or equal to 0\n");
 		exit(-1);
 	}

--- a/tests/tests/get-referring-entries.sh
+++ b/tests/tests/get-referring-entries.sh
@@ -72,7 +72,7 @@ touch $T_D0/dir/file
 mkdir $T_D0/dir/dir
 ln -s $T_D0/dir/file $T_D0/dir/symlink
 mknod $T_D0/dir/char c 1 3 # null
-mknod $T_D0/dir/block b 7 0 # loop0
+mknod $T_D0/dir/block b 42 0 # SAMPLE block dev - nonexistant/demo use only number
 for name in $(ls -UA $T_D0/dir | sort); do
 	ino=$(stat -c '%i' $T_D0/dir/$name)
 	$GRE $ino | filter_types
--- a/tests/tests/mmap.sh
+++ b/tests/tests/mmap.sh
@@ -5,7 +5,7 @@
 t_require_commands mmap_stress mmap_validate scoutfs xfs_io

 echo "== mmap_stress"
-mmap_stress 8192 2000 "$T_D0/mmap_stress" "$T_D1/mmap_stress" "$T_D2/mmap_stress" "$T_D3/mmap_stress" "$T_D4/mmap_stress" | sed 's/:.*//g' | sort
+mmap_stress 8192 30 "$T_D0/mmap_stress" "$T_D0/mmap_stress" "$T_D0/mmap_stress" "$T_D3/mmap_stress" "$T_D3/mmap_stress" | sed 's/:.*//g' | sort

 echo "== basic mmap/read/write consistency checks"
 mmap_validate 256 1000 "$T_D0/mmap_val1" "$T_D1/mmap_val1"
--- a/tests/tests/renameat2-noreplace.sh
+++ b/tests/tests/renameat2-noreplace.sh
@@ -8,19 +8,19 @@ t_require_mounts 2
 echo "=== renameat2 noreplace flag test"

 # give each mount their own dir (lock group) to minimize create contention
-mkdir $T_M0/dir0
-mkdir $T_M1/dir1
+mkdir $T_D0/dir0
+mkdir $T_D1/dir1

 echo "=== run two asynchronous calls to renameat2 NOREPLACE"
 for i in $(seq 0 100); do
        # prepare inputs in isolation
-        touch "$T_M0/dir0/old0"
-        touch "$T_M1/dir1/old1"
+        touch "$T_D0/dir0/old0"
+        touch "$T_D1/dir1/old1"

        # race doing noreplace renames, both can't succeed
-        dumb_renameat2 -n "$T_M0/dir0/old0" "$T_M0/dir0/sharednew" 2> /dev/null &
+        dumb_renameat2 -n "$T_D0/dir0/old0" "$T_D0/dir0/sharednew" 2> /dev/null &
        pid0=$!
-        dumb_renameat2 -n "$T_M1/dir1/old1" "$T_M1/dir0/sharednew" 2> /dev/null &
+        dumb_renameat2 -n "$T_D1/dir1/old1" "$T_D1/dir0/sharednew" 2> /dev/null &
        pid1=$!

        wait $pid0
@@ -31,7 +31,7 @@ for i in $(seq 0 100); do
        test "$rc0" == 0 -a "$rc1" == 0 && t_fail "both renames succeeded"

        # blow away possible files for either race outcome
-        rm -f "$T_M0/dir0/old0" "$T_M1/dir1/old1" "$T_M0/dir0/sharednew" "$T_M1/dir1/sharednew"
+        rm -f "$T_D0/dir0/old0" "$T_D1/dir1/old1" "$T_D0/dir0/sharednew" "$T_D1/dir1/sharednew"
 done

 t_pass
--- a/utils/fenced/scoutfs-fenced
+++ b/utils/fenced/scoutfs-fenced
@@ -62,31 +62,27 @@ test -x "$SCOUTFS_FENCED_RUN" || \
 # files disappear.
 #

-# generate failure messages to stderr while still echoing 0 for the caller
-careful_cat()
+# silence error messages
+quiet_cat()
 {
-	local path="$@"
-
-	cat "$@" || echo 0
+	cat "$@" 2>/dev/null
 }

 while sleep $SCOUTFS_FENCED_DELAY; do
+	shopt -s nullglob
 	for fence in /sys/fs/scoutfs/*/fence/*; do
-		# catches unmatched regex when no dirs
-		if [ ! -d "$fence" ]; then
-			continue
-		fi
-
-		# skip requests that have been handled
-		if [ "$(careful_cat $fence/fenced)" == 1 -o \
-		     "$(careful_cat $fence/error)" == 1 ]; then
-			continue
-		fi

 		srv=$(basename $(dirname $(dirname $fence)))
-		rid="$(cat $fence/rid)"
-		ip="$(cat $fence/ipv4_addr)"
-		reason="$(cat $fence/reason)"
+		fenced="$(quiet_cat $fence/fenced)"
+		error="$(quiet_cat $fence/error)"
+		rid="$(quiet_cat $fence/rid)"
+		ip="$(quiet_cat $fence/ipv4_addr)"
+		reason="$(quiet_cat $fence/reason)"
+
+		# request dirs can linger then disappear after fenced/error is set
+		if [ ! -d "$fence" -o "$fenced" == "1" -o "$error" == "1" ]; then
+			continue
+		fi

 		log_message "server $srv fencing rid $rid at IP $ip for $reason"

--- a/utils/man/scoutfs.5
+++ b/utils/man/scoutfs.5
@@ -55,6 +55,14 @@ with initial sparse regions (perhaps by multiple threads writing to
 different regions) and wasted space isn't an issue (perhaps because the
 file population contains few small files).
 .TP
+.B ino_alloc_per_lock=<number>
+This option determines how many inode numbers are allocated in the same
+cluster lock.  The default, and maximum, is 1024.  The minimum is 1.
+Allocating fewer inodes per lock can allow more parallelism between
+mounts because there are more locks that cover the same number of
+created files.  This can be helpful when working with smaller numbers of
+large files.
+.TP
 .B log_merge_wait_timeout_ms=<number>
 This option sets the amount of time, in milliseconds, that log merge
 creation can wait before timing out.  This setting is per-mount, only
Author	SHA1	Message	Date
Zach Brown	c73e9994c5	v1.27 Release Finish the release notes for the 1.27 release. Signed-off-by: Zach Brown <zab@versity.com>	2026-01-07 10:31:54 -08:00
Zach Brown	50bff13f21	Merge pull request #266 from versity/zab/increase_move_empty_budget Increase server commit block budget for alloc move	2025-12-18 12:44:20 -08:00
Zach Brown	de70ca2372	Increase server commit block budget for alloc move A few callers of alloc_move_empty in the server were providing a budget that was too small. Recent changes to extent_mod_blocks increased the max budget that is necessary to move extents between btrees. The existing WAG of 100 was too small for trees of height 2 and 3. This caused looping in production. We can increase the move budget to half the overall commit budget, which leaves room for a height of around 7 each. This is much greater than we see in practice because the size of the per-mount btrees is effectiely limited by both watermarks and thresholds to commit and drain. Signed-off-by: Zach Brown <zab@versity.com>	2025-12-17 14:22:04 -06:00
Zach Brown	5af1412d5f	Merge pull request #270 from versity/auke/bdev_autoloading Avoid block device autoloading warning.	2025-12-17 11:06:32 -08:00
Zach Brown	0a2b2ad409	Merge pull request #269 from versity/auke/tap_status_msg Include t_fail status in tap output.	2025-12-17 11:04:00 -08:00
Auke Kok	6c4590a8a0	Avoid block device autoloading warning. It's possible to trigger the block device autoloading mechanism with a mknod()/stat(), and this mechanism has long been declared obsolete, thus triggering a dmesg warning since el9_7, which then fails the test. You may need to `rmmod loop` to reproduce. Avoid this by avoiding to trigger a loop autoload - we just make a different blockdev. Chosing `42` here should avoid any autoload mechanism as this number is explicitly for demo drivers and should never trigger an autoload. We also just ignore the warning line in dmesg. Other tests can and might perhaps still trigger this, as well as background noise running during the test. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-12-08 13:04:58 -08:00
Zach Brown	1768f69c3c	Merge pull request #224 from versity/auke/renameat2-test-sub-dir Use T_D0/1 instead of T_M0 here.	2025-12-08 10:05:46 -08:00
Zach Brown	dcb0fd5805	Merge pull request #268 from versity/auke/dont_use_bash_special_stdfiles Avoid using bash special device nodes.	2025-12-08 09:47:19 -08:00
Auke Kok	660f874488	Use T_D0/1 instead of T_M0 here. Use of T_M0 and variants should be reserved for e.g. scoutfs <subcommand> -p <mountpoint> type of usages. Tests should create individual content files in the assigned subdirectory. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-12-04 14:34:02 -05:00
Auke Kok	e1a6689a9b	Include t_fail status in tap output. The tap output file was not yet complete as it failed to include the contents of `status.msg`. In a few cases, that would mean it lacks important context. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-12-04 14:09:39 -05:00
Auke Kok	2884a92408	Avoid using bash special device nodes. Bash has special handling when these standard IO files, but there are cases where customers have special restrictions set on them. Likely to avoid leaking error data out of system logs as part of IDS software. In any case, we can just reopen existing file descriptors here in both these cases to avoid this entirely. This will always work. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-12-04 13:24:48 -05:00
Zach Brown	e194714004	Merge pull request #264 from versity/auke/findmnt_retval Findmnt returns 1 when no matching entries found	2025-12-03 14:29:31 -08:00
Auke Kok	8bb2f83cf9	Findmnt returns 1 when no matching entries found Our local fence script attempts to interpret errors executing `findmnt` as critical errors, but the program exit code explicitly returns EXIT_FAILURE when the total number of matching mount entries is zero. This can happen if the mount disappeared while we're attempting to fence the mount, but, the scoutfs sysfs files are still in place as we read them. It's a small window, but, it's a fork/exec plus full parse of /etc/fstab, and a lot can happen in the 0.015s findmnt takes on my system. There's no other exit codes from findmnt other than 0 and 1. At that point, we can only assume that if the stdout is empty, the mount isn't there anymore. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-12-02 12:55:11 -08:00
Zach Brown	6a9a6789d5	Merge pull request #267 from versity/clk/merge_enoent Handle ENOENT when getting log merge status item	2025-12-02 09:34:28 -08:00
Chris Kirby	ee630b164f	Handle ENOENT when getting log merge status item Tests that cause client retries can fail with this error from server_commit_log_merge(): error -2 committing log merge: getting merge status item This can happen if the server has already committed and resolved the log merge that is being retried. We can safely ignore ENOENT here just like we do a few lines later. Signed-off-by: Chris Kirby <ckirby@versity.com>	2025-12-01 08:58:24 -06:00
Zach Brown	1c7678b6f5	Merge pull request #263 from versity/zab/v1.26 v1.26 Release	2025-11-18 09:39:27 -08:00
Zach Brown	22b5e79bbd	v1.26 Release Finish the release notes for the 1.26 release. Signed-off-by: Zach Brown <zab@versity.com>	2025-11-17 14:42:14 -08:00
Zach Brown	259e639271	Merge pull request #262 from versity/zab/ino_alloc_per_lock Add ino_alloc_per_lock option	2025-11-14 13:57:49 -08:00
Zach Brown	4d66c38c71	Remove redundant WARN in commit_log_trees The server's commit_log_trees has an error message that includes the source of the error, but it's not used for all errors. The WARN_ON is redundant with the message and is removed because it isn't filtered out when we see errors from forced unmount. Signed-off-by: Zach Brown <zab@versity.com>	2025-11-14 10:04:30 -08:00
Zach Brown	7ef62894bd	Add ino_alloc_per_lock option Add an option that can limit the number of inode numbers that are allocated per lock group. Signed-off-by: Zach Brown <zab@versity.com>	2025-11-13 17:19:04 -08:00
Zach Brown	1f363a1ead	Merge pull request #261 from versity/zab/log_merge_double_free Zab/log merge double free	2025-11-13 17:18:30 -08:00
Zach Brown	8ddf9b8c8c	Handle disappearing fencing requests and targets The userspace fencing process wasn't careful about handling underlying directories that disappear while it was working. On the server/fenced side, fencing requests can linger after they've been resolved by writing 1 to fenced or error. The script could come back around to see the directory before the server finally removes it, causing all later uses of the request dir to fail. We saw this in the logs as a bunch of cat errors for the various request files. On the local fence script side, all the mounts can be in the process of being unmounted so both the /sys/fs dirs and the mount it self can be removed while we're working. For both, when we're working with the /sys/fs files we read them without logging errors and then test that the dir still exists before using what we read. When fencing a mount, we stop if findmnt doesn't find the mount and then raise a umount error if the /sys/fs dir exists after umount fails. And while we're at it, we have each scripts logging append instead of truncating (if, say, it's a log file instead of an interactive tty). Signed-off-by: Zach Brown <zab@versity.com>	2025-11-13 12:43:31 -08:00
Zach Brown	fd80c17ab6	Filter out kernel message when guests are slow Ignore more kernel messages when debug guests are being slow. Signed-off-by: Zach Brown <zab@versity.com>	2025-11-13 12:43:31 -08:00
Zach Brown	991e2cbdf8	Ignore slow quorum hb transfers in tests We're getting test failures from messages that our guests can be unresponsive. They sure can be. We don't need to fail for this one specific case. Signed-off-by: Zach Brown <zab@versity.com>	2025-11-13 12:43:31 -08:00
Zach Brown	92ac132873	Silence merge splice error when forcing Silence another error warning and assertion that's assuming that the result of the errors is going to be persistent. When we're forcing an unmount we've severed storage and networking. Signed-off-by: Zach Brown <zab@versity.com>	2025-11-13 12:43:31 -08:00
Auke Kok	ad078cd93c	Avoid lock stalling mmap_stress mmap_stress gets completely stalled in lock messaging and starving most of the mmap_stress threads, which causes it to delay and even time out in CI. Instead of spawning threads over all 5 test nodes, we reduce it to spawning over only 2 artificially. This still does a good number of operations on those node, and now the work is spread across the two nodes evenly. Additionaly, I've added a miniscule (10ms) delay in between operations that should hopefully be sufficient for other locking attempts to settle and allow the threads to better spread the work. This now shows that all the threads exit within < 0.25s on my test machine, which is a lot better than the 40s variation that I was seeing locally. Hopefully this fares better in CI. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-11-13 12:43:31 -08:00
Auke Kok	90cb458cd5	Make mmap_stress not exceed a fixed amount of time. There's a scenarion where mmap_stress gets enough resources that twoe of the threads will starve the others, which then all take a very long time catching up committing changes. Because this test program didn't finish until all the threads had completed a fixed amount of work, essentially these threads all ended up tripping over eachother. In CI this would exceed 6h+, while originally I intended this to run in about 100s or so. Instead, cap the run time to ~30s by default. If threads exceed this time, they will immediately exit, which causes any clog in contention between the threads to drain relatively quickly. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-11-13 12:43:31 -08:00
Zach Brown	1ab798e7eb	Silence inconsistent srch on forced unmount Assembling a srch compaction operation creates an item and populates it with allocator state. It doesn't cleanly unwind the allocation and undo the compaction item change if allocation filling fails and issues a warning. This warning isn't needed if the error shows that we're in forced unmount. The inconsistent state won't be applied, it will be dropped on the floor as the mount is torn down. Signed-off-by: Zach Brown <zab@versity.com>	2025-11-13 12:43:31 -08:00
Zach Brown	e182914e51	Fix double free of metadata blocks in log merging The log merging process is meant to provide parallelism across workers in mounts. The idea is that the server hands out a bunch of concurrent non-intersecting work that's based on the structure of the stable input fs_root btree. The nature of the parallel work (cow of the blocks that intersect a key range) means that the ranges of concurrently issued work can't overlap or the work will all cow the same input blocks, freeing that input stable block multiple times. We're seeing this in testing. Correctness was intended by having an advancing key that sweeps sorted ranges. Duplicate ranges would never be hit as the key advanced past each it visited. This was broken by the mapping of the fs item keys to log merge tree keys by clobbering the sk_zone key value. It effectively interleaves the ranges of each zone in the fs root (meta indexes, orphans, fs items). With just the right log merge conditions that involve logged items in the right places and partial completed work to insert remaining ranges behind the key, ranges can be stored at mapped keys that end up with ranges out of order. The server iterates over these and ends up issueing overlapping work, which results in duplicated frees of the input blocks. The fix, without changing the format of the stored log tree items, is to perform a full sweep of all the range items and determine the next item by looking at the full precision stored keys. This ensures that the processed ranges always advance and never overlap. Signed-off-by: Zach Brown <zab@versity.com>	2025-11-13 12:43:31 -08:00
Zach Brown	8484a58dd6	Have xfstest pass when using args The xfstests's golden output includes the full set of tests we expect to run when no args are specified. If we specify args then the set of tests can change and the test will always fail when they do. This fixes that by having the test check the set of tests itself, rather than relying on golden output. If args are specified then our xfstest only fails if any of the executed xfstest tests failed. Without args, we perform the same scraping of the check output and compare it against the expected results ourself. It would have been a bit much to put that large file inline in the test file, so we add a dir of per-test files in revision control. We can also put the list of exclusions there. We can also clean up the output redirection helper functions to make them more clear. After xfstests has executed we want to redirect output back to the compared output so that we can catch any unexpected output. Signed-off-by: Zach Brown <zab@versity.com>	2025-11-13 12:43:31 -08:00
Zach Brown	a077104531	Add crash monitor to run-tests Add a little background function that runs during the test which triggers a crash if it finds catastrophic failure conditions. This is the second bg task we want to kill and we can only have one function run on the EXIT trap, so we create a generic process killing trap function. We feed it the fenced pid as well. run-tests didn't log much of value into the fenced log, and we're not logging the kills into anymore, so we just remove run-tests fenced logging. Signed-off-by: Zach Brown <zab@versity.com>	2025-11-13 12:43:31 -08:00