Fix commit budget calculation with multiple holders

The try_drain_data_freed() path was generating errors about overrunning its commit budget: scoutfs f.2b8928.r.02689f error: 1 holders exceeded alloc budget av: bef 8185 now 8036, fr: bef 8185 now 7602 The budget overrun check was using the current number of commit holders (in this case one) instead of the the maximum number of concurrent holders (in this case two). So even well behaved paths like try_drain_data_freed() can appear to exceed their commit budget if other holders dirty some blocks and apply their commits before the try_drain_data_freed() thread does its final budget reconciliation. Signed-off-by: Chris Kirby <ckirby@versity.com>
Fix dirtied block calculation in extent_mod_blocks()
2026-04-30 09:56:55 +00:00 · 2025-06-17 11:38:07 -05:00 · 2025-06-17 11:38:07 -05:00 · 2025-06-04 11:21:25 -07:00 · 2025-06-03 13:35:42 -07:00 · 2025-05-12 12:21:02 -07:00
11 changed files with 218 additions and 109 deletions
--- a/ReleaseNotes.md
+++ b/ReleaseNotes.md
@@ -1,6 +1,27 @@
 Versity ScoutFS Release Notes
 =============================

+---
+v1.25
+\
+*Jun 3, 2025*
+
+Fix a bug that could cause indefinite retries of failed client commits.
+Under specific error conditions the client and server's understanding of
+the current client commit could get out of sync.  The client would retry
+commits indefinitely that could never succeed.  This manifested as
+infinite "critical transaction commit failure" messages in the kernel
+log on the client and matching "error <nr> committing client logs" on
+the server.
+
+Fix a bug in a specific case of server error handling that could result
+in sending references to unwritten blocks to the client.  The client
+would try to read blocks that hadn't been written and return spurious
+errors.  This was seen under low free space conditions on the server and
+resulted in error messages with error code 116 (The errno enum for
+ESTALE, the client's indication that it couldn't read the blocks that it
+expected.)
+
 ---
 v1.24
 \
--- a/kmod/src/alloc.c
+++ b/kmod/src/alloc.c
@@ -86,18 +86,47 @@ static u64 smallest_order_length(u64 len)
 }

 /*
- * An extent modification dirties three distinct leaves of an allocator
- * btree as it adds and removes the blkno and size sorted items for the
- * old and new lengths of the extent.  Dirtying the paths to these
- * leaves can grow the tree and grow/shrink neighbours at each level.
- * We over-estimate the number of blocks allocated and freed (the paths
- * share a root, growth doesn't free) to err on the simpler and safer
- * side.  The overhead is minimal given the relatively large list blocks
- * and relatively short allocator trees.
+ * Moving an extent between trees can dirty blocks in several ways. This
+ * function calculates worst case number of blocks across these scenarions.
+ * We treat the alloc and free counts independently, so the values below are
+ * max(allocated, freed), not the sum.
+ *
+ * We track extents with two separate btree items: by block number and by size.
+ *
+ * If we're removing an extent from the btree (allocating), we can dirty
+ * two blocks if the keys are in different leaves. If we wind up merging
+ * leaves because we fall below the low water mark, we can wind up freeing
+ * three leaves.
+ *
+ * That sequence is as follows, assuming the original keys are removed from
+ * blocks A and B:
+ *
+ * Allocate new dirty A' and B'
+ * Free old stable A and B
+ * B' has fallen below the low water mark, so copy B' into A'
+ * Free B'
+ *
+ * An extent insertion (freeing an extent) can dirty up to five distinct items
+ * in the btree as it adds and removes the blkno and size sorted items for the
+ * old and new lengths of the extent:
+ *
+ * In the by-blkno portion of the btree, we can dirty (allocate for COW) up
+ * to two blocks- either by merging adjacent extents, which can cause us to
+ * join leaf blocks; or by an insertion that causes a split.
+ *
+ * In the by-size portion, we never merge extents, so normally we just dirty
+ * a single item with a size insertion. But if we merged adjacent extents in
+ * the by-blkno portion of the tree, we might be working with three by-sizex
+ * items: removing the two old ones that were combined in the merge; and
+ * adding the new one for the larger, merged size.
+ *
+ * Finally, dirtying the paths to these leaves can grow the tree and grow/shrink
+ * neighbours at each level, so we multiply by the height of the tree after
+ * accounting for a possible new level.
 */
 static u32 extent_mod_blocks(u32 height)
 {
-	return ((1 + height) * 2) * 3;
+	return ((1 + height) * 3) * 5;
 }

 /*
--- a/kmod/src/scoutfs_trace.h
+++ b/kmod/src/scoutfs_trace.h
@@ -1966,15 +1966,17 @@ DEFINE_EVENT(scoutfs_server_client_count_class, scoutfs_server_client_down,
 );

 DECLARE_EVENT_CLASS(scoutfs_server_commit_users_class,
-        TP_PROTO(struct super_block *sb, int holding, int applying, int nr_holders,
-		 u32 avail_before, u32 freed_before, int committing, int exceeded),
-        TP_ARGS(sb, holding, applying, nr_holders, avail_before, freed_before, committing,
-		exceeded),
+        TP_PROTO(struct super_block *sb, int holding, int applying,
+		 int nr_holders, u32 budget,
+		 u32 avail_before, u32 freed_before,
+		 int committing, int exceeded),
+        TP_ARGS(sb, holding, applying, nr_holders, budget, avail_before, freed_before, committing, exceeded),
        TP_STRUCT__entry(
 		SCSB_TRACE_FIELDS
 		__field(int, holding)
 		__field(int, applying)
 		__field(int, nr_holders)
+		__field(u32, budget)
 		__field(__u32, avail_before)
 		__field(__u32, freed_before)
 		__field(int, committing)
@@ -1985,35 +1987,45 @@ DECLARE_EVENT_CLASS(scoutfs_server_commit_users_class,
 		__entry->holding = !!holding;
 		__entry->applying = !!applying;
 		__entry->nr_holders = nr_holders;
+		__entry->budget = budget;
 		__entry->avail_before = avail_before;
 		__entry->freed_before = freed_before;
 		__entry->committing = !!committing;
 		__entry->exceeded = !!exceeded;
        ),
-	TP_printk(SCSBF" holding %u applying %u nr %u avail_before %u freed_before %u committing %u exceeded %u",
-		  SCSB_TRACE_ARGS, __entry->holding, __entry->applying, __entry->nr_holders,
-		  __entry->avail_before, __entry->freed_before, __entry->committing,
-		  __entry->exceeded)
+	TP_printk(SCSBF" holding %u applying %u nr %u budget %u avail_before %u freed_before %u committing %u exceeded %u",
+		  SCSB_TRACE_ARGS, __entry->holding, __entry->applying,
+		  __entry->nr_holders, __entry->budget,
+		  __entry->avail_before, __entry->freed_before,
+		  __entry->committing, __entry->exceeded)
 );
 DEFINE_EVENT(scoutfs_server_commit_users_class, scoutfs_server_commit_hold,
-        TP_PROTO(struct super_block *sb, int holding, int applying, int nr_holders,
-		 u32 avail_before, u32 freed_before, int committing, int exceeded),
-        TP_ARGS(sb, holding, applying, nr_holders, avail_before, freed_before, committing, exceeded)
+        TP_PROTO(struct super_block *sb, int holding, int applying,
+		 int nr_holders, u32 budget,
+		 u32 avail_before, u32 freed_before,
+		 int committing, int exceeded),
+        TP_ARGS(sb, holding, applying, nr_holders, budget, avail_before, freed_before, committing, exceeded)
 );
 DEFINE_EVENT(scoutfs_server_commit_users_class, scoutfs_server_commit_apply,
-        TP_PROTO(struct super_block *sb, int holding, int applying, int nr_holders,
-		 u32 avail_before, u32 freed_before, int committing, int exceeded),
-        TP_ARGS(sb, holding, applying, nr_holders, avail_before, freed_before, committing, exceeded)
+        TP_PROTO(struct super_block *sb, int holding, int applying,
+		 int nr_holders, u32 budget,
+		 u32 avail_before, u32 freed_before,
+		 int committing, int exceeded),
+        TP_ARGS(sb, holding, applying, nr_holders, budget, avail_before, freed_before, committing, exceeded)
 );
 DEFINE_EVENT(scoutfs_server_commit_users_class, scoutfs_server_commit_start,
-        TP_PROTO(struct super_block *sb, int holding, int applying, int nr_holders,
-		 u32 avail_before, u32 freed_before, int committing, int exceeded),
-        TP_ARGS(sb, holding, applying, nr_holders, avail_before, freed_before, committing, exceeded)
+        TP_PROTO(struct super_block *sb, int holding, int applying,
+		 int nr_holders, u32 budget,
+		 u32 avail_before, u32 freed_before,
+		 int committing, int exceeded),
+        TP_ARGS(sb, holding, applying, nr_holders, budget, avail_before, freed_before, committing, exceeded)
 );
 DEFINE_EVENT(scoutfs_server_commit_users_class, scoutfs_server_commit_end,
-        TP_PROTO(struct super_block *sb, int holding, int applying, int nr_holders,
-		 u32 avail_before, u32 freed_before, int committing, int exceeded),
-        TP_ARGS(sb, holding, applying, nr_holders, avail_before, freed_before, committing, exceeded)
+        TP_PROTO(struct super_block *sb, int holding, int applying,
+		 int nr_holders, u32 budget,
+		 u32 avail_before, u32 freed_before,
+		 int committing, int exceeded),
+        TP_ARGS(sb, holding, applying, nr_holders, budget, avail_before, freed_before, committing, exceeded)
 );

 #define slt_symbolic(mode)						\
--- a/kmod/src/server.c
+++ b/kmod/src/server.c
@@ -65,6 +65,7 @@ struct commit_users {
 	struct list_head holding;
 	struct list_head applying;
 	unsigned int nr_holders;
+	u32 budget;
 	u32 avail_before;
 	u32 freed_before;
 	bool committing;
@@ -84,8 +85,9 @@ static void init_commit_users(struct commit_users *cusers)
 do {												\
 	__typeof__(cusers) _cusers = (cusers);							\
 	trace_scoutfs_server_commit_##which(sb, !list_empty(&_cusers->holding),			\
-		!list_empty(&_cusers->applying), _cusers->nr_holders, _cusers->avail_before,	\
-		_cusers->freed_before, _cusers->committing, _cusers->exceeded);			\
+		!list_empty(&_cusers->applying), _cusers->nr_holders, _cusers->budget,		\
+		_cusers->avail_before, _cusers->freed_before, _cusers->committing,		\
+		_cusers->exceeded);								\
 } while (0)

 struct server_info {
@@ -303,7 +305,6 @@ static void check_holder_budget(struct super_block *sb, struct server_info *serv
 	u32 freed_used;
 	u32 avail_now;
 	u32 freed_now;
-	u32 budget;

 	assert_spin_locked(&cusers->lock);

@@ -318,15 +319,14 @@ static void check_holder_budget(struct super_block *sb, struct server_info *serv
 	else
 		freed_used = SCOUTFS_ALLOC_LIST_MAX_BLOCKS - freed_now;

-	budget = cusers->nr_holders * COMMIT_HOLD_ALLOC_BUDGET;
-	if (avail_used <= budget && freed_used <= budget)
+	if (avail_used <= cusers->budget && freed_used <= cusers->budget)
 		return;

 	exceeded_once = true;
 	cusers->exceeded = cusers->nr_holders;

-	scoutfs_err(sb, "%u holders exceeded alloc budget av: bef %u now %u, fr: bef %u now %u",
-		    cusers->nr_holders, cusers->avail_before, avail_now,
+	scoutfs_err(sb, "holders exceeded alloc budget %u av: bef %u now %u, fr: bef %u now %u",
+		    cusers->budget, cusers->avail_before, avail_now,
 		    cusers->freed_before, freed_now);

 	list_for_each_entry(hold, &cusers->holding, entry) {
@@ -349,7 +349,7 @@ static bool hold_commit(struct super_block *sb, struct server_info *server,
 {
 	bool has_room;
 	bool held;
-	u32 budget;
+	u32 new_budget;
 	u32 av;
 	u32 fr;

@@ -367,8 +367,8 @@ static bool hold_commit(struct super_block *sb, struct server_info *server,
 	}

 	/* +2 for our additional hold and then for the final commit work the server does */
-	budget = (cusers->nr_holders + 2) * COMMIT_HOLD_ALLOC_BUDGET;
-	has_room = av >= budget && fr >= budget;
+	new_budget = max(cusers->budget, (cusers->nr_holders + 2) * COMMIT_HOLD_ALLOC_BUDGET);
+	has_room = av >= new_budget && fr >= new_budget;
 	/* checking applying so holders drain once an apply caller starts waiting */
 	held = !cusers->committing && has_room && list_empty(&cusers->applying);

@@ -388,6 +388,7 @@ static bool hold_commit(struct super_block *sb, struct server_info *server,
 		list_add_tail(&hold->entry, &cusers->holding);

 		cusers->nr_holders++;
+		cusers->budget = new_budget;

 	} else if (!has_room && cusers->nr_holders == 0 && !cusers->committing) {
 		cusers->committing = true;
@@ -516,6 +517,7 @@ static void commit_end(struct super_block *sb, struct commit_users *cusers, int
 	list_for_each_entry_safe(hold, tmp, &cusers->applying, entry)
 		list_del_init(&hold->entry);
 	cusers->committing = false;
+	cusers->budget = 0;
 	spin_unlock(&cusers->lock);

 	wake_up(&cusers->waitq);
@@ -1299,12 +1301,10 @@ static int finalize_and_start_log_merge(struct super_block *sb, struct scoutfs_l
 * is nested inside holding commits so we recheck the persistent item
 * each time we commit to make sure it's still what we think.   The
 * caller is still going to send the item to the client so we update the
- * caller's each time we make progress.  This is a best-effort attempt
- * to clean up and it's valid to leave extents in data_freed we don't
- * return errors to the caller.  The client will continue the work later
- * in get_log_trees or as the rid is reclaimed.
+ * caller's each time we make progress.  If we hit an error applying the
+ * changes we make then we can't send the log_trees to the client.
 */
-static void try_drain_data_freed(struct super_block *sb, struct scoutfs_log_trees *lt)
+static int try_drain_data_freed(struct super_block *sb, struct scoutfs_log_trees *lt)
 {
 	DECLARE_SERVER_INFO(sb, server);
 	struct scoutfs_super_block *super = DIRTY_SUPER_SB(sb);
@@ -1313,6 +1313,7 @@ static void try_drain_data_freed(struct super_block *sb, struct scoutfs_log_tree
 	struct scoutfs_log_trees drain;
 	struct scoutfs_key key;
 	COMMIT_HOLD(hold);
+	bool apply = false;
 	int ret = 0;
 	int err;

@@ -1321,22 +1322,27 @@ static void try_drain_data_freed(struct super_block *sb, struct scoutfs_log_tree
 	while (lt->data_freed.total_len != 0) {
 		server_hold_commit(sb, &hold);
 		mutex_lock(&server->logs_mutex);
+		apply = true;

 		ret = find_log_trees_item(sb, &super->logs_root, false, rid, U64_MAX, &drain);
-		if (ret < 0)
+		if (ret < 0) {
+			ret = 0;
 			break;
+		}

 		/* careful to only keep draining the caller's specific open trans */
 		if (drain.nr != lt->nr || drain.get_trans_seq != lt->get_trans_seq ||
 		    drain.commit_trans_seq != lt->commit_trans_seq || drain.flags != lt->flags) {
-			ret = -ENOENT;
+			ret = 0;
 			break;
 		}

 		ret = scoutfs_btree_dirty(sb, &server->alloc, &server->wri,
 					  &super->logs_root, &key);
-		if (ret < 0)
+		if (ret < 0) {
+			ret = 0;
 			break;
+		}

 		/* moving can modify and return errors, always update caller and item */
 		mutex_lock(&server->alloc_mutex);
@@ -1352,19 +1358,19 @@ static void try_drain_data_freed(struct super_block *sb, struct scoutfs_log_tree
 		BUG_ON(err < 0); /* dirtying must guarantee success */

 		mutex_unlock(&server->logs_mutex);
-
 		ret = server_apply_commit(sb, &hold, ret);
-		if (ret < 0) {
-			ret = 0; /* don't try to abort, ignoring ret */
+		apply = false;
+
+		if (ret < 0)
 			break;
-		}
 	}

-	/* try to cleanly abort and write any partial dirty btree blocks, but ignore result */
-	if (ret < 0) {
+	if (apply) {
 		mutex_unlock(&server->logs_mutex);
-		server_apply_commit(sb, &hold, 0);
+		server_apply_commit(sb, &hold, ret);
 	}
+
+	return ret;
 }

 /*
@@ -1572,9 +1578,9 @@ out:
 		scoutfs_err(sb, "error %d getting log trees for rid %016llx: %s",
 			    ret, rid, err_str);

-	/* try to drain excessive data_freed with additional commits, if needed, ignoring err */
+	/* try to drain excessive data_freed with additional commits, if needed */
 	if (ret == 0)
-		try_drain_data_freed(sb, &lt);
+		ret = try_drain_data_freed(sb, &lt);

 	return scoutfs_net_response(sb, conn, cmd, id, ret, &lt, sizeof(lt));
 }
@@ -4149,7 +4155,7 @@ static void fence_pending_recov_worker(struct work_struct *work)
 	struct server_info *server = container_of(work, struct server_info,
 						  fence_pending_recov_work);
 	struct super_block *sb = server->sb;
-	union scoutfs_inet_addr addr;
+	union scoutfs_inet_addr addr = {{0,}};
 	u64 rid = 0;
 	int ret = 0;

--- a/kmod/src/trans.c
+++ b/kmod/src/trans.c
@@ -159,6 +159,58 @@ static bool drained_holders(struct trans_info *tri)
 	return holders == 0;
 }

+static int commit_current_log_trees(struct super_block *sb, char **str)
+{
+	DECLARE_TRANS_INFO(sb, tri);
+
+	return (*str = "data submit", scoutfs_inode_walk_writeback(sb, true)) ?:
+	       (*str = "item dirty", scoutfs_item_write_dirty(sb))  ?:
+	       (*str = "data prepare", scoutfs_data_prepare_commit(sb))  ?:
+	       (*str = "alloc prepare", scoutfs_alloc_prepare_commit(sb, &tri->alloc, &tri->wri)) ?:
+	       (*str = "meta write", scoutfs_block_writer_write(sb, &tri->wri))  ?:
+	       (*str = "data wait", scoutfs_inode_walk_writeback(sb, false)) ?:
+	       (*str = "commit log trees", commit_btrees(sb)) ?:
+	       scoutfs_item_write_done(sb);
+}
+
+static int get_next_log_trees(struct super_block *sb, char **str)
+{
+	return (*str = "get log trees", scoutfs_trans_get_log_trees(sb));
+}
+
+static int retry_forever(struct super_block *sb, int (*func)(struct super_block *sb, char **str))
+{
+	bool retrying = false;
+	char *str;
+	int ret;
+
+	do {
+		str = NULL;
+
+		ret = func(sb, &str);
+		if (ret < 0) {
+			if (!retrying) {
+				scoutfs_warn(sb, "critical transaction commit failure: %s = %d, retrying",
+					    str, ret);
+				retrying = true;
+			}
+
+			if (scoutfs_forcing_unmount(sb)) {
+				ret = -EIO;
+				break;
+			}
+
+			msleep(2 * MSEC_PER_SEC);
+
+		} else if (retrying) {
+			scoutfs_info(sb, "retried transaction commit succeeded");
+		}
+
+	} while (ret < 0);
+
+	return ret;
+}
+
 /*
 * This work func is responsible for writing out all the dirty blocks
 * that make up the current dirty transaction.  It prevents writers from
@@ -184,8 +236,6 @@ void scoutfs_trans_write_func(struct work_struct *work)
 	struct trans_info *tri = container_of(work, struct trans_info, write_work.work);
 	struct super_block *sb = tri->sb;
 	struct scoutfs_sb_info *sbi = SCOUTFS_SB(sb);
-	bool retrying = false;
-	char *s = NULL;
 	int ret = 0;

 	tri->task = current;
@@ -214,37 +264,9 @@ void scoutfs_trans_write_func(struct work_struct *work)

 	scoutfs_inc_counter(sb, trans_commit_written);

-	do {
-		ret = (s = "data submit", scoutfs_inode_walk_writeback(sb, true)) ?:
-		      (s = "item dirty", scoutfs_item_write_dirty(sb))  ?:
-		      (s = "data prepare", scoutfs_data_prepare_commit(sb))  ?:
-		      (s = "alloc prepare", scoutfs_alloc_prepare_commit(sb, &tri->alloc,
-									 &tri->wri))  ?:
-		      (s = "meta write", scoutfs_block_writer_write(sb, &tri->wri))  ?:
-		      (s = "data wait", scoutfs_inode_walk_writeback(sb, false)) ?:
-		      (s = "commit log trees", commit_btrees(sb)) ?:
-		      scoutfs_item_write_done(sb) ?:
-		      (s = "get log trees", scoutfs_trans_get_log_trees(sb));
-		if (ret < 0) {
-			if (!retrying) {
-				scoutfs_warn(sb, "critical transaction commit failure: %s = %d, retrying",
-					    s, ret);
-				retrying = true;
-			}
-
-			if (scoutfs_forcing_unmount(sb)) {
-				ret = -EIO;
-				break;
-			}
-
-			msleep(2 * MSEC_PER_SEC);
-
-		} else if (retrying) {
-			scoutfs_info(sb, "retried transaction commit succeeded");
-		}
-
-	} while (ret < 0);
-
+	/* retry {commit,get}_log_trees until they succeeed, can only fail when forcing unmount */
+	ret = retry_forever(sb, commit_current_log_trees) ?:
+	      retry_forever(sb, get_next_log_trees);
 out:
 	spin_lock(&tri->write_lock);
 	tri->write_count++;
--- a/tests/funcs/exec.sh
+++ b/tests/funcs/exec.sh
@@ -80,3 +80,15 @@ t_compare_output()
 {
 	"$@" >&7 2>&1
 }
+
+#
+# usually bash prints an annoying output message when jobs
+# are killed.  We can avoid that by redirecting stderr for
+# the bash process when it reaps the jobs that are killed.
+#
+t_silent_kill() {
+	exec {ERR}>&2 2>/dev/null
+	kill "$@"
+	wait "$@"
+	exec 2>&$ERR {ERR}>&-
+}
--- a/tests/funcs/filter.sh
+++ b/tests/funcs/filter.sh
@@ -160,6 +160,9 @@ t_filter_dmesg()
 	re="$re|Pipe handler or fully qualified core dump path required.*"
 	re="$re|Set kernel.core_pattern before fs.suid_dumpable.*"

+	# perf warning that it adjusted sample rate
+	re="$re|perf: interrupt took too long.*lowering kernel.perf_event_max_sample_rate.*"
+
 	egrep -v "($re)" | \
 		ignore_harmless_unwind_kasan_stack_oob
 }
--- a/tests/run-tests.sh
+++ b/tests/run-tests.sh
@@ -532,12 +532,15 @@ for t in $tests; do
 	cmd rm -rf "$T_TMPDIR"
 	cmd mkdir -p "$T_TMPDIR"

-	# create a test name dir in the fs
+	# create a test name dir in the fs, clean up old data as needed
 	T_DS=""
 	for i in $(seq 0 $((T_NR_MOUNTS - 1))); do
 		dir="${T_M[$i]}/test/$test_name"

-		test $i == 0 && cmd mkdir -p "$dir"
+		test $i == 0 && (
+			test -d "$dir" && cmd rm -rf "$dir"
+			cmd mkdir -p "$dir"
+		)

 		eval T_D$i=$dir
 		T_D[$i]=$dir
--- a/tests/tests/enospc.sh
+++ b/tests/tests/enospc.sh
@@ -88,6 +88,11 @@ rm -rf "$SCR/xattrs"

 echo "== make sure we can create again"
 file="$SCR/file-after"
+C=120
+while (( C-- )); do
+	touch $file 2> /dev/null && break
+	sleep 1
+done
 touch $file
 setfattr -n user.scoutfs-enospc -v 1 "$file"
 sync
--- a/tests/tests/lock-recover-invalidate.sh
+++ b/tests/tests/lock-recover-invalidate.sh
@@ -38,6 +38,6 @@ while [ "$SECONDS" -lt "$END" ]; do
 done

 echo "== stopping background load"
-kill $load_pids
+t_silent_kill $load_pids

 t_pass
--- a/tests/tests/orphan-inodes.sh
+++ b/tests/tests/orphan-inodes.sh
@@ -5,18 +5,6 @@
 t_require_commands sleep touch sync stat handle_cat kill rm
 t_require_mounts 2

-#
-# usually bash prints an annoying output message when jobs
-# are killed.  We can avoid that by redirecting stderr for
-# the bash process when it reaps the jobs that are killed.
-#
-silent_kill() {
-	exec {ERR}>&2 2>/dev/null
-	kill "$@"
-	wait "$@"
-	exec 2>&$ERR {ERR}>&-
-}
-
 #
 # We don't have a great way to test that inode items still exist.   We
 # don't prevent opening handles with nlink 0 today, so we'll use that.
@@ -52,7 +40,7 @@ inode_exists $ino || echo "$ino didn't exist"

 echo "== orphan from failed evict deletion is picked up"
 # pending kill signal stops evict from getting locks and deleting
-silent_kill $pid
+t_silent_kill $pid
 t_set_sysfs_mount_option 0 orphan_scan_delay_ms 1000
 sleep 5
 inode_exists $ino && echo "$ino still exists"
@@ -70,7 +58,7 @@ for nr in $(t_fs_nrs); do
 	rm -f "$path"
 done
 sync
-silent_kill $pids
+t_silent_kill $pids
 for nr in $(t_fs_nrs); do
 	t_force_umount $nr
 done
@@ -82,7 +70,15 @@ done
 # wait for orphan scans to run
 t_set_all_sysfs_mount_options orphan_scan_delay_ms 1000
 # also have to wait for delayed log merge work from mount
-sleep 15
+C=120
+while (( C-- )); do
+	brk=1
+	for ino in $inos; do
+		inode_exists $ino && brk=0
+	done
+	test $brk -eq 1 && break
+	sleep 1
+done
 for ino in $inos; do
 	inode_exists $ino && echo "$ino still exists"
 done
@@ -131,7 +127,7 @@ while [ $SECONDS -lt $END ]; do
 	done

 	# trigger eviction deletion of each file in each mount
-	silent_kill $pids
+	t_silent_kill $pids

 	wait || t_fail "handle_fsetxattr failed"
Author	SHA1	Message	Date
Chris Kirby	2fcd56d0e2	Fix commit budget calculation with multiple holders The try_drain_data_freed() path was generating errors about overrunning its commit budget: scoutfs f.2b8928.r.02689f error: 1 holders exceeded alloc budget av: bef 8185 now 8036, fr: bef 8185 now 7602 The budget overrun check was using the current number of commit holders (in this case one) instead of the the maximum number of concurrent holders (in this case two). So even well behaved paths like try_drain_data_freed() can appear to exceed their commit budget if other holders dirty some blocks and apply their commits before the try_drain_data_freed() thread does its final budget reconciliation. Signed-off-by: Chris Kirby <ckirby@versity.com>	2025-06-17 11:38:07 -05:00
Chris Kirby	e0d2aec2c0	Fix dirtied block calculation in extent_mod_blocks() Free extents are stored in two btrees: one sorted by block number, one by size. So if you insert a new extent between two existing extents, you can be modifying two items in the by-block-number tree. And depending on the size of those items, that can result in three items over in the -by-size tree. So that's a 5x multiplier per level. If we're shrinking the tree and adding more freed blocks, we're conceptually dirtying two blocks at each level to merge. (current 2 in the code). But if they fall under the low water mark then one of them is freed, so we can have 3 per level in this case. Signed-off-by: Chris Kirby <ckirby@versity.com>	2025-06-17 11:38:07 -05:00
Zach Brown	9741d40e10	Merge pull request #229 from versity/zab/v1.25 v1.25 Release	2025-06-04 11:21:25 -07:00
Zach Brown	48ac7bdf7c	v1.25 Release Finish the release notes for the 1.25 release. Signed-off-by: Zach Brown <zab@versity.com>	2025-06-03 13:35:42 -07:00
Zach Brown	7865ee9f54	Merge pull request #223 from versity/auke/el9_5_wmaybe-uninit Fix -Wmaybe-uninitalized since rhel9.5	2025-05-12 12:21:02 -07:00
Zach Brown	624eb128c6	Merge pull request #221 from versity/auke/enospc-test Give enospc test more time to commit unlink.	2025-05-09 11:27:04 -07:00
Zach Brown	091eb3b683	Merge pull request #219 from versity/auke/fix-tests-failing-dirty-test-dirs Fix test cases that don't run cleanly in a semi-dirty env.	2025-05-09 11:17:24 -07:00
Zach Brown	04e8cc6295	Merge pull request #220 from versity/auke/orphan-inodes Extend orphan-inodes timeout.	2025-05-09 11:15:13 -07:00
Zach Brown	0f6fdb3eb5	Merge pull request #222 from versity/auke/t_kill_silent Properly silently kill background tasks.	2025-05-09 11:11:24 -07:00
Auke Kok	2f48a606e8	Fix -Wmaybe-uninitalized since rhel9.5 Looks like the compiler isn't smart enough to understand the pass by pointer value, and we can initialize it here easily. make[1]: Entering directory '/usr/src/kernels/5.14.0-503.26.1.el9_5.x86_64' CC [M] /home/auke/scoutfs/kmod/src/server.o /home/auke/scoutfs/kmod/src/server.c: In function ‘fence_pending_recov_worker’: /home/auke/scoutfs/kmod/src/server.c:4170:23: error: ‘addr.v4.addr’ may be used uninitialized in this function [-Werror=maybe-uninitialized] 4170 \| ret = scoutfs_fence_start(sb, rid, le32_to_be32(addr.v4.addr), \| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 4171 \| SCOUTFS_FENCE_CLIENT_RECOVERY); \| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ cc1: all warnings being treated as errors There's still the obvious issue here that we'd intended to support ipv6 but just disregard that here. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-05-08 15:20:50 -07:00
Auke Kok	377e49caf1	Properly silently kill background tasks. Occasionally, we have some tests fail because these kills produce: tests/lock-recover-invalidate.sh: line 42: 9928 Terminated Even though we expected them to be silent. In these particular cases we already don't care about this output. We borrow the silent_kill() function from orphan-inodes and promote it to t_silent_kill() in funcs/exec.sh, and then use it everywhere where appropriate. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-05-08 12:03:04 -07:00
Auke Kok	d08eb66adc	Give enospc test more time to commit unlink. The current test sequence performs the unlink and immediately tests whether enough resources are available to create new files again, and this consistently fails. One of my crummy VMs takes a good 12 seconds before the `touch` actually succeeds. We care about the filesystem eventually returning from ENOSPC, and certainly we don't want it to take forever, but there is a period after our first ENOSPC error and cleanup that we expect ENOSPC to fail for a bit longer. Make the timeout 120s. As soon as the `touch` completes, exit the wait loop. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-05-08 11:40:13 -07:00
Zach Brown	6f19d0bd36	Merge pull request #216 from versity/zab/stop_ending_dirty_data_freed Zab/stop ending dirty data freed	2025-05-08 11:18:23 -07:00
Auke Kok	1d0cde7cc3	Clean up old test data as needed. If run without `-m` (explicit mkfs) in subsequent testing, old test data files may break several tests. Most failures are -EEXIST, but there are some more subtle ones. This change erases any existing test dir as needed just before we run the tests, and avoids the issue entirely. I considered doing a `mv dir dir.$$ && rm -rf dir.$$ &` alternative solution but that likely will interfere disproportionally with tests that do disconnects and other thing that can be impacted by an unlink storm. This has an obvious performance aspect - tests will be a little slower to start on subsequent runs. In CI, this will effectively be a no-op though. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-05-08 10:10:01 -07:00
Auke Kok	138c7c6b49	Extend orphan-inodes timeout. This test regularly fails in CI when the 15 seconds elapses and the system still hasn't concluded the mount log merges and orphan inode scans needed to unlink the test files. Instead of just extending the timeout value, we test-and-retry for 120s. This hopefully is faster in most cases. My smallest VM needs about 6s-8s on average. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-05-08 09:56:45 -07:00
Zach Brown	8aa1a98901	Merge pull request #210 from versity/auke/perf-irq-took-too-long Filter out perf `interrupt took too long` dmesg.	2025-04-30 10:04:00 -07:00
Zach Brown	888b1394a6	Retry client commit and get log trees separately The client transaction commit worker has a series of functions that it calls to commit the current transaction and open the next one. If any of them fail, it retries all of them from the beginning each time until they all succeed. This pattern behaves badly since we added the strict get_trans_seq and commit_trans_seq latching in the log_trees. The server will only commit the items for a get or commit request once, and will fail a commit request if it isn't given the seq that matches the current item. If the server gets an error it can have persisted items while sending an error to the client. If this error was for a get request, then the client will retry all of its transaction write functions. This includes the commit request which is now using a stale seq and will fail indefinitely. This is visible in the server log as: error -5 committing client logs for rid e57e37132c919c4f: invalid log trees item get_trans_seq The solution is to retry the commit and get phases independently. This way a failed get will be retried on its own without running through the commit phase that had succeeded. The client will eventually get the next seq that it can then safely commit. Signed-off-by: Zach Brown <zab@versity.com>	2025-04-29 11:46:38 -07:00
Zach Brown	e457694f19	Don't send dirty data_freed blocks to client At the end of get_log_trees we can try and drain the data_freed extent tree, which can take multiple commits. If a commit fails then the blocks are still dirty in memory. We can't send references to those blocks to the client. We have to return an error and not send the log_trees, like the main get_log_trees does. The client will retry and eventually get a log_trees that references blocks that were successfully committed. Signed-off-by: Zach Brown <zab@versity.com>	2025-04-29 11:46:38 -07:00
Auke Kok	1b47e9429e	Filter out perf `interrupt took too long` dmesg. Example: ``` [ 2469.638414] perf: interrupt took too long (2507 > 2500), lowering kernel.perf_event_max_sample_rate to 79000 ``` Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-04-14 12:06:58 -07:00