v1.7 Release

Finish the release notes for the 1.7 release. Signed-off-by: Zach Brown <zab@versity.com>
Merge pull request #98 from versity/zab/move_freed_many_commits
2026-01-06 12:06:26 +00:00 · 2022-08-26 11:38:23 -07:00 · 2022-08-01 09:09:28 -07:00 · 2022-07-29 11:25:01 -07:00 · 2022-07-29 11:25:01 -07:00 · 2022-07-29 11:25:01 -07:00
35 changed files with 1442 additions and 499 deletions
--- a/ReleaseNotes.md
+++ b/ReleaseNotes.md
@@ -2,9 +2,125 @@ Versity ScoutFS Release Notes
 =============================

 ---
-v1.2-rc
+v1.7
 \
-*TBD*
+*Aug 26, 2022*
+
+* **Fixed possible persistent errors moving freed data extents**
+\
+  Fixed a case where the server could hit persistent errors trying to
+  move a client's freed extents in one commit.  The client had to free
+  a large number of extents that occupied distant positions in the
+  global free extent btree.  Very large fragmented files could cause
+  this.  The server now moves the freed extents in multiple commits and
+  can always ensure forward progress.
+
+* **Fixed possible persistent errors from freed duplicate extents**
+\
+  Background orphan deletion wasn't properly synchronizing with
+  foreground tasks deleting very large files.  If a deletion took long
+  enough then background deletion could also attempt to delete inode items
+  while the deletion was making progress.  This could create duplicate
+  deletions of data extent items which causes the server to abort when
+  it later discovers the duplicate extents as it merges free lists.
+
+---
+v1.6
+\
+*Jul 7, 2022*
+
+* **Fix memory leaks in rare corner cases**
+\
+  Analysis tools found a few corner cases that leaked small structures,
+  generally around error handling or startup and shutdown.
+
+* **Add --skip-likely-huge scoutfs print command option**
+\
+  Add an option to scoutfs print to reduce the size of the output
+  so that it can be used to see system-wide metadata without being
+  overwhelmed by file-level details.
+
+---
+v1.5
+\
+*Jun 21, 2022*
+
+* **Fix persistent error during server startup**
+\
+  Fixed a case where the server would always hit a consistent error on
+  seartup, preventing the system from mounting.  This required a rare
+  but valid state across the clients.
+
+* **Fix a client hang that would lead to fencing**
+\
+  The client module's use of in-kernel networking was missing annotation
+  that could lead to communication hanging.  The server would fence the
+  client when it stopped communicating.  This could be identified by the
+  server fencing a client after it disconnected with no attempt by the
+  client to reconnect.
+
+---
+v1.4
+\
+*May 6, 2022*
+
+* **Fix possible client crash during server failover**
+\
+  Fixed a narrow window during server failover and lock recovery that
+  could cause a client mount to believe that it had an inconsistent item
+  cache and panic.  This required very specific lock state and messaging
+  patterns between multiple mounts and multiple servers which made it
+  unlikely to occur in the field.
+
+---
+v1.3
+\
+*Apr 7, 2022*
+
+* **Fix rare server instability under heavy load**
+\
+  Fixed a case of server instability under heavy load due to concurrent
+  work fully exhausting metadata block allocation pools reserved for a
+  single server transaction.  This would cause brief interruption as the
+  server shutdown and the next server started up and made progress as
+  pending work was retried.
+
+* **Fix slow fencing preventing server startup**
+\
+  If a server had to process many fence requests with a slow fencing
+  mechanism it could be interrupted before it finished.  The server
+  now makes sure heartbeat messages are sent while it is making progress
+  on fencing requests so that other quorum members don't interrupt the
+  process.
+
+* **Performance improvement in getxattr and setxattr**
+\
+  Kernel allocation patterns in the getxattr and setxattr
+  implementations were causing significant contention between CPUs.  Their
+  allocation strategy was changed so that concurrent tasks can call these
+  xattr methods without degrading performance.
+
+---
+v1.2
+\
+*Mar 14, 2022*
+
+* **Fix deadlock between fallocate() and read() system calls**
+\
+  Fixed a lock inversion that could cause two tasks to deadlock if they
+  performed fallocate() and read() on a file at the same time.   The
+  deadlock was uninterruptible so the machine needed to be rebooted.  This
+  was relatively rare as fallocate() is usually used to prepare files
+  before they're used.
+
+* **Fix instability from heavy file deletion workloads**
+\
+  Fixed rare circumstances under which background file deletion cleanup
+  tasks could try to delete a file while it is being deleted by another
+  task.  Heavy load across multiple nodes, either many files being deleted
+  or large files being deleted, increased the chances of this happening.
+  Heavy staging could cause this problem because staging can create many
+  internal temporary files that need to be deleted.

 ---
 v1.1
--- a/kmod/src/alloc.c
+++ b/kmod/src/alloc.c
@@ -84,6 +84,21 @@ static u64 smallest_order_length(u64 len)
 	return 1ULL << (free_extent_order(len) * 3);
 }

+/*
+ * An extent modification dirties three distinct leaves of an allocator
+ * btree as it adds and removes the blkno and size sorted items for the
+ * old and new lengths of the extent.  Dirtying the paths to these
+ * leaves can grow the tree and grow/shrink neighbours at each level.
+ * We over-estimate the number of blocks allocated and freed (the paths
+ * share a root, growth doesn't free) to err on the simpler and safer
+ * side.  The overhead is minimal given the relatively large list blocks
+ * and relatively short allocator trees.
+ */
+static u32 extent_mod_blocks(u32 height)
+{
+	return ((1 + height) * 2) * 3;
+}
+
 /*
 * Free extents don't have flags and are stored in two indexes sorted by
 * block location and by length order, largest first.  The location key
@@ -877,6 +892,13 @@ static int find_zone_extent(struct super_block *sb, struct scoutfs_alloc_root *r
 * -ENOENT is returned if we run out of extents in the source tree
 * before moving the total.
 *
+ * If meta_budget is non-zero then -EINPROGRESS can be returned if the
+ * the caller's budget is consumed in the allocator during this call
+ * (though not necessarily by us, we don't have per-thread tracking of
+ * allocator consumption :/).  The call can still have made progress and
+ * caller is expected commit the dirty trees and examining the resulting
+ * modified trees to see if they need to continue moving extents.
+ *
 * The caller can specify that extents in the source tree should first
 * be found based on their zone bitmaps.  We'll first try to find
 * extents in the exclusive zones, then vacant zones, and then we'll
@@ -891,7 +913,7 @@ int scoutfs_alloc_move(struct super_block *sb, struct scoutfs_alloc *alloc,
 		       struct scoutfs_block_writer *wri,
 		       struct scoutfs_alloc_root *dst,
 		       struct scoutfs_alloc_root *src, u64 total,
-		       __le64 *exclusive, __le64 *vacant, u64 zone_blocks)
+		       __le64 *exclusive, __le64 *vacant, u64 zone_blocks, u64 meta_budget)
 {
 	struct alloc_ext_args args = {
 		.alloc = alloc,
@@ -899,6 +921,8 @@ int scoutfs_alloc_move(struct super_block *sb, struct scoutfs_alloc *alloc,
 	};
 	struct scoutfs_extent found;
 	struct scoutfs_extent ext;
+	u32 avail_start = 0;
+	u32 freed_start = 0;
 	u64 moved = 0;
 	u64 count;
 	int ret = 0;
@@ -909,6 +933,9 @@ int scoutfs_alloc_move(struct super_block *sb, struct scoutfs_alloc *alloc,
 		vacant = NULL;
 	}

+	if (meta_budget != 0)
+		scoutfs_alloc_meta_remaining(alloc, &avail_start, &freed_start);
+
 	while (moved < total) {
 		count = total - moved;

@@ -941,6 +968,14 @@ int scoutfs_alloc_move(struct super_block *sb, struct scoutfs_alloc *alloc,
 		if (ret < 0)
 			break;

+		if (meta_budget != 0 &&
+		    scoutfs_alloc_meta_low_since(alloc, avail_start, freed_start, meta_budget,
+						 extent_mod_blocks(src->root.height) +
+						 extent_mod_blocks(dst->root.height))) {
+			ret = -EINPROGRESS;
+			break;
+		}
+
 		/* searching set start/len, finish initializing alloced extent */
 		ext.map = found.map ? ext.start - found.start + found.map : 0;
 		ext.flags = found.flags;
@@ -1065,15 +1100,6 @@ out:
 * than completely exhausting the avail list or overflowing the freed
 * list.
 *
- * An extent modification dirties three distinct leaves of an allocator
- * btree as it adds and removes the blkno and size sorted items for the
- * old and new lengths of the extent.  Dirtying the paths to these
- * leaves can grow the tree and grow/shrink neighbours at each level.
- * We over-estimate the number of blocks allocated and freed (the paths
- * share a root, growth doesn't free) to err on the simpler and safer
- * side.  The overhead is minimal given the relatively large list blocks
- * and relatively short allocator trees.
- *
 * The caller tells us how many extents they're about to modify and how
 * many other additional blocks they may cow manually.  And finally, the
 * caller could be the first to dirty the avail and freed blocks in the
@@ -1082,7 +1108,7 @@ out:
 static bool list_has_blocks(struct super_block *sb, struct scoutfs_alloc *alloc,
 			    struct scoutfs_alloc_root *root, u32 extents, u32 addl_blocks)
 {
-	u32 tree_blocks = (((1 + root->root.height) * 2) * 3) * extents;
+	u32 tree_blocks = extent_mod_blocks(root->root.height) * extents;
 	u32 most = 1 + tree_blocks + addl_blocks;

 	if (le32_to_cpu(alloc->avail.first_nr) < most) {
@@ -1318,6 +1344,38 @@ bool scoutfs_alloc_meta_low(struct super_block *sb,
 	return lo;
 }

+void scoutfs_alloc_meta_remaining(struct scoutfs_alloc *alloc, u32 *avail_total, u32 *freed_space)
+{
+	unsigned int seq;
+
+	do {
+		seq = read_seqbegin(&alloc->seqlock);
+		*avail_total = le32_to_cpu(alloc->avail.first_nr);
+		*freed_space = list_block_space(alloc->freed.first_nr);
+	} while (read_seqretry(&alloc->seqlock, seq));
+}
+
+/*
+ * Returns true if the caller's consumption of nr from either avail or
+ * freed would end up exceeding their budget relative to the starting
+ * remaining snapshot they took.
+ */
+bool scoutfs_alloc_meta_low_since(struct scoutfs_alloc *alloc, u32 avail_start, u32 freed_start,
+				  u32 budget, u32 nr)
+{
+	u32 avail_use;
+	u32 freed_use;
+	u32 avail;
+	u32 freed;
+
+	scoutfs_alloc_meta_remaining(alloc, &avail, &freed);
+
+	avail_use = avail_start - avail;
+	freed_use = freed_start - freed;
+
+	return ((avail_use + nr) > budget) || ((freed_use + nr) > budget);
+}
+
 bool scoutfs_alloc_test_flag(struct super_block *sb,
 			    struct scoutfs_alloc *alloc, u32 flag)
 {
--- a/kmod/src/alloc.h
+++ b/kmod/src/alloc.h
@@ -131,7 +131,7 @@ int scoutfs_alloc_move(struct super_block *sb, struct scoutfs_alloc *alloc,
 		       struct scoutfs_block_writer *wri,
 		       struct scoutfs_alloc_root *dst,
 		       struct scoutfs_alloc_root *src, u64 total,
-		       __le64 *exclusive, __le64 *vacant, u64 zone_blocks);
+		       __le64 *exclusive, __le64 *vacant, u64 zone_blocks, u64 meta_budget);
 int scoutfs_alloc_insert(struct super_block *sb, struct scoutfs_alloc *alloc,
 			 struct scoutfs_block_writer *wri, struct scoutfs_alloc_root *root,
 			 u64 start, u64 len);
@@ -158,6 +158,9 @@ int scoutfs_alloc_splice_list(struct super_block *sb,

 bool scoutfs_alloc_meta_low(struct super_block *sb,
 			    struct scoutfs_alloc *alloc, u32 nr);
+void scoutfs_alloc_meta_remaining(struct scoutfs_alloc *alloc, u32 *avail_total, u32 *freed_space);
+bool scoutfs_alloc_meta_low_since(struct scoutfs_alloc *alloc, u32 avail_start, u32 freed_start,
+				  u32 budget, u32 nr);
 bool scoutfs_alloc_test_flag(struct super_block *sb,
 			    struct scoutfs_alloc *alloc, u32 flag);

--- a/kmod/src/btree.c
+++ b/kmod/src/btree.c
@@ -2449,7 +2449,7 @@ int scoutfs_btree_free_blocks(struct super_block *sb,
 			      struct scoutfs_alloc *alloc,
 			      struct scoutfs_block_writer *wri,
 			      struct scoutfs_key *key,
-			      struct scoutfs_btree_root *root, int alloc_low)
+			      struct scoutfs_btree_root *root, int free_budget)
 {
 	u64 blknos[SCOUTFS_BTREE_MAX_HEIGHT];
 	struct scoutfs_block *bl = NULL;
@@ -2459,11 +2459,15 @@ int scoutfs_btree_free_blocks(struct super_block *sb,
 	struct scoutfs_avl_node *node;
 	struct scoutfs_avl_node *next;
 	struct scoutfs_key par_next;
+	int nr_freed = 0;
 	int nr_par;
 	int level;
 	int ret;
 	int i;

+	if (WARN_ON_ONCE(free_budget <= 0))
+		return -EINVAL;
+
 	if (WARN_ON_ONCE(root->height > ARRAY_SIZE(blknos)))
 		return -EIO; /* XXX corruption */

@@ -2538,8 +2542,7 @@ int scoutfs_btree_free_blocks(struct super_block *sb,
 		while (node) {

 			/* make sure we can always free parents after leaves */
-			if (scoutfs_alloc_meta_low(sb, alloc,
-						   alloc_low + nr_par + 1)) {
+			if ((nr_freed + 1 + nr_par) > free_budget) {
 				ret = 0;
 				goto out;
 			}
@@ -2553,6 +2556,7 @@ int scoutfs_btree_free_blocks(struct super_block *sb,
 						le64_to_cpu(ref.blkno));
 			if (ret < 0)
 				goto out;
+			nr_freed++;

 			node = scoutfs_avl_next(&bt->item_root, node);
 			if (node) {
@@ -2568,6 +2572,7 @@ int scoutfs_btree_free_blocks(struct super_block *sb,
 							       blknos[i]);
 			ret = scoutfs_free_meta(sb, alloc, wri, blknos[i]);
 			BUG_ON(ret); /* checked meta low, freed should fit */
+			nr_freed++;
 		}

 		/* restart walk past the subtree we just freed */
--- a/kmod/src/btree.h
+++ b/kmod/src/btree.h
@@ -125,7 +125,7 @@ int scoutfs_btree_free_blocks(struct super_block *sb,
 			      struct scoutfs_alloc *alloc,
 			      struct scoutfs_block_writer *wri,
 			      struct scoutfs_key *key,
-			      struct scoutfs_btree_root *root, int alloc_low);
+			      struct scoutfs_btree_root *root, int free_budget);

 void scoutfs_btree_put_iref(struct scoutfs_btree_item_ref *iref);

--- a/kmod/src/counters.h
+++ b/kmod/src/counters.h
@@ -157,6 +157,7 @@
 	EXPAND_COUNTER(orphan_scan_error)			\
 	EXPAND_COUNTER(orphan_scan_item)			\
 	EXPAND_COUNTER(orphan_scan_omap_set)			\
+	EXPAND_COUNTER(quorum_candidate_server_stopping)	\
 	EXPAND_COUNTER(quorum_elected)				\
 	EXPAND_COUNTER(quorum_fence_error)			\
 	EXPAND_COUNTER(quorum_fence_leader)			\
--- a/kmod/src/inode.c
+++ b/kmod/src/inode.c
@@ -1685,6 +1685,7 @@ static int try_delete_inode_items(struct super_block *sb, u64 ino)
 	struct scoutfs_lock *lock = NULL;
 	struct scoutfs_inode sinode;
 	struct scoutfs_key key;
+	bool clear_trying = false;
 	u64 group_nr;
 	int bit_nr;
 	int ret;
@@ -1704,6 +1705,7 @@ static int try_delete_inode_items(struct super_block *sb, u64 ino)
 		ret = 0;
 		goto out;
 	}
+	clear_trying = true;

 	/* can't delete if it's cached in local or remote mounts */
 	if (scoutfs_omap_test(sb, ino) || test_bit_le(bit_nr, ldata->map.bits)) {
@@ -1730,7 +1732,7 @@ static int try_delete_inode_items(struct super_block *sb, u64 ino)

 	ret = delete_inode_items(sb, ino, &sinode, lock, orph_lock);
 out:
-	if (ldata)
+	if (clear_trying)
 		clear_bit(bit_nr, ldata->trying);

 	scoutfs_unlock(sb, lock, SCOUTFS_LOCK_WRITE);
--- a/kmod/src/lock.c
+++ b/kmod/src/lock.c
@@ -289,6 +289,7 @@ static struct scoutfs_lock *lock_alloc(struct super_block *sb,
 	lock->sb = sb;
 	init_waitqueue_head(&lock->waitq);
 	lock->mode = SCOUTFS_LOCK_NULL;
+	lock->invalidating_mode = SCOUTFS_LOCK_NULL;

 	atomic64_set(&lock->forest_bloom_nr, 0);

@@ -666,7 +667,9 @@ struct inv_req {
 *
 * Before we start invalidating the lock we set the lock to the new
 * mode, preventing further incompatible users of the old mode from
- * using the lock while we're invalidating.
+ * using the lock while we're invalidating.  We record the previously
+ * granted mode so that we can send lock recover responses with the old
+ * granted mode during invalidation.
 */
 static void lock_invalidate_worker(struct work_struct *work)
 {
@@ -691,7 +694,8 @@ static void lock_invalidate_worker(struct work_struct *work)
 		if (!lock_counts_match(nl->new_mode, lock->users))
 			continue;

-		/* set the new mode, no incompatible users during inval */
+		/* set the new mode, no incompatible users during inval, recov needs old */
+		lock->invalidating_mode = lock->mode;
 		lock->mode = nl->new_mode;

 		/* move everyone that's ready to our private list */
@@ -734,6 +738,8 @@ static void lock_invalidate_worker(struct work_struct *work)
 		list_del(&ireq->head);
 		kfree(ireq);

+		lock->invalidating_mode = SCOUTFS_LOCK_NULL;
+
 		if (list_empty(&lock->inv_list)) {
 			/* finish if another request didn't arrive */
 			list_del_init(&lock->inv_head);
@@ -824,6 +830,7 @@ int scoutfs_lock_recover_request(struct super_block *sb, u64 net_id,
 {
 	DECLARE_LOCK_INFO(sb, linfo);
 	struct scoutfs_net_lock_recover *nlr;
+	enum scoutfs_lock_mode mode;
 	struct scoutfs_lock *lock;
 	struct scoutfs_lock *next;
 	struct rb_node *node;
@@ -844,10 +851,15 @@ int scoutfs_lock_recover_request(struct super_block *sb, u64 net_id,

 	for (i = 0; lock && i < SCOUTFS_NET_LOCK_MAX_RECOVER_NR; i++) {

+		if (lock->invalidating_mode != SCOUTFS_LOCK_NULL)
+			mode = lock->invalidating_mode;
+		else
+			mode = lock->mode;
+
 		nlr->locks[i].key = lock->start;
 		nlr->locks[i].write_seq = cpu_to_le64(lock->write_seq);
-		nlr->locks[i].old_mode = lock->mode;
-		nlr->locks[i].new_mode = lock->mode;
+		nlr->locks[i].old_mode = mode;
+		nlr->locks[i].new_mode = mode;

 		node = rb_next(&lock->node);
 		if (node)
--- a/kmod/src/lock.h
+++ b/kmod/src/lock.h
@@ -39,6 +39,7 @@ struct scoutfs_lock {
 	struct list_head cov_list;

 	enum scoutfs_lock_mode mode;
+	enum scoutfs_lock_mode invalidating_mode;
 	unsigned int waiters[SCOUTFS_LOCK_NR_MODES];
 	unsigned int users[SCOUTFS_LOCK_NR_MODES];

--- a/kmod/src/lock_server.c
+++ b/kmod/src/lock_server.c
@@ -749,7 +749,7 @@ out:
 	if (ret < 0) {
 		scoutfs_err(sb, "lock server err %d during client rid %016llx farewell, shutting down",
 			    ret, rid);
-		scoutfs_server_abort(sb);
+		scoutfs_server_stop(sb);
 	}

 	return ret;
--- a/kmod/src/net.c
+++ b/kmod/src/net.c
@@ -355,6 +355,7 @@ static int submit_send(struct super_block *sb,
 		}
 		if (rid != 0) {
 			spin_unlock(&conn->lock);
+			kfree(msend);
 			return -ENOTCONN;
 		}
 	}
@@ -991,6 +992,8 @@ static void scoutfs_net_listen_worker(struct work_struct *work)
 		if (ret < 0)
 			break;

+		acc_sock->sk->sk_allocation = GFP_NOFS;
+
 		/* inherit accepted request funcs from listening conn */
 		acc_conn = scoutfs_net_alloc_conn(sb, conn->notify_up,
 						  conn->notify_down,
@@ -1053,6 +1056,8 @@ static void scoutfs_net_connect_worker(struct work_struct *work)
 	if (ret)
 		goto out;

+	sock->sk->sk_allocation = GFP_NOFS;
+
 	/* caller specified connect timeout */
 	tv.tv_sec = conn->connect_timeout_ms / MSEC_PER_SEC;
 	tv.tv_usec = (conn->connect_timeout_ms % MSEC_PER_SEC) * USEC_PER_MSEC;
@@ -1292,7 +1297,7 @@ restart:
 				if (ret) {
 					scoutfs_err(sb, "client fence returned err %d, shutting down server",
 						    ret);
-					scoutfs_server_abort(sb);
+					scoutfs_server_stop(sb);
 				}
 			}
 			destroy_conn(acc);
@@ -1341,10 +1346,12 @@ scoutfs_net_alloc_conn(struct super_block *sb,
 	if (!conn)
 		return NULL;

-	conn->info = kzalloc(info_size, GFP_NOFS);
-	if (!conn->info) {
-		kfree(conn);
-		return NULL;
+	if (info_size) {
+		conn->info = kzalloc(info_size, GFP_NOFS);
+		if (!conn->info) {
+			kfree(conn);
+			return NULL;
+		}
 	}

 	conn->workq = alloc_workqueue("scoutfs_net_%s",
@@ -1450,6 +1457,8 @@ int scoutfs_net_bind(struct super_block *sb,
 	if (ret)
 		goto out;

+	sock->sk->sk_allocation = GFP_NOFS;
+
 	optval = 1;
 	ret = kernel_setsockopt(sock, SOL_SOCKET, SO_REUSEADDR,
 				(char *)&optval, sizeof(optval));
--- a/kmod/src/omap.c
+++ b/kmod/src/omap.c
@@ -157,6 +157,15 @@ static int free_rid(struct omap_rid_list *list, struct omap_rid_entry *entry)
 	return nr;
 }

+static void free_rid_list(struct omap_rid_list *list)
+{
+	struct omap_rid_entry *entry;
+	struct omap_rid_entry *tmp;
+
+	list_for_each_entry_safe(entry, tmp, &list->head, head)
+		free_rid(list, entry);
+}
+
 static int copy_rids(struct omap_rid_list *to, struct omap_rid_list *from, spinlock_t *from_lock)
 {
 	struct omap_rid_entry *entry;
@@ -804,6 +813,10 @@ void scoutfs_omap_server_shutdown(struct super_block *sb)
 	llist_for_each_entry_safe(req, tmp, requests, llnode)
 		kfree(req);

+	spin_lock(&ominf->lock);
+	free_rid_list(&ominf->rids);
+	spin_unlock(&ominf->lock);
+
 	synchronize_rcu();
 }

@@ -864,6 +877,10 @@ void scoutfs_omap_destroy(struct super_block *sb)
 		rhashtable_walk_stop(&iter);
 		rhashtable_walk_exit(&iter);

+		spin_lock(&ominf->lock);
+		free_rid_list(&ominf->rids);
+		spin_unlock(&ominf->lock);
+
 		rhashtable_destroy(&ominf->group_ht);
 		rhashtable_destroy(&ominf->req_ht);
 		kfree(ominf);
--- a/kmod/src/quorum.c
+++ b/kmod/src/quorum.c
@@ -105,6 +105,8 @@ enum quorum_role { FOLLOWER, CANDIDATE, LEADER };
 struct quorum_status {
 	enum quorum_role role;
 	u64 term;
+	u64 server_start_term;
+	int server_event;
 	int vote_for;
 	unsigned long vote_bits;
 	ktime_t timeout;
@@ -117,7 +119,6 @@ struct quorum_info {
 	bool shutdown;

 	int our_quorum_slot_nr;
-	unsigned long flags;
 	int votes_needed;

 	spinlock_t show_lock;
@@ -128,8 +129,6 @@ struct quorum_info {
 	struct scoutfs_sysfs_attrs ssa;
 };

-#define QINF_FLAG_SERVER 0
-
 #define DECLARE_QUORUM_INFO(sb, name) \
 	struct quorum_info *name = SCOUTFS_SB(sb)->quorum_info
 #define DECLARE_QUORUM_INFO_KOBJ(kobj, name) \
@@ -494,16 +493,6 @@ static int update_quorum_block(struct super_block *sb, int event, u64 term, bool
 	return ret;
 }

-/*
- * The calling server has fenced previous leaders and reclaimed their
- * resources.  We can now update our fence event with a greater term to
- * stop future leaders from doing the same.
- */
-int scoutfs_quorum_fence_complete(struct super_block *sb, u64 term)
-{
-	return update_quorum_block(sb, SCOUTFS_QUORUM_EVENT_FENCE, term, true);
-}
-
 /*
 * The calling server has been elected and has started running but can't
 * yet assume that it has exclusive access to the metadata device.  We
@@ -593,15 +582,9 @@ int scoutfs_quorum_fence_leaders(struct super_block *sb, u64 term)
 	}

 out:
-	if (fence_started) {
-		err = scoutfs_fence_wait_fenced(sb, msecs_to_jiffies(SCOUTFS_QUORUM_FENCE_TO_MS));
-		if (ret == 0)
-			ret = err;
-	} else {
-		err = scoutfs_quorum_fence_complete(sb, term);
-		if (ret == 0)
-			ret = err;
-	}
+	err = scoutfs_fence_wait_fenced(sb, msecs_to_jiffies(SCOUTFS_QUORUM_FENCE_TO_MS));
+	if (ret == 0)
+		ret = err;

 	if (ret < 0)
 		scoutfs_inc_counter(sb, quorum_fence_error);
@@ -609,12 +592,26 @@ out:
 	return ret;
 }

+/*
+ * The main quorum task maintains its private status.  It seemed cleaner
+ * to occasionally copy the status for showing in sysfs/debugfs files
+ * than to have the two lock access to shared status.  The show copy is
+ * updated after being modified before the quorum task sleeps for a
+ * significant amount of time, either waiting on timeouts or interacting
+ * with the server.
+ */
+static void update_show_status(struct quorum_info *qinf, struct quorum_status *qst)
+{
+	spin_lock(&qinf->show_lock);
+	qinf->show_status = *qst;
+	spin_unlock(&qinf->show_lock);
+}
+
 /*
 * The quorum work always runs in the background of quorum member
 * mounts.  It's responsible for starting and stopping the server if
- * it's elected leader, and the server can call back into it to let it
- * know that it has shut itself down (perhaps due to error) so that the
- * work should stop sending heartbeats.
+ * it's elected leader.  While it's leader it sends heartbeats to
+ * suppress other quorum work from standing for election.
 */
 static void scoutfs_quorum_worker(struct work_struct *work)
 {
@@ -622,7 +619,7 @@ static void scoutfs_quorum_worker(struct work_struct *work)
 	struct super_block *sb = qinf->sb;
 	struct sockaddr_in unused;
 	struct quorum_host_msg msg;
-	struct quorum_status qst;
+	struct quorum_status qst = {0,};
 	int ret;
 	int err;

@@ -631,9 +628,7 @@ static void scoutfs_quorum_worker(struct work_struct *work)

 	/* start out as a follower */
 	qst.role = FOLLOWER;
-	qst.term = 0;
 	qst.vote_for = -1;
-	qst.vote_bits = 0;

 	/* read our starting term from greatest in all events in all slots */
 	read_greatest_term(sb, &qst.term);
@@ -651,6 +646,8 @@ static void scoutfs_quorum_worker(struct work_struct *work)

 	while (!(qinf->shutdown || scoutfs_forcing_unmount(sb))) {

+		update_show_status(qinf, &qst);
+
 		ret = recv_msg(sb, &msg, qst.timeout);
 		if (ret < 0) {
 			if (ret != -ETIMEDOUT && ret != -EAGAIN) {
@@ -667,24 +664,6 @@ static void scoutfs_quorum_worker(struct work_struct *work)
 		    msg.term < qst.term)
 			msg.type = SCOUTFS_QUORUM_MSG_INVALID;

-		/* if the server has shutdown we become follower */
-		if (!test_bit(QINF_FLAG_SERVER, &qinf->flags) &&
-		    qst.role == LEADER) {
-			qst.role = FOLLOWER;
-			qst.vote_for = -1;
-			qst.vote_bits = 0;
-			qst.timeout = election_timeout();
-			scoutfs_inc_counter(sb, quorum_server_shutdown);
-
-			send_msg_others(sb, SCOUTFS_QUORUM_MSG_RESIGNATION,
-					qst.term);
-			scoutfs_inc_counter(sb, quorum_send_resignation);
-		}
-
-		spin_lock(&qinf->show_lock);
-		qinf->show_status = qst;
-		spin_unlock(&qinf->show_lock);
-
 		trace_scoutfs_quorum_loop(sb, qst.role, qst.term, qst.vote_for,
 					  qst.vote_bits,
 					  ktime_to_timespec64(qst.timeout));
@@ -695,7 +674,6 @@ static void scoutfs_quorum_worker(struct work_struct *work)
 			if (qst.role == LEADER) {
 				scoutfs_warn(sb, "saw msg type %u from %u for term %llu while leader in term %llu, shutting down server.",
 					     msg.type, msg.from, msg.term, qst.term);
-				scoutfs_server_stop(sb);
 			}
 			qst.role = FOLLOWER;
 			qst.term = msg.term;
@@ -717,6 +695,13 @@ static void scoutfs_quorum_worker(struct work_struct *work)
 		/* followers and candidates start new election on timeout */
 		if (qst.role != LEADER &&
 		    ktime_after(ktime_get(), qst.timeout)) {
+			/* .. but only if their server has stopped */
+			if (!scoutfs_server_is_down(sb)) {
+				qst.timeout = election_timeout();
+				scoutfs_inc_counter(sb, quorum_candidate_server_stopping);
+				continue;
+			}
+
 			qst.role = CANDIDATE;
 			qst.term++;
 			qst.vote_for = -1;
@@ -758,29 +743,69 @@ static void scoutfs_quorum_worker(struct work_struct *work)
 					qst.term);
 			qst.timeout = heartbeat_interval();

+			update_show_status(qinf, &qst);
+
 			/* record that we've been elected before starting up server */
 			ret = update_quorum_block(sb, SCOUTFS_QUORUM_EVENT_ELECT, qst.term, true);
 			if (ret < 0)
 				goto out;

-			/* make very sure server is fully shut down */
-			scoutfs_server_stop(sb);
-			/* set server bit before server shutdown could clear */
-			set_bit(QINF_FLAG_SERVER, &qinf->flags);
+			qst.server_start_term = qst.term;
+			qst.server_event = SCOUTFS_QUORUM_EVENT_ELECT;
+			scoutfs_server_start(sb, qst.term);
+		}

-			ret = scoutfs_server_start(sb, qst.term);
-			if (ret < 0) {
-				clear_bit(QINF_FLAG_SERVER, &qinf->flags);
-				/* store our increased term */
-				err = update_quorum_block(sb, SCOUTFS_QUORUM_EVENT_STOP, qst.term,
-							  true);
-				if (err < 0) {
-					ret = err;
-					goto out;
-				}
-				ret = 0;
-				continue;
+		/*
+		 * This leader's server is up, having finished fencing
+		 * previous leaders.  We update the fence event with the
+		 * current term to let future leaders know that previous
+		 * servers have been fenced.
+		 */
+		if (qst.role == LEADER && qst.server_event != SCOUTFS_QUORUM_EVENT_FENCE &&
+		    scoutfs_server_is_up(sb)) {
+			ret = update_quorum_block(sb, SCOUTFS_QUORUM_EVENT_FENCE, qst.term, true);
+			if (ret < 0)
+				goto out;
+			qst.server_event = SCOUTFS_QUORUM_EVENT_FENCE;
+		}
+
+		/*
+		 * Stop a running server if we're no longer leader in
+		 * its term.
+		 */
+		if (!(qst.role == LEADER && qst.term == qst.server_start_term) &&
+		    scoutfs_server_is_running(sb)) {
+			scoutfs_server_stop(sb);
+		}
+
+		/*
+		 * A previously running server has stopped.  The quorum
+		 * protocol might have shut it down by changing roles or
+		 * it might have stopped on its own, perhaps on errors.
+		 * If we're still a leader then we become a follower and
+		 * send resignations to encourage the next election.
+		 * Always update the _STOP event to stop connections and
+		 * fencing.
+		 */
+		if (qst.server_start_term > 0 && scoutfs_server_is_down(sb)) {
+			if (qst.role == LEADER) {
+				qst.role = FOLLOWER;
+				qst.vote_for = -1;
+				qst.vote_bits = 0;
+				qst.timeout = election_timeout();
+				scoutfs_inc_counter(sb, quorum_server_shutdown);
+
+				send_msg_others(sb, SCOUTFS_QUORUM_MSG_RESIGNATION,
+						qst.server_start_term);
+				scoutfs_inc_counter(sb, quorum_send_resignation);
 			}
+
+			ret = update_quorum_block(sb, SCOUTFS_QUORUM_EVENT_STOP,
+						  qst.server_start_term, true);
+			if (ret < 0)
+				goto out;
+
+			qst.server_start_term = 0;
 		}

 		/* leaders regularly send heartbeats to delay elections */
@@ -817,12 +842,19 @@ static void scoutfs_quorum_worker(struct work_struct *work)
 		}
 	}

+	update_show_status(qinf, &qst);
+
 	/* always try to stop a running server as we stop */
-	if (test_bit(QINF_FLAG_SERVER, &qinf->flags)) {
-		scoutfs_server_stop(sb);
-		scoutfs_fence_stop(sb);
-		send_msg_others(sb, SCOUTFS_QUORUM_MSG_RESIGNATION,
-				qst.term);
+	if (scoutfs_server_is_running(sb)) {
+		scoutfs_server_stop_wait(sb);
+		send_msg_others(sb, SCOUTFS_QUORUM_MSG_RESIGNATION, qst.term);
+
+		if (qst.server_start_term > 0) {
+			err = update_quorum_block(sb, SCOUTFS_QUORUM_EVENT_STOP,
+						  qst.server_start_term, true);
+			if (err < 0 && ret == 0)
+				ret = err;
+		}
 	}

 	/* record that this slot no longer has an active quorum */
@@ -834,21 +866,6 @@ out:
 	}
 }

-/*
- * The calling server has shutdown and is no longer using shared
- * resources.  Clear the bit so that we stop sending heartbeats and
- * allow the next server to be elected.  Update the stop event so that
- * it won't be considered available by clients or fenced by the next
- * leader.
- */
-void scoutfs_quorum_server_shutdown(struct super_block *sb, u64 term)
-{
-	DECLARE_QUORUM_INFO(sb, qinf);
-
-	clear_bit(QINF_FLAG_SERVER, &qinf->flags);
-	update_quorum_block(sb, SCOUTFS_QUORUM_EVENT_STOP, term, true);
-}
-
 /*
 * Clients read quorum blocks looking for the leader with a server whose
 * address it can try and connect to.
@@ -970,6 +987,8 @@ static ssize_t status_show(struct kobject *kobj, struct kobj_attribute *attr,
 		     qinf->our_quorum_slot_nr);
 	snprintf_ret(buf, size, &ret, "term %llu\n",
 		     qst.term);
+	snprintf_ret(buf, size, &ret, "server_start_term %llu\n", qst.server_start_term);
+	snprintf_ret(buf, size, &ret, "server_event %d\n", qst.server_event);
 	snprintf_ret(buf, size, &ret, "role %d (%s)\n",
 		     qst.role, role_str(qst.role));
 	snprintf_ret(buf, size, &ret, "vote_for %d\n",
--- a/kmod/src/quorum.h
+++ b/kmod/src/quorum.h
@@ -2,14 +2,12 @@
 #define _SCOUTFS_QUORUM_H_

 int scoutfs_quorum_server_sin(struct super_block *sb, struct sockaddr_in *sin);
-void scoutfs_quorum_server_shutdown(struct super_block *sb, u64 term);

 u8 scoutfs_quorum_votes_needed(struct super_block *sb);
 void scoutfs_quorum_slot_sin(struct scoutfs_super_block *super, int i,
 			     struct sockaddr_in *sin);

 int scoutfs_quorum_fence_leaders(struct super_block *sb, u64 term);
-int scoutfs_quorum_fence_complete(struct super_block *sb, u64 term);

 int scoutfs_quorum_setup(struct super_block *sb);
 void scoutfs_quorum_shutdown(struct super_block *sb);
--- a/kmod/src/scoutfs_trace.h
+++ b/kmod/src/scoutfs_trace.h
@@ -1843,6 +1843,53 @@ DEFINE_EVENT(scoutfs_server_client_count_class, scoutfs_server_client_down,
 	TP_ARGS(sb, rid, nr_clients)
 );

+DECLARE_EVENT_CLASS(scoutfs_server_commit_users_class,
+        TP_PROTO(struct super_block *sb, int holding, int applying, int nr_holders,
+		 u32 avail_before, u32 freed_before, int exceeded),
+        TP_ARGS(sb, holding, applying, nr_holders, avail_before, freed_before, exceeded),
+        TP_STRUCT__entry(
+		SCSB_TRACE_FIELDS
+		__field(int, holding)
+		__field(int, applying)
+		__field(int, nr_holders)
+		__field(__u32, avail_before)
+		__field(__u32, freed_before)
+		__field(int, exceeded)
+        ),
+        TP_fast_assign(
+		SCSB_TRACE_ASSIGN(sb);
+		__entry->holding = !!holding;
+		__entry->applying = !!applying;
+		__entry->nr_holders = nr_holders;
+		__entry->avail_before = avail_before;
+		__entry->freed_before = freed_before;
+		__entry->exceeded = !!exceeded;
+        ),
+	TP_printk(SCSBF" holding %u applying %u nr %u avail_before %u freed_before %u exceeded %u",
+		  SCSB_TRACE_ARGS, __entry->holding, __entry->applying, __entry->nr_holders,
+		  __entry->avail_before, __entry->freed_before, __entry->exceeded)
+);
+DEFINE_EVENT(scoutfs_server_commit_users_class, scoutfs_server_commit_hold,
+        TP_PROTO(struct super_block *sb, int holding, int applying, int nr_holders,
+		 u32 avail_before, u32 freed_before, int exceeded),
+        TP_ARGS(sb, holding, applying, nr_holders, avail_before, freed_before, exceeded)
+);
+DEFINE_EVENT(scoutfs_server_commit_users_class, scoutfs_server_commit_apply,
+        TP_PROTO(struct super_block *sb, int holding, int applying, int nr_holders,
+		 u32 avail_before, u32 freed_before, int exceeded),
+        TP_ARGS(sb, holding, applying, nr_holders, avail_before, freed_before, exceeded)
+);
+DEFINE_EVENT(scoutfs_server_commit_users_class, scoutfs_server_commit_start,
+        TP_PROTO(struct super_block *sb, int holding, int applying, int nr_holders,
+		 u32 avail_before, u32 freed_before, int exceeded),
+        TP_ARGS(sb, holding, applying, nr_holders, avail_before, freed_before, exceeded)
+);
+DEFINE_EVENT(scoutfs_server_commit_users_class, scoutfs_server_commit_end,
+        TP_PROTO(struct super_block *sb, int holding, int applying, int nr_holders,
+		 u32 avail_before, u32 freed_before, int exceeded),
+        TP_ARGS(sb, holding, applying, nr_holders, avail_before, freed_before, exceeded)
+);
+
 #define slt_symbolic(mode)						\
 	__print_symbolic(mode,					\
 		{ SLT_CLIENT,		"client" },	\
--- a/kmod/src/server.c
+++ b/kmod/src/server.c
--- a/kmod/src/server.h
+++ b/kmod/src/server.h
@@ -64,8 +64,6 @@ int scoutfs_server_lock_response(struct super_block *sb, u64 rid, u64 id,
 				 struct scoutfs_net_lock *nl);
 int scoutfs_server_lock_recover_request(struct super_block *sb, u64 rid,
 					struct scoutfs_key *key);
-void scoutfs_server_hold_commit(struct super_block *sb);
-int scoutfs_server_apply_commit(struct super_block *sb, int err);
 void scoutfs_server_recov_finish(struct super_block *sb, u64 rid, int which);

 int scoutfs_server_send_omap_request(struct super_block *sb, u64 rid,
@@ -77,9 +75,12 @@ u64 scoutfs_server_seq(struct super_block *sb);
 u64 scoutfs_server_next_seq(struct super_block *sb);
 void scoutfs_server_set_seq_if_greater(struct super_block *sb, u64 seq);

-int scoutfs_server_start(struct super_block *sb, u64 term);
-void scoutfs_server_abort(struct super_block *sb);
+void scoutfs_server_start(struct super_block *sb, u64 term);
 void scoutfs_server_stop(struct super_block *sb);
+void scoutfs_server_stop_wait(struct super_block *sb);
+bool scoutfs_server_is_running(struct super_block *sb);
+bool scoutfs_server_is_up(struct super_block *sb);
+bool scoutfs_server_is_down(struct super_block *sb);

 int scoutfs_server_setup(struct super_block *sb);
 void scoutfs_server_destroy(struct super_block *sb);
--- a/kmod/src/super.c
+++ b/kmod/src/super.c
@@ -496,7 +496,7 @@ static int scoutfs_fill_super(struct super_block *sb, void *data, int silent)

 	ret = assign_random_id(sbi);
 	if (ret < 0)
-		return ret;
+		goto out;

 	spin_lock_init(&sbi->next_ino_lock);
 	spin_lock_init(&sbi->data_wait_root.lock);
@@ -505,7 +505,7 @@ static int scoutfs_fill_super(struct super_block *sb, void *data, int silent)
 	/* parse options early for use during setup */
 	ret = scoutfs_options_early_setup(sb, data);
 	if (ret < 0)
-		return ret;
+		goto out;
 	scoutfs_options_read(sb, &opts);

 	ret = sb_set_blocksize(sb, SCOUTFS_BLOCK_SM_SIZE);
--- a/kmod/src/sysfs.c
+++ b/kmod/src/sysfs.c
@@ -37,6 +37,15 @@ struct attr_funcs {
 #define ATTR_FUNCS_RO(_name) \
 	static struct attr_funcs _name##_attr_funcs = __ATTR_RO(_name)

+static ssize_t data_device_maj_min_show(struct kobject *kobj, struct attribute *attr, char *buf)
+{
+	struct super_block *sb = KOBJ_TO_SB(kobj, sb_id_kobj);
+
+	return snprintf(buf, PAGE_SIZE, "%u:%u\n",
+			MAJOR(sb->s_bdev->bd_dev), MINOR(sb->s_bdev->bd_dev));
+}
+ATTR_FUNCS_RO(data_device_maj_min);
+
 static ssize_t format_version_show(struct kobject *kobj, struct attribute *attr,
 			 char *buf)
 {
@@ -101,6 +110,7 @@ static ssize_t attr_funcs_show(struct kobject *kobj, struct attribute *attr,


 static struct attribute *sb_id_attrs[] = {
+	&data_device_maj_min_attr_funcs.attr,
 	&format_version_attr_funcs.attr,
 	&fsid_attr_funcs.attr,
 	&rid_attr_funcs.attr,
--- a/kmod/src/xattr.c
+++ b/kmod/src/xattr.c
@@ -57,12 +57,6 @@ static u32 xattr_names_equal(const char *a_name, unsigned int a_len,
 	return a_len == b_len && memcmp(a_name, b_name, a_len) == 0;
 }

-static unsigned int xattr_full_bytes(struct scoutfs_xattr *xat)
-{
-	return offsetof(struct scoutfs_xattr,
-		        name[xat->name_len + le16_to_cpu(xat->val_len)]);
-}
-
 static unsigned int xattr_nr_parts(struct scoutfs_xattr *xat)
 {
 	return SCOUTFS_XATTR_NR_PARTS(xat->name_len,
@@ -137,12 +131,29 @@ int scoutfs_xattr_parse_tags(const char *name, unsigned int name_len,
 }

 /*
- * Find the next xattr and copy the key, xattr header, and as much of
- * the name and value into the callers buffer as we can.  Returns the
- * number of bytes copied which include the header, name, and value and
- * can be limited by the xattr length or the callers buffer.  The caller
- * is responsible for comparing their lengths, the header, and the
- * returned length before safely using the xattr.
+ * xattrs are stored in multiple items.   The first item is a
+ * concatenation of an initial header, the name, and then as much of the
+ * value as fits in the remainder of the first item.  This return the
+ * size of the first item that'd store an xattr with the given name
+ * length and value payload size.
+ */
+static int first_item_bytes(int name_len, size_t size)
+{
+	if (WARN_ON_ONCE(name_len <= 0) ||
+	    WARN_ON_ONCE(name_len > SCOUTFS_XATTR_MAX_NAME_LEN))
+		return 0;
+
+	return min_t(int, sizeof(struct scoutfs_xattr) + name_len + size,
+			  SCOUTFS_XATTR_MAX_PART_SIZE);
+}
+
+/*
+ * Find the next xattr, set the caller's key, and copy as much of the
+ * first item into the callers buffer as we can.  Returns the number of
+ * bytes copied which can include the header, name, and start of the
+ * value from the first item.  The caller is responsible for comparing
+ * their lengths, the header, and the returned length before safely
+ * using the buffer.
 *
 * If a name is provided then we'll iterate over items with a matching
 * name_hash until we find a matching name.  If we don't find a matching
@@ -154,20 +165,17 @@ int scoutfs_xattr_parse_tags(const char *name, unsigned int name_len,
 * Returns -ENOENT if it didn't find a next item.
 */
 static int get_next_xattr(struct inode *inode, struct scoutfs_key *key,
-			  struct scoutfs_xattr *xat, unsigned int bytes,
+			  struct scoutfs_xattr *xat, unsigned int xat_bytes,
 			  const char *name, unsigned int name_len,
 			  u64 name_hash, u64 id, struct scoutfs_lock *lock)
 {
 	struct super_block *sb = inode->i_sb;
 	struct scoutfs_key last;
-	u8 last_part;
-	int total;
-	u8 part;
 	int ret;

 	/* need to be able to see the name we're looking for */
-	if (WARN_ON_ONCE(name_len > 0 && bytes < offsetof(struct scoutfs_xattr,
-							  name[name_len])))
+	if (WARN_ON_ONCE(name_len > 0 &&
+			 xat_bytes < offsetof(struct scoutfs_xattr, name[name_len])))
 		return -EINVAL;

 	if (name_len)
@@ -176,26 +184,15 @@ static int get_next_xattr(struct inode *inode, struct scoutfs_key *key,
 	init_xattr_key(key, scoutfs_ino(inode), name_hash, id);
 	init_xattr_key(&last, scoutfs_ino(inode), U32_MAX, U64_MAX);

-	last_part = 0;
-	part = 0;
-	total = 0;
-
 	for (;;) {
-		key->skx_part = part;
-		ret = scoutfs_item_next(sb, key, &last,
-					(void *)xat + total, bytes - total,
-					lock);
-		if (ret < 0) {
-			/* XXX corruption, ran out of parts */
-			if (ret == -ENOENT && part > 0)
-				ret = -EIO;
+		ret = scoutfs_item_next(sb, key, &last, xat, xat_bytes, lock);
+		if (ret < 0)
 			break;
-		}

 		trace_scoutfs_xattr_get_next_key(sb, key);

 		/* XXX corruption */
-		if (key->skx_part != part) {
+		if (key->skx_part != 0) {
 			ret = -EIO;
 			break;
 		}
@@ -205,8 +202,7 @@ static int get_next_xattr(struct inode *inode, struct scoutfs_key *key,
 		 * the first part and if the next xattr name fits in our
 		 * buffer then the item must have included it.
 		 */
-		if (part == 0 &&
-		    (ret < sizeof(struct scoutfs_xattr) ||
+		if ((ret < sizeof(struct scoutfs_xattr) ||
 		     (xat->name_len <= name_len &&
 		      ret < offsetof(struct scoutfs_xattr,
 				     name[xat->name_len])) ||
@@ -216,7 +212,7 @@ static int get_next_xattr(struct inode *inode, struct scoutfs_key *key,
 			break;
 		}

-		if (part == 0 && name_len) {
+		if (name_len > 0) {
 			/* ran out of names that could match */
 			if (le64_to_cpu(key->skx_name_hash) != name_hash) {
 				ret = -ENOENT;
@@ -224,64 +220,126 @@ static int get_next_xattr(struct inode *inode, struct scoutfs_key *key,
 			}

 			/* keep looking for our name */
-			if (!xattr_names_equal(name, name_len,
-					       xat->name, xat->name_len)) {
-				part = 0;
+			if (!xattr_names_equal(name, name_len, xat->name, xat->name_len)) {
 				le64_add_cpu(&key->skx_id, 1);
 				continue;
 			}
-
-			/* use the matching name we found */
-			last_part = xattr_nr_parts(xat) - 1;
 		}

-		total += ret;
-		if (total == bytes || part == last_part) {
-			/* copied as much as we could */
-			ret = total;
-			break;
-		}
-		part++;
+		/* found next name */
+		break;
 	}

 	return ret;
 }

+/*
+ * The caller has already read and verified the xattr's first item.
+ * Copy the value from the tail of the first item and from any future
+ * items into the destination buffer.
+ */
+static int copy_xattr_value(struct super_block *sb, struct scoutfs_key *xat_key,
+			    struct scoutfs_xattr *xat, int xat_bytes,
+			    char *buffer, size_t size,
+			    struct scoutfs_lock *lock)
+{
+	struct scoutfs_key key;
+	size_t copied = 0;
+	int val_tail;
+	int bytes;
+	int ret;
+	int i;
+
+	/* must have first item up to value */
+	if (WARN_ON_ONCE(xat_bytes < sizeof(struct scoutfs_xattr)) ||
+	    WARN_ON_ONCE(xat_bytes < offsetof(struct scoutfs_xattr, name[xat->name_len])))
+		return -EINVAL;
+
+	/* only ever copy up to the full value */
+	size = min_t(size_t, size, le16_to_cpu(xat->val_len));
+
+	/* must have full first item if caller needs value from second item */
+	val_tail = SCOUTFS_XATTR_MAX_PART_SIZE -
+		   offsetof(struct scoutfs_xattr, name[xat->name_len]);
+	if (WARN_ON_ONCE(size > val_tail && xat_bytes != SCOUTFS_XATTR_MAX_PART_SIZE))
+		return -EINVAL;
+
+	/* copy from tail of first item */
+	bytes = min_t(unsigned int, size, val_tail);
+	if (bytes > 0) {
+		memcpy(buffer, &xat->name[xat->name_len], bytes);
+		copied += bytes;
+	}
+
+	key = *xat_key;
+	for (i = 1; copied < size; i++) {
+		key.skx_part = i;
+		bytes = min_t(unsigned int, size - copied, SCOUTFS_XATTR_MAX_PART_SIZE);
+
+		ret = scoutfs_item_lookup(sb, &key, buffer + copied, bytes, lock);
+		if (ret >= 0 && ret != bytes)
+			ret = -EIO;
+		if (ret < 0)
+			return ret;
+
+		copied += ret;
+	}
+
+	return copied;
+}
+
+/*
+ * The caller is working with items that are either in the allocated
+ * first compound item or further items that are offsets into a value
+ * buffer.  Give them a pointer and length of the start of the item.
+ */
+static void xattr_item_part_buffer(void **buf, int *len, int part,
+				   struct scoutfs_xattr *xat, unsigned int xat_bytes,
+				   const char *value, size_t size)
+{
+	int off;
+
+	if (part == 0) {
+		*buf = xat;
+		*len = xat_bytes;
+	} else {
+		off = (part * SCOUTFS_XATTR_MAX_PART_SIZE) -
+		      offsetof(struct scoutfs_xattr, name[xat->name_len]);
+		BUG_ON(off >= size); /* calls limited by number of parts */
+		*buf = (void *)value + off;
+		*len = min_t(size_t, size - off, SCOUTFS_XATTR_MAX_PART_SIZE);
+	}
+}
+
 /*
 * Create all the items associated with the given xattr.  If this
 * returns an error it will have already cleaned up any items it created
 * before seeing the error.
 */
-static int create_xattr_items(struct inode *inode, u64 id,
-			      struct scoutfs_xattr *xat, unsigned int bytes,
+static int create_xattr_items(struct inode *inode, u64 id, struct scoutfs_xattr *xat,
+			      int xat_bytes, const char *value, size_t size, u8 new_parts,
 			      struct scoutfs_lock *lock)
 {
 	struct super_block *sb = inode->i_sb;
 	struct scoutfs_key key;
-	unsigned int part_bytes;
-	unsigned int total;
-	int ret;
+	int ret = 0;
+	void *buf;
+	int len;
+	int i;

 	init_xattr_key(&key, scoutfs_ino(inode),
 		       xattr_name_hash(xat->name, xat->name_len), id);

-	total = 0;
-	ret = 0;
-	while (total < bytes) {
-		part_bytes = min_t(unsigned int, bytes - total,
-				   SCOUTFS_XATTR_MAX_PART_SIZE);
+	for (i = 0; i < new_parts; i++) {
+		key.skx_part = i;
+		xattr_item_part_buffer(&buf, &len, i, xat, xat_bytes, value, size);

-		ret = scoutfs_item_create(sb, &key,
-					  (void *)xat + total, part_bytes,
-					  lock);
-		if (ret) {
+		ret = scoutfs_item_create(sb, &key, buf, len, lock);
+		if (ret < 0) {
 			while (key.skx_part-- > 0)
 				scoutfs_item_delete(sb, &key, lock);
 			break;
 		}
-
-		total += part_bytes;
-		key.skx_part++;
 	}

 	return ret;
@@ -329,20 +387,20 @@ out:
 * deleted items.
 */
 static int change_xattr_items(struct inode *inode, u64 id,
-			      struct scoutfs_xattr *new_xat,
-			      unsigned int new_bytes, u8 new_parts,
-			      u8 old_parts, struct scoutfs_lock *lock)
+			      struct scoutfs_xattr *xat, int xat_bytes,
+			      const char *value, size_t size,
+			      u8 new_parts, u8 old_parts, struct scoutfs_lock *lock)
 {
 	struct super_block *sb = inode->i_sb;
 	struct scoutfs_key key;
 	int last_created = -1;
-	int bytes;
-	int off;
+	void *buf;
+	int len;
 	int i;
 	int ret;

 	init_xattr_key(&key, scoutfs_ino(inode),
-		       xattr_name_hash(new_xat->name, new_xat->name_len), id);
+		       xattr_name_hash(xat->name, xat->name_len), id);

 	/* dirty existing old items */
 	for (i = 0; i < old_parts; i++) {
@@ -354,13 +412,10 @@ static int change_xattr_items(struct inode *inode, u64 id,

 	/* create any new items past the old */
 	for (i = old_parts; i < new_parts; i++) {
-		off = i * SCOUTFS_XATTR_MAX_PART_SIZE;
-		bytes = min_t(unsigned int, new_bytes - off,
-			      SCOUTFS_XATTR_MAX_PART_SIZE);
-
 		key.skx_part = i;
-		ret = scoutfs_item_create(sb, &key, (void *)new_xat + off,
-					  bytes, lock);
+		xattr_item_part_buffer(&buf, &len, i, xat, xat_bytes, value, size);
+
+		ret = scoutfs_item_create(sb, &key, buf, len, lock);
 		if (ret)
 			goto out;

@@ -369,13 +424,10 @@ static int change_xattr_items(struct inode *inode, u64 id,

 	/* update dirtied overlapping existing items, last partial first */
 	for (i = min(old_parts, new_parts) - 1; i >= 0; i--) {
-		off = i * SCOUTFS_XATTR_MAX_PART_SIZE;
-		bytes = min_t(unsigned int, new_bytes - off,
-			      SCOUTFS_XATTR_MAX_PART_SIZE);
-
 		key.skx_part = i;
-		ret = scoutfs_item_update(sb, &key, (void *)new_xat + off,
-					  bytes, lock);
+		xattr_item_part_buffer(&buf, &len, i, xat, xat_bytes, value, size);
+
+		ret = scoutfs_item_update(sb, &key, buf, len, lock);
 		/* only last partial can fail, then we unwind created */
 		if (ret < 0)
 			goto out;
@@ -412,7 +464,7 @@ ssize_t scoutfs_getxattr(struct dentry *dentry, const char *name, void *buffer,
 	struct scoutfs_xattr *xat = NULL;
 	struct scoutfs_lock *lck = NULL;
 	struct scoutfs_key key;
-	unsigned int bytes;
+	unsigned int xat_bytes;
 	size_t name_len;
 	int ret;

@@ -423,9 +475,8 @@ ssize_t scoutfs_getxattr(struct dentry *dentry, const char *name, void *buffer,
 	if (name_len > SCOUTFS_XATTR_MAX_NAME_LEN)
 		return -ENODATA;

-	/* only need enough for caller's name and value sizes */
-	bytes = sizeof(struct scoutfs_xattr) + name_len + size;
-	xat = __vmalloc(bytes, GFP_NOFS, PAGE_KERNEL);
+	xat_bytes = first_item_bytes(name_len, size);
+	xat = kmalloc(xat_bytes, GFP_NOFS);
 	if (!xat)
 		return -ENOMEM;

@@ -435,40 +486,32 @@ ssize_t scoutfs_getxattr(struct dentry *dentry, const char *name, void *buffer,

 	down_read(&si->xattr_rwsem);

-	ret = get_next_xattr(inode, &key, xat, bytes,
-			     name, name_len, 0, 0, lck);
-
-	up_read(&si->xattr_rwsem);
-	scoutfs_unlock(sb, lck, SCOUTFS_LOCK_READ);
+	ret = get_next_xattr(inode, &key, xat, xat_bytes, name, name_len, 0, 0, lck);

 	if (ret < 0) {
 		if (ret == -ENOENT)
 			ret = -ENODATA;
-		goto out;
+		goto unlock;
 	}

 	/* the caller just wants to know the size */
 	if (size == 0) {
 		ret = le16_to_cpu(xat->val_len);
-		goto out;
+		goto unlock;
 	}

 	/* the caller's buffer wasn't big enough */
 	if (size < le16_to_cpu(xat->val_len)) {
 		ret = -ERANGE;
-		goto out;
+		goto unlock;
 	}

-	/* XXX corruption, the items didn't match the header */
-	if (ret < xattr_full_bytes(xat)) {
-		ret = -EIO;
-		goto out;
-	}
-
-	ret = le16_to_cpu(xat->val_len);
-	memcpy(buffer, &xat->name[xat->name_len], ret);
+	ret = copy_xattr_value(sb, &key, xat, xat_bytes, buffer, size, lck);
+unlock:
+	up_read(&si->xattr_rwsem);
+	scoutfs_unlock(sb, lck, SCOUTFS_LOCK_READ);
 out:
-	vfree(xat);
+	kfree(xat);
 	return ret;
 }

@@ -596,7 +639,8 @@ static int scoutfs_xattr_set(struct dentry *dentry, const char *name,
 	bool undo_totl = false;
 	LIST_HEAD(ind_locks);
 	u8 found_parts;
-	unsigned int bytes;
+	unsigned int xat_bytes_totl;
+	unsigned int xat_bytes;
 	unsigned int val_len;
 	u64 ind_seq;
 	u64 total;
@@ -629,9 +673,12 @@ static int scoutfs_xattr_set(struct dentry *dentry, const char *name,
 	if (tgs.totl && ((ret = parse_totl_key(&totl_key, name, name_len)) != 0))
 		return ret;

-	bytes = sizeof(struct scoutfs_xattr) + name_len + size;
-	/* alloc enough to read old totl value */
-	xat = __vmalloc(bytes + SCOUTFS_XATTR_MAX_TOTL_U64, GFP_NOFS, PAGE_KERNEL);
+	/* allocate enough to always read an existing xattr's totl */
+	xat_bytes_totl = first_item_bytes(name_len,
+					  max_t(size_t, size, SCOUTFS_XATTR_MAX_TOTL_U64));
+	/* but store partial first item that only includes the new xattr's value */
+	xat_bytes = first_item_bytes(name_len, size);
+	xat = kmalloc(xat_bytes_totl, GFP_NOFS);
 	if (!xat) {
 		ret = -ENOMEM;
 		goto out;
@@ -645,9 +692,7 @@ static int scoutfs_xattr_set(struct dentry *dentry, const char *name,
 	down_write(&si->xattr_rwsem);

 	/* find an existing xattr to delete, including possible totl value */
-	ret = get_next_xattr(inode, &key, xat,
-			     sizeof(struct scoutfs_xattr) + name_len + SCOUTFS_XATTR_MAX_TOTL_U64,
-			     name, name_len, 0, 0, lck);
+	ret = get_next_xattr(inode, &key, xat, xat_bytes_totl, name, name_len, 0, 0, lck);
 	if (ret < 0 && ret != -ENOENT)
 		goto unlock;

@@ -683,7 +728,7 @@ static int scoutfs_xattr_set(struct dentry *dentry, const char *name,
 		le64_add_cpu(&tval.total, -total);
 	}

-	/* prepare our xattr */
+	/* prepare the xattr header, name, and start of value in first item */
 	if (value) {
 		if (found_parts)
 			id = le64_to_cpu(key.skx_id);
@@ -693,7 +738,9 @@ static int scoutfs_xattr_set(struct dentry *dentry, const char *name,
 		xat->val_len = cpu_to_le16(size);
 		memset(xat->__pad, 0, sizeof(xat->__pad));
 		memcpy(xat->name, name, name_len);
-		memcpy(&xat->name[xat->name_len], value, size);
+		memcpy(&xat->name[name_len], value,
+		       min(size, SCOUTFS_XATTR_MAX_PART_SIZE -
+			         offsetof(struct scoutfs_xattr, name[name_len])));

 		if (tgs.totl) {
 			ret = parse_totl_u64(value, size, &total);
@@ -741,14 +788,15 @@ retry:
 	}

 	if (found_parts && value)
-		ret = change_xattr_items(inode, id, xat, bytes,
+		ret = change_xattr_items(inode, id, xat, xat_bytes, value, size,
 					 xattr_nr_parts(xat), found_parts, lck);
 	else if (found_parts)
 		ret = delete_xattr_items(inode, le64_to_cpu(key.skx_name_hash),
 					 le64_to_cpu(key.skx_id), found_parts,
 					 lck);
 	else
-		ret = create_xattr_items(inode, id, xat, bytes, lck);
+		ret = create_xattr_items(inode, id, xat, xat_bytes, value, size,
+					 xattr_nr_parts(xat), lck);
 	if (ret < 0)
 		goto release;

@@ -778,7 +826,7 @@ unlock:
 	scoutfs_unlock(sb, lck, SCOUTFS_LOCK_WRITE);
 	scoutfs_unlock(sb, totl_lock, SCOUTFS_LOCK_WRITE_ONLY);
 out:
-	vfree(xat);
+	kfree(xat);

 	return ret;
 }
@@ -807,7 +855,7 @@ ssize_t scoutfs_list_xattrs(struct inode *inode, char *buffer,
 	struct scoutfs_xattr *xat = NULL;
 	struct scoutfs_lock *lck = NULL;
 	struct scoutfs_key key;
-	unsigned int bytes;
+	unsigned int xat_bytes;
 	ssize_t total = 0;
 	u32 name_hash = 0;
 	bool is_hidden;
@@ -820,8 +868,8 @@ ssize_t scoutfs_list_xattrs(struct inode *inode, char *buffer,
 		id = *id_pos;

 	/* need a buffer large enough for all possible names */
-	bytes = sizeof(struct scoutfs_xattr) + SCOUTFS_XATTR_MAX_NAME_LEN;
-	xat = kmalloc(bytes, GFP_NOFS);
+	xat_bytes = first_item_bytes(SCOUTFS_XATTR_MAX_NAME_LEN, 0);
+	xat = kmalloc(xat_bytes, GFP_NOFS);
 	if (!xat) {
 		ret = -ENOMEM;
 		goto out;
@@ -834,8 +882,7 @@ ssize_t scoutfs_list_xattrs(struct inode *inode, char *buffer,
 	down_read(&si->xattr_rwsem);

 	for (;;) {
-		ret = get_next_xattr(inode, &key, xat, bytes,
-				     NULL, 0, name_hash, id, lck);
+		ret = get_next_xattr(inode, &key, xat, xat_bytes, NULL, 0, name_hash, id, lck);
 		if (ret < 0) {
 			if (ret == -ENOENT)
 				ret = total;
--- a/tests/Makefile
+++ b/tests/Makefile
@@ -10,7 +10,8 @@ BIN := src/createmany			\
 	src/bulk_create_paths		\
 	src/stage_tmpfile		\
 	src/find_xattrs			\
-	src/create_xattr_loop
+	src/create_xattr_loop		\
+	src/fragmented_data_extents

 DEPS := $(wildcard src/*.d)

--- a/tests/fenced-local-force-unmount.sh
+++ b/tests/fenced-local-force-unmount.sh
@@ -1,5 +1,18 @@
 #!/usr/bin/bash

+#
+# This fencing script is used for testing clusters of multiple mounts on
+# a single host.  It finds mounts to fence by looking for their rids and
+# only knows how to "fence" by using forced unmount.
+#
+
+echo "$0 running rid '$SCOUTFS_FENCED_REQ_RID' ip '$SCOUTFS_FENCED_REQ_IP' args '$@'"
+
+log() {
+	echo "$@" > /dev/stderr
+	exit 1
+}
+
 echo_fail() {
 	echo "$@" > /dev/stderr
 	exit 1
@@ -7,29 +20,24 @@ echo_fail() {

 rid="$SCOUTFS_FENCED_REQ_RID"

-#
-# Look for a local mount with the rid to fence.  Typically we'll at
-# least find the mount with the server that requested the fence that
-# we're processing.   But it's possible that mounts are unmounted
-# before, or while, we're running.
-#
-mnts=$(findmnt -l -n -t scoutfs -o TARGET) || \
-	echo_fail "findmnt -t scoutfs failed" > /dev/stderr
+for fs in /sys/fs/scoutfs/*; do
+	[ ! -d "$fs" ] && continue

-for mnt in $mnts; do
-	mnt_rid=$(scoutfs statfs -p "$mnt" -s rid) || \
-		echo_fail "scoutfs statfs $mnt failed"
-
-	if [ "$mnt_rid" == "$rid" ]; then
-		umount -f "$mnt" || \
-			echo_fail "umout -f $mnt"
-
-		exit 0
+	fs_rid="$(cat $fs/rid)" || \
+		echo_fail "failed to get rid in $fs"
+	if [ "$fs_rid" != "$rid" ]; then
+		continue
 	fi
+
+	nr="$(cat $fs/data_device_maj_min)" || \
+		echo_fail "failed to get data device major:minor in $fs"
+
+	mnts=$(findmnt -l -n -t scoutfs -o TARGET -S $nr) || \
+		echo_fail "findmnt -t scoutfs -S $nr failed"
+	for mnt in $mnts; do
+		umount -f "$mnt" || \
+			echo_fail "umout -f $mnt failed"
+	done
 done

-#
-# If the mount doesn't exist on this host then it can't access the
-# devices by definition and can be considered fenced.
-#
 exit 0
--- a/tests/funcs/fs.sh
+++ b/tests/funcs/fs.sh
@@ -75,6 +75,20 @@ t_fs_nrs()
 	seq 0 $((T_NR_MOUNTS - 1))
 }

+#
+# outputs "1" if the fs number has "1" in its quorum/is_leader file.
+# All other cases output 0, including the fs nr being a client which
+# won't have a quorum/ dir.
+#
+t_fs_is_leader()
+{
+	if [ "$(cat $(t_sysfs_path $i)/quorum/is_leader 2>/dev/null)" == "1" ]; then
+		echo "1"
+	else
+		echo "0"
+	fi
+}
+
 #
 # Output the mount nr of the current server.  This takes no steps to
 # ensure that the server doesn't shut down and have some other mount
@@ -83,7 +97,7 @@ t_fs_nrs()
 t_server_nr()
 {
 	for i in $(t_fs_nrs); do
-		if [ "$(cat $(t_sysfs_path $i)/quorum/is_leader)" == "1" ]; then
+		if [ "$(t_fs_is_leader $i)" == "1" ]; then
 			echo $i
 			return
 		fi
@@ -101,7 +115,7 @@ t_server_nr()
 t_first_client_nr()
 {
 	for i in $(t_fs_nrs); do
-		if [ "$(cat $(t_sysfs_path $i)/quorum/is_leader)" == "0" ]; then
+		if [ "$(t_fs_is_leader $i)" == "0" ]; then
 			echo $i
 			return
 		fi
--- a/tests/golden/large-fragmented-free
+++ b/tests/golden/large-fragmented-free
@@ -0,0 +1,3 @@
+== creating fragmented extents
+== unlink file with moved extents to free extents per block
+== cleanup
--- a/tests/golden/lock-recover-invalidate
+++ b/tests/golden/lock-recover-invalidate
@@ -0,0 +1,3 @@
+== starting background invalidating read/write load
+== 60s of lock recovery during invalidating load
+== stopping background load
--- a/tests/golden/lock-rever-invalidate
+++ b/tests/golden/lock-rever-invalidate
--- a/tests/run-tests.sh
+++ b/tests/run-tests.sh
@@ -380,13 +380,14 @@ cmd grep .  /sys/kernel/debug/tracing/options/trace_printk \
 # Build a fenced config that runs scripts out of the repository rather
 # than the default system directory
 #
-conf="$T_RESULTS/scoutfs-fencd.conf"
+conf="$T_RESULTS/scoutfs-fenced.conf"
 cat > $conf << EOF
 SCOUTFS_FENCED_DELAY=1
 SCOUTFS_FENCED_RUN=$T_TESTS/fenced-local-force-unmount.sh
-SCOUTFS_FENCED_RUN_ARGS=""
+SCOUTFS_FENCED_RUN_ARGS="ignored run args"
 EOF
 export SCOUTFS_FENCED_CONFIG_FILE="$conf"
+T_FENCED_LOG="$T_RESULTS/fenced.log"

 #
 # Run the agent in the background, log its output, an kill it if we
@@ -394,7 +395,7 @@ export SCOUTFS_FENCED_CONFIG_FILE="$conf"
 #
 fenced_log()
 {
-	echo "[$(timestamp)] $*" >> "$T_RESULTS/fenced.stdout.log"
+	echo "[$(timestamp)] $*" >> "$T_FENCED_LOG"
 }
 fenced_pid=""
 kill_fenced()
@@ -405,7 +406,7 @@ kill_fenced()
 	fi
 }
 trap kill_fenced EXIT
-$T_UTILS/fenced/scoutfs-fenced > "$T_RESULTS/fenced.stdout.log" 2> "$T_RESULTS/fenced.stderr.log" &
+$T_UTILS/fenced/scoutfs-fenced > "$T_FENCED_LOG" 2>&1 &
 fenced_pid=$!
 fenced_log "started fenced pid $fenced_pid in the background"

--- a/tests/sequence
+++ b/tests/sequence
@@ -9,6 +9,7 @@ fallocate.sh
 setattr_more.sh
 offline-extent-waiting.sh
 move-blocks.sh
+large-fragmented-free.sh
 enospc.sh
 srch-basic-functionality.sh
 simple-xattr-unit.sh
@@ -17,6 +18,7 @@ lock-refleak.sh
 lock-shrink-consistency.sh
 lock-pr-cw-conflict.sh
 lock-revoke-getcwd.sh
+lock-recover-invalidate.sh
 export-lookup-evict-race.sh
 createmany-parallel.sh
 createmany-large-names.sh
--- a/tests/src/fragmented_data_extents.c
+++ b/tests/src/fragmented_data_extents.c
@@ -0,0 +1,113 @@
+/*
+ * Copyright (C) 2021 Versity Software, Inc.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+
+/*
+ * This creates fragmented data extents.
+ *
+ * A file is created that has alternating free and allocated extents.
+ * This also results in the global allocator having the matching
+ * fragmented free extent pattern.  While that file is being created,
+ * occasionally an allocated extent is moved to another file.   This
+ * results in a file that has fragmented extents at a given stride that
+ * can be deleted to create free data extents with a given stride.
+ *
+ * We don't have hole punching so to do this quickly we use a goofy
+ * combination of fallocate, truncate, and our move_blocks ioctl.
+ */
+
+#ifndef _GNU_SOURCE
+#define _GNU_SOURCE
+#endif
+#include <unistd.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/ioctl.h>
+#include <fcntl.h>
+#include <errno.h>
+#include <linux/types.h>
+#include <assert.h>
+
+#include "ioctl.h"
+
+#define BLOCK_SIZE 4096
+
+int main(int argc, char **argv)
+{
+	struct scoutfs_ioctl_move_blocks mb = {0,};
+	unsigned long long freed_extents;
+	unsigned long long move_stride;
+	unsigned long long i;
+	int alloc_fd;
+	int trunc_fd;
+	off_t off;
+	int ret;
+
+	if (argc != 5) {
+		printf("%s <freed_extents> <move_stride> <alloc_file> <trunc_file>\n", argv[0]);
+		return 1;
+	}
+
+	freed_extents = strtoull(argv[1], NULL, 0);
+	move_stride = strtoull(argv[2], NULL, 0);
+
+	alloc_fd = open(argv[3], O_RDWR | O_CREAT | O_TRUNC, S_IRUSR | S_IWUSR);
+	if (alloc_fd == -1) {
+		fprintf(stderr, "error opening %s: %d (%s)\n", argv[3], errno, strerror(errno));
+		exit(1);
+	}
+
+	trunc_fd = open(argv[4], O_RDWR | O_CREAT | O_TRUNC, S_IRUSR | S_IWUSR);
+	if (trunc_fd == -1) {
+		fprintf(stderr, "error opening %s: %d (%s)\n", argv[4], errno, strerror(errno));
+		exit(1);
+	}
+
+	for (i = 0, off = 0; i < freed_extents; i++, off += BLOCK_SIZE * 2) {
+
+		ret = fallocate(alloc_fd, 0, off, BLOCK_SIZE * 2);
+		if (ret < 0) {
+			fprintf(stderr, "fallocate at off %llu error: %d (%s)\n",
+				(unsigned long long)off, errno, strerror(errno));
+			exit(1);
+		}
+
+		ret = ftruncate(alloc_fd, off + BLOCK_SIZE);
+		if (ret < 0) {
+			fprintf(stderr, "truncate to off %llu error: %d (%s)\n",
+				(unsigned long long)off + BLOCK_SIZE, errno, strerror(errno));
+			exit(1);
+		}
+
+		if ((i % move_stride) == 0) {
+			mb.from_fd = alloc_fd;
+			mb.from_off = off;
+			mb.len = BLOCK_SIZE;
+			mb.to_off = i * BLOCK_SIZE;
+
+			ret = ioctl(trunc_fd, SCOUTFS_IOC_MOVE_BLOCKS, &mb);
+			if (ret < 0) {
+				fprintf(stderr, "move from off %llu error: %d (%s)\n",
+					(unsigned long long)off,
+					errno, strerror(errno));
+			}
+		}
+	}
+
+	if (alloc_fd > -1)
+		close(alloc_fd);
+	if (trunc_fd > -1)
+		close(trunc_fd);
+
+	return 0;
+}
--- a/tests/tests/fence-and-reclaim.sh
+++ b/tests/tests/fence-and-reclaim.sh
@@ -45,6 +45,18 @@ check_read_write()
 	fi
 }

+# verify that fenced ran our testing fence script
+verify_fenced_run()
+{
+	local rids="$@"
+	local rid
+
+	for rid in $rids; do
+		grep -q ".* running rid '$rid'.* args 'ignored run args'" "$T_FENCED_LOG" || \
+			t_fail "fenced didn't execute RUN script for rid $rid"
+	done
+}
+
 echo "== make sure all mounts can see each other"
 check_read_write

@@ -62,12 +74,14 @@ done
 while t_rid_is_fencing $rid; do
 	sleep .5
 done
+verify_fenced_run $rid
 t_mount $cl
 check_read_write

 echo "== force unmount all non-server, connection timeout, fence nop, mount"
 sv=$(t_server_nr)
 pattern="nonsense"
+rids=""
 sync
 for cl in $(t_fs_nrs); do
 	if [ $cl == $sv ]; then
@@ -75,6 +89,7 @@ for cl in $(t_fs_nrs); do
 	fi

 	rid=$(t_mount_rid $cl)
+	rids="$rids $rid"
 	pattern="$pattern|$rid"
 	echo "cl $cl sv $sv rid $rid" >> "$T_TMP.log"

@@ -89,6 +104,7 @@ done
 while test -d $(echo /sys/fs/scoutfs/*/fence/* | cut -d " " -f 1); do
 	sleep .5
 done
+verify_fenced_run $rids
 # remount all the clients
 for cl in $(t_fs_nrs); do
 	if [ $cl == $sv ]; then
@@ -109,11 +125,17 @@ t_wait_for_leader
 while t_rid_is_fencing $rid; do
 	sleep .5
 done
+verify_fenced_run $rid
 t_mount $sv
 check_read_write

 echo "== force unmount everything, new server fences all previous"
 sync
+rids=""
+# get rids before forced unmount breaks scoutfs statfs
+for nr in $(t_fs_nrs); do
+	rids="$rids $(t_mount_rid $nr)"
+done
 for nr in $(t_fs_nrs); do
 	t_force_umount $nr
 done
@@ -122,6 +144,7 @@ t_mount_all
 while test -d $(echo /sys/fs/scoutfs/*/fence/* | cut -d " " -f 1); do
 	sleep .5
 done
+verify_fenced_run $rids
 check_read_write

 t_pass
--- a/tests/tests/large-fragmented-free.sh
+++ b/tests/tests/large-fragmented-free.sh
@@ -0,0 +1,22 @@
+#
+# Make sure the server can handle a transaction with a data_freed whose
+# blocks all hit different btree blocks in the main free list.  It
+# probably has to be merged in multiple commits.
+#
+
+t_require_commands fragmented_data_extents
+
+EXTENTS_PER_BTREE_BLOCK=600
+EXTENTS_PER_LIST_BLOCK=8192
+FREED_EXTENTS=$((EXTENTS_PER_BTREE_BLOCK * EXTENTS_PER_LIST_BLOCK))
+
+echo "== creating fragmented extents"
+fragmented_data_extents $FREED_EXTENTS $EXTENTS_PER_BTREE_BLOCK "$T_D0/alloc" "$T_D0/move"
+
+echo "== unlink file with moved extents to free extents per block"
+rm -f "$T_D0/move"
+
+echo "== cleanup"
+rm -f "$T_D0/alloc"
+
+t_pass
--- a/tests/tests/lock-recover-invalidate.sh
+++ b/tests/tests/lock-recover-invalidate.sh
@@ -0,0 +1,43 @@
+#
+# trigger server failover and lock recovery during heavy invalidating
+# load on multiple mounts
+#
+
+majority_nr=$(t_majority_count)
+quorum_nr=$T_QUORUM
+
+test "$quorum_nr" == "$majority_nr" && \
+        t_skip "need remaining majority when leader unmounted"
+
+test "$T_NR_MOUNTS" -lt "$((quorum_nr + 2))" && \
+        t_skip "need at least 2 non-quorum load mounts"
+
+echo "== starting background invalidating read/write load"
+touch "$T_D0/file"
+load_pids=""
+for i in $(t_fs_nrs); do
+	if [ "$i" -ge "$quorum_nr" ]; then
+		eval path="\$T_D${i}/file"
+
+		(while true; do touch $path > /dev/null 2>&1; done) &
+		load_pids="$load_pids $!"
+		(while true; do stat $path > /dev/null 2>&1; done) &
+		load_pids="$load_pids $!"
+	fi
+done
+
+# had it reproduce in ~40s on wimpy debug kernel guests
+LENGTH=60
+echo "== ${LENGTH}s of lock recovery during invalidating load"
+END=$((SECONDS + LENGTH))
+while [ "$SECONDS" -lt "$END" ]; do
+        sv=$(t_server_nr)
+        t_umount $sv
+        t_mount $sv
+	# new server had to process greeting for mount to finish
+done
+
+echo "== stopping background load"
+kill $load_pids
+
+t_pass
--- a/utils/fenced/scoutfs-fenced
+++ b/utils/fenced/scoutfs-fenced
@@ -55,9 +55,21 @@ test -x "$SCOUTFS_FENCED_RUN" || \
 	error_exit "SCOUTFS_FENCED_RUN '$SCOUTFS_FENCED_RUN' isn't executable"

 #
-# main loop watching for fence request across all filesystems 
+# Main loop watching for fence request across all filesystems.   The
+# server can shut down without waiting for pending fence requests to
+# finish.  All of the interaction with the fence directory and files can
+# fail at any moment.  We will generate log messages when the dir or
+# files disappear.
 #

+# generate failure messages to stderr while still echoing 0 for the caller
+careful_cat()
+{
+	local path="$@"
+
+	cat "$@" || echo 0
+}
+
 while sleep $SCOUTFS_FENCED_DELAY; do
 	for fence in /sys/fs/scoutfs/*/fence/*; do
 		# catches unmatched regex when no dirs
@@ -66,7 +78,8 @@ while sleep $SCOUTFS_FENCED_DELAY; do
 		fi

 		# skip requests that have been handled
-		if [ $(cat "$fence/fenced") == 1 -o $(cat "$fence/error") == 1 ]; then
+		if [ "$(careful_cat $fence/fenced)" == 1 -o \
+		     "$(careful_cat $fence/error)" == 1 ]; then
 			continue
 		fi

@@ -81,10 +94,10 @@ while sleep $SCOUTFS_FENCED_DELAY; do
 		export SCOUTFS_FENCED_REQ_RID="$rid"
 		export SCOUTFS_FENCED_REQ_IP="$ip"

-		$run $SCOUTFS_FENCED_RUN_ARGS
+		$SCOUTFS_FENCED_RUN $SCOUTFS_FENCED_RUN_ARGS
 		rc=$?
 		if [ "$rc" != 0 ]; then
-			log_message "server $srv fencing rid $rid saw error status $rc from $run"
+			log_message "server $srv fencing rid $rid saw error status $rc"
 			echo 1 > "$fence/error"
 			continue
 		fi
--- a/utils/man/scoutfs.8
+++ b/utils/man/scoutfs.8
@@ -597,7 +597,7 @@ format.
 .PD

 .TP
-.BI "print META-DEVICE"
+.BI "print {-S|--skip-likely-huge} META-DEVICE"
 .sp
 Prints out all of the metadata in the file system.  This makes no effort
 to ensure that the structures are consistent as they're traversed and
@@ -607,6 +607,20 @@ output.
 .PD 0
 .TP
 .sp
+.B "-S, --skip-likely-huge"
+Skip printing structures that are likely to be very large.  The
+structures that are skipped tend to be global and whose size tends to be
+related to the size of the volume.   Examples of skipped structures include
+the global fs items, srch files, and metadata and data
+allocators.  Similar structures that are not skipped are related to the
+number of mounts and are maintained at a relatively reasonable size.
+These include per-mount log trees, srch files, allocators, and the
+metadata allocators used by server commits.
+.sp
+Skipping the larger structures limits the print output to a relatively
+constant size rather than being a large multiple of the used metadata
+space of the volume making the output much more useful for inspection.
+.TP
 .B "META-DEVICE"
 The path to the metadata device for the filesystem whose metadata will be
 printed.  Since this command reads via the host's buffer cache, it may not
--- a/utils/src/print.c
+++ b/utils/src/print.c
@@ -8,6 +8,7 @@
 #include <errno.h>
 #include <string.h>
 #include <stdarg.h>
+#include <stdbool.h>
 #include <ctype.h>
 #include <uuid/uuid.h>
 #include <sys/socket.h>
@@ -989,9 +990,10 @@ static void print_super_block(struct scoutfs_super_block *super, u64 blkno)

 struct print_args {
 	char *meta_device;
+	bool skip_likely_huge;
 };

-static int print_volume(int fd)
+static int print_volume(int fd, struct print_args *args)
 {
 	struct scoutfs_super_block *super = NULL;
 	struct print_recursion_args pa;
@@ -1041,23 +1043,26 @@ static int print_volume(int fd)
 			ret = err;
 	}

-	for (i = 0; i < array_size(super->meta_alloc); i++) {
-		snprintf(str, sizeof(str), "meta_alloc[%u]", i);
-		err = print_btree(fd, super, str, &super->meta_alloc[i].root,
+	if (!args->skip_likely_huge) {
+		for (i = 0; i < array_size(super->meta_alloc); i++) {
+			snprintf(str, sizeof(str), "meta_alloc[%u]", i);
+			err = print_btree(fd, super, str, &super->meta_alloc[i].root,
+					  print_alloc_item, NULL);
+			if (err && !ret)
+				ret = err;
+		}
+
+		err = print_btree(fd, super, "data_alloc", &super->data_alloc.root,
 				  print_alloc_item, NULL);
 		if (err && !ret)
 			ret = err;
 	}

-	err = print_btree(fd, super, "data_alloc", &super->data_alloc.root,
-			  print_alloc_item, NULL);
-	if (err && !ret)
-		ret = err;
-
 	err = print_btree(fd, super, "srch_root", &super->srch_root,
 			  print_srch_root_item, NULL);
 	if (err && !ret)
 		ret = err;
+
 	err = print_btree(fd, super, "logs_root", &super->logs_root,
 			  print_log_trees_item, NULL);
 	if (err && !ret)
@@ -1065,19 +1070,23 @@ static int print_volume(int fd)

 	pa.super = super;
 	pa.fd = fd;
-	err = print_btree_leaf_items(fd, super, &super->srch_root.ref,
-				     print_srch_root_files, &pa);
-	if (err && !ret)
-		ret = err;
+	if (!args->skip_likely_huge) {
+		err = print_btree_leaf_items(fd, super, &super->srch_root.ref,
+					     print_srch_root_files, &pa);
+		if (err && !ret)
+			ret = err;
+	}
 	err = print_btree_leaf_items(fd, super, &super->logs_root.ref,
 				     print_log_trees_roots, &pa);
 	if (err && !ret)
 		ret = err;

-	err = print_btree(fd, super, "fs_root", &super->fs_root,
-			  print_fs_item, NULL);
-	if (err && !ret)
-		ret = err;
+	if (!args->skip_likely_huge) {
+		err = print_btree(fd, super, "fs_root", &super->fs_root,
+				  print_fs_item, NULL);
+		if (err && !ret)
+			ret = err;
+	}

 out:
 	free(super);
@@ -1098,7 +1107,7 @@ static int do_print(struct print_args *args)
 		return ret;
 	}

-	ret = print_volume(fd);
+	ret = print_volume(fd, args);
 	close(fd);
 	return ret;
 };
@@ -1108,6 +1117,9 @@ static int parse_opt(int key, char *arg, struct argp_state *state)
 	struct print_args *args = state->input;

 	switch (key) {
+	case 'S':
+		args->skip_likely_huge = true;
+		break;
 	case ARGP_KEY_ARG:
 		if (!args->meta_device)
 			args->meta_device = strdup_or_error(state, arg);
@@ -1125,8 +1137,13 @@ static int parse_opt(int key, char *arg, struct argp_state *state)
 	return 0;
 }

+static struct argp_option options[] = {
+	{ "skip-likely-huge", 'S', NULL, 0, "Skip large structures to minimize output size"},
+	{ NULL }
+};
+
 static struct argp argp = {
-	NULL,
+	options,
 	parse_opt,
 	"META-DEV",
 	"Print metadata structures"