Walk met_seq for inodes to open_by_handle_at() = -ESTALE, and trace.

This walks the meta_seq from 0, or, a passed in number to the current meta_seq, and polls further advances, until it approaches (but doesn't reach) the current stable meta_seq. Inodes found in the index will be tested to see if they return -ESTALE when open_by_handle_at(). If they do, further tests assure the file exists and can be resolved. If that happens to be the case, we ftrace the open_by_handle_at() on the inode, and write out the tracefile. The program can be ^C'd and will print out how far in the meta_seq it got, so it can be resumed where it left off. If tracing happens to be on while a TERM or INT signal is received, we immediately turn off tracing again. The program runs without producing needless terminal output. If traces are generated, the file name is printed out. If the program terminates, for whatever reason, it prints out how far it has advanced through the meta_seq, and that information can be used to resume testing from there on. The program can be run in tailing (normal) mode, the default, and continue to wait for new work to appear. Alternatively, the command line option `-e` tells the program to stop execution once the current stable meta_seq has been reached. Signed-off-by: Auke Kok <auke.kok@versity.com>
Merge pull request #223 from versity/auke/el9_5_wmaybe-uninit
2026-01-11 22:12:53 +00:00 · 2025-05-21 10:00:35 -07:00 · 2025-05-12 12:21:02 -07:00 · 2025-05-09 11:27:04 -07:00 · 2025-05-09 11:17:24 -07:00 · 2025-05-09 11:15:13 -07:00
31 changed files with 1923 additions and 332 deletions
--- a/ReleaseNotes.md
+++ b/ReleaseNotes.md
@@ -1,6 +1,22 @@
 Versity ScoutFS Release Notes
 =============================

+---
+v1.24
+\
+*Mar 14, 2025*
+
+Add support for coherent read and write mmap() mappings of regular file
+data between mounts.
+
+Fix a bug that was causing scoutfs utilities to parse and change some
+file names before passing them on to the kernel for processing.  This
+fixes spurious scoutfs command errors for files with the offending
+patterns in their names.
+
+Fix a bug where rename wasn't updating the ctime of the inode at the
+destination name if it existed.
+
 ---
 v1.23
 \
--- a/kmod/src/Makefile.kernelcompat
+++ b/kmod/src/Makefile.kernelcompat
@@ -6,26 +6,6 @@

 ccflags-y += -include $(src)/kernelcompat.h

-#
-# v3.10-rc6-21-gbb6f619b3a49
-#
-# _readdir changes from fop->readdir() to fop->iterate() and from
-# filldir(dirent) to dir_emit(ctx).
-#
-ifneq (,$(shell grep 'iterate.*dir_context' include/linux/fs.h))
-ccflags-y += -DKC_ITERATE_DIR_CONTEXT
-endif
-
-#
-# v3.10-rc6-23-g5f99f4e79abc
-#
-# Helpers including dir_emit_dots() are added in the process of
-# switching dcache_readdir() from fop->readdir() to fop->iterate()
-#
-ifneq (,$(shell grep 'dir_emit_dots' include/linux/fs.h))
-ccflags-y += -DKC_DIR_EMIT_DOTS
-endif
-
 #
 # v3.18-rc2-19-gb5ae6b15bd73
 # 
@@ -431,3 +411,26 @@ endif
 ifneq (,$(shell grep 'struct file.*bdev_file_open_by_path.const char.*path' include/linux/blkdev.h))
 ccflags-y += -DKC_BDEV_FILE_OPEN_BY_PATH
 endif
+
+# v4.0-rc7-1796-gfe0f07d08ee3
+#
+# direct-io changes modify inode_dio_done to now be called inode_dio_end
+ifneq (,$(shell grep 'void inode_dio_end.struct inode' include/linux/fs.h))
+ccflags-y += -DKC_INODE_DIO_END
+endif
+
+#
+# v5.0-6476-g3d3539018d2c
+#
+# page fault handlers return a bitmask vm_fault_t instead
+# Note: el8's header has a slightly modified prefix here
+ifneq (,$(shell grep 'typedef.*__bitwise unsigned.*int vm_fault_t' include/linux/mm_types.h))
+ccflags-y += -DKC_MM_VM_FAULT_T
+endif
+
+# v3.19-499-gd83a08db5ba6
+#
+# .remap pages becomes obsolete
+ifneq (,$(shell grep 'int ..remap_pages..struct vm_area_struct' include/linux/mm.h))
+ccflags-y += -DKC_MM_REMAP_PAGES
+endif
--- a/kmod/src/data.c
+++ b/kmod/src/data.c
@@ -560,7 +560,7 @@ static int scoutfs_get_block(struct inode *inode, sector_t iblock,
 	u64 offset;
 	int ret;

-	WARN_ON_ONCE(create && !inode_is_locked(inode));
+	WARN_ON_ONCE(create && !rwsem_is_locked(&si->extent_sem));

 	/* make sure caller holds a cluster lock */
 	lock = scoutfs_per_task_get(&si->pt_data_lock);
@@ -1551,13 +1551,17 @@ int scoutfs_data_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
 	struct super_block *sb = inode->i_sb;
 	const u64 ino = scoutfs_ino(inode);
 	struct scoutfs_lock *lock = NULL;
+	struct scoutfs_extent *info = NULL;
+	struct page *page = NULL;
 	struct scoutfs_extent ext;
 	struct scoutfs_extent cur;
 	struct data_ext_args args;
 	u32 last_flags;
 	u64 iblock;
 	u64 last;
+	int entries = 0;
 	int ret;
+	int complete = 0;

 	if (len == 0) {
 		ret = 0;
@@ -1568,16 +1572,11 @@ int scoutfs_data_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
 	if (ret)
 		goto out;

-	inode_lock(inode);
-	down_read(&si->extent_sem);
-
-	ret = scoutfs_lock_inode(sb, SCOUTFS_LOCK_READ, 0, inode, &lock);
-	if (ret)
-		goto unlock;
-
-	args.ino = ino;
-	args.inode = inode;
-	args.lock = lock;
+	page = alloc_page(GFP_KERNEL);
+	if (!page) {
+		ret = -ENOMEM;
+		goto out;
+	}

 	/* use a dummy extent to track */
 	memset(&cur, 0, sizeof(cur));
@@ -1586,48 +1585,93 @@ int scoutfs_data_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
 	iblock = start >> SCOUTFS_BLOCK_SM_SHIFT;
 	last = (start + len - 1) >> SCOUTFS_BLOCK_SM_SHIFT;

+	args.ino = ino;
+	args.inode = inode;
+
+	/* outer loop */
 	while (iblock <= last) {
-		ret = scoutfs_ext_next(sb, &data_ext_ops, &args,
-				       iblock, 1, &ext);
-		if (ret < 0) {
-			if (ret == -ENOENT)
+		/* lock */
+		inode_lock(inode);
+		down_read(&si->extent_sem);
+
+		ret = scoutfs_lock_inode(sb, SCOUTFS_LOCK_READ, 0, inode, &lock);
+		if (ret) {
+			up_read(&si->extent_sem);
+			inode_unlock(inode);
+			break;
+		}
+
+		args.lock = lock;
+
+		/* collect entries */
+		info = page_address(page);
+		memset(info, 0, PAGE_SIZE);
+		while (entries < (PAGE_SIZE / sizeof(struct fiemap_extent)) - 1) {
+			ret = scoutfs_ext_next(sb, &data_ext_ops, &args,
+					       iblock, 1, &ext);
+			if (ret < 0) {
+				if (ret == -ENOENT)
+					ret = 0;
+				complete = 1;
+				last_flags = FIEMAP_EXTENT_LAST;
+				break;
+			}
+
+			trace_scoutfs_data_fiemap_extent(sb, ino, &ext);
+
+			if (ext.start > last) {
+				/* not setting _LAST, it's for end of file */
 				ret = 0;
-			last_flags = FIEMAP_EXTENT_LAST;
-			break;
+				complete = 1;
+				break;
+			}
+
+			if (scoutfs_ext_can_merge(&cur, &ext)) {
+				/* merged extents could be greater than input len */
+				cur.len += ext.len;
+			} else {
+				/* fill it */
+				memcpy(info, &cur, sizeof(cur));
+
+				entries++;
+				info++;
+
+				cur = ext;
+			}
+
+			iblock = ext.start + ext.len;
 		}

-		trace_scoutfs_data_fiemap_extent(sb, ino, &ext);
+		/* unlock */
+		scoutfs_unlock(sb, lock, SCOUTFS_LOCK_READ);
+		up_read(&si->extent_sem);
+		inode_unlock(inode);

-		if (ext.start > last) {
-			/* not setting _LAST, it's for end of file */
-			ret = 0;
+		if (ret)
 			break;
-		}

-		if (scoutfs_ext_can_merge(&cur, &ext)) {
-			/* merged extents could be greater than input len */
-			cur.len += ext.len;
-		} else {
-			ret = fill_extent(fieinfo, &cur, 0);
+		/* emit entries */
+		info = page_address(page);
+		for (; entries > 0; entries--) {
+			ret = fill_extent(fieinfo, info, 0);
 			if (ret != 0)
-				goto unlock;
-			cur = ext;
+				goto out;
+			info++;
 		}

-		iblock = ext.start + ext.len;
+		if (complete)
+			break;
 	}

+	/* still one left, it's in cur */
 	if (cur.len)
 		ret = fill_extent(fieinfo, &cur, last_flags);
-unlock:
-	scoutfs_unlock(sb, lock, SCOUTFS_LOCK_READ);
-	up_read(&si->extent_sem);
-	inode_unlock(inode);

 out:
 	if (ret == 1)
 		ret = 0;
-
+	if (page)
+		__free_page(page);
 	trace_scoutfs_data_fiemap(sb, start, len, ret);

 	return ret;
@@ -1914,6 +1958,236 @@ int scoutfs_data_waiting(struct super_block *sb, u64 ino, u64 iblock,
 	return ret;
 }

+#ifdef KC_MM_VM_FAULT_T
+static vm_fault_t scoutfs_data_page_mkwrite(struct vm_fault *vmf)
+{
+	struct vm_area_struct *vma = vmf->vma;
+#else
+static int scoutfs_data_page_mkwrite(struct vm_area_struct *vma,
+				     struct vm_fault *vmf)
+{
+#endif
+	struct page *page = vmf->page;
+	struct file *file = vma->vm_file;
+	struct inode *inode = file_inode(file);
+	struct scoutfs_inode_info *si = SCOUTFS_I(inode);
+	struct super_block *sb = inode->i_sb;
+	struct scoutfs_lock *lock = NULL;
+	SCOUTFS_DECLARE_PER_TASK_ENTRY(pt_ent);
+	DECLARE_DATA_WAIT(dw);
+	struct write_begin_data wbd;
+	u64 ind_seq;
+	loff_t pos;
+	loff_t size;
+	unsigned int len = PAGE_SIZE;
+	vm_fault_t ret = VM_FAULT_SIGBUS;
+	int err;
+
+	pos = vmf->pgoff << PAGE_SHIFT;
+
+	sb_start_pagefault(sb);
+
+	err = scoutfs_lock_inode(sb, SCOUTFS_LOCK_WRITE,
+				 SCOUTFS_LKF_REFRESH_INODE, inode, &lock);
+	if (err) {
+		ret = vmf_error(err);
+		goto out;
+	}
+
+	size = i_size_read(inode);
+
+	if (scoutfs_per_task_add_excl(&si->pt_data_lock, &pt_ent, lock)) {
+		/* data_version is per inode, whole file must be online */
+		err = scoutfs_data_wait_check(inode, 0, size,
+					      SEF_OFFLINE,
+					      SCOUTFS_IOC_DWO_WRITE,
+					      &dw, lock);
+		if (err != 0) {
+			if (err < 0)
+				ret = vmf_error(err);
+			goto out_unlock;
+		}
+	}
+
+
+	/* scoutfs_write_begin */
+	memset(&wbd, 0, sizeof(wbd));
+	INIT_LIST_HEAD(&wbd.ind_locks);
+	wbd.lock = lock;
+
+	/*
+	 * Start transaction before taking page locks - we want to make sure we're
+	 * not locking a page, then waiting for trans, because writeback might race
+	 * against it and cause a lock inversion hang - as demonstrated by both
+	 * holetest and fsstress tests in xfstests.
+	 */
+	do {
+		err = scoutfs_inode_index_start(sb, &ind_seq) ?:
+			scoutfs_inode_index_prepare(sb, &wbd.ind_locks, inode,
+						    true) ?:
+			scoutfs_inode_index_try_lock_hold(sb, &wbd.ind_locks,
+							  ind_seq, false);
+	} while (err > 0);
+	if (err < 0) {
+		ret = vmf_error(err);
+		goto out_trans;
+	}
+
+	down_write(&si->extent_sem);
+
+	if (!trylock_page(page)) {
+		ret = VM_FAULT_NOPAGE;
+		goto out_sem;
+	}
+	ret = VM_FAULT_LOCKED;
+
+	if ((page->mapping != inode->i_mapping) ||
+	    (!PageUptodate(page)) ||
+	    (page_offset(page) > size))	 {
+		unlock_page(page);
+		ret = VM_FAULT_NOPAGE;
+		goto out_sem;
+	}
+
+	if (page->index == (size - 1) >> PAGE_SHIFT)
+		len = ((size - 1) & ~PAGE_MASK) + 1;
+
+	err = __block_write_begin(page, pos, PAGE_SIZE, scoutfs_get_block);
+	if (err) {
+		ret = vmf_error(err);
+		unlock_page(page);
+		goto out_sem;
+	}
+	/* end scoutfs_write_begin */
+
+	/*
+	 * We mark the page dirty already here so that when freeze is in
+	 * progress, we are guaranteed that writeback during freezing will
+	 * see the dirty page and writeprotect it again.
+	 */
+	set_page_dirty(page);
+	wait_for_stable_page(page);
+
+	/* scoutfs_write_end */
+	scoutfs_inode_set_data_seq(inode);
+	scoutfs_inode_inc_data_version(inode);
+
+	file_update_time(vma->vm_file);
+
+	scoutfs_update_inode_item(inode, wbd.lock, &wbd.ind_locks);
+	scoutfs_inode_queue_writeback(inode);
+
+out_sem:
+	up_write(&si->extent_sem);
+out_trans:
+	scoutfs_release_trans(sb);
+	scoutfs_inode_index_unlock(sb, &wbd.ind_locks);
+	/* end scoutfs_write_end */
+
+out_unlock:
+	scoutfs_per_task_del(&si->pt_data_lock, &pt_ent);
+	scoutfs_unlock(sb, lock, SCOUTFS_LOCK_WRITE);
+
+out:
+	sb_end_pagefault(sb);
+
+	if (scoutfs_data_wait_found(&dw)) {
+		/*
+		 * It'd be really nice to not hold the mmap_sem lock here
+		 * before waiting for data, and then return VM_FAULT_RETRY
+		 */
+		err = scoutfs_data_wait(inode, &dw);
+		if (err == 0)
+			ret = VM_FAULT_NOPAGE;
+		else
+			ret = vmf_error(err);
+	}
+
+	trace_scoutfs_data_page_mkwrite(sb, scoutfs_ino(inode), pos, (__force u32)ret);
+
+	return ret;
+}
+
+#ifdef KC_MM_VM_FAULT_T
+static vm_fault_t scoutfs_data_filemap_fault(struct vm_fault *vmf)
+{
+	struct vm_area_struct *vma = vmf->vma;
+#else
+static int scoutfs_data_filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+#endif
+	struct file *file = vma->vm_file;
+	struct inode *inode = file_inode(file);
+	struct scoutfs_inode_info *si = SCOUTFS_I(inode);
+	struct super_block *sb = inode->i_sb;
+	struct scoutfs_lock *inode_lock = NULL;
+	SCOUTFS_DECLARE_PER_TASK_ENTRY(pt_ent);
+	DECLARE_DATA_WAIT(dw);
+	loff_t pos;
+	int err;
+	vm_fault_t ret = VM_FAULT_SIGBUS;
+
+	pos = vmf->pgoff;
+	pos <<= PAGE_SHIFT;
+
+retry:
+	err = scoutfs_lock_inode(sb, SCOUTFS_LOCK_READ,
+				 SCOUTFS_LKF_REFRESH_INODE, inode, &inode_lock);
+	if (err < 0)
+		return vmf_error(err);
+
+	if (scoutfs_per_task_add_excl(&si->pt_data_lock, &pt_ent, inode_lock)) {
+		/* protect checked extents from stage/release */
+		atomic_inc(&inode->i_dio_count);
+
+		err = scoutfs_data_wait_check(inode, pos, PAGE_SIZE,
+					      SEF_OFFLINE, SCOUTFS_IOC_DWO_READ,
+					      &dw, inode_lock);
+		if (err != 0) {
+			if (err < 0)
+				ret = vmf_error(err);
+			goto out;
+		}
+	}
+
+#ifdef KC_MM_VM_FAULT_T
+	ret = filemap_fault(vmf);
+#else
+	ret = filemap_fault(vma, vmf);
+#endif
+
+out:
+	if (scoutfs_per_task_del(&si->pt_data_lock, &pt_ent))
+		kc_inode_dio_end(inode);
+	scoutfs_unlock(sb, inode_lock, SCOUTFS_LOCK_READ);
+	if (scoutfs_data_wait_found(&dw)) {
+		err = scoutfs_data_wait(inode, &dw);
+		if (err == 0)
+			goto retry;
+
+		ret = VM_FAULT_RETRY;
+	}
+
+	trace_scoutfs_data_filemap_fault(sb, scoutfs_ino(inode), pos, (__force u32)ret);
+
+	return ret;
+}
+
+static const struct vm_operations_struct scoutfs_data_file_vm_ops = {
+	.fault		= scoutfs_data_filemap_fault,
+	.page_mkwrite	= scoutfs_data_page_mkwrite,
+#ifdef KC_MM_REMAP_PAGES
+	.remap_pages	= generic_file_remap_pages,
+#endif
+};
+
+static int scoutfs_file_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	file_accessed(file);
+	vma->vm_ops = &scoutfs_data_file_vm_ops;
+	return 0;
+}
+
 const struct address_space_operations scoutfs_file_aops = {
 #ifdef KC_MPAGE_READ_FOLIO
 	.dirty_folio		= block_dirty_folio,
@@ -1945,6 +2219,7 @@ const struct file_operations scoutfs_file_fops = {
 	.splice_read	= generic_file_splice_read,
 	.splice_write	= iter_file_splice_write,
 #endif
+	.mmap		= scoutfs_file_mmap,
 	.unlocked_ioctl	= scoutfs_ioctl,
 	.fsync		= scoutfs_file_fsync,
 	.llseek		= scoutfs_file_llseek,
--- a/kmod/src/dir.c
+++ b/kmod/src/dir.c
@@ -11,11 +11,13 @@
 * General Public License for more details.
 */
 #include <linux/kernel.h>
+#include <linux/stddef.h>
 #include <linux/fs.h>
 #include <linux/slab.h>
 #include <linux/uio.h>
 #include <linux/xattr.h>
 #include <linux/namei.h>
+#include <linux/mm.h>

 #include "format.h"
 #include "file.h"
@@ -434,6 +436,15 @@ out:
 		return d_splice_alias(inode, dentry);
 }

+/*
+ * Helper to make iterating through dirent ptrs aligned
+ */
+static inline struct scoutfs_dirent *next_aligned_dirent(struct scoutfs_dirent *dent, u8 len)
+{
+	return (void *)dent +
+		ALIGN(offsetof(struct scoutfs_dirent, name[len]), __alignof__(struct scoutfs_dirent));
+}
+
 /*
 * readdir simply iterates over the dirent items for the dir inode and
 * uses their offset as the readdir position.
@@ -441,76 +452,112 @@ out:
 * It will need to be careful not to read past the region of the dirent
 * hash offset keys that it has access to.
 */
-static int KC_DECLARE_READDIR(scoutfs_readdir, struct file *file,
-			      void *dirent, kc_readdir_ctx_t ctx)
+static int scoutfs_readdir(struct file *file, struct dir_context *ctx)
 {
 	struct inode *inode = file_inode(file);
 	struct super_block *sb = inode->i_sb;
 	struct scoutfs_lock *dir_lock = NULL;
 	struct scoutfs_dirent *dent = NULL;
+/* we'll store name_len in dent->__pad[0] */
+#define hacky_name_len __pad[0]
 	struct scoutfs_key last_key;
 	struct scoutfs_key key;
+	struct page *page = NULL;
 	int name_len;
 	u64 pos;
+	int entries = 0;
 	int ret;
+	int complete = 0;
+	struct scoutfs_dirent *end;

-	if (!kc_dir_emit_dots(file, dirent, ctx))
+	if (!dir_emit_dots(file, ctx))
 		return 0;

-	dent = alloc_dirent(SCOUTFS_NAME_LEN);
-	if (!dent) {
+	page = alloc_page(GFP_KERNEL);
+	if (!page)
 		return -ENOMEM;
-	}
+
+	end = page_address(page) + PAGE_SIZE;

 	init_dirent_key(&last_key, SCOUTFS_READDIR_TYPE, scoutfs_ino(inode),
 			SCOUTFS_DIRENT_LAST_POS, 0);

-	ret = scoutfs_lock_inode(sb, SCOUTFS_LOCK_READ, 0, inode, &dir_lock);
-	if (ret)
-		goto out;
-
+	/*
+	 * lock and fetch dirent items, until the page no longer fits
+	 * a max size dirent (288b). Then unlock and dir_emit the ones
+	 * we stored in the page.
+	 */
 	for (;;) {
-		init_dirent_key(&key, SCOUTFS_READDIR_TYPE, scoutfs_ino(inode),
-				kc_readdir_pos(file, ctx), 0);
+		/* lock */
+		ret = scoutfs_lock_inode(sb, SCOUTFS_LOCK_READ, 0, inode, &dir_lock);
+		if (ret)
+			break;

-		ret = scoutfs_item_next(sb, &key, &last_key, dent,
-					dirent_bytes(SCOUTFS_NAME_LEN),
-					dir_lock);
-		if (ret < 0) {
-			if (ret == -ENOENT)
+		dent = page_address(page);
+		pos = ctx->pos;
+		while (next_aligned_dirent(dent, SCOUTFS_NAME_LEN) < end) {
+			init_dirent_key(&key, SCOUTFS_READDIR_TYPE, scoutfs_ino(inode),
+					pos, 0);
+
+			ret = scoutfs_item_next(sb, &key, &last_key, dent,
+						dirent_bytes(SCOUTFS_NAME_LEN),
+						dir_lock);
+			if (ret < 0) {
+				if (ret == -ENOENT) {
+					ret = 0;
+					complete = 1;
+				}
+				break;
+			}
+
+			name_len = ret - sizeof(struct scoutfs_dirent);
+			dent->hacky_name_len = name_len;
+			if (name_len < 1 || name_len > SCOUTFS_NAME_LEN) {
+				scoutfs_corruption(sb, SC_DIRENT_READDIR_NAME_LEN,
+						   corrupt_dirent_readdir_name_len,
+						   "dir_ino %llu pos %llu key "SK_FMT" len %d",
+						   scoutfs_ino(inode),
+						   pos,
+						   SK_ARG(&key), name_len);
+				ret = -EIO;
+				break;
+			}
+
+			pos = le64_to_cpu(dent->pos) + 1;
+
+			dent = next_aligned_dirent(dent, name_len);
+			entries++;
+		}
+
+		/* unlock */
+		scoutfs_unlock(sb, dir_lock, SCOUTFS_LOCK_READ);
+
+		if (ret < 0)
+			break;
+
+		dent = page_address(page);
+		for (; entries > 0; entries--) {
+			ctx->pos = le64_to_cpu(dent->pos);
+			if (!dir_emit(ctx, dent->name, dent->hacky_name_len,
+					le64_to_cpu(dent->ino),
+					dentry_type(dent->type))) {
 				ret = 0;
+				goto out;
+			}
+
+			dent = next_aligned_dirent(dent, dent->hacky_name_len);
+
+			/* always advance ctx->pos past */
+			ctx->pos++;
+		}
+
+		if (complete)
 			break;
-		}
-
-		name_len = ret - sizeof(struct scoutfs_dirent);
-		if (name_len < 1 || name_len > SCOUTFS_NAME_LEN) {
-			scoutfs_corruption(sb, SC_DIRENT_READDIR_NAME_LEN,
-					   corrupt_dirent_readdir_name_len,
-					   "dir_ino %llu pos %llu key "SK_FMT" len %d",
-					   scoutfs_ino(inode),
-					   kc_readdir_pos(file, ctx),
-					   SK_ARG(&key), name_len);
-			ret = -EIO;
-			goto out;
-		}
-
-		pos = le64_to_cpu(key.skd_major);
-		kc_readdir_pos(file, ctx) = pos;
-
-		if (!kc_dir_emit(ctx, dirent, dent->name, name_len, pos,
-				le64_to_cpu(dent->ino),
-				dentry_type(dent->type))) {
-			ret = 0;
-			break;
-		}
-
-		kc_readdir_pos(file, ctx) = pos + 1;
 	}

 out:
-	scoutfs_unlock(sb, dir_lock, SCOUTFS_LOCK_READ);
-
-	kfree(dent);
+	if (page)
+		__free_page(page);
 	return ret;
 }

@@ -1765,7 +1812,7 @@ retry:
 	}
 	old_inode->i_ctime = now;
 	if (new_inode)
-		old_inode->i_ctime = now;
+		new_inode->i_ctime = now;

 	inode_inc_iversion(old_dir);
 	inode_inc_iversion(old_inode);
@@ -1973,7 +2020,7 @@ const struct inode_operations scoutfs_symlink_iops = {
 };

 const struct file_operations scoutfs_dir_fops = {
-	.KC_FOP_READDIR	= scoutfs_readdir,
+	.iterate	= scoutfs_readdir,
 #ifdef KC_FMODE_KABI_ITERATE
 	.open		= scoutfs_dir_open,
 #endif
--- a/kmod/src/ioctl.c
+++ b/kmod/src/ioctl.c
@@ -58,25 +58,23 @@
 * key space after we find no items in a given lock region.  This is
 * relatively cheap because reading is going to check the segments
 * anyway.
- *
- * This is copying to userspace while holding a read lock.  This is safe
- * because faulting can send a request for a write lock while the read
- * lock is being used.  The cluster locks don't block tasks in a node,
- * they match and the tasks fall back to local locking.  In this case
- * the spin locks around the item cache.
 */
 static long scoutfs_ioc_walk_inodes(struct file *file, unsigned long arg)
 {
 	struct super_block *sb = file_inode(file)->i_sb;
 	struct scoutfs_ioctl_walk_inodes __user *uwalk = (void __user *)arg;
 	struct scoutfs_ioctl_walk_inodes walk;
-	struct scoutfs_ioctl_walk_inodes_entry ent;
+	struct scoutfs_ioctl_walk_inodes_entry *ent = NULL;
+	struct scoutfs_ioctl_walk_inodes_entry *end;
 	struct scoutfs_key next_key;
 	struct scoutfs_key last_key;
 	struct scoutfs_key key;
 	struct scoutfs_lock *lock;
+	struct page *page = NULL;
 	u64 last_seq;
+	u64 entries = 0;
 	int ret = 0;
+	int complete = 0;
 	u32 nr = 0;
 	u8 type;

@@ -107,6 +105,10 @@ static long scoutfs_ioc_walk_inodes(struct file *file, unsigned long arg)
 		}
 	}

+	page = alloc_page(GFP_KERNEL);
+	if (!page)
+		return -ENOMEM;
+
 	scoutfs_inode_init_index_key(&key, type, walk.first.major,
 				     walk.first.minor, walk.first.ino);
 	scoutfs_inode_init_index_key(&last_key, type, walk.last.major,
@@ -115,77 +117,107 @@ static long scoutfs_ioc_walk_inodes(struct file *file, unsigned long arg)
 	/* cap nr to the max the ioctl can return to a compat task */
 	walk.nr_entries = min_t(u64, walk.nr_entries, INT_MAX);

-	ret = scoutfs_lock_inode_index(sb, SCOUTFS_LOCK_READ, type,
-				       walk.first.major, walk.first.ino,
-				       &lock);
-	if (ret < 0)
-		goto out;
+	end = page_address(page) + PAGE_SIZE;

-	for (nr = 0; nr < walk.nr_entries; ) {
+	/* outer loop */
+	for (nr = 0;;) {
+		ent = page_address(page);
+		/* make sure _pad and minor are zeroed */
+		memset(ent, 0, PAGE_SIZE);

-		ret = scoutfs_item_next(sb, &key, &last_key, NULL, 0, lock);
-		if (ret < 0 && ret != -ENOENT)
+		ret = scoutfs_lock_inode_index(sb, SCOUTFS_LOCK_READ, type,
+					       le64_to_cpu(key.skii_major),
+					       le64_to_cpu(key.skii_ino),
+					       &lock);
+		if (ret)
 			break;

-		if (ret == -ENOENT) {
-
-			/* done if lock covers last iteration key */
-			if (scoutfs_key_compare(&last_key, &lock->end) <= 0) {
-				ret = 0;
+		/* inner loop 1 */
+		while (ent + 1 < end) {
+			ret = scoutfs_item_next(sb, &key, &last_key, NULL, 0, lock);
+			if (ret < 0 && ret != -ENOENT)
 				break;
+
+			if (ret == -ENOENT) {
+				/* done if lock covers last iteration key */
+				if (scoutfs_key_compare(&last_key, &lock->end) <= 0) {
+					ret = 0;
+					complete = 1;
+					break;
+				}
+
+				/* continue iterating after locked empty region */
+				key = lock->end;
+				scoutfs_key_inc(&key);
+
+				scoutfs_unlock(sb, lock, SCOUTFS_LOCK_READ);
+				/* avoid double-unlocking here after break */
+				lock = NULL;
+
+				ret = scoutfs_forest_next_hint(sb, &key, &next_key);
+				if (ret < 0 && ret != -ENOENT)
+					break;
+
+				if (ret == -ENOENT ||
+				    scoutfs_key_compare(&next_key, &last_key) > 0) {
+					ret = 0;
+					complete = 1;
+					break;
+				}
+
+				key = next_key;
+
+				ret = scoutfs_lock_inode_index(sb, SCOUTFS_LOCK_READ,
+							type,
+							le64_to_cpu(key.skii_major),
+							le64_to_cpu(key.skii_ino),
+							&lock);
+				if (ret)
+					break;
+
+				continue;
 			}

-			/* continue iterating after locked empty region */
-			key = lock->end;
+			ent->major = le64_to_cpu(key.skii_major);
+			ent->ino = le64_to_cpu(key.skii_ino);
+
 			scoutfs_key_inc(&key);

-			scoutfs_unlock(sb, lock, SCOUTFS_LOCK_READ);
+			ent++;
+			entries++;

-			ret = scoutfs_forest_next_hint(sb, &key, &next_key);
-			if (ret < 0 && ret != -ENOENT)
-				goto out;
+			if (nr + entries >= walk.nr_entries) {
+				complete = 1;
+				break;
+			}
+		}

-			if (ret == -ENOENT ||
-			    scoutfs_key_compare(&next_key, &last_key) > 0) {
-				ret = 0;
+		scoutfs_unlock(sb, lock, SCOUTFS_LOCK_READ);
+		if (ret < 0)
+			break;
+
+		/* inner loop 2 */
+		ent = page_address(page);
+		for (; entries > 0; entries--) {
+			if (copy_to_user((void __user *)walk.entries_ptr, ent,
+					 sizeof(struct scoutfs_ioctl_walk_inodes_entry))) {
+				ret = -EFAULT;
 				goto out;
 			}
-
-			key = next_key;
-
-			ret = scoutfs_lock_inode_index(sb, SCOUTFS_LOCK_READ,
-						key.sk_type,
-						le64_to_cpu(key.skii_major),
-						le64_to_cpu(key.skii_ino),
-						&lock);
-			if (ret < 0)
-				goto out;
-
-			continue;
+			walk.entries_ptr += sizeof(struct scoutfs_ioctl_walk_inodes_entry);
+			ent++;
+			nr++;
 		}

-		ent.major = le64_to_cpu(key.skii_major);
-		ent.minor = 0;
-		ent.ino = le64_to_cpu(key.skii_ino);
-
-		if (copy_to_user((void __user *)walk.entries_ptr, &ent,
-				 sizeof(ent))) {
-			ret = -EFAULT;
+		if (complete)
 			break;
-		}
-
-		nr++;
-		walk.entries_ptr += sizeof(ent);
-
-		scoutfs_key_inc(&key);
 	}

-	scoutfs_unlock(sb, lock, SCOUTFS_LOCK_READ);
-
 out:
+	if (page)
+		__free_page(page);
 	if (nr > 0)
 		ret = nr;
-
 	return ret;
 }

@@ -1163,11 +1195,15 @@ static long scoutfs_ioc_get_allocated_inos(struct file *file, unsigned long arg)
 	struct scoutfs_lock *lock = NULL;
 	struct scoutfs_key key;
 	struct scoutfs_key end;
+	struct page *page = NULL;
 	u64 __user *uinos;
 	u64 bytes;
-	u64 ino;
+	u64 *ino;
+	u64 *ino_end;
+	int entries = 0;
 	int nr;
 	int ret;
+	int complete = 0;

 	if (!(file->f_mode & FMODE_READ)) {
 		ret = -EBADF;
@@ -1189,47 +1225,83 @@ static long scoutfs_ioc_get_allocated_inos(struct file *file, unsigned long arg)
 		goto out;
 	}

+	page = alloc_page(GFP_KERNEL);
+	if (!page) {
+		ret = -ENOMEM;
+		goto out;
+	}
+	ino_end = page_address(page) + PAGE_SIZE;
+
 	scoutfs_inode_init_key(&key, gai.start_ino);
 	scoutfs_inode_init_key(&end, gai.start_ino | SCOUTFS_LOCK_INODE_GROUP_MASK);
 	uinos = (void __user *)gai.inos_ptr;
 	bytes = gai.inos_bytes;
 	nr = 0;

-	ret = scoutfs_lock_ino(sb, SCOUTFS_LOCK_READ, 0, gai.start_ino, &lock);
-	if (ret < 0)
-		goto out;
+	for (;;) {

-	while (bytes >= sizeof(*uinos)) {
+		ret = scoutfs_lock_ino(sb, SCOUTFS_LOCK_READ, 0, gai.start_ino, &lock);
+		if (ret < 0)
+			goto out;

-		ret = scoutfs_item_next(sb, &key, &end, NULL, 0, lock);
-		if (ret < 0) {
-			if (ret == -ENOENT)
+		ino = page_address(page);
+		while (ino < ino_end) {
+
+			ret = scoutfs_item_next(sb, &key, &end, NULL, 0, lock);
+			if (ret < 0) {
+				if (ret == -ENOENT) {
+					ret = 0;
+					complete = 1;
+				}
+				break;
+			}
+
+			if (key.sk_zone != SCOUTFS_FS_ZONE) {
 				ret = 0;
-			break;
+				complete = 1;
+				break;
+			}
+
+			/* all fs items are owned by allocated inodes, and _first is always ino */
+			*ino = le64_to_cpu(key._sk_first);
+			scoutfs_inode_init_key(&key, *ino + 1);
+
+			ino++;
+			entries++;
+			nr++;
+
+			bytes -= sizeof(*uinos);
+			if (bytes < sizeof(*uinos)) {
+				complete = 1;
+				break;
+			}
+
+			if (nr == INT_MAX) {
+				complete = 1;
+				break;
+			}
 		}

-		if (key.sk_zone != SCOUTFS_FS_ZONE) {
-			ret = 0;
-			break;
-		}
+		scoutfs_unlock(sb, lock, SCOUTFS_LOCK_READ);

-		/* all fs items are owned by allocated inodes, and _first is always ino */
-		ino = le64_to_cpu(key._sk_first);
-		if (put_user(ino, uinos)) {
+		if (ret < 0)
+			break;
+
+		ino = page_address(page);
+		if (copy_to_user(uinos, ino, entries * sizeof(*uinos))) {
 			ret = -EFAULT;
-			break;
+			goto out;
 		}

-		uinos++;
-		bytes -= sizeof(*uinos);
-		if (++nr == INT_MAX)
+		uinos += entries;
+		entries = 0;
+
+		if (complete)
 			break;
-
-		scoutfs_inode_init_key(&key, ino + 1);
 	}
-
-	scoutfs_unlock(sb, lock, SCOUTFS_LOCK_READ);
 out:
+	if (page)
+		__free_page(page);
 	return ret ?: nr;
 }

--- a/kmod/src/kernelcompat.h
+++ b/kmod/src/kernelcompat.h
@@ -29,50 +29,6 @@ do {						\
 })
 #endif

-#ifndef KC_ITERATE_DIR_CONTEXT
-typedef filldir_t kc_readdir_ctx_t;
-#define KC_DECLARE_READDIR(name, file, dirent, ctx) name(file, dirent, ctx)
-#define KC_FOP_READDIR readdir
-#define kc_readdir_pos(filp, ctx) (filp)->f_pos
-#define kc_dir_emit_dots(file, dirent, ctx) dir_emit_dots(file, dirent, ctx)
-#define kc_dir_emit(ctx, dirent, name, name_len, pos, ino, dt) \
-	(ctx(dirent, name, name_len, pos, ino, dt) == 0)
-#else
-typedef struct dir_context * kc_readdir_ctx_t;
-#define KC_DECLARE_READDIR(name, file, dirent, ctx) name(file, ctx)
-#define KC_FOP_READDIR iterate
-#define kc_readdir_pos(filp, ctx) (ctx)->pos
-#define kc_dir_emit_dots(file, dirent, ctx) dir_emit_dots(file, ctx)
-#define kc_dir_emit(ctx, dirent, name, name_len, pos, ino, dt) \
-	dir_emit(ctx, name, name_len, ino, dt)
-#endif
-
-#ifndef KC_DIR_EMIT_DOTS
-/*
- * Kernels before ->iterate and don't have dir_emit_dots so we give them
- * one that works with the ->readdir() filldir() method.
- */
-static inline int dir_emit_dots(struct file *file, void *dirent,
-				filldir_t filldir)
-{
-	if (file->f_pos == 0) {
-		if (filldir(dirent, ".", 1, 1,
-			    file->f_path.dentry->d_inode->i_ino, DT_DIR))
-			return 0;
-		file->f_pos = 1;
-	}
-
-	if (file->f_pos == 1) {
-		if (filldir(dirent, "..", 2, 1,
-			    parent_ino(file->f_path.dentry), DT_DIR))
-			return 0;
-		file->f_pos = 2;
-	}
-
-	return 1;
-}
-#endif
-
 #ifdef KC_POSIX_ACL_VALID_USER_NS
 #define kc_posix_acl_valid(user_ns, acl) posix_acl_valid(user_ns, acl)
 #else
@@ -438,4 +394,20 @@ static inline int kc_tcp_sock_set_nodelay(struct socket *sock)
 }
 #endif

+#ifdef KC_INODE_DIO_END
+#define kc_inode_dio_end inode_dio_end
+#else
+#define kc_inode_dio_end inode_dio_done
+#endif
+
+#ifndef KC_MM_VM_FAULT_T
+typedef unsigned int vm_fault_t;
+static inline vm_fault_t vmf_error(int err)
+{
+	if (err == -ENOMEM)
+		return VM_FAULT_OOM;
+	return VM_FAULT_SIGBUS;
+}
+#endif
+
 #endif
--- a/kmod/src/lock.c
+++ b/kmod/src/lock.c
@@ -302,6 +302,7 @@ static void lock_inc_count(unsigned int *counts, enum scoutfs_lock_mode mode)
 static void lock_dec_count(unsigned int *counts, enum scoutfs_lock_mode mode)
 {
 	BUG_ON(mode < 0 || mode >= SCOUTFS_LOCK_NR_MODES);
+	BUG_ON(counts[mode] == 0);
 	counts[mode]--;
 }

--- a/kmod/src/scoutfs_trace.h
+++ b/kmod/src/scoutfs_trace.h
@@ -286,6 +286,52 @@ TRACE_EVENT(scoutfs_data_alloc_block_enter,
 		  STE_ENTRY_ARGS(ext))
 );

+TRACE_EVENT(scoutfs_data_page_mkwrite,
+	TP_PROTO(struct super_block *sb, __u64 ino, __u64 pos, __u32 ret),
+
+	TP_ARGS(sb, ino, pos, ret),
+
+	TP_STRUCT__entry(
+		SCSB_TRACE_FIELDS
+		__field(__u64, ino)
+		__field(__u64, pos)
+		__field(__u32, ret)
+	),
+
+	TP_fast_assign(
+		SCSB_TRACE_ASSIGN(sb);
+		__entry->ino = ino;
+		__entry->pos = pos;
+		__entry->ret = ret;
+	),
+
+	TP_printk(SCSBF" ino %llu pos %llu ret %u ",
+		  SCSB_TRACE_ARGS, __entry->ino, __entry->pos, __entry->ret)
+);
+
+TRACE_EVENT(scoutfs_data_filemap_fault,
+	TP_PROTO(struct super_block *sb, __u64 ino, __u64 pos, __u32 ret),
+
+	TP_ARGS(sb, ino, pos, ret),
+
+	TP_STRUCT__entry(
+		SCSB_TRACE_FIELDS
+		__field(__u64, ino)
+		__field(__u64, pos)
+		__field(__u32, ret)
+	),
+
+	TP_fast_assign(
+		SCSB_TRACE_ASSIGN(sb);
+		__entry->ino = ino;
+		__entry->pos = pos;
+		__entry->ret = ret;
+	),
+
+	TP_printk(SCSBF" ino %llu pos %llu ret %u ",
+		  SCSB_TRACE_ARGS, __entry->ino, __entry->pos, __entry->ret)
+);
+
 DECLARE_EVENT_CLASS(scoutfs_data_file_extent_class,
 	TP_PROTO(struct super_block *sb, __u64 ino, struct scoutfs_extent *ext),

--- a/kmod/src/server.c
+++ b/kmod/src/server.c
@@ -1299,12 +1299,10 @@ static int finalize_and_start_log_merge(struct super_block *sb, struct scoutfs_l
 * is nested inside holding commits so we recheck the persistent item
 * each time we commit to make sure it's still what we think.   The
 * caller is still going to send the item to the client so we update the
- * caller's each time we make progress.  This is a best-effort attempt
- * to clean up and it's valid to leave extents in data_freed we don't
- * return errors to the caller.  The client will continue the work later
- * in get_log_trees or as the rid is reclaimed.
+ * caller's each time we make progress.  If we hit an error applying the
+ * changes we make then we can't send the log_trees to the client.
 */
-static void try_drain_data_freed(struct super_block *sb, struct scoutfs_log_trees *lt)
+static int try_drain_data_freed(struct super_block *sb, struct scoutfs_log_trees *lt)
 {
 	DECLARE_SERVER_INFO(sb, server);
 	struct scoutfs_super_block *super = DIRTY_SUPER_SB(sb);
@@ -1313,6 +1311,7 @@ static void try_drain_data_freed(struct super_block *sb, struct scoutfs_log_tree
 	struct scoutfs_log_trees drain;
 	struct scoutfs_key key;
 	COMMIT_HOLD(hold);
+	bool apply = false;
 	int ret = 0;
 	int err;

@@ -1321,22 +1320,27 @@ static void try_drain_data_freed(struct super_block *sb, struct scoutfs_log_tree
 	while (lt->data_freed.total_len != 0) {
 		server_hold_commit(sb, &hold);
 		mutex_lock(&server->logs_mutex);
+		apply = true;

 		ret = find_log_trees_item(sb, &super->logs_root, false, rid, U64_MAX, &drain);
-		if (ret < 0)
+		if (ret < 0) {
+			ret = 0;
 			break;
+		}

 		/* careful to only keep draining the caller's specific open trans */
 		if (drain.nr != lt->nr || drain.get_trans_seq != lt->get_trans_seq ||
 		    drain.commit_trans_seq != lt->commit_trans_seq || drain.flags != lt->flags) {
-			ret = -ENOENT;
+			ret = 0;
 			break;
 		}

 		ret = scoutfs_btree_dirty(sb, &server->alloc, &server->wri,
 					  &super->logs_root, &key);
-		if (ret < 0)
+		if (ret < 0) {
+			ret = 0;
 			break;
+		}

 		/* moving can modify and return errors, always update caller and item */
 		mutex_lock(&server->alloc_mutex);
@@ -1352,19 +1356,19 @@ static void try_drain_data_freed(struct super_block *sb, struct scoutfs_log_tree
 		BUG_ON(err < 0); /* dirtying must guarantee success */

 		mutex_unlock(&server->logs_mutex);
-
 		ret = server_apply_commit(sb, &hold, ret);
-		if (ret < 0) {
-			ret = 0; /* don't try to abort, ignoring ret */
+		apply = false;
+
+		if (ret < 0)
 			break;
-		}
 	}

-	/* try to cleanly abort and write any partial dirty btree blocks, but ignore result */
-	if (ret < 0) {
+	if (apply) {
 		mutex_unlock(&server->logs_mutex);
-		server_apply_commit(sb, &hold, 0);
+		server_apply_commit(sb, &hold, ret);
 	}
+
+	return ret;
 }

 /*
@@ -1572,9 +1576,9 @@ out:
 		scoutfs_err(sb, "error %d getting log trees for rid %016llx: %s",
 			    ret, rid, err_str);

-	/* try to drain excessive data_freed with additional commits, if needed, ignoring err */
+	/* try to drain excessive data_freed with additional commits, if needed */
 	if (ret == 0)
-		try_drain_data_freed(sb, &lt);
+		ret = try_drain_data_freed(sb, &lt);

 	return scoutfs_net_response(sb, conn, cmd, id, ret, &lt, sizeof(lt));
 }
@@ -4149,7 +4153,7 @@ static void fence_pending_recov_worker(struct work_struct *work)
 	struct server_info *server = container_of(work, struct server_info,
 						  fence_pending_recov_work);
 	struct super_block *sb = server->sb;
-	union scoutfs_inet_addr addr;
+	union scoutfs_inet_addr addr = {{0,}};
 	u64 rid = 0;
 	int ret = 0;

--- a/kmod/src/trans.c
+++ b/kmod/src/trans.c
@@ -159,6 +159,58 @@ static bool drained_holders(struct trans_info *tri)
 	return holders == 0;
 }

+static int commit_current_log_trees(struct super_block *sb, char **str)
+{
+	DECLARE_TRANS_INFO(sb, tri);
+
+	return (*str = "data submit", scoutfs_inode_walk_writeback(sb, true)) ?:
+	       (*str = "item dirty", scoutfs_item_write_dirty(sb))  ?:
+	       (*str = "data prepare", scoutfs_data_prepare_commit(sb))  ?:
+	       (*str = "alloc prepare", scoutfs_alloc_prepare_commit(sb, &tri->alloc, &tri->wri)) ?:
+	       (*str = "meta write", scoutfs_block_writer_write(sb, &tri->wri))  ?:
+	       (*str = "data wait", scoutfs_inode_walk_writeback(sb, false)) ?:
+	       (*str = "commit log trees", commit_btrees(sb)) ?:
+	       scoutfs_item_write_done(sb);
+}
+
+static int get_next_log_trees(struct super_block *sb, char **str)
+{
+	return (*str = "get log trees", scoutfs_trans_get_log_trees(sb));
+}
+
+static int retry_forever(struct super_block *sb, int (*func)(struct super_block *sb, char **str))
+{
+	bool retrying = false;
+	char *str;
+	int ret;
+
+	do {
+		str = NULL;
+
+		ret = func(sb, &str);
+		if (ret < 0) {
+			if (!retrying) {
+				scoutfs_warn(sb, "critical transaction commit failure: %s = %d, retrying",
+					    str, ret);
+				retrying = true;
+			}
+
+			if (scoutfs_forcing_unmount(sb)) {
+				ret = -EIO;
+				break;
+			}
+
+			msleep(2 * MSEC_PER_SEC);
+
+		} else if (retrying) {
+			scoutfs_info(sb, "retried transaction commit succeeded");
+		}
+
+	} while (ret < 0);
+
+	return ret;
+}
+
 /*
 * This work func is responsible for writing out all the dirty blocks
 * that make up the current dirty transaction.  It prevents writers from
@@ -184,8 +236,6 @@ void scoutfs_trans_write_func(struct work_struct *work)
 	struct trans_info *tri = container_of(work, struct trans_info, write_work.work);
 	struct super_block *sb = tri->sb;
 	struct scoutfs_sb_info *sbi = SCOUTFS_SB(sb);
-	bool retrying = false;
-	char *s = NULL;
 	int ret = 0;

 	tri->task = current;
@@ -214,37 +264,9 @@ void scoutfs_trans_write_func(struct work_struct *work)

 	scoutfs_inc_counter(sb, trans_commit_written);

-	do {
-		ret = (s = "data submit", scoutfs_inode_walk_writeback(sb, true)) ?:
-		      (s = "item dirty", scoutfs_item_write_dirty(sb))  ?:
-		      (s = "data prepare", scoutfs_data_prepare_commit(sb))  ?:
-		      (s = "alloc prepare", scoutfs_alloc_prepare_commit(sb, &tri->alloc,
-									 &tri->wri))  ?:
-		      (s = "meta write", scoutfs_block_writer_write(sb, &tri->wri))  ?:
-		      (s = "data wait", scoutfs_inode_walk_writeback(sb, false)) ?:
-		      (s = "commit log trees", commit_btrees(sb)) ?:
-		      scoutfs_item_write_done(sb) ?:
-		      (s = "get log trees", scoutfs_trans_get_log_trees(sb));
-		if (ret < 0) {
-			if (!retrying) {
-				scoutfs_warn(sb, "critical transaction commit failure: %s = %d, retrying",
-					    s, ret);
-				retrying = true;
-			}
-
-			if (scoutfs_forcing_unmount(sb)) {
-				ret = -EIO;
-				break;
-			}
-
-			msleep(2 * MSEC_PER_SEC);
-
-		} else if (retrying) {
-			scoutfs_info(sb, "retried transaction commit succeeded");
-		}
-
-	} while (ret < 0);
-
+	/* retry {commit,get}_log_trees until they succeeed, can only fail when forcing unmount */
+	ret = retry_forever(sb, commit_current_log_trees) ?:
+	      retry_forever(sb, get_next_log_trees);
 out:
 	spin_lock(&tri->write_lock);
 	tri->write_count++;
--- a/tests/.gitignore
+++ b/tests/.gitignore
@@ -10,3 +10,5 @@ src/stage_tmpfile
 src/create_xattr_loop
 src/o_tmpfile_umask
 src/o_tmpfile_linkat
+src/mmap_stress
+src/mmap_validate
--- a/tests/Makefile
+++ b/tests/Makefile
@@ -13,7 +13,10 @@ BIN := src/createmany			\
 	src/create_xattr_loop		\
 	src/fragmented_data_extents	\
 	src/o_tmpfile_umask		\
-	src/o_tmpfile_linkat
+	src/o_tmpfile_linkat		\
+	src/mmap_stress			\
+	src/mmap_validate		\
+	src/walk_inodes_for_estale

 DEPS := $(wildcard src/*.d)

@@ -23,8 +26,10 @@ ifneq ($(DEPS),)
 -include $(DEPS)
 endif

+src/mmap_stress: LIBS+=-lpthread
+
 $(BIN): %: %.c Makefile
-	gcc $(CFLAGS) -MD -MP -MF $*.d $< -o $@
+	gcc $(CFLAGS) -MD -MP -MF $*.d $< -o $@ $(LIBS)

 .PHONY: clean
 clean:
--- a/tests/funcs/exec.sh
+++ b/tests/funcs/exec.sh
@@ -80,3 +80,15 @@ t_compare_output()
 {
 	"$@" >&7 2>&1
 }
+
+#
+# usually bash prints an annoying output message when jobs
+# are killed.  We can avoid that by redirecting stderr for
+# the bash process when it reaps the jobs that are killed.
+#
+t_silent_kill() {
+	exec {ERR}>&2 2>/dev/null
+	kill "$@"
+	wait "$@"
+	exec 2>&$ERR {ERR}>&-
+}
--- a/tests/funcs/filter.sh
+++ b/tests/funcs/filter.sh
@@ -160,6 +160,9 @@ t_filter_dmesg()
 	re="$re|Pipe handler or fully qualified core dump path required.*"
 	re="$re|Set kernel.core_pattern before fs.suid_dumpable.*"

+	# perf warning that it adjusted sample rate
+	re="$re|perf: interrupt took too long.*lowering kernel.perf_event_max_sample_rate.*"
+
 	egrep -v "($re)" | \
 		ignore_harmless_unwind_kasan_stack_oob
 }
--- a/tests/funcs/tap.sh
+++ b/tests/funcs/tap.sh
@@ -0,0 +1,88 @@
+
+#
+# Generate TAP format test results
+#
+
+t_tap_header()
+{
+	local runid=$1
+	local sequence=( $(echo $tests) )
+	local count=${#sequence[@]}
+
+	# avoid recreating the same TAP result over again - harness sets this
+	[[ -z "$runid" ]] && runid="*test*"
+
+	cat > $T_RESULTS/scoutfs.tap <<TAPEOF
+TAP version 14
+1..${count}
+#
+# TAP results for run ${runid}
+#
+# host/run info:
+#
+#   hostname: ${HOSTNAME}
+#   test start time: $(date --utc)
+#   uname -r: $(uname -r)
+#   scoutfs commit id: $(git describe --tags)
+#
+# sequence for this run:
+#
+TAPEOF
+
+	# Sequence
+	for t in ${tests}; do
+		 echo ${t/.sh/}
+	done | cat -n | expand | column -c 120 | expand | sed 's/^ /#/' >> $T_RESULTS/scoutfs.tap
+	echo "#" >> $T_RESULTS/scoutfs.tap
+}
+
+t_tap_progress()
+{
+(
+	local i=$(( testcount + 1 ))
+	local testname=$1
+	local result=$2
+
+	local diff=""
+	local dmsg=""
+
+	if [[ -s "$T_RESULTS/tmp/${testname}/dmesg.new" ]]; then
+		dmsg="1"
+	fi
+
+	if ! cmp -s golden/${testname} $T_RESULTS/output/${testname}; then
+		diff="1"
+	fi
+
+	if [[ "${result}" == "100" ]] && [[ -z "${dmsg}" ]] && [[ -z "${diff}" ]]; then
+		echo "ok ${i} - ${testname}"
+	elif [[ "${result}" == "103" ]]; then
+		echo "ok ${i} - ${testname}"
+		echo "# ${testname} ** skipped - permitted **"
+	else
+		echo "not ok ${i} - ${testname}"
+		case ${result} in
+		101)
+			echo "# ${testname} ** skipped **"
+			;;
+		102)
+			echo "# ${testname} ** failed **"
+			;;
+		esac
+
+		if [[ -n "${diff}" ]]; then
+			echo "#"
+			echo "# diff:"
+			echo "#"
+			diff -u golden/${testname} $T_RESULTS/output/${testname} | expand | sed 's/^/#   /'
+		fi
+
+		if [[ -n "${dmsg}" ]]; then
+			echo "#"
+			echo "# dmesg:"
+			echo "#"
+			cat "$T_RESULTS/tmp/${testname}/dmesg.new" | sed 's/^/#   /'
+		fi
+	fi
+) >> $T_RESULTS/scoutfs.tap
+}
--- a/tests/golden/mmap
+++ b/tests/golden/mmap
@@ -0,0 +1,27 @@
+== mmap_stress
+thread 0 complete
+thread 1 complete
+thread 2 complete
+thread 3 complete
+thread 4 complete
+== basic mmap/read/write consistency checks
+== mmap read from offline extent
+0: offset: 0 length: 2 flags: O.L
+extents: 1
+1
+00000200:  ea ea ea ea ea ea ea ea ea ea ea ea ea ea ea ea  ................
+0
+0: offset: 0 length: 2 flags: ..L
+extents: 1
+== mmap write to an offline extent
+0: offset: 0 length: 2 flags: O.L
+extents: 1
+1
+0
+0: offset: 0 length: 2 flags: ..L
+extents: 1
+00000000  ea ea ea ea ea ea ea ea  ea ea ea ea ea ea ea ea  |................|
+00000010  11 11 11 11 11 11 11 11  11 11 11 11 11 11 11 11  |................|
+00000020  ea ea ea ea ea ea ea ea  ea ea ea ea ea ea ea ea  |................|
+00000030
+== done
--- a/tests/golden/simple-readdir
+++ b/tests/golden/simple-readdir
@@ -0,0 +1,97 @@
+== create content
+== readdir all
+00000000: d_off: 0x00000001 d_reclen: 0x18 d_type: DT_DIR d_name: .
+00000001: d_off: 0x00000002 d_reclen: 0x18 d_type: DT_DIR d_name: ..
+00000002: d_off: 0x00000003 d_reclen: 0x18 d_type: DT_REG d_name: a
+00000003: d_off: 0x00000004 d_reclen: 0x20 d_type: DT_REG d_name: aaaaaaaa
+00000004: d_off: 0x00000005 d_reclen: 0x28 d_type: DT_REG d_name: aaaaaaaaaaaaaaa
+00000005: d_off: 0x00000006 d_reclen: 0x30 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaa
+00000006: d_off: 0x00000007 d_reclen: 0x38 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaa
+00000007: d_off: 0x00000008 d_reclen: 0x38 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
+00000008: d_off: 0x00000009 d_reclen: 0x40 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
+00000009: d_off: 0x0000000a d_reclen: 0x48 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
+0000000a: d_off: 0x0000000b d_reclen: 0x50 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
+0000000b: d_off: 0x0000000c d_reclen: 0x58 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
+0000000c: d_off: 0x0000000d d_reclen: 0x60 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
+0000000d: d_off: 0x0000000e d_reclen: 0x68 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
+0000000e: d_off: 0x0000000f d_reclen: 0x70 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
+0000000f: d_off: 0x00000010 d_reclen: 0x70 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
+00000010: d_off: 0x00000011 d_reclen: 0x78 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
+00000011: d_off: 0x00000012 d_reclen: 0x80 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
+00000012: d_off: 0x00000013 d_reclen: 0x88 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
+00000013: d_off: 0x00000014 d_reclen: 0x90 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
+00000014: d_off: 0x00000015 d_reclen: 0x98 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
+00000015: d_off: 0x00000016 d_reclen: 0xa0 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
+00000016: d_off: 0x00000017 d_reclen: 0xa8 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
+00000017: d_off: 0x00000018 d_reclen: 0xa8 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
+00000018: d_off: 0x00000019 d_reclen: 0xb0 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
+00000019: d_off: 0x0000001a d_reclen: 0xb8 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
+0000001a: d_off: 0x0000001b d_reclen: 0xc0 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
+0000001b: d_off: 0x0000001c d_reclen: 0xc8 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
+0000001c: d_off: 0x0000001d d_reclen: 0xd0 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
+0000001d: d_off: 0x0000001e d_reclen: 0xd8 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
+0000001e: d_off: 0x0000001f d_reclen: 0xe0 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
+0000001f: d_off: 0x00000020 d_reclen: 0xe0 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
+00000020: d_off: 0x00000021 d_reclen: 0xe8 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
+00000021: d_off: 0x00000022 d_reclen: 0xf0 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
+00000022: d_off: 0x00000023 d_reclen: 0xf8 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
+00000023: d_off: 0x00000024 d_reclen: 0x100 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
+00000024: d_off: 0x00000025 d_reclen: 0x108 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
+00000025: d_off: 0x00000026 d_reclen: 0x110 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
+== readdir offset
+00000014: d_off: 0x00000015 d_reclen: 0x98 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
+00000015: d_off: 0x00000016 d_reclen: 0xa0 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
+00000016: d_off: 0x00000017 d_reclen: 0xa8 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
+00000017: d_off: 0x00000018 d_reclen: 0xa8 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
+00000018: d_off: 0x00000019 d_reclen: 0xb0 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
+00000019: d_off: 0x0000001a d_reclen: 0xb8 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
+0000001a: d_off: 0x0000001b d_reclen: 0xc0 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
+0000001b: d_off: 0x0000001c d_reclen: 0xc8 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
+0000001c: d_off: 0x0000001d d_reclen: 0xd0 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
+0000001d: d_off: 0x0000001e d_reclen: 0xd8 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
+0000001e: d_off: 0x0000001f d_reclen: 0xe0 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
+0000001f: d_off: 0x00000020 d_reclen: 0xe0 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
+00000020: d_off: 0x00000021 d_reclen: 0xe8 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
+00000021: d_off: 0x00000022 d_reclen: 0xf0 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
+00000022: d_off: 0x00000023 d_reclen: 0xf8 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
+00000023: d_off: 0x00000024 d_reclen: 0x100 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
+00000024: d_off: 0x00000025 d_reclen: 0x108 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
+00000025: d_off: 0x00000026 d_reclen: 0x110 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
+== readdir len (bytes)
+00000000: d_off: 0x00000001 d_reclen: 0x18 d_type: DT_DIR d_name: .
+00000001: d_off: 0x00000002 d_reclen: 0x18 d_type: DT_DIR d_name: ..
+00000002: d_off: 0x00000003 d_reclen: 0x18 d_type: DT_REG d_name: a
+00000003: d_off: 0x00000004 d_reclen: 0x20 d_type: DT_REG d_name: aaaaaaaa
+00000004: d_off: 0x00000005 d_reclen: 0x28 d_type: DT_REG d_name: aaaaaaaaaaaaaaa
+00000005: d_off: 0x00000006 d_reclen: 0x30 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaa
+00000006: d_off: 0x00000007 d_reclen: 0x38 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaa
+== introduce gap
+00000000: d_off: 0x00000001 d_reclen: 0x18 d_type: DT_DIR d_name: .
+00000001: d_off: 0x00000002 d_reclen: 0x18 d_type: DT_DIR d_name: ..
+00000002: d_off: 0x00000003 d_reclen: 0x18 d_type: DT_REG d_name: a
+00000003: d_off: 0x00000004 d_reclen: 0x20 d_type: DT_REG d_name: aaaaaaaa
+00000004: d_off: 0x00000005 d_reclen: 0x28 d_type: DT_REG d_name: aaaaaaaaaaaaaaa
+00000005: d_off: 0x00000006 d_reclen: 0x30 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaa
+00000006: d_off: 0x00000007 d_reclen: 0x38 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaa
+00000007: d_off: 0x00000008 d_reclen: 0x38 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
+00000008: d_off: 0x00000009 d_reclen: 0x40 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
+00000009: d_off: 0x00000014 d_reclen: 0x48 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
+00000014: d_off: 0x00000015 d_reclen: 0x98 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
+00000015: d_off: 0x00000016 d_reclen: 0xa0 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
+00000016: d_off: 0x00000017 d_reclen: 0xa8 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
+00000017: d_off: 0x00000018 d_reclen: 0xa8 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
+00000018: d_off: 0x00000019 d_reclen: 0xb0 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
+00000019: d_off: 0x0000001a d_reclen: 0xb8 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
+0000001a: d_off: 0x0000001b d_reclen: 0xc0 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
+0000001b: d_off: 0x0000001c d_reclen: 0xc8 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
+0000001c: d_off: 0x0000001d d_reclen: 0xd0 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
+0000001d: d_off: 0x0000001e d_reclen: 0xd8 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
+0000001e: d_off: 0x0000001f d_reclen: 0xe0 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
+0000001f: d_off: 0x00000020 d_reclen: 0xe0 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
+00000020: d_off: 0x00000021 d_reclen: 0xe8 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
+00000021: d_off: 0x00000022 d_reclen: 0xf0 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
+00000022: d_off: 0x00000023 d_reclen: 0xf8 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
+00000023: d_off: 0x00000024 d_reclen: 0x100 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
+00000024: d_off: 0x00000025 d_reclen: 0x108 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
+00000025: d_off: 0x00000026 d_reclen: 0x110 d_type: DT_REG d_name: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
+== cleanup
--- a/tests/golden/xfstests
+++ b/tests/golden/xfstests
@@ -22,6 +22,8 @@ generic/024
 generic/025
 generic/026
 generic/028
+generic/029
+generic/030
 generic/031
 generic/032
 generic/033
@@ -53,6 +55,7 @@ generic/073
 generic/076
 generic/078
 generic/079
+generic/080
 generic/081
 generic/082
 generic/084
@@ -81,10 +84,12 @@ generic/116
 generic/117
 generic/118
 generic/119
+generic/120
 generic/121
 generic/122
 generic/123
 generic/124
+generic/126
 generic/128
 generic/129
 generic/130
@@ -95,6 +100,7 @@ generic/136
 generic/138
 generic/139
 generic/140
+generic/141
 generic/142
 generic/143
 generic/144
@@ -153,6 +159,7 @@ generic/210
 generic/211
 generic/212
 generic/214
+generic/215
 generic/216
 generic/217
 generic/218
@@ -173,6 +180,9 @@ generic/238
 generic/240
 generic/244
 generic/245
+generic/246
+generic/247
+generic/248
 generic/249
 generic/250
 generic/252
@@ -231,6 +241,7 @@ generic/317
 generic/319
 generic/322
 generic/324
+generic/325
 generic/326
 generic/327
 generic/328
@@ -244,6 +255,7 @@ generic/337
 generic/341
 generic/342
 generic/343
+generic/346
 generic/348
 generic/353
 generic/355
@@ -305,7 +317,9 @@ generic/424
 generic/425
 generic/426
 generic/427
+generic/428
 generic/436
+generic/437
 generic/439
 generic/440
 generic/443
@@ -315,6 +329,7 @@ generic/448
 generic/449
 generic/450
 generic/451
+generic/452
 generic/453
 generic/454
 generic/456
@@ -438,6 +453,7 @@ generic/610
 generic/611
 generic/612
 generic/613
+generic/614
 generic/618
 generic/621
 generic/623
@@ -451,6 +467,7 @@ generic/632
 generic/634
 generic/635
 generic/637
+generic/638
 generic/639
 generic/640
 generic/644
@@ -862,4 +879,4 @@ generic/688
 generic/689
 shared/002
 shared/032
-Passed all 495 tests
+Passed all 512 tests
--- a/tests/run-tests.sh
+++ b/tests/run-tests.sh
@@ -512,6 +512,11 @@ msg "running tests"
 > "$T_RESULTS/skip.log"
 > "$T_RESULTS/fail.log"

+# generate a test ID to make sure we can de-duplicate TAP results in aggregation
+. funcs/tap.sh
+t_tap_header $(uuidgen)
+
+testcount=0
 passed=0
 skipped=0
 failed=0
@@ -527,12 +532,15 @@ for t in $tests; do
 	cmd rm -rf "$T_TMPDIR"
 	cmd mkdir -p "$T_TMPDIR"

-	# create a test name dir in the fs
+	# create a test name dir in the fs, clean up old data as needed
 	T_DS=""
 	for i in $(seq 0 $((T_NR_MOUNTS - 1))); do
 		dir="${T_M[$i]}/test/$test_name"

-		test $i == 0 && cmd mkdir -p "$dir"
+		test $i == 0 && (
+			test -d "$dir" && cmd rm -rf "$dir"
+			cmd mkdir -p "$dir"
+		)

 		eval T_D$i=$dir
 		T_D[$i]=$dir
@@ -637,6 +645,11 @@ for t in $tests; do

 		test -n "$T_ABORT" && die "aborting after first failure"
 	fi
+
+	# record results for TAP format output
+	t_tap_progress $test_name $sts
+	((testcount++))
+
 done

 msg "all tests run: $passed passed, $skipped skipped, $skipped_permitted skipped (permitted), $failed failed"
--- a/tests/sequence
+++ b/tests/sequence
@@ -6,6 +6,7 @@ inode-items-updated.sh
 simple-inode-index.sh
 simple-staging.sh
 simple-release-extents.sh
+simple-readdir.sh
 get-referring-entries.sh
 fallocate.sh
 basic-truncate.sh
@@ -17,6 +18,7 @@ projects.sh
 large-fragmented-free.sh
 format-version-forward-back.sh
 enospc.sh
+mmap.sh
 srch-safe-merge-pos.sh
 srch-basic-functionality.sh
 simple-xattr-unit.sh
--- a/tests/src/mmap_stress.c
+++ b/tests/src/mmap_stress.c
@@ -0,0 +1,181 @@
+#define _GNU_SOURCE
+/*
+ * mmap() stress test for scoutfs
+ *
+ * This test exercises the scoutfs kernel module's locking by
+ * repeatedly reading/writing using mmap and pread/write calls
+ * across 5 clients (mounts).
+ *
+ * Each thread operates on a single thread/client, and performs
+ * operations in a random order on the file.
+ *
+ * The goal is to assure that locking between _page_mkwrite vfs
+ * calls and the normal read/write paths do not cause deadlocks.
+ *
+ * There is no content validation performed. All that is done is
+ * assure that the programs continues without errors.
+ */
+
+#include <sys/types.h>
+#include <stdio.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <unistd.h>
+#include <stdlib.h>
+#include <string.h>
+#include <stdbool.h>
+#include <sys/mman.h>
+#include <pthread.h>
+#include <errno.h>
+
+static int size = 0;
+static int count = 0; /* XXX make this duration instead */
+
+struct thread_info {
+	int nr;
+	int fd;
+};
+
+static void *run_test_func(void *ptr)
+{
+	void *buf = NULL;
+	char *addr = NULL;
+	struct thread_info *tinfo = ptr;
+	int c = 0;
+	int fd;
+	ssize_t read, written, ret;
+	int preads = 0, pwrites = 0, mreads = 0, mwrites = 0;
+
+	fd = tinfo->fd;
+
+	if (posix_memalign(&buf, 4096, size) != 0) {
+		perror("calloc");
+		exit(-1);
+	}
+
+	addr = mmap(NULL, size, PROT_WRITE | PROT_READ, MAP_SHARED, fd, 0);
+	if (addr == MAP_FAILED) {
+		perror("mmap");
+		exit(-1);
+	}
+
+	usleep(100000); /* 0.1sec to allow all threads to start roughly at the same time */
+
+	for (;;) {
+		if (++c > count)
+			break;
+
+		switch (rand() % 4) {
+		case 0: /* pread */
+			preads++;
+			for (read = 0; read < size;) {
+				ret = pread(fd, buf, size - read, read);
+				if (ret < 0) {
+					perror("pwrite");
+					exit(-1);
+				}
+				read += ret;
+			}
+			break;
+		case 1: /* pwrite */
+			pwrites++;
+			memset(buf, (char)(c & 0xff), size);
+			for (written = 0; written < size;) {
+				ret = pwrite(fd, buf, size - written, written);
+				if (ret < 0) {
+					perror("pwrite");
+					exit(-1);
+				}
+				written += ret;
+			}
+			break;
+		case 2: /* mmap read */
+			mreads++;
+			memcpy(buf, addr, size); /* noerr */
+			break;
+		case 3: /* mmap write */
+			mwrites++;
+			memset(buf, (char)(c & 0xff), size);
+			memcpy(addr, buf, size); /* noerr */
+			break;
+		}
+	}
+
+	munmap(addr, size);
+
+	free(buf);
+
+	printf("thread %u complete: preads %u pwrites %u mreads %u mwrites %u\n", tinfo->nr,
+		mreads, mwrites, preads, pwrites);
+
+	return NULL;
+}
+
+int main(int argc, char **argv)
+{
+	pthread_t thread[5];
+	struct thread_info tinfo[5];
+	int fd[5];
+	int ret;
+	int i;
+
+	if (argc != 8) {
+		fprintf(stderr, "%s requires 7 arguments - size count file1 file2 file3 file4 file5\n", argv[0]);
+		exit(-1);
+	}
+
+	size = atoi(argv[1]);
+	if (size <= 0) {
+		fprintf(stderr, "invalid size, must be greater than 0\n");
+		exit(-1);
+	}
+
+	count = atoi(argv[2]);
+	if (count < 0) {
+		fprintf(stderr, "invalid count, must be greater than 0\n");
+		exit(-1);
+	}
+
+	/* create and truncate one fd */
+	fd[0] = open(argv[3], O_RDWR | O_CREAT | O_TRUNC, 00644);
+	if (fd[0] < 0) {
+		perror("open");
+		exit(-1);
+	}
+
+	/* make it the test size */
+	if (posix_fallocate(fd[0], 0, size) != 0) {
+		perror("fallocate");
+		exit(-1);
+	}
+
+	/* now open the rest of the fds */
+	for (i = 1; i < 5; i++) {
+		fd[i] = open(argv[3+i], O_RDWR);
+		if (fd[i] < 0) {
+			perror("open");
+			exit(-1);
+		}
+	}
+
+	/* start threads */
+	for (i = 0; i < 5; i++) {
+		tinfo[i].fd = fd[i];
+		tinfo[i].nr = i;
+		ret = pthread_create(&thread[i], NULL, run_test_func, (void*)&tinfo[i]);
+
+		if (ret) {
+			perror("pthread_create");
+			exit(-1);
+		}
+	}
+
+	/* wait for complete */
+	for (i = 0; i < 5; i++)
+		pthread_join(thread[i], NULL);
+
+	for (i = 0; i < 5; i++)
+		close(fd[i]);
+
+	exit(0);
+}
--- a/tests/src/mmap_validate.c
+++ b/tests/src/mmap_validate.c
@@ -0,0 +1,159 @@
+#define _GNU_SOURCE
+/*
+ * mmap() content consistency checking for scoutfs
+ *
+ * This test program validates that content from memory mappings
+ * are consistent across clients, whether written/read with mmap or
+ * normal writes/reads.
+ *
+ * One side of (read/write) will always be memory mapped. It may
+ * be that both sides do memory mapped (33% of the time).
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <string.h>
+#include <sys/mman.h>
+#include <fcntl.h>
+#include <errno.h>
+
+static int count = 0;
+static int size = 0;
+
+static void run_test_func(int fd1, int fd2)
+{
+	void *buf1 = NULL;
+	void *buf2 = NULL;
+	char *addr1 = NULL;
+	char *addr2 = NULL;
+	int c = 0;
+	ssize_t read, written, ret;
+
+	/* buffers for both sides to compare */
+	if (posix_memalign(&buf1, 4096, size) != 0) {
+		perror("calloc1");
+		exit(-1);
+	}
+
+	if (posix_memalign(&buf2, 4096, size) != 0) {
+		perror("calloc1");
+		exit(-1);
+	}
+
+	/* memory maps for both sides */
+	addr1 = mmap(NULL, size, PROT_WRITE | PROT_READ, MAP_SHARED, fd1, 0);
+	if (addr1 == MAP_FAILED) {
+		perror("mmap1");
+		exit(-1);
+	}
+
+	addr2 = mmap(NULL, size, PROT_WRITE | PROT_READ, MAP_SHARED, fd2, 0);
+	if (addr2 == MAP_FAILED) {
+		perror("mmap2");
+		exit(-1);
+	}
+
+	for (;;) {
+		if (++c > count) /* 10k iterations */
+			break;
+
+		/* put a pattern in buf1 */
+		memset(buf1, c & 0xff, size);
+
+		/* pwrite or mmap write from buf1 */
+		switch (c % 3) {
+		case 0:	/* pwrite */
+			for (written = 0; written < size;) {
+				ret = pwrite(fd1, buf1, size - written, written);
+				if (ret < 0) {
+					perror("pwrite");
+					exit(-1);
+				}
+				written += ret;
+			}
+			break;
+		default: /* mmap write */
+			memcpy(addr1, buf1, size);
+			break;
+		}
+
+		/* pread or mmap read to buf2 */
+		switch (c % 3) {
+		case 2: /* pread */
+			for (read = 0; read < size;) {
+				ret = pread(fd2, buf2, size - read, read);
+				if (ret < 0) {
+					perror("pwrite");
+					exit(-1);
+				}
+				read += ret;
+			}
+			break;
+		default: /* mmap read */
+			memcpy(buf2, addr2, size);
+			break;
+		}
+
+		/* compare bufs */
+		if (memcmp(buf1, buf2, size) != 0) {
+			fprintf(stderr, "memcmp() failed\n");
+			exit(-1);
+		}
+	}
+
+	munmap(addr1, size);
+	munmap(addr2, size);
+
+	free(buf1);
+	free(buf2);
+}
+
+int main(int argc, char **argv)
+{
+	int fd[1];
+
+	if (argc != 5) {
+		fprintf(stderr, "%s requires 4 arguments - size count file1 file2\n", argv[0]);
+		exit(-1);
+	}
+
+	size = atoi(argv[1]);
+	if (size <= 0) {
+		fprintf(stderr, "invalid size, must be greater than 0\n");
+		exit(-1);
+	}
+
+	count = atoi(argv[2]);
+	if (count < 3) {
+		fprintf(stderr, "invalid count, must be greater than 3\n");
+		exit(-1);
+	}
+
+	/* create and truncate one fd */
+	fd[0] = open(argv[3], O_RDWR | O_CREAT | O_TRUNC, 00644);
+	if (fd[0] < 0) {
+		perror("open");
+		exit(-1);
+	}
+
+	fd[1] = open(argv[4], O_RDWR , 00644);
+	if (fd[1] < 0) {
+		perror("open");
+		exit(-1);
+	}
+
+	/* make it the test size */
+	if (posix_fallocate(fd[0], 0, size) != 0) {
+		perror("fallocate");
+		exit(-1);
+	}
+
+	/* run the test function */
+	run_test_func(fd[0], fd[1]);
+
+	close(fd[0]);
+	close(fd[1]);
+
+	exit(0);
+}
--- a/tests/src/walk_inodes_for_estale.c
+++ b/tests/src/walk_inodes_for_estale.c
@@ -0,0 +1,464 @@
+
+/*
+ * Copyright (C) 2025 Versity Software, Inc.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+
+#ifndef _GNU_SOURCE
+#define _GNU_SOURCE
+#endif
+#include <errno.h>
+#include <fcntl.h>
+#include <inttypes.h>
+#include <limits.h>
+#include <signal.h>
+#include <stdbool.h>
+#include <stddef.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <time.h>
+#include <unistd.h>
+#include <linux/types.h>
+#include <sys/ioctl.h>
+#include <sys/stat.h>
+#include <sys/syscall.h>
+
+#include "ioctl.h"
+
+#define array_size(arr) (sizeof(arr) / sizeof(arr[0]))
+
+#define FILEID_SCOUTFS			0x81
+#define FILEID_SCOUTFS_WITH_PARENT	0x82
+
+static uint64_t meta_seq = 0;
+static bool sig_received = false;
+static bool tracing_on = false;
+static bool exit_on_current = false;
+static bool exiting = false;
+static uint64_t count = 0;
+
+struct our_handle {
+	struct file_handle handle;
+	/*
+	 * scoutfs file handle can be ino or ino/parent. The
+	 * handle_type field of struct file_handle denotes which
+	 * version is in use. We only use the ino variant here.
+	 */
+	__le64 scoutfs_ino;
+};
+
+static void exit_usage(void)
+{
+	printf(
+		" -e            exit once stable meta_seq has been reached\n"
+		" -m <string>   scoutfs mount path string for seq walk\n"
+		" -s <number>   start from meta_seq number, instead of 0\n"
+		);
+	exit(1);
+}
+
+static int write_at(int tracefd, char *path, char *val)
+{
+	int fd = -1;
+	int ret;
+
+	fd = openat(tracefd, path, O_TRUNC | O_RDWR);
+	if (fd < 0)
+		return errno;
+	ret = write(fd, val, strlen(val));
+	if (ret < 0)
+		ret = errno;
+
+	close(fd);
+	return 0;
+}
+
+static int do_trace(int fd, uint64_t ino)
+{
+	struct our_handle handle;
+	int tracefd = -1;
+	int targetfd = -1;
+	int outfd = -1;
+	int infd = -1;
+	char *pidstr;
+	char *name;
+	char *buf;
+	ssize_t bytes;
+	ssize_t written;
+	ssize_t off = 0;
+	unsigned long e = 0;
+	int ret;
+
+	if (asprintf(&pidstr, "%u", getpid()) < 0)
+		return ENOMEM;
+
+	if (asprintf(&name, "trace.scoutfs.open_by_handle_at.ino-%lu", ino) < 0)
+		return ENOMEM;
+
+	buf = malloc(4096);
+	if (!buf)
+		return ENOMEM;
+
+	handle.handle.handle_bytes = sizeof(struct our_handle);
+	handle.handle.handle_type = FILEID_SCOUTFS;
+	handle.scoutfs_ino = htole64(ino);
+
+	/* keep a quick dirfd around for easy writing sysfs files */
+	tracefd = open("/sys/kernel/debug/tracing", 0);
+	if (tracefd < 0)
+		return errno;
+
+	/* start tracing */
+	ret = write_at(tracefd, "current_tracer", "nop") ?:
+	      write_at(tracefd, "current_tracer", "function_graph") ?:
+	      write_at(tracefd, "set_ftrace_pid", pidstr) ?:
+	      write_at(tracefd, "tracing_on", "1");
+
+	tracing_on = true;
+
+	if (ret)
+		goto out;
+
+	targetfd = open_by_handle_at(fd, &handle.handle, O_RDWR);
+	e = errno;
+
+out:
+	/* turn off tracing first */
+	ret = write_at(tracefd, "tracing_on", "0");
+	if (ret)
+		return ret;
+
+	tracing_on = false;
+
+	if (targetfd != -1) {
+		close(targetfd);
+		return 0;
+	}
+
+	if (e == ESTALE) {
+		/* capture trace */
+		outfd = open(name, O_CREAT | O_TRUNC | O_RDWR, 0644);
+		if (outfd < 0) {
+			fprintf(stderr, "Error opening trace\n");
+			return errno;
+		}
+		infd = openat(tracefd, "trace", O_RDONLY);
+		if (infd < 0) {
+			fprintf(stderr, "Error opening trace output\n");
+			return errno;
+		}
+		for (;;) {
+			bytes = pread(infd, buf, 4096, off);
+			if (bytes < 0)
+				return errno;
+			if (bytes == 0)
+				break;
+			written = pwrite(outfd, buf, bytes, off);
+			if (written < 0)
+				return errno;
+			if (written != bytes)
+				return EIO;
+			off += bytes;
+		}
+		close(outfd);
+		close(infd);
+
+		fprintf(stderr, "Wrote \"%s\"\n", name);
+	}
+
+	/* cleanup */
+	ret = write_at(tracefd, "current_tracer", "nop");
+
+	free(pidstr);
+	free(name);
+	free(buf);
+	close(tracefd);
+	/* collect trace output */
+	return ret;
+}
+
+/*
+ * lookup path for ino using ino_path
+ */
+struct ino_args {
+	char *path;
+	__u64 ino;
+};
+
+static int do_resolve(int fd, uint64_t ino, char **path)
+{
+	struct scoutfs_ioctl_ino_path ioctl_args = {0};
+	struct scoutfs_ioctl_ino_path_result *res;
+	unsigned int result_bytes;
+	int ret;
+
+	result_bytes = offsetof(struct scoutfs_ioctl_ino_path_result,
+				path[PATH_MAX]);
+
+	res = malloc(result_bytes);
+	if (!res)
+		return ENOMEM;
+
+	ioctl_args.ino = ino;
+	ioctl_args.dir_ino = 0;
+	ioctl_args.dir_pos = 0;
+	ioctl_args.result_ptr = (intptr_t)res;
+	ioctl_args.result_bytes = result_bytes;
+
+	ret = ioctl(fd, SCOUTFS_IOC_INO_PATH, &ioctl_args);
+	if (ret < 0) {
+		if (errno == ENOENT) {
+			*path = NULL;
+			return 0;
+		}
+		return errno;
+	}
+
+	ret = asprintf(path, "%.*s", res->path_bytes, res->path);
+	if (ret <= 0)
+		return ENOMEM;
+
+	free(res);
+
+	return 0;
+}
+
+static int do_test_ino(int fd, uint64_t ino)
+{
+	struct our_handle handle = {{0}};
+	struct stat sb = {0};
+	char *path = NULL;
+	int targetfd = -1;
+	int ret;
+
+	/* filter: open_by_handle_at() must fail */
+	handle.handle.handle_bytes = sizeof(struct our_handle);
+	handle.handle.handle_type = FILEID_SCOUTFS;
+	handle.scoutfs_ino = htole64(ino);
+
+	targetfd = open_by_handle_at(fd, &handle.handle, O_RDWR);
+	if (targetfd != -1) {
+		close(targetfd);
+		return 0;
+	}
+
+	/* filter: errno must be ESTALE */
+	if (errno != ESTALE)
+		return 0;
+
+	/* filter: path resolution succeeds to an actual file entry */
+	ret = do_resolve(fd, ino, &path);
+	if (path == NULL)
+		return 0;
+	if (ret)
+		return ret;
+
+	/* filter: stat() must succeed on resolved path */
+	ret = fstatat(fd, path, &sb, AT_SYMLINK_NOFOLLOW);
+	free(path);
+	if (ret != 0) {
+		if (errno == ENOENT)
+			/* doesn't exist */
+			return 0;
+		return errno;
+	}
+
+	return do_trace(fd, ino);
+}
+
+static uint64_t do_get_meta_seq_stable(int fd)
+{
+	struct scoutfs_ioctl_stat_more stm;
+
+	if (ioctl(fd, SCOUTFS_IOC_STAT_MORE, &stm) < 0)
+		return errno;
+
+	return stm.meta_seq;
+}
+
+static int do_walk_seq(int fd)
+{
+	struct scoutfs_ioctl_walk_inodes_entry ents[128];
+	struct scoutfs_ioctl_walk_inodes walk = {{0}};
+	struct timespec ts;
+	time_t seconds;
+	int ret;
+	uint64_t total = 0;
+	uint64_t stable;
+	int i;
+	int j;
+
+	walk.index = SCOUTFS_IOC_WALK_INODES_META_SEQ;
+
+	/* make sure not to advance to stable meta_seq, we can just trail behind */
+	stable = do_get_meta_seq_stable(fd);
+	if (stable == 0)
+		return 0;
+	if (meta_seq >= stable - 1) {
+		if (exit_on_current)
+			exiting = true;
+		return 0;
+	}
+
+	meta_seq = meta_seq ? meta_seq + 1 : 0;
+
+	walk.first.major = meta_seq;
+	walk.first.minor = 0;
+	walk.first.ino = 0;
+
+	walk.last.major = stable - 1;
+	walk.last.minor = ~0;
+	walk.last.ino = ~0ULL;
+
+	walk.entries_ptr = (unsigned long)ents;
+	walk.nr_entries = array_size(ents);
+
+	clock_gettime(CLOCK_REALTIME, &ts);
+	seconds = ts.tv_sec;
+
+	for (j = 0;; j++) {
+		if (sig_received)
+			return 0;
+
+		ret = ioctl(fd, SCOUTFS_IOC_WALK_INODES, &walk);
+		if (ret < 0)
+			return ret;
+
+		if (ret == 0)
+			break;
+
+		for (i = 0; i < ret; i++) {
+			meta_seq = ents[i].major;
+			if (ents[i].ino == 1)
+				continue;
+
+			/* poke at it */
+			ret = do_test_ino(fd, ents[i].ino);
+
+			count++;
+
+			if (ret < 0)
+				return ret;
+		}
+
+		total += i;
+
+		walk.first = ents[i - 1];
+		if (++walk.first.ino == 0 && ++walk.first.minor == 0)
+			walk.first.major++;
+
+		/* yield once in a while */
+		if (j % 32 == 0) {
+			clock_gettime(CLOCK_REALTIME, &ts);
+			if (ts.tv_sec > seconds + 1)
+				break;
+		}
+	}
+
+	return 0;
+}
+
+void handle_signal(int sig)
+{
+	int tracefd = -1;
+
+	sig_received = true;
+
+	if (!tracing_on)
+		return;
+
+	tracefd = open("/sys/kernel/debug/tracing", 0);
+	write_at(tracefd, "tracing_on", "0");
+	close(tracefd);
+}
+
+int main(int argc, char **argv)
+{
+	char *mnt = NULL;
+	char c;
+	int mntfd;
+	int ret;
+
+	meta_seq = 0;
+
+	/* All we need is the mount point arg */
+	while ((c = getopt(argc, argv, "+em:s:")) != -1) {
+		switch (c) {
+			case 'e':
+				exit_on_current = true;
+				break;
+			case 'm':
+				mnt = strdup(optarg);
+				break;
+			case 's':
+				meta_seq = strtoull(optarg, NULL, 0);
+				break;
+			case '?':
+				printf("unknown argument: %c\n", optind);
+			case 'h':
+				exit_usage();
+		}
+	}
+
+	if (!mnt) {
+		fprintf(stderr, "Must provide a mount point with -m\n");
+		exit(EXIT_FAILURE);
+	}
+
+	if (meta_seq > 0)
+		fprintf(stdout, "Starting from meta_seq = %lu\n", meta_seq);
+
+	/* lower prio */
+	ret = nice(10);
+	if (ret == -1)
+		fprintf(stderr, "Error setting nice value\n");
+	ret = syscall(SYS_ioprio_set, 1, 0, 0); /* IOPRIO_WHO_PROCESS = 1, IOPRIO_PRIO_CLASS(IOPRIO_CLASS_IDLE) = 0 */
+	if (ret == -1)
+		fprintf(stderr, "Error setting ioprio value\n");
+
+	signal(SIGINT, handle_signal);
+	signal(SIGTERM, handle_signal);
+
+	for (;;) {
+		if (sig_received)
+			break;
+
+		mntfd = open(mnt, O_RDONLY);
+		if (mntfd == -1) {
+			perror("open(mntfd)");
+			exit(EXIT_FAILURE);
+		}
+
+		ret = do_walk_seq(mntfd);
+		/* handle unmounts? EAGAIN? */
+		if (ret)
+			break;
+
+		close(mntfd);
+
+		if (exiting)
+			break;
+
+		/* yield */
+		if (!sig_received)
+			sleep(5);
+	}
+
+	free(mnt);
+
+	fprintf(stdout, "Last meta_seq = %lu\n", meta_seq);
+
+	if (ret)
+		fprintf(stderr, "Error walking inodes: %s(%d)\n", strerror(errno), ret);
+
+	exit(ret);
+}
--- a/tests/tests/basic-truncate.sh
+++ b/tests/tests/basic-truncate.sh
@@ -11,7 +11,7 @@ FILE="$T_D0/file"
 # final block as we truncated past it.
 #
 echo "== truncate writes zeroed partial end of file block"
-yes | dd of="$FILE" bs=8K count=1 status=none iflag=fullblock
+yes 2>/dev/null | dd of="$FILE" bs=8K count=1 status=none iflag=fullblock
 sync

 # not passing iflag=fullblock causes the file occasionally to just be
--- a/tests/tests/enospc.sh
+++ b/tests/tests/enospc.sh
@@ -88,6 +88,11 @@ rm -rf "$SCR/xattrs"

 echo "== make sure we can create again"
 file="$SCR/file-after"
+C=120
+while (( C-- )); do
+	touch $file 2> /dev/null && break
+	sleep 1
+done
 touch $file
 setfattr -n user.scoutfs-enospc -v 1 "$file"
 sync
--- a/tests/tests/lock-recover-invalidate.sh
+++ b/tests/tests/lock-recover-invalidate.sh
@@ -38,6 +38,6 @@ while [ "$SECONDS" -lt "$END" ]; do
 done

 echo "== stopping background load"
-kill $load_pids
+t_silent_kill $load_pids

 t_pass
--- a/tests/tests/mmap.sh
+++ b/tests/tests/mmap.sh
@@ -0,0 +1,54 @@
+#
+# test mmap() and normal read/write consistency between different nodes
+#
+
+t_require_commands mmap_stress mmap_validate scoutfs xfs_io
+
+echo "== mmap_stress"
+mmap_stress 8192 2000 "$T_D0/mmap_stress" "$T_D1/mmap_stress" "$T_D2/mmap_stress" "$T_D3/mmap_stress" "$T_D4/mmap_stress" | sed 's/:.*//g' | sort
+
+echo "== basic mmap/read/write consistency checks"
+mmap_validate 256 1000 "$T_D0/mmap_val1" "$T_D1/mmap_val1"
+mmap_validate 8192 1000 "$T_D0/mmap_val2" "$T_D1/mmap_val2"
+mmap_validate 88400 1000 "$T_D0/mmap_val3" "$T_D1/mmap_val3"
+
+echo "== mmap read from offline extent"
+F="$T_D0/mmap-offline"
+touch "$F"
+xfs_io -c "pwrite -S 0xEA 0 8192" "$F" > /dev/null
+cp "$F" "${F}-stage"
+vers=$(scoutfs stat -s data_version "$F")
+scoutfs release "$F" -V "$vers" -o 0 -l 8192
+scoutfs get-fiemap -L "$F"
+xfs_io -c "mmap -rwx 0 8192" \
+	-c "mread -v 512 16" "$F" &
+sleep 1
+# should be 1 - data waiting
+jobs | wc -l
+scoutfs stage "${F}-stage" "$F" -V "$vers" -o 0 -l 8192
+# xfs_io thread <here> will output 16 bytes of read data
+sleep 1
+# should be 0 - no more waiting jobs, xfs_io should have exited
+jobs | wc -l
+scoutfs get-fiemap -L "$F"
+
+echo "== mmap write to an offline extent"
+# reuse the same file
+scoutfs release "$F" -V "$vers" -o 0 -l 8192
+scoutfs get-fiemap -L "$F"
+xfs_io -c "mmap -rwx 0 8192" \
+	-c "mwrite -S 0x11 528 16" "$F" &
+sleep 1
+# should be 1 job waiting
+jobs | wc -l
+scoutfs stage "${F}-stage" "$F" -V "$vers" -o 0 -l 8192
+# no output here from write
+sleep 1
+# should be 0 - no more waiting jobs, xfs_io should have exited
+jobs | wc -l
+scoutfs get-fiemap -L "$F"
+# read back contents to assure write changed the file
+dd status=none if="$F" bs=1 count=48 skip=512 | hexdump -C
+
+echo "== done"
+t_pass
--- a/tests/tests/orphan-inodes.sh
+++ b/tests/tests/orphan-inodes.sh
@@ -5,18 +5,6 @@
 t_require_commands sleep touch sync stat handle_cat kill rm
 t_require_mounts 2

-#
-# usually bash prints an annoying output message when jobs
-# are killed.  We can avoid that by redirecting stderr for
-# the bash process when it reaps the jobs that are killed.
-#
-silent_kill() {
-	exec {ERR}>&2 2>/dev/null
-	kill "$@"
-	wait "$@"
-	exec 2>&$ERR {ERR}>&-
-}
-
 #
 # We don't have a great way to test that inode items still exist.   We
 # don't prevent opening handles with nlink 0 today, so we'll use that.
@@ -52,7 +40,7 @@ inode_exists $ino || echo "$ino didn't exist"

 echo "== orphan from failed evict deletion is picked up"
 # pending kill signal stops evict from getting locks and deleting
-silent_kill $pid
+t_silent_kill $pid
 t_set_sysfs_mount_option 0 orphan_scan_delay_ms 1000
 sleep 5
 inode_exists $ino && echo "$ino still exists"
@@ -70,7 +58,7 @@ for nr in $(t_fs_nrs); do
 	rm -f "$path"
 done
 sync
-silent_kill $pids
+t_silent_kill $pids
 for nr in $(t_fs_nrs); do
 	t_force_umount $nr
 done
@@ -82,7 +70,15 @@ done
 # wait for orphan scans to run
 t_set_all_sysfs_mount_options orphan_scan_delay_ms 1000
 # also have to wait for delayed log merge work from mount
-sleep 15
+C=120
+while (( C-- )); do
+	brk=1
+	for ino in $inos; do
+		inode_exists $ino && brk=0
+	done
+	test $brk -eq 1 && break
+	sleep 1
+done
 for ino in $inos; do
 	inode_exists $ino && echo "$ino still exists"
 done
@@ -131,7 +127,7 @@ while [ $SECONDS -lt $END ]; do
 	done

 	# trigger eviction deletion of each file in each mount
-	silent_kill $pids
+	t_silent_kill $pids

 	wait || t_fail "handle_fsetxattr failed"

--- a/tests/tests/simple-readdir.sh
+++ b/tests/tests/simple-readdir.sh
@@ -0,0 +1,37 @@
+#
+# verify d_off output of xfs_io is consistent.
+#
+
+t_require_commands xfs_io
+
+filt()
+{
+	grep d_off | cut -d ' ' -f 1,4-
+}
+
+echo "== create content"
+for s in $(seq 1 7 250); do
+	f=$(printf '%*s' $s | tr ' ' 'a')
+	touch ${T_D0}/$f
+done
+
+echo "== readdir all"
+xfs_io -c "readdir -v" $T_D0 | filt
+
+echo "== readdir offset"
+xfs_io -c "readdir -v -o 20" $T_D0 | filt
+
+echo "== readdir len (bytes)"
+xfs_io -c "readdir -v -l 193" $T_D0 | filt
+
+echo "== introduce gap"
+for s in $(seq 57 7 120); do
+	f=$(printf '%*s' $s | tr ' ' 'a')
+	rm -f ${T_D0}/$f
+done
+xfs_io -c "readdir -v" $T_D0 | filt
+
+echo "== cleanup"
+rm -rf $T_D0
+
+t_pass
--- a/tests/tests/xfstests.sh
+++ b/tests/tests/xfstests.sh
@@ -65,26 +65,14 @@ EOF

 cat << EOF > local.exclude
 generic/003	# missing atime update in buffered read
-generic/029	# mmap missing
-generic/030	# mmap missing
 generic/075	# file content mismatch failures (fds, etc)
-generic/080	# mmap missing
 generic/103	# enospc causes trans commit failures
 generic/108	# mount fails on failing device?
 generic/112	# file content mismatch failures (fds, etc)
-generic/120	# (can't exec 'cause no mmap)
-generic/126	# (can't exec 'cause no mmap)
-generic/141	# mmap missing
 generic/213	# enospc causes trans commit failures
-generic/215	# mmap missing
-generic/246	# mmap missing
-generic/247	# mmap missing
-generic/248	# mmap missing
 generic/318	# can't support user namespaces until v5.11
 generic/321	# requires selinux enabled for '+' in ls?
-generic/325	# mmap missing
 generic/338	# BUG_ON update inode error handling
-generic/346	# mmap missing
 generic/347	# _dmthin_mount doesn't work?
 generic/356	# swap
 generic/357	# swap
@@ -92,16 +80,13 @@ generic/409	# bind mounts not scripted yet
 generic/410	# bind mounts not scripted yet
 generic/411	# bind mounts not scripted yet
 generic/423	# symlink inode size is strlen() + 1 on scoutfs
-generic/428	# mmap missing
 generic/430	# xfs_io copy_range missing in el7
 generic/431	# xfs_io copy_range missing in el7
 generic/432	# xfs_io copy_range missing in el7
 generic/433	# xfs_io copy_range missing in el7
 generic/434	# xfs_io copy_range missing in el7
-generic/437	# mmap missing
 generic/441	# dm-mapper
 generic/444	# el9's posix_acl_update_mode is buggy ?
-generic/452	# exec test - no mmap
 generic/467	# open_by_handle ESTALE
 generic/472	# swap
 generic/484	# dm-mapper
@@ -118,11 +103,9 @@ generic/565	# xfs_io copy_range missing in el7
 generic/568	# falloc not resulting in block count increase
 generic/569	# swap
 generic/570	# swap
-generic/614	# mmap missing
 generic/620	# dm-hugedisk
-generic/633	# mmap, id-mapped mounts missing in el7
+generic/633	# id-mapped mounts missing in el7
 generic/636	# swap
-generic/638	# mmap missing
 generic/641	# swap
 generic/643	# swap
 EOF
--- a/utils/src/util.c
+++ b/utils/src/util.c
@@ -7,7 +7,6 @@
 #include <errno.h>
 #include <stdio.h>
 #include <stdlib.h>
-#include <wordexp.h>

 #include "util.h"
 #include "format.h"
@@ -18,26 +17,15 @@

 static int open_path(char *path, int flags)
 {
-	wordexp_t exp_result;
 	int ret;

-	ret = wordexp(path, &exp_result, WRDE_NOCMD | WRDE_SHOWERR | WRDE_UNDEF);
-	if (ret) {
-		fprintf(stderr, "wordexp() failure for \"%s\": %d\n", path, ret);
-		ret = -EINVAL;
-		goto out;
-	}
-
-	ret = open(exp_result.we_wordv[0], flags);
+	ret = open(path, flags);
 	if (ret < 0) {
 		ret = -errno;
 		fprintf(stderr, "failed to open '%s': %s (%d)\n",
 			path, strerror(errno), errno);
 	}

-out:
-	wordfree(&exp_result);
-
 	return ret;
 }
Author	SHA1	Message	Date
Auke Kok	68f7d1f2d0	Walk met_seq for inodes to open_by_handle_at() = -ESTALE, and trace. This walks the meta_seq from 0, or, a passed in number to the current meta_seq, and polls further advances, until it approaches (but doesn't reach) the current stable meta_seq. Inodes found in the index will be tested to see if they return -ESTALE when open_by_handle_at(). If they do, further tests assure the file exists and can be resolved. If that happens to be the case, we ftrace the open_by_handle_at() on the inode, and write out the tracefile. The program can be ^C'd and will print out how far in the meta_seq it got, so it can be resumed where it left off. If tracing happens to be on while a TERM or INT signal is received, we immediately turn off tracing again. The program runs without producing needless terminal output. If traces are generated, the file name is printed out. If the program terminates, for whatever reason, it prints out how far it has advanced through the meta_seq, and that information can be used to resume testing from there on. The program can be run in tailing (normal) mode, the default, and continue to wait for new work to appear. Alternatively, the command line option `-e` tells the program to stop execution once the current stable meta_seq has been reached. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-05-21 10:00:35 -07:00
Zach Brown	7865ee9f54	Merge pull request #223 from versity/auke/el9_5_wmaybe-uninit Fix -Wmaybe-uninitalized since rhel9.5	2025-05-12 12:21:02 -07:00
Zach Brown	624eb128c6	Merge pull request #221 from versity/auke/enospc-test Give enospc test more time to commit unlink.	2025-05-09 11:27:04 -07:00
Zach Brown	091eb3b683	Merge pull request #219 from versity/auke/fix-tests-failing-dirty-test-dirs Fix test cases that don't run cleanly in a semi-dirty env.	2025-05-09 11:17:24 -07:00
Zach Brown	04e8cc6295	Merge pull request #220 from versity/auke/orphan-inodes Extend orphan-inodes timeout.	2025-05-09 11:15:13 -07:00
Zach Brown	0f6fdb3eb5	Merge pull request #222 from versity/auke/t_kill_silent Properly silently kill background tasks.	2025-05-09 11:11:24 -07:00
Auke Kok	2f48a606e8	Fix -Wmaybe-uninitalized since rhel9.5 Looks like the compiler isn't smart enough to understand the pass by pointer value, and we can initialize it here easily. make[1]: Entering directory '/usr/src/kernels/5.14.0-503.26.1.el9_5.x86_64' CC [M] /home/auke/scoutfs/kmod/src/server.o /home/auke/scoutfs/kmod/src/server.c: In function ‘fence_pending_recov_worker’: /home/auke/scoutfs/kmod/src/server.c:4170:23: error: ‘addr.v4.addr’ may be used uninitialized in this function [-Werror=maybe-uninitialized] 4170 \| ret = scoutfs_fence_start(sb, rid, le32_to_be32(addr.v4.addr), \| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 4171 \| SCOUTFS_FENCE_CLIENT_RECOVERY); \| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ cc1: all warnings being treated as errors There's still the obvious issue here that we'd intended to support ipv6 but just disregard that here. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-05-08 15:20:50 -07:00
Auke Kok	377e49caf1	Properly silently kill background tasks. Occasionally, we have some tests fail because these kills produce: tests/lock-recover-invalidate.sh: line 42: 9928 Terminated Even though we expected them to be silent. In these particular cases we already don't care about this output. We borrow the silent_kill() function from orphan-inodes and promote it to t_silent_kill() in funcs/exec.sh, and then use it everywhere where appropriate. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-05-08 12:03:04 -07:00
Auke Kok	d08eb66adc	Give enospc test more time to commit unlink. The current test sequence performs the unlink and immediately tests whether enough resources are available to create new files again, and this consistently fails. One of my crummy VMs takes a good 12 seconds before the `touch` actually succeeds. We care about the filesystem eventually returning from ENOSPC, and certainly we don't want it to take forever, but there is a period after our first ENOSPC error and cleanup that we expect ENOSPC to fail for a bit longer. Make the timeout 120s. As soon as the `touch` completes, exit the wait loop. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-05-08 11:40:13 -07:00
Zach Brown	6f19d0bd36	Merge pull request #216 from versity/zab/stop_ending_dirty_data_freed Zab/stop ending dirty data freed	2025-05-08 11:18:23 -07:00
Auke Kok	1d0cde7cc3	Clean up old test data as needed. If run without `-m` (explicit mkfs) in subsequent testing, old test data files may break several tests. Most failures are -EEXIST, but there are some more subtle ones. This change erases any existing test dir as needed just before we run the tests, and avoids the issue entirely. I considered doing a `mv dir dir.$$ && rm -rf dir.$$ &` alternative solution but that likely will interfere disproportionally with tests that do disconnects and other thing that can be impacted by an unlink storm. This has an obvious performance aspect - tests will be a little slower to start on subsequent runs. In CI, this will effectively be a no-op though. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-05-08 10:10:01 -07:00
Auke Kok	138c7c6b49	Extend orphan-inodes timeout. This test regularly fails in CI when the 15 seconds elapses and the system still hasn't concluded the mount log merges and orphan inode scans needed to unlink the test files. Instead of just extending the timeout value, we test-and-retry for 120s. This hopefully is faster in most cases. My smallest VM needs about 6s-8s on average. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-05-08 09:56:45 -07:00
Zach Brown	8aa1a98901	Merge pull request #210 from versity/auke/perf-irq-took-too-long Filter out perf `interrupt took too long` dmesg.	2025-04-30 10:04:00 -07:00
Zach Brown	888b1394a6	Retry client commit and get log trees separately The client transaction commit worker has a series of functions that it calls to commit the current transaction and open the next one. If any of them fail, it retries all of them from the beginning each time until they all succeed. This pattern behaves badly since we added the strict get_trans_seq and commit_trans_seq latching in the log_trees. The server will only commit the items for a get or commit request once, and will fail a commit request if it isn't given the seq that matches the current item. If the server gets an error it can have persisted items while sending an error to the client. If this error was for a get request, then the client will retry all of its transaction write functions. This includes the commit request which is now using a stale seq and will fail indefinitely. This is visible in the server log as: error -5 committing client logs for rid e57e37132c919c4f: invalid log trees item get_trans_seq The solution is to retry the commit and get phases independently. This way a failed get will be retried on its own without running through the commit phase that had succeeded. The client will eventually get the next seq that it can then safely commit. Signed-off-by: Zach Brown <zab@versity.com>	2025-04-29 11:46:38 -07:00
Zach Brown	e457694f19	Don't send dirty data_freed blocks to client At the end of get_log_trees we can try and drain the data_freed extent tree, which can take multiple commits. If a commit fails then the blocks are still dirty in memory. We can't send references to those blocks to the client. We have to return an error and not send the log_trees, like the main get_log_trees does. The client will retry and eventually get a log_trees that references blocks that were successfully committed. Signed-off-by: Zach Brown <zab@versity.com>	2025-04-29 11:46:38 -07:00
Zach Brown	459de5b478	Merge pull request #211 from versity/auke/tapf-output TAP formatted output.	2025-04-15 14:25:06 -07:00
Auke Kok	24031cde1d	TAP formatted output. Stored as `results/scoutfs.tap`, this file contains TAP format 14 generated test results. Embedded in the output are some metadata so that these files can be aggregated and stored in an unique and deduplicating way, but using a generated UUID at the start of testing. The file itself also catches git ID, date, and kernel version, as well as the (possibly altered) test sequence used. Any test that has diff or dmesg output will be considered failed, and a copy of the relevant data is included as comments. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-04-15 12:02:41 -07:00
Zach Brown	04cc41719c	Merge pull request #209 from versity/auke/basic-truncate-yes-pipefail Ignore pipefail alternative error when not a tty.	2025-04-14 13:15:03 -07:00
Auke Kok	1b47e9429e	Filter out perf `interrupt took too long` dmesg. Example: ``` [ 2469.638414] perf: interrupt took too long (2507 > 2500), lowering kernel.perf_event_max_sample_rate to 79000 ``` Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-04-14 12:06:58 -07:00
Auke Kok	7ea084082d	Ignore pipefail alternative error when not a tty. This happens with the basic-truncate test, only. It's the only user of the `yes` program. The `yes` command normally fails gracefully under the usual runs that are attached to some terminal. But when the test script runs entirely under something else, it will throw a needless error message that pollutes the test output: `yes: standard output: Broken pipe` Adjust the redirect to omit all stderr for `yes` in this case. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-04-14 11:13:39 -07:00
Zach Brown	f565451f76	Merge pull request #208 from versity/zab/v1.24 v1.24 Release	2025-03-17 11:18:42 -07:00
Zach Brown	05f14640fb	v1.24 Release Finish the release notes for the 1.24 release. Signed-off-by: Zach Brown <zab@versity.com>	2025-03-14 12:19:30 -07:00
Zach Brown	609fc56cd6	Merge pull request #203 from versity/auke/new_inode_ctime Fix new_inode ctime assignment.	2025-02-25 15:23:16 -08:00
Zach Brown	a4b5a256eb	Merge pull request #175 from versity/auke/mmap Support for mmap() writable mappings.	2025-02-20 14:03:01 -08:00
Zach Brown	f701ce104c	Merge pull request #204 from versity/zab/remove_wordexp Remove wordexp expansion of utils path argument	2025-02-19 09:27:15 -08:00
Zach Brown	c6dab3c306	Remove wordexp expansion of utils path argument scoutfs cli commands were using a helper that tried to perform word expansion on the path argument. This was done with the intent of providing the convenience of shell expansion (env vars, ~) within the cli command argument. But it breaks paths that accidentally have their file names match the syntax that wordexp supports. "[ ]" tripped up files in the wild. We don't need to provide shell expansion functionality in our argument parsing. The shell can do that. The cli must pass the arguments straight through, no parsing at all. Signed-off-by: Zach Brown <zab@versity.com>	2025-02-18 11:55:37 -08:00
Auke Kok	e3e2cfceec	Fix new_inode ctime assignment. Very old copy/paste bug here, we want to update new_inode's ctime instead. old_inode already is updated. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-02-18 13:15:49 -05:00
Auke Kok	e9d147260c	Fix ctx->pos updating to properly handle dent gaps We need to assure we're emitting dents with the proper position and we already have them as part of our dent. The only caveat is to increment ctx->pos once beyond the list to make sure the caller doesn't call us once more. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-01-27 14:49:04 -05:00
Auke Kok	6c85879489	Assert unlock doesn't underflow lock user count. While debugging a double unlock error we hit this condition and debugging would have been a lot easier had we enforced this simple constraint that we can't decrement the lock users count if it's already 0. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-01-27 14:49:04 -05:00
Auke Kok	8b76a53cf3	Avoid cluster locking while put_user() in _allocated_inos. Similar to fiemap, readdir and walk_inodes, this method could have put_user during a page fault, causing potentially a deadlock. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-01-27 14:49:04 -05:00
Auke Kok	e76a171c40	Avoid faulting while cluster locked in _walk_inodes. Similar to readdir and fiemap vfs methods, we can't copy to user while holding cluster locks. The previous comment about it being safe no longer applies, and this could deadlock. Rewrite the loop to iterate and store entries in a page, then flush the page contents while not holding a clusterlock. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-01-27 14:49:04 -05:00
Auke Kok	8cb08507d6	Do not copy to user while holding locks in scoutfs_data_fiemap() Now that we support mmap writes, at any point in time we could pagefault and lock for writes. That means - just like readdir - we can no longer lock and copy_to_user, since it also may page fault and thus deadlock. We statically allocate 32 extent entries on the stack and use these to shuffle out fiemap entries at a time, locking and unlocking around collecting and fiemap_fill_extent_next. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-01-27 14:49:04 -05:00
Auke Kok	cad12d5ce8	Avoid deadlock in _readdir() due to copy_to_user(). dir_emit() will copy_to_user, which can pagefault. If this happens while cluster locked, we could deadlock. We use a single page to stage dir_emit data, and iterate between fetching dirents while locked, and emitting them while not locked. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-01-27 14:49:04 -05:00
Auke Kok	e59a5f8ebd	Readdir w/offset validation. Verify using xfs_io that readdir offsets match expected output. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-01-27 14:49:04 -05:00
Auke Kok	1bcd1d4d00	Drop readdir pre-.iterate() compat (el7.5ish). These 2 sections of compat for readdir are wholly obsolete and can be hard dropped, which restores the method to look like current upstream code. This was added in `ddd1a4e`. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-01-23 14:28:40 -05:00
Auke Kok	b944f609aa	remap_pages ops becomes obsolete.	2025-01-23 14:28:40 -05:00
Auke Kok	519b47a53c	mmap() trace events. We merely trace exit values and position, and ignore length. Because vm_fault_t is __bitwise, sparse will loudly complain about a plain cast to u32, so we must __force (on el8). ret will be 512 in normal cases. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-01-23 14:28:40 -05:00
Auke Kok	92f704d35a	Enable all xfstests mmap() tests. Now that all of these should be passing, we enable all mmap() tests in xfstests, and update the golden output with the new tests. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-01-23 14:28:40 -05:00
Auke Kok	311bf75902	Add mmap tests. Two test programs are added. The run time is about 1min on my el7 instance. The test script finishes up with a read/write mmap test on offline extents to verify the data wait paths in those functions. One program will perform vfs read/write and mmap read/write calls on the same file from across 5 threads (mounts) repeatedly. The goal is to assure there are no locking issues between read/write paths. The second test program performs consistency checking on a file that is repeatedly written/read using memory maps and normal reads and writes, and the content is verified after every operation. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-01-23 14:28:40 -05:00
Benjamin LaHaise	3788d67101	Add support for writable shared mmap()ings Add support for writable MAP_SHARED mmap()ings. Avoid issues with late writepage()s building transactions by doing the block_write_begin() work in scoutfs_data_page_mkwrite(). Ensure the page is marked dirty and prepared for write, then let the VM complete the write when the page is flushed or invalidated. Signed-off-by: Benjamin LaHaise <bcrl@kvack.org> Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-01-23 14:28:40 -05:00
Benjamin LaHaise	b7a3d03711	Add support for read only mmap() Adds the required memory mapped ops struct and page fault handler for reads. Signed-off-by: Benjamin LaHaise <bcrl@kvack.org> Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-01-23 14:28:40 -05:00